summaryrefslogtreecommitdiffstats
path: root/util/locale_database/ldml.py
Commit message (Collapse)AuthorAgeFilesLines
* Fix spacing inconsistencies brought to light by flake8Edward Welbourne8 days1-2/+2
| | | | | | | | | | It has many grumbles about spacing, but at least this code is currently consistent about its departure from PEP8's spacing rules (and closer to Qt's) for the present. We can review whether to do a drastic spacing revolution later. Change-Id: Ife4e8a5b02b63434bd9c7ac7ba4cbc11b6311f9f Reviewed-by: Mate Barany <mate.barany@qt.io>
* Move LocaleScanner's INHERIT check from find upstream to __findEdward Welbourne13 days1-3/+12
| | | | | | | | | | | | | | | | | When digesting CLDR v44.1's github form, some data (e.g. pt_BR's language endonym) were None that had perfectly sensible values in the zip-file form. Letting __find() yield INHERIT entries lead to find() sometimes returning None, where __find() should have tried harder or raised an Error. This further amends commit bcdd51cfae24731a73d008add23d3c1e85bbd8d0 (after commit 0f770b0b34bcb5fa0a598b2ff76fe215fbc25f5c isolated its magic value). Pick-to: 6.7 Task-number: QTBUG-115158 Change-Id: I1af92a687cd50b8fd026c25f068c804a3516ef95 Reviewed-by: Mate Barany <mate.barany@qt.io>
* Document LocaleScanner's constructorEdward Welbourne2024-02-081-0/+7
| | | | | | | | I needed to know in order to make recent changes. Save the need to work it out again next time. Change-Id: Ibc606cbe2e6af16e6820fd753a643331a03cdfb3 Reviewed-by: Ievgenii Meshcheriakov <ievgenii.meshcheriakov@qt.io>
* Move special-case LDML value to a module globalEdward Welbourne2024-01-291-2/+4
| | | | | | | | | | Giving it a symbolic name is clearer (and saves me the need to duplicate the comment when I add some more references to it). This amends commit bcdd51cfae24731a73d008add23d3c1e85bbd8d0 Task-number: QTBUG-115158 Change-Id: I7577e0cde783fcda840009c7aea46934964c6e4c Reviewed-by: Cristian Maureira-Fredes <cristian.maureira-fredes@qt.io>
* ldml.LocaleScanner.__find(): only Error if no matches foundEdward Welbourne2024-01-291-9/+15
| | | | | | | | | | | | The existing caller returns early on finding a match, so never ran off the end of the iteration unless there were no matches. I'll soon be adding a new caller that wants to iterate all matches, so will run off the end even when there are some. So only raise the Error if we found nothing. Task-number: QTBUG-115158 Change-Id: I1cae4674eb5e83c433554c15ecc4441b756f20eb Reviewed-by: Cristian Maureira-Fredes <cristian.maureira-fredes@qt.io>
* Package DOM attributes for Node objectsEdward Welbourne2024-01-291-3/+8
| | | | | | | | | | The Supplement type did the needed mapping (using nodeValue when the value wasn't a string) and it turns out to be useful to do the same for the DOM object packaged by a Node, too. Pull out into a helper function, use dict-comprehension and expose as a method of Node. Change-Id: Ice6737a54a33372b45cf42152e3fdbf5f2da7ba4 Reviewed-by: Cristian Maureira-Fredes <cristian.maureira-fredes@qt.io>
* Prepare to support taking CLDR data from its github upstreamEdward Welbourne2024-01-191-1/+4
| | | | | | | | | | | | | | We've previously used the zip-file form, but that's not been published for CLDR v44.1 - the advice on the list was to use github instead. That, however, has ↑↑↑ as a special value for fields, meaning to inherit from a prent locale. So special-case that value. I have verified that v44 from the zip file produces identical results to v44 from github, with this minor fix. As it happens v44.1 also produces identical results. Pick-to: 6.7 6.5 Change-Id: I6eb0aedda7556753cdc83bb9d76652fbb68dc669 Reviewed-by: Ievgenii Meshcheriakov <ievgenii.meshcheriakov@qt.io>
* Cope with CLDR data conflict decimal == groupEdward Welbourne2023-08-011-1/+7
| | | | | | | | | | | | | | | | | | | | The digit-grouping and fractional-part separators need to be distinct for parsing to be able to distinguish between two thousand and one vs two and one thousandth. Thakfully ldml.py asserted this, so caught the glitch in CLDR v43's data where mn_Mong_MN over-rode mn's decimal, but not group, and thereby clashed with group. Fortunately the over-ride is marked as draft="contributed" so we can back out of the collision and limit the selection to draft="approved" values (but only when there *is* such a conflict, as plenty of locales have (compatible) draft data), thereby ignoring the conflicting contribution. Brought to the attention of cldr-users at: https://groups.google.com/a/unicode.org/g/cldr-users/c/6kW9kC6fz3g hopefully that'll lead to a saner resolution at v44. Task-number: QTBUG-111550 Change-Id: I1332486e60481cb4494446c0c87d89d74bd317d4 Reviewed-by: Ievgenii Meshcheriakov <ievgenii.meshcheriakov@qt.io>
* Ignore parentLocales nodes with component="..." attributesEdward Welbourne2023-08-011-3/+16
| | | | | | | | | | | | | From CLDR v43, "The parentLocale elements now have an optional component attribute, with a value of segmentations or collations. These should be used for inheritance for those respective elements." Since we aren't extracting collation or segmentation data for the present, omit these elements from the scan for parentLocale information. Task-number: QTBUG-111550 Change-Id: I42871929f539c1852471812801953f2fc8be0e8a Reviewed-by: Ievgenii Meshcheriakov <ievgenii.meshcheriakov@qt.io>
* Record a recent discovery: Suzhou isn't hanidecEdward Welbourne2023-02-281-1/+2
| | | | | | | | Revise a comment in ldml.py about Suzhou "digits", since it turns out they aren't the same as hanidec, which is far from contiguous. Change-Id: Ia3947dbc5a927772026e55fe197c8ebce2540da2 Reviewed-by: Ievgenii Meshcheriakov <ievgenii.meshcheriakov@qt.io>
* Use SPDX license identifiersLucie Gérard2022-05-161-27/+2
| | | | | | | | | | | | | Replace the current license disclaimer in files by a SPDX-License-Identifier. Files that have to be modified by hand are modified. License files are organized under LICENSES directory. Task-number: QTBUG-67283 Change-Id: Id880c92784c40f3bbde861c0d93f58151c18b9f1 Reviewed-by: Qt CI Bot <qt_ci_bot@qt-project.org> Reviewed-by: Lars Knoll <lars.knoll@qt.io> Reviewed-by: Jörg Bornemann <joerg.bornemann@qt.io>
* locale_database: Use f-strings in Python codeIevgenii Meshcheriakov2021-07-161-44/+40
| | | | | | | | | | | Replace most uses of str.format() and string arithmetic by f-strings. This results in more compact code and the code is easier to read when using an appropriate editor. Task-number: QTBUG-83488 Pick-to: 6.2 Change-Id: I3409f745b5d0324985cbd5690f5eda8d09b869ca Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
* locale_database: Don't use u prefix for strings in python filesIevgenii Meshcheriakov2021-07-151-2/+2
| | | | | | | | | | This prefix is useless with Python 3. Task-number: QTBUG-83488 Pick-to: 6.2 Change-Id: Ic008d53fe506865759e9a5003f439f7ac107b9e6 Reviewed-by: Cristian Maureira-Fredes <cristian.maureira-fredes@qt.io> Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
* Convert CLDR scripts to Python 3Ievgenii Meshcheriakov2021-07-151-3/+3
| | | | | | | | | | | The convertion is moslty done using 2to3 script with manual cleanup afterwards. Task-number: QTBUG-83488 Pick-to: 6.2 Change-Id: I4d33b04e7269c55a83ff2deb876a23a78a89f39d Reviewed-by: Cristian Maureira-Fredes <cristian.maureira-fredes@qt.io> Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
* Nomenclature change: s/countr/territor/g in locale scriptsEdward Welbourne2021-05-261-7/+7
| | | | | | | | | | Change the nomenclature used in the scripts and the QLocaleXML data format to use "territory" and "territories" in place of "country" and "countries". Does not change the generated source files. Change-Id: I4b208d8d01ad2bfc70d289fa6551f7e0355df5ef Reviewed-by: JiDe Zhang <zhangjide@uniontech.com> Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
* QLocale: simplify currency display name lookupEdward Welbourne2020-11-171-7/+13
| | | | | | | | | | | | | | | We were extracting several candidate display names from CLDR for each currency, joining them with semicolons, storing in a table, then using only the first entry from the list - where we should probably have used the first non-empty entry in any case. So instead extract the first non-empty candidate name from CLDR and store that simply, saving the need for semicolon-joining or parsing out the first entry from the thus-joined list. This significantly reduces the size of the currency name data table. Change-Id: I201d0528348d5fcb9eceb5df86211b9c77de3485 Reviewed-by: Mårten Nordheim <marten.nordheim@qt.io>
* Fix handling of Suzhou numbering systemEdward Welbourne2020-07-171-2/+4
| | | | | | | | | | | | | | | | This only arises when the system locale tells us to use its zero as our zero digit, since no CLDR locale uses it by default. Adapt an MS-specific QLocale::system() test to use Suzhou numbering, so as to test this. While updating the locale-restoration code to also restore the digits being set in that test, add restore code for the long time format, where previously only the short time format was restored. Add a comment to make it less likely one of those shall be missed in future. Fixes: QTBUG-85409 Change-Id: I343324bb563ee0e455dfe77d4825bf8c3082ca30 Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
* Support digit-grouping correctlyEdward Welbourne2020-07-141-0/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Read three more values from CLDR and add a byte to the bit-fields at the end of QLocaleData, indicating the three group sizes. This adds three new parameters to various low-level formatting functions. At the same time, rename ThousandsGroup to GroupDigits, more faithfully expressing what this (internal) option means. This replaces commit 27d139128013c969a939779536485c1a80be977e with a fuller implementation that handles digit-grouping in any of the ways that CLDR supports. The formerly "Indian" formatting now also applies to at least some locales for Bangladesh, Bhutan and Sri Lanka. Fixed Costa Rica currency formatting test that wrongly put a separator after the first digit; the locale (in common with several Spanish locales) requires at least two digits before the first separator. [ChangeLog][QtCore][Important Behavior Changes] Some locales require more than one digit before the first grouping separator; others use group sizes other than three. The latter was partially supported (only for India) at 5.15 but is now systematically supported; the former is now also supported. Task-number: QTBUG-24301 Fixes: QTBUG-81050 Change-Id: I4ea4e331f3254d1f34801cddf51f3c65d3815573 Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
* Merge remote-tracking branch 'origin/5.15' into devQt Forward Merge Bot2020-04-081-8/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: examples/opengl/doc/src/cube.qdoc src/corelib/global/qlibraryinfo.cpp src/corelib/text/qbytearray_p.h src/corelib/text/qlocale_data_p.h src/corelib/time/qhijricalendar_data_p.h src/corelib/time/qjalalicalendar_data_p.h src/corelib/time/qromancalendar_data_p.h src/network/ssl/qsslcertificate.h src/widgets/doc/src/graphicsview.qdoc src/widgets/widgets/qcombobox.cpp src/widgets/widgets/qcombobox.h tests/auto/corelib/tools/qscopeguard/tst_qscopeguard.cpp tests/auto/widgets/widgets/qcombobox/tst_qcombobox.cpp tests/benchmarks/corelib/io/qdiriterator/qdiriterator.pro tests/manual/diaglib/debugproxystyle.cpp tests/manual/diaglib/qwidgetdump.cpp tests/manual/diaglib/qwindowdump.cpp tests/manual/diaglib/textdump.cpp util/locale_database/cldr2qlocalexml.py util/locale_database/qlocalexml.py util/locale_database/qlocalexml2cpp.py Resolution of util/locale_database/ are based on: https://codereview.qt-project.org/c/qt/qtbase/+/294250 and src/corelib/{text,time}/*_data_p.h were then regenerated by running those scripts. Updated CMakeLists.txt in each of tests/auto/corelib/serialization/qcborstreamreader/ tests/auto/corelib/serialization/qcborvalue/ tests/auto/gui/kernel/ and generated new ones in each of tests/auto/gui/kernel/qaddpostroutine/ tests/auto/gui/kernel/qhighdpiscaling/ tests/libfuzzer/corelib/text/qregularexpression/optimize/ tests/libfuzzer/gui/painting/qcolorspace/fromiccprofile/ tests/libfuzzer/gui/text/qtextdocument/sethtml/ tests/libfuzzer/gui/text/qtextdocument/setmarkdown/ tests/libfuzzer/gui/text/qtextlayout/beginlayout/ by running util/cmake/pro2cmake.py on their changed .pro files. Changed target name in tests/auto/gui/kernel/qaction/qaction.pro tests/auto/gui/kernel/qaction/qactiongroup.pro tests/auto/gui/kernel/qshortcut/qshortcut.pro to ensure unique target names for CMake Changed tst_QComboBox::currentIndex to not test the currentIndexChanged(QString), as that one does not exist in Qt 6 anymore. Change-Id: I9a85705484855ae1dc874a81f49d27a50b0dcff7
* Check all matches for each XPath when searchingEdward Welbourne2020-04-021-63/+67
| | | | | | | | | | | | | | | | | | | Previously, if we found one element with required attributes, we would search into it and ignore any later elements also with those required attributes. This meant that, if the first didn't contain the child elements we were looking for, we'd fail to find what we sought, if it was in a later matching element (e.g. with some ignored attributes). We would then go on to look for a match in a later file, where there might have been a match we should have found in the earlier file. Check all matches, rather than only the first match in each file. Do the search in each file "in parallel" to save reparsing the XPath. This clears the search code of rather hard-to-follow break/else handling in loops; and currently makes no change to the generated data. Change-Id: I86b010e65b9a1fc1b79e5fdd45a5aeff1ed5d5d5 Reviewed-by: Cristian Maureira-Fredes <cristian.maureira-fredes@qt.io>
* Change QLocale to use CLDR's accounting formats for currenciesEdward Welbourne2020-04-021-1/+1
| | | | | | | | | | | | | | | | In particular, this changed the US currency formats for negative amounts to be parenthesised versions of the positive amount forms, rather than having a minus sign after the $ sign. Test updated. [ChangeLog][QtCore][QLocale] Currency formats are now based on CLDR's accounting formats, where they were previously mostly based (more or less by accident) on standard formats. In particular, this now means negative currency formats are specified, where available, where they (mostly) were not previously. Task-number: QTBUG-79902 Change-Id: Ie0c07515ece8bd518a74a6956bf97ca85e9894eb Reviewed-by: Cristian Maureira-Fredes <cristian.maureira-fredes@qt.io>
* Take CLDR's distinguished attributes into accountEdward Welbourne2020-04-021-15/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When doing XPATH searches, child nodes that have distinguished attributes that were not asked for should be skipped. This is part of the LDML spec and matters when resolving locale inheritance. Scan the LDML DTD (previously only scanned for the CLDR version) to find which attributes of which tags are ignorable - all others are distinguished - and take the result into account when performing XPATH searches. The XPath we were using for currency formats wasn't excluding currencyFormatLength elements with type="short" and patterns specific to thousands (and larger multiples); this is fixed by taking distinguished attributes into account. However, the XPATH also wasn't specifying the always distinguished attribute type="standard" that was, in practice, used for nearly all locales that weren't (wrongly) using short-forms for thousands; so type="standard" is now made explicit, so as to minimize the diff. This leaves only twenty-one locales with a negative currency formats. A later commit shall switch to using accounting by default (it falls back via an alias to standard, in any case), thereby restoring the two mentioned below that were using it by accident, but the present change gives the minimal diff here. Thousands-specific formats replaced with sensible ones: * zh_Hant_{HK,MO} (Traditional Mandarin, Hong Kong and Macau) * eo_001 (Esperanto) * fr_CA (Canadian French) * ha_* (Hausa, when not written in Arabic) * es_{GT,MX,US} (Spanish - Guatemala, Mexico, USA) * sw_KE (Swahili, Kenya) * yi_001 (Yiddish) * mfe_MU (Morisyen, Mauritius) * lag_TZ (Langi, Tanzania) * mgh_MZ (Makhuwa Meetto, Mozambique) * wae_CH (Walser, Switzerland) * kkj_CM (Kako, Cameroon) * lkt_US (Lakota, USA) * pa_Arab_PK (Punjabi, in Arabic script, as used in Pakistan; uses arabext number system, whose currency falls back to latn's, for which pa_Arab over-rides the thousands-format). Format changed from an over-ridden type="accounting" to standard (so these lost a negative-specific form) in: * en_SI (English, Slovenia) * es_DO (Spanish, Dominican Republic; same) For some locales we were picking up over-rides of narrow or short list formats, or formats for or-lists or unit-lists rather than and-lists, in place of the standard list format, that these locales don't over-ride, provided by a parent locale. This changed list formats for: * en_CA, en_IN (dropped "Oxford" comma before "and") * qu_* (Quechua; dropped "utaq", presumably meaning "and") * ur_IN (Urdu, India; was using unit-list formats) [ChangeLog][QtCore][QLocale] Data used for currency formats in several locales and list patterns in some locales have changed due to now parsing the CLDR data more faithfully. Fixes: QTBUG-81344 Change-Id: I6b95c6c37db92df167153767c1b103becfb0ac98 Reviewed-by: Cristian Maureira-Fredes <cristian.maureira-fredes@qt.io>
* Take number system into account in currency format look-upEdward Welbourne2020-04-021-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | CLDR's currency formats do have number system variation, so take it into account. (The old xpathlite code clearly intended to do this, but failed at it due to looking for the wrong component of an XPATH to fix.) This changes the currency formats in use for * all Dutch locales (because nl.xml lists a currency format for arab before the one for latn, and they differ), * Punjabi, Urdu - specifically pa_Guru_IN, ur_Arab_PK (both like Dutch, arabext before latn; which is correct for pa_Arab_PK and ur_Arab_IN), * Sindi (whose over-ride of latn currency format we were using, where we should be using arab's format, supplied by root's default), * Tatar (which specifies a generic currency format, which we were using, before one specific to latn, which we now use), * Tongan (same as Dutch), * Konkani (like Dutch, deva before latn) and * several North African Arabic locales (whose default number system is latn, rather than arab, but previously used arab's formats). Task-number: QTBUG-79902 Change-Id: I18d8ec16bfd3a516d1bcd2f63bc7f7f15179a3f4 Reviewed-by: Cristian Maureira-Fredes <cristian.maureira-fredes@qt.io>
* Rework cldr2qlocalexml.py's reading of CLDR dataEdward Welbourne2020-04-021-18/+432
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the code out to a CldrReader class in cldr.py, expand CldrAccess with facilities that needs, expand ldml.py to include support for more features, finally making xpathlite.py redundant. This initial commit aims, though, to be bug-for-bug compatible with xpathlite in its reading of the CLDR data. It turns out we've been using draftier data than we were aware of (which might not be a bad thing). The xpathlite code appeared to check for draft attributes, but these only appear on leaf nodes and most data were fetched by finding a parent and then scanning its children without the draft check; only am/pm data was actually being excluded based on draft values. (We allowed contributed, for am/pm, in addition to approved, which is all the xpathlite code allows otherwise.) There are also some less equivocal bugs; I'll deal with these in later commits. Simplified number-system data look-ups; the old get_number_in_system() was taking care of old LDML versions' placement of the number system attribute; this is no longer needed. (It was also being used for a currency value to which it was not appropriate, which is now handled separately; this is one of the bugs mentioned above.) Ditched a fall-back to nativeZeroDigit, which no longer exists in CLDR. Change the command-line to take the root of the CLDR data tree, rather than its common/main/ sub-directory. Support naming the file to which to write output, as a second command-line argument, instead of always writing to stdout (which remains the default) and leaving whoever runs the script to redirect stdout. Support (internally for now, while adding TODOs to give main() more command-line options) separating the stderr output into its more and less interesting parts; for now, continue producing both, but suppress the least interesting entirely. Task-number: QTBUG-81344 Change-Id: Ie611b47403a9452b51feaeeaaa0fbc8f7e84dc71 Reviewed-by: Cristian Maureira-Fredes <cristian.maureira-fredes@qt.io>
* Move cldr2qtimezone.py's CLDR-reading to a CldrAccess classEdward Welbourne2020-04-021-0/+140
This begins the process of replacing xpathlite.py, adding low-level DOM-access classes to ldml.py and the CldrAccess class to cldr.py Moved a format comment from cldr2qtimezone.py's doc-string to the method of CldrAccess that does the actual reading. Task-number: QTBUG-81344 Change-Id: I46ae3f402f8207ced6d30a1de5cedaeef47b2bcf Reviewed-by: Cristian Maureira-Fredes <cristian.maureira-fredes@qt.io>