summaryrefslogtreecommitdiffstats
path: root/src/corelib/text/qunicodetables.cpp
Commit message (Collapse)AuthorAgeFilesLines
* Revise UCD-generated data files' SPDX headersEdward Welbourne2024-04-221-1/+1
| | | | | | | | | | | | | | | | | | | The existing data comes under Unicode-DFS-2016 but future updates shall come under Unicode-3.0, so update the existing headers with the former and the generator script with the latter. Leave a note in the attribution file about this transitional state and how to resolve it. Replaced UNICODE_LICENSE.txt from src/corelib/text/ with LICENSES/Unicode-DFS-2016.txt, as fetched using reuse download. This doesn't look like a rename but only actually adds some irrelevant lines about where on the Unicode website the upstream files (to which we do not apply this license) come from and changes some spacing. Pick-to: 6.7 6.5 Fixes: QTBUG-121653 Change-Id: I50c9f4badc77a9aa402af946561aff58ae9e3e7a Reviewed-by: Volker Hilsheimer <volker.hilsheimer@qt.io> Reviewed-by: Kai Köhne <kai.koehne@qt.io>
* Unicode line breaking: Implement rules LB15a and LB15bIevgenii Meshcheriakov2024-02-081-3123/+3123
| | | | | | | | | | | | | | | | | | | | | | | | The new rules were added in Unicode 15.1 (TR #14, revision 51). The rules read: LB15a: (sot | BK | CR | LF | NL | OP | QU | GL | SP | ZW) [\p{Pi}&QU] SP* × LB15b: × [\p{Pf}&QU] (SP | GL | WJ | CL | QU | CP | EX | IS | SY | BK | CR | LF | NL | ZW | eot) Add two new line breaking classes LineBreak_QU_Pi and _QU_Pf to represent quotation characters with context that matches left side of LB15a and right side of LB15b respectively. This way it is still possible to use the line breaking classes table. Also add a coment about the original source of the line break table. Task-number: QTBUG-121529 Change-Id: Ib35f400e39e76819cd1c3299691f7b040ea37178 Reviewed-by: Edward Welbourne <edward.welbourne@qt.io> Reviewed-by: Mårten Nordheim <marten.nordheim@qt.io>
* unicode: Import version 15.1 (UCD version 32)Ievgenii Meshcheriakov2024-02-081-4384/+4458
| | | | | | | | | | | | | | | | | | | | | | Add enumerator for the new Unicode version to QChar::UnicodeVersion. Remap new line breaking classes to their Unicode 15.0 values: * AK, AP and AS to AL, * VI and VF to CM. These are classes for new line breaking support for Indic scripts that require more work. Blacklist failing tests for now: * tst_QUrlUts46::idnaTestV2 * tst_QTextBoundaryFinder::lineBoundariesDefault * tst_QTextBoundaryFinder::graphemeBoundariesDefault Regenerate the source files. Task-number: QTBUG-121529 Change-Id: I869cc9fbaa53765d8ae6265c22cdbef9f19d05bf Reviewed-by: Mårten Nordheim <marten.nordheim@qt.io> Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
* Update Unicode data version stringIevgenii Meshcheriakov2024-01-251-1/+1
| | | | | | | | | | This amends c4e550703c2bdc1ee710507b8df9c0c9a118402e. The data version update was just forgotten when updating to Unicode 15.0. Pick-to: 6.5 6.6 6.7 Change-Id: Ibb3e9cb81e9bbcb5d4aaf4e4df6231485531c128 Reviewed-by: Mårten Nordheim <marten.nordheim@qt.io> Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
* Update UCD to Revision 30Ievgenii Meshcheriakov2022-10-111-5285/+5587
| | | | | | | | | | | | | | | | | | This corresponds to Unicode version 15.0.0. Added the following scripts: * Kawi * Nag Mundari Full support of these scripts requires harfbuzz version 5.2.0, this version adds support for Unicode 15.0: https://github.com/harfbuzz/harfbuzz/releases/tag/5.2.0 Fixes: QTBUG-106810 Change-Id: Ib06c526e49b0f01ef9f21123bcf875c6b19f2601 Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
* Core: make Unicode Database constexprYuhang Zhao2022-05-261-11/+11
| | | | | | | | Task-number: QTBUG-100485 Pick-to: 6.3 6.2 Change-Id: I41480a34b14fd86a68a5c10b7e0f3d250e785d0f Reviewed-by: Marc Mutz <marc.mutz@qt.io> Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
* Unicode: Extract EastAsianWidth propertyIevgenii Meshcheriakov2022-05-241-9758/+9896
| | | | | | | | | | This property is needed to properly implement the line breaking algorithm from UAX #14. Task-number: QTBUG-97537 Pick-to: 6.3 Change-Id: Ia83cc553c9ef19fae33560721630849d2a95af84 Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
* Unicode: Remove obsolete word break classesIevgenii Meshcheriakov2022-05-241-5/+5
| | | | | | | | | | Remove E_Base, Glue_After_Zwj, E_Base_GAZ, and E_Modifier obsoleted by UTS #29, version 33 (Unicode 11.0.0). Task-number: QTBUG-97537 Pick-to: 6.2 6.3 Change-Id: If5dc36ae17cd8746bbe81b73bbcc0863181e4a7a Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
* Use SPDX license identifiersLucie Gérard2022-05-161-38/+2
| | | | | | | | | | | | | Replace the current license disclaimer in files by a SPDX-License-Identifier. Files that have to be modified by hand are modified. License files are organized under LICENSES directory. Task-number: QTBUG-67283 Change-Id: Id880c92784c40f3bbde861c0d93f58151c18b9f1 Reviewed-by: Qt CI Bot <qt_ci_bot@qt-project.org> Reviewed-by: Lars Knoll <lars.knoll@qt.io> Reviewed-by: Jörg Bornemann <joerg.bornemann@qt.io>
* Update UCD to Revision 28Ievgenii Meshcheriakov2021-10-181-6588/+7011
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | This corresponds to Unicode version 14.0.0. Added the following scripts: * CyproMinoan * OldUyghur * Tangsa * Toto * Vithkuqi Full support of these scripts requires harfbuzz version 3.0.0, this version adds support for Unicode 14.0: https://github.com/harfbuzz/harfbuzz/releases/tag/3.0.0 With this release 10 test cases in tst_qurluts46 were fixed, one additional test case is failing in tst_qtextboundaryfinder and is commented out. In total 62 line break test cases and 44 word break test cases are failing. A comment in src/corelib/text/qt_attribution.json was updated to include the URL of the page containing UCD version number. Fixes: QTBUG-94359 Change-Id: Iefc9ff13f3df279f91cbdb1246d56f75b20ecb35 Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
* unicode: Regenerate qunicodetables{.cpp,_p.h}Ievgenii Meshcheriakov2021-09-031-5762/+5845
| | | | | | | | | | Run unicode utility to regenerate the Unicode tables. This reduces size of the IDNA mapping tables. Adjust the QUrl client code to use the new API. Task-number: QTBUG-85323 Change-Id: Iaa8d6932e611f7aa4009a3fae2972de87b875cf8 Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
* unicode: Regenerate Unicode tablesIevgenii Meshcheriakov2021-08-261-9404/+15286
| | | | | | | | | Re-run unicode utility to update the Unicode tables. This adds properties and mappings needed to implement UTS #46 (IDNA). Task-number: QTBUG-85323 Change-Id: Id1de91caddd82095f8f8f2301bfd7bb2ee3fcafd Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
* Unicode: fix the extended grapheme cluster algorithmGiuseppe D'Angelo2021-04-161-6230/+6340
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | UAX #29 in Unicode 11 changed the EGC algorithm to its current form. Although Qt has upgraded the Unicode tables all the way up to Unicode 13, the algorithm has never been adapted; in other words, it has been working by chance for years. Luckily, MOST of the cases were dealt with correctly, but emoji handling actually manages to break it. This commit: * Adds parsing of emoji-data.txt into the unicode table generator. That is necessary to extract the Extended_Pictographic property, which is used by the EGC algorithm. * Regenerates the tables. * Removes some obsoleted grapheme cluster break properties, and adds the ones added in the meanwhile. * Rewrites the EGC algorithm according to Unicode 13. This is done by simplifying a lot the lookup table. Some rules (GB11, GB12, GB13) can't be done by the table alone so some hand-rolled code is necessary in that case. * Thanks to these fixes, the complete upstream GraphemeBreakTest now passes. Remove the "edited" version that ignored some rows (because they were failing). Change-Id: Iaa07cb2e6d0ab9deac28397f46d9af189d2edf8b Pick-to: 6.1 6.0 5.15 Fixes: QTBUG-92822 Reviewed-by: Thiago Macieira <thiago.macieira@intel.com> Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
* Use checked string iteration in case conversionsEdward Welbourne2020-08-291-0/+1
| | | | | | | | | | | | | | | | | | The Unicode table code can only be safely called on valid code-points. So code that calls it must only pass it valid Unicode data. The string iterator's Unchecked Unchecked methods only provide this guarantee when the string being iterated is guaranteed to be valid UTF-16; while client code should only use QString, QStringView and friends on valid UTF-16 data, we have no way to be sure they have respected that. So take the few extra cycles to actually check validity in the course of iterating strings, when the resulting code-points are to be passed to the Unicode table look-ups. Add tests that case mapping doesn't access Unicode tables out of range (it'll trigger the new assertion). Added some comments to qchar.h that helped me understand surrogates. Change-Id: Iec2c3106bf1a875bdaa1d622f6cf94d7007e281e Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
* Inline two macros in the unicode tablesEdward Welbourne2020-08-121-10/+6
| | | | | | | | They were only used by one function each, in unicodetables.cpp, so don't need to be macros. Change-Id: I3e7f9f661568862d0a0d265bb8f657a8e0782b13 Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
* Tidy up unicode table generationEdward Welbourne2020-08-051-11/+8
| | | | | | | | | | Eliminate some needless parentheses, tidy up some spacing and indentation and split some long lines. Change first += after declaration to initializer. Change-Id: I05ff2a6337b7ed14e0a2dc9c03fc784c92b63515 Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com> Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
* QChar/QString: centralize case folding in qchar.cppMarc Mutz2020-05-091-0/+2
| | | | | | | | | | | | | | | | | | | | | There are (at least) two implementations of the low-level case-folding algorithm, one of which (for QChar::toLower()) seems to be wrong (it doesn't deal with special cases which expand to more than one code point). The algoithm hidden in QString and entangled with the QString detaching code makes reusing the code much harder. At the same time, the dependency of the algorithm on the unicode tables makes exposing a non-allocating result type in the public API hard. std::u16string would be an alternative if we can assure that all implementations use SSO with at least four characters. So, for the time being, leave this as internal API for use in an upcoming QStringView::toLower() as well as case-insensitive hashing. Change-Id: Iabb2611846f6176776aa20e634f44d8464f3305c Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
* QUnicodeTables: port to charNN_tMarc Mutz2020-04-271-18/+8
| | | | | | | | | | | This makes existing calls passing uint or ushort ambiguous, so fix all the callers. There do not appear to be callers outside QtBase. In fact, the ...BreakClass() functions appear to be utterly unused. Change-Id: I1c2251920beba48d4909650bc1d501375c6a3ecf Reviewed-by: Lars Knoll <lars.knoll@qt.io> Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
* QChar: port low-level functions from uint/ushort to char32/16_tMarc Mutz2020-04-241-0/+10
| | | | | | | | | | | | | | | | | | Now that the standard gives us proper types for UTF-16 and UTF-32 characters, use them. Will eventually make the code much easier to read than today, where uint could be an index as well as a char32_t. It also ensures that the result of e.g. QChar::highSurrogate() can still be implicitly converted to a QChar now that the QChar(non-characater-integral-types) ctors are being made explicit. [ChangeLog][QtCore][QChar] All low-level functions (e.g. highSurrogate()) now take and return char16_t instead of ushort and char32_t instead of uint. Change-Id: I9cd8ebf6fb998fe1075dae96c7c4484a057f0b91 Reviewed-by: Lars Knoll <lars.knoll@qt.io> Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
* Update UCD to Revision 26Edward Welbourne2020-03-141-6336/+6731
| | | | | | | | | | | | | | Include WordBreakTest.html, since a test uses sample strings from it, albeit without actually reading the file. Had to comment out more of the new tests, as at Revision 24, pending an update to harfbuzz and the text boundary detection code. Task-number: QTBUG-79631 Task-number: QTBUG-79418 Task-number: QTBUG-82747 Change-Id: I0082294b09d67ffdc6a9b5c15acf77ad3b86f65f Reviewed-by: Lars Knoll <lars.knoll@qt.io>
* Update UCD data to Unicode 12.1.0's Revision 24Edward Welbourne2019-10-301-7341/+7849
| | | | | | | | | | | | | | | | Had to teach the update program to accept category Lm as for Joining_Transparent, for the sake of a new ArabicShaping.txt entry. Added three new Unicode versions, several new scripts and a new word-break class. Updated UCD's test data for tst_QTextBoundaryFinder. This left 57 tests failing; I have commented out the data rows for those tests, pending someone with more knowledge addressing this. Task-number: QTBUG-79631 Task-number: QTBUG-79418 Change-Id: Ic33d3b3551195d47a84d98e84020f57a68f0b201 Reviewed-by: Eskil Abrahamsen Blomfeldt <eskil.abrahamsen-blomfeldt@qt.io>
* QUnicodeTables: use array for case folding tablesMarc Mutz2019-09-041-2652/+2652
| | | | | | | | | | | | | | | | | Instead of four pairs of :1 :15 bit fields, use an array of four :1, :15 structs. This allows to replace the case folding traits classes with a simple enum that indexes into said array. I don't know what the WASM #ifdef'ed code is supposed to effect (a :0 bit-field is only useful to separate adjacent bit-field into separate memory locations for multi-threading), but I thought it safer to leave it in, and that means the array must be a 64-bit block of its own, so I had to move two fields around. Saves ~4.5KiB in text size on optimized GCC 10 LTO Linux AMD64 builds. Change-Id: Ib52cd7706342d5227b50b57545d073829c45da9a Reviewed-by: Lars Knoll <lars.knoll@qt.io>
* QUnicodeTables: pack Properties structMarc Mutz2019-09-041-2638/+2638
| | | | | | | | | | | | | | | | | | | | | | GCC doesn't like the sequence : 5 : 5 : 8 : 6 : 8 and inserts a :6 padding between the :5 and the :8 and a :2 padding between the :6 and the :8, growing the bitfield by 8 bits of embedded padding and another byte to bring the struct back to sizeof % 2 == 0. Fix by reshuffling the elements and adding a static_assert for the next round. Saves ~5KiB in QtCore executable size. Change-Id: I4758a6f48ba389abc2aee92f60997d42ebb0e5b8 Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
* Move text-related code out of corelib/tools/ to corelib/text/Edward Welbourne2019-07-101-0/+13446
This includes byte array, string, char, unicode, locale, collation and regular expressions. Change-Id: I8b125fa52c8c513eb57a0f1298b91910e5a0d786 Reviewed-by: Volker Hilsheimer <volker.hilsheimer@qt.io>