From be0aa6a9a230dc98994cb65d97b76be7ae695a44 Mon Sep 17 00:00:00 2001 From: Giuseppe D'Angelo Date: Thu, 15 Apr 2021 14:39:51 +0200 Subject: Unicode: fix the extended grapheme cluster algorithm UAX #29 in Unicode 11 changed the EGC algorithm to its current form. Although Qt has upgraded the Unicode tables all the way up to Unicode 13, the algorithm has never been adapted; in other words, it has been working by chance for years. Luckily, MOST of the cases were dealt with correctly, but emoji handling actually manages to break it. This commit: * Adds parsing of emoji-data.txt into the unicode table generator. That is necessary to extract the Extended_Pictographic property, which is used by the EGC algorithm. * Regenerates the tables. * Removes some obsoleted grapheme cluster break properties, and adds the ones added in the meanwhile. * Rewrites the EGC algorithm according to Unicode 13. This is done by simplifying a lot the lookup table. Some rules (GB11, GB12, GB13) can't be done by the table alone so some hand-rolled code is necessary in that case. * Thanks to these fixes, the complete upstream GraphemeBreakTest now passes. Remove the "edited" version that ignored some rows (because they were failing). Change-Id: Iaa07cb2e6d0ab9deac28397f46d9af189d2edf8b Fixes: QTBUG-92822 Reviewed-by: Thiago Macieira Reviewed-by: Konstantin Ritt (cherry picked from commit a794c5e287381bd056008b20ae55f9b1e0acf138) Reviewed-by: Qt CI Bot Reviewed-by: Lars Knoll --- util/unicode/README | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'util/unicode/README') diff --git a/util/unicode/README b/util/unicode/README index 2ff8176084..1b6efb237e 100644 --- a/util/unicode/README +++ b/util/unicode/README @@ -4,8 +4,8 @@ To update: * Find the data (UAX #44, UCD; not the XML version) at ftp://www.unicode.org/Public/zipped/$Version/ * Unpack the zip file; for each file in data/, replace with the new - version; find the *BreakProperty.txt in auxiliary/. (These last are - only in the zip, not in the web-space's unpacked versions.) + version; find the *BreakProperty.txt in auxiliary/ and emoji-data.txt + in emoji/. * In tst_QTextBoundaryFinder's data/ sub-directory, update its files from the auxiliary/ sub-directory of the UCD data. * If needed, add an entry to enum QChar::UnicodeVersion for the new -- cgit v1.2.3