summaryrefslogtreecommitdiffstats
path: root/src/corelib/tools/qunicodetools.cpp
Commit message (Collapse)AuthorAgeFilesLines
* Update Text segmentation and line break data to Unicode 10.0Lars Knoll2018-01-031-91/+175
| | | | | | | | Also adjusted the text segmentation and line break algorithms so that they can handle the new data, and pass the test suite. Change-Id: Ib727fd80003e34e96458d7a681996de3fa3691e7 Reviewed-by: Eskil Abrahamsen Blomfeldt <eskil.abrahamsen-blomfeldt@qt.io>
* Fix libs build with msvc on Chinese locale on WindowsLiang Qi2017-02-011-1/+1
| | | | | | | | | | | | | | | | | Chinese locale means Code Page 936 here. It's also related with removing C4819 warnings. And it's also following Conventions in Qt source code: All code is ascii only (7-bit characters only, run man ascii if unsure) See also http://wiki.qt.io/Coding_Conventions Task-number: QTBUG-56155 Task-number: QTBUG-58161 Change-Id: I37fa7a0e6a82a16eaf80e1cc99be801099ab87de Reviewed-by: Friedemann Kleint <Friedemann.Kleint@qt.io> Reviewed-by: jian liang <jianliang79@gmail.com> Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
* Support C++17 fallthrough attributeAllan Sandfeld Jensen2016-08-191-2/+2
| | | | | | | | | Replaces our mix of comments for annotating intended absence of break in switches with the C++17 attribute [[fallthrough]], or its earlier a clang extension counterpart. Change-Id: I4b2d0b9b5e4425819c7f1bf01608093c536b6d14 Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
* Improve the script itemization algorithm to match Unicode 8.0Konstantin Ritt2016-04-271-47/+24
| | | | | | | | | | | | | | | Override preceding Common-s with a subsequent non-Inherited, non-Common script. This produces longer script runs, which automagically improves the shaping quality (as we don't lose the context anymore), the shaping performance (as we're typically shape a fewer runs), and the fallback font selection (when the font supports more than just a single language/script). Task-number: QTBUG-29930 Change-Id: I1c55af30bd397871d7f1f6e062605517f5a7e5a1 Reviewed-by: Lars Knoll <lars.knoll@theqtcompany.com> Reviewed-by: Eskil Abrahamsen Blomfeldt <eskil.abrahamsen-blomfeldt@qt.io>
* Updated license headersJani Heikkinen2016-01-151-14/+20
| | | | | | | | | | | From Qt 5.7 -> LGPL v2.1 isn't an option anymore, see http://blog.qt.io/blog/2016/01/13/new-agreement-with-the-kde-free-qt-foundation/ Updated license headers to use new LGPL header instead of LGPL21 one (in those files which will be under LGPL v3) Change-Id: I046ec3e47b1876cd7b4b0353a576b352e3a946d9 Reviewed-by: Lars Knoll <lars.knoll@theqtcompany.com>
* Update Unicode data & algorithms up to v8.0Konstantin Ritt2015-11-051-24/+37
| | | | | | | | | | | | | | | | | | | * Georgian lari currency symbol * A large collection of CJK unified ideographs * Emoji symbols and symbol modifiers * Letters to support the Ik language in Uganda, Kulango in the Côte d’Ivoire, and other languages of Africa * A set of lowercase Cherokee syllables, forming case pairs with the existing Cherokee characters * The Ahom script for support of the Tai Ahom language in India * Arabic letters to support Arwi—the Tamil language written in the Arabic script For more details, see http://www.unicode.org/versions/Unicode8.0.0/ [ChangeLog][QtCore] Unicode data updated to v.8.0 Change-Id: If255f95c9c45655b721369a116299da3cabbba0a Reviewed-by: Lars Knoll <lars.knoll@theqtcompany.com>
* Update Unicode data up to v7.0Konstantin Ritt2015-03-271-3/+3
| | | | | | | | | | | | | | | | | | | | | | * Two newly adopted currency symbols: the Azerbaijan manat and the Russia ruble * Pictographic symbols (including many emoji), geometric symbols, arrows, and ornaments originating from the Wingdings and Webdings sets * Twenty-three new lesser-used and historic scripts extending support for written languages of North America, China, India, other Asian countries, and Africa * Letters used in Teuthonista and other transcriptional systems, and a new notational set, Duployan For more details, see http://www.unicode.org/versions/Unicode7.0.0/ The Properties struct's .*Diff members were narrowed down to signed 15 bits and the unicodeVersion has been expanded to 8 bits. [ChangeLog][QtCore] Unicode data updated to v.7.0 Change-Id: I93ab6f79fa3b05f61abc7279f1d046834c1c1a0b Reviewed-by: Lars Knoll <lars.knoll@digia.com>
* Update copyright headersJani Heikkinen2015-02-111-7/+7
| | | | | | | | | | | | | | | | | | Qt copyrights are now in The Qt Company, so we could update the source code headers accordingly. In the same go we should also fix the links to point to qt.io. Outdated header.LGPL removed (use header.LGPL21 instead) Old header.LGPL3 renamed to header.LGPL3-COMM to match actual licensing combination. New header.LGPL-COMM taken in the use file which were using old header.LGPL3 (src/plugins/platforms/android/extract.cpp) Added new header.LGPL3 containing Commercial + LGPLv3 + GPLv2 license combination Change-Id: I6f49b819a8a20cc4f88b794a8f6726d975e8ffbe Reviewed-by: Matti Paaso <matti.paaso@theqtcompany.com>
* Update license headers and add new license filesMatti Paaso2014-09-241-19/+11
| | | | | | | | | - Renamed LICENSE.LGPL to LICENSE.LGPLv21 - Added LICENSE.LGPLv3 - Removed LICENSE.GPL Change-Id: Iec3406e3eb3f133be549092015cefe33d259a3f2 Reviewed-by: Iikka Eklund <iikka.eklund@digia.com>
* Fix several regressions in font selectionEskil Abrahamsen Blomfeldt2014-08-201-0/+6
| | | | | | | | | | | | | | | | | | In Qt 5.3.0 a change was added which automatically adapts Common script to surrounding scripts in accordance with the Unicode tr#24. This broke *a lot* of cases of font selection because the font selection algorithm is not prepared for handling characters with adapted scripts. We need to disable this change for now and redo it later with patches to font selection to avoid the regressions. [ChangeLog][Text] Fixed several regressions in font selection when combining different writing systems in the same text. Task-number: QTBUG-39930 Task-number: QTBUG-39860 Change-Id: Id02b5ae2403c06542ed5d81e7c4deb2e0c7d816e Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
* Improve the Unicode script itemization implementationKonstantin Ritt2014-04-141-4/+34
| | | | | | | | | | | | | | | | | | Make it closer to the Unicode specs (UAX#24): * Common now inherits the preceding character's script, if any; * In a combining character sequence, if the base character is of Common script, the entire sequence is treated like if it were of the first non-Inherited, non-Common script in the sequence. See http://www.unicode.org/reports/tr24/tr24-21.html for more details. [ChangeLog][QtGui] Fixed regression in arabic text rendering. Task-number: QTBUG-28813 Task-number: QTBUG-29930 (related) Task-number: QTBUG-35836 Change-Id: Id85761965b08ca94c674d5f3613fe58b82b2ce9c Reviewed-by: Eskil Abrahamsen Blomfeldt <eskil.abrahamsen-blomfeldt@digia.com> Reviewed-by: Ahmed Saidi <justroftest@gmail.com>
* Update the Unicode Data and Algorithms up to Unicode 6.3.0Konstantin Ritt2014-01-141-29/+51
| | | | | | | | | | | | | | | | | | | | * Mongolian and Phags-pa characters have been given a Joining_Type classification for contextual shaping. As a part of these additions, one Phags-pa character has the Joining_Type value of L (Left Joining), which no character had been assigned before. * The unassigned code points in the Currency Symbols block have been given the Bidi_Class property value ET and the Line_Break property value PR, to help implementations support new currency symbols, when they are encoded. * Hebrew letters and basic punctuation marks have been assigned the newly introduced Word_Break property values Hebrew_Letter, Single_Quote, and Double_Quote. * The Bidi_Class property has been extended with four new values for directional isolates. For more details, see http://www.unicode.org/versions/Unicode6.3.0/ Change-Id: Iad62d02edc58a8497898dcd6d6c70d5aece317ea Reviewed-by: Lars Knoll <lars.knoll@digia.com>
* initCharAttributes() micro-optimizationKonstantin Ritt2013-08-151-5/+12
| | | | | Change-Id: Id8e275c9b4ae0a9855b8ba8917824c79cde5a919 Reviewed-by: Lars Knoll <lars.knoll@digia.com>
* Move Unicode script itemization code from text engine to UnicodeToolsKonstantin Ritt2013-03-141-0/+45
| | | | | | | | | | | This is still the same trivial implementation with the only difference in that that it properly handles surrogate pairs and combining marks. This temporarily makes QTextEngine::itemize() insignificatly slower due to using intermediate buffer, until refactoring is done. Change-Id: I7987d6306b0b5cdb21b837968e292dd70abfe223 Reviewed-by: Eskil Abrahamsen Blomfeldt <eskil.abrahamsen-blomfeldt@digia.com>
* Hide Harfbuzz from the outer worldKonstantin Ritt2013-03-131-2/+1
| | | | | | | | Don't export, don't generate private headers, don't mention HB in API. Change-Id: I048ebd178bf4afaf9fda710a00933b95274cf910 Reviewed-by: Josh Faust <jfaust@suitabletech.com> Reviewed-by: Lars Knoll <lars.knoll@digia.com>
* Fix potential issue in QTBF itemization codeKonstantin Ritt2013-03-041-1/+1
| | | | | | | Since 5.0, script is QChar::Script which isn't of 1:1 mapping to HB_Script Change-Id: I2d88f929d7d3c3c994076a4e92ea22370962c41c Reviewed-by: Lars Knoll <lars.knoll@digia.com>
* Merge remote-tracking branch 'origin/stable' into devFrederik Gladhorn2013-01-221-1/+1
|\ | | | | | | | | | | | | | | | | | | Conflicts: src/corelib/io/qsavefile_p.h src/corelib/tools/qregularexpression.cpp src/gui/util/qvalidator.cpp src/gui/util/qvalidator.h Change-Id: I58fdf0358bd86e2fad5d9ad0556f3d3f1f535825
| * Update copyright year in Digia's license headersSergio Ahumada2013-01-181-1/+1
| | | | | | | | | | Change-Id: Ic804938fc352291d011800d21e549c10acac66fb Reviewed-by: Lars Knoll <lars.knoll@digia.com>
* | Merge branch 'stable' into devFrederik Gladhorn2013-01-041-1/+2
|\| | | | | | | | | | | | | | | | | | | | | Conflicts: examples/widgets/painting/shared/shared.pri src/corelib/tools/qharfbuzz_p.h src/corelib/tools/qunicodetools.cpp src/plugins/platforms/windows/accessible/qwindowsaccessibility.cpp src/plugins/platforms/windows/qwindowsfontdatabase.cpp Change-Id: Ibc9860abf570e5ce8b052fb88feb73ec35e64bd3
| * properly syncqt-ize harfbuzz headersOswald Buddenhagen2012-12-041-1/+1
| | | | | | | | | | | | | | | | | | we were already installing them into QtCore/private, so turn them into proper private headers to start with. this cleans up our project files. Change-Id: I0795f79e03b60b5854de9e4dc339e9b5a5e6fd87 Reviewed-by: Lars Knoll <lars.knoll@digia.com> Reviewed-by: Joerg Bornemann <joerg.bornemann@digia.com>
* | Update Qt internals to use QChar::ScriptKonstantin Ritt2012-12-211-3/+3
|/ | | | | | | | | | | | | | ...and remove the outdated QUnicodeTables::Script enum. QFontEngineData now has one extra slot that never used (engines[QChar::Script_Inherited]). engines[QChar::Script_Unknown], if accessed, would be set with a Box engine instance, and could be used as a minor optimization some time later. In order to preserve the existing behavior, we map all scripts up to Latin to Common. Change-Id: Ide4182a0f8447b4bf25713ecc3fe8097b8fed040 Reviewed-by: Pierre Rossi <pierre.rossi@gmail.com> Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
* QTBF: Fix issue with no splitting the words at "." (FULL STOP)Konstantin Ritt2012-11-231-0/+12
| | | | | | | | | | | | As of Unicode 5.1, some punctuation marks were mapped to MidLetter and MidNumLet for better URL and abbreviations handling which caused "hi.there" to be treated like if it were just a single word; until we have the Unicode Text Segmentation tailoring mechanism, retain the old behavior by remapping (some of) those characters back to their old values. Change-Id: I49dea6064f2ea40a82fc0b1bc3c4f0b4e803919f Reviewed-by: David Faure <david.faure@kdab.com> Reviewed-by: Lars Knoll <lars.knoll@digia.com>
* QTextBoundaryFinder: Introduce BoundaryReason::MandatoryBreak flagKonstantin Ritt2012-10-101-3/+3
| | | | | | | | | | that will be returned by boundaryReasons() when the boundary finder is at the line end position (CR, LF, NewLine Function, End of Text, etc.). The MandatoryBreak flag, if set, means the text should be wrapped at a given position. Change-Id: I32d4f570935d2e015bfc5f18915396a15f009fde Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com> Reviewed-by: Lars Knoll <lars.knoll@digia.com>
* Update the Unicode Data and Algorithms up to Unicode 6.2Konstantin Ritt2012-10-091-58/+63
| | | | | | | | | | | | | Version 6.2 of the Unicode Standard is a special release dedicated to the early publication of the newly encoded Turkish lira sign. In addition, there are some significant changes to the Unicode algorithms for text segmentation and line breaking to improve breaking for emoji symbols. For more details, see http://www.unicode.org/versions/Unicode6.2.0/ Change-Id: I21cfd4f307e41b41a19d36cce87f7a44c2661bc2 Reviewed-by: Thiago Macieira <thiago.macieira@intel.com> Reviewed-by: Lars Knoll <lars.knoll@digia.com>
* QCharAttributes: add wordStart/wordEnd flagsKonstantin Ritt2012-09-261-2/+32
| | | | | | | | | | | | | | | A simple heuristic is used to detect the word beginning and ending by looking at the word break property value of surrounding characters. This behaves better than the white-spaces based implementation used before and makes it possible to tailor the default algorithm for complex scripts. BIG FAT WARNING: The QCharAttributes buffer now has to have a length of string length + 1 for the flags at end of text. Task-Id: QTBUG-6498 Change-Id: I5589b191ffde6a50d2af0c14a00430d3852c67b4 Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
* Change copyrights from Nokia to DigiaIikka Eklund2012-09-221-24/+24
| | | | | | | | Change copyrights and license headers from Nokia to Digia Change-Id: If1cc974286d29fd01ec6c19dd4719a67f4c3f00e Reviewed-by: Lars Knoll <lars.knoll@digia.com> Reviewed-by: Sergio Ahumada <sergio.ahumada@digia.com>
* A step out from Harfbuzz (reduce dependency)Konstantin Ritt2012-09-221-27/+49
| | | | | | | | | | | | | | | | | Introduce QCharAttributes and use it instead of HB_CharAttributes everywhere in Qt (in Harfbuzz, the HB_CharAttributes is only used in the text segmentation algorithm which has been moved from HB to Qt (well, most of it)). Rename some members to better reflect their meaning, remember to keep HB_CharAttributes in sync with QCharAttributes. Also replace HB_ScriptItem with a (temporary) QUnicodeTools::ScriptItem struct that will be replaced with a more efficient/friendly solution a bit later. The soft hyphen and the mandatory break detection has been factored out of the default text breaking algorithm to a higher level in order to refactor the QCharAttributes bitfields and to optimize the implementation for the common case. Change-Id: Ieb365623ae954430f1c8b2dfcd65c82973143eec Reviewed-by: Lars Knoll <lars.knoll@digia.com>
* Add QChar::SoftHyphen enum valueKonstantin Ritt2012-07-031-1/+1
| | | | | | | | | Just like for the QChar::ByteOrderMark, `ch == QChar::SoftHyphen` is much more readable than `ch == 0x00ad // (soft-hyphen)`, etc. Change-Id: I9c85f14cfd979037d35103c3259a435fd729b869 Reviewed-by: Thiago Macieira <thiago.macieira@intel.com> Reviewed-by: Oswald Buddenhagen <oswald.buddenhagen@nokia.com>
* QUnicodeTables: some internal API renamingsKonstantin Ritt2012-06-221-23/+23
| | | | | | | | | | | | enums GraphemeBreak, WordBreak, and SentenceBreak has been renamed to GraphemeBreakClass, WordBreakClass, and SentenceBreakClass respectively, their values has been renamed to contain a '_' as logical enum-value separator (just like many other nums in Qt, e.g. LineBreakClass); *BreakFormat has been replaced with *Break_Extend (some format characters are kind of subtype of the extender characters, not vice versa). Change-Id: I9ddbcf8848da87409736c2d6d1798a62fa28cab8 Reviewed-by: Lars Knoll <lars.knoll@nokia.com>
* Improve the code generation by using Q_LIKELY/Q_UNLIKELYKonstantin Ritt2012-06-161-36/+34
| | | | | | | + reorder conditions in getWordBreaks() to make further updates more clear Change-Id: I1ca9adde066c3a48830f310202f7181585fac194 Reviewed-by: Lars Knoll <lars.knoll@nokia.com>
* Line Breaking Algorithm: handle the Object Replacement CharacterKonstantin Ritt2012-06-101-30/+31
| | | | | | | | See http://www.unicode.org/reports/tr14/#CB and http://www.unicode.org/reports/tr14/#LB20 for details Change-Id: Ice0aa2b2ce81f6e39839a353240420436eddd754 Reviewed-by: Lars Knoll <lars.knoll@nokia.com>
* Line Breaking Algorithm: don't break inside numeric expressionsKonstantin Ritt2012-06-101-6/+107
| | | | | Change-Id: I8362663454e4c6604ecb6289ae8009d47c78aeb1 Reviewed-by: Lars Knoll <lars.knoll@nokia.com>
* Update the Unicode Text Breaking Algorithm implementationKonstantin Ritt2012-06-101-281/+326
| | | | | | | | | | | | | | | | | | to make it conformant to the Unicode 6.1 specifications #14 and #29. The most important changes are: * The implementation has been reworked from scratch to fix all known bugs; * Separate-out the grapheme and the line breaking implementation to eliminate an overhead due to calculating unnecessary breaks; * Stop using deprecated SG class in favor of resolving pairs of surrogates; * A proper support for SMP code points; * Support for extended grapheme clusters (a drop-in replacement for the legacy grapheme clusters as of Unicode 5.1); * The hardcoded tailoring of UBA has been eliminated which breaks the 7 years-old lineBreaking test. Some later, we'll investigate if such a tailoring is still needed. Change-Id: I9f5867b3cec753b4fc120bc5a7e20f9a73d89370 Reviewed-by: Lars Knoll <lars.knoll@nokia.com>
* Shift positions for lineBreakTypeKonstantin Ritt2012-06-071-1/+4
| | | | | | | | | | | | | | | | | | | to keep them consistent with positions for all other flags. This changes the internal behavior so that attributes[0].lineBreakType now means "break opportunity at start of the text (before the first character in the string)" and is always assigned with HB_NoBreak to conform rule LB2 (see http://www.unicode.org/reports/tr14/#LB2). The current implementation is based on the sample implementation from tr14 that aimed to be as simple as possible rather than to be optimal. From now, we can use pieces of the attributes array "as is" without having to adjust some positions. Or we can analize some long text by chunks (e.g. paragraph by paragraph) and consume less memory. This introduces a minor overhead that will be eliminated shortly. Change-Id: Ic873a05a9d5203b1c3d5aff2e4445a3f034c4bd2 Reviewed-by: Lars Knoll <lars.knoll@nokia.com>
* Set the whiteSpace flag outside the grapheme and the line breaking loopKonstantin Ritt2012-06-071-7/+21
| | | | | | | | | | | | | | | The white spaces determination doesn't belong to the text breaking algorithm. A proper breaking implementation shouldn't assume spaces are break opportunities (actually, space is allowed to be a grapheme base); However, the whiteSpace flag should never be checked alone while iterating over the text to find the space sequence; the grapheme boundaries should always be taken into account. This covers the SMP code points in UTF-16 text and graphemes that consist of a space followed with one or more grapheme extenders. This introduces a minor overhead that would be eliminated some later. Change-Id: Ic2cc7f485631fd0b436fc256ce112ded5f94fc07 Reviewed-by: Lars Knoll <lars.knoll@nokia.com>
* QTBF: fix the mandatory line breaks not being set in some casesKonstantin Ritt2012-05-301-4/+5
| | | | | | | | | e.g. for the text "Aaa bbb ccc.\r\nDdd eee fff." the \r\n wasn't treated as a hard line break or as a line break opportunity, ever. Quite ancient bug... Change-Id: I8d8497c55a3a4d51c27de99ccfe1e31f3bf4de77 Reviewed-by: Lars Knoll <lars.knoll@nokia.com>
* Introduce QUnicodeToolsKonstantin Ritt2012-05-301-11/+21
| | | | | | | | | | | | | Add QUnicodeTools namespace and rename qGetCharAttributes to initCharAttributes; Make it possible to disable tailoring globally by overriding qt_initcharattributes_default_algorithm_only value (useful for i.e. running the specification conformance tests); This is mostly a preparation step for the upcoming patches. Change-Id: I783879fd17b63b52d7983e25dad5b820f0515e7f Reviewed-by: Lars Knoll <lars.knoll@nokia.com>
* move the default text breaking algorithm impl from HarfBuzz to QtKonstantin Ritt2012-05-101-0/+398
there are several reasons to do this: * text breaking is not a shaper's job; * since the text breaking rules are bound to a specific Unicode version, updating Qt's internal unicode data would require updating the data in HB as well; * makes porting to HurfBuzz-NG some easier Change-Id: I0bbf8e8a343bc074696f4ddf2ae4e7fa32a61629 Reviewed-by: Lars Knoll <lars.knoll@nokia.com>