summaryrefslogtreecommitdiffstats
path: root/src/corelib/codecs/qutfcodec.cpp
Commit message (Collapse)AuthorAgeFilesLines
* Update license headers and add new license filesMatti Paaso2014-09-241-19/+11
| | | | | | | | | - Renamed LICENSE.LGPL to LICENSE.LGPLv21 - Added LICENSE.LGPLv3 - Removed LICENSE.GPL Change-Id: Iec3406e3eb3f133be549092015cefe33d259a3f2 Reviewed-by: Iikka Eklund <iikka.eklund@digia.com>
* Merge remote-tracking branch 'origin/stable' into devJ-P Nurmi2014-06-051-11/+8
|\ | | | | | | | | | | | | | | | | | | Conflicts: mkspecs/features/qt.prf src/plugins/platforms/xcb/qxcbwindow.h src/tools/qdoc/qdocindexfiles.cpp src/widgets/kernel/qwidget_qpa.cpp Change-Id: I214f57b03bc2ff86cf3b7dfe2966168af93a5a67
| * UTF-8: always store the SIMD result, even if invalidThiago Macieira2014-05-271-11/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For ASCII content, this improves the throughput because the conditional is no longer on the codepath to storing, so the processor can perform the store at the same time as it's doing the movemask operation. However, the gain is mostly theoretical: benchmarking with mostly ASCII content shows the algorithm running within 0.5% of the previous result (which is noise). For non-ASCII content, we're comparing the cost of doing a 16-byte store (which may be completely overwritten) with the loop copying and shifting left. Benchmarking shows a slight gain of a few percent. Change-Id: I28ef0021dffc725a922c539cc5976db367f36e78 Reviewed-by: Allan Sandfeld Jensen <allan.jensen@digia.com>
* | Merge remote-tracking branch 'origin/stable' into devSimon Hausmann2014-05-221-8/+36
|\| | | | | | | Change-Id: Ia36e93771066d8abcf8123dbe2362c5c9d9260fc
| * Fix stateful handling of invalid UTF-8 straddling buffer bordersThiago Macieira2014-05-131-8/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a UTF-8 sequences is too short, QUtf8Functions::fromUtf8 returns EndOfString. If the decoder is stateful, we must save the state and then restart it when more data is supplied. The new stateful decoder (8dd47e34b9b96ac27a99cdcf10b8aec506882fc2) mishandled the Error case by advancing the src pointer by a negative number, thus causing a buffer overflow (the issue of the task). And it also did not handle the len == 0 case properly, though neither did the older decoder. Task-number: QTBUG-38939 Change-Id: Ie03d7c55a04e51ee838ccdb3a01e5b989d8e67aa Reviewed-by: Kai Koehne <kai.koehne@digia.com> Reviewed-by: Lars Knoll <lars.knoll@digia.com>
* | Improve a few string operations with AVX2Thiago Macieira2014-05-211-15/+32
|/ | | | | | | | | | | | | | AVX2 brings the new PMOVZXBW instruction that extends from one 128-bit SSE register to an 256-bit AVX register. With that, the main decoding code is just two instructions (the loop requires a couple more to maintain the offset counter and do the end-of-loop check). This buys us another 4% performance improvement in the fromLatin1 code, calculated on top of the VEX-encoded SSE2 code (which is already a little better than plain SSE2). Change-Id: I675fa24de4fa97683b662f19d146047251f77359 Reviewed-by: Allan Sandfeld Jensen <allan.jensen@digia.com>
* Restore handling of BOMs in QString::fromUtf8Thiago Macieira2014-04-241-15/+29
| | | | | | | | | | | 8dd47e34b9b96ac27a99cdcf10b8aec506882fc2 removed the handling of the BOMs but did not document it. This brings the behavior back and adds a unit test so we don't break it again. Discussed-on: http://lists.qt-project.org/pipermail/development/2014-April/016532.html Change-Id: Ifb7a9a6e5a494622f46b8ab435e1d168b862d952 Reviewed-by: Olivier Goffart <ogoffart@woboq.com> Reviewed-by: Lars Knoll <lars.knoll@digia.com>
* Fix off-by-one error: the next ASCII character is next oneThiago Macieira2014-02-221-2/+2
| | | | | | | | | The bit scan function returns the index of the last non-ASCII character. The next ASCII is the one after this. This means all the benchmarks were made while reentering the SIMD loop uselessly... Change-Id: If7de485a63428bfa36d413049d9239ddda1986aa Reviewed-by: Lars Knoll <lars.knoll@digia.com>
* QUtfCodec: don't encode invalid UCS-4 codepointsGiuseppe D'Angelo2014-02-071-8/+9
| | | | | | | | | | | | | | | | | | | The code didn't check for malformed surrogate pairs. That means that - high surrogates followed by *anything* were decoded as they formed a valid surrogate pair; - stray low surrogates were returned as-is. We can't return surrogate values in UCS-4, so properly detect these cases and return U+FFFD instead. [ChangeLog][QtCore][QTextCodec] Encoding a QString in UTF-32 will now replace malformed UTF-16 subsequences in the string with the Unicode replacement character (U+FFFD). Change-Id: I5cd771d6aa21ffeff4dd9d9e5a7961cf692dc457 Reviewed-by: Thiago Macieira <thiago.macieira@intel.com> Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
* Add support for UTF-8 encoding/decoding with SIMDThiago Macieira2014-01-311-15/+129
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Decoding from UTF-8 is easy: if the high bit is set, we fall back to the byte-by-byte decoding. Encoding to UTF-8 requires a little bit more work: to detect anything between 0x0080 and 0xffff, we have several options but none as easy as above. Multiple alternatives are in the benchmark code. In both loops, we do two things once we run into a non-ASCII character: first, we continue the loop for the remainder of ASCII characters in the buffer (which we can tell by checking the bits set in the mask), then we find the last non-ASCII character in that 16-character group, so we don't reenter the SSE code too soon. For the UTF-8 encoding, I have chosen the alternative that results in the best performance. It's closely tied to the alternative running the PMIN instruction, but that requires SSE 4.1. It's not worth the complexity. And quite counter-intuitively, the dedicated string instruction from SSE 4.2 performs most poorly of all solutions. This begs re-visiting the performance of the toLatin1 encoder. The best of 10 benchmark runs of this code were measured on my SandyBridge CPU @ 2.66 GHz (turbo @ 3.3 GHz), both as CPU cycles and as CPU ticks: Compared to: ICU Qt 4.7 non-SSE Qt 5.3 Data set fromUtf8 toUtf8 fromUtf8 toUtf8 fromUtf8 toUtf8 ASCII only 7.50x 6.22x 6.94x 7.60x 4.45x 4.90x 2-char UTF-8 1.17x 1.33x 1.64x 1.56x 1.01x 1.02x 3-char UTF-8 1.08x 1.18x 1.48x 1.33x 0.97x 0.92x 4-char UTF-8 1.05x 1.19x 1.20x 1.21x 0.97x 0.97x Creator data 3.62x 2.16x 2.60x 1.25x 1.78x 1.23x As shown by the numbers, the SSE-based code is slightly worse than the non-SSE code for dense non-ASCII strings. However, as evident in the Qt Creator data, most strings manipulated by applications are either pure ASCII or mostly so, so there's a net gain. Done-with: H. Peter Anvin <hpa@linux.intel.com> Change-Id: Ia74fbdfdcd7b088f6cba5048c03a153c01f5dbc1 Reviewed-by: Lars Knoll <lars.knoll@digia.com>
* Add a new UTF-8 decoder, similar to the encoder we've just addedThiago Macieira2014-01-091-86/+89
| | | | | | | | | | | | | | | | Like before, this is taken from the existing QUrl code and is optimized for ASCII handling (for the same reasons). And like previously, make QString::fromUtf8 use a stateless version of the codec, which is faster. There's a small change in behavior in the decoding: we insert a U+FFFD for each byte that cannot be decoded properly. Previously, it would "eat" all bad high-bit bytes and replace them all with one single U+FFFD. Either behavior is allowed by the UTF-8 specifications, even though this new behavior will cause misalignment in the Bradley Kuhn sample UTF-8 text. Change-Id: Ib1b1f0b4291293bab345acaf376e00204ed87565 Reviewed-by: Olivier Goffart <ogoffart@woboq.com> Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
* Add a new UTF-8 encoder and use it from QStringThiago Macieira2014-01-091-45/+41
| | | | | | | | | | | | | | | | | This is a new and faster UTF-8 encoder, based on the code from QUrl. This code specializes for ASCII, which is the most common case anyway, especially since QString's "ascii" mode is actually UTF-8 now. In addition, make QString::toUtf8 use a stateless encoder. Stateless means that the function doesn't handle state between calls in the form of QTextCodec::ConverterState. This allows it to be faster than otherwise. The new code is in the form of a template so that it can be used from QJsonDocument and QUrl, which have small modifications to how the encoding is handled. Change-Id: I305ee0fd8523cc4ec74c2678cb9ea88b75bac7ac Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
* Allow non-character codes in utf8 stringsKurt Pattyn2013-10-171-11/+2
| | | | | | | | | | | | | | | | | | | | | | | | Changed the processing of non-character code handling in the UTF8 codec. Non-character codes are now accepted in QStrings, QUrls and QJson strings. Unit tests were adapted accordingly. For more info about non-character codes, see: http://www.unicode.org/versions/corrigendum9.html [ChangeLog][QtCore][QUtf8] UTF-8 now accepts non-character unicode points; these are not replaced by the replacement character anymore [ChangeLog][QtCore][QUrl] QUrl now fully accepts non-character unicode points; they are encoded as percent characters; they can also be pretty decoded [ChangeLog][QtCore][QJson] The Writer and the Parser now fully accept non-character unicode points. Change-Id: I77cf4f0e6210741eac8082912a0b6118eced4f77 Task-number: QTBUG-33229 Reviewed-by: Lars Knoll <lars.knoll@digia.com> Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
* Update copyright year in Digia's license headersSergio Ahumada2013-01-181-1/+1
| | | | | Change-Id: Ic804938fc352291d011800d21e549c10acac66fb Reviewed-by: Lars Knoll <lars.knoll@digia.com>
* Change copyrights from Nokia to DigiaIikka Eklund2012-09-221-24/+24
| | | | | | | | Change copyrights and license headers from Nokia to Digia Change-Id: If1cc974286d29fd01ec6c19dd4719a67f4c3f00e Reviewed-by: Lars Knoll <lars.knoll@digia.com> Reviewed-by: Sergio Ahumada <sergio.ahumada@digia.com>
* QChar: add isSurrogate() and isNonCharacter() to the public APIKonstantin Ritt2012-05-161-4/+3
| | | | | | | | + QChar::LastValidCodePoint enum value that supercede the UNICODE_LAST_CODEPOINT macro replace uses of hardcoded values with the new API; remove leftovers Change-Id: I1395c9840b85fcb6b08e241b131794a98773c952 Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
* add some useful methods to QUnicodeTables::Konstantin Ritt2012-05-101-15/+3
| | | | | | | in order to reduce code duplication and prepare the ground for upcoming changes Change-Id: I980244149f65384c9484bbec4682de8b7b848b08 Reviewed-by: Lars Knoll <lars.knoll@nokia.com>
* replace hardcoded values with a surrogate handling methodsKonstantin Ritt2012-04-131-3/+3
| | | | | | Change-Id: Ib41e08d835f2e8ca2e32b4025c6f5a99840f2e27 Reviewed-by: Oswald Buddenhagen <oswald.buddenhagen@nokia.com> Reviewed-by: Lars Knoll <lars.knoll@nokia.com>
* fix QUtf8 codec to disallow codes in range [U+fdd0..U+fdef]Konstantin Ritt2012-04-111-1/+1
| | | | | | | | | 0xfdef-0xfdd0 is definitely 31 and not 15 :) also fix all copy-pastes of this code (greping for '0xfdd0' helps ;) Change-Id: I8f3bd4fd9d85f9de066f0f5df378b9188c12bd48 Reviewed-by: Thiago Macieira <thiago.macieira@intel.com> Reviewed-by: Denis Dzyubenko <denis.dzyubenko@nokia.com>
* Remove "All rights reserved" line from license headers.Jason McDonald2012-01-301-1/+1
| | | | | | | | | | As in the past, to avoid rewriting various autotests that contain line-number information, an extra blank line has been inserted at the end of the license text to ensure that this commit does not change the total number of lines in the license header. Change-Id: I311e001373776812699d6efc045b5f742890c689 Reviewed-by: Rohan McGovern <rohan.mcgovern@nokia.com>
* Update contact information in license headers.Jason McDonald2012-01-231-1/+1
| | | | | | | Replace Nokia contact email address with Qt Project website. Change-Id: I431bbbf76d7c27d8b502f87947675c116994c415 Reviewed-by: Rohan McGovern <rohan.mcgovern@nokia.com>
* Update copyright year in license headers.Jason McDonald2012-01-051-1/+1
| | | | | Change-Id: I02f2c620296fcd91d4967d58767ea33fc4e1e7dc Reviewed-by: Rohan McGovern <rohan.mcgovern@nokia.com>
* Remove duplicate check in utf endian detectionKent Hansen2011-10-061-14/+12
| | | | | | | | | | | This was excessive paranoia. Task-number: QTBUG-20482 Change-Id: Ia0c76651773e12f25ec5d62675d6f317b8d2df13 Reviewed-on: http://codereview.qt-project.org/6045 Reviewed-by: Qt Sanity Bot <qt_sanity_bot@ovi.com> Reviewed-by: Jędrzej Nowacki <jedrzej.nowacki@nokia.com> Reviewed-by: Lars Knoll <lars.knoll@nokia.com>
* Replace explicit surrogate handlers by inline methods of QChar classsuzuki toshiya2011-09-121-4/+4
| | | | | | | | | | | | Merge-request: 1284 Reviewed-by: Oswald Buddenhagen <oswald.buddenhagen@nokia.com> (cherry picked from commit 50af55095afe1ba048dde357b771485ef2188778) Change-Id: I0b1967145ad62243afc2060a6ae4ca141a9609fd Reviewed-on: http://codereview.qt-project.org/4587 Reviewed-by: Qt Sanity Bot <qt_sanity_bot@ovi.com> Reviewed-by: Oswald Buddenhagen <oswald.buddenhagen@nokia.com>
* Update licenseheader text in source files for qtbase Qt moduleJyri Tahtela2011-05-241-17/+17
| | | | | | | Updated version of LGPL and FDL licenseheaders. Apply release phase licenseheaders for all source files. Reviewed-by: Trust Me
* Initial import from the monolithic Qt.Qt by Nokia2011-04-271-0/+670
This is the beginning of revision history for this module. If you want to look at revision history older than this, please refer to the Qt Git wiki for how to use Git history grafting. At the time of writing, this wiki is located here: http://qt.gitorious.org/qt/pages/GitIntroductionWithQt If you have already performed the grafting and you don't see any history beyond this commit, try running "git log" with the "--follow" argument. Branched from the monolithic repo, Qt master branch, at commit 896db169ea224deb96c59ce8af800d019de63f12