diff options
author | Oswald Buddenhagen <oswald.buddenhagen@qt.io> | 2017-10-06 11:52:00 +0000 |
---|---|---|
committer | Oswald Buddenhagen <oswald.buddenhagen@qt.io> | 2017-10-06 16:48:38 +0200 |
commit | 50b7967a78f8f7b1b45444577662ba20ed2643c8 (patch) | |
tree | 1486c059593a413b66218e23b6025faa2d8e5011 /8.31/doc/pcreunicode.3 |
Initial import
Diffstat (limited to '8.31/doc/pcreunicode.3')
-rw-r--r-- | 8.31/doc/pcreunicode.3 | 225 |
1 files changed, 225 insertions, 0 deletions
diff --git a/8.31/doc/pcreunicode.3 b/8.31/doc/pcreunicode.3 new file mode 100644 index 0000000..e8dc80e --- /dev/null +++ b/8.31/doc/pcreunicode.3 @@ -0,0 +1,225 @@ +.TH PCREUNICODE 3 "14 April 2012" "PCRE 8.30" +.SH NAME +PCRE - Perl-compatible regular expressions +.SH "UTF-8, UTF-16, AND UNICODE PROPERTY SUPPORT" +.rs +.sp +From Release 8.30, in addition to its previous UTF-8 support, PCRE also +supports UTF-16 by means of a separate 16-bit library. This can be built as +well as, or instead of, the 8-bit library. +. +. +.SH "UTF-8 SUPPORT" +.rs +.sp +In order process UTF-8 strings, you must build PCRE's 8-bit library with UTF +support, and, in addition, you must call +.\" HREF +\fBpcre_compile()\fP +.\" +with the PCRE_UTF8 option flag, or the pattern must start with the sequence +(*UTF8). When either of these is the case, both the pattern and any subject +strings that are matched against it are treated as UTF-8 strings instead of +strings of 1-byte characters. +. +. +.SH "UTF-16 SUPPORT" +.rs +.sp +In order process UTF-16 strings, you must build PCRE's 16-bit library with UTF +support, and, in addition, you must call +.\" HTML <a href="pcre_compile.html"> +.\" </a> +\fBpcre16_compile()\fP +.\" +with the PCRE_UTF16 option flag, or the pattern must start with the sequence +(*UTF16). When either of these is the case, both the pattern and any subject +strings that are matched against it are treated as UTF-16 strings instead of +strings of 16-bit characters. +. +. +.SH "UTF SUPPORT OVERHEAD" +.rs +.sp +If you compile PCRE with UTF support, but do not use it at run time, the +library will be a bit bigger, but the additional run time overhead is limited +to testing the PCRE_UTF8/16 flag occasionally, so should not be very big. +. +. +.SH "UNICODE PROPERTY SUPPORT" +.rs +.sp +If PCRE is built with Unicode character property support (which implies UTF +support), the escape sequences \ep{..}, \eP{..}, and \eX can be used. +The available properties that can be tested are limited to the general +category properties such as Lu for an upper case letter or Nd for a decimal +number, the Unicode script names such as Arabic or Han, and the derived +properties Any and L&. A full list is given in the +.\" HREF +\fBpcrepattern\fP +.\" +documentation. Only the short names for properties are supported. For example, +\ep{L} matches a letter. Its Perl synonym, \ep{Letter}, is not supported. +Furthermore, in Perl, many properties may optionally be prefixed by "Is", for +compatibility with Perl 5.6. PCRE does not support this. +. +. +.\" HTML <a name="utf8strings"></a> +.SS "Validity of UTF-8 strings" +.rs +.sp +When you set the PCRE_UTF8 flag, the byte strings passed as patterns and +subjects are (by default) checked for validity on entry to the relevant +functions. The entire string is checked before any other processing takes +place. From release 7.3 of PCRE, the check is according the rules of RFC 3629, +which are themselves derived from the Unicode specification. Earlier releases +of PCRE followed the rules of RFC 2279, which allows the full range of 31-bit +values (0 to 0x7FFFFFFF). The current check allows only values in the range U+0 +to U+10FFFF, excluding U+D800 to U+DFFF. +.P +The excluded code points are the "Surrogate Area" of Unicode. They are reserved +for use by UTF-16, where they are used in pairs to encode codepoints with +values greater than 0xFFFF. The code points that are encoded by UTF-16 pairs +are available independently in the UTF-8 encoding. (In other words, the whole +surrogate thing is a fudge for UTF-16 which unfortunately messes up UTF-8.) +.P +If an invalid UTF-8 string is passed to PCRE, an error return is given. At +compile time, the only additional information is the offset to the first byte +of the failing character. The run-time functions \fBpcre_exec()\fP and +\fBpcre_dfa_exec()\fP also pass back this information, as well as a more +detailed reason code if the caller has provided memory in which to do this. +.P +In some situations, you may already know that your strings are valid, and +therefore want to skip these checks in order to improve performance, for +example in the case of a long subject string that is being scanned repeatedly +with different patterns. If you set the PCRE_NO_UTF8_CHECK flag at compile time +or at run time, PCRE assumes that the pattern or subject it is given +(respectively) contains only valid UTF-8 codes. In this case, it does not +diagnose an invalid UTF-8 string. +.P +If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set, what +happens depends on why the string is invalid. If the string conforms to the +"old" definition of UTF-8 (RFC 2279), it is processed as a string of characters +in the range 0 to 0x7FFFFFFF by \fBpcre_dfa_exec()\fP and the interpreted +version of \fBpcre_exec()\fP. In other words, apart from the initial validity +test, these functions (when in UTF-8 mode) handle strings according to the more +liberal rules of RFC 2279. However, the just-in-time (JIT) optimization for +\fBpcre_exec()\fP supports only RFC 3629. If you are using JIT optimization, or +if the string does not even conform to RFC 2279, the result is undefined. Your +program may crash. +.P +If you want to process strings of values in the full range 0 to 0x7FFFFFFF, +encoded in a UTF-8-like manner as per the old RFC, you can set +PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in this +situation, you will have to apply your own validity check, and avoid the use of +JIT optimization. +. +. +.\" HTML <a name="utf16strings"></a> +.SS "Validity of UTF-16 strings" +.rs +.sp +When you set the PCRE_UTF16 flag, the strings of 16-bit data units that are +passed as patterns and subjects are (by default) checked for validity on entry +to the relevant functions. Values other than those in the surrogate range +U+D800 to U+DFFF are independent code points. Values in the surrogate range +must be used in pairs in the correct manner. +.P +If an invalid UTF-16 string is passed to PCRE, an error return is given. At +compile time, the only additional information is the offset to the first data +unit of the failing character. The run-time functions \fBpcre16_exec()\fP and +\fBpcre16_dfa_exec()\fP also pass back this information, as well as a more +detailed reason code if the caller has provided memory in which to do this. +.P +In some situations, you may already know that your strings are valid, and +therefore want to skip these checks in order to improve performance. If you set +the PCRE_NO_UTF16_CHECK flag at compile time or at run time, PCRE assumes that +the pattern or subject it is given (respectively) contains only valid UTF-16 +sequences. In this case, it does not diagnose an invalid UTF-16 string. +. +. +.SS "General comments about UTF modes" +.rs +.sp +1. Codepoints less than 256 can be specified by either braced or unbraced +hexadecimal escape sequences (for example, \ex{b3} or \exb3). Larger values +have to use braced sequences. +.P +2. Octal numbers up to \e777 are recognized, and in UTF-8 mode, they match +two-byte characters for values greater than \e177. +.P +3. Repeat quantifiers apply to complete UTF characters, not to individual +data units, for example: \ex{100}{3}. +.P +4. The dot metacharacter matches one UTF character instead of a single data +unit. +.P +5. The escape sequence \eC can be used to match a single byte in UTF-8 mode, or +a single 16-bit data unit in UTF-16 mode, but its use can lead to some strange +effects because it breaks up multi-unit characters (see the description of \eC +in the +.\" HREF +\fBpcrepattern\fP +.\" +documentation). The use of \eC is not supported in the alternative matching +function \fBpcre[16]_dfa_exec()\fP, nor is it supported in UTF mode by the JIT +optimization of \fBpcre[16]_exec()\fP. If JIT optimization is requested for a +UTF pattern that contains \eC, it will not succeed, and so the matching will +be carried out by the normal interpretive function. +.P +6. The character escapes \eb, \eB, \ed, \eD, \es, \eS, \ew, and \eW correctly +test characters of any code value, but, by default, the characters that PCRE +recognizes as digits, spaces, or word characters remain the same set as in +non-UTF mode, all with values less than 256. This remains true even when PCRE +is built to include Unicode property support, because to do otherwise would +slow down PCRE in many common cases. Note in particular that this applies to +\eb and \eB, because they are defined in terms of \ew and \eW. If you really +want to test for a wider sense of, say, "digit", you can use explicit Unicode +property tests such as \ep{Nd}. Alternatively, if you set the PCRE_UCP option, +the way that the character escapes work is changed so that Unicode properties +are used to determine which characters match. There are more details in the +section on +.\" HTML <a href="pcrepattern.html#genericchartypes"> +.\" </a> +generic character types +.\" +in the +.\" HREF +\fBpcrepattern\fP +.\" +documentation. +.P +7. Similarly, characters that match the POSIX named character classes are all +low-valued characters, unless the PCRE_UCP option is set. +.P +8. However, the horizontal and vertical white space matching escapes (\eh, \eH, +\ev, and \eV) do match all the appropriate Unicode characters, whether or not +PCRE_UCP is set. +.P +9. Case-insensitive matching applies only to characters whose values are less +than 128, unless PCRE is built with Unicode property support. Even when Unicode +property support is available, PCRE still uses its own character tables when +checking the case of low-valued characters, so as not to degrade performance. +The Unicode property information is used only for characters with higher +values. Furthermore, PCRE supports case-insensitive matching only when there is +a one-to-one mapping between a letter's cases. There are a small number of +many-to-one mappings in Unicode; these are not supported by PCRE. +. +. +.SH AUTHOR +.rs +.sp +.nf +Philip Hazel +University Computing Service +Cambridge CB2 3QH, England. +.fi +. +. +.SH REVISION +.rs +.sp +.nf +Last updated: 14 April 2012 +Copyright (c) 1997-2012 University of Cambridge. +.fi |