1 files changed, 0 insertions, 1622 deletions
diff --git a/chromium/third_party/cygwin/lib/perl5/5.10/pods/perlunicode.pod b/chromium/third_party/cygwin/lib/perl5/5.10/pods/perlunicode.pod
deleted file mode 100644
index 4e62fed0b51..00000000000
--- a/chromium/third_party/cygwin/lib/perl5/5.10/pods/perlunicode.pod
+++ /dev/null
@@ -1,1622 +0,0 @@
-=head1 NAME
-
-perlunicode - Unicode support in Perl
-
-=head1 DESCRIPTION
-
-=head2 Important Caveats
-
-Unicode support is an extensive requirement. While Perl does not
-implement the Unicode standard or the accompanying technical reports
-from cover to cover, Perl does support many Unicode features.
-
-People who want to learn to use Unicode in Perl, should probably read
-L<the Perl Unicode tutorial, perlunitut|perlunitut>, before reading
-this reference document.
-
-=over 4
-
-=item Input and Output Layers
-
-Perl knows when a filehandle uses Perl's internal Unicode encodings
-(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
-the ":utf8" layer.  Other encodings can be converted to Perl's
-encoding on input or from Perl's encoding on output by use of the
-":encoding(...)"  layer.  See L<open>.
-
-To indicate that Perl source itself is in UTF-8, use C<use utf8;>.
-
-=item Regular Expressions
-
-The regular expression compiler produces polymorphic opcodes.  That is,
-the pattern adapts to the data and automatically switches to the Unicode
-character scheme when presented with data that is internally encoded in
-UTF-8 -- or instead uses a traditional byte scheme when presented with
-byte data.
-
-=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
-
-As a compatibility measure, the C<use utf8> pragma must be explicitly
-included to enable recognition of UTF-8 in the Perl scripts themselves
-(in string or regular expression literals, or in identifier names) on
-ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
-machines.  B<These are the only times when an explicit C<use utf8>
-is needed.>  See L<utf8>.
-
-=item BOM-marked scripts and UTF-16 scripts autodetected
-
-If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE,
-or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either
-endianness, Perl will correctly read in the script as Unicode.
-(BOMless UTF-8 cannot be effectively recognized or differentiated from
-ISO 8859-1 or other eight-bit encodings.)
-
-=item C<use encoding> needed to upgrade non-Latin-1 byte strings
-
-By default, there is a fundamental asymmetry in Perl's Unicode model:
-implicit upgrading from byte strings to Unicode strings assumes that
-they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
-downgraded with UTF-8 encoding.  This happens because the first 256
-codepoints in Unicode happens to agree with Latin-1.  
-
-See L</"Byte and Character Semantics"> for more details.
-
-=back
-
-=head2 Byte and Character Semantics
-
-Beginning with version 5.6, Perl uses logically-wide characters to
-represent strings internally.
-
-In future, Perl-level operations will be expected to work with
-characters rather than bytes.
-
-However, as an interim compatibility measure, Perl aims to
-provide a safe migration path from byte semantics to character
-semantics for programs.  For operations where Perl can unambiguously
-decide that the input data are characters, Perl switches to
-character semantics.  For operations where this determination cannot
-be made without additional information from the user, Perl decides in
-favor of compatibility and chooses to use byte semantics.
-
-This behavior preserves compatibility with earlier versions of Perl,
-which allowed byte semantics in Perl operations only if
-none of the program's inputs were marked as being as source of Unicode
-character data.  Such data may come from filehandles, from calls to
-external programs, from information provided by the system (such as %ENV),
-or from literals and constants in the source text.
-
-The C<bytes> pragma will always, regardless of platform, force byte
-semantics in a particular lexical scope.  See L<bytes>.
-
-The C<utf8> pragma is primarily a compatibility device that enables
-recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
-Note that this pragma is only required while Perl defaults to byte
-semantics; when character semantics become the default, this pragma
-may become a no-op.  See L<utf8>.
-
-Unless explicitly stated, Perl operators use character semantics
-for Unicode data and byte semantics for non-Unicode data.
-The decision to use character semantics is made transparently.  If
-input data comes from a Unicode source--for example, if a character
-encoding layer is added to a filehandle or a literal Unicode
-string constant appears in a program--character semantics apply.
-Otherwise, byte semantics are in effect.  The C<bytes> pragma should
-be used to force byte semantics on Unicode data.
-
-If strings operating under byte semantics and strings with Unicode
-character data are concatenated, the new string will be created by
-decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
-old Unicode string used EBCDIC.  This translation is done without
-regard to the system's native 8-bit encoding. 
-
-Under character semantics, many operations that formerly operated on
-bytes now operate on characters. A character in Perl is
-logically just a number ranging from 0 to 2**31 or so. Larger
-characters may encode into longer sequences of bytes internally, but
-this internal detail is mostly hidden for Perl code.
-See L<perluniintro> for more.
-
-=head2 Effects of Character Semantics
-
-Character semantics have the following effects:
-
-=over 4
-
-=item *
-
-Strings--including hash keys--and regular expression patterns may
-contain characters that have an ordinal value larger than 255.
-
-If you use a Unicode editor to edit your program, Unicode characters may
-occur directly within the literal strings in UTF-8 encoding, or UTF-16.
-(The former requires a BOM or C<use utf8>, the latter requires a BOM.)
-
-Unicode characters can also be added to a string by using the C<\x{...}>
-notation.  The Unicode code for the desired character, in hexadecimal,
-should be placed in the braces. For instance, a smiley face is
-C<\x{263A}>.  This encoding scheme only works for all characters, but
-for characters under 0x100, note that Perl may use an 8 bit encoding
-internally, for optimization and/or backward compatibility.
-
-Additionally, if you
-
-   use charnames ':full';
-
-you can use the C<\N{...}> notation and put the official Unicode
-character name within the braces, such as C<\N{WHITE SMILING FACE}>.
-
-=item *
-
-If an appropriate L<encoding> is specified, identifiers within the
-Perl script may contain Unicode alphanumeric characters, including
-ideographs.  Perl does not currently attempt to canonicalize variable
-names.
-
-=item *
-
-Regular expressions match characters instead of bytes.  "." matches
-a character instead of a byte.
-
-=item *
-
-Character classes in regular expressions match characters instead of
-bytes and match against the character properties specified in the
-Unicode properties database.  C<\w> can be used to match a Japanese
-ideograph, for instance.
-
-=item *
-
-Named Unicode properties, scripts, and block ranges may be used like
-character classes via the C<\p{}> "matches property" construct and
-the C<\P{}> negation, "doesn't match property".
-
-See L</"Unicode Character Properties"> for more details.
-
-You can define your own character properties and use them
-in the regular expression with the C<\p{}> or C<\P{}> construct.
-
-See L</"User-Defined Character Properties"> for more details.
-
-=item *
-
-The special pattern C<\X> matches any extended Unicode
-sequence--"a combining character sequence" in Standardese--where the
-first character is a base character and subsequent characters are mark
-characters that apply to the base character.  C<\X> is equivalent to
-C<< (?>\PM\pM*) >>.
-
-=item *
-
-The C<tr///> operator translates characters instead of bytes.  Note
-that the C<tr///CU> functionality has been removed.  For similar
-functionality see pack('U0', ...) and pack('C0', ...).
-
-=item *
-
-Case translation operators use the Unicode case translation tables
-when character input is provided.  Note that C<uc()>, or C<\U> in
-interpolated strings, translates to uppercase, while C<ucfirst>,
-or C<\u> in interpolated strings, translates to titlecase in languages
-that make the distinction.
-
-=item *
-
-Most operators that deal with positions or lengths in a string will
-automatically switch to using character positions, including
-C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
-C<sprintf()>, C<write()>, and C<length()>.  An operator that
-specifically does not switch is C<vec()>.  Operators that really don't 
-care include operators that treat strings as a bucket of bits such as 
-C<sort()>, and operators dealing with filenames.
-
-=item *
-
-The C<pack()>/C<unpack()> letter C<C> does I<not> change, since it is often 
-used for byte-oriented formats.  Again, think C<char> in the C language.
-
-There is a new C<U> specifier that converts between Unicode characters
-and code points. There is also a C<W> specifier that is the equivalent of
-C<chr>/C<ord> and properly handles character values even if they are above 255.
-
-=item *
-
-The C<chr()> and C<ord()> functions work on characters, similar to
-C<pack("W")> and C<unpack("W")>, I<not> C<pack("C")> and
-C<unpack("C")>.  C<pack("C")> and C<unpack("C")> are methods for
-emulating byte-oriented C<chr()> and C<ord()> on Unicode strings.
-While these methods reveal the internal encoding of Unicode strings,
-that is not something one normally needs to care about at all.
-
-=item *
-
-The bit string operators, C<& | ^ ~>, can operate on character data.
-However, for backward compatibility, such as when using bit string
-operations when characters are all less than 256 in ordinal value, one
-should not use C<~> (the bit complement) with characters of both
-values less than 256 and values greater than 256.  Most importantly,
-DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>)
-will not hold.  The reason for this mathematical I<faux pas> is that
-the complement cannot return B<both> the 8-bit (byte-wide) bit
-complement B<and> the full character-wide bit complement.
-
-=item *
-
-lc(), uc(), lcfirst(), and ucfirst() work for the following cases:
-
-=over 8
-
-=item *
-
-the case mapping is from a single Unicode character to another
-single Unicode character, or
-
-=item *
-
-the case mapping is from a single Unicode character to more
-than one Unicode character.
-
-=back
-
-Things to do with locales (Lithuanian, Turkish, Azeri) do B<not> work
-since Perl does not understand the concept of Unicode locales.
-
-See the Unicode Technical Report #21, Case Mappings, for more details.
-
-But you can also define your own mappings to be used in the lc(),
-lcfirst(), uc(), and ucfirst() (or their string-inlined versions).
-
-See L</"User-Defined Case Mappings"> for more details.
-
-=back
-
-=over 4
-
-=item *
-
-And finally, C<scalar reverse()> reverses by character rather than by byte.
-
-=back
-
-=head2 Unicode Character Properties
-
-Named Unicode properties, scripts, and block ranges may be used like
-character classes via the C<\p{}> "matches property" construct and
-the C<\P{}> negation, "doesn't match property".
-
-For instance, C<\p{Lu}> matches any character with the Unicode "Lu"
-(Letter, uppercase) property, while C<\p{M}> matches any character
-with an "M" (mark--accents and such) property.  Brackets are not
-required for single letter properties, so C<\p{M}> is equivalent to
-C<\pM>. Many predefined properties are available, such as
-C<\p{Mirrored}> and C<\p{Tibetan}>.
-
-The official Unicode script and block names have spaces and dashes as
-separators, but for convenience you can use dashes, spaces, or
-underbars, and case is unimportant. It is recommended, however, that
-for consistency you use the following naming: the official Unicode
-script, property, or block name (see below for the additional rules
-that apply to block names) with whitespace and dashes removed, and the
-words "uppercase-first-lowercase-rest". C<Latin-1 Supplement> thus
-becomes C<Latin1Supplement>.
-
-You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
-(^) between the first brace and the property name: C<\p{^Tamil}> is
-equal to C<\P{Tamil}>.
-
-B<NOTE: the properties, scripts, and blocks listed here are as of
-Unicode 5.0.0 in July 2006.>
-
-=over 4
-
-=item General Category
-
-Here are the basic Unicode General Category properties, followed by their
-long form.  You can use either; C<\p{Lu}> and C<\p{UppercaseLetter}>,
-for instance, are identical.
-
-    Short       Long
-
-    L           Letter
-    LC          CasedLetter
-    Lu          UppercaseLetter
-    Ll          LowercaseLetter
-    Lt          TitlecaseLetter
-    Lm          ModifierLetter
-    Lo          OtherLetter
-
-    M           Mark
-    Mn          NonspacingMark
-    Mc          SpacingMark
-    Me          EnclosingMark
-
-    N           Number
-    Nd          DecimalNumber
-    Nl          LetterNumber
-    No          OtherNumber
-
-    P           Punctuation
-    Pc          ConnectorPunctuation
-    Pd          DashPunctuation
-    Ps          OpenPunctuation
-    Pe          ClosePunctuation
-    Pi          InitialPunctuation
-                (may behave like Ps or Pe depending on usage)
-    Pf          FinalPunctuation
-                (may behave like Ps or Pe depending on usage)
-    Po          OtherPunctuation
-
-    S           Symbol
-    Sm          MathSymbol
-    Sc          CurrencySymbol
-    Sk          ModifierSymbol
-    So          OtherSymbol
-
-    Z           Separator
-    Zs          SpaceSeparator
-    Zl          LineSeparator
-    Zp          ParagraphSeparator
-
-    C           Other
-    Cc          Control
-    Cf          Format
-    Cs          Surrogate   (not usable)
-    Co          PrivateUse
-    Cn          Unassigned
-
-Single-letter properties match all characters in any of the
-two-letter sub-properties starting with the same letter.
-C<LC> and C<L&> are special cases, which are aliases for the set of
-C<Ll>, C<Lu>, and C<Lt>.
-
-Because Perl hides the need for the user to understand the internal
-representation of Unicode characters, there is no need to implement
-the somewhat messy concept of surrogates. C<Cs> is therefore not
-supported.
-
-=item Bidirectional Character Types
-
-Because scripts differ in their directionality--Hebrew is
-written right to left, for example--Unicode supplies these properties in
-the BidiClass class:
-
-    Property    Meaning
-
-    L           Left-to-Right
-    LRE         Left-to-Right Embedding
-    LRO         Left-to-Right Override
-    R           Right-to-Left
-    AL          Right-to-Left Arabic
-    RLE         Right-to-Left Embedding
-    RLO         Right-to-Left Override
-    PDF         Pop Directional Format
-    EN          European Number
-    ES          European Number Separator
-    ET          European Number Terminator
-    AN          Arabic Number
-    CS          Common Number Separator
-    NSM         Non-Spacing Mark
-    BN          Boundary Neutral
-    B           Paragraph Separator
-    S           Segment Separator
-    WS          Whitespace
-    ON          Other Neutrals
-
-For example, C<\p{BidiClass:R}> matches characters that are normally
-written right to left.
-
-=item Scripts
-
-The script names which can be used by C<\p{...}> and C<\P{...}>,
-such as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows:
-
-    Arabic
-    Armenian
-    Balinese
-    Bengali
-    Bopomofo
-    Braille
-    Buginese
-    Buhid
-    CanadianAboriginal
-    Cherokee
-    Coptic
-    Cuneiform
-    Cypriot
-    Cyrillic
-    Deseret
-    Devanagari
-    Ethiopic
-    Georgian
-    Glagolitic
-    Gothic
-    Greek
-    Gujarati
-    Gurmukhi
-    Han
-    Hangul
-    Hanunoo
-    Hebrew
-    Hiragana
-    Inherited
-    Kannada
-    Katakana
-    Kharoshthi
-    Khmer
-    Lao
-    Latin
-    Limbu
-    LinearB
-    Malayalam
-    Mongolian
-    Myanmar
-    NewTaiLue
-    Nko
-    Ogham
-    OldItalic
-    OldPersian
-    Oriya
-    Osmanya
-    PhagsPa
-    Phoenician
-    Runic
-    Shavian
-    Sinhala
-    SylotiNagri
-    Syriac
-    Tagalog
-    Tagbanwa
-    TaiLe
-    Tamil
-    Telugu
-    Thaana
-    Thai
-    Tibetan
-    Tifinagh
-    Ugaritic
-    Yi
-
-=item Extended property classes
-
-Extended property classes can supplement the basic
-properties, defined by the F<PropList> Unicode database:
-
-    ASCIIHexDigit
-    BidiControl
-    Dash
-    Deprecated
-    Diacritic
-    Extender
-    HexDigit
-    Hyphen
-    Ideographic
-    IDSBinaryOperator
-    IDSTrinaryOperator
-    JoinControl
-    LogicalOrderException
-    NoncharacterCodePoint
-    OtherAlphabetic
-    OtherDefaultIgnorableCodePoint
-    OtherGraphemeExtend
-    OtherIDStart
-    OtherIDContinue
-    OtherLowercase
-    OtherMath
-    OtherUppercase
-    PatternSyntax
-    PatternWhiteSpace
-    QuotationMark
-    Radical
-    SoftDotted
-    STerm
-    TerminalPunctuation
-    UnifiedIdeograph
-    VariationSelector
-    WhiteSpace
-
-and there are further derived properties:
-
-    Alphabetic  =  Lu + Ll + Lt + Lm + Lo + Nl + OtherAlphabetic
-    Lowercase   =  Ll + OtherLowercase
-    Uppercase   =  Lu + OtherUppercase
-    Math        =  Sm + OtherMath
-
-    IDStart     =  Lu + Ll + Lt + Lm + Lo + Nl + OtherIDStart
-    IDContinue  =  IDStart + Mn + Mc + Nd + Pc + OtherIDContinue
-
-    DefaultIgnorableCodePoint
-                =  OtherDefaultIgnorableCodePoint
-                   + Cf + Cc + Cs + Noncharacters + VariationSelector
-                   - WhiteSpace - FFF9..FFFB (Annotation Characters)
-
-    Any         =  Any code points (i.e. U+0000 to U+10FFFF)
-    Assigned    =  Any non-Cn code points (i.e. synonym for \P{Cn})
-    Unassigned  =  Synonym for \p{Cn}
-    ASCII       =  ASCII (i.e. U+0000 to U+007F)
-
-    Common      =  Any character (or unassigned code point)
-                   not explicitly assigned to a script
-
-=item Use of "Is" Prefix
-
-For backward compatibility (with Perl 5.6), all properties mentioned
-so far may have C<Is> prepended to their name, so C<\P{IsLu}>, for
-example, is equal to C<\P{Lu}>.
-
-=item Blocks
-
-In addition to B<scripts>, Unicode also defines B<blocks> of
-characters.  The difference between scripts and blocks is that the
-concept of scripts is closer to natural languages, while the concept
-of blocks is more of an artificial grouping based on groups of 256
-Unicode characters. For example, the C<Latin> script contains letters
-from many blocks but does not contain all the characters from those
-blocks. It does not, for example, contain digits, because digits are
-shared across many scripts. Digits and similar groups, like
-punctuation, are in a category called C<Common>.
-
-For more about scripts, see the UAX#24 "Script Names":
-
-   http://www.unicode.org/reports/tr24/
-
-For more about blocks, see:
-
-   http://www.unicode.org/Public/UNIDATA/Blocks.txt
-
-Block names are given with the C<In> prefix. For example, the
-Katakana block is referenced via C<\p{InKatakana}>.  The C<In>
-prefix may be omitted if there is no naming conflict with a script
-or any other property, but it is recommended that C<In> always be used
-for block tests to avoid confusion.
-
-These block names are supported:
-
-    InAegeanNumbers
-    InAlphabeticPresentationForms
-    InAncientGreekMusicalNotation
-    InAncientGreekNumbers
-    InArabic
-    InArabicPresentationFormsA
-    InArabicPresentationFormsB
-    InArabicSupplement
-    InArmenian
-    InArrows
-    InBalinese
-    InBasicLatin
-    InBengali
-    InBlockElements
-    InBopomofo
-    InBopomofoExtended
-    InBoxDrawing
-    InBraillePatterns
-    InBuginese
-    InBuhid
-    InByzantineMusicalSymbols
-    InCJKCompatibility
-    InCJKCompatibilityForms
-    InCJKCompatibilityIdeographs
-    InCJKCompatibilityIdeographsSupplement
-    InCJKRadicalsSupplement
-    InCJKStrokes
-    InCJKSymbolsAndPunctuation
-    InCJKUnifiedIdeographs
-    InCJKUnifiedIdeographsExtensionA
-    InCJKUnifiedIdeographsExtensionB
-    InCherokee
-    InCombiningDiacriticalMarks
-    InCombiningDiacriticalMarksSupplement
-    InCombiningDiacriticalMarksforSymbols
-    InCombiningHalfMarks
-    InControlPictures
-    InCoptic
-    InCountingRodNumerals
-    InCuneiform
-    InCuneiformNumbersAndPunctuation
-    InCurrencySymbols
-    InCypriotSyllabary
-    InCyrillic
-    InCyrillicSupplement
-    InDeseret
-    InDevanagari
-    InDingbats
-    InEnclosedAlphanumerics
-    InEnclosedCJKLettersAndMonths
-    InEthiopic
-    InEthiopicExtended
-    InEthiopicSupplement
-    InGeneralPunctuation
-    InGeometricShapes
-    InGeorgian
-    InGeorgianSupplement
-    InGlagolitic
-    InGothic
-    InGreekExtended
-    InGreekAndCoptic
-    InGujarati
-    InGurmukhi
-    InHalfwidthAndFullwidthForms
-    InHangulCompatibilityJamo
-    InHangulJamo
-    InHangulSyllables
-    InHanunoo
-    InHebrew
-    InHighPrivateUseSurrogates
-    InHighSurrogates
-    InHiragana
-    InIPAExtensions
-    InIdeographicDescriptionCharacters
-    InKanbun
-    InKangxiRadicals
-    InKannada
-    InKatakana
-    InKatakanaPhoneticExtensions
-    InKharoshthi
-    InKhmer
-    InKhmerSymbols
-    InLao
-    InLatin1Supplement
-    InLatinExtendedA
-    InLatinExtendedAdditional
-    InLatinExtendedB
-    InLatinExtendedC
-    InLatinExtendedD
-    InLetterlikeSymbols
-    InLimbu
-    InLinearBIdeograms
-    InLinearBSyllabary
-    InLowSurrogates
-    InMalayalam
-    InMathematicalAlphanumericSymbols
-    InMathematicalOperators
-    InMiscellaneousMathematicalSymbolsA
-    InMiscellaneousMathematicalSymbolsB
-    InMiscellaneousSymbols
-    InMiscellaneousSymbolsAndArrows
-    InMiscellaneousTechnical
-    InModifierToneLetters
-    InMongolian
-    InMusicalSymbols
-    InMyanmar
-    InNKo
-    InNewTaiLue
-    InNumberForms
-    InOgham
-    InOldItalic
-    InOldPersian
-    InOpticalCharacterRecognition
-    InOriya
-    InOsmanya
-    InPhagspa
-    InPhoenician
-    InPhoneticExtensions
-    InPhoneticExtensionsSupplement
-    InPrivateUseArea
-    InRunic
-    InShavian
-    InSinhala
-    InSmallFormVariants
-    InSpacingModifierLetters
-    InSpecials
-    InSuperscriptsAndSubscripts
-    InSupplementalArrowsA
-    InSupplementalArrowsB
-    InSupplementalMathematicalOperators
-    InSupplementalPunctuation
-    InSupplementaryPrivateUseAreaA
-    InSupplementaryPrivateUseAreaB
-    InSylotiNagri
-    InSyriac
-    InTagalog
-    InTagbanwa
-    InTags
-    InTaiLe
-    InTaiXuanJingSymbols
-    InTamil
-    InTelugu
-    InThaana
-    InThai
-    InTibetan
-    InTifinagh
-    InUgaritic
-    InUnifiedCanadianAboriginalSyllabics
-    InVariationSelectors
-    InVariationSelectorsSupplement
-    InVerticalForms
-    InYiRadicals
-    InYiSyllables
-    InYijingHexagramSymbols
-
-=back
-
-=head2 User-Defined Character Properties
-
-You can define your own character properties by defining subroutines
-whose names begin with "In" or "Is".  The subroutines can be defined in
-any package.  The user-defined properties can be used in the regular
-expression C<\p> and C<\P> constructs; if you are using a user-defined
-property from a package other than the one you are in, you must specify
-its package in the C<\p> or C<\P> construct.
-
-    # assuming property IsForeign defined in Lang::
-    package main;  # property package name required
-    if ($txt =~ /\p{Lang::IsForeign}+/) { ... }
-
-    package Lang;  # property package name not required
-    if ($txt =~ /\p{IsForeign}+/) { ... }
-
-
-Note that the effect is compile-time and immutable once defined.
-
-The subroutines must return a specially-formatted string, with one
-or more newline-separated lines.  Each line must be one of the following:
-
-=over 4
-
-=item *
-
-A single hexadecimal number denoting a Unicode code point to include.
-
-=item *
-
-Two hexadecimal numbers separated by horizontal whitespace (space or
-tabular characters) denoting a range of Unicode code points to include.
-
-=item *
-
-Something to include, prefixed by "+": a built-in character
-property (prefixed by "utf8::") or a user-defined character property,
-to represent all the characters in that property; two hexadecimal code
-points for a range; or a single hexadecimal code point.
-
-=item *
-
-Something to exclude, prefixed by "-": an existing character
-property (prefixed by "utf8::") or a user-defined character property,
-to represent all the characters in that property; two hexadecimal code
-points for a range; or a single hexadecimal code point.
-
-=item *
-
-Something to negate, prefixed "!": an existing character
-property (prefixed by "utf8::") or a user-defined character property,
-to represent all the characters in that property; two hexadecimal code
-points for a range; or a single hexadecimal code point.
-
-=item *
-
-Something to intersect with, prefixed by "&": an existing character
-property (prefixed by "utf8::") or a user-defined character property,
-for all the characters except the characters in the property; two
-hexadecimal code points for a range; or a single hexadecimal code point.
-
-=back
-
-For example, to define a property that covers both the Japanese
-syllabaries (hiragana and katakana), you can define
-
-    sub InKana {
-	return <<END;
-    3040\t309F
-    30A0\t30FF
-    END
-    }
-
-Imagine that the here-doc end marker is at the beginning of the line.
-Now you can use C<\p{InKana}> and C<\P{InKana}>.
-
-You could also have used the existing block property names:
-
-    sub InKana {
-	return <<'END';
-    +utf8::InHiragana
-    +utf8::InKatakana
-    END
-    }
-
-Suppose you wanted to match only the allocated characters,
-not the raw block ranges: in other words, you want to remove
-the non-characters:
-
-    sub InKana {
-	return <<'END';
-    +utf8::InHiragana
-    +utf8::InKatakana
-    -utf8::IsCn
-    END
-    }
-
-The negation is useful for defining (surprise!) negated classes.
-
-    sub InNotKana {
-	return <<'END';
-    !utf8::InHiragana
-    -utf8::InKatakana
-    +utf8::IsCn
-    END
-    }
-
-Intersection is useful for getting the common characters matched by
-two (or more) classes.
-
-    sub InFooAndBar {
-        return <<'END';
-    +main::Foo
-    &main::Bar
-    END
-    }
-
-It's important to remember not to use "&" for the first set -- that
-would be intersecting with nothing (resulting in an empty set).
-
-=head2 User-Defined Case Mappings
-
-You can also define your own mappings to be used in the lc(),
-lcfirst(), uc(), and ucfirst() (or their string-inlined versions).
-The principle is similar to that of user-defined character
-properties: to define subroutines in the C<main> package
-with names like C<ToLower> (for lc() and lcfirst()), C<ToTitle> (for
-the first character in ucfirst()), and C<ToUpper> (for uc(), and the
-rest of the characters in ucfirst()).
-
-The string returned by the subroutines needs now to be three
-hexadecimal numbers separated by tabulators: start of the source
-range, end of the source range, and start of the destination range.
-For example:
-
-    sub ToUpper {
-	return <<END;
-    0061\t0063\t0041
-    END
-    }
-
-defines an uc() mapping that causes only the characters "a", "b", and
-"c" to be mapped to "A", "B", "C", all other characters will remain
-unchanged.
-
-If there is no source range to speak of, that is, the mapping is from
-a single character to another single character, leave the end of the
-source range empty, but the two tabulator characters are still needed.
-For example:
-
-    sub ToLower {
-	return <<END;
-    0041\t\t0061
-    END
-    }
-
-defines a lc() mapping that causes only "A" to be mapped to "a", all
-other characters will remain unchanged.
-
-(For serious hackers only)  If you want to introspect the default
-mappings, you can find the data in the directory
-C<$Config{privlib}>/F<unicore/To/>.  The mapping data is returned as
-the here-document, and the C<utf8::ToSpecFoo> are special exception
-mappings derived from <$Config{privlib}>/F<unicore/SpecialCasing.txt>.
-The C<Digit> and C<Fold> mappings that one can see in the directory
-are not directly user-accessible, one can use either the
-C<Unicode::UCD> module, or just match case-insensitively (that's when
-the C<Fold> mapping is used).
-
-A final note on the user-defined case mappings: they will be used
-only if the scalar has been marked as having Unicode characters.
-Old byte-style strings will not be affected.
-
-=head2 Character Encodings for Input and Output
-
-See L<Encode>.
-
-=head2 Unicode Regular Expression Support Level
-
-The following list of Unicode support for regular expressions describes
-all the features currently supported.  The references to "Level N"
-and the section numbers refer to the Unicode Technical Standard #18,
-"Unicode Regular Expressions", version 11, in May 2005.
-
-=over 4
-
-=item *
-
-Level 1 - Basic Unicode Support
-
-        RL1.1   Hex Notation                        - done          [1]
-        RL1.2   Properties                          - done          [2][3]
-        RL1.2a  Compatibility Properties            - done          [4]
-        RL1.3   Subtraction and Intersection        - MISSING       [5]
-        RL1.4   Simple Word Boundaries              - done          [6]
-        RL1.5   Simple Loose Matches                - done          [7]
-        RL1.6   Line Boundaries                     - MISSING       [8]
-        RL1.7   Supplementary Code Points           - done          [9]
-
-        [1]  \x{...}
-        [2]  \p{...} \P{...}
-        [3]  supports not only minimal list (general category, scripts,
-             Alphabetic, Lowercase, Uppercase, WhiteSpace,
-             NoncharacterCodePoint, DefaultIgnorableCodePoint, Any,
-             ASCII, Assigned), but also bidirectional types, blocks, etc.
-             (see L</"Unicode Character Properties">)
-        [4]  \d \D \s \S \w \W \X [:prop:] [:^prop:]
-        [5]  can use regular expression look-ahead [a] or
-             user-defined character properties [b] to emulate set operations
-        [6]  \b \B
-        [7]  note that Perl does Full case-folding in matching, not Simple:
-             for example U+1F88 is equivalent with U+1F00 U+03B9,
-             not with 1F80.  This difference matters for certain Greek
-             capital letters with certain modifiers: the Full case-folding
-             decomposes the letter, while the Simple case-folding would map
-             it to a single character.
-        [8]  should do ^ and $ also on U+000B (\v in C), FF (\f), CR (\r),
-             CRLF (\r\n), NEL (U+0085), LS (U+2028), and PS (U+2029);
-             should also affect <>, $., and script line numbers;
-             should not split lines within CRLF [c] (i.e. there is no empty
-             line between \r and \n)
-        [9]  UTF-8/UTF-EBDDIC used in perl allows not only U+10000 to U+10FFFF
-             but also beyond U+10FFFF [d]
-
-[a] You can mimic class subtraction using lookahead.
-For example, what UTS#18 might write as
-
-    [{Greek}-[{UNASSIGNED}]]
-
-in Perl can be written as:
-
-    (?!\p{Unassigned})\p{InGreekAndCoptic}
-    (?=\p{Assigned})\p{InGreekAndCoptic}
-
-But in this particular example, you probably really want
-
-    \p{GreekAndCoptic}
-
-which will match assigned characters known to be part of the Greek script.
-
-Also see the Unicode::Regex::Set module, it does implement the full
-UTS#18 grouping, intersection, union, and removal (subtraction) syntax.
-
-[b] '+' for union, '-' for removal (set-difference), '&' for intersection
-(see L</"User-Defined Character Properties">)
-
-[c] Try the C<:crlf> layer (see L<PerlIO>).
-
-[d] Avoid C<use warning 'utf8';> (or say C<no warning 'utf8';>) to allow
-U+FFFF (C<\x{FFFF}>).
-
-=item *
-
-Level 2 - Extended Unicode Support
-
-        RL2.1   Canonical Equivalents           - MISSING       [10][11]
-        RL2.2   Default Grapheme Clusters       - MISSING       [12][13]
-        RL2.3   Default Word Boundaries         - MISSING       [14]
-        RL2.4   Default Loose Matches           - MISSING       [15]
-        RL2.5   Name Properties                 - MISSING       [16]
-        RL2.6   Wildcard Properties             - MISSING
-
-        [10] see UAX#15 "Unicode Normalization Forms"
-        [11] have Unicode::Normalize but not integrated to regexes
-        [12] have \X but at this level . should equal that
-        [13] UAX#29 "Text Boundaries" considers CRLF and Hangul syllable
-             clusters as a single grapheme cluster.
-        [14] see UAX#29, Word Boundaries
-        [15] see UAX#21 "Case Mappings"
-        [16] have \N{...} but neither compute names of CJK Ideographs
-             and Hangul Syllables nor use a loose match [e]
-
-[e] C<\N{...}> allows namespaces (see L<charnames>).
-
-=item *
-
-Level 3 - Tailored Support
-
-        RL3.1   Tailored Punctuation            - MISSING
-        RL3.2   Tailored Grapheme Clusters      - MISSING       [17][18]
-        RL3.3   Tailored Word Boundaries        - MISSING
-        RL3.4   Tailored Loose Matches          - MISSING
-        RL3.5   Tailored Ranges                 - MISSING
-        RL3.6   Context Matching                - MISSING       [19]
-        RL3.7   Incremental Matches             - MISSING
-      ( RL3.8   Unicode Set Sharing )
-        RL3.9   Possible Match Sets             - MISSING
-        RL3.10  Folded Matching                 - MISSING       [20]
-        RL3.11  Submatchers                     - MISSING
-
-        [17] see UAX#10 "Unicode Collation Algorithms"
-        [18] have Unicode::Collate but not integrated to regexes
-        [19] have (?<=x) and (?=x), but look-aheads or look-behinds should see
-             outside of the target substring
-        [20] need insensitive matching for linguistic features other than case;
-             for example, hiragana to katakana, wide and narrow, simplified Han
-             to traditional Han (see UTR#30 "Character Foldings")
-
-=back
-
-=head2 Unicode Encodings
-
-Unicode characters are assigned to I<code points>, which are abstract
-numbers.  To use these numbers, various encodings are needed.
-
-=over 4
-
-=item *
-
-UTF-8
-
-UTF-8 is a variable-length (1 to 6 bytes, current character allocations
-require 4 bytes), byte-order independent encoding. For ASCII (and we
-really do mean 7-bit ASCII, not another 8-bit encoding), UTF-8 is
-transparent.
-
-The following table is from Unicode 3.2.
-
- Code Points            1st Byte  2nd Byte  3rd Byte  4th Byte
-
-   U+0000..U+007F       00..7F
-   U+0080..U+07FF       C2..DF    80..BF
-   U+0800..U+0FFF       E0        A0..BF    80..BF
-   U+1000..U+CFFF       E1..EC    80..BF    80..BF
-   U+D000..U+D7FF       ED        80..9F    80..BF
-   U+D800..U+DFFF       ******* ill-formed *******
-   U+E000..U+FFFF       EE..EF    80..BF    80..BF
-  U+10000..U+3FFFF      F0        90..BF    80..BF    80..BF
-  U+40000..U+FFFFF      F1..F3    80..BF    80..BF    80..BF
- U+100000..U+10FFFF     F4        80..8F    80..BF    80..BF
-
-Note the C<A0..BF> in C<U+0800..U+0FFF>, the C<80..9F> in
-C<U+D000...U+D7FF>, the C<90..B>F in C<U+10000..U+3FFFF>, and the
-C<80...8F> in C<U+100000..U+10FFFF>.  The "gaps" are caused by legal
-UTF-8 avoiding non-shortest encodings: it is technically possible to
-UTF-8-encode a single code point in different ways, but that is
-explicitly forbidden, and the shortest possible encoding should always
-be used.  So that's what Perl does.
-
-Another way to look at it is via bits:
-
- Code Points                    1st Byte   2nd Byte  3rd Byte  4th Byte
-
-                    0aaaaaaa     0aaaaaaa
-            00000bbbbbaaaaaa     110bbbbb  10aaaaaa
-            ccccbbbbbbaaaaaa     1110cccc  10bbbbbb  10aaaaaa
-  00000dddccccccbbbbbbaaaaaa     11110ddd  10cccccc  10bbbbbb  10aaaaaa
-
-As you can see, the continuation bytes all begin with C<10>, and the
-leading bits of the start byte tell how many bytes the are in the
-encoded character.
-
-=item *
-
-UTF-EBCDIC
-
-Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.
-
-=item *
-
-UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks)
-
-The followings items are mostly for reference and general Unicode
-knowledge, Perl doesn't use these constructs internally.
-
-UTF-16 is a 2 or 4 byte encoding.  The Unicode code points
-C<U+0000..U+FFFF> are stored in a single 16-bit unit, and the code
-points C<U+10000..U+10FFFF> in two 16-bit units.  The latter case is
-using I<surrogates>, the first 16-bit unit being the I<high
-surrogate>, and the second being the I<low surrogate>.
-
-Surrogates are code points set aside to encode the C<U+10000..U+10FFFF>
-range of Unicode code points in pairs of 16-bit units.  The I<high
-surrogates> are the range C<U+D800..U+DBFF>, and the I<low surrogates>
-are the range C<U+DC00..U+DFFF>.  The surrogate encoding is
-
-	$hi = ($uni - 0x10000) / 0x400 + 0xD800;
-	$lo = ($uni - 0x10000) % 0x400 + 0xDC00;
-
-and the decoding is
-
-	$uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
-
-If you try to generate surrogates (for example by using chr()), you
-will get a warning if warnings are turned on, because those code
-points are not valid for a Unicode character.
-
-Because of the 16-bitness, UTF-16 is byte-order dependent.  UTF-16
-itself can be used for in-memory computations, but if storage or
-transfer is required either UTF-16BE (big-endian) or UTF-16LE
-(little-endian) encodings must be chosen.
-
-This introduces another problem: what if you just know that your data
-is UTF-16, but you don't know which endianness?  Byte Order Marks, or
-BOMs, are a solution to this.  A special character has been reserved
-in Unicode to function as a byte order marker: the character with the
-code point C<U+FEFF> is the BOM.
-
-The trick is that if you read a BOM, you will know the byte order,
-since if it was written on a big-endian platform, you will read the
-bytes C<0xFE 0xFF>, but if it was written on a little-endian platform,
-you will read the bytes C<0xFF 0xFE>.  (And if the originating platform
-was writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.)
-
-The way this trick works is that the character with the code point
-C<U+FFFE> is guaranteed not to be a valid Unicode character, so the
-sequence of bytes C<0xFF 0xFE> is unambiguously "BOM, represented in
-little-endian format" and cannot be C<U+FFFE>, represented in big-endian
-format".
-
-=item *
-
-UTF-32, UTF-32BE, UTF-32LE
-
-The UTF-32 family is pretty much like the UTF-16 family, expect that
-the units are 32-bit, and therefore the surrogate scheme is not
-needed.  The BOM signatures will be C<0x00 0x00 0xFE 0xFF> for BE and
-C<0xFF 0xFE 0x00 0x00> for LE.
-
-=item *
-
-UCS-2, UCS-4
-
-Encodings defined by the ISO 10646 standard.  UCS-2 is a 16-bit
-encoding.  Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>,
-because it does not use surrogates.  UCS-4 is a 32-bit encoding,
-functionally identical to UTF-32.
-
-=item *
-
-UTF-7
-
-A seven-bit safe (non-eight-bit) encoding, which is useful if the
-transport or storage is not eight-bit safe.  Defined by RFC 2152.
-
-=back
-
-=head2 Security Implications of Unicode
-
-=over 4
-
-=item *
-
-Malformed UTF-8
-
-Unfortunately, the specification of UTF-8 leaves some room for
-interpretation of how many bytes of encoded output one should generate
-from one input Unicode character.  Strictly speaking, the shortest
-possible sequence of UTF-8 bytes should be generated,
-because otherwise there is potential for an input buffer overflow at
-the receiving end of a UTF-8 connection.  Perl always generates the
-shortest length UTF-8, and with warnings on Perl will warn about
-non-shortest length UTF-8 along with other malformations, such as the
-surrogates, which are not real Unicode code points.
-
-=item *
-
-Regular expressions behave slightly differently between byte data and
-character (Unicode) data.  For example, the "word character" character
-class C<\w> will work differently depending on if data is eight-bit bytes
-or Unicode.
-
-In the first case, the set of C<\w> characters is either small--the
-default set of alphabetic characters, digits, and the "_"--or, if you
-are using a locale (see L<perllocale>), the C<\w> might contain a few
-more letters according to your language and country.
-
-In the second case, the C<\w> set of characters is much, much larger.
-Most importantly, even in the set of the first 256 characters, it will
-probably match different characters: unlike most locales, which are
-specific to a language and country pair, Unicode classifies all the
-characters that are letters I<somewhere> as C<\w>.  For example, your
-locale might not think that LATIN SMALL LETTER ETH is a letter (unless
-you happen to speak Icelandic), but Unicode does.
-
-As discussed elsewhere, Perl has one foot (two hooves?) planted in
-each of two worlds: the old world of bytes and the new world of
-characters, upgrading from bytes to characters when necessary.
-If your legacy code does not explicitly use Unicode, no automatic
-switch-over to characters should happen.  Characters shouldn't get
-downgraded to bytes, either.  It is possible to accidentally mix bytes
-and characters, however (see L<perluniintro>), in which case C<\w> in
-regular expressions might start behaving differently.  Review your
-code.  Use warnings and the C<strict> pragma.
-
-=back
-
-=head2 Unicode in Perl on EBCDIC
-
-The way Unicode is handled on EBCDIC platforms is still
-experimental.  On such platforms, references to UTF-8 encoding in this
-document and elsewhere should be read as meaning the UTF-EBCDIC
-specified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issues
-are specifically discussed. There is no C<utfebcdic> pragma or
-":utfebcdic" layer; rather, "utf8" and ":utf8" are reused to mean
-the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic>
-for more discussion of the issues.
-
-=head2 Locales
-
-Usually locale settings and Unicode do not affect each other, but
-there are a couple of exceptions:
-
-=over 4
-
-=item *
-
-You can enable automatic UTF-8-ification of your standard file
-handles, default C<open()> layer, and C<@ARGV> by using either
-the C<-C> command line switch or the C<PERL_UNICODE> environment
-variable, see L<perlrun> for the documentation of the C<-C> switch.
-
-=item *
-
-Perl tries really hard to work both with Unicode and the old
-byte-oriented world. Most often this is nice, but sometimes Perl's
-straddling of the proverbial fence causes problems.
-
-=back
-
-=head2 When Unicode Does Not Happen
-
-While Perl does have extensive ways to input and output in Unicode,
-and few other 'entry points' like the @ARGV which can be interpreted
-as Unicode (UTF-8), there still are many places where Unicode (in some
-encoding or another) could be given as arguments or received as
-results, or both, but it is not.
-
-The following are such interfaces.  For all of these interfaces Perl
-currently (as of 5.8.3) simply assumes byte strings both as arguments
-and results, or UTF-8 strings if the C<encoding> pragma has been used.
-
-One reason why Perl does not attempt to resolve the role of Unicode in
-this cases is that the answers are highly dependent on the operating
-system and the file system(s).  For example, whether filenames can be
-in Unicode, and in exactly what kind of encoding, is not exactly a
-portable concept.  Similarly for the qx and system: how well will the
-'command line interface' (and which of them?) handle Unicode?
-
-=over 4
-
-=item *
-
-chdir, chmod, chown, chroot, exec, link, lstat, mkdir, 
-rename, rmdir, stat, symlink, truncate, unlink, utime, -X
-
-=item *
-
-%ENV
-
-=item *
-
-glob (aka the <*>)
-
-=item *
-
-open, opendir, sysopen
-
-=item *
-
-qx (aka the backtick operator), system
-
-=item *
-
-readdir, readlink
-
-=back
-
-=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
-
-Sometimes (see L</"When Unicode Does Not Happen">) there are
-situations where you simply need to force Perl to believe that a byte
-string is UTF-8, or vice versa.  The low-level calls
-utf8::upgrade($bytestring) and utf8::downgrade($utf8string) are
-the answers.
-
-Do not use them without careful thought, though: Perl may easily get
-very confused, angry, or even crash, if you suddenly change the 'nature'
-of scalar like that.  Especially careful you have to be if you use the
-utf8::upgrade(): any random byte string is not valid UTF-8.
-
-=head2 Using Unicode in XS
-
-If you want to handle Perl Unicode in XS extensions, you may find the
-following C APIs useful.  See also L<perlguts/"Unicode Support"> for an
-explanation about Unicode at the XS level, and L<perlapi> for the API
-details.
-
-=over 4
-
-=item *
-
-C<DO_UTF8(sv)> returns true if the C<UTF8> flag is on and the bytes
-pragma is not in effect.  C<SvUTF8(sv)> returns true is the C<UTF8>
-flag is on; the bytes pragma is ignored.  The C<UTF8> flag being on
-does B<not> mean that there are any characters of code points greater
-than 255 (or 127) in the scalar or that there are even any characters
-in the scalar.  What the C<UTF8> flag means is that the sequence of
-octets in the representation of the scalar is the sequence of UTF-8
-encoded code points of the characters of a string.  The C<UTF8> flag
-being off means that each octet in this representation encodes a
-single character with code point 0..255 within the string.  Perl's
-Unicode model is not to use UTF-8 until it is absolutely necessary.
-
-=item *
-
-C<uvuni_to_utf8(buf, chr)> writes a Unicode character code point into
-a buffer encoding the code point as UTF-8, and returns a pointer
-pointing after the UTF-8 bytes.
-
-=item *
-
-C<utf8_to_uvuni(buf, lenp)> reads UTF-8 encoded bytes from a buffer and
-returns the Unicode character code point and, optionally, the length of
-the UTF-8 byte sequence.
-
-=item *
-
-C<utf8_length(start, end)> returns the length of the UTF-8 encoded buffer
-in characters.  C<sv_len_utf8(sv)> returns the length of the UTF-8 encoded
-scalar.
-
-=item *
-
-C<sv_utf8_upgrade(sv)> converts the string of the scalar to its UTF-8
-encoded form.  C<sv_utf8_downgrade(sv)> does the opposite, if
-possible.  C<sv_utf8_encode(sv)> is like sv_utf8_upgrade except that
-it does not set the C<UTF8> flag.  C<sv_utf8_decode()> does the
-opposite of C<sv_utf8_encode()>.  Note that none of these are to be
-used as general-purpose encoding or decoding interfaces: C<use Encode>
-for that.  C<sv_utf8_upgrade()> is affected by the encoding pragma
-but C<sv_utf8_downgrade()> is not (since the encoding pragma is
-designed to be a one-way street).
-
-=item *
-
-C<is_utf8_char(s)> returns true if the pointer points to a valid UTF-8
-character.
-
-=item *
-
-C<is_utf8_string(buf, len)> returns true if C<len> bytes of the buffer
-are valid UTF-8.
-
-=item *
-
-C<UTF8SKIP(buf)> will return the number of bytes in the UTF-8 encoded
-character in the buffer.  C<UNISKIP(chr)> will return the number of bytes
-required to UTF-8-encode the Unicode character code point.  C<UTF8SKIP()>
-is useful for example for iterating over the characters of a UTF-8
-encoded buffer; C<UNISKIP()> is useful, for example, in computing
-the size required for a UTF-8 encoded buffer.
-
-=item *
-
-C<utf8_distance(a, b)> will tell the distance in characters between the
-two pointers pointing to the same UTF-8 encoded buffer.
-
-=item *
-
-C<utf8_hop(s, off)> will return a pointer to an UTF-8 encoded buffer
-that is C<off> (positive or negative) Unicode characters displaced
-from the UTF-8 buffer C<s>.  Be careful not to overstep the buffer:
-C<utf8_hop()> will merrily run off the end or the beginning of the
-buffer if told to do so.
-
-=item *
-
-C<pv_uni_display(dsv, spv, len, pvlim, flags)> and
-C<sv_uni_display(dsv, ssv, pvlim, flags)> are useful for debugging the
-output of Unicode strings and scalars.  By default they are useful
-only for debugging--they display B<all> characters as hexadecimal code
-points--but with the flags C<UNI_DISPLAY_ISPRINT>,
-C<UNI_DISPLAY_BACKSLASH>, and C<UNI_DISPLAY_QQ> you can make the
-output more readable.
-
-=item *
-
-C<ibcmp_utf8(s1, pe1, u1, l1, u1, s2, pe2, l2, u2)> can be used to
-compare two strings case-insensitively in Unicode.  For case-sensitive
-comparisons you can just use C<memEQ()> and C<memNE()> as usual.
-
-=back
-
-For more information, see L<perlapi>, and F<utf8.c> and F<utf8.h>
-in the Perl source code distribution.
-
-=head1 BUGS
-
-=head2 Interaction with Locales
-
-Use of locales with Unicode data may lead to odd results.  Currently,
-Perl attempts to attach 8-bit locale info to characters in the range
-0..255, but this technique is demonstrably incorrect for locales that
-use characters above that range when mapped into Unicode.  Perl's
-Unicode support will also tend to run slower.  Use of locales with
-Unicode is discouraged.
-
-=head2 Interaction with Extensions
-
-When Perl exchanges data with an extension, the extension should be
-able to understand the UTF8 flag and act accordingly. If the
-extension doesn't know about the flag, it's likely that the extension
-will return incorrectly-flagged data.
-
-So if you're working with Unicode data, consult the documentation of
-every module you're using if there are any issues with Unicode data
-exchange. If the documentation does not talk about Unicode at all,
-suspect the worst and probably look at the source to learn how the
-module is implemented. Modules written completely in Perl shouldn't
-cause problems. Modules that directly or indirectly access code written
-in other programming languages are at risk.
-
-For affected functions, the simple strategy to avoid data corruption is
-to always make the encoding of the exchanged data explicit. Choose an
-encoding that you know the extension can handle. Convert arguments passed
-to the extensions to that encoding and convert results back from that
-encoding. Write wrapper functions that do the conversions for you, so
-you can later change the functions when the extension catches up.
-
-To provide an example, let's say the popular Foo::Bar::escape_html
-function doesn't deal with Unicode data yet. The wrapper function
-would convert the argument to raw UTF-8 and convert the result back to
-Perl's internal representation like so:
-
-    sub my_escape_html ($) {
-      my($what) = shift;
-      return unless defined $what;
-      Encode::decode_utf8(Foo::Bar::escape_html(Encode::encode_utf8($what)));
-    }
-
-Sometimes, when the extension does not convert data but just stores
-and retrieves them, you will be in a position to use the otherwise
-dangerous Encode::_utf8_on() function. Let's say the popular
-C<Foo::Bar> extension, written in C, provides a C<param> method that
-lets you store and retrieve data according to these prototypes:
-
-    $self->param($name, $value);            # set a scalar
-    $value = $self->param($name);           # retrieve a scalar
-
-If it does not yet provide support for any encoding, one could write a
-derived class with such a C<param> method:
-
-    sub param {
-      my($self,$name,$value) = @_;
-      utf8::upgrade($name);     # make sure it is UTF-8 encoded
-      if (defined $value) {
-        utf8::upgrade($value);  # make sure it is UTF-8 encoded
-        return $self->SUPER::param($name,$value);
-      } else {
-        my $ret = $self->SUPER::param($name);
-        Encode::_utf8_on($ret); # we know, it is UTF-8 encoded
-        return $ret;
-      }
-    }
-
-Some extensions provide filters on data entry/exit points, such as
-DB_File::filter_store_key and family. Look out for such filters in
-the documentation of your extensions, they can make the transition to
-Unicode data much easier.
-
-=head2 Speed
-
-Some functions are slower when working on UTF-8 encoded strings than
-on byte encoded strings.  All functions that need to hop over
-characters such as length(), substr() or index(), or matching regular
-expressions can work B<much> faster when the underlying data are
-byte-encoded.
-
-In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1
-a caching scheme was introduced which will hopefully make the slowness
-somewhat less spectacular, at least for some operations.  In general,
-operations with UTF-8 encoded strings are still slower. As an example,
-the Unicode properties (character classes) like C<\p{Nd}> are known to
-be quite a bit slower (5-20 times) than their simpler counterparts
-like C<\d> (then again, there 268 Unicode characters matching C<Nd>
-compared with the 10 ASCII characters matching C<d>).
-
-=head2 Porting code from perl-5.6.X
-
-Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer
-was required to use the C<utf8> pragma to declare that a given scope
-expected to deal with Unicode data and had to make sure that only
-Unicode data were reaching that scope. If you have code that is
-working with 5.6, you will need some of the following adjustments to
-your code. The examples are written such that the code will continue
-to work under 5.6, so you should be safe to try them out.
-
-=over 4
-
-=item *
-
-A filehandle that should read or write UTF-8
-
-  if ($] > 5.007) {
-    binmode $fh, ":encoding(utf8)";
-  }
-
-=item *
-
-A scalar that is going to be passed to some extension
-
-Be it Compress::Zlib, Apache::Request or any extension that has no
-mention of Unicode in the manpage, you need to make sure that the
-UTF8 flag is stripped off. Note that at the time of this writing
-(October 2002) the mentioned modules are not UTF-8-aware. Please
-check the documentation to verify if this is still true.
-
-  if ($] > 5.007) {
-    require Encode;
-    $val = Encode::encode_utf8($val); # make octets
-  }
-
-=item *
-
-A scalar we got back from an extension
-
-If you believe the scalar comes back as UTF-8, you will most likely
-want the UTF8 flag restored:
-
-  if ($] > 5.007) {
-    require Encode;
-    $val = Encode::decode_utf8($val);
-  }
-
-=item *
-
-Same thing, if you are really sure it is UTF-8
-
-  if ($] > 5.007) {
-    require Encode;
-    Encode::_utf8_on($val);
-  }
-
-=item *
-
-A wrapper for fetchrow_array and fetchrow_hashref
-
-When the database contains only UTF-8, a wrapper function or method is
-a convenient way to replace all your fetchrow_array and
-fetchrow_hashref calls. A wrapper function will also make it easier to
-adapt to future enhancements in your database driver. Note that at the
-time of this writing (October 2002), the DBI has no standardized way
-to deal with UTF-8 data. Please check the documentation to verify if
-that is still true.
-
-  sub fetchrow {
-    my($self, $sth, $what) = @_; # $what is one of fetchrow_{array,hashref}
-    if ($] < 5.007) {
-      return $sth->$what;
-    } else {
-      require Encode;
-      if (wantarray) {
-        my @arr = $sth->$what;
-        for (@arr) {
-          defined && /[^\000-\177]/ && Encode::_utf8_on($_);
-        }
-        return @arr;
-      } else {
-        my $ret = $sth->$what;
-        if (ref $ret) {
-          for my $k (keys %$ret) {
-            defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret->{$k};
-          }
-          return $ret;
-        } else {
-          defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret;
-          return $ret;
-        }
-      }
-    }
-  }
-
-
-=item *
-
-A large scalar that you know can only contain ASCII
-
-Scalars that contain only ASCII and are marked as UTF-8 are sometimes
-a drag to your program. If you recognize such a situation, just remove
-the UTF8 flag:
-
-  utf8::downgrade($val) if $] > 5.007;
-
-=back
-
-=head1 SEE ALSO
-
-L<perlunitut>, L<perluniintro>, L<Encode>, L<open>, L<utf8>, L<bytes>,
-L<perlretut>, L<perlvar/"${^UNICODE}">
-
-=cut