summaryrefslogtreecommitdiffstats
path: root/doc/global/includes/corelib/port-from-qregexp.qdocinc
blob: 11f0a3136fc11d4d3bf5d7d04b190202c8dbe0d7 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
// Copyright (C) 2022 Giuseppe D'Angelo <dangelog@gmail.com>.
// Copyright (C) 2022 Klarälvdalens Datakonsult AB, a KDAB Group company, info@kdab.com, author Giuseppe D'Angelo <giuseppe.dangelo@kdab.com>
// Copyright (C) 2022 The Qt Company Ltd.
// SPDX-License-Identifier: LicenseRef-Qt-Commercial OR GFDL-1.3-no-invariants-only

//! [porting-to-qregularexpression]

    The QRegularExpression class introduced in Qt 5 implements Perl-compatible
    regular expressions and is a big improvement upon QRegExp in terms of APIs
    offered, supported pattern syntax, and speed of execution. The biggest
    difference is that QRegularExpression simply holds a regular expression,
    and it's \e{not} modified when a match is requested. Instead, a
    QRegularExpressionMatch object is returned, to check the result of a match
    and extract the captured substring. The same applies to global matching and
    QRegularExpressionMatchIterator.

    Other differences are outlined below.

    \note QRegularExpression does not support all the features available in
    Perl-compatible regular expressions. The most notable one is the fact that
    duplicated names for capturing groups are not supported, and using them can
    lead to undefined behavior. This may change in a future version of Qt.

    \section3 Different pattern syntax

    Porting a regular expression from QRegExp to QRegularExpression may require
    changes to the pattern itself.

    In specific scenarios, QRegExp was too lenient and accepted patterns that
    are simply invalid when using QRegularExpression. These are easy to detect,
    because the QRegularExpression objects built with these patterns are not
    valid (see QRegularExpression::isValid()).

    In other cases, a pattern ported from QRegExp to QRegularExpression may
    silently change semantics. Therefore, it is necessary to review the
    patterns used. The most notable cases of silent incompatibility are:

    \list

    \li Curly braces are needed to use a hexadecimal escape like \c{\xHHHH}
        with more than 2 digits. A pattern like \c{\x2022} needs to be ported
        to \c{\x{2022}}, or it will match a space (\c{0x20}) followed by the
        string \c{"22"}. In general, it is highly recommended to always use
        curly braces with the \c{\x} escape, no matter the number of digits
        specified.

    \li A 0-to-n quantification like \c{{,n}} needs to be ported to \c{{0,n}}
        to preserve semantics. Otherwise, a pattern such as \c{\d{,3}} would
        match a digit followed by the exact string \c{"{,3}"}.

    \li QRegExp by default does Unicode-aware matching, while
        QRegularExpression requires a separate option; see below for more
        details.

    \li c{.} in QRegExp does by default match all characters, including the
        newline character. QRegularExpression excludes the newline character
        by default. To include the newline character, set the
        QRegularExpression::DotMatchesEverythingOption pattern option.

    \endlist

    For an overview of the regular expression syntax supported by
    QRegularExpression, please refer to the
    \l{https://pcre.org/original/doc/html/pcrepattern.html}{pcrepattern(3)}
    man page, describing the pattern syntax supported by PCRE (the reference
    implementation of Perl-compatible regular expressions).

    \section3 Porting from QRegExp::exactMatch()

    QRegExp::exactMatch() served two purposes: it exactly matched a regular
    expression against a subject string, and it implemented partial matching.

    \section4 Porting from QRegExp's Exact Matching

    Exact matching indicates whether the regular expression matches the entire
    subject string. For example, the classes yield on the subject string \c{"abc123"}:

    \table
    \header \li                  \li QRegExp::exactMatch() \li QRegularExpressionMatch::hasMatch()
    \row    \li \c{"\\d+"}       \li \b false              \li \b true
    \row    \li \c{"[a-z]+\\d+"} \li \b true               \li \b true
    \endtable

    Exact matching is not reflected in QRegularExpression. If you want
    to be sure that the subject string matches the regular expression
    exactly, you can wrap the pattern using the QRegularExpression::anchoredPattern()
    function:

    \snippet code/doc_src_port_from_qregexp.cpp 0

    \section4 Porting from QRegExp's Partial Matching

    When using QRegExp::exactMatch(), if an exact match was not found, one
    could still find out how much of the subject string was matched by the
    regular expression by calling QRegExp::matchedLength(). If the returned length
    was equal to the subject string's length, then one could conclude that a partial
    match was found.

    QRegularExpression supports partial matching explicitly by means of the
    appropriate QRegularExpression::MatchType.

    \section3 Global matching

    Due to limitations of the QRegExp API, it was impossible to implement global
    matching correctly (that is, like Perl does). In particular, patterns that
    can match 0 characters (like \c{"a*"}) are problematic.

    QRegularExpression::globalMatch() implements Perl global match correctly, and
    the returned iterator can be used to examine each result.

    For example, if you have code like:

    \snippet code/doc_src_port_from_qregexp.cpp 1

    You can rewrite it as:

    \snippet code/doc_src_port_from_qregexp.cpp 2

    \section3 Unicode properties support

    When using QRegExp, character classes such as \c{\w}, \c{\d}, etc. match
    characters with the corresponding Unicode property: for instance, \c{\d}
    matches any character with the Unicode \c{Nd} (decimal digit) property.

    Those character classes only match ASCII characters by default when using
    QRegularExpression: for instance, \c{\d} matches exactly a character in the
    \c{0-9} ASCII range. It is possible to change this behavior by using the
    QRegularExpression::UseUnicodePropertiesOption pattern option.

    \section3 Wildcard matching

    There is no direct way to do wildcard matching in QRegularExpression.
    However, the QRegularExpression::wildcardToRegularExpression() method
    is provided to translate glob patterns into a Perl-compatible regular
    expression that can be used for that purpose.

    For example, if you have code like:

    \snippet code/doc_src_port_from_qregexp.cpp 3

    You can rewrite it as:

    \snippet code/doc_src_port_from_qregexp.cpp 4

    Please note though that some shell-like wildcard patterns might not be
    translated to what you expect. The following example code will silently
    break if simply converted using the above-mentioned function:

    \snippet code/doc_src_port_from_qregexp.cpp 5

    This is because, by default, the regular expression returned by
    QRegularExpression::wildcardToRegularExpression() is fully anchored.
    To get a regular expression that is not anchored, pass
    QRegularExpression::UnanchoredWildcardConversion as the conversion
    options:

    \snippet code/doc_src_port_from_qregexp.cpp 6

    \section3 Minimal matching

    QRegExp::setMinimal() implemented minimal matching by simply reversing the
    greediness of the quantifiers (QRegExp did not support lazy quantifiers,
    like \c{*?}, \c{+?}, etc.). QRegularExpression instead does support greedy,
    lazy, and possessive quantifiers. The QRegularExpression::InvertedGreedinessOption
    pattern option can be useful to emulate the effects of QRegExp::setMinimal():
    if enabled, it inverts the greediness of quantifiers (greedy ones become
    lazy and vice versa).

    \section3 Caret modes

    The QRegularExpression::AnchorAtOffsetMatchOption match option can be used to
    emulate the QRegExp::CaretAtOffset behavior. There is no equivalent for the
    other QRegExp::CaretMode modes.

//! [porting-to-qregularexpression]