diff options
author | Marc Mutz <marc.mutz@kdab.com> | 2017-11-22 15:48:02 +0100 |
---|---|---|
committer | Marc Mutz <marc.mutz@kdab.com> | 2020-06-03 19:13:54 +0200 |
commit | 6a3c6f939f29c83d53d2da0c3f53b814bdd02358 (patch) | |
tree | b0734ab85ce0839a80e440b42da4216ff7291378 /src/corelib/text/qstringtokenizer.cpp | |
parent | 1b33ee95e5c6e5e27f732fd273920861fdae486a (diff) |
Long live QStringTokenizer!
This class is designed as C++20-style generator / lazy sequence, and
the new return value of QString{,View}::tokenize().
It thus is more similar to a hand-coded loop around indexOf() than
QString::split(), which returns a container (the filling of which
allocates memory).
The template arguments of QStringTokenizer intricately depend on the
arguments with which it is constructed, so QStringTokenizer cannot be used
directly without C++17 CTAD. To work around this issue, add a factory
function, qTokenize().
LATER:
- ~Optimize QLatin1String needles (avoid repeated L1->UTF16 conversion)~
(out of scope for QStringTokenizer, should be solved in the respective
indexOf())
- Keep per-instantiation state:
* Boyer-Moore table
[ChangeLog][QtCore][QStringTokenizer] New class.
[ChangeLog][QtCore][qTokenize] New function.
Change-Id: I7a7a02e9175cdd3887778f29f2f91933329be759
Reviewed-by: Lars Knoll <lars.knoll@qt.io>
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
Diffstat (limited to 'src/corelib/text/qstringtokenizer.cpp')
-rw-r--r-- | src/corelib/text/qstringtokenizer.cpp | 357 |
1 files changed, 357 insertions, 0 deletions
diff --git a/src/corelib/text/qstringtokenizer.cpp b/src/corelib/text/qstringtokenizer.cpp new file mode 100644 index 0000000000..043269a3ac --- /dev/null +++ b/src/corelib/text/qstringtokenizer.cpp @@ -0,0 +1,357 @@ +/**************************************************************************** +** +** Copyright (C) 2020 Klarälvdalens Datakonsult AB, a KDAB Group company, info@kdab.com, author Marc Mutz <marc.mutz@kdab.com> +** Contact: http://www.qt.io/licensing/ +** +** This file is part of the QtCore module of the Qt Toolkit. +** +** $QT_BEGIN_LICENSE:LGPL$ +** Commercial License Usage +** Licensees holding valid commercial Qt licenses may use this file in +** accordance with the commercial license agreement provided with the +** Software or, alternatively, in accordance with the terms contained in +** a written agreement between you and The Qt Company. For licensing terms +** and conditions see https://www.qt.io/terms-conditions. For further +** information use the contact form at https://www.qt.io/contact-us. +** +** GNU Lesser General Public License Usage +** Alternatively, this file may be used under the terms of the GNU Lesser +** General Public License version 3 as published by the Free Software +** Foundation and appearing in the file LICENSE.LGPL3 included in the +** packaging of this file. Please review the following information to +** ensure the GNU Lesser General Public License version 3 requirements +** will be met: https://www.gnu.org/licenses/lgpl-3.0.html. +** +** GNU General Public License Usage +** Alternatively, this file may be used under the terms of the GNU +** General Public License version 2.0 or (at your option) the GNU General +** Public license version 3 or any later version approved by the KDE Free +** Qt Foundation. The licenses are as published by the Free Software +** Foundation and appearing in the file LICENSE.GPL2 and LICENSE.GPL3 +** included in the packaging of this file. Please review the following +** information to ensure the GNU General Public License requirements will +** be met: https://www.gnu.org/licenses/gpl-2.0.html and +** https://www.gnu.org/licenses/gpl-3.0.html. +** +** $QT_END_LICENSE$ +** +****************************************************************************/ + +#include "qstringtokenizer.h" +#include "qstringalgorithms.h" + +QT_BEGIN_NAMESPACE + +/*! + \class QStringTokenizer + \inmodule QtCore + \since 6.0 + \brief The QStringTokenizer class splits strings into tokens along given separators + \reentrant + \ingroup tools + \ingroup string-processing + + Splits a string into substrings wherever a given separator occurs, + and returns a (lazy) list of those strings. If the separator does + not match anywhere in the string, produces a single-element + containing this string. If the separator is empty, + QStringTokenizer produces an empty string, followed by each of the + string's characters, followed by another empty string. The two + enumerations Qt::SplitBehavior and Qt::CaseSensitivity further + control the output. + + QStringTokenizer drives QStringView::tokenize(), but, at least with a + recent compiler, you can use it directly, too: + + \code + for (auto it : QStringTokenizer{string, separator}) + use(*it); + \endcode + + \note You should never, ever, name the template arguments of a + QStringTokenizer explicitly. If you can use C++17 Class Template + Argument Deduction (CTAD), you may write + \c{QStringTokenizer{string, separator}} (without template + arguments). If you can't use C++17 CTAD, you must use the + QStringView::split() or QLatin1String::split() member functions + and store the return value only in \c{auto} variables: + + \code + auto result = string.split(sep); + \endcode + + This is because the template arguments of QStringTokenizer have a + very subtle dependency on the specific string and separator types + from with which they are constructed, and they don't usually + correspond to the actual types passed. + + \section Lazy Sequences + + QStringTokenizer acts as a so-called lazy sequence, that is, each + next element is only computed once you ask for it. Lazy sequences + have the advantage that they only require O(1) memory. They have + the disadvantage that, at least for QStringTokenizer, they only + allow forward, not random-access, iteration. + + The intended use-case is that you just plug it into a ranged for loop: + + \code + for (auto it : QStringTokenizer{string, separator}) + use(*it); + \endcode + + or a C++20 ranged algorithm: + + \code + std::ranges::for_each(QStringTokenizer{string, separator}, + [] (auto token) { use(token); }); + \endcode + + \section End Sentinel + + The QStringTokenizer iterators cannot be used with classical STL + algorithms, because those require iterator/iterator pairs, while + QStringTokenizer uses sentinels, that is, it uses a different + type, QStringTokenizer::sentinel, to mark the end of the + range. This improves performance, because the sentinel is an empty + type. Sentinels are supported from C++17 (for ranged for) + and C++20 (for algorithms using the new ranges library). + + \section Temporaries + + QStringTokenizer is very carefully designed to avoid dangling + references. If you construct a tokenizer from a temporary string + (an rvalue), that argument is stored internally, so the referenced + data isn't deleted before it is tokenized: + + \code + auto tok = QStringTokenizer{widget.text(), u','}; + // return value of `widget.text()` is destroyed, but content was moved into `tok` + for (auto e : tok) + use(e); + \endcode + + If you pass named objects (lvalues), then QStringTokenizer does + not store a copy. You are reponsible to keep the named object's + data around for longer than the tokenizer operates on it: + + \code + auto text = widget.text(); + auto tok = QStringTokenizer{text, u','}; + text.clear(); // destroy content of `text` + for (auto e : tok) // ERROR: `tok` references deleted data! + use(e); + \endcode + + \sa QStringView::split(), QLatin1Sting::split(), Qt::SplitBehavior, Qt::CaseSensitivity +*/ + +/*! + \typedef QStringTokenizer::value_type + + Alias for \c{const QStringView} or \c{const QLatin1String}, + depending on the tokenizer's \c Haystack template argument. +*/ + +/*! + \typedef QStringTokenizer::difference_type + + Alias for qsizetype. +*/ + +/*! + \typedef QStringTokenizer::size_type + + Alias for qsizetype. +*/ + +/*! + \typedef QStringTokenizer::reference + + Alias for \c{value_type &}. + + QStringTokenizer does not support mutable references, so this is + the same as const_reference. +*/ + +/*! + \typedef QStringTokenizer::const_reference + + Alias for \c{value_type &}. +*/ + +/*! + \typedef QStringTokenizer::pointer + + Alias for \c{value_type *}. + + QStringTokenizer does not support mutable iterators, so this is + the same as const_pointer. +*/ + +/*! + \typedef QStringTokenizer::const_pointer + + Alias for \c{value_type *}. +*/ + +/*! + \typedef QStringTokenizer::iterator + + This typedef provides an STL-style const iterator for + QStringTokenizer. + + QStringTokenizer does not support mutable iterators, so this is + the same as const_iterator. + + \sa const_iterator +*/ + +/*! + \typedef QStringTokenizer::const_iterator + + This typedef provides an STL-style const iterator for + QStringTokenizer. + + \sa iterator +*/ + +/*! + \typedef QStringTokenizer::sentinel + + This typedef provides an STL-style sentinel for + QStringTokenizer::iterator and QStringTokenizer::const_iterator. + + \sa const_iterator +*/ + +/*! + \fn QStringTokenizer(Haystack haystack, String needle, Qt::CaseSensitivity cs, Qt::SplitBehavior sb) + \fn QStringTokenizer(Haystack haystack, String needle, Qt::SplitBehavior sb, Qt::CaseSensitivity cs) + + Constructs a string tokenizer that splits the string \a haystack + into substrings wherever \a needle occurs, and allows iteration + over those strings as they are found. If \a needle does not match + anywhere in \a haystack, a single element containing \a haystack + is produced. + + \a cs specifies whether \a needle should be matched case + sensitively or case insensitively. + + If \a sb is QString::SkipEmptyParts, empty entries don't + appear in the result. By default, empty entries are included. + + \sa QStringView::split(), QLatin1String::split(), Qt::CaseSensitivity, Qt::SplitBehavior +*/ + +/*! + \fn QStringTokenizer::const_iterator QStringTokenizer::begin() const + + Returns a const \l{STL-style iterators}{STL-style iterator} + pointing to the first token in the list. + + \sa end(), cbegin() +*/ + +/*! + \fn QStringTokenizer::const_iterator QStringTokenizer::cbegin() const + + Same as begin(). + + \sa cend(), begin() +*/ + +/*! + \fn QStringTokenizer::sentinel QStringTokenizer::end() const + + Returns a const \l{STL-style iterators}{STL-style sentinel} + pointing to the imaginary token after the last token in the list. + + \sa begin(), cend() +*/ + +/*! + \fn QStringTokenizer::sentinel QStringTokenizer::cend() const + + Same as end(). + + \sa cbegin(), end() +*/ + +/*! + \fn QStringTokenizer::toContainer(Container &&c) const & + + Convenience method to convert the lazy sequence into a + (typically) random-access container. + + This function is only available if \c Container has a \c value_type + matching this tokenizer's value_type. + + If you pass in a named container (an lvalue), then that container + is filled, and a reference to it is returned. + + If you pass in a temporary container (an rvalue, incl. the default + argument), then that container is filled, and returned by value. + + \code + // assuming tok's value_type is QStringView, then... + auto tok = QStringTokenizer{~~~}; + // ... rac1 is a QVector: + auto rac1 = tok.toContainer(); + // ... rac2 is std::pmr::vector<QStringView>: + auto rac2 = tok.toContainer<std::pmr::vector<QStringView>>(); + auto rac3 = QVarLengthArray<QStringView, 12>{}; + // appends the token sequence produced by tok to rac3 + // and returns a reference to rac3 (which we ignore here): + tok.toContainer(rac3); + \endcode + + This gives you maximum flexibility in how you want the sequence to + be stored. +*/ + +/*! + \fn QStringTokenizer::toContainer(Container &&c) const && + \overload + + In addition to the constraints on the lvalue-this overload, this + rvalue-this overload is only available when this QStringTokenizer + does not store the haystack internally, as this could create a + container full of dangling references: + + \code + auto tokens = QStringTokenizer{widget.text(), u','}.toContainer(); + // ERROR: cannot call toContainer() on rvalue + // 'tokens' references the data of the copy of widget.text() + // stored inside the QStringTokenizer, which has since been deleted + \endcode + + To fix, store the QStringTokenizer in a temporary: + + \code + auto tokenizer = QStringTokenizer{widget.text90, u','}; + auto tokens = tokenizer.toContainer(); + // OK: the copy of widget.text() stored in 'tokenizer' keeps the data + // referenced by 'tokens' alive. + \endcode + + You can force this function into existence by passing a view instead: + + \code + func(QStringTokenizer{QStringView{widget.text()}, u','}.toContainer()); + // OK: compiler keeps widget.text() around until after func() has executed + \endcode +*/ + +/*! + \fn qTokenize(Haystack &&haystack, Needle &&needle, Flags...flags) + \relates QStringTokenizer + \since 6.0 + + Factory function for QStringTokenizer. You can use this function + if your compiler doesn't, yet, support C++17 Class Template + Argument Deduction (CTAD), but we recommend direct use of + QStringTokenizer with CTAD instead. +*/ + +QT_END_NAMESPACE |