Hello, all!
I'm here with a pitch for another component of the regex-powered string processing work, this one focused on how Regex
interacts with the rich Unicode support in Swift's String
and accompanying types. You can read portions of the draft proposal below or as a Markdown document in full.
Introduction
This proposal describes Regex
's rich Unicode support during regex matching, along with the character classes and options that define and modify that behavior.
This proposal is one component of a larger regex-powered string processing initiative. For the status of each proposal, see this document — discussion of other facets of the overall regex design is out of scope of this proposal and better discussed in the most relevant review.
Motivation
Swift's String
type provides, by default, a view of Character
s or extended grapheme clusters whose comparison honors Unicode canonical equivalence. Each character in a string can be composed of one or more Unicode scalar values, while still being treated as a single unit, equivalent to other ways of formulating the equivalent character:
let str = "Cafe\u{301}" // "Café"
str == "Café" // true
str.dropLast() // "Caf"
str.last == "é" // true (precomposed e with acute accent)
str.last == "e\u{301}" // true (e followed by composing acute accent)
This default view is fairly novel. Most languages that support Unicode strings generally operate at the Unicode scalar level, and don't provide the same affordance for operating on a string as a collection of grapheme clusters. In Python, for example, Unicode strings report their length as the number of scalar values, and don't use canonical equivalence in comparisons:
cafe = u"Cafe\u0301"
len(cafe) # 5
cafe == u"Café" # False
Existing regex engines follow this same model of operating at the Unicode scalar level. To match canonically equivalent characters, or have equivalent behavior between equivalent strings, you must normalize your string and regex to the same canonical format.
# Matches a four-element string
re.match(u"^.{4}$", cafe) # None
# Matches a string ending with 'é'
re.match(u".+é$", cafe) # None
cafeComp = unicodedata.normalize("NFC", cafe)
re.match(u"^.{4}$", cafeComp) # <re.Match object...>
re.match(u".+é$", cafeComp) # <re.Match object...>
With Swift's string model, this behavior would surprising and undesirable — Swift's default regex semantics must match the semantics of a String
.
Other engines
Other regex engines match character classes (such as \w
or .
) at the Unicode scalar value level, or even the code unit level, instead of recognizing grapheme clusters as characters. When matching the .
character class, other languages will only match the first part of an "e\u{301}"
grapheme cluster. Some languages, like Perl, Ruby, and Java, support an additional \X
metacharacter, which explicitly represents a single grapheme cluster.
Matching "Cafe\u{301}"
|
Pattern: ^Caf.
|
Remaining | Pattern: ^Caf\X
|
Remaining |
---|---|---|---|---|
C#, Rust, Go, Python | "Cafe" |
"´" |
n/a | n/a |
NSString, Java, Ruby, Perl | "Cafe" |
"´" |
"Café" |
"" |
Other than Java's CANON_EQ
option, the vast majority of other languages and engines are not capable of comparing with canonical equivalence.
Proposed solution
In a regex's simplest form, without metacharacters or special features, matching behaves like a test for equality. A string always matches a regex that simply contains the same characters.
let str = "Cafe\u{301}" // "Café"
str.contains(/Café/) // true
From that point, small changes continue to comport with the element counting and comparison expectations set by String
:
str.contains(/Caf./) // true
str.contains(/.+é/) // true
str.contains(/.+e\u{301}/) // true
str.contains(/\w+é/) // true
For compatibility with other regex engines and the flexibility to match at both Character
and Unicode scalar level, you can switch between matching levels for an entire regex or within select portions. This powerful capability provides the expected default behavior when working with strings, while allowing you to drop down for Unicode scalar-specific matching.
By default, literal characters and Unicode scalar values (e.g. \u{301}
) are coalesced into characters in the same way as a normal string, as shown above. Metacharacters, like .
and \w
, and custom character classes each match a single element at the current matching level.
For example, these matches fail, because by the time the parser encounters the "\u{301}
" Unicode scalar literal, the full "é"
character has been matched:
str.contains(/Caf.\u{301}) // false - `.` matches "é" character
str.contains(/Caf\w\u{301}) // false - `\w` matches "é" character
str.contains(/.+\u{301}) // false - `.+` matches each character
Alternatively, we can drop down to use Unicode scalar semantics if we want to match specific Unicode sequences. For example, these regexes matches an "e"
followed by any modifier with the specified parameters:
str.contains(/e[\u{300}-\u{314}]/.matchingSemantics(.unicodeScalar))
// true - matches an "e" followed by a Unicode scalar in the range U+0300 - U+0314
str.contains(/e\p{Nonspacing Mark}/.matchingSemantics(.unicodeScalar))
// true - matches an "e" followed by a Unicode scalar with general category "Nonspacing Mark"
Matching in Unicode scalar mode is analogous to comparing against a string's UnicodeScalarView
— individual Unicode scalars are matched without combining them into characters or testing for canonical equivalence.
str.contains(/Café/.matchingSemantics(.unicodeScalar))
// false - "e\u{301}" doesn't match with /é/
str.contains(/Cafe\u{301}/.matchingSemantics(.unicodeScalar))
// true - "e\u{301}" matches with /e\u{301}/
Swift's Regex
follows the level 2 guidelines for Unicode support in regular expressions described in Unicode Technical Standard #18, with support for Unicode character classes, canonical equivalence, grapheme cluster matching semantics, and level 2 word boundaries enabled by default. In addition to selecting the matching semantics, Regex
provides options for selecting different matching behaviors, such as ASCII character classes or Unicode scalar semantics, which corresponds more closely with other regex engines.
Detailed design
First, we'll discuss the options that let you control a regex's behavior, and then explore the character classes that define the your pattern.
Source compatibility
Everything in this proposal is additive, and has no compatibility effect on existing source code.
Effect on ABI stability
Everything in this proposal is additive, and has no effect on existing stable ABI.
Effect on API resilience
N/A
Future directions
Expanded options and modifiers
The initial version of Regex
includes only the options described above. Filling out the remainder of options described in the [Run-time Regex Construction proposal][literals] could be completed as future work, as well as additional improvements, such as adding an option that makes a regex match only at the start of a string.
Extensions to Character and Unicode Scalar APIs
An earlier version of this pitch described adding standard library APIs to Character
and UnicodeScalar
for each of the supported character classes, as well as convenient static members for control characters. In addition, regex literals support Unicode property features that don’t currently exist in the standard library, such as a scalar’s script or extended category, or creating a scalar by its Unicode name instead of its scalar value. These kinds of additions are
Byte semantic mode
A future Regex
version could support a byte-level semantic mode in addition to grapheme cluster and Unicode scalar semantics. Byte-level semantics would allow matching individual bytes, potentially providing the capability of parsing string and non-string data together.
More general CharacterSet
replacement
Foundation's CharacterSet
type is in some ways similar to the CharacterClass
type defined in this proposal. CharacterSet
is primarily a set type that is defined over Unicode scalars, and can therefore sometimes be awkward to use in conjunction with Swift String
s. The proposed CharacterClass
type is a RegexBuilder
-specific type, and as such isn't intended to be a full general purpose replacement. Future work could involve expanding upon the CharacterClass
API or introducing a different type to fill that role.
Alternatives considered
Operate on String.UnicodeScalarView instead of using semantic modes
Instead of providing APIs to select whether Regex
matching is Character
-based vs. UnicodeScalar
-based, we could instead provide methods to match against the different views of a string. This different approach has multiple drawbacks:
- As the scalar level used when matching changes the behavior of individual components of a
Regex
, it’s more appropriate to specify the semantic level at the declaration site than the call site. - With the proposed options model, you can define a Regex that includes different semantic levels for different portions of the match, which would be impossible with a call site-based approach.
Binary word boundary option method
A prior version of this proposal used a binary method for setting the word boundary algorithm, called usingSimpleWordBoundaries()
. A method taking a RegexWordBoundaryKind
instance is included in the proposal instead, to leave room for implementing other word boundary algorithms in the future.