[Pitch] Unicode for String Processing

nnnnnnnn · April 22, 2022, 7:30pm

Hello, all!

I'm here with a pitch for another component of the regex-powered string processing work, this one focused on how Regex interacts with the rich Unicode support in Swift's String and accompanying types. You can read portions of the draft proposal below or as a Markdown document in full.

Introduction

This proposal describes Regex's rich Unicode support during regex matching, along with the character classes and options that define and modify that behavior.

This proposal is one component of a larger regex-powered string processing initiative. For the status of each proposal, see this document — discussion of other facets of the overall regex design is out of scope of this proposal and better discussed in the most relevant review.

Motivation

Swift's String type provides, by default, a view of Characters or extended grapheme clusters whose comparison honors Unicode canonical equivalence. Each character in a string can be composed of one or more Unicode scalar values, while still being treated as a single unit, equivalent to other ways of formulating the equivalent character:

let str = "Cafe\u{301}" // "Café"
str == "Café"           // true
str.dropLast()          // "Caf"
str.last == "é"         // true (precomposed e with acute accent)
str.last == "e\u{301}"  // true (e followed by composing acute accent)

This default view is fairly novel. Most languages that support Unicode strings generally operate at the Unicode scalar level, and don't provide the same affordance for operating on a string as a collection of grapheme clusters. In Python, for example, Unicode strings report their length as the number of scalar values, and don't use canonical equivalence in comparisons:

cafe = u"Cafe\u0301"
len(cafe)                     # 5
cafe == u"Café"               # False

Existing regex engines follow this same model of operating at the Unicode scalar level. To match canonically equivalent characters, or have equivalent behavior between equivalent strings, you must normalize your string and regex to the same canonical format.

# Matches a four-element string
re.match(u"^.{4}$", cafe)     # None
# Matches a string ending with 'é'
re.match(u".+é$", cafe)       # None

cafeComp = unicodedata.normalize("NFC", cafe)
re.match(u"^.{4}$", cafeComp) # <re.Match object...>
re.match(u".+é$", cafeComp)   # <re.Match object...>

With Swift's string model, this behavior would surprising and undesirable — Swift's default regex semantics must match the semantics of a String.

Other engines

Other regex engines match character classes (such as \w or .) at the Unicode scalar value level, or even the code unit level, instead of recognizing grapheme clusters as characters. When matching the . character class, other languages will only match the first part of an "e\u{301}" grapheme cluster. Some languages, like Perl, Ruby, and Java, support an additional \X metacharacter, which explicitly represents a single grapheme cluster.

Matching `"Cafe\u{301}"`	Pattern: `^Caf.`	Remaining	Pattern: `^Caf\X`	Remaining
C#, Rust, Go, Python	`"Cafe"`	`"´"`	n/a	n/a
NSString, Java, Ruby, Perl	`"Cafe"`	`"´"`	`"Café"`	`""`

Other than Java's CANON_EQ option, the vast majority of other languages and engines are not capable of comparing with canonical equivalence.

Proposed solution

In a regex's simplest form, without metacharacters or special features, matching behaves like a test for equality. A string always matches a regex that simply contains the same characters.

let str = "Cafe\u{301}"     // "Café"
str.contains(/Café/)        // true

From that point, small changes continue to comport with the element counting and comparison expectations set by String:

str.contains(/Caf./)        // true
str.contains(/.+é/)         // true
str.contains(/.+e\u{301}/)  // true
str.contains(/\w+é/)        // true

For compatibility with other regex engines and the flexibility to match at both Character and Unicode scalar level, you can switch between matching levels for an entire regex or within select portions. This powerful capability provides the expected default behavior when working with strings, while allowing you to drop down for Unicode scalar-specific matching.

By default, literal characters and Unicode scalar values (e.g. \u{301}) are coalesced into characters in the same way as a normal string, as shown above. Metacharacters, like . and \w, and custom character classes each match a single element at the current matching level.

For example, these matches fail, because by the time the parser encounters the "\u{301}" Unicode scalar literal, the full "é" character has been matched:

str.contains(/Caf.\u{301})    // false - `.` matches "é" character
str.contains(/Caf\w\u{301})   // false - `\w` matches "é" character
str.contains(/.+\u{301})      // false - `.+` matches each character

Alternatively, we can drop down to use Unicode scalar semantics if we want to match specific Unicode sequences. For example, these regexes matches an "e" followed by any modifier with the specified parameters:

str.contains(/e[\u{300}-\u{314}]/.matchingSemantics(.unicodeScalar))
// true - matches an "e" followed by a Unicode scalar in the range U+0300 - U+0314
str.contains(/e\p{Nonspacing Mark}/.matchingSemantics(.unicodeScalar))
// true - matches an "e" followed by a Unicode scalar with general category "Nonspacing Mark"

Matching in Unicode scalar mode is analogous to comparing against a string's UnicodeScalarView — individual Unicode scalars are matched without combining them into characters or testing for canonical equivalence.

str.contains(/Café/.matchingSemantics(.unicodeScalar))
// false - "e\u{301}" doesn't match with /é/
str.contains(/Cafe\u{301}/.matchingSemantics(.unicodeScalar))
// true - "e\u{301}" matches with /e\u{301}/

Swift's Regex follows the level 2 guidelines for Unicode support in regular expressions described in Unicode Technical Standard #18, with support for Unicode character classes, canonical equivalence, grapheme cluster matching semantics, and level 2 word boundaries enabled by default. In addition to selecting the matching semantics, Regex provides options for selecting different matching behaviors, such as ASCII character classes or Unicode scalar semantics, which corresponds more closely with other regex engines.

Detailed design

First, we'll discuss the options that let you control a regex's behavior, and then explore the character classes that define the your pattern.

Source compatibility

Everything in this proposal is additive, and has no compatibility effect on existing source code.

Effect on ABI stability

Everything in this proposal is additive, and has no effect on existing stable ABI.

Effect on API resilience

N/A

Future directions

Expanded options and modifiers

The initial version of Regex includes only the options described above. Filling out the remainder of options described in the [Run-time Regex Construction proposal][literals] could be completed as future work, as well as additional improvements, such as adding an option that makes a regex match only at the start of a string.

Extensions to Character and Unicode Scalar APIs

An earlier version of this pitch described adding standard library APIs to Character and UnicodeScalar for each of the supported character classes, as well as convenient static members for control characters. In addition, regex literals support Unicode property features that don’t currently exist in the standard library, such as a scalar’s script or extended category, or creating a scalar by its Unicode name instead of its scalar value. These kinds of additions are

Byte semantic mode

A future Regex version could support a byte-level semantic mode in addition to grapheme cluster and Unicode scalar semantics. Byte-level semantics would allow matching individual bytes, potentially providing the capability of parsing string and non-string data together.

More general `CharacterSet` replacement

Foundation's CharacterSet type is in some ways similar to the CharacterClass type defined in this proposal. CharacterSet is primarily a set type that is defined over Unicode scalars, and can therefore sometimes be awkward to use in conjunction with Swift Strings. The proposed CharacterClass type is a RegexBuilder-specific type, and as such isn't intended to be a full general purpose replacement. Future work could involve expanding upon the CharacterClass API or introducing a different type to fill that role.

Alternatives considered

Operate on String.UnicodeScalarView instead of using semantic modes

Instead of providing APIs to select whether Regex matching is Character-based vs. UnicodeScalar-based, we could instead provide methods to match against the different views of a string. This different approach has multiple drawbacks:

As the scalar level used when matching changes the behavior of individual components of a Regex, it’s more appropriate to specify the semantic level at the declaration site than the call site.
With the proposed options model, you can define a Regex that includes different semantic levels for different portions of the match, which would be impossible with a call site-based approach.

Binary word boundary option method

A prior version of this proposal used a binary method for setting the word boundary algorithm, called usingSimpleWordBoundaries(). A method taking a RegexWordBoundaryKind instance is included in the proposal instead, to leave room for implementing other word boundary algorithms in the future.

benlings · April 25, 2022, 7:42am

I'm not sure which Regex-related proposal this fits under, but this behaviour was a little unexpected at first (running under the current nightly docker image):

let r = try! Regex(compiling: #"\n"#)
let c = "line1\r\nline2".contains(r)
print("contains? \(c)")

From here, I understand that \r\n is a grapheme cluster and therefore a single Character so it makes sense once I had made that link. However, with the introduction of Regex, it's going to be easier to do text processing in Swift and I think this will catch others out.

It might be worth mentioning this particular case when talking about the differences between Unicode and code-point-level text processing? I had naïvely thought that it would only make a difference to things like emoji and accented characters.

nnnnnnnn · April 26, 2022, 5:16pm

Right, this flows from a have a regex comprised of only character data matching with the same semantics as string equality. We treat "\r\n" as a single character, but it isn't one that's equivalent to "\n". \n in a regex isn't a metacharacter like \w or \d, it's just a literal newline, so it fails to match here. It's the same principle as "Cafe\u{301}".contains(/e/) == false.

The metacharacters that match vertical space all do match, however:

for regexStr in [#"\n"#, #"\s"#, #"\v"#, #"\R"#] {
    print(regexStr, "found?", "line1\r\nline2".contains(try! Regex(regexStr)))
}
// Prints:
// \n found? false
// \s found? true
// \v found? true
// \R found? true

Michael_Ilseman · April 26, 2022, 5:52pm

I think a valid alternative is to by default treat \n as Character.isNewline (or as \R). We already treat \b specially under level-2 of UTS-18 by default

benlings · April 26, 2022, 5:53pm

Thank you. To be clear, I wasn’t saying the behaviour is incorrect, just noting that I was surprised by it. I thought it might be helpful to call out this specific case in any discussions of Unicode matching, because I’m probably not going to be the only one surprised by it .

YOCKOW · April 28, 2022, 4:36am

"2.4 Default Case Conversion" is described as retracted, and it implies there could be various implementations:

Some implementations may choose to have a mixed solution, where they do full case matching on literals such as "Strauß", but simple case folding on character classes such as [ß].

I wonder how ignoresCase(_:) in this pitch is expected to work.
Would it be locale-independent?
Would it take some options, for example, to choose full case folding or simple case folding?

benlings · May 2, 2022, 7:42am

Just discovered that Raku (Perl 6) does this: Regexes | Raku Documentation

\n matches a logical newline. \N matches a single character that's not a logical newline.

Michael_Ilseman · May 2, 2022, 1:44pm

I'll let @nnnnnnnn and @Alejandro get into specifics here, but in general the stdlib implements locale-insensitive, context-insensitive, un-tailored behavior in order to present a (quasi-)algebraic model of String. This means that e.g. the Greek sigma needs consistent treatment no matter where it appears. Locale-specific or specialized functionality can be layered on top via explicit API or options/customizations.

As a more technical division, and the line can be slightly blurry here, functionality based on the UCD is within the realm of the stdlib while functionality within the CLDR is the role of libraries layered on top (such as Foundation).

My guess from my readings of the Unicode Standard (section 3.13) is that it would implement the Default Case Folding algorithm, but another alternative is identifier caseless matching. Case folding does not always preserve normalization, so care may be needed for caseless canonical equivalent comparisons

I think that tailoring of case operations makes sense as explicit API, and it's most likely future work. So perhaps a more general question is: what is the extensibility story for locale-sensitive processing? Do we have a clear way to extended what is proposed in the future for these concerns such that a localized library, e.g. Foundation, can add localized API and behavior on top?

Michael_Ilseman · May 2, 2022, 1:46pm

This behavior would be my vote as well. I think it's better to fold CR-LF into regex's \n than treat it as a literal equivalent to String literal \n (that is, LF).

Michael_Ilseman · May 8, 2022, 2:44am

Not sure parsing is relevant; the reason isn't because we don't know about the \u{301} in the literal, but because the semantics of . will match a full character. We could special case concatenation of a combining scalar, and that should be addressed in an alternative considered especially if there are undesirable side-effects of doing so.

Single-line mode (?s) dotMatchesNewlines()

Should this be the default?

Multi-line mode (?m) anchorsMatchNewlines()

Similarly, default or not?

Unicode word boundaries (?w) wordBoundaryKind(_:)

This is the default, right?

Semantic level (?Xu) matchingSemantics(_:)

This is newly introduced and specific to Swift, correct?

We might want a column for enabled/disabled by default

asciiOnlyDigits, asciiOnlyWhitespace, asciiOnlyWordCharacters, asciiOnlyCharacterClasses

Elsewhere we use an enum to cut down on all of these overloads (e.g. repitition behavior, word boundary level, semantic level, etc). Is that possible here? Should character class customization using options receive an OptionSet?

Note that if RegexBuilder is imported, all of these API will be on any component, which includes String and library parsers, so it makes sense to somewhat organize this API.

€1 234,56 ["1", "234", "56"] ["€", "1", "234,56"]

What happened to the € in the first table entry?

public struct RegexWordBoundaryKind: Hashable {
public static var unicodeLevel1: Self { get }
public static var unicodeLevel2: Self { get }

This naming convention feels odd. Either we an enum (or enum-like struct) for the UTS#18 levels (a concept beyond just word boundaries), or we give these names reflecting what word boundary algorithm is used, like simpleBoundaries and defaultBoundaries.

When matching with Unicode scalar semantics, metacharacters and character classes always match a single Unicode scalar value, even if that scalar comprises part of a grapheme cluster.

What happens with case-folded matching if case conversion produces multiple scalars? I don't recall if case folded form has this as well, just making sure that . matches the same with and without case insensitivity. CC @Alejandro

With grapheme cluster semantics, a grapheme cluster boundary is naturally enforced at the start and end of the match and every capture group. Matching with Unicode scalar semantics, on the other hand, including using the \O metacharacter or .anyUnicodeScalar character class, can yield string indices that aren't aligned to character boundaries. Take care when using indices that aren't aligned with grapheme cluster boundaries, as they may have to be rounded to a boundary if used in a String instance.

There's a lot of back-and-forth prose that switches between modes. Do you think you can give a clearer, algebraic reasoning for where boundaries occur? This may also help explain why some choices were made (e.g. producing EGC-aligned indices).

When a regex proceeds with grapheme cluster semantics from a position that isn't grapheme cluster aligned, it attempts to match the partial grapheme cluster that starts at that point. In the first call to contains(_:) below, \O matches a single Unicode scalar value, as shown above, and then the engine tries to match \s against the remainder of the family emoji character.

When is partial matching useful vs dropping to a more precise level? E.g., why is \O supported in grapheme-semantics mode? How does that behave around captures, that is could it be used to split a grapheme cluster? We treat \X as . (pending newline considerations) in grapheme cluster mode and \O as . (pending newline considerations) in scalar semantic modes.

Stepping up to a higher-level mode is useful and that's when alignment becomes more interesting. We could treat \O as any under max(currrentMode, .scalarSemantics), \X as any under max(currentMode, .graphemeClusterSemantics), etc. Another alternative is to forbid \O in grapheme cluster semantic mode, but I'd prefer promoting it to \X.

Stepping down is where it'd be very useful to have an algebraic model to reason about this with, as it seems likely to introduce problems.

Regex syntax: (?X)... or (?X...) for grapheme cluster semantics, (?u)... or (?u...) for Unicode scalar semantics.

IIUC, this is completely novel to Swift and this proposal. That should be spelled out prominently and it would cause an issue if these are even used for important options in the future. An alternative considered section should go over why these letters were chosen over a more verbose solution, and another alternative is to treat it as future work from within a literal.

Regex syntax: (?U)... or (?U...)

RegexBuilder API:

The repetitionBehavior(_:) method lets you set the default behavior for all quantifiers that don't explicitly provide their own behavior. > For example, you can make all quantifiers behave possessively, eliminating any quantification-caused backtracking.

Switching default to possessive is a big deal and Swift's making a contribution to this space. However, posessive is not like the others: it dramatically changes the inputs recognized by the regex. Should these be globbed together?

Also, why is repetitionBehavior on String ala RegexComponent? These don't really make sense on leaf regex components; should they be on a protocol specific to builders and combinators?

When you pass nil, the quantifier uses the default behavior as set by this option (either eager or reluctant). If an explicit behavior is passed, that behavior is used regardless of the default.

What about possessive?

Unicode scalar semantics: Matches a Unicode scalar that has a numericType property equal to .decimal. This includes the digits from the ASCII range, from the Halfwidth and Fullwidth Forms Unicode block, as well as digits in some scripts, like DEVANAGARI DIGIT NINE (U+096F). This corresponds to the general category Decimal_Number.

Grapheme cluster semantics: Matches a character made up of a single Unicode scalar that fits the decimal digit criteria above.

ASCII mode: Matches a Unicode scalar in the range 0 to 9.

Is ASCII a mode or a set of options? Should it be a mode?

Also, this prose convention is difficult to follow. Any way to separate out the definition a bit more? The click-able disclosure triangles are also not rendering as clickable triangles, which makes this worse...

For Grapheme cluster semantics, a very relevant point is whether this is the same definition as API on Character or not.

Unicode scalar semantics: Matches a decimal digit, as described above, or an uppercase or small A through F from the Halfwidth and Fullwidth Forms Unicode block. Note that this is a broader class than described by the UnicodeScalar.properties.isHexDigit property, as that property only include ASCII and fullwidth decimal digits.

Rationale?

Unicode property matching is extended to Characters with a goal of consistency with other regex character classes. For \p{Decimal} and \p{Hex_Digit}, only single-scalar Characters can match, for the reasons described in that section, above. For all other Unicode property classes, matching Characters can comprise multiple scalars, as long as the first scalar matches the property.

Does permissive reasoning really apply to all the rest? Did we check?

What about non-boolean properties like canonical combining class? Do those make sense under first-scalar interpretation?

[:lower:] \p{Lowercase} starts-with [a-z]

Out of curiosity, does fuzzy matching and the unification of [:...:] with \p{...} mean that this could be written as \p{lower}?

When in grapheme cluster semantic mode, ranges of characters will test for membership using NFD form (or NFKD when performing caseless matching). This differs from how a ClosedRange would operate its contains method, since that depends on String's Comparable conformance, but the decomposed comparison better aligns with the canonical equivalence matching used elsewhere in Regex.

But is NFKD the same behavior? Could you give examples? It seems like there's a lot of complexity here and I don't want mistakes or inconsistencies to slip through.

A custom character class will match a maximum of one Character or UnicodeScalar, depending on the matching semantic level. This means that a custom character class with extended grapheme cluster members may not match anything while using scalar semantics.

Is \R permitted inside a custom character class and if so, what does it do?

With the proposed options model, you can define a Regex that includes different semantic levels for different portions of the match, which would be impossible with a call site-based approach.

Can you detail how boundaries work when switching between levels? E.g. grapheme-cluster -> scalar -> grapheme-cluster, how many grapheme cluster boundaries are checked and where? And vice versa. I think this is a very important, albeit nuanaced, aspect that can help illuminate the model.

A prior version of this proposal used a binary method for setting the word boundary algorithm, called usingSimpleWordBoundaries(). A method taking a RegexWordBoundaryKind instance is included in the proposal instead, to leave room for implementing other word boundary algorithms in the future.

Should that be the approach taken for everything that might incorporate local or tailoring? E.g. semantic modes and the alternate character classes?