[Pitch] Character Classes for String Processing

nnnnnnnn · October 19, 2021, 11:03pm

Authors: Nate Cook, Michael Ilseman
Status: Draft pitch

Introduction

Declarative String Processing Overview presents regex-powered matching broadly, without details concerning syntax and semantics, leaving clarification to subsequent pitches. Regular Expression Literals presents more details on regex syntax such as delimiters and PCRE-syntax innards, but explicitly excludes discussion of regex semantics. This pitch and discussion aims to address a targeted subset of regex semantics: definitions of character classes. We propose a comprehensive treatment of regex character class semantics in the context of existing and newly proposed API directly on Character and Unicode.Scalar.

Character classes in regular expressions include metacharacters like \d to match a digit, \s to match whitespace, and . to match any character. Individual literal characters can also be thought of as character classes, as they at least match themselves, and, in case-insensitive matching, their case-toggled counterpart. For the purpose of this work, then, we consider a character class to be any part of a regular expression literal that can match an actual component of a string.

Motivation

Operating over classes of characters is a vital component of string processing. Swift's String provides, by default, a view of Characters or extended grapheme clusters whose comparison honors Unicode canonical equivalence.

let str = "Cafe\u{301}" // "Café"
str == "Café"           // true
str.dropLast()          // "Caf"
str.last == "é"         // true (precomposed e with acute accent)
str.last == "e\u{301}"  // true (e followed by composing acute accent)

Unicode leaves all interpretation of grapheme clusters up to implementations, which means that Swift needs to define any semantics for its own usage. Since other regex engines operate, at most, at the semantics level of Unicode scalar values, there is little to no prior art to consult.

Other engines

Character classes in other languages match at either the Unicode scalar value level, or even the code unit level, instead of recognizing grapheme clusters as characters. When matching the . character class, other languages will only match the first part of an "e\u{301}" grapheme cluster. Some languages, like Perl, Ruby, and Java, support an additional \X metacharacter, which explicitly represents a single grapheme cluster.

Matching `"Cafe\u{301}"`	Pattern: `^Caf.`	Remaining	Pattern: `^Caf\X`	Remaining
C#, Rust, Go	`"Cafe"`	`"´"`	n/a	n/a
NSString, Java, Ruby, Perl	`"Cafe"`	`"´"`	`"Café"`	`""`

Other than Java's CANON_EQ option, the vast majority of other languages and engines are not capable of comparing with canonical equivalence.

SE-0211 Unicode Scalar Properties added basic building blocks for classification of scalars by surfacing Unicode data from the UCD. SE-0221: Character Properties defined grapheme-cluster semantics for Swift for a subset of these. But, many classifications used in string processing are combinations of scalar properties or ad-hoc listings, and as such are not present today in Swift.

Regardless of any syntax or underlying formalism, classifying characters is a worthy and much needed addition to the Swift standard library. We believe our thorough treatment of every character class found across many popular regex engines gives Swift a solid semantic basis.

Proposed Solution

This pitch is narrowly scoped to Swift definitions of character classes found in regexes. For each character class, we propose:

A name for use in API
A Character API, by extending Unicode scalar definitions to grapheme clusters
A Unicode.Scalar API with modern Unicode definitions
If applicable, a Unicode.Scalar API for notable standards like POSIX

We're proposing what we believe to be the Swiftiest definitions using Unicode's guidance for Unicode.Scalar and extending this to grapheme clusters using Character's existing rationale.

Broad language/engine survey

For these definitions, we cross-referenced Unicode's UTS#18 with a broad survey of existing languages and engines. We found that while these all support a subset of UTS#18, each language or framework implements a slightly different subset. The following table shows some of the variations:

Language/Framework	Dot (`.`) matches	Supports `\X`	Canonical Equivalence	`\d` matches FULL WIDTH digit
ECMAScript	UTF16 code unit (Unicode scalar in Unicode mode)	no	no	no
Perl / PCRE	UTF16 code unit, (Unicode scalar in Unicode mode)	yes	no	no
Python3	Unicode scalar	no	no	yes
Raku	Grapheme cluster	n/a	strings always normalized	yes
Ruby	Unicode scalar	yes	no	no
Rust	Unicode scalar	no	no	no
C#	UTF16 code unit	no	no	yes
Java	Unicode scalar	yes	Only in CANON_EQ mode	no
Go	Unicode scalar	no	no	no
`NSRegularExpression`	Unicode scalar	yes	no	yes

We are still in the process of evaluating C++, RE2, and Oniguruma.

Detailed Design

Literal characters

A literal character (such as a, é, or 한) in a regex literal matches that particular character or code sequence. When matching at the semantic level of Unicode.Scalar, it should match the literal sequence of scalars. When matching at the semantic level of Character, it should match Character-by-Character, honoring Unicode canonical equivalence.

We are not proposing new API here as this is already handled by String and String.UnicodeScalarView's conformance to Collection.

Unicode values: `\u`, `\U`, `\x`

Metacharacters that begin with \u, \U, or \x match a character with the specified Unicode scalar values. We propose these be treated exactly the same as literals.

Match any: `.`, `\X`

The dot metacharacter matches any single character or element. Depending on options and modes, it may exclude newlines.

\X matches any grapheme cluster (Character), even when the regular expression is otherwise matching at semantic level of Unicode.Scalar.

We are not proposing new API here as this is already handled by collection conformances.

While we would like for the stdlib to have grapheme-breaking API over collections of Unicode.Scalar, that is a separate discussion and out-of-scope for this pitch.

Decimal digits: `\d`,`\D`

We propose \d be named "decimalDigit" with the following definitions:

extension Character {
  /// A Boolean value indicating whether this character represents
  /// a decimal digit.
  ///
  /// Decimal digits are comprised of a single Unicode scalar that has a 
  /// `numericType` property equal to `.decimal`. This includes the digits
  ///  from the ASCII range, from the _Halfwidth and Fullwidth Forms_ Unicode
  ///  block, as well as digits in some scripts, like `DEVANAGARI DIGIT NINE`
  ///  (U+096F).
  ///
  /// Decimal digits are a subset of whole numbers, see `isWholeNumber`.
  ///
  /// To get the character's value, use the `decimalDigitValue` property.
  public var isDecimalDigit: Bool { get }

  /// The numeric value this character represents, if it is a decimal digit.
  ///
  /// Decimal digits are comprised of a single Unicode scalar that has a 
  /// `numericType` property equal to `.decimal`. This includes the digits
  ///  from the ASCII range, from the _Halfwidth and Fullwidth Forms_ Unicode
  ///  block, as well as digits in some scripts, like `DEVANAGARI DIGIT NINE`
  ///  (U+096F).
  ///
  /// Decimal digits are a subset of whole numbers, see `wholeNumberValue`.
  ///
  ///     let chars: [Character] = ["1", "९", "A"]
  ///     for ch in chars {
  ///         print(ch, "-->", ch.decimalDigitValue)
  ///     }
  ///     // Prints:
  ///     // 1 --> Optional(1)
  ///     // ९ --> Optional(9)
  ///     // A --> nil
  public var decimalDigitValue: Int? { get }

}

extension Unicode.Scalar {
  /// A Boolean value indicating whether this scalar is considered 
  /// a decimal digit.
  ///
  /// Any Unicode scalar that has a `numericType` property equal to `.decimal`
  /// is considered a decimal digit. This includes the digits from the ASCII
  /// range, from the _Halfwidth and Fullwidth Forms_  Unicode block, as well
  ///  as digits in some scripts, like `DEVANAGARI DIGIT NINE` (U+096F).
  public var isDecimalDigit: Bool { get }
}

\D matches the inverse of \d.

TBD: SE-0221: Character Properties did not define equivalent API on Unicode.Scalar, as it was itself an extension of single Unicode.Scalar.Properties. Since we're defining additional classifications formed from algebraic formulations of properties, it may make sense to put API such as decimalDigitValue on Unicode.Scalar as well as back-porting other API from Character (e.g. hexDigitValue). We'd like to discuss this with the community.

TBD: Character.isHexDigit is currently constrained to the subset of decimal digits that are followed by encodings of Latin letters A-F in various forms (all 6 of them... thanks Unicode). We could consider extending this to be a superset of isDecimalDigit by allowing and producing values for all decimal digits, one would just have to use the Latin letters to refer to values greater than 9. We'd like to discuss this with the community.

Rationale

Unicode's recommended definition for \d is its numeric type of "Decimal" in contrast to "Digit". It is specifically restricted to sets of ascending contiguously-encoded scalars in a decimal radix positional numeral system. Thus, it excludes "digits" such as superscript numerals from its definition and is a proper subset of Character.isWholeNumber.

We interpret Unicode's definition of the set of scalars, especially its requirement that scalars be encoded in ascending chains, to imply that this class is restricted to scalars which meaningfully encode base-10 digits. Thus, we choose to make this Character property restrictive, similar to isHexDigit and isWholeNumber and provide a way to access this value.

It's possible we might add future properties to differentiate Unicode's non-decimal digits, but that is outside the scope of this pitch.

Word characters: `\w`, `\W`

We propose \w be named "word character" with the following definitions:

extension Character {
  /// A Boolean value indicating whether this character is considered
  /// a "word" character.
  ///
  /// See `Unicode.Scalar.isWordCharacter`.
  public var isWordCharacter: Bool { get }
}

extension Unicode.Scalar {
  /// A Boolean value indicating whether this scalar is considered
  /// a "word" character.
  ///
  /// Any Unicode scalar that has one of the Unicode properties
  /// `Alphabetic`, `Digit`, or `Join_Control`, or is in the
  /// general category `Mark` or `Connector_Punctuation`.
  public var isWordCharacter: Bool { get }
}

\W matches the inverse of \w.

Rationale

Word characters include more than letters, and we went with Unicode's recommended scalar semantics. We extend to grapheme clusters similarly to Character.isLetter, that is, subsequent (combining) scalars do not change the word-character-ness of the grapheme cluster.

Whitespace and newlines: `\s`, `\S` (plus `\h`, `\H`, `\v`, `\V`, and `\R`)

We propose \s be named "whitespace" with the following definitions:

extension Unicode.Scalar {
  /// A Boolean value indicating whether this scalar is considered 
  /// whitespace.
  ///
  /// All Unicode scalars with the derived `White_Space` property are 
  /// considered whitespace, including:
  ///
  /// - `CHARACTER TABULATION` (U+0009)
  /// - `LINE FEED (LF)` (U+000A)
  /// - `LINE TABULATION` (U+000B)
  /// - `FORM FEED (FF)` (U+000C)
  /// - `CARRIAGE RETURN (CR)` (U+000D)
  /// - `NEWLINE (NEL)` (U+0085)
  public var isWhitespace: Bool { get }
}

This definition matches the value of the existing Unicode.Scalar.Properties.isWhitespace property. Note that Character.isWhitespace already exists with the desired semantics, which is a grapheme cluster that begins with a whitespace Unicode scalar.

We propose \h be named "horizontalWhitespace" with the following definitions:

extension Character {
  /// A Boolean value indicating whether this character is considered 
  /// horizontal whitespace.
  ///
  /// All characters with an initial Unicode scalar in the general 
  /// category `Zs`/`Space_Separator`, or the control character 
  /// `CHARACTER TABULATION` (U+0009), are considered horizontal 
  /// whitespace.
  public var isHorizontalWhitespace: Bool { get }    
}

extension Unicode.Scalar {
  /// A Boolean value indicating whether this scalar is considered 
  /// horizontal whitespace.
  ///
  /// All Unicode scalars with the general category 
  /// `Zs`/`Space_Separator`, along with the control character 
  /// `CHARACTER TABULATION` (U+0009), are considered horizontal 
  /// whitespace.
  public var isHorizontalWhitespace: Bool { get }
}

We propose \v be named "verticalWhitespace" with the following definitions:

extension Character {
  /// A Boolean value indicating whether this scalar is considered 
  /// vertical whitespace.
  ///
  /// All characters with an initial Unicode scalar in the general 
  /// category `Zl`/`Line_Separator`, or the following control
  /// characters, are considered vertical whitespace (see below)
  public var isVerticalWhitespace: Bool { get }    
}

extension Unicode.Scalar {
  /// A Boolean value indicating whether this scalar is considered 
  /// vertical whitespace.
  ///
  /// All Unicode scalars with the general category 
  /// `Zl`/`Line_Separator`, along with the following control
  /// characters, are considered vertical whitespace:
  ///
  /// - `LINE FEED (LF)` (U+000A)
  /// - `LINE TABULATION` (U+000B)
  /// - `FORM FEED (FF)` (U+000C)
  /// - `CARRIAGE RETURN (CR)` (U+000D)
  /// - `NEWLINE (NEL)` (U+0085)
  public var isVerticalWhitespace: Bool { get }
}

Note that Character.isNewline already exists with the definition [required][lineboundary] by UTS#18. TBD: Should we backport to Unicode.Scalar?

\S, \H, and \V match the inverse of \s, \h, and \v, respectively.

We propose \R include "verticalWhitespace" above with detection (and consumption) of the CR-LF sequence when applied to Unicode.Scalar. It is equivalent to Character.isVerticalWhitespace when applied to Characters.

We are similarly not proposing any new API for \R until the stdlib has grapheme-breaking API over Unicode.Scalar.

Rationale

Note that "whitespace" is a term-of-art and is not correlated with visibility, which is a completely separate concept.

We use Unicode's recommended scalar semantics for horizontal whitespace and extend that to grapheme semantics similarly to Character.isWhitespace.

We use ICU's definition for vertical whitespace, similarly extended to grapheme clusters.

Control characters: `\t`, `\r`, `\n`, `\f`, `\0`, `\e`, `\a`, `\b`, `\cX`

We propose the following names and meanings for these escaped literals representing specific control characters:

extension Character {
  /// A horizontal tab character, `CHARACTER TABULATION` (U+0009).
  public static var tab: Character { get }

  /// A carriage return character, `CARRIAGE RETURN (CR)` (U+000D).
  public static var carriageReturn: Character { get }

  /// A line feed character, `LINE FEED (LF)` (U+000A).
  public static var lineFeed: Character { get }

  /// A form feed character, `FORM FEED (FF)` (U+000C).   
  public static var formFeed: Character { get }

  /// A NULL character, `NUL` (U+0000).   
  public static var nul: Character { get }

  /// An escape control character, `ESC` (U+001B).
  public static var escape: Character { get }

  /// A bell character, `BEL` (U+0007).
  public static var bell: Character { get }

  /// A backspace character, `BS` (U+0008).
  public static var backspace: Character { get }

  /// A combined carriage return and line feed as a single character denoting
  //  end-of-line.
  public static var carriageReturnLineFeed: Character { get }

  /// Returns a control character with the given value, Control-`x`.
  ///
  /// This method returns a value only when you pass a letter in 
  /// the ASCII range as `x`:
  ///
  ///     if let ch = Character.control("G") {
  ///         print("'ch' is a bell character", ch == Character.bell)
  ///     } else {
  ///         print("'ch' is not a control character")
  ///     }
  ///     // Prints "'ch' is a bell character: true"
  ///
  /// - Parameter x: An upper- or lowercase letter to derive
  ///   the control character from.
  /// - Returns: Control-`x` if `x` is in the pattern `[a-zA-Z]`;
  ///   otherwise, `nil`.
  public static func control(_ x: Unicode.Scalar) -> Character?
}

extension Unicode.Scalar {
  /// Same as above, producing Unicode.Scalar, except for CR-LF...
}

We also propose isControl properties with the following definitions:

extension Character {
  /// A Boolean value indicating whether this character represents 
  /// a control character.
  ///
  /// Control characters are a single Unicode scalar with the
  /// general category `Cc`/`Control` or the CR-LF pair (`\r\n`).
  public var isControl: Bool { get }    
}

extension Unicode.Scalar {
  /// A Boolean value indicating whether this scalar represents 
  /// a control character.
  ///
  /// Control characters have the general category `Cc`/`Control`.
  public var isControl: Bool { get }
}

TBD: Should we have a CR-LF static var on Unicode.Scalar that produces a value of type Character?

Rationale

This approach simplifies the use of some common control characters, while making the rest available through a method call.

Unicode named values and properties: `\N`, `\p`, `\P`

\N{NAME} matches a Unicode scalar value with the specified name. \p{PROPERTY} and \p{PROPERTY=VALUE} match a Unicode scalar value with the given Unicode property (and value, if given).

While most Unicode-defined properties can only match at the Unicode scalar level, some are defined to match an extended grapheme cluster. For example, /\p{RGI_Emoji_Flag_Sequence}/ will match any flag emoji character, which are composed of two Unicode scalar values.

\P{...} matches the inverse of \p{...}.

Most of this is already present inside Unicode.Scalar.Properties, and we propose to round it out with anything missing, e.g. script and script extensions. (API is TBD, still working on it.)

Even though we are not proposing any Character-based API, we'd like to discuss with the community whether or how to extend them to grapheme clusters. Some options:

Forbid in any grapheme-cluster semantic mode
Match only single-scalar grapheme clusters with the given property
Match any grapheme cluster that starts with the given property
Something more-involved such as per-property reasoning

POSIX character classes: `[:NAME:]`

We propose that POSIX character classes be prefixed with "posix" in their name with APIs for testing membership of Characters and Unicode.Scalars. Unicode.Scalar.isASCII and Character.isASCII already exist and can satisfy [:ascii:], and can be used in combination with new members like isDigit to represent individual POSIX character classes. Alternatively, we could introduce an option-set-like POSIXCharacterClass and func isPOSIX(_:POSIXCharacterClass) since POSIX is a fully defined standard. This would cut down on the amount of API noise directly visible on Character and Unicode.Scalar significantly. We'd like some discussion the the community here, noting that this will become clearer as more of the string processing overview takes shape.

POSIX's character classes represent concepts that we'd like to define at all semantic levels. We propose the following definitions, some of which are covered elsewhere in this pitch and some of which already exist today. Some Character definitions are TBD and we'd like more discussion with the community.

POSIX class	API name	`Character`	`Unicode.Scalar`	POSIX mode value
`[:lower:]`	lowercase	(exists)	`\p{Lowercase}`	`[a-z]`
`[:upper:]`	uppercase	(exists)	`\p{Uppercase}`	`[A-Z]`
`[:alpha:]`	alphabetic	(exists: `.isLetter`)	`\p{Alphabetic}`	`[A-Za-z]`
`[:alnum:]`	alphaNumeric	TBD	`[\p{Alphabetic}\p{Decimal}]`	`[A-Za-z0-9]`
`[:word:]`	wordCharacter	(pitched)	(pitched)	`[[:alnum:]_]`
`[:digit:]`	decimalDigit	(pitched)	(pitched)	`[0-9]`
`[:xdigit:]`	hexDigit	(exists)	`\p{Hex_Digit}`	`[0-9A-Fa-f]`
`[:punct:]`	punctuation	(exists)	(port from `Character`)	`[-!"#%&'()*,./:;?@[\\\]_{}]`
`[:blank:]`	horizontalWhitespace	(pitched)	(pitched)	`[ \t]`
`[:space:]`	whitespace	(exists)	`\p{Whitespace}`	`[ \t\n\r\f\v]`
`[:cntrl:]`	control	(pitched)	(pitched)	`[\x00-\x1f\x7f]`
`[:graph:]`	TBD	TBD	TBD	`[^ [:cntrl:]]`
`[:print:]`	TBD	TBD	TBD	`[[:graph:] ]`

Custom classes: `[...]`

We propose that custom classes function just like set union. We propose that ranged-based custom character classes function just like ClosedRange. Thus, we are not proposing any additional API.

That being said, providing grapheme cluster semantics is simultaneously obvious and tricky. A direct extension treats [a-f] as equivalent to ("a"..."f").contains(). Strings (and thus Characters) are ordered for the purposes of efficiently maintaining programming invariants while honoring Unicode canonical equivalence. This ordering is consistent but linguistically meaningless and subject to implementation details such as whether we choose to normalize under NFC or NFD.

let c: ClosedRange<Character> = "a"..."f"
c.contains("e") // true
c.contains("g") // false
c.contains("e\u{301}") // false, NFC uses precomposed é
c.contains("e\u{305}") // true, there is no precomposed e̅

We will likely want corresponding RangeExpression-based API in the future and keeping consistency with ranges is important.

We would like to discuss this problem with the community here. Even though we are not addressing regex literals specifically in this thread, it makes sense to produce suggestions for compilation errors or warnings.

Some options:

Do nothing, embrace emergent behavior
Warn/error for any character class ranges
Warn/error for character class ranges outside of a quasi-meaningful subset (e.g. ACII, albeit still has issues above)
Warn/error for multiple-scalar grapheme clusters (albeit still has issues above)

Future Directions

Future API

Library-extensible pattern matching will necessitate more types, protocols, and API in the future, many of which may involve character classes. This pitch aims to define names and semantics for exactly these kinds of API now, so that they can slot in naturally.

More classes or custom classes

Future API might express custom classes or need more built-in classes. This pitch aims to establish rationale and precedent for a large number of character classes in Swift, serving as a basis that can be extended.

More lenient conversion APIs

The proposed semantics for matching "digits" are broader than what the existing Int(_:radix:)? initializer accepts. It may be useful to provide additional initializers that can understand the whole breadth of characters matched by \d, or other related conversions.

ksluder · October 19, 2021, 11:12pm

I’m sorry if I somehow missed the explanation, but are "modes" (POSIX, Unicode.Scalar, Character) a property of a regular expression, or a parameter at the usage of a regular expression?

grynspan · October 19, 2021, 11:26pm

A large number of "isFoo" properties across the two types might be a bit much. What if you introduced a type, CharacterAttributes (Naming Is Hard™), with these properties nested therein, and then have a single characterAttributes property on Character and UnicodeScalar?

struct CharacterAttributes {
  var isDecimalDigit: Bool { get } // \d
  var isLowercase: Bool { get } // [:lower:]
  ...
}

extension Character {
  var characterAttributes: CharacterAttributes { get }
}

To be clear, there's no need for the properties of this type to be stored—they can all be lazy or computed based on the character/scalar whence the instance was created. This implies the type does store said character/scalar, but nothing else.

Michael_Ilseman · October 19, 2021, 11:34pm

From "Proposed Solution":

so you didn't miss anything. It's something we're still exploring, as it's highly relevant to both the use site and potentially the compiler for diagnostics. There's a whole lot to discuss and define just within character class definitions, so ideally we can have that discussion in the overview thread.

I realize this is relevant to discussing arbitrary scalar properties in a grapheme-semantic mode. Let's assume we have enough static information available if we're discussing helpful diagnostics. We can also discuss what less-harmful fallback behavior could be with or without a signaling mechanism.

roosterboy · October 19, 2021, 11:36pm

Why NUL as nul but BEL as bell? I would think the NULL character should be named null.

/// A NULL character, NUL (U+0000).
public static var nul: Character { get }

/// A bell character, BEL (U+0007).
public static var bell: Character { get }

nnnnnnnn · October 20, 2021, 12:02am

That's a good note, agreed! We'll update that in the next draft.

ksluder · October 20, 2021, 12:22am

Nit: “match a Character“ isn’t meaningful for the other modes. Perhaps: “Metacharacters that begin with \u, \U or \x are treated as if they were replaced by a single Unicode scalar of the specified value.”

Nit: Does this text apply to all subsequent character classes that lack equivalent API on Unicode.Scalar.Properties? If so, should it move out from within a single character class’s description?

Does it make more sense to combine \N with \U and friends?

xwu · October 20, 2021, 1:48am

Brief comments—

Is this something that comes out of Unicode’s guidance? Might there be any counter examples? (For instance, can an emoji be a “word character”?—for if not, then if there’s a bona fide “word character” for which there’s also a variant selector that turns it into an emoji…)

I think someone else has already pointed out the null-NUL/bell-BEL inconsistency here.

Seems to me the same behavior for Unicode scalar escapes \U should apply here to \N and cousins.

Without having thought deeply about this, keeping the POSIX stuff separate seems nice, both for cutting down API noise and for keeping the Unicode-aware “word,” “digit,” etc. on a different (more directly visible) level from the POSIX stuff.

I think warning for anything outside the most “quasi-meaningful” subsets would be the safest here. We can rationalize all we want that subsets are otherwise consistent despite lack of semantic meaning, but I would imagine most people try to create subsets precisely because they want some semantically meaningful subset, so a warning at minimum seems appropriate.

Nevin · October 20, 2021, 2:21am

Are we sure that “word character” is semantically meaningful?

It seems similar to “vowel character”, in that certain symbols are only sometimes vowels, and likewise certain symbols are only sometimes word characters.

For example, many words include an apostrophe or a hyphen, but those characters also have other uses as punctuation where they are not part of a word.

I also suspect, but do not know for sure, that there may be some symbols which are word characters in one language, but non-word characters in another.

YOCKOW · October 20, 2021, 8:15am

Let me take up an issue related to "Literal characters" and "Unicode values".

For example, a single scalar of regional indicator symbol such as U+1F1E6 (REGIONAL INDICATOR SYMBOL LETTER A) cannot be a grapheme cluster according to UAX #29 - §3.1.1 and UTS #51 - §1.4.5 - ED-14.
I mean /\x{1F1E6}/ should be applied only with scalar semantics if we strictly follow the Unicode Standard.

On the other hand, Swift.Character accepts some invalid scalar sequence including just a single regional indicator symbol like below. (SR-6077)

let a: Character = "\u{1F1E6}" // Successfully compiled

In that sense, /\x{1F1E6}/ might be practically able to be applied with grapheme-cluster semantics in Swift. However, the result may be unexpected to the programmer who write it.

We may have the similar options with ones described in "Custom classes" section.

Do nothing.
Warn/Error for any kind of this pattern.
Warn/Error for trying to apply grapheme-cluster semantics.

Note that we require validation for grapheme clusters if we want to generate warnings/errors.

Alejandro · October 20, 2021, 8:52am

YOCKOW:

Let me take up an issue related to "Literal characters" and "Unicode values".

For example, a single scalar of regional indicator symbol such as U+1F1E6 ( REGIONAL INDICATOR SYMBOL LETTER A ) cannot be a grapheme cluster according to UAX #29 - §3.1.1 and UTS #51 - §1.4.5 - ED-14.
I mean /\x{1F1E6}/ should be applied only with scalar semantics if we strictly follow the Unicode Standard.

On the other hand, Swift.Character accepts some invalid scalar sequence including just a single regional indicator symbol like below. (SR-6077)
let a: Character = "\u{1F1E6}" // Successfully compiled
In that sense, /\x{1F1E6}/ might be practically able to be applied with grapheme-cluster semantics in Swift. However, the result may be unexpected to the programmer who write it.

I think you're interpreting the Unicode spec incorrectly here actually. The text segmentation spec specifies that emoji flag sequences must not be broken, and emoji flag sequences are composed of 2 regional indicators. The UTS you referenced states:

A singleton Regional Indicator character is not a well-formed emoji flag sequence .

A singular regional indicator is not an emoji flag sequence, thus this rule does not apply to it all. Because of the default breaking case in the text segmentation spec, a singular regional indicator is in fact a grapheme cluster itself.

Consider the following examples:

// U+1F1E7 = Regional Indicator B
// A🇧C
//
// In this case, the regional indicator is not paired with any other regional
// indicators, so there must be a boundary between the 'A' and the 'C'.
//
// Count = 3
let three = "A\u{1F1E7}C"

// U+1F1E6 = Regional Indicator A
// U+1F1E7 = Regional Indicator B
// U+1F1E8 = Regional Indicator C
// 🇦🇧🇨
//
// Now we have an interesting case with 3 regional indicators. Per the
// text segmentation spec, the first two regional indicators should be treated
// as an emoji flag sequence (a singular grapheme cluster). The last regional
// indicator, due to the fact that the first two form a grapheme cluster, is itself
// a grapheme cluster.
//
// Count = 2
let two = "\u{1F1E6}\u{1F1E7}\u{1F1E8}"

benrimmington · October 20, 2021, 10:52am

Control characters: \t, \r, \n, \f, \0, \e, \a, \b, \cX

For consistency with isHorizontalWhitespace, should the proposed tab property be renamed horizontalTab instead?

The \0 in Swift strings is always the null character, but in PCRE patterns it can be followed by (one or two) more octal digits.

YOCKOW · October 21, 2021, 12:11am

No. Your examples show only that Swift accepts such sequences.
You seem to have ignored UAX#29.

Table 1b. Combining Character Sequences and Grapheme Clusters

Term	Regex	Notes
combining character sequence	`ccs-base? ccs-extend+`	A single base character is not a combining character sequence. However, a single combining mark is a (degenerate) combining character sequence.
extended combining character sequence	`extended_base? ccs-extend+`	extended_base includes Hangul Syllables
legacy grapheme cluster	`crlf	Control
extended grapheme cluster	`crlf	Control

Table 1c. Regex Definitions

`ccs-base :=`	`[\p{L}\p{N}\p{P}\p{S}\p{Zs}]`
`ccs-extend :=`	`[\p{M}\p{Join_Control}]`
`extended_base :=`	`ccs-base
`crlf :=`	`CR LF`
`legacy-core :=`	`hangul-syllable
`legacy-postcore :=`	`[Extend ZWJ]`
`core :=`	`hangul-syllable
`postcore :=`	`[Extend ZWJ SpacingMark]`
`precore :=`	`Prepend`
`RI-Sequence :=`	`RI RI`
`hangul-syllable :=`	`L* (V+
`xpicto-sequence :=`	`\p{Extended_Pictographic} (Extend* ZWJ \p{Extended_Pictographic})*`

Because RI-Sequence is defined as only RI RI, a singleton RI is not a grapheme cluster nor an extended grapheme cluster.

By the way, the point is not here.
I want to say that scalar-value-expressions would have the similar problems with range-expressions.

Michael_Ilseman · October 21, 2021, 12:42am

I assure you he has not.

Alejandro's example shows that Unicode accepts such sequences as degenerate grapheme clusters.

From UAX#29:

Ignore degenerates . No special provisions are made to get marginally better behavior for degenerate cases that never occur in practice, such as an A followed by an Indic combining mark.

Swift, being an actual implementation of Unicode with its own additional semantics to ensure, has to constantly wrestle with the existence of degenerate grapheme clusters. They will come up in these discussions, because Unicode explicitly chose to permit their existence in favor of simplicity/speed. We follow Unicode even as it violates Collection algebra.

However, they are degenerate, so we are not burdened with ascribing meaning to the meaningless. We need to have defined behavior (in the strict UB-sense) in a world where they exist, but our regular API design intuition does not necessarily map directly on to them. We should ascribe meaning for the meaningful graphemes and degenerate cases can fall out naturally.

Some prior musings on the topic (I'm hoping Discourse links them properly):

Corner-cases in `Character` classification of whitespace

Degenerate graphemes, such as one that contains only a combining scalar, violate common Collection intuition:
“abcde”.count // 5
“\u{0301}”.count // 1
let str = “abcde” + “\u{0301}” // “abcdé”
str.count // 5
String needs to accommodate the existence of degenerate graphemes, and they can always be formed by operating on the Unicode scalar or code unit views. But, we should try to avoid forming them in common use top-level String APIs.

These points become salient the moment we try to use character class definitions with API (@timv). I think here is probably a better place to discuss and clarify them than in the API pitch. This pitch doesn't call them out specifically (@nnnnnnnn we should in next draft), but its definitions extend to them.

xwu · October 21, 2021, 12:46am

Actually, @Alejandro is correct. It’s spelled out explicitly in what you’ve posted:

All regional indicators fall under \p{S}—they are assigned to general category So (symbol, other).

While degenerate grapheme clusters are also an issue, by my read, an isolated regional indicator isn’t even degenerate—it’s a bona fide character/grapheme cluster.

YOCKOW · October 21, 2021, 1:13am

@Alejandro @Michael_Ilseman @xwu
I'm very sorry. It's me who misread UAX. I thought degenerating is applied to combining scalars not to RIs.

Now, my words have to be changed.
How do we deal with degenerates in Regex?

George · October 21, 2021, 1:27am

To take this one step further maybe isFoo methods are obsoleted by a true support for character classes? The cleanest way to implement this would be just to force users to use CharacterClass.foo.containers(character), which may be a bit verbose... we can also do something like character.isMember(of: .foo), which is a bit more verbose than isFoo but doesn't require duplicating every character class.

grynspan · October 21, 2021, 1:56am

That's effectively CharacterSet, I think.

Michael_Ilseman · October 21, 2021, 2:28am

We pick definitions that are good for non-degenerates.

Their existence shouldn't make the world worse for meaningful characters. Their "purpose in life" is to make grapheme-breaking faster and simpler for non-degenerate strings.

We make sure not to produce degenerates from normal grapheme-semantic operations, nor make existing "meaningless-but-benign" degenerates more degenerate.

That is, if the string didn't have any degenerate characters in it, we shouldn't introduce them through grapheme-semantic API. This sounds obvious, but making sure we don't violate this actually guides the design into a more consistent place.

For example, passing " \u{301}X" to a trim() grapheme-semantic API shouldn't produce "\u{301}X".

(A space with a combining mark may technically be considered "degenerate" by the phrasing of UAX#29, but it's benign in that it doesn't break algebra the way a dangling "\u{301}" does. It is "degenerate" to a renderer, and Unicode offers a lot of advice about rendering, but it's effectively benign to the standard library).

We understand that degenerate cases might look weird, because they are.

Again, the goal is good, consistent behavior for non-degenerate strings. We're so used to looking at an example and trying to figure out what the most intuitive and consistent behavior specifically for that example should be. But, if we are looking at degenerate examples, which are by definition generators of counter-intuition, that tendency can lead us towards inconsistent treatment.

This is another reason why having a scalar-semantic mode is so important. From The Big Picture

Formats such as JSON, CSV, plists, and even source code are data formats that have a textual presentation when rendered by an editor or terminal. Their stored content may be Unicode-rich, but the format itself should be processed as data.

For example, imagine processing a CSV document of all defined Unicode scalars. Inside such a document would appear the CSV field-separator , followed by U+0301 (Combining Acute Accent). If we were doing string processing, this would appear to us as a single (degenerate) grapheme cluster ,́, that is the comma combined with the following accent into a single grapheme cluster that does not compare equal to either piece. Instead, we want to process CSV at the binary-semantics level where we match field separators literally and everything in-between is opaque content which we present for interpretation by a higher-level layer.

(noting that "string processing" there is referring to grapheme semantics and "data" to scalar semantics)

YOCKOW · October 21, 2021, 9:48am

Because scalar-semantics is important, we would want to write a regular expression like /\x{301}/ which is regarded as a degenerate where grapheme-semantics is applied.
My intuition is that such regular expressions should be used only with scalar-semantics.
Should we warn? Error only when trying to apply grapheme-semantics? Or enable to apply grapheme-semantics even if it's meaningless?

💭

I imagine some process like $nfd =~ s/([\x{3099}\x{309A}])/$1 == "\x{3099}" ? "\x{309B}" : "\x{309C}"/eg; in Perl. U+3099 and U+309A are Nonspacing Marks.

[Pitch] Character Classes for String Processing

Introduction

Motivation

Proposed Solution

Detailed Design

Literal characters

Unicode values: \u, \U, \x

Match any: ., \X

Decimal digits: \d,\D

Word characters: \w, \W

Whitespace and newlines: \s, \S (plus \h, \H, \v, \V, and \R)

Control characters: \t, \r, \n, \f, \0, \e, \a, \b, \cX

Unicode named values and properties: \N, \p, \P

POSIX character classes: [:NAME:]

Custom classes: [...]

Future Directions

Future API

More classes or custom classes

More lenient conversion APIs

Unicode values: `\u`, `\U`, `\x`

Match any: `.`, `\X`

Decimal digits: `\d`,`\D`

Word characters: `\w`, `\W`

Whitespace and newlines: `\s`, `\S` (plus `\h`, `\H`, `\v`, `\V`, and `\R`)

Control characters: `\t`, `\r`, `\n`, `\f`, `\0`, `\e`, `\a`, `\b`, `\cX`

Unicode named values and properties: `\N`, `\p`, `\P`

POSIX character classes: `[:NAME:]`

Custom classes: `[...]`