- Authors: Nate Cook, Michael Ilseman
- Status: Draft pitch
Introduction
Declarative String Processing Overview presents regex-powered matching broadly, without details concerning syntax and semantics, leaving clarification to subsequent pitches. Regular Expression Literals presents more details on regex syntax such as delimiters and PCRE-syntax innards, but explicitly excludes discussion of regex semantics. This pitch and discussion aims to address a targeted subset of regex semantics: definitions of character classes. We propose a comprehensive treatment of regex character class semantics in the context of existing and newly proposed API directly on Character
and Unicode.Scalar
.
Character classes in regular expressions include metacharacters like \d
to match a digit, \s
to match whitespace, and .
to match any character. Individual literal characters can also be thought of as character classes, as they at least match themselves, and, in case-insensitive matching, their case-toggled counterpart. For the purpose of this work, then, we consider a character class to be any part of a regular expression literal that can match an actual component of a string.
Motivation
Operating over classes of characters is a vital component of string processing. Swift's String
provides, by default, a view of Character
s or extended grapheme clusters whose comparison honors Unicode canonical equivalence.
let str = "Cafe\u{301}" // "Café"
str == "Café" // true
str.dropLast() // "Caf"
str.last == "Ă©" // true (precomposed e with acute accent)
str.last == "e\u{301}" // true (e followed by composing acute accent)
Unicode leaves all interpretation of grapheme clusters up to implementations, which means that Swift needs to define any semantics for its own usage. Since other regex engines operate, at most, at the semantics level of Unicode scalar values, there is little to no prior art to consult.
Other engines
Character classes in other languages match at either the Unicode scalar value level, or even the code unit level, instead of recognizing grapheme clusters as characters. When matching the .
character class, other languages will only match the first part of an "e\u{301}"
grapheme cluster. Some languages, like Perl, Ruby, and Java, support an additional \X
metacharacter, which explicitly represents a single grapheme cluster.
Matching "Cafe\u{301}"
|
Pattern: ^Caf.
|
Remaining | Pattern: ^Caf\X
|
Remaining |
---|---|---|---|---|
C#, Rust, Go | "Cafe" |
"ÂŽ" |
n/a | n/a |
NSString, Java, Ruby, Perl | "Cafe" |
"ÂŽ" |
"Café" |
"" |
Other than Java's CANON_EQ
option, the vast majority of other languages and engines are not capable of comparing with canonical equivalence.
SE-0211 Unicode Scalar Properties added basic building blocks for classification of scalars by surfacing Unicode data from the UCD. SE-0221: Character Properties defined grapheme-cluster semantics for Swift for a subset of these. But, many classifications used in string processing are combinations of scalar properties or ad-hoc listings, and as such are not present today in Swift.
Regardless of any syntax or underlying formalism, classifying characters is a worthy and much needed addition to the Swift standard library. We believe our thorough treatment of every character class found across many popular regex engines gives Swift a solid semantic basis.
Proposed Solution
This pitch is narrowly scoped to Swift definitions of character classes found in regexes. For each character class, we propose:
- A name for use in API
- A
Character
API, by extending Unicode scalar definitions to grapheme clusters - A
Unicode.Scalar
API with modern Unicode definitions - If applicable, a
Unicode.Scalar
API for notable standards like POSIX
We're proposing what we believe to be the Swiftiest definitions using Unicode's guidance for Unicode.Scalar
and extending this to grapheme clusters using Character
's existing rationale.
Broad language/engine survey
For these definitions, we cross-referenced Unicode's UTS#18 with a broad survey of existing languages and engines. We found that while these all support a subset of UTS#18, each language or framework implements a slightly different subset. The following table shows some of the variations:
Language/Framework | Dot (. ) matches |
Supports \X
|
Canonical Equivalence |
\d matches FULL WIDTH digit |
---|---|---|---|---|
ECMAScript | UTF16 code unit (Unicode scalar in Unicode mode) | no | no | no |
Perl / PCRE | UTF16 code unit, (Unicode scalar in Unicode mode) | yes | no | no |
Python3 | Unicode scalar | no | no | yes |
Raku | Grapheme cluster | n/a | strings always normalized | yes |
Ruby | Unicode scalar | yes | no | no |
Rust | Unicode scalar | no | no | no |
C# | UTF16 code unit | no | no | yes |
Java | Unicode scalar | yes | Only in CANON_EQ mode | no |
Go | Unicode scalar | no | no | no |
NSRegularExpression |
Unicode scalar | yes | no | yes |
We are still in the process of evaluating C++, RE2, and Oniguruma.
Detailed Design
Literal characters
A literal character (such as a
, Ă©
, or í
) in a regex literal matches that particular character or code sequence. When matching at the semantic level of Unicode.Scalar
, it should match the literal sequence of scalars. When matching at the semantic level of Character
, it should match Character
-by-Character
, honoring Unicode canonical equivalence.
We are not proposing new API here as this is already handled by String
and String.UnicodeScalarView
's conformance to Collection
.
Unicode values: \u
, \U
, \x
Metacharacters that begin with \u
, \U
, or \x
match a character with the specified Unicode scalar values. We propose these be treated exactly the same as literals.
Match any: .
, \X
The dot metacharacter matches any single character or element. Depending on options and modes, it may exclude newlines.
\X
matches any grapheme cluster (Character
), even when the regular expression is otherwise matching at semantic level of Unicode.Scalar
.
We are not proposing new API here as this is already handled by collection conformances.
While we would like for the stdlib to have grapheme-breaking API over collections of Unicode.Scalar
, that is a separate discussion and out-of-scope for this pitch.
Decimal digits: \d
,\D
We propose \d
be named "decimalDigit" with the following definitions:
extension Character {
/// A Boolean value indicating whether this character represents
/// a decimal digit.
///
/// Decimal digits are comprised of a single Unicode scalar that has a
/// `numericType` property equal to `.decimal`. This includes the digits
/// from the ASCII range, from the _Halfwidth and Fullwidth Forms_ Unicode
/// block, as well as digits in some scripts, like `DEVANAGARI DIGIT NINE`
/// (U+096F).
///
/// Decimal digits are a subset of whole numbers, see `isWholeNumber`.
///
/// To get the character's value, use the `decimalDigitValue` property.
public var isDecimalDigit: Bool { get }
/// The numeric value this character represents, if it is a decimal digit.
///
/// Decimal digits are comprised of a single Unicode scalar that has a
/// `numericType` property equal to `.decimal`. This includes the digits
/// from the ASCII range, from the _Halfwidth and Fullwidth Forms_ Unicode
/// block, as well as digits in some scripts, like `DEVANAGARI DIGIT NINE`
/// (U+096F).
///
/// Decimal digits are a subset of whole numbers, see `wholeNumberValue`.
///
/// let chars: [Character] = ["1", "à„Ż", "A"]
/// for ch in chars {
/// print(ch, "-->", ch.decimalDigitValue)
/// }
/// // Prints:
/// // 1 --> Optional(1)
/// // à„Ż --> Optional(9)
/// // A --> nil
public var decimalDigitValue: Int? { get }
}
extension Unicode.Scalar {
/// A Boolean value indicating whether this scalar is considered
/// a decimal digit.
///
/// Any Unicode scalar that has a `numericType` property equal to `.decimal`
/// is considered a decimal digit. This includes the digits from the ASCII
/// range, from the _Halfwidth and Fullwidth Forms_ Unicode block, as well
/// as digits in some scripts, like `DEVANAGARI DIGIT NINE` (U+096F).
public var isDecimalDigit: Bool { get }
}
\D
matches the inverse of \d
.
TBD: SE-0221: Character Properties did not define equivalent API on Unicode.Scalar
, as it was itself an extension of single Unicode.Scalar.Properties
. Since we're defining additional classifications formed from algebraic formulations of properties, it may make sense to put API such as decimalDigitValue
on Unicode.Scalar
as well as back-porting other API from Character
(e.g. hexDigitValue
). We'd like to discuss this with the community.
TBD: Character.isHexDigit
is currently constrained to the subset of decimal digits that are followed by encodings of Latin letters A-F
in various forms (all 6 of them... thanks Unicode). We could consider extending this to be a superset of isDecimalDigit
by allowing and producing values for all decimal digits, one would just have to use the Latin letters to refer to values greater than 9
. We'd like to discuss this with the community.
Rationale
Unicode's recommended definition for \d
is its numeric type of "Decimal" in contrast to "Digit". It is specifically restricted to sets of ascending contiguously-encoded scalars in a decimal radix positional numeral system. Thus, it excludes "digits" such as superscript numerals from its definition and is a proper subset of Character.isWholeNumber
.
We interpret Unicode's definition of the set of scalars, especially its requirement that scalars be encoded in ascending chains, to imply that this class is restricted to scalars which meaningfully encode base-10 digits. Thus, we choose to make this Character property restrictive, similar to isHexDigit
and isWholeNumber
and provide a way to access this value.
It's possible we might add future properties to differentiate Unicode's non-decimal digits, but that is outside the scope of this pitch.
Word characters: \w
, \W
We propose \w
be named "word character" with the following definitions:
extension Character {
/// A Boolean value indicating whether this character is considered
/// a "word" character.
///
/// See `Unicode.Scalar.isWordCharacter`.
public var isWordCharacter: Bool { get }
}
extension Unicode.Scalar {
/// A Boolean value indicating whether this scalar is considered
/// a "word" character.
///
/// Any Unicode scalar that has one of the Unicode properties
/// `Alphabetic`, `Digit`, or `Join_Control`, or is in the
/// general category `Mark` or `Connector_Punctuation`.
public var isWordCharacter: Bool { get }
}
\W
matches the inverse of \w
.
Rationale
Word characters include more than letters, and we went with Unicode's recommended scalar semantics. We extend to grapheme clusters similarly to Character.isLetter
, that is, subsequent (combining) scalars do not change the word-character-ness of the grapheme cluster.
Whitespace and newlines: \s
, \S
(plus \h
, \H
, \v
, \V
, and \R
)
We propose \s
be named "whitespace" with the following definitions:
extension Unicode.Scalar {
/// A Boolean value indicating whether this scalar is considered
/// whitespace.
///
/// All Unicode scalars with the derived `White_Space` property are
/// considered whitespace, including:
///
/// - `CHARACTER TABULATION` (U+0009)
/// - `LINE FEED (LF)` (U+000A)
/// - `LINE TABULATION` (U+000B)
/// - `FORM FEED (FF)` (U+000C)
/// - `CARRIAGE RETURN (CR)` (U+000D)
/// - `NEWLINE (NEL)` (U+0085)
public var isWhitespace: Bool { get }
}
This definition matches the value of the existing Unicode.Scalar.Properties.isWhitespace
property. Note that Character.isWhitespace
already exists with the desired semantics, which is a grapheme cluster that begins with a whitespace Unicode scalar.
We propose \h
be named "horizontalWhitespace" with the following definitions:
extension Character {
/// A Boolean value indicating whether this character is considered
/// horizontal whitespace.
///
/// All characters with an initial Unicode scalar in the general
/// category `Zs`/`Space_Separator`, or the control character
/// `CHARACTER TABULATION` (U+0009), are considered horizontal
/// whitespace.
public var isHorizontalWhitespace: Bool { get }
}
extension Unicode.Scalar {
/// A Boolean value indicating whether this scalar is considered
/// horizontal whitespace.
///
/// All Unicode scalars with the general category
/// `Zs`/`Space_Separator`, along with the control character
/// `CHARACTER TABULATION` (U+0009), are considered horizontal
/// whitespace.
public var isHorizontalWhitespace: Bool { get }
}
We propose \v
be named "verticalWhitespace" with the following definitions:
extension Character {
/// A Boolean value indicating whether this scalar is considered
/// vertical whitespace.
///
/// All characters with an initial Unicode scalar in the general
/// category `Zl`/`Line_Separator`, or the following control
/// characters, are considered vertical whitespace (see below)
public var isVerticalWhitespace: Bool { get }
}
extension Unicode.Scalar {
/// A Boolean value indicating whether this scalar is considered
/// vertical whitespace.
///
/// All Unicode scalars with the general category
/// `Zl`/`Line_Separator`, along with the following control
/// characters, are considered vertical whitespace:
///
/// - `LINE FEED (LF)` (U+000A)
/// - `LINE TABULATION` (U+000B)
/// - `FORM FEED (FF)` (U+000C)
/// - `CARRIAGE RETURN (CR)` (U+000D)
/// - `NEWLINE (NEL)` (U+0085)
public var isVerticalWhitespace: Bool { get }
}
Note that Character.isNewline
already exists with the definition [required][lineboundary] by UTS#18. TBD: Should we backport to Unicode.Scalar
?
\S
, \H
, and \V
match the inverse of \s
, \h
, and \v
, respectively.
We propose \R
include "verticalWhitespace" above with detection (and consumption) of the CR-LF sequence when applied to Unicode.Scalar
. It is equivalent to Character.isVerticalWhitespace
when applied to Character
s.
We are similarly not proposing any new API for \R
until the stdlib has grapheme-breaking API over Unicode.Scalar
.
Rationale
Note that "whitespace" is a term-of-art and is not correlated with visibility, which is a completely separate concept.
We use Unicode's recommended scalar semantics for horizontal whitespace and extend that to grapheme semantics similarly to Character.isWhitespace
.
We use ICU's definition for vertical whitespace, similarly extended to grapheme clusters.
Control characters: \t
, \r
, \n
, \f
, \0
, \e
, \a
, \b
, \cX
We propose the following names and meanings for these escaped literals representing specific control characters:
extension Character {
/// A horizontal tab character, `CHARACTER TABULATION` (U+0009).
public static var tab: Character { get }
/// A carriage return character, `CARRIAGE RETURN (CR)` (U+000D).
public static var carriageReturn: Character { get }
/// A line feed character, `LINE FEED (LF)` (U+000A).
public static var lineFeed: Character { get }
/// A form feed character, `FORM FEED (FF)` (U+000C).
public static var formFeed: Character { get }
/// A NULL character, `NUL` (U+0000).
public static var nul: Character { get }
/// An escape control character, `ESC` (U+001B).
public static var escape: Character { get }
/// A bell character, `BEL` (U+0007).
public static var bell: Character { get }
/// A backspace character, `BS` (U+0008).
public static var backspace: Character { get }
/// A combined carriage return and line feed as a single character denoting
// end-of-line.
public static var carriageReturnLineFeed: Character { get }
/// Returns a control character with the given value, Control-`x`.
///
/// This method returns a value only when you pass a letter in
/// the ASCII range as `x`:
///
/// if let ch = Character.control("G") {
/// print("'ch' is a bell character", ch == Character.bell)
/// } else {
/// print("'ch' is not a control character")
/// }
/// // Prints "'ch' is a bell character: true"
///
/// - Parameter x: An upper- or lowercase letter to derive
/// the control character from.
/// - Returns: Control-`x` if `x` is in the pattern `[a-zA-Z]`;
/// otherwise, `nil`.
public static func control(_ x: Unicode.Scalar) -> Character?
}
extension Unicode.Scalar {
/// Same as above, producing Unicode.Scalar, except for CR-LF...
}
We also propose isControl
properties with the following definitions:
extension Character {
/// A Boolean value indicating whether this character represents
/// a control character.
///
/// Control characters are a single Unicode scalar with the
/// general category `Cc`/`Control` or the CR-LF pair (`\r\n`).
public var isControl: Bool { get }
}
extension Unicode.Scalar {
/// A Boolean value indicating whether this scalar represents
/// a control character.
///
/// Control characters have the general category `Cc`/`Control`.
public var isControl: Bool { get }
}
TBD: Should we have a CR-LF static var on Unicode.Scalar
that produces a value of type Character
?
Rationale
This approach simplifies the use of some common control characters, while making the rest available through a method call.
Unicode named values and properties: \N
, \p
, \P
\N{NAME}
matches a Unicode scalar value with the specified name. \p{PROPERTY}
and \p{PROPERTY=VALUE}
match a Unicode scalar value with the given Unicode property (and value, if given).
While most Unicode-defined properties can only match at the Unicode scalar level, some are defined to match an extended grapheme cluster. For example, /\p{RGI_Emoji_Flag_Sequence}/
will match any flag emoji character, which are composed of two Unicode scalar values.
\P{...}
matches the inverse of \p{...}
.
Most of this is already present inside Unicode.Scalar.Properties
, and we propose to round it out with anything missing, e.g. script and script extensions. (API is TBD, still working on it.)
Even though we are not proposing any Character
-based API, we'd like to discuss with the community whether or how to extend them to grapheme clusters. Some options:
- Forbid in any grapheme-cluster semantic mode
- Match only single-scalar grapheme clusters with the given property
- Match any grapheme cluster that starts with the given property
- Something more-involved such as per-property reasoning
POSIX character classes: [:NAME:]
We propose that POSIX character classes be prefixed with "posix" in their name with APIs for testing membership of Character
s and Unicode.Scalar
s. Unicode.Scalar.isASCII
and Character.isASCII
already exist and can satisfy [:ascii:]
, and can be used in combination with new members like isDigit
to represent individual POSIX character classes. Alternatively, we could introduce an option-set-like POSIXCharacterClass
and func isPOSIX(_:POSIXCharacterClass)
since POSIX is a fully defined standard. This would cut down on the amount of API noise directly visible on Character
and Unicode.Scalar
significantly. We'd like some discussion the the community here, noting that this will become clearer as more of the string processing overview takes shape.
POSIX's character classes represent concepts that we'd like to define at all semantic levels. We propose the following definitions, some of which are covered elsewhere in this pitch and some of which already exist today. Some Character definitions are TBD and we'd like more discussion with the community.
POSIX class | API name | Character |
Unicode.Scalar |
POSIX mode value |
---|---|---|---|---|
[:lower:] |
lowercase | (exists) | \p{Lowercase} |
[a-z] |
[:upper:] |
uppercase | (exists) | \p{Uppercase} |
[A-Z] |
[:alpha:] |
alphabetic | (exists: .isLetter ) |
\p{Alphabetic} |
[A-Za-z] |
[:alnum:] |
alphaNumeric | TBD | [\p{Alphabetic}\p{Decimal}] |
[A-Za-z0-9] |
[:word:] |
wordCharacter | (pitched) | (pitched) | [[:alnum:]_] |
[:digit:] |
decimalDigit | (pitched) | (pitched) | [0-9] |
[:xdigit:] |
hexDigit | (exists) | \p{Hex_Digit} |
[0-9A-Fa-f] |
[:punct:] |
punctuation | (exists) | (port from Character ) |
[-!"#%&'()*,./:;?@[\\\]_{}] |
[:blank:] |
horizontalWhitespace | (pitched) | (pitched) | [ \t] |
[:space:] |
whitespace | (exists) | \p{Whitespace} |
[ \t\n\r\f\v] |
[:cntrl:] |
control | (pitched) | (pitched) | [\x00-\x1f\x7f] |
[:graph:] |
TBD | TBD | TBD | [^ [:cntrl:]] |
[:print:] |
TBD | TBD | TBD | [[:graph:] ] |
Custom classes: [...]
We propose that custom classes function just like set union. We propose that ranged-based custom character classes function just like ClosedRange
. Thus, we are not proposing any additional API.
That being said, providing grapheme cluster semantics is simultaneously obvious and tricky. A direct extension treats [a-f]
as equivalent to ("a"..."f").contains()
. Strings (and thus Characters) are ordered for the purposes of efficiently maintaining programming invariants while honoring Unicode canonical equivalence. This ordering is consistent but linguistically meaningless and subject to implementation details such as whether we choose to normalize under NFC or NFD.
let c: ClosedRange<Character> = "a"..."f"
c.contains("e") // true
c.contains("g") // false
c.contains("e\u{301}") // false, NFC uses precomposed Ă©
c.contains("e\u{305}") // true, there is no precomposed eÌ
We will likely want corresponding RangeExpression
-based API in the future and keeping consistency with ranges is important.
We would like to discuss this problem with the community here. Even though we are not addressing regex literals specifically in this thread, it makes sense to produce suggestions for compilation errors or warnings.
Some options:
- Do nothing, embrace emergent behavior
- Warn/error for any character class ranges
- Warn/error for character class ranges outside of a quasi-meaningful subset (e.g. ACII, albeit still has issues above)
- Warn/error for multiple-scalar grapheme clusters (albeit still has issues above)
Future Directions
Future API
Library-extensible pattern matching will necessitate more types, protocols, and API in the future, many of which may involve character classes. This pitch aims to define names and semantics for exactly these kinds of API now, so that they can slot in naturally.
More classes or custom classes
Future API might express custom classes or need more built-in classes. This pitch aims to establish rationale and precedent for a large number of character classes in Swift, serving as a basis that can be extended.
More lenient conversion APIs
The proposed semantics for matching "digits" are broader than what the existing Int(_:radix:)?
initializer accepts. It may be useful to provide additional initializers that can understand the whole breadth of characters matched by \d
, or other related conversions.