Not sure parsing is relevant; the reason isn't because we don't know about the \u{301} in the literal, but because the semantics of . will match a full character. We could special case concatenation of a combining scalar, and that should be addressed in an alternative considered especially if there are undesirable side-effects of doing so.
Single-line mode (?s) dotMatchesNewlines()
Should this be the default?
Multi-line mode (?m) anchorsMatchNewlines()
Similarly, default or not?
Unicode word boundaries (?w) wordBoundaryKind(_:)
This is the default, right?
Semantic level (?Xu) matchingSemantics(_:)
This is newly introduced and specific to Swift, correct?
We might want a column for enabled/disabled by default
asciiOnlyDigits, asciiOnlyWhitespace, asciiOnlyWordCharacters, asciiOnlyCharacterClasses
Elsewhere we use an enum to cut down on all of these overloads (e.g. repitition behavior, word boundary level, semantic level, etc). Is that possible here? Should character class customization using options receive an OptionSet?
Note that if RegexBuilder is imported, all of these API will be on any component, which includes String and library parsers, so it makes sense to somewhat organize this API.
€1 234,56 ["1", "234", "56"] ["€", "1", "234,56"]
What happened to the € in the first table entry?
public struct RegexWordBoundaryKind: Hashable {
public static var unicodeLevel1: Self { get }
public static var unicodeLevel2: Self { get }
This naming convention feels odd. Either we an enum (or enum-like struct) for the UTS#18 levels (a concept beyond just word boundaries), or we give these names reflecting what word boundary algorithm is used, like simpleBoundaries and defaultBoundaries.
When matching with Unicode scalar semantics, metacharacters and character classes always match a single Unicode scalar value, even if that scalar comprises part of a grapheme cluster.
What happens with case-folded matching if case conversion produces multiple scalars? I don't recall if case folded form has this as well, just making sure that . matches the same with and without case insensitivity. CC @Alejandro
With grapheme cluster semantics, a grapheme cluster boundary is naturally enforced at the start and end of the match and every capture group. Matching with Unicode scalar semantics, on the other hand, including using the \O metacharacter or .anyUnicodeScalar character class, can yield string indices that aren't aligned to character boundaries. Take care when using indices that aren't aligned with grapheme cluster boundaries, as they may have to be rounded to a boundary if used in a String instance.
There's a lot of back-and-forth prose that switches between modes. Do you think you can give a clearer, algebraic reasoning for where boundaries occur? This may also help explain why some choices were made (e.g. producing EGC-aligned indices).
When a regex proceeds with grapheme cluster semantics from a position that isn't grapheme cluster aligned, it attempts to match the partial grapheme cluster that starts at that point. In the first call to contains(_:) below, \O matches a single Unicode scalar value, as shown above, and then the engine tries to match \s against the remainder of the family emoji character.
When is partial matching useful vs dropping to a more precise level? E.g., why is \O supported in grapheme-semantics mode? How does that behave around captures, that is could it be used to split a grapheme cluster? We treat \X as . (pending newline considerations) in grapheme cluster mode and \O as . (pending newline considerations) in scalar semantic modes.
Stepping up to a higher-level mode is useful and that's when alignment becomes more interesting. We could treat \O as any under max(currrentMode, .scalarSemantics), \X as any under max(currentMode, .graphemeClusterSemantics), etc. Another alternative is to forbid \O in grapheme cluster semantic mode, but I'd prefer promoting it to \X.
Stepping down is where it'd be very useful to have an algebraic model to reason about this with, as it seems likely to introduce problems.
Regex syntax: (?X)... or (?X...) for grapheme cluster semantics, (?u)... or (?u...) for Unicode scalar semantics.
IIUC, this is completely novel to Swift and this proposal. That should be spelled out prominently and it would cause an issue if these are even used for important options in the future. An alternative considered section should go over why these letters were chosen over a more verbose solution, and another alternative is to treat it as future work from within a literal.
Regex syntax: (?U)... or (?U...)
RegexBuilder API:
The repetitionBehavior(_:) method lets you set the default behavior for all quantifiers that don't explicitly provide their own behavior. > For example, you can make all quantifiers behave possessively, eliminating any quantification-caused backtracking.
Switching default to possessive is a big deal and Swift's making a contribution to this space. However, posessive is not like the others: it dramatically changes the inputs recognized by the regex. Should these be globbed together?
Also, why is repetitionBehavior on String ala RegexComponent? These don't really make sense on leaf regex components; should they be on a protocol specific to builders and combinators?
When you pass nil, the quantifier uses the default behavior as set by this option (either eager or reluctant). If an explicit behavior is passed, that behavior is used regardless of the default.
What about possessive?
Unicode scalar semantics: Matches a Unicode scalar that has a numericType property equal to .decimal. This includes the digits from the ASCII range, from the Halfwidth and Fullwidth Forms Unicode block, as well as digits in some scripts, like DEVANAGARI DIGIT NINE (U+096F). This corresponds to the general category Decimal_Number.
Grapheme cluster semantics: Matches a character made up of a single Unicode scalar that fits the decimal digit criteria above.
ASCII mode: Matches a Unicode scalar in the range 0 to 9.
Is ASCII a mode or a set of options? Should it be a mode?
Also, this prose convention is difficult to follow. Any way to separate out the definition a bit more? The click-able disclosure triangles are also not rendering as clickable triangles, which makes this worse...
For Grapheme cluster semantics, a very relevant point is whether this is the same definition as API on Character or not.
Unicode scalar semantics: Matches a decimal digit, as described above, or an uppercase or small A through F from the Halfwidth and Fullwidth Forms Unicode block. Note that this is a broader class than described by the UnicodeScalar.properties.isHexDigit property, as that property only include ASCII and fullwidth decimal digits.
Rationale?
Unicode property matching is extended to Characters with a goal of consistency with other regex character classes. For \p{Decimal} and \p{Hex_Digit}, only single-scalar Characters can match, for the reasons described in that section, above. For all other Unicode property classes, matching Characters can comprise multiple scalars, as long as the first scalar matches the property.
Does permissive reasoning really apply to all the rest? Did we check?
What about non-boolean properties like canonical combining class? Do those make sense under first-scalar interpretation?
[:lower:] \p{Lowercase} starts-with [a-z]
Out of curiosity, does fuzzy matching and the unification of [:...:] with \p{...} mean that this could be written as \p{lower}?
When in grapheme cluster semantic mode, ranges of characters will test for membership using NFD form (or NFKD when performing caseless matching). This differs from how a ClosedRange would operate its contains method, since that depends on String's Comparable conformance, but the decomposed comparison better aligns with the canonical equivalence matching used elsewhere in Regex.
But is NFKD the same behavior? Could you give examples? It seems like there's a lot of complexity here and I don't want mistakes or inconsistencies to slip through.
A custom character class will match a maximum of one Character or UnicodeScalar, depending on the matching semantic level. This means that a custom character class with extended grapheme cluster members may not match anything while using scalar semantics.
Is \R permitted inside a custom character class and if so, what does it do?
With the proposed options model, you can define a Regex that includes different semantic levels for different portions of the match, which would be impossible with a call site-based approach.
Can you detail how boundaries work when switching between levels? E.g. grapheme-cluster -> scalar -> grapheme-cluster, how many grapheme cluster boundaries are checked and where? And vice versa. I think this is a very important, albeit nuanaced, aspect that can help illuminate the model.
A prior version of this proposal used a binary method for setting the word boundary algorithm, called usingSimpleWordBoundaries(). A method taking a RegexWordBoundaryKind instance is included in the proposal instead, to leave room for implementing other word boundary algorithms in the future.
Should that be the approach taken for everything that might incorporate local or tailoring? E.g. semantic modes and the alternate character classes?