SE-0363: Unicode for String Processing

Hello, Swift Community.

The review of SE-0363: Unicode for String Processing begins now and runs through July 11, 2022.

Reviews are an important part of the Swift evolution process. All review feedback should be either on this forum thread or, if you would like to keep your feedback private, directly to the review manager. If you do email me directly, please put "SE-0351" somewhere in the subject line.

What goes into a review?

The goal of the review process is to improve the proposal under review through constructive criticism and, eventually, determine the direction of Swift. When writing your review, here are some questions you might want to answer in your review:

  • What is your evaluation of the proposal?
  • Is the problem being addressed significant enough to warrant a change to Swift?
  • Does this proposal fit well with the feel and direction of Swift?
  • If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?
  • How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

More information about the Swift evolution process is available at:

https://github.com/apple/swift-evolution/blob/main/process.md

As always, thank you for contributing to Swift.

Ben Cohen

Review Manager

9 Likes

Maybe there should be a shorter direct representation
from

/\p{L}/

to

Regex {
  CharacterClass(.generalCategory(.uppercaseLetter), .generalCategory(.lowercaseLetter), .generalCategory(.titlecaseLetter), .generalCategory(.modifierLetter), .generalCategory(.otherLetter))
}

Typo

- Regex syntax (?D): Match only ASCII members for all POSIX properties (including digit, space, and word).
+ Regex syntax (?P): Match only ASCII members for all POSIX properties (including digit, space, and word).

Missing link at L1240

... [Run-time Regex Construction proposal][literals]

Extras at the end of L1321

- let regex1 = /\p{name=latin lowercase a}/.extendUnicodeProperty(\.name, by: .firstScalar)`.
+ let regex1 = /\p{name=latin lowercase a}/.extendUnicodeProperty(\.name, by: .firstScalar)

This is an extremely minor point, but I’d appreciate if the proposal title mentioned regex to make it easier to find later.

5 Likes

To support both matching on String's default character-by-character view and more broadly-compatible Unicode scalar-based matching, you can select a matching level for an entire regex or a portion of a regex constructed with the RegexBuilder API. [...]

These specific levels of matching [or to be specific, extended grapheme cluster-based matching rather than Unicode scalar-based matching], and the options to switch between them, are unique to Swift

I'm concerned that extended grapheme cluster-based matching is both unique to Swift and also the (silent) default. Compatibility with other regex engines has been a key consideration in this series of proposals, and it is specifically desired with respect to the classical regex syntax. As the proposal acknowledges, extended grapheme cluster-based matching by default decreases that level of broad compatibility.

Moreover, since there is no "inside-the-regex" syntax to enable the more compatible Unicode scalar mode and only the rather more verbose .matchingSemantics(.unicodeScalar), I worry about discoverability even if a user knows about this compatibility difference. That said, I do understand the rationale for not including totally new regex syntax, at least at this time, and could probably be on board with that approach were it not for the first consideration above about defaults.

In addition, threading these two "matching semantics" modes through the entire proposal also means that we have to review two distinct sets of character classes, one of them unique to Swift. They look very well thought out, and it is certainly possible that the proposed semantics for each character class is optimal as-is. But do we have a deep enough set of Unicode expertise in this community to approximate a "many eyes" approach to evaluating these details? I am skeptical of this. How many people have an informed opinion on halfwidth forms, for instance? It is unclear what we would do down the line (with source compatibility constraints, etc.) if, upon reflection, some definition adopted today turns out to have included or excluded characters in error or conflicts with a future Unicode decision—the Unicode Consortium makes certain backwards compatibility guarantees with respect to its own decisions but isn't under any obligation not to contradict what we decide.

Therefore, I wonder if there is another approach possible here—something along the lines below—which would satisfy the overarching (legitimate) need for Unicode correctness by default and consistency with the user-facing representation of String as a sequence of extended grapheme clusters as elements, while obviating the need for a distinct extended grapheme cluster-based regex matching scheme:

  • Support a "Unicode normalization–insensitive" option for regex, enabled by default for regex literals
  • Where the "Unicode normalization–insensitive" option is enabled, normalize the regex and match against the normalized string representation (i.e., do automatically what the proposal says one must do manually when using regex engines in other languages)
  • Drop grapheme cluster-level semantics from the proposal; in place of matchingSemantics, add an ignoresNormalizationForm() option (strawman name) that defaults to being enabled, and possibly an in-regex syntax for toggling the same option

I have minor concerns re some spellings (e.g., dotMatches... versus anchorsMatch...—can't we use the singular for both flags, particularly since it's possible to use only one anchor?) and character class definitions (e.g., the mismatch where isHexDigit is more narrowly defined than the hexDigit class), which can probably be discussed later.

2 Likes

I agree, while my opinion was expressed in a different way.

I may have read it perversely, but I wonder why you use the term "normalization". :thinking:
I mean we can't ignore some normalization forms such as NFKC and NFKD that are not for the purpose of searching.

Correct me if I'm wrong, but it sounds like what you're proposing is a normalize every input string and match by scalar default. This model actually works pretty well until you come across a language whose Characters are not represented by a single scalar grapheme. Notoriously, this happens to be emojis, but there are also Indic sequences who follow this rule as well (there might be other cases!). Consider the following:

// This is a single grapheme composed of 5 scalars.
let string1 = "क़्‍त"
// This is a single grapheme composed of 7 scalars.
let string2 = "👨‍👩‍👧‍👦"

// Prints: 1
print(string1.count)
// Prints: 1
print(string2.count)

// Match any single character
let regex = /./

// Proposed: grapheme based matching semantics
// Prints: क़्‍त
print(try regex.firstMatch(in: string1)!)
// Prints: 👨‍👩‍👧‍👦
print(try regex.firstMatch(in: string2)!)

// ignoresNormalize and scalar based matching semantics
// Prints: क
print(try regex.firstMatch(in: string1)!)
// Prints: 👨
print(try regex.firstMatch(in: string2)!)

Without grapheme based matching semantics we fail to match String's single character, instead we get something that doesn't even appear to be in our original input. For someone who has never used regex before, these results seems bizarre!

This is pretty inconsistent with the rest of our language model where we want to be Unicode correct. Having different outputs by default between string.count and regex /./ seems like a direction that isn't toward being Unicode correct, even if it means being compatible with classical regex engines. From the get go, our string model is vastly different than that of other language's strings, so we have to diverge from being 100% compatible with other engines to ensure that we're consistent with what model we currently have and to continue trying to be as Unicode correct as possible.

11 Likes

I'd like to give a little context to @Alejandro's reply regarding String's philosophy. This is ever-evolving but as we add more API and functionality to String, it starts to become clearer.

When it comes to strings and Unicode, there is no universal correctness and strings are messy messy things. Swift uses Unicode out of pragmatism, as it is the very best we have, despite its flaws. Unicode does not directly prescribe or dictate a programming model, it's a collection of some definitions and occasional suggestions. It's an art and a science to eek out some sensible programming model from it.

Trying to doggedly assign meaning to the meaningless is a fruitless endeavor at best, and can harm realistic usage at worst. It's better to find some principles. Similarly, these principles are not to be doggedly held to all costs. If the principles were sufficient to define a clear and correct universal programming model, then strings would be easy.

Swift's default string model: Characters

Swift's primary, default model of String is a collection of Characters ("extended" grapheme clusters) equal under Unicode canonical equivalence.

In Unicode, grapheme clusters are defined for the sole purpose of having renderers mostly agree with each other. Unicode says little to nothing more about them. Grapheme breaking is designed to be simple enough for renderers, at the cost of allowing all kinds of senseless constructs and corner cases in "unrealistic" usage. Unicode's recommendation (for renderers) is to have some reasonable fall-back behavior for the weird cases that don't arise organically, in order to allow the realistic cases to follow a more consistent model.

Swift chose to base its string model on top of grapheme clusters. Swift has to venture forth on its own here. Algorithms against Swift's model of string will be semantically incompatible with those algorithms against a different model of string. String provides ways to explicitly use a different model of string (for example the scalar and code unit views).

Weirdness can result from this decision, so we try to interpret Unicode's guidance as we develop some principles and choose between tradeoffs.

  • Principle: Degenerate cases can be weird in service of making realistic and important usage better

For example, str1.count + str2.count == (str1 + str2).count does not hold when str2's leading Character would combine with str1's trailing Character, breaking algebraic reasoning in this situation. But, that would be an example of str2 being an inorganic case ("degenerate"). Swift made the call that String's RangeReplaceableCollection conformance was so important for many practical reasons that an inconsistency under degenerate cases was an acceptable compromise.

  • Principle: Normal/default API should not create degenerate cases where there originally were none
  • Corollary: any indices produced by normal/default operations should produce Character-aligned indices to avoid degeneracy

We should be hesitant to add functionality to String that could produce non-Character-aligned indices unless explicitly opted into. I realize this is currently violated by some of the inherited NSString APIs from Objective-C, and it's on-going work to replace them with APIs that are better and more ergonomic.


Note that there are multiple aspects of Regex regarding the concept of "compatibility", and there are many things which Regex could aim for compatibility with. There's the syntax of run-time and literal regexes, the behaviors associated with constructs such as repetition, the targeted feature set, and then there's the model of string to which a regex is applied. A regex declares an algorithm over a model of string, and this proposal establishes String.String as the default semantic model to be compatible with. E.g. Regex { CharacterClass.any } will match String.first.

This is future work for a couple reasons. We're a little hesitant to add brand new regex syntax (with few eyeballs) at the same time we're introducing regex syntax in the first place. We'd also like to give more consideration to what a byte semantics of regex could be like. For example, applying a regex to a collection of UInt8s under the interpretation of UTF-8, though we'd need to figure out what the encoding validation story is there.

Note that the scalar semantics definition is directly prescribed by Unicode in UTS-18. Unicode doesn't describe grapheme-cluster semantics at all and thus we don't risk incompatibility with Unicode itself; if they one day decide to do so then that's a whole new design area for the future.

Many of the common grapheme cluster semantic definitions are equivalent to the SE-0221 definitions, and similar reasoning can apply to the other common properties.

Future work includes the ability to adjust or dictate how properties are extended.

For the less common queries, or those that don't have as obvious an extension to grapheme cluster semantics, I think a conservative approach would be to treat the Extension column as non-normative. I.e. we pick a suitable default behavior but we're not formally locking that in until there's a clear need to revisit it with more information. There's a decently high chance that they're never revisited anyways, and a developer who cares about obscure Unicode details may want to work in the more precise scalar-semantic mode (which supports \X and \Y or anyGraphemeCluster and graphemeClusterBoundary).

An implementation strategy could be to throw a compilation error when in grapheme-semantic mode for these fairly obscure corner cases, encouraging the use of the more precise scalar semantics mode.

I believe you are referring to the option to enable matching under Unicode canonical equivalence, such as in Java. I think this could be fine future work. Grapheme-semantic mode enables it by default (which is very natural as normalization segments are always sub-sequences of grapheme clusters) and scalar-semantics disabled it by default. But it could be useful to selectively enable it in scalar semantics (or disable it in grapheme-cluster semantics, though that seems less useful).

8 Likes