SE-0363: Unicode for String Processing

Michael_Ilseman · July 17, 2022, 6:00pm

I'd like to give a little context to @Alejandro's reply regarding String's philosophy. This is ever-evolving but as we add more API and functionality to String, it starts to become clearer.

When it comes to strings and Unicode, there is no universal correctness and strings are messy messy things. Swift uses Unicode out of pragmatism, as it is the very best we have, despite its flaws. Unicode does not directly prescribe or dictate a programming model, it's a collection of some definitions and occasional suggestions. It's an art and a science to eek out some sensible programming model from it.

Trying to doggedly assign meaning to the meaningless is a fruitless endeavor at best, and can harm realistic usage at worst. It's better to find some principles. Similarly, these principles are not to be doggedly held to all costs. If the principles were sufficient to define a clear and correct universal programming model, then strings would be easy.

Swift's default string model: `Character`s

Swift's primary, default model of String is a collection of Characters ("extended" grapheme clusters) equal under Unicode canonical equivalence.

In Unicode, grapheme clusters are defined for the sole purpose of having renderers mostly agree with each other. Unicode says little to nothing more about them. Grapheme breaking is designed to be simple enough for renderers, at the cost of allowing all kinds of senseless constructs and corner cases in "unrealistic" usage. Unicode's recommendation (for renderers) is to have some reasonable fall-back behavior for the weird cases that don't arise organically, in order to allow the realistic cases to follow a more consistent model.

Swift chose to base its string model on top of grapheme clusters. Swift has to venture forth on its own here. Algorithms against Swift's model of string will be semantically incompatible with those algorithms against a different model of string. String provides ways to explicitly use a different model of string (for example the scalar and code unit views).

Weirdness can result from this decision, so we try to interpret Unicode's guidance as we develop some principles and choose between tradeoffs.

Principle: Degenerate cases can be weird in service of making realistic and important usage better

For example, str1.count + str2.count == (str1 + str2).count does not hold when str2's leading Character would combine with str1's trailing Character, breaking algebraic reasoning in this situation. But, that would be an example of str2 being an inorganic case ("degenerate"). Swift made the call that String's RangeReplaceableCollection conformance was so important for many practical reasons that an inconsistency under degenerate cases was an acceptable compromise.

Principle: Normal/default API should not create degenerate cases where there originally were none
Corollary: any indices produced by normal/default operations should produce Character-aligned indices to avoid degeneracy

We should be hesitant to add functionality to String that could produce non-Character-aligned indices unless explicitly opted into. I realize this is currently violated by some of the inherited NSString APIs from Objective-C, and it's on-going work to replace them with APIs that are better and more ergonomic.

Note that there are multiple aspects of Regex regarding the concept of "compatibility", and there are many things which Regex could aim for compatibility with. There's the syntax of run-time and literal regexes, the behaviors associated with constructs such as repetition, the targeted feature set, and then there's the model of string to which a regex is applied. A regex declares an algorithm over a model of string, and this proposal establishes String.String as the default semantic model to be compatible with. E.g. Regex { CharacterClass.any } will match String.first.

This is future work for a couple reasons. We're a little hesitant to add brand new regex syntax (with few eyeballs) at the same time we're introducing regex syntax in the first place. We'd also like to give more consideration to what a byte semantics of regex could be like. For example, applying a regex to a collection of UInt8s under the interpretation of UTF-8, though we'd need to figure out what the encoding validation story is there.

Note that the scalar semantics definition is directly prescribed by Unicode in UTS-18. Unicode doesn't describe grapheme-cluster semantics at all and thus we don't risk incompatibility with Unicode itself; if they one day decide to do so then that's a whole new design area for the future.

Many of the common grapheme cluster semantic definitions are equivalent to the SE-0221 definitions, and similar reasoning can apply to the other common properties.

Future work includes the ability to adjust or dictate how properties are extended.

For the less common queries, or those that don't have as obvious an extension to grapheme cluster semantics, I think a conservative approach would be to treat the Extension column as non-normative. I.e. we pick a suitable default behavior but we're not formally locking that in until there's a clear need to revisit it with more information. There's a decently high chance that they're never revisited anyways, and a developer who cares about obscure Unicode details may want to work in the more precise scalar-semantic mode (which supports \X and \Y or anyGraphemeCluster and graphemeClusterBoundary).

An implementation strategy could be to throw a compilation error when in grapheme-semantic mode for these fairly obscure corner cases, encouraging the use of the more precise scalar semantics mode.

I believe you are referring to the option to enable matching under Unicode canonical equivalence, such as in Java. I think this could be fine future work. Grapheme-semantic mode enables it by default (which is very natural as normalization segments are always sub-sequences of grapheme clusters) and scalar-semantics disabled it by default. But it could be useful to selectively enable it in scalar semantics (or disable it in grapheme-cluster semantics, though that seems less useful).

SE-0363: Unicode for String Processing

Swift's default string model: Characters

Swift's default string model: `Character`s