String case folding and normalization APIs

Michael_Ilseman · July 22, 2018, 10:53pm

Big +1 to the idea and effort, some comments on the details.

This is incorrect (and mostly irrelevant). See here

This is actually a non-goal. The purpose of ordering, beyond equality, is for maintaining programmer invariants such as the sorted-ness of a data structure. It does not try to provide a universal ordering appropriate for presentation to humans, such as UCA+DUCET tries to do.

Nit: FCD is not a normal form; it’s a subset of strings.

A big decision decision here is whether we should have eager APIs that produce new Strings, or if instead we should provide a (lazy) normalized/case-folded view, or both.

For example, String.UnicodeScalarView could have a var nfc property, which provides a view of lazily-NFC-normalized scalars. Similarly for the UTF-8 and UTF-16 views, which provide normalized code units. This could be in addition to, or in lieu of, the eager one on String.

For example, inside the standard library’s current comparison implementation, we have lots of fast-paths for common and already-normalized situations, but can fall back to a slow-path involving lazily-normalized UTF-16 code units. Avoiding an extra allocation and copy is beneficial, especially since comparison can early-exit. Additionally, even non-normal strings are often nearly-normal, so copying an entire string’s contents is not always needed.

Having an eager method on String gels very well with the goals of performance flags remembering whether a String is already in the stdlib’s preferred normal form. In those cases, canonical equivalence is binary equivalence, so we can take a memcmp-like fast path. On the other hand, since the resulting type is String and not some kind of NormalizedString<NFC>, the normalization status is not represented in the type and up to the programmer to remember.

In other languages that provide such a default option, it is NFC, and users familiar with Unicode normalization may expect the same default in Swift if one exists. In Swift, however, strings use FCC normalization before comparison, and Swift users unfamiliar with Unicode normalization may expect that the default option for explicit normalization to be the same as that for comparison.

Again, it’s currently NFC, but that’s not part of the stdlib's contract. The precise ordering results can vary version-to-version of Swift. For example, it likely will be changed in the future to be scalar-order rather than code-unit order.

I do think it would be useful for performance-minded users to also provide an abstract “stdlib preferred” normal form, which can accelerate comparisons using the aforementioned fast-paths. This stdlib-canonical form will almost certainly remain some kind of composed form for memory efficiency, even if it complicates some implementation. This could be something like (straw man):

extension NormalizationForm {
  public static var canonical: NormalizationForm { get }
}