String case folding and normalization APIs

[NOTE: I'm posting this on behalf of @xwu, who authored this proposal and asked me to push it forward due to personal time constraints. It aligns nicely with some of my other Unicode interests, and I think it would be a non-intrusive and valuable addition to the string handling facilities of the standard library.]

cc: @Michael_Ilseman


Introduction

We propose to add APIs for case folding and normalization of Unicode strings to the standard library.

Motivation

The standard library offers Unicode-aware processing of strings by default. It relies on system ICU libraries to implement that support, but for reasons outlined in SE-0211, these APIs cannot be used in a straightforward way by end users.

The same Unicode extended grapheme cluster (known in Swift as a "character") can have more than one underlying representation. For example, Ă„ (00C4 LATIN CAPITAL LETTER A WITH DIAERESIS) and Ă„ (0041 LATIN CAPITAL LETTER A + 0308 + COMBINING DIAERESIS) are equivalent. Unicode-aware, locale-independent string comparisons must rely on normalization to ignore any "irrelevant" differences and on case folding to ignore case. (Case folding is an operation defined by the Unicode standard that's related to case conversion but stable across Unicode revisions, language-neutral, and context-insensitive.)

In Swift, the standard library offers Unicode-aware, locale-independent string comparisons, while Foundation augments these facilities with locale-specific comparisons. Since Swift 4.2, the standard library uses "Fast C Contiguous" (FCC) normalization when it performs (case-sensitive) string comparisons and sorting, producing results that are generally consistent with user expectations:

let x = "\u{00C4}"         // Ă„
let y = "\u{0041}\u{0308}" // Ă„
x == y                     // true

(Case-insensitive string comparisons require Foundation.)

For many users, these facilities may be sufficient. However, there are clear use cases that motivate direct access to normalization and case folding functionality. Consider the following scenario:

A recent pull request improves the swift-corelibs-foundation implementation of HTTPCookie to align with IETF standards. One improvement was the use of lowercased() prior to domain matching; this is sufficient for ASCII domain names.

However, suppose we wished (as might well be advisable for a language such as Swift that provides Unicode-aware string handling by default) to support internationalized domain names (IDNs). Conversion of IDNs to their ASCII form requires use of the Nameprep and Punycode algorithms; it's possible to
implement Punycode in Swift-native code, but Nameprep requires case folding and NFKC normalization and can't feasibly be implemented in Swift!

This example, we believe, is not one-of-its-kind. In general, code that will need to interoperate with other systems by handling non-ASCII text can be expected to transform input or produce output in ways that ignore "unwanted"
differences in representation, where what's "unwanted" is specified by the relevant standard and may differ from Swift's adopted FCC normalization. This is precisely why Unicode defines four normalization forms that decompose code points either canonically or compatibly (while ICU also provides access to two other "fast" algorithms specified in Unicode Technical Note #5, FCC being one of them):

  • Normalization Form D (NFD), canonical composition
  • Normalization Form C (NFC), canonical decomposition followed by canonical
    decomposition
  • Normalization Form KD (NFKD), compatibility decomposition
  • Normalization Form KC (NFKC), compatibility decomposition followed by
    canonical composition
  • "Fast C or D" (FCD)
  • "Fase C Contiguous" (FCC)

(See the proposed implementation for an introductory description of canonical and compatibility decomposition and these normalization forms.)

Other standards reference one or more of these normalization forms that conforming implementations should then use when processing text. We propose to expose these normalization forms and associated facilities in the standard library to enable such code to be written in native Swift.

Proposed solution

In SE-0211,
a significant number of Unicode properties were exposed to users; they adhere closely to the Unicode standard, and (as a result) not all of them correspond exactly with likely expectations of users not familiar with the standard. The property isEmoji, for example, does not align perfectly with what the average smartphone user might consider to be emoji for various reasons too complex to cover here. Since these properties were intended for advanced use only, and since there were so many of them, they added to a separate Unicode.Scalar.Properties structure rather than Unicode.Scalar itself.

An important question here is where case folding and normalization are appropriate to be exposed to users on String itself. Several reasons suggest that this is the case:

  • Unlike properties such as isEmoji, users unfamiliar with the Unicode standard are unlikely to have subtly erroneous preconceptions as to what case folding and normalization are.
  • There are not so many new APIs, as there were in SE-0211, such that their addition would unduly bloat the existing String API.
  • The case conversion operations uppercased() and lowercased() certainly have their place, but where users arbitrarily choose one or the other for language-independent caseless comparison (such as in the case of domain matching discussed above), the more appropriate facility is case folding, which would ideally be presented to users as an option exactly where case conversions are available.
  • In Foundation, NSString provides locale-sensitive case and diacritic folding in the method folding(options:locale:), but the standard library does not provide the Unicode-defined locale-independent counterpart as it does for case conversion.
  • Many (if not most) common languages provide Unicode normalization facilities, and often as methods available on their counterpart to the String type. These languages are not exclusively "low level" and many do not also provide the full complement of ICU APIs, suggesting that normalization in particular has more use cases than perhaps other Unicode facilities:
Language Normalization method
C# str.Normalize(form)
Java Normalizer.normalize(str, form)
JavaScript str.normalize(form)
Python unicodedata.normalize(form, str)
Rust str.nfc(), str.nfd(), etc.

Therefore, we propose adding two methods to StringProtocol (and, therefore, to the only two types that are permitted to conform to that protocol, String and Substring):

// Existing methods:
func lowercased() -> String
func uppercased() -> String

// Proposed methods
func caseFolded() -> String
func normalized(_ form: Unicode.NormalizationForm) -> String

The available Unicode normalization forms will be defined as follows:

/* non-frozen */ public enum NormalizationForm {
  case nfd
  case nfc
  case nfkd
  case nfkc
  case fcd
  case fcc
}

Detailed design

The detailed design, including proposed documentation, is available in the implementation linked above. We describe some detailed design considerations below.

Case folding

The proposed case folding API is modeled after the existing methods uppercased() and lowercased(). In line with Unicode properties recently added in SE-0211 (changesWhenUppercased, changesWhenLowercased,
changesWhenCaseFolded), it is spelled caseFolded() (note the use of camel case).

The implementation calls an ICU API to perform the Unicode-defined, language-independent, context-insensitive default case folding. Documentation will emphasize what sort of usage the case folding method is meant to support (caseless string matching) and what it's not recommended for (natural language text for human consumption).

No option is provided to use Turkic mappings (for dotless I's). Although linguistically important and supported by the underlying ICU API, it is only one of many possible language-specific case mappings. Providing this one mapping option dilutes the message that the function is (as the Unicode standard says) intended to be a language-neutral facility. Python, which also exposes the case folding API (str.casefold()) similarly does not provide a specific Turkic option.

Normalization

The normalization API is modeled after rounded(_:), taking one unlabeled option. Some may wonder whether a preposition such as "using" might be useful as an argument label. However, in our view, it does not read any better or worse, so we settled on the more concise option. Texts that discuss Unicode normalization often refer to "NFC strings" and "NFKC strings," and string.normalized(.nfc) can be read "string, normalized NFC" just as
value.rounded(.down) is intended to be read "value, rounded down." (Another possible argument label, form, was also considered; however, it is strictly redundant with the parameter type, Unicode.NormalizationForm.)

We also considered also whether normalization form names ought to be written out in some way rather than using the term-of-art abbreviation. Such a design was disfavored because it did not improve clarity but did increase verbosity. Specifically:

  • .nfd is more readable at a glance than .canonicalDecomposition.
  • .nfd and .nfkd are more visually distinct from each other than are .canonicalDecomposition and .compatibilityDecomposition.
  • A reader who is unfamiliar with Unicode normalization is no more likely to understand the term "canonical decomposition" than they are to understand "NFD": both are terms of art, and the latter is more easily googled.
  • A reader who has some familiarity with Unicode normalization is more likely to associate "NFD" with the relevant algorithm than "canonical decomposition" (or even "form D").
  • Corresponding properties added to Unicode.Scalar.Properties refer to these algorithms by their abbreviations: isNFDInert, isNFKDInert, etc.
  • The written-out name that would be used instead of .nfkc is .compatibilityDecompositionFollowedByCanonicalComposition, which is absurd.
  • The written-out name that would be used instead of .fcd would be .fastCOrD, which is not more readable and itself still uses the abbreviations "C" and "D."

We also considered whether there should be a default option for normalization form. In other languages that provide such a default option, it is NFC, and users familiar with Unicode normalization may expect the same default in Swift if one exists. In Swift, however, strings use FCC normalization before comparison, and Swift users unfamiliar with Unicode normalization may expect that the default option for explicit normalization to be the same as that for comparison. It is not particularly cumbersome to choose one or another normalization form explicitly; moreover, it adds clarity for the reader and avoids this issue regarding which normalization form ought to be the default.

The implementation calls an ICU API to perform the indicated normalization. Prior to doing so, it will call another ICU API to check quickly if the given input is definitely already normalized; if so, it will return self. In practice, most strings will already be normalized, for most normalization forms.

No separate API is currently proposed here (although, if a clear use case arises, it can be later added) for separately checking if a string is already normalized. Thus far, the envisioned use cases for the standard library APIs proposed here would use such a check, then return normalize if required or use the string as-is otherwise. However, Swift.String is a copy-on-write (CoW) type, and the proposed implementation of normalize(_:) already performs such a check.

Source compatibility

The proposed changes are strictly additive; there is no effect on source compatibility.

Effect on ABI stability

The proposed changes are strictly additive; there is no effect on the ABI of existing facilities.

Effect on API resilience

The Unicode.NormalizationForm enumeration is defined as non-@frozen so that future normalization forms can be added to Swift when they come into being.

Alternatives considered

An alternative to adding case folding to String (which may still have independent value) is to add caseFoldMapping to Unicode.Scalar.Properties to complement lowercaseMapping and others. Its absence appears to be an oversight. However, as discussed above, the conceptually related case conversion operations are already used by some users where case folding would be the preferred solution, and those are available as String methods. Moreover, Foundation offers a locale-aware counterpart as an NSString method. Therefore,
we consider it appropriate to expose caseFolded() where lowercased() and uppercased() are available, and to allow users to case-fold entire strings rather than to proceed through each code point to concatenate mapped strings.

Normalization can be exposed in a Normalizer type and not as a method on String. However, for the reasons outlined above, we feel that such a method is appropriate on String and does not cause undue API bloat.

Finally, we also considered including a titlecased() method. It would largely be similar in implementation to the case folding API (with the exception that a UBreakIterator is required to identify the beginning of words). However, the result of titlecasing by locale-independent Unicode rules is unsatisfactory for human consumption in obvious ways. Consider, for example:

# Written in (actual, real) Python, which *does* expose this API:
"we're titlecasing!".title()
# "We'Re Titlecasing!"

Discussions in the Python community reveal pervasive dissatisfaction and misunderstanding. However, as Guido van Rossum has pointed out, such discrepancies between the Unicode standard and non-expert user expectations cannot actually be reconciled even by means of alternative rules: for example, no rule could correctly titlecase both "we're" and "O'Brien." Although such an API would expose some useful functionality (around digraphs, for example), a reasonable conclusion would be that something so prone to mismatched user expectations is not an appropriate general-use API for the standard library. Moreover, for digraphs, the titlecase mapping can be obtained through the Unicode.Scalar.Properties property titlecaseMapping.

10 Likes

It sounds good overall (we definitely need string normalization in the standard library), but I have no context to evaluate the differences between the different normalization forms. At the very least, a default option should be provided. We shouldn't make beginners have to google + understand the nuances of the Unicode spec just to do a case/diacritic insensitive comparison...

Big +1 to the idea and effort, some comments on the details.

This is incorrect (and mostly irrelevant). See here

This is actually a non-goal. The purpose of ordering, beyond equality, is for maintaining programmer invariants such as the sorted-ness of a data structure. It does not try to provide a universal ordering appropriate for presentation to humans, such as UCA+DUCET tries to do.

Nit: FCD is not a normal form; it’s a subset of strings.

A big decision decision here is whether we should have eager APIs that produce new Strings, or if instead we should provide a (lazy) normalized/case-folded view, or both.

For example, String.UnicodeScalarView could have a var nfc property, which provides a view of lazily-NFC-normalized scalars. Similarly for the UTF-8 and UTF-16 views, which provide normalized code units. This could be in addition to, or in lieu of, the eager one on String.

For example, inside the standard library’s current comparison implementation, we have lots of fast-paths for common and already-normalized situations, but can fall back to a slow-path involving lazily-normalized UTF-16 code units. Avoiding an extra allocation and copy is beneficial, especially since comparison can early-exit. Additionally, even non-normal strings are often nearly-normal, so copying an entire string’s contents is not always needed.

Having an eager method on String gels very well with the goals of performance flags remembering whether a String is already in the stdlib’s preferred normal form. In those cases, canonical equivalence is binary equivalence, so we can take a memcmp-like fast path. On the other hand, since the resulting type is String and not some kind of NormalizedString<NFC>, the normalization status is not represented in the type and up to the programmer to remember.

In other languages that provide such a default option, it is NFC, and users familiar with Unicode normalization may expect the same default in Swift if one exists. In Swift, however, strings use FCC normalization before comparison, and Swift users unfamiliar with Unicode normalization may expect that the default option for explicit normalization to be the same as that for comparison.

Again, it’s currently NFC, but that’s not part of the stdlib's contract. The precise ordering results can vary version-to-version of Swift. For example, it likely will be changed in the future to be scalar-order rather than code-unit order.

I do think it would be useful for performance-minded users to also provide an abstract “stdlib preferred” normal form, which can accelerate comparisons using the aforementioned fast-paths. This stdlib-canonical form will almost certainly remain some kind of composed form for memory efficiency, even if it complicates some implementation. This could be something like (straw man):

extension NormalizationForm {
  public static var canonical: NormalizationForm { get }
}
9 Likes

These are very useful comments! I have to apologize for not having the time in the next few months to take this on directly to the finish line and thank @allevato for pushing it forward. Since it's being discussed, though, I feel I should chime in with some brief replies:

I struggled with how to describe that particular option; ICU offers an FCD normalizer, but it's not a unique form. Do you think it holds its own weight being exposed in Swift given that it's not really a normalization form?

Yes, that's the big question, in my view. Rust offers a view, for example, although all other languages surveyed in the draft proposal offer a string.

In addition to the arguments you give, one that pushed me towards eager String relates to compatibility mappings: a string will not necessarily be equivalent to its NFKC or NFKD counterpart according to Swift's definition of equivalence, and sometimes by a long stretch. The functionality would be usable as a view, but I think conceptually it's cleaner when normalization either returns the original string or allocates a totally new string.

Along the same lines, Unicode describes what single normalization operation is equivalent to a series of chained normalizations. For example, toNFD(toNFD(x)) = toNFD(x). In general:

...any chain of normalizations is equivalent to a single normalization, which is:

  • a compatibility normalization, if any normalization is a compatibility normalization
  • a composition normalization, if the final normalization is a composition normalization

However, exposing the functionality as a view may falsely give the user an impression that any series of chained transformations is equivalent to the last transformation (since accessing a view wouldn't change the underlying data), but that is not how Unicode specifies equivalence of chained normalization operations.

This is perfect, and it would provide a great way of implementing what @Jon_Hull has asked for, a default option that makes sense for the Swift standard library without committing to any particular normalization form forever. The initial implementation could, of course, be more naive and then it could be optimized later to be maximally efficient.

2 Likes

Has there been any movement on this? I see caseFolded as being particularly useful on the server for uniquing identifiers like user handles.

Not that I know of. Normalized views and case folding are still in the "future" category of Unicode APIs. Case folding can be fairly easily separated out into its own pitch adding String.caseFolded() and Unicode.Scalar.Properties.caseFolding: String.