[Pitch] Unicode Normalization

Karl · July 17, 2024, 6:57pm

This is a summary and some sections have been omitted for brevity. Click here for the full proposal

Unicode Normalization

Proposal: SE-NNNN
Authors: Karl Wagner, Michael Ilseman (@Michael_Ilseman)
Review Manager: TBD
Status: Pitch
Implementation: swiftlang/swift#75298

Introduction

Unicode Normalization Forms are formally defined normalizations of Unicode strings which make it possible to determine whether any two Unicode strings are equivalent to each other.

Normalization is a fundamental operation when processing Unicode text, and as such deserves to be one of the core text processing algorithms exposed by the standard library. It is used by the standard library internally to implement basic String operations and is of great importance to other libraries.

Motivation

Normalization determines whether two pieces of text can be considered equivalent: if two pieces of text are normalized to the same form, and they are equivalent, they will consist of exactly the same Unicode scalars (and by extension, the same UTF-8/16 code-units).

Unicode defines two categories of equivalence:

Canonical Equivalence

A fundamental equivalency between characters or sequences of characters which represent the same abstract character, and which when correctly displayed should always have the same visual appearance and behavior.

UAX#15 Unicode Normalization Forms
Example: Ω (U+2126 OHM SIGN) is canonically equivalent to Ω (U+03A9 GREEK CAPITAL LETTER OMEGA)
Compatiblity Equivalence

It is best to think of these Normalization Forms as being like uppercase or lowercase mappings: useful in certain contexts for identifying core meanings, but also performing modifications to the text that may not always be appropriate.

UAX#15 Unicode Normalization Forms
Example: "Ⅳ" (U+2163 ROMAN NUMERAL FOUR) is compatibility equivalent to the ASCII string "IV".

Additionally, some scalars are equivalent to a sequence of scalars and combining marks. These are called canonical composites, and when producing the canonical or compatibility normal form of some text, we can further choose for it to contain either decomposed or precomposed representations of these composites.

These are different forms, but importantly are not additional categories of equivalence. Applications are free to compose or decompose text without affecting equivalence.

Example: The famous "é"


Decomposed	"e\u{0301}" (2 scalars: U+0065 LATIN SMALL LETTER E, U+0301 COMBINING ACUTE ACCENT)
Precomposed	"é" (1 scalar: U+00E9 LATIN SMALL LETTER E WITH ACUTE)

// Decomposed and Precomposed forms are canonically equivalent.
// Applications can choose to work with whichever form
// is more convenient for them.

assert("e\u{0301}" == "é")

This defines all four normal forms:

	Canonical	Compatibility
Decomposed	NFD	NFKD
Precomposed	NFC	NFKC

Canonical equivalence is particularly important. The Unicode standard says that programs should treat canonically-equivalent strings identically, and are always free to normalise strings to a canonically-equivalent form internally without fear of altering the text's interpretation.

C6. A process shall not assume that the interpretations of two canonical-equivalent character sequences are distinct.

The implications of this conformance clause are twofold. First, a process is never required to give different interpretations to two different, but canonical-equivalent character sequences. Second, no process can assume that another process will make a distinction between two different, but canonical-equivalent character sequences.

Ideally, an implementation would always interpret two canonical-equivalent character sequences identically. [...]

C7. When a process purports not to modify the interpretation of a valid coded character sequence, it shall make no change to that coded character sequence other than the possible replacement of character sequences by their canonical-equivalent sequences.

Replacement of a character sequence by a compatibility-equivalent sequence does modify the interpretation of the text.

Unicode Standard 15.0, 3.2 Conformance Requirements

Accordingly, as part of ensuring that Swift has first-class support for Unicode, it was decided that String's default Equatable semantics (the == operator) would test canonical equivalence. As a result, by default applications get the ideal behaviour described by the Unicode standard - for instance, if one inserts a String in to an Array or Set it can be found again using any canonically-equivalent String.

var strings: Set<String> = []

strings.insert("\u{00E9}")            // precomposed e + acute accent
assert(strings.contains("e\u{0301}")) // decomposed e + acute accent

Other libraries would like similar Unicode support in their own data structures without requiring String for storage, or may require normalisation to implement specific algorithms, standards, or protocols. For instance, normalising to NFD or NFKD allows one to more easily remove diacritics for fuzzy search algorithms and spoof detection, and processing Internationalised Domain Names (IDNs) requires normalising to NFC.

Additionally, String can store and preserve any sequence of code-points, including non-normalised text -- however, since its comparison operators test canonical equivalence, in the worst case both operands will have to be normalised on-the-fly. Normalisation may allocate buffers and involves lookups in to Unicode property databases, so this may not always be desirable.

The ability to normalise text in advance (rather than on-the-fly) can deliver some significant benefits. Recall that canonically-equivalent strings, when normalised to the same form, encode to the same bytes of UTF8; so if our text is already normalised we can perform a simple binary comparison (such as memcmp), and our results will still be consistent with String's default operators. We pay the cost of normalisation once per string rather than paying it up to twice per comparison operation.

Consider a Trie data structure (which is often used with textual data):

      root
     /  |  \
    a   b   c
   / \   \   \
  p   t   a   a
 /     \   \   \
p       e   t   t

When performing a lookup, we compare the next element in the string we are searching for with the children of our current node and repeat that process to descend the Trie. For instance, when searching for the word "app", we descend from the root to the "a" node, then to the "p" node, etc. If the Trie were filled with normalised text and the search string were also normalised, these could be simple binary comparisons (with no allocations or table lookups) while still matching all canonically-equivalent strings. In fact, so long as we normalise everything going in, the fundamental operation of the Trie doesn't need to know anything about Unicode; it can just operate on binary blobs. Other data structures could benefit from similar techniques - everything from B-Trees (many comparisons) to Bloom filters (computing many hashes).

In summary, normalisation is an extremely important operation and there are many, significant benefits to exposing it in the standard library.

Existing API

Currently, normalisation is only exposed via Foundation:

extension String {
  var decomposedStringWithCanonicalMapping: String { get }
  var decomposedStringWithCompatibilityMapping: String { get }
  var precomposedStringWithCanonicalMapping: String { get }
  var precomposedStringWithCompatibilityMapping: String { get }
}

There are many reasons to want to revise this interface and bring the functionality in to the standard library:

It is hard to find, using terminology most users will not understand. Many developers will hear about normalisation, and "NFC" and "NFD" are familiar terms of art in that context, but it's difficult to join the dots between "NFC" and precomposedStringWithCanonicalMapping.

In JavaScript and many other languages, this operation is:
```
"some string".normalize("NFC");
```
It does not expose an interface for producing stabilised strings.
It only accepts input text as a String. There are other interesting data structures which may contain Unicode text, and copying to a String can be a significant overhead for them.

The existing API also does not support normalising a Substring or Character; only entire Strings.
It eagerly normalises the entirety of its input. This is suboptimal when comparing strings or checking if a string is already normalised; applications typically want to early-exit as soon the result is apparent.
It is incompatible with streaming APIs. Streams provide their data in incremental chunks, not aligned to any normalisation boundaries. However, normalisation is not closed to concatenation:

even if two strings X and Y are normalized, their string concatenation X+Y is not guaranteed to be normalized.
This means a program wanting to operate on a stream of normalised text cannot just normalise each chunk separately. In order to work with the existing API, they would have to forgo streaming entirely, buffer all of the incoming data, copy it in to a String, then normalise the entire String at once.

Proposed solution

We propose 3 levels of API, targeting:

Strings
Custom storage and incremental normalisation, and
Stateful normaliser

Additionally, we are proposing a handful of smaller enhancements to help developers process text using these APIs.

The proposal aims to advance text processing in Swift and unlock certain key use-cases, but it is not exhaustive. There will remain a healthy amount of subject matter for future consideration.

1. Strings

We propose to introduce functions on StringProtocol (String, Substring) and Character which produce a normalised copy of their contents:

extension Unicode {

  @frozen
  public enum CanonicalNormalizationForm {
    case nfd
    case nfc
  }

  @frozen
  public enum CompatibilityNormalizationForm {
    case nfkd
    case nfkc
  }
}

extension StringProtocol {

  /// Returns a copy of this string in the given normal form.
  ///
  /// The result is canonically equivalent to this string.
  ///
  public func normalized(
    _ form: Unicode.CanonicalNormalizationForm
  ) -> String

  /// Returns a copy of this string in the given normal form.
  ///
  /// The result _may not_ be canonically equivalent to this string.
  ///
  public func normalized(
    _ form: Unicode.CompatibilityNormalizationForm
  ) -> String

  /// Returns a copy of this string in the given normal form,
  /// if the result is stable.
  ///
  /// A stable normalization will not change if normalized again
  /// to the same form by any version of Unicode, past or future.
  ///
  /// The result, if not `nil`, is canonically equivalent
  /// to this string.
  ///
  public func stableNormalization(
    _ form: Unicode.CanonicalNormalizationForm
  ) -> String?

  /// Returns a copy of this string in the given normal form,
  /// if the result is stable.
  ///
  /// A stable normalization will not change if normalized again
  /// to the same form by any version of Unicode, past or future.
  ///
  /// The result _may not_ be canonically equivalent to this string.
  ///
  public func stableNormalization(
    _ form: Unicode.CompatibilityNormalizationForm
  ) -> String?
}

extension Character {

  /// Returns a copy of this character in the given normal form.
  ///
  /// The result is canonically equivalent to this character.
  ///
  public func normalized(
    _ form: Unicode.CanonicalNormalizationForm
  ) -> Character
}

Character does not offer a stableNormalization function, as the definition of character boundaries is not stable across Unicode versions. While this doesn't technically matter for the purpose of normalisation, it seems wrong to mention stability in the context of characters while their boundaries remain unstable.

Character also does not offer compatibility normalisation, as the compatibility decomposition of a Character may result in multiple Characters. However, Characters may be normalised to their canonical equivalents.

Usage:

// Here, a database treats keys as binary blobs.
// We normalise at the application level
// to retrofit canonical equivalence for lookups.

func persist(key: String, value: String) throws {
  guard let stableKey = key.stableNormalization(.nfc) else {
    throw UnsupportedKeyError(key)
  }
  try writeToDatabase(binaryKey: stableKey.utf8, value: value)
}

func lookup(key: String) -> String? {
  let normalizedKey = key.normalized(.nfc)
  lookupInDatabase(binaryKey: normalizedKey.utf8)
}

try! persist(key: "cafe\u{0301}", value: "Present")

lookup(key: "caf\u{00E9}") // ✅ "Present"

The Standard Library's preferred form and documenting String's sort order

String's comparison behaviour sorts canonically-equivalent strings identically, which already implies that it must behave as if its contents were normalised. However, it has never been documented which form it normalises to. We propose documenting it, and moreover documenting it in code:

extension Unicode.CanonicalNormalizationForm {

  /// The normal form preferred by the Swift Standard Library.
  ///
  /// String's conformance to `Comparable` sorts values
  /// as if their contents were normalized to this form.
  ///
  public static var preferredForm: Self { get }
}

This allows developers to use normalisation to achieve predictable performance, with the guarantee that their results are consistent with String's default operators.

struct NormalizedStringHeap {

  // Stores normalised UTF8Views internally
  // for cheaper code-unit level comparisons.
  // [!] Requires String.UTF8View: Comparable
  private var heap: Heap<String.UTF8View> = ...

  mutating func insert(_ element: String) {
    let normalized = element.normalized(.preferredForm)
    heap.insert(normalized.utf8)
  }

  // This needs to be consistent with String.<
  var min: String? {
    heap.min.map { utf8 in String(utf8) }
  }
}

If an application would like to take advantage of normalisation but doesn't have a preference for a particular form, the standard library's preferred form should be chosen.

2. Custom storage and incremental normalisation

For text in non-String storage, or operations which can early-exit, we propose introducing API which allows developers to lazily normalize any Sequence<Unicode.Scalar>. This API is exposed via a new .normalized namespace wrapper:

Namespace:

extension Unicode {

  /// Normalized representations of Unicode text.
  ///
  /// This type exposes `Sequence`s and `AsyncSequence`s which
  /// wrap a source of Unicode scalars and lazily normalize it.
  ///
  @frozen
  public struct NormalizedScalars<Source> { ... }
}

extension Sequence<Unicode.Scalar> {

  /// A namespace providing normalized versions of this sequence's contents.
  ///
  public var normalized: NormalizedScalars<Self> { get }
}

Normalised sequence:

extension Unicode.NormalizedScalars
 where Source: Sequence<Unicode.Scalar> {

  /// The contents of the source, normalized to NFD.
  ///
  public var nfd: NFD { get }

  @frozen 
  public struct NFD: Sequence {
    public typealias Element = Unicode.Scalar
  }

  // and same for NFC, NFKD, NFKC.
}

Usage:

struct Trie {

  private class Node {
    var children: [Unicode.Scalar: Node]
    var hasTerminator: Bool
  }

  private var root = Node()

  func contains(_ key: some StringProtocol) -> Bool {
    var node = root
    for scalar in key.unicodeScalars.normalized.nfc {
      guard let next = node.children[scalar] else {
        // Early-exit: 
        // We know that 'key' isn't in this Trie,
        // no need to normalize the rest of it.
        return false
      }
      node = next
    }
    return node.hasTerminator
  }
}

We also propose async versions of the above, to complement the AsyncUnicodeScalarSequence available in Foundation.

extension AsyncSequence where Element == Unicode.Scalar {

  /// A namespace providing normalized versions of this sequence's contents.
  ///
  public var normalized: Unicode.NormalizedScalars<Self> { get }
}

extension Unicode.NormalizedScalars
 where Source: AsyncSequence<Unicode.Scalar> {

  /// The contents of the source, normalized to NFD.
  ///
  public var nfd: AsyncNFD { get }

  @frozen 
  public struct AsyncNFD: AsyncSequence {
    public typealias Element = Unicode.Scalar
    public typealias Failure = Source.Failure
  }

  // and same for NFC, NFKD, NFKC.
}

Usage:

import Foundation

let url = URL(...)

for try await scalar in url.resourceBytes.unicodeScalars.normalized.nfc {
  // NFC scalars, loaded and normalized on-demand.
}

We do not propose exposing normalised scalars as a Collection. This is explained in Alternatives Considered.

3. Stateful normaliser

While Sequence and AsyncSequence-level APIs are sufficient for most developers, specialised use-cases may benefit from directly applying the normalisation algorithm. For these, we propose a stateful normaliser, which encapsulates the state of a single "logical" text stream and is fed "physical" chunks of source data.

extension Unicode {

  /// A normalizer representing a single logical text stream.
  ///
  /// The normalizer has value semantics, so it may be copied
  /// and stored indefinitely, and is inherently thread-safe.
  ///
  public struct NFDNormalizer: Sendable {

    public init()

    /// Returns the next normalized scalar,
    /// consuming data from the given source if necessary.
    ///
    public mutating func resume(
      consuming source: inout some IteratorProtocol<Unicode.Scalar>
    ) -> Unicode.Scalar?


    /// Returns the next normalized scalar,
    /// iteratively invoking the scalar producer if necessary
    ///
    public mutating func resume(
      scalarProducer: () -> Unicode.Scalar?
    ) -> Unicode.Scalar?

    /// Marks the end of the logical text stream
    /// and returns remaining data from the normalizer's buffers.
    ///
    public mutating func flush() -> Unicode.Scalar?

    /// Resets the normalizer to its initial state.
    ///
    /// Any allocated buffer capacity will be kept and reused
    /// unless it exceeds the given maximum capacity,
    /// in which case it will be discarded.
    ///
    public mutating func reset(maximumCapacity: Int = default)
  }

  // and same for NFC, NFKD, NFKC.
}

Other Additions

We propose a range of minor additions related to the use of the above API.

Unicode.Scalar properties

So that streaming use-cases can efficiently produce stabilised strings, we will add a .isUnassigned property to Unicode.Scalar:

extension Unicode.Scalar {
  public var isUnassigned: Bool { get }
}

Currently the standard library offers two ways to access this information:

scalar.properties.generalCategory == .unassigned
scalar.properties.age == nil

Unfortunately these queries are less amenable to fast paths covering large contiguous blocks of known-assigned scalars. We can significantly reduce the number of table lookups for the most common scripts with a simple boolean property.

We will also add "Quick Check" properties, which are useful in a range of Unicode algorithms.

extension Unicode {

  @frozen
  public enum QuickCheckResult {
    case yes
    case no
    case maybe
  }
}

extension Unicode.Scalar.Properties {

  // The QC properties for decomposed forms
  // always return yes or no.

  public var isNFD_QC: Bool  { get }
  public var isNKFD_QC: Bool { get }

  // The QC properties for precomposed forms
  // can return "maybe".

  public var isNFC_QC: Unicode.QuickCheckResult  { get }
  public var isNFKC_QC: Unicode.QuickCheckResult { get }
}

Checking for Normalisation

It is possible to efficiently check whether text is already normalised. We offer this on all of the types mentioned above that are used for storing text:

StringProtocol (String/Substring)
Character
Sequence<Unicode.Scalar>
Collection<Unicode.Scalar>
AsyncSequence where Element == Unicode.Scalar

extension Sequence<Unicode.Scalar> {

  public func isNormalized(
    _ form: Unicode.CanonicalNormalizationForm
  ) -> Bool

  public func isNormalized(
    _ form: Unicode.CompatibilityNormalizationForm
  ) -> Bool
}

Of note, we offer a test for compatibility normalisation on Character even though it does not have a .normalized() function for compatibility forms. Also, there is a unique implementation for Collection which can be more efficient than the one for single-pass sequences.

The results of these functions are definite, with no false positives or false negatives.

Add common protocol conformances to String views

String's default comparison and equivalence operations ensure applications handle Unicode text correctly. Once we add normalisation APIs, developers will be able to take manual control over how these semantics are implemented - for instance, by ensuring all data in a Heap is normalised to the same form for efficient comparisons.

However, String does not always know when its contents are normalised. Instead, a developer who is maintaining this invariant themselves should be able to easily opt-in to code-unit or scalar level comparison semantics.

struct NormalizedStringHeap {

  // [!] Requires String.UTF8View: Comparable
  var heap: Heap<String.UTF8View> = ...

  mutating func insert(_ element: String) {
    // .insert performs O(log(count)) comparisons.
    //
    // Now they are guaranteed to be simple binary comparisons,
    // with no allocations or table lookups,
    // while having the same semantics as String.<
    heap.insert(aString.normalized(.preferredForm).utf8)
  }
}

We propose adding the following conformances to String's UTF8View, UTF16View, and UnicodeScalarView:

Equatable. Semantics: Exact code-unit/scalar match.
Hashable. Semantics: Must match Equatable.
Comparable. Semantics: Lexicographical comparison of code-units/scalars.

These conformances will likely also be useful for embedded applications, where String itself may lack them.

Creating a String or Character from Scalars

This is a straightforward gap in String's API.

extension String {
  public init(_: some Sequence<Unicode.Scalar>)
}

extension Character {
  /// Returns `nil` if more than one extended grapheme cluster
  /// is present.
  public init?(_: some Sequence<Unicode.Scalar>)
}

Acknowledgments

Alejandro Alonso (@Alejandro) originally implemented normalisation in the standard library. The proposed interfaces build on his work.

Michael_Ilseman · July 17, 2024, 10:47pm

(Co-author here)

I want to discuss with the community the particulars of stability for normalization across multiple Unicode versions. This topic does not come up for String's == because there is only one version Unicode live in the current process.

This is an area that @Karl has a deep appreciation for, but one that I'm just starting to build my intuition around.

From the Unicode stability policy:

Unicode 4.1+

Given versions V and U of Unicode, and any string S which only contains characters assigned according to both V and U, the following are always true:

toNFC_V(S) = toNFC_U(S)
toNFD_V(S) = toNFD_U(S)
toNFKC_V(S) = toNFKC_U(S)
toNFKD_V(S) = toNFKD_U(S)

Once a character is encoded, its canonical combining class and decomposition mapping will not be changed in any way.

Meaning that the version from which normalization is stable is at least the maximum age of any scalar contained in the string, or the latest version for which the normalization properties for that assigned scalar changed. There were no changes for assigned scalars after 4.1, but Unicode 3.2 and 4.0 had changes/corrections to normalization properties.

For example, the below would find that version (minimum 4.1):

// Returns the latest Unicode version from which normalization
// is stable, or `nil` if `s` contains any unassigned code
// points.
// Minimum version: 4.1
func stableSince(_ s: String) -> Unicode.Version? {
  var base = (major: 4, minor: 1)
  for s in s.unicodeScalars {
    guard let age = s.properties.age else {
      return nil
    }
    if age.major > base.major ||
        (age.major == base.major && age.minor > base.minor)
    {
      base = age
    }
  }
  return base
}

print(stableSince("abc")) // 4.1
print(stableSince("abc🤯")) // 10.0
print(stableSince("abc🤯🫥")) // 14.0

Stability is a nuanced concept. My questions for the community are around would be most useful. Would API that returns the earliest stable version be useful? Does anyone care about pre-4.1 versions of Unicode?

xwu · July 17, 2024, 10:59pm

Lovely to see this treated so comprehensively. I had drafted an earlier attempt but for various reasons didn’t have the bandwidth to push it through, and the timing is certainly now better with native Unicode facilities and the proposed document is very, very nice overall.

Can this be usefully/ergonomically instead just Bool?.

taylorswift · July 17, 2024, 11:29pm

Karl:

We propose to introduce functions on StringProtocol (String, Substring) and Character which produce a normalised copy of their contents:

extension Unicode {

  @frozen
  public enum CanonicalNormalizationForm {
    case nfd
    case nfc
  }

  @frozen
  public enum CompatibilityNormalizationForm {
    case nfkd
    case nfkc
  }
}

extension StringProtocol {

  /// Returns a copy of this string in the given normal form.
  ///
  /// The result is canonically equivalent to this string.
  ///
  public func normalized(
    _ form: Unicode.CanonicalNormalizationForm
  ) -> String

  /// Returns a copy of this string in the given normal form.
  ///
  /// The result _may not_ be canonically equivalent to this string.
  ///
  public func normalized(
    _ form: Unicode.CompatibilityNormalizationForm
  ) -> String

  /// Returns a copy of this string in the given normal form,
  /// if the result is stable.
  ///
  /// A stable normalization will not change if normalized again
  /// to the same form by any version of Unicode, past or future.
  ///
  /// The result, if not `nil`, is canonically equivalent
  /// to this string.
  ///
  public func stableNormalization(
    _ form: Unicode.CanonicalNormalizationForm
  ) -> String?

  /// Returns a copy of this string in the given normal form,
  /// if the result is stable.
  ///
  /// A stable normalization will not change if normalized again
  /// to the same form by any version of Unicode, past or future.
  ///
  /// The result _may not_ be canonically equivalent to this string.
  ///
  public func stableNormalization(
    _ form: Unicode.CompatibilityNormalizationForm
  ) -> String?
}

extension Character {

  /// Returns a copy of this character in the given normal form.
  ///
  /// The result is canonically equivalent to this character.
  ///
  public func normalized(
    _ form: Unicode.CanonicalNormalizationForm
  ) -> Character
}

i’m not thrilled about reusing String and Character to hold the normalized representations, it would create a situation where you would learn to start caring about the difference between Set.insert(_:) and Set.update(with:) and other nuances that typically get papered over by “equatability implies substitutability”.

on the other hand, returning [UInt8] would be very annoying, and wouldn’t benefit from small string optimization. so my suggestion would be to introduce some frozen wrapper type like String.Normalized that wraps a String but has an Equatable implementation that ignores Unicode and actually distinguishes between normalized and unnormalized UTF-8 sequences.

ksluder · July 17, 2024, 11:41pm

I don’t think Karl’s proposal alters the status quo. You can already create two canonically equivalent characters from e.g. UInt8 representations and insert them into the same Set. All this pitch does is give you a way to create one normalized form from another using the Swift standard library.

taylorswift · July 17, 2024, 11:52pm

yes, but we go to great lengths to teach users that é and e\u{0301} are the same, and provide them tools (like ==) that intentionally do not distinguish between them. so even if you explicitly spell e\u{0301}, you not “supposed” to care if a helper somewhere replaces those code units with a precomposed form, and vice-versa.

this proposal basically introduces a new API where you do care about the difference between the two forms, and i think this should come with some static typing.

ksluder · July 18, 2024, 12:05am

I don’t think a static type is warranted at the standard library level. Any Swift code that deals with externally-provided strings must either be agnostic to their normalization or already normalizing and maintaining that normalized invariant.

For example, the HFS+ filesystem stores filenames in NFD form. If I were to write an HFS+ driver in Swift, I’d need to accept arbitrary Strings at the API layer, and convert them to NFD at some point internally. Rather than use an NFDNormalizedFilename type to encode this state in the type system, I’d probably just put assert(filename.isNFDNormalized) in the deepest bowels of my B-Tree search algorithm.

Karl · July 18, 2024, 12:09am

"The same" is too imprecise of a term - the strings are canonically equivalent, and they remain so after normalisation. That said, they are not entirely "the same"; you can already observe that they are somehow different because their .utf8, .utf16, and .unicodeScalars collections will each have different contents from each other.

The whole reason we have decomposed and precomposed forms within the same equivalence class is because some kinds of processing are more convenient on one form or another. It is absolutely an intentional part of Unicode that applications can normalise to whichever form suits them best.

allevato · July 18, 2024, 12:13am

This is also consistent with the status quo where String's underlying storage is opaque. You could have an inlined UTF-8 short string, or a contiguous UTF-8 buffer, or a bridged Foundation NSString. We don't give those distinct type system representations even though there are potential performance implications if the representation isn't considered in certain use cases, but I still think that's the right complexity trade-off. Normalization feels analogous to that.

Karl · July 18, 2024, 12:27am

By the way, I should note that the proposal intentionally does not promise better performance from String if you normalise to NFC (or the .preferredForm). That is because small strings do not have space in their ABI for an isNFC bit, so they don't even know if their contents are normalised.

For predictable performance, you should normalise, then store the normalised strings as String.UTF8View (or .UTF16View or .UnicodeScalarView) in your data structure. The views gain Equatable, Hashable, and Comparable conformances in this proposal to facilitate that. Example from the proposal:

struct NormalizedStringHeap {
  
  // Stores normalised UTF8Views internally
  // for cheaper code-unit level comparisons.
  private var heap: Heap<String.UTF8View> = ...

  mutating func insert(_ element: String) {
    let normalized = element.normalized(.preferredForm)
    heap.insert(normalized.utf8)
  }

  // This needs to be consistent with String.<
  var min: String? {
    heap.min.map { utf8 in String(utf8) }
  }
}

Because the views are still using String storage, the String(utf8) initialiser does not actually copy the contents.

A NormalizedString type is left as a possible future direction. You can use the facilities in this proposal to prototype that in a package.

One more thing:

I have a macOS toolchain if anybody wants to try this out now: macOS toolchain (1.36 GB). Pretty much everything is implemented. The Linux bot is having issues, but once it's fixed I can also post a Linux toolchain.

taylorswift · July 18, 2024, 1:03am

as i think about it more, i think this is a reasonable pattern we should recommend. have you thought about refactoring the proposed API to become initializers on String.UTF8View, Character.UTF8View, Character.UTF16View, etc that accept some StringProtocol instead?

ksluder · July 18, 2024, 1:05am

Normalization is a concept that transcends encoding. You shouldn’t be forced to pick a particular encoding just to normalize a string.

ellie20 · July 18, 2024, 3:17am

The way I see it, the String type is a quotient set over the Unicode canonical equivalence relation. Basically, a String value represents an umbrella of different Unicode scalar sequences (UnicodeScalarView values), all of which are canonically equivalent to each other.^[1]

Importantly, UnicodeScalarView, UTF8View, and UTF16View have a one-to-one correspondence with each other, but they each have a many-to-one correspondence with String.

I think it would make sense to let UnicodeScalarView be the go-to type when distinguishing canonically equivalent strings is important. To achieve that, we should probably add utf8 and utf16 properties to String.UnicodeScalarView.

If we were to take "equality implies substitutability" to the fullest, it wouldn't be possible to get a UTF8View or UTF16View from a String, only from a UnicodeScalarView. And it wouldn't be possible to get a UnicodeScalarView from a String without specifying a normalization form.

If I were to redesign the relevant APIs to be as "pure" as possible, it would look something like this:

extension String {
    public func unicodeScalars(
        in normalizationForm: CanonicalNormalizationForm
    ) -> String.UnicodeScalarView
    
    public init(_ unicodeScalars: String.UnicodeScalarView)
    
    public func compatibilityNormalized() -> String
}

extension String.UnicodeScalarView {
    public var utf8: String.UTF8View { get set }
    public var utf16: String.UTF16View { get set }

    public init(_ utf8: String.UTF8View)
    public init(_ utf16: String.UTF16View)
}

Of course, an API which perfectly respects "equality implies substitutability" would be impractical in situations like JSON decoding and cross-language interoperability. But nonetheless, I think we should respect it in new APIs as much as possible to prevent confusion.

It therefore makes sense to split a String into Characters, because such a split respects Unicode canonical equivalence in a way that splitting a String into Unicode scalars wouldn't. ↩︎

ksluder · July 18, 2024, 3:38am

I don’t think this is quite right, unless you’re using a definition of “value” that is restricted entirely to equality comparison:

func f(_ s: String) -> [UnicodeScalar] {
    return Array(s.unicodeScalars)
}

let s1 = String(String.UnicodeScalarView([UnicodeScalar(0xe9)]))
let s2 = String(String.UnicodeScalarView([UnicodeScalar(0x65), UnicodeScalar(0x301 as UInt32)!]))

assert(s1 == s2)
assert(f(s1) != f(s2))

I think it’s more correct to say that a String’s value is its UnicodeScalarView, but two String values compare equal if and only if their values are canonically equivalent.

(Obviously, from the language runtime’s perspective a String’s value is really the aggregate of the stored properties of struct String.)

taylorswift · July 18, 2024, 3:46am

i read @ellie20 as saying that String represents an umbrella of different Unicode scalar sequences which themselves are distinct from one another, but are canonically equivalent under the abstraction of String.

to me, this is no more surprising than having 0 == 0000 but "0" != "0000".

ellie20 · July 18, 2024, 3:49am

ksluder:

I don’t think this is quite right, unless you’re using a definition of “value” that is restricted entirely to equality comparison:
func f(_ s: String) -> [UnicodeScalar] {
    return Array(s.unicodeScalars)
}

let s1 = String(String.UnicodeScalarView([UnicodeScalar(0xe9)]))
let s2 = String(String.UnicodeScalarView([UnicodeScalar(0x65), UnicodeScalar(0x301 as UInt32)!]))

assert(s1 == s2)
assert(f(s1) != f(s2))
I think it’s more correct to say that a String’s value is its UnicodeScalarView, but two String values compare equal if and only if their values are canonically equivalent.

Your example is what I mean when I say "impure APIs". That is, APIs which don't respect canonical equivalence and thus "pierce the veil" of the abstraction.^[1] I suppose it's subjective depending on which APIs in particular one considers to be "primary". For me, the "primary" APIs of String are the Equatable and Hashable conformances and the segmentation into Characters, both of which do respect canonical equivalence. Notably, using "impure" APIs will cause unintuitive behavior when simultaneously using the Equatable and Hashable conformances, like putting Strings into a Set like @taylorswift mentioned.

Array.capacity and String.isContiguousUTF8 are other examples of "impure" APIs. ↩︎

ksluder · July 18, 2024, 3:49am

The public accessibility of .unicodeScalars is what makes this false, IMO. But we’re all talking about an abstraction that is intentionally designed to look as appealing as can be when viewed from different angles.

xwu · July 18, 2024, 5:24am

We do not and never have; the string normalization APIs wouldn't be breaking any new ground here.

It's certainly not something to take lightly, but beyond strings (discussed ably by others already), 0.0 and -0.0 aren't substitutable for all operations (nor NaNs); [1, 2, 3] as Set and [2, 1, 3] as Set aren't either; nor class instances. The documentation for Equatable acknowledges the last part and then generally admonishes implementers to document exceptions, but the end result is that two values that compare equal are substitutable except when they're not.

What's more salient here than litigating policy, I would argue, is asking whether a wrapper that does consider normalization to be a part of the value better serves the stated motivation, helps users to handle normalized strings more correctly, is more ergonomic to use, etc.

ellie20 · July 18, 2024, 7:43am

Here's my attempt to articulate a vision regarding String, UnicodeScalarView, UTF8View, and UTF16View (the latter three of which I'll call the "view types"), with relevance to this proposal:

Summary

I think we should think of these types as representing different problem domains. The String type should, ideally, be used in the "high-level" problem domain of human text, Characters, words, and so on, which respects canonical equivalence. The view types, in contrast, should be used in the "lower-level" problem domain of scalars, code units, bytes, and encodings.

In this vision, the .utf8, .utf16, and .unicodeScalars properties primarily exist to facilitate the switching of problem domains. I think there will be benefits to conceptual clarity if the separation of types and their APIs neatly corresponds to the separation in problem domains.

To make the view types usable as the primary types in the "lower-level" problem domain, any "lower-level" APIs of String should also be added to the view types, and the view types should be where new lower-level APIs are added. So .normalized(_:) and friends would land in the view types, not in String, because ideally, when working with Strings, canonical normalization forms won't matter, except for performance. It will also help avoid casting to String in situations such as iterating over the elements of a Set<UnicodeScalarView>, where one is already forced to use the correct type for the relevant problem domain.

The main difficulties I foresee with this are situations where one needs to pass a String into an API which doesn't respect canonical equivalence. For example, an expression such as someAPI(text: String(string.unicodeScalars.normalized(.nfc))) quickly becomes unwieldy. I would expect these difficulties to become less common as the view types replace String as the "currency" type in lower-level problem domains, both in APIs and in people's code. In those ideal conditions, the expression people reach for would instead be someAPI(text: scalars.normalized(.nfc)).

These difficulties may come up in some standard library methods on String, such as the convenient withCString method. It would be unwieldy to type an expression like String(string.utf8.normalized(.nfc)).withCString. I think the solution to this would be to add a withCString method to UTF8View — after all, it would suit the separation of problem domains for the withCString method to be available on the view types. The unwieldy expression would then become string.utf8.normalized(.nfc).withCString. Or even better, if the user was already working in the UTF-8 problem domain, it would become utf8string.normalized(.nfc).withCString.

Basically, I think that by designing APIs which respect the separation of problem domains (while allowing users to switch between them as necessary through the .utf8 property and friends), we can avoid most of the ergonomic issues that may come from adding canonical normalization and other lower-level APIs to the view types instead of String, with the reward of better conceptual clarity (and better ergonomics in some situations). There is, after all, very little difference between for c in string.normalized(.nfc).unicodeScalars and for c in string.unicodeScalars.normalized(.nfc).

Karl · July 18, 2024, 9:14am

I think your ideas are interesting, and you're not wrong. That being said, I'm not convinced about the "benefits to conceptual clarity" or that hiding this feature in the String views is going to do much more than frustrate people when they look for it.

Would "friends" here include compatibility forms? Because they are useful for a variety of machine-processing tasks, such as decomposing ligatures:

"puﬀerfish" (includes U+FB00 LATIN SMALL LIGATURE FF, "ﬀ")
"pufferfish" (no ligatures)

These strings are not canonically equivalent because they should be displayed differently. However, if you are searching in a database (or matching using a Regex), it would be reasonable to consider these strings equivalent. Compatibility normalisation will break up the ligature for you.

If I understand your suggestion, normalisation to compatibility forms would be available directly on String, but canonical forms would only be available on lower-level views, is that correct?

someString.normalized(.nfkc) // okay

someString.normalized(.nfc)                 // wouldn't work
someString.unicodeScalars.normalized(.nfc)  // you'd need to do this instead