This is a summary and some sections have been omitted for brevity. Click here for the full proposal
Unicode Normalization
- Proposal: SE-NNNN
- Authors: Karl Wagner, Michael Ilseman (@Michael_Ilseman)
- Review Manager: TBD
- Status: Pitch
- Implementation: swiftlang/swift#75298
Introduction
Unicode Normalization Forms are formally defined normalizations of Unicode strings which make it possible to determine whether any two Unicode strings are equivalent to each other.
Normalization is a fundamental operation when processing Unicode text, and as such deserves to be one of the core text processing algorithms exposed by the standard library. It is used by the standard library internally to implement basic String operations and is of great importance to other libraries.
Motivation
Normalization determines whether two pieces of text can be considered equivalent: if two pieces of text are normalized to the same form, and they are equivalent, they will consist of exactly the same Unicode scalars (and by extension, the same UTF-8/16 code-units).
Unicode defines two categories of equivalence:
-
Canonical Equivalence
A fundamental equivalency between characters or sequences of characters which represent the same abstract character, and which when correctly displayed should always have the same visual appearance and behavior.
UAX#15 Unicode Normalization Forms
Example:Ω
(U+2126 OHM SIGN) is canonically equivalent toΩ
(U+03A9 GREEK CAPITAL LETTER OMEGA) -
Compatiblity Equivalence
It is best to think of these Normalization Forms as being like uppercase or lowercase mappings: useful in certain contexts for identifying core meanings, but also performing modifications to the text that may not always be appropriate.
UAX#15 Unicode Normalization Forms
Example: "â…Ł" (U+2163 ROMAN NUMERAL FOUR) is compatibility equivalent to the ASCII string "IV".
Additionally, some scalars are equivalent to a sequence of scalars and combining marks. These are called canonical composites, and when producing the canonical or compatibility normal form of some text, we can further choose for it to contain either decomposed or precomposed representations of these composites.
These are different forms, but importantly are not additional categories of equivalence. Applications are free to compose or decompose text without affecting equivalence.
Example: The famous "Ă©"
Decomposed | "e\u{0301}" (2 scalars: U+0065 LATIN SMALL LETTER E, U+0301 COMBINING ACUTE ACCENT) |
Precomposed | "Ă©" (1 scalar: U+00E9 LATIN SMALL LETTER E WITH ACUTE) |
// Decomposed and Precomposed forms are canonically equivalent.
// Applications can choose to work with whichever form
// is more convenient for them.
assert("e\u{0301}" == "Ă©")
This defines all four normal forms:
Canonical | Compatibility | |
---|---|---|
Decomposed | NFD | NFKD |
Precomposed | NFC | NFKC |
Canonical equivalence is particularly important. The Unicode standard says that programs should treat canonically-equivalent strings identically, and are always free to normalise strings to a canonically-equivalent form internally without fear of altering the text's interpretation.
C6. A process shall not assume that the interpretations of two canonical-equivalent character sequences are distinct.
The implications of this conformance clause are twofold. First, a process is never required to give different interpretations to two different, but canonical-equivalent character sequences. Second, no process can assume that another process will make a distinction between two different, but canonical-equivalent character sequences.
Ideally, an implementation would always interpret two canonical-equivalent character sequences identically. [...]
C7. When a process purports not to modify the interpretation of a valid coded character sequence, it shall make no change to that coded character sequence other than the possible replacement of character sequences by their canonical-equivalent sequences.
- Replacement of a character sequence by a compatibility-equivalent sequence does modify the interpretation of the text.
Unicode Standard 15.0, 3.2 Conformance Requirements
Accordingly, as part of ensuring that Swift has first-class support for Unicode, it was decided that String's default Equatable
semantics (the ==
operator) would test canonical equivalence. As a result, by default applications get the ideal behaviour described by the Unicode standard - for instance, if one inserts a String in to an Array
or Set
it can be found again using any canonically-equivalent String.
var strings: Set<String> = []
strings.insert("\u{00E9}") // precomposed e + acute accent
assert(strings.contains("e\u{0301}")) // decomposed e + acute accent
Other libraries would like similar Unicode support in their own data structures without requiring String
for storage, or may require normalisation to implement specific algorithms, standards, or protocols. For instance, normalising to NFD or NFKD allows one to more easily remove diacritics for fuzzy search algorithms and spoof detection, and processing Internationalised Domain Names (IDNs) requires normalising to NFC.
Additionally, String can store and preserve any sequence of code-points, including non-normalised text -- however, since its comparison operators test canonical equivalence, in the worst case both operands will have to be normalised on-the-fly. Normalisation may allocate buffers and involves lookups in to Unicode property databases, so this may not always be desirable.
The ability to normalise text in advance (rather than on-the-fly) can deliver some significant benefits. Recall that canonically-equivalent strings, when normalised to the same form, encode to the same bytes of UTF8; so if our text is already normalised we can perform a simple binary comparison (such as memcmp
), and our results will still be consistent with String's default operators. We pay the cost of normalisation once per string rather than paying it up to twice per comparison operation.
Consider a Trie data structure (which is often used with textual data):
root
/ | \
a b c
/ \ \ \
p t a a
/ \ \ \
p e t t
When performing a lookup, we compare the next element in the string we are searching for with the children of our current node and repeat that process to descend the Trie. For instance, when searching for the word "app", we descend from the root to the "a" node, then to the "p" node, etc. If the Trie were filled with normalised text and the search string were also normalised, these could be simple binary comparisons (with no allocations or table lookups) while still matching all canonically-equivalent strings. In fact, so long as we normalise everything going in, the fundamental operation of the Trie doesn't need to know anything about Unicode; it can just operate on binary blobs. Other data structures could benefit from similar techniques - everything from B-Trees (many comparisons) to Bloom filters (computing many hashes).
In summary, normalisation is an extremely important operation and there are many, significant benefits to exposing it in the standard library.
Existing API
Currently, normalisation is only exposed via Foundation:
extension String {
var decomposedStringWithCanonicalMapping: String { get }
var decomposedStringWithCompatibilityMapping: String { get }
var precomposedStringWithCanonicalMapping: String { get }
var precomposedStringWithCompatibilityMapping: String { get }
}
There are many reasons to want to revise this interface and bring the functionality in to the standard library:
-
It is hard to find, using terminology most users will not understand. Many developers will hear about normalisation, and "NFC" and "NFD" are familiar terms of art in that context, but it's difficult to join the dots between "NFC" and
precomposedStringWithCanonicalMapping
.In JavaScript and many other languages, this operation is:
"some string".normalize("NFC");
-
It does not expose an interface for producing stabilised strings.
-
It only accepts input text as a String. There are other interesting data structures which may contain Unicode text, and copying to a String can be a significant overhead for them.
The existing API also does not support normalising a Substring or Character; only entire Strings.
-
It eagerly normalises the entirety of its input. This is suboptimal when comparing strings or checking if a string is already normalised; applications typically want to early-exit as soon the result is apparent.
-
It is incompatible with streaming APIs. Streams provide their data in incremental chunks, not aligned to any normalisation boundaries. However, normalisation is not closed to concatenation:
even if two strings X and Y are normalized, their string concatenation X+Y is not guaranteed to be normalized.
This means a program wanting to operate on a stream of normalised text cannot just normalise each chunk separately. In order to work with the existing API, they would have to forgo streaming entirely, buffer all of the incoming data, copy it in to a String, then normalise the entire String at once.
Proposed solution
We propose 3 levels of API, targeting:
- Strings
- Custom storage and incremental normalisation, and
- Stateful normaliser
Additionally, we are proposing a handful of smaller enhancements to help developers process text using these APIs.
The proposal aims to advance text processing in Swift and unlock certain key use-cases, but it is not exhaustive. There will remain a healthy amount of subject matter for future consideration.
1. Strings
We propose to introduce functions on StringProtocol (String, Substring) and Character which produce a normalised copy of their contents:
extension Unicode {
@frozen
public enum CanonicalNormalizationForm {
case nfd
case nfc
}
@frozen
public enum CompatibilityNormalizationForm {
case nfkd
case nfkc
}
}
extension StringProtocol {
/// Returns a copy of this string in the given normal form.
///
/// The result is canonically equivalent to this string.
///
public func normalized(
_ form: Unicode.CanonicalNormalizationForm
) -> String
/// Returns a copy of this string in the given normal form.
///
/// The result _may not_ be canonically equivalent to this string.
///
public func normalized(
_ form: Unicode.CompatibilityNormalizationForm
) -> String
/// Returns a copy of this string in the given normal form,
/// if the result is stable.
///
/// A stable normalization will not change if normalized again
/// to the same form by any version of Unicode, past or future.
///
/// The result, if not `nil`, is canonically equivalent
/// to this string.
///
public func stableNormalization(
_ form: Unicode.CanonicalNormalizationForm
) -> String?
/// Returns a copy of this string in the given normal form,
/// if the result is stable.
///
/// A stable normalization will not change if normalized again
/// to the same form by any version of Unicode, past or future.
///
/// The result _may not_ be canonically equivalent to this string.
///
public func stableNormalization(
_ form: Unicode.CompatibilityNormalizationForm
) -> String?
}
extension Character {
/// Returns a copy of this character in the given normal form.
///
/// The result is canonically equivalent to this character.
///
public func normalized(
_ form: Unicode.CanonicalNormalizationForm
) -> Character
}
Character does not offer a stableNormalization
function, as the definition of character boundaries is not stable across Unicode versions. While this doesn't technically matter for the purpose of normalisation, it seems wrong to mention stability in the context of characters while their boundaries remain unstable.
Character also does not offer compatibility normalisation, as the compatibility decomposition of a Character may result in multiple Characters. However, Characters may be normalised to their canonical equivalents.
Usage:
// Here, a database treats keys as binary blobs.
// We normalise at the application level
// to retrofit canonical equivalence for lookups.
func persist(key: String, value: String) throws {
guard let stableKey = key.stableNormalization(.nfc) else {
throw UnsupportedKeyError(key)
}
try writeToDatabase(binaryKey: stableKey.utf8, value: value)
}
func lookup(key: String) -> String? {
let normalizedKey = key.normalized(.nfc)
lookupInDatabase(binaryKey: normalizedKey.utf8)
}
try! persist(key: "cafe\u{0301}", value: "Present")
lookup(key: "caf\u{00E9}") // âś… "Present"
The Standard Library's preferred form and documenting String's sort order
String's comparison behaviour sorts canonically-equivalent strings identically, which already implies that it must behave as if its contents were normalised. However, it has never been documented which form it normalises to. We propose documenting it, and moreover documenting it in code:
extension Unicode.CanonicalNormalizationForm {
/// The normal form preferred by the Swift Standard Library.
///
/// String's conformance to `Comparable` sorts values
/// as if their contents were normalized to this form.
///
public static var preferredForm: Self { get }
}
This allows developers to use normalisation to achieve predictable performance, with the guarantee that their results are consistent with String's default operators.
struct NormalizedStringHeap {
// Stores normalised UTF8Views internally
// for cheaper code-unit level comparisons.
// [!] Requires String.UTF8View: Comparable
private var heap: Heap<String.UTF8View> = ...
mutating func insert(_ element: String) {
let normalized = element.normalized(.preferredForm)
heap.insert(normalized.utf8)
}
// This needs to be consistent with String.<
var min: String? {
heap.min.map { utf8 in String(utf8) }
}
}
If an application would like to take advantage of normalisation but doesn't have a preference for a particular form, the standard library's preferred form should be chosen.
2. Custom storage and incremental normalisation
For text in non-String storage, or operations which can early-exit, we propose introducing API which allows developers to lazily normalize any Sequence<Unicode.Scalar>
. This API is exposed via a new .normalized
namespace wrapper:
Namespace:
extension Unicode {
/// Normalized representations of Unicode text.
///
/// This type exposes `Sequence`s and `AsyncSequence`s which
/// wrap a source of Unicode scalars and lazily normalize it.
///
@frozen
public struct NormalizedScalars<Source> { ... }
}
extension Sequence<Unicode.Scalar> {
/// A namespace providing normalized versions of this sequence's contents.
///
public var normalized: NormalizedScalars<Self> { get }
}
Normalised sequence:
extension Unicode.NormalizedScalars
where Source: Sequence<Unicode.Scalar> {
/// The contents of the source, normalized to NFD.
///
public var nfd: NFD { get }
@frozen
public struct NFD: Sequence {
public typealias Element = Unicode.Scalar
}
// and same for NFC, NFKD, NFKC.
}
Usage:
struct Trie {
private class Node {
var children: [Unicode.Scalar: Node]
var hasTerminator: Bool
}
private var root = Node()
func contains(_ key: some StringProtocol) -> Bool {
var node = root
for scalar in key.unicodeScalars.normalized.nfc {
guard let next = node.children[scalar] else {
// Early-exit:
// We know that 'key' isn't in this Trie,
// no need to normalize the rest of it.
return false
}
node = next
}
return node.hasTerminator
}
}
We also propose async versions of the above, to complement the AsyncUnicodeScalarSequence
available in Foundation.
extension AsyncSequence where Element == Unicode.Scalar {
/// A namespace providing normalized versions of this sequence's contents.
///
public var normalized: Unicode.NormalizedScalars<Self> { get }
}
extension Unicode.NormalizedScalars
where Source: AsyncSequence<Unicode.Scalar> {
/// The contents of the source, normalized to NFD.
///
public var nfd: AsyncNFD { get }
@frozen
public struct AsyncNFD: AsyncSequence {
public typealias Element = Unicode.Scalar
public typealias Failure = Source.Failure
}
// and same for NFC, NFKD, NFKC.
}
Usage:
import Foundation
let url = URL(...)
for try await scalar in url.resourceBytes.unicodeScalars.normalized.nfc {
// NFC scalars, loaded and normalized on-demand.
}
We do not propose exposing normalised scalars as a Collection. This is explained in Alternatives Considered.
3. Stateful normaliser
While Sequence
and AsyncSequence
-level APIs are sufficient for most developers, specialised use-cases may benefit from directly applying the normalisation algorithm. For these, we propose a stateful normaliser, which encapsulates the state of a single "logical" text stream and is fed "physical" chunks of source data.
extension Unicode {
/// A normalizer representing a single logical text stream.
///
/// The normalizer has value semantics, so it may be copied
/// and stored indefinitely, and is inherently thread-safe.
///
public struct NFDNormalizer: Sendable {
public init()
/// Returns the next normalized scalar,
/// consuming data from the given source if necessary.
///
public mutating func resume(
consuming source: inout some IteratorProtocol<Unicode.Scalar>
) -> Unicode.Scalar?
/// Returns the next normalized scalar,
/// iteratively invoking the scalar producer if necessary
///
public mutating func resume(
scalarProducer: () -> Unicode.Scalar?
) -> Unicode.Scalar?
/// Marks the end of the logical text stream
/// and returns remaining data from the normalizer's buffers.
///
public mutating func flush() -> Unicode.Scalar?
/// Resets the normalizer to its initial state.
///
/// Any allocated buffer capacity will be kept and reused
/// unless it exceeds the given maximum capacity,
/// in which case it will be discarded.
///
public mutating func reset(maximumCapacity: Int = default)
}
// and same for NFC, NFKD, NFKC.
}
Other Additions
We propose a range of minor additions related to the use of the above API.
Unicode.Scalar properties
So that streaming use-cases can efficiently produce stabilised strings, we will add a .isUnassigned
property to Unicode.Scalar
:
extension Unicode.Scalar {
public var isUnassigned: Bool { get }
}
Currently the standard library offers two ways to access this information:
scalar.properties.generalCategory == .unassigned
scalar.properties.age == nil
Unfortunately these queries are less amenable to fast paths covering large contiguous blocks of known-assigned scalars. We can significantly reduce the number of table lookups for the most common scripts with a simple boolean property.
We will also add "Quick Check" properties, which are useful in a range of Unicode algorithms.
extension Unicode {
@frozen
public enum QuickCheckResult {
case yes
case no
case maybe
}
}
extension Unicode.Scalar.Properties {
// The QC properties for decomposed forms
// always return yes or no.
public var isNFD_QC: Bool { get }
public var isNKFD_QC: Bool { get }
// The QC properties for precomposed forms
// can return "maybe".
public var isNFC_QC: Unicode.QuickCheckResult { get }
public var isNFKC_QC: Unicode.QuickCheckResult { get }
}
Checking for Normalisation
It is possible to efficiently check whether text is already normalised. We offer this on all of the types mentioned above that are used for storing text:
StringProtocol
(String/Substring)Character
Sequence<Unicode.Scalar>
Collection<Unicode.Scalar>
AsyncSequence where Element == Unicode.Scalar
extension Sequence<Unicode.Scalar> {
public func isNormalized(
_ form: Unicode.CanonicalNormalizationForm
) -> Bool
public func isNormalized(
_ form: Unicode.CompatibilityNormalizationForm
) -> Bool
}
Of note, we offer a test for compatibility normalisation on Character
even though it does not have a .normalized()
function for compatibility forms. Also, there is a unique implementation for Collection
which can be more efficient than the one for single-pass sequences.
The results of these functions are definite, with no false positives or false negatives.
Add common protocol conformances to String views
String's default comparison and equivalence operations ensure applications handle Unicode text correctly. Once we add normalisation APIs, developers will be able to take manual control over how these semantics are implemented - for instance, by ensuring all data in a Heap
is normalised to the same form for efficient comparisons.
However, String does not always know when its contents are normalised. Instead, a developer who is maintaining this invariant themselves should be able to easily opt-in to code-unit or scalar level comparison semantics.
struct NormalizedStringHeap {
// [!] Requires String.UTF8View: Comparable
var heap: Heap<String.UTF8View> = ...
mutating func insert(_ element: String) {
// .insert performs O(log(count)) comparisons.
//
// Now they are guaranteed to be simple binary comparisons,
// with no allocations or table lookups,
// while having the same semantics as String.<
heap.insert(aString.normalized(.preferredForm).utf8)
}
}
We propose adding the following conformances to String's UTF8View
, UTF16View
, and UnicodeScalarView
:
Equatable
. Semantics: Exact code-unit/scalar match.Hashable
. Semantics: Must match Equatable.Comparable
. Semantics: Lexicographical comparison of code-units/scalars.
These conformances will likely also be useful for embedded applications, where String itself may lack them.
Creating a String or Character from Scalars
This is a straightforward gap in String's API.
extension String {
public init(_: some Sequence<Unicode.Scalar>)
}
extension Character {
/// Returns `nil` if more than one extended grapheme cluster
/// is present.
public init?(_: some Sequence<Unicode.Scalar>)
}
Acknowledgments
Alejandro Alonso (@Alejandro) originally implemented normalisation in the standard library. The proposed interfaces build on his work.