Use string counts to short-circuit string comparisons

Michael_Ilseman · November 30, 2021, 4:57pm

This is a broad topic, and there's a lot of different angles and aspects in the space as well as this thread, so I'll focus on improving performance of often-compared strings for equality checks over native Swift strings of sufficient length (minimum 16 UTF-8 code units in length, otherwise they'd be small-form strings). See Piercing the String Veil. These would not need to have the isNFC flag set, which @David_Smith's optimization requires.

If we care about this performance case, then we should cache the NFC-normalized code unit count. If the string is in NFC, this is equivalent to the code unit count, so no caching is needed when isNFC is set. Often, this cache will be set by hashing, which has to consume the entire string in NFC anyways. (cc @lorentey), strings stored in dictionaries and sets will typically be hashed and compared for equality, potentially multiple times, so this would save alread-computed information for future use. Note that comparison does abort when the inputs diverge, so comparison would only be able to set the cache if it has inspected the entire string.

Since we do normalization natively, it should be much easier to count things, remember whether the input was already in NFC, etc. (cc @Alejandro). Another thing that can be set is the storage class's copy of the isNFC flag if we traverse the entire string and learn that it is in fact in NFC. This is a very common scenario, because strings tend to be put in NFC for compactness anyways (which is ultimately why the stdlib uses NFC instead of the simpler NFD).

We can't update the struct String's isNFC bit, except inside mutating methods, so we'd add a check for the class's copy of the isNFC bit. With native normalization in place, we can also add something like a

extension String {
  public mutating func canonicalize()
}

That will eagerly bridge it (if applicable), convert to NFC, and analyze the content to set all of the corresponding fast bits like isASCII and isNFC, but also a hasSingleScalarGraphemeClusters bit for when the character view is equivalent to the scalar view, etc. (It should check those properties after normalization, because Unicode). This would be a superset of what's done inside makeContiguousUTF8, but this wouldn't preserve the contents of the string when viewed through the scalar or code unit views, because it normalizes.