String’s ABI and UTF-8

Michael_Ilseman · November 5, 2018, 11:17pm

String’s ABI and UTF-8

We just landed String’s proposed final ABI on master. This ABI includes some significant changes, the primary one being that native Swift strings are stored as UTF-8 where they were previously stored either as ASCII or UTF-16 depending on their contents. NSStrings are still lazily bridged in to String without copying.

This does not immediately surface in the API, but allows for some important performance wins and gives a more consistent basis for future APIs providing efficient and direct access to the underlying code units. UTF-8 is a one-byte Unicode encoding and is the preferred encoding for interoperating with C, systems programming, server-side programming, scripting, client-side programming, and tools that process source code and textual formats.

Performance

Unifying the storage representation for ASCII and Unicode-rich strings gives us a lot of performance wins. These wins are an effect of several compounding factors including a simpler model with less branching, on-creation encoding validation of native Strings (enabled by a faster validator), a unified implementation code path, a more efficient allocation and use of various bits in the struct, etc.

C Interoperability

By maintaining nul-termination in our storage, interoperability with C is basically free: we just use our pointer. This means that myString.withCString { ... } no longer needs to allocate, transcode, and later free its contents in order to supply the closure with a C-compatible string.

Quantifying this improvement as an nx faster ratio: it’s either millions of times faster or error: division by zero times faster, depending on how you measure.

Decoding

Walking over and decoding the Unicode scalar values that comprise a string is much more efficient now.

Strings of Chinese characters are traditionally a worst-case scenario for UTF-8 decoding performance relative to UTF-16, as UTF-8 resorts to a multi-byte encoding sequence while UTF-16 just stores the scalar value directly as a code unit. This is even worse in reverse, because a continuation-byte in UTF-8 does not communicate the distance to the start of the scalar.

But, this isn’t really an issue: on modern CPUs this increase in encoding complexity is more than offset by the decrease in model complexity by having a unified storage representation.

Walking the Unicode scalar values forwards on Chinese text is now over 3x faster than before and walking in reverse (harder) is now over 2x faster. ASCII benefits even more, despite the old model having a dedicated storage representation and fast paths for ASCII-only strings.

Small UTF-8 Strings

Swift 4.2 introduced a small-string representation on 64-bit platforms for strings of up to 15 ASCII code units in length that stores the value directly in the String struct without requiring an allocation or memory management. With a unified code path that supports UTF-8, we’re able to enhance small strings to support up to 15 UTF-8 code units in length. This means your most important non-ASCII strings such as "smol 🐶! 😍" , can in fact, be smol.

We also added small-string support on 32-bit platforms, where we pack in strings of up to 10 UTF-8 code units directly into the String struct.

Miscellaneous

Operations over the UTF-8 view are (obviously) dramatically faster on native Swift strings: ~10x depending on the nature of the operation.

Character-based String modifications, such as String.insert(_:Character) are around 5-10x faster.

Improved normality checking makes String hashing 2-4x faster when the contents are already in NFC (which is the case most of the time).

Creating a String from UTF-8 contents ala String(decoding: codeUnits, as: UTF8.self) is around 5-6x faster.

Efficient Cocoa Interoperability

Efficient interoperability with Cocoa is a huge selling point for Swift, and strings are lazily bridged to Objective-C. String’s storage class is a subclass of NSString at runtime, and thus has to answer APIs assuming constant-time access to UTF-16 code units. We solved this with a breadcrumbing strategy: upon first request from one of these APIs on large strings, we perform a fast scan of the contents to check the UTF-16 length, leaving behind breadcrumbs at regular intervals. This allows us to provide amortized constant-time access to transcoded UTF-16 contents by scanning between breadcrumbs.

This is leveraged by String.UTF16View, so Swift code that imports Foundation and assumes constant-time access to the view also benefits.

We’ll be tweaking and tuning the granularity of these breadcrumbs and improving the scanning time, but this strategy has been proving sufficient for maintaining performance in realistic use cases.

For performance improvements in Cocoa interoperability, we’re working on some sweet bridging optimizations (simpler on a unified storage representation), but it’s too early to report back findings. We expect wins here to be far more important than a higher constant-factor on UTF-16 access.

Current Microbenchmark Issues

We landed with some known microbenchmark regressions that we knew we could fix with some elbow grease. We’re now applying elbow grease. Since this is such a substantial model change, it is far more important from a risk-management perspective to land this now to expose any unknown issues. Even so, net performance is substantially better.

We also have known gaps in our String benchmarking, which we will be closing and addressing any issues exposed.

Code Size

We haven’t started to tweak and tune code size, but this change already carries in some nice wins. A simpler model means less code and less reliance on heroic inlining for performance.

The stdlib binary is around 13% smaller with this change, which is a big win for Swift 5.0 applications that will back-deploy to pre-Swift-5 OSes. This also reduces memory usage and provides other system-wide benefits for post-Swift-5 OSes. The Foundation overlay is also around 5% smaller, as are others.

The source compatibility suite saw modest improvements, with an overall 2-3% shrinkage in total binary size. As I said, we haven’t started to tweak and tune, so this may improve more.

The Future of String Performance

Internal Improvements

We have many ideas for further performance enhancements to the internal implementation of String, such as:

Check for (or even guaranteeing) NFC-normalized contents upon creation, making canonical-equivalence comparison super fast
Cache more information on the storage class’s subsequent tail allocations, such as grapheme count and hash value
Perform fast single-scalar-grapheme checks and set relevant performance flags
Vectorize all the things, especially small strings

Low Level APIs

The most exciting aspect of the future of String performance is exposing low-level performant APIs. The unified storage representation allows us to expose low-level APIs on String that directly accessing the underlying storage. Previously, we’d have to expose a pair of each, one for ASCII storage and one for UTF-16 storage, and hope the developer remembers to test both paths. Now, we can expose something akin to the following (details/spellings for demonstration purposes only):

myString.withCodeUnits { codeUnitBuffer in 
  // Access the contents as a contiguous buffer of `UInt8`
  // Awesome synergy with the character litarals pitch
  ...
}

let str = String(withInitialCapacity: 42) { contentsPtr in 
  // Initialize the string directly
  ...

  // Return the actual size we wrote in UTF-8 code units
  return actualSize

  // (UTF-8 validation is performed by String after closure is finished)
}

Of course, we need to figure out a strategy for communicating whether some existing String is native or a lazily-bridged NSString that does not provide contiguous UTF-8 contents. There are approaches with various tradeoffs: do the eager bridge, make everything optional, throw, trap, etc. Figuring this out will be the most important part of designing these APIs.

Shared Strings

The branch also introduces support in the ABI (but currently not exposed in any APIs) for shared strings, which provide contiguous UTF-8 code units through some externally-managed storage. These enable future APIs allowing developers to create a String with shared storage from a [UInt8], Data, ByteBuffer, or Substring without actually copying the contents. Reads would be slightly slower as it will require an extra level of pointer-indirection, but avoiding the copy could be a big win depending on the situation.

How You Can Help

While we are attacking our known-unknowns (regressions and gaps in the benchmark suite), we would really like to get early feedback on the new String ABI. If you encounter any issues or performance regressions, please let us know.

Toolchains are available at Redirecting…, try out a "Trunk Development (master)" toolchain.

Huge thanks to @lorentey, @lancep , @johannesweiss , @David_Smith, @Erik_Eckstein, and @scanon for helping make this happen!

edit: Explicitly mentioned that NSStrings are still lazily-bridged in without copy.

tanner0101 · November 5, 2018, 11:50pm

Amazing work on this. I'm super excited to check it out as soon as the toolchains are ready.

Thanks @Michael_Ilseman et al!

ddunbar · November 6, 2018, 12:22am

Thanks @Michael_Ilseman for your amazing work to make this happen!!

taylorswift · November 6, 2018, 12:37am

Any movement on this? Create 0233-codepoint-and-character-literals.md by tayloraswift · Pull Request #939 · swiftlang/swift-evolution · GitHub

Michael_Ilseman · November 6, 2018, 12:41am

Off topic for this thread. Has it entered review? If you want it scheduled for review, ping a core team member.

Chris_Lattner3 · November 6, 2018, 12:45am

Great work @Michael_Ilseman, congratulations on landing this!

compnerd · November 6, 2018, 2:37am

Congratulations on getting this merged @Michael_Ilseman! It was quite an achievement.

anandabits · November 6, 2018, 2:45am

Very impressive! It’s exciting to see this land. Thanks for your hard work @Michael_Ilseman!

duan · November 6, 2018, 5:44am

Frickin amazing. Congratulations Michael.

I look forward to diving into past projects where I used UTF16View for performance and fix them :)

xavier.lowmiller · November 6, 2018, 6:39am

Really looking forward to this landing in Swift 5!

Will this also benefit regex performance? Swift has some room for improvement in that area: https://benchmarksgame-team.pages.debian.net/benchmarksgame/faster/swift.html (see regex-redux)

SDGGiesbrecht · November 6, 2018, 9:00am

Most of this sounds awesome. I do have one concern though:

What does “guaranteeing” mean? Are you thinking of force‐normalizing every String to NFC?!?

Please don’t. I depend on a wrapper structure that enforces NFD in order to do reliable and efficient searches for combining scalars.

let string = "café"
let acute = "\u{301}"

let nfd = string.decomposedStringWithCanonicalMapping.unicodeScalars
print(nfd.firstIndex(of: acute))
// 4th scalar (→ part of 3rd cluster)
// (✓: The expected answer.)

let nfc = string.precomposedStringWithCanonicalMapping.unicodeScalars
print(nfc.firstIndex(of: acute))
// nil
// (✗: Illogical side effect of legacy encoding designs.)

If “guarantee” means String becomes locked to NFC, then this sort of thing will become impossible, and I will have to write my own String type from scratch.

If “guarantee” means String still allows NFD, but is guaranteed to start in NFC when first initialized, then things will still work but it will be inefficient, because composition and decomposition will happen back and forth unnecessarily.

IanPartridge · November 6, 2018, 10:51am

Congratulations on landing this, @Michael_Ilseman - it's an amazing achievement. I am keen to bench various Kitura workloads against these changes. Do you have a Linux toolchain available or should I wait for one to appear on swift.org? Thanks.

dimpiax · November 6, 2018, 12:32pm

Great improvement! Thanks for your work!

Michael_Ilseman · November 6, 2018, 4:40pm

No, this will not be a 100%-case guarantee. ABI stability (and lazily bridged NSString on Darwin platforms) dictates that String will always need to handle non-NFC-normalized contents. String has a performance flag that can be set when the contents are known to be NFC. If the bit is not set, the standard library has to treat the contents as non-NFC. ABI stability means that all future versions of the standard library have to handle strings with and without that bit set.

The performance goal is to find more ways to set that bit when we want performant comparison, but your example code won't break. There will always be ways of creating a String with a "leave my bits alone!" option. Whether that is the default, an explicit option, new or repurposed initializers, or it varies by use case, it's totally open to future change.

edit: grammar

SDGGiesbrecht · November 6, 2018, 4:44pm

Thanks for answering. Sounds good.

dabrahams · November 11, 2018, 1:29am

This is really exciting, but can you explain where the speedups come from? I wrote both the decoders, and they haven't changed AFAICS. The inherent complexity of decoding UTF-8 always made it measurably slower than decoding ASCII or UTF16. The cost of branching on the representation doesn't seem like it should show up in a test of the speed of walking scalar values on Chinese text.

Also, with regard to NSString bridging; is it lazy even for strings that could be represented in 15 utf-8 code units?

Michael_Ilseman · November 12, 2018, 1:20am

Yes, they are still unchanged.

As I mentioned above, this is the result of many small but compounding simplifications to the model, whose net effect is greater than the cost of decoding a scalar from UTF-8 on modern CPUs. Some, but definitely not all, of them could have been retrofitted on the old model, but they would not have had the same degree of payoff and/or introduced even more performance issues and complexity.

A big one is simply that what was 2 different backing storage representations and behavior inlined into user code is now just one. Under a 2-word String like we have in Swift 4.1 and later, String APIs handling lazily-bridged Cocoa strings (in general) will involve uninlinable function calls to compute their result. With UTF-8 native Strings, we no longer have two paths inlined into call sites with out-of-lines calls to facilitate Cocoa strings. We just have the UTF-8 path in-line and one call to out-of-line support for (non-ASCII) Cocoa strings, since they cannot be fully inlined anyways.

These effects are greater than the cost of just a branch on read and also allow for more optimizable code (no heroic guessing at intractable problems).

The strategy is unchanged from 4.2 where only ASCII small strings exist: only tagged-pointer NSStrings are eagerly bridged.

Any arbitrary subclass of NSString can carry important associated information with it besides its contents (e.g. localized strings), so eagerly bridging them in would lose that information. Also, there would be performance issues with creating an object when ping-ponging between ObjC and Swift. But, tagged pointer NSStrings have the advantage of just being values indistinguishable from their contents, and there are no object allocations when ping-ponging between ObjC and Swift.

Karl · November 12, 2018, 11:48am

Wow, this is amazing. Much better than I ever hoped for.

Switching to UTF8 is great; it's the typical transmission and persistence encoding. Avoiding those transcoding overheads and awkward index translations will be great, and makes it more practical to read files and network pipes directly in to String buffers or to have them share storage.
The shared string stuff sounds like magic. I love it. I remember the pitch that Swift was aiming to be as fast (if not faster) than C - and, IMO, an important part of that is to make it as close to zero-cost as possible to layer a convenient, unicode-safe text API on top of some memory. I think this is going to let us do that.
Creating a String which shares storage with a Substring sounds interesting, since Substring is itself a kind-of workaround to have 2 Strings which share storage. What if slicing a String produced a borrowed String? Would mean we don't have to make everything generic on StringProtocol.
I'm interested how much overhead we need to pay for this "breadcrumb" system. Even if it is created lazily, NSString doesn't have an ObjC API which doesn't assume UTF16 code units, so even trivial user code which could be made encoding-agnostic with a more opaque indexing/range API will still face an impedance mismatch. ObjC Foundation is under Apple's control - so would you consider adding that?
I have one small nit before String's API/ABI is finalised: we need a mutating version of this function which returns the number of matches:
```
public func replacingOccurrences<Target, Replacement>(of target: Target, with replacement: Replacement, options: String.CompareOptions = default, range searchRange: Range<Self.Index>? = default) -> String
  where Target : StringProtocol, Replacement : StringProtocol
```
Currently you have to go through NSMutableString and back, and I'm guessing that will get more costly if transcoding is involved when bridging. It looks like a simple oversight.

benrimmington · November 12, 2018, 3:59pm

Can tagged pointers have associated values, and if so, does the bridging preserve them?

For example, some of the accessibility APIs can be used with any object, including NSString instances.

import Foundation
import ObjectiveC
import UIKit

// Create a tagged pointer.
let str = NSString(string: "str")
object_getClass(str) //-> __C.NSTaggedPointerString.Type

// Insert an associated value.
str.accessibilityHint = "hint"
assert(str.accessibilityHint == "hint")

// Remove all associated values.
objc_removeAssociatedObjects(str)
assert(str.accessibilityHint == nil)

Michael_Ilseman · November 16, 2018, 6:01pm

Currently, we could (not necessarily arguing we should) add an API like:

mySubstring.withSharedString { str: String in
  ... call some API that requires String
  ... I don't care about persisting the entire string if it's copied
}

You would pay for an object allocation (and ARC) for shared storage, but avoid paying for copying the contents themselves, in exchange for the shared storage potentially persisting the whole string allocation. In general, this potential to persist the whole string is a huge risk, but there are obviously circumstances where it is worth it or the risk is minimal.

I think where ownership could come in would be trying to avoid the cost of the object allocation and ARC on the shared storage String. For example, if you could reason about the lifetime or escapability, then the shared-string object could be stack-allocated (and effectively-immortal) for the closure. Or, you could imagine something analogous to withoutActuallyEscaping that would stack allocate and trap on escape.

Since we probably won't have anything like lifetime variables in our type system, I don't think we'd have the means to change slicing to produce a borrowed String. We can of course add subscript overloads that have different behavior, e.g. myString = str[copying: a..<b], myString = str[sharing: a..<b], etc.

Sorry, I got lost in the double negatives and I'm not sure what the question is. Yes, NSString APIs are heavy on assumed random-access to UTF-16 contents. For strings of a sufficient size that are bridged out and then operated on with these APIs, they will have to populate these breadcrumbs. Note that NSString does have properties such as utf8String and other ways of asking for contents in their current encoding. @David_Smith can speak further on those details and how String can answer these.

Beyond that, I can't really speculate on future APIs outside of the standard library.

Karl:

I have one small nit before String's API/ABI is finalised: we need a mutating version of this function which returns the number of matches:
public func replacingOccurrences<Target, Replacement>(of target: Target, with replacement: Replacement, options: String.CompareOptions = default, range searchRange: Range<Self.Index>? = default) -> String
  where Target : StringProtocol, Replacement : StringProtocol
Currently you have to go through NSMutableString and back, and I'm guessing that will get more costly if transcoding is involved when bridging. It looks like a simple oversight.

Right, this is from the Foundation overlay and operates on NSString (on Darwin platforms). I think it's clear that the standard library should provide simple find/replace operations on BidirectionalCollection and subsume this functionality (this will be part of an ergonomics push after ABI stability settles a bit).

As for transcoding overhead, that would only be when a native non-ASCII String becomes a NSMutableString. Since the vast majority of our contents for native Swift strings start out as UTF-8, you could view this as deferring the transcoding until absolutely necessary. The situations where this is necessary is becoming more rare over time.