Piercing the String Veil

Michael_Ilseman · March 17, 2019, 6:04am

Piercing the String Veil

Hi all, I would like to provide some background information about what is available in String’s ABI in Swift 5, and thoughts on what's possible to surface to performance-minded users.

String’s ABI encodes information for performant processing and future flexibility, allowing the standard library to run faster and producer smaller and more flexible code. But, this information is not available as API for low-level programming as it’s hidden behind the abstraction of the String type. With this information formally part of String’s ABI, we should lift the veil of the String type and start exposing these as API.

An opaque string is capable of handling all string operations through resilient function calls. Nothing about their implementation is exposed outside the standard library. Currently, these are used for lazily-bridged NSStrings that do not provide access to contiguous UTF-8 (i.e. ASCII) bytes in memory. This can be expanded in the future to support any kind of future string forms we may want or need to add.

All other strings are contiguous, that is, they can provide a pointer and length to valid UTF-8 bytes in memory.

An indirect string has declared in the ABI that it is capable of providing a pointer to validly-encoded UTF-8 bytes in memory through a resilient function. Currently, indirect strings are used for NSStrings that can provide access to contiguous UTF-8 in memory. Since the means by which they get a UTF-8 pointer is behind a resilient function call, they can also be used to support shared strings, that is strings which share their contents with some other data structure (e.g. Array<UInt8>). Accessing the content of an indirect string requires an extra level of indirection (hence the name) to get the pointer, but after that, they hit all the fast-paths in String’s implementation.

Other contiguous strings are immediate, which means they are in Swift’s preferred form for the most efficient and direct access.

A small string packs its contents (up to 15 UTF-8 code units in length) directly in the String struct, without the need for a separate allocation.

A large string holds a reference to a class instance with tail-allocated contents at a statically known offset from the instance’s address. This means there is no additional indirection on top of the reference itself.

Contiguity

The most significant distinction, both for the standard library’s implementation of String’s API and performance-minded developers, is whether a String can provide contiguous-UTF-8 contents and how to get the pointer. SE-0247, currently under review, aims to deliver this. It lifts the veil one layer to expose contiguity.

Shared storage

The “indirect” contiguous form allows a string to share UTF-8 storage with another type rather than making a copy. Reads pay the cost of an extra level of indirection, but are still contiguous and fast.

This would allow Strings to be created sharing storage with an Array<UInt8>/Data/ByteBuffer/etc. UTF-8 validation should still happen (though we could consider an “unsafe” opt-out as in Rust), and there’s some decision decisions in how best to surface this.

Substring could have a withSharedString { ... } API, which vends such an indirect string for the purposes of interacting with APIs taking a String, without actually copying out the substring. Care would have to be taken to not overly-extend the lifetime of a much larger allocation. We could also consider an alternative approach similar to withoutActuallyEscaping, where the shared-storage class is stack-allocated and we issue a trap if it escaped.

In a world with more shared storage, we’d want to provide developers answers to the following:

Is immediate — Is this string in the fastest possible representation, i.e. the fewest levels of indirection
Excess storage estimate — How much memory could be freed up by copying into an immediate representation.

as well as an explicit copying method or initializer to force the string into an immediate representation.

Small and large

The standard library has an internal _SmallString struct which is used for string contents of up to 15 UTF-8 code units, as well as ways to query if a string is in the small form. This could be useful to developers in rare situations, e.g. operating directly on the struct rather than spilling to the stack as withUTF8 would do. We would need to discover more use cases to justify it’s addition.

Performance flags

String remembers performance-relevant information about its contents through the use of performance flags.

For example, a String that is known to be all-ASCII has a trivial UTF8View, UTF16View, and UnicodeScalarView. Also, mapping offsets between the two code unit views is trivial, so there is no need for any bookkeeping as part of Cocoa interop.

Current flags:

isASCII - whether the content is known to be entirely ASCII
- Trivial UTF8View, UTF16View, and UnicodeScalarView
- Grapheme breaking fast-paths (just have to guard for CR-LF)
- Trivial offset mapping between UTF-16 and UTF-8 for Cocoa interop
isNFC - whether the content is known to be in Normal Form C
- Comparisons is just comparing memory as-is
- Hashing is just hashing memory as-is
isNativelyStored - whether the string is immediate, large, and memory managed (i.e. not a literal)
- String is in a form that allows in-place mutation, if uniquely referenced
isTailAllocated - whether the string is immediate and large
- Contents are available at a known fixed-offset from the object’s header

String has room to grow 12 more at any point in the future, with the constraint that a value of 0 has to be semantically equivalent to 1. I.e. they represent opportunities for faster processing, but they’re not needed for correctness.

Flags such as isASCII are a best-effort: if not set it’s still possible that the String is ASCII. For example, a non-ASCII string whose last remaining non-ASCII character was just removed might not have the isASCII bit set, as that would require prohibitively expensive bookkeeping for every mutation.

There are a few levels at which we could expose working with these flags

What do the flags say? e.g. var isKnownNFC: Bool { get } .
Perform a scan to answer the query, using the flags as just a fast-path. E.g. func isNFC(scan: Bool = true) -> Bool.
Perform a scan, updating the flags with the new information gleaned. E.g. mutating func updatePerformanceFlags().
Force the content into a form that has a flag set. E.g. mutating func makeNFC() or mutating func makeCanonical().

Future directions for opaque strings

String permits anything that can implement String’s interfaces through a resilient functions as part of an opaque string form. All handling is done in out-of-line code to minimize the impact on regular strings.

Currently, this is only used for non-contiguous-ASCII lazily-bridged NSStrings. The only way for a developer to provide their own backing String representation here is to provide a subclass of NSString. In the future, we could use this path as a way to support existential strings, which could include strings of an unknown encoding, compressed strings, interned strings, etc., which gain flexibility in exchange for slower read access.

It is important that the existence of opaque strings have the minimal possible impact on the performance of contiguous strings. Opacity is given a dedicated bit in String's ABI. For example, on ARM64, supporting opaque strings only costs a single, 4-byte TBNZ instruction.

scanon · March 17, 2019, 2:58pm

Can you expand a bit on why an extra level of indirection would be incurred here, since that's not totally obvious?

Michael_Ilseman · March 17, 2019, 8:39pm

Right, a large (immediate and contiguous) string has tail-allocated contents at a statically known fixed-offset. To load the first UTF-8 code unit, we form the pointer by adding the offset (32) to the object reference itself, and then load. We've done one direct load from a pointer.

As currently represented, indirect (contiguous) strings provide that pointer through a resilient function call. Loading the first UTF-8 code unit involves first calling this function to get the pointer out, and then loading from that pointer. That function can be thought of as an extra level of indirection compared to large strings, which just loads from the reference itself with an offset.

Let's peel this back one layer. If we had some shared string form that was so frequent and important that we wanted to encode it in String's ABI (using a spare representation) for inlining purposes, it would still likely have an extra level of pointer indirection. For example, this would probably be __SharedStringStorage, which holds an owner (to track lifetime) and a pointer+length to UTF-8 (and supports breadcrumbing). We would be intentionally paying a code size cost to avoid a resilient function call for this representation.

Accessing the first UTF-8 code unit would involve loading the pointer from some offset from the object reference first, then loading the byte from that, meaning there's an extra level of pointer indirection at the beginning of every access. For an API such as the proposed withUTF8, a small constant-cost to get the pointer is not big deal compared to the processing that would follow. But, it is a small constant-cost that you will pay for every String API.

Peeling it back even further, we could choose to bless certain kinds of shared strings that happen to encompass the entire storage of some other tail-allocated type (e.g. Array). If that type's offset just so happens to be the same as large strings (32), then we could consider representing it as though it were a large but not isNativelyStored. Otherwise, we'd have another blessed representation for that specific type's offset. This is technically feasible, but the code size tradeoff makes it unlikely unless one particular offset turns out to be very common in practice.

We would also want to start using a perf-flag to denote nul-termination soon. That would allow O(1) C-string bridging for such shared strings if known safe, and would ease migration if we did go down the route of using a large-form encoding.

johannesweiss · March 18, 2019, 2:20pm

Thanks so much for writing this up, Michael! I'm sure I'll link to this post quite often

Michael_Ilseman · February 24, 2021, 4:23pm

Hello everyone, I thought I'd post a decoder ring which helps anyone make sense of the stdlib source code which implements these String forms. The names in source code were established before my post (due to ABI stability), and were chosen to guide implementors (i.e. they appear in symbol names but not API).

"Immortal" denotes a string that has permanent lifetime, such as a string literal or a small-form string, and thus is not managed by ARC.

"Bridged" denotes that the object coming in originated in Objective-C and thus reference counting should be performed using the Objective-C runtime rather than Swift's runtime.

Large is used to refer to not-small, including everything except Immediate-Small. All of these forms share a similar bit-layout, including where the count and performance flags are located. Small strings have a completely different bit layout, since they pack their small contents directly into the struct, so differentiating between small and large is often one of the first considerations.

Native is used for Immediate-Large strings, as they are using the stdlib's native storage class which has a guaranteed fixed offset to the start of the tail-allocated code units. These are the fastest non-small strings as the pointer can be derived through addition. This offset is referred to as the nativeBias and is applied in reverse to string literal addresses, which saves a branch everywhere in any (usually inlined) code that accesses the content of a string. I.e., the code paths for reading from natively-stored tail-allocated strings and immortal strings literals is identical.

Shared is used for Indirect strings and Foreign is used for Opaque strings. In hindsight, "Indirect" and "Opaque" are probably better names, and luckily they are only present in the ABI in very minor ways, so we have the opportunity to rename them.

_StringObject and _SmallString are layout-equivalent with one another on 64-bit little endian platforms and byte swapped on 64-bit endian platforms. On 32-bit, small strings need to be packed and unpacked, since we are more bit-constrained there. _SmallString is useful as a view of a subset of bit-patterns that _StringObject can take, allowing String functionality to be directly implemented.

Max_Desiatov · February 24, 2021, 8:40pm

Would you mind if this is added to the StringDesign.rst document? And as as a side note, I'm going through a few other docs and converting them to Markdown, may I do that to StringDesign.rst too?

Michael_Ilseman · February 25, 2021, 4:53am

That is a very old (pre-Swift-1.0) document that is likely out of date (though it happened to guess some details right, such as UTF-8 encoding). If you do start an overhaul, perhaps archive that document as-is for historical reasons. Otherwise I wouldn't try to just append my post to the document.

If you are interested in overhauling it and writing a new design doc (i.e. doing all of the real work :-), I'd be happy to provide clarifications and content.