Piercing the String Veil
Hi all, I would like to provide some background information about what is available in String’s ABI in Swift 5, and thoughts on what's possible to surface to performance-minded users.
String’s ABI encodes information for performant processing and future flexibility, allowing the standard library to run faster and producer smaller and more flexible code. But, this information is not available as API for low-level programming as it’s hidden behind the abstraction of the String
type. With this information formally part of String’s ABI, we should lift the veil of the String
type and start exposing these as API.
An opaque string is capable of handling all string operations through resilient function calls. Nothing about their implementation is exposed outside the standard library. Currently, these are used for lazily-bridged NSStrings that do not provide access to contiguous UTF-8 (i.e. ASCII) bytes in memory. This can be expanded in the future to support any kind of future string forms we may want or need to add.
All other strings are contiguous, that is, they can provide a pointer and length to valid UTF-8 bytes in memory.
An indirect string has declared in the ABI that it is capable of providing a pointer to validly-encoded UTF-8 bytes in memory through a resilient function. Currently, indirect strings are used for NSStrings that can provide access to contiguous UTF-8 in memory. Since the means by which they get a UTF-8 pointer is behind a resilient function call, they can also be used to support shared strings, that is strings which share their contents with some other data structure (e.g. Array<UInt8>
). Accessing the content of an indirect string requires an extra level of indirection (hence the name) to get the pointer, but after that, they hit all the fast-paths in String’s implementation.
Other contiguous strings are immediate, which means they are in Swift’s preferred form for the most efficient and direct access.
A small string packs its contents (up to 15 UTF-8 code units in length) directly in the String struct, without the need for a separate allocation.
A large string holds a reference to a class instance with tail-allocated contents at a statically known offset from the instance’s address. This means there is no additional indirection on top of the reference itself.
Contiguity
The most significant distinction, both for the standard library’s implementation of String’s API and performance-minded developers, is whether a String can provide contiguous-UTF-8 contents and how to get the pointer. SE-0247, currently under review, aims to deliver this. It lifts the veil one layer to expose contiguity.
Shared storage
The “indirect” contiguous form allows a string to share UTF-8 storage with another type rather than making a copy. Reads pay the cost of an extra level of indirection, but are still contiguous and fast.
This would allow Strings to be created sharing storage with an Array<UInt8>
/Data/ByteBuffer/etc. UTF-8 validation should still happen (though we could consider an “unsafe” opt-out as in Rust), and there’s some decision decisions in how best to surface this.
Substring could have a withSharedString { ... }
API, which vends such an indirect string for the purposes of interacting with APIs taking a String, without actually copying out the substring. Care would have to be taken to not overly-extend the lifetime of a much larger allocation. We could also consider an alternative approach similar to withoutActuallyEscaping
, where the shared-storage class is stack-allocated and we issue a trap if it escaped.
In a world with more shared storage, we’d want to provide developers answers to the following:
- Is immediate — Is this string in the fastest possible representation, i.e. the fewest levels of indirection
- Excess storage estimate — How much memory could be freed up by copying into an immediate representation.
as well as an explicit copying method or initializer to force the string into an immediate representation.
Small and large
The standard library has an internal _SmallString
struct which is used for string contents of up to 15 UTF-8 code units, as well as ways to query if a string is in the small form. This could be useful to developers in rare situations, e.g. operating directly on the struct rather than spilling to the stack as withUTF8
would do. We would need to discover more use cases to justify it’s addition.
Performance flags
String remembers performance-relevant information about its contents through the use of performance flags.
For example, a String that is known to be all-ASCII has a trivial UTF8View, UTF16View, and UnicodeScalarView. Also, mapping offsets between the two code unit views is trivial, so there is no need for any bookkeeping as part of Cocoa interop.
Current flags:
- isASCII - whether the content is known to be entirely ASCII
- Trivial UTF8View, UTF16View, and UnicodeScalarView
- Grapheme breaking fast-paths (just have to guard for CR-LF)
- Trivial offset mapping between UTF-16 and UTF-8 for Cocoa interop
- isNFC - whether the content is known to be in Normal Form C
- Comparisons is just comparing memory as-is
- Hashing is just hashing memory as-is
- isNativelyStored - whether the string is immediate, large, and memory managed (i.e. not a literal)
- String is in a form that allows in-place mutation, if uniquely referenced
- isTailAllocated - whether the string is immediate and large
- Contents are available at a known fixed-offset from the object’s header
String has room to grow 12 more at any point in the future, with the constraint that a value of 0 has to be semantically equivalent to 1. I.e. they represent opportunities for faster processing, but they’re not needed for correctness.
Flags such as isASCII
are a best-effort: if not set it’s still possible that the String is ASCII. For example, a non-ASCII string whose last remaining non-ASCII character was just removed might not have the isASCII
bit set, as that would require prohibitively expensive bookkeeping for every mutation.
There are a few levels at which we could expose working with these flags
- What do the flags say? e.g.
var isKnownNFC: Bool { get }
. - Perform a scan to answer the query, using the flags as just a fast-path. E.g.
func isNFC(scan: Bool = true) -> Bool
. - Perform a scan, updating the flags with the new information gleaned. E.g.
mutating func updatePerformanceFlags()
. - Force the content into a form that has a flag set. E.g.
mutating func makeNFC()
ormutating func makeCanonical()
.
Future directions for opaque strings
String permits anything that can implement String’s interfaces through a resilient functions as part of an opaque string form. All handling is done in out-of-line code to minimize the impact on regular strings.
Currently, this is only used for non-contiguous-ASCII lazily-bridged NSStrings. The only way for a developer to provide their own backing String representation here is to provide a subclass of NSString. In the future, we could use this path as a way to support existential strings, which could include strings of an unknown encoding, compressed strings, interned strings, etc., which gain flexibility in exchange for slower read access.
It is important that the existence of opaque strings have the minimal possible impact on the performance of contiguous strings. Opacity is given a dedicated bit in String's ABI. For example, on ARM64, supporting opaque strings only costs a single, 4-byte TBNZ
instruction.