Pitch: Add a String Initializer with Access to Uninitialized Storage

Hi folks!

I've been working on improving the performance of NSString-to-String bridging, and @Michael_Ilseman pointed out that the more we can build it out of public API rather than adding a bunch of private stuff for Foundation's use, the better off everyone is.

This new initializer is designed to be consistent with the recently-accepted similar initializer on Array.

Add a String Initializer with Access to Uninitialized Storage

Introduction

This proposal suggests a new initializer for String that provides access to a String's uninitialized storage buffer.

Motivation

Bridging NSString to String currently requires using standard library internals to get good performance, which suggests a class of problem that the standard library is currently not well-equipped to solve: efficiently creating a String when you don't already have a contiguous buffer of bytes to initialize it with. In the NSString bridging case, that's using CFStringGetBytes to copy-and-transcode the contents of the NSString.

Proposed solution

Add a new String initializer that lets a program work with an uninitialized
buffer.

The new initializer takes a closure that operates on an
UnsafeMutableBufferPointer and an inout count of initialized elements. This
closure has access to the uninitialized contents of the newly created String's
storage, and must set the intialized count of the array before exiting.

let myCocoaString = NSString("The quick brown fox jumps over the lazy dog") as CFString
var myString = String(unsafeUninitializedCapacity: CFStringGetLength(myCocoaString)) { buffer, initializedCount in
    CFStringGetBytes(
    	myCocoaString,
    	buffer,
    	…,
    	&initializedCount
    )
}
// myString == "The quick brown fox jumps over the lazy dog"

Without this initializer we would have had to heap allocate an UnsafeMutableBufferPointer, copy the NSString contents into it, and then copy the buffer again as we initialized the String.

Detailed design

  /// Creates a new String with the specified capacity in UTF-8 code units then
  /// calls the given closure with a buffer covering the String's uninitialized
  /// memory.
  ///
  /// The closure should set `initializedCount` to the number of
  /// initialized code units, or 0 if it couldn't initialize the buffer
  /// (for example if the requested capacity was too small).
  ///
  /// This method replaces ill-formed UTF-8 sequences with the Unicode
  /// replacement character (`"\u{FFFD}"`); This may require resizing
  /// the buffer beyond its original capacity.
  ///
  /// The following examples use this initializer with the contents of two
  /// different `UInt8` arrays---the first with well-formed UTF-8 code unit
  /// sequences and the second with an ill-formed sequence at the end.
  ///
  ///     let validUTF8: [UInt8] = [67, 97, 102, -61, -87, 0]
  ///     let s = String(unsafeUninitializedCapacity: validUTF8.count,
  ///                    initializingUTF8With: { (ptr, count) in
  ///         ptr.initializeFrom(validUTF8)
  ///         count = validUTF8.count
  ///     })
  ///     // Prints "Optional(Café)"
  ///
  ///     let invalidUTF8: [UInt8] = [67, 97, 102, -61, 0]
  ///     let s = String(unsafeUninitializedCapacity: invalidUTF8.count,
  ///                    initializingUTF8With: { (ptr, count) in
  ///         ptr.initializeFrom(invalidUTF8)
  ///         count = invalidUTF8.count
  ///     })
  ///     // Prints "Optional(Caf�)"
  ///
  ///     let s = String(unsafeUninitializedCapacity: invalidUTF8.count,
  ///                    initializingUTF8With: { (ptr, count) in
  ///         ptr.initializeFrom(invalidUTF8)
  ///         count = 0
  ///     })
  ///     // Prints "Optional("")"
  ///
  /// - Parameters:
  ///   - capacity: The number of UTF-8 code units worth of memory to allocate
  ///       for the String.
  ///   - initializer: A closure that initializes elements and sets the count of
  ///       the new String
  ///     - Parameters:
  ///       - buffer: A buffer covering uninitialized memory with room for the
  ///           specified number of UTF-8 code units.
  ///       - initializedCount: Set this to the number of elements in `buffer`
  ///           that were actually initialized by the `initializer`
  @inlinable @inline(__always)
  public init(
    unsafeUninitializedCapacity capacity: Int,
    initializingUTF8With initializer: (
    _ buffer: UnsafeMutableBufferPointer<UInt8>,
    _ initializedCount: inout Int
    ) throws -> Void
  ) rethrows

Specifying a capacity

The initializer takes the specific capacity that a user wants to work with as a
parameter. The buffer passed to the closure has a count that is exactly the
same as the specified capacity, even if the ultimate size of the new String is larger.

Guarantees after throwing

Unlike Array, there are no special considerations about the state of the buffer when an error is thrown.

Source compatibility

This is an additive change to the standard library,
so there is no effect on source compatibility.

Effect on ABI stability

The new initializer will be part of the ABI, and will result in calls to a new @usableFromInline symbol being inlined into client code. Use of the new initializer is gated by @availability though, so there's no back-deployment concern.

Effect on API resilience

The additional APIs will be a permanent part of the standard library,
and will need to remain public API.

Alternatives considered

Returning the new count from the initializer closure

This is more plausible for String than it was for Array, since there's no need to deal with deinitialization of elements in partially initialized buffers (UInt8s are trivial). However, it was considered more valuable to match Array's behavior here.

Returning a Bool to indicate success from the closure

Requiring people to either throw or check in the caller for an empty String return if the initializing closure fails is slightly awkward, but again, not sufficiently so to warrant being inconsistent with Array.

Validating UTF-8 instead of repairing invalid UTF-8

Matching the behavior of most other String initializers here also makes it more ergonomic to use, since it can be non-failable this way.

12 Likes

Woo! Huge +1

I think bridging undersells the motivation, it's generally useful and was included in String Essentials. It allows users to avoid having their own hacky buffers for small contents and avoids an extra allocation for large contents.

Within the standard library, Float and Int portions do the hacky stack buffers thing.

@ravikandhadai could probably make good use of this in his logging efforts, as might the server-side loggers. @johannesweiss, does NIO or other server side projects have a use for directly initializing strings?

@tbkka, would this be useful for Swift protobuf?

Would this be because UTF8.CodeUnit is a trivial type? Worth stating explicitly.

The real motivation is that a failable closure would make the init failable. That would really hurt use sites, forcing them into an ugly anti-pattern the way failable withContiguousStorageIfAvailable often does in practice.

edit: removed dangling text

6 Likes

Not only would this be useful for the Darwin Foundation but I think it would be super keen for swift-corelibs-Foundation as well. +1

4 Likes

Seems straightforward and useful to me. My main concern was about the confusion between capacity and count, but while double-checking if it was new terminology for String I found out that the ship has sailed on that already. I think the documentation on those existing capacity methods should be updated to talk about UTF-8 code units instead of ASCII characters to be consistent, though.

3 Likes

Good point. @nnnnnnnn what do you think?

So first of all, big +1 to this proposal :slight_smile:.

@Michael_Ilseman, what exactly to you mean with 'directly initialising strings'? We do create plenty of strings from bytes but we usually already have them in a contiguous buffer from the network. So currently we just do String(decoding: ourBuffer, as: UInt8.self) which if I understand this correctly will do the same thing as the proposed initialiser if you already have such a contiguous buffer.

Where I could see this being super useful is when we're streaming data. So from the network we get arbitrarily sized contiguous chunks. But maybe the user wants to create a String that spans to ByteBufferViews, ie. the string hasn't arrived from the network in one chunk. With the proposed API, we could create an API in NIO:

extension String {
    public init<C: Collection>(byteBufferViews: C) where C.Element == ByteBufferView {
        let overallCapacity = byteBufferViews.reduce(0, { $0 + $1.count })
        self = .init(unsafeUninitializedCapacity: overallCapacity) { targetPtr in
            // loop over views and append their contents to targetPtr
       }
    }
}
2 Likes

Your example code should probably use CFStringGetMaximumSizeForEncoding instead of CFStringGetLength for the capacity.

import Foundation

// Two UTF-16 code units.
let utf16String = NSString("\u{1F600}") as CFString
let utf16Count = CFStringGetLength(utf16String)
let utf16Range = CFRangeMake(0, utf16Count)

// Estimated six UTF-8 code units.
let utf8Encoding = CFStringBuiltInEncodings.UTF8.rawValue
let utf8Capacity = CFStringGetMaximumSizeForEncoding(
    utf16Count,
    utf8Encoding)

// Four UTF-8 code units (on lossless conversion).
let utf8LossByte: UInt8 = 0xFF
let utf8String = String(
    unsafeUninitializedCapacity: utf8Capacity) {
    utf8Buffer, utf8Count in
    _ = CFStringGetBytes(
        utf16String,
        utf16Range,
        utf8Encoding,
        utf8LossByte,
        false, // isExternalRepresentation
        utf8Buffer.baseAddress,
        utf8Capacity,
        &utf8Count)
}

Being able to intentionally underestimate the required capacity is actually a key design constraint for the use I have in mind, but you’re right, for examples I should make things conservatively correct by default.

My thoughts on this API have in some ways evolved, and some ways not budged at all and have even been reinforced :sweat_smile:

It all comes down to shared Strings. They are in the ABI right now, so all of the performance wins that we've got in Swift 5 with UTF8 Strings are happening despite branching to check for external backing storage. All of the changes needed to expose this in the language are just surface level. That's just awesome.

So why do we even need a way to manually initialise String's storage, when we can so easily provide our own? Are there actual benefits to using String's built-in storage type for these cases? (I can only think of one: COW checks/in-place mutations. Can shared String's not be extended to incorporate this?)

And then this doubles back to consistency with Array: if String can accommodate external storage with such great performance, it seems strange that Array couldn't do that, too. (Of course, Array is in many ways a lot simpler and always performed better than a Unicode-compliant String, so it has more to lose).

I asked about it when the equivalent Array API was proposed. The answer I got about the 'count' header seems pretty easy to overcome in the face of everything String is able to pull off. I'm sure we could devise a way to communicate the element count.


Anyway, on the proposal as written:

I'm pretty sure I remember being told that functions in Swift either succeed, throw, or something catastrophic/unrecoverable happens - i.e. a fatalError (as opposed to the 'exception' model). So IMO throwing is the right way for initialisation-closures to model failure. It's not clear if an empty String is necessarily always a failure condition.

We should drop the documentation comments about setting initializedCount to 0 on failure (again, throwing is the way to indicate a failure), and certainly the part about doing so if the capacity is not sufficient, since it implies the API might somehow give you less capacity than you asked for (i.e. an allocation failure, which should be a fatalError), and @David_Smith literally just said it was a "key design constraint" that you can intentionally ask for less and presumably append the rest later.

The most contentious issue is this:

People who are using an 'unsafe' initialiser are probably doing so because performance is critical. Validating the data is O(N), so I think it would be useful to provide a non-validating version as well.

Not important, but out of curiosity: do we know what happens if a String somehow manages to contain invalid UTF8? Is something likely to crash or fall in to an infinite loop?

I was wondering about this approach (which I've been referring to as "adopting" storage) too; it's not unreasonable, but does have some tradeoffs:

  • I wouldn't expect __SharedStringStorage to perform as well as __StringStorage; an extra level of pointer indirection isn't a huge overhead, but it's not nothing. Perhaps more importantly, there's the cost of handling the storage owner.
  • We'd need to make sure the 15-codepoints-or-less case could compile down to the implementation we want, since SmallString initialization in bridging is extremely sensitive to small changes, and we would not want to do no-copy for SmallStrings.
  • In the case of NSString bridging, we don't already have a buffer (if CFStringGetCStringPtr returns NULL anyway), so for an escaping String I don't see any way around heap allocation a buffer and copying into it, which means paying two mallocs instead of one.

Indeed. I'm basically describing a performance hack here, but it was a necessary one for the change I'm making, so I felt it was worth calling out that it was possible. In short, throwing is currently too slow, but checking for 0 length input + empty String output is fast enough. I've filed a bug to make throwing faster. Whether that capability should be a documented part of the API semantics is something I could go either way on, but I'm generally wary of saying "the standard library gets to do XYZ, but everyone else doesn't" if we have an alternative.

Copying the data is also O(n), so there's no time complexity change here. Given that performance seems to be good for the (very performance-focused) bridging case, my inclination is to wait on providing a non-validating version until a) it becomes apparent that it's needed in practice, and b) we've vectorized the existing implementation and it's still insufficient.

Shared strings need their contents in contiguous managed storage. This pitch covers areas where we do not have contents in contiguous managed storage. The motivation section demonstrates one: interacting with a C API that will write into some storage, and we prefer to use String's native one for small-string optimizations. I also highlighted int/float formatting, where there is no pre-existing heap allocation for the contents and a stack buffer wouldn't have the right lifetime for a shared string.

Not quite. Shared strings will always be less efficient, as they require at least one extra level of indirection. Native strings are very carefully designed for efficient reads: they have tail-allocated contents at an offset that is burned into our ABI. No other representation can be as fast.

Users of shared strings would be willing to have a constant-factor overhead at the beginning of every read for the memory benefits of shared or adopted storage.

This pitch is orthogonal to shared strings.

In addition to indirection and small forms above, yes, mutation is another. For shared strings to benefit from COW, the owner of the original storage must itself honor COW semantics, which would be a subset of storage guarantees. Also, the original owner likely has a "count" and "capacity" field that would need to be updated, so shared strings would have to know how to interface with the owner, perhaps through some kind of runtime dispatch table. We have talked about such notions as long-term plans, and they all carry different tradeoffs. Again, this is all orthogonal to this pitch.

Array is fundamentally different, it is generic and contiguous in its element type, while String has a different storage representation from its concrete element type. Also, Strings are far more frequently bridged lazily.

All "contiguous UTF-8" strings maintain the invariant that they have validly encoded contents. We check on creation, when it is orders of magnitude cheaper to do, rather than on access. If we drop this guarantee from this API, they will be forced through the opaque slow path, which does validation at read-time.

"Unsafe" regards memory safety, not invariant safety. For example, String.UTF8View.index(after:) -> String.Index is memory safe, but might not be invariant-safe for a model assuming scalar alignment because it could produce a non-scalar-aligned index. This is a way that Swift is different than Rust, which would choose to call such an operation unsafe.

As for unsafe in the name, it does seems a little weird to me, but it's what Array chose. I suppose there are no bounds check guarantees inside the closure, so the closure's body could be considered memory-unsafe.

1 Like

For Array the unsafe was a little more relevant because the element type might not be trivial, which means setting initializedCount incorrectly could result in memory unsafety. I think it would be reasonable to consider dropping that for String, but it would also be reasonable to keep the strict parallel.

Incidentally @Michael_Ilseman I have just encountered a corner of the NIO codebase that would like direct initialisation. In HTTP/2 we often need to decompress Huffman-encoded byte sequences into Strings. Right now I'm having to do this either via an intermediate Array or via a somewhat-hacky lazy Sequence, but it would probably be nicer to use direct initialisation to write the decoded bytes into the String storage.

The only downside is that I'll have to use an over-estimate of how much memory I need, but that's fine enough: we already have to do that if we want to use an intermediate Array.