Hi! I'd like to get some feedback on a new feature for the standard library: shared Substrings.
It's based on @allevato's prototype implementation, but with some significant changes.
Introduction
Shared Substrings give us a way to interpret buffers of bytes as unicode text, without allocating new storage exclusively owned by a String
object. It would, for example, allow developers to receive data from a file or network connection as an Array<UInt8>
or Foundation Data
object, and parse that data as text without copying it. Additionally, it gives developers of structured text objects (like URLs) greater control over how they organise their storage.
Motivation
It is common for applications to receive and manipulate text as blocks of bytes. For example, when reading data from a file or over a network connection, applications will typically traffic in types such as Data
or ByteBuffer
, some or all of which may be interpretable as text. The Swift standard library includes several utilities and algorithms for interpreting unicode text - however, making use of these algorithms currently requires that the data is stored in a buffer created and managed by the String
type.
Let's imagine a function which receives some payload data from a network connection as an Array and counts the number of characters it contains:
func countCharacters(in utf8Bytes: [UInt8]) -> Int {
let text = String(decoding: utf8Bytes, as: UTF8.self)
return text.count
}
Unfortunately, this copies the entire payload in to new, owned storage - even though all we really want is to apply String's grapheme-traversal algorithms to the bytes in the existing buffer.
Furthermore, many libraries which make heavy use of Strings may want finer control over how they organise their storage. Take URLs for instance: a type representing a parsed URL might look something like this:
struct MyURL {
var urlString: String
var schemeEndIndex: String.Index
var usernameEndIndex: String.Index?
var passwordEndIndex: String.Index?
var hostnameEndIndex: String.Index?
var portEndIndex: String.Index?
var pathEndIndex: String.Index?
var queryStringEndIndex: String.Index?
var fragmentEndIndex: String.Index?
}
This is roughly what Rust's URL type looks like. But just looking at it: every String.Index
is 8 bytes on x86_64 (9 bytes for an optional), and String
is 16 bytes - resulting in a minimum 87 bytes plus whatever dynamic storage the String
might own. That's a very heavy object.
Even if we made this object a class, we end up with 2 heap allocations: one for MyURL
itself, containing the indexes and the String
, and again for String
's storage. Maybe we would prefer to group these in to a single allocation, like ManagedBuffer
does, and maybe we'd like more control over the size of the String, so that we can use a smaller index type. Shared Substrings allow developers to explore these kinds of designs.
Proposed solution
This proposal would add the following API to the standard library, allowing developers to create Substring
s from either: a pointer-owner pair, a ManagedBuffer
instance, or an ArraySlice
.
extension Substring {
/// Creates an immutable `Substring` whose backing store is shared with the UTF-8 data
/// referenced by the given buffer pointer.
///
/// The `owner` argument should manage the lifetime of the shared buffer.
/// The `Substring` instance created by this initializer retains `owner` so that deallocation
/// may occur after the substring is no longer in use. The buffer _must not_ be
/// mutated while there are any strings sharing it.
///
/// This initializer does not try to repair ill-formed UTF-8 code unit
/// sequences. If any are found, the result of the initializer is `nil`.
///
/// - Parameters:
/// - buffer: An `UnsafeBufferPointer` containing the UTF-8 bytes that
/// should be shared with the created `Substring`.
/// - owner: An optional object that owns the memory referenced by `buffer`.
///
public init?(sharingStorage buffer: UnsafeBufferPointer<UInt8>, owner: AnyObject)
/// Creates an immutable `Substring` whose UTF-8 backing store is shared with
/// the given `ManagedBuffer` instance's elements.
///
/// The `Substring` instance created by this initializer retains `owner` so that deallocation
/// may occur after the substring is no longer in use. The buffer _must not_ be
/// mutated while there are any strings sharing it.
///
/// This initializer does not try to repair ill-formed UTF-8 code unit
/// sequences. If any are found, the result of the initializer is `nil`.
///
/// - Parameters:
/// - buffer: A `ManagedBuffer` whose elements are UTF-8 bytes that
/// should be shared with the created `Substring`.
/// - range: The range of elements which should be included in the `Substring`.
///
public init?<Header>(sharingElements buffer: ManagedBuffer<Header, UInt8>, range: Range<UnsafeBufferPointer<UInt8>>)
/// Creates an immutable `Substring` whose backing store is shared with the UTF-8 data
/// in the given region of an array.
///
/// This initializer does not try to repair ill-formed UTF-8 code unit
/// sequences. If any are found, the result of the initializer is `nil`.
///
/// - Parameters:
/// - array: An `ArraySlice` containing the UTF-8 bytes that
/// should be shared with the created `Substring`.
///
public init?(sharingStorage array: ArraySlice<UInt8>)
}
There are several interesting things about this API:
-
It allows creating a
Substring
, not aString
. This aligns with SE-0163, which introducedSubstring
specifically to differentiate strings which own their backing storage from those which share a buffer and should not be stored long-term. A similar principle applies to shared strings. From SE-0163:Important
Long-term storage of
Substring
instances is discouraged. A substring holds a reference to the entire storage of a larger string, not just to the portion it presents, even after the original string’s lifetime ends. Long-term storage of a substring may therefore prolong the lifetime of elements that are no longer otherwise accessible, which can appear to be memory leakage. -
Sharing storage requires value semantics/copy-on-write. Even though we ask people to not store
Substrings
long-term, that doesn't mean they won't, and an object which exposesSubstring
views of its storage must be mindful that those references may indeed escape. Theowner
object lets other references determine if any such references exist, and whether or not it is safe to mutate or deallocate the buffer. -
The returned
Substring
s are immutable. This means that standard library will not mutate the bytes in the referenced buffer, and any attempted in-place modifications (like callingappend
) will copy to String-owned storage. -
ManagedBuffer
andArray
are special-cased. Even though we typically don't add convenience functions for low-level APIs, implementing these requires access to standard library internals. The special-casing is in terms ofArraySlice
in order to reduce the number of entrypoints.
Detailed design
With the above API, the character-counting example could be written as follows:
func countCharacters(in utf8Bytes: [UInt8]) -> Int {
let text = Substring(sharingStorage: utf8Bytes)
return text?.count ?? 0
}
Here, the creation of the shared Substring
allows us to avoid having to allocate new storage.
The URL example consisting of a String and lots of indexes could look something like:
struct MyURL {
struct Header {
var schemeEndOffset: UInt16
// ...
}
var storage: ManagedBuffer<Header, UInt8> = ...
var scheme: Substring {
let schemeEndIndex = Int(storage.header.schemeEndOffset)
return Substring(sharingStorage: storage, range: 0..<schemeEndIndex)!
}
// etc.
}
Here, a single allocation contains both the indexes, and the storage itself. Additionally, we have gained control over the index storage, by representing them as offsets in to the UTF8 bytes and limiting them to 2 bytes each. Using 0 to represent nil
, this representation amounts to a 16 byte in-line prefix to the UTF8 data.
Source compatibility
These APIs are additive.
Effect on ABI stability
The functionality is already part of String's ABI since Swift 5.0.
Effect on API resilience
This proposal does not include any language features which would affect API resilience.
Alternatives considered
Create String
s rather than Substring
s
Many languages distinguish between owned and borrowed strings: for example, Rust includes both String
(owned) and str
(borrowed) types, C++ includes std::string
(owned) and std::string_view
(borrowed). Indeed, distinguishing between owned and borrowed storage is precisely why Swift defines both String
(owned) and Substring
(borrowed) types in the first place. Therefore, this proposal considers that Substring
is the "correct" place for this API to live.
A closure-based API
The 2 most important things for users of this API to get right are the owner
object's management of the buffer lifetime, and that all mutations first ensure that the buffer is not shared. Typically, Swift does this with closure-based APIs:
extension String {
static func withStringView<Source, Result>(
of: Source, perform: (String?) throws -> Result
) rethrows -> Result where Source: Sequence, Source.Element == UInt8 {
// use withContiguousStorageIfAvailable, otherwise copy or return nil.
// No need for 'owner' because the String should not escape.
}
}
// Example:
func countCharacters(in utf8Bytes: [UInt8]) -> Int {
return String.withStringView(of: utf8Bytes) { string in
return string?.count ?? 0
}
}
Unfortunately, this would severely hurt the usability of some of the motivating use-cases. For example, it would be rather unfortunate if a URL type could only expose its scheme
or path
string inside a closure scope.
Promote Substring
to String
@allevato's prototype shows another API we could implement using shared storage: the ability to "promote" a Substring
to a String
for a limited scope without copying:
extension Substring {
/// Calls the given closure, passing it a `String` whose contents are equal to
/// the substring but which shares the substring's storage instead of copying
/// it.
///
/// The `String` value passed into the closure is only valid durings its
/// execution.
/// ...
public func withSharedString<Result>(
_ body: (String) throws -> Result
) rethrows -> Result
}
This proposal considers such an API to be a separate issue. It might be nice to have a way to break the split owned/borrowing type model in limited situations, but isn't necessary for the motivating use-cases in this proposal.