Shared Substrings

Karl · July 18, 2020, 1:49pm

Hi! I'd like to get some feedback on a new feature for the standard library: shared Substrings.

It's based on @allevato's prototype implementation, but with some significant changes.

Introduction

Shared Substrings give us a way to interpret buffers of bytes as unicode text, without allocating new storage exclusively owned by a String object. It would, for example, allow developers to receive data from a file or network connection as an Array<UInt8> or Foundation Data object, and parse that data as text without copying it. Additionally, it gives developers of structured text objects (like URLs) greater control over how they organise their storage.

Motivation

It is common for applications to receive and manipulate text as blocks of bytes. For example, when reading data from a file or over a network connection, applications will typically traffic in types such as Data or ByteBuffer, some or all of which may be interpretable as text. The Swift standard library includes several utilities and algorithms for interpreting unicode text - however, making use of these algorithms currently requires that the data is stored in a buffer created and managed by the String type.

Let's imagine a function which receives some payload data from a network connection as an Array and counts the number of characters it contains:

func countCharacters(in utf8Bytes: [UInt8]) -> Int { 
  let text = String(decoding: utf8Bytes, as: UTF8.self)
  return text.count
}

Unfortunately, this copies the entire payload in to new, owned storage - even though all we really want is to apply String's grapheme-traversal algorithms to the bytes in the existing buffer.

Furthermore, many libraries which make heavy use of Strings may want finer control over how they organise their storage. Take URLs for instance: a type representing a parsed URL might look something like this:

struct MyURL {
  var urlString: String
  var schemeEndIndex: String.Index
  var usernameEndIndex: String.Index?
  var passwordEndIndex: String.Index?
  var hostnameEndIndex: String.Index?
  var portEndIndex: String.Index?
  var pathEndIndex: String.Index?
  var queryStringEndIndex: String.Index?
  var fragmentEndIndex: String.Index?
}

This is roughly what Rust's URL type looks like. But just looking at it: every String.Index is 8 bytes on x86_64 (9 bytes for an optional), and String is 16 bytes - resulting in a minimum 87 bytes plus whatever dynamic storage the String might own. That's a very heavy object.

Even if we made this object a class, we end up with 2 heap allocations: one for MyURL itself, containing the indexes and the String, and again for String's storage. Maybe we would prefer to group these in to a single allocation, like ManagedBuffer does, and maybe we'd like more control over the size of the String, so that we can use a smaller index type. Shared Substrings allow developers to explore these kinds of designs.

Proposed solution

This proposal would add the following API to the standard library, allowing developers to create Substrings from either: a pointer-owner pair, a ManagedBuffer instance, or an ArraySlice.

extension Substring {

  /// Creates an immutable `Substring` whose backing store is shared with the UTF-8 data
  /// referenced by the given buffer pointer.
  ///
  /// The `owner` argument should manage the lifetime of the shared buffer.
  /// The `Substring` instance created by this initializer retains `owner` so that deallocation
  /// may occur after the substring is no longer in use. The buffer _must not_ be
  /// mutated while there are any strings sharing it.
  ///
  /// This initializer does not try to repair ill-formed UTF-8 code unit
  /// sequences. If any are found, the result of the initializer is `nil`.
  ///
  /// - Parameters:
  ///   - buffer: An `UnsafeBufferPointer` containing the UTF-8 bytes that
  ///     should be shared with the created `Substring`.
  ///   - owner: An optional object that owns the memory referenced by `buffer`.
  ///
  public init?(sharingStorage buffer: UnsafeBufferPointer<UInt8>, owner: AnyObject)

  /// Creates an immutable `Substring` whose UTF-8 backing store is shared with
  /// the given `ManagedBuffer` instance's elements.
  ///
  /// The `Substring` instance created by this initializer retains `owner` so that deallocation
  /// may occur after the substring is no longer in use. The buffer _must not_ be
  /// mutated while there are any strings sharing it.
  ///
  /// This initializer does not try to repair ill-formed UTF-8 code unit
  /// sequences. If any are found, the result of the initializer is `nil`.
  ///
  /// - Parameters:
  ///   - buffer: A `ManagedBuffer` whose elements are UTF-8 bytes that
  ///     should be shared with the created `Substring`.
  ///   - range: The range of elements which should be included in the `Substring`.
  ///
  public init?<Header>(sharingElements buffer: ManagedBuffer<Header, UInt8>, range: Range<UnsafeBufferPointer<UInt8>>)

  /// Creates an immutable `Substring` whose backing store is shared with the UTF-8 data
  /// in the given region of an array.
  ///
  /// This initializer does not try to repair ill-formed UTF-8 code unit
  /// sequences. If any are found, the result of the initializer is `nil`.
  ///
  /// - Parameters:
  ///   - array: An `ArraySlice` containing the UTF-8 bytes that
  ///     should be shared with the created `Substring`.
  ///
  public init?(sharingStorage array: ArraySlice<UInt8>)
}

There are several interesting things about this API:

It allows creating a Substring, not a String. This aligns with SE-0163, which introduced Substring specifically to differentiate strings which own their backing storage from those which share a buffer and should not be stored long-term. A similar principle applies to shared strings. From SE-0163:

Important

Long-term storage of Substring instances is discouraged. A substring holds a reference to the entire storage of a larger string, not just to the portion it presents, even after the original string’s lifetime ends. Long-term storage of a substring may therefore prolong the lifetime of elements that are no longer otherwise accessible, which can appear to be memory leakage.
Sharing storage requires value semantics/copy-on-write. Even though we ask people to not store Substrings long-term, that doesn't mean they won't, and an object which exposes Substring views of its storage must be mindful that those references may indeed escape. The owner object lets other references determine if any such references exist, and whether or not it is safe to mutate or deallocate the buffer.
The returned Substrings are immutable. This means that standard library will not mutate the bytes in the referenced buffer, and any attempted in-place modifications (like calling append) will copy to String-owned storage.
ManagedBuffer and Array are special-cased. Even though we typically don't add convenience functions for low-level APIs, implementing these requires access to standard library internals. The special-casing is in terms of ArraySlice in order to reduce the number of entrypoints.

Detailed design

With the above API, the character-counting example could be written as follows:

func countCharacters(in utf8Bytes: [UInt8]) -> Int { 
  let text = Substring(sharingStorage: utf8Bytes)
  return text?.count ?? 0
}

Here, the creation of the shared Substring allows us to avoid having to allocate new storage.

The URL example consisting of a String and lots of indexes could look something like:

struct MyURL {
  struct Header {
    var schemeEndOffset: UInt16
    // ...
  }
  var storage: ManagedBuffer<Header, UInt8> = ...
  
  var scheme: Substring {
    let schemeEndIndex = Int(storage.header.schemeEndOffset)
    return Substring(sharingStorage: storage, range: 0..<schemeEndIndex)!
  }
  // etc.
}

Here, a single allocation contains both the indexes, and the storage itself. Additionally, we have gained control over the index storage, by representing them as offsets in to the UTF8 bytes and limiting them to 2 bytes each. Using 0 to represent nil, this representation amounts to a 16 byte in-line prefix to the UTF8 data.

Source compatibility

These APIs are additive.

Effect on ABI stability

The functionality is already part of String's ABI since Swift 5.0.

Effect on API resilience

This proposal does not include any language features which would affect API resilience.

Alternatives considered

Create `String`s rather than `Substring`s

Many languages distinguish between owned and borrowed strings: for example, Rust includes both String (owned) and str (borrowed) types, C++ includes std::string (owned) and std::string_view (borrowed). Indeed, distinguishing between owned and borrowed storage is precisely why Swift defines both String (owned) and Substring (borrowed) types in the first place. Therefore, this proposal considers that Substring is the "correct" place for this API to live.

A closure-based API

The 2 most important things for users of this API to get right are the owner object's management of the buffer lifetime, and that all mutations first ensure that the buffer is not shared. Typically, Swift does this with closure-based APIs:

extension String {

  static func withStringView<Source, Result>(
    of: Source, perform: (String?) throws -> Result
  ) rethrows -> Result where Source: Sequence, Source.Element == UInt8 {
    // use withContiguousStorageIfAvailable, otherwise copy or return nil.
    // No need for 'owner' because the String should not escape.
  }
}

// Example:
func countCharacters(in utf8Bytes: [UInt8]) -> Int { 
  return String.withStringView(of: utf8Bytes) { string in
    return string?.count ?? 0
  }
}

Unfortunately, this would severely hurt the usability of some of the motivating use-cases. For example, it would be rather unfortunate if a URL type could only expose its scheme or path string inside a closure scope.

Promote `Substring` to `String`

@allevato's prototype shows another API we could implement using shared storage: the ability to "promote" a Substring to a String for a limited scope without copying:

extension Substring {

  /// Calls the given closure, passing it a `String` whose contents are equal to
  /// the substring but which shares the substring's storage instead of copying
  /// it.
  ///
  /// The `String` value passed into the closure is only valid durings its
  /// execution.
  /// ...
  public func withSharedString<Result>(
       _ body: (String) throws -> Result
  ) rethrows -> Result
}

This proposal considers such an API to be a separate issue. It might be nice to have a way to break the split owned/borrowing type model in limited situations, but isn't necessary for the motivating use-cases in this proposal.

Karl · July 18, 2020, 1:55pm

CC @Michael_Ilseman

One thing that might be nice to add is some way to bypass UTF8 verification. String would already be relying on me not to mutate the buffer under its feet: it's perfectly reasonable to ask that it also just trust me that my buffer contains only ASCII text.

Another nice thing would be if String.Index had methods which exposed UTF8 offsets, but that might better be left to a separate proposal.

For example, right now, String.Index has the following methods to convert to/from integer UTF16 offsets:

extension String.Index {
  // https://developer.apple.com/documentation/swift/string/index/3151848-init
  init<S>(utf16Offset offset: Int, in s: S) where S : StringProtocol

  // https://developer.apple.com/documentation/swift/string/index/3151849-utf16offset
  func utf16Offset<S>(in s: S) -> Int where S : StringProtocol
}

But there is strangely no equivalent for UTF8.

allevato · July 18, 2020, 5:15pm

Thanks for driving this forward!

Unfortunately, I think using Substring instead of String for this API is the wrong choice for many situations—it's limiting to such a degree that the API would not be useful for many of its intended use cases.

The original use case that drove my implementation was writing bridging code for C/C++ APIs that returned char buffers or std::string objects (for which ownership transferred to the caller) that I wanted to get into a Swift String without a copy. The problem with using Substring as the currency type for this operation is simply that most Swift APIs that consume text operate on String values, store String values, etc. Therefore, the moment you want to do anything truly useful with the string, you'd have to convert it using String(substring), and you've lost every performance benefit that the shared storage API was supposed to provide.

The issue is with the following statement:

The bolded sentence is not always true. Consider the following function from Cmark (the C Markdown library):

/** Render a 'node' tree as an HTML fragment.  It is up to the user
 * to add an appropriate header and footer. It is the caller's
 * responsibility to free the returned buffer.
 */
CMARK_EXPORT
char *cmark_render_html(cmark_node *root, int options);

The C API has explicitly transferred ownership of the returned char buffer to the caller. "Strings which own their backing storage" doesn't have to mean that they initially created that backing storage, just that they manage it once they're given ownership of it. In this case, what is desired is an ability to create a String that takes ownership of an existing buffer to efficiently create one without a copy:

class FreeingOwner {
  private let buffer: UnsafeBufferPointer<UInt8>
  init(_ buffer: UnsafeBufferPointer<UInt8>) { self.buffer = buffer }
  deinit { free(buffer.baseAddress) }
}

let pointer = cmark_render_html(treeRoot, 0)
let buffer = UnsafeBufferPointer(base: pointer, count: strlen(pointer))
let htmlString = String(sharingStorage: buffer, owner: FreeingOwner(buffer))

If I'm forced to create a Substring here, then I'm much more limited in what I can do with that value unless I create a copy or unless all the APIs I want to use support Substring or are generic over StringProtocol, which is extremely unlikely.

Karl · July 18, 2020, 5:21pm

SE-0248 introduced Substring.base, so you could still get a String type if you’re sure there’s no cost to storing the result long-term (like if it represents or takes ownership of the entire buffer).

allevato · July 18, 2020, 5:35pm

That would be a misuse of that API, though, and going through such hoops would unnecessarily obfuscate the code at the call site. The problem being solved is "I have an existing buffer that I own and I want to create a String and transfer ownership to it". I shouldn't have to go through Substring for that. Instead of giving people an escape hatch to get a String, let's just give them a clear API.

I can't speak for Rust's types, but the analogy between the proposed shared strings and C++'s std::string_view isn't correct. A std::string_view is nothing more than a pointer into an existing buffer and a length; there is no concept of ownership involved with string_view (so it's also unsafe). The equivalent in Swift would be a Substring initializer that takes a buffer pointer but no owner. But since the proposed initializers do take an owner object, we have the ability to transfer ownership completely to the object, which is what makes String a viable target type.

Karl · July 18, 2020, 5:42pm

Perhaps - that is something I’d like to get feedback on. String is certainly convenient, but Substring specifically exists to represent text from a shared buffer. It’s up to the community if they care about strictly enforcing that model or just letting everything be Strings because convenience is more important. They’re both viable options.

Perhaps it would be more correct to say that string_view is “unowned” rather than “borrowed“. It’s a similar concept, but of course it isn’t a direct equivalent.

woolsweater · July 18, 2020, 5:57pm

I like the direction here very much!

One point, perhaps minor, that springs to mind is the occasional mismatch in String's API between signed/unsigned bytes. I think this interface may want to include entry points that use CChar/Int8 to avoid friction talking to C APIs.

allevato · July 18, 2020, 6:01pm

Substring exists to represent a slice of a String where the memory is owned by the base string—I don't think one should try to extrapolate semantics about other external memory buffers from that. The documentation for Substring says simply:

Substring

A slice of a string.

If you wished to implement the proposed APIs on Substring, then you would still need to create a String under the hood to act as the base, that String would be the owner of the buffer, and then your Substring is a "view" over the entire string. IMO, such an API would be a lie—it's returning something other than what it's actually creating, and it's returning something that is not what documentation of Substring says it is.

In my opinion, the crux of the issue is that Swift makes it extremely easy at the compiler level to interop with C APIs; there is no other language I can think of that has such a clean approach (thanks to embedding Clang directly), because all you have to do is import the module.

When it comes to working with those APIs at the runtime level, however, string-based APIs are where C interop suffers a great deal more, and it's one of the biggest obstacles today to Swift being used as a replacement for C and C++ in situations where performance is critical. In order to eliminate those obstacles, not only should we remove the performance penalties, but the APIs we create to do so must be as ergonomic as possible.

Karl · July 18, 2020, 6:55pm

The ergonomic difficulties that come from using Substring are not new. The responses in this thread summarise the situation well, and why it is the way it is: Swift is still sometimes so cumbersome - #4 by Ben_Cohen

But those responses do convince me that sharing should be expressed as a Substring by default. Maybe we could add a separate API on String for taking ownership of (not sharing) an external buffer?

allevato · July 18, 2020, 9:55pm

Ok, I think the link above provided some additional context that wasn't as clear before. Is your concern that the following would happen?

You have a MyURL type like the one above (slices into a managed buffer), but if the shared buffer initializers are on String, then the person implementing the type would also make the properties (like scheme) type String.
Someone accesses myURL.scheme and gets a String with the value "http".
They no longer need the rest of the URL, so they let it go out of scope.
They think they're hanging on to this small 4-byte string, but what they're really retaining is the original buffer which 1MB worth of query parameters.
They would have realized this possible memory inefficiency if the type of the property had been Substring.

That's a reasonable concern, but it's a problem in this specific case because the initializer that takes a ManagedBuffer also takes a slice of the buffer. I think there's a different way to resolve this:

An initializer on String takes only a ManagedBuffer and no range argument, and assumes the entire buffer is the string contents.
An initializer on String takes an Array instead of an ArraySlice and uses the entire array.

Is that what you mean by "a separate API on String for taking ownership of (not sharing) an external buffer"? (I can't think of another way to distinguish owning vs. sharing in these situations, since both must retain the underlying buffer via an owner for the string or substring to be safe.)

Then, properties like those in MyURL would create a string that wraps the entire buffer and the get the relevant substring of that. This would require String.Index initializers with UTF-8 offsets, as you already pointed out, and I don't think it should be less efficient.

I personally would prefer this because it maintains the property that a Substring is a slice of a String, as opposed to something that is created from "thin air".

Now, in total fairness, the above problem can't be avoided for the UnsafeBufferPointer initializer, because once you have one of those, you can move it wherever you want, and there's no way to guarantee that the buffer passed in is exactly the same as what is owned by owner. But such is the nature of unsafe APIs.

Karl · July 19, 2020, 5:00pm

Unfortunately, slicing shared Strings of the entire buffer has a couple of drawbacks:

It means the entire buffer contents are checked as being valid UTF8, every time we get any component. We'd need an option to bypass that (e.g. URLs do percent-escaping and IDN transformations, so we can guarantee the contents are ASCII).
There are still plenty of use-cases (e.g. buffers in text editors) when we can't give any meaningful "base" for the string to be sliced from, other than the entire string itself. So "base" doesn't always have much meaning - library authors need to document what (if anything) users can assume the slice is part of.

Actually I kind of misspoke there, that's not what I mean. Here's hopefully a clearer explanation:

I think Substring is the "natural place" for sharing to live, as it already represents text data from a possibly larger allocation that shouldn't be stored long-term. Currently, that buffer only comes from a String, but this proposal would expand that. This is reinforced as I read posts like the one I linked to above, and by research I've done about other String models and how similar functionality is exposed there.

Take Rust, where &str represents a string slice and is actually a language-primitive type. It includes methods from from_utf8 and from_utf8_unchecked, and from what I can tell, it is by far the most common string representation you will see in libraries. The owned string type, String, is a library type which basically consists of a vector that can be implicitly converted to a string slice (&str). Rust developers seem to really like their model, and besides syntax, there are direct parallels to Swift's String vs Substring model. It’s not exactly the same (Rust’s string slice doesn’t take an owner, but lifetime is enforced through borrow scopes). If we were to extrapolate from Rust, I think these methods would be on Substring.

So when you say:

I wonder if that's a problem with this feature, or the ecosystem/language? Again, we have an almost-identical model to Rust, but their developers love this split and use their version of Substring for everything, while Swift developers (as evidenced in that post, and many like it) find it awkward. Maybe this calls for some broader thinking beyond shared Strings.

allevato:

class FreeingOwner {
  private let buffer: UnsafeBufferPointer<UInt8>
  init(_ buffer: UnsafeBufferPointer<UInt8>) { self.buffer = buffer }
  deinit { free(buffer.baseAddress) }
}

let pointer = cmark_render_html(treeRoot, 0)
let buffer = UnsafeBufferPointer(base: pointer, count: strlen(pointer))
let htmlString = String(sharingStorage: buffer, owner: FreeingOwner(buffer))

You're right that it makes sense to create a String here - that despite the very good reasons we have to distinguish String and Substring in the type system, there are cases when the sharing doesn't really add up to much/any additional memory use in practice, and we want to explicitly allow long-term storage.

Besides your example: for a URL, the entire-URL string consisting of the buffer's entire contents can safely be exposed as a String and stored long-term without worrying about sharing causing memory leaks. There's a small cost for the header, but the library author can decide if it's worth copying to remove that.

So maybe what we need is something like I mentioned in the last part of "alternatives considered": an API to promote Substrings in to String. Would something like this work?

extension String {

  /// Creates a String referencing the same storage as the given Substring.
  ///
  /// - Warning: Using this initializer erases the signal which the type `Substring`
  ///   gives to developers, informing them that a piece of text is part of a shared
  ///   and potentially large allocation. Users are advised to only use this
  ///   when long-term storage of the result would not keep much more memory alive
  ///   than is necessary.
  /// 
  public init(allowingLongTermStorageOf substring: Substring)
}

Allowing you to write:

let pointer = cmark_render_html(treeRoot, 0)
let buffer = UnsafeBufferPointer(base: pointer, count: strlen(pointer))
let htmlString = String(allowingLongTermStorageOf: 
  Substring(sharingStorage: buffer, owner: FreeingOwner(buffer))
)

It's a little bit involved, but I think it gives sharing the best API (using Substring), and is nice and explicit about what the lifting to String entails.

One more thing I've been thinking about is adding an API to get the owner back out of the String. This could enable some really funky stuff, for example:

extension MyURL {
  var storage: ManagedBuffer<Header, UInt8>

  var urlString: String {
    let stringView = Substring(sharingStorage: storage, range: 0..<header.count)
    return String(allowingLongTermStorageOf: stringView)
  }
  
  init?(_ string: String) {
    // If this is a MyURL.urlString, reference the existing storage.
    if let existingStorage = string.owner as? ManagedBuffer<MyURL.Header, UInt8>,
      string.withUTF8({ $0.count == existingStorage.header.count }) 
    {
      self.storage = existingStorage
      return
    }
    // Ordinary String. Create new storage and parse.
  }
}

So basically we can cache arbitrary associated data in Strings. There are lots of very cool things you can do with that, like caching Regex match information to speed up subsequent searches.

Jon_Shier · July 19, 2020, 5:22pm

I don't have the expertise for the larger conversation, but I do run into String / Substring issues quite a bit, so I wanted to speak briefly to this point. Perhaps the biggest issue with Substring is the lack of support within Swift and Apple's SDKs, as almost nothing uses StringProtocol. There isn't much Swift-native String functionality, and none of the imported Foundation APIs work with it. None of Apple's other APIs work with it either, so every time we have one it must be converted into a String.

I think if Swift were to make it easier to use Substrings where we currently use Strings, were to move much of Foundation's String API onto StringProtocol, and were to provide some way to easily use API's unaware of Substring, we would see much higher use of the type.

allevato · July 19, 2020, 9:25pm

Then we should add that option, which I think is a requirement for this API to be minimally viable anyway. At least in the case of the UnsafeBufferPointer initializer, we're already in unsafe land (i.e., String has to trust that the caller won't modify the memory underneath it because there's no CoW like the ManagedBuffer or Array cases), so having an option that says "trust me, the API that gave me this buffer already guaranteed it's valid UTF-8" doesn't make things any less safe and is necessary for the performance that users need. Adding it to the ManagedBuffer and Array cases seems fine too.

I don't follow this argument; if the only meaningful base is the entire string itself, then just let that be the base. I'm not sure how to interpret "what (if anything) [...] the slice is part of" because by definition a slice has to be sliced from something. This goes back to my point providing an API that creates Substrings from "thin air"; if the base isn't obvious, we shouldn't provide APIs that gloss over that. Make the caller create the whole String and slice it themselves so it's crystal clear.

But these two bullet points make me think we should drill deeper into the MyURL example because they leave me unclear about what some of the invariants of these shared strings would be under your proposed implementation.

Above, MyURL wraps a buffer and provides efficient non-copying access to non-overlapping slices of that buffer. It's not mentioned in the snippet, but presumably there's also a way to access the entire URL as a string, and based on the representation, I imagine that would also be a non-copying shared string covering the whole buffer.

You mentioned concern about the "create a String and then slice it" approach incurring UTF-8 validation of the entire string, absent an option to skip it. But consider the following properties:

All strings in Swift are valid UTF-8. Therefore, all substrings are also valid UTF-8.
For any substring s, there is a base string S accessible using the Substring.base property.

Let's imagine the user accesses MyURL.scheme for a URL that is 1MB of data. What are you proposing the behavior to be here?

The substring is the scheme and the base is the entire buffer, but only the scheme is UTF-8-validated or known to be valid UTF-8; validation is deferred for the entire buffer until the base is accessed (if at all).

This would allow the creation of a buffer where you can access slices of it that are valid Substring values, but which would cause a runtime failure when you try to access their base. That's a violation of the string APIs since a Substring must come from a valid String.
The substring is the scheme and the base is also just that slice of the buffer, so only the requested portion needs to be UTF-8-validated or known to be valid UTF-8.

This would be more efficient, but would be surprising to the user of this API. If I'm given specifically a Substring when I access the scheme, I'm going to expect that its base is the full URL, that its startIndex/endIndex point to the relevant locations within that base, and so forth.

I claim that the only answer to this that can satisfy both the required relationships between String and Substring and provide a Substring that is semantically meaningful/correct to the caller is:

The substring is the scheme and the base is the entire buffer, and the entire buffer has been UTF-8-validated or is known to be valid UTF-8.

Therefore, there is no advantage to providing a Substring initializer. The relationship and validation behavior is far clearer if the callee creates a String first and then slices it to get the Substring. All that's needed are the initializers to get indices from UTF-8 offsets.

So, to this point:

Since Swift Substring has the user-accessible base property that is a non-optional String, how would you expand that without creating a String anyway at some point (either internally, or as a consequence of the user accessing that property) or without making the API more complicated to support different kinds of bases?

I'm not sure whether the Rust analogies hold because I can't find a way to take a &str and get the base String from it, because one may not exist (e.g., if the &str is a literal). I'm by no means a Rust expert so it's entirely possible I've missed something, but since we have the base property, we need to work within that constraint. If the claim is that Swift's string model is awkward, then trying to make Substring a currency type for slices of non-String buffers is only going to make it more complex, not less, without a much bigger overhaul of the API than you're proposing here.

saeta · July 20, 2020, 3:55pm

+1 to this. I have found UTF-8 verification to be a significant enough performance challenge that I had to abandon using Swift Strings in a library I wrote.

dabrahams · July 20, 2020, 4:08pm

Please no. String and SubString are safe types: when you write an API receiving them, you know you can store and manipulate them in perpetuity without violating memory safety. The whole point of StringProtocol is that you can build unsafe strings without violating the safety of these core types. It would be as wrong to create arrays that point to unmanaged memory.

Karl · July 20, 2020, 5:14pm

Could you explain how this makes either of those types unsafe? The concerns about storage lifetimes and which type to use are about leakage, not safety.

You can totally store a shared Substring for as long as like: it’s just discouraged, is all.

MrMage · July 20, 2020, 5:46pm

Another +1. My app spends a considerable amount of time validating UTF-8 strings that are expected to be valid in the first place.

I am a bit concerned about the usefulness of Substring though. Having a similar option to obtain a String would be great. My use case would mostly involve using entire strings, so wasting memory by retaining a long string while only using a small slice of it would not be a concern.

dabrahams · July 20, 2020, 6:03pm

This API:

extension SubString {
  public init?(sharingStorage buffer: UnsafeBufferPointer<UInt8>, owner: AnyObject)
}

Allows me to create a new SubString that can outlive the buffer it references. If the buffer is deallocated or overwritten while (a copy of) the SubString is alive, you now have a “safe” type all of whose interesting operations violate memory safety.

MrMage · July 20, 2020, 6:07pm

dabrahams:

This API:
extension SubString {
  public init?(sharingStorage buffer: UnsafeBufferPointer<UInt8>, owner: AnyObject)
}
Allows me to create a new SubString that can outlive the buffer it references. If the buffer is deallocated or overwritten while (a copy of) the SubString is alive, you now have a “safe” type all of whose interesting operations violate memory safety.

My understanding is that the SubString would retain owner, and the user is expected to guarantee that the buffer does not get deallocated before owner is deallocated. In that case, SubString should automatically extend the buffer's lifetime as needed.

Karl · July 20, 2020, 6:17pm

Correct. The only way you would violate memory safety is if there was a bug in your code.

We already have plenty of unsafe APIs which can cause undefined behaviour if not used correctly, and UB anywhere is UB everywhere: an UnsafeMutableBufferPointer overflow somewhere could corrupt a neighbouring Array's memory, and even though Array is "safe", this object's behaviour is unknown. No code can be resilient even to bugs.

An example with Array:

class SomeClass {}
let boom = Array<SomeClass>(unsafeUninitializedCapacity: 10) { ptr, count in
  count = 5 
}

Maybe it's worth adding the word unsafe to the name for visibility? init(sharingUnsafeStorage:)?