Shared Substrings

Substring exists to represent a slice of a String where the memory is owned by the base string—I don't think one should try to extrapolate semantics about other external memory buffers from that. The documentation for Substring says simply:

Substring

A slice of a string.

If you wished to implement the proposed APIs on Substring, then you would still need to create a String under the hood to act as the base, that String would be the owner of the buffer, and then your Substring is a "view" over the entire string. IMO, such an API would be a lie—it's returning something other than what it's actually creating, and it's returning something that is not what documentation of Substring says it is.


In my opinion, the crux of the issue is that Swift makes it extremely easy at the compiler level to interop with C APIs; there is no other language I can think of that has such a clean approach (thanks to embedding Clang directly), because all you have to do is import the module.

When it comes to working with those APIs at the runtime level, however, string-based APIs are where C interop suffers a great deal more, and it's one of the biggest obstacles today to Swift being used as a replacement for C and C++ in situations where performance is critical. In order to eliminate those obstacles, not only should we remove the performance penalties, but the APIs we create to do so must be as ergonomic as possible.

4 Likes

The ergonomic difficulties that come from using Substring are not new. The responses in this thread summarise the situation well, and why it is the way it is: Swift is still sometimes so cumbersome - #4 by Ben_Cohen

But those responses do convince me that sharing should be expressed as a Substring by default. Maybe we could add a separate API on String for taking ownership of (not sharing) an external buffer?

Ok, I think the link above provided some additional context that wasn't as clear before. Is your concern that the following would happen?

  1. You have a MyURL type like the one above (slices into a managed buffer), but if the shared buffer initializers are on String, then the person implementing the type would also make the properties (like scheme) type String.
  2. Someone accesses myURL.scheme and gets a String with the value "http".
  3. They no longer need the rest of the URL, so they let it go out of scope.
  4. They think they're hanging on to this small 4-byte string, but what they're really retaining is the original buffer which 1MB worth of query parameters.
  5. They would have realized this possible memory inefficiency if the type of the property had been Substring.

That's a reasonable concern, but it's a problem in this specific case because the initializer that takes a ManagedBuffer also takes a slice of the buffer. I think there's a different way to resolve this:

  • An initializer on String takes only a ManagedBuffer and no range argument, and assumes the entire buffer is the string contents.
  • An initializer on String takes an Array instead of an ArraySlice and uses the entire array.

Is that what you mean by "a separate API on String for taking ownership of (not sharing) an external buffer"? (I can't think of another way to distinguish owning vs. sharing in these situations, since both must retain the underlying buffer via an owner for the string or substring to be safe.)

Then, properties like those in MyURL would create a string that wraps the entire buffer and the get the relevant substring of that. This would require String.Index initializers with UTF-8 offsets, as you already pointed out, and I don't think it should be less efficient.

I personally would prefer this because it maintains the property that a Substring is a slice of a String, as opposed to something that is created from "thin air".

Now, in total fairness, the above problem can't be avoided for the UnsafeBufferPointer initializer, because once you have one of those, you can move it wherever you want, and there's no way to guarantee that the buffer passed in is exactly the same as what is owned by owner. But such is the nature of unsafe APIs.

2 Likes

Unfortunately, slicing shared Strings of the entire buffer has a couple of drawbacks:

  • It means the entire buffer contents are checked as being valid UTF8, every time we get any component. We'd need an option to bypass that (e.g. URLs do percent-escaping and IDN transformations, so we can guarantee the contents are ASCII).
  • There are still plenty of use-cases (e.g. buffers in text editors) when we can't give any meaningful "base" for the string to be sliced from, other than the entire string itself. So "base" doesn't always have much meaning - library authors need to document what (if anything) users can assume the slice is part of.

Actually I kind of misspoke there, that's not what I mean. Here's hopefully a clearer explanation:

I think Substring is the "natural place" for sharing to live, as it already represents text data from a possibly larger allocation that shouldn't be stored long-term. Currently, that buffer only comes from a String, but this proposal would expand that. This is reinforced as I read posts like the one I linked to above, and by research I've done about other String models and how similar functionality is exposed there.

Take Rust, where &str represents a string slice and is actually a language-primitive type. It includes methods from from_utf8 and from_utf8_unchecked, and from what I can tell, it is by far the most common string representation you will see in libraries. The owned string type, String, is a library type which basically consists of a vector that can be implicitly converted to a string slice (&str). Rust developers seem to really like their model, and besides syntax, there are direct parallels to Swift's String vs Substring model. It’s not exactly the same (Rust’s string slice doesn’t take an owner, but lifetime is enforced through borrow scopes). If we were to extrapolate from Rust, I think these methods would be on Substring.

So when you say:

I wonder if that's a problem with this feature, or the ecosystem/language? Again, we have an almost-identical model to Rust, but their developers love this split and use their version of Substring for everything, while Swift developers (as evidenced in that post, and many like it) find it awkward. Maybe this calls for some broader thinking beyond shared Strings.

You're right that it makes sense to create a String here - that despite the very good reasons we have to distinguish String and Substring in the type system, there are cases when the sharing doesn't really add up to much/any additional memory use in practice, and we want to explicitly allow long-term storage.

Besides your example: for a URL, the entire-URL string consisting of the buffer's entire contents can safely be exposed as a String and stored long-term without worrying about sharing causing memory leaks. There's a small cost for the header, but the library author can decide if it's worth copying to remove that.

So maybe what we need is something like I mentioned in the last part of "alternatives considered": an API to promote Substrings in to String. Would something like this work?

extension String {

  /// Creates a String referencing the same storage as the given Substring.
  ///
  /// - Warning: Using this initializer erases the signal which the type `Substring`
  ///   gives to developers, informing them that a piece of text is part of a shared
  ///   and potentially large allocation. Users are advised to only use this
  ///   when long-term storage of the result would not keep much more memory alive
  ///   than is necessary.
  /// 
  public init(allowingLongTermStorageOf substring: Substring)
}

Allowing you to write:

let pointer = cmark_render_html(treeRoot, 0)
let buffer = UnsafeBufferPointer(base: pointer, count: strlen(pointer))
let htmlString = String(allowingLongTermStorageOf: 
  Substring(sharingStorage: buffer, owner: FreeingOwner(buffer))
)

It's a little bit involved, but I think it gives sharing the best API (using Substring), and is nice and explicit about what the lifting to String entails.


One more thing I've been thinking about is adding an API to get the owner back out of the String. This could enable some really funky stuff, for example:

extension MyURL {
  var storage: ManagedBuffer<Header, UInt8>

  var urlString: String {
    let stringView = Substring(sharingStorage: storage, range: 0..<header.count)
    return String(allowingLongTermStorageOf: stringView)
  }
  
  init?(_ string: String) {
    // If this is a MyURL.urlString, reference the existing storage.
    if let existingStorage = string.owner as? ManagedBuffer<MyURL.Header, UInt8>,
      string.withUTF8({ $0.count == existingStorage.header.count }) 
    {
      self.storage = existingStorage
      return
    }
    // Ordinary String. Create new storage and parse.
  }
}

So basically we can cache arbitrary associated data in Strings. There are lots of very cool things you can do with that, like caching Regex match information to speed up subsequent searches.

I don't have the expertise for the larger conversation, but I do run into String / Substring issues quite a bit, so I wanted to speak briefly to this point. Perhaps the biggest issue with Substring is the lack of support within Swift and Apple's SDKs, as almost nothing uses StringProtocol. There isn't much Swift-native String functionality, and none of the imported Foundation APIs work with it. None of Apple's other APIs work with it either, so every time we have one it must be converted into a String.

I think if Swift were to make it easier to use Substrings where we currently use Strings, were to move much of Foundation's String API onto StringProtocol, and were to provide some way to easily use API's unaware of Substring, we would see much higher use of the type.

13 Likes

Then we should add that option, which I think is a requirement for this API to be minimally viable anyway. At least in the case of the UnsafeBufferPointer initializer, we're already in unsafe land (i.e., String has to trust that the caller won't modify the memory underneath it because there's no CoW like the ManagedBuffer or Array cases), so having an option that says "trust me, the API that gave me this buffer already guaranteed it's valid UTF-8" doesn't make things any less safe and is necessary for the performance that users need. Adding it to the ManagedBuffer and Array cases seems fine too.

I don't follow this argument; if the only meaningful base is the entire string itself, then just let that be the base. I'm not sure how to interpret "what (if anything) [...] the slice is part of" because by definition a slice has to be sliced from something. This goes back to my point providing an API that creates Substrings from "thin air"; if the base isn't obvious, we shouldn't provide APIs that gloss over that. Make the caller create the whole String and slice it themselves so it's crystal clear.

But these two bullet points make me think we should drill deeper into the MyURL example because they leave me unclear about what some of the invariants of these shared strings would be under your proposed implementation.

Above, MyURL wraps a buffer and provides efficient non-copying access to non-overlapping slices of that buffer. It's not mentioned in the snippet, but presumably there's also a way to access the entire URL as a string, and based on the representation, I imagine that would also be a non-copying shared string covering the whole buffer.

You mentioned concern about the "create a String and then slice it" approach incurring UTF-8 validation of the entire string, absent an option to skip it. But consider the following properties:

  1. All strings in Swift are valid UTF-8. Therefore, all substrings are also valid UTF-8.
  2. For any substring s, there is a base string S accessible using the Substring.base property.

Let's imagine the user accesses MyURL.scheme for a URL that is 1MB of data. What are you proposing the behavior to be here?

  1. The substring is the scheme and the base is the entire buffer, but only the scheme is UTF-8-validated or known to be valid UTF-8; validation is deferred for the entire buffer until the base is accessed (if at all).

    This would allow the creation of a buffer where you can access slices of it that are valid Substring values, but which would cause a runtime failure when you try to access their base. That's a violation of the string APIs since a Substring must come from a valid String.

  2. The substring is the scheme and the base is also just that slice of the buffer, so only the requested portion needs to be UTF-8-validated or known to be valid UTF-8.

    This would be more efficient, but would be surprising to the user of this API. If I'm given specifically a Substring when I access the scheme, I'm going to expect that its base is the full URL, that its startIndex/endIndex point to the relevant locations within that base, and so forth.

I claim that the only answer to this that can satisfy both the required relationships between String and Substring and provide a Substring that is semantically meaningful/correct to the caller is:

  1. The substring is the scheme and the base is the entire buffer, and the entire buffer has been UTF-8-validated or is known to be valid UTF-8.

Therefore, there is no advantage to providing a Substring initializer. The relationship and validation behavior is far clearer if the callee creates a String first and then slices it to get the Substring. All that's needed are the initializers to get indices from UTF-8 offsets.

So, to this point:

Since Swift Substring has the user-accessible base property that is a non-optional String, how would you expand that without creating a String anyway at some point (either internally, or as a consequence of the user accessing that property) or without making the API more complicated to support different kinds of bases?

I'm not sure whether the Rust analogies hold because I can't find a way to take a &str and get the base String from it, because one may not exist (e.g., if the &str is a literal). I'm by no means a Rust expert so it's entirely possible I've missed something, but since we have the base property, we need to work within that constraint. If the claim is that Swift's string model is awkward, then trying to make Substring a currency type for slices of non-String buffers is only going to make it more complex, not less, without a much bigger overhaul of the API than you're proposing here.

3 Likes

+1 to this. I have found UTF-8 verification to be a significant enough performance challenge that I had to abandon using Swift Strings in a library I wrote.

1 Like

Please no. String and SubString are safe types: when you write an API receiving them, you know you can store and manipulate them in perpetuity without violating memory safety. The whole point of StringProtocol is that you can build unsafe strings without violating the safety of these core types. It would be as wrong to create arrays that point to unmanaged memory.

5 Likes

Could you explain how this makes either of those types unsafe? The concerns about storage lifetimes and which type to use are about leakage, not safety.

You can totally store a shared Substring for as long as like: it’s just discouraged, is all.

Another +1. My app spends a considerable amount of time validating UTF-8 strings that are expected to be valid in the first place.

I am a bit concerned about the usefulness of Substring though. Having a similar option to obtain a String would be great. My use case would mostly involve using entire strings, so wasting memory by retaining a long string while only using a small slice of it would not be a concern.

This API:

extension SubString {
  public init?(sharingStorage buffer: UnsafeBufferPointer<UInt8>, owner: AnyObject)
}

Allows me to create a new SubString that can outlive the buffer it references. If the buffer is deallocated or overwritten while (a copy of) the SubString is alive, you now have a “safe” type all of whose interesting operations violate memory safety.

3 Likes

My understanding is that the SubString would retain owner, and the user is expected to guarantee that the buffer does not get deallocated before owner is deallocated. In that case, SubString should automatically extend the buffer's lifetime as needed.

1 Like

Correct. The only way you would violate memory safety is if there was a bug in your code.

We already have plenty of unsafe APIs which can cause undefined behaviour if not used correctly, and UB anywhere is UB everywhere: an UnsafeMutableBufferPointer overflow somewhere could corrupt a neighbouring Array's memory, and even though Array is "safe", this object's behaviour is unknown. No code can be resilient even to bugs.

An example with Array:

class SomeClass {}
let boom = Array<SomeClass>(unsafeUninitializedCapacity: 10) { ptr, count in
  count = 5 
}

Maybe it's worth adding the word unsafe to the name for visibility? init(sharingUnsafeStorage:)?

  1. The API as specified does not make those requirements on the user, so at the very least the doc comment needs to be different. This is not a minor point: the specification is part of the API.
  2. Yes, if the API were added it would absolutely have to have the word “unsafe” in the name.
  3. Even if you change the doc comment and the name, it is a major and serious change to the guarantees of the system that it will be possible to create an unsafe instance of SubString using a correct invocation of its own initializer. Even Array.init(unsafeUninitializedCapacity:initializingWith:) doesn't let you do that: after a correct initialization, there are no safe operations that will violate the memory safety of uses of that instance. I'm extremely wary of adding anything like that, and for me the burden of proof that this is a door we really want to open is on the proposers.
4 Likes

I haven’t used Rust but I have wanted to borrow a CString in Swift before. I agree with @dabrahams that we should want to keep String and SubString safe.

I would rather we opened up StringProtocol so that it is safe to add custom conforming types. One could be an UnsafeBorrowedString type of some kind with the appropriate APIs.

3 Likes

@dabrahams has a point. As pitched, this would be incongruent as an Array that has unmanaged backing storage. Honestly I would love to have one of those for work with raw audio data, but I digress.

IMHO a closure-based API is the only viable way to achieve (most of) what is suggested here. And I think that could be very useful, despite not fulfilling the needs of something like a longer-lived MyURL type. A closure-based API would make it significantly harder to do something you shouldn’t with an unsafe (Sub)String instance.

Alternatively we could create a new (unsafe) type that conforms to StringProtocol. In this case (and maybe also even in its absence) there would a need to make more of Apple’s APIs accept StringProtocol rather than String. (I haven’t tried but is this even feasible right now? Not sure if StringProtocol has Self requirements). To me this seems to be the only reasonable option for the MyURL case. To me personally it is also the less interesting case, but I wouldn’t argue against a proposal for it by any means.

Roughly speaking, that's called Unsafe[Mutable]BufferPointer. It lacks RangeReplaceableCollection conformance and does not range-check your indices, but you can create a simple wrapper that adds that checking if you want.

Not much; you can just return it from the closure. Regardless, such an API would still violate the invariant I'm trying to defend: once correctly constructed, a SubString is safe to use in perpetuity.

StringProtocol is not meant to be used as an existential type, and it has has associated types, so it can't be. To create APIs that work on any StringProtocol, you make them generic:

func f(s: String)

becomes

func f<T: StringProtocol>(s: T)
1 Like

I appreciate your reply here @dabrahams, thanks for taking the time :slight_smile:

I was just yesterday trying to change the capacity of an UnsafeMutableBufferPointer without deallocating and reallocating the entire thing (and avoiding a manual copying step in between) like you can with Array, i.e. array.removeLast(100), but it doesn't appear possible to just change the count directly. I could probably just construct a new Unsafe..BufferPointer with the same base address and the reduced count, but is that "legal"?

On that note, it would be incredible to have an API on Array that looks like removeLast(_ k: Int, keepingCapacity: Bool) as an analogue to removeAll(keepingCapacity: Bool); I probably wouldn't need the Unsafe API at that point.

Sorry for the tangent here, back to the actual topic...

Do you mean something like

let buffer: [UInt8] = ...
var escapedString: String?
buffer.withUnsafeTemporaryUTF8String { string in
  escapedString = string // evil?
}

I wonder if it's possible to disallow that from happening at the compiler level? In any case, wouldn't this be much the same as illegally saving the buffer from any of the other withUnsafe... closures?

First of all, I would think it should be obvious that You Shouldn't Do That™ (if you do, though, I do see the significant semantic difference of the resulting type in one still being called Unsafe and not in the other); secondly, wouldn't Swift actually "do better" here by creating a copy of the String and its underlying buffer automatically?

I am not familiar with the runtime enough to know whether it'd be possible (maybe it's already implemented) to create a copy of the String and maybe even verify its UTF8-compatibility if it is escaped, but it doesn't seem impossible or undesirable to me as a layman.

I do see an edge case of someone escaping a substring instead of the String itself here. Would it be unviable to have it still reference buffer in that case? I think the major issue we're facing is that some of the suggestions in this thread are talking about making temporary strings from entirely unmanaged (unsafe) memory, which would indeed make this edge case unviable. But using the array of (U)Int8s as above it doesn't seem impossible.


This appears to be a major hurdle given how pretty much all Apple APIs work (including parts of the Swift stdlib, IIRC). Foundation's APIs changing to generically take StringProtocol, for example, seems extremely unlikely.

The possibility of using protocols with associated type requirements as if they were existential types seemed to be swimming around a couple of years ago, but maybe it's a non-starter in reality (?). Until we can do that (or change significant amounts of API), this option doesn't seem viable for most use cases. It would mean copying to String every time devs want to use the custom type for anything not defined on StringProtocol directly, which kind of defeats the purpose.

When the doc comment says "The buffer must not be mutated while there are any strings sharing it.", that is meant to include deallocation.

So, I notice a subtle difference in your requirements here: for Substring, you take issue with a "correct invocation of [the] initializer" returning an invalid instance, but for Array, the requirement is looser: it must be a "correct initialization".

I would argue that passing a closure that fails to initialize all of its elements to Array.init(unsafeUninitializedCapacity:initializingWith:) is not a correct initialization, and it results in the same kind of "unsafe instance" as a shared Substring whose owner cannot guarantee the memory's lifetime. Yes, the objects are of the correct type and return from the initializer without any failure indication (so I guess they are "apparently correct invocations"), but they don't do what the API says they need to do.

And the issues with one are not easier to spot, nor easier to debug than the other. The buggy closure might fail to initialize any element of the Array, so perhaps users wouldn't even notice until much later in their programs. Array does not guarantee that all of its memory is correctly constructed after this call (because it is partly relying on the user), and any operation like calling .first or .dropFirst(5).first might in fact access uninitialized memory.

Firstly: this doesn't change anything in practice about Substring. The source of its data might not always be a String-owned buffer any more, but it will be a String-managed buffer, which ARC ensuring that the owner object lives for at least as long as last Substring. Nothing changes for users of Substring.

Since I'm describing deallocation as a mutation, there is a single fundamental requirement for safety: all other references to the buffer must copy before mutating while owner is shared. This implicitly means they have to check that owner is not multiply-referenced before deallocating buffer. If you get a buffer from C and there are no other references, or they never mutate/deallocate - great! You may want to free the memory afterwards to be a good citizen, but that situation will never be unsafe.

Some things we can take just from this:

  1. Substrings constructed from shared Arrays are always safe.
  2. ManagedBuffers are safe if they use copy-on-write.

Is it 100% bullet-proof? Of course not. But at that point we're talking about things like over-freeing using ManagedBuffers using Unmanaged, and those are rare use-cases with pretty-much nothing anybody can do about them. We have very good tools which can help detect use-after-frees, and anybody that deep in unmanaged and unsafe APIs should be well aware of the need to test what they've created.

This thread has lots of discussion about the use-cases, and the effect on performance would be dramatic. This performance can be unlocked in a safe way, and it is not very difficult to do. I'd say it's harder to mess up than Array.init(unsafeUninitializedCapacity:initializingWith:).

OK, I understand your intent. As a technical matter, mutation, deallocation, and the ending of lifetimes are orthogonal, and it's the latter thing that you want to ask the user to guarantee does not happen until the SubString is deallocated.

So, I notice a subtle difference in your requirements here: for Substring, you take issue with a "correct invocation of [the] initializer" returning an invalid instance, but for Array, the requirement is looser: it must be a "correct initialization".

That difference in phrasing was not intended to convey any difference in meaning, and I can't imagine what difference you could read into it. I never mentioned “returning an invalid instance.” I'm concerned with an invocation returning an instance that later becomes invalid.

I would argue that passing a closure that fails to initialize all of its elements to Array.init(unsafeUninitializedCapacity:initializingWith:) is not a correct initialization,

Agreed.

and it results in the same kind of "unsafe instance" as a shared Substring whose owner cannot guarantee the memory's lifetime.

I assume that by “owner” you mean the thing passed as the owner parameter, whose doc comment says:

If the word “optional” is to have any meaning at all, it has to mean I could could pass something that's not really an object (e.g. Int.self as AnyObject), so I can build a SubString that is backed by unmanaged memory. If that's correct usage, it creates an instance that can become invalid to use at any time.

Firstly: this doesn't change anything in practice about Substring . The source of its data might not always be a String-owned buffer any more, but it will be a String-managed buffer, which ARC ensuring that the owner object lives for at least as long as last Substring.

If that's the API you intend to propose, you need to change more things in the doc comment, including getting rid of the word “optional.” Especially for an unsafe API, it's crucial that it be documented rigorously, and that the conditions for using it correctly are simple to understand. I'm sorry to be hard-nosed about this, but the burden really has to fall on the proposer to get that right.

Since I'm describing deallocation as a mutation, there is a single fundamental requirement for safety: all other references to the buffer must copy before mutating while owner is shared. This implicitly means they have to check that owner is not multiply-referenced before deallocating buffer.

I don't understand what multiple references have to do with it. If even a single reference exists to the owner, you have to assume there's a SubString somewhere that depends on the owner.

If you get a buffer from C and there are no other references, or they never mutate/deallocate - great! You may want to free the memory afterwards to be a good citizen, but that situation will never be unsafe.

A typical C API may pass your callback a buffer that's only good for the duration of the callback's execution.

Presumably, you need the SubString so you can pass it to some API. That API is free to copy and store the SubString as long as it wants, and if it doesn't do so today it's free to start doing so tomorrow, and it won't tell you when it's done with the instance. Unless you have a way to keep the bytes in the buffer alive until that hypothetical instance goes away, you've created an unsafe instance.

This is different from the Array case because the guarantees you are asking for are dynamic properties of code that has yet to be executed at the time of construction, and cannot usually be given by the code constructing the SubString or by the SubString's own initializer.

This is qualitatively different in the ways I've described. It sounds to me like you're minimizing the importance of, or simply failing to recognize, the differences. Until we're actually acknowledging those differences and—as a community—making a rational choice about whether we want the library to go in that direction, I'm going to be opposed to it.

I'm all for unlocking performance. That said, I seriously doubt there are enough SubStrings running around existing APIs that this change would enable a drop-in speedup. Instead, many Strings would need to change to SubString. At that point, we may as well start talking about making APIs generic over StringProtocol so that when you really need performance you can pass unmanaged bytes that don't even incur ARC overhead when copied.