Pitch: Contiguous Strings

SE PR and implementation.

edit: This revision removes some attributes (redundant with "Effect on API resilience" section) and adds more alternatives considered in response to feedback.

edit 2: More alternatives considered, in response to feedback.

Contiguous Strings

Introduction

One of the most common API requests from performance-minded users of string is a way to get direct access to the raw underlying code units. Now that Swift 5 uses UTF-8 for its preferred encoding, we can provide this.

“Contiguous strings” are strings that are capable of providing a pointer and length to validly encoded UTF-8 contents in constant time. Contiguous strings include:

  • All native Swift string forms
  • Lazily bridged Cocoa strings that provide a pointer to contiguous ASCII
  • Any “shared string” concept we may add in the future

Noncontiguous strings include:

  • Lazily-bridged Cocoa strings that don’t provide a pointer to contiguous ASCII (even if they do have contiguous UTF-16 code units)
  • Any “foreign string” concept we may add in the future

Contiguous strings are heavily optimized and more efficient to process than noncontiguous strings. However, noncontiguous string contents don’t have to be copied when imported into Swift, which is profitable for strings which may never be read from, such as imported NSString dictionary keys.

Swift-evolution thread: TBD

Motivation

In Swift 5.0, String.UTF8View supports withContiguousStorageIfAvailable(), which succeeds on contiguous strings but returns nil on noncontiguous strings. While it’s nice to run a fast algorithm when you can, users are left in the dark if this doesn’t succeed. Even if it’s known to probably succeed, guarding against nil and having a valid fall-back path is not ergonomic.

Proposed solution

We propose adding to String and Substring:

  • A way to query if a string is contiguous
  • A way to force a string to be contiguous
  • A way to run a closure over the raw UTF-8 contents of a string

(For rationale on why StringProtocol was excluded, see “What about StringProtocol?” in Alternatives Considered)

Detailed design

extension String {
  /// Returns whether this string is capable of providing access to
  /// validly-encoded UTF-8 contents in contiguous memory in O(1) time.
  ///
  /// Contiguous strings always operate in O(1) time for withUTF8 and always
  /// give a result for String.UTF8View.withContiguousStorageIfAvailable.
  /// Contiguous strings also benefit from fast-paths and better optimizations.
  ///
  public var isContiguous: Bool { get }

  /// If this string is not contiguous, make it so. If this mutates the string,
  /// it will invalidate any pre-existing indices.
  ///
  /// Complexity: O(n) if non-contiguous, O(1) if already contiguous
  ///
  public mutating func makeContiguous()

  /// Runs `body` over the content of this string in contiguous memory. If this
  /// string is not contiguous, this will first make it contiguous, which will
  /// also speed up subsequent access. If this mutates the string,
  /// it will invalidate any pre-existing indices.
  ///
  /// Note that it is unsafe to escape the pointer provided to `body`. For
  /// example, strings of up to 15 UTF-8 code units in length may be represented
  /// in a small-string representation, and thus will be spilled into
  /// temporary stack space which is invalid after `withUTF8` finishes
  /// execution.
  ///
  /// Complexity: O(n) if non-contiguous, O(1) if already contiguous
  ///
  public mutating func withUTF8<R>(
    _ body: (UnsafeBufferPointer<UInt8>) throws -> R
  ) rethrows -> R
}

// Contiguous UTF-8 strings
extension Substring {
  /// Returns whether this string is capable of providing access to
  /// validly-encoded UTF-8 contents in contiguous memory in O(1) time.
  ///
  /// Contiguous strings always operate in O(1) time for withUTF8 and always
  /// give a result for String.UTF8View.withContiguousStorageIfAvailable.
  /// Contiguous strings also benefit from fast-paths and better optimizations.
  ///
  public var isContiguous: Bool { get }

  /// If this string is not contiguous, make it so. If this mutates the
  /// substring, it will invalidate any pre-existing indices.
  ///
  /// Complexity: O(n) if non-contiguous, O(1) if already contiguous
  ///
  public mutating func makeContiguous()

  /// Runs `body` over the content of this substring in contiguous memory. If
  /// this substring is not contiguous, this will first make it contiguous,
  /// which will also speed up subsequent access. If this mutates the substring,
  /// it will invalidate any pre-existing indices.
  ///
  /// Note that it is unsafe to escape the pointer provided to `body`. For
  /// example, strings of up to 15 UTF-8 code units in length may be represented
  /// in a small-string representation, and thus will be spilled into
  /// temporary stack space which is invalid after `withUTF8` finishes
  /// execution.
  ///
  /// Complexity: O(n) if non-contiguous, O(1) if already contiguous
  ///
  public mutating func withUTF8<R>(
    _ body: (UnsafeBufferPointer<UInt8>) throws -> R
  ) rethrows -> R
}

(For rationale as to why withUTF8 is marked mutating, see “Wait, why is withUTF8 marked as mutating?” in Alternatives Considered.)

Source compatibility

All changes are additive.

Effect on ABI stability

All changes are additive. ABI-relevant attributes are provided in “Detailed design”.

Effect on API resilience

The APIs for String and Substring have their implementations exposed, as they are expressed in terms of assumptions and mechanisms already (non-resiliently) present in Swift 5.0.

Alternatives considered

Wait, why is withUTF8 marked as mutating?

If the string is noncontiguous, something has to happen to get contiguous UTF-8 contents in memory. We have 3 basic options here:

1. Copy into a new heap allocation, run the closure, then throw it away

This approach takes inspiration from Array’s withUnsafeBufferPointer, which runs directly on its contents if it can, otherwise it makes a temporary copy to run over. However, unlike Array, noncontiguous strings are much more common and there are no type-level guarantees of contiguity. For example, Array with a value-type Element is always contiguous, so the vast majority of performance sensitive operations over arrays (e.g. Array<Int>) can never trigger this allocation.

Because String does not have Array’s guarantee, this approach would harm the ability to reason about the performance characteristics of code. This approach would also keep the string on the slow-path for access after the method is called. If the contents are worth reading, they’re worth ensuring contiguity.

2. Trap if noncontiguous

This would be the ideal API for platforms without Cocoa interoperability (unless/until we introduce noncontiguous foreign strings, at least), as it will always succeed. However, this would be a terrible source of crashes in libraries and applications that forget to exhaustively test their code with noncontiguous strings. This would hit cross-platform packages especially hard.

Even for libraries confined to Cocoa-interoperable platforms, whether an imported NSString is contiguous or not could change version-to-version of the OS, producing difficult to reason about bugs.

3. Fine, make it mutating

The proposed compromise makes withUTF8 mutating, forcing the contents to be bridged if noncontiguous. This has the downside of forcing such strings to be declared var rather than let, but mitigates the downsides of the other approaches. If the contents are worth reading, they’re worth ensuring contiguity.

Or, introduce a separate type, ContiguousUTF8String, instead

Introducing a new type would introduce an API schism and further complicate the
nature of StringProtocol. We don't consider the downsides to be worth the upside
of dropping mutable.

What about StringProtocol?

Adding these methods to StringProtocol would allow code to be generic over String and Substring (and in the somewhat unlikely event we add new conformers, those as well). This can be done at any time, but there are some issues with doing this right now.

When it comes to adding these methods on StringProtocol, we have two options:

1. Add it as a protocol extension

This approach would add the functions in an extension, ideally resilient and versioned. Unfortunately, makeContiguous() wouldn’t be implementable, as we cannot reassign self to a concrete type, so we’d have to add some additional requirements to StringProtocol surrounding forming a new contiguous string of the same concrete type.

2. Add it as a requirement with a default implementation

This approach would add it as a customization hook that’s dynamically dispatched to the concrete type’s real implementations. The default implementation would be resilient and versioned and trap if called; any new conformers to StringProtocol would need to be versioned and accommodated here.

Adding new versioned-and-defaulted requirements to a protocol can be done at any point while preserving ABI stability. For now, we’re not sure it’s worth the extra witness table entries at this point. This also hits some of the pain points of option 1: any conformers to StringProtocol must satisfy makeContiguous’s post-condition of ensuring contiguity without changing concrete type.

This can be added in the future.

Name it isContiguousUTF8 and makeContiguousUTF8()

This proposal is introducing the concept of "contiguous strings" which is
blessed as the fast-path in String's stable ABI for strings that can provide
UTF-8 contents in contiguous memory. If a string cannot provide UTF-8 content in
contiguous memory, it does receive these benefits, even if it happens to have
content in some other encoding in contiguous memory.

We feel the concept of string contiguity in Swift is inherently tied to UTF-8,
and worth claiming the term "contiguous" unqualified in encoding. That being
said, this is a weakly held opinion and isContiguousUTF8 is acceptable as
well.

Put this on the UTF8View

Contiguous UTF-8 is baked into String's ABI as the fastest representation; this
is about directly exposing that fact to performance-minded users. Other views
will not have this, so we believe it makes sense to put directly on String.
UTF-8 is special for performance and contiguity concerns.

The Most Viable Alternative

Here's an alternative formulation that uses an explicit "UTF8" in the name and
avoids mutating and index-invalidation (by using strategy #1 above).

extension [Sub]String {
  /// <as above>
  public var isContiguousUTF8: Bool { get }

  /// Produce a contiguous UTF-8 string with the same contents as `self`.
  /// Returns `self` is already contiguous, otherwise returns a new copy.
  ///
  /// Complexity: O(1) if `self` is contiguous UTF-8, otherwise O(n)
  ///
  public var contiguousUTF8: [Sub]String { get }

  /// Runs `body` over the content of this string in contiguous memory. If this
  /// string is not contiguous UTF-8, this will first make a contiguous copy and
  /// run over that.
  ///
  /// Note that it is unsafe to escape the pointer provided to `body`, even if
  /// `self` is contiguous UTF-8. For example, strings of up to 15 UTF-8 code
  /// units in length may be represented in a small-string representation, and
  /// thus will be spilled into temporary stack space which is invalid after
  /// `withUTF8` returns.
  ///
  /// Complexity: O(1) if `self` is contiguous UTF-8, otherwise O(n)
  ///
  public func withUTF8<R>(
    _ body: (UnsafeBufferPointer<UInt8>) throws -> R
  ) rethrows -> R
}

This formulation would be preferable for situations where strings are nearly-
always contiguous UTF-8. Also, the computed property is less invasive than a
mutating method as it can be used on the right-hand-side of a let binding.

Choosing between this and the version proposed depends on how users expect to
use it. We're eagerly awaiting feedback during the review.

12 Likes

Thanks @Michael_Ilseman! I very very much support this pitch. Right now we have pretty awful code like this method and the one below in SwiftNIO which I would love to remove and your pitch allows for exactly that.

The mutating withUTF8 is a bit unfortunate because it would often need to propagate to ‘further out’ APIs to not throw the possible conversion away but I think that’s okay. There’s just no really good way to do this, we’ve hit this a number of times too. (Digression: For example, NIO1 stores the HTTP headers and URL as plain bytes and produces Strings lazily. That works great for benchmarks but if the user asks for the URL twice then they pay the conversion cost twice because we have to throw away the String because retrieving a header is non-mutating. In NIO2 we will go for eager String conversation because UTF8-native Strings make everything so much faster :+1:.)

1 Like

Yeah, unfortunately that's just how values work: the callee cannot mutate the value unless it was passed inout. This is the same for the alternative options as well as calling makeContiguous().

Agreed, I’m happy to pay this price for all the goodness value types give us elsewhere.

1 Like

Bikeshedding: I don't feel like it's obvious that "contiguous" means "contiguous UTF-8". That could be fixed either by naming these things *ContiguousUTF8, or by putting these operations on String.UTF8View instead.

7 Likes

Agree with this and it's likewise my only suggestion--isContiguousUTF8, makeContiguousUTF8, withContiguousUTF8.

1 Like

I understand the first two, but why this one?

It's not strictly necessary, but I feel it signals to the reader at least a little as to why it must be mutating.

Hello, I have two feedbacks.

This mutating method looks odd to me. I would expect a non mutating method, just like other string transformation methods lowercased(), etc:

public func contiguousUTF8() -> String { ... }

Since we're at it, why not define a new String type, ContiguousUTF8String? This would also be a solution to the mutating withUTF8 method.

Quick and dirty draft:

extension StringProtocol {
    public func contiguousUTF8String() -> ContiguousUTF8String { ... }
}

public struct ContiguousUTF8String: StringProtocol {
    public func withUTF8<R>(
        _ body: (UnsafeBufferPointer<UInt8>) throws -> R
    ) rethrows -> R { ... }
}
2 Likes

The intent is to claim the term "contiguous" for the fast-path blessed in String's ABI, which is UTF-8. I updated the alternatives considered section for this.

I updated the Alternatives Considered to address this.

1 Like

Thanks Michael. You wrote:

Or, introduce a separate type, ContiguousUTF8String, instead

Introducing a new type would introduce an API schism and further complicate the nature of StringProtocol. We don't consider the downsides to be worth the upside of dropping mutable.

I beg you not to dismiss the intent even if the initially proposed solution is ill-advised.

The intent is to get rid of the mutating methods, and I think we should try harder to remove them.

Unless I'm totally mistaken on the intended use cases of the pitch, I see two uses for contiguous UTF8 strings:

  1. One shot access to an efficient UTF8 storage.
  2. Repeated access to an efficient UTF8 storage.

The first use case looks a lot like what people can do today with string.utf8CString, or ContiguousArray(string.utf8). The common point is that this use case can live with a temporary allocation. (But we can and will avoid this allocation if the string is already contiguous.)

The second use case is the one that motivates more your mutating methods. Once converted to a contiguous UTF8 string, such a string must provide two guarantees: fast access and allocation-free access to the UTF8 storage.

So let me please give another attempt:

let string: String = /* any String, maybe a Cocoa string */

// One shot access to a contiguous UTF8 buffer:
string.contiguousUTF8.withUTF8 { buffer in ... } // Maybe allocate

// Repeated access to a contiguous UTF8 buffer:
let contiguousString = String(string.contiguousUTF8) // Maybe allocate
contiguousString.contiguousUTF8.withUTF8 { buffer in ... } // No allocation guaranteed
contiguousString.contiguousUTF8.withUTF8 { buffer in ... } // No allocation guaranteed

The mutating variant is still possible:

var string: String = /* any String, maybe a Cocoa string */
string = String(string.contiguousUTF8) // Maybe allocate
string.contiguousUTF8.withUTF8 { buffer in ... } // No allocation guaranteed
string.contiguousUTF8.withUTF8 { buffer in ... } // No allocation guaranteed

Do you think we can look at this with more details?

Sorry, I was not trying to dismiss the idea of eliminating the mutating keyword. In fact, I wrote more about attempts at removing it in the alternatives considered section than I did about the proposal itself!

Did you see the 3 strategies I presented? Do you have a different takeaway than what's given there? I'd love to hear your view.

These do not result in Strings, so you lose out on all of String's API. A big motivation behind efficient UTF-8 storage for users who had to resort to these means for efficiency's sake. They really regretted having to do this when they lost all of String's API and the ability to pass these as Strings (without making a new allocation and copy).

Ah, so I think I see what you're getting at. An alternative to makeContiguous() here would be var contiguousUTF8: String { get }, which returns self if it is contiguous, otherwise makes a copy.

But, if that results in a String, does String have withUTF8 on it? What does it do, i.e. is it allocate-and-throw-away, mutating, trapping, etc? See the reasoning concerning these 3 options from the pitch.

In practice what would happen underneath if you call makeContiguous or withUTF8 on a Substring?
Would that make the storage of the whole String (that it was sliced from) contiguous?

Bikeshedding: I don't feel like it's obvious that "contiguous" means "contiguous UTF-8". That could be fixed either by naming these things *ContiguousUTF8 , or by putting these operations on String.UTF8View instead.

The intent is to claim the term "contiguous" for the fast-path blessed in String's ABI, which is UTF-8. I updated the alternatives considered section for this.

Even if the default encoding is UTF-8, it feels natural that access to UTF8 should be kept on String.UTF8View. That part of jrose comment is not yet mentioned in alternatives considered.

Example:

var str: String = ...
str.utf8.withContiguousStorage( .. ) // Same behaviour as proposed String.withUTF8

But I can already guess that having a mutating function UTF8View won't work.

1 Like

Interestingly, the documentation for withContiguousStorageIfAvailable() states that the type will first be made contiguous (meaning myString.utf8's version should never return nil, even for bridged strings). That's what Array does (Array's implementation just calls in to withUnsafeBufferPointer). I think that's probably bad. We should change it.

Contiguous-ness in Swift makes me kind of depressed. For many generic algorithms, it's extremely important to be able to implement them efficiently. But we don't appear to have a consistent story about how to express or deal with it - it's a kind of grab-bag of assorted, ad-hoc concepts.

So the important questions for me are: "Why do these APIs not already exist for Array?" (and by extension Data/DispatchData) and "Is it worth adding them?"

1 Like

Thanks Michael for your reply. I know it's easier to talk while the iron is hot, but please give me a few hours or days until I find the time to give a proper answer.

Quick alternative, what do you all think?

extension [Sub]String {
  var isContiguousUTF8: Bool { get }
  var contiguousUTF8: String { get } // Returns self if contiguous, otherwise copy
  func withUTF8<R>(_ (UnsafeBufferPointer<UInt8>) throws -> R) rethrows -> R
}

Where withUTF8 pursues alternative #1 mentioned, that is, it will have to allocate and throw away if not already contiguous.

I'm wondering if, in practice, the burden of mutating is too high and contiguity will become more common over time. Or, would this just force people to put if !str.isContiguousUTF8 { str = str.contiguousUTF8 } throughout their code. Thoughts?

edit: More thoughts and rationale:

The reason for var contiguousUTF8 { get } instead of mutating func makeContiguous() is that you can use it on the RHS of a let binding: let myString = someFunction().contiguousUTF8.

While we don't like to sprinkle "go faster" methods/properties in code, this is less invasive than mutating functions. If the future favors contiguity, this pattern fades away more gracefully than mutating functions.

2 Likes

Only if it's part of a direct subscript access, as in str[..<idx].withUTF8 { ... }. Otherwise, slices are considered distinct values, and this would trigger a copy of the sliced contents.

That already exists as a withContiguousStorageIfAvailable. Contiguous UTF-8 is baked into String's ABI as the fastest representation; this is about directly exposing that fact to performance-minded users. Other views will not have this, so I think it makes sense to put directly on String for performance-sensitive code. UTF-8 is special in this regard.

As I mentioned in "Alternatives Considered", Arrays of values types are always contiguous, so it's less of an issue in practice.

I do agree that we will want to be refining the concept of contiguity more in the standard library. String would be a little different than the other types because it doesn't not have a generic Element type, nor does it store Elements in memory.

Instead of the mutating makeContiguous() method or non-mutating contiguousUTF8 property, could the existing initializers copy the characters to contiguous storage if necessary?

-let myString = someFunction().contiguousUTF8
+let myString = String(someFunction())

Instead of isContiguous or isContiguousUTF8, could we put a hasContiguousStorage property on the String.UTF8View to go alongside the withContiguousStorageIfAvailable(_:) method?

-myString.isContiguousUTF8
+myString.utf8.hasContiguousStorage

Do we also need similar APIs for UTF-16 storage? Swift on Windows will probably need to use wchar_t-based (UTF-16) APIs rather than char-based (ANSI code page) APIs. Unless a future version of Windows could set the default code page to UTF-8.

2 Likes