Pitch: UTF-8 Processing Over Unsafe Contiguous Bytes

link to gist

UTF-8 Processing Over Unsafe Contiguous Bytes

Introduction and Motivation

Native Strings are stored as validly-encoded UTF-8 bytes in a contiguous memory buffer. The standard library implements String functionality on top of this buffer, taking advantage of the validly-encoded invariant and specialized Unicode knowledge. We propose exposing this functionality as API for more advanced libraries and developers.

This pitch focuses on a portion of the broader API and functionality discussed in Pitch: Unicode Processing APIs. That broader pitch can be divided into 3 kinds of API additions:

  1. Unicode processing API for working with contiguously-stored valid UTF-8 bytes
  2. Element-based stream processing functionality. E.g., a stream of UInt8 can be turned into a stream of Unicode.Scalar or Characters.
  3. Stream-of-buffers processing functionality, which provides a lower-level / more efficient implementation for the second area.

This pitch focuses on the first.

Proposed Solution

We propose UnsafeValidUTF8BufferPointer which exposes a similar API surface as String for validly-encoded UTF-8 code units in contiguous memory.

Detailed Design

UnsafeValidUTF8BufferPointer consists of a (non-optional) raw pointer and a length, with some flags bit-packed in.

/// An unsafe buffer pointer to validly-encoded UTF-8 code units stored in
/// contiguous memory.
///
/// UTF-8 validity is checked upon creation.
///
/// `UnsafeValidUTF8BufferPointer` does not manage the memory or guarantee
/// memory safety. Any overlapping writes into the memory can lead to undefined 
/// behavior.
///
@frozen
public struct UnsafeValidUTF8BufferPointer {
  @usableFromInline
  internal var _baseAddress: UnsafeRawPointer

  // A bit-packed count and flags (such as isASCII)
  @usableFromInline
  internal var _countAndFlags: UInt64
}

It differs from UnsafeRawBufferPointer in that its contents, upon construction, are guaranteed to be validly-encoded UTF-8. This guarantee speeds up processing significantly relative to performing validation on every read. It is unsafe because it is an API surface on top of UnsafeRawPointer, inheriting all the unsafety therein and developers must manually guarantee invariants such as lifetimes and exclusivity. It is further based on UnsafeRawPointer instead of UnsafePointer<UInt8> so as not to bind memory to a type.

Validation and creation

UnsafeValidUTF8BufferPointer is validated at initialization time, and encoding errors are thrown.

extension Unicode.UTF8 {
  @frozen
  public enum EncodingErrorKind: Error {
    case unexpectedContinuationByte
    case expectedContinuationByte
    case overlongEncoding
    case invalidCodePoint

    case invalidStarterByte

    case unexpectedEndOfInput
  }
}
// All the initializers below are `throw`ing, as they validate the contents
// upon construction.
extension UnsafeValidUTF8BufferPointer {
  @frozen
  public struct DecodingError: Error, Sendable, Hashable, Codable {
    public var kind: UTF8.EncodingErrorKind
    public var offsets: Range<Int>
  }

  // ABI traffics in `Result`
  @usableFromInline
  internal static func _validate(
    baseAddress: UnsafeRawPointer, length: Int
  ) -> Result<UnsafeValidUTF8BufferPointer, DecodingError>

  @_alwaysEmitIntoClient
  public init(baseAddress: UnsafeRawPointer, length: Int) throws(DecodingError)

  @_alwaysEmitIntoClient
  public init(nulTerminatedCString: UnsafeRawPointer) throws(DecodingError)

  @_alwaysEmitIntoClient
  public init(nulTerminatedCString: UnsafePointer<CChar>) throws(DecodingError)

  @_alwaysEmitIntoClient
  public init(_: UnsafeRawBufferPointer) throws(DecodingError)

  @_alwaysEmitIntoClient
  public init(_: UnsafeBufferPointer<UInt8>) throws(DecodingError)
}

Unsafety and encoding validity

Every way to construct a UnsafeValidUTF8BufferPointer ensures that its contents are validly-encoded UTF-8. Thus, it has no new source of unsafety beyond the unsafety inherent in unsafe pointer's requirement that lifetime and exclusive access be manually enforced by the programmer. A write into this memory which violates encoding validity would also violate exclusivity.

If we did not guarantee UTF-8 encoding validity, we'd be open to new security and safety concerns beyond unsafe pointers.

With invalidly-encoded contents, memory safety would become more nuanced. An ill-formed leading byte can dictate a scalar length that is longer than the memory buffer. The buffer may have bounds associated with it, which differs from the bounds dictated by its contents.

Additionally, a particular scalar value in valid UTF-8 has only one encoding, but invalid UTF-8 could have the same value encoded as an overlong encoding, which would compromise code that checks for the presence of a scalar value by looking at the encoded bytes (or that does a byte-wise comparison).

UnsafeValidUTF8BufferPointer is unsafe in the all ways that unsafe pointers are unsafe, but not in more ways.

Accessing contents

Flags and raw contents can be accessed:

extension UnsafeValidUTF8BufferPointer {
  /// Returns whether the validated contents were all-ASCII. This is checked at
  /// initialization time and remembered.
  @inlinable
  public var isASCII: Bool

  /// Access the underlying raw bytes
  @inlinable
  public var rawBytes: UnsafeRawBufferPointer
}

Like String, UnsafeValidUTF8BufferPointer provides views for accessing Unicode.Scalars, UTF16.CodeUnits, and Characters.

extension UnsafeValidUTF8BufferPointer {
  /// A view of the buffer's contents as a bidirectional collection of `Unicode.Scalar`s.
  @frozen
  public struct UnicodeScalarView {
    public var buffer: UnsafeValidUTF8BufferPointer

    @inlinable
    public init(_ buffer: UnsafeValidUTF8BufferPointer)
  }

  @inlinable
  public var unicodeScalars: UnicodeScalarView

  /// A view of the buffer's contents as a bidirectional collection of `Character`s.
  @frozen
  public struct CharacterView {
    public var buffer: UnsafeValidUTF8BufferPointer

    @inlinable
    public init(_ buffer: UnsafeValidUTF8BufferPointer)
  }

  @inlinable
  public var characters: CharacterView

  /// A view off the buffer's contents as a bidirectional collection of transcoded
  /// `UTF16.CodeUnit`s.
  @frozen
  public struct UTF16View {
    public var buffer: UnsafeValidUTF8BufferPointer

    @inlinable
    public init(_ buffer: UnsafeValidUTF8BufferPointer)
  }

  @inlinable
  public var utf16: UTF16View
}

These are bidirectional collections, as in String. Their indices, however, are distinct from each other because they mean different things. For example, a scalar-view index is scalar aligned but not necessarily Character aligned, and a transcoded index which points mid-scalar doesn't have a corresponding position in the raw bytes.

extension UnsafeValidUTF8BufferPointer.UnicodeScalarView: BidirectionalCollection {
  public typealias Element = Unicode.Scalar

  @frozen
  public struct Index: Comparable, Hashable {
    @usableFromInline
    internal var _byteOffset: Int

    @inlinable
    public var byteOffset: Int { get }

    @inlinable
    public static func < (lhs: Self, rhs: Self) -> Bool

    @inlinable
    internal init(_uncheckedByteOffset offset: Int)
  }

  @inlinable
  public subscript(position: Index) -> Element { _read }

  @inlinable
  public func index(after i: Index) -> Index

  @inlinable
  public func index(before i: Index) -> Index

  @inlinable
  public var startIndex: Index

  @inlinable
  public var endIndex: Index
}


extension UnsafeValidUTF8BufferPointer.CharacterView: BidirectionalCollection {
  public typealias Element = Character

  @frozen
  public struct Index: Comparable, Hashable {
    @usableFromInline
    internal var _byteOffset: Int

    @inlinable
    public var byteOffset: Int { get }

    @inlinable
    public static func < (lhs: Self, rhs: Self) -> Bool

    @inlinable
    internal init(_uncheckedByteOffset offset: Int)
  }

  // Custom-defined for performance to avoid double-measuring
  // grapheme cluster length
  @frozen
  public struct Iterator: IteratorProtocol {
    @usableFromInline
    internal var _buffer: UnsafeValidUTF8BufferPointer

    @usableFromInline
    internal var _position: Index

    @inlinable
    public var buffer: UnsafeValidUTF8BufferPointer { get }

    @inlinable
    public var position: Index { get }

    public typealias Element = Character

    public mutating func next() -> Character?

    @inlinable
    internal init(
      _buffer: UnsafeValidUTF8BufferPointer, _position: Index
    )
  }

  @inlinable
  public func makeIterator() -> Iterator

  @inlinable
  public subscript(position: Index) -> Element { _read }

  @inlinable
  public func index(after i: Index) -> Index

  @inlinable
  public func index(before i: Index) -> Index

  @inlinable
  public var startIndex: Index

  @inlinable
  public var endIndex: Index
}

extension UnsafeValidUTF8BufferPointer.UTF16View: BidirectionalCollection {
  public typealias Element = Unicode.Scalar

  @frozen
  public struct Index: Comparable, Hashable {
    // Bitpacked byte offset and transcoded offset
    @usableFromInline
    internal var _byteOffsetAndTranscodedOffset: UInt64

    /// Offset of the first byte of the currently-indexed scalar
    @inlinable
    public var byteOffset: Int { get }

    /// Offset of the transcoded code unit within the currently-indexed scalar
    @inlinable
    public var transcodedOffset: Int { get }

    @inlinable
    public static func < (lhs: Self, rhs: Self) -> Bool

    @inlinable
    internal init(
      _uncheckedByteOffset offset: Int, _transcodedOffset: Int
    )
  }

  @inlinable
  public subscript(position: Index) -> Element { _read }

  @inlinable
  public func index(after i: Index) -> Index

  @inlinable
  public func index(before i: Index) -> Index

  @inlinable
  public var startIndex: Index

  @inlinable
  public var endIndex: Index
}

Canonical equivalence

// Canonical equivalence
extension UnsafeValidUTF8BufferPointer {
  /// Whether `self` is equivalent to `other` under Unicode Canonical Equivalance
  public func isCanonicallyEquivalent(
    to other: UnsafeValidUTF8BufferPointer
  ) -> Bool

  /// Whether `self` orders less than `other` (under Unicode Canonical Equivalance
  /// using normalized code-unit order)
  public func isCanonicallyLessThan(
    _ other: UnsafeValidUTF8BufferPointer
  ) -> Bool
}

Alternatives Considered

Other names

We're not particularly attached to the name UnsafeValidUTF8BufferPointer. Other names could include:

  • UnsafeValidUTF8CodeUnitBufferPointer
  • UTF8.UnsafeValidBufferPointer
  • UTF8.UnsafeValidCodeUnitBufferPointer
  • UTF8.ValidlyEncodedCodeUnitUnsafeBufferPointer
  • UnsafeContiguouslyStoredValidUTF8CodeUnitsBuffer

etc.

For isCanonicallyLessThan, another name could be canonicallyPrecedes, lexicographicallyPrecedesUnderNFC, etc.

Static methods instead of initializers

UnsafeValidUTF8BufferPointers could instead be created by static methods on UTF8:

extension Unicode.UTF8 {
  static func validate(
    ...
  ) throws -> UnsafeValidUTF8BufferPointer
}

Hashable and other conformances

UnsafeValidUTF8BufferPointer follows UnsafeRawBufferPointer and UnsafeBufferPointer in not conforming to Sendable, Hashable, Equatable, Comparable, Codable, etc.

UTF8.EncodingErrorKind as a struct

We may want to use the raw-representable struct pattern for UTF8.EncodingErrorKind instead of an exhaustive enum. That is, we may want to define it as:

extension Unicode.UTF8 {
  @frozen
  public struct EncodingErrorKind: Error, Sendable, Hashable, Codable {
    public var rawValue: UInt8

    @inlinable
    public init(rawValue: UInt8) {
      self.rawValue = rawValue
    }

    @inlinable
    public static var unexpectedContinuationByte: Self {
      .init(rawValue: 0x01)
    }

    @inlinable
    public static var overlongEncoding: Self {
      .init(rawValue: 0x02)
    }

    // ...
  }
}

This would allow us to grow the kinds or errors or else add some error-nuance to the future, at the loss of exhaustive switches inside catches.

For example, an unexpected-end-of-input error, which happens when a scalar is in the process of being decoded but not enough bytes have been read, could be reported in different ways. It could be reported as a distinct kind of error (particularly useful for stream processing which may want to resume with more content) or it could be a expectedContinuationByte covering the end-of-input position. As a value, it could have a distinct value or be an alias to the same value.

Future Directions

A non-escapable ValidUTF8BufferView

Future improvements to Swift enable a non-escapable type ("BufferView") to provide safely-unmanaged buffers via dependent lifetimes for use within a limited scope. We should add a corresponding type for validly-encoded UTF-8 contents, following the same API shape.

Shared-ownership buffer

We could propose a managed or shared-ownership validly-encoded UTF-8 buffer. E.g.:

struct ValidlyEncodedUTF8SharedBuffer {
  var contents: UnsafeValidlyEncodedUTF8BufferPointer
  var owner: AnyObject?
}

where "shared" denotes that ownership is shared with the owner field, as opposed to an allocation exclusively managed by this type (the way Array or String would). Thus, it could be backed by a native String, an instance of Data or Array<UInt8> (if ensured to be validly encoded), etc., which participate fully in their COW semantics by retaining their storage.

This would enable us to create shared strings, e.g.

extension String {
  /// Does not copy the given storage, rather shares it
  init(sharing: ValidlyEncodedUTF8SharedBuffer)
}

Also, this could allow us to present API which repairs invalid contents, since a repair operation would need to create and manage its own allocation.

Alternative: More general formulation (:boom::cow:)

We could add the more general "deconstructed COW"

/// A buffer of `T`s in contiguous memory
struct SharedContiguousStorage<T> {
  var rawContents: UnsafeRawBufferPointer
  var owner: AnyObject?
}

where the choice of Raw pointers is necessary to avoid type-binding the memory, but other designs are possible too.

However, this type alone loses static knowledge of the UTF-8 validity, so we'd still need a separate type for validly encoded UTF-8.

Instead, we could parameterize over a unsafe-buffer-pointer-like protocol:

struct SharedContiguousStorage<UnsafeBuffer: UnsafeBufferPointerProtocol> {
  var contents: UnsafeBuffer
  var owner: AnyObject?    
}

extension String {
  /// Does not copy the given storage, rather shares it
  init(sharing: SharedContiguousStorage<UnsafeValidUTF8BufferPointer>)
}

Accessing the stored pointer would still need to be done carefully, as it would have lifetime dependent on owner. In current Swift, that would likely need to be done via a closure-taking API.

protocol ContiguouslyStoredValidUTF8

We could define a protocol for validly-encoded UTF-8 bytes in contiguous memory, somewhat analogous to a low-level StringProtocol. Both an unsafe and a shared-ownership type could conform to provide the same API.

However, we'd want to be careful to future-proof such a protocol so that a ValidUTF8BufferView could conform as well. In the mean-time, even if we go with adding a shared-ownership type, Unicode processing operations can be performed by accessing the unsafe buffer pointer.

Extend to Element-based or buffer-based streams

We could define a segment of validly encoded UTF-8, which is not necessarily aligned across any particular boundary. This would be a significantly different API shape than String's views. Accessing the start of content would require passing in initial state and reaching the end would produce a state to be fed into the next segment.

It would make an awkward fit directly on top of Collection, so this would be a new API shape. For example, it could be akin to a StatefulCollection that in addition to having startIndex/endIndex would have startState/endState. Concerns such as bidirectionality, where exactly endIndex points to (the start or end of the partial value at the tail), etc, requires further thought.

Regex or regex-like support

Future API additions would be to support Regexes on such buffers.

Another future direction could be to add many routines corresponding to the underlying operations performed by the regex engine, such as:

extension UnsafeValidUTF8BufferPointer.CharacterView {
  func matchCharacterClass(
    _: CharacterClass,
    startingAt: Index,
    limitedBy: Index    
  ) throws -> Index?

  func matchQuantifiedCharacterClass(
    _: CharacterClass,
    _: QuantificationDescription,
    startingAt: Index,
    limitedBy: Index    
  ) throws -> Index?
}

which would be useful for parser-combinator libraries who wish to expose String's model of Unicode by using the stdlib's accelerated implementation.

Transcoded views, normalized views, case-folded views, etc

We could provide lazily transcoded, normalized, case-folded, etc., views. If we do any of these for UnsafeValidUTF8BufferPointer, we should consider adding equivalents on String, Substring, etc. If we were to make any new protocols or changes to protocols, we'd want to also future-proof for a ValidUTF8BufferView.

For example, transcoded views can be generalized:

extension UnsafeValidUTF8BufferPointer {
  /// A view off the buffer's contents as a bidirectional collection of transcoded
  /// `Encoding.CodeUnit`s.
  @frozen
  public struct TranscodedView<Encoding: _UnicodeEncoding> {
    public var buffer: UnsafeValidUTF8BufferPointer

    @inlinable
    public init(_ buffer: UnsafeValidUTF8BufferPointer)
  }
}

Note that since UTF-16 has such historical significance that even with a fully-generic transcoded view, we'd likely want a dedicated, specialized type for UTF-16.

We could similarly provide lazily-normalized views of code units or scalars under NFC or NFD (which the stdlib already distributes data tables for), possibly generic via a protocol for 3rd party normal forms.

Finally, case-folded functionality can be accessed in today's Swift via scalar properties, but we could provide convenience collections ourselves as well.

UTF-8 to/from UTF-16 breadcrumbs API

String's implementation caches distances between UTF-8 and UTF-16 views, as some imported Cocoa APIs use random access to the UTF-16 view. We could formalize and expose API for this.

NUL-termination concerns and C bridging

UnsafeValidUTF8BufferPointer is capable of housing interior NUL characters, just like String. We could add additional flags and initialization options to detect a trailing NUL byte beyond the count and treat it as a terminator. In those cases, we could provide a withCStringIfAvailable style API.

Index rounding operations

Unlike String, UnsafeValidUTF8BufferPointer's view's Index types are distinct, which avoids a mess of problems. Interesting additions to both UnsafeValidUTF8BufferPointer and String would be explicit index-rounding for a desired behavior.

Canonical Spaceships

Should a ComparisonResult (or spaceship) be added to Swift, we could support that operation under canonical equivalence in a single pass rather than subsequent calls to isCanonicallyEquivalent(to:) and isCanonicallyLessThan(_:).

Other Unicode functionality

For the purposes of this pitch, we're not looking to expand the scope of functionality beyond what the stdlib already does in support of String's API. Other functionality can be considered future work.

9 Likes

I have not read all the proposal yet but I would already remark that a Byte Order Mark (the one that might correctly appear when using UTF-8 is EF BB BF) at the beginning of a stream should be handled correctly (e.g. you would not want this in your stream of characters).

2 Likes

I'm a bit conflicted by this idea.

The reason that anybody cares that some binary data is valid UTF8 is because they want to do text processing with it. We have a type for that - String - it has a very thorough API, and there is a large body of code (in the standard library, Apple's SDK libraries, third-party libraries, etc) which relies on it as a currency type.

We have a separate Substring type, and a StringProtocol interface which abstracts over the two of them. It's quite uncommon to see code using StringProtocol; personally I always really value when I see libraries that do (it isn't just better for performance; it's also better for readability). It underscores just how deeply people want to use String as the currency type.

I fear that introducing yet another type (or set of types) would make the whole system too fractured and complex. Almost no libraries are going to support passing in your text content as a UnsafeValidUTF8BufferPointer or some ContiguouslyStoredValidUTF8, meaning it will still be incredibly awkward to use these types.

And it just shouldn't be this complex. UTF8 validation happens either way, whether you use String or this thing, and is IMO the least interesting part of this. The really interesting part is the actual text processing that you do post-validation. And all we're really talking about with this pitch is who owns the data - whether it lives in a String-owned buffer, or is provided externally.

For that reason, (as I've said before) shared strings are my preferred shape for this functionality. That way, we can get a String-typed value that is compatible with existing code. In my ideal version of this, I'd also be able to use any RandomAccessCollection<UInt8> as the backing storage.

(Note: I chose RandomAccessCollection because String.Index stores UTF8 offsets, and that's locked-in by ABI. With RAC, we can at least map between those offsets and the underlying collection indices in constant time)

I'm not sure this is accurate. Unsafe pointers are unsafe in a very specific way: they do not guarantee that accesses are in-bounds to a live region of memory. "Exclusivity" (as we typically use the term) does not apply to unsafe pointers, and does not require that memory is immutable while there is an unsafe buffer pointer over it.

If what is happening is that this type reads a leading byte of 1110_xxxx, then immediately reads 2 continuation bytes without a bounds-check (since it verified all encoded scalars at creation time and assumes the contents haven't changed), then it is actually introducing a new source of unsafety, as I understand it.

I would also be interested to know how much this actually saves.

FYI - Rust has a regex type which works on bytes. It does text matching, of course, but it also allows you to match arbitrary byte sequences as part of the pattern (so it doesn't require valid UTF8). I just discovered this the other day, so I thought I'd mention it if anybody's interested.

I don't think we need to explore it very deeply, but what I get from it is that there are probably two major use-cases:

  1. A Regex-like type which operates on arbitrary bytes. Since it doesn't require valid UTF8, it would presumably accept a plain some (Bidirectional?)Collection<UInt8> as input.

    If a third-party library wanted to implement something like that, they might use some of the APIs discussed here (I'm not sure if our existing UTF8 decoding APIs would be good enough -- they might be), but otherwise it's a completely separate thing.

  2. Text processing over known-valid UTF8. If we used shared strings, I think the existing Regex APIs should "just work".

7 Likes

I think it would be useful to have an initializer that specifically removes invalid characters for data recovery purposes. (Obviously this should not be the default behavior for safety reasons.) Likewise, it would also be useful to offer an iterator over raw bytes that yields characters or errors (along with the erroneous bytes).

There are differing opinions on what “correctly” means here. I believe some Windows string conversion APIs like to put a BOM at the beginning of a UTF-8 string. Round-tripping through this data type shouldn’t drop that.

Then it should be configurable. Note that the UTF-8 BOM is the same as the encoded ZERO WIDTH NO-BREAK SPACE. Of course, one might get a stream which begins with ZERO WIDTH NO-BREAK SPACE and which should be part of the text. This is one of those Unicode things that are not well solved.

1 Like

Isn’t that a higher level issue? Once you have a known valid UTF-8 stream, you can check if its first character is a BOM/ZWNBSP. You can’t safely do that before validating the UTF-8.

I read the proposal, and am still not quite sure where I would want to use this over just String, potentially with some more convenient accessors. Including some use cases in the Motivation could help us judge whether or not this is the best solution.

I would also recommend dropping “Valid” from the name. UTF-8 to me implies it is valid UTF-8.

4 Likes

I can say something about my use case, but of course this should be generalizable. One important feature is being able to check the correctness of the UTF-8 encoding, I do this "manually" in my XML parser. The reason is that in my XML applications I need to check if a document only contains characters allowed by the application (and this might be different for different parts of the document), which is common in technical communications because you need to ensure that all characters get a correct visual representation for all target where the content will be consumed, and you want to do this during the parsing process for efficiency. Other XML parsers just skip invalid encoding, and this absolutely a no-go. Also, a typical XML parsing really goes character by character involving an explicit algorithm because you need to e.g. switch to a different source file in-between or react to an entity, so it not a common parsing problem. So having a standard tool for such tasks at hand is, for me, a very welcomed addition to the Swift standard libraries.

Thanks Karl, this is really helpful feedback.

This makes a really good point: the motivation needs be more clearly articulated. The pitch frames the API through the lens of making progress towards more lower-level Unicode processing facilities. Another lens could be as making progress towards Piercing the String Veil. I'll take a stab at motivating through that lens (and copying some text from that thread):

The stdlib's String is a high level type that internally supports many backing representations: opaque, indirect, large, and small.

An opaque string is capable of handling all string operations through opaque/resilient function calls. They are unable to provide a pointer to validly-encoded UTF-8 contents in contiguous memory. Currently, these are used for those lazily-bridged NSStrings that do not provide access to contiguous UTF-8.

An indirect string can provide a pointer to validly-encoded UTF-8 contents in contiguous memory through an opaque/resilient function call. Thus, indirect strings have an extra layer of indirection required in order to get that pointer compared to large or small strings.

A large string has validly-encoded UTF-8 contents in contiguous memory stored as a tail-allocation at a fixed, statically known offset from the object's base address. To get the UTF-8 pointer, we add the offset (32 bytes) to the object reference.

A small string packs its contents (up to 15 validly-encoded UTF-8 code units in length) directly in the String struct, without the need for a separate allocation. To get a UTF-8 pointer, we can spill it into a 16-byte stack buffer.

In essence, you can view every String API as having the following implementation pattern:

extension String {
  public func foo(...) -> T {
    if _isOpaque {
      return _opaqueFoo(...)
    }

    let utf8Buffer: UnsafeValidlyEncodedUTF8
    if _isSmall {
      // ... spill to stack buffer
      utf8Buffer = // ... pointer to stack buffer
    } else if _isLarge {
      utf8Buffer = // ... add a bias to get pointer to tail allocation
    } else if _isShared {
      utf8Buffer = // ... call an opaque function to get pointer
    }
    return utf8Buffer.foo(...)
  }

  internal func _foreignFoo(...) {
    // implementation suitable for foreign strings
  }
}

This proposal is about making UnsafeValidlyEncodedUTF8.foo(...) into API. This is the lower-level foundation upon which String is built and using UnsafeValidlyEncodedUTF8 avoids having to repeatedly re-branch for every sub-operation that's performed.

I think this adds weight behind making sure this type is more clearly delineating this as an advanced facility which allows libraries to do low-level Unicode processing. A clearer way could be to make sure it's hosted under the Unicode or UTF8 namespace by one of the alternative names, such as UTF8.ValidlyEncodedCodeUnitUnsafeBufferPointer.

Shared strings are a very worthy, yet separate, aspect of "Piercing the String Veil" (or, making API for what String's ABI can do). This is why it's discussed in the future work section.

I think that there's a lot more nuance here that unsafe pointers never really addressed. If I remember correctly, the current unsafe pointers were also designed prior to exclusivity being more fully fleshed out.

Exclusivity is still a concern for safely or securely working with unsafe pointers in many domains. Reading data from a shared buffer which may have non-exclusive writes to portions of it can cause very carefully written, seemingly secure code to lead to unsafe or insecure behavior. For example, the double-load problem in which data is loaded in order to direct program logic, and then is re-loaded after a direction is taken but after the contents have been changed by a non-exclusive write.

This nuance definitely needs to be more clearly fleshed out.

There's also the overlong encoding problem, in which continuation byte payloads could be overwritten to (invalidly) encode an alias of an ASCII value, bypassing bitwise equality checks.

That might be interesting, but as soon as you get outside of text processing, or more specifically searching within text, regexes have undesirable semantic defaults (unrestricted backtracking). More interesting for binary data would be linear automata composed as part of a binary data parser combinator library. If part of that binary data is UTF-8, then the future work discussed of adding the routines routines performed by the Regex engine would be very helpful to that library.

3 Likes

If there is a UTF-8 BOM at the beginning, only after three bytes you might state a code point, and at this moment you can check if it is the UTF-8 BOM and you can skip it for the stream of characters if configured accordingly.

I was interested to see this proposal, coming from the perspective of Embedded Swift, where String is disallowed. I would really love a way to do string-like manipulation without the baggage of ICU, even in a simpler or reduced way.

Looks like this proposal doesn’t help with that, and instead appears to be motivated by exposing a small performance improvement by essentially eliminating a switch case when performing String operations.

Although I am a big proponent of improving string processing performance, I guess I’m ultimately not the target audience of this proposal.

Basically, despite reading the OP and further elaborations, I don’t really understand the motivation for this change.

How far away are we from buffer views? Public standard library symbols are forever. It would be unfortunate to add a new unsafe type, then very shortly after introduce new functionality that could have been used to avoid creating an unsafe type.

"BufferView" is a non-escapable type, meaning it is only applicable when lifetimes follow the lexical nesting of code. You can view it as akin to a x.withContiguousStorageIfAvailable { ... } which enforces that the unsafe pointers passed in do not leak out. The addition of a dependent-lifetime attribute allows you to not have to nest usage in closures, however the lifetime is still lexical as it extends to the end of the current lexical scope. It cannot model non-lexical lifetimes.

There's effectively a lifetime management type trinity:

  1. Dynamically managed (via ARC types)
  2. Statically managed (via non-escapable types)
  3. Unmanaged (via unsafe types)

Until non-escapable types come, unsafe types are used to implement statically-managed lifetimes, unsafely. With non-escapable types, they can be done safely. However, unmanaged lifetimes still exist and unsafe types will still be used for non-lexically constrained lifetimes.

What are your specific concerns over adding stdlib symbols. Is it binary size of symbols, binary size of code, API visibility, compatibility maintenance, ...?

The stdlib does not use ICU. It does bundle some data tables for normalization and grapheme breaking, which may be too large for some embedded environments. Any API here that depends on such tables (the CharacterView and isCanonicallyEquivalent(to:)) would similarly be unavailable, however there's nothing stopping the type itself and the other views from being available.

String is trickier to sub-divide out because it's conformance to Comparable and Collection rely on such tables.

Let me approach it from a third direction.

Currently, if you want to do String-like stuff over your UTF-8 bytes, you have to make an instance of String, which allocates a native storage class and copies all the bytes. You would then operate within the new String's views and map between String.Index and back to byte offsets in your original buffer.

If your bytes were the backing of a data structure, you'd need to decide if you wanted to cache such a new String instance or recreate it on the fly. Caching more than doubles the size and adds caching complexity. Recreating it on the fly adds a linear time factor and class instance allocation/deallocation.

The stdlib currently only supports a small subset of String-like operations outside of String, such as decoding scalars, but it's not a lot and it's not optimized for contiguously stored valid-UTF-8 (which is why String doesn't call it).

3 Likes

All unsafe functionality you expose will be used unsafely. We don’t need to fill out the trinity for the sake of it. I understand that several types have a contiguous collection of UTF-8 bytes as internal storage, so it’s useful to expose a view type as it lets people implement generic algorithms over UTF-8 types with the least overhead. What I don’t understand is why one might need that view type to have an unmanaged lifetime.

1 Like

Where would structs (like Int) be, is that (4)?