Pitch: UTF-8 Processing Over Unsafe Contiguous Bytes

Michael_Ilseman · January 30, 2024, 4:41pm

Thanks Karl, this is really helpful feedback.

This makes a really good point: the motivation needs be more clearly articulated. The pitch frames the API through the lens of making progress towards more lower-level Unicode processing facilities. Another lens could be as making progress towards Piercing the String Veil. I'll take a stab at motivating through that lens (and copying some text from that thread):

The stdlib's String is a high level type that internally supports many backing representations: opaque, indirect, large, and small.

An opaque string is capable of handling all string operations through opaque/resilient function calls. They are unable to provide a pointer to validly-encoded UTF-8 contents in contiguous memory. Currently, these are used for those lazily-bridged NSStrings that do not provide access to contiguous UTF-8.

An indirect string can provide a pointer to validly-encoded UTF-8 contents in contiguous memory through an opaque/resilient function call. Thus, indirect strings have an extra layer of indirection required in order to get that pointer compared to large or small strings.

A large string has validly-encoded UTF-8 contents in contiguous memory stored as a tail-allocation at a fixed, statically known offset from the object's base address. To get the UTF-8 pointer, we add the offset (32 bytes) to the object reference.

A small string packs its contents (up to 15 validly-encoded UTF-8 code units in length) directly in the String struct, without the need for a separate allocation. To get a UTF-8 pointer, we can spill it into a 16-byte stack buffer.

In essence, you can view every String API as having the following implementation pattern:

extension String {
  public func foo(...) -> T {
    if _isOpaque {
      return _opaqueFoo(...)
    }

    let utf8Buffer: UnsafeValidlyEncodedUTF8
    if _isSmall {
      // ... spill to stack buffer
      utf8Buffer = // ... pointer to stack buffer
    } else if _isLarge {
      utf8Buffer = // ... add a bias to get pointer to tail allocation
    } else if _isShared {
      utf8Buffer = // ... call an opaque function to get pointer
    }
    return utf8Buffer.foo(...)
  }

  internal func _foreignFoo(...) {
    // implementation suitable for foreign strings
  }
}

This proposal is about making UnsafeValidlyEncodedUTF8.foo(...) into API. This is the lower-level foundation upon which String is built and using UnsafeValidlyEncodedUTF8 avoids having to repeatedly re-branch for every sub-operation that's performed.

I think this adds weight behind making sure this type is more clearly delineating this as an advanced facility which allows libraries to do low-level Unicode processing. A clearer way could be to make sure it's hosted under the Unicode or UTF8 namespace by one of the alternative names, such as UTF8.ValidlyEncodedCodeUnitUnsafeBufferPointer.

Shared strings are a very worthy, yet separate, aspect of "Piercing the String Veil" (or, making API for what String's ABI can do). This is why it's discussed in the future work section.

I think that there's a lot more nuance here that unsafe pointers never really addressed. If I remember correctly, the current unsafe pointers were also designed prior to exclusivity being more fully fleshed out.

Exclusivity is still a concern for safely or securely working with unsafe pointers in many domains. Reading data from a shared buffer which may have non-exclusive writes to portions of it can cause very carefully written, seemingly secure code to lead to unsafe or insecure behavior. For example, the double-load problem in which data is loaded in order to direct program logic, and then is re-loaded after a direction is taken but after the contents have been changed by a non-exclusive write.

This nuance definitely needs to be more clearly fleshed out.

There's also the overlong encoding problem, in which continuation byte payloads could be overwritten to (invalidly) encode an alias of an ASCII value, bypassing bitwise equality checks.

That might be interesting, but as soon as you get outside of text processing, or more specifically searching within text, regexes have undesirable semantic defaults (unrestricted backtracking). More interesting for binary data would be linear automata composed as part of a binary data parser combinator library. If part of that binary data is UTF-8, then the future work discussed of adding the routines routines performed by the Regex engine would be very helpful to that library.