[Pitch] UTF8Span: Safe UTF-8 Processing Over Contiguous Bytes

Michael_Ilseman · January 27, 2025, 2:55pm

UTF8Span: Safe UTF-8 Processing Over Contiguous Bytes

Proposal: SE-NNNN
Authors: Michael Ilseman, Guillaume Lessard
Review Manager: TBD
Status: Awaiting review
Bug: rdar://48132971, rdar://96837923
Implementation: DRAFT: UTF8Span by milseman · Pull Request #78531 · swiftlang/swift · GitHub
Upcoming Feature Flag:
Review: (pitch 1) (pitch 2)

Introduction

We introduce UTF8Span for efficient and safe Unicode processing over contiguous storage. UTF8Span is a memory safe non-escapable type similar to Span.

Native Strings are stored as validly-encoded UTF-8 bytes in an internal contiguous memory buffer. The standard library implements String's API as internal methods which operate on top of this buffer, taking advantage of the validly-encoded invariant and specialized Unicode knowledge. We propose making this UTF-8 buffer and its methods public as API for more advanced libraries and developers.

Motivation

Currently, if a developer wants to do String-like processing over UTF-8 bytes, they have to make an instance of String, which allocates a native storage class, copies all the bytes, and is reference counted. The developer would then need to operate within the new String's views and map between String.Index and byte offsets in the original buffer.

For example, if these bytes were part of a data structure, the developer would need to decide to either cache such a new String instance or recreate it on the fly. Caching more than doubles the size and adds caching complexity. Recreating it on the fly adds a linear time factor and class instance allocation/deallocation and potentially reference counting.

Furthermore, String may not be fully available on tightly constrained platforms, especially those that cannot support allocations. Both String and UTF8Span have some API that require Unicode data tables and that might not be available on embedded (String via its conformance to Comparable and Collection depend on these data tables while UTF8Span has a couple of methods that will be unavailable).

UTF-8 validity and efficiency

UTF-8 validation is a particularly common concern and the subject of a fair amount of research. Once an input is known to be validly encoded UTF-8, subsequent operations such as decoding, grapheme breaking, comparison, etc., can be implemented much more efficiently under this assumption of validity. Swift's String type's native storage is guaranteed-valid-UTF8 for this reason.

Failure to guarantee UTF-8 encoding validity creates security and safety concerns. With invalidly-encoded contents, memory safety would become more nuanced. An ill-formed leading byte can dictate a scalar length that is longer than the memory buffer. The buffer may have bounds associated with it, which differs from the bounds dictated by its contents.

Additionally, a particular scalar value in valid UTF-8 has only one encoding, but invalid UTF-8 could have the same value encoded as an overlong encoding, which would compromise code that checks for the presence of a scalar value by looking at the encoded bytes (or that does a byte-wise comparison).

Proposed solution

We propose a non-escapable UTF8Span which exposes String functionality for validly-encoded UTF-8 code units in contiguous memory. We also propose rich API describing the kind and location of encoding errors.

Detailed design

UTF8Span is a borrowed view into contiguous memory containing validly-encoded UTF-8 code units.

@frozen
public struct UTF8Span: Copyable, ~Escapable {
  **TODO**: This might end up being UnsafeRawPointer? like in Span
  public var unsafeBaseAddress: UnsafeRawPointer

  /*
   A bit-packed count and flags (such as isASCII)

   ╔═══════╦═════╦═════╦══════════╦═══════╗
   ║  b63  ║ b62 ║ b61 ║  b60:56  ║ b56:0 ║
   ╠═══════╬═════╬═════╬══════════╬═══════╣
   ║ ASCII ║ NFC ║ SSC ║ reserved ║ count ║
   ╚═══════╩═════╩═════╩══════════╩═══════╝

   ASCII means the contents are all-ASCII (<0x7F).
   NFC means contents are in normal form C for fast comparisons.
   SSC means single-scalar Characters (i.e. grapheme clusters): every
     `Character` holds only a single `Unicode.Scalar`.
   */
  @usableFromInline
  internal var _countAndFlags: UInt64
}

UTF-8 validation

We propose new API for identifying where and what kind of encoding errors are present in UTF-8 content.

extension Unicode.UTF8 {
  /**

   The kind and location of a UTF-8 encoding error.

   Valid UTF-8 is represented by this table:

   ╔════════════════════╦════════╦════════╦════════╦════════╗
   ║    Scalar value    ║ Byte 0 ║ Byte 1 ║ Byte 2 ║ Byte 3 ║
   ╠════════════════════╬════════╬════════╬════════╬════════╣
   ║ U+0000..U+007F     ║ 00..7F ║        ║        ║        ║
   ║ U+0080..U+07FF     ║ C2..DF ║ 80..BF ║        ║        ║
   ║ U+0800..U+0FFF     ║ E0     ║ A0..BF ║ 80..BF ║        ║
   ║ U+1000..U+CFFF     ║ E1..EC ║ 80..BF ║ 80..BF ║        ║
   ║ U+D000..U+D7FF     ║ ED     ║ 80..9F ║ 80..BF ║        ║
   ║ U+E000..U+FFFF     ║ EE..EF ║ 80..BF ║ 80..BF ║        ║
   ║ U+10000..U+3FFFF   ║ F0     ║ 90..BF ║ 80..BF ║ 80..BF ║
   ║ U+40000..U+FFFFF   ║ F1..F3 ║ 80..BF ║ 80..BF ║ 80..BF ║
   ║ U+100000..U+10FFFF ║ F4     ║ 80..8F ║ 80..BF ║ 80..BF ║
   ╚════════════════════╩════════╩════════╩════════╩════════╝

   ### Classifying errors

   An *unexpected continuation* is when a continuation byte (`10xxxxxx`) occurs
   in a position that should be the start of a new scalar value. Unexpected
   continuations can often occur when the input contains arbitrary data
   instead of textual content. An unexpected continuation at the start of
   input might mean that the input was not correctly sliced along scalar
   boundaries or that it does not contain UTF-8.

   A *truncated scalar* is a multi-byte sequence that is the start of a valid
   multi-byte scalar but is cut off before ending correctly. A truncated
   scalar at the end of the input might mean that only part of the entire
   input was received.

   A *surrogate code point* (`U+D800..U+DFFF`) is invalid UTF-8. Surrogate
   code points are used by UTF-16 to encode scalars in the supplementary
   planes. Their presence may mean the input was encoded in a different 8-bit
   encoding, such as CESU-8, WTF-8, or Java's Modified UTF-8.

   An *invalid non-surrogate code point* is any code point higher than
   `U+10FFFF`. This can often occur when the input is arbitrary data instead
   of textual content.

   An *overlong encoding* occurs when a scalar value that could have been
   encoded using fewer bytes is encoded in a longer byte sequence. Overlong
   encodings are invalid UTF-8 and can lead to security issues if not
   correctly detected:

   - https://nvd.nist.gov/vuln/detail/CVE-2008-2938
   - https://nvd.nist.gov/vuln/detail/CVE-2000-0884

   An overlong encoding of `NUL`, `0xC0 0x80`, is used in Java's Modified
   UTF-8 but is invalid UTF-8. Overlong encoding errors often catch attempts
   to bypass security measures.

   ### Reporting the range of the error

   The range of the error reported follows the *Maximal subpart of an
   ill-formed subsequence* algorithm in which each error is either one byte
   long or ends before the first byte that is disallowed. See "U+FFFD
   Substitution of Maximal Subparts" in the Unicode Standard. Unicode started
   recommending this algorithm in version 6 and is adopted by the W3C.

   The maximal subpart algorithm will produce a single multi-byte range for a
   truncated scalar (a multi-byte sequence that is the start of a valid
   multi-byte scalar but is cut off before ending correctly). For all other
   errors (including overlong encodings, surrogates, and invalid code
   points), it will produce an error per byte.

   Since overlong encodings, surrogates, and invalid code points are erroneous
   by the second byte (at the latest), the above definition produces the same
   ranges as defining such a sequence as a truncated scalar error followed by
   unexpected continuation byte errors. The more semantically-rich
   classification is reported.

   For example, a surrogate count point sequence `ED A0 80` will be reported
   as three `.surrogateCodePointByte` errors rather than a `.truncatedScalar`
   followed by two `.unexpectedContinuationByte` errors.

   Other commonly reported error ranges can be constructed from this result.
   For example, PEP 383's error-per-byte can be constructed by mapping over
   the reported range. Similarly, constructing a single error for the longest
   invalid byte range can be constructed by joining adjacent error ranges.

   ╔═════════════════╦══════╦═════╦═════╦═════╦═════╦═════╦═════╦══════╗
   ║                 ║  61  ║ F1  ║ 80  ║ 80  ║ E1  ║ 80  ║ C2  ║  62  ║
   ╠═════════════════╬══════╬═════╬═════╬═════╬═════╬═════╬═════╬══════╣
   ║ Longest range   ║ U+61 ║ err ║     ║     ║     ║     ║     ║ U+62 ║
   ║ Maximal subpart ║ U+61 ║ err ║     ║     ║ err ║     ║ err ║ U+62 ║
   ║ Error per byte  ║ U+61 ║ err ║ err ║ err ║ err ║ err ║ err ║ U+62 ║
   ╚═════════════════╩══════╩═════╩═════╩═════╩═════╩═════╩═════╩══════╝

   */
  @frozen
  public struct EncodingError: Error, Sendable, Hashable, Codable {
    /// The kind of encoding error
    public var kind: Unicode.UTF8.EncodingError.Kind

    /// The range of offsets into our input containing the error
    public var range: Range<Int>

    @_alwaysEmitIntoClient
    public init(
      _ kind: Unicode.UTF8.EncodingError.Kind,
      _ range: some RangeExpression<Int>
    )

    @_alwaysEmitIntoClient
    public init(_ kind: Unicode.UTF8.EncodingError.Kind, at: Int)
  }
}

extension UTF8.EncodingError {
  /// The kind of encoding error encountered during validation
  @frozen
  public struct Kind: Error, Sendable, Hashable, Codable, RawRepresentable {
    public var rawValue: UInt8

    @inlinable
    public init(rawValue: UInt8)

    /// A continuation byte (`10xxxxxx`) outside of a multi-byte sequence
    @_alwaysEmitIntoClient
    public static var unexpectedContinuationByte: Self

    /// A byte in a surrogate code point (`U+D800..U+DFFF`) sequence
    @_alwaysEmitIntoClient
    public static var surrogateCodePointByte: Self

    /// A byte in an invalid, non-surrogate code point (`>U+10FFFF`) sequence
    @_alwaysEmitIntoClient
    public static var invalidNonSurrogateCodePointByte: Self

    /// A byte in an overlong encoding sequence
    @_alwaysEmitIntoClient
    public static var overlongEncodingByte: Self

    /// A multi-byte sequence that is the start of a valid multi-byte scalar
    /// but is cut off before ending correctly
    @_alwaysEmitIntoClient
    public static var truncatedScalar: Self
  }
}

@_unavailableInEmbedded
extension UTF8.EncodingError.Kind: CustomStringConvertible {
  public var description: String { get }
}

@_unavailableInEmbedded
extension UTF8.EncodingError: CustomStringConvertible {
  public var description: String { get }
}

QUESTION: It would be good to expose this functionality via a general purpose validation API. Question is do we want a findFirstError or findAllErrors style API, both? E.g.:

extension UTF8 {
  public static func checkForError(
    _ s: some Sequence<UInt8>
  ) -> some UTF8.EncodingError {

  ... or

  public static func checkForAllErrors(
    _ s: some Sequence<UInt8>
  ) -> some Sequence<UTF8.EncodingError> {

Creation and validation

UTF8Span is validated at initialization time and encoding errors are diagnosed and thrown.


extension UTF8Span {
  @lifetime(codeUnits)
  public init(
    _validating codeUnits: consuming Span<UInt8>
  ) throws(UTF8.EncodingError) {

  @lifetime(borrow start)
  internal init(
    _unsafeAssumingValidUTF8 start: borrowing UnsafeRawPointer,
    _countAndFlags: UInt64
  )
}

NOTE: The final status of underscores, annotations, etc., are pending things like SE-0456 and Lifetime Dependencies.

Scalar processing

We propose a UTF8Span.ScalarIterator type that can do scalar processing forwards and backwards. Note that ScalarIterator itself is non-escapable, and thus cannot conform to IteratorProtocol, etc.

extension UTF8Span {
  public func _makeScalarIterator() -> ScalarIterator

  /// Iterate the `Unicode.Scalar`s  contents of a `UTF8Span`.
  public struct ScalarIterator: ~Escapable {
    public var codeUnits: UTF8Span

    /// The byte offset of the start of the next scalar. This is
    /// always scalar-aligned.
    public var currentCodeUnitOffset: Int { get }

    public init(_ codeUnits: UTF8Span)

    /// Decode and return the scalar starting at `currentCodeUnitOffset`.
    /// After the function returns, `currentCodeUnitOffset` holds the
    /// position at the end of the returned scalar, which is also the start
    /// of the next scalar.
    ///
    /// Returns `nil` if at the end of the `UTF8Span`.
    public mutating func next() -> Unicode.Scalar?

    /// Decode and return the scalar ending at `currentCodeUnitOffset`. After
    /// the function returns, `currentCodeUnitOffset` holds the position at
    /// the start of the returned scalar, which is also the end of the
    /// previous scalar.
    ///
    /// Returns `nil` if at the start of the `UTF8Span`.
    public mutating func previous() -> Unicode.Scalar?

    /// Advance `codeUnitOffset` to the end of the current scalar, without
    /// decoding it.
    ///
    /// Returns the number of `Unicode.Scalar`s skipped over, which can be 0
    /// if at the end of the UTF8Span.
    public mutating func skipForward() -> Int

    /// Advance `codeUnitOffset` to the end of `n` scalars, without decoding
    /// them.
    ///
    /// Returns the number of `Unicode.Scalar`s skipped over, which can be
    /// fewer than `n` if at the end of the UTF8Span.
    public mutating func skipForward(by n: Int) -> Int

    /// Move `codeUnitOffset` to the start of the previous scalar, without
    /// decoding it.
    ///
    /// Returns the number of `Unicode.Scalar`s skipped over, which can be 0
    /// if at the start of the UTF8Span.
    public mutating func skipBack() -> Bool

    /// Move `codeUnitOffset` to the start of the previous `n` scalars,
    /// without decoding them.
    ///
    /// Returns the number of `Unicode.Scalar`s skipped over, which can be
    /// fewer than `n` if at the start of the UTF8Span.
    public mutating func skipBack(by n: Int) -> Bool

    /// Reset to the nearest scalar-aligned code unit offset `<= i`.
    ///
    /// **TODO**: Example
    public mutating func reset(roundingBackwardsFrom i: Int)

    /// Reset to the nearest scalar-aligned code unit offset `>= i`.
    ///
    /// **TODO**: Example
    public mutating func reset(roundingForwardsFrom i: Int)

    /// Reset this iterator to code unit offset `i`, skipping _all_ safety
    /// checks.
    ///
    /// Note: This is only for very specific, low-level use cases. If
    /// `codeUnitOffset` is not properly scalar-aligned, this function can
    /// result in undefined behavior when, e.g., `next()` is called.
    ///
    /// For example, this could be used by a regex engine to backtrack to a
    /// known-valid previous position.
    ///
    public mutating func reset(uncheckedAssumingAlignedTo i: Int)

    /// Returns the UTF8Span containing all the content up to the iterator's
    /// current position.
    public func _prefix() -> UTF8Span

    /// Returns the UTF8Span containing all the content after the iterator's
    /// current position.
    public func _suffix() -> UTF8Span
  }
}

QUESTION: Is it worth also surfacing as isScalarAligned API on UTF8Span so it's a little easier to find and spell (as well as talk about in doc comments)?

Character processing

We similarly propose a UTF8Span.CharacterIterator type that can do grapheme-breaking forwards and backwards.

The CharacterIterator assumes that the start and end of the UTF8Span is the start and end of content.

Any scalar-aligned position is a valid place to start or reset the grapheme-breaking algorithm to, though you could get different Character output if if resetting to a position that isn't Character-aligned relative to the start of the UTF8Span (e.g. in the middle of a series of regional indicators).

@_unavailableInEmbedded
extension UTF8Span {
  public func _makeCharacterIterator() -> CharacterIterator

  /// Iterate the `Character` contents of a `UTF8Span`.
  public struct CharacterIterator: ~Escapable {
    public var codeUnits: UTF8Span

    /// The byte offset of the start of the next `Character`. This is 
    /// always scalar-aligned and `Character`-aligned.
    public var currentCodeUnitOffset: Int { get }

    public init(_ span: UTF8Span)

    /// Return the `Character` starting at `currentCodeUnitOffset`. After the
    /// function returns, `currentCodeUnitOffset` holds the position at the
    /// end of the `Character`, which is also the start of the next
    /// `Character`. 
    ///
    /// Returns `nil` if at the end of the `UTF8Span`.
    public mutating func next() -> Character?

    /// Return the `Character` ending at `currentCodeUnitOffset`. After the
    /// function returns, `currentCodeUnitOffset` holds the position at the
    /// start of the returned `Character`, which is also the end of the
    /// previous `Character`. 
    ///
    /// Returns `nil` if at the start of the `UTF8Span`.
    public mutating func previous() -> Character?

    /// Advance `codeUnitOffset` to the end of the current `Character`,
    /// without constructing it.
    ///
    /// Returns the number of `Character`s skipped over, which can be 0
    /// if at the end of the UTF8Span.
    public mutating func skipForward()

    /// Advance `codeUnitOffset` to the end of `n` `Characters`, without
    /// constructing them.
    ///
    /// Returns the number of `Character`s skipped over, which can be
    /// fewer than `n` if at the end of the UTF8Span.
    public mutating func skipForward(by n: Int)

    /// Move `codeUnitOffset` to the start of the previous `Character`,
    /// without constructing it.
    ///
    /// Returns the number of `Character`s skipped over, which can be 0
    /// if at the start of the UTF8Span.
    public mutating func skipBack()

    /// Move `codeUnitOffset` to the start of the previous `n` `Character`s,
    /// without constructing them.
    ///
    /// Returns the number of `Character`s skipped over, which can be
    /// fewer than `n` if at the start of the UTF8Span.
    public mutating func skipBack(by n: Int)

    /// Reset to the nearest character-aligned position `<= i`.
    public mutating func reset(roundingBackwardsFrom i: Int)

    /// Reset to the nearest character-aligned position `>= i`.
    public mutating func reset(roundingForwardsFrom i: Int)

    /// Reset this iterator to code unit offset `i`, skipping _all_ safety
    /// checks.
    ///
    /// Note: This is only for very specific, low-level use cases. If
    /// `codeUnitOffset` is not properly scalar-aligned, this function can
    /// result in undefined behavior when, e.g., `next()` is called. 
    ///
    /// If `i` is scalar-aligned, but not `Character`-aligned, you may get
    /// different results from running `Character` iteration.
    ///
    /// For example, this could be used by a regex engine to backtrack to a
    /// known-valid previous position.
    ///
    public mutating func reset(uncheckedAssumingAlignedTo i: Int)

    /// Returns the UTF8Span containing all the content up to the iterator's
    /// current position.
    public func prefix() -> UTF8Span

    /// Returns the UTF8Span containing all the content after the iterator's
    /// current position.
    public func suffix() -> UTF8Span
  }

}

Comparisons

The content of a UTF8Span can be compared in a number of ways, including literally (byte semantics) and Unicode canonical equivalence.

extension UTF8Span {
  /// Whether this span has the same bytes as `other`.
  @_alwaysEmitIntoClient
  public func bytesEqual(to other: UTF8Span) -> Bool

  /// Whether this span has the same bytes as `other`.
  @_alwaysEmitIntoClient
  public func bytesEqual(to other: some Sequence<UInt8>) -> Bool

  /// Whether this span has the same `Unicode.Scalar`s as `other`.
  @_alwaysEmitIntoClient
  public func scalarsEqual(
    to other: some Sequence<Unicode.Scalar>
  ) -> Bool

  /// Whether this span has the same `Character`s as `other`, using
  /// `Character.==` (i.e. Unicode canonical equivalence).
  @_unavailableInEmbedded
  @_alwaysEmitIntoClient
  public func charactersEqual(
    to other: some Sequence<Character>
  ) -> Bool
}

We also support literal (i.e. non-canonical) pattern matching against StaticString.

extension UTF8Span {
  static func ~=(_ lhs: UTF8Span, _ rhs: StaticString) -> Bool
}

Canonical equivalence and ordering

UTF8Span can perform Unicode canonical equivalence checks (i.e. the semantics of String.== and Character.==).

extension UTF8Span {
  /// Whether `self` is equivalent to `other` under Unicode Canonical
  /// Equivalence.
  @_unavailableInEmbedded
  public func isCanonicallyEquivalent(
    to other: UTF8Span
  ) -> Bool

  /// Whether `self` orders less than `other` under Unicode Canonical
  /// Equivalence using normalized code-unit order (in NFC).
  @_unavailableInEmbedded
  public func isCanonicallyLessThan(
    _ other: UTF8Span
  ) -> Bool
}

Extracting sub-spans

Slicing a UTF8Span is nuanced and depends on the caller's desired use. They can only be sliced at scalar-aligned code unit offsets or else it will break the valid-UTF8 invariant. Furthermore, if the caller desires consistent grapheme breaking behavior without externally managing grapheme breaking state, they must be sliced along Character boundaries. For this reason, we have exposed slicing as prefix and suffix operations on UTF8Span's iterators instead of Span's' extracting methods.

Queries

UTF8Span checks at construction time and remembers whether its contents are all ASCII. Additional checks can be requested and remembered.

extension UTF8Span {
  /// Returns whether the validated contents were all-ASCII. This is checked at
  /// initialization time and remembered.
  @inlinable
  public var isASCII: Bool { get }

  /// Returns whether the contents are known to be NFC. This is not
  /// always checked at initialization time and is set by `checkForNFC`.
  @inlinable
  @_unavailableInEmbedded
  public var isKnownNFC: Bool { get }

  /// Do a scan checking for whether the contents are in Normal Form C.
  /// When the contents are in NFC, canonical equivalence checks are much
  /// faster.
  ///
  /// `quickCheck` will check for a subset of NFC contents using the
  /// NFCQuickCheck algorithm, which is faster than the full normalization
  /// algorithm. However, it cannot detect all NFC contents.
  ///
  /// Updates the `isKnownNFC` bit.
  @_unavailableInEmbedded
  public mutating func checkForNFC(
    quickCheck: Bool
  ) -> Bool

  /// Returns whether every `Character` (i.e. grapheme cluster)
  /// is known to be comprised of a single `Unicode.Scalar`.
  ///
  /// This is not always checked at initialization time. It is set by
  /// `checkForSingleScalarCharacters`.
  @_unavailableInEmbedded
  @inlinable
  public var isKnownSingleScalarCharacters: Bool { get }

  /// Do a scan, checking whether every `Character` (i.e. grapheme cluster)
  /// is comprised of only a single `Unicode.Scalar`. When a span contains
  /// only single-scalar characters, character operations are much faster.
  ///
  /// `quickCheck` will check for a subset of single-scalar character contents
  /// using a faster algorithm than the full grapheme breaking algorithm.
  /// However, it cannot detect all single-scalar `Character` contents.
  ///
  /// Updates the `isKnownSingleScalarCharacters` bit.
  @_unavailableInEmbedded
  public mutating func checkForSingleScalarCharacters(
    quickCheck: Bool
  ) -> Bool
}

QUESTION: There is an even quicker quick-check for NFC (checking if all scalars are <0x300, which covers extended latin characters). Should we expose that level as well?

`UTF8Span` from `String`

We will add utf8Span-style properties to String and Substring, in line with however SE-0456 turns out.

`Span`-like functionality

A UTF8Span is similar to a Span<UInt8>, but with the valid-UTF8 invariant and additional information such as isASCII. We propose a way to get a Span<UInt8> from a UTF8Span as well as some methods directly on UTF8Span:

extension UTF8Span {
  @_alwaysEmitIntoClient
  public var isEmpty: Bool { get }

  @_alwaysEmitIntoClient
  public var storage: Span<UInt8> { get }

  /// Calls a closure with a pointer to the viewed contiguous storage.
  ///
  /// The buffer pointer passed as an argument to `body` is valid only
  /// during the execution of `withUnsafeBufferPointer(_:)`.
  /// Do not store or return the pointer for later use.
  ///
  /// - Parameter body: A closure with an `UnsafeBufferPointer` parameter
  ///   that points to the viewed contiguous storage. If `body` has
  ///   a return value, that value is also used as the return value
  ///   for the `withUnsafeBufferPointer(_:)` method. The closure's
  ///   parameter is valid only for the duration of its execution.
  /// - Returns: The return value of the `body` closure parameter.
  @_alwaysEmitIntoClient
  borrowing public func withUnsafeBufferPointer<
    E: Error, Result: ~Copyable & ~Escapable
  >(
    _ body: (_ buffer: borrowing UnsafeBufferPointer<UInt8>) throws(E) -> Result
  ) throws(E) -> dependsOn(self) Result
}

Getting a `String` from a `UTF8Span`

QUESTION: We should make it easier than String(decoding:as) to make a String copy of the UTF8Span, especially since UTF8Span cannot conform to Sequence or Collection. This will form an ARC-managed copy and not something that will share the (ephemeral) storage.

Source compatibility

This proposal is additive and source-compatible with existing code.

ABI compatibility

This proposal is additive and ABI-compatible with existing code.

Implications on adoption

The additions described in this proposal require a new version of the standard library and runtime.

Future directions

Alternatives considered

Invalid start / end of input UTF-8 encoding errors

see gist

An unsafe UTF8 Buffer Pointer type

see gist

Alternatives to Iterators

Functions

see gist

Collections

see gist

Acknowledgments

Karoy Lorentey, Karl, Geordie_J, and fclout, contributed to this proposal with their clarifying questions and discussions.

xwu · January 27, 2025, 3:24pm

A few nits:

Michael_Ilseman:

    public static var truncatedScalar: Self
[...]
  public func scalarsEqual(
    to other: some Sequence<Unicode.Scalar>
  ) -> Bool
[...]

Is it worth also surfacing as isScalarAligned API on UTF8Span

In current APIs, as far as I'm aware, we've never abbreviated "Unicode scalar" to "scalar"—even on String-specific APIs where there isn't any realistic possibility of confusion with SIMD scalars, etc. I'm thinking of unicodeScalars specifically as the prototype of this. Can we stick to that usage?

(I've always thought that if "Unicode scalar" is too much of a mouthful—which it is—Go's "rune" is the right term-of-art and not the at-best-meaningless-at-worst-misleading "scalar.")

It doesn't look like this conforms to IteratorProtocol—if not, can we find a different name for it than "Iterator" (even though I know that it aligns with ICU terminology)? Perhaps "Strider"?

This is a good caveat. To make it explicit, if we buy the argument that storage-is-better-than-span for the property, this would be utf8Storage; and if we buy the argument that span should be the Span-vending API, then this would be utf8Span?

FlorianPircher · January 27, 2025, 3:47pm

Thank you for the work in this area!

I think String only keeps an is-known-ASCII flag, not an is-ASCII flag. Does this mean that accessing a UTF8Span of a String would not be a constant-time operation, as strings for which the is-known-ASCII flag is false would need to be scanned for the is-ASCII status first? (In regard to String values with a UTF-8 backing; other strings would of course not be constant-time.)

Would a bytes-based comparison (<, maybe also<=>) API be appropriate? That would match the isCanonicallyEquivalent/isCanonicallyLessThan API.

I would welcome such an addition, as UTF8Span is a low-level API and so it would be unfortunate to perform both == and < when comparing string data where both is-equal and is-less-than is needed to be know.

benrimmington · January 27, 2025, 4:04pm

Alternatively, var contiguousUTF8: UTF8Span? to match the existing isContiguousUTF8 and makeContiguousUTF8() APIs. Should the result be optional, or would it have the same "eager copy behaviour" as SE-0456?

Michael_Ilseman · February 3, 2025, 5:18pm

I agree with going with either unicodeScalar or rune. I'll pitch the former as it doesn't introduce new terminology.

Yes, that is my thinking. It's a view of the storage, which is ultimately owned by someone else.

Michael_Ilseman · February 3, 2025, 5:52pm

Yes, String's bit is isKnownASCII, though it is very often the same as whether the content stores only ASCII.

Small strings keep their isASCII bit in sync with whether the contents are ASCII.

Large native string's storage class has an isASCII bit which is checked and set at initialization time and is cleared if we ever append non-ASCII content. However, if non-ASCII content is appended and later removed, meaning the string is back to being ASCII, we don't re-scan the entire string after removal to update the bit.

Non-UTF-8 strings will need to be transcoded into contiguous storage and that process can track ASCII-ness.

Shared strings (i.e. contiguous UTF-8 with an owner field) are created when bridging to ObjC and carry the string's known-ASCII-ness.

We could change UTF8Span's bit to isKnownASCII. This is still checked and definitively set when validating bytes, but when getting a span from a String it carries its known-ASCII meaning.

I wonder if we should expose the 3-state nature of these queries, that is:

known-ASCII
known-non-ASCII
unknown

using 2ish bits per flag to store this.

allevato · February 19, 2025, 3:01pm

The Language Steering Group discussed this pitch (#2688) recently, and we wanted to share some clarifying comments/questions.

Underscores in public APIs

Some of the APIs proposed here have been named with leading underscores because features that they rely on (such as lifetime annotations) are not yet completed. We think it's fine for dependencies like this to exist, but for the purposes of the proposal text, the APIs should be expressed as you would want their desired end state to be, without underscores.

Lifetime semantics

Similarly, since the syntax for lifetime annotations has not yet been decided, it will be easier to discuss the details if those annotations are omitted from the proposal text. We acknowledge, for example, that we will need to provide some syntax for an initializer to indicate that the lifetime of a new value is tied to that of one of its arguments, but this proposal doesn't need to pin itself to those specifics. Instead, we recommend just using doc comments on the relevant declarations to describe what the desired lifetime semantics are, and the implementation can adapt to whatever syntax eventually emerges.

Internal APIs

APIs that aren't part of the public API surface (like the internal UTF8Span.init) can be omitted from the proposal, even if such implementation details end up as part of the ABI because of inlinability. I don't think there were very many of these, however.

Custom iterator types vs. views

This iteration (pun intended?) of the proposal declares "iterator" types for scalar and character iteration that have a number of bespoke new APIs which combine the notions of a view and an index into a single type, whereas the earlier version of the proposal declared view types that were more similar in shape to the scalar view provided by String (and the character view provided by earlier versions of String).

We acknowledge that because UTF8Span is a non-escapable type, neither it nor its nested types can provide the usual collection or iterator conformances at this time (and you acknowledge this in the proposal as well). You also discuss in your alternatives considered section why an earlier version of a more manual decoding API that was based on integer offsets is insufficient, but we were curious why the current API surface for iterators was chosen over something closer to a separate "view and index" model implemented concretely, even without the standard conformances that those types have in String—such a model would also seem to be able to uphold the preconditions that offsets are properly aligned. If the proposed iterator APIs are a better choice, can you elaborate further on this?

fclout · February 19, 2025, 7:13pm

What does "skipping all safety checks" mean in reset(uncheckedAssumingAlignedTo:)? Can you cause an (uncaught) out-of-bounds read by resetting to a bad location?

One thing that you don't mention and that seems useful to ponder is that UTF8Span cannot bridge to a C string because it doesn't care whether the content is NUL-terminated.

I also have a number of less important drive-by comments:

Span being copyable, consuming Span<UInt8> is essentially meaningless in public init(_validating codeUnits: consuming Span<UInt8>).
The comment for skip methods say it returns a number of characters, but the signatures don't have a return value.
The typical property name to get a span out of a collection has changed to span (from storage)
When it comes to embedded Swift, the biggest obstacle to strings is that the platforms typically do not process user-defined strings, and consequently do not want to pay the code size cost of Unicode processing.

glessard · February 19, 2025, 10:25pm

Note that for initializer parameters, consuming is the default ownership, so this simply states the default explicitly.

It does clarify that the lifetime is copied from codeUnits to the new value, in the absence of an explicit lifetime annotation. It indeed means nothing else.

Michael_Ilseman · February 22, 2025, 9:35pm

Because Collections are Sequences, they also provide Iterators in addition to Indexes and SubSequencees (i.e. slices). That is, even with the view types, we'd want to define an iterator type with a next() method such that we'd be able to say:

// iter: UTF8Span.UnicodeScalarIterator (or UTF8Span.UnicodeScalarView.Iterator)
while let scalar = iter.next() {
  ...
}

Working with that iterator will be more efficient than working with indices and views via:

// view: UTF8Span.UnicodeScalarView
var curIdx = view.startIndex
while curIdx < view.endIndex {
  let scalar = view[curIdx]
  ...
  view.formIndex(after: &curIndex)
}

This is because iterators encapsulate and maintain alignment guarantees that indices cannot, even if we have a separate index type per view.

Idiomatic Collection-style interfaces support index interchange, even if "support" means reliably crashing after a dynamic check. Any idiomatic index-based interface would need to dynamically check for correct alignment in case the received index was derived from a different span. (There is a whole design space around smart indices and their tradeoffs, discussed in a lengthy appendix in the Span proposal).

This means that UTF8Span.UnicodeScalarView.subscript would have to check for scalar alignment of its given index, as it does not know whether it originally produced the passed index or not. Similarly, index(after:), index(before:), index(_:offsetBy:), etc., would make these checks on every call.

If we want to give the developer access to efficient formulations of index-style interfaces, we'd additionally propose uncheckedAssumingAligned: variants of nearly every method (subscript(uncheckedAssumingAligned i:), index(uncheckedAssumingAlignedAfter:), index(uncheckedAssumingAlignedBefore:), index(uncheckedAssumingAligned:offsetBy:), etc.). This also undermines the value of having an adapter to existing code patterns.

The question is whether we should provide less efficient Collection-style interfaces in addition to the more efficient iterator-style interfaces. That is, for each XIterator type proposed we could also propose XView, XView.Index, XView.SubSequence, etc.

A benefit of providing this adapter code is that pre-existing generic code over Collection might be able to be adapted to UTF8Span via copy-paste-and-edit.

The downside is that this pattern is less efficient and runs counter to UTF8Span's main goal of direct, low-level (but safe) interfaces.

If we do provide view adapter code, the API could look a little different in that UnicodeScalarIterator is called UnicodeScalarView.Iterator, prefix/suffix are slicing, and the reset() functionality is expressed by slicing the view before creating an iterator. However, this would also have the effect of scattering the efficient API use pattern across multiple types, intermingled with inefficient or ill-advised adaptor interfaces which have the more idiomatic names.

I think that there could be a place in some projects for view-like adapter code as a migration stop-gap. I chose to sever these adapters from this proposal and leave them as future work. For now, if we're going to present a somewhat-divergent formulation of an existing idiom, then it seems like the iterators provide the best option. Even if we had views, we would advise developers to use the iterators.

Finally, in the future there will likely be some kind of Container protocol for types that can vend segments of contiguous storage. In our case, the segment type is UTF8Span, while the element is decoded from the underlying UTF-8. It's likely easier and more straightforward to retrofit or deprecate a single UnicodeScalarIterator type than a collection of types interrelated to each other (and possibly needing associated type workarounds in the future).

Michael_Ilseman · February 24, 2025, 4:12pm

Could you help clarify the policy on some annotations? I get what you're saying about some of these ABI-affecting annotations and I'm wondering where SE should draw the line. I haven't kept up to date with all of the changes to the SE process so I apologize if I missed some guidance. I've erred on the side of providing more potentially-relevant info than less.

SE proposals are meant to address ABI stability concerns. For the most part, ABI is derived from public API and binary compatibility is derived from those API along with the contracts established in their docs.

Should proposals include/remove @frozen, which promotes a type's layout to ABI? Such an annotation can be relevant to reviewers, as it affects whether instances of a type can be handled/passed directly or indirectly, how it can be changed or extended in the future, etc. It closes future API doors and opens current performance doors.

UTF8Span, like String and a small number of other types, goes beyond just being @frozen by hand-promoting the interpretation of certain bits to ABI and thus the binary compatibility story. In this proposal, I represented that as an ASCII table on the (@usableFromInline internal) _countAndFlags property.

Should this information be included in the proposal, and if so where should it live?

This hand-promotion happens inside the body of @inlinable or @_alwaysEmitIntoClient methods. In general, @inlinable has the effect of formalizing the API contract into the binary compatibility story, and is not particularly relevant to SE. But in this case, it affects the ABI and future evolvability in a similar fashion to @frozen by ascribing meaning to values of certain bits in the type's representation.

(Further, the ABI-affecting @_alwaysEmitIntoClient annotation is underscored, which makes it look odd in a proposal).

What about platform-availability annotations such as @_unavailableInEmbedded? This seems like it is relevant to SE, but could also be noise for many reviewers.

For @lifetime, I will drop them in favor of doc comment descriptions of the lifetime.

Michael_Ilseman · February 24, 2025, 8:05pm

I have updated the gist to incorporate feedback.

I think the main annotation question I have is how to talk about the ABI of UTF8Span itself. We want to be able to communicate to SE what the type is and how it could evolve.

Basically, I want to say that this is a trivial 2-word struct whose lifetime is statically managed. Trivial 2-word comes from @frozen and listing its stored members in the proposal and statically managed comes from mentioning the : ~Escapable. This is similar to how the Span proposal specified both @frozen and the stored members (it did omit @usableFromInline).

If we do describe this layout, the next question is whether we describe the custom hand-coded bit interpretation of the performance flags. We don't have an analogy on SE AFAICT, as String predates SE and String's perf flags predate ABI stability in SE. It is very much ABI and it does show potential evolution directions and constraints. I could see arguments either way.

fclout · February 24, 2025, 10:56pm

Thanks for clarifying that reset(uncheckedAssumingAlignedTo:) skips bounds checks. I have another clarifying question: who is responsible for bounds-checking in the first place? I would assume that next() or previous() are; in this design, you can go out of bounds with reset(uncheckedAssumingAlignedTo:), but the next call to next() or previous() is going to trap or return nil. Is there a value that you can pass reset(uncheckedAssumingAlignedTo:) such that the next call to next() or previous() will read out of bounds? If so, the method should be @unsafe.

benrimmington · February 24, 2025, 11:03pm

Conditional on the alignment of its UInt64 stored property, I think the UTF8Span struct will either be:

2 words on a 64-bit platform.
3 or 4 words on a 32-bit platform?
5 or 6 or 8 words on a 16-bit platform?

At least for Embedded Swift, would it be better to store the count and flags separately?

Michael_Ilseman · February 25, 2025, 5:46pm

If an out-of-bounds code unit offset is passed in, it will not result in an out of bounds access via previous() and next() as they guard against the start/end position.

A goal is to make out-of-bounds access impossible even if the passed offset is in-bounds but not scalar aligned. This should work, but I need to verify and write some tests.

This would mean that reset(uncheckedAssumingAlignedTo:) doesn't need @unsafe as it isn't memory unsafe and doesn't do out of bounds reads. previous()/next() might consistently return garbage values, nil, or trap, but wouldn't be unsafe.

stephencelis · February 25, 2025, 6:03pm

Is there a toolchain that we can play around with?

Karl · February 25, 2025, 6:06pm

You wouldn't use a UInt64 for the count and flags of other platforms.

You need a count, and it doesn't need to be a full word because the type doesn't support viewing your entire addressable memory as UTF8, and then we use the spare bits to store some flags.

Michael_Ilseman · February 25, 2025, 8:09pm

I'm working on an updated one right now. I'll post when it's available.

[Pitch] UTF8Span: Safe UTF-8 Processing Over Contiguous Bytes

UTF8Span: Safe UTF-8 Processing Over Contiguous Bytes

Introduction

Motivation

UTF-8 validity and efficiency

Proposed solution

Detailed design

UTF-8 validation

Creation and validation

Scalar processing

Character processing

Comparisons

Canonical equivalence and ordering

Extracting sub-spans

Queries

UTF8Span from String

Span-like functionality

Getting a String from a UTF8Span

Source compatibility

ABI compatibility

Implications on adoption

Future directions

Streaming grapheme breaking

More alignments

Normalization

UnicodeScalarView and CharacterView

More algorithms

More validation API

Transcoded iterators, normalized iterators, case-folded iterators, etc

Regex or regex-like support

Canonical Spaceships

Exposing String's storage class

Track other bits

Putting more API on String

Generalize printing and logging facilities

Alternatives considered

Invalid start / end of input UTF-8 encoding errors

An unsafe UTF8 Buffer Pointer type

Alternatives to Iterators

Functions

Collections

Acknowledgments

Underscores in public APIs

Lifetime semantics

Internal APIs

Custom iterator types vs. views

`UTF8Span` from `String`

`Span`-like functionality

Getting a `String` from a `UTF8Span`

Exposing `String`'s storage class