[Pitch] UTF8Span: Safe UTF-8 Processing Over Contiguous Bytes

gist

UTF8Span: Safe UTF-8 Processing Over Contiguous Bytes

Introduction

We introduce UTF8Span for efficient and safe Unicode processing over contiguous storage. UTF8Span is a memory safe non-escapable type similar to Span.

Native Strings are stored as validly-encoded UTF-8 bytes in an internal contiguous memory buffer. The standard library implements String's API as internal methods which operate on top of this buffer, taking advantage of the validly-encoded invariant and specialized Unicode knowledge. We propose making this UTF-8 buffer and its methods public as API for more advanced libraries and developers.

Motivation

Currently, if a developer wants to do String-like processing over UTF-8 bytes, they have to make an instance of String, which allocates a native storage class, copies all the bytes, and is reference counted. The developer would then need to operate within the new String's views and map between String.Index and byte offsets in the original buffer.

For example, if these bytes were part of a data structure, the developer would need to decide to either cache such a new String instance or recreate it on the fly. Caching more than doubles the size and adds caching complexity. Recreating it on the fly adds a linear time factor and class instance allocation/deallocation and potentially reference counting.

Furthermore, String may not be fully available on tightly constrained platforms, especially those that cannot support allocations. Both String and UTF8Span have some API that require Unicode data tables and that might not be available on embedded (String via its conformance to Comparable and Collection depend on these data tables while UTF8Span has a couple of methods that will be unavailable).

UTF-8 validity and efficiency

UTF-8 validation is a particularly common concern and the subject of a fair amount of research. Once an input is known to be validly encoded UTF-8, subsequent operations such as decoding, grapheme breaking, comparison, etc., can be implemented much more efficiently under this assumption of validity. Swift's String type's native storage is guaranteed-valid-UTF8 for this reason.

Failure to guarantee UTF-8 encoding validity creates security and safety concerns. With invalidly-encoded contents, memory safety would become more nuanced. An ill-formed leading byte can dictate a scalar length that is longer than the memory buffer. The buffer may have bounds associated with it, which differs from the bounds dictated by its contents.

Additionally, a particular scalar value in valid UTF-8 has only one encoding, but invalid UTF-8 could have the same value encoded as an overlong encoding, which would compromise code that checks for the presence of a scalar value by looking at the encoded bytes (or that does a byte-wise comparison).

Proposed solution

We propose a non-escapable UTF8Span which exposes String functionality for validly-encoded UTF-8 code units in contiguous memory. We also propose rich API describing the kind and location of encoding errors.

Detailed design

UTF8Span is a borrowed view into contiguous memory containing validly-encoded UTF-8 code units.

@frozen
public struct UTF8Span: Copyable, ~Escapable {
  **TODO**: This might end up being UnsafeRawPointer? like in Span
  public var unsafeBaseAddress: UnsafeRawPointer

  /*
   A bit-packed count and flags (such as isASCII)

   ╔═══════╦═════╦═════╦══════════╦═══════╗
   ║  b63  ║ b62 ║ b61 ║  b60:56  ║ b56:0 ║
   ╠═══════╬═════╬═════╬══════════╬═══════╣
   ║ ASCII ║ NFC ║ SSC ║ reserved ║ count ║
   ╚═══════╩═════╩═════╩══════════╩═══════╝

   ASCII means the contents are all-ASCII (<0x7F).
   NFC means contents are in normal form C for fast comparisons.
   SSC means single-scalar Characters (i.e. grapheme clusters): every
     `Character` holds only a single `Unicode.Scalar`.
   */
  @usableFromInline
  internal var _countAndFlags: UInt64
}

UTF-8 validation

We propose new API for identifying where and what kind of encoding errors are present in UTF-8 content.

extension Unicode.UTF8 {
  /**

   The kind and location of a UTF-8 encoding error.

   Valid UTF-8 is represented by this table:

   ╔════════════════════╦════════╦════════╦════════╦════════╗
   ║    Scalar value    ║ Byte 0 ║ Byte 1 ║ Byte 2 ║ Byte 3 ║
   ╠════════════════════╬════════╬════════╬════════╬════════╣
   ║ U+0000..U+007F     ║ 00..7F ║        ║        ║        ║
   ║ U+0080..U+07FF     ║ C2..DF ║ 80..BF ║        ║        ║
   ║ U+0800..U+0FFF     ║ E0     ║ A0..BF ║ 80..BF ║        ║
   ║ U+1000..U+CFFF     ║ E1..EC ║ 80..BF ║ 80..BF ║        ║
   ║ U+D000..U+D7FF     ║ ED     ║ 80..9F ║ 80..BF ║        ║
   ║ U+E000..U+FFFF     ║ EE..EF ║ 80..BF ║ 80..BF ║        ║
   ║ U+10000..U+3FFFF   ║ F0     ║ 90..BF ║ 80..BF ║ 80..BF ║
   ║ U+40000..U+FFFFF   ║ F1..F3 ║ 80..BF ║ 80..BF ║ 80..BF ║
   ║ U+100000..U+10FFFF ║ F4     ║ 80..8F ║ 80..BF ║ 80..BF ║
   ╚════════════════════╩════════╩════════╩════════╩════════╝

   ### Classifying errors

   An *unexpected continuation* is when a continuation byte (`10xxxxxx`) occurs
   in a position that should be the start of a new scalar value. Unexpected
   continuations can often occur when the input contains arbitrary data
   instead of textual content. An unexpected continuation at the start of
   input might mean that the input was not correctly sliced along scalar
   boundaries or that it does not contain UTF-8.

   A *truncated scalar* is a multi-byte sequence that is the start of a valid
   multi-byte scalar but is cut off before ending correctly. A truncated
   scalar at the end of the input might mean that only part of the entire
   input was received.

   A *surrogate code point* (`U+D800..U+DFFF`) is invalid UTF-8. Surrogate
   code points are used by UTF-16 to encode scalars in the supplementary
   planes. Their presence may mean the input was encoded in a different 8-bit
   encoding, such as CESU-8, WTF-8, or Java's Modified UTF-8.

   An *invalid non-surrogate code point* is any code point higher than
   `U+10FFFF`. This can often occur when the input is arbitrary data instead
   of textual content.

   An *overlong encoding* occurs when a scalar value that could have been
   encoded using fewer bytes is encoded in a longer byte sequence. Overlong
   encodings are invalid UTF-8 and can lead to security issues if not
   correctly detected:

   - https://nvd.nist.gov/vuln/detail/CVE-2008-2938
   - https://nvd.nist.gov/vuln/detail/CVE-2000-0884

   An overlong encoding of `NUL`, `0xC0 0x80`, is used in Java's Modified
   UTF-8 but is invalid UTF-8. Overlong encoding errors often catch attempts
   to bypass security measures.

   ### Reporting the range of the error

   The range of the error reported follows the *Maximal subpart of an
   ill-formed subsequence* algorithm in which each error is either one byte
   long or ends before the first byte that is disallowed. See "U+FFFD
   Substitution of Maximal Subparts" in the Unicode Standard. Unicode started
   recommending this algorithm in version 6 and is adopted by the W3C.

   The maximal subpart algorithm will produce a single multi-byte range for a
   truncated scalar (a multi-byte sequence that is the start of a valid
   multi-byte scalar but is cut off before ending correctly). For all other
   errors (including overlong encodings, surrogates, and invalid code
   points), it will produce an error per byte.

   Since overlong encodings, surrogates, and invalid code points are erroneous
   by the second byte (at the latest), the above definition produces the same
   ranges as defining such a sequence as a truncated scalar error followed by
   unexpected continuation byte errors. The more semantically-rich
   classification is reported.

   For example, a surrogate count point sequence `ED A0 80` will be reported
   as three `.surrogateCodePointByte` errors rather than a `.truncatedScalar`
   followed by two `.unexpectedContinuationByte` errors.

   Other commonly reported error ranges can be constructed from this result.
   For example, PEP 383's error-per-byte can be constructed by mapping over
   the reported range. Similarly, constructing a single error for the longest
   invalid byte range can be constructed by joining adjacent error ranges.

   ╔═════════════════╦══════╦═════╦═════╦═════╦═════╦═════╦═════╦══════╗
   ║                 ║  61  ║ F1  ║ 80  ║ 80  ║ E1  ║ 80  ║ C2  ║  62  ║
   ╠═════════════════╬══════╬═════╬═════╬═════╬═════╬═════╬═════╬══════╣
   ║ Longest range   ║ U+61 ║ err ║     ║     ║     ║     ║     ║ U+62 ║
   ║ Maximal subpart ║ U+61 ║ err ║     ║     ║ err ║     ║ err ║ U+62 ║
   ║ Error per byte  ║ U+61 ║ err ║ err ║ err ║ err ║ err ║ err ║ U+62 ║
   ╚═════════════════╩══════╩═════╩═════╩═════╩═════╩═════╩═════╩══════╝

   */
  @frozen
  public struct EncodingError: Error, Sendable, Hashable, Codable {
    /// The kind of encoding error
    public var kind: Unicode.UTF8.EncodingError.Kind

    /// The range of offsets into our input containing the error
    public var range: Range<Int>

    @_alwaysEmitIntoClient
    public init(
      _ kind: Unicode.UTF8.EncodingError.Kind,
      _ range: some RangeExpression<Int>
    )

    @_alwaysEmitIntoClient
    public init(_ kind: Unicode.UTF8.EncodingError.Kind, at: Int)
  }
}

extension UTF8.EncodingError {
  /// The kind of encoding error encountered during validation
  @frozen
  public struct Kind: Error, Sendable, Hashable, Codable, RawRepresentable {
    public var rawValue: UInt8

    @inlinable
    public init(rawValue: UInt8)

    /// A continuation byte (`10xxxxxx`) outside of a multi-byte sequence
    @_alwaysEmitIntoClient
    public static var unexpectedContinuationByte: Self

    /// A byte in a surrogate code point (`U+D800..U+DFFF`) sequence
    @_alwaysEmitIntoClient
    public static var surrogateCodePointByte: Self

    /// A byte in an invalid, non-surrogate code point (`>U+10FFFF`) sequence
    @_alwaysEmitIntoClient
    public static var invalidNonSurrogateCodePointByte: Self

    /// A byte in an overlong encoding sequence
    @_alwaysEmitIntoClient
    public static var overlongEncodingByte: Self

    /// A multi-byte sequence that is the start of a valid multi-byte scalar
    /// but is cut off before ending correctly
    @_alwaysEmitIntoClient
    public static var truncatedScalar: Self
  }
}

@_unavailableInEmbedded
extension UTF8.EncodingError.Kind: CustomStringConvertible {
  public var description: String { get }
}

@_unavailableInEmbedded
extension UTF8.EncodingError: CustomStringConvertible {
  public var description: String { get }
}

QUESTION: It would be good to expose this functionality via a general purpose validation API. Question is do we want a findFirstError or findAllErrors style API, both? E.g.:

extension UTF8 {
  public static func checkForError(
    _ s: some Sequence<UInt8>
  ) -> some UTF8.EncodingError {

  ... or

  public static func checkForAllErrors(
    _ s: some Sequence<UInt8>
  ) -> some Sequence<UTF8.EncodingError> {

Creation and validation

UTF8Span is validated at initialization time and encoding errors are diagnosed and thrown.


extension UTF8Span {
  @lifetime(codeUnits)
  public init(
    _validating codeUnits: consuming Span<UInt8>
  ) throws(UTF8.EncodingError) {

  @lifetime(borrow start)
  internal init(
    _unsafeAssumingValidUTF8 start: borrowing UnsafeRawPointer,
    _countAndFlags: UInt64
  )
}

NOTE: The final status of underscores, annotations, etc., are pending things like SE-0456 and Lifetime Dependencies.

Scalar processing

We propose a UTF8Span.ScalarIterator type that can do scalar processing forwards and backwards. Note that ScalarIterator itself is non-escapable, and thus cannot conform to IteratorProtocol, etc.

extension UTF8Span {
  public func _makeScalarIterator() -> ScalarIterator

  /// Iterate the `Unicode.Scalar`s  contents of a `UTF8Span`.
  public struct ScalarIterator: ~Escapable {
    public var codeUnits: UTF8Span

    /// The byte offset of the start of the next scalar. This is
    /// always scalar-aligned.
    public var currentCodeUnitOffset: Int { get }

    public init(_ codeUnits: UTF8Span)

    /// Decode and return the scalar starting at `currentCodeUnitOffset`.
    /// After the function returns, `currentCodeUnitOffset` holds the
    /// position at the end of the returned scalar, which is also the start
    /// of the next scalar.
    ///
    /// Returns `nil` if at the end of the `UTF8Span`.
    public mutating func next() -> Unicode.Scalar?

    /// Decode and return the scalar ending at `currentCodeUnitOffset`. After
    /// the function returns, `currentCodeUnitOffset` holds the position at
    /// the start of the returned scalar, which is also the end of the
    /// previous scalar.
    ///
    /// Returns `nil` if at the start of the `UTF8Span`.
    public mutating func previous() -> Unicode.Scalar?

    /// Advance `codeUnitOffset` to the end of the current scalar, without
    /// decoding it.
    ///
    /// Returns the number of `Unicode.Scalar`s skipped over, which can be 0
    /// if at the end of the UTF8Span.
    public mutating func skipForward() -> Int

    /// Advance `codeUnitOffset` to the end of `n` scalars, without decoding
    /// them.
    ///
    /// Returns the number of `Unicode.Scalar`s skipped over, which can be
    /// fewer than `n` if at the end of the UTF8Span.
    public mutating func skipForward(by n: Int) -> Int

    /// Move `codeUnitOffset` to the start of the previous scalar, without
    /// decoding it.
    ///
    /// Returns the number of `Unicode.Scalar`s skipped over, which can be 0
    /// if at the start of the UTF8Span.
    public mutating func skipBack() -> Bool

    /// Move `codeUnitOffset` to the start of the previous `n` scalars,
    /// without decoding them.
    ///
    /// Returns the number of `Unicode.Scalar`s skipped over, which can be
    /// fewer than `n` if at the start of the UTF8Span.
    public mutating func skipBack(by n: Int) -> Bool

    /// Reset to the nearest scalar-aligned code unit offset `<= i`.
    ///
    /// **TODO**: Example
    public mutating func reset(roundingBackwardsFrom i: Int)

    /// Reset to the nearest scalar-aligned code unit offset `>= i`.
    ///
    /// **TODO**: Example
    public mutating func reset(roundingForwardsFrom i: Int)

    /// Reset this iterator to code unit offset `i`, skipping _all_ safety
    /// checks.
    ///
    /// Note: This is only for very specific, low-level use cases. If
    /// `codeUnitOffset` is not properly scalar-aligned, this function can
    /// result in undefined behavior when, e.g., `next()` is called.
    ///
    /// For example, this could be used by a regex engine to backtrack to a
    /// known-valid previous position.
    ///
    public mutating func reset(uncheckedAssumingAlignedTo i: Int)

    /// Returns the UTF8Span containing all the content up to the iterator's
    /// current position.
    public func _prefix() -> UTF8Span

    /// Returns the UTF8Span containing all the content after the iterator's
    /// current position.
    public func _suffix() -> UTF8Span
  }
}

QUESTION: Is it worth also surfacing as isScalarAligned API on UTF8Span so it's a little easier to find and spell (as well as talk about in doc comments)?

Character processing

We similarly propose a UTF8Span.CharacterIterator type that can do grapheme-breaking forwards and backwards.

The CharacterIterator assumes that the start and end of the UTF8Span is the start and end of content.

Any scalar-aligned position is a valid place to start or reset the grapheme-breaking algorithm to, though you could get different Character output if if resetting to a position that isn't Character-aligned relative to the start of the UTF8Span (e.g. in the middle of a series of regional indicators).

@_unavailableInEmbedded
extension UTF8Span {
  public func _makeCharacterIterator() -> CharacterIterator

  /// Iterate the `Character` contents of a `UTF8Span`.
  public struct CharacterIterator: ~Escapable {
    public var codeUnits: UTF8Span

    /// The byte offset of the start of the next `Character`. This is 
    /// always scalar-aligned and `Character`-aligned.
    public var currentCodeUnitOffset: Int { get }

    public init(_ span: UTF8Span)

    /// Return the `Character` starting at `currentCodeUnitOffset`. After the
    /// function returns, `currentCodeUnitOffset` holds the position at the
    /// end of the `Character`, which is also the start of the next
    /// `Character`. 
    ///
    /// Returns `nil` if at the end of the `UTF8Span`.
    public mutating func next() -> Character?

    /// Return the `Character` ending at `currentCodeUnitOffset`. After the
    /// function returns, `currentCodeUnitOffset` holds the position at the
    /// start of the returned `Character`, which is also the end of the
    /// previous `Character`. 
    ///
    /// Returns `nil` if at the start of the `UTF8Span`.
    public mutating func previous() -> Character?

    /// Advance `codeUnitOffset` to the end of the current `Character`,
    /// without constructing it.
    ///
    /// Returns the number of `Character`s skipped over, which can be 0
    /// if at the end of the UTF8Span.
    public mutating func skipForward()

    /// Advance `codeUnitOffset` to the end of `n` `Characters`, without
    /// constructing them.
    ///
    /// Returns the number of `Character`s skipped over, which can be
    /// fewer than `n` if at the end of the UTF8Span.
    public mutating func skipForward(by n: Int)

    /// Move `codeUnitOffset` to the start of the previous `Character`,
    /// without constructing it.
    ///
    /// Returns the number of `Character`s skipped over, which can be 0
    /// if at the start of the UTF8Span.
    public mutating func skipBack()

    /// Move `codeUnitOffset` to the start of the previous `n` `Character`s,
    /// without constructing them.
    ///
    /// Returns the number of `Character`s skipped over, which can be
    /// fewer than `n` if at the start of the UTF8Span.
    public mutating func skipBack(by n: Int)

    /// Reset to the nearest character-aligned position `<= i`.
    public mutating func reset(roundingBackwardsFrom i: Int)

    /// Reset to the nearest character-aligned position `>= i`.
    public mutating func reset(roundingForwardsFrom i: Int)

    /// Reset this iterator to code unit offset `i`, skipping _all_ safety
    /// checks.
    ///
    /// Note: This is only for very specific, low-level use cases. If
    /// `codeUnitOffset` is not properly scalar-aligned, this function can
    /// result in undefined behavior when, e.g., `next()` is called. 
    ///
    /// If `i` is scalar-aligned, but not `Character`-aligned, you may get
    /// different results from running `Character` iteration.
    ///
    /// For example, this could be used by a regex engine to backtrack to a
    /// known-valid previous position.
    ///
    public mutating func reset(uncheckedAssumingAlignedTo i: Int)

    /// Returns the UTF8Span containing all the content up to the iterator's
    /// current position.
    public func prefix() -> UTF8Span

    /// Returns the UTF8Span containing all the content after the iterator's
    /// current position.
    public func suffix() -> UTF8Span
  }

}

Comparisons

The content of a UTF8Span can be compared in a number of ways, including literally (byte semantics) and Unicode canonical equivalence.

extension UTF8Span {
  /// Whether this span has the same bytes as `other`.
  @_alwaysEmitIntoClient
  public func bytesEqual(to other: UTF8Span) -> Bool

  /// Whether this span has the same bytes as `other`.
  @_alwaysEmitIntoClient
  public func bytesEqual(to other: some Sequence<UInt8>) -> Bool

  /// Whether this span has the same `Unicode.Scalar`s as `other`.
  @_alwaysEmitIntoClient
  public func scalarsEqual(
    to other: some Sequence<Unicode.Scalar>
  ) -> Bool

  /// Whether this span has the same `Character`s as `other`, using
  /// `Character.==` (i.e. Unicode canonical equivalence).
  @_unavailableInEmbedded
  @_alwaysEmitIntoClient
  public func charactersEqual(
    to other: some Sequence<Character>
  ) -> Bool
}

We also support literal (i.e. non-canonical) pattern matching against StaticString.

extension UTF8Span {
  static func ~=(_ lhs: UTF8Span, _ rhs: StaticString) -> Bool
}

Canonical equivalence and ordering

UTF8Span can perform Unicode canonical equivalence checks (i.e. the semantics of String.== and Character.==).

extension UTF8Span {
  /// Whether `self` is equivalent to `other` under Unicode Canonical
  /// Equivalence.
  @_unavailableInEmbedded
  public func isCanonicallyEquivalent(
    to other: UTF8Span
  ) -> Bool

  /// Whether `self` orders less than `other` under Unicode Canonical
  /// Equivalence using normalized code-unit order (in NFC).
  @_unavailableInEmbedded
  public func isCanonicallyLessThan(
    _ other: UTF8Span
  ) -> Bool
}

Extracting sub-spans

Slicing a UTF8Span is nuanced and depends on the caller's desired use. They can only be sliced at scalar-aligned code unit offsets or else it will break the valid-UTF8 invariant. Furthermore, if the caller desires consistent grapheme breaking behavior without externally managing grapheme breaking state, they must be sliced along Character boundaries. For this reason, we have exposed slicing as prefix and suffix operations on UTF8Span's iterators instead of Span's' extracting methods.

Queries

UTF8Span checks at construction time and remembers whether its contents are all ASCII. Additional checks can be requested and remembered.

extension UTF8Span {
  /// Returns whether the validated contents were all-ASCII. This is checked at
  /// initialization time and remembered.
  @inlinable
  public var isASCII: Bool { get }

  /// Returns whether the contents are known to be NFC. This is not
  /// always checked at initialization time and is set by `checkForNFC`.
  @inlinable
  @_unavailableInEmbedded
  public var isKnownNFC: Bool { get }

  /// Do a scan checking for whether the contents are in Normal Form C.
  /// When the contents are in NFC, canonical equivalence checks are much
  /// faster.
  ///
  /// `quickCheck` will check for a subset of NFC contents using the
  /// NFCQuickCheck algorithm, which is faster than the full normalization
  /// algorithm. However, it cannot detect all NFC contents.
  ///
  /// Updates the `isKnownNFC` bit.
  @_unavailableInEmbedded
  public mutating func checkForNFC(
    quickCheck: Bool
  ) -> Bool

  /// Returns whether every `Character` (i.e. grapheme cluster)
  /// is known to be comprised of a single `Unicode.Scalar`.
  ///
  /// This is not always checked at initialization time. It is set by
  /// `checkForSingleScalarCharacters`.
  @_unavailableInEmbedded
  @inlinable
  public var isKnownSingleScalarCharacters: Bool { get }

  /// Do a scan, checking whether every `Character` (i.e. grapheme cluster)
  /// is comprised of only a single `Unicode.Scalar`. When a span contains
  /// only single-scalar characters, character operations are much faster.
  ///
  /// `quickCheck` will check for a subset of single-scalar character contents
  /// using a faster algorithm than the full grapheme breaking algorithm.
  /// However, it cannot detect all single-scalar `Character` contents.
  ///
  /// Updates the `isKnownSingleScalarCharacters` bit.
  @_unavailableInEmbedded
  public mutating func checkForSingleScalarCharacters(
    quickCheck: Bool
  ) -> Bool
}

QUESTION: There is an even quicker quick-check for NFC (checking if all scalars are <0x300, which covers extended latin characters). Should we expose that level as well?

UTF8Span from String

We will add utf8Span-style properties to String and Substring, in line with however SE-0456 turns out.

Span-like functionality

A UTF8Span is similar to a Span<UInt8>, but with the valid-UTF8 invariant and additional information such as isASCII. We propose a way to get a Span<UInt8> from a UTF8Span as well as some methods directly on UTF8Span:

extension UTF8Span {
  @_alwaysEmitIntoClient
  public var isEmpty: Bool { get }

  @_alwaysEmitIntoClient
  public var storage: Span<UInt8> { get }

  /// Calls a closure with a pointer to the viewed contiguous storage.
  ///
  /// The buffer pointer passed as an argument to `body` is valid only
  /// during the execution of `withUnsafeBufferPointer(_:)`.
  /// Do not store or return the pointer for later use.
  ///
  /// - Parameter body: A closure with an `UnsafeBufferPointer` parameter
  ///   that points to the viewed contiguous storage. If `body` has
  ///   a return value, that value is also used as the return value
  ///   for the `withUnsafeBufferPointer(_:)` method. The closure's
  ///   parameter is valid only for the duration of its execution.
  /// - Returns: The return value of the `body` closure parameter.
  @_alwaysEmitIntoClient
  borrowing public func withUnsafeBufferPointer<
    E: Error, Result: ~Copyable & ~Escapable
  >(
    _ body: (_ buffer: borrowing UnsafeBufferPointer<UInt8>) throws(E) -> Result
  ) throws(E) -> dependsOn(self) Result
}

Getting a String from a UTF8Span

QUESTION: We should make it easier than String(decoding:as) to make a String copy of the UTF8Span, especially since UTF8Span cannot conform to Sequence or Collection. This will form an ARC-managed copy and not something that will share the (ephemeral) storage.

Source compatibility

This proposal is additive and source-compatible with existing code.

ABI compatibility

This proposal is additive and ABI-compatible with existing code.

Implications on adoption

The additions described in this proposal require a new version of the standard library and runtime.

Future directions

Streaming grapheme breaking

see gist

More alignments

see gist

Normalization

see gist

UnicodeScalarView and CharacterView

see gist

More algorithms

see gist

More validation API

see gist

Transcoded iterators, normalized iterators, case-folded iterators, etc

see gist

Regex or regex-like support

see gist

Canonical Spaceships

see gist

Exposing String's storage class

see gist

Track other bits

see gist

Putting more API on String

see gist

Generalize printing and logging facilities

see gist

Alternatives considered

Invalid start / end of input UTF-8 encoding errors

see gist

An unsafe UTF8 Buffer Pointer type

see gist

Alternatives to Iterators

Functions

see gist

Collections

see gist

Acknowledgments

Karoy Lorentey, Karl, Geordie_J, and fclout, contributed to this proposal with their clarifying questions and discussions.

9 Likes

A few nits:

In current APIs, as far as I'm aware, we've never abbreviated "Unicode scalar" to "scalar"—even on String-specific APIs where there isn't any realistic possibility of confusion with SIMD scalars, etc. I'm thinking of unicodeScalars specifically as the prototype of this. Can we stick to that usage?

(I've always thought that if "Unicode scalar" is too much of a mouthful—which it is—Go's "rune" is the right term-of-art and not the at-best-meaningless-at-worst-misleading "scalar.")


It doesn't look like this conforms to IteratorProtocol—if not, can we find a different name for it than "Iterator" (even though I know that it aligns with ICU terminology)? Perhaps "Strider"?


This is a good caveat. To make it explicit, if we buy the argument that storage-is-better-than-span for the property, this would be utf8Storage; and if we buy the argument that span should be the Span-vending API, then this would be utf8Span?

3 Likes

Thank you for the work in this area!

I think String only keeps an is-known-ASCII flag, not an is-ASCII flag. Does this mean that accessing a UTF8Span of a String would not be a constant-time operation, as strings for which the is-known-ASCII flag is false would need to be scanned for the is-ASCII status first? (In regard to String values with a UTF-8 backing; other strings would of course not be constant-time.)

Would a bytes-based comparison (<, maybe also<=>) API be appropriate? That would match the isCanonicallyEquivalent/isCanonicallyLessThan API.

I would welcome such an addition, as UTF8Span is a low-level API and so it would be unfortunate to perform both == and < when comparing string data where both is-equal and is-less-than is needed to be know.

3 Likes

Alternatively, var contiguousUTF8: UTF8Span? to match the existing isContiguousUTF8 and makeContiguousUTF8() APIs. Should the result be optional, or would it have the same "eager copy behaviour" as SE-0456?

3 Likes

I agree with going with either unicodeScalar or rune. I'll pitch the former as it doesn't introduce new terminology.

Yes, that is my thinking. It's a view of the storage, which is ultimately owned by someone else.

1 Like

Yes, String's bit is isKnownASCII, though it is very often the same as whether the content stores only ASCII.

Small strings keep their isASCII bit in sync with whether the contents are ASCII.

Large native string's storage class has an isASCII bit which is checked and set at initialization time and is cleared if we ever append non-ASCII content. However, if non-ASCII content is appended and later removed, meaning the string is back to being ASCII, we don't re-scan the entire string after removal to update the bit.

Non-UTF-8 strings will need to be transcoded into contiguous storage and that process can track ASCII-ness.

Shared strings (i.e. contiguous UTF-8 with an owner field) are created when bridging to ObjC and carry the string's known-ASCII-ness.

We could change UTF8Span's bit to isKnownASCII. This is still checked and definitively set when validating bytes, but when getting a span from a String it carries its known-ASCII meaning.

I wonder if we should expose the 3-state nature of these queries, that is:

  • known-ASCII
  • known-non-ASCII
  • unknown

using 2ish bits per flag to store this.

2 Likes