[Pitch] Safe UTF-8 Processing Over Contiguous Bytes

Michael_Ilseman · June 25, 2024, 6:19pm

Due to the character limit, I couldn't put the future directions or alternatives in the initial post.

Future directions

More alignments

Future API could include whether an index is "word aligned" (either simple or default), "line aligned", etc.

Normalization

Future API could include checks for whether the content is in a particular normal form (not just NFC).

UnicodeScalarView and CharacterView

Like Span, we are deferring adding any collection-like types to non-escapable UTF8Span. Future work includes adding view types and corresponding iterators.

For an example implementation of those see the UTFSpanViews.swift test file.

More Collectiony algorithms

We propose equality checks (e.g. scalarsEqual), as those are incredibly common and useful operations. We have (tentatively) deferred other algorithms until non-escapable collections are figured out.

However, we can add select high-value algorithms if motivated by the community. We'd want to

More validation API

Future work includes returning all the encoding errors found in a given input.

extension UTF8 {
  public static func checkAllErrors(
    _ s: some Sequence<UInt8>
  ) -> some Sequence<UTF8.EncodingError>

See _checkAllErrors in UTF8EncodingError.swift.

Transcoded views, normalized views, case-folded views, etc

We could provide lazily transcoded, normalized, case-folded, etc., views. If we do any of these for UTF8Span, we should consider adding equivalents on String, Substring, etc.

For example, transcoded views can be generalized:

extension UTF8Span {
  /// A view of the span's contents as a bidirectional collection of
  /// transcoded `Encoding.CodeUnit`s.
  @frozen
  public struct TranscodedView<Encoding: _UnicodeEncoding> {
    public var span: UTF8Span

    @inlinable
    public init(_ span: UTF8Span)

    ...
  }
}

We could similarly provide lazily-normalized views of code units or scalars under NFC or NFD (which the stdlib already distributes data tables for), possibly generic via a protocol for 3rd party normal forms.

Finally, case-folded functionality can be accessed in today's Swift via scalar properties, but we could provide convenience collections ourselves as well.

Regex or regex-like support

Future API additions would be to support Regexes on UTF8Span. We'd expose grapheme-level semantics, scalar-level semantics, and introduce byte-level semantics.

Another future direction could be to add many routines corresponding to the underlying operations performed by the regex engine, such as:

extension UTF8Span.CharacterView {
  func matchCharacterClass(
    _: CharacterClass,
    startingAt: Index,
    limitedBy: Index
  ) throws -> Index?

  func matchQuantifiedCharacterClass(
    _: CharacterClass,
    _: QuantificationDescription,
    startingAt: Index,
    limitedBy: Index
  ) throws -> Index?
}

which would be useful for parser-combinator libraries who wish to expose String's model of Unicode by using the stdlib's accelerated implementation.

Canonical Spaceships

Should a ComparisonResult (or spaceship) be added to Swift, we could support that operation under canonical equivalence in a single pass rather than subsequent calls to isCanonicallyEquivalent(to:) and isCanonicallyLessThan(_:).

Other Unicode functionality

For the purposes of this pitch, we're not looking to expand the scope of functionality beyond what the stdlib already does in support of String's API. Other functionality can be considered future work.

Exposing `String`'s storage class

String's internal storage class is null-terminated valid UTF-8 (by substituting replacement characters) and implements range-replaceable operations along scalar boundaries. We could consider exposing the storage class itself, which might be useful for embedded platforms that don't have String.

Yield UTF8Spans in byte parsers

Span's proposal mentions a future direction of byte parsing helpers on a Cursor or Iterator type on RawSpan. We could extend these types (or analogous types on Span<UInt>) with UTF-8 parsing code:

extension RawSpan.Cursor {
  public mutating func parseUTF8(length: Int) throws -> UTF8Span

  public mutating func parseNullTermiantedUTF8() throws -> UTF8Span
}

Track other bits

Future work include tracking whether the contents are NULL-terminated (useful for C bridging), whether the contents contain any newlines or only a single newline at the end (useful for accelerating Regex .), etc.

Alternatives considered

Invalid start / end of input UTF-8 encoding errors

Earlier prototypes had .invalidStartOfInput and .invalidEndOfInput UTF8 validation errors to communicate that the input was perhaps incomplete or not slices along scalar boundaries. In this scenario, .invalidStartOfInput is equivalent to .unexpectedContinuation with the range's lower bound equal to 0 and .invalidEndOfInput is equivalent to .truncatedScalar with the range's upper bound equal to count.

This was rejected so as to not have two ways to encode the same error. There is no loss of information and .unexpectedContinuation/.truncatedScalar with ranges are more semantically precise.

An unsafe UTF8 Buffer Pointer type

An earlier pitch proposed an unsafe version of UTF8Span. Now that we have ~Escapable, a memory-safe UTF8Span is better.

Other names for basic operations

An alternative name for nextScalarStart(_:) and previousScalarStart(_:) could be something like scalarEnd(startingAt:) and scalarStart(endingAt: i). Similarly, decodeNextScalar(_:) and decodePreviousScalar(_:) could be decodeScalar(startingAt:) and decodeScalar(endingAt:). These names are similar to index(after:) and index(before:).

However, in practice this buries the direction deeper into the argument label and is more confusing than the index(before/after:) analogues. This is especially true when the argument label contains unchecked or uncheckedAssumingAligned.

That being said, these names are definitely bikesheddable and we'd like suggestions from the community.

Other bounds or alignment checked formulations

For many operations that take an index that needs to be appropriately aligned, we propose foo(_:), foo(unchecked:), and foo(uncheckedAssumingAligned:).

foo(_:) and foo(unchecked:) have analogues in Span and foo(uncheckedAssumingAligned:) is the lowest level interface that a type such as Iterator would call (since it maintains index validity and alignment as an invariant).

We could additionally have a foo(assumingAligned:) overload that does bounds checking, but it's unclear what the use case would be.

Another alternative is to only have a variant that skips both bounds and alignment checks and call it foo(unchecked:). However, this use of unchecked: is far more nuanced than Span's and it's not the case that any i in 0..<count would be valid.

We could also only offer foo(_:) and foo(uncheckedAssumingAligned:). Unaligned API such as isScalarAligned(_:) and isScalarAligned(unchecked:) would keep their names.