Pitch: Unicode Processing APIs

You can find the gist here

This is a pitch for some low-level Unicode operations to enable libraries to implement their own String-like functionality and types. The purpose of this API is to present low-level core components out of which libraries can make types and higher-level API (i.e. tools for tool-makers).

I'm interested in hearing about capabilities that libraries may need and getting design feedback.

Introduction

String performs many Unicode operations using a mix of internal and publicly available functionality in the stdlib. The stdlib provides some limited general-purpose Unicode API, though this is a very dusty corner of the stdlib. We want to enable libraries to vend their own String-like types and functionality.

One interesting use case to consider are ephemeral byte strings backed by chunks of data in contiguous memory. These chunks of data could be synchronous (i.e. Sequence/Iterator) or asynchronous (i.e. AsyncSequence/AsyncIterator). Their buffers are ephemeral, meaning there is no general way to fully reset the stream back to an earlier state (i.e. there is no Index). The buffers might not be segmented along code unit, scalar, or grapheme cluster boundaries, meaning that a produced value might span segments.

We're aiming for a solution that provides:

  1. Simple, composable pieces to build up abstraction hierarchies
  2. Efficient, safe buffer-based interfaces to drill down through abstraction hierarchies
  3. A strategy for library-extensibility as well as compatibility with existing stdlib API and concepts

Note: Many of the approaches proposed are dependent on recently developed features such as typed throws and non-escapable values, whose ABI impacts may not be fully fleshed out. This may motivate splitting functionality across multiple proposals and releases

Decoding and Character API

Errors

Validation API and producing errors as part of decoding is an often-requested feature. Below are errors related to Unicode encodings:

extension Unicode.UTF8 {
  public enum DecodingError: Error {
    case expectedStarter
    case expectedContinuation
    case overlongEncoding
    case invalidCodePoint
    case invalidStarterByte
  }
}
extension Unicode.UTF16 {
  public enum DecodingError: Error {
    case expectedTrailingSurrogate
    case unexpectedTrailingSurrogate
  }
}
extension Unicode.UTF32 {
  public enum DecodingError: Error {
    case invalidCodePoint
  }
}

Alternative: a single error enum, noting that some error cases are irrelevant in some encodings

Alternative / Investigation: making the error type an associated type on a protocol for library-provided encodings to customize

When it comes to validating bytes, the byte source might be ephemeral (e.g. a single-pass Sequence) or it might be possible to re-visit contents by position (e.g. a Collection).

extension Unicode.UTFX { // `UTFX` meaning UTF8, UTF16, and UTF32
  public struct CollectionDecodingError<Index: Comparable>: Error {
    public var kind: Unicode.UTFX.DecodingError
    public var range: Range<Index>
  }

  public struct ByteStreamDecodingError: Error {
    public var kind: Unicode.UTFX.DecodingError
    public var bytes: (UInt8, UInt8?, UInt8?)
  }
}

Knowing the kind of encoding error and bytes involved can be very helpful. Overlong encodings are often an intentional attempt to compromise security. Some systems may want to use custom error-correction (i.e. 128 distinct replacement characters) such that the corrected bytes are valid Unicode while also preserving the original bits. Additionally, knowing the encoding error can be helpful for debugging.

Validation API and concerns are further discussed later in this document.

Endianness

Endianness (byte-ordering or memory ordering), denotes whether the first byte received contains the high bits or low bits of a code unit. It is relevant for multi-byte encodings (UTF-16 and UTF-32, but not UTF-8).

public enum Endianness {
  case little
  case big

  /// The platform's native byte-ordering in memory 
  public static var native: Self
}

Alternative: Endianness could be considered in combination with the encoding, yielding e.g. UTF16BE and UTF16LE encodings. Or, it could be considered a property of a serialization format.

Alternative: An alternate name for Endianness could be ByteOrder.

Decoding

The stdlib has some existing functionality along these lines, but it's inadequate for byte stream validation and decoding. Instance methods UnicodeCodec.decode and parseScalar do not produce meaningful error information and they operate only in terms of fully-formed code units and complete scalars.

We propose stateful ByteStreamDecoder structs. These are statically associated with an encoding and, at initialization time, a byte order. They store a small internal buffer of bytes until enough bytes have been seen to produce a Unicode scalar. They can be fed data in a byte at a time, which allows developers to feed data in as it is received and not worry about handling scalar alignment.

While all names in this rough draft are strawperson names, the method name consume below is a particularly strawy name. It is fairly unpalatable and meant to be a placeholder. Alternate names such as read, receive, input (as a present-tense verb), streamIn, next, feed, feedIn, etc., are not much better. The name decode doesn't carry the implication of an in-progress or suspended operation.

public protocol UnicodeByteStreamDecoder {
  /// Input a byte, returns a finished scalar or `nil`.
  /// Throws a decoding error.
  mutating func consume(
    _ byte: UInt8
  ) throws -> Unicode.Scalar?

  /// We've reached the end of input. If there's an unfinished
  /// scalar in progress, throws the appropriate encoding error
  func finalize() throws

  // Customization points:

  /// Read bytes until yielding a decoded scalar.
  ///
  /// Throws validation errors.
  ///
  /// Returns `nil` when `bytes` is done. `bytes` may not
  /// have finished a scalar and `self` may contain some
  /// bytes of an in-progress scalar value.
  public mutating func consume(
    _ bytes: inout some IteratorProtocol<UInt8>
  ) throws -> Unicode.Scalar?

  /// Read bytes asynchronously until yielding a decoded scalar.
  ///
  /// Throws validation errors and rethrows upstream errors.
  ///
  /// Returns `nil` when `bytes` is done. `bytes` may not
  /// have finished a scalar and `self` may contain some
  /// bytes of an in-progress scalar value.
  public mutating func consume<AI: AsyncIteratorProtocol>(
    _ bytes: inout AI
  ) async throws -> Unicode.Scalar?
  where AI.Element == UInt8

  /// Read bytes starting from `position` and yielding a decoded scalar
  /// and the position of the start of the next scalar.
  ///
  /// Throws validation errors.
  ///
  /// Returns `nil` when `bytes` is done. `bytes` may not
  /// have finished a scalar and `self` may contain some
  /// bytes of an in-progress scalar value.
  ///
  /// INVESTIGATE: take `position` inout so that it gets updated rather 
  ///   than requiring the caller to update a local `var`.
  ///
  /// INVESTIGATE: Alternative: take a slice `inout`, but
  ///   we'd want to make sure it makes sense for non-copyable
  ///   slices
  public mutating func consume<C: Collection<UInt8>>(
    _ bytes: C,
    startingFrom position: C.Index
  ) throws -> (Unicode.Scalar, scalarEnd: C.Index)?  


  /// INVESTIGATE: API to access the internal buffer, such as
  ///   whether it is empty, it's contents, clearing it, etc
}

extension Unicode.UTF8 {
  public struct ByteStreamDecoder: UnicodeByteStreamDecoder {
    public init()

    /// Input a single
    public mutating func consume(
      _ byte: UInt8
    ) throws -> Unicode.Scalar?

    public func finalize() throws
  }
}

extension Unicode.UTF16 {
  public struct ByteStreamDecoder: UnicodeByteStreamDecoder {
    public init(byteOrder: Endianness)

    public mutating func consume(
      _ byte: UInt8
    ) throws -> Unicode.Scalar?

    public func finalize() throws
  }
}

extension Unicode.UTF32 {
  public struct ByteStreamDecoder: UnicodeByteStreamDecoder {
    public init(byteOrder: Endianness)

    public mutating func consume(
      _ x: UInt8
    ) throws -> Unicode.Scalar?

    public func finalize() throws
  }
}

// Default implementations
extension UnicodeByteStreamDecoder {
  ... see gist for re-declaration ...
}

A decoder with undecoded contents left in its buffer could be an indication of programmer error. The finalize method checks for this. We could also consider exposing some properties or access to the buffer itself.

Alternative: we could attempt to add new byte-stream overloads for decode() that throw errors and have the ability to resume with a different input source. For the purposes of this pitch, a separate type and namespace lets us explore the different semantics pitched.

Alternative: We could consider options to repair invalid input, possibly by also communicating what errors would have been reported.

Rejected Alternative: consume receives endianness. This has the downside of a more complex API contract and more branching for an uncommon use case of interleaving mixed-endianness byte streams.

Alternative: Statically separate out endianness, i.e. have a UTF16BE encoding or alternatively a UTF16BEByteStreamDecoder. It's not clear this would be an API improvement, and it still needs some benchmarking to demonstrate whether there's a meaningful performance difference.

Alternative: A single dynamically-parameterized decoder type. Such a type could receive the encoding at init-time and have a suitably large internal buffer for any anticipated encoding. This would result in many more run-time branches, however.

Alternative: A single type-parameterized decoder type. This would reduce the amount of API, though in effect we'd want to specialize for each of the presented encodings anyways. This may end up being in effect a different spelling of what is pitched here.

Typed throws

It likely makes sense to specify errors using typed throws. This may also motivate including decoding error types in the protocol.

Typed throws would be further motivated by the fact that some highly constrained or embedded systems may not have String available. This could be due to the inability to dynamically allocate memory or not having enough space to bundle the data necessary to implement String's semantics such as grapheme breaking and canonical equivalence data tables. Such environments could benefit from good UTF-8 decoding API and typed errors would make this API available.

Grapheme-breaking API

The Swift Collections package defines a BigString type, which provides String-like functionality over rope-like storage. It uses underscored functionality in the stdlib to detect grapheme cluster boundaries.

Grapheme breaking requires looking ahead at the next scalar and keeping a few bits of state along the way. We could surface the underscored interfaces as API:

extension Unicode {
  public struct GraphemeBreaker {
    public init()

    /// Returns whether there was a grapheme break _before_
    /// `scalar`. Updates internal state and stores `scalar` for
    /// the next call.
    public mutating func consume(
      _ scalar: Unicode.Scalar
    ) -> Bool
  }
}

To build a Character-producing stream out of this, the caller either has to buffer scalars themselves or do some bookkeeping to track positions of the scalars fed in.

Character streams with buffering

The following is an example use of the GraphemeBreaker, and could be additional API or an alternate API for consideration.

public struct GraphemeFormer {
  ... see gist for implementation ...

  public init() {}

  /// Consumes `scalar`. Returns a completed `Character` if
  /// `scalar` was the start of the next grapheme cluster.
  public mutating func consume(
    _ scalar: Unicode.Scalar
  ) -> Character?

  /// Finishes and returns the in-progress Character
  public mutating func flush() -> Character?
}

Like the decoders, this uses an internal buffer: a String's UnicodeScalarView (which starts off using a small-form).

Example: Scalars and Characters from FileDescriptor

On Apple platforms, Foundation's FileHandle can asynchronously vend bytes, Unicode scalars, and Characters.

For an example use of the pitched API, let's implement that on System's FileDescriptor.

(see gist for code snippet)

The given code shows a simple use of the pitched API. However, it also shows the need for an efficient approach that can drill through abstraction layers to underlying bytes in buffers.

When the source of Unicode scalars is backed by a chunk of memory containing validly-encoded UTF-8, it is more efficient to work in terms of positions in that chunk.

Alternative naming: Use FooCharacterBar instead of FooGraphemeBar or FooGraphemeClusterBar throughout this proposal

There's a naming spectrum between Unicode's preferred terminology and Swift's.

One one end, if this were a standalone package aimed solely around providing an implementation of the Unicode standard via accelerated routines, there is an affordance to stick solely to Unicode terminology. Such a package could use the term "grapheme cluster" or "extended grapheme cluster" because "character" is not Unicode's terminology and could be ambiguous or confusing.

On the other end, anything in the String namespace or which vends a String type (including Character) would use use the term "character". Similarly anything outside of a dedicated Unicode package, library, namespace, module, or sub-module would as well.

As for what we are providing, we don't have a separate Unicode module or sub-module, largely for historical reasons. Unicode is an empty enum that functions as a namespace. Some precedent so far has found it better to lean towards Unicode terminology for items under Unicode. For example, Unicode.Scalar.Property.isGraphemeBase is the better name than Unicode.Scalar.Property.isCharacterBase.

Contiguous buffers and segmentation API

Many byte streams are backed by chunks of contiguous memory. Rather than read byte-at-a-time and return scalar-at-a-time using internal buffering, a byte stream decoder could communicate scalar-aligned positions in its upstream's backing buffers.

We explore functionality that reads from any byte source and vends chunks of validly-encoded, and validly-aligned (for a specified alignment) UTF-8. This enables efficient streaming operations, i.e. those that operate over a properly-aligned moving window of validly encoded UTF-8 bytes in contiguous memory. This involves areas of current investigation and could motivate and incorporate the more advanced lifetime management discussed.

Another important consideration is how to handle when a scalar, normalization, or grapheme cluster segment straddles multiple chunks of data. In that case, API may need to return a view into a new buffer which stores these contiguously.

Validation API

Validation looks at an entire input to ensure it is validly encoded. While it can be performed by decoding and discarding the contents, it can be done more efficiently as its own standalone operation if the original contents are meant to be kept in their original encoding.

extension Unicode.UTF8 {
  public static func validate<C: Collection<UInt8>>(
    _ bytes: C
  ) throws
}

// Available on UTF16 and UTF32, where endianness matters and
// where code units are not individual bytes
extension Unicode.UTF[16/32] {
  public static func validate<C: Collection<CodeUnit>>(
    _ codeUnits: C
  ) throws

  public static func validate<C: Collection<UInt8>>(
    _ bytes: C,
    endianness: Endianness
  ) throws
}

Alternative: Concrete functions taking a BufferView or some BufferViewable-like protocol

UTF-8 validity and efficiency

UTF-8 validation is particularly common concern and the subject of a fair amount of research. Once an input is known to be validly encoded UTF-8, subsequent operations such as decoding, grapheme breaking, comparison, etc., can be implemented much more efficiently under this assumption of validity. Swift's String type's native storage is guaranteed-valid-UTF8 for this reason.

However, if the input isn't actually valid, assuming validity leads to a new class of security concerns.

Memory safety is more nuanced. An ill-formed leading byte can dictate a scalar length that is longer than the memory buffer. The buffer may have bounds associated with it, which differs from the bounds dictated by its contents.

Additionally, a particular scalar value in valid UTF-8 has only one encoding, but invalid UTF-8 could have the same value encoded as an overlong encoding, which would compromise any code that checks for the presence of a scalar value by looking at the encoded bytes.

One approach is to define API that takes a parameter assuming that its contents contain correctly-encoded UTF-8. Today, that is often done via a unsafeAssumingValidUTF8: UnsafeRawBufferPointer parameter, but this is unsafe in multiple ways that might not be clear to the caller. The UnsafeRawBufferPointer is memory-unsafe of course, but even if the caller knows the memory itself is safe, the contents might be invalidly encoded in a way that subtly bypasses correct behavior elsewhere in the program.

A type such as BufferView would help mitigate the memory unsafety of the pointer itself, but not the far more subtle problems of assuming valid UTF-8.

The rest of this pitch is interwoven with on-going investigations into non-escapable values and statically-reasoned lifetimes. As such, it could change depending on when or how that support arrives. This is similar to how the new Atomics API was originally implemented with unsafe constructs before ~Copyable support was available.

Valid UTF8 buffer views

UTF8.ValidBufferView is a buffer view whose contents are known to be valid UTF-8 as represented in the type system.

extension Unicode.UTF8 {
  public struct ValidBufferView {
    /// TO INVESTIGATE: This field's lifetime is tied to `self`, i.e. the lifetime
    /// of either `owner` or the lexical scope into which it was returned. Any `get` 
    /// accessors should be non-escapable
    public var bytes: BufferView

    /// An object that owns the memory, if the API needed to allocate memory.
    ///
    /// This is needed when validation or alignment needs to allocate to ensure
    /// the relevant content is is contiguous memory
    public var owner: AnyObject?

    /// Create from the validated contents of `c`. If `c` contains invalidly encoded
    /// UTF-8, throws an error. If `c` is valid and and provide a `BufferView`, will
    /// borrow that view. If `c` is valid but does not provide a `BufferView`, will
    /// allocate memory to provide a contiguous view.
    public init(validating c: some Collection<UInt8>) throws

    /// As `validating:`, but repairs any encoding errors. If a repair was made, a
    /// new allocation must be made for the corrected content.
    public init(repairing c: some Collection<UInt8>)
  }
}

Alignments

There are 3 particularly useful alignments to segment content such that common operations can be performed by only looking at one chunk of data at a time.

  1. Scalar aligned: decoding and validation
  2. Normalization-segment aligned: canonical comparison
  3. Grapheme-cluster aligned: forming Characters

Each successive segmentation is broader than the one before: every grapheme-cluster boundary is a normalization-segment boundary and every normalization-segment boundary is a scalar boundary.

Note: Normalization segments being sub-segments of grapheme-clusters is not technically guaranteed by the Unicode standard to always be true in future Unicode versions. Unicode is allowed to change the rules of grapheme breaking in future versions. That being said, a normalization segment is defined to start on a non-combining scalar, and grapheme clusters that break before non-combining scalars are nonsensical. Unicode handles nonsensical cases as degenerate cases and those cases do not break, though Unicode could change its mind in future versions. Because of this, the "sub-alignment" relationship between normalization segments and grapheme clusters should treated as illustrative for the reader and not a formal API guarantee into the future.

extension Unicode.UTF8 {
  public struct ValidScalarAlignedBufferView {
    public var buffer: ValidBufferView
  }

  public struct ValidNormalizationSegmentAlignedBufferView {
    public var buffer: ValidScalarAlignedBufferView
  }

  public struct ValidGraphemeClusterAlignedBufferView {
    public var buffer: ValidScalarAlignedBufferView
  }
}

extension Unicode.UTF8 {
  /// Transforms a sequence of buffer views to a sequence of valid buffer views
    ... see gist for declarations ...
}

Aligning data along these boundaries can be useful for implementing data structures that retain their own copy of the storage. Such a data structure may want to guarantee that it can vend a given view's Element by inspecting only a single chunk.

Alternative: Type-parameterize based on alignment instead, or even dynamic-value-parameterize based on alignment.

Accessing ranges

The above API, which provides alignment with normalization segments and grapheme clusters, can also provide a view of the bytes which comprise an individual normalization segment or grapheme cluster:

  ... see gist for view declarations ...

Normalization segments are particularly tricky to account for, as the normalization process could turn a single segment into multiple ones.

Alternative: Pending BufferView's final design with respect to self-slicing, a view of the bytes comprising a single normalization-segment or grapheme cluster might be represented using a Slice.

Creating Strings

String's decoding initializers are difficult to discover and use as they make use of metatypes: String(decoding: myBytes, as: UTF8.self). Attempts to rectify this have been saddled with compatibility concerns. This may be a good opportunity to make some progress on this. Alternatively, this is severable should it start to bog down the rest of this pitch.

The below String inits are straw-person named and intentionally presented in a naming-vacuum, that is without consideration for existing String API names. This helps us work on enumerating the functionality and presenting the entire API picture without simultaneously juggling some of the current issues in API names.

For example, SE-0405 String Initializers with Encoding Validation takes a stab at improving the story somewhat with nil-return inits, but it uses the same validating: name as the error-throwing inits below. Depending on exactly how this pitch takes shape and when it is ready for review, the below could be considered an amendment to SE-0405 or a straw-person naming-vacuum investigation.

// Strawperson assuming typed throws
extension String {
  /// Sequence-version of stdlib's `String.init(decoding: x, as: UTF8.self)`
  public init(repairingUTF8: some Sequence<UInt8>)

  /// Puts contents in stdlib-normal-form for fast comparison.
  /// `Character`s are the same, but scalars and code unit views
  /// could show different (i.e. normalized) contents
  public init(normalizingUTF8: some Sequence<UInt8>)

  /// Checks for errors and throws them: Sequence error version
  public init(
    validatingUTF8: some Sequence<UInt8>
  ) throws(UTF8.ByteStreamDecodingError)

  /// Checks for errors and throws them: Collection error version
  public init(
    validatingUTF8: some Collection<UInt8>
  ) throws(UTF8.CollectionDecodingError)

  /// This is a convenience spelling for either repairing or normalizing.
  /// We can pick/debate which would be better, there are reasonable
  /// arguments for either.
  public init(utf8: some Sequence<UInt8>)
}

Similarly, API which is parameterized over the encoding, as well as API over byte streams associated with endianness.

extension String {
  // Repairing
  public init<Encoding: Unicode.Encoding>(
    repairing: some Sequence<Encoding.CodeUnit>, 
    as sourceEncoding: Encoding.Type
  )

  public init<Encoding: Unicode.Encoding>(
    repairing: some Sequence<UInt8>, 
    as sourceEncoding: Encoding.Type,
    endianness: Endianness
  )

  ... see gist for `normalizing:` and `validating:` variants ...

Library extensibility and use cases

Encodings and protocols

The stdlib has existing protocols, though they can be difficult to conform to, difficult to use, and derived operations can be inefficient. More investigation is needed to see how to improve them or else how to fit new improvements into them.

We could consider adding protocols for encoding errors, decoder structs, etc., seeing if there's a good library-extensibility story here.

Good case studies include CESU-8 which uses UTF-16-style surrogate pairs for non-BMP scalars in a UTF-8-like encoding, resulting in up to 6 bytes per Unicode scalar value. Java's modified UTF-8 further extends CESU-8's approach by using an overlong encoding for NUL. These are not valid UTF-8 encodings, but they are valid Unicode encodings as surrogates must be paired. They tradeoff some of UTF-8's advantages for compatibility benefits.

WTF-8 allows unpaired surrogates and thus is not a valid Unicode encoding. It could be interesting to consider how to help support this kind of invalid encoding by creating individual code points instead whole Unicode scalar values.

There are also encodings that only encode a subset of Unicode, such as ASCII (which UTF-8 is a binary-compatible superset of) and Latin1 (which UTF-8 is not binary compatible with). Supporting these are tricky as transcoding is lossy and otherwise complete functions become partial functions.

The stdlib currently provides ASCII as a Unicode encoding, however as a subset encoding it has some sharp edges and follows different conventions from the actual Unicode encodings. We should consider sunsetting this encoding in favor of UTF-8, which is a strict superset. The stdlib's implementation should detect and fast-path UTF-8 when the contents happen to be only-ASCII anyways. We can provide optimized isASCII queries on String, byte buffers and byte streams, etc.

Libraries

The Swift Collections package defines a BigString type, which provides String-like functionality over rope-like storage.

Foundation's AttributedString is built on BigString. Additionally, Foundation parses data formats such as plists which are encoded using UTF-16 in big endian byte order.

Foundation also normalizes paths, on some file systems, to a pre-Unicode-3.0 NFD. Unicode version 3.0 is important since it is only afterwards that normalization properties are stable. Other libraries may need to similarly specify a specific Unicode version and bundle their own data tables to drive normalization and, especially, decomposition. This could be done through a data-table provider protocol, though there may be efficiency concerns with working through such an abstraction. Either way, the normalization-segmentation API are helpful for performing custom decomposition.

Relatedly, a server-client library may wish to ensure that both the server and client are using the same version of Unicode for the purposes of canonical equivalence. While the properties for defined code points are stable, it is possible that an undefined code point could normalize differently in future Unicode versions. An alternative approach would be a quick scan for the presence of undefined code points.

We can look at using some of the byte-stream functionality over AsyncIterator to define (combiners / operators?) such as AsyncUnicodeScalarSequence, AsyncCharacterSequence, etc. These could be good API to have in the stdlib proper or in the Swift Async Algorithms package.

The WebURL package does a hefty about of Unicode processing and is a great example of the kinds of libraries that the stdlib should empower. It can serve as a good target for these improvements and many others.

Libraries such as Swift Syntax sometimes roll out their own decoding and would benefit from a standard approach.

I'm interested in hearing about other libraries and potential use cases.

Future directions

Normalization

A future direction is for String, UTF8.ValidBufferView, etc., to provide lazily-normalized views of their contents under NFC, NFD, as well as forms provided by libraries.

For the buffer-based API, a future direction could include composing and decomposing API, possibly driven by a library's data tables, along the lines of Foundation's path normalization described above.

BOM

(see gist for BOM discussion)

Shared and ephemeral strings

An often desired feature is to have String or Substring API available on storage that's owned by another object, e.g. shared substrings or using String's ABI support.

(see gist for shared and ephemeral strings discussion)

String API on validated UTF-8 bytes

UTF8.ValidBufferView could also have String's API on it, at least in some fashion. Future work could include Regex support, etc.

31 Likes

Typed throws should be used, as you propose. The ergonomics are better, it's more efficient (than using existentials), and as you note it permits use in more scenarios (e.g. Embedded Swift).

If they're combined, it undermines the benefit of using typed throws because you can no longer tell - from the code itself - what sort of errors a specific function can throw.

It is of course not always possible - or at least not practical & reasonable - to tailor each thrown error type to its function, if that results in an explosion of error types. But the division you have here seems reasonable and effective in this case.

I'm not sure they need to be conflated with what you have proposed, since while related they seem technically orthogonal, but it would be great to have things like:

  • Conversion / comparison based on various Unicode normalisation forms (or at least NFC & NFKC), to enable things like compatibility-based equality checks.
  • And/or (with the above) an elegant Swift wrapper over uspoof.h functionality.
  • Simply exposing some key underlying ICU APIs (with pleasant Swifty wrappers), like unumf_formatDecimal, so that 3rd party numerics may easily support localisation.

Just mentioning them since, well, you did ask, and also I want to seed the ideas. :slightly_smiling_face:

Particularly if these error types aren't frozen anyway, then I'd question whether this is worth the redundancy here—an invalid code point is an invalid code point: nothing is gained by repeating the encoding in the error type.

I think it makes sense to talk a little about this, likely denoting it as "future work".

The stdlib ships (compressed) data tables that can drive NFD and NFC for the stdlib's version of Unicode. We are a little sensitive to data size, so we'd be unlikely to ship NFKC/D data tables, data tables for previous versions of Unicode, etc., unless there's a particularly compelling need.

A library may want to vend its own data tables to drive composition or decomposition. We may want (possibly as future work) a library-extensibility story here. There's a little bit of finesse involved in designing such an interface as these are often performance-sensitive parts of code and there are many ways the client library's data tables may be encoded.

One example could be a server/client framework that wants to serialize normalized textual content under a particular version of Unicode, rather than the stdlib's version.

Another example would be a library that normalizes paths for the HFS+ file system. HFS+ predates the Unicode 3.x era, which is when many aspects of normalization was stabilized. A HFS+ path normalizer would want to plug in its own data tables to drive decomposition.

These would be outside the purview of the stdlib, as the stdlib does not use ICU or bundle this data. There is an ICU package by @allevato and Foundation does bundle much of this data (particularly localization), so those would be the better places for this API. But, if there's something the stdlib can do to help serve these libraries, e.g. providing protocols to extend some core algorithms, that could be interesting.

Right, this is a trade-off currently. However, that can be eliminated through future language enhancements (e.g. sum types, or enum inheritance). We can't be sure if & when those future enhancements will arrive, but at least it's a possibility - if a single error enum is used there's no apparent way to elegantly improve that in future.

That package is completely unmaintained :sweat_smile: The main sticking point that reduced my interest in it was that there was no way to get ICU cleanly integrated into it as a Swift package; you can't use libicucore.dylib on Apple platforms (they're considered private APIs in the app approval process) or you had to do a static library build of ICU separately and wire that into the rest of the build.

This was all before binary .xcframework packages were supported, which would probably be a fine approach to the problem today. Maybe I should revive it.

However, with Foundation now moving toward native Swift and open-source source-of-truth, I would love to see that library add support for more of the less-common ICU APIs if they would be deemed too large for the standard library.

3 Likes

Does the proposed solution incur any increases in the stdlib’s binary size or does it use the Unicode tables that are already included?

public protocol UnicodeByteStreamDecoder {

Should the new protocol be nested (SE-0404) in the Unicode enum namespace?

4 Likes
Full answer to this

ICU's uspoof checker implements confusable detection as specified by UAX39. In other words, we could implement exactly the same functionality in native Swift, without wrapping ICU, and I have actually considered doing that.

The reason I didn't is because confusable detection is very complex, and the services exposed by USpoofCheck are possibly best left as implementation details for use by experts. Good spoof-checking implementations need to be deliberate about which UAX39 options they select, and will typically want to augment it with other, domain-specific or contextual checks. Additionally, modern computing platforms could potentially offer better spoof detection than USpoofCheck; there's still a lot of research to be done.

I know this because I ported most of Chromium's IDN spoof checking in a proof-of-concept extension library for WebURL. AFAIK no system or third-party library exposes a browser-quality spoof-checking of domains, so it's an interesting thing to experiment with, and I wanted to make sure WebURL exposed good APIs for this because the processing can be quite intensive.

It uses ICU's uspoof, so if you want to try it out on a Mac, first run brew install icu4c, then swift test from the command line (OSX's built-in ICU doesn't include uspoof, and I couldn't get Xcode to set the right include flags for the brew-installed library, but it all works from the command line).

So what does Chromium do for spoof-checking? Well, it does use USpoofCheck, but it configures a custom set of allowed characters starting with UAX31's "Candidate Characters for Inclusion in Identifiers" with a bunch of extra characters removed - e.g. U+2010 HYPHEN is removed because it is confusable with U+002D HYPHEN-MINUS, and some characters are removed because they are allegedly rendered blank by certain default system fonts.

If there are no issues, it then checks a bunch of TLD-specific rules (e.g. Latin small letter thorn, þ, is only allowed in Icelandic domains because although it is used in that language, in other contexts it can be used to spoof b or p), and searches for extra digit lookalikes, Kana, and combining diacritics which may be considered potentially problematic even when used in a single script and require even more thorough processing.

As part of that more thorough processing, it performs a kind of loose match against a database of "skeletons". Skeletons are also defined by UAX39, and uspoof.h does indeed include a function for getting the skeleton of a string.

Chromium's implementation doesn't just call that function as-is, though - there's a bunch of custom rules which are used to generate up to 128 variants of each hostname, and skeletons are generated for each of those. Then they're tested against a database containing almost 5000 of the most popular domains. I haven't ported any of this, because WebURL and the spoofcheck library are third-party packages, and I don't think many applications would be willing to accept the increased download size even if I went through the effort of implementing it.

So that's a brief walkthrough of what a production-quality spoof-checker looks like; hopefully it illustrates that good detection is seriously involved. Even if we provided the functionality from ICU's uspoof.h, you'd still need very detailed study of several Unicode standards in order to use it effectively, and even then you'd likely need to augment it with context-specific logic (such as that for specific TLDs), before ultimately falling back to some kind of database.

Just providing USpoofCheck is not going to be useful for most people (and it may even be counterproductive if configured incorrectly, giving them a false sense of security); it is probably more productive to focus on higher-level but also more specific APIs -- e.g. APIs specifically for rendering domains.


But all of this analysis only looks at the text itself. There are way more factors which affect confusability analysis. For instance, the font and text size are huge factors. The UAX39 data tries to consider that:

The prospective confusables were gathered from a number of sources. Erik van der Poel contributed a list derived from running a program over a large number of fonts to catch characters that shared identical glyphs within a font, and Mark Davis did the same more recently for fonts on Windows and the Macintosh. Volunteers from Google, IBM, Microsoft and other companies gathered other lists of characters. These included native speakers for languages with different writing systems. The Unicode compatibility mappings were also used as a source. The process of gathering visual confusables is ongoing: the Unicode Consortium welcomes submission of additional mappings. The complex scripts of South and Southeast Asia need special attention. The focus is on characters that have Identifier_Status=Allowed, because they are of most concern.

The fonts used to assess the confusables included those used by the major operating systems in user interfaces. In addition, the representative glyphs used in the Unicode Standard were also considered. Fonts used for the user interface in operating systems are an important source, because they are the ones that will usually be seen by users in circumstances where confusability is important, such such as when using IRIS (Internationalized Resource Identifiers) and their sub-elements (such as domain names).

This process of manual inspection clearly is not ideal. Additionally, the UI fonts on Apple platforms have been tweaked several times over the years, but Apple is not specifically mentioned as contributing confusable data (which doesn't mean they don't). It is possible that Unicode's official data (used by ICU) is suboptimal on Apple devices; we can't say. What we can say is that the best confusable detection would account for the sophisticated text scaling, layout, and rendering performed by the OS frameworks.

Another thing to consider is that there are other sources of data which can sway the confusability analysis - for instance, you might already know which scripts the user understands, or you might have access to their bookmarks or browser history, letting you know that a domain which looks confusable is actually trusted.

I can't remember specifically where I read it, but I believe Chrome's omnibox does factor in these kinds of signals (because it's a browser; it can collect this data itself). Obviously that is quite sensitive information, and it would not be okay for every application to have access to it. Any API which considered this data would need to isolate it from the application to safeguard user privacy.

And if we really want to consider everything (and we do, especially if somebody is spoofing the hostname of a bank or online pharmacy), it's important to remember that Unicode isn't the only way that people spoof domains - elision is also not trivial.

IMO, all of this points to the OS as the best place to offer high quality spoof-checking, which doesn't just depend on the text itself, but also on how the text is rendered, and what we can reasonably expect the reader to discern.

To be even more specific, I think there should be some kind of specialised Label for displaying URLs and hostnames, which automatically handles elision and Unicode rendering where safe. I wanted to write one for WebURL but will probably never find the time.


There was recently a high-profile incident which exposed the limitations of Unicode's manually curated confusable tables:

Google-hosted malvertising leads to fake Keepass site that looks genuine | Ars Technica

The issue is that U+0137 LATIN SMALL LETTER K WITH CEDILLA ( ķ ) wasn't listed as a possible confusable for U+006B LATIN SMALL LETTER K ( k ). Since they both belong to the Latin block, all of that advanced mixed-script analysis described above didn't kick in. It's not the first time the tables missed a confusable, and it won't be last.

IMO, the best approach would be to use machine learning. If you search around, you'll find a fair amount of recent academic literature on the topic - it is promising, and there is evidence that it is being further developed with a view to possible production use.

TLDR: I don't think the standard library is the best place for this. Actually doing an excellent job (which is important - spoofing can have seriously bad consequences) requires data and tuning that is not really feasible outside of the OS.

4 Likes

Just a short list for what features I would like to have:

What I need a lot is the check of a large number of “classes” (e.g. if it is Greek or a large mathematical operator) a Unicode code point belongs to, I need a lot more (see the definitions) than defined in ICU.

I also need to use those classes in regex expressions, I can output according regex expression parts from the those classes, but of course I would like to use those classes directly in regex expression (directly using them in regex expressions is obviously kind of a separate topic).

I also reference a lot of Unicode code points by name, I do this efficiently by using according constants for Unicode code points (see for example there) in order to have easily understandable names in my code.

I also need to get either the ordinary or the mathematical form for some mathematical operators or Greek letters (see there).

Of course, checking a correct (UTF-8) encoding is very important, I do this “manually” in my XML parser, other XML libraries don‘t do this.

Assuming that fixing existing String APIs (e.g. first) is unacceptable (due to backwards-compatibility constraints), are there new APIs that could be introduced to at least provide the ability to work around the existing flaws? (e.g.)

I think the heart of it is some way to 'lazily' compare strings, meaning comparison algorithms that don't work on whole graphemes only but are smart enough to abort on the first byte that determines the result of the comparison. Thus closing the door on attacks like absurd numbers of repeated combining characters.

I don't have a good idea what this would look like, API-wise, but the need is pretty clear.

3 Likes

Does any part of this pitch involve something that would affect a case-folding or case-mapping story? From the talk of things like normalization being 'future' work, I'd guess not?

This proposal only uses existing data and data tables. Any size increase would be purely due to new code (just as in any API proposal).

Yes!

Just so I understand, would an example of what you're talking about be iterating the scalars present and for each scalar running a large number of queries, some of which may involve querying built-in character classes as well as Unicode Scalar Properties?

Custom character classes (i.e. not the syntactic kind but those defined by code) are a special case of Custom Regex Consumer. That is, a consumer matches some amount of input and advances, else fails, while the custom character class runs a predicate over a single Character and advances if it passes. However, the existing interface is very low-level and could be a little more awkward than necessary to use for this.

...

It seems like you have lots of interesting use cases. Strictly speaking, most of these are probably outside of the scope of the work pitched, but could be worthy directions for other API, including more extensions to Regex. CC @nnnnnnnn regarding custom character class API.

Binary comparison of Strings can be done by comparing along the sub-Character views, e.g. the UTF8View (or UnicodeScalarView). If you want to do canonical comparison, then you do sometimes need to inspect the entire normalization segment for ordering in general. For equality specifically, there might be a bail-out path when one side has a small normalization segment and the other has a longer one, which might be helpful if only one side is controlled by the attacker. This would take more investigation before I could affirm it.

Could you present what exactly you need? We have Unicode functionality already to get the case mappings for scalars (e.g. lowercase). Thus, anything that can vend you Unicode.Scalars can be appropriately (flat?)mapped today.

It would be interesting to explore lazily-case-mapped views, though.

2 Likes

It is about checking a single scalar for character classes. I prepare a map for each of my properties, but even if you wouldn't: such properties are usually bound to ranges and some single characters, but this is an implementation question, properties like Diacritic are also implemented "somehow" in an sensible way.

I will have a look at it, thanks, but what I would like to have is to use e.g. \{greek} (OK, this could not be a regex syntax but you understand what I mean) in a regex expression using the according custom character class, via a quick overview at least I do not see such an example in the implemented proposal.

1 Like

Swift Regex supports Unicode scripts in all their glorious and confusing complexity. We also support all the various engine's (and Unicode's) ways of spelling and referring to them. See Character Properties.

Some examples of various Greek-ness queries and various syntaxes:

\p{Greek}
\p{Script_Extensions=Greek}
\p{script=Greek}
\p{ISscript=isGreek}
\p{isGreek}
\p{Greek}
\p{sc=isGreek}
\p{sc=grek}
[[:script=Greek:]]
3 Likes

So the syntax exists and the missing point would be to allow own character classes to be used this way?

Should the #if _endian(big) platform condition also be formalized?

public enum ByteOrder: Sendable {
  case bigEndian
  case littleEndian

  // Or `current` or `host` or `init()`.
  public static var native: Self {
#if byteOrder(bigEndian)
    return .bigEndian
#elseif byteOrder(littleEndian)
    return .littleEndian
#endif
  }
}
2 Likes

This is very exciting!

I've been working on a UTF-8 rope, similar to BigString. Its internals are closer to xi-editor's rope, but its grapheme breaking implementation is similar to BigString's. Relevant code: Rope.swift, AttributedRope.swift, BTree.swift.

My primary reasons for building custom string storage were:

  1. Performance:
    • O(log n) subscripting and mutation.
    • O(1) counting of characters, scalars, etc.
  2. Efficiently count and iterate over larger segments: words, lines, etc.

There's a lot in this pitch that would have been useful to me, but most of the issues I've run into would be best solved by higher level abstractions that would let custom string types take advantage of existing code in the stdlib. Much of this could be built on top of the "collection-of-buffers" abstraction outlined in the pitch.

Some of these are definitely out of scope, but I think they might inform the design of in-scope APIs.

N.b. I haven't thought through the example APIs below. They probably don’t work and all names are strawpeople. For now I'm just trying to articulate problems I've had.

Streaming validation/repair during allocation

When allocating a string stored in a tree, the first step is to split the underlying flat byte buffer into chunks that are stored in the tree's leaves. I like the String(repairing:), String(normalizing:) and String(validating:) initializers, but for very large (i.e. multi-gigabyte) strings, allocating a String just to get the repairing: behavior before re-allocating your custom storage can be expensive. An API that can do this efficiently on the fly would be better.

Grapheme breaking

To get the O(log n) indexing behavior from storing a string in a balanced tree, leaves must be fixed size and Characters must be allowed to span multiple leaves. If you force leaves to be Character aligned, it becomes impossible to have fixed-size leaves, and in pathological cases – i.e. a single character with tons of combining codepoints – you can end up back at O(n).

This means each leaf has to store at minimum the following extra state:

  1. prefixCount – the number of bytes at the beginning of the leaf that continue a character from the previous leaf.
  2. lastCharContinues - whether nextLeaf().prefixCount > 0. Not technically necessary, but having to regularly load the next leaf to check wither chunk.bytes.endIndex is a character boundary isn't ideal.

Furthermore, this state has to be updated when mutating your BigString or appending two BigStrings:

s1 = "a" * 1000
s2 = "´" + "b" * 999
s3 = s1 + s2

s1.count == 1000
s2.count == 1000
s3.count == 1999 // the last "a" and "´" have combined, so there's one less character.

s1.root.height == 0 // a single node, which is a leaf
s2.root.height == 0

// s3 has two leaves
s3.root.height == 1
s3.root.children.count == 2

l1 = s1.root
l2 = s2.root
l3a = s3.root.children[0]
l3b = s3.root.children[1]

// s3's leaves have the same bytes as s1 and s2's, but their extra state has changed.
l1.bytes == l3a.bytes
l2.bytes == l3b.bytes

l1.prefixCount == 0; l1.lastCharContinues == false
l2.prefixCount == 0; l2.lastCharContinues == false

l3a.prefixCount == 0; l3a.lastCharContinues == true    // lastCharContinues is different
l3b.prefixCount == 2; l3b.lastCharContinues == false   // prefixCount is different

The proposed GraphemeFormer is too low level to deal with mutations like this without a lot of complex logic, which is easy to get wrong. BigString+Managing Breaks.swift gives a sense of what the logic looks like. The technique used in BigString also requires each leaf to store another piece of state – the GraphemeFormer having consumed everything up to the first byte of the leaf (this assumes the leaves are scalar-aligned).

It would be great to have a higher level API that makes this easier. One option would be to have an API like GraphemeCursor from the unicode_segmentation Rust crate. It allows you to start at an arbitrary unicode scalar offset, and ask for the next or previous grapheme boundary, and it will ask you for next and previous chunks as necessary until it can return a boundary.

This does have some problems: a pathological string with gigabytes of regional indicators (the codepoints used in flag emojis) requires an O(n) scan back to the first regional indicator, but I think this is true of String.index(before:) now, so maybe it’s not so bad. I also don’t know how easy it would be to keep character counts up to date using this API. Perhaps it’s enough, or perhaps there’s a better high level API without the downsides.

I don't know how this problem applies to other string data structures like gap buffers or piece tables.

Indexing APIs

Replicating String’s indexing behavior is difficult. It would be great if there was a way to use the same codepaths that String uses in a custom string type.

Consider the string "aé" (using the non-combining scalar U+00E9 LATIN SMALL LETTER E WITH ACUTE) with i == 2[utf8], which is non character-aligned inside "é".

  • Subscripting a character with a non-aligned index should round down to the nearest character boundary: s[i] == "é"
  • index(before:) of an aligned index moves to the left by one boundary, but an unaligned index moves to the left by two boundaries: s.index(before: i) == s.startIndex (skipping the boundary between "a" and "é").
  • index(after:) always moves to the right by a single boundary regardless of whether the index is aligned or not: s.index(after: i) == s.endIndex (not skipping a boundary).

It’s hard to know all the rules that String follows, let alone reimplement them.

Additionally, String.Index has performance improvements enabled by properties like _isCharacterAligned, _isScalarAligned, etc. It should be easy to apply these to a custom string type without reimplementing String's logic. Supporting these optimizations might also reduce the requirements on the grapheme breaking API, but I'm not sure.

It's possible all of this could be solved with a Unicode.Index type that can be embedded in your custom string's Index type, as well as a protocol for moving between boundaries. Here's a rough sketch.

extension Unicode {
    struct Index {
        var utf8Offset: Int { get }
    }
}

extension Unicode {
    enum Boundary {
        case utf8
        case utf16
        case scalar
        case character
        
        // Maybe these from UAX #29 too? How to make them optional?
        case word
        case sentence
        case line
    }
}

protocol UnicodeIndexWrapping {
    init(_ unicodeIndex: Unicode.Index)
    var unicodeIndex: { get }
}

protocol UnicodeIndexable {
    associatedtype Index: UnicodeIndexWrapping

    func unicodeIndex(ofBoundaryBefore i: Index, boundary: Unicode.Boundary) -> Index
    func unicodeIndex(ofBoundaryAfter i: Index, boundary: Unicode.Boundary) -> Index
    func unicodeIndex(ofBoundaryRoundingDown i: Index, boundary: Unicode.Boundary) -> Index
}

struct BigString: UnicodeIndexable {
    struct Index: UnicodeIndexWrapping {
        let unicodeIndex: Unicode.Index
    }
}

Actually sharing indexing logic with String is complicated by the fact that String.Index is frozen, but I'm hopeful there's a clever way around that.

UAX #29 word and sentence breaking

Swift has an implementation of UAX #29’s word and sentence breaking algorithms, including locale specific variants – enumerateSubstrings(in:options:_:) with .byWord, .bySentence, optionally combined with .localized. Being able to use the stdlib’s implementation with custom string types would be great.

Even better, there are times when you might want to customize this behavior (UAX #29 notes this). Some examples: having no internal word boundaries in a number with punctuation “1,234.56”, or having different word boundary rules for different programming languages in a text editor – e.g. in a language like Swift that uses camel case, with the string “fooBarBaz”, having word boundaries between “foo” and “Bar” and “Bar” and “Baz”. Extension points to allow this would be great.

A generic UTF16View from a collection of UTF-8 buffers

Building a UTF-16 view on top of UTF-8 storage is hard. I haven’t even attempted it. But it’s necessary to efficiently integrate with system frameworks that use UTF-16. It would be great if Unicode.Index (or another API) were able to deal with trailing surrogate indices, which don't actually exist in the underlying UTF-8 storage, and make it easy to return both leading and trailing surrogates when subscripting.

Similarly, if String’s UTF-16 breadcrumbing code could somehow be extracted and made generic, that would be even better.

Tools for implementing other views

It’s possible something like UnicodeIndexable would allow Swift to provide APIs for implementing other views on top of custom string storage, including views for words, sentences, and lines.

Indexing into a LinesView is particularly gnarly because s.lines[s.endIndex] should be valid and return "" for strings ending in an empty line. This breaks all sorts of Collection invariants/expectations which I have not found a good solution for.

Custom Unicode tables

This is a good idea. Here’s an article with more context.

Streaming normalization and denormalization

If you want to be able to present a good find/replace UI, you need to deal with normalization. You don’t necessarily want to normalize the string as you read it in because your app may need to write the string out to disk later, with minimal edits.

It would be great if Swift provided APIs to do streaming normalization and denormalization of custom string types.

Fixed-size arrays

This is pretty far afield, and would require a language change (there are ongoing discussions), but it’s still a pain point.

Right now, in my Rope, each leaf stores a String. I’d like to be able to store a fixed-size array of bytes. Partially this is for performance – storing the bytes directly in the leaf eliminates unnecessary pointer chasing to the String’s storage. But it’s also important for having more control so I can better interop with zero-copy C APIs (see below).

Interop with zero-copy C APIs with temporarily escaping buffers

This is a problem with String, but because I build on top of String (see “Fixed-size arrays”), it’s still a problem for me.

Consider this (simplified) API from Tree-sitter:

typedef struct {
  void *payload;
  const char *(*read)(void *payload, uint32_t byte_index, uint32_t *bytes_read);
} TSInput;

TSTree *ts_parser_parse(TSParser *self, TSInput input);

ts_parser_parse is designed to be zero-copy. During the execution of ts_parser_parse, your read function is called N times, giving you the ability to provide a pointer to a portion of your input, whatever size is convenient. In the case of BigString, the bytes in a single leaf is the obvious choice.

While the char pointer escapes read, it’s guaranteed not to escape ts_parser_parse.

Because String.withUTF8 et. al. do not allow their pointer to escape the passed in closure, there’s no way to implement this in Swift without copying the bytes of each leaf into a struct that wraps TSInput and (IIRC) manually allocating and freeing the memory yourself. This is a big hole in Swift – it's not that you can't use this zero-copy API without resorting to unsafe constructs; it's that you can't express it at all. There's no way to use ts_parser_parse from Swift without eventually doing a memcpy of the entire storage of your string.

From what I can remember of the BufferView vision doc, BufferViews would not solve this problem because supporting unsafe pointers is an explicit non-goal. I have not had a chance to read the StorageView proposal that came out of that, so maybe this has changed. Perhaps some of the escape analysis stuff that goes along with it might help too.

At minimum, if I had fixed-size arrays, I could do this unsafely with my own rope (I can guarantee that any unsafe pointer will live as long as the rope it’s derived from), but this is a general problem, and it would be great to have a general solution.

6 Likes