You can find the gist here
This is a pitch for some low-level Unicode operations to enable libraries to implement their own String-like functionality and types. The purpose of this API is to present low-level core components out of which libraries can make types and higher-level API (i.e. tools for tool-makers).
I'm interested in hearing about capabilities that libraries may need and getting design feedback.
Introduction
String
performs many Unicode operations using a mix of internal and publicly available functionality in the stdlib. The stdlib provides some limited general-purpose Unicode API, though this is a very dusty corner of the stdlib. We want to enable libraries to vend their own String
-like types and functionality.
One interesting use case to consider are ephemeral byte strings backed by chunks of data in contiguous memory. These chunks of data could be synchronous (i.e. Sequence
/Iterator
) or asynchronous (i.e. AsyncSequence
/AsyncIterator
). Their buffers are ephemeral, meaning there is no general way to fully reset the stream back to an earlier state (i.e. there is no Index
). The buffers might not be segmented along code unit, scalar, or grapheme cluster boundaries, meaning that a produced value might span segments.
We're aiming for a solution that provides:
- Simple, composable pieces to build up abstraction hierarchies
- Efficient, safe buffer-based interfaces to drill down through abstraction hierarchies
- A strategy for library-extensibility as well as compatibility with existing stdlib API and concepts
Note: Many of the approaches proposed are dependent on recently developed features such as typed throws and non-escapable values, whose ABI impacts may not be fully fleshed out. This may motivate splitting functionality across multiple proposals and releases
Decoding and Character
API
Errors
Validation API and producing errors as part of decoding is an often-requested feature. Below are errors related to Unicode encodings:
extension Unicode.UTF8 {
public enum DecodingError: Error {
case expectedStarter
case expectedContinuation
case overlongEncoding
case invalidCodePoint
case invalidStarterByte
}
}
extension Unicode.UTF16 {
public enum DecodingError: Error {
case expectedTrailingSurrogate
case unexpectedTrailingSurrogate
}
}
extension Unicode.UTF32 {
public enum DecodingError: Error {
case invalidCodePoint
}
}
Alternative: a single error enum, noting that some error cases are irrelevant in some encodings
Alternative / Investigation: making the error type an associated type on a protocol for library-provided encodings to customize
When it comes to validating bytes, the byte source might be ephemeral (e.g. a single-pass Sequence
) or it might be possible to re-visit contents by position (e.g. a Collection
).
extension Unicode.UTFX { // `UTFX` meaning UTF8, UTF16, and UTF32
public struct CollectionDecodingError<Index: Comparable>: Error {
public var kind: Unicode.UTFX.DecodingError
public var range: Range<Index>
}
public struct ByteStreamDecodingError: Error {
public var kind: Unicode.UTFX.DecodingError
public var bytes: (UInt8, UInt8?, UInt8?)
}
}
Knowing the kind of encoding error and bytes involved can be very helpful. Overlong encodings are often an intentional attempt to compromise security. Some systems may want to use custom error-correction (i.e. 128 distinct replacement characters) such that the corrected bytes are valid Unicode while also preserving the original bits. Additionally, knowing the encoding error can be helpful for debugging.
Validation API and concerns are further discussed later in this document.
Endianness
Endianness (byte-ordering or memory ordering), denotes whether the first byte received contains the high bits or low bits of a code unit. It is relevant for multi-byte encodings (UTF-16 and UTF-32, but not UTF-8).
public enum Endianness {
case little
case big
/// The platform's native byte-ordering in memory
public static var native: Self
}
Alternative: Endianness could be considered in combination with the encoding, yielding e.g. UTF16BE
and UTF16LE
encodings. Or, it could be considered a property of a serialization format.
Alternative: An alternate name for Endianness
could be ByteOrder
.
Decoding
The stdlib has some existing functionality along these lines, but it's inadequate for byte stream validation and decoding. Instance methods UnicodeCodec.decode and parseScalar do not produce meaningful error information and they operate only in terms of fully-formed code units and complete scalars.
We propose stateful ByteStreamDecoder
structs. These are statically associated with an encoding and, at initialization time, a byte order. They store a small internal buffer of bytes until enough bytes have been seen to produce a Unicode scalar. They can be fed data in a byte at a time, which allows developers to feed data in as it is received and not worry about handling scalar alignment.
While all names in this rough draft are strawperson names, the method name consume
below is a particularly strawy name. It is fairly unpalatable and meant to be a placeholder. Alternate names such as read
, receive
, input
(as a present-tense verb), streamIn
, next
, feed
, feedIn
, etc., are not much better. The name decode
doesn't carry the implication of an in-progress or suspended operation.
public protocol UnicodeByteStreamDecoder {
/// Input a byte, returns a finished scalar or `nil`.
/// Throws a decoding error.
mutating func consume(
_ byte: UInt8
) throws -> Unicode.Scalar?
/// We've reached the end of input. If there's an unfinished
/// scalar in progress, throws the appropriate encoding error
func finalize() throws
// Customization points:
/// Read bytes until yielding a decoded scalar.
///
/// Throws validation errors.
///
/// Returns `nil` when `bytes` is done. `bytes` may not
/// have finished a scalar and `self` may contain some
/// bytes of an in-progress scalar value.
public mutating func consume(
_ bytes: inout some IteratorProtocol<UInt8>
) throws -> Unicode.Scalar?
/// Read bytes asynchronously until yielding a decoded scalar.
///
/// Throws validation errors and rethrows upstream errors.
///
/// Returns `nil` when `bytes` is done. `bytes` may not
/// have finished a scalar and `self` may contain some
/// bytes of an in-progress scalar value.
public mutating func consume<AI: AsyncIteratorProtocol>(
_ bytes: inout AI
) async throws -> Unicode.Scalar?
where AI.Element == UInt8
/// Read bytes starting from `position` and yielding a decoded scalar
/// and the position of the start of the next scalar.
///
/// Throws validation errors.
///
/// Returns `nil` when `bytes` is done. `bytes` may not
/// have finished a scalar and `self` may contain some
/// bytes of an in-progress scalar value.
///
/// INVESTIGATE: take `position` inout so that it gets updated rather
/// than requiring the caller to update a local `var`.
///
/// INVESTIGATE: Alternative: take a slice `inout`, but
/// we'd want to make sure it makes sense for non-copyable
/// slices
public mutating func consume<C: Collection<UInt8>>(
_ bytes: C,
startingFrom position: C.Index
) throws -> (Unicode.Scalar, scalarEnd: C.Index)?
/// INVESTIGATE: API to access the internal buffer, such as
/// whether it is empty, it's contents, clearing it, etc
}
extension Unicode.UTF8 {
public struct ByteStreamDecoder: UnicodeByteStreamDecoder {
public init()
/// Input a single
public mutating func consume(
_ byte: UInt8
) throws -> Unicode.Scalar?
public func finalize() throws
}
}
extension Unicode.UTF16 {
public struct ByteStreamDecoder: UnicodeByteStreamDecoder {
public init(byteOrder: Endianness)
public mutating func consume(
_ byte: UInt8
) throws -> Unicode.Scalar?
public func finalize() throws
}
}
extension Unicode.UTF32 {
public struct ByteStreamDecoder: UnicodeByteStreamDecoder {
public init(byteOrder: Endianness)
public mutating func consume(
_ x: UInt8
) throws -> Unicode.Scalar?
public func finalize() throws
}
}
// Default implementations
extension UnicodeByteStreamDecoder {
... see gist for re-declaration ...
}
A decoder with undecoded contents left in its buffer could be an indication of programmer error. The finalize
method checks for this. We could also consider exposing some properties or access to the buffer itself.
Alternative: we could attempt to add new byte-stream overloads for decode()
that throw errors and have the ability to resume with a different input source. For the purposes of this pitch, a separate type and namespace lets us explore the different semantics pitched.
Alternative: We could consider options to repair invalid input, possibly by also communicating what errors would have been reported.
Rejected Alternative: consume
receives endianness. This has the downside of a more complex API contract and more branching for an uncommon use case of interleaving mixed-endianness byte streams.
Alternative: Statically separate out endianness, i.e. have a UTF16BE
encoding or alternatively a UTF16BEByteStreamDecoder
. It's not clear this would be an API improvement, and it still needs some benchmarking to demonstrate whether there's a meaningful performance difference.
Alternative: A single dynamically-parameterized decoder type. Such a type could receive the encoding at init
-time and have a suitably large internal buffer for any anticipated encoding. This would result in many more run-time branches, however.
Alternative: A single type-parameterized decoder type. This would reduce the amount of API, though in effect we'd want to specialize for each of the presented encodings anyways. This may end up being in effect a different spelling of what is pitched here.
Typed throws
It likely makes sense to specify errors using typed throws. This may also motivate including decoding error types in the protocol.
Typed throws would be further motivated by the fact that some highly constrained or embedded systems may not have String
available. This could be due to the inability to dynamically allocate memory or not having enough space to bundle the data necessary to implement String
's semantics such as grapheme breaking and canonical equivalence data tables. Such environments could benefit from good UTF-8 decoding API and typed errors would make this API available.
Grapheme-breaking API
The Swift Collections package defines a BigString type, which provides String-like functionality over rope-like storage. It uses underscored functionality in the stdlib to detect grapheme cluster boundaries.
Grapheme breaking requires looking ahead at the next scalar and keeping a few bits of state along the way. We could surface the underscored interfaces as API:
extension Unicode {
public struct GraphemeBreaker {
public init()
/// Returns whether there was a grapheme break _before_
/// `scalar`. Updates internal state and stores `scalar` for
/// the next call.
public mutating func consume(
_ scalar: Unicode.Scalar
) -> Bool
}
}
To build a Character
-producing stream out of this, the caller either has to buffer scalars themselves or do some bookkeeping to track positions of the scalars fed in.
Character streams with buffering
The following is an example use of the GraphemeBreaker
, and could be additional API or an alternate API for consideration.
public struct GraphemeFormer {
... see gist for implementation ...
public init() {}
/// Consumes `scalar`. Returns a completed `Character` if
/// `scalar` was the start of the next grapheme cluster.
public mutating func consume(
_ scalar: Unicode.Scalar
) -> Character?
/// Finishes and returns the in-progress Character
public mutating func flush() -> Character?
}
Like the decoders, this uses an internal buffer: a String
's UnicodeScalarView
(which starts off using a small-form).
Example: Scalars and Characters from FileDescriptor
On Apple platforms, Foundation's FileHandle
can asynchronously vend bytes, Unicode scalars, and Characters.
For an example use of the pitched API, let's implement that on System's FileDescriptor.
(see gist for code snippet)
The given code shows a simple use of the pitched API. However, it also shows the need for an efficient approach that can drill through abstraction layers to underlying bytes in buffers.
When the source of Unicode scalars is backed by a chunk of memory containing validly-encoded UTF-8, it is more efficient to work in terms of positions in that chunk.
Alternative naming: Use FooCharacterBar
instead of FooGraphemeBar
or FooGraphemeClusterBar
throughout this proposal
There's a naming spectrum between Unicode's preferred terminology and Swift's.
One one end, if this were a standalone package aimed solely around providing an implementation of the Unicode standard via accelerated routines, there is an affordance to stick solely to Unicode terminology. Such a package could use the term "grapheme cluster" or "extended grapheme cluster" because "character" is not Unicode's terminology and could be ambiguous or confusing.
On the other end, anything in the String
namespace or which vends a String
type (including Character
) would use use the term "character". Similarly anything outside of a dedicated Unicode
package, library, namespace, module, or sub-module would as well.
As for what we are providing, we don't have a separate Unicode
module or sub-module, largely for historical reasons. Unicode
is an empty enum that functions as a namespace. Some precedent so far has found it better to lean towards Unicode
terminology for items under Unicode
. For example, Unicode.Scalar.Property.isGraphemeBase
is the better name than Unicode.Scalar.Property.isCharacterBase
.
Contiguous buffers and segmentation API
Many byte streams are backed by chunks of contiguous memory. Rather than read byte-at-a-time and return scalar-at-a-time using internal buffering, a byte stream decoder could communicate scalar-aligned positions in its upstream's backing buffers.
We explore functionality that reads from any byte source and vends chunks of validly-encoded, and validly-aligned (for a specified alignment) UTF-8. This enables efficient streaming operations, i.e. those that operate over a properly-aligned moving window of validly encoded UTF-8 bytes in contiguous memory. This involves areas of current investigation and could motivate and incorporate the more advanced lifetime management discussed.
Another important consideration is how to handle when a scalar, normalization, or grapheme cluster segment straddles multiple chunks of data. In that case, API may need to return a view into a new buffer which stores these contiguously.
Validation API
Validation looks at an entire input to ensure it is validly encoded. While it can be performed by decoding and discarding the contents, it can be done more efficiently as its own standalone operation if the original contents are meant to be kept in their original encoding.
extension Unicode.UTF8 {
public static func validate<C: Collection<UInt8>>(
_ bytes: C
) throws
}
// Available on UTF16 and UTF32, where endianness matters and
// where code units are not individual bytes
extension Unicode.UTF[16/32] {
public static func validate<C: Collection<CodeUnit>>(
_ codeUnits: C
) throws
public static func validate<C: Collection<UInt8>>(
_ bytes: C,
endianness: Endianness
) throws
}
Alternative: Concrete functions taking a BufferView
or some BufferViewable
-like protocol
UTF-8 validity and efficiency
UTF-8 validation is particularly common concern and the subject of a fair amount of research. Once an input is known to be validly encoded UTF-8, subsequent operations such as decoding, grapheme breaking, comparison, etc., can be implemented much more efficiently under this assumption of validity. Swift's String
type's native storage is guaranteed-valid-UTF8 for this reason.
However, if the input isn't actually valid, assuming validity leads to a new class of security concerns.
Memory safety is more nuanced. An ill-formed leading byte can dictate a scalar length that is longer than the memory buffer. The buffer may have bounds associated with it, which differs from the bounds dictated by its contents.
Additionally, a particular scalar value in valid UTF-8 has only one encoding, but invalid UTF-8 could have the same value encoded as an overlong encoding, which would compromise any code that checks for the presence of a scalar value by looking at the encoded bytes.
One approach is to define API that takes a parameter assuming that its contents contain correctly-encoded UTF-8. Today, that is often done via a unsafeAssumingValidUTF8: UnsafeRawBufferPointer
parameter, but this is unsafe in multiple ways that might not be clear to the caller. The UnsafeRawBufferPointer
is memory-unsafe of course, but even if the caller knows the memory itself is safe, the contents might be invalidly encoded in a way that subtly bypasses correct behavior elsewhere in the program.
A type such as BufferView
would help mitigate the memory unsafety of the pointer itself, but not the far more subtle problems of assuming valid UTF-8.
The rest of this pitch is interwoven with on-going investigations into non-escapable values and statically-reasoned lifetimes. As such, it could change depending on when or how that support arrives. This is similar to how the new Atomics API was originally implemented with unsafe constructs before ~Copyable
support was available.
Valid UTF8 buffer views
UTF8.ValidBufferView
is a buffer view whose contents are known to be valid UTF-8 as represented in the type system.
extension Unicode.UTF8 {
public struct ValidBufferView {
/// TO INVESTIGATE: This field's lifetime is tied to `self`, i.e. the lifetime
/// of either `owner` or the lexical scope into which it was returned. Any `get`
/// accessors should be non-escapable
public var bytes: BufferView
/// An object that owns the memory, if the API needed to allocate memory.
///
/// This is needed when validation or alignment needs to allocate to ensure
/// the relevant content is is contiguous memory
public var owner: AnyObject?
/// Create from the validated contents of `c`. If `c` contains invalidly encoded
/// UTF-8, throws an error. If `c` is valid and and provide a `BufferView`, will
/// borrow that view. If `c` is valid but does not provide a `BufferView`, will
/// allocate memory to provide a contiguous view.
public init(validating c: some Collection<UInt8>) throws
/// As `validating:`, but repairs any encoding errors. If a repair was made, a
/// new allocation must be made for the corrected content.
public init(repairing c: some Collection<UInt8>)
}
}
Alignments
There are 3 particularly useful alignments to segment content such that common operations can be performed by only looking at one chunk of data at a time.
- Scalar aligned: decoding and validation
- Normalization-segment aligned: canonical comparison
- Grapheme-cluster aligned: forming
Character
s
Each successive segmentation is broader than the one before: every grapheme-cluster boundary is a normalization-segment boundary and every normalization-segment boundary is a scalar boundary.
Note: Normalization segments being sub-segments of grapheme-clusters is not technically guaranteed by the Unicode standard to always be true in future Unicode versions. Unicode is allowed to change the rules of grapheme breaking in future versions. That being said, a normalization segment is defined to start on a non-combining scalar, and grapheme clusters that break before non-combining scalars are nonsensical. Unicode handles nonsensical cases as degenerate cases and those cases do not break, though Unicode could change its mind in future versions. Because of this, the "sub-alignment" relationship between normalization segments and grapheme clusters should treated as illustrative for the reader and not a formal API guarantee into the future.
extension Unicode.UTF8 {
public struct ValidScalarAlignedBufferView {
public var buffer: ValidBufferView
}
public struct ValidNormalizationSegmentAlignedBufferView {
public var buffer: ValidScalarAlignedBufferView
}
public struct ValidGraphemeClusterAlignedBufferView {
public var buffer: ValidScalarAlignedBufferView
}
}
extension Unicode.UTF8 {
/// Transforms a sequence of buffer views to a sequence of valid buffer views
... see gist for declarations ...
}
Aligning data along these boundaries can be useful for implementing data structures that retain their own copy of the storage. Such a data structure may want to guarantee that it can vend a given view's Element
by inspecting only a single chunk.
Alternative: Type-parameterize based on alignment instead, or even dynamic-value-parameterize based on alignment.
Accessing ranges
The above API, which provides alignment with normalization segments and grapheme clusters, can also provide a view of the bytes which comprise an individual normalization segment or grapheme cluster:
... see gist for view declarations ...
Normalization segments are particularly tricky to account for, as the normalization process could turn a single segment into multiple ones.
Alternative: Pending BufferView
's final design with respect to self-slicing, a view of the bytes comprising a single normalization-segment or grapheme cluster might be represented using a Slice
.
Creating Strings
String's decoding initializers are difficult to discover and use as they make use of metatypes: String(decoding: myBytes, as: UTF8.self)
. Attempts to rectify this have been saddled with compatibility concerns. This may be a good opportunity to make some progress on this. Alternatively, this is severable should it start to bog down the rest of this pitch.
The below String inits are straw-person named and intentionally presented in a naming-vacuum, that is without consideration for existing String API names. This helps us work on enumerating the functionality and presenting the entire API picture without simultaneously juggling some of the current issues in API names.
For example, SE-0405 String Initializers with Encoding Validation takes a stab at improving the story somewhat with nil
-return inits, but it uses the same validating:
name as the error-throwing inits below. Depending on exactly how this pitch takes shape and when it is ready for review, the below could be considered an amendment to SE-0405 or a straw-person naming-vacuum investigation.
// Strawperson assuming typed throws
extension String {
/// Sequence-version of stdlib's `String.init(decoding: x, as: UTF8.self)`
public init(repairingUTF8: some Sequence<UInt8>)
/// Puts contents in stdlib-normal-form for fast comparison.
/// `Character`s are the same, but scalars and code unit views
/// could show different (i.e. normalized) contents
public init(normalizingUTF8: some Sequence<UInt8>)
/// Checks for errors and throws them: Sequence error version
public init(
validatingUTF8: some Sequence<UInt8>
) throws(UTF8.ByteStreamDecodingError)
/// Checks for errors and throws them: Collection error version
public init(
validatingUTF8: some Collection<UInt8>
) throws(UTF8.CollectionDecodingError)
/// This is a convenience spelling for either repairing or normalizing.
/// We can pick/debate which would be better, there are reasonable
/// arguments for either.
public init(utf8: some Sequence<UInt8>)
}
Similarly, API which is parameterized over the encoding, as well as API over byte streams associated with endianness.
extension String {
// Repairing
public init<Encoding: Unicode.Encoding>(
repairing: some Sequence<Encoding.CodeUnit>,
as sourceEncoding: Encoding.Type
)
public init<Encoding: Unicode.Encoding>(
repairing: some Sequence<UInt8>,
as sourceEncoding: Encoding.Type,
endianness: Endianness
)
... see gist for `normalizing:` and `validating:` variants ...
Library extensibility and use cases
Encodings and protocols
The stdlib has existing protocols, though they can be difficult to conform to, difficult to use, and derived operations can be inefficient. More investigation is needed to see how to improve them or else how to fit new improvements into them.
We could consider adding protocols for encoding errors, decoder structs, etc., seeing if there's a good library-extensibility story here.
Good case studies include CESU-8 which uses UTF-16-style surrogate pairs for non-BMP scalars in a UTF-8-like encoding, resulting in up to 6 bytes per Unicode scalar value. Java's modified UTF-8 further extends CESU-8's approach by using an overlong encoding for NUL
. These are not valid UTF-8 encodings, but they are valid Unicode encodings as surrogates must be paired. They tradeoff some of UTF-8's advantages for compatibility benefits.
WTF-8 allows unpaired surrogates and thus is not a valid Unicode encoding. It could be interesting to consider how to help support this kind of invalid encoding by creating individual code points instead whole Unicode scalar values.
There are also encodings that only encode a subset of Unicode, such as ASCII (which UTF-8 is a binary-compatible superset of) and Latin1 (which UTF-8 is not binary compatible with). Supporting these are tricky as transcoding is lossy and otherwise complete functions become partial functions.
The stdlib currently provides ASCII as a Unicode encoding, however as a subset encoding it has some sharp edges and follows different conventions from the actual Unicode encodings. We should consider sunsetting this encoding in favor of UTF-8, which is a strict superset. The stdlib's implementation should detect and fast-path UTF-8 when the contents happen to be only-ASCII anyways. We can provide optimized isASCII
queries on String
, byte buffers and byte streams, etc.
Libraries
The Swift Collections package defines a BigString type, which provides String-like functionality over rope-like storage.
Foundation's AttributedString is built on BigString
. Additionally, Foundation parses data formats such as plists which are encoded using UTF-16 in big endian byte order.
Foundation also normalizes paths, on some file systems, to a pre-Unicode-3.0 NFD. Unicode version 3.0 is important since it is only afterwards that normalization properties are stable. Other libraries may need to similarly specify a specific Unicode version and bundle their own data tables to drive normalization and, especially, decomposition. This could be done through a data-table provider protocol, though there may be efficiency concerns with working through such an abstraction. Either way, the normalization-segmentation API are helpful for performing custom decomposition.
Relatedly, a server-client library may wish to ensure that both the server and client are using the same version of Unicode for the purposes of canonical equivalence. While the properties for defined code points are stable, it is possible that an undefined code point could normalize differently in future Unicode versions. An alternative approach would be a quick scan for the presence of undefined code points.
We can look at using some of the byte-stream functionality over AsyncIterator
to define (combiners / operators?) such as AsyncUnicodeScalarSequence
, AsyncCharacterSequence
, etc. These could be good API to have in the stdlib proper or in the Swift Async Algorithms package.
The WebURL package does a hefty about of Unicode processing and is a great example of the kinds of libraries that the stdlib should empower. It can serve as a good target for these improvements and many others.
Libraries such as Swift Syntax sometimes roll out their own decoding and would benefit from a standard approach.
I'm interested in hearing about other libraries and potential use cases.
Future directions
Normalization
A future direction is for String
, UTF8.ValidBufferView
, etc., to provide lazily-normalized views of their contents under NFC, NFD, as well as forms provided by libraries.
For the buffer-based API, a future direction could include composing and decomposing API, possibly driven by a library's data tables, along the lines of Foundation's path normalization described above.
BOM
(see gist for BOM discussion)
Shared and ephemeral strings
An often desired feature is to have String
or Substring
API available on storage that's owned by another object, e.g. shared substrings or using String
's ABI support.
(see gist for shared and ephemeral strings discussion)
String API on validated UTF-8 bytes
UTF8.ValidBufferView
could also have String
's API on it, at least in some fashion. Future work could include Regex
support, etc.