Due to the character limit, I couldn't put the future directions or alternatives in the initial post.
Future directions
More alignments
Future API could include whether an index is "word aligned" (either simple or default), "line aligned", etc.
Normalization
Future API could include checks for whether the content is in a particular normal form (not just NFC).
UnicodeScalarView and CharacterView
Like Span
, we are deferring adding any collection-like types to non-escapable UTF8Span
. Future work includes adding view types and corresponding iterators.
For an example implementation of those see the UTFSpanViews.swift
test file.
More Collectiony algorithms
We propose equality checks (e.g. scalarsEqual
), as those are incredibly common and useful operations. We have (tentatively) deferred other algorithms until non-escapable collections are figured out.
However, we can add select high-value algorithms if motivated by the community. We'd want to
More validation API
Future work includes returning all the encoding errors found in a given input.
extension UTF8 {
public static func checkAllErrors(
_ s: some Sequence<UInt8>
) -> some Sequence<UTF8.EncodingError>
See _checkAllErrors
in UTF8EncodingError.swift
.
Transcoded views, normalized views, case-folded views, etc
We could provide lazily transcoded, normalized, case-folded, etc., views. If we do any of these for UTF8Span
, we should consider adding equivalents on String
, Substring
, etc.
For example, transcoded views can be generalized:
extension UTF8Span {
/// A view of the span's contents as a bidirectional collection of
/// transcoded `Encoding.CodeUnit`s.
@frozen
public struct TranscodedView<Encoding: _UnicodeEncoding> {
public var span: UTF8Span
@inlinable
public init(_ span: UTF8Span)
...
}
}
We could similarly provide lazily-normalized views of code units or scalars under NFC or NFD (which the stdlib already distributes data tables for), possibly generic via a protocol for 3rd party normal forms.
Finally, case-folded functionality can be accessed in today's Swift via scalar properties, but we could provide convenience collections ourselves as well.
Regex or regex-like support
Future API additions would be to support Regex
es on UTF8Span
. We'd expose grapheme-level semantics, scalar-level semantics, and introduce byte-level semantics.
Another future direction could be to add many routines corresponding to the underlying operations performed by the regex engine, such as:
extension UTF8Span.CharacterView {
func matchCharacterClass(
_: CharacterClass,
startingAt: Index,
limitedBy: Index
) throws -> Index?
func matchQuantifiedCharacterClass(
_: CharacterClass,
_: QuantificationDescription,
startingAt: Index,
limitedBy: Index
) throws -> Index?
}
which would be useful for parser-combinator libraries who wish to expose String
's model of Unicode by using the stdlib's accelerated implementation.
Canonical Spaceships
Should a ComparisonResult
(or spaceship) be added to Swift, we could support that operation under canonical equivalence in a single pass rather than subsequent calls to isCanonicallyEquivalent(to:)
and isCanonicallyLessThan(_:)
.
Other Unicode functionality
For the purposes of this pitch, we're not looking to expand the scope of functionality beyond what the stdlib already does in support of String
's API. Other functionality can be considered future work.
Exposing String
's storage class
String's internal storage class is null-terminated valid UTF-8 (by substituting replacement characters) and implements range-replaceable operations along scalar boundaries. We could consider exposing the storage class itself, which might be useful for embedded platforms that don't have String
.
Yield UTF8Spans in byte parsers
Span's proposal mentions a future direction of byte parsing helpers on a Cursor
or Iterator
type on RawSpan
. We could extend these types (or analogous types on Span<UInt>
) with UTF-8 parsing code:
extension RawSpan.Cursor {
public mutating func parseUTF8(length: Int) throws -> UTF8Span
public mutating func parseNullTermiantedUTF8() throws -> UTF8Span
}
Track other bits
Future work include tracking whether the contents are NULL-terminated (useful for C bridging), whether the contents contain any newlines or only a single newline at the end (useful for accelerating Regex .
), etc.
Alternatives considered
Invalid start / end of input UTF-8 encoding errors
Earlier prototypes had .invalidStartOfInput
and .invalidEndOfInput
UTF8 validation errors to communicate that the input was perhaps incomplete or not slices along scalar boundaries. In this scenario, .invalidStartOfInput
is equivalent to .unexpectedContinuation
with the range's lower bound equal to 0 and .invalidEndOfInput
is equivalent to .truncatedScalar
with the range's upper bound equal to count
.
This was rejected so as to not have two ways to encode the same error. There is no loss of information and .unexpectedContinuation
/.truncatedScalar
with ranges are more semantically precise.
An unsafe UTF8 Buffer Pointer type
An earlier pitch proposed an unsafe version of UTF8Span
. Now that we have ~Escapable
, a memory-safe UTF8Span
is better.
Other names for basic operations
An alternative name for nextScalarStart(_:)
and previousScalarStart(_:)
could be something like scalarEnd(startingAt:)
and scalarStart(endingAt: i)
. Similarly, decodeNextScalar(_:)
and decodePreviousScalar(_:)
could be decodeScalar(startingAt:)
and decodeScalar(endingAt:)
. These names are similar to index(after:)
and index(before:)
.
However, in practice this buries the direction deeper into the argument label and is more confusing than the index(before/after:)
analogues. This is especially true when the argument label contains unchecked
or uncheckedAssumingAligned
.
That being said, these names are definitely bikesheddable and we'd like suggestions from the community.
Other bounds or alignment checked formulations
For many operations that take an index that needs to be appropriately aligned, we propose foo(_:)
, foo(unchecked:)
, and foo(uncheckedAssumingAligned:)
.
foo(_:)
and foo(unchecked:)
have analogues in Span
and foo(uncheckedAssumingAligned:)
is the lowest level interface that a type such as Iterator
would call (since it maintains index validity and alignment as an invariant).
We could additionally have a foo(assumingAligned:)
overload that does bounds checking, but it's unclear what the use case would be.
Another alternative is to only have a variant that skips both bounds and alignment checks and call it foo(unchecked:)
. However, this use of unchecked:
is far more nuanced than Span
's and it's not the case that any i
in 0..<count
would be valid.
We could also only offer foo(_:)
and foo(uncheckedAssumingAligned:)
. Unaligned API such as isScalarAligned(_:)
and isScalarAligned(unchecked:)
would keep their names.