UTF8Span: Safe UTF-8 Processing Over Contiguous Bytes
- Proposal: SE-NNNN
- Authors: Michael Ilseman, Guillaume Lessard
- Review Manager: TBD
- Status: Awaiting review
- Bug: rdar://48132971, rdar://96837923
- Implementation: DRAFT: UTF8Span by milseman · Pull Request #78531 · swiftlang/swift · GitHub
- Upcoming Feature Flag:
- Review: (pitch 1) (pitch 2)
Introduction
We introduce UTF8Span
for efficient and safe Unicode processing over contiguous storage. UTF8Span
is a memory safe non-escapable type similar to Span
.
Native String
s are stored as validly-encoded UTF-8 bytes in an internal contiguous memory buffer. The standard library implements String
's API as internal methods which operate on top of this buffer, taking advantage of the validly-encoded invariant and specialized Unicode knowledge. We propose making this UTF-8 buffer and its methods public as API for more advanced libraries and developers.
Motivation
Currently, if a developer wants to do String
-like processing over UTF-8 bytes, they have to make an instance of String
, which allocates a native storage class, copies all the bytes, and is reference counted. The developer would then need to operate within the new String
's views and map between String.Index
and byte offsets in the original buffer.
For example, if these bytes were part of a data structure, the developer would need to decide to either cache such a new String
instance or recreate it on the fly. Caching more than doubles the size and adds caching complexity. Recreating it on the fly adds a linear time factor and class instance allocation/deallocation and potentially reference counting.
Furthermore, String
may not be fully available on tightly constrained platforms, especially those that cannot support allocations. Both String
and UTF8Span
have some API that require Unicode data tables and that might not be available on embedded (String via its conformance to Comparable
and Collection
depend on these data tables while UTF8Span
has a couple of methods that will be unavailable).
UTF-8 validity and efficiency
UTF-8 validation is a particularly common concern and the subject of a fair amount of research. Once an input is known to be validly encoded UTF-8, subsequent operations such as decoding, grapheme breaking, comparison, etc., can be implemented much more efficiently under this assumption of validity. Swift's String
type's native storage is guaranteed-valid-UTF8 for this reason.
Failure to guarantee UTF-8 encoding validity creates security and safety concerns. With invalidly-encoded contents, memory safety would become more nuanced. An ill-formed leading byte can dictate a scalar length that is longer than the memory buffer. The buffer may have bounds associated with it, which differs from the bounds dictated by its contents.
Additionally, a particular scalar value in valid UTF-8 has only one encoding, but invalid UTF-8 could have the same value encoded as an overlong encoding, which would compromise code that checks for the presence of a scalar value by looking at the encoded bytes (or that does a byte-wise comparison).
Proposed solution
We propose a non-escapable UTF8Span
which exposes String
functionality for validly-encoded UTF-8 code units in contiguous memory. We also propose rich API describing the kind and location of encoding errors.
Detailed design
UTF8Span
is a borrowed view into contiguous memory containing validly-encoded UTF-8 code units.
@frozen
public struct UTF8Span: Copyable, ~Escapable {
**TODO**: This might end up being UnsafeRawPointer? like in Span
public var unsafeBaseAddress: UnsafeRawPointer
/*
A bit-packed count and flags (such as isASCII)
╔═══════╦═════╦═════╦══════════╦═══════╗
║ b63 ║ b62 ║ b61 ║ b60:56 ║ b56:0 ║
╠═══════╬═════╬═════╬══════════╬═══════╣
║ ASCII ║ NFC ║ SSC ║ reserved ║ count ║
╚═══════╩═════╩═════╩══════════╩═══════╝
ASCII means the contents are all-ASCII (<0x7F).
NFC means contents are in normal form C for fast comparisons.
SSC means single-scalar Characters (i.e. grapheme clusters): every
`Character` holds only a single `Unicode.Scalar`.
*/
@usableFromInline
internal var _countAndFlags: UInt64
}
UTF-8 validation
We propose new API for identifying where and what kind of encoding errors are present in UTF-8 content.
extension Unicode.UTF8 {
/**
The kind and location of a UTF-8 encoding error.
Valid UTF-8 is represented by this table:
╔════════════════════╦════════╦════════╦════════╦════════╗
║ Scalar value ║ Byte 0 ║ Byte 1 ║ Byte 2 ║ Byte 3 ║
╠════════════════════╬════════╬════════╬════════╬════════╣
║ U+0000..U+007F ║ 00..7F ║ ║ ║ ║
║ U+0080..U+07FF ║ C2..DF ║ 80..BF ║ ║ ║
║ U+0800..U+0FFF ║ E0 ║ A0..BF ║ 80..BF ║ ║
║ U+1000..U+CFFF ║ E1..EC ║ 80..BF ║ 80..BF ║ ║
║ U+D000..U+D7FF ║ ED ║ 80..9F ║ 80..BF ║ ║
║ U+E000..U+FFFF ║ EE..EF ║ 80..BF ║ 80..BF ║ ║
║ U+10000..U+3FFFF ║ F0 ║ 90..BF ║ 80..BF ║ 80..BF ║
║ U+40000..U+FFFFF ║ F1..F3 ║ 80..BF ║ 80..BF ║ 80..BF ║
║ U+100000..U+10FFFF ║ F4 ║ 80..8F ║ 80..BF ║ 80..BF ║
╚════════════════════╩════════╩════════╩════════╩════════╝
### Classifying errors
An *unexpected continuation* is when a continuation byte (`10xxxxxx`) occurs
in a position that should be the start of a new scalar value. Unexpected
continuations can often occur when the input contains arbitrary data
instead of textual content. An unexpected continuation at the start of
input might mean that the input was not correctly sliced along scalar
boundaries or that it does not contain UTF-8.
A *truncated scalar* is a multi-byte sequence that is the start of a valid
multi-byte scalar but is cut off before ending correctly. A truncated
scalar at the end of the input might mean that only part of the entire
input was received.
A *surrogate code point* (`U+D800..U+DFFF`) is invalid UTF-8. Surrogate
code points are used by UTF-16 to encode scalars in the supplementary
planes. Their presence may mean the input was encoded in a different 8-bit
encoding, such as CESU-8, WTF-8, or Java's Modified UTF-8.
An *invalid non-surrogate code point* is any code point higher than
`U+10FFFF`. This can often occur when the input is arbitrary data instead
of textual content.
An *overlong encoding* occurs when a scalar value that could have been
encoded using fewer bytes is encoded in a longer byte sequence. Overlong
encodings are invalid UTF-8 and can lead to security issues if not
correctly detected:
- https://nvd.nist.gov/vuln/detail/CVE-2008-2938
- https://nvd.nist.gov/vuln/detail/CVE-2000-0884
An overlong encoding of `NUL`, `0xC0 0x80`, is used in Java's Modified
UTF-8 but is invalid UTF-8. Overlong encoding errors often catch attempts
to bypass security measures.
### Reporting the range of the error
The range of the error reported follows the *Maximal subpart of an
ill-formed subsequence* algorithm in which each error is either one byte
long or ends before the first byte that is disallowed. See "U+FFFD
Substitution of Maximal Subparts" in the Unicode Standard. Unicode started
recommending this algorithm in version 6 and is adopted by the W3C.
The maximal subpart algorithm will produce a single multi-byte range for a
truncated scalar (a multi-byte sequence that is the start of a valid
multi-byte scalar but is cut off before ending correctly). For all other
errors (including overlong encodings, surrogates, and invalid code
points), it will produce an error per byte.
Since overlong encodings, surrogates, and invalid code points are erroneous
by the second byte (at the latest), the above definition produces the same
ranges as defining such a sequence as a truncated scalar error followed by
unexpected continuation byte errors. The more semantically-rich
classification is reported.
For example, a surrogate count point sequence `ED A0 80` will be reported
as three `.surrogateCodePointByte` errors rather than a `.truncatedScalar`
followed by two `.unexpectedContinuationByte` errors.
Other commonly reported error ranges can be constructed from this result.
For example, PEP 383's error-per-byte can be constructed by mapping over
the reported range. Similarly, constructing a single error for the longest
invalid byte range can be constructed by joining adjacent error ranges.
╔═════════════════╦══════╦═════╦═════╦═════╦═════╦═════╦═════╦══════╗
║ ║ 61 ║ F1 ║ 80 ║ 80 ║ E1 ║ 80 ║ C2 ║ 62 ║
╠═════════════════╬══════╬═════╬═════╬═════╬═════╬═════╬═════╬══════╣
║ Longest range ║ U+61 ║ err ║ ║ ║ ║ ║ ║ U+62 ║
║ Maximal subpart ║ U+61 ║ err ║ ║ ║ err ║ ║ err ║ U+62 ║
║ Error per byte ║ U+61 ║ err ║ err ║ err ║ err ║ err ║ err ║ U+62 ║
╚═════════════════╩══════╩═════╩═════╩═════╩═════╩═════╩═════╩══════╝
*/
@frozen
public struct EncodingError: Error, Sendable, Hashable, Codable {
/// The kind of encoding error
public var kind: Unicode.UTF8.EncodingError.Kind
/// The range of offsets into our input containing the error
public var range: Range<Int>
@_alwaysEmitIntoClient
public init(
_ kind: Unicode.UTF8.EncodingError.Kind,
_ range: some RangeExpression<Int>
)
@_alwaysEmitIntoClient
public init(_ kind: Unicode.UTF8.EncodingError.Kind, at: Int)
}
}
extension UTF8.EncodingError {
/// The kind of encoding error encountered during validation
@frozen
public struct Kind: Error, Sendable, Hashable, Codable, RawRepresentable {
public var rawValue: UInt8
@inlinable
public init(rawValue: UInt8)
/// A continuation byte (`10xxxxxx`) outside of a multi-byte sequence
@_alwaysEmitIntoClient
public static var unexpectedContinuationByte: Self
/// A byte in a surrogate code point (`U+D800..U+DFFF`) sequence
@_alwaysEmitIntoClient
public static var surrogateCodePointByte: Self
/// A byte in an invalid, non-surrogate code point (`>U+10FFFF`) sequence
@_alwaysEmitIntoClient
public static var invalidNonSurrogateCodePointByte: Self
/// A byte in an overlong encoding sequence
@_alwaysEmitIntoClient
public static var overlongEncodingByte: Self
/// A multi-byte sequence that is the start of a valid multi-byte scalar
/// but is cut off before ending correctly
@_alwaysEmitIntoClient
public static var truncatedScalar: Self
}
}
@_unavailableInEmbedded
extension UTF8.EncodingError.Kind: CustomStringConvertible {
public var description: String { get }
}
@_unavailableInEmbedded
extension UTF8.EncodingError: CustomStringConvertible {
public var description: String { get }
}
QUESTION: It would be good to expose this functionality via a general purpose validation API. Question is do we want a findFirstError
or findAllErrors
style API, both? E.g.:
extension UTF8 {
public static func checkForError(
_ s: some Sequence<UInt8>
) -> some UTF8.EncodingError {
... or
public static func checkForAllErrors(
_ s: some Sequence<UInt8>
) -> some Sequence<UTF8.EncodingError> {
Creation and validation
UTF8Span
is validated at initialization time and encoding errors are diagnosed and thrown.
extension UTF8Span {
@lifetime(codeUnits)
public init(
_validating codeUnits: consuming Span<UInt8>
) throws(UTF8.EncodingError) {
@lifetime(borrow start)
internal init(
_unsafeAssumingValidUTF8 start: borrowing UnsafeRawPointer,
_countAndFlags: UInt64
)
}
NOTE: The final status of underscores, annotations, etc., are pending things like SE-0456 and Lifetime Dependencies.
Scalar processing
We propose a UTF8Span.ScalarIterator
type that can do scalar processing forwards and backwards. Note that ScalarIterator
itself is non-escapable, and thus cannot conform to IteratorProtocol
, etc.
extension UTF8Span {
public func _makeScalarIterator() -> ScalarIterator
/// Iterate the `Unicode.Scalar`s contents of a `UTF8Span`.
public struct ScalarIterator: ~Escapable {
public var codeUnits: UTF8Span
/// The byte offset of the start of the next scalar. This is
/// always scalar-aligned.
public var currentCodeUnitOffset: Int { get }
public init(_ codeUnits: UTF8Span)
/// Decode and return the scalar starting at `currentCodeUnitOffset`.
/// After the function returns, `currentCodeUnitOffset` holds the
/// position at the end of the returned scalar, which is also the start
/// of the next scalar.
///
/// Returns `nil` if at the end of the `UTF8Span`.
public mutating func next() -> Unicode.Scalar?
/// Decode and return the scalar ending at `currentCodeUnitOffset`. After
/// the function returns, `currentCodeUnitOffset` holds the position at
/// the start of the returned scalar, which is also the end of the
/// previous scalar.
///
/// Returns `nil` if at the start of the `UTF8Span`.
public mutating func previous() -> Unicode.Scalar?
/// Advance `codeUnitOffset` to the end of the current scalar, without
/// decoding it.
///
/// Returns the number of `Unicode.Scalar`s skipped over, which can be 0
/// if at the end of the UTF8Span.
public mutating func skipForward() -> Int
/// Advance `codeUnitOffset` to the end of `n` scalars, without decoding
/// them.
///
/// Returns the number of `Unicode.Scalar`s skipped over, which can be
/// fewer than `n` if at the end of the UTF8Span.
public mutating func skipForward(by n: Int) -> Int
/// Move `codeUnitOffset` to the start of the previous scalar, without
/// decoding it.
///
/// Returns the number of `Unicode.Scalar`s skipped over, which can be 0
/// if at the start of the UTF8Span.
public mutating func skipBack() -> Bool
/// Move `codeUnitOffset` to the start of the previous `n` scalars,
/// without decoding them.
///
/// Returns the number of `Unicode.Scalar`s skipped over, which can be
/// fewer than `n` if at the start of the UTF8Span.
public mutating func skipBack(by n: Int) -> Bool
/// Reset to the nearest scalar-aligned code unit offset `<= i`.
///
/// **TODO**: Example
public mutating func reset(roundingBackwardsFrom i: Int)
/// Reset to the nearest scalar-aligned code unit offset `>= i`.
///
/// **TODO**: Example
public mutating func reset(roundingForwardsFrom i: Int)
/// Reset this iterator to code unit offset `i`, skipping _all_ safety
/// checks.
///
/// Note: This is only for very specific, low-level use cases. If
/// `codeUnitOffset` is not properly scalar-aligned, this function can
/// result in undefined behavior when, e.g., `next()` is called.
///
/// For example, this could be used by a regex engine to backtrack to a
/// known-valid previous position.
///
public mutating func reset(uncheckedAssumingAlignedTo i: Int)
/// Returns the UTF8Span containing all the content up to the iterator's
/// current position.
public func _prefix() -> UTF8Span
/// Returns the UTF8Span containing all the content after the iterator's
/// current position.
public func _suffix() -> UTF8Span
}
}
QUESTION: Is it worth also surfacing as isScalarAligned
API on UTF8Span
so it's a little easier to find and spell (as well as talk about in doc comments)?
Character processing
We similarly propose a UTF8Span.CharacterIterator
type that can do grapheme-breaking forwards and backwards.
The CharacterIterator
assumes that the start and end of the UTF8Span
is the start and end of content.
Any scalar-aligned position is a valid place to start or reset the grapheme-breaking algorithm to, though you could get different Character
output if if resetting to a position that isn't Character
-aligned relative to the start of the UTF8Span
(e.g. in the middle of a series of regional indicators).
@_unavailableInEmbedded
extension UTF8Span {
public func _makeCharacterIterator() -> CharacterIterator
/// Iterate the `Character` contents of a `UTF8Span`.
public struct CharacterIterator: ~Escapable {
public var codeUnits: UTF8Span
/// The byte offset of the start of the next `Character`. This is
/// always scalar-aligned and `Character`-aligned.
public var currentCodeUnitOffset: Int { get }
public init(_ span: UTF8Span)
/// Return the `Character` starting at `currentCodeUnitOffset`. After the
/// function returns, `currentCodeUnitOffset` holds the position at the
/// end of the `Character`, which is also the start of the next
/// `Character`.
///
/// Returns `nil` if at the end of the `UTF8Span`.
public mutating func next() -> Character?
/// Return the `Character` ending at `currentCodeUnitOffset`. After the
/// function returns, `currentCodeUnitOffset` holds the position at the
/// start of the returned `Character`, which is also the end of the
/// previous `Character`.
///
/// Returns `nil` if at the start of the `UTF8Span`.
public mutating func previous() -> Character?
/// Advance `codeUnitOffset` to the end of the current `Character`,
/// without constructing it.
///
/// Returns the number of `Character`s skipped over, which can be 0
/// if at the end of the UTF8Span.
public mutating func skipForward()
/// Advance `codeUnitOffset` to the end of `n` `Characters`, without
/// constructing them.
///
/// Returns the number of `Character`s skipped over, which can be
/// fewer than `n` if at the end of the UTF8Span.
public mutating func skipForward(by n: Int)
/// Move `codeUnitOffset` to the start of the previous `Character`,
/// without constructing it.
///
/// Returns the number of `Character`s skipped over, which can be 0
/// if at the start of the UTF8Span.
public mutating func skipBack()
/// Move `codeUnitOffset` to the start of the previous `n` `Character`s,
/// without constructing them.
///
/// Returns the number of `Character`s skipped over, which can be
/// fewer than `n` if at the start of the UTF8Span.
public mutating func skipBack(by n: Int)
/// Reset to the nearest character-aligned position `<= i`.
public mutating func reset(roundingBackwardsFrom i: Int)
/// Reset to the nearest character-aligned position `>= i`.
public mutating func reset(roundingForwardsFrom i: Int)
/// Reset this iterator to code unit offset `i`, skipping _all_ safety
/// checks.
///
/// Note: This is only for very specific, low-level use cases. If
/// `codeUnitOffset` is not properly scalar-aligned, this function can
/// result in undefined behavior when, e.g., `next()` is called.
///
/// If `i` is scalar-aligned, but not `Character`-aligned, you may get
/// different results from running `Character` iteration.
///
/// For example, this could be used by a regex engine to backtrack to a
/// known-valid previous position.
///
public mutating func reset(uncheckedAssumingAlignedTo i: Int)
/// Returns the UTF8Span containing all the content up to the iterator's
/// current position.
public func prefix() -> UTF8Span
/// Returns the UTF8Span containing all the content after the iterator's
/// current position.
public func suffix() -> UTF8Span
}
}
Comparisons
The content of a UTF8Span
can be compared in a number of ways, including literally (byte semantics) and Unicode canonical equivalence.
extension UTF8Span {
/// Whether this span has the same bytes as `other`.
@_alwaysEmitIntoClient
public func bytesEqual(to other: UTF8Span) -> Bool
/// Whether this span has the same bytes as `other`.
@_alwaysEmitIntoClient
public func bytesEqual(to other: some Sequence<UInt8>) -> Bool
/// Whether this span has the same `Unicode.Scalar`s as `other`.
@_alwaysEmitIntoClient
public func scalarsEqual(
to other: some Sequence<Unicode.Scalar>
) -> Bool
/// Whether this span has the same `Character`s as `other`, using
/// `Character.==` (i.e. Unicode canonical equivalence).
@_unavailableInEmbedded
@_alwaysEmitIntoClient
public func charactersEqual(
to other: some Sequence<Character>
) -> Bool
}
We also support literal (i.e. non-canonical) pattern matching against StaticString
.
extension UTF8Span {
static func ~=(_ lhs: UTF8Span, _ rhs: StaticString) -> Bool
}
Canonical equivalence and ordering
UTF8Span
can perform Unicode canonical equivalence checks (i.e. the semantics of String.==
and Character.==
).
extension UTF8Span {
/// Whether `self` is equivalent to `other` under Unicode Canonical
/// Equivalence.
@_unavailableInEmbedded
public func isCanonicallyEquivalent(
to other: UTF8Span
) -> Bool
/// Whether `self` orders less than `other` under Unicode Canonical
/// Equivalence using normalized code-unit order (in NFC).
@_unavailableInEmbedded
public func isCanonicallyLessThan(
_ other: UTF8Span
) -> Bool
}
Extracting sub-spans
Slicing a UTF8Span
is nuanced and depends on the caller's desired use. They can only be sliced at scalar-aligned code unit offsets or else it will break the valid-UTF8 invariant. Furthermore, if the caller desires consistent grapheme breaking behavior without externally managing grapheme breaking state, they must be sliced along Character
boundaries. For this reason, we have exposed slicing as prefix
and suffix
operations on UTF8Span
's iterators instead of Span
's' extracting
methods.
Queries
UTF8Span
checks at construction time and remembers whether its contents are all ASCII. Additional checks can be requested and remembered.
extension UTF8Span {
/// Returns whether the validated contents were all-ASCII. This is checked at
/// initialization time and remembered.
@inlinable
public var isASCII: Bool { get }
/// Returns whether the contents are known to be NFC. This is not
/// always checked at initialization time and is set by `checkForNFC`.
@inlinable
@_unavailableInEmbedded
public var isKnownNFC: Bool { get }
/// Do a scan checking for whether the contents are in Normal Form C.
/// When the contents are in NFC, canonical equivalence checks are much
/// faster.
///
/// `quickCheck` will check for a subset of NFC contents using the
/// NFCQuickCheck algorithm, which is faster than the full normalization
/// algorithm. However, it cannot detect all NFC contents.
///
/// Updates the `isKnownNFC` bit.
@_unavailableInEmbedded
public mutating func checkForNFC(
quickCheck: Bool
) -> Bool
/// Returns whether every `Character` (i.e. grapheme cluster)
/// is known to be comprised of a single `Unicode.Scalar`.
///
/// This is not always checked at initialization time. It is set by
/// `checkForSingleScalarCharacters`.
@_unavailableInEmbedded
@inlinable
public var isKnownSingleScalarCharacters: Bool { get }
/// Do a scan, checking whether every `Character` (i.e. grapheme cluster)
/// is comprised of only a single `Unicode.Scalar`. When a span contains
/// only single-scalar characters, character operations are much faster.
///
/// `quickCheck` will check for a subset of single-scalar character contents
/// using a faster algorithm than the full grapheme breaking algorithm.
/// However, it cannot detect all single-scalar `Character` contents.
///
/// Updates the `isKnownSingleScalarCharacters` bit.
@_unavailableInEmbedded
public mutating func checkForSingleScalarCharacters(
quickCheck: Bool
) -> Bool
}
QUESTION: There is an even quicker quick-check for NFC (checking if all scalars are <0x300
, which covers extended latin characters). Should we expose that level as well?
UTF8Span
from String
We will add utf8Span
-style properties to String
and Substring
, in line with however SE-0456 turns out.
Span
-like functionality
A UTF8Span
is similar to a Span<UInt8>
, but with the valid-UTF8 invariant and additional information such as isASCII
. We propose a way to get a Span<UInt8>
from a UTF8Span
as well as some methods directly on UTF8Span
:
extension UTF8Span {
@_alwaysEmitIntoClient
public var isEmpty: Bool { get }
@_alwaysEmitIntoClient
public var storage: Span<UInt8> { get }
/// Calls a closure with a pointer to the viewed contiguous storage.
///
/// The buffer pointer passed as an argument to `body` is valid only
/// during the execution of `withUnsafeBufferPointer(_:)`.
/// Do not store or return the pointer for later use.
///
/// - Parameter body: A closure with an `UnsafeBufferPointer` parameter
/// that points to the viewed contiguous storage. If `body` has
/// a return value, that value is also used as the return value
/// for the `withUnsafeBufferPointer(_:)` method. The closure's
/// parameter is valid only for the duration of its execution.
/// - Returns: The return value of the `body` closure parameter.
@_alwaysEmitIntoClient
borrowing public func withUnsafeBufferPointer<
E: Error, Result: ~Copyable & ~Escapable
>(
_ body: (_ buffer: borrowing UnsafeBufferPointer<UInt8>) throws(E) -> Result
) throws(E) -> dependsOn(self) Result
}
Getting a String
from a UTF8Span
QUESTION: We should make it easier than String(decoding:as)
to make a String
copy of the UTF8Span
, especially since UTF8Span
cannot conform to Sequence
or Collection
. This will form an ARC-managed copy and not something that will share the (ephemeral) storage.
Source compatibility
This proposal is additive and source-compatible with existing code.
ABI compatibility
This proposal is additive and ABI-compatible with existing code.
Implications on adoption
The additions described in this proposal require a new version of the standard library and runtime.
Future directions
Streaming grapheme breaking
see gist
More alignments
see gist
Normalization
see gist
UnicodeScalarView and CharacterView
see gist
More algorithms
see gist
More validation API
see gist
Transcoded iterators, normalized iterators, case-folded iterators, etc
see gist
Regex or regex-like support
see gist
Canonical Spaceships
see gist
Exposing String
's storage class
see gist
Track other bits
see gist
Putting more API on String
see gist
Generalize printing and logging facilities
see gist
Alternatives considered
Invalid start / end of input UTF-8 encoding errors
see gist
An unsafe UTF8 Buffer Pointer type
see gist
Alternatives to Iterators
Functions
see gist
Collections
see gist
Acknowledgments
Karoy Lorentey, Karl, Geordie_J, and fclout, contributed to this proposal with their clarifying questions and discussions.