SE-0464: UTF8Span: Safe UTF-8 Processing Over Contiguous Bytes

allevato · March 5, 2025, 6:42pm

Hello, Swift community!

The review of SE-0464: UTF8Span: Safe UTF-8 Processing Over Contiguous Bytes begins now and runs through March 19th, 2025.

Reviews are an important part of the Swift evolution process. All review feedback should be either on this forum thread or, if you would like to keep your feedback private, directly to me as the review manager by DM. When contacting the review manager directly, please put "SE-0464" in the subject line.

What goes into a review?

The goal of the review process is to improve the proposal under review through constructive criticism and, eventually, determine the direction of Swift. When writing your review, here are some questions you might want to answer in your review:

What is your evaluation of the proposal?
Is the problem being addressed significant enough to warrant a change to Swift?
Does this proposal fit well with the feel and direction of Swift?
If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?
How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

More information about the Swift evolution process is available at:

swift-evolution/process.md at main · apple/swift-evolution · GitHub

Thank you,

Tony Allevato
Review Manager

grynspan · March 5, 2025, 7:08pm

Could this type be generalized to something like StringSpan: StringProtocol, where it could then be initialized with a sequence of UTF-8 and otherwise behave like a string, but in the future it could also support UTF-16, UTF-32, or other encodings?

The behaviour of a type conforming to StringSpan would be more similar to std::string_view in C++, but still allow inspecting individual UTF-8 bytes as needed.

KeithBauerANZ · March 5, 2025, 9:23pm

It's in the "future directions", but I really feel like the ability to turn a UTF8Span into a String safely & without revalidating the UTF-8 is missing — .to_owned() is one of the most common things to do with an &str in Rust.

I also think StaticString should get a .utf8Span property.

I also feel like there's a missing unchecked initializer? Sometimes we have known-UTF-8 data from another source (eg. SQLite) & revalidating doesn't seem necessary.

scanon · March 5, 2025, 11:15pm

Yeah, there’s no reason not to include this unless we’re actually unable to do so now for some reason.

fclout · March 6, 2025, 7:33am

Can someone help me understand the implications of a type being ~Escapable and BitwiseCopyable? I don't remember that Span is BitwiseCopyable (@glessard to confirm). Given they must have roughly the same constraints and insides, that seems suspicious.

Looking at the comparisons section, I'm noticing that there's a bytesEqual method that takes another UTF8Span, but that's the extent of comparisons with "Span" types. For instance, you can't use bytesEqual with a Span<UInt8>, or charactersEqual with another UTF8Span.

isKnownAscii raises several questions for me:

/// ASCII-ness is checked and remembered during UTF-8 validation, so this
/// is often equivalent to is-ASCII, but there are some situations where 
/// we might return `false` even when the content happens to be all-ASCII.
///
/// For example, a UTF-8 span generated from a `String` that at some point 
/// contained non-ASCII content would report false for `isKnownASCII`, even 
/// if that String had subsequent mutation operations that removed any 
/// non-ASCII content.

Does this mean that you can create a UTF8Span out of a String without going through validation?
Does this mean String also itself remembers whether it is entirely ASCII and passes it on to UTF8Span?
If yes, how come removing non-ASCII characters from a String can invalidate its ASCII bit but adding non-ASCII characters is handled correctly?

Both isKnownASCII and isKnownNFC rely on you having called the corresponding checkFor method at some point, but there's no way to know it's been called before.

Least importantly: EncodingError.init doesn't use argument labels, which seems unusual?

scanon · March 6, 2025, 1:34pm

Yes, if a String is UTF8 encoded (i.e. is represented by one of the native String representations), then no validation is necessary to get a UTF8Span, because the storage is already valid UTF8.

String has a isASCII bit in some of its representations. If this bit is set, then the String contains only ASCII bytes, and so the bit would also be set on a Span from the String.

String does not check the remaining characters for ASCII-ness on removal (this would be prohibitively expensive). So if the character removed is the only non-ASCII character in the string, there is no mechanism to detect this and set the string's isASCII bit. This does not "invalidate" it, however; the bit should be understood as set meaning "is definitely ASCII" and unset meaning "no claim is made about ASCII-ness", so the unset state can never be invalid.

Michael_Ilseman · March 6, 2025, 4:21pm

StringProtocol requires Collection conformances, which are unavailable for non-escapable types.

I think there could be a future direction for abstracting over other encodings. It would likely be after we know what some kind of Container protocol looks like.

something like this?

extension String {
  /// Makes a copy (skips validation)
  init(copying codeUnits: UTF8Span)
}

This is a little bit more complicated because StaticString has a small form for a single Unicode.Scalar. Unlike String's small form, this isn't UTF-8 and isn't laid out in memory such that we can access it by unsafe address. We'd either need a builtin to pass in a tiny stack buffer or some kind of coroutine accessor for it.

Yes, I think this would be useful. Such an init would probably be @unsafe.

Michael_Ilseman · March 6, 2025, 4:24pm

An alternative could be to use 2 bits to encode known-ness and have a yes/no/maybe return type.

allevato · March 6, 2025, 4:54pm

(Removing my review manager hat)

Minor name bikeshedding

For the function isCanonicallyLessThan, we have prior art in Sequence that uses the term "precedes" rather than "less than" to represent ordering, via lexicographicallyPrecedes. I wonder if we should call this canonicallyPrecedes instead, to align with that nomenclature.

That does lose symmetry with isCanonicallyEquivalent(to:), however. I'm not sure how we could align it with Sequence.elementsEqual(_:). But it probably doesn't make sense to, either; that operation implies element-wise equality, whereas canonical equivalence is a property of the whole pair of spans rather than direct comparisons of the elements.

Pattern matching

I think the definition of the ~= in the proposal text has a typo, since the pattern should be the LHS and the value being matched should be the RHS. It looks like the implementation has it correct, though. (Review manager hat on, apologies for not catching that in my review pass!, review manager hat off)

The documented behavior of ~= doing a non-canonical comparison feels like a potential foot-gun, though. If users were able to write something like this:

let x: String
switch x {
case "é": // statements
}

let x: UTF8Span
switch x {
case "é": // statements
}

These two snippets that look exactly the same on the surface would have different outcomes depending on whether, for example, "é" was encoded as U+00E9 or as U+0065 U+0301.

Is there a strong reason to prefer byte-by-byte equality for this comparison? I imagine availability in embedded platforms is one, but is that enough to warrant the potential surprise across all platforms?

Nobody1707 · March 6, 2025, 5:24pm

I don't see how this can be true considering that StaticString has a utf8Start property that returns a stable UnsafePointer<UInt8> that is valid for the lifetime of the string. I could understand it if utf8Start returned an optional pointer, but it doesn't.

glessard · March 6, 2025, 5:29pm

Span and RawSpan are both ~Escapable & BitwiseCopyable. This isn't that much more interesting than saying ~Escapable & Copyable!

The consequences of a value being of a ~Escapable type exist at compile time only. They limit where the value can be stored. BitwiseCopyable describes how a value of that type can be copied, namely with the equivalent of memcpy(); it doesn't say anything about where it can be copied. A Span instance, which is ~Escapable & BitwiseCopyable, can only copied within the scope of its valid lifetime; when it is copied, however, the underlying operation is a memcpy(), with no involvement from the runtime. This has nothing to do with the memory that is referred to by the Span, RawSpan or UTF8Span; just the pointer+count pair that forms the reference itself.

allevato · March 6, 2025, 5:30pm

The documentation for utf8Start states:

Accessing this property when hasPointerRepresentation is false triggers a runtime error.

fclout · March 6, 2025, 5:30pm

Thanks. Do you get an error if you try to unsafeLoad a Span out of a RawSpan? If not, what lifetime does it get?

Nobody1707 · March 6, 2025, 5:31pm

I'm not sure if that's better than returning an optional pointer, but I see how it works at least. Is there a reason we couldn't return an optional UTF8Span from a StaticString?

fclout · March 6, 2025, 5:36pm

It's weird UTF8Span would take ownership of remembering whether the contents is ASCII/NFC, but not of whether it knows that it has the right answer. What are the use cases where you wouldn't be using (isKnownASCII || checkForASCII())?

glessard · March 6, 2025, 5:45pm

You get a compilation error: unsafeLoad() requires T: BitwiseCopyable where an Escapable requirement is implied.

glessard · March 6, 2025, 5:46pm

Mostly that we can't return an Optional of a non-escapable type yet. We could consider it soon.

Michael_Ilseman · March 6, 2025, 5:58pm

ASCII is the trivial case for many Unicode processing routines. ASCII-ness means that there's a 1-1 correspondence between Unicode scalars and the raw code units, and is a very common fast path to check when doing string processing.

UTF8Span.UnicodeScalarIterator.next() can just zero-extend the byte and advance by 1. UTF8Span.UnicodeScalarIterator.skip(by n:) adds n (or the distance to the end if less than n) to the codeUnitOffset. UTF8Span.CharacterIterator's operations can quickly check for \r\n via a 2-byte compare and then return its answer right away, etc.

Similarly, any (future) UTF-16 transcoding operations, including lazy views/iterators, benefit from knowing the 1-1 correspondence. They can just zero-extend/truncate to transcode, any buffers or moving windows can be appropriately sized (instead of pessimistically sized), etc.

Known-ASCII is also the most content-relative fast path in String's own implementation.

These wouldn't want to scan the entire contents to determine ASCII-ness. Some use cases, though, might want to make the scan.

scanon · March 6, 2025, 7:18pm

It always has the right answer--either the contents are definitely ASCII, or it makes no claims about the contents. Either way, it's always correct.

Karl · March 6, 2025, 7:22pm

This is a very important type - our new broadest text currency type.

I need more time to digest it, but with regard to normalisation checks, and quick-check in particular, I can share some of the things I considered when omitting it from the normalisation proposal.

Basically, when you perform a quick check for normalisation, you can end up with one of 3 answers:

YES - definitely is normalised
NO - definitely is not normalised
MAYBE - could be normalised, but a more comprehensive (non-"quick") check is needed to say for sure

If we return a Bool, we would need to collapse NO and MAYBE to the same state, losing important information ("do I need to bother to run the comprehensive check?").

So you might think the ideal result looks something like this:

enum QuickCheckResult {
  case yes
  case no
  case maybe
}

Except that's not quite ideal, either. You see, if we reached a MAYBE character, it means there is some prefix of the String which was entirely YES (otherwise we would have early-exited). If you did want to resolve that MAYBE condition, we'd really want to preserve that information. So maybe it would instead look something like this:

enum QuickCheckResult<C: Collection> {
  case yes
  case no
  case maybe(requiresCheckFrom: C.Index)
}

And the way you want to use it would be something like:

var i = text.startIndex
while i < text.endIndex {

  let result = text[i...].isNormalized_QuickCheck(.nfc)
  switch result {
  case .yes: 
    return true
  case .no:  
    return false
  case .maybe(requiresCheckFrom: let startOfRemainder):
    let (isNormalized: Bool, resumeFrom: C.Index) = text.isNormalized_resolveQuickCheckMaybe(.nfc, from: startOfRemainder)
    guard isNormalized else {
      return false
    }
    i = resumeFrom
  }

}

This is very, very rough and we can probably clean it up a lot. The reason I omitted this from the normalisation proposal was so I could avoid thinking about the design of it too much, so these mental sketches are as much as I've got (normalisation is a big enough proposal as it is).

But essentially what you want is to keep using the quick-check algorithm as much as possible, and only fall back to the slow path when you really, truly have no other choice.

This algorithm is already implemented in the isNormalized function I wrote for the normalisation proposal.

(Okay, admittedly it's structured slightly differently from the above because it tracks when a segment containing a MAYBE character ends, not when it begins, but conceptually it's the same thing.)

Another thing to consider: in a language like Swift, where Strings are compared with canonical equivalence by default, you very rarely need to manually normalise text. Anybody who is even going down this road, manually normalising or checking for normalisation, is already a bit of an advanced user, and if they're reaching for a normalisation quick-check no less, I think it's safe to say they are the kind of developer who would appreciate us returning the full fidelity of information captured by the quick-check algorithm.