SE-0464: UTF8Span: Safe UTF-8 Processing Over Contiguous Bytes

Michael_Ilseman · March 26, 2025, 2:13pm

The more direct translation of Unicode's maximal subpart of an
ill-formed subsequence would be something like:

(codeUnitOffset: Int, length: Int)

Where length could be 1, 2, or 3. Range<Int> is a more Swift-canonical representation and I'm open to either formulation.

If we stick with Range<Int>, then I agree it should be codeUnitOffsets or byteOffsets. After a little bit more discussion, I'll update the proposal and ask the LSG to pick a name if we don't settle on one ourselves.

What about func skip(by nScalars: Int) and func skip(by nCharacters: Int)?

I had picked n as that's used elsewhere in the stdlib. In those uses, there's only one Element type in context, but here's the also the concept of code unit offsets.

Hmm, the analogy with Codable is a bit unfortunate, but it is an error in how content is encoded rather than in the encoding processes. It could also be named UTF8.InvalidEncoding or similar. Thoughts?

Those, like Codable's are a bit more operational in nature. The API are single-byte-taking and have an emptyInput case instead of reporting a truncated scalar at the known end of the content.

As to the discussion with @Karl and @scanon , handling unfinished input is better served by more dedicated API than catching this error and diagnosing if it was due to insufficient buffering.

Not a bad idea to precondition on these, the equivalent (codeUnitOffset: Int, length: Int) would have similar checks and we could have an unchecked equivalent to Range(uncheckedBounds:).

I guess a question is whether we want the raw value to be public. I thought it might be helpful for tooling, such as logging errors, etc., which is a beneficiary of error classification and diagnosis.

If we want it public, then I agree with making it failable, as that's supported by RawRepresentable.

Hmm, it seems like the proposed classification is providing a little more granularity by answering "what kind of invalid start byte" without needing the context of the rest of the input. The only case where Unicode maximal subpart of an ill-formed sequence (we need an acronym or something...) cares about context is truncated scalars, which is their special case (e.g. error length >=1).

benrimmington · March 26, 2025, 4:17pm

Range<Int> has an initializer that converts from a closed range.

Another initializer could be added:

extension Range where Bound: Strideable, Bound.Stride: SignedInteger {
  public init(start lowerBound: Bound, count distance: Bound.Stride)
}

byteOffsets would match the error names (e.g. overlongEncodingByte).
codeUnitOffsets would match the iterator APIs (currentCodeUnitOffset).

You could have Replacement in the name, because it represents the range of a U+FFFD "REPLACEMENT CHARACTER" when repairing.

Would the same error type be thrown by the Unicode.UTF8.ByteStreamDecoder APIs?

Michael_Ilseman · March 27, 2025, 6:22pm

I think byteOffsets is the best name, as it doesn't imply that the input content is encoded UTF-8 (and in fact, we just discovered it's invalid).

allevato · April 2, 2025, 12:51pm

Hi all,

The Language Steering Group tentatively accepts this proposal on principle,
but after discussion with the proposal authors, we are seeking the following
modifications:

The name of the error type has been a point of debate during the review.
The Language Steering Group is sympathetic to the argument that
EncodingError, even when namespaced under Unicode.UTF8, is potentially
confusing. While it can reasonably be read as an error of the encoding
(the noun) as opposed to during encoding (the verb), it could also be
interpreted as having the wrong polarity since the error occurs when
validating or decoding the bytes in the span.

Therefore, we would like to offer Unicode.UTF8.ValidationError as the name
of this error type. Such a name would also avoid any confusing overlap with
the Codable category of APIs.
The Language Steering Group agrees with the discussion in the review thread
that byteOffsets is a better name for the range stored in the
aforementioned error type.
Lastly, we believe that the unsafe initializer for UTF8Span from a
Span<UInt8> would benefit from the following defaulted argument:
```
@unsafe public init(
    unsafeAssumingValidUTF8 uncheckedCodeUnits: Span<UInt8>,
    isKnownASCII: Bool = false
)
```
This would provide an important optimization to clients who initialize a
UTF8Span from an external data source that is known a priori to be
ASCII-only, which we expect to be a potentially common use case. This
avoids a separate call to checkForASCII() that would undo the
performance benefit desired from using the unsafe initializer.

We think it's somewhat less likely that users would need to set isKnownNFC
or would have that information during unsafe initialization, and an initializer
that takes such a parameter can be trivially added in the future should the
need arise, so we are comfortable leaving that as a future direction.

Since neither the proposed name in #1 nor the initializer signature in #3 were
raised during review, the Language Steering Group wants to ensure that the
community has the opportunity to comment on these modifications. For this
reason, we are granting a brief extension of the review until April 7 to allow
discussion on these points.

Thanks,

—Tony Allevato
Review manager

Michael_Ilseman · April 4, 2025, 7:44pm

allevato:

Lastly, we believe that the unsafe initializer for UTF8Span from a
Span<UInt8> would benefit from the following defaulted argument:
@unsafe public init(
    unsafeAssumingValidUTF8 uncheckedCodeUnits: Span<UInt8>,
    isKnownASCII: Bool = false
)
This would provide an important optimization to clients who initialize a
UTF8Span from an external data source that is known a priori to be
ASCII-only, which we expect to be a potentially common use case. This
avoids a separate call to checkForASCII() that would undo the
performance benefit desired from using the unsafe initializer.

What's the reasoning for providing a default value for isKnownASCII?

xwu · April 4, 2025, 9:52pm

allevato:

Lastly, we believe that the unsafe initializer for UTF8Span from a
Span<UInt8> would benefit from the following defaulted argument:
@unsafe public init(
    unsafeAssumingValidUTF8 uncheckedCodeUnits: Span<UInt8>,
    isKnownASCII: Bool = false
)

Minor comment: if we add additional flags that are unchecked (unsafe or not), then perhaps we should label the first argument as something other than unsafeAssumingValidUTF8—as we are expanding the unchecked assumptions beyond just valid UTF-8.

(Incidentally, "unsafe assuming" can probably more succinctly be called "unchecked" per our existing precedents—so, maybe just uncheckedValid? There is another label in this proposal which is also unsafeAssuming... or uncheckedAssuming... where the same comment would apply—the "unchecked" part encompasses the "assuming" part.)

allevato · April 6, 2025, 12:34pm

I'll only speak for myself here (taking off my review manager hat), but it feels to me like the right balance of ergonomics. If you have a Span<UInt8> and don't know anything about its contents other than the fact that it's valid UTF-8, then initializing the UTF8Span with the default value never gives you incorrect behavior—you only miss out on potential optimizations. So, reducing the ceremony for folks using this initializer with potentially non-ASCII UTF-8 data makes the cases stand out more where isKnownASCII: true is passed.

Michael_Ilseman · April 6, 2025, 6:36pm

I think uncheckedValidUTF8 is a good alternative name for unsafeAssumingValidUTF8. I'm not sure what the best would be, as invalid UTF-8 can definitely lead to UB, but we also use unchecked elsewhere in the stdlib for that. Unsafe and unchecked are sometimes used interchangably.

For the other arguments, such as isKnownASCII, what would we name them?

init(uncheckedValidUTF8 codeUnits: Span<UInt8>, isKnownASCII: Bool = false, ...)

reads a little better to me than

init(uncheckedValidUTF8 codeUnits: Span<UInt8>, uncheckedIsKnownASCII: Bool = false, ...)

I do not have strong (or even weak) opinions here.

scanon · April 6, 2025, 7:29pm

For me, I'd probably prefer to shorten the labels a bit to something like:

let mySpan = UTF8Span(uncheckedUTF8: data, isASCII: true)

as this is a bit of an expert-mode API, users can be expected to read the documentation or read between the lines: if we're not validating UTF8, one assumes that we're also not validating your claim of ASCII-ness (and, as ever, ValidUTF8 is redundant; if it's UTF8, it's valid).

xwu · April 6, 2025, 7:38pm

I would go even further on the brevity of first argument. It's not just the "UTF-8–ness" that's unchecked, it's anything at all about the data; since the corresponding safe API is UTF8Span(validating:), I'd say this can just be UTF8Span(unchecked:...).

As for the second argument, though, I think the "known" part is load-bearing: as @allevato says, the consequences are not symmetric between a false-positive and a false-negative here, so even if it's an advanced API I think we want to encourage users to leave the default alone if they don't know for sure.

scanon · April 8, 2025, 8:39pm

Xiaodi convinced me that this should be

init(unchecked codeUnits: Span<UInt8>, isKnownASCII: Bool)

Regarding the reset API in the proposal, I think the same basic arguments apply (and also uncheckedAssumingAlignedTo is very problematic because it suggests the new position is "aligned to" the argument, rather than the assumption that it's aligned being unchecked), and so we should have:

mutating func reset(toUnchecked offset: Int)

allevato · April 10, 2025, 7:42pm

This proposal has been accepted with modifications.