SE-0464: UTF8Span: Safe UTF-8 Processing Over Contiguous Bytes

Yeah, I made a similar point to Michael when he was working on this proposal: if you want to write high-performance string processing that takes advantage of known-ASCII segments as efficiently as possible, you need something much finer-grained than just a tri-state "isASCII". Rather, you want an API along the lines of "give me a collection of pairs of spans, each of which consists of an all-ASCII prefix with some minimum length, followed by a suffix that contains some non-ASCII characters," which is a much higher-level thing than the rest of this proposal.

6 Likes

Well, the original property has three states, and you compacted it to two. I appreciate that this is accurately represented in the name of the property. However, if your question is "is it ASCII" instead of "is it definitely ASCII", then the property is not answering the question accurately.

I have immediate use cases for neither isKnownAscii nor a hypothetical isAscii, so it's ultimately none of my business, but I'm wondering if this is overfitted to the standard library use case. It seems like this is only useful for optimizations.

Correct. It is only useful for optimizations, and that's the purpose for which it is intended (because it is very useful for optimizations).

6 Likes

The Unicode.UTF8.EncodingError struct has:

  • a range: Range<Int> property,
  • a range: some RangeExpression<Int> parameter,
  • an at: Int parameter.

I think these should be renamed to either byteOffsets or codeUnitOffsets.

The draft implementation uses range.relative(to: Int.min..<Int.max) to convert from a partial/closed range. But why should it accept partial ranges?


The Unicode.UTF8.EncodingError.Kind struct has:

  • a non-failable initializer that accepts all UInt8 raw values,
  • static properties for the known error kinds.

Can this type be an extensible enum, perhaps with String raw values? Otherwise, should the initializer be failable?

(The draft implementation of the description property uses fatalError() for unknown error kinds.)


The UTF8Span.UnicodeScalarIterator and UTF8Span.CharacterIterator structs have:

  • skip…(by:) methods that take a scalar/character count,
  • reset methods that take a code unit offset.

These currently have parameters named n and i, which might be clearer if renamed.

It's also not clear if the suffix() methods will include the current scalar/character.

Unicode.UTF8.EncodingError might be misnamed, because it isn't thrown during encoding — unlike the EncodingError used by Codable.

Other similar types are Unicode.ParseResult and UnicodeDecodingResult.


Unicode.UTF8.EncodingError.Kind is missing a case for:

As a consequence of the well-formedness conditions specified in Table 3-7, the following byte values are disallowed in UTF-8: C0–C1, F5–FF.

It does make sense to accept at least some forms of range expressions, for example constructing a single-byte error at a given location the way the implementation of init(_:at:) does.

We could only add overloads for the forms we think might be commonly useful (normal range, closed range, etc). But, there's nothing wrong with partial ranges (e.g. an error on the first byte via ...0) and some RangeExpression is typically how the stdlib would generalize from Range to a range expression.

This is written as an extensible enum via the raw-representable struct with static members pattern. I did that because it's the only way to control the layout of the type. We could also drop RawRepresentable and make the raw values and initializers private (but keep the struct frozen).

I'm not sure that this should be open to extension, UTF-8 encoding is fully defined and encoding error diagnosis can be exhaustive.

Wouldn't those be overlong encodings? I suppose an interesting question is how broadly useful the diagnosis of the kind of error present. AFAICT, the proposed solution is novel in not only reporting the present and location of an error, but diagnosing the kind of error.

Callers may have their own domain error cases to map on top. E.g., a truncated scalar at the current end of a stream of bytes signifies that the caller needs to chunk more bytes.

I would agree with calling them overlong encodings. You can squint and argue that 0xf5 .. 0xff might be called something else, but overlong fits as well as any other name for the error would.

1 Like

Partial ranges would be converted to:

(0...).relative(to: Int.min ..< Int.max)  //-> 0 ..< 9223372036854775807
(..<0).relative(to: Int.min ..< Int.max)  //-> -9223372036854775808 ..< 0
(...0).relative(to: Int.min ..< Int.max)  //-> -9223372036854775808 ..< 1

Are negative bounds ever valid? Are empty ranges ever valid?

My interpretation of U+FFFD Substitution of Maximal Subparts and your documentation is:

  • the truncatedScalar error is for a subspan of 1–3 bytes;
  • the other errors are always for a single byte.

Therefore your initializers could reject other combinations of Kind and Range<Int>:

public init(_ kind: Kind, byteOffsets: Range<Int>) {
  _precondition(byteOffsets.lowerBound >= 0)
  if kind == .truncatedScalar {
    _precondition(!byteOffsets.isEmpty)
    _precondition(byteOffsets.count < 4)
  } else {
    _precondition(byteOffsets.count == 1)
  }
  /* … */
}

If that's so then Kind.init can either be private as you've suggested, or failable by returning nil for unknown raw values.


  • C0–C1 are overlongEncodingByte errors (for ASCII) — but only when followed by a continuation byte?

  • F5–F7, F9–FB, and FD–FE are invalidNonSurrogateCodePointByte errors — but only when followed by the expected 3–6 continuation bytes?

  • F8 and FC are either overlongEncodingByte or invalidNonSurrogateCodePointByte errors, depending on the values of the continuation bytes?

  • FF doesn't have a zero bit, unlike the bit patterns of other start bytes.

It might be simpler to have an invalidStartByte error for C0–C1 and F5–FF.

I'm very fond of this proposal, and am very happy with the support for Span in there as well. And the utf8Span getter on String will allow everything to interoperate very well.

In the context of parsing, some of my libraries allow users to validate data without making a copy of a value (like a String). This proposal would perfectly tie into that by allowing users to borrow that UTF8 memory region and copying the data out to a String only if they strictly need to.

A definite +1 from me.

Speaking of encoding errors, I wonder if we really want to only provide validation returning such a rich error type.

My experience writing parsers for common interchange formats is that very little production code cares about the details of why an input was invalid, or which particular byte could not be parsed, but the requirement to return that information can impose costs on the invalid path. And when it comes to these kinds of formats, optimising the invalid path is also valuable.

One thing that we could possibly do is return a kind of opaque error type, which contains an index to start searching from. Then we'd offer an API on that type to resolve it to a specific error. The index wouldn't necessarily be exact, but it would be before the error, and save you searching from the very beginning if you do want to resolve it.

I'm not familiar enough with the fastest UTF8 validation algorithms to know whether this is a significant limitation in practice. I took a quick glance at Lemire's SIMD UTF8 validation paper, and it appears they have a single error bit to represent invalid scalars and some overlong encodings (see table 8, on page 10). That's the kind of thing I mean - in order to return a specific error, we would need an extra step which disambiguates that (and to scan through the SIMD buffer to check the specific byte at which it occurred), even though almost no consumers care about that information.

2 Likes

Once you're on an error path, you generally don't care too much about performance. As long as you can keep the "the entire buffer is valid" and "give me the longest valid prefix" paths efficient, any small cost involved in reporting a good error is generally worth paying.

For something like validating UTF8, people expect it to be lightning fast, so I would argue you always care about performance. Even on the error path.

At the same time, that "good error" isn't necessarily a "good" error. How many times have you ever seen code that cared whether validation failed because of an overlong encoding vs an invalid scalar? Even among Unicode nerds, that's gotta be super rare.

The presence of unpaired surrogates are sometimes mildly interesting, but there's generally not much you can do to handle them (preserve them by treating it as "WTF8", or substitute with a replacement character... which is the same as you'd do for anything).

Having a bunch of information nobody cares about isn't really "good", in terms of having a good design.

1 Like

Realistically, we are talking about, maybe a few tens of extra cycles (a very conservative estimate; realistically it can be less), to produce a "good" error, which is drowned in the rest of the code on an error paths. It just isn't a big issue. Do you always need that information? Do you know a priori when you're going to need it? Not usually.

You're making quite a lot of assumptions about this error path. I used to think that way as well, but my eyes were opened when I tried to integrate my URL parser in to Swift's Regex engine.

Maybe you would think somebody attempting to parse an invalid URL string is such a bad mistake, and that this type is so high-level, that it deserves a rich error type telling you exactly which part of the string was invalid and in what way. It's very helpful in logs. If you're coming at it from the perspective of a GUI application developer, that's easily worth the cost.

But then Regex comes along. The way regexes work (and, in particular, the way patterns compose within a regex) involves testing very large inputs against the pattern and shrinking it down one character at a time until it finds a solution to the overall pattern. Invalid inputs are part of its normal operation. When you compose a URL parser in to a regex, you might attempt to parse hundreds of invalid inputs before you finally get a match - if you were bothering to take the time to construct loggable and actionable errors for every one of those failures, you're wasting time and burning energy. And this can add up. In the case of URLs, people use these things in web crawlers, and run these parsers over the entire internet. An efficient error path can be just as important as the success path, and the only reason I initially thought otherwise is because I didn't understand enough about all of the potential use-cases.

IIRC, I've seen gains in the 8% kind of range by eliminating that rich error information. And when you start looking at the implementations of some of these parsers and the tricks they use to squeeze every bit out of every instruction, it seems weird to leave that kind of performance on the table in one execution path.

Now, it may be that for UTF8 specifically, "give me the longest valid prefix" can cover that. I haven't explored all of the use-cases enough to make such a statement. Is there a Regex style use-case, which repeatedly attempts to parse invalid content but doesn't care about the details of the failure (like most things don't care about the details of the failure)? I don't know. In any event, I couldn't find such an operation in this proposal. If it exists, could you point it out?

3 Likes

I would say that if you're repeatedly trying to parse invalid content, you're doing something wrong--repeatedly testing inputs and shrinking until valid is a catastrophic anti-pattern. I'm sure that it's necessary sometimes, but we'd really like to avoid it as much as possible. I think that it's quite likely that we're missing API to help developers avoid doing so, but the solution to that isn't to make the bad pattern faster, it's to add the API that lets you avoid it.

There is a very compelling argument to add a String.init(copying: UTF8Span) that just copies the bytes and can skip validation. There is also a potentially-compelling argument, for low-level libraries such as those implementing data structures using custom allocations, for an @unsafe init to UTF8Span that skips the validation check.

However, the combination of the two creates a new kind of easily-accesible backdoor to String's security and safety, namely the invariant that it holds validly encoded UTF-8 when in native form. I'd like to have more of a discussion around this.

I go back and forth on this. I think the arguments for the inclusion of both are well understood, so here's my best effort at an argument against adding them both.

Currently, String is 100% safe outside of crazy custom subclass shenanigans (only on ObjC platforms) or arbitrarily scribbling over memory (which is true of all of Swift). Both are highly visible and require writing many lines of advanced-knowledge code.

Without these two API, it is in theory possible to skip validation and produce a String instance of the indirect contiguous UTF-8 flavor through a custom subclass of NSString. But, it is only available on Obj-C platforms and involves creating a custom subclass of NSString, having knowledge of lazy bridging internals (which can and sometimes do change from release to release of Swift), and writing very specialized code. The product would be an unsafe lazily bridged instance of String, which could more than offset any performance gains from the workaround itself.

While this gets into the philosophical debate around @unsafe itself, I think there should be some thought before adding new backdoors that are easy and relatively attractive to use in a rush. It's nice to know that even in a sketchy software code base, where other engineers are doing expedient unsafe workarounds, if you take a String it will be safe outside extreme circumstances (namely, a very unexpedient and difficult workaround that might perform worse anyways).

With these two API, you can get to UB via a:

let codeUnits = unsafe UTF8Span(unsafeAssumingValidUTF8: bytes)
...
String(copying: codeUnits) 

That is, you could get an unsafe instance of a String that then gets passed around, ARC copied, etc., throughout the system that evades testing and only happens in production. Furthermore, this String instance is separated in time/space from the actual unsafe annotation.

While this kind of situation is somewhat inherent to taint propagation via @unsafe, it is a new one for String and much more severe and eggregious than what we've seen before.

5 Likes

I just want to point out that there is a third way to create a string without validating the bytes, namely by writing a custom struct with compatible layout and using unsafeBitCast to convert it to String.

(Unless that’s what you meant by “arbitrarily scribbling over memory”.)

I think we can consider that in the same bucket as scribbling over memory, sure =)

1 Like

Hi everyone,

Toward the end of the review period for this proposal, the Language Steering Group and the authors discussed some relatively minor changes to the proposed APIs based on feedback during the review. Instead of going through an entirely new review cycle, we've chosen to extend this review for one week (until April 2) so that the community may provide any additional feedback that they wish on these changes:

  • An @unsafe initializer has been added to UTF8Span to create an instance from an externally-known-to-be-valid Span of UTF8 code units.
  • A String initializer has been added to create an instance of a UTF8Span.
  • UTF8Span.isCanonicallyLessThan(_:) has been renamed to canonicallyPrecedes(_:).
  • The pattern match operator ~= has been removed from UTF8Span.

The proposal text has been updated to include these changes, and the diff can be viewed here.

Thanks,

—Tony Allevato
Review manager

Thanks Tony, I realized I forgot to post instructions on where to find a toolchain and how to use one. I'll be keeping the initial comment of the PR up-to-date with instructions and improvements. Current toolchain instructions:

Current toolchain: https://ci.swift.org/job/swift-PR-toolchain-macos/1865/artifact/branch-main/swift-PR-78531-1865-osx.tar.gz

Usage

Before unpacking the tar ball, clear attributes via xattr -c swift-PR-78531-*.tar

Use DYLD_LIBRARY_PATH and the bundled swift executable via

DYLD_LIBRARY_PATH=~/Downloads/Library/Developer/Toolchains/swift-PR-78531-xxxx.xctoolchain/usr/lib/swift/macosx  ~/Downloads/Library/Developer/Toolchains/swift-PR-78531-xxxx.xctoolchain/usr/bin/swift run

Known issues