SE-0464: UTF8Span: Safe UTF-8 Processing Over Contiguous Bytes

There is a very compelling argument to add a String.init(copying: UTF8Span) that just copies the bytes and can skip validation. There is also a potentially-compelling argument, for low-level libraries such as those implementing data structures using custom allocations, for an @unsafe init to UTF8Span that skips the validation check.

However, the combination of the two creates a new kind of easily-accesible backdoor to String's security and safety, namely the invariant that it holds validly encoded UTF-8 when in native form. I'd like to have more of a discussion around this.

I go back and forth on this. I think the arguments for the inclusion of both are well understood, so here's my best effort at an argument against adding them both.

Currently, String is 100% safe outside of crazy custom subclass shenanigans (only on ObjC platforms) or arbitrarily scribbling over memory (which is true of all of Swift). Both are highly visible and require writing many lines of advanced-knowledge code.

Without these two API, it is in theory possible to skip validation and produce a String instance of the indirect contiguous UTF-8 flavor through a custom subclass of NSString. But, it is only available on Obj-C platforms and involves creating a custom subclass of NSString, having knowledge of lazy bridging internals (which can and sometimes do change from release to release of Swift), and writing very specialized code. The product would be an unsafe lazily bridged instance of String, which could more than offset any performance gains from the workaround itself.

While this gets into the philosophical debate around @unsafe itself, I think there should be some thought before adding new backdoors that are easy and relatively attractive to use in a rush. It's nice to know that even in a sketchy software code base, where other engineers are doing expedient unsafe workarounds, if you take a String it will be safe outside extreme circumstances (namely, a very unexpedient and difficult workaround that might perform worse anyways).

With these two API, you can get to UB via a:

let codeUnits = unsafe UTF8Span(unsafeAssumingValidUTF8: bytes)
...
String(copying: codeUnits) 

That is, you could get an unsafe instance of a String that then gets passed around, ARC copied, etc., throughout the system that evades testing and only happens in production. Furthermore, this String instance is separated in time/space from the actual unsafe annotation.

While this kind of situation is somewhat inherent to taint propagation via @unsafe, it is a new one for String and much more severe and eggregious than what we've seen before.

6 Likes