[Pitch] String Input-validating Initializers

glessard · July 17, 2023, 6:37pm

A proposal to add new String failable initializers that validate encoded input, and return nil when the input contains any invalid elements.

String initializers with encoding validation

Proposal: SE-NNNN String initializers with encoding validation
Author: Guillaume Lessard
Review Manager: TBD
Status: Pitch
Bugs: rdar://99276048, rdar://99832858
Implementation: (pending)

Introduction

We propose adding new String failable initializers that validate encoded input, and return nil when the input contains any invalid elements.

Motivation

The String type guarantees that it represents well-formed Unicode text. When data representing text is received from a file, the network, or some other source, it may be relevant to store it in a String, but that data must be validated first. String already provides a way to transform data to valid Unicode by repairing invalid elements, but such a transformation is often not desirable, especially when dealing with untrusted sources. For example a JSON decoder cannot transform its input; it must fail if a span representing text contains any invalid UTF-8.

This functionality has not been available directly from the standard library. It is possible to compose it using existing public API, but only at the cost of extra memory copies and allocations. The standard library is uniquely positioned to implement this functionality in a performant way.

Proposed Solution

We will add a new String initializer that can fail, returning nil, when its input is found to be invalid according the encoding represented by a type parameter that conforms to Unicode.Encoding.

extension String {
  public init?<Encoding: Unicode.Encoding>(
  	validating codeUnits: some Sequence<Encoding.CodeUnit>, as: Encoding.Type
  )
}

For convenience and discoverability, we will also provide initializers that specify the input encoding as part as an argument label:

extension String {
  public init?(validatingFromUTF8 codeUnits: some Sequence<UTF8.CodeUnit>)

  public init?(validatingFromUTF16 codeUnits: some Sequence<UTF16.CodeUnit>)

  public init?(validatingFromUTF32 codeUnits: some Sequence<UTF32.CodeUnit>)
}

These will construct a new String, returning nil when their input is found invalid according to the encoding specified by the label.

When handling with data obtained from C, it is frequently the case that UTF-8 data is represented by CChar rather than UInt8. We will provide a convenience initializer for this use case, noting that it typically involves contiguous memory, and as such is well-served by explicitly using an abstraction for contiguous memory (UnsafeBufferPointer<CChar>):

extension String {
  public init?(validatingFromUTF8 codeUnits: UnsafeBufferPointer<CChar>)
}

String already features a validating initializer for UTF-8 input. Is is intended for C interoperability, but its argument label does not convey the expectation that its input is a null-terminated C string. We propose to rename it in order to clarify this:

extension String {
  public init?(validatingCString nullTerminatedUTF8: UnsafePointer<CChar>)

  @available(Swift 5.XLIX, deprecated, renamed:"String.init(validatingCString:)")
  public init?(validatingUTF8 cString: UnsafePointer<CChar>)
}

Detailed Design

Please see the gist for details.

Source Compatibility

This proposal is strictly additive.

ABI Compatibility

This proposal adds new functions to the ABI.

Implications on adoption

This feature requires a new version of the standard library.

Alternatives considered

The `validatingUTF8` argument label

The argument label validatingUTF8 seems like it would have been preferable to validatingFromUTF8, but using the former would have been source-breaking. The C string validation initializer takes an UnsafePointer<UInt8>, but that is also valid with [UInt8] via implicit pointer conversion. Any use site that passes an [UInt8] to the C string validation initializer would have changed behaviour upon recompilation, going to considering a null character (\0) as the termination of the C string to considering it as a valid character.

Have the `CChar`-validating function take a parameter of type `some Sequence<CChar>`

This would produce a compile-time ambiguity on platforms where CChar is typealiased to UInt8 rather than Int8. Using UnsafeBufferPointer<CChar> as the parameter type will avoid such a compile-time ambiguity.

Acknowledgements

Thanks to Michael Ilseman, Tina Liu and Quinn Quinn for discussions about input validation issues.

SE-0027 by Zachary Waldowski was reviewed in February 2016, covering similar ground. It was rejected at the time because the design of String had not been finalized. The name String.init(validatingCString:) was included in that proposal. Lily Ballard later pitched a renaming of String.init(validatingUTF8:), citing consistency with other String API involving C strings.

jlukas · July 17, 2023, 7:22pm

I would prefer the argument label validatingAsUTF8: etc., to match validating: …, as: Unicode.UTF8 (assuming I got that syntax correct).

glessard · July 17, 2023, 8:57pm

Fair point. I used "from" because the general case involves transcoding (to the internal storage encoding, UTF-8).

davedelong · July 17, 2023, 9:13pm

I would love to see this throw an error instead of blindly returning nil. If I'm dealing with a byte stream, having an error I could potentially use to diagnose why a string isn't validating would be very helpful.

glessard · July 17, 2023, 9:15pm

I will add throwing an error in future directions. The current validation apparatus is not equipped to identify the locations of errors, and additional work is required to make that happen.

scanon · July 17, 2023, 10:42pm

I would be interested to see how much work is required to achieve this. It might be worth doing now.

Jumhyn · July 17, 2023, 10:55pm

Throwing an error here would maybe also allow for the use of the validatingUTF8: label since existing uses would then become errors. Unless, I suppose, they happen to already appear in an expression marked with try…

glessard · July 17, 2023, 11:10pm

The minimum is an error that contains the first index after the last successfully decoded character. I don't feel confident that minimum is sufficient, but I'll soldier on.

The throwing case would require the input to be a Collection, however. It doesn't have all the functionality of failable initializers that take Sequence, so I believe we would still need the initializers proposed here.

The existing (internal) UTF-8 validation API could straightforwardly be modified. The UTF-16 and UTF-32 cases, however, would involve creating an alternate transcoding method that makes the input index available to its client code. It would be better to make that public API as well. It's enough larger that I would prefer to make all that a separate proposal!

scanon · July 17, 2023, 11:12pm

That's reasonable, I guess my question really boils down to: in the hypothetical future where we have these, do we want to also have the failable version?

What can you actually do when the init from sequence fails, though? You're just done.

glessard · July 18, 2023, 12:10am

What I failed to say clearly earlier is that since the throwing version seems to require the input to be Collection rather than Sequence, it isn't strictly more general than the failable initializers proposed here. Given that, I think there is a justification for both kinds of initializers.

tera · July 18, 2023, 12:14am

To include the failing index in the error? It could still have some context, like last 50 elements, or so and the reason of failing. It could also have the sequence tail that could be useful e.g. to resume scanning the string. In case of real index available that could be included, if there's no real index the "element count" could be included (where count would start from 0 and increment for every element in the sequence).

glessard · July 18, 2023, 12:44am

To include a lot of context as you suggest requires either
a) that the input be a Collection or even BidirectionalCollection, so that we can reconstruct the context when an error is encountered, or
b) that we allocate memory to keep a copy of recent data. The latter seems terribly inefficient! We want validation to be efficient, and keeping context would involve lots of extra work even on the happy path. It's okay to do some extra work on the error path.

To include an offset since the beginning of a Sequence seems like a workaround for not having required a Collection. If the input Sequence really isn't replayable, then an offset would be useless.

So that's why I think the throwing initializer needs to be paired with Collection to be useful.

On the nature of the error: what is the need to include more than a single index (or offset)? Does this require an excessive amount of extra work? It would be nice to not have to expose distinct validateReportingErrors() and validate() functions for each encoding because keeping track of data for error reporting ends up costing too much.

wes1 · July 18, 2023, 5:57am

Would it help to consider the alternative of a static factory or converting function?

Initializers are a highly-contended space where API design consistency necessitates a fairly broad set of considerations, particularly when inlined.

Tailor-made static factories can themselves be distinguished by target (encoding) and behavior (throwing). That could support e.g., rolling out available UTF8 support now and others later, or using different error-handling for different encodings.

Both initializers and String factories mean String knows-about all possible/relevant variable sources, instead of their knowing-about a relatively consistent String.

A converter function on Sequence< Encoding.CodeUnit> to String could be specialized by CodeUnit subtype, so instead of loading String with N=Unit-type initializers, each unit converts to String. Converter implementations might be in a better position to offer partial results, recovery, and error specifics without undue copying. Using a common name like toString would aid discoverability.

If needed, experience with the static factories or converters can inform decisions around initializers, and tooling can be built to migrate the initial uses of static factories to the eventually-adopted initializer.

It's clearly a fall-back that shouldn't inhibit substantive discussion on initializers, but it might sequence implementations and avoid latecomers delaying the early/ready ones.

anon9791410 · July 18, 2023, 8:53am

Functions that return Optional have never been the right choice choice since Swift 2.0. It puts the onus of error information propagation on someone else, forever taking up more of other people's time. If they'd like to transform a failure into their own error, that's fine, but it should never be required.

xwu · July 18, 2023, 12:59pm

This is not so: the rationale for the role of Optional in error handling is detailed in Error Handling Rationale:

Simple domain errors

A simple domain error is something like calling String.toInt() on a string that isn't an integer. The operation has an obvious precondition about its arguments, but it's useful to be able to pass other values to test whether they're okay. The client will often handle the error immediately.

Conditions like this are best modeled with an optional return value. They don't benefit from a more complex error-handling model, and using one would make common code unnecessarily awkward. For example, speculatively trying to parse a String as an integer in Java requires catching an exception, which is far more syntactically heavyweight (and inefficient without optimization).

An initializer that validates UTF-8 code units is an example of an API that has an obvious precondition and serves the role of testing whether the argument meets that condition.

Karl · July 18, 2023, 3:01pm

IMO, it isn't necessary to throw a more detailed error.

We already have String(decoding:as:), which will repair malformed data by inserting replacement characters. This initialiser will allow those with known-valid text to assert that replacements did not occur.

There is only one reason: because the byte stream did not contain valid text.

Throwing a more detailed error than that is even less useful than having Int.init?(String) throw an error, IMO. Any sort of partial decoding and custom repairing that you want to do is a sufficiently advanced use-case that you should use things like UnicodeCodec directly and build up a buffer of scalars.

Speaking of which, we could really do with a String initialiser which accepts a sequence/collection of Unicode scalars. It is effectively the same as UTF32, but the UTF32.CodeUnit is UInt32, not Unicode.Scalar. I ended up having to do some exceptionally ugly stuff to implement this efficiently (the standard library does something similar).

Please let me remove that code! I really, really, really want to remove it!

glessard:

extension String {
  public init?(validatingFromUTF8 codeUnits: some Sequence<UTF8.CodeUnit>)

  public init?(validatingFromUTF16 codeUnits: some Sequence<UTF16.CodeUnit>)

  public init?(validatingFromUTF32 codeUnits: some Sequence<UTF32.CodeUnit>)
}

While I appreciate the intention, I do not think these conveniences are necessary. We would need to add similar repairing initialisers for symmetry, which adds up to a significant amount of API surface, and these spellings are not actually shorter than the alternative.

String(decoding: x, as: UTF8.self) can be a bit annoying to type in a debugger, but spellings like this also aren't ideal.

Jon_Shier · July 18, 2023, 3:07pm

I think people understand that, but being able to see where the invalid byte occurred and what was there instead can be useful to understanding what was wrong with the stream. Errors aren't only used to recover within the program but to allow investigation into why it occurred in the first place.

Karl · July 18, 2023, 3:16pm

Yeah but that same logic applies to Int.init?(String). Maybe it encountered a non-numeric character, or maybe it overflowed. There are even more interesting failure conditions than with String, but we don't bother saying which one occurred. I don't think that debugging corrupt data streams is a motivation to add String initialisers.

Firstly, a corrupted stream will not always result in invalidly-encoded data. Strings do not include checksums or parity bits.

Secondly, if you need fault tolerance, we offer repairing invalid streams (and today that's actually all we offer, so it's available everywhere). If you want to identify bits that were repaired, look for Unicode replacement characters (U+FFFD).

If you need even more detail than that, the standard library also provides Unicode decoders which you can invoke directly.

Jon_Shier · July 18, 2023, 3:36pm

I'm just talking about producing errors, not corruption or fault tolerance. That other APIs don't return errors isn't an argument for new API to not return errors.

davedelong · July 18, 2023, 3:57pm

Yes, precisely.

In the past I've written file format parsers (primarily CSV, but others as well) that attempt to guess an encoding of a file if it's not supplied by the calling code. This involves complicated attempts to look at byte order marks and the first ~100 bytes in an attempt to make an educated guess about what format the file is in. Under-the-hood, I was doing this by repeatedly trying to create strings with a specific encoding.

In these cases, all I'd get back was nil, which meant hours upon hours of guess-and-check work to try to come up with the right heuristics about the file format. If, instead, I could get back an error saying "At byte offset 42, there was a byte that should've been part of a surrogate pair but was the wrong value…", that would've been extremely helpful at narrowing down better attempts to deal with unknown-encoding string data.