[Accepted with modifications] SE-0405: String Initializers with Encoding Validation

Hi folks,

The review of SE-0405: String Initializers with Encoding Validation ended on August 29. The Language Steering Group as decided to accept the proposal with modifications to the validatingAsUTF8: overloads.

Review feedback was light overall but in favor of adding initializers for string validation. The Language Steering Group believes that init?(validating:as:) taking a Sequence of code units is the right shape for this API in general.

More discussion centered around the two proposed init?(validatingAsUTF8:) overloads, with some reviewers asking whether they were necessary given the proposed more general API or whether the choice of parameter type was the most useful. The proposal argued that becuase UTF-8 is "the most common case", a UTF-8-specific overload would be convenient and discoverable.

The Language Steering Group discussed these overloads and we feel that convenience and discoverability alone are not enough of a motivating factor to add separate, less general overloads. The key question we asked ourselves was, "if the tooling were improved to make UTF-8 usage of the general API more discoverable, would we still want the less general API?" and we agreed that the answer to this question as that we would not. Therefore, we believe that this is an area that tooling can and should improve. For APIs like init?(validating:as:) in particular that take a small set of protocol-conforming metatypes, we would like to see autocomplete have better support for filling in the possible choices as the user writes code.

Regarding the other proposed init?(validatingAsUTF8:) overload intended mainly for C interop, the Language Steering Group believes that an overload init?(validating: some Sequence<Int8>, as: UTF8.self) aligns nicely with the general API above and this pair of APIs portably supports data obtained from C without concern for the underlying type signedness.

The Language Steering Group also agrees—even though the removal of the init?(validatingAsUTF8:) overloads eliminates the name near-collision with the existing init?(validatingUTF8:) initializer—that the latter initializer is still poorly named and that it should be deprecated and replaced by init?(validatingCString:).

In summary, we are accepting the following set of APIs:

extension String {
  public init?<Encoding: Unicode.Encoding>(
    validating codeUnits: some Sequence<Encoding.CodeUnit>,
    as: Encoding.Type)

  public init?(
    validating codeUnits: some Sequence<Int8>,
    as: UTF8.Type)

  public init?(
    validatingCString nullTerminatedUTF8: UnsafePointer<CChar>)

  @available(Swift 5.XLIX, deprecated, renamed:"String.init(validatingCString:)")
  public init?(validatingUTF8 cString: UnsafePointer<CChar>)
}

Thank you to everyone who participated in the pitch and proposal review! Your contributions help make Swift a better language.

—Tony Allevato
Review Manager

16 Likes

Is this the first instance in the stdlib of an argument whose value is forced by the type system but not defaulted by the API author?

An interesting point. The purpose of spelling the API this way is so that users who have an instance of some Sequence<CChar> can use the validating APIs in a platform-agnostic way; it is deliberately meant not to expose any capabilities or spellings not possible with an instance of some Sequence<UInt8>.

You are right that, technically, we could do one better with:

public init?<Encoding: Unicode.Encoding>(
    validating codeUnits: some Sequence<Int8>,
    as: Encoding.Type) where Encoding.CodeUnit == UInt8

But I am not sure we gain much here. (Is ASCII.self a valid argument?)

3 Likes

I do think this spelling is a lot less “weird.” The originally accepted spelling can easily cause the reader to question their understanding, since an argument of fixed metatype has no meaningful semantics. The average Swift programmer will have seen functions that operate on Type values and functions that take (always? defaulted) Type arguments to satisfy the compiler’s insistence that all type parameters appear in the function signature; failure to pattern-match this new API would likely cause persistent discomfort in programmers who haven’t followed this evolution process.

I also think your proposed spelling more clearly illustrates that this API exists solely to treat UInt8 and Int8 equivalently. It makes explicit the stdlib’s assumption that all character encodings are unsigned. I think that is a reasonable assumption.

Several of the decode methods are overloaded in this way, e.g.:

public protocol SingleValueDecodingContainer {
    // ...
    func decode(_ type: Bool.Type) throws -> Bool
    func decode(_ type: String.Type) throws -> String
    // ...
}

That said, it does seem like ASCII.self would probably also make sense as an argument…

4 Likes

I agree with Xiaodi, it should be ... where Encoding.CodeUnit == UInt8. That with CChar was in fact my original draft, but the ambiguities derailed that.

Yeah, the ability to support ASCII.self makes me prefer the generic version as well. I can't speak for the rest of the steering group, but I may have been laser-focused on UTF8.self since it was in the context of replacing the validatingAsUTF8: initializer, but where Encoding.CodeUnit == UInt8 would also achieve that while being more general.

1 Like

is the idea to introduce a new top-level typealias ASCII = Unicode.ASCII? i wouldn’t want that to get in the way of client-defined ASCII abstractions.

That is not being considered here.

2 Likes