[Pitch] String Input-validating Initializers

Sorry, I guess I didn't make it clear.

Instead of the function signature you're proposing:

extension String {
  public init?<Encoding: Unicode.Encoding>(
  	validating codeUnits: some Sequence<Encoding.CodeUnit>,
    as: Encoding.Type
  )
}

We would change the second argument to be an instance of Encoding rather than a type:

extension String {
  public init?<Encoding: Unicode.Encoding>(
  	validating codeUnits: some Sequence<Encoding.CodeUnit>,
    as: Encoding // πŸ‘ˆ instance, not a type
  )
}

This would allow the use of static member syntax.

Additionally, I explain why I believe it must be trivial to construct an instance of any Unicode.Encoding-conforming type. Basically, it only has static members, meaning there is no instance state relevant to the protocol conformance; any user-defined encodings will almost certainly be empty types. So everybody should be able to adopt this.

1 Like

I wouldn't assign a very high value to using Sequence over Collection for these inits. Much of Sequence outside of Collection is a historical accident that we didn't have time to fix up before ABI stability. It's also asymmetric with the other initializers we have on String. It's nice when you get if for free, though.

Another side-benefit of constraining over Collection is that I believe all of this can be always-emit-into-client (https://github.com/apple/swift/blob/9e73dad31110312356469364ba314603d8cff4b7/stdlib/public/core/StringCreate.swift#L214).


Agreed.

There's a balance here where on the one hand we want to allow for smaller pitches to the stdlib and on the other we want to make sure the total API is fleshed out.

String(utf8: myBytes) is long-overdue as creating a String from UTF-8 needs better discoverability. If we're adding input-validating convenience overloads, they will show up in code completion and it's possible that common usage would be to use them with a force-unwrap at the end, when I think it would be better for common usage to do input-correction.

This also very quickly opens the door to discussing input-normalizing API.

  1. decoding:as: preserves exact scalar values and does input-correction
  2. validating:as: preserves exact scalar values and does failable input-validation
  3. normalizing:as: normalizes/canonicalizes scalar values, speeding up comparisons and searches, and does input-correction

And a question for String(utf8: myBytes) is whether it has the semantics of 1 or 3.


This is pitching initializers that, in the process of creating a String, fail if the input is invalidly encoded.

A separate and greatly needed batch of functionality is better Unicode processing, validating, and correcting APIs over code units (particularly when contiguous in memory). Briefly (using straw-person names and [Roadmap] Language support for BufferView):

  1. struct BufferView: ~Escapable based functionality and ABI
  2. protocol BufferViewable based inlinable (or always-emit-into-client) API
  3. Chunking API over [Async]Sequence<CodeUnit> (i.e. create a moving window of in-memory code units and handle truncation)

Validation can throw specific errors in terms of e.g. the nth code unit in the input.

I believe this would also be more appropriate for this kind of use than trying to create a String, especially if the input is incomplete or very large.

Relatedly, there's also the decodeCString static functions which give you more precise info:

@inlinable public static func decodeCString<Encoding: _UnicodeEncoding>(
  _ cString: UnsafePointer<Encoding.CodeUnit>?,
  as encoding: Encoding.Type, 
  repairingInvalidCodeUnits isRepairing: Bool = true
) -> (result: String, repairsMade: Bool)? 

@inlinable public static func decodeCString<Encoding: _UnicodeEncoding>(
  _ cString: [Encoding.CodeUnit],
  as encoding: Encoding.Type,
  repairingInvalidCodeUnits isRepairing: Bool = true
) -> (result: String, repairsMade: Bool)? where Encoding : _UnicodeEncoding

It's possible we could in the future have something like a static func decodeBytes(_: some BufferViewable) throws -> String which gives you much more error information as well as creating the String.

6 Likes

Where are these static members defined, and how does auto-complete find them?

I gave a full example in the post above for the existing String(decoding:as:) initialiser.

I can copy/paste that in to an Xcode project and it works, and has autocomplete. Does it not work for you?

Ah, I see. I'd missed a part of your post (user error) and misconstrued the example as hypothetical.

Here is a draft implementation of this proposal as a staged package:

Note that up to this point the (all-or-nothing) validation code has not been updated. Making it better in a variety of ways is follow-up work to this proposal. A final implementation in the standard library would be slightly different, but functionally the same.

I think that throwing an error with no information other than "we failed" is functionally identical to optional (you can even use try? to make it an optional) and it means that the future direction is simply adding more cases to the error with more fine-grained information. Is there a reason that ending up with a throwing and an optional initializer is preferable?

1 Like

Interestingly I do not see returning Result<String, Error> being discussed as an alternative. FWIW it's is somewhat more "round trip compatible" with throw compared to optional. Has Result got out of vogue?

Result was never really in vogue, especially for the core team. It was a type of necessity added to the standard library both as a hedge against the time it would take for true async support as well as the niche functionality it could provide immediately (typed error propagation), and its resulting popularity in the community due to those uses. While I've used it extensively in the past for complex typed-error flows and more modern promise-like behavior with async, its overall ergonomics don't recommend it for normal error production. While its typed errors can be beneficial in some circumstances, most users don't care about that most of the time. And really, since Result<String, Error> is just Result { try stringProducer() }, there's no benefit to producing it directly at all. Additionally, the same performance issues that arise with throws arise with Result, and may be worse given throws slightly more optimal runtime behavior, so it doesn't gain us anything here.

3 Likes

If there's enough interest we could spin this off. To me the simplest solution would be to make throw a function, accepting a closure, that closure would be evaluated when the function in question is called with "try" and not evaluated when it is called with "try?"

    throw { SomeError.error(param1, makeCostlyParam()) }

(those who fancy autoclosure would prefer that form:)

    throw(SomeError.error(param1, makeCostlyParam()))

Plus the relevant (hopefully not massive) changes to the compiler.

If this is done, the "try?" would really become a zero cost abstraction similar in performance to returning an Optional.


OTOH.. This optimises the failure path. Normally throwing an error (or an exception in other languages) is considered an exceptional event, and typically only the "happy path" is worth making "zero cost" as "unhappy path" won't happen too often.

Well, that's the age-old philosophical religious war, isn't it? :laughing:

Objective-C has true exceptions that are genuinely only used for really exceptional stuff - where you're basically going to crash most of the time anyway - and has manual error handling (NSError etc) instead for things that are softer errors; things that often make sense to handle gracefully. [1]

Swift doesn't really have the same setup. There's returning nil (plus failable initialisers) but that's not really equivalent to NSError since it provides absolutely no information on why the failure occurred. So Swift exceptions have to serve that purpose.

Thus, I don't think Swift realistically allows you to avoid throwing exceptions - potentially frequently - in real-world, correctly-functioning code.

Easy to test:

Then just put "Swift throw" in Xcode's console output filter and see throws happening in realtime, along with the current count. They don't happen too often in my app.

it is a lot harder to write a chain type over two (or more) base collections that conforms to Collection than it is to write a chain type that only conforms to Sequence, because Sequence only requires an iterator.

i have written the latter kind of wrapper type countless times. the former is something i only attempt as a last resort.

2 Likes

To refine my position: I think that if this pitch proposes convenience initializers for validation, it should also include convenience initializers for (error-correcting) decoding, which is the more common/preferred API path. This would elevate the visibility of decoding.

E.g.:

extension String {
  @_alwaysEmitIntoClient
  public init(decodingUTF8 bytes: some Collection<UInt8>) {
    self.init(decoding: bytes as: UTF8.self)
  }

  @_alwaysEmitIntoClient
  public init(decodingUTF16 bytes: some Collection<UInt8>) {
    self.init(decoding: bytes as: UTF16.self)
  }

  @_alwaysEmitIntoClient
  public init(decodingUTF32 bytes: some Collection<UInt8>) {
    self.init(decoding: bytes as: UTF32.self)
  }
}

Future work are normalizing inits and picking a default for String(utf8: myBytes).

This is still a lot of API surface area.

I made a new Xcode project, only imported Foundation, and here's what autocomplete gives me for String.init:

String initialisers-2

Now we're considering adding init(validatingFromUTF8:), and maybe init(decodingFromUTF8:) and init(normalizingFromUTF8:)? And then all of those again for UTF-16? And then again for UTF-32?

It's too much. If we accept that there are discoverability issues, might I suggest that we are overwhelming users with too many initialisers? Adding yet more initialisers might not be the answer we seek and may even be counterproductive.

--

Also, the 3 Unicode codecs are not of equal importance. I don't think they all deserve convenience initialisers.

  1. UTF8 is obviously vital - it doesn't have endianness concerns, so it's the best format to use for documents and data which may be used on multiple systems (e.g. basically everything on the internet or stored to disk). A UTF8 initialiser is also good for ASCII strings. I would support a convenience initialiser for UTF8.

  2. UTF16 is occasionally useful, but at least an order of magnitude less so than UTF8. For us, it's mostly important for bridging to NSString, but we mostly do that through actual bridging. I doubt that so many users manually bridge strings by decoding UTF16 code-units that it's worth adding specific validatingUTF16: and decodingUTF16: initialisers.

  3. UTF32 is another order of magnitude less common than UTF16 (or even two); it is almost never used by regular programmers. It's useful for implementing Unicode algorithms, but Swift-native implementations should probably prefer to work in terms of Unicode.Scalar. It's very hard to justify a set of convenience initialisers for UTF32.

--

Final note: as shown above, some existing APIs are named dangerously closely to the proposed ones:

  • init?(utf8String:)
  • init?(validatingUTF8:)

These take [CChar] parameters, and will both fatalError if the data isn't null-terminated.

I don't think users will be happy that init(validatingFromUTF8:) works but init(validatingUTF8:) crashes their programs. Ditto for init(decodingFromUTF8:) vs. init(utf8String:).

There's also init(utf16CodeUnits:count:) from Foundation, which takes an UnsafePointer<unichar>. I don't even know why we're still exposing that API. It should be deprecated.

Maybe we could also rename Foundation's init(bytes:encoding:) to init(bytes:legacyEncoding:) or something? To emphasise that it's for ancient encodings only.

7 Likes

I agree that the three don't have the same importance. I wonder if we could just add the UTF-8 case and use its documentation to point users towards the generic initializer.

The stdlib's init?(validatingUTF8:) is motivated by C interop and is badly named; I propose renaming it. The main validating initializer pitched here is necessary if we ever want the ability to deprecate Foundation's init?(utf8String:), though what happens to Foundation is not determined by this pitch.

2 Likes

I am also supportive of (and generally in favor of) convenience inits only for UTF8.

2 Likes

I've updated the proposal in Proposal for String Validating Initializers by glessard Β· Pull Request #2110 Β· apple/swift-evolution Β· GitHub. This removes the UTF-16 and UTF-32 convenience initializers.

From your other recent pitch:

If strlen is commonly reimplemented, then should it be in the standard library, possibly as an UnsafeBufferPointer initializer?

(It probably wouldn't validate as UTF-8/16/32, so maybe off-topic here.)

1 Like

I don’t think anybody should reimplement strlen, and it’s probably not needed in the standard library beyond the wrappers that already exist. One such wrapper does happen to be the first library call in String.init(cString:), which is why I used that idea as an example.

One direction of interest is a wrapper for pointer+terminator, for C interop. That is probably different than our pointer+length wrappers.

2 Likes