[Pitch] String Input-validating Initializers

Michael_Ilseman · July 19, 2023, 1:39pm

I wouldn't assign a very high value to using Sequence over Collection for these inits. Much of Sequence outside of Collection is a historical accident that we didn't have time to fix up before ABI stability. It's also asymmetric with the other initializers we have on String. It's nice when you get if for free, though.

Another side-benefit of constraining over Collection is that I believe all of this can be always-emit-into-client (swift/stdlib/public/core/StringCreate.swift at 9e73dad31110312356469364ba314603d8cff4b7 · swiftlang/swift · GitHub).

Agreed.

There's a balance here where on the one hand we want to allow for smaller pitches to the stdlib and on the other we want to make sure the total API is fleshed out.

String(utf8: myBytes) is long-overdue as creating a String from UTF-8 needs better discoverability. If we're adding input-validating convenience overloads, they will show up in code completion and it's possible that common usage would be to use them with a force-unwrap at the end, when I think it would be better for common usage to do input-correction.

This also very quickly opens the door to discussing input-normalizing API.

decoding:as: preserves exact scalar values and does input-correction
validating:as: preserves exact scalar values and does failable input-validation
normalizing:as: normalizes/canonicalizes scalar values, speeding up comparisons and searches, and does input-correction

And a question for String(utf8: myBytes) is whether it has the semantics of 1 or 3.

This is pitching initializers that, in the process of creating a String, fail if the input is invalidly encoded.

A separate and greatly needed batch of functionality is better Unicode processing, validating, and correcting APIs over code units (particularly when contiguous in memory). Briefly (using straw-person names and [Roadmap] Language support for BufferView):

struct BufferView: ~Escapable based functionality and ABI
protocol BufferViewable based inlinable (or always-emit-into-client) API
Chunking API over [Async]Sequence<CodeUnit> (i.e. create a moving window of in-memory code units and handle truncation)

Validation can throw specific errors in terms of e.g. the nth code unit in the input.

I believe this would also be more appropriate for this kind of use than trying to create a String, especially if the input is incomplete or very large.

Relatedly, there's also the decodeCString static functions which give you more precise info:

@inlinable public static func decodeCString<Encoding: _UnicodeEncoding>(
  _ cString: UnsafePointer<Encoding.CodeUnit>?,
  as encoding: Encoding.Type, 
  repairingInvalidCodeUnits isRepairing: Bool = true
) -> (result: String, repairsMade: Bool)? 

@inlinable public static func decodeCString<Encoding: _UnicodeEncoding>(
  _ cString: [Encoding.CodeUnit],
  as encoding: Encoding.Type,
  repairingInvalidCodeUnits isRepairing: Bool = true
) -> (result: String, repairsMade: Bool)? where Encoding : _UnicodeEncoding

It's possible we could in the future have something like a static func decodeBytes(_: some BufferViewable) throws -> String which gives you much more error information as well as creating the String.