[Pitch] String Input-validating Initializers

Jumhyn · July 18, 2023, 3:57pm

In the case of a non-replayable stream it strikes me as potentially more valuable to have even some limited error info (e.g. what code unit value caused the failure), because the error is otherwise totally opaque and potentially unreproducible. With Int.init?(String) you can always inspect the String value to glean whatever error information you want, but this isn't the case for an arbitrary sequence.

Alejandro · July 18, 2023, 4:24pm

We've also been thinking about adding conveniences for that initializer like String(utf8: x) etc. I think these shorthand initializers make reaching for the right thing a lot easier, but that's for another proposal.

xwu · July 18, 2023, 4:26pm

Undoubtedly so, but @glessard has explained pretty convincingly why all of this information does not fall out automatically. It doesn’t seem reasonable for the design of a general-use initializer to optimize for non-replayable streams where the offset and nature of invalidity being an unpaired surrogate is paid for in performance cost by all users.

glessard · July 18, 2023, 4:28pm

One major usability issue with the initializers that require a type parameter such as String(decoding: x, as: UTF8.self) is that the list of valid type parameters is hard to find. This has led countless developers to use the Foundation API that transits through NSString. We need to do better.

davedelong · July 18, 2023, 4:33pm

It's not just non-replayable streams. It's also streams that don't reasonably fit into memory. My file format parsers lazily loaded chunks of bytes, so they could handle multi-gigabyte files while parsing without consuming too much memory. The files exist on disk, so I could reasonably translate a byte offset to an actual position in the file with other apps that can gracefully handle opening huge files.

I understand that the current implementation doesn't lend itself well to this model. But this is a pitch, and I'm focusing on the proposed developer experience. Having a thrown error definitively improves that experience, and folks who aren't interested in it can trivially ignore it with try?.

Jon_Shier · July 18, 2023, 4:39pm

If the performance cost of errors is too much for such a core API, it seems like additional throwing overloads would solve both issues, if errors are felt worthwhile. So the real quest is, if errors were zero cost, should this API produce them?

xwu · July 18, 2023, 4:46pm

Those folks would not trivially be able to avoid paying the performance cost of generating the error information that they then have to throw away.

As the saying goes, “If my grandmother had wheels, she would be a tractor.” Yours is a reasonable question to ask if errors were zero-cost: put another way, even if they’re zero-cost doesn’t mean that the most appropriate design for an API is to expose them. But here the errors are not zero-cost, so that isn’t the question.

If specialized uses justify additional APIs, then I agree it’s reasonable to consider them additive to the pitched feature and subset them out as a future direction, so that the present discussion focuses on the best design for the core API being considered.

xwu · July 18, 2023, 5:09pm

Ah—unless I misunderstand, you're describing a feature that's pretty much totally different from the stated pitch here: input-validating initializers for String.

If your stream of bytes doesn't reasonably fit into memory, then calling an initializer that creates a string from those bytes, however elaborately it can report a failure, would be devastating if the stream is valid. You are not actually interested in creating any such string; and I am not aware of any precedent for must-fail initializers other than on Never.

Karl · July 18, 2023, 5:30pm

Maybe the Foundation API is too easy to find ;)

But here's another idea - and it could apply to this API as well, so I think it's on-topic:

Currently, the String(decoding:as:) initialiser, and the proposed validating initialiser, accept an encoding using a type such as UTF8.self. This precludes us from taking advantage of static member syntax:

extension Unicode.Encoding.Type where Self == UTF8.self {
  // Not a thing. We can't do this.
  // Error: Cannot extend a metatype 'any Unicode.Encoding.Type' 
}

// This extension compiles, but it doesn't work with static member syntax.
extension Unicode.Encoding where Self == UTF8 {
  public static var utf8: Self.Type { Self.self }
}

In order to use static member syntax, we would need to provide an instance of the encoding:

extension String {

  @inlinable
  public init<Encoding: Unicode.Encoding>(
    decoding codeUnits: some Collection<Encoding.CodeUnit>,
    as: Encoding // <- instance, not a type!
  ) {
    self = String(decoding: codeUnits, as: Encoding.self)
  }
}

extension Unicode.Encoding where Self == UTF8 {
  public static var utf8: Self { .init() }
}

extension Unicode.Encoding where Self == UTF16 {
  public static var utf16: Self { .init() }
}

extension Unicode.Encoding where Self == UTF32 {
  public static var utf32: Self { .init() }
}

func test(_ input: some Collection<UInt8>) {
  String(decoding: input, as: .utf8)  // Works!
}

Moreover, Xcode (and I assume other SourceKit-LSP editors) will include static members in autocomplete suggestions:

But can we do this? We use a protocol-based system because it is at least theoretically possible for somebody to implement legacy encodings such as Latin-1 or SHIFT-JIS, and String(decoding: ..., as: .shiftjis) will decode the bytes from SHIFT-JIS in to Unicode. The standard library types are all just empty structs with plain init()s, so we can create instances of them without issue, but could it perhaps be inconvenient to instantiate a user-defined encoding?

As it turns out, the Unicode.Encoding protocol consists exclusively of static members. It cannot contain any state or options, and so we can reasonably say that even user-defined encodings will almost always be empty types, and should always be able to support a trivial init().

And so it should be possible for any Unicode.Encoding to be passed to this function as an instance, meaning they should all be able to make use of static member syntax.

Alternatively, we could expand static member syntax so it works with metatypes. But this is a lot easier.

glessard · July 18, 2023, 5:54pm

The argument label approach is easier still.

tera · July 18, 2023, 5:58pm

If this is a valid argument against having a throwing API, then I'd say it should be a valid argument to make it optimal to call try? foo() compared to try foo(). Perhaps there could be two generated versions of foo(), one which is actually called when used with try and another when it's called with try? - the latter won't incur error generation overhead.

wadetregaskis · July 18, 2023, 6:05pm

I empathise with both sides here - I too have often wanted better diagnostics from failable initialisers, such as numeric to string converters, and yet I also appreciate that these are very common operations that much of the time don't need deeper diagnosis and are performance-sensitive. It seems like if it has to be a choice between the two then neither choice is satisfying.

Is it a false dichotomy, though?

Can this be solved generally, as a lateral solution to this debate? e.g. could the Swift compiler automagically produce specialisations of throwing functions that just return nil on failure (stripping out any then 'dead' code related to exception generation) if the caller is ignoring the exception anyway (e.g. with try?)?

Aside from making the code simpler - no need to manually write both versions - it could be advantageous to leave that decision to the compiler, as it can hollistically evaluate whether the specialisation is actually worth it (in code size etc, potentially guided by PGO, taking into account the ripple effect into child functions, etc).

Or perhaps this can be done - a little more explicitly and manually - via a macro that can be attached to an exception-producing function to synthesise a failable variant of the function?

glessard · July 18, 2023, 6:32pm

Yes it is. This pitch is about adding failable initializers, and adding these does not prevent other additions at a later time.

Future directions include error-reporting initialization, as well as standalone validation, and transcoding improvements.

In the meantime, are there arguments against making the proposed changes at all?

davedelong · July 18, 2023, 6:40pm

Yes, that's a good point. But the pattern I was using for file parsing still applies: I had a chunk of bytes and I was wanting to see if it would parse as WhateverEncoding.self. I knew where my chunk started relative to the overall file, and if the error could tell me where the problem was within the current chunk, I could still use that to locate the error within the overall file.

ahti · July 19, 2023, 6:04am

I'm not quite convinced providing some useful error information would incur a significant performance cost or be impossible for single-pass Sequences.

Since there is no implementation to look at yet this is just guesswork, but I would assume the implementation would build up the transcoded/validated UTF-8 representation in a buffer as it iterates over the input code units. This would mean that when an error is detected, the prefix string up to excl. the code unit causing the failure is already in memory in a form that could pretty easily be wrapped in a String and included in the thrown error with no extra effort on the happy path.

This alone would imo be pretty useful, and require no replaying of sequences or passing of indices. For bonus points, a bit of extra info about how exactly the next code unit failed validation wouldn't hurt, but wouldn't be super necessary either.

Karl · July 19, 2023, 9:36am

Yeah but it's not really scalable. There are already more than enough String initialisers, and not all of the proposed conveniences pull their weight - I doubt people decode UTF32 so often that it's worth its own entrypoint. At the same time, static member syntax was designed precisely to make these kinds of APIs easier to use.

As for errors - I'm just not seeing the use-case. Character-set detection was mentioned, but that is an extremely complex operation, and has been a subject of academic study for decades (check out the references). It generally relies on heuristics and statistical analysis, and there are a plethora of libraries which perform better or worse at certain kinds of text, or in certain languages, etc - from libchardet (Mozilla's Universal Character Detector), to uchardet (forked from Mozilla), to Google's compact_enc_det (supposedly better at shorter text samples), to ICU's CharsetMatch. There's also Charamel which looks interesting - it uses machine learning models, which would seem to suit this kind of problem well.

In short: character set detection is hard. Errors thrown by String.init are likely not going to be enough for a good quality implementation. We already expose text decoding APIs which somebody could use as part a detector library, but even then I think it's more likely that they'd write their own. If you're analysing a byte stream for patterns, it can be counterproductive to abstract those patterns away.

And so I'm just not seeing a convincing use-case, bearing in mind the initialiser we already offer, which will repair malformed data by injecting replacement characters.

glessard · July 19, 2023, 1:28pm

Unless I'm mistaken, though, the kind of syntax you're proposing involves adding a compiler feature — that's just not the scope of this pitch.
We could go with the approach used by Foundation, and specify the encoding with an enum. This would have the same flaw as labels: it could only define what's in the standard library, and it would also be easily confusable with the Foundation API.

Karl · July 19, 2023, 1:37pm

Sorry, I guess I didn't make it clear.

Instead of the function signature you're proposing:

extension String {
  public init?<Encoding: Unicode.Encoding>(
  	validating codeUnits: some Sequence<Encoding.CodeUnit>,
    as: Encoding.Type
  )
}

We would change the second argument to be an instance of Encoding rather than a type:

extension String {
  public init?<Encoding: Unicode.Encoding>(
  	validating codeUnits: some Sequence<Encoding.CodeUnit>,
    as: Encoding // 👈 instance, not a type
  )
}

This would allow the use of static member syntax.

Additionally, I explain why I believe it must be trivial to construct an instance of any Unicode.Encoding-conforming type. Basically, it only has static members, meaning there is no instance state relevant to the protocol conformance; any user-defined encodings will almost certainly be empty types. So everybody should be able to adopt this.

Michael_Ilseman · July 19, 2023, 1:39pm

I wouldn't assign a very high value to using Sequence over Collection for these inits. Much of Sequence outside of Collection is a historical accident that we didn't have time to fix up before ABI stability. It's also asymmetric with the other initializers we have on String. It's nice when you get if for free, though.

Another side-benefit of constraining over Collection is that I believe all of this can be always-emit-into-client (swift/stdlib/public/core/StringCreate.swift at 9e73dad31110312356469364ba314603d8cff4b7 · swiftlang/swift · GitHub).

Agreed.

There's a balance here where on the one hand we want to allow for smaller pitches to the stdlib and on the other we want to make sure the total API is fleshed out.

String(utf8: myBytes) is long-overdue as creating a String from UTF-8 needs better discoverability. If we're adding input-validating convenience overloads, they will show up in code completion and it's possible that common usage would be to use them with a force-unwrap at the end, when I think it would be better for common usage to do input-correction.

This also very quickly opens the door to discussing input-normalizing API.

decoding:as: preserves exact scalar values and does input-correction
validating:as: preserves exact scalar values and does failable input-validation
normalizing:as: normalizes/canonicalizes scalar values, speeding up comparisons and searches, and does input-correction

And a question for String(utf8: myBytes) is whether it has the semantics of 1 or 3.

This is pitching initializers that, in the process of creating a String, fail if the input is invalidly encoded.

A separate and greatly needed batch of functionality is better Unicode processing, validating, and correcting APIs over code units (particularly when contiguous in memory). Briefly (using straw-person names and [Roadmap] Language support for BufferView):

struct BufferView: ~Escapable based functionality and ABI
protocol BufferViewable based inlinable (or always-emit-into-client) API
Chunking API over [Async]Sequence<CodeUnit> (i.e. create a moving window of in-memory code units and handle truncation)

Validation can throw specific errors in terms of e.g. the nth code unit in the input.

I believe this would also be more appropriate for this kind of use than trying to create a String, especially if the input is incomplete or very large.

Relatedly, there's also the decodeCString static functions which give you more precise info:

@inlinable public static func decodeCString<Encoding: _UnicodeEncoding>(
  _ cString: UnsafePointer<Encoding.CodeUnit>?,
  as encoding: Encoding.Type, 
  repairingInvalidCodeUnits isRepairing: Bool = true
) -> (result: String, repairsMade: Bool)? 

@inlinable public static func decodeCString<Encoding: _UnicodeEncoding>(
  _ cString: [Encoding.CodeUnit],
  as encoding: Encoding.Type,
  repairingInvalidCodeUnits isRepairing: Bool = true
) -> (result: String, repairsMade: Bool)? where Encoding : _UnicodeEncoding

It's possible we could in the future have something like a static func decodeBytes(_: some BufferViewable) throws -> String which gives you much more error information as well as creating the String.

glessard · July 19, 2023, 2:14pm

Karl:

We would change the second argument to be an instance of Encoding rather than a type:
extension String {
  public init?<Encoding: Unicode.Encoding>(
  	validating codeUnits: some Sequence<Encoding.CodeUnit>,
    as: Encoding // 👈 instance, not a type
  )
}
This would allow the use of static member syntax.

Where are these static members defined, and how does auto-complete find them?