SE-0405: String Initializers with Encoding Validation

allevato · August 8, 2023, 1:24pm

Hi folks,

The review of SE-0405: String Initializers with Encoding Validation begins now and runs through August 22nd, 2023.

Reviews are an important part of the Swift evolution process. All review feedback should be either on this forum thread or, if you would like to keep your feedback private, directly to me as the review manager by DM. When contacting the review manager directly, please put "SE-0405" in the subject line.

Trying it out

If you'd like to try this proposal out, you can use it as a Swift package from the swift-evolution-staging repository. To do so, add the following entry to the dependencies of your Swift package:

    .package(
        url: "https://github.com/apple/swift-evolution-staging",
        branch: "input-validating-string-initializers"
    )

and add the following to the dependencies of the target that will be importing it:

    .product(
        name: "SE0000inputValidatingStringInitializers",
        package: "swift-evolution-staging"
    )

What goes into a review?

The goal of the review process is to improve the proposal under review through constructive criticism and, eventually, determine the direction of Swift. When writing your review, here are some questions you might want to answer in your review:

What is your evaluation of the proposal?
Is the problem being addressed significant enough to warrant a change to Swift?
Does this proposal fit well with the feel and direction of Swift?
If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?
How much effort did you put into your review? A glance, a quick reading, or an in-depth study?
More information about the Swift evolution process is available at:

https://github.com/apple/swift-evolution/blob/main/process.md

Thank you,

Tony Allevato
Review Manager

wadetregaskis · August 8, 2023, 3:45pm

I'm supportive of the overall direction. It's such an obvious and conceptually simple improvement.

Though it's technically orthogonal, is this the time to clean up the existing initialisers, rather than add yet more? There are already so many initialisers for String - especially with Foundation overlaid - and many are seemingly overlapping or (at first glance) equivalent in functionality, such that picking the right one is already non-trivial (and Xcode's autocomplete is usually overwhelmed by the plethora of options, offering no real help).

What if this proposal instead introduced more flexible initialiser(s) and deprecated a bunch of the existing ones that would then be superfluous? e.g.:

extension String {
    init?<Encoding: Unicode.Encoding>(
        fromUnicode: some Sequence<Encoding.CodeUnit>,
        encoding: Encoding.Type = .utf8,
        terminator: Encoding.CodeUnit? = nil,
        normaliseTo: Normalisation = .NFC,
        invalidCodeSequences: InvalidCodeSequenceAction = .failAndReturnNil)

    init?<Encoding: String.Encoding>(
        fromBytes: some Sequence<UInt8>,
        encoding: Encoding.Type = .utf8,
        terminator: UInt8? = nil,
        normaliseTo: Normalisation = .NFC,
        invalidCodeSequences: InvalidCodeSequenceAction = . failAndReturnNil)

    enum InvalidCodeSequenceAction {
        case failAndReturnNil
        case omit
        case replaceWith(String)
    }

    enum Normalisation {
        // Unicode experts might have better ideas for what the options
        // here should be.  This is just for illustration.
        //
        // I'm also going by my understanding that String today does not
        // force a certain normalised form in its internal storage -
        // presumably there's a reason that's necessary.
        case asIs
        case NFD
        case NFC
        case NFKD
        case NFKC
}

Debatably there could be a version for UnsafePointer (though I'm happy leaving it to callers to conform that to UnsafeBufferPointer first, using whatever context-specific knowledge is required to determine the length).

glessard · August 8, 2023, 4:38pm

As noted in the future directions, this is all work we want to do. What is proposed here is already done and needed: we can use it to delete slower code from swift-foundation and every end-user will benefit. The rest, however, is neither done nor sketched out or even budgeted for. I do not want to hold a simple (good) API proposal hostage to an un-designed ideal. If we play our cards right, the API proposed in SE-0405 will largely look like the everything cases, but with defaulted parameters.

You do raise an interesting question: do the String initializers need to separate the input-repairing case from the input-validating case? I believe they do. They inherently return different types (String vs. String?,) and separating them allows us to have better fast paths, because those fast paths do different things.

esummers · August 8, 2023, 4:58pm

I feel having an optional constructor that can never fail is a poor compromise since it will alway need an override.

In general, I feel the opposite. I'm ok with lots of simple constructors. Kitchen sink constructors are what make me uneasy...

Joseph_Heck · August 8, 2023, 7:29pm

+1

Although the set of initializers for String gets increasingly complex, I think the option for strict, non-repairing initialization is overdue, and extremely worthwhile.

I'm also looking forward to the API surface of initializing to get simpler, although I'm not holding my breath ;-) In any case, huge +1 to this proposal and incremental step towards (in my opinion), a clearly "better" situation.

xwu · August 21, 2023, 10:24pm

I agree with the proposed direction to add init?(validating:as:) to the standard library, and I think the rationale given passes the bar we've previously stated for additions to the standard library (namely: "It is possible to compose it using existing public API, but only at the cost of extra memory copies and allocations. The standard library is uniquely positioned to implement this functionality in a performant way."). I also agree that separate APIs for input validation (rather than validation-as-part-of-initialization) as well as normalization are properly separate topics that can be separated from this proposal.

As to fit with the Swift standard library, however, I would echo @wadetregaskis's concern about adding additional overloads, particularly with respect to the convenience API init?(validatingAsUTF8:). If this API were a standalone proposal, I do not think the justification that there are a lot of overloads and, thus with the most appropriate being hard to find, adding this one to make it easier to pick out from the bunch would pass muster. Indeed, it is adding to the very problem that it is purportedly trying to alleviate. It calls to mind that phrase from Virgil's Aeneid (XII:45): aegrescitque medendo ("and by the remedy he grows sicker").

I am also concerned that the distinction between the new init?(validatingAsUTF8:) and the now to-be-deprecated init?(validatingUTF8:) is entirely too subtle. Leaving the word "as" to be the user-visible distinguishing mark between the old and new APIs which have a difference in expectation with regard to nul-termination does not strike me as a principled move here; instead, I'd think that the resulting set of String APIs would be more coherent if this proposal added only init?(validating:as:) APIs—including versions that take CChar—and renamed the existing init?(validatingUTF8:).

Regarding the overload that takes CChar elements: the rationale that there would be ambiguity if an overload that takes some Sequence<CChar> is used on platforms where CChar is aliased to UInt8 seems solvable to me in ways other than restricting to UnsafeBufferPointer<CChar>:

Such an overload could be #if'd such that it is present only on platforms that alias CChar to Int8
Such an overload could instead be written to take some Sequence<Int8> so that the same functionality is present on all platforms regardless of what CChar is aliased to
Such an overload could be designated @_disfavoredOverload

It may still be the case that, of all these alternatives, restricting to UnsafeBufferPointer is the best, but the proposal does not make the case.

Regarding the renamed init?(validatingCString:), a small nit and a suggestion:

The nit: the internal parameter name isn't part of the API, but many years ago I commented on the standard library spelling of nul to indicate the NUL character and was told that this was indeed the intended house style; it is inconsistent therefore that this proposal adopts the spelling nullTerminatedCodeUnits.

The suggestion: The renamed API nowhere tells us that it's validating the input as UTF-8: the proposed renaming adds the crucial detail about nul-termination but then loses the detail about encoding. The existing String API includes the property utf8CString (full disclosure: I was the one who wrote the proposal to rename nulTerminatedUTF8CString to its present name) and I think it would be cromulent to parallel that here: init?(validatingUTF8CString:).

wadetregaskis · August 22, 2023, 12:05am

Agreed. My example was overly simplistic, in that regard. It would be preferable if using omit or replaceWith(String) made the initialiser infallible. Unfortunately Swift doesn't intrinsically support that, so it'd require manually duplicating the initialisers.

For clarity the failable one arguably should still have an invalidCodeSequences parameter, it should just be limited to the failAndReturnNil value. So that the similarity with the infallible initialiser is clearer. I can see the counter-argument that it's unnecessary boilerplate, though.

I suspect a good portion of that is subjective, and based on personal experience. I bear the scars of Java, for example, and that informs my preferences in no small way.

Trying to focus on the [moreso] objective aspects, my case in short for fewer [equivalent] constructors is that they are:

Easier to work with for auto-complete (less chance of overflowing Xcode's suggestion pop-up, and more likely to make meaningful distinctions clear such as between initialisers that can fail vs those that cannot).
More discoverable. It'd be uncommon for someone to exhaustively read the full documentation for every String initialiser (especially since there's already so many, and some are provided only through the Foundation overlay), so most folks aren't going to learn the full breadth of the API. Whereas having few[er] constructors makes it more likely that you'll investigate most or all of them.

Also, once you do find one which seems to fit the bill, you have a nice menu of relevant options (initialiser parameters) laid out for you, from which you can naturally discover what the possibilities are (and also be prompted to think about things you wouldn't have considered otherwise).

I do find it curious, though, that I spent a decade working with Objective-C, which has many init methods by necessity (no method overloading, no default arguments, etc), and yet I really like Objective-C. Furthermore I feel like a few bigger initialisers is more in the spirit of Objective-C. I'm not sure how to rationalise that.

glessard · August 22, 2023, 2:00am

I am merely aware that the pre-existing UTF8-repairing initializer for C strings is init(decodingCString nullTerminatedCodeUnits:). I re-used that parameter label, since these initializers will be siblings.

I like the validatingUTF8CString label, but it feels inconsistent with the label its sibling has, "decodingCString". I don't think we would want to rename both for consistency, so I went with a close analogue of the existing one.

glessard · August 22, 2023, 2:20am

I had not thought of making the overload disappear in the right (or wrong) conditions. I have to think about that.
I don't quite like the idea of just defining it for some Sequence <Int8>, but really it's just a mild distaste.

This leaves the discoverability problem intact. There is no auto-complete for the possible arguments to the as portion of the initializer. I would like for auto-complete to work there, but I do not know of a good solution for this. Karl suggested that these type parameters could be replaced by instances, but since there is simply nothing useful to be done with a hypothetical instance of a decoder, it feels like the wrong direction.

xwu · August 22, 2023, 2:59am

This is a very good point, but I'd rather we solve it—as in, actually make autocomplete work for the argument to as, given that we already have init(decoding:as:) and, with this proposal, init(validating:as:)—rather than adding ad hoc overloads to work around the problem.

It would seem to me that one route to that solution is something that extends SE-0299, and another is to rework both init(decoding:as:) and init(validating:as:) to work with the static member lookup rules we currently have. I do agree, though, that simply having instances of decoders that don't do anything isn't ideal, even though that would seem the most straightforward solution that works with existing static member lookup rules. Perhaps there are other alternatives too?

glessard · August 22, 2023, 9:32pm

Solving the autocomplete problem would also be my preference. Realistically, however, the issue has remained unsolved since String.init(decoding:as:) became API 6 years ago. I would prefer a solution to exist soon than a waiting even longer for a better solution.

allevato · August 22, 2023, 9:53pm

Hi folks,

The Language Steering Group discussed this proposal today. In general, we feel that API design should not be motivated primarily by tooling deficiencies—the "tail wagging the dog," so to speak. In this particular case, we discussed whether specific initializer overloads for UTF-8 would be warranted if tooling like autocomplete made it easier to discover the init(validating:as:) variant, and we don't believe that they would.

We feel that these tooling problems can and should be addressed; for example, since the metatype argument for the encoding is constrained by a protocol, SourceKit could offer a list of all types conforming to that protocol as the completion suggestions when using this API.

With that context, we are extending the review of this proposal for one week (until August 29) to allow this discussion to continue and get more feedback from the community.

allevato · September 5, 2023, 7:03pm

This proposal has been accepted with modifications.