Type Safety of String Indices

string

(Jeremy David Giesbrecht) #1

SE‐0180 came up again during the review of SE‐0241, which seeks to do some damage control.

I was not around during the original review of SE‐0180. I wish I was, because it has become the only accepted evolution proposal I have ever wished could be undone. I have been just biting my tongue in silence and trying to live with it. Since others are apparently noticing problems it caused too, I thought I would start a discussion about the issues it has been causing me. A complete reversal of SE‐0180 is probably out of the question, but maybe we can find something that can be still done about some of the issues?

SE‐0180 erased the difference between the indices of a string’s various views:

Today String shares an Index type with its CharacterView but not with its UTF8View , UTF16View , or UnicodeScalarView . This proposal redefines String.UTF8View.Index , String.UTF16View.Index , and String.CharacterView.Index as typealiases for String.Index [...]

The rational given was this:

The different index types are supported by a set of Index initializers, which are failable whenever the source index might not correspond to a position in the target view[.] [...] The result is a great deal of API surface area for apparently little gain in ordinary code, that normally only interchanges indices among views when the positions match up exactly (i.e. when the conversion is going to succeed). Also, the resulting code is needlessly awkward.

I have found just the opposite. The “ordinary code” which “only interchanges indices when the conversion is going to succeed” has taken a massive safety hit, and now requires extra boilerplate to deal with illogical failability.

(The following examples all assume a global constant let café = "cafe\u{301}".)

Before SE‐0180 landed, I had lots of code like this:

func use(index: String.Index) {
    let scalarIndex = index.samePosition(in: café)
    // ...
}
use(index: café.startIndex)

It was all perfectly valid and logical. Since the parameter is a valid character index, it must have a corresponding scalar index. The standard library knew this, and the conversion method returned a non‐optional.

But then SE‐0180 landed, and it stopped compiling. There were two possible ways to deal with it:

  1. Obey the fix‐it:

    func use(index: String.Index) {
        let scalarIndex = index.samePosition(in: café)! // ← Added “!”.
        // ...
    }
    

    Great, just add one character and all is well—at first. To begin with, the method is only applied in the same places as before, so the indices it receives are always valid and everything continues to work. Unfortunately, over time code completion suggests to use it in new places and sooner or later, something somewhere provides it an invalid character index, forgetting to convert it from UTF8View.Index and passing it directly through the typealias:

    use(index: café.utf8.index(before: café.utf8.endIndex))
    

    Even this bug isn’t caught at first, because many indices will still just happen to be valid. Eventually you get an inconsistent crash, which occurs far from the real cause.

  2. Rethink according to the new model. String.Index isn’t really a character index; it’s really a UTF8View.Index that might be usable in some string‐based APIs with undocumented behaviour occurring whenever it isn’t (more no that later). This rewrite is ready for anything:

    func use(index: String.Index) {
        guard let validated = index.samePosition(in: café) else {
            return
        }
    
        let scalarIndex = validated.samePosition(in: café.unicodeScalars)!
        // ...
    }
    

    Now the method is bulletproof. But the tradeoffs aren’t negligible. There are extra run‐time unwrapping and validity checks. This Void‐returning method may silently do nothing. A version with a return value needs to become optional, forcing every call site to also handle an optionality which shouldn’t be there. If the return type is already optional, such as func firstMatch(for: Pattern, in: Range<String.Index>) -> Range<String.Index>? then it conflates two different semantics into nil: “There is no match.” vs “I cannot search because the index is invalid.”

Now the “ordinary code” which “only interchanges indices when the conversion is going to succeed”, has to choose between hard to track bugs and an avalanche of API changes to switch into the domain of unordinary code which interchanges indices with unknown validity.

To reuse SE‐0180’s own words, I feel that now “the resulting code [really] is needlessly awkward.”

There is actually a third option. It’s one I shudder at even more, but it is employed in several places in the standard library—the strategy of silently rounding to a valid index.

For example, the String’s subscript for getting substrings apparently rounds the indices, though I can find no documentation anywhere that describes its intended behaviour for such cases.

This code is not logically valid, yet right now neither compiler nor documentation tell you that it isn’t.

let index = café.utf8.index(before: café.utf8.endIndex) // The last byte.
let before = String(café[..<index])
let after = String(café[index...])
print(before + after) // “café”
print(after.utf8.count) // 2 bytes?!?

I wish this didn’t compile. But even knowing that the current String.Index model requires it to compile, I would at least expect the second line to fail at runtime. But it doesn’t, instead entering a bizarre, unexpected and undefined territory, injecting nebulous bugs into later logic. We might pat ourselves on the back that the two halves still make the same whole on the fourth line, but the developer asked for one byte in the top line and silently ended up with two by the fifth. Any byte offset logic that follows will be headed for a train wreck. At least they are confined to five short lines in the example, but in practice the component calls could be in far‐flung methods spread across a library. And again, the whole thing may appear to work just fine for a long time, merely because the developer neglected to test it with anything outside ASCII.

So what I’m asking is this: Can we find a way to restore type safety to string indices, or at least a way to provide a type‐safe alternative alongside the status quo? Can there be an elegant way to declare method parameters that have a compile‐time guarantee that they are valid indices of a particular view? Is there some other solution to these issues?

(Possibly interested persons: @Michael_Ilseman, @lorentey)


Pre-pitch: Let's solve the String.Index encoded offset problem
(Norio Nomura) #2

FYI, this behavior seems to be changed in Swift 5.(It is probably a bug)
SR-9802: String.UTF8View.index(before:) is not compatible between Swift 4.x and Swift 5


(Karoy Lorentey) #3

In my mental model, there is no need to use any conversion methods -- String's various views all share the same index space, so I can safely pass an UTF-32 index to a UTF-8 view of the same string, and I'm golden:

let café = "🦜cafe\u{301}"

Array(café)                // ⟹ ["🦜", "c", "a", "f", "é"]
Array(café.unicodeScalars) // ⟹ [129436, 99, 97, 102, 101, 769]
Array(café.utf16)          // ⟹ [55358, 56732, 99, 97, 102, 101, 769]
Array(café.utf8)           // ⟹ [240, 159, 166, 156, 99, 97, 102, 101, 204, 129]

café.indices.map { café.unicodeScalars[$0] }       // ⟹ [129436, 99, 97, 102, 101]
café.indices.map { café.utf16[$0] }                // ⟹ [55358, 99, 97, 102, 101]
café.indices.map { café.utf8[$0] }                 // ⟹ [240, 99, 97, 102, 101]

café.unicodeScalars.indices.map { café.utf16[$0] } // ⟹ [55358, 99, 97, 102, 101, 769]
café.unicodeScalars.indices.map { café.utf8[$0] }  // ⟹ [240, 99, 97, 102, 101, 204]
café.unicodeScalars.indices.map { café[$0] }       // ⟹ ["🦜", "c", "a", "f", "é", "\u{301}"] ☆☆☆

café.utf16.indices.map { café[$0] }                // ⟹ ["🦜", "🦜", "c", "a", "f", "é", "\u{301}"] ☆☆☆
café.utf16.indices.map { café.unicodeScalars[$0] } // ⟹ [129436, 129436, 99, 97, 102, 101, 769] ★★★
café.utf16.indices.map { café.utf8[$0] }           // ⟹ [240, 240, 99, 97, 102, 101, 204] ★★★

café.utf8.indices.map { café[$0] }                 // ⟹ ["🦜", "🦜", "🦜", "🦜", "c", "a", "f", "é", "\u{301}", "\u{301}"] ☆☆☆
café.utf8.indices.map { café.unicodeScalars[$0] }  // ⟹ [129436, 129436, 129436, 129436, 99, 97, 102, 101, 769, 769] ★★★
café.utf8.indices.map { café.utf16[$0] }           // ⟹ [55358, 55358, 55358, 55358, 99, 97, 102, 101, 769, 769] ★★★

This behavior is well-defined for all combinations. I agree that rounding off partial positions (exhibited in lines marked with ★★★) is unusual. (Arguably, this isn't quite consistent, either: String is vending partial elements, as seen in lines marked ☆☆☆, while other views aren't.)

The samePosition and init?(_:, within:) APIs are there to use when this behavior is undesirable. However, there shouldn't ever be a need to call them twice in a sequence, like in your option (2) -- this should do just fine:

func use(index: String.Index) { 
    // We need to assume that `index` is a valid index that came 
    // from some view of `café`. There is no way to verify this, 
    // other than trying to use it and to see if it traps or results 
    // in nonsensical values.

    // Now if for some reason we *need* `index` to fall on a scalar
    // boundary, we can call `samePosition(in:)` or `init?(_, within:):`
    // to ensure this:
    precondition(index.samePosition(in: café.unicodeScalars) != nil,
      "Invalid index: not on a scalar boundary")
    
    // `index` is okay
    ...
}

However, this example is probably too abstract. It's strange for a function to take a standalone string index without also taking an explicit string parameter. Indices are meaningless without their corresponding collection instance, and the ambiguity goes away the instant we provide context using a particular string view.

For example, here is a way to retrieve the first word from a string (using a highly questionable definition for "word"). It has issues, but its use of indices seems fine to me; I feel that adding an extra conversion step (like we had to in Swift 3) would not help understanding the code at all.

extension String {
  var firstWord: Substring? {
    guard let i = self.utf8.firstIndex(of: 32) else { return nil }
    return self[..<i]
  }
}

(Jeremy David Giesbrecht) #4

After tracking down and reading the pitch, review and re‐review of SE‐0180, it appears these people may be interested too: @dabrahams , @jrose, @Lily_Ballard, @xwu, @Drew_Crawford1.


Yup. Here’s a real world one that used to make sense, but is now brittle:

(It’s still somewhat reduced; some generics have been specialized for clarity—forgive me if I introduced typos in the process.)

extension String.UnicodeScalarView.Index {

    /// Returns the position of the cluster that contains this index.
    func cluster(in clusters: String) -> String.Index {
        let string = String(clusters)

        var copy = self
        var position = samePosition(in: string)

        while position == nil {
            // What does the next line do now if its not really a UnicodeScalarView.Index?
            copy = string.unicodeScalars.index(before: copy)
            position = copy.samePosition(in: string)
        }

        guard let result = copy.samePosition(in: string) else {
            unreachable()
        }
        return result
    }
}

And from there:

extension Range where Bound == String.UnicodeScalarView.Index {

    /// Returns the range of clusters that contains this range.
    func clusters(in clusters: String) -> Range<String.Index> {
        let lower = lowerBound.cluster(in: clusters)
        if let upper = upperBound.samePosition(in: String(clusters)) {
            return lower ..< upper
        } else {
            return lower ..< clusters.index(after: upperBound.cluster(in: clusters))
        }
    }
}

It is then used in ways like this:

let nfd = café.decomposedStringWithCanonicalMapping()
for scalarRange in nfd.unicodeScalars.matches(for: "\u{301}") {
    let converted = scalarRange.clusters(in: nfd)
    let offsets = nfd.distance(from: nfd.startIndex, to: converted.lowerBound)
        ..< nfd.distance(from: nfd.startIndex, to: converted.upperBound)
    print("Don’t use that symbol at \(offsets.lowerBound)!")

    let replacement = nfd[converted].unicodeScalars.filter { $0 != "\u{301}" }
    let stringifiedReplacement = String(String.UnicodeScalarView(replacement))
    print("Replace it with \(stringifiedReplacement)")
}

Does/should the noted line in the bottom‐level function trap? The documentation has no answer. If it rounds down, we could be indirectly asking what comes before the start index. If it rounds up, the range conversion method up one level could call past the end index. The function could start with a precondition check that verifies it can convert to itself, but that’s extra runtime overhead and extra code.

So is there—or can there be again—a better way for the intermediate functions to declare that they accept only the kind of index they are designed to handle, without introducing a long cascade of unnecessary optionality? Before Swift 4 this sort of thing just worked and the compiler enforced its integrity. Now it hurts my head just trying to reason about it—and it is really easy to forget a conversion and enter the twilight zone. It just occurred to me that even if optionality were embraced all the way up the chain, the top level caller will likely implicitly unwrap, which sets everything back to square one if some conversion was neglected before the call.

Edit: Found and fixed at least two typos already. Wrong time of night to be trying to do this...


(Jordan Rose) #5

Heh. I saw your thread; the short form of my reaction is "I agree because strong type safety is Good, but at this point source compatibility is more important, oh well". That precludes me joining the fight to convince others that this would be worth it to change [back].


(Jeremy David Giesbrecht) #6

I’m not necessarily asking to change it back. I understand the need for source compatibility. There are also aspects of SE‐0180 I do like.

Right now I’m more interested in answers to these questions:

  1. Does anyone out there has a good suggestion of how to repair/write these sorts of methods (the ones I have, not the ones in the Standard Library). Maybe that’s some sort of language change (e.g. undo the typealiasing, but provide string subscripts for all kinds of indices). Maybe it means a module redesign instead (e.g. creating type‐safe wrappers for everything). I haven’t found an idea that is both feasible and elegant, but someone else might.

    In this vein it occurred to me (finally) in the middle of the night last night that switching the bottom function of the last example to treat any index as a UTF‐8 index would solve everything for that particular case. (That’s provided a string cannot be holding an incomplete UTF‐8 sequence. I used to assume that was the case. But Norio just demonstrated that to be false in Swift 5. Is that a bug or not?)

  2. What is the intended behaviour now for functions like index(before:)? If it is pinned down and documented, then I can at least know what I am building on to make the next layer do what it is supposed to. Right now I can only test what it does and make guesses, but Norio demonstrated that those observations don’t match from one patch to the next, so what does happen and what might happen aren’t necessarily the same.

    What should be happening here?:

    let string = "e\u{301}"
    let invalid = string.unicodeScalars.index(after: string.startIndex)
    
    var towardStart = invalid
    while towardStart > string.startIndex {
        print("Scalar Offset: \(string.unicodeScalars.distance(from: string.unicodeScalars.startIndex, to: towardStart))")
        print("Character: \(string[towardStart])")
        print("Prefix: \(string[..<towardStart])")
        print("Suffix: \(string[towardStart...])")
        print("Backing up...")
        towardStart = string.index(before: towardStart)
    }
    print("Scalar Offset: \(string.unicodeScalars.distance(from: string.unicodeScalars.startIndex, to: towardStart))")
    print("Character: \(string[towardStart])")
    print("Prefix: \(string[..<towardStart])")
    print("Suffix: \(string[towardStart...])")
    
    print("")
    
    var towardEnd = invalid
    while towardEnd < string.endIndex {
        print("Scalar Offset: \(string.unicodeScalars.distance(from: string.unicodeScalars.startIndex, to: towardEnd))")
        print("Character: \(string[towardEnd])")
        print("Prefix: \(string[..<towardEnd])")
        print("Suffix: \(string[towardEnd...])")
        print("Advancing...")
        towardEnd = string.index(after: towardEnd)
    }
    

    Swift 4.2.1 outputs this:

    Scalar Offset: 1
    Character: ́
    Prefix: e
    Suffix: ́
    Backing up...
    Scalar Offset: 0
    Character: e // ← Uh...?!?
    Prefix:
    Suffix: é
    
    Scalar Offset: 1
    Character: ́
    Prefix: e
    Suffix: ́
    Advancing...
    

    It seems to select the next valid index. Can that be relied on? Might it someday trap instead? Might it someday round, causing an indirect trap?

    ...and where on earth did it get the idea that “e” was the entire character? That index wasn’t even invalid.


(Lily Ballard) #7

My general feeling is having a single String.Index for all views is an ergonomic win. There is a slight loss in the case of "I want to write a function that only behaves properly if the input index lies on a unicode scalar boundary" as you can no longer represent that in the type system, but that's largely solved by simply defining explicit semantics for what happens when you take an index from e.g. the utf8 view and use it in e.g. the unicodeScalars view. I was under the impression that SE-180 actually did define those semantics, though I just took another look and what I see seems wrong:

With respect to subscripts, an index that does not fall on an exact boundary in a given String or Substring view will be treated as falling at its encodedOffset in the underlying code units, with the actual contents of the result being an emergent property of applying the usual Unicode rules for decoding those code units. For example, when slicing a string with an index i that falls between two Character boundaries, i.encodedOffset is treated as a position in the string's underlying code units, and the Character s of the result are determined by performing standard Unicode grapheme breaking on the resulting sequence of code units.

Since this is talking about code units, not unicode scalars, even assuming code units is UTF-16 this description can produce invalid strings (e.g. by splitting a surrogate pair).

I think we need to define semantics for the following:

  1. Subscripting when an index doesn't fall on a boundary for the current view.
  2. Calculating the next before an index, and
  3. Calculating an index after an index.

#2 and #3 are I think really easy to define. index(before:) produces the largest index in the view that compares less than the input. If the input index is contained in the view's indices, then the behavior matches expectation. If the input index is not contained in indices, then this is equivalent to rounding up to the nearest boundary and then applying index(before:). Similarly for index(after:), it's equivalent to rounding the index down to the nearest boundary and then applying index(after:).

For subscripting, we basically just have to decide if we want to round down or up. I'm not sure which is appropriate. For the character-based String / Substring view, we'd actually apply rounding based on unicodeScalars (as splitting a grapheme cluster apart still produces valid strings). This way e.g. str[..<idx] and Substring(str.unicodeScalars[..<idx]) will produce the same strings. This is already the current behavior, but it should be made explicit.

What I definitely don't want to ever see is to throw a fatal error if I use an index from one view in a different view. An index that is valid for one view should be valid for all views.


(Lily Ballard) #8

One more operation we'd have to define the semantics of: str.unicodeScalars.distance(from: str.startIndex, to: someUTF8Index) (where the UTF-8 index doesn't fall on a scalar boundary). I'm not sure what the behavior of this should be. Experimentally, right now it throws a fatal error (Swift 4 says "Fatal error: String index is out of bounds", Swift 5 just says "Fatal error:" with no message which is pretty weird). The simplest approach is to keep the panic and simply document it.

We also have to figure out the semantics of handling UTF-8 indices in UTF-16 views (and vice-versa). The previous definition (e.g. largest index that sorts before the input) actually does give us an explicit answer, I'm just not sure how these indexes sort today. SE-180 provides a loose definition (based on encodedOffset) but also admits that this is a weak ordering, and if we're replacing encodedOffset with a per-view calculation then we lose even this definition. My recommendation is to say a UTF-8 or UTF-16 index that doesn't fall on a unicode scalar boundary should be rounded down when interpreted in a UTF-16 or UTF-8 (respectively) view, prior to applying the subscript or index(before:) / index(after:) logic.


(Jeremy David Giesbrecht) #9

I can go with that if it is explicit in the API contract. I think anyway—I asked before if strings can carry invalid UTF‐8 or not. What happens if even startIndex is invalid as a scalar index?

In the character/scalar realm that can makes sense, and I don’t mind it, because “degenerate” does not mean invalid. But with the UTF views wouldn’t that enable the creation of the very slices which throw a wrench in what we just agreed about index(before:)?

Let’s say the function I posted above walked backward through UTF‐8 instead of scalars until it found a valid index. I thought UTF‐8 would be the lowest type of index, and therefore completely safe. But what you suggested would mean someone could send me a UTF‐16 index that could be rounded down to startIndex before doing the logic of index(before:)? Right?

This is turning out to be a whole lot more complicated than I realized at first. Maybe there really is no way around starting each method with a self‐conversion precondition? But then you just said that that very thing would annoy you if you were the user... Grrr... I don’t like problems without solutions.


(Lily Ballard) #10

They're not supposed to. AFAIK in Swift 4 there's no way to construct a Swift String that contains invalid code units. It's hard but possible to construct an NSString that contains unpaired surrogates, but my recollection is when consuming such a string from Swift, the unpaired surrogates turn into U+FFFD replacement characters. The fact that Swift 5 right now is happy to construct a String that contains invalid UTF-8 is definitely a bug.

I'm not quite sure what you mean. In what way does the ability to use all valid UTF-8 indices in the unicodeScalars view conflict with index(before:)?

Good point, you could end up in a situation where someIdx > str.utf8.startIndex is true and also str.utf8.index(before: someIdx) would return an invalid index because it would round down to startIndex first. Given that, I think we can refine the behavior of index(before:) and index(after:). Given the following:

  • There exists a total order between all indices that can be expressed in the UTF-8 view (which includes all indices that lie on a unicode scalar boundary).
  • There exists a total order between all indices that can be expressed in the UTF-16 view (which includes all indices that lie on a unicode scalar boundary).

We can then say that two String indices are "comparable" if they can both be expressed in the same UTF-8 or UTF-16 view. With that definition in hand, index(before:) becomes:

"This returns the largest index within the receiver view that is both comparable to the argument and compares less than the argument."

Similarly for index(after:), this would return the smallest index that is both comparable to the argument and compares greater than the argument.

In most cases this is equivalent to the previous definition. In the specific case of taking a UTF-16 non-character-boundary index and calling utf8.index(before: idx) (or a UTF-8 index in the utf16 view), index(before:) will effectively round down to the unicode scalar boundary and index(after:) will round up.

This way, if idx > utf8.startIndex, then utf8.index(before: idx) will always yield a valid index (same thing for index(after:) and endIndex).


(Michael Ilseman) #11

As the person on the hook to hard-code into Swift ABI the most efficient possible unified String.Index across multiple backing encodings and transcoded views, while preserving as much semantic behavior with older versions of Swift and saving space for future growth, I strongly sympathize with this sentiment.

However, the ability to use grapheme-aligned indices with a scalar view, or scalar-aligned indices with a code unit view, is an improvement to ergonomics. The potential issues come from going the opposite direction, and I think this is something we can alleviate in the future.

SE-0180 states:

With respect to subscripts, an index that does not fall on an exact boundary in a given String or Substring view will be treated as falling at its encodedOffset in the underlying code units, with the actual contents of the result being an emergent property of applying the usual Unicode rules for decoding those code units.

This "emergent property" is also emerging from the actual encoding the String happens to use. Using non-scalar-aligned indices on a transcoded view, in addition to being a bad idea, might have different behavior depending on the underlying encoding.

There is no problem with Index interchange from a higher-granularity view to a lower one, e.g. grapheme-aligned indices are always scalar-aligned. Scalar-aligned indices always exhibit the same emergent behavior regardless of underlying encoding, i.e. whether the view is transcoded or not.

As String has become more ergonomic and continues to become more ergonomic, the set of developers using index interchange, especially from a (potentially transcoding) code-unit view to a different view should continue to shift further and further into rare expert-use domain where explicit control and more elaborate safe-guards would be appreciated.

It's possible that in a future Swift version, say Swift n, we can overhaul SE-0180. Using swift-version, deployment target checking, etc., we could change behavior specifically on Swift n code running with a Swift n stdlib/runtime, to trap for non-aligned indices (especially sub-scalar indices) rather than auto-aligning / emergent behavior. We could then provide more/safer API for explicit control / guards.


(Michael Ilseman) #12

Swift 4 permitted invalid code units, the views did validation on the fly at read time substituting in the replacement character. Swift 5 established an internal (currently unspecified but burned into ABI) invariant that all native strings contain validly encoded UTF-8 contents by validating at creation time, like Rust, when it is far more efficient to do so. Non-native strings can still contain invalid contents and those are lazily validated on read.

If you have a non-native string whose startIndex does not begin a scalar, then you should get behavior as though that scalar was replaced with U+FFFD. Exactly how many U+FFFDs are used for replacement is a complicated issue and currently unspecified. That's a whole other rabbit hole.

String.Index already conforms to Comparable. Encoded offsets are in the most significant bits and transcoded offsets in the least significant bits.


(Lily Ballard) #13

That's true in the implementation, but semantically, two non-scalar-boundary indices where one comes from a transcoded view (or both come from different transcoded views) do not have a defined ordering. This is what I meant by "comparable", that the 2 indices have a defined order. At runtime you can always compare any 2 indices but if you do so with 2 "uncomparable" indices the result you get is implementation-defined.


(Michael Ilseman) #14

We are in violent agreement that there is a partial order exposed by transcoding ;-)


(Jeremy David Giesbrecht) #15

I said that because then you could do this...

String(otherString[misalignedUTF8Range])

...and get a string with no valid startIndex.

But Michael addressed this afterward:


I see the value in that part of what SE‐0180 provided. Making that easier was a definite improvement. To the designers and implementors: Thanks.

Also thanks, Michael, for your detailed reply. It gives me a much clearer idea of what String does and doesn’t claim to do, so I already feel safe(r) again reasoning about edge cases when I’m coding.


One last question about this:

So that means Swift 5 will trap the code at the top of this post during the initializer, right? Do I understand correctly that the Substring will behave like the non‐native strings you mention and not trap during the slice?

Or will the new String borrow the underlying storage of the old and not notice that it is invalid?