String slicing can produce “invalid” substrings

I am trying to understand how string slicing with indices from different string views works. Comparison and Subscript Semantics in SE-0180 String Index Overhaul states that

... For example, when slicing a string with an index i that falls between two Character boundaries, i.encodedOffset is treated as a position in the string's underlying code units, and the Characters of the result are determined by performing standard Unicode grapheme breaking on the resulting sequence of code units.

Does that mean that string slicing should always produce a valid substring, no matter what indices are used? Here is an example which produces “strange” results:

let s = "€"
print(Array(s.utf8)) // [226, 130, 172]

let u = s.utf8
let from = u.index(u.startIndex, offsetBy: 1)
let to = u.index(u.startIndex, offsetBy: 2)

let s2 = s[from..<to]
print(s2.debugDescription)  // ""
print(Array(s2.utf8))       // [130]

So apparently we get a substring whose .utf8 view is an invalid UTF-8 sequence. Working with that substring produces more strange results. In Xcode 10/Swift 4.2:

print(s2.count) // 1
print(Array(s2.unicodeScalars)) // Thread 1: Fatal error: String index is out of bounds

and in Xcode 9.4/Swift 4.1:

print(s2.count) // Thread 1: Fatal error: cannot increment beyond endIndex
print(Array(s2.unicodeScalars)) // Never terminates

Is this the expected bevavior? Is it my responsibility to check if the UTF-8 indices fall on grapheme cluster boundaries before using them for slicing?

Or is this a bug, and s[from..<to] should return some valid substring?

Regards, Martin

As far as I know, this is the intended behavior. You're supposed to use someIndex.samePosition(in:) or String.Index.init(_:within:) when you're unsure if an index represents a valid position in another view. Both of these APIs are failable.

In your example:

let s = "€"
let u = s.utf8
let from = u.index(u.startIndex, offsetBy: 1)

from.samePosition(in: s) // returns nil
String.Index(from, within: s) // returns nil

I'm not sure if "returning strange data" is the intended behavior or if these invalid accesses should trap. The same problem can occur if you subscript a String with a previously valid index that got invalidated, e.g. because the string was mutated. It might trap or it might return bogus data.

I don't think the Collection protocol mandates that subscript accessors must trap on invalid indices. The documentation only states that it's the caller's responsibility to make sure the index is valid before invoking the subscript:

subscript(position: Self.Index) -> Self.Element { get }

Parameters

position
The position of the element to access. position must be a valid index of the collection that is not equal to the endIndex property.

2 Likes

You are probably right. However, I still have problems to grasp how the above quote from SE-0180 applies in this situation. In particular,

... and the Characters of the result are determined by performing standard Unicode grapheme breaking on the resulting sequence of code units.

sounds to me as if the result would always be some valid character sequence.

Nah, it does what it says: it determines the characters in the result by performing standard Unicode grapheme breaking on the sequence of code units; in your case, it looks like performing grapheme breaking on the code units results in a runtime trap.