TLDR: SE-0180 is fundamentally incompatible with BidirectionalCollection
's stated requirements, and this is causing real problems. We should do something about this! (I’m not sure what.)
(Sorry, this is going to be long and complicated. It's one of those cases where superficially things seem to be working fine, but the closer you look, the more rotten things appear.)
BidirectionalCollection
has a requirement about the consistency of indices:
Indices that are moved forward and backward in a bidirectional collection move by the same amount in each direction. That is, for any index
i
into a bidirectional collectionc
:
- If
i >= c.startIndex && i < c.endIndex
,c.index(before: c.index(after: i)) == i
.- If
i > c.startIndex && i <= c.endIndex
c.index(after: c.index(before: i)) == i
.
This seems like a very sensible requirement. However, a strict reading of it effectively disallows the collection from supporting any non-reachable indices, i.e., indices that aren't reachable from startIndex
or endIndex
by repeated applications of index(after:)
/index(before:)
.
This may seem like pointless quibbling, but it actually has some significant practical consequences -- because String
in the Standard Library is a BidirectionalCollection
that supports non-reachable indices!
About half a decade ago, SE-0180 (String Index Overhaul) changed things so that indices can be used interchangeably between all of String
's various views, including String
itself. Therefore, the full set of supported indices in a String
include not just the ones in indices
, but also ones that address individual Unicode Scalars within a Character
, or even individual code units within the string's UTF-8 or UTF-16 encodings.
If we accept "any index" to mean what it normally means, this implies that String
has been violating BidirectionalCollection
's requirements since Swift 4:
let s = "🇺🇸"
// U+1F1FA REGIONAL INDICATOR SYMBOL LETTER U
// U+1F1F8 REGIONAL INDICATOR SYMBOL LETTER S
let i = s.unicodeScalars.index(after: s.startIndex) // regional indicator S
let j = s.index(after: i) // endIndex
let i2 = s.index(before: j) // regional indicator U
print(i == i2) // false!
Let's now see some examples where this violation (sometimes mixed in with Unicode edge cases and bugs in String
's current implementation) leads to highly unexpected/surprising results.
Vexing Edge Case #1
We never really adapted the implementation of basic String
algorithms to fully support non-reachable indices. This makes for truly bizarre behavior sometimes -- a relatively easy to understand example is String.distance(from:to:)
, which simply forwards to the default generic algorithm that assumes that all indices are reachable:
let s = "🇺🇸"
// U+1F1FA REGIONAL INDICATOR SYMBOL LETTER U
// U+1F1F8 REGIONAL INDICATOR SYMBOL LETTER S
let i = s.unicodeScalars.index(after: s.startIndex)
print(s.distance(from: s.startIndex, to: i)) // Fatal error: String index is out of bounds
i
isn't reachable from the start index, so distance
skips right over it and falls off the end of the collection. Oops!
I think this is clearly broken, and I'm working on a PR that fixes this class of problems -- things trapping where they shouldn't, Substring
s with an unreachable endIndex
not properly conforming to Collection
etc.
If the behavior in the current version of my PR survives into a release, s.distance(from: s.startIndex, to: i)
will return 1, which makes more sense to me: we can reach (and go past) i
by going just one step forward from the start.
Similarly, s.distance(from: i, to: s.startIndex)
would return -1, which is nice and symmetric, exactly how I would naively expect distance
would work in a BidirectionalCollection
.
However, as it happens, my naive expectation is wrong -- let's take a closer look.
Vexing Edge Case #2
The way Unicode's grapheme breaking algorithm is defined sometimes makes for surprising results. For example, cutting off the first Unicode scalar from a single Character
sometimes blows up the remaining scalars into multiple splinters:
let s = "\u{1f9df}\u{200d}\u{2640}\u{fe0f}" // 🧟♀️
// U+1F9DF ZOMBIE
// U+200D ZERO WIDTH JOINER
// U+2640 FEMALE SIGN
// U+FE0F VARIATION SELECTOR-16
let start = s.startIndex
print(s[start]) // "🧟♀️"
let i = s.unicodeScalars.index(after: start)
print(s[i]) // "\u{200d}" (zero width joiner)
let j = s.index(after: i)
print(s[j]) // "\u{2640}\u{fe0f}" (emoji variant of the female sign)
let k = s.index(after: j)
print(k == s.endIndex) // true
let i2 = s.index(before: k)
print(s[i2]) // "🧟♀️"
print(i2 < i) // true
Note how the zero width joiner and the emojified female sign become two individual characters when we skip over the base scalar. This is how extended grapheme clusters work in Unicode, and from what I can tell, String
's behavior here also follows the intent of SE-0180 -- the result is "an emergent property of applying the usual Unicode rules for decoding those code units".
The big consequence of this is that String.distance
isn't necessarily symmetric: if we start from i
we need to take two steps to reach the end of the string, but when going backwards, we immediately skip over it in a single step!
print(s.distance(from: i, to: s.endIndex)) // 2
print(s.distance(from: s.endIndex, to: i)) // -1
How is that for an "emergent property"?
(This is with the changes in my draft PR. In current Swift, s.distance(from: s.endIndex, to: i)
actually traps due to the bug above. I think that's even worse.)
From what I can understand, this is a fundamental consequence of SE-0180. We cannot completely fix it without rolling back at least some part of it.
Vexing Edge Case #3
It gets somewhat worse! In some cases, removing a single scalar from the start of an extended grapheme cluster can change the meaning of the rest of the scalars in the string, causing them to rearrange to form entirely different Character
s.
For example, consider regional indicator symbols, a.k.a. flag emojis. These are built up from two-letter codes that identify the corresponding region. For example, the string "🇨🇦🇺🇸"
consists of the following four Unicode scalars:
U+1F1E8 REGIONAL INDICATOR SYMBOL LETTER C
U+1F1E6 REGIONAL INDICATOR SYMBOL LETTER A
U+1F1FA REGIONAL INDICATOR SYMBOL LETTER U
U+1F1F8 REGIONAL INDICATOR SYMBOL LETTER S
Note how these just spell out "CAUS" (as in the country codes CA and US). Of particular note is that the two halves of a single flag are only distinguished by their position, not their value.
Now if we skip over the first scalar C, the three remaining ones get rearranged to spell out "🇸" -- the Australian flag (AU) followed by a degenerate cluster for S.
let s = "🇨🇦🇺🇸" // (CAUS)
let i = s.unicodeScalars.index(after: s.startIndex)
print(s[i]) // "🇦🇺" (AU)
let j = s.index(after: i)
print(s[j]) // "🇸" (S)
let k = s.index(after: j) // endIndex (in my impl)
The US flag was lost! By skipping the first code point, we change not only the character that includes it, but the subsequent character, too. And there is no limit how many characters may be affected, either:
let s = "🇨🇦🇺🇸🇨🇦🇺🇸🇨🇦🇺🇸🇨🇦🇺🇸🇨🇦🇺🇸🇨🇦🇺🇸🇨🇦🇺🇸🇨🇦🇺🇸"
let i = s.unicodeScalars.index(after: s.startIndex)
print(s[i...]) // "🇦🇺🇸🇨🇦🇺🇸🇨🇦🇺🇸🇨🇦🇺🇸🇨🇦🇺🇸🇨🇦🇺🇸🇨🇦🇺🇸🇨🇦🇺🇸"
("" is the Australian flag (AU) followed by the Seychelles flag (SC).)
When i
is not reachable, String.index(after: i)
can evidently sometimes return an unreachable index too. When unreachable indices are involved, we can get a completely different set of characters when we go forward from i
up to j
than when we go backwards from j
down to i
within a single String
!
As in example #2, this also an inherent Unicode weirdness that we cannot fix without at least partially rejecting SE-0180.
But like example #1, there are implementation bugs here, too. Per SE-0180, the substring s[i...]
ought to behave like a string that consists of the same Unicode scalars. The print statement above does behave as if this was the case; unfortunately Substring.index(before:)
in 5.6 simply forwards to the base string's String.index(before:)
implementation, so it may look at (and be affected by) data outside the bounds of the substring -- which is violating all sorts of expectations about how Collections work.
(The bounds of a substring are trivially reachable indices within the substring itself, even if they happen to be nonreachable in the base string. In the substring case this elevates the emergent curiosities above into major violations of basic Collection requirements.)
Summary of Lessons
-
SE-0180 is fundamentally incompatible with
BidirectionalCollection
's stated requirements. If we take the requirements exactly as written, thenString
as of Swift 4 and above does not actually conform toBidirectionalCollection
. If we want to fix this, we need to tweakSE-0180
,BidirectionalCollection
, or both. -
In Swift 5.6 and below, basic
String
algorithms likedistance(from:to:)
,index(_:offsetBy:)
andindex(_:offsetby:limitedBy:)
do not work correctly when given unreachable indices. Additionally,Substring
s whose bounds are unreachable are currently pretty much completely broken, clearly violating both the intent and the letter of SE-0180, as well asCollection
's requirements. -
Unicode's grapheme breaking algorithms can lead to funny edge case behavior sometimes.
Potential Solutions
I don't think we can (or should) do anything about the last issue, but I can see multiple ways we could tackle the first two. The four obvious ideas I can think of are:
-
Leave things as they are and pretend that things are fine (until someone complains loudly enough).
Why mess with something few people are complaining about? I have to assume that relatively few people use scalar (or UTF-8/16) indices on
String
, and the few who do limit themselves to simple subscript invocations, never passing these indices to anything else. (Because I hope otherwise people would be complaining very loudly about all the nonsensical runtime errorsString
tends to currently run into.)However, I think it's only a matter of time until these issues cause real problems. Also, the current broken state probably leads folks to think that the issues are entirely due to Unicode's complexity, which gives the wrong impression. It's true that the technical details of Unicode contribute to the issue, but that isn't nearly the full story.
I hate this option.
-
Adjust
BidirectionalCollection
's docs to replace "any indexi
" with "any reachable indexi
" above, allowing bidirectional collections to support non-reachable indices. Where possible/feasible, update default implementations accordingly, or document the previously implicit expectations.This would formalize existing practice by adapting
BidirectionalCollection
toString
instead of vice versa. Unfortunately, it does nothing to fix existing algorithms that were written assuming that all indices are reachable.The current default implementations of
BidirectionalCollection
's algorithms (such asdistance(from:to:)
clearly cannot cope with unreachable indices. We could generalize some of them by replacing their use of==
in loop termination conditions with<
, but this isn't always enough.For example, the default
SubSequence
typeSlice
simply forwards its algorithms to the base collection. However, this doesn't always work correctly if the slice's bounds happen to fall on unreachable indices -- in the case of aString
, such slices can have direction-sensitive counts. (Substring
can fix these by limiting grapheme breaking to work within the substring bounds. This requires access to the low-level implementation that simply isn't possible in the genericSlice
case.)Worse, even changing
==
to<
can lead to bad complications. For certain (rare) collection types (linked lists come to mind),Index.<
may be dramatically slower than==
, and changing the default algorithms could lead to such a drastic performance regression that it would effectively be a source breaking change.(
Index.<
is currently also relatively rarely exercised, which means some high-profile collection types could easily be shipping broken implementations; such bugs would suddenly become critical issues after this change.)I hate this option, too.
-
Reverse SE-0180's explicitly (if very loosely) stated intention about the semantics of such indices and have
String
round all indices down to the nearest wholeCharacter
boundary. (Similar to how it already rounds indices that address individual code units down to the nearest whole Unicode scalar boundary.)In this world,
String
would still have non-reachable indices, but they would be treated as equivalent to a reachable index. This would work great at hiding most of the Unicode weirdness like examples #2 and #3. Unfortunately,String.Index.==
which would need to continue to distinguish between indices pointing to different parts of storage, which means that we would still have the same issues with<
vs==
in termination conditions.This would also be a breaking change, as it would change the return value of
s[i]
ands[i ..< j]
onString
. If the breakage was unmitigated, existing code (and binaries!) would start returning unexpected results. (Mitigating this would be very difficult, but perhaps not impossible. For example, we could perhaps somehow tie it to a language version. This involves parts of the stdlib that aren't inlinable (and must remain non-inlinable), which makes things both easier and far more tricky at the same time.)I hate this option as well.
-
Completely roll back SE-0180, and reintroduce separate index types for each string view.
This would fix all of these problems, but it would be massively source- and ABI-breaking. In theory, we could mitigate these by introducing new language features (such as an ABI-transparent "newtype" facility), but it'd take a lot effort implement that, and (afaik) this isn't currently planned.
Even without the ABI issues, I believe Swift is long past the point where we can make sweeping API overhauls to such a fundamental type as String; I suspect this rules out redefining its Index types.
I hate all four of these options.
As a lazy Standard Library engineer whose job depends on not breaking existing code (or, worse, existing binaries) if at all possible, I'm leaning towards going with some sort of intellectually unsatisfying compromise between options (1) or (2).
As linked above, I have a draft PR up https://github.com/apple/swift/pull/41417 that aims to see just how many SE-0180-related bugs we can fix without revising that proposal. (And while we're it, to also fix some long-standing index validation issues that aren't strictly SE-0180-related. Some of these I have already spun off into separate PRs (#41598, #41572, with more spinoffs likely coming later.)
Unfortunately, fixing the fundamental conflict with BidirectionalCollection would most likely require an intervention from Swift Evolution -- but at this point I am sort of at loss as to what exactly we should propose to do about it.
Ideas please!