String Index unification vs BidirectionalCollection requirements

lorentey · March 15, 2022, 5:31am

TLDR: SE-0180 is fundamentally incompatible with BidirectionalCollection's stated requirements, and this is causing real problems. We should do something about this! (I’m not sure what.)

(Sorry, this is going to be long and complicated. It's one of those cases where superficially things seem to be working fine, but the closer you look, the more rotten things appear.)

BidirectionalCollection has a requirement about the consistency of indices:

Indices that are moved forward and backward in a bidirectional collection move by the same amount in each direction. That is, for any index i into a bidirectional collection c :

If i >= c.startIndex && i < c.endIndex , c.index(before: c.index(after: i)) == i .

If i > c.startIndex && i <= c.endIndex c.index(after: c.index(before: i)) == i .

This seems like a very sensible requirement. However, a strict reading of it effectively disallows the collection from supporting any non-reachable indices, i.e., indices that aren't reachable from startIndex or endIndex by repeated applications of index(after:)/index(before:).

(By this requirement, the successor/predecessor of a non-reachable index must also be non-reachable (or the index would actually be reachable). But this wouldn't be practical -- for example, it seems natural to expect that loops that call `index(after:)` would always eventually reach the `endIndex`, even if they started from a non-reachable index.)

This may seem like pointless quibbling, but it actually has some significant practical consequences -- because String in the Standard Library is a BidirectionalCollection that supports non-reachable indices!

About half a decade ago, SE-0180 (String Index Overhaul) changed things so that indices can be used interchangeably between all of String's various views, including String itself. Therefore, the full set of supported indices in a String include not just the ones in indices, but also ones that address individual Unicode Scalars within a Character, or even individual code units within the string's UTF-8 or UTF-16 encodings.

If we accept "any index" to mean what it normally means, this implies that String has been violating BidirectionalCollection's requirements since Swift 4:

let s = "🇺🇸"
// U+1F1FA REGIONAL INDICATOR SYMBOL LETTER U
// U+1F1F8 REGIONAL INDICATOR SYMBOL LETTER S

let i = s.unicodeScalars.index(after: s.startIndex) // regional indicator S
let j = s.index(after: i) // endIndex
let i2 = s.index(before: j) // regional indicator U

print(i == i2) // false!

Let's now see some examples where this violation (sometimes mixed in with Unicode edge cases and bugs in String's current implementation) leads to highly unexpected/surprising results.

Vexing Edge Case #1

We never really adapted the implementation of basic String algorithms to fully support non-reachable indices. This makes for truly bizarre behavior sometimes -- a relatively easy to understand example is String.distance(from:to:), which simply forwards to the default generic algorithm that assumes that all indices are reachable:

let s = "🇺🇸"
// U+1F1FA REGIONAL INDICATOR SYMBOL LETTER U
// U+1F1F8 REGIONAL INDICATOR SYMBOL LETTER S
let i = s.unicodeScalars.index(after: s.startIndex)
print(s.distance(from: s.startIndex, to: i)) // Fatal error: String index is out of bounds

i isn't reachable from the start index, so distance skips right over it and falls off the end of the collection. Oops!

I think this is clearly broken, and I'm working on a PR that fixes this class of problems -- things trapping where they shouldn't, Substrings with an unreachable endIndex not properly conforming to Collection etc.

If the behavior in the current version of my PR survives into a release, s.distance(from: s.startIndex, to: i) will return 1, which makes more sense to me: we can reach (and go past) i by going just one step forward from the start.

Similarly, s.distance(from: i, to: s.startIndex) would return -1, which is nice and symmetric, exactly how I would naively expect distance would work in a BidirectionalCollection.

However, as it happens, my naive expectation is wrong -- let's take a closer look.

Vexing Edge Case #2

The way Unicode's grapheme breaking algorithm is defined sometimes makes for surprising results. For example, cutting off the first Unicode scalar from a single Character sometimes blows up the remaining scalars into multiple splinters:

let s = "\u{1f9df}\u{200d}\u{2640}\u{fe0f}" // 🧟‍♀️
// U+1F9DF ZOMBIE
// U+200D ZERO WIDTH JOINER
// U+2640 FEMALE SIGN
// U+FE0F VARIATION SELECTOR-16

let start = s.startIndex
print(s[start]) // "🧟‍♀️"

let i = s.unicodeScalars.index(after: start)
print(s[i]) // "\u{200d}" (zero width joiner)

let j = s.index(after: i)
print(s[j]) // "\u{2640}\u{fe0f}" (emoji variant of the female sign)

let k = s.index(after: j)
print(k == s.endIndex) // true

let i2 = s.index(before: k)
print(s[i2]) // "🧟‍♀️"

print(i2 < i) // true

Note how the zero width joiner and the emojified female sign become two individual characters when we skip over the base scalar. This is how extended grapheme clusters work in Unicode, and from what I can tell, String's behavior here also follows the intent of SE-0180 -- the result is "an emergent property of applying the usual Unicode rules for decoding those code units".

The big consequence of this is that String.distance isn't necessarily symmetric: if we start from i we need to take two steps to reach the end of the string, but when going backwards, we immediately skip over it in a single step!

print(s.distance(from: i, to: s.endIndex)) // 2
print(s.distance(from: s.endIndex, to: i)) // -1

How is that for an "emergent property"?

(This is with the changes in my draft PR. In current Swift, s.distance(from: s.endIndex, to: i) actually traps due to the bug above. I think that's even worse.)

From what I can understand, this is a fundamental consequence of SE-0180. We cannot completely fix it without rolling back at least some part of it.

Vexing Edge Case #3

It gets somewhat worse! In some cases, removing a single scalar from the start of an extended grapheme cluster can change the meaning of the rest of the scalars in the string, causing them to rearrange to form entirely different Characters.

For example, consider regional indicator symbols, a.k.a. flag emojis. These are built up from two-letter codes that identify the corresponding region. For example, the string "🇨🇦🇺🇸" consists of the following four Unicode scalars:

U+1F1E8 REGIONAL INDICATOR SYMBOL LETTER C
U+1F1E6 REGIONAL INDICATOR SYMBOL LETTER A
U+1F1FA REGIONAL INDICATOR SYMBOL LETTER U
U+1F1F8 REGIONAL INDICATOR SYMBOL LETTER S

Note how these just spell out "CAUS" (as in the country codes CA and US). Of particular note is that the two halves of a single flag are only distinguished by their position, not their value.

Now if we skip over the first scalar C, the three remaining ones get rearranged to spell out "🇸" -- the Australian flag (AU) followed by a degenerate cluster for S.

let s = "🇨🇦🇺🇸" // (CAUS)

let i = s.unicodeScalars.index(after: s.startIndex)
print(s[i]) // "🇦🇺" (AU)
let j = s.index(after: i)
print(s[j]) // "🇸" (S)
let k = s.index(after: j) // endIndex (in my impl)

The US flag was lost! By skipping the first code point, we change not only the character that includes it, but the subsequent character, too. And there is no limit how many characters may be affected, either:

let s = "🇨🇦🇺🇸🇨🇦🇺🇸🇨🇦🇺🇸🇨🇦🇺🇸🇨🇦🇺🇸🇨🇦🇺🇸🇨🇦🇺🇸🇨🇦🇺🇸"
let i = s.unicodeScalars.index(after: s.startIndex)
print(s[i...]) // "🇦🇺🇸🇨🇦🇺🇸🇨🇦🇺🇸🇨🇦🇺🇸🇨🇦🇺🇸🇨🇦🇺🇸🇨🇦🇺🇸🇨🇦🇺🇸"

("" is the Australian flag (AU) followed by the Seychelles flag (SC).)

When i is not reachable, String.index(after: i) can evidently sometimes return an unreachable index too. When unreachable indices are involved, we can get a completely different set of characters when we go forward from i up to j than when we go backwards from j down to i within a single String!

As in example #2, this also an inherent Unicode weirdness that we cannot fix without at least partially rejecting SE-0180.

But like example #1, there are implementation bugs here, too. Per SE-0180, the substring s[i...] ought to behave like a string that consists of the same Unicode scalars. The print statement above does behave as if this was the case; unfortunately Substring.index(before:) in 5.6 simply forwards to the base string's String.index(before:) implementation, so it may look at (and be affected by) data outside the bounds of the substring -- which is violating all sorts of expectations about how Collections work.

(The bounds of a substring are trivially reachable indices within the substring itself, even if they happen to be nonreachable in the base string. In the substring case this elevates the emergent curiosities above into major violations of basic Collection requirements.)

Summary of Lessons

SE-0180 is fundamentally incompatible with BidirectionalCollection's stated requirements. If we take the requirements exactly as written, then String as of Swift 4 and above does not actually conform to BidirectionalCollection. If we want to fix this, we need to tweak SE-0180, BidirectionalCollection, or both.
In Swift 5.6 and below, basic String algorithms like distance(from:to:), index(_:offsetBy:) and index(_:offsetby:limitedBy:) do not work correctly when given unreachable indices. Additionally, Substrings whose bounds are unreachable are currently pretty much completely broken, clearly violating both the intent and the letter of SE-0180, as well as Collection's requirements.
Unicode's grapheme breaking algorithms can lead to funny edge case behavior sometimes.

Potential Solutions

I don't think we can (or should) do anything about the last issue, but I can see multiple ways we could tackle the first two. The four obvious ideas I can think of are:

Leave things as they are and pretend that things are fine (until someone complains loudly enough).

Why mess with something few people are complaining about? I have to assume that relatively few people use scalar (or UTF-8/16) indices on String, and the few who do limit themselves to simple subscript invocations, never passing these indices to anything else. (Because I hope otherwise people would be complaining very loudly about all the nonsensical runtime errors String tends to currently run into.)

However, I think it's only a matter of time until these issues cause real problems. Also, the current broken state probably leads folks to think that the issues are entirely due to Unicode's complexity, which gives the wrong impression. It's true that the technical details of Unicode contribute to the issue, but that isn't nearly the full story.

I hate this option.
Adjust BidirectionalCollection's docs to replace "any index i" with "any reachable index i" above, allowing bidirectional collections to support non-reachable indices. Where possible/feasible, update default implementations accordingly, or document the previously implicit expectations.

This would formalize existing practice by adapting BidirectionalCollection to String instead of vice versa. Unfortunately, it does nothing to fix existing algorithms that were written assuming that all indices are reachable.

The current default implementations of BidirectionalCollection's algorithms (such as distance(from:to:) clearly cannot cope with unreachable indices. We could generalize some of them by replacing their use of == in loop termination conditions with <, but this isn't always enough.

For example, the default SubSequence type Slice simply forwards its algorithms to the base collection. However, this doesn't always work correctly if the slice's bounds happen to fall on unreachable indices -- in the case of a String, such slices can have direction-sensitive counts. (Substring can fix these by limiting grapheme breaking to work within the substring bounds. This requires access to the low-level implementation that simply isn't possible in the generic Slice case.)

Worse, even changing == to < can lead to bad complications. For certain (rare) collection types (linked lists come to mind), Index.< may be dramatically slower than ==, and changing the default algorithms could lead to such a drastic performance regression that it would effectively be a source breaking change.

(Index.< is currently also relatively rarely exercised, which means some high-profile collection types could easily be shipping broken implementations; such bugs would suddenly become critical issues after this change.)

I hate this option, too.
Reverse SE-0180's explicitly (if very loosely) stated intention about the semantics of such indices and have String round all indices down to the nearest whole Character boundary. (Similar to how it already rounds indices that address individual code units down to the nearest whole Unicode scalar boundary.)

In this world, String would still have non-reachable indices, but they would be treated as equivalent to a reachable index. This would work great at hiding most of the Unicode weirdness like examples #2 and #3. Unfortunately, String.Index.== which would need to continue to distinguish between indices pointing to different parts of storage, which means that we would still have the same issues with < vs == in termination conditions.

This would also be a breaking change, as it would change the return value of s[i] and s[i ..< j] on String. If the breakage was unmitigated, existing code (and binaries!) would start returning unexpected results. (Mitigating this would be very difficult, but perhaps not impossible. For example, we could perhaps somehow tie it to a language version. This involves parts of the stdlib that aren't inlinable (and must remain non-inlinable), which makes things both easier and far more tricky at the same time.)

I hate this option as well.
Completely roll back SE-0180, and reintroduce separate index types for each string view.

This would fix all of these problems, but it would be massively source- and ABI-breaking. In theory, we could mitigate these by introducing new language features (such as an ABI-transparent "newtype" facility), but it'd take a lot effort implement that, and (afaik) this isn't currently planned.

Even without the ABI issues, I believe Swift is long past the point where we can make sweeping API overhauls to such a fundamental type as String; I suspect this rules out redefining its Index types.

I hate all four of these options.

As a lazy Standard Library engineer whose job depends on not breaking existing code (or, worse, existing binaries) if at all possible, I'm leaning towards going with some sort of intellectually unsatisfying compromise between options (1) or (2).

As linked above, I have a draft PR up https://github.com/apple/swift/pull/41417 that aims to see just how many SE-0180-related bugs we can fix without revising that proposal. (And while we're it, to also fix some long-standing index validation issues that aren't strictly SE-0180-related. Some of these I have already spun off into separate PRs (#41598, #41572, with more spinoffs likely coming later.)

Unfortunately, fixing the fundamental conflict with BidirectionalCollection would most likely require an intervention from Swift Evolution -- but at this point I am sort of at loss as to what exactly we should propose to do about it.

Ideas please!

lorentey · March 15, 2022, 6:13am

FWIW, until very recently, I considered non-reachable indices to be largely outside the scope of Collection requirements — which is basically the “everything is fine” view. Vexing examples #2 and #3 above convinced me to adjust this somewhat. E.g., I did not previously consider that distance could become fully asymmetric.

SDGGiesbrecht · March 15, 2022, 6:45am

Welcome to my world.

When SE‐0180 went live all sorts of things started going wrong. Internally, the methods used to explicitly round up or down to valid indices broke because the APIs they used suddenly started “helpfully” pre‐rounding. Externally, with a hole blown through type safety, clients started finding all sorts of ways to break algorithms by feeding mismatched indices.

When it was brought up here at Swift, the response was basically that SE‐0180 was so useful for normal developers, that the inconveniences it caused for niche developers (you and me) exercising strings in wacky ways should just be accepted in the trade‐off. These issues were not unique to strings, because LazyFilterCollection or some such exhibited the same behaviours. So essentially option 0 was consciously chosen and it survived until you posted just now.

There were a few more months of aggravating issue discovery, but eventually they tapered off. Documentation was improved to be more explicit about algorithms only supporting indices acquired from the collection instance itself, not a derived view or filter, and that SE‐0180 did not apply to the API in question. Guards were added wherever clients supplied indices, so that only valid and reachable indices ever reached deeper into the libraries. In time, the skies turned blue again.

lorentey · March 15, 2022, 6:52am

Fun historical note: this was actually the originally proposed version of SE-0180; it got changed during review because of code compatibility concerns. (Which got largely thrown out by all the UTF-8 work a year or so later, with minimal problems.) It’s tricky to foresee what will actually cause problems.

(As I implied above, rounding indices down would not have helped with the BidirectionalCollection requirement — that would still be violated.)

taylorswift · March 15, 2022, 7:07am

i usually mix indices when moving between [UInt8] and String.UTF8View, usually through Sequence APIs. i would do everything in [UInt8] when scanning for an ASCII separator, like a slash '/', but some features of String, like human-readable ASCII literals, make that hard to do so.

in the long term, it might be a good idea to introduce an “intermediate” level string type, like UTF8Bytes, so people can offload onto that, and String can focus more on unicode safety.

lorentey · March 15, 2022, 7:43am

Hm, that’s not great. Introducing explicit rounding methods would have helped resolve these problems for everyone. Unfortunately SE-0180 deferred those to a followup proposal that we never did get around to write. It’s never too late, though!

I admit I was firmly in this camp, too, until I took a closer look at these consequences. I only changed my tune this weekend, when I finally accepted that String.distance cannot possibly be made symmetric.

Even with the problems above, I find it hard to dismiss just how helpful SE-0180 was in reducing API-induced noise in String handling code. I do not fondly remember the old Index conversion initializers — in the vast majority of cases, I only needed to use them in the non-failable direction, when they were extremely inconvenient, and they just made string code unreadably complicated.

(People are justifiably complaining about the verbosity of string.index(string.startIndex, offsetBy: 4) even today. Those conversion initializers were even worse.)

Interesting. I view LazyFilterCollection‘s reuse of the underlying collection’s Index type as an obvious API design bug, because of the way it encourages accidental misuse — especially if the underlying Index type is a strideable type like Int.

I think this case is distinct from String‘s. I don’t really look at filtered indices as valid in LazyFilterCollection at all: I consider them more like runtime errors that we fail to diagnose due to performance considerations.

How do you check that an index is reachable in a String? Are you doing a two-step index(before:) + index(after:) dance, or did you have to replicate parts of the grapheme breaking algorithm? (I hope it’s just the dance!)

I suspect the stdlib ought to provide direct tools to help with this, if only to allow performance improvements — the before+after jiggle works great, but I think it includes some redundant/unnecessary work that we could eventually optimize away.

If nothing else, adding explicit index rounding methods would definitely help with API usability for people who need to do precise Unicode work. We should probably do it soon.

However, would that be enough to revert to the “everything is fine” position about to non-reachable indices in a BidirectionalCollection?

It would at least let us point to a way to normalize indices.

Hm, how would this new type solve String’s problems with BidirectionalCollection? IIUC, String.UTF8View is already largely free of the issues I posted, so introducing an alternative to it probably would not have much of an impact at all.

SDGGiesbrecht · March 15, 2022, 9:01am

Yes, the reduction of verbosity was an improvement. In hindsight it might have been accomplished better by adding overloaded subscripts to the views instead of unifying the index type. The logically non‐failable ones could have just returned the respective standard slice, and the twilight zone ones we are talking about here could have returned new strings or unique types or not been provided in the first place. Any of those options would have insulated them from interacting with collection related conformances.

I suspect you could get a nearly source compatible implementation like that which preserves the useful bits but turns the nonsense into compiler errors. But when it comes to ABI compatibility, it would require all sorts of magic that likely does not exist.

I realize this comment likely does not provide anything actionable regarding the current state of things. I guess I am mostly just thinking out loud.

In various ways. Each instance had a tradeoff between thoroughness and performance. Their purpose was mostly to train clients to be more careful on their end, not to make invalid calls survivable. Due to things like flag series you mentioned, you have to walk a long way to tell for sure.

bobergj · March 15, 2022, 9:22am

Another data point: Assuming I know the index comes from another of the String's views I use String.index.samePosition(in:). Where necessary, iterating the index backwards or forwards in its originating String view until samePosition indicates a valid result. It would be nice if stdlib came with a rounding function, but it's not a big deal.

Nevin · March 15, 2022, 2:25pm

While I agree that you have highlighted some important issues, I do not see how this claim follows:

The stated requirement that you quote says (paraphrased), “For any index in startIndex ..< endIndex, calling both index(before:) and index(after:) in either order gets you back to the same index where you began.”

It does not say anything about unreachable indices, except that if you start from one and take a step forward and a step backward, you end up at the same (unreachable) index. This is exactly how I would expect unreachable indices to behave.

Are you saying that String violates that?

• • •

The documentation for Collection includes this passage:

That is an explicit statement that unreachable indices are emphatically not valid. Furthermore, the documentation for index(after:), distance(from:to:) and other such operations all include the parameter description:

i: A valid index of the collection.

So the behavior you describe is in fact the direct result of a programmer violating the documented requirements of the methods they are calling.

• • •

The behavior of SE–180 is also quite clear: subscripting with an unreachable (and hence invalid) index is handled by rounding to a valid index.

xwu · March 15, 2022, 5:01pm

When String advances to the index after an unreachable index, you get a reachable index as a result*. When String advances to the index before that reachable index, you get the preceding reachable index, which is not the unreachable index you started with. It cannot give you back the same (unreachable) index you started with...because then by construction that index would then be reachable from another reachable index.

*) This does point to a way of reconciling SE-0180 and BidirectionalCollection requirements without tweaking either the proposal or the protocol's semantic requirements: string.index(after: unreachableIndex) can refuse to give a reachable index as a result; in the most straightforward implementation, it would be a runtime trap to pass an unreachable index as an argument (likewise for index(before:), distance(from:to:), etc.). This could be practically very breaking but, as you point out, all current uses of index(after:) with an unreachable index are documented as a violation of the requirements of the method and not guaranteed to have any stated behavior. (This does not preclude SE-0180 guaranteeing that subscripting a string with an unreachable index is handled by rounding.)

taylorswift · March 15, 2022, 5:11pm

part of the reason why there is demand for correctness-challenged String APIs, and why we have those APIs today, is because it is very difficult to work with [UInt8], so everyone who is actually using ASCII buffers in a String wrapper complains loudly about String pedantry.

String.UTF8View isn’t really a solution, it’s more of a bridge between [UInt8] storage representations, and String/Unicode.Scalar literals.

lorentey · March 15, 2022, 8:54pm

The problem with this is that it's not just subscripts -- to have a consistent API, every method that takes an index would need to have multiple overloads. API begets more API, endlessly.

Creating a String index protocol would've reduced this to two overloads per method; this works okay for RangeExpression.

This makes sense.

That works fine, but it's a bit wasteful -- each invocation of String.Index.samePosition(in:) does the full job of rounding down the index, only to compare the result against the original index and then discard it.

It seems to me exposing direct stdlib API to align indices would definitely be useful!

Yes. See the example immediately preceding Vexing Edge Case #1.

This is not what SE-180 requires. To quote directly from the proposal:

let s = "e\u{301}galite\u{301}"           // "égalité"
let i = Array(s.unicodeScalars.indices)
print(s[i[1]...])                         // "◌́galité"
print(s[..<i.last!])                      // "égalite"
print(s[i[1])                             // "◌́"

Note the last example.

I believe this clarification was added to try putting the genie back in the bottle, shortly after SE-0180 landed. This is not an indefensible viewpoint, but I have two issues with it:

BidirectionalCollection's requirement talks about "all indices", not "all valid indices". (See solution #1 above.)
It shifts the blame for bad results from the library to clients. If we think String.distance(from:to:) must not be given unreachable indices, then ideally the library should actively diagnose this, by reliably trapping. (Or rounding, I guess.)

Also, the library should provide better tools for normalizing indices. The failable conversion methods are a quick short term fix that we got stuck with for 5 years.

xwu:

*) This does point to a way of reconciling SE-0180 and BidirectionalCollection requirements without tweaking either the proposal or the protocol's semantic requirements: string.index(after: unreachableIndex) can refuse to give a reachable index as a result; in the most straightforward implementation, it would be a runtime trap to pass an unreachable index as an argument (likewise for index(before:) , distance(from:to:) , etc.). This could be practically very breaking but, as you point out, all current uses of index(after:) with an unreachable index are documented as a violation of the requirements of the method and not guaranteed to have any stated behavior. (This does not preclude SE-0180 guaranteeing that subscripting a string with an unreachable index is handled by rounding.)

I think I agree. Internally, we can mitigate the runtime performance impact of having to validate indices by adding a bit to String.Index that signals whether it was generated by String itself, or some other, non-Character-aligned view.

Note that SE-0180 explicitly allows creating a subscript whose bounds are unreachable indices. index(after:), distance(from:to:) etc will need to work in such a substring, and to a certain extent, the indices it vends ought to be compatible with the base string.

lorentey · March 15, 2022, 9:35pm

That has been very helpful, thanks all. I think I have a plan that may work:

Solution #1 seems to be the winner.

Adjust BidirectionalCollection's docs to replace "any index i" with "any reachable index i" above, allowing bidirectional collections to support non-reachable indices. Where possible/feasible, update default implementations accordingly, or document the previously implicit expectations.

The right wording is "any valid index", repeating Collection's definition of what a valid index is.

I think this may count as a doc fix, therefore not requiring a separate evolution proposal, although I will probably run this by a Core Team member just to make sure.
We can fully embrace the requirement that String.index(after:), String.distance(from:to:), .index(_:offsetBy:) etc must be given "valid" (i.e., reachable) indices by improving these methods to reliably detect when that's not the case, and trap.
Detecting that a random index is on a Character boundary is a quite expensive operation -- it currently requires taking a full step back and forth (i == index(after: index(before: i)))), which we do not want to run in each and every invocation of String.index(after:) itself. Luckily, we don't have to -- we can have keep track of whether an index is known to be character-aligned within the index value itself, and only do the alignment jiggle when that flag isn't set. So regular code that doesn't mix indices from different views will see minimal/no performance impact.
If trapping is too much of a potential compatibility problem, then we can have these methods round down to the nearest reachable index instead. (Possibly emitting a runtime warning, and still trapping if the executable was linked with some future version of the Swift Standard Library.) This is solution #2 above.

Rounding indices down resolves all of the Unicode weirdnesses that lead to distance(from:to:) being ill-defined. subscript's behavior will be a bit weird/inconsistent, but that is arguably by design.

(Note that subscripting operations (either with an individual index or a range/range expression) must not round down the indices they are given, because that would violate SE-0180. (Making those round too would require an evolution proposal & probably a bunch of compatibility mitigation work.))
At minimum, the documentation of these methods on String/Substring need to explicitly call out that they cannot be used on unaligned/unreachable/"invalid" indices.

I believe this is just a doc fix that doesn't need a proposal.
We really, really, really need to improve String's index portability story by adding explicit, public methods that round indices down to the nearest reachable/aligned/valid index. The existing failable conversions are throwing away the exact piece of data that people need, for no good reason.

This will require a short proposal, fulfilling the promise in SE-0180 itself.

SDGGiesbrecht · March 15, 2022, 9:58pm

Yes, that sounds like the standard library equivalent of what our libraries did, which seems to have worked out okay in the end.

Remembering that we are juggling a tradeoff here and a partial net may be better than a full net, I ask this just for clarity: Is the dance step sufficient in your example?

That is a wonderful relief.

Does Substring also need to be repaired so that it properly arrives at its start and end indices even when they would be invalid for the parent string, and so that stepping away from said indices properly arrives at the first actual boundary by the parent string’s definition? Or does that naturally happen as a result of your other points?

lorentey · March 15, 2022, 11:06pm

Yes! index(before:) implements a full search for the start of the flag run, otherwise it would get out of sync with index(after:). (It also looks back in a couple of other edge cases; iirc those are more limited.)

I think we simply need to rewrite Substring so that its indexing methods never ever look outside the bounds of the substring. (The PR I have already implements most of this work.) SE-0180 explicitly requires irregular substrings to work like strings that contain precisely the same Unicode scalars, so that's what I'll need to implement.

This means within substrings that have bounds that are invalid indices in the base string, sometimes indices that are considered valid in the substring will be invalid in their base strings. This is somewhat weird, but it follows directly from SE-0180.

(Subscripting (incl. range subscripting) the base string with substring indices will still usually return the same character as in the substring itself, except in cases where that cannot possibly work -- e.g. if the substring ended on a partial character in its base, then its last index will still address the full character in the base string. This seems fine to me!)

lorentey · March 15, 2022, 11:28pm

It seems we did actually end up with a compromise between 1 & 2, although it's up for debate how intellectually unsatisfying it is:

SDGGiesbrecht · March 15, 2022, 11:30pm

Yes, that is fine too. It just ultimately needs its start and end indices to be valid and reachable within itself even when they are not valid for the parent. Any intervening indices need only be internally consistent, not necessarily consistent with the parent. Any collection algorithm that depends on a slice being consistent with its base will not have been able to acquire a misaligned slice through the collection conformance anyway.

masters3d · March 16, 2022, 12:26pm

It would be great if the fix aligned Array and String API so that trappings was not needed. I am personally allergic to trapping APIs and much rather deal with optional returns or throwing.

Edit: reference Behavior of index(after:)

wowbagger · March 24, 2022, 3:03pm

I think String can be visualized as something like a bidirectional skip list, whose levels from bottom to top contain code units, code points, and grapheme clusters, respectively.

So, alternatively, what if a new MultiLevelBidirectionalCollection protocol is introduced which is refined by BidirectionalCollection? The new protocol can have more appropriate requirements for collection types that have multiple levels of indexing.

It still might not help against vexing edge case #3, though.

Karl · March 24, 2022, 4:16pm

Isn't the point that this violates Collection's requirements?

My understanding of Collection is that the indices of a slices must all be valid in the parent, and must return the same values. "intervening indices need only be internally consistent, not necessarily consistent with the parent" very much goes against everything I've ever understood about these types. The power of slices very much comes from the idea that they adjust the startIndex and endIndex, but otherwise forward (almost) everything to their parents. With that small tweak to the collection conformance, pretty much everything else "just works".

People compose algorithms. You may obtain a substring through a string-specific method and pass that in to an algorithm or data structure which operates on generic Collections.

I'd rather we revisited SE-0180 than break the relationship between a slice and its parent/base collection, if that is an option.