I ran into this question on StackOverflow, which got my curious:
I'm aware with the caveats associated with breaking up extended grapheme clusters along incorrect boundaries, and creating broken substrings with invalid unicode character sequences.
Is there a safe way to take an index from one string (e.g. stringA below), and transform it so it points to the same range of characters in another string (stringB below), without jumping through the hoop of manually deriving the distance like so:
let stringA = "abc👨👩👧👦xyz"
let stringB = "1234567890ABCDEFGHJI1234567890" // Long enough so we can show the issue below, and not just crash
let fourthCharIndexA = stringA.index(stringA.startIndex, offsetBy: +3)
print(stringA[fourthCharIndexA]) // 👨👩👧👦
print(stringB[fourthCharIndexA]) // Invalid: 4567890ABCDEFGHJI12345678
// EDITED: below used to be `stringB.distance...`, which was a typo.
let distance = stringA.distance(from: stringA.startIndex, to: fourthCharIndexA) // the distance is string-agnostic, right?
let fourthCharIndexB = stringB.index(stringB.startIndex, offsetBy: +distance)
print(stringB[fourthCharIndexB]) // Correct: 4
Further more, is there an API to do the same transformation, but to Range<String.Index>?
Perhaps. Though I see the utility in a case where you want to do a bunch of O(1) accessed by an Int index, this sorta case doesn't seem to warrant jumping those hoops, IMHO
As you stated in the “Substring is your friend” over on Stack Overflow, using Substring and String.Index sidesteps the whole problem whenever you are dealing with slices, since all slices of the same string share indices with the base string.
But when you are dealing with two separate strings like your post here, there is simply no way around it. You have to convert to an offset and then from that offset to the new string. If you have to do it a lot, you can make it more ergonomic by extending Range with a map method and String.Index with a method that does the offset‐and‐back‐again conversion. But it will still require iterating the string prefixes under the hood and there is by definition nothing you can do to sidestep it.
I believe the way you compute distance is incorrect. You are essentially using indices from stringA to index into stringB. You can see this by changing stringA:
let stringA = "ábc👨👩👧👦xyz"
// ...
stringB.distance(from: stringA.startIndex, to: fourthCharIndexA) // 4
stringA.distance(from: stringA.startIndex, to: fourthCharIndexA) // 3
I believe the only safe way of converting indices between collections is indeed a combination of distance(from:to) and index(_:offsetBy:).
Ah, no. I know the string matters (that distance(from:to:) is being called on), but I wasn't fully confident that the resulting distance: Int is fully safe to use elsewhere. As far as I can tell, it's a count of number of characters (not tied to any particular underlying encoding), so I figured it's string-agnostic, but I wasn't fully sure.
It’s safe as long as you’re using the same version of Unicode. So within the program execution you’d be fine.
I wouldn’t advise encoding the string and offset into JSON and loading it on a different device or a year later though. Grapheme cluster definitions can change, and then you might be pointing at something other than what you thought.
Instead of persisting a string and subrange, persist the three component strings: (1) up to the subrange, (2) the subrange, (3) everything after the subrange. That way the break points won’t move no matter what Unicode decides to redefine. You can always rejoin them on the other end (counting their new lengths beforehand if necessary).
If the use case is significantly more complicated, then simply converting to an array of characters (as strings) would persist all the original grapheme breaks.