Is there a safe way to shift Range<String.Index>?

AlexanderM · April 16, 2020, 7:36pm

I ran into this question on StackOverflow, which got my curious:

I'm aware with the caveats associated with breaking up extended grapheme clusters along incorrect boundaries, and creating broken substrings with invalid unicode character sequences.

Is there a safe way to take an index from one string (e.g. stringA below), and transform it so it points to the same range of characters in another string (stringB below), without jumping through the hoop of manually deriving the distance like so:

let stringA = "abc👨‍👩‍👧‍👦xyz"
let stringB = "1234567890ABCDEFGHJI1234567890" // Long enough so we can show the issue below, and not just crash

let fourthCharIndexA = stringA.index(stringA.startIndex, offsetBy: +3)
print(stringA[fourthCharIndexA]) // 👨‍👩‍👧‍👦
print(stringB[fourthCharIndexA]) // Invalid: 4567890ABCDEFGHJI12345678

// EDITED: below used to be `stringB.distance...`, which was a typo.
let distance = stringA.distance(from: stringA.startIndex, to: fourthCharIndexA) // the distance is string-agnostic, right?
let fourthCharIndexB = stringB.index(stringB.startIndex, offsetBy: +distance)

print(stringB[fourthCharIndexB]) // Correct: 4

Further more, is there an API to do the same transformation, but to Range<String.Index>?

taylorswift · April 16, 2020, 7:39pm

if you are doing this kind of operation a lot would it not make more sense to store the string as a [Character] rather than a String?

AlexanderM · April 16, 2020, 7:44pm

Perhaps. Though I see the utility in a case where you want to do a bunch of O(1) accessed by an Int index, this sorta case doesn't seem to warrant jumping those hoops, IMHO

taylorswift · April 16, 2020, 7:44pm

alternatively, this could work

let a:String 
let b:String 
for (i, j):(String.Index, String.Index) in zip(a.indices, b.indices)
{ 
    if a[i] == "👨‍👩‍👧‍👦" 
    {
        return j 
    }
}

or if you don’t care about the nil case,

b[zip(a.indices, b.indices).first{ a[$0.0] == "👨‍👩‍👧‍👦" }!.1]

SDGGiesbrecht · April 16, 2020, 8:35pm

As you stated in the “Substring is your friend” over on Stack Overflow, using Substring and String.Index sidesteps the whole problem whenever you are dealing with slices, since all slices of the same string share indices with the base string.

But when you are dealing with two separate strings like your post here, there is simply no way around it. You have to convert to an offset and then from that offset to the new string. If you have to do it a lot, you can make it more ergonomic by extending Range with a map method and String.Index with a method that does the offset‐and‐back‐again conversion. But it will still require iterating the string prefixes under the hood and there is by definition nothing you can do to sidestep it.

AlexanderM · April 16, 2020, 8:52pm

I suspected so, just wanted to make sure I wasn't missing anything. Here's what I came up with

// Models the offset between a `parent` Collection and another collection made from a `SubSequence` of `parent`
struct CollectionOffset<C: Collection> {
	let offset: Int
	let parent: C
	
	init(of slice: C.SubSequence, in parent: C) {
		self.offset = parent.distance(from: parent.startIndex, to: slice.startIndex)
		self.parent = parent
	}
	
	func convert<C2: Collection>(indexInParent: C.Index, toIndexIn slice: C2) -> C2.Index
		where C.Element == C2.Element, C.Index == C2.Index {
		let distance = parent.distance(from: parent.startIndex, to: indexInParent)
		let distanceInNewSlice = distance - offset
		return slice.index(slice.startIndex, offsetBy: distanceInNewSlice)
	}
	
	func convert<C2: Collection>(rangeInParent: Range<C.Index>, toRangeIn slice: C2) -> Range<C.Index>
		where C.Element == C2.Element, C.Index == C2.Index {
		let newLowerBound = self.convert(indexInParent: rangeInParent.lowerBound, toIndexIn: slice)
		
		let span = self.parent.distance(from: rangeInParent.lowerBound, to: rangeInParent.upperBound)
		let newUpperBound = slice.index(newLowerBound, offsetBy: span)
		
		return newLowerBound ..< newUpperBound
	}
}


let string = "1234567890"

let rangeIntoOriginalString = string.index(string.startIndex, offsetBy: +4) ..< string.index(string.startIndex, offsetBy: +8)
let substring = string[rangeIntoOriginalString]
precondition(substring == "5678")

let newString = String(substring)

let offset = CollectionOffset(of: substring, in: string)

do { // Demonstrate CollectionOffset.convert(indexInParent:toIndexIn:)
	let indexIntoOriginalString = substring.startIndex
	assert(substring[indexIntoOriginalString] == "5")

	let indexIntoNewString = offset.convert(indexInParent: indexIntoOriginalString, toIndexIn: newString)
	assert(newString[indexIntoNewString] == "5")
}

do { // Demonstrate CollectionOffset.convert(rangeInParent:toRangeIn:)
	let rangeIntoNewString = offset.convert(rangeInParent: rangeIntoOriginalString, toRangeIn: newString)
	assert(newString[rangeIntoNewString] == "5678")
}

I also show a String-specific specialization in my answer to the question in the original post.

What do you think?

SDGGiesbrecht · April 16, 2020, 9:45pm

Yes, that will work.

The constraints of the problem seem strange to me though. My sniffer says the real issue is in the surrounding design.

jenox · April 16, 2020, 10:39pm

AlexanderM:

Is there a safe way to take an index from one string (e.g. stringA below), and transform it so it points to the same range of characters in another string ( stringB below), without jumping through the hoop of manually deriving the distance like so:

let stringA = "abc👨‍👩‍👧‍👦xyz"
let stringB = "1234567890ABCDEFGHJI1234567890" // Long enough so we can show the issue below, and not just crash

let fourthCharIndexA = stringA.index(stringA.startIndex, offsetBy: +3)
print(stringA[fourthCharIndexA]) // 👨‍👩‍👧‍👦
print(stringB[fourthCharIndexA]) // Invalid: 4567890ABCDEFGHJI12345678

let distance = stringB.distance(from: stringA.startIndex, to: fourthCharIndexA) // the distance is string-agnostic, right?
let fourthCharIndexB = stringB.index(stringB.startIndex, offsetBy: +distance)

print(stringB[fourthCharIndexB]) // Correct: 4

I believe the way you compute distance is incorrect. You are essentially using indices from stringA to index into stringB. You can see this by changing stringA:

let stringA = "ábc👨‍👩‍👧‍👦xyz"
// ...
stringB.distance(from: stringA.startIndex, to: fourthCharIndexA) // 4
stringA.distance(from: stringA.startIndex, to: fourthCharIndexA) // 3

I believe the only safe way of converting indices between collections is indeed a combination of distance(from:to) and index(_:offsetBy:).

AlexanderM · April 16, 2020, 10:48pm

believe the only safe way of converting indices between collections is indeed a combination of distance(from:to) and index(_:offsetBy:) .

As far as I understand, that's exactly what I thought I was doing. Could you elaborate?

SDGGiesbrecht · April 16, 2020, 10:53pm

There was a minor typo in this line:

let distance = stringB.distance(from: stringA.startIndex, to: fourthCharIndexA)

You meant stringA.distance(...). All of us besides @jenox read it as though it were stringA without noticing the mistake.

It doesn’t really change anything that’s been said.

AlexanderM · April 16, 2020, 11:25pm

Ah, nice catch. I fixed that typo.

jenox · April 16, 2020, 11:35pm

Hm, could you explain what you mean by "the distance is string-agnostic, right?" then?

As I understood it you were asking if the string you call this method on impacts the result, and it does; that's why I pointed out the error.

AlexanderM · April 16, 2020, 11:50pm

Ah, no. I know the string matters (that distance(from:to:) is being called on), but I wasn't fully confident that the resulting distance: Int is fully safe to use elsewhere. As far as I can tell, it's a count of number of characters (not tied to any particular underlying encoding), so I figured it's string-agnostic, but I wasn't fully sure.

Thanks for pointing out that error, too!

SDGGiesbrecht · April 16, 2020, 11:57pm

It’s safe as long as you’re using the same version of Unicode. So within the program execution you’d be fine.

I wouldn’t advise encoding the string and offset into JSON and loading it on a different device or a year later though. Grapheme cluster definitions can change, and then you might be pointing at something other than what you thought.

taylorswift · April 17, 2020, 2:35am

what would be a resilient way to store a string position?

SDGGiesbrecht · April 17, 2020, 2:52am

Instead of persisting a string and subrange, persist the three component strings: (1) up to the subrange, (2) the subrange, (3) everything after the subrange. That way the break points won’t move no matter what Unicode decides to redefine. You can always rejoin them on the other end (counting their new lengths beforehand if necessary).

If the use case is significantly more complicated, then simply converting to an array of characters (as strings) would persist all the original grapheme breaks.