Subscripting in String extension to access substrings and characters with Int types

SDGGiesbrecht · February 26, 2019, 8:09pm

I personally love String.Index (well, over Int at least). When you understand what is going on underneath, it makes a lot more sense.

let string = "¡Buenos días, Señor!"
let index = string.index(string.startIndex, offsetBy: 10) // O(n)
doSomething(with: string[index]) // O(1)
doSomethingElse(with: string[index]) // O(1)
doSomethingCompletelyDifferent(with: string[index]) // O(1)

With String.Index, the whole thing is just O(n), but if string had Int indices instead, it would be:

let string = "¡Buenos días, Señor!"
let index = string.index(string.startIndex, offsetBy: 10) // O(1) (or just `let index = 10`)
doSomething(with: string[index]) // O(n)
doSomethingElse(with: string[index]) // O(n)
doSomethingCompletelyDifferent(with: string[index]) // O(n)

Yikes, that is three times slower!

In addition, it is usually best to write collection algorithms generically on Collection anyway, and to do that, you need to use T.Index.

All that goes to say that I think it is very much worth learning to use Index right away instead of playing with Int first.

There are times when it is helpful to internally make something which is slow or risky easier to do, because you know that in your limited context, the dangers or gotchas do not apply.

In such cases, this is the extension I would write to provide integer subscripts into a string when you can guarantee that it really will not end up misused:

extension Collection {
    private func convert(offset: Int, startingAt start: Index) -> Index {
        // This is as fast as it gets – really.
        return index(start, offsetBy: offset)
    }
    subscript(offset offset: Int) -> Element {
        return self[convert(offset: offset, startingAt: startIndex)]
    }
    subscript<R>(offsets offsets: R) -> SubSequence where R : RangeExpression, R.Bound == Int {
        // This covers most range types at maximum efficiency.
        let resolved = offsets.relative(to: 0 ..< Int.max) // The only thing relying on the `endIndex` is `PartialRangeFrom`, which is handled below instead anyway.
        let lowerBound = convert(offset: resolved.lowerBound, startingAt: startIndex)
        let upperBound = convert(offset: resolved.upperBound - resolved.lowerBound, startingAt: lowerBound)
        return self[lowerBound ..< upperBound]
    }
    subscript(offsets offsets: PartialRangeFrom<Int>) -> SubSequence {
        // This overload skips crawling all the way to the end.
        return self[convert(offset: offsets.lowerBound, startingAt: startIndex)...]
    }
}

It is worth noting several differences from your implementation above:

It is generic over Collection, so it is more concise and works not just on String, but on ArraySlice, Substring, String.UnicodeScalarView, etc.
The subscript has an explicit label. This keeps it from causing ambiguities for the compiler where Index itself happens to be Int. It also serves as a little reminder at the call site that we are computing an offset and thus possibly iterating parts of the collection.

The conversion function skips a lot of extra work that it did in yours. You had the following code; I have inserted comments.

private func nearestIndex(pos:Int) -> String.Index {
    if pos >= count { // ← You have just iterated the whole string.
        // Are you sure this is sane? Why not just let it trap as out of bounds?
        // This means endIndex will trap, but not anything afterward; Kind of unexpected.
        return index(before: endIndex)
    }
    if pos <= 0 {
        // Same as endIndex above.
        return startIndex
    }
    if pos < (count / 2) { // ← You’ve just iterated the entire string. O(count)
        // This reiterates the first half, becoming O(count + n).
        return index(startIndex, offsetBy: pos)
    } else {
        // And this reiterates the second half, now O(count + (count − n))
        return index(endIndex, offsetBy: -(count-pos))
    }
}

The total complexity averages O(2.5(count)), when the simple implementation is just O(0.5(count)). That is 5 times slower.

The range conversions are generic over RangeExpression. It reduces the boilerplate. Only PartialRangeFrom can improve efficiency by not resolving the endIndex. Resolving the startIndex for the PartialRangeUpTo/Through is a no‐op. So work is only actually done for the indices that need it.
Lastly, when iterating toward the upperBound of a range, we start at the lowerBound that we already know, instead of reiterating everything before it too.

¿Más complicado que parece? Por eso utilizamos Index.
More complicated than it looks? That is why we use Index.

*The API and implementation illustrated here is more‐or‐less what @Michael_Ilseman, has informally proposed in several places.