Improving indexing into Swift Strings?

johnno1962 · October 25, 2020, 5:04pm

At the moment, the developer experience of accessing individual characters or ranges in Swift Strings feels rather suboptimal. To access the 5th character of a string, for example, you need to remember to perform the following dance:

str[str.index(str.startIndex, offsetBy: 5)]

Having grown frustrated with with this of late I ended up writing a small package, StringIndex. By using a few new operators it shortens this operation to the more concise:

str[str.startIndex+5]

while not compromising the Unicode correctness or performance of the indexing operation. Under the covers, addition and subtraction on a String.Index creates a temporary structure storing the index and the offset for which there are new subscripts implemented on StringProtocol. In this way, the offsetting of the index can be performed “late” in the subscript operator when the target string is known without having to specify it a third time

Thankfully, indexing into a string isn’t something you have to do very often but when you do it feels more complicated than it needs to be. Perhaps these operators could be brought into the standard library at some point to make Swift strings more welcoming to novice Swift users.

xwu · October 25, 2020, 5:46pm

This is the subject of a number of prior threads. It's unclear if there's much to say at the moment on the topic which hasn't already been covered in the conversations below, and as a result some of these conversations have devolved until they're locked:

Lantua · October 25, 2020, 5:58pm

Rather, a more recent development would be [Returned for revision] SE-0265: Offset-Based Access to Indices, Elements, and Slices

johnno1962 · October 25, 2020, 6:00pm

Indeed, it's certainly a known problem. I'm trying to put forward a (not entirely novel, looking at the threads you listed) solution albeit one that is specific to String. Something has to be done about the ergonomics of indexing into a string - at the moment it is not great. Did you get a chance to look at the solution I'm pitching? It's pretty straightforward code:

// Basic operators to offset String.Index when used in a subscript
public func + (index: String.Index?, offset: Int) -> String.OffsetIndex {
    precondition(index != nil, "nil String.Index being offset by \(offset)")
    return String.OffsetIndex(index: index!, offset: offset)
}
public func - (index: String.Index?, offset: Int) -> String.OffsetIndex {
    return index + -offset
}

extension String {

    /// Represents an index to be offset
    public struct OffsetIndex: Comparable {
        let index: Index, offset: Int

        // Chaining offsets in expressions
        public static func + (index: OffsetIndex, offset: Int) -> OffsetIndex {
            return OffsetIndex(index: index.index, offset: index.offset + offset)
        }
        public static func - (index: OffsetIndex, offset: Int) -> OffsetIndex {
            return index + -offset
        }

        // Mixed String.Index and OffsetIndex in range
        public static func ..< (lhs: OffsetIndex, rhs: Index?) -> Range<OffsetIndex> {
            return lhs ..< rhs + 0
        }
        public static func ..< (lhs: Index?, rhs: OffsetIndex) -> Range<OffsetIndex> {
            return lhs + 0 ..< rhs
        }

        /// Required by Comparable check when creating ranges
        public static func < (lhs: OffsetIndex, rhs: OffsetIndex) -> Bool {
            return false // slight cheat here as we don't know the string
        }
    }
}

extension StringProtocol {
    public typealias OffsetIndex = String.OffsetIndex
    public typealias OISubstring = String // Can/should? be Substring

    // Subscripts on StringProtocol for OffsetIndex type
    public subscript (offset: OffsetIndex) -> Character {
        get {
            return self[index(offset.index, offsetBy: offset.offset)]
        }
        set (newValue) {
            self[offset ..< offset+1] = OISubstring(String(newValue))
        }
    }

    // lhs ..< rhs operator
    public subscript (range: Range<OffsetIndex>) -> OISubstring {
        get {
            let from = range.lowerBound, to = range.upperBound
            return OISubstring(self[index(from.index, offsetBy: from.offset) ..<
                                    index(to.index, offsetBy: to.offset)])
        }
        set (newValue) {
            let before = self[..<range.lowerBound]
            let after = self[range.upperBound...]
            self = Self(String(before) + String(newValue) + String(after))!
        }
    }
    // ..<rhs operator
    public subscript (range: PartialRangeUpTo<OffsetIndex>) -> OISubstring {
        get {
            return self[startIndex ..< range.upperBound]
        }
        set (newValue) {
            self[startIndex ..< range.upperBound] = newValue
        }
    }
    // lhs... operator
    public subscript (range: PartialRangeFrom<OffsetIndex>) -> OISubstring {
        get {
            return self[range.lowerBound ..< endIndex]
        }
        set (newValue) {
            self[range.lowerBound ..< endIndex] = newValue
        }
    }

    // Misc.
    public mutating func replaceSubrange<C>(_ bounds: Range<OffsetIndex>,
        with newElements: C) where C : Collection, C.Element == Character {
        self[bounds] = OISubstring(newElements)
    }
    public mutating func insert<S>(contentsOf newElements: S, at i: OffsetIndex)
        where S : Collection, S.Element == Character {
        replaceSubrange(i ..< i, with: newElements)
    }
    public mutating func insert(_ newElement: Character, at i: OffsetIndex) {
        insert(contentsOf: String(newElement), at: i)
    }
}

Lantua · October 25, 2020, 6:22pm

I think you should remove optional from most of these APIs (you even have precondition to check against that). It doesn't make much sense to add an offset to nil index. You could be interpreting nil as startIndex, but that's probably unnecessary.

Also, I'm not sure how you can extend +/- to other collection types since they're operating on Collection.Index, not Collection. Or are you planning this to be a String-only thing?

johnno1962 · October 25, 2020, 6:30pm

The nil-able index argument was an afterthought. It allows you to specify something like:

    let firstWord = str[..<(str.firstIndex(of: " ")+0)]

and sweep the force unwrap of firstIndex while under the carpet while having it still fail. I'm only considering String at this stage. Perhaps the approach could be generalised.

AlexanderM · October 25, 2020, 6:33pm

Adding a subscript that accepts an Int (instead of a String.Index) is trivial.

It's intentionally omitted, because its sirens' song will attract people into doing the wrong thing (doing constant linear-time subscripting operations in a loop, making it accidentally quadratic, or worse)

I wish there was an ergonomic option that doesn't ruin performance, but I haven't seen anything achieve that. A "human understanding" of characters (extended grapheme clusters, as represented by Swift.Character, not singular unicode scalars Swift.UnicodeScalar, or bytes) aren't subscriptable in constant time.

Swift pushes users towards approaching string processing from a different perspective, that doesn't involve repeated O(N) subscript calls.

Lantua · October 25, 2020, 6:33pm

I don't think it's a scenario where you should eat the force unwrapping like that. It's not a + fault the user fails to create a valid index.

It's tricky with these operations:

public func + (index: String.Index, offset: Int) -> String.OffsetIndex { ... }
public func - (index: String.Index, offset: Int) -> String.OffsetIndex { ... }

One thing you could do would be to have it apply to any type, not just index of some collection, which would be weird.

public func +<X>(index: X, offset: Int) -> OffsetIndex<X> { ... }

We could also restrict X to be comparable, which is weird still, but perhaps less so.

johnno1962 · October 25, 2020, 9:59pm

I agree. That was one of those afterthoughts it is best not having .

I tried generisizing the index type but the result was the code degenerated into mass of chevrons that didn't really add much and it seems safer to only declare operators on a narrow set of types. There is also Substring to consider so perhaps its best keeping this specific to String.Index and StringProtocol.

johnno1962 · October 25, 2020, 10:17pm

If you are saying that Swift seeks to deter people from using potentially non-performant constructs by obscurity then it has succeeded and how with str[str.index(str.startIndex, offsetBy: 5)] but even that won't deter the resourceful novice putting it in a loop. I understand how the design arrived at that point but I think we can do slightly better and make one-off String manipulations that should be straightforward, straightforward.

wowbagger · October 26, 2020, 12:35am

As far as I understand, this pitch is not about replacing/complimenting String.Index with Int. Strings will remain being indexed by String.Index.

What is pitched is essentially a syntactic sugar for getting a substring when you have an index and a character count from that known index.

johnno1962 · October 26, 2020, 7:03am

That's right, this pitch isn't less correct. Just more convenient. I've reworked the StringIndex package to have the new String.OffsetIndex temporary be an enum including cases .start, .end (pinched from @Michael_Ilseman's recently reviewed proposal) and adding .first(of: Character) and .last(of: Character) so the following is now possible:

let fifthChar: Character = string[.start+4]
let firstWord: Substring = string[..<(.first(of:" "))]
let stripped: Substring = string[.start+1 ..< .end-1]
let lastWord: Substring = string[(.last(of: " ")+1)...]

If this can be achieved more abstractly through the Collection protocol, then all power to it but for now I have something I can work with that builds as a package or Pod all the way back to Swift 4.2.

BrentM · October 27, 2020, 8:54pm

This has always been my biggest pet peeve about swift.

I’ve always resorted to writing wrappers around string indexes (similar to what the OP showed) to make working with them slightly less annoying.

QuinceyMorris · October 27, 2020, 11:37pm

I'm a little confused about what's happening here.

It looks like @johnno1962 is taking the syntax of @Michael_Ilseman's proposal, and making the same proposal but limited to String only.

The original proposal was returned for reasons that were important to the core team. Why would we think those reasons wouldn't also apply to this new proposal, too?

Couldn't we just fix the original proposal? It doesn't seem it's languished because that's impossible, but maybe just because no one has had time to do it?

wowbagger · October 27, 2020, 11:55pm

I think they are ~~very~~ different proposals. SE-0265 introduces OffsetBound as an abstraction of a collection's ~~endpoints~~ indices. This pitch overloads a few operators that simplify offsetting string indices using existing String and StringProtocol things, and defines a few new subscripts that work with these operators.

QuinceyMorris · October 28, 2020, 12:07am

Maybe it's technically different, but according to an earlier post in this thread, it introduced a new String.OffsetIndex type, which it then enhanced to include abstractions of the endpoints. That sounds to me basically the same as OffsetBound.

My real question stands, though. If this version of the proposal is acceptable, why can't we move forward with the more comprehensive proposal?

Conversely, if we can't move forward with the more comprehensive proposal, why wouldn't its stated defects apply here too?

johnno1962 · October 28, 2020, 8:13am

The semantics are essentially the same and of course the more abstract proposal should proceed. There is an argument for keeping it StringProtocol specific however given the special performance considerations compared to other collections.

I've updated the drop in StringIndex package to include another idea which has been doing the rounds, that of safe indexing with subscripts prefixed by the label safe: that return optional types for when the index is invalid. Assigning to an invalid index is still a crash.

xwu · October 28, 2020, 1:15pm

@QuinceyMorris’s point then comes to the fore: that proposal can’t proceed without revisions to address the very weighty concerns discussed by the core team, and any design with the same semantics can’t either. It is not a trivial or solved design problem.

It must be repeated, as it has been numerous times, that this is not the sense in which “safe” is used in Swift. The existing subscripting facilities are all safe, because trapping is safe. The terminology after much discussion here that gained traction before for what you describe was “lenient.”

johnno1962 · December 14, 2020, 9:21am

Undeterred, I have continued to develop this idea/package and a potentially interesting abstraction has emerged, that of “index expressions”. This was fuelled by a situation I encountered where I wanted to find the second occurrence of a character in a String which as things stand is not at all easy using the available String model in Swift without having to resort to Foundation.

An index expression is a sequence of either .start, .end, .first(of: “target”), .last(of: “target”) or integer offsets that are chained together using the + operator to specify a future navigation within a string. These expressions can then be converted into a concrete String.Index by realising them against a particular string. Examples would be:


.start+5 // 6th character in s string

.start + .first(of: “a”) // location of first letter a in a String

.start + .first(of: “a”) + 1 + .first(of: “a”) // location of second letter “a" in string

.end + .last(of: #”\w+”#, regex: true) // location of start of last word in a string

.end + .last(of: #”\w+”#, regex: true, end: true) // location of end of last word in a string

These expressions have no meaning in themselves util they are evaluated with respect to a particular target string by either using them in a subscript as I mention above or using a new method on StringProtocol: “string”.index(of: .start+5). I’ve been using the package for a couple of months now and these four primitives seem to have most of the bases covered for any gnarly sub-String manipulation tasks one might encounter.

github.com

johnno1962/StringIndex/blob/main/Sources/StringIndex/StringIndex.swift

//
//  StringIndex.swift
//  StringIndex
//
//  Created by John Holdsworth on 25/10/2020.
//  Copyright © 2020 John Holdsworth. All rights reserved.
//
//  A few operators simplifying offsettting a String index
//
//  Repo: https://github.com/johnno1962/StringIndex.git
//
//  $Id: //depot/StringIndex/Sources/StringIndex/StringIndex.swift#34 $
//

import Foundation

// Basic operators to offset String.Index when used in a subscript
public func + (index: String.Index?, offset: Int) -> String.OffsetIndex {
    return .offsetIndex(index: index, offset: offset)
}

This file has been truncated. show original

Lantua · December 14, 2020, 10:30pm

+ wouldn't be a good fit. We're not doing traditional pointer arithmetic, nor are we adding two indices together. Maybe we can figure out a new operator for that. That may even be a good thing since the type checker perf will be less of a concern.