Subscripting in String extension to access substrings and characters with Int types

applecoding · February 26, 2019, 11:26am

Hello,

I made an implementation like an extension of String, to work with substrings and characters with Int types and not String.Index. It uses String.Index internally and error control in code, but you can use all the ranges and positions in a String with Int.

It implements ClosedRange, Range, PartialRangeFrom, PartialRangeUpTo and PartialRangeThrough. Always returns a Substring. And you can use an Int to retrieve a Character type of the element of the position you asked.

It will be great to implement in the language and with this, you have a more simple way to work with substrings doing:

testString[15..<25] (if lowerBound is minus than 0, uses startIndex)
testString[15...250] (if string is shorter returns to endIndex)
testString[20...] (from this index to the end of the string)
testString[..<21] (from the beginning until this position minus 1)
testString[...21] (from the beginning until this position)
testString[5] (return the character at this position)

I made all of the code and is tested and prepared to make the pull request.

Thanks in advance
Julio César Fernández
Apple Coding

Jumhyn · February 26, 2019, 1:19pm

The reason that String doesn’t allow this today is that subscripting is expected to be a constant-time operation (and, in Collection, this is made explicit). The variable-width nature of Characters makes this infeasible for string without sacrificing efficiency in other areas.

Lantua · February 26, 2019, 2:46pm

As @Jumhyn said, having subscript doing more than O(1) work is semantically incorrect.

What we need is to allowed another set of functionality that alleviate that constraint. This has come up in:

Short hand for Starting Index and End Index
Why are String Offset so Complicated

And I’m sure many times more, but I’m not profecient enough with search bar on this site.

Bear · February 26, 2019, 2:51pm

Why would performance cost make anything semantically correct or incorrect? Senantics is about expressing things, not about their work behind the scenes. I see nothing wrong with the subscripts not being O(1).

applecoding · February 26, 2019, 3:06pm

I'm understand the problem with Character and the use of Collection of these.
Be honest: people is doing terrible things in code to solve this problem because to work with String.Index is a headache.
We need a REAL solution, and we need that language will implement the best possible solution and more efficient directly for use of any developer.
The best solution is to extend the String with a pure Swift implementation, better than anyone can do it. The best possible solution using Swift with the tools we've got today.

This is my solution:

extension String {    
    private func nearestIndex(pos:Int) -> String.Index {
        if pos >= count {
            return index(before: endIndex)
        }
        if pos <= 0 {
            return startIndex
        }
        if pos < (count / 2) {
            return index(startIndex, offsetBy: pos)
        } else {
            return index(endIndex, offsetBy: -(count-pos))
        }
    }
    
    subscript (range:ClosedRange) -> Substring {
        return self[nearestIndex(pos: range.lowerBound)...nearestIndex(pos: range.upperBound)]
    }
    
    subscript (range:Range) -> Substring {
        return self[nearestIndex(pos: range.lowerBound)..<nearestIndex(pos: range.upperBound)]
    }
    
    subscript (range:PartialRangeFrom) -> Substring {
        return self[nearestIndex(pos: range.lowerBound)...index(before: endIndex)]
    }
    
    subscript (range:PartialRangeUpTo) -> Substring {
        return self[startIndex..<nearestIndex(pos: range.upperBound)]
    }
    
    subscript (range:PartialRangeThrough) -> Substring {
        return self[startIndex...nearestIndex(pos: range.upperBound)]
    }
}

I think it's elegant and the best you can do with the actual problem. I think the language must provide the best solution for each problem, and not a complex answer that not help to fix the problem, making the people uses a worse solution.

let cadena = "A long time ago in a galaxy far far away"

cadena.count

cadena[5...25]

cadena[5...250]

cadena[-5..<25]

cadena[1...41]

cadena[...30]

cadena[10...]

cadena[..<30]

This is the way I see it: a developer uses an inefficient way with a bad function (offsetBy). We can give us the best possible solution at this point, made in Swift itself.

Thanks in advance

Lantua · February 26, 2019, 3:30pm

My writing was unclear, apologies for that.
The fault is not in subscript not being O(1) per se; it is that we have subscript(_:), without argument label, that guarantees to be O(1).
These new subscript can cause confusion, or even collision if Index is Int.
That’s part of the reason why there’s suggestion of subscript(offset:) popping up in the pitch I posted above.

Agreed, this is a real problem, one that I’ve in so far managed to work around. Others may not be so lucky.
Just that it’s a little hard for a good solution that everyone would agree upon to arise.

This is an interesting take on it. I’m not sure about the use case just yet. Will need to wait for others to comment on it.

Jumhyn · February 26, 2019, 3:51pm

Complexity is not “behind the scenes,” though. Reasoning about the complexity/efficiency of programs that you’re writing is impossible if you don’t have guarantees about the complexity of the subroutines you’re calling. It’s very much part of the contract of a method to say “if you double the input size, the runtime of this method will not change.”

Furthermore, program design will tend to mirror API design. If Swift were to introduce ergonomic Int indexing for String, it would encourage users to design programs which are using String improperly. Nothing is preventing motivated users who really want to ignore complexity concerns from defining their own extensions or types to allow for this type of indexing, but including it in the standard library amounts to a “seal of approval” of sorts that this is the “correct” way to do things. I believe this is largely what people wish to avoid by including Int indexing in the standard library.

Karl · February 26, 2019, 4:01pm

FWIW, I've experimented with this kind of thing before, and I think the most practical solution is to wrap the String together with an index-cache similar to the "breadcrumbs" we use for fast UTF-16 indexes. You can write the wrapper generically so that, for the cost of a small cache, any Collection can be "promoted" to a RandomAccessCollection and gain convenient offset-based subscripting and slicing using integers.

In the past we've discussed using a similar buffer to promote all Sequences to Collections.

applecoding · February 26, 2019, 4:14pm

Believe me that I don't understand anything. We are approaching to something that makes substrings better than the 99% of programmers that uses Swift. Not perfect? For sure. But we are talking about a solution in String type using Swift itself. Not to promote to the entire Collection or Sequence protocols. If you have an array, you can subscript with Int. And I know String.Index is because of the length of each character that is not equal.

Sincerely, I think that discard a good solution in Swift only because it's not perfect, but believe me is way better than to use String.Index... Users is designing programs now using String.Index and offsetBy because there's no other way to work with Strings. And best of all: if someday there's a better solution, change the subscript and that's it... still working. But sincerely I think a language in 5.1 version cannot work with substrings in the actual way. A temporary solution at this time I think will be better. It's my opinion, with all my respect.

I'm teaching Swift to hundred of students in Spain for so many years, till the first version. And to work with substrings it was always a headache for all and something that nobody understand. To force the user to use extensions that the common users cannot know how to made.

Karl · February 26, 2019, 4:46pm

I'm saying that you want something that looks like this:

// 1. A place to hold the index cache.

struct IndexCached<Wrapped>: RandomAccessCollection where Wrapped: Collection {
  var wrapped: Wrapped
  var breadcrumbs: [Wrapped.Index]

  typealias SubSequence = IndexCached<Wrapped.SubSequence>
}

// 2. Some convenience extensions.

extension IndexCached where Wrapped: Equatable {
  static func ==(lhs: IndexCached, rhs: Wrapped) -> Bool { return lhs.wrapped == rhs }
}

// 3. Offset-based subscripting for _all_ RACs.

extension RandomAccessCollection {
  // offset-based subscripting, including special handling of negative offsets, etc.
  subscript(offsets: Range<Int>) -> SubSequence { /* ... */ }
}

// 4. Easily create a cached version of a collection.

extension Collection {
  // Perhaps make this 'mutating' to enforce exclusive access?
  func withRandomAccess<R>(_ execute: (IndexCached<Self>)->R) -> R {
    return execute(IndexCached(self))
  }
}

And you use it like this:

self.textView.text.withRandomAccess { str in
  // Parse from text from your string.
  if str[-5...] == "error" { 
    let errorMsg = str[..<-5].wrapped // Substring.
  }
}

Jumhyn · February 26, 2019, 4:48pm

To be fair, I think that's overestimating things by a fair amount. Teaching programming is a valid point to bring up, since many exercises tend to revolve around contrived examples, and Strings are an attractive type to work with in a learning environment. However, having worked on large Swift code bases for several years at this point, the use cases for subscripting Strings by Ints has basically never come up without there being a more "correct" way of solving the problem.

applecoding · February 26, 2019, 4:58pm

In my experience for many years, people uses Swift at basic user level. When I'm teaching I'm trying to open their minds and I try to teach how to think in functional, extensions or even in a protocol oriented way. But when people trying to use substrings (again, in my experience) their only solution it's to look for an inefficient extension or any function that someone puts in stackoverflow. Meanwhile, there's no solution to fight against the String.Index type.

Believe me, I'm understand perfectly the need to make the things perfectly fit the stdlib, protocols and the rest of stuff. But I think: maybe sometimes the easier solution is the more practical. Maybe not perfect, but practical while all of us found the perfect solution to make it better.

That's all.

Jumhyn · February 26, 2019, 5:36pm

Read through the other threads on this topic and see how you feel about the alternatives suggested. I think many people here would argue that feeling like you’re “[fighting] against the String.Index” type is a feature, not a bug. It encourages you to find alternate designs and approaches to express your ideas more efficiently and correctly. If someone were expressing annoyance that storing their numbers as Strings was cumbersome because the addition operators concatenated the strings rather than adding the numbers they represent, the response would be similar. Obviously that’s an extreme example, but I think it illustrates the point that there are good reasons why working with String in the way you describe is made cumbersome.

Nevin · February 26, 2019, 6:51pm

The technical term for this is syntactic salt: “A feature designed to make it harder to write bad code.”

SDGGiesbrecht · February 26, 2019, 8:09pm

I personally love String.Index (well, over Int at least). When you understand what is going on underneath, it makes a lot more sense.

let string = "¡Buenos días, Señor!"
let index = string.index(string.startIndex, offsetBy: 10) // O(n)
doSomething(with: string[index]) // O(1)
doSomethingElse(with: string[index]) // O(1)
doSomethingCompletelyDifferent(with: string[index]) // O(1)

With String.Index, the whole thing is just O(n), but if string had Int indices instead, it would be:

let string = "¡Buenos días, Señor!"
let index = string.index(string.startIndex, offsetBy: 10) // O(1) (or just `let index = 10`)
doSomething(with: string[index]) // O(n)
doSomethingElse(with: string[index]) // O(n)
doSomethingCompletelyDifferent(with: string[index]) // O(n)

Yikes, that is three times slower!

In addition, it is usually best to write collection algorithms generically on Collection anyway, and to do that, you need to use T.Index.

All that goes to say that I think it is very much worth learning to use Index right away instead of playing with Int first.

There are times when it is helpful to internally make something which is slow or risky easier to do, because you know that in your limited context, the dangers or gotchas do not apply.

In such cases, this is the extension I would write to provide integer subscripts into a string when you can guarantee that it really will not end up misused:

extension Collection {
    private func convert(offset: Int, startingAt start: Index) -> Index {
        // This is as fast as it gets – really.
        return index(start, offsetBy: offset)
    }
    subscript(offset offset: Int) -> Element {
        return self[convert(offset: offset, startingAt: startIndex)]
    }
    subscript<R>(offsets offsets: R) -> SubSequence where R : RangeExpression, R.Bound == Int {
        // This covers most range types at maximum efficiency.
        let resolved = offsets.relative(to: 0 ..< Int.max) // The only thing relying on the `endIndex` is `PartialRangeFrom`, which is handled below instead anyway.
        let lowerBound = convert(offset: resolved.lowerBound, startingAt: startIndex)
        let upperBound = convert(offset: resolved.upperBound - resolved.lowerBound, startingAt: lowerBound)
        return self[lowerBound ..< upperBound]
    }
    subscript(offsets offsets: PartialRangeFrom<Int>) -> SubSequence {
        // This overload skips crawling all the way to the end.
        return self[convert(offset: offsets.lowerBound, startingAt: startIndex)...]
    }
}

It is worth noting several differences from your implementation above:

It is generic over Collection, so it is more concise and works not just on String, but on ArraySlice, Substring, String.UnicodeScalarView, etc.
The subscript has an explicit label. This keeps it from causing ambiguities for the compiler where Index itself happens to be Int. It also serves as a little reminder at the call site that we are computing an offset and thus possibly iterating parts of the collection.

The conversion function skips a lot of extra work that it did in yours. You had the following code; I have inserted comments.

private func nearestIndex(pos:Int) -> String.Index {
    if pos >= count { // ← You have just iterated the whole string.
        // Are you sure this is sane? Why not just let it trap as out of bounds?
        // This means endIndex will trap, but not anything afterward; Kind of unexpected.
        return index(before: endIndex)
    }
    if pos <= 0 {
        // Same as endIndex above.
        return startIndex
    }
    if pos < (count / 2) { // ← You’ve just iterated the entire string. O(count)
        // This reiterates the first half, becoming O(count + n).
        return index(startIndex, offsetBy: pos)
    } else {
        // And this reiterates the second half, now O(count + (count − n))
        return index(endIndex, offsetBy: -(count-pos))
    }
}

The total complexity averages O(2.5(count)), when the simple implementation is just O(0.5(count)). That is 5 times slower.

The range conversions are generic over RangeExpression. It reduces the boilerplate. Only PartialRangeFrom can improve efficiency by not resolving the endIndex. Resolving the startIndex for the PartialRangeUpTo/Through is a no‐op. So work is only actually done for the indices that need it.
Lastly, when iterating toward the upperBound of a range, we start at the lowerBound that we already know, instead of reiterating everything before it too.

¿Más complicado que parece? Por eso utilizamos Index.
More complicated than it looks? That is why we use Index.

*The API and implementation illustrated here is more‐or‐less what @Michael_Ilseman, has informally proposed in several places.

applecoding · February 26, 2019, 9:05pm

OK. I will make some comments like you... First of all thanks for your comments:

extension Collection {
    private func convert(offset: Int, startingAt start: Index) -> Index {
        // This is the same index offsetBy I used it
        return index(start, offsetBy: offset)
    }
    subscript(offset offset: Int) -> Element {
        // if you want to check one of the latest characters, you will go to the entire string from the beginning. No sense.
        return self[convert(offset: offset, startingAt: startIndex)]
    }
    subscript(offsets offsets: R) -> SubSequence where R : RangeExpression, R.Bound == Int {
        // I like it, but if you put a range out of index... BOOOM!!!!
        let resolved = offsets.relative(to: 0 ..< Int.max) // The only thing relying on the `endIndex` is `PartialRangeFrom`, which is handled below instead anyway.
        let lowerBound = convert(offset: resolved.lowerBound, startingAt: startIndex)
        let upperBound = convert(offset: resolved.upperBound - resolved.lowerBound, startingAt: lowerBound)
        return self[lowerBound ..< upperBound]
    }
    subscript(offsets offsets: PartialRangeFrom) -> SubSequence {
        // BOOM! 2 
        return self[convert(offset: offsets.lowerBound, startingAt: startIndex)...]
    }
}

I created a mix of your implementation and mine. If made O(0.5) in the offsetBy because I will from the beginning or the end depends on the position of the character.

extension String {
    private func nearestIndex(pos:Int, total:Int) -> String.Index {
        if pos >= total {
            return index(before: endIndex)
        }
        if pos <= 0 {
            return startIndex
        }
        if pos < (total / 2) {
            return index(startIndex, offsetBy: pos)
        } else {
            return index(endIndex, offsetBy: -(total-pos))
        }
    }
    
    subscript(offset: Int) -> Character {
        return self[nearestIndex(pos: offset, total:count)]
    }
    
    subscript(offsets: R) -> Substring where R : RangeExpression, R.Bound == Int {
        let total = count
        let resolved = offsets.relative(to: 0 ..< Int.max)
        let lowerBound = nearestIndex(pos: resolved.lowerBound, total: total)
        let upperBound = nearestIndex(pos: resolved.upperBound, total: total)
        return self[lowerBound ..< upperBound]
    }
    
    subscript(offsets: PartialRangeFrom) -> SubSequence {
        let lowerBound = nearestIndex(pos: offsets.lowerBound, total: count)
        return self[lowerBound...]
    }
}

Now you can test with this and you have perfect error control for out of index...

let cadena = "A long time ago in a galaxy far far away"

cadena[5...25]

cadena[5...250]

cadena[-5..<25]

cadena[1...41]

cadena[...30]

cadena[10...]

cadena[..<30]

Thanks for your suggestions.

Julio César

SDGGiesbrecht · February 26, 2019, 9:59pm

You have to go through the entire string from the beginning. There is no alternative. Either:

You start from the beginning and go until you hit the index. Avg. O(0.5(count))
You start from the beginning and go all the way to the end in order to figure out where the end is. From that you know which end of the string is closer so you can walk from there again until you hit the index. Avg. O(1.5(count))

Your new implementation successfully defers the initial computation of the count out of the method, but you still require the client to compute it in order to give it to you. So the extra complexity is still there, it is just in someone else’s hands.

The unusual handling of invalid indices results in some very surprising behaviour:

XCTAssert(cadena[..<100].elementsEqual(cadena))
// Fails: “A long time ago in a galaxy far far awa”

XCTAssertEqual(cadena[100], cadena.last!)
// Passes as is, but fix the first test and instead, BOOM!

There is another solution out there that efficiently handles all except the tolerance of out‐of‐bounds indices, without any need for extensions. Walk the string once creating an array.* Then after that you are dealing with a random access collection with integer indices:

let cadena = Array("A long time ago in a galaxy far far away") // O(n)

// O(1) from here on:
cadena[5...25]

cadena[5...]

cadena[..<25]

cadena[1...]

cadena[...30]

cadena[10...]

cadena[..<30]

*This will bloat the memory footprint, which is why String does not work this way by default. The array also will not correctly merge characters, mishandling Unicode equivalence if you start mutating it.

applecoding · February 26, 2019, 10:51pm

I will continue to think a more efficient solution. Thanks a lot for your comments. I know it will be a better solution. It just found it.

applecoding · February 26, 2019, 11:03pm

Fix it the error:

subscript <R>(offsets: R) -> Substring where R : RangeExpression, R.Bound == Int {

let total = count

let resolved = offsets.relative(to: 0 ..< Int.max)

let lowerBound = nearestIndex(pos: resolved.lowerBound, total: total)

let upperBound = index(after: nearestIndex(pos: resolved.upperBound, total: total))

return self[lowerBound ..< upperBound]

}

SDGGiesbrecht · February 26, 2019, 11:23pm

Now you have other issues to solve.

XCTAssertEqual(cadena[..<39].count, 39)
XCTAssert(cadena[..<0].isEmpty)

I have other things vying for my time, so that is likely the last I will say on the subject. But good luck on your hunt for a solution! I hope you continue to contribute your ideas in this forum.