Search for substr from position

ljh · December 1, 2020, 3:34pm

Can I search for a substring starting from a specific position:

firstIndex(of: String, at: String.Index) -> String.Index?

Like this in C++:

// search from position 5
string const s = "This is a string";
string::size_type n = s.find("is", 5);

or in Go:

s := "This is a string"
n := strings.Index(s[5:], "is")

Lantua · December 1, 2020, 3:40pm

I don't think there's any such function in the standard library, not even an O(nm) one. Closest I can think of is to use NSRegularExpression.

Nevin · December 1, 2020, 10:44pm

There are some NSString searching methods in Foundation, but I don’t know if they’ll fit your needs.

If you decide to write your own, here are some thoughts:

When the string you’re searching for is short, the naive algorithm is fast. When it’s long, you’re better off with a linear-time algorithm.

If you’re searching for whole words, you can skip ahead to the start of the next word each time there’s a mismatch.

There’s a whole body of research on string search algorithms (see, eg. the wiki article or this list or this Stack Overflow post).

I’m not an expert in the field, but my understanding is that the two-way algorithm is held in high regard.

• • •

If this functionality were ever added to Swift, I suspect it would be written as a method on Collection, with a signature like so:

extension Collection where Element: Equatable {
  func firstRange<C: Collection>(of target: C) -> Range<Index>?
    where C.Element == Element
  {
    // Insert implementation here
  }
}

For optimizability, it would probably also be a protocol requirement of Collection, so that eg. BidirectionalCollection could provide its own implementation.

xwu · December 1, 2020, 11:27pm

You may be interested in the following proposed addition to Swift Algorithms.

github.com/apple/swift-algorithms

Add methods for finding subsequences in a collection

apple:main ← apple:nate/firstrange

opened 08:01PM - 05 Nov 20 UTC

natecook1000

+395 -2

### _Work in Progress_ --- ### Description Methods for finding the first,… last, or all ranges of a given subsequence. ### Detailed Design TK ### Documentation Plan Guide and docs to come… ### Test Plan Initial unit tests included. ### Source Impact This is an additive change only. ### Checklist - [ ] I've added at least one test that validates that my change is working, if appropriate - [ ] I've followed the code style of the rest of the project - [ ] I've read the [Contribution Guidelines](../../CONTRIBUTING.md) - [ ] I've updated the documentation if necessary

With that, the requested functionality is spelled:

s.dropFirst(5).firstRange(of: "is")

ljh · December 2, 2020, 10:09am

Find multiple substrings from specified index

To implement the exact method you asked for, you would do this:
import Foundation
extension String {
    func firstIndex(of: String, at: String.Index) -> String.Index? {
        return self[at...].range(of: of)?.lowerBound
    }
}
But there are many pitfalls in what you want to do with it, because String indices are not integer offsets.

I found that @SDGGiesbrecht's solution is nice.
As for the pitfalls, if I keep using value of type String.Index for the as: argument, will it be all good? (Integer value cannot be passed as argument type String.Index, right?)

var s = "This also is a string"
var sep = "is"
var i1 = s.firstIndex(of: sep, at: s.startIndex)
if var i1 = i1 {
    var i2 = s.firstIndex(of: sep, at: s.index(after: i1))
    if var i2 = i2 {
        print("'\(s)'")
        print("'\(s[i1...i2])'")
        for _ in sep {
            i1 = s.index(after: i1)
        }
        i2 = s.index(before: i2)
        print("'\(s[i1...i2])'")
        print("'\(s[i1...i2].trimmingCharacters(in: .whitespacesAndNewlines))'")
    }
}

SDGGiesbrecht · December 2, 2020, 10:18pm

Yes, that will work, and it does not stumble into any of the repeated integer conversion pitfalls I was talking about.

(Caveat: At this point I am assuming you can guarantee sep will never be "", in which case I don’t know off the top of my head what will happen when it reaches range(of:) in the extension method.)

Three parts of it might still be doing slightly more work than necessary.

Twice you have used this pattern (with i1 and with i2):
```
var x = y()
if var x = x {
  // ...
```
Those two var declarations create two separate variables, even though one shadows the other. (The compiler is probably even warning you that the first one is never changed and could be switched to a let.) You can compress that pattern directly into this:
```
if var x = y() {
  // ...
```
That way you are only storing one variable.
The innermost loop...
```
for _ in sep {
  i1 = s.index(after: i1)
}
```
...could be reduced to...
```
i1 = s.index(i1, offsetBy: sep.count)
```
(count must be doing a similar loop of some form under the hood, but it might have inside information allowing it to do so without the overhead of dispatching to the index(after:) method in each iteration.)
The conversion to an open range...
```
i2 = s.index(before: i2)
print("'\(s[i1...i2])'")
```
...could be simplified by just using a closed range directly:
```
print("'\(s[i1..<i2])'")
```

ljh · December 3, 2020, 4:41am

Thanks for help.

I rewrite my code following your suggestion.

The code won't crash even sep is "".

import Foundation
extension String {
    func firstIndex(of: String, at: String.Index) -> String.Index? {
        return self[at...].range(of: of)?.lowerBound
    }
}

let s = "This also is a string"
let sep = "is"
if var i1 = s.firstIndex(of: sep, at: s.startIndex) {
    if let i2 = s.firstIndex(of: sep, at: s.index(after: i1)) {
        print("'\(s)'")
        print("'\(s[i1...i2])'")
        i1 = s.index(i1, offsetBy: sep.count)
        print("'\(s[i1..<i2])'")
        print("'\(s[i1..<i2].trimmingCharacters(in: .whitespacesAndNewlines))'")
    }
}

phoneyDev · December 4, 2020, 1:02am

If we're doing a code review I'll point out that one of Swift's design principles is Fluency. range(of: of) isn't very fluent IMO. Same with [at...]. I recommend you add additional variable labels here. Something like

    func firstIndex(of subString: String, at index: String.Index) -> String.Index? {
        return self[index...].range(of: subString)?.lowerBound
    }

You might prefer different names but range(of: of)and [at...] are odd.

You should be able to read your code out loud and it should sound normal.

AlexanderM · December 4, 2020, 1:25am

I would call this firstRange(of:after:)

But anyways, there’s a whole series of operations you’d want to do before or after a certain point in a string (capitalizing, lower casing, sorting, searching, replacing, etc.). There’s no point in bloating each API with after: Index parameters.

Instead, you can compose orthogonal components like slicing (to pick which part or act on) and a normal API like firstIndex(of:)