How to make `String.Index` with `Int` offset (distance) in O(1) time?

Hoon_H · September 8, 2019, 12:41am

String has O(1) substring extraction using String.Index

@available(swift 4)
@inlinable public subscript (r: Range<String.Index>) -> Substring { get }

But making a new index with offset (distance) takes O(n).

/// - Complexity: O(*n*), where *n* is the absolute value of `n`.
@inlinable public func index(_ i: String.Index, offsetBy n: String.IndexDistance) -> String.Index

I think this is because String internally performs grapheme cluster validation. (guessed by function names on call stack in Instruments) And it seems now this validation cost is my major bottleneck for large string processing.

Is there any way to make a new String.Index with offset in O(1)? Regardless of safe or unsafe. All offsets I have are all correct as they're obtained only by calling String methods.

David_Smith · September 8, 2019, 1:09am

There's a lot of paths through the String code, if you post your Instruments time profile we might be able to give better advice

For example, in many cases what looks like slowness due to String itself is actually slowness due to bridged Strings, which can work differently.

In general though, no. You might be able to get what more like what you want by using string.utf8 instead of the string itself, though be aware that that will produce different results for multi code point grapheme clusters.

SDGGiesbrecht · September 8, 2019, 2:40am

Please clarify what methods you mean. Anything you get from a String’s methods ought to already be an Index, not an offset. (Older, bridged Foundation methods may provide Int values, but these do not represent offsets into String—they are offsets into String.UTF16View.)

Hoon_H · September 8, 2019, 1:01pm

Unfortunately, I rollback my code so I can't post it now. I'll try it again in another day.

I tried to avoid bridged state by calling makeContiguousUTF8() and asserting isContiguousUTF8 == true. All strings passed assertion, but now I think this wouldn't be enough to avoid bridging. It could be a bridging issue as like you talked. I'll consider this on my next trial. Thanks.

I am willing to use String.utf8, but its documentation clearly says String.utf8.index(,offsetBy:) is O(n). And that's so weird because there's no reason to show O(n) performance on the method except Swift performing some validation.

Hoon_H · September 8, 2019, 1:06pm

I would use only String.distance(from:,to:) and String.utf8.distance(from:,to:). Also I treat returned offset number as an opaque type (though it's actually just Int), and didn't manipulate the number in any way.

SDGGiesbrecht · September 8, 2019, 4:02pm

But distance(from:to) is O(n) too though. That seems to mean you received an Index at O(1), but instead of using it directly at O(1), you are converting it to an offset at O(n) cost and then hoping to avoid the O(n) cost of converting it back into an Index. Even if you could magically achieve the second conversion in O(1), you would still be at O(n) overall.

Isn’t the straightforward solution to avoid both conversions and stay in Index land?

Hoon_H · September 8, 2019, 5:11pm

That is exactly what I would like to do. My idea was calculate indices and offsets only once and using them later quickly. Initial calculation is okay with O(n), but I want later uses can be done O(1).

I see your point of avoiding both conversions. I agree to your opinion for constant strings, but still worry about my case. I'm working on a text editor, and need to deal with millions of mutating Substrings that have essentially random indices. Now my implementation is based on String, and I saw great performance raise when I converted it into Substring based. But I had to rollback because I'm worrying about high maintenance cost of dealing with Substring indices. That is doable, but difficult. At least, it's not really straightforward for me. If I can do index + offset in O(1) time, I think my work can be simplified greatly and overall maintenance cost can also be decreased..

Lantua · September 8, 2019, 5:20pm

That is fishy. Substring uses the same index system as the base String. So the index computation should be identical.

Did you by any chance allocate new String from substring repeatedly like this

let foo = String(someString[index1..<index2])

? If so, I'd suggest that you try to avoid such allocation and work with StringProtocol instead of String, and possibly pass full-range substring like this

doSomething(someString[...])

. Though without seeing the code, that's about as far as I can guess/suggest.

SDGGiesbrecht · September 8, 2019, 7:13pm

Provided nothing will be mutated, you can convert the String to an Array of characters in O(n):

let string = "..."
let indexed: [Character] = Array(string)

After that all look‐ups into indexed will be O(1).

If you are worried about memory usage as well (the Array’s storage is significantly larger than the equivalent String), you can compress it somewhat by keeping an Array of Index values instead, alongside the original string:

struct IndexedString {
    let string: String
    let indices: [String.Index]
    init(string: String) { // O(n)
        string = string
        indices = Array(string.indices)
    }
    subscript(offset: Int) -> Character { // O(1)
        return string[indices[i]]
    }
    // [Similar forwarding methods and protocol conformances can go here.]
}

Please note that these strategies fall apart if you plan to include mutating functionality:

let string = "cafe"
let indexed = Array(string)

let appendix = "́" // U+0301, a combining character.
let indexedAppendix = Array(appendix)

let concatenation = string.append(contentsOf: appendix)
let indexedConcatenation = indexed.append(contentsOf: indexedAppendix)
// It is now logically broken.
// This assertion no longer holds:
assert(indexedConcatenation.count == concatenation.count)

However, it is safe to mutate only the string, and create the indexing cache anew from scratch. But that makes each mutation O(n), so it may or may not actually be a net improvement, depending on how it gets used.

There may be other optimizations that can be done in particular use cases, if other invariants are known about the text being processed or how it will be used, so that it can be deduced that certain complications will never be encountered and otherwise dangerous shortcuts can be taken. But to do that would require much more knowledge of the the code and its surroundings.

Hoon_H · September 9, 2019, 3:11am

Yes that is what my current codebase does. Instantiation of each fragment into new Strings. And that's why I claim replacing them to Substring gonna gain great performance raise. Instantiation and copying cost. As a text editor, split large text into many small string is typical task.

But to use Substring, I have to deal with randomized indices -- indices from different Substring instances are not equal anymore even if they represent same relative positions for same contents. And that is what I want to avoid. If I have index(:,offsetBy) in O(1), this can be solved by storing relative offset.
Though it's possible to deal with such randomized indices, it's far more difficult to maintain.

Lantua · September 9, 2019, 3:22am

I don't think having O(1) index(:offsetBy:) is going to help much in this regard. As I said, Substring index is that same as String. In fact, Substring just call the same function in the base String.

Since it's a typical task, I'd say that's all the more so to use Substring or even StringProtocol, because instantiating String does cost, a lot, and could even be the one causing grapheme cluster validation.

If it helps, I remember that UTF8View has fast path (O(1) IIRC, but I'm not too sure). I don't exactly remember how to guide the code through that path. My guess would be to use ASCII string, or pure Swift String.

Hoon_H · September 9, 2019, 3:42am

@SDGGiesbrecht
Yeah I have considered several other approaches, but not really satisfied. I'm sorry for lack of actual code now. I should have kept the branch...

What I want is just String.Index + String.Index.Distance in O(1) time. Both values are known and correct and essentially simple int+int operation under the hood. I just don't have (or find) exposed API.

@Lantua
I agree this would be more ergonomic problem rather than performance. I just want to reduce TCO by avoiding more potential to make bugs. Maybe not a big deal to some people, but very important to me. This is why I was hesitating to adapt Substring. But as like you talked, String is too expensive, so I'm very likely to switch to Substring. Thanks for advise.

I looked into UTF8View hoping to find such method. Actually that is my original question! Just having a index + offset method in O(1). But I had no luck to find out such function. All of index/offset functions in UTF8View say O(1) only if collection is RandomAccessCollection. UTF8View is just BidirectionalCollection and it's O(n). The only thing I could find that conforming RandomAccessCollection was Unicode.Scalar.UTF8View. And that doesn't help a lot.

Lantua · September 9, 2019, 3:56am

I'd also suggest further that you try to implement it in terms of generic StringProtocol, since they do generally have the same set of API, but generic one could easily be maintained should you change your mind (from Substring to, say, UTF8View).

That's how I usually work with String, and it usually result in performant (maintainable?) code. It lets you avoid most unnecessary type gymnastic and hidden conversion. (IMO) It helps you think in term of Swift String.

It's not that simple. String.IndexDistance is computed in terms of Characters, and has no relation whatsoever to the underlying storage.
That's why you need to traverse the whole length to find out where it lands.
Only when it is known to have fixed-sized elements (ASCII String, mostly; I don't know if we have mechanism to explicitly force that), can we do O(1) pointer arithmetic.

And I think that's about as much as you could find from docs, the fast path I mentioned are pretty much implementation details that could go away at anytime.

DeFrenZ · September 9, 2019, 10:22am

I don't understand exactly: are you finding an index in a substring, and then want to use it in another substring that contains the same index?

Hoon_H · September 9, 2019, 11:40am

Though I would like to try, but Swift documentation explicitly prohibits conforming StringProtocol.

Do not declare new conformances to StringProtocol . Only the String and Substring types in the standard library are valid conforming types.

This kind of warning usually discovered if there are some hidden magic, therefore I don't really want to try it.

Counting distance takes O(n). That is well know and I also understand it well.

UTF-8/16 are in-memory live compression algorithm of 32-bit Unicode characters values.
These compression algorithms do not support O(1) time traversing.
String count thing in Charcater (extended grapheme clusters).
String.utf8 count things in UTF-8 code unit.

But what I am talking is not "traversing". Traversing is done before processing, I already have correct indices and distances. And what I want is "making a new index" from the values.

String.utf8 distances are reported as number of UTF-8 code-units. This is defined in Swift String documents. Though String.Index is an opaque types, I am pretty sure that is just integer byte offset of some encoding binary (either of UTF-816/32) under the hood. Otherwise, there's no way to get substring in O(1). If underlying storage is UTF-8 encoded binary, indexing and distance (offset) all can be done with byte offsets. I know and understand Swift prohibits access to underlying integer values of indices to prevent instancing incorrect indices. But I was hoping to find some kind of "quick hatch" API for people seeking rare scenario. For now, the only possible solution is dealing with unsafe/pointer-based/no-copy APIs that I really don't want.

Lantua · September 9, 2019, 11:45am

That is correct. Conforming new types to StringProtocol is discouraged, but using it is not.

You can definitely do,

func foo<S: StringProtocol>(someData: S) {
  ...
} 
extension StringProtocol {
  func bar() {
    ...
  }
}

and call

let someString: String = ...
foo(someData: someString) // S is String
foo(someData: someString[...]) // S is Substring

xwu · September 9, 2019, 11:49am

It's basically impossible to discuss what's going on without the specifics of what you're trying to do. We're all trying to guess:

...but now you're mentioning issues such as conforming to StringProtocol, which totally doesn't make any sense in this context.

Until you can share the concrete details of what you're trying to do, I don't think this conversation is going to produce any useful results.

Hoon_H · September 9, 2019, 12:14pm

@DeFrenZ
Though I am getting lost what I was intended at first...

let s = "ABCABC"
let i1 = "ABC".endIndex
let s1 = s[s.startIndex..<i1]
let s2 = s[i1...]

Now s1 and s2 contains same characters but have different indices, and their indices are not interchangeable.

This is unlike String instances that can interchange their indices if they have same content.

@xwu
Basically I was just asking for a method index1 = index + distance in O(1) time. If such method exists, I want to update my text editor codebase (currently based on String) to use Substring and relative offsets.

As people pointed such method won't be useful, so I was trying to explain how it can be useful.

Conforming StringProtocol could make sense because if I can make a completely new string type from scratch, I can build optimized one for my needs from UTF-8 byte array though it's not very realistic. Anyway, it's not just a complete fantasy if it is allowed. Therefore, I thought @Lantua was talking about the case. I think it's still in context and makes sense.

Though I got helped somewhat in this discussion, I agree that this conversion would not make any more useful result. I'll come up later with better examples if possible.

Lantua · September 9, 2019, 12:37pm

That's not quite. You want to do certain text processing, and believe that O(1) index movement is required. While others think you might be able to avoid doing it altogether.

xwu · September 9, 2019, 12:39pm

Hoon_H:

@DeFrenZ
Though I am getting lost what I was intended at first...
let s = "ABCABC"
let i1 = "ABC".endIndex
let s1 = s[s.startIndex..<i1]
let s2 = s[i1...]
Now s1 and s2 contains same characters but have different indices, and their indices are not interchangeable.

It’s always invalid to use indices from one string to subscript another. In the future, Swift may detect such usage and fatal error; it may happen to work for now, but there is no guarantee it won’t silently give an incorrect result in any future version of Swift.