How to make `String.Index` with `Int` offset (distance) in O(1) time?

Lantua · September 9, 2019, 12:43pm

As @xwu said, it's not safe to do it like that. It could also lead to unexpected behaviour if your index lands in the middle of grapheme.

Not necessarily. There's one more requirement that indices must be able to be shared among different Views of the same string (UTF8View, UTF16View). This make index type rather complex under the hood. We do have fast path if string is simple, but if you use user-input, it's unlikely to stay simple.

Hoon_H · September 9, 2019, 12:55pm

@Lantua Well I thought it is required at first, and now I changed my opinion. It's time to move on to hard ways.

@Lantua Actually I just need method only for UTF-8 and don't need anything for any other encodings. Swift Strings already have some special methods for UTF-8 encoding, so I was hoping to get one like them.

@xwu Yeah I didn't know that so far and I realized that can be unsafe. Thanks for pointing it out. Strings are hard. And realizing this issue would be the biggest result of this discussion to me. I think Swift has to detect and make an error for this cases at least in debug build.

Lantua · September 9, 2019, 1:33pm

Speaking of fast UTF8, there's this [Accepted with modification] SE-0247: Contiguous Strings.

So you can drop-down to array of UInt8 if you need to. Though contiguous string already have fast index manipulation, even for existing APIs.

Hoon_H · September 10, 2019, 12:20am

@Lantua Yup this is what I was talking as unsafe/pointer-based alternative. But I didn't know that UTF8View is also implementing contiguous memory pointer access. It seems this can make some difference to me.

Though I didn't liked pointer controlling, but now I realized that pointer based is not really different with index + distance operation in safety. I think this is worth to try. Thanks.

Lantua · September 10, 2019, 12:31am

Index approach is slightly safer, as it still detects if the value goes out-of-bound.

Even with all the discussion thus far, it's ought to be pretty fast to directly manipulate String, sometimes while jumping around the related type (Substring, Views, etc). That's one of the String design goal.

It'd be great if you can post some code snippet that is slow, but should be fast. Some people here would be interested, and may even want to bake common use case (if not already) into compiler.

Hoon_H · September 10, 2019, 12:36am

I'll post here if I can reproduce similar situation though I am not sure.

Michael_Ilseman · September 11, 2019, 3:14am

If you called makeContiguousUTF8() then indexing operations such as distance, offsetting, etc., on the UTF8View will be O(1).

This is not currently documented as such: I believe we should change the documentation to make this guarantee. I don't know if that would require some participation in swift-evolution or precisely how we'd want to state it. @nnnnnnnn, you've thought a lot more about stdlib documentation practices, what do you think?

Lantua · September 11, 2019, 3:24am

Is ContiguousUTF8 equivalent to fastUTF8 in the stdlib implementation? I tried to dig around but couldn’t find much, and quite reluctant to use them interchangably.

Michael_Ilseman · September 11, 2019, 3:32am

Yes

fastUTF8 is just an internal name for the concept, predating the API (though present in the stdlib ABI). You shouldn't be able to use fastUTF8, unless you mean making inferences from inspecting the stdlib's source code.

Lantua · September 11, 2019, 3:40am

Yes, that’s what happened. I saw them being used a lot when checking for fast path.

nnnnnnnn · September 11, 2019, 6:05am

Generally we try to treat the things we document as promises for future behavior, so whether or not we can add a guarantee to the docs depends on whether that guarantee was a part of an evolution discussion. For example, sorting has been stable since Swift 5.0, but we don’t promise that that’s true in the docs because we haven’t made that commitment yet.

That said, SE-0247 is super clear that the point of calling makeContiguousUTF8() is to get random-access performance when working with a string’s utf-8 encoded contents. Given that, the docs should include the guarantee! Probably on that method and on the UTF8View docs, as well.

gnuoyd · September 11, 2019, 3:42pm

Is there any circumstance where one can rely on those operations to be
O(1) on the UTF16View? That would facilitate creating a fast, compact
implementation in Swift of NSTextStorage, which is indexed by UTF-16
code units.

Dave

Michael_Ilseman · September 11, 2019, 6:05pm

They are currently amortized O(1) thanks to breadcrumbing, but there are some caveats in practice. You'll get constant-ish access, but with higher coefficients than if it were natively encoded as UTF-16. You will pay a O(n) scan the very first time you call such a method on the UTF16View. If you mutate the content, that will invalidate the results of the scan, so it will be performed again the next time.

There's a little more written in the blog post and some forum discussions.

Swift String still bridges to NSString lazily, and still services NSString's API in amortized O(1) time with these breadcrumbs. Since NSStrings are often accessed through message sends in the first place (at least without access to implementation details), this amortization cost is not generally significant. When we did performance analysis, we absolutely hammered these public interfaces in app-like scenarios and couldn't measure a reasonable regression. But, in theory, if you could subvert Objective-C message sends and CF bridging checks on NSString, then you might tell the difference.

We hope to speed up the constant factors considerably with SIMD work, which I just filed a bug to track. The savings would allow us to make better performance vs memory tradeoffs. We also don't have any API to expose what the current tradeoff is. If you have any benchmarks or scenarios where you can observe salient performance differences, please let us know.

We don't formally document this guarantee; not until we have a better idea of where opaque or indirect strings might be headed. But, any String that could be lazily bridged to NSString has to provide performance roughly indistinguishable from O(1).

If you want to provide something like a NSTextStorage equivalent, but supporting both modern UTF-8 interfaces and legacy UTF-16 interfaces, you could consider making the backing be contiguous UTF-8 and rely on these breadcrumbs. Obviously, Array<UInt16> would be faster and more direct, but depending on the operation (especially if it's behind a message send), it may not matter. Otherwise, you can also maintain your own mapping between UTF8View indices and UTF16View indices at the cost of greater memory impact.

gnuoyd · September 12, 2019, 2:25am

Thank you, Michael! This looks just like the information (and
reassurance) I need to take the next step with my rope implementation.

Dave

Michael_Ilseman · September 12, 2019, 3:11am

I opened PR-27135 for that. Let me know how you think it would be best communicated.

@Ben_Cohen, how do you feel about this? Should we at least float it for comment on swift-evolution?

Joe_Groff · September 12, 2019, 4:18pm

Lantua:

That is correct. Conforming new types to StringProtocol is discouraged, but using it is not.

You can definitely do,
func foo<S: StringProtocol>(someData: S) {
  ...
} 
extension StringProtocol {
  func bar() {
    ...
  }
}
and call
let someString: String = ...
foo(someData: someString) // S is String
foo(someData: someString[...]) // S is Substring

Using StringProtocol is probably going to be overkill. If you want an API to accept Strings and Substrings, I think it would be better to only take Substring, since going from String to Substring with s[...] is cheap, and that will avoid the cost of either protocol indirection or specialization code size blowup in the implementation.

Hoon_H · October 3, 2019, 4:51am

I fully switched my codebase to Substring with UTF-8 offset based. Simple, performant and far easier to develop/maintain. Thanks for help.