Fastest way to advance by char count and get that distance in utf16?

Jesse_Grosjean · March 10, 2022, 10:12pm

Here is what I'm doing now:

var start = string.startIndex // 1
let end = string.index(start, offsetBy: 10) // 2
let length: CFIndex = string.utf16.distance(from: start, to: end) // 3

Is that the best way to do it?

I'm surprised that the second line is about 4x faster than the third line?

My goal is that I have a string broken up into saved lengths which are Int representing character count. I want to iterate over my string using those character lengths, but I want to generate CFRanges.

Thanks,
Jesse

jrose · March 10, 2022, 10:23pm

That sounds about right. It makes sense that the third line is fast: it has to convert from one encoding of Unicode to another, but doesn’t care what the codepoints are. Character splitting, on the other hand, has to know whether “e” and COMBINING ACUTE ACCENT should be treated as a single unit “é”, and even if that doesn’t come up in your string, it has to check that that doesn’t come up.

EDIT: I read that backwards! I too am surprised that counting UTF-16 is slower than counting Characters.

ksluder · March 10, 2022, 10:37pm

Is string a native Swift string? Those are UTF-8 nowadays.

jrose · March 10, 2022, 10:45pm

Right, I'm assuming a UTF-8 backing and (half the work of) transcoding to get a UTF-16 count instead (for CFRange). Counting Characters (extended grapheme clusters) should be about equally hard for UTF-8 and UTF-16.

Jesse_Grosjean · March 10, 2022, 10:50pm

Yes, it is native swift string. Now that I've verified that I'm not doing anything too wrong I'll try to write a smaller benchmark that I can share.

SDGGiesbrecht · March 10, 2022, 10:52pm

These are not stable across Unicode versions, so if they are saved or transmitted separately from the string itself, they risk pointing at something different after they are reloaded.

If you disallow undefined scalars, then you can pick a normalization form, apply it after loading, and save the ranges as scalar distances in that normalization form. Otherwise, to be completely stable, you need to save raw bytes (not strings) and save the ranges as byte distances.

Unless the context guarantees the string is long enough, you probably want to use index(_:offsetBy:limitedBy:) instead.

Jesse_Grosjean · March 11, 2022, 3:47pm

Well my head hurts...

Outside of where I'm actually using this code in my app string.index(start, offsetBy: 10) is the expensive call and string.utf16.distance(from: start, to: end) is the cheep call. Inside my app where I care it's the opposite.

Maybe this IS the problem. What's a fool proof way to make sure that I'm dealing with a native Swift string? I am trying:

let maybeNotNative: String = ...
var forceNative = String(maybeNotNative[...])
forceNative.makeContiguousUTF8()

But I still get odd performance in that case.

Jesse

Jesse_Grosjean · March 11, 2022, 7:26pm

Maybe done here.

To recap, I'm trying to iterate over a string. I pass in lengths in "Character" unit. I want to get out CFRanges (which are utf16 unit). The two key methods to do this are:

String.index(_:offsetBy:limitedBy:) // Advance existing index by chars
String.UTF16View.distance(from:to:) // Calculate CFRange length

It seems that the performance relation between those two functions varies based on the underlying string and on the advance length.

Shorter run lengths mean more time calculating (2) relative to (1)
When string contains non ascii characters (2) gets even more expensive relative to (1)

... I think.

Alejandro · March 11, 2022, 10:31pm

Yeah, unfortunately a native string in this case is worse than a bridged one. For a pure ASCII native string this is not an issue, but anything outside of ASCII requires the use of transcoding both indices (start and end) (although transcoding start is trivial). For a UTF-16 backed string, this operation is trivial (end - start), and yes, what Jordan said about counting characters is true, it's pretty much equally hard for both UTF-8 and UTF-16 backed strings.