Can I ensure that two strings can share indexes?

Jesse_Grosjean · February 1, 2023, 9:41pm

The problem I am trying to solve is that I have a shared NLTokenizer. The expensive operation is tokenizer.string = .... I would like to avoid setting the string if it's already been set with that same string.

Generally I want something like:

if tokenizer.string != newString {
    tokenizer.string = newString
}

And then I will get String.Index ranges from the tokenizer and use those indexes on newString. This mostly works, but sometimes two strings can be equal (unicode equal) even if they have different underlying unicode code points. When that happens the indexes can also be incompatible and crash.

To avoid this I have changes my logic to something like:

if tokenizer.string.utf8 != newString.utf8 {
    tokenizer.string = newString
}

This seems to fix the problem and avoid the crashing cases that I'm aware of, I'm just wondering if it's an OK way to do things. Specific questions:

If two strings have same utf8 bytes can I assume indices are compatible?
What's the best way to see if two strings have same utf8 bytes. UTF8View does not implement ==. Right now I check that counts are the same and then loop through looking for any byte that's different. Simple, but maybe there's a better way?

Thanks,
Jesse

1-877-547-7272 · February 1, 2023, 9:49pm

It's a good idea to try using collection algorithms (like Sequence.elementsEqual(_:)) instead of raw loops. Dave Abrahams has a good talk on this topic.

SDGGiesbrecht · February 1, 2023, 11:11pm

Not if their internal memory representation is different. .utf8 will compute UTF‐8 bytes for you if necessary, but the string’s own indices might be pointing at UTF‐16 locations that do not exist in UTF‐8 or vice versa. String’s indices also cache stuff that may not be valid in a string that is conceptually identical but located elsewhere.

In general, unless a particular type says otherwise, collection indices are only ever safe to use in the same instance that generated them, and only if it has not been mutated.

If that were really your question, then for strings of unknown origin, the least amount of abstract work would be doing $0.unicodeScalars.elementsEqual($1.unicodeScalars) (although what is actually fastest will depend on the particular string instances).

However, if you have already done makeContiguousUTF8(), then it will be more efficient to do $0.utf8.elementsEqual($1.utf8).

(You can swap elementsEqual(_:) for == if the latter happens to be defined at some point, since it could conceivably take shortcuts like checking memory addresses.)

lorentey · February 2, 2023, 12:30am

I had half of a whole reply typed up, but @SDGGiesbrecht put it far more succinctly!

UTF8View does not implement ==.

This is a constant pain point for me too: at minimum, all string views and slices need to be Equatable, Hashable, and CustomStringConvertible -- this is a basic and obvious usability problem. I'd love to make progress towards fixing this.

Unfortunately, adding the missing conformances to the stdlib may not be possible, as it would be a potentially ABI breaking change.

However, we could at least add correct and fully public implementations for all requirements, and perhaps ship the conformance declarations as a standard source module, outside of the core stdlib itself.

tera · February 2, 2023, 1:29am

Jesse_Grosjean:

The problem I am trying to solve is that I have a shared NLTokenizer. The expensive operation is tokenizer.string = .... I would like to avoid setting the string if it's already been set with that same string.

Generally I want something like:
if tokenizer.string != newString {
    tokenizer.string = newString
}

How long are the strings in question? Is the "tokenizer.string = string" assignment itself slow or does it cause slowness down the road? If the former - how long (e.g. in µs) does it take and how long does the "==" take which you want to use for the test? Also you should consider how often will the == test take the "true" vs "false" path in practice for your data (99% compared to 1% could make a great deal of difference in the overall timing results).

Note that the string EQ is slowest when EQ operation can't use the quick path for some reason (comparing the underlying pointers which happen to be equal) but the strings themselves are equal. Or consider the two long strings that are equal up until the very last byte. For two random strings - EQ is typically fast (quickly returns false after comparing the lengths or the first few characters).

I asked a similar question some time ago, back then there was no known answer.

Also consider the following "quick positive" string equivalence check:

extension String {
    /// quickly returns `true` if the two strings are definitely equal (the same thing)
    /// may return `false` for two strings that are equal when compared with `==`
    /// definitely returns `false` for two != strings
    func isSame(_ other: String) -> Bool {
        precondition(MemoryLayout<String>.size == 16)
        let lhs = unsafeBitCast(self, to: (UInt64, UInt64).self)
        let rhs = unsafeBitCast(other, to: (UInt64, UInt64).self)
        return lhs == rhs
    }
}

// usage:

if !tokenizer.string.isSame(newString) {
    tokenizer.string = newString
}

Karl · February 2, 2023, 1:31am

Note that the generic Sequence.elementsEqual is not always optimal. Since you appear to be concerned about performance, it may be better to use String.withUTF8 to get unsafe buffer pointers (it will also convert the string's internal storage to contiguous UTF8, which is almost always worth it), check the counts are equal, then use memcmp.

Related issue: [SR-15712] UBP<UInt8>/URBP.elementsEqual is slow · Issue #57990 · apple/swift · GitHub

elementsEqual is not a customisation point, so String.UTF8View isn't able to provide these kinds of fast-paths for you. It may be possible if it implemented ==, though.

Alternatively, it would be great if the generic implementation could recognise these opportunities via withContiguousStorageIfAvailable.

Jesse_Grosjean · February 2, 2023, 2:44am

Thanks, and yes that's a better way to think about it for my use case. In 99% of cases that I want to skip assignment to the tokenizer it will be the same instance. I'm going to use this approach and see how it goes.

Jesse_Grosjean · February 2, 2023, 2:53am

@tera I'm getting beyond my actual knowledge here, but is it possible to write a generic "same bytes" function that could be used in optimization cases where the goal is a quick check for equality?

public func isSameBytes<T>(lhs: T, rhs: T) -> Bool {
    var lhs = lhs
    var rhs = rhs
    return memcmp(&lhs, &rhs, MemoryLayout<T>.size) == 0
}

Also I have read that a mutating function will cause a COW to copy even if the method doesn't do any mutation. That doesn't happen if you just assign to a var (like in the above code) right?

taylorswift · February 2, 2023, 3:00am

this depends on where the isKnownUniquelyReferenced(_:) boundary is set. there is a lot of truly terrible Box<T> tutorial code out there that does copy storage on every formal mutation, but a quality data structure library will most likely place the checks close enough that they only copy storage if storage is actually written to according to the semantics of the data structure.

if lhs and rhs are locals, and memcmp only uses the formal mutation to obtain a memory address, then no copy-on-write can possibly happen, because there is nowhere the implementations of types could possibly run the code to copy themselves.

if T is a type with an allocation, then MemoryLayout<T> is just the size of the box. you are comparing pointer values, but it won’t detect if two separately-allocated instances of T have the same bytes in their storage.

tera · February 2, 2023, 3:38am

Jesse_Grosjean:

@tera I'm getting beyond my actual knowledge here, but is it possible to write a generic "same bytes" function that could be used in optimization cases where the goal is a quick check for equality?
public func isSameBytes<T>(lhs: T, rhs: T) -> Bool {
    var lhs = lhs
    var rhs = rhs
    return memcmp(&lhs, &rhs, MemoryLayout<T>.size) == 0
}

Good call to make it that generic. This is what I'd do:

public func isSameBytes<T>(_ lhs: UnsafePointer<T>, _ rhs: UnsafePointer<T>) -> Bool {
    memcmp(lhs, rhs, MemoryLayout<T>.size) == 0
}

Use with & on the call site.

    if tokenizer.string == nil || !isSameBytes(&tokenizer.string!, &newString) {
        tokenizer.string = newString
    }

taylorswift · February 2, 2023, 3:52am

but @Jesse_Grosjean note that in your original example it was not possible for “implicit COW to happen”, but in this instance it is, because taking an instance property (&token.string) inout is allowed to run get or _modify to yield the value, and this code may be where isKnownUniqueReferenced(_:) is checked.

tera · February 2, 2023, 12:25pm

Interestingly we've just reinvented identity operator.

infix operator ==== : ComparisonPrecedence
infix operator !=== : ComparisonPrecedence

func ==== <T>(lhs: T, rhs: T) -> Bool {
    var lhs = lhs
    var rhs = rhs
    return memcmp(&lhs, &rhs, MemoryLayout<T>.size) == 0
}

func !=== <T>(lhs: T, rhs: T) -> Bool {
    !(lhs ==== rhs)
}

It gives the same result as "===" for reference types, and gives the useful "are these bit for bit equivalent values" for value types.

This raises a question: perhaps the "===" operator could be extended to work with value types?