Use string counts to short-circuit string comparisons

tera · October 4, 2023, 9:59am

Jordan, what you wrote is sound. I'm not sure how applicable it is though.... Strings (as well as other value types like Arrays/Dictionaries/Sets/Data) are not thread safe in Swift... You (impersonally) just can't modify a string in one thread and access it in another thread without synchronisation, be it for the purposes of getting its hash or its length or its contents or its anything else!

Josh_Osborne · October 4, 2023, 3:53pm

arm64e exists

Woah! Nifty!

John_McCall · October 4, 2023, 4:38pm

The cached hash would need to be updated, but this needn’t even be atomic, because all uses of it must be well-ordered before or after the mutation.

ksluder · October 4, 2023, 5:10pm

By this, do you mean that the thread that updates the cache needs to emit a barrier after it has written the hash value?

John_McCall · October 4, 2023, 5:54pm

If an ordinary variable is modified on one thread, then accessed (read or modified) on another thread, those threads must synchronize in order to well-order those events. That is just a general requirement for the correctness of memory: all mutations of ordinary memory must be totally ordered, and all reads of that memory must be well-ordered into "slots" in that total order. Basically:

allocation < initialization < reads < mutation < reads < mutation < ... < reads < deinitialization < deallocation

Because this total order is already required, invalidating the hash cache on a string buffer during a mutation can never be concurrent with any other operation on the buffer.

In other words, you would not specifically need barriers when updating the cache during a mutation because you know that there must already be appropriate barriers before/after any accesses to the cache on any other thread.

Karl · October 4, 2023, 6:11pm

While that is true, formally neither Swift nor C++ allows mixing atomic and non-atomic operations to the same memory location IIUC.

So if the memory location requires atomic operations sometimes (which it does), all operations on that location must be atomic. Regardless of what other information we have about whether the location is actually being shared.

John_McCall · October 4, 2023, 7:04pm

Well, Swift doesn’t currently expose any atomic operations in the language; that’s in pitch. If we wanted freedom to emit these already-well-ordered accesses optimally, I think we’d probably do that by adding a memory ordering that’s weaker than relaxed, rather than by allowing ordinary memory accesses to atomic objects via some sort of type-punning. In practice, it would only help with double-word atomics — we’d be able to emit them with loads and stores that would normally be subject to tearing. But if that were important for performance, it’d be easy to do.

tera · October 4, 2023, 7:17pm

Please clarify why (when) does it need atomic operations. I assume you are talking about a location that stores the hash value.

Karl · October 4, 2023, 7:38pm

Yes. Setting aside the other reasons that pre-calculating hashvalues is not feasible, any values would presumably be calculated on-demand and cached in the string's storage itself. That would require atomic operations.

Resetting the value back to <uncalculated> on mutation could, in theory, happen nonatomically, since if you're mutating the string, you have unique ownership of its storage where the value is held. But, as mentioned, we don't support mixed atomic/nonatomic operations.

That being said, in my experience writing plenty of data types where I've done things like this, it's almost always a mistake. Downstream code comes to expect that operations which really cannot be constant-time (e.g. hashing a long string) are constant-time, making those components tightly-coupled in very subtle ways. It's better for that downstream code to perform its own caching explicitly (e.g. using something like the PreHashed wrapper suggested by @michelf). Swift's embracing of value semantics really helps you out there.

tera · October 4, 2023, 7:49pm

This bit is exactly what I am trying you to clarify :-) why?

Here's a very simple string simulation:

class String {
	var contents: Int = 0 {
		didSet { updateHash() }
	}
	var hash: Int = 0xDEADBEEFFEEDFACE
	
	func updateHash() {
		hash = contents &^ 0xDEADBEEFFEEDFACE
	}
	func validateHash() {
		precondition(hash == contents &^ 0xDEADBEEFFEEDFACE)
	}
}

Obviously if you use it unprotected all hell breaks loose:

let ref = Ref()

thread1:
while true {
	let v = Int.random()
	ref.contents = v
}

thread2:
while true {
	ref.validateHash()
}

so you'd need some means of synchronisation, e.g.:

let lock = NSLock()
let ref = Ref()

thread1:
while true {
	let v = Int.random()
	lock.withLock {
		ref.contents = v
	}
}

thread2:
while true {
	lock.withLock {
		ref.validateHash()
	}
}

Note, we do not need to use atomic operations to update / access memoized hash location: it is already protected by the outer synchronisation means.

Quite similar for the implementation that resets stored hash value to nil on every change and where hashValue function checks that location and updates it when needed:

class String {
	var contents: Int = 0 {
		didSet { hash = nil }
	}
	var hash: Int? = recalculateHash()
	
	var hashValue: Int {
	    if hash == nil {
	        hash = recalculateHash()
	    }
	    return hash
	}
    ...
}

wadetregaskis · October 4, 2023, 8:29pm

I get where you're coming from here, but I think that's a bit defeatist. It's true that changes to the String internals might cause people issues - e.g. raising the threshold on string length for caching the hash value - but they're pretty rare, wouldn't usually happen outside of a toolchain update (new Xcode major version) or OS major version, and as such would kind of just be lost in the noise of general fixes & refactors necessary for any such new release.

Between major releases - and across most of them, surely, since these things wouldn't change that often - it should be pretty stable, which makes it acceptable to rely on for most purposes.

I think it's appropriate to think of things like this - Stdlib / Foundation optimisations - in the same way as compiler optimisations: they shouldn't change the functionality itself but they're free to make it faster, and free to change over time, and the more the compiler / stdlib / Foundation can do (in this respect) the better since even if it makes their implementations a bit gnarly, it's worth it given just how many programs benefit.

vonlost · June 22, 2024, 9:06pm

Guys, I still think about this. I can implement it myself using:

struct StringMetadata {
let string: String
let count: Int
let isASCII: Bool

init(_ string: String) {
    self.string = string
    self.count = string.count
    self.isASCII = string.allSatisfy { $0.isASCII }
}

}

How about adding a compile-time switch to turn this functionality on for the string type? I think it would really benefit those of us whose strings are almost all ASCII.

Is it doable? Thanks again.

xwu · June 22, 2024, 9:18pm

@Michael_Ilseman has already said above that if two strings are native and ASCII, then this is already how string comparison works:

Are you seeing otherwise?

vonlost · June 27, 2024, 2:37pm

Well, yes. My app deals with an array of strings. There are 370,000 entries averaging 14 characters each, 92% ASCII and 8% Unicode entries. I have implemented a StringMetadata library via

private var sCount = 0 // counts found by string, count, ascii tests
private var cCount = 0 // counts just for testing
private var aCount = 0
private var tCount = 0 // total count

import Foundation

struct StringMetadata: Equatable, Comparable {
let string: String
let count: Int
let isASCII: Bool

init(_ string: String) {
    self.string  = string
    self.count   = string.count
    self.isASCII = string.allSatisfy { $0.isASCII }
    if self.isASCII {
        aCount += 1         // to see what percentage are ascii vs unicode
    }
    tCount += 1              // total
}

static func == (lhs: StringMetadata, rhs: StringMetadata) -> Bool {
    tCount += 1                    // total, just for testing
    if lhs.count != rhs.count {       // test count first, most productive
        cCount += 1                   // just for testing
        return false
    } else if lhs.isASCII != rhs.isASCII {   // testing ascii, helps a lot for unicode search
        aCount += 1                          // just for testing
        return false
    } else {
        sCount += 1                          // just for testing
        return lhs.string == rhs.string
    }
}

to do timing tests, without having to make other code changes.

With this code it does take about 29 times longer to build the string StringMetadata array compared to building the standard String array, but then searching for an ascii is nine times faster, and searching for a unicode is 60 times faster.

So I wonder if providing a compiler switch to turn on such counting could be helpful for many general string handling cases.

xwu · June 27, 2024, 3:20pm

Are your strings native to Swift, or are you getting them from Obj-C Foundation APIs?

vonlost · June 27, 2024, 3:39pm

func readStringFromFile( theFile: Substring) -> String {
var aStr = ""
let filePath = String( theFile)
let path = URL( fileURLWithPath: filePath)
let text = try? String(contentsOf: path)
aStr = text!
return aStr
}

xwu · June 27, 2024, 5:49pm

Yup, String(contentsOf:) is a Foundation API, so it's possible that the string is bridged rather than native. That problem will solve itself as more of Foundation is reimplemented in Swift.

Would be curious if you reproduce the same performance characteristics with native strings.

David_Smith · June 27, 2024, 8:29pm

Note that if you use String(contentsOf:encoding:) (i.e. pass the encoding explicitly), then it will produce native Strings for UTF8 and ASCII

vonlost · June 27, 2024, 8:34pm

Ah, I was unaware of the issue. Switching to

func readStringFromFile(theFile: Substring) -> String {
var aStr = ""
let filePath = String(theFile)
let fileManager = FileManager.default

if let fileData = fileManager.contents(atPath: filePath) {
    if let fileContentString = String(data: fileData, encoding: .utf8) {
        aStr = fileContentString
    } else {
        print("Failed to convert file data to string.")
    }
} else {
    print("Failed to read file.")
}

return aStr

}

indeed sped up the String case. Now StringMetadata is only four times faster searching for ASCII, and only six times faster searching for Unicode.

vonlost · June 27, 2024, 11:54pm

I simplified to:

func readStringFromFile( theFile: Substring) -> String {
let text = try? String( contentsOf: URL( fileURLWithPath: String( theFile)), encoding: .utf8)
let aStr = text!
return aStr
}

StringMetadata is now 3.5 times faster than String searching for ASCII, 7.5 times faster searching for Unicode, for my application.

If there were a compiler switch to turn this sort of "optimization" on natively, I would use it!

Thanks, guys, for your help!