Jordan, what you wrote is sound. I'm not sure how applicable it is though.... Strings (as well as other value types like Arrays/Dictionaries/Sets/Data) are not thread safe in Swift... You (impersonally) just can't modify a string in one thread and access it in another thread without synchronisation, be it for the purposes of getting its hash or its length or its contents or its anything else!
Woah! Nifty!
The cached hash would need to be updated, but this neednāt even be atomic, because all uses of it must be well-ordered before or after the mutation.
By this, do you mean that the thread that updates the cache needs to emit a barrier after it has written the hash value?
If an ordinary variable is modified on one thread, then accessed (read or modified) on another thread, those threads must synchronize in order to well-order those events. That is just a general requirement for the correctness of memory: all mutations of ordinary memory must be totally ordered, and all reads of that memory must be well-ordered into "slots" in that total order. Basically:
allocation < initialization < reads < mutation < reads < mutation < ... < reads < deinitialization < deallocation
Because this total order is already required, invalidating the hash cache on a string buffer during a mutation can never be concurrent with any other operation on the buffer.
In other words, you would not specifically need barriers when updating the cache during a mutation because you know that there must already be appropriate barriers before/after any accesses to the cache on any other thread.
While that is true, formally neither Swift nor C++ allows mixing atomic and non-atomic operations to the same memory location IIUC.
So if the memory location requires atomic operations sometimes (which it does), all operations on that location must be atomic. Regardless of what other information we have about whether the location is actually being shared.
Well, Swift doesnāt currently expose any atomic operations in the language; thatās in pitch. If we wanted freedom to emit these already-well-ordered accesses optimally, I think weād probably do that by adding a memory ordering thatās weaker than relaxed
, rather than by allowing ordinary memory accesses to atomic objects via some sort of type-punning. In practice, it would only help with double-word atomics ā weād be able to emit them with loads and stores that would normally be subject to tearing. But if that were important for performance, itād be easy to do.
Please clarify why (when) does it need atomic operations. I assume you are talking about a location that stores the hash value.
Yes. Setting aside the other reasons that pre-calculating hashvalues is not feasible, any values would presumably be calculated on-demand and cached in the string's storage itself. That would require atomic operations.
Resetting the value back to <uncalculated>
on mutation could, in theory, happen nonatomically, since if you're mutating the string, you have unique ownership of its storage where the value is held. But, as mentioned, we don't support mixed atomic/nonatomic operations.
That being said, in my experience writing plenty of data types where I've done things like this, it's almost always a mistake. Downstream code comes to expect that operations which really cannot be constant-time (e.g. hashing a long string) are constant-time, making those components tightly-coupled in very subtle ways. It's better for that downstream code to perform its own caching explicitly (e.g. using something like the PreHashed
wrapper suggested by @michelf). Swift's embracing of value semantics really helps you out there.
This bit is exactly what I am trying you to clarify :-) why?
Here's a very simple string simulation:
class String {
var contents: Int = 0 {
didSet { updateHash() }
}
var hash: Int = 0xDEADBEEFFEEDFACE
func updateHash() {
hash = contents &^ 0xDEADBEEFFEEDFACE
}
func validateHash() {
precondition(hash == contents &^ 0xDEADBEEFFEEDFACE)
}
}
Obviously if you use it unprotected all hell breaks loose:
let ref = Ref()
thread1:
while true {
let v = Int.random()
ref.contents = v
}
thread2:
while true {
ref.validateHash()
}
so you'd need some means of synchronisation, e.g.:
let lock = NSLock()
let ref = Ref()
thread1:
while true {
let v = Int.random()
lock.withLock {
ref.contents = v
}
}
thread2:
while true {
lock.withLock {
ref.validateHash()
}
}
Note, we do not need to use atomic operations to update / access memoized hash location: it is already protected by the outer synchronisation means.
Quite similar for the implementation that resets stored hash value to nil on every change and where hashValue function checks that location and updates it when needed:
class String {
var contents: Int = 0 {
didSet { hash = nil }
}
var hash: Int? = recalculateHash()
var hashValue: Int {
if hash == nil {
hash = recalculateHash()
}
return hash
}
...
}
I get where you're coming from here, but I think that's a bit defeatist. It's true that changes to the String
internals might cause people issues - e.g. raising the threshold on string length for caching the hash value - but they're pretty rare, wouldn't usually happen outside of a toolchain update (new Xcode major version) or OS major version, and as such would kind of just be lost in the noise of general fixes & refactors necessary for any such new release.
Between major releases - and across most of them, surely, since these things wouldn't change that often - it should be pretty stable, which makes it acceptable to rely on for most purposes.
I think it's appropriate to think of things like this - Stdlib / Foundation optimisations - in the same way as compiler optimisations: they shouldn't change the functionality itself but they're free to make it faster, and free to change over time, and the more the compiler / stdlib / Foundation can do (in this respect) the better since even if it makes their implementations a bit gnarly, it's worth it given just how many programs benefit.
Guys, I still think about this. I can implement it myself using:
struct StringMetadata {
let string: String
let count: Int
let isASCII: Bool
init(_ string: String) {
self.string = string
self.count = string.count
self.isASCII = string.allSatisfy { $0.isASCII }
}
}
How about adding a compile-time switch to turn this functionality on for the string type? I think it would really benefit those of us whose strings are almost all ASCII.
Is it doable? Thanks again.
@Michael_Ilseman has already said above that if two strings are native and ASCII, then this is already how string comparison works:
Are you seeing otherwise?
Well, yes. My app deals with an array of strings. There are 370,000 entries averaging 14 characters each, 92% ASCII and 8% Unicode entries. I have implemented a StringMetadata library via
private var sCount = 0 // counts found by string, count, ascii tests
private var cCount = 0 // counts just for testing
private var aCount = 0
private var tCount = 0 // total count
import Foundation
struct StringMetadata: Equatable, Comparable {
let string: String
let count: Int
let isASCII: Bool
init(_ string: String) {
self.string = string
self.count = string.count
self.isASCII = string.allSatisfy { $0.isASCII }
if self.isASCII {
aCount += 1 // to see what percentage are ascii vs unicode
}
tCount += 1 // total
}
static func == (lhs: StringMetadata, rhs: StringMetadata) -> Bool {
tCount += 1 // total, just for testing
if lhs.count != rhs.count { // test count first, most productive
cCount += 1 // just for testing
return false
} else if lhs.isASCII != rhs.isASCII { // testing ascii, helps a lot for unicode search
aCount += 1 // just for testing
return false
} else {
sCount += 1 // just for testing
return lhs.string == rhs.string
}
}
to do timing tests, without having to make other code changes.
With this code it does take about 29 times longer to build the string StringMetadata array compared to building the standard String array, but then searching for an ascii is nine times faster, and searching for a unicode is 60 times faster.
So I wonder if providing a compiler switch to turn on such counting could be helpful for many general string handling cases.
Are your strings native to Swift, or are you getting them from Obj-C Foundation APIs?
func readStringFromFile( theFile: Substring) -> String {
var aStr = ""
let filePath = String( theFile)
let path = URL( fileURLWithPath: filePath)
let text = try? String(contentsOf: path)
aStr = text!
return aStr
}
Yup, String(contentsOf:)
is a Foundation API, so it's possible that the string is bridged rather than native. That problem will solve itself as more of Foundation is reimplemented in Swift.
Would be curious if you reproduce the same performance characteristics with native strings.
Note that if you use String(contentsOf:encoding:)
(i.e. pass the encoding explicitly), then it will produce native Strings for UTF8 and ASCII
Ah, I was unaware of the issue. Switching to
func readStringFromFile(theFile: Substring) -> String {
var aStr = ""
let filePath = String(theFile)
let fileManager = FileManager.default
if let fileData = fileManager.contents(atPath: filePath) {
if let fileContentString = String(data: fileData, encoding: .utf8) {
aStr = fileContentString
} else {
print("Failed to convert file data to string.")
}
} else {
print("Failed to read file.")
}
return aStr
}
indeed sped up the String case. Now StringMetadata is only four times faster searching for ASCII, and only six times faster searching for Unicode.
I simplified to:
func readStringFromFile( theFile: Substring) -> String {
let text = try? String( contentsOf: URL( fileURLWithPath: String( theFile)), encoding: .utf8)
let aStr = text!
return aStr
}
StringMetadata is now 3.5 times faster than String searching for ASCII, 7.5 times faster searching for Unicode, for my application.
If there were a compiler switch to turn this sort of "optimization" on natively, I would use it!
Thanks, guys, for your help!