Hi swift users! Recently I tried to write some parser in swift, and faced with one bottleneck. Basicly, I have sequence of words separated by whitespace, and I want to get a set of unique words (this is very simplified, but shows the problem):
func uniqWords(from str: String, words: inout Set<String>) {
...
while (....) {
let word: Substring = ....
words.insert(word)
}
}
var result = Set<String>()
for f in files {
uniqWords(from: readfile(f), &result)
}
Of course, that will not work, because types don’t match. There are two workaraounds:
// First way
func uniqWords1(from str: String, words: inout Set<String>) {
...
while (....) {
let word: Substring = ....
let stringWord = String(word) // <- Extra allocation
words.insert(stringWord)
}
}
var result = Set<String>()
for f in files {
uniqWords(from: readfile(f), &result)
}
// Second way
func uniqWords2(from str: String, words: inout Set<Substring>) {
...
while (....) {
let word: Substring = ....
words.insert(word)
}
}
var acc = Set<Substring>()
let filesContent = files.map(readfile) // <- keeping all files in memory
for f in filesContent {
uniqWords(from: f, &result)
}
let result = Set(acc.map(String.init)) // <- allocating new set
And both of them are not perfect in terms of performance. First will allocate many additional strings that will be discarded, second (1) will create two sets (2) requires to keep all initial data in memory.
So, do you know the way to do it better? Is there a way to check existence of element without actually allocating it?
What's the problem with that? The initial data is already in memory (str), and it doesn't escape via the function either.
Your other option would be to create a new split that returns a Sequence of String yourself. It could still be a lot of allocation (more than option 2 most likely), but they don't stay alive for long.
As I said, example is very simplified. In actual parser there are a lot of strings coming from different files, and set is passed as an inout parametr.
This doesn't help to avoid extra allocations of string that will be discarded in set insertion. I've updated code to be more explicit about the problem.
Allocation is done only once per unique element (there is also additional hash calculation, but thats ok, the critical path is to skip elements that are already in set).
Another way is to have special method in Set that accepts hashValue and equating function to check element existence:
It’s quite likely you’re needlessly optimising here. Swift has a special representation for small strings wherein it packs the string contents into the String struct, avoiding any allocations. If most of your substrings are short (up to 15 Unicode code units on 64-bit platforms) creating Strings from them should be a very inexpensive operation.
Also, if you wanted another fun way to solve this problem (which won’t necessarily be any faster), you could try building a trie.
I’ve measured it with tools, it’s about 25-30% off all execution time. Another ~40% goes to hash calculation. And no, they are not less than 15 bits, its file paths, usually 100-200 chars.
Yep, I tried to build trie also, but got much worse result.
It’s an interesting idea. Though it’s very easy to accidentally escape the temp String, which you just did in the sample code. String(word) won’t know if it’s a temp string, you’ll need to insert data outside the block.
To extend @orobio’s idea, perhaps a final class-backed Set would help here. Just do comparison with class, and performs convertToString separately in each element.
final class Word: Hashable {
// do the same thing as @orobio did
// convert internal storage to `String`
func convertToString() {...}
}
func uniqueWords3(...) {
for word in str.split(" ") {
let candidate = Word(word)
if result.insert(candidate).inserted {
candidate.convertToString()
}
}
}
now you’ll perform conversion only at the new unique elements.
PS
Don’t forget to release build when benchmarking .