Checking element existence in set without allocating it?

Hi swift users! Recently I tried to write some parser in swift, and faced with one bottleneck. Basicly, I have sequence of words separated by whitespace, and I want to get a set of unique words (this is very simplified, but shows the problem):

func uniqWords(from str: String, words: inout Set<String>) {
  ...
  while (....) {
    let word: Substring = ....
    words.insert(word)
  }
}
var result = Set<String>()
for f in files {
  uniqWords(from: readfile(f), &result)
}

Of course, that will not work, because types don’t match. There are two workaraounds:

// First way
func uniqWords1(from str: String, words: inout Set<String>) {
  ...
  while (....) {
    let word: Substring = ....
    let stringWord = String(word) // <- Extra allocation
    words.insert(stringWord)
  }
}
var result = Set<String>()
for f in files {
  uniqWords(from: readfile(f), &result)
}
// Second way
func uniqWords2(from str: String, words: inout Set<Substring>) {
  ...
  while (....) {
    let word: Substring = ....
    words.insert(word)
  }
}
var acc = Set<Substring>()
let filesContent = files.map(readfile) // <- keeping all files in memory
for f in filesContent {
  uniqWords(from: f, &result)
}
let result = Set(acc.map(String.init)) // <- allocating new set

And both of them are not perfect in terms of performance. First will allocate many additional strings that will be discarded, second (1) will create two sets (2) requires to keep all initial data in memory.

So, do you know the way to do it better? Is there a way to check existence of element without actually allocating it?

What's the problem with that? The initial data is already in memory (str), and it doesn't escape via the function either.

Your other option would be to create a new split that returns a Sequence of String yourself. It could still be a lot of allocation (more than option 2 most likely), but they don't stay alive for long.

As I said, example is very simplified. In actual parser there are a lot of strings coming from different files, and set is passed as an inout parametr.

Even better, you can directly add them to the resulting set without converting them back to Set:

func uniqWords(from str: String, result: inout Set<String>) {
  let comps = str.split(separator: " ")
  result.formUnion(Set(comps).lazy.map(String.init))
}

formIntersection and formSymmetricDifference also have overloads that take arbitrary Sequence.

Not sure if you'd need lazy to prevent eager computation. That depends on the workload at hands.

Edit:

Why do I bother creating a Set when formUnion can just take them in. :woman_facepalming:

func uniqWords(from str: String, result: inout Set<String>) {
  result.formUnion(str.split(separator: " ").lazy.map(String.init))
}

which is very similar to (1) but with lazy.

func uniqWords1lazy(from str: String) -> Set<String> {
    let comps = str.split(separator: " ").lazy.map(String.init)
    return Set(comps)
}

This doesn't help to avoid extra allocations of string that will be discarded in set insertion. I've updated code to be more explicit about the problem.

It might help if Substring will have a way to give temporary String object, without actually allocating new buffer, something like this:

extension Substring {
  func withTemporaryString(_ block: (String) -> Void) { ... }
}

And putting word to set would be:

let word: Substring = ....
word.withTemporaryString {
  guard !words.contains($0) else { return }
  words.insert(String(word))
}

Allocation is done only once per unique element (there is also additional hash calculation, but thats ok, the critical path is to skip elements that are already in set).

Another way is to have special method in Set that accepts hashValue and equating function to check element existence:

extension Set {
  func contains<T>(
    _ obj: T, 
    hashValue: Int, 
    equalityCheck: (Element, T) -> Bool
  ) -> Bool {...}
}

And putting word to set would be:

let word: Substring = ....
let alreadyContains = 
  words.contains(word, hashValue: word.hashValue) { $0[...] == $1 }
if !alreadyContains { words.insert(String(word)) }

Personally, I prefer the second way, as it's more flexible, but, as far as I understand, both of them are not possible in current stdlib.

I think that's correct. It would be nice if Set could be made more flexible in some way. For now, perhaps something like the following could work?

enum Word: Equatable, Hashable {
    case string(String)
    case substring(Substring)
    
    init(_ string: String)       { self = .string(string) }
    init(_ substring: Substring) { self = .substring(substring) }
    
    func storedAsString() -> Word {
        switch self {
            case .string(_)                : return self
            case .substring(let substring) : return .string(String(substring))
        }
    }
    
    static func == (left: Word, right: Word) -> Bool {
        switch (left, right) {
            case (.string(let left)   , .string(let right))    : return left == right
            case (.string(let left)   , .substring(let right)) : return left == right
            case (.substring(let left), .string(let right))    : return left == right
            case (.substring(let left), .substring(let right)) : return left == right
        }
    }
    
    func hash(into hasher: inout Hasher) {
        switch self {
            case .string(let string)       : hasher.combine(string)
            case .substring(let substring) : hasher.combine(substring)
        }
    }
}

func uniqWords(from str: String, words: inout Set<Word>) {
    let comps = str.split(separator: " ").map(Word.init)
    for comp in comps {
        if !words.contains(comp) {
            words.insert(comp.storedAsString())
        }
    }
}

It’s quite likely you’re needlessly optimising here. Swift has a special representation for small strings wherein it packs the string contents into the String struct, avoiding any allocations. If most of your substrings are short (up to 15 Unicode code units on 64-bit platforms) creating Strings from them should be a very inexpensive operation.

Also, if you wanted another fun way to solve this problem (which won’t necessarily be any faster), you could try building a trie.

I’ve measured it with tools, it’s about 25-30% off all execution time. Another ~40% goes to hash calculation. And no, they are not less than 15 bits, its file paths, usually 100-200 chars.

Yep, I tried to build trie also, but got much worse result.

Thanks! I will try to measure it, interesting how that indirection will perform.

It’s an interesting idea. Though it’s very easy to accidentally escape the temp String, which you just did in the sample code. String(word) won’t know if it’s a temp string, you’ll need to insert data outside the block.

To extend @orobio’s idea, perhaps a final class-backed Set would help here. Just do comparison with class, and performs convertToString separately in each element.

final class Word: Hashable {
  // do the same thing as @orobio did

  // convert internal storage to `String`
  func convertToString() {...}
}

func uniqueWords3(...) {
  for word in str.split(" ") {
    let candidate = Word(word)
    if result.insert(candidate).inserted {
      candidate.convertToString()
    }
  }
}

now you’ll perform conversion only at the new unique elements.

PS
Don’t forget to release build when benchmarking :stuck_out_tongue:.