Observation alternatives discussions

Dave, we could leave that other thread alone and move here.

I'm with you! But let's see how this could translate to practice.

Imagine this very simple setup:

struct User {
  	var firstName: String
  	var lastName: String
  	var gender: Gender
  	var dob: Date
  	var weight: Double
  	var height: Double
  	// some other fields
}

var users: [User] = ...

let oldUsers = users
users.insert(user, at: 12345)
users.remove(at: 12345)

let changed = users != oldUsers // 🤔

How do you compare users variable efficiently?
It looks O(N) operation to me. Even if the individual User comparison is quick (comparing same strings "quick" case and comparing the rest, perhaps via memcmp) we are still talking about O(N) complexity. And we know we could do this better (with or without observation).

I think this is just a normal programming/optimization problem.

Step 1 might be adding CoW around your User struct. Both indirect enum cases and existentials do that automatically, IIUC, though I don't know if there's any provision for a fast-path equality check that returns true when two instances share storage. If not, you'd have to implement CoW manually (and please submit a feature request). Now you've still got O(N) but the constant is way lower and there's much more data locality.

To do better than that, you'd want a different data structure than a bare array. One simple move might be to wrap the array in a data structure that records additional information about each mutation. The details obviously depend on what kinds of mutations you're going to make. There is probably a way to exploit in-memory B-trees to get very good worst-case performance for these comparisons, though I haven't thought about the details much. Hmm, with plain arrays I bet you could exploit @sean-parent's "russian coatcheck algorithm” to do really well for a wide variety of applications.

Note that in most cases, a precise diff isn't necessary; it's enough to know a superset of what changed at some lower granularity, and depending on the application if the diffs are getting large/complex/slow you might want your differencing algorithm to give up and report that everything changed (lowest granularity).

If this is an important kind of problem for Swift programmers and Foundation's collection diffing isn't enough to make it efficient and easy, someone should be developing the libraries and/or language features to support it. Maybe you'll discover what those APIs should be by trying it yourself.

4 Likes

Never heard of that, thank you.

But how exactly could it help?

var users: [User] = ...

let oldUsers = users
users.append(user)
users.remove(at: users.count - 1)

let changed = users != oldUsers // 🤔 Still O(N)

Yep. I was thinking about having a unique ID along with data structures like array / dictionary, every change changes that ID. There would be false negatives (like in the above example) but at least every comparison operation would be quick. It's worth introducing another comparison operator with the meaning "definitely_equal", that would always give its answer in O(1) time, at the price of giving false negatives sometimes.

            "normal EQ"   "definitely equal"
A A            true          true or false
A B           false              false

Depends on the details of the application of course, but if the underlying data structure could track regions that may have changed (say groups of 1000 elements), or if it was a Deque with CoW'd segments, the fact that most operations on Sean's registry leave items in their existing positions could be easily exploited.

Maybe TreeSet and TreeDictionary from swift-collections could be helpful here.

2 Likes

Exploring this possibility further. Something like this would do:

struct ArrayWrapper<T> {
    var uuid: UUID = .zero
    
    var array: [T] {
        didSet { changed() }
    }
    
    mutating func changed() {
        uuid = isEmpty ? .zero : UUID()
    }
    
    mutating func append(_ element: T) {
        array.append(element)
    }
    
    ...
}

If UUID() is not performant enough to be used on every change, then perhaps allocate a couple of random numbers and increment them on every change:

struct UniqueIdentifier: Equatable {
    var a: Int = .random(in: .min ... .max)
    var b: Int = .random(in: .min ... .max)
    var c: Int = .random(in: .min ... .max)
    
    mutating func change() {
        a &+= 1
        b &+= 1
        c &+= 1
    }
}

The most interesting bit:

extension ArrayWrapper {
    // gives approximate result in O(1) time
    func mostLikelyEqual(_ other: Self) -> Bool {
        var this = self
        var other = other
        // quick positive check (comparing references)
        if memcmp(&this, &other, MemoryLayout<Self>.size) == 0 {
            return true
        }
        // quick positive check (comparing uuid's)
        if uuid == other.uuid {
            return true
        }
        // can't get result quickly
        return false
    }
}

However.... this still won't pass the above test:

var users: ArrayWrapper<User> = ...

let oldUsers = users
users.append(user)
users.remove(at: users.count - 1)

let changed = users.mostLikelyEqual(oldUsers) // false 🤔

Another very different approach that would solve more cases (included the above) is to maintain a quick and reasonably wide hash:

struct Array<T> {
    var qhash: (Int, Int, Int, Int)
}

Compared to normal hash (that requires O(N) to calculate), this quick hash would be maintained and updated on every change operation, e.g. appending an element to the end of array would get the "quick hash" of the element, combine it with the index of the newly inserted element (for append operation that would be count - 1), and merge it with existing quick hash of the array. Inserting an element in the middle of array would have to recalculate/recombine half of array qhash values, but in principle that would not increase the time complexity of the insertion operation (it would remain O(N) and would be dominated by the cost of memory I/O). Similarly for Strings, dictionaries, Data, etc. The "most likely equal" check would become:

    // O(1) complexity
    func mostLikelyEqual(_ other: Self) -> Bool {
        // quick negative check (comparing counts)
        if count != other.count {
            return false
        }
        var this = self
        var other = other
        // quick positive check (comparing references)
        if memcmp(&this, &other, MemoryLayout<Self>.size) == 0 {
            return true
        }
        // quick positive check (comparing qhashes)
        // can result in a false positive due to a hash collision
        return qhash == other.qhash
    }

This would be a probabilistic check with a possibility of false positives due to hash collisions but with no false negatives (false result would mean the two values are definitely not equal). To reduce the probability of false positives (hash collisions) down to infinitesimal value, a wide hash value could be used, e.g. 256 bits.

Edit: mostLikelyEqual check could also include the lhs.count == rhs.count check, to further reduce the possibility of false positives to happen only on hash collisions of collections of the same sizes (updated above).

*Edit2: for small POD types (e.g. Int, or a struct / tuple of a few small POD values, etc) there's no need to maintain a stored qhash variable as it could be calculated on the fly in a semi constant time:

protocol SmallPOD {}

extension SmallPOD {
    
    func combine(hash: inout (Int, Int, Int, Int), byte: UInt8, index: Int) {
        // ...
    }
    
    var qhash: (Int, Int, Int, Int) {
        let size = MemoryLayout<Self>.size
        precondition(size < 100) // let's keep this 100*O(1), which is still O(1)
        var hash = (0, 0, 0, 0)
        var this = self
        withUnsafeBytes(of: &this) { buffer in
            for i in 0 ..< size {
                let byte = buffer.baseAddress!.load(fromByteOffset: i, as: UInt8.self)
                combine(hash: &hash, byte: byte, index: i)
            }
        }
        return hash
    }
    
    func mostLikelyEqual(_ other: Self) -> Bool {
        let size = MemoryLayout<Self>.size
        precondition(size < 100) // let's keep this 100*O(1), which is still O(1)
        var this = self
        var other = other
        return memcmp(&this, &other, size) == 0
    }
}

Floating point values would expose the infamous anomaly with Float.nan being mostLikelyEqual to another Float.nan which would disagree with the normal EQ behaviour.