Observation alternatives discussions

tera · July 10, 2023, 6:02pm

Dave, we could leave that other thread alone and move here.

[Second review] SE-0395: Observability

Bird's-eye view: you have this giant data structure that could well describe the state of the app, and you want to have a function to be magically called when needed to reflect the change in a particular subset of the state:

Well, I'm not entirely sure I can help you there, because I have an issue with the word “magically.” When stuff happens magically that usually means local reasoning has been lost… so I'm opposed to magic in programming. The whole point of value semantics is to retain local reasoning.

If you're willing to settle for non-magic, you could take the approach SwiftUI, Photoshop, the desktop software I developed way back in prehistory—and I'm sure countless other systems—take to this problem:

Make your data structure a value

Use CoW so similar values of the data structure created by small mutations share lots of storage

Make it efficient to compare parts of the data structure for equality and describe the differences between one value and another (didn't Foundation add collection diffing facilities some years ago?)

Take a snapshot of your data structure before making a series of changes

Compare the snapshot with the new state to see what changed and do whatever you need to do in response to those changes.

(This is super convenient if you need to be able to undo changes, because your undo history can just be a series of snapshots)

This is just one way to do it. I'm sure you can think of others.

Notice that steps 3 and 5 involve first-class algorithms that allow you to describe the behaviors of the system coherently rather than as a scattered set of observation actions.

[Second review] SE-0395: Observability

While in this simple example, observer works fine and an algorithm would be overkilling, but from my experience the observer approach doesn't scale.

And this is exactly the key. More generally, reference semantics doesn't scale because it undermines local reasoning.

I'm with you! But let's see how this could translate to practice.

Imagine this very simple setup:

struct User {
  	var firstName: String
  	var lastName: String
  	var gender: Gender
  	var dob: Date
  	var weight: Double
  	var height: Double
  	// some other fields
}

var users: [User] = ...

let oldUsers = users
users.insert(user, at: 12345)
users.remove(at: 12345)

let changed = users != oldUsers // 🤔

How do you compare users variable efficiently?
It looks O(N) operation to me. Even if the individual User comparison is quick (comparing same strings "quick" case and comparing the rest, perhaps via memcmp) we are still talking about O(N) complexity. And we know we could do this better (with or without observation).

dabrahams · July 10, 2023, 6:55pm

I think this is just a normal programming/optimization problem.

Step 1 might be adding CoW around your User struct. Both indirect enum cases and existentials do that automatically, IIUC, though I don't know if there's any provision for a fast-path equality check that returns true when two instances share storage. If not, you'd have to implement CoW manually (and please submit a feature request). Now you've still got O(N) but the constant is way lower and there's much more data locality.

To do better than that, you'd want a different data structure than a bare array. One simple move might be to wrap the array in a data structure that records additional information about each mutation. The details obviously depend on what kinds of mutations you're going to make. There is probably a way to exploit in-memory B-trees to get very good worst-case performance for these comparisons, though I haven't thought about the details much. Hmm, with plain arrays I bet you could exploit @sean-parent's "russian coatcheck algorithm” to do really well for a wide variety of applications.

Note that in most cases, a precise diff isn't necessary; it's enough to know a superset of what changed at some lower granularity, and depending on the application if the diffs are getting large/complex/slow you might want your differencing algorithm to give up and report that everything changed (lowest granularity).

If this is an important kind of problem for Swift programmers and Foundation's collection diffing isn't enough to make it efficient and easy, someone should be developing the libraries and/or language features to support it. Maybe you'll discover what those APIs should be by trying it yourself.

tera · July 10, 2023, 7:22pm

Never heard of that, thank you.

But how exactly could it help?

var users: [User] = ...

let oldUsers = users
users.append(user)
users.remove(at: users.count - 1)

let changed = users != oldUsers // 🤔 Still O(N)

Yep. I was thinking about having a unique ID along with data structures like array / dictionary, every change changes that ID. There would be false negatives (like in the above example) but at least every comparison operation would be quick. It's worth introducing another comparison operator with the meaning "definitely_equal", that would always give its answer in O(1) time, at the price of giving false negatives sometimes.

            "normal EQ"   "definitely equal"
A A            true          true or false
A B           false              false

dabrahams · July 10, 2023, 7:29pm

Depends on the details of the application of course, but if the underlying data structure could track regions that may have changed (say groups of 1000 elements), or if it was a Deque with CoW'd segments, the fact that most operations on Sean's registry leave items in their existing positions could be easily exploited.

tclementdev · July 10, 2023, 8:02pm

Maybe TreeSet and TreeDictionary from swift-collections could be helpful here.

tera · July 11, 2023, 7:05pm

Exploring this possibility further. Something like this would do:

struct ArrayWrapper<T> {
    var uuid: UUID = .zero
    
    var array: [T] {
        didSet { changed() }
    }
    
    mutating func changed() {
        uuid = isEmpty ? .zero : UUID()
    }
    
    mutating func append(_ element: T) {
        array.append(element)
    }
    
    ...
}

If UUID() is not performant enough to be used on every change, then perhaps allocate a couple of random numbers and increment them on every change:

struct UniqueIdentifier: Equatable {
    var a: Int = .random(in: .min ... .max)
    var b: Int = .random(in: .min ... .max)
    var c: Int = .random(in: .min ... .max)
    
    mutating func change() {
        a &+= 1
        b &+= 1
        c &+= 1
    }
}

The most interesting bit:

extension ArrayWrapper {
    // gives approximate result in O(1) time
    func mostLikelyEqual(_ other: Self) -> Bool {
        var this = self
        var other = other
        // quick positive check (comparing references)
        if memcmp(&this, &other, MemoryLayout<Self>.size) == 0 {
            return true
        }
        // quick positive check (comparing uuid's)
        if uuid == other.uuid {
            return true
        }
        // can't get result quickly
        return false
    }
}

However.... this still won't pass the above test:

var users: ArrayWrapper<User> = ...

let oldUsers = users
users.append(user)
users.remove(at: users.count - 1)

let changed = users.mostLikelyEqual(oldUsers) // false 🤔

Another very different approach that would solve more cases (included the above) is to maintain a quick and reasonably wide hash:

struct Array<T> {
    var qhash: (Int, Int, Int, Int)
}

Compared to normal hash (that requires O(N) to calculate), this quick hash would be maintained and updated on every change operation, e.g. appending an element to the end of array would get the "quick hash" of the element, combine it with the index of the newly inserted element (for append operation that would be count - 1), and merge it with existing quick hash of the array. Inserting an element in the middle of array would have to recalculate/recombine half of array qhash values, but in principle that would not increase the time complexity of the insertion operation (it would remain O(N) and would be dominated by the cost of memory I/O). Similarly for Strings, dictionaries, Data, etc. The "most likely equal" check would become:

    // O(1) complexity
    func mostLikelyEqual(_ other: Self) -> Bool {
        // quick negative check (comparing counts)
        if count != other.count {
            return false
        }
        var this = self
        var other = other
        // quick positive check (comparing references)
        if memcmp(&this, &other, MemoryLayout<Self>.size) == 0 {
            return true
        }
        // quick positive check (comparing qhashes)
        // can result in a false positive due to a hash collision
        return qhash == other.qhash
    }

This would be a probabilistic check with a possibility of false positives due to hash collisions but with no false negatives (false result would mean the two values are definitely not equal). To reduce the probability of false positives (hash collisions) down to infinitesimal value, a wide hash value could be used, e.g. 256 bits.

Edit: mostLikelyEqual check could also include the lhs.count == rhs.count check, to further reduce the possibility of false positives to happen only on hash collisions of collections of the same sizes (updated above).

*Edit2: for small POD types (e.g. Int, or a struct / tuple of a few small POD values, etc) there's no need to maintain a stored qhash variable as it could be calculated on the fly in a semi constant time:

protocol SmallPOD {}

extension SmallPOD {
    
    func combine(hash: inout (Int, Int, Int, Int), byte: UInt8, index: Int) {
        // ...
    }
    
    var qhash: (Int, Int, Int, Int) {
        let size = MemoryLayout<Self>.size
        precondition(size < 100) // let's keep this 100*O(1), which is still O(1)
        var hash = (0, 0, 0, 0)
        var this = self
        withUnsafeBytes(of: &this) { buffer in
            for i in 0 ..< size {
                let byte = buffer.baseAddress!.load(fromByteOffset: i, as: UInt8.self)
                combine(hash: &hash, byte: byte, index: i)
            }
        }
        return hash
    }
    
    func mostLikelyEqual(_ other: Self) -> Bool {
        let size = MemoryLayout<Self>.size
        precondition(size < 100) // let's keep this 100*O(1), which is still O(1)
        var this = self
        var other = other
        return memcmp(&this, &other, size) == 0
    }
}

Floating point values would expose the infamous anomaly with Float.nan being mostLikelyEqual to another Float.nan which would disagree with the normal EQ behaviour.