[Pitch #2] Add `isKnownIdentical` Method for Quick Comparisons to `Equatable`

Lets assume we implement this proposal, and Array implements isKnownIdentical(to:)
What would happen if at some point in time we want to change the backend storage for Array from COW to something else? Because the new approach has many advantages, but can no longer do the quick comparison check. I.e. isKnownIdentical(to:) will always return nil.

Will this then be considered a breaking change?

If so, then this proposal feels risky, as we can no longer change the internal implementation of an array (an implementation detail), because we added coupling of the API to an implementation detail.

This could off course be a valid trade off. Yet, is the problem big enough to warrant that?

1 Like

You lost me here as well: memcmp does the trick, in case of 80 byte structure it compares 80 bytes (same as equal) in case of 8 byte COW implementation it compares 8 bytes (so quicker than equal).

Perhaps a better example would be:

struct Element: Equatable {
  // A struct with about 80 bytes
  var a: Int64 = 0 {
    didSet {
        b = a*2
        c = a*3
        d = a*4
        ...
    }
  }
  private (set) var b: Int64 = 0
  private (set) var c: Int64 = 0
  private (set) var d: Int64 = 0
  private (set) var e: Int64 = 0
  private (set) var f: Int64 = 0
  private (set) var g: Int64 = 0
  private (set) var h: Int64 = 0
  private (set) var i: Int64 = 0
  private (set) var j: Int64 = 0
}

in which case "quick" equality could compare just the variable a without comparing the rest of the gang (as they are all dependent). And if I had a custom equatable implementation it would do the same, it's just if I am lazy and allowed Swift to generated Equatable implementation it would be comparing all 80 bytes.

1 Like

My misgivings about linking "identicality" to equality is that two elements can be considered equal while still containing a payload that makes them non-identical. And two elements can be identical but non-equal (anything containing NaN values). By putting them both inside Equatable you end up with the view that identity is strictly a shortcut for equality (which is true in many cases, but not all).

So if the goal is to detect whether a new value is identical to a previous one to save processing time, I think this should be done without involving equality. I still believe my suggestion from an earlier pitch would be best: a bitwise comparaison by default (with an optional cap on byte count) and a CustomKnownIdenticalComparable protocol for types that needs to customize the behavior. This makes "identical" a concept unburdened by the contradictions of "equal".

3 Likes

I think it's an important question to consider at some point in time… but I'm not sure this pitch — or potential formal design proposal — would need to be that point in time.

This specific pitch at this moment in time remains mostly agnostic about specific concrete types that should override isKnownIdentical with a custom behavior. The implication is that this decision could then be made by library maintainers through their own process. If the process to override isKnownIdentical on Array needs a design proposal review then so be it… but my expectation is that discussion could also take place on a GitHub RFC diff without a design proposal review — we aren't explicitly adding new APIs to Array… we would be overriding what at that point would be an existing API for some new runtime behavior.

That being said… if this pitch-proposal should also engage in a thoughtful conversation about not only the protocol APIs but also engage in a thoughtful conversation about tactical details of how concrete types from Standard Library should or should not override that protocol API, then I am open to updating this pitch-proposal to start to address some of those concrete types more directly.

Library maintainers that want to override isKnownIdentical with custom behavior could weigh those trade offs through their preferred channels. You are absolutely right if we say it is a small "exposure" of an important implementation detail.

If this proposal ships with no opinions about standard library concrete types that should override isKnownIdentical there is still utility and impact from then having the option to experiment with optimizations on algorithms in Standard Library that operate on generic contexts over Equatable. I believe @tera has some ideas about places we could start with that.

Moving down from an abstract and conceptual discussion about the risks of exposing this "window" into implementation details (where I agree that library maintainers want to move thoughtfully before adopting a "contract" for fast checking), if we look at some standard library concrete types and move the discussion more pragmatically then I would probably make the argument the example of Array would be an example of a "safe" type to adopt this operation.

If at some point in the future Array ever "lost" its COW behavior to perform copy (without mutation) in constant time that would have some pretty serious downstream performance impacts. I believe the remaining "variable size" COWs from standard library (String, Dictionary, etc) would also have the same problem. Yes it is true that exposing an identity equality operation would "lock" these concrete types into a contract… but the other POV here is these concrete types are already (in a way) locked into a very similar contract by performance. As long as we have the ability to perform a copy without a mutation in constant time then I believe we should always have the ability to perform a check for identity equality.

Sure. The memcmp returns a "correct" answer here in the sense that if our memcmp return true then our Element instances must return true for a == value equality operator. But we have two main goals with isKnownIdentical:

  • If isKnownIdentical returns true these instances must be equal by value.
  • The isKnownIdentical must return in constant time and should perform its operation meaningfully faster than a check for value equality.

If our Element.isKnownIdentical was built on memcmp… it's essentially just another way of looking at value equality. It's a "constant" time operation is the sense that the memory footprint of Element is fixed and it's not a variable-length data structure like Array… but it's "linear" time if we look at it from the POV as operating linearly over the memory footprint of the data structure itself. If Element was a CowBox we could then return identity equality with a check on a 8 byte pointer which is meaningfully faster than checking all 80 bytes for what is essentially the same work as value equality — which is exactly the amount of work that isKnownIdentical is built to avoid performing.

My argument would be that Element in this case should not implement isKnownIdentical with memcmp. Element should implement isKnownIdentical with nil to indicate that a meaningful concept of identity equality for our purposes is not exposed.

I see what you mean, thank you.

However keep in mind that identical being true doesn't always mean the two things are equal (e.g. NaN != NaN but could be bitwise identical).

This itself would be a breaking change... If I am not mistaken CoW is documented behaviour of Array.

3 Likes

Hi Rick,

Thank you for your clear answers! And I now support the proposal :smiley:

This makes a lot of sense to me. The maintainers of any library should be responsible for overriding this behavior. And depending on the library could engage in a different types of process. Like a Standard Library type like Array, I can imagine using Swift Evolution or at least use the forums to get feedback on the proposal. But that is indeed up to those responsible for the library. Especially because the lock into a 'performance contract'.

I agree that discussing which concrete types should or should not override this behavior is out of scope for this proposal.

I imagine that if one of the standard types were to consider changing its backing storage implementation, we might first bring that into specialized types. And we can always later make a trade off whether for instance the array or dictionary syntax will create an COWBackedArray or an NovelBackedArray.

Good point!

1 Like

This is a fair point… and my expectation is that library maintainers could — and should — thoughtfully consider situations such as that before they return true from isKnownIdentical. This specific pitch outlines a contract those library maintainers should adopt in terms of semantics and performance… but library maintainers also have to operate correctly on their concrete types with their knowledge and context about how those concrete types work.

I wish there were a way to communicate that in the API.

1 Like

Quite important choice here:

  • Option 1. identical=true should mean equal=true
    So for Doubles (at least when they are NaN's) isIdentical should be returning nil and isIdentical is not a simple bitwise memcmp style comparison. It is backed by a protocol (is a customisation point).

  • Option 2a. identical=true does not necessarily mean equal=true, the two things are unrelated. In this option isIdentical is simply memcmp and there is no way to customise it

  • Option 2b. as before identical=true does not necessarily mean equal=true, the two things are unrelated. In this option isIdentical is backed by a protocol and it is a customisation point.

I think all these options are worth considering and I won't say one is definitely superior to another.

Note that memcmp comparison (Option 2a) could be meaningfully faster than a check for value equality especially if you have to call == for each field in a struct and that function call can't be optimised away (for example the fields' types in question are provided by another module).

1 Like

I'm not completely sure I follow how that relates to the current pitch, but the semantics enforced by the current pitch is that if a concrete type returns true for isKnownIdentical then this implies == equality by value. How an arbitrary library maintainer chooses whether or not to return true for isKnownIdentical for an arbitrary concrete type is not currently considered as part of this pitch. But these are important questions and we absolutely encourage library maintainers to discuss that through their appropriate workflows and channels when choosing to override isKnownIdentical.

This pitch still does not ship with an opinion about how library maintainers should implement isKnownIdentical (other than it is constant time and faster than ==). Our opinion is what should be returned. How library maintainers choose to return true could very well be memcmp on concrete types… but we don't need to ship an opinion on that here in this pitch.

Equatable requires a couple axioms. One of it is Reflexivity (a == a is always true) but all floating point types violate it e.g NaN == NaN must be true according to Equatable but it isn’t. This proposal requires those axioms as well. Because of that I don’t think that it is really an issue that needs to be addressed in this proposal.

2 Likes

I'd like to see a convincing example where one would like to use a custom isKnownIdentical hence the question should it be a customisation point to begin with instead of a non-customisable, freestanding, written once and set in stone, memcmp based func isIdentical<T>(_ lhs: T, _ rhs: T) -> Bool. The above "struct Element" is not quite convincing for me: that in some cases equatable would be as quick as isKnownIdentical is not a show stopper.

2 Likes

Hmm… I would maybe start with a comment from the previous pitch:

I do want to mostly keep this pitch review "out of the weeds" on tactical discussion over implementation details… but it could also be relevant to consider a "case study" specifically as an example of when a library maintainer might want or need a custom isKnownIdentical that is not memcmp.

One place we could start is our "prior art": the _isIdentical method on String:

AFAIK the only stored instance property on String is _guts:

But our _isIdentical is not memcmp over _guts… it's actually comparing value equality over _StringGuts.rawBits:

Which is itself the value of _StringObject.rawBits:

But wait… that's the 64-bit version of rawBits. We have a different version of rawBits for 32-bit:

Why is all that conditional logic and custom code needed? I have no idea… but I know I don't want to break it.

To go back to our String._isIdentical method… an approach on memcmp would compare the bytes of _StringObject directly without respecting all the work we built on rawBits.

If memcmp on String ever returned a "false positive" — returning true for identity equality when == value equality returns false — that would be considered a bug. If memcmp on String never returned a false positive but did return false negatives — returning false when the existing production logic on _isIdentical returns true — that would not exactly be a "bug"… but it would indicate that our library maintainer does know how to override isKnownIdentical as an impactful and non-trivial improvement over a default implementation on memcmp.

I'm not sure what to say… an important opinion of this pitch is that isKnownIdentical must return in constant time and meaningfully faster than value equality. It's trending to a "circular" argument in the sense that I am saying "this is the way" it should be "because the pitch says so"… but that's a strong opinion we are making for now. If a type does not have any ability to answer isKnownIdentical without a comparison equal in effort to normal value equality, then that type should return nil.

1 Like

I must admit I do not fully understand this bit. Is it only applicable to non copyable types? How would I be able to even attempt to compare the two (presumably identical) variables if the type is non copyable to begin with? (without always getting a known result of false, I mean). Could you give an example?

This bit is fully understandable. Yes, there could be some false negatives due to padding, which is not guaranteed to be zeroed. Not a show stopper on my list as it doesn't break the semantic contract (false positive would), but YMMV of course.

Great detective work, thank you.

So Int and other similar types would return nil in their isKnownIdentical implementation?

1 Like

I am not sure I have enough context on this right now to give a good example. Maybe @lorentey can help with some more information on that?

Not to sound like a "broken record"… but my intention is that this pitch-proposal would ship without making its own strong opinions about specific concrete types from standard library that should override isKnownIdentical. My preference is for that decision to remain the domain of the library maintainers outside of this specific pitch-proposal design review cycle.

But if you are asking me for a "weak" opinion on this topic… then I would say that these are examples of standard library types that should return nil for isKnownIdentical:

  • Bool
  • Int
  • InlineArray

Again… I want the library maintainers to use their domain-specific knowledge and context to help them make those decisions on their own under the semantic and performance contract of the pitch-proposal. But my weak opinion on this topic is that Int should not expose a concept of identity equality.

1 Like

Or it could probably be solved by not providing isKnownIdentical for non copyable types or always returning nil / false for those.

I see. nil would be quite painful for optimisations like these, for such algorithms it would be better if such types were not conforming to the protocol in the first place, but I guess this is not on the table if isIdentical is coupled with Equatable.

BTW, what happens with this type with your proposal:

struct Storage: Equatable {
    var items: [String] 
    var count: Int
}

Would the default autogenerated isKnownIdentical of Storage be returning nil because one of its members says nil?

It's also not clear what Array<Something> should do when Something has a custom isKnownIdentical that returns false or nil. Or is it ok to have one Array identical to another even when its members are not?

1 Like

I think the sticking point for me is the inclusion in Equatable. isKnownIdentical returns an Optional<Bool> (an unusual type to see outside a generic context), and it has a default implementation that returns nil.

This results in static ambiguity about whether a given type has any real notion of identicality (also somewhat unusual in Swift – protocols tend to convey ability). Under this proposal, this maybe-implemented behavior would form fully 50% of the mass of Equatable.

I think returning to a discrete Distinguishable protocol, but having it refine Equatable strikes the right balance. Container types could then conditionally conform to Distinguishable when their contents did, just like they do with Equatable, but we don't get stuck in the theoretically valid but practically unlikely situation of Distinguishable-but-not-Equatable

7 Likes

The free function could still require Equatable conformance without using any functional requirements of it but just the semantic/axioms of Equatable. Mutex and Atomic do not conform to Equatable and I don’t think they ever will.

1 Like

Hmm… so the thinking is the pitch could be amended to enforce that library maintainers must not override isKnownIdentical on a type that is ~Copyable? I suppose in its current form the pitch leaves the decision about ~Copyable up to the library maintainers… if those library maintainers choose to override isKnownIdentical on a type that is ~Copyable, then I'm not completely sure it must be the job of the protocol member itself to try and prevent those bugs from happening. And that becomes even more important if for some reason there would exist some kind of legit use-case where a specialized type that is ~Copyable can return a meaningful answer for isKnownIdentical and the library maintainers really do want to move forward with that.

TBH… I believe there could be more opportunities in this pitch to directly address how this might work alongside Copyable and ~Copyable and I would like to keep hearing POVs on this topic if you or anyone else is interested in continuing that discussion.

Correct me if I'm wrong… but I believe that specific code snippet is built on a version of the API design that returned false to indicate "must not be equal"… as opposed to the pitch in its current form where false indicates "might not be equal". Is that correct?

Just to be a little more clear… we do not propose any "codegen" or hacking on the compiler to synthesize different isKnownIdentical implementations at compile time based on how different types are defined. All our pitch recommends for now is a new member method on Equatable with a default value that returns nil. In this example Storage.isKnownIdentical would return nil because there is no custom override declared here.

Another broken record response… but I do want to generally keep this pitch away from any strong opinions about concrete types in standard library that might choose to override isKnownIdentical. But to consider Array as a case study of how a library maintainer might choose to implement isKnownIdentical, I would start with the existing "fast path" on equality checking for Array:

Comparing the identity of our buffer storage references is constant time… we do not need or want this operation to scale linearly with the N elements our Array might contain.

When we talk about "container" types… we generally see isKnownIdentical as performing a "shallow" check on the root of the container itself — which should not be a "deep" check through the complete tree of N elements.

Thoughts on Immutable Collections

An inspiration for fast identity checking on collections was the ImmutableJS library from Lee Byron. Modeling data collections as immutable objects enabled important performance optimizations when pairing ReactJS with "Immutable" Flux or Redux. Checking whether or not two data collections "might have changed" became a constant-time operation: a "shallow" operation on the identity of the collection without a "deep" operation through all N elements contained in the collection.

React.js Conf 2015 - Immutable Data and React