[Pitch] `Distinguishable` protocol for Quick Comparisons

Hi! Here is a pitch derived from this discussion:

Distinguishable protocol and isIdentical Function for Quick Comparisons

Introduction

I propose a new Distinguishable protocol and isIdentical function for determining if two instances are distinguishable in constant-time.

Motivation

Suppose we have some code that listens to values from an AsyncSequence. Every value received from the AsyncSequence is then used to perform some work that scales linearly with the size of the Element:

func doLinearOperation<T>(with: T) {
  //  perform some operation
  //  scales linearly with T
}

func f1<S>(sequence: S) async throws where S: AsyncSequence {
  for try await element in sequence {
    doLinearOperation(with: element)
  }
}

Here we perform a linear operation on every value received from sequence. How much performance does this use?

Suppose we know that the work performed in doLinearOperation is only necessary when element is not equal (here we define "equal" to be "value equality"). The first call to doLinearOperation is important, and the next calls to doLinearOperation are only important if element is not equal-by-value to the last element that was used to perform doLinearOperation.

When we know that Element conforms to Equatable we can choose to "memoize" our values before we perform doLinearOperation:

func f2<S>(sequence: S) async throws where S: AsyncSequence, S.Element: Equatable {
  var oldElement: S.Element?
  for try await element in sequence {
    if oldElement != element {
      oldElement = element
      doLinearOperation(with: element)
    }
  }
}

When our sequence produces many elements that are equal-by-value, "eagerly" passing that element to doLinearOperation performs more work than necessary. Performing a check for value-equality before we pass that element to doLinearOperation saves us the work from performing more doLinearOperation than necessary, but we have now traded performance in a different direction. Because we know that the work performed in doLinearOperation scales linearly with the size of the Element, and we know that the == operator also scales linearly with the size of the Element, we now perform two linear operations whenever our sequence delivers a new value that is not equal-by-value to the previous input to doLinearOperation.

At this point our product engineer has to make a tradeoff: do we "eagerly" perform the call to doLinearOperation without a preflight check for value equality on the expectation that sequence will produce many non-equal values, or do we perform the call to doLinearOperation with a preflight check for value equality on the expectation that sequence will produce many equal values?

There is a third path forward… a "quick" check against elements that returns in constant-time and guarantees these types must be equal by value.

Prior Art

Swift.String already ships a public-but-underscored API that returns in constant time:[1]

extension String {
  /// Returns a boolean value indicating whether this string is identical to
  /// `other`.
  ///
  /// Two string values are identical if there is no way to distinguish between
  /// them.
  ///
  /// Comparing strings this way includes comparing (normally) hidden
  /// implementation details such as the memory location of any underlying
  /// string storage object. Therefore, identical strings are guaranteed to
  /// compare equal with `==`, but not all equal strings are considered
  /// identical.
  ///
  /// - Performance: O(1)
  @_alwaysEmitIntoClient
  public func _isIdentical(to other: Self) -> Bool {
    self._guts.rawBits == other._guts.rawBits
  }
}

We don't see this API currently being used in standard library, but it's possible this API is already being used to optimize performance in private frameworks from Apple.

Proposed Solution

Many types in Swift and Foundation are "copy-on-write" data structures. These types present as value types, but can leverage a reference to some shared state to optimize for performance. When we copy this value we copy a reference to shared storage. If we perform a mutation on a copy we can preserve value semantics by copying the storage reference to a unique value before we write our mutation: we "copy" on "write".

This means that many types in Swift and Foundation already have some private reference that can be checked in constant-time to determine if two values are identical. Because these types copy before writing, if two values are identical by their shared storage must be equal by value.

Suppose our Element conformed to a Distinguishable protocol. This Element now adopts a function that can return in constant time if two values are identical and must be equal by-value

We can now refactor our operation on AsyncSequence to take advantage of this:

func f3<S>(sequence: S) async throws where S: AsyncSequence, S.Element: Distinguishable {
  var oldElement: S.Element?
  for try await element in sequence {
    if oldElement.isIdentical(to: element) == false {
      oldElement = element
      doLinearOperation(with: element)
    }
  }
}

What has this done for our performance? We know that doLinearOperation performs a linear operation over element. We also know that isIdentical returns in constant-time. If isIdentical returns true we skip performing doLinearOperation. If isIdentical returns false we perform doLinearOperation, but this is now one linear operation. We will potentially perform this linear operation even if the element returned is equal by-value, but since the preflight check to confirm value equality was itself a linear operation, we now perform one linear operation instead of two.

Detailed Design

Here is a new protocol defined in Standard Library:

@available(SwiftStdlib 6.3, *)
protocol Distinguishable {
  /// - Performance: O(1)
  func isIdentical(to other: Self) -> Bool
}

Because Distinguishable is similar but orthogonal to value equality, we give infra engineers the ability to define what "identity equality" means for them and the types that adopt Distinguishable.

The most common adoptions would be on Standard Library and Foundation types that are copy-on-write values that also conform to Equatable. Examples include Array and Dictionary. If two Array instances return true for isIdentical, then these two Array instances must be equal by-value.

Source Compatibility

This code is additive. The protocol definition is guarded by available. The adoptions on Standard Library and Foundation types are guarded by available.

Impact on ABI

Introducing a new protocol and adopting that protocol on existing types does not have to be ABI breaking. The implication is that we want to "get it right" and settle on all the types that should adopt Distinguishable because if we ship Distinguishable in 6.3 and then decide that an existing type should adopt Distinguishable in 6.4 that would break ABI.

Determining exactly what types from Standard Library and Foundation should adopt Distinguishable should block the landing of Distinguishable, but does not necessarily need to block the design review of Distinguishable itself. We can follow up on a design review of Distinguishable with an audit of Standard Library and Foundation to agree of what types should adopt Distinguishable.

Future Directions

Engineers outside of Swift and Foundation also ship copy-on-write data structures that conform to Equatable. An example is TreeDictionary from swift-collections. We would like engineers to be able to easily adopt Distinguishable on their own types.

Alternatives Considered

Could we "overload" the === operator from AnyObject? Because Distinguishable makes no promises or assumptions about the nature of the type itself, this proposal recommends keeping the === operator only on AnyObject types.

Instead of publishing an isIdentical function which implies two types must be equal, could we think of things from the opposite direction? Could we publish a maybeDifferent function which implies two types might not be equal? This then introduces some potential ambiguity for product engineers: to what extent does maybeDifferent imply "probably different"? This ambiguity could be settled with extra documentation on the protocol, but isIdentical solves that ambiguity up-front.

Without breaking ABI, we could add the isIdentical function directly to Equatable. Because we expect most types that would adopt Distinguishable would return true to indicate two instances must be equal by-value, there are arguments that this should itself belong in Equatable. At this point we have now coupled these two concepts and reduced the flexibility available to product engineers. A "pure" value type might conform to Equatable without any underlying reference that can return true in constant time to define identity equality in constant time. The "default" behavior on these types would be to always return false for isIdentical: we can't make any constant-time judgement about any potential value-equality.

Acknowledgments

Thanks @dnadoba for suggesting this isIdentical function should exist on a new protocol.

Thanks @ben_cohen for helping to think through and generalize the original use-case and problem-statement.

Thanks @Slava_Pestov for helping to investigate ABI stability implications of a new protocol.

Thanks @lorentey for shipping the optimization on Swift.String that shows us a precedent for this kind of operation.[1:1]


  1. swift/stdlib/public/core/String.swift at swift-6.1-RELEASE · swiftlang/swift · GitHub ↩︎ ↩︎

8 Likes

I'm not going to stay heavily invested in this, but I think if you want to weaken Equatable to something O(1), it's worth it being a tri-state. Two ASCII strings of unequal length are "definitely unequal"; two strings with the same backing storage are "definitely equal"; two strings with the same length and different storage we can't say anything about.

(I'm not actually saying we should or shouldn't implement this for strings, it was just a convenient example.)

9 Likes

Hmm… I think product engineers generally reach for copy-on-write data structures to optimize for values that contain some storage of variable length or to optimize for some large storage of fixed length (and a side-optimization there is nested copy-on-write data structures that can fold down to just one retain operation when the parent container is copied).

For data structures of variable length I believe there is a natural concept of "length" or "size" that can be used for a "definitely unequal" check.

For data structures of fixed length I'm not so sure I could easily picture what that "definitely unequal" check would look like. These CoWs could return a meaningful true value for "definitely equal" but might not have any clear ability to return a meaningful true value for "definitely unequal": which could imply they just always have to return false for that one.

My goal with Distinguishable was to define a protocol that maps to both of these CoWs. This would imply that variable CoWs like Swift.String and Foundation.Data can adopt Distinguishable as well as user defined CoWs like large data structures of fixed length.

I was about to make the same tri-state suggestion. Something like this:

protocol ShortcutEquatable {
   /// Evaluates to the same thing as `==` if equality can be 
   /// determined in constant time, otherwise returns `nil`.
   func shortcutEquality(to other: Self) -> Bool?
}

One advantage of returning a tri-state value is that it forces the user to think about the third case (the "unknown" case expressed as a nil here). It's thus more difficult to misinterpret the function as testing equality. (Perhaps it should be a tri-state enum to make this even more obvious, with equal, unequal, and unknown cases.)

It's also interesting to note that this protocol can be implemented even when the array elements are not Equatable:

extension Array: ShortcutEquatable {
   /// Evaluates to the same thing as `==` if equality can be 
   /// determined without looking at the contained elements,
   /// otherwise returns `nil`.
   func shortcutEquality(with other: Array) -> Bool? {
      if (storage is the same) || (count == 0 && other.count == 0) {
         return true
      } else if count != other.count {
         return false
      } else {
         return nil
      }
   }
}
2 Likes

I think that leads back to my previous comment:

  • For copy-on-write data structures that map to some variable length (like count or length), there is a natural ability to map to a false value.
    • For types like Array and Dictionary, this variable length is already exposed as a public API, which implies that a product engineer looking for a "definitely unequal" value already has a constant-time ability to determine that.
  • For copy-on-write data structures that do not map to some variable length, I do not understand what value a false value would communicate to a product engineer building on this. For that data structure it seems like we would only be returning either true or nil.

Also just from brainstorming I can think of a lot of useful applications from being able to know if isIdentical returns true, but I can't think off the top of my head of how or why splitting apart "we are sure these are not equal" from "we are not sure these are not equal" would be an impactful change with useful applications. Why might a product engineer want or need this to be returned as a tri-state? Is there precedent from other languages or ecosystems for that?

1 Like

I think you’re overindexing on copy-on-write, but I admit I’m having a hard time thinking of when “definitely unequal” would be worth knowing. Abstractly, I assume every optimized Equatable implementation would start with the O(1) checks anyway, both positive and negative*, and then fall back on checks that require O(N) work; thus, the purpose of separating the O(1) check is if you would rather do something different to avoid falling into the O(N) case. Or, as Ben put it in the previous thread:

The motivation being, an observer might be about to do some linear operation anyway when values have changed, and that linear operation is fairly similar in effort to an equality check in cases of false change notifications, so they might as well do it even when nothing has changed rather than pay the price of first checking for definitive inequality, then performing that work.

Consequently, if there were a use for “definitely unequal”, it would be that the observer would do expensive work if the objects are equal, and it’s okay to do that work if the objects are inequal, but we’d skip it if we could. I can contrive some cases like that by making the object some kind of key: if the key is one of the ones I care about, I will use it to do an operation on the value associated with it. Only…keys are usually cheap enough to compare fully, and even if they’re expensive it’s rare that the cost of the check will outweigh whatever work you might do with them.

Still, even with that I think michelf is right that the tri-state makes it more likely people will use the method correctly. It also makes it much simpler to say “nil is always a correct response for this method, if a bad one”, just like “not hashing anything is always a valid hash(into:), if a bad one”.

* @David_Smith reminded me of an infamous case where this is not true: memcmp does not check whether you gave it the same pointer in both arguments, the idea being that if you need to optimize your call to memcmp, the caller should be the one to check pointer equality, because that saves the overhead of a function call!

2 Likes

I think the main reason for a tri-state is to make the result clearer by having one of the possible results an "unknown equality status" value. Perhaps making it a tri-state is overengineering, but I'd argue that isIdentical sounds pretty much the same as isEqual until you dig into documentation, and sadly will behave the same in many cases (dependent on underlying implementation details people won't be familiar with, and which are probably subject to change with time). This opens the door to mistakes not being caught early.

But I realize now that a tri-state is a bit inconvenient too because you always want to check for the union of the "unknown" state and one other.

As for the usefulness of the "known not equal" return value, I don't have any concrete example of my head.


Ultimately, perhaps just using a better name would ease my fear. For instance:

func isKnownEqualInConstantTime(to other: Self) -> Bool

This is the same pattern as isKnownUniquelyReferenced and is pretty self-explanatory so you can't really mistake this for normal equality. And I think the example in the pitch reads better too:

if !oldElement.isKnownEqualInConstantTime(to: element) {
  oldElement = element
  doLinearOperation(with: element)
}
Alternative names
func isFastKnownEqual(to other: Self) -> Bool
func isKnownEqual(to other: Self) -> Bool
func mayBeUnequal(to other: Self) -> Bool  // inverted result vs. isIdentical
func mayBeNotEqual(to other: Self) -> Bool // inverted result vs. isIdentical
1 Like

My thoughts here is that Distinguishable and isIdentical would "practically speaking" map in the direction of a "quick" value equality check on types from standard library and Foundation… but I also would prefer for Distinguishable to not by itself directly imply that a product engineer adopting Distinguishable must mentally map its concept to value equality.

I believe it should be supported for a product engineer to build a type that does not adopt Equatable and does adopt Distinguishable. If the function we have to adopt is named with "equal" it seems like we are sending a mixed message there WRT the formal relationship between Equatable and Distinguishable.

I can see pros and cons with either approach. I'm not completely opposed to something along the lines of "known equal" or "must be equal"… but I'm leaning in the direction of isIdentical and presenting the protocol to product engineers as a comparison that is related to — but not necessarily dependent on — any concept of value equality.

1 Like

One potential gotcha with computed "definitely unequal" is that I would expect most of those implementations to perform some check over "count" or "length". The recommendation from Swift is that types conforming to Collection might return count in linear time. AFAIK most Collection types from standard library either do conform to RandomAccessCollection or confirm in their documentation that count is constant time… but product engineers rolling their own Collection types might still be performing this work in linear time.

Then, given the protocol name is Distinguishable, wouldn't the name isDistinct(from:) be a better fit?

You also avoid a negation here:

if oldElement.isDistinct(from: element) {
  oldElement = element
  doLinearOperation(with: element)
}
4 Likes

Without more context on what isDistinct does that looks like it would perform work similar to a != check for value inequality. If I was just stepping into this code from somewhere else I don't think that is very clear what kind of a check is happening. I also don't see any precedent inside or around Swift for this operation to be named isDistinct.

Here's another example from swift-markdown:

Given the precedent from this public function on swift-markdown, the public functions in swift-collections, the public underscored function on Swift.String, and the upcoming public functions on Span and RawSpan, I still believe that a protocol built around isIdentical remains consistent with the prior art in this space and is the best ("not-perfect-but-least-confusing") choice.

1 Like

I appreciate the initiative to address the issue presented in the pitch. However, I believe the problem isn't prevalent enough to justify the inclusion of a new protocol in the Swift standard library. This concern is particularly relevant if the protocol imposes additional constraints on conforming types. For example, should the Array type conform to this protocol, any decision to alter its underlying mechanism, such as moving away from reference type or copy-on-write storage, could result in a breaking change due to protocol conformance requirements.

Nevertheless, if there's considerable consensus about the necessity of such a protocol in the standard library, I propose the following considerations to mitigate potential issues:

  1. Optional Protocol Conformance: Ensure protocol adoption is optional, allowing types like Collection to conform independently of this new protocol.

  2. Explicit Tri-state Result Handling: Define the protocol results as a tri-state (for example: .definitelyEqual, .definitelyNotEqual, .undetermined). This clearly distinguishes it from Equatable, providing a better representation of the domain than a simple Bool?.

  3. Flexibility in Implementation: Accept that if the internal implementation changes, making definite answers infeasible, conforming types should be allowed to return .undetermined consistently.

5 Likes

Good point! Or consider a type that holds a cached value of a hash - if the hash values are different we know the “not same” answer in O(1) time.

So tri-state makes total sense. In fairness it could be modelled by Optional<Bool> (no idea if that’s better or worse than an enum with 3 values).

2 Likes

Correct. One of the alternatives considered was to add this work directly to Equatable without breaking ABI. But then that would couple the new work too closely with Equatable types that might not need or want to have anything to say about "identity equality".

Well… I can't say hate it… I was still a little more personally in favor of the isIdentical concept but I wouldn't hate definitely[Not]Equal. But this sort of leads me back to point number one. I would prefer to ship a Distinguishable protocol independent of the formal requirements on Equatable. But if the requirements of Distinguishable informally imply value equality… then I think maybe there's an argument to be made in favor of just making that coupling more explicit by adopting Equatable on Distinguishable?

@available(SwiftStdlib 6.3, *)
protocol Distinguishable: Equatable {
  /// - Performance: O(1)
  ...
}

One potential idea I had here that might "bulk up" the protocol would be something like a runtime check where the infra engineer adopting Distinguishable can then return a Bool to signal to a product engineer whether or not "definite [in]equality" can be determined:

@available(SwiftStdlib 6.3, *)
protocol Distinguishable: Equatable {
  /// - Performance: O(1)
  ...
  func canDetermineDefinitiveEquality(to other: Self) -> Bool
  func canDetermineDefinitiveInequality(to other: Self) -> Bool
  ...
}

A product engineer would not need to perform this check before attempting to determine definitive [in]equality… but might choose to condition their runtime behavior based on what the infra engineer adopting Distinguishable has to say at runtime about whether or not their internal storage makes a check like this possible or meaningful.

One gotcha I might see with returning an enum value or optional Bool is the product engineer is then "paying the price" for determining the state of a case they might not be interested in:

extension Array: ShortcutEquatable {
   /// Evaluates to the same thing as `==` if equality can be 
   /// determined without looking at the contained elements,
   /// otherwise returns `nil`.
   func shortcutEquality(with other: Array) -> Bool? {
      if (storage is the same) || (count == 0 && other.count == 0) {
         return true
      } else if count != other.count {
         return false
      } else {
         return nil
      }
   }
}

If I am a product engineer and I want this API to determine if two instances are "definitely not equal" then I don't need or want to pay the cost of first determining if these two instances are "not definitely equal". It's a small constant time operation… but since this API is meant as a performance optimization then it also makes sense to help product engineers perform as little work as possible.

Returning the tri-state across two operations:

@available(SwiftStdlib 6.3, *)
protocol Distinguishable {
  /// - Performance: O(1)
  func isDefinitelyEqual(to other: Self) -> Bool
  /// - Performance: O(1)
  func isDefinitelyNotEqual(to other: Self) -> Bool
}

Gives the product engineer the ability to explicitly define what work they want the infra engineer to do. And a product engineer has the option to then perform both together and use the "false and false" case to imply what we would otherwise define as a true undetermined case.

TBF… this would give the infra engineer the opportunity to return true for both at the same time as a programmer error… which would be bad.

3 Likes

One data point is that Sequence actually has a semi-hidden customization point that uses the Bool? tripartite result for containment checks: _customContainsEquatableElement:

This method uses an optional result because not every collection can provide a speed increase over a linear search, like Array. For collections that can, like Set or Range, the implementation of this method lets them provide fast containment checks even in generic contexts.

2 Likes

When it’s important for the implementation of a protocol method to be O(1) do we have any way to tell the compiler to warn if the implementation on a conforming type has a higher complexity?

If Distinguishable had a default implementation of isIdentical(to:) that just immediately returns the .unknown case, wouldn’t that just let Equatable types opt-in anytime later without an ABI break?

Edit: I guess the suggested “tri-state value” is not actually part of the pitch, but returning false by default would be, well, not the same but it would still cover the ABI break issue. I don’t like that though. I think a default implementation makes sense only if we have an .unknown return value case.

Alternatively, we could use a bi-state enum:

enum PossibleEquality {
   /// Values are known to be equal.
   case definitelyEqual
   /// Value's equality is unknown.
   case maybeUnequal
}

protocol PossibleEquatable {
   /// Evaluates to `definitelyEqual` if equality can be determined 
   /// quickly in constant-time, otherwise returns `maybeUnequal`.
   func possibleEquality(with other: Self) -> PossibleEquality
}

// in usage:
if a.possibleEquality(with: b) == .maybeUnequal {
    ... do the work ...
}
// or:
if a.possibleEquality(with: b) == .definitlyEqual {
    ... do the shortcut ...
}

This way you keep the work to a minimum but still have to spell things ultra clearly at the point of use. In fact it's clearer than a tri-state because with a tri-state you always want to match two of the three cases, forcing either a negation or testing for two values.

(And then—if the motivation arises—we could use add a different protocol for testing the reverse, possibleUnequality, returning an enum with definitlyUnequal and maybeEqual.)

Thank you for your explanation.

I like the suggestion of splitting this into two operations. This is clearly different from Equatable and could be implemented returning a bool or two case enum. (Here I’d probably go for just a bool).

The issue that you could implement both returning true, I’d say, that’s what tests are for.

1 Like