[Pitch #2] Add `isIdentical` Methods for Quick Comparisons to Concrete Types

Hi! Here is a new pitch refined from an earlier one:

Add isIdentical Methods for Quick Comparisons to Concrete Types

Introduction

We propose new isIdentical instance methods to concrete types for determining in constant-time if two instances must be equal by-value.

Motivation

Suppose we have some code that listens to strings from an AsyncSequence. Every string received from the AsyncSequence is then used to perform some work that scales linearly with the size of the string:

func doLinearOperation(with string: String) {
  //  perform some operation
  //  scales linearly with string
}

func f1<S>(sequence: S) async throws
where S: AsyncSequence, S.Element == String {
  for try await string in sequence {
    doLinearOperation(with: string)
  }
}

Suppose we know that doLinearOperation only performs important work when string is not equal to the last value (here we define “equal” to imply “value equality”). The first call to doLinearOperation is important, and the next calls to doLinearOperation are only important if string is not equal by-value to the last string that was used to perform doLinearOperation.

Since we know that String conforms to Equatable, we can choose to “memoize” our values before we perform doLinearOperation:

func f2<S>(sequence: S) async throws
where S: AsyncSequence, S.Element == String {
  var oldString: String?
  for try await string in sequence {
    if oldString == string { continue }
    oldString = string
    doLinearOperation(with: string)
  }
}

When our sequence produces many strings that are equal by-value, “eagerly” passing that string to doLinearOperation performs more work than necessary. Performing a check for value-equality before we pass that string to doLinearOperation saves us the work from performing doLinearOperation more than necessary, but we have now traded performance in a different direction. Because we know that the work performed in doLinearOperation scales linearly with the size of the string, and we know that the == operator also scales linearly with the size of the string, we now perform two linear operations whenever our sequence delivers a new string that is not equal by-value to the previous input to doLinearOperation.

At this point our product engineer has to make a tradeoff: do we “eagerly” perform the call to doLinearOperation without a preflight check for value equality on the expectation that sequence will produce many non-equal values, or do we perform the call to doLinearOperation with a preflight check for value equality on the expectation that sequence will produce many equal values?

There is a third path forward… a “quick” check against String values that returns in constant-time and guarantees these instances must be equal by value. We can add a similar check to additional concrete types from Standard Library.

Prior Art

String already ships a public-but-underscored API that returns in constant time.[1]

extension String {
  /// Returns a boolean value indicating whether this string is identical to
  /// `other`.
  ///
  /// Two string values are identical if there is no way to distinguish between
  /// them.
  ///
  /// Comparing strings this way includes comparing (normally) hidden
  /// implementation details such as the memory location of any underlying
  /// string storage object. Therefore, identical strings are guaranteed to
  /// compare equal with `==`, but not all equal strings are considered
  /// identical.
  ///
  /// - Performance: O(1)
  @_alwaysEmitIntoClient
  public func _isIdentical(to other: Self) -> Bool {
    self._guts.rawBits == other._guts.rawBits
  }
}

We don’t see this API currently being used in standard library, but it’s possible this API is already being used to optimize performance in private frameworks from Apple.

Many more examples of isIdentical functions are currently shipping in Swift-Collections[2][3][4][5][6][7][8][9][10][11][12][13], Swift-Markdown[14], and Swift-CowBox[15]. We also support isIdentical on the upcoming Span and RawSpan types from Standard Library.[16]

Proposed Solution

Many types in Standard Library are “copy-on-write” data structures. These types present as value types, but can leverage a reference to some shared state to optimize for performance. When we copy this value we copy a reference to shared storage. If we perform a mutation on a copy we can preserve value semantics by copying the storage reference to a unique value before we write our mutation: we “copy” on “write”.

This means that many types in Standard Library already have some private reference that can be checked in constant-time to determine if two values are identical. Because these types copy before writing, two values that are identical by their shared storage must be equal by value.

Suppose our _isIdentical method from String was no longer underscored. We could now refactor our operation on AsyncSequence to:

func f3<S>(sequence: S) async throws
where S: AsyncSequence, S.Element == String {
  var oldString: String?
  for try await string in sequence {
    if oldString?.isIdentical(to: string) ?? false { continue }
    oldString = string
    doLinearOperation(with: string)
  }
}

What has this done for our performance? We know that doLinearOperation performs a linear operation over string. We also know that isIdentical returns in constant-time. If isIdentical returns true we skip performing doLinearOperation. If isIdentical returns false we perform doLinearOperation, but this is now one linear operation. We will potentially perform this linear operation even if the element returned is equal by-value, but since the preflight check to confirm value equality was itself a linear operation, we now perform one linear operation instead of two.

Detailed Design

Here is a new method defined on String:

@available(SwiftStdlib 6.3, *)
extension String {
  /// Returns a boolean value indicating whether this string is identical to
  /// `other`.
  ///
  /// Two string values are identical if there is no way to distinguish between
  /// them.
  ///
  /// Comparing strings this way includes comparing (normally) hidden
  /// implementation details such as the memory location of any underlying
  /// string storage object. Therefore, identical strings are guaranteed to
  /// compare equal with `==`, but not all equal strings are considered
  /// identical.
  ///
  /// - Complexity: O(1)
  @available(SwiftStdlib 6.3, *)
  public func isIdentical(to other: Self) -> Bool { ... }
}

We propose adding isIdentical methods to the following concrete types from Standard Library:

  • String
  • Substring
  • Array
  • ArraySlice
  • ContiguousArray
  • Dictionary
  • Set

The methods follow the same pattern from String. Every isIdentical method is an instance method that takes one parameter of the same type and returns a Bool value in constant-time indicating two instances must be equal by value. Here is an example on Array:

@available(SwiftStdlib 6.3, *)
extension Array {
  /// Returns a boolean value indicating whether this array is identical to
  /// `other`.
  ///
  /// Two array values are identical if there is no way to distinguish between
  /// them.
  ///
  /// Comparing arrays this way includes comparing (normally) hidden
  /// implementation details such as the memory location of any underlying
  /// array storage object. Therefore, identical arrays are guaranteed to
  /// compare equal with `==`, but not all equal arrays are considered
  /// identical.
  ///
  /// - Complexity: O(1)
  @available(SwiftStdlib 6.3, *)
  public func isIdentical(to other: Self) -> Bool { ... }
}

Source Compatibility

This proposal is additive and source-compatible with existing code.

Impact on ABI

This proposal is additive and ABI-compatible with existing code.

Future Directions

Any Standard Library types that are copy-on-write values that also conform to Equatable would be good candidates to add isIdentical functions.

Alternatives Considered

Overload for ===

Could we “overload” the === operator from AnyObject? This proposal considers that question to be orthogonal to our goal of exposing identity equality with the isIdentical instances methods. We could choose to overload ===, but this would be a larger “conceptual” and “philosophical” change because the === operator is currently meant for AnyObject types — not value types like String and Array.

Overload for Optionals

When working with Optional string values we can add the following overload:

@available(SwiftStdlib 6.3, *)
extension Optional {
  @available(SwiftStdlib 6.3, *)
  public func isIdentical(to other: Self) -> Bool
  where Wrapped == String {
    switch (self, other) {
    case let (value?, other?):
      return value.isIdentical(to: other)
    case (nil, nil):
      return true
    default:
      return false
    }
  }
}

Because this overload needs no private or internal symbols from Standard Library, we can omit this overload from our proposal. Product engineers that want this overload can choose to implement it for themselves.

Alternative Semantics

Instead of publishing an isIdentical function which implies two types must be equal, could we think of things from the opposite direction? Could we publish a maybeDifferent function which implies two types might not be equal? This then introduces some potential ambiguity for product engineers: to what extent does “maybe different” imply “probably different”? This ambiguity could be settled with extra documentation on the protocol, but isIdentical solves that ambiguity up-front. The isIdentical function is also consistent with the prior art in this space.

In the same way this proposal exposes a way to quickly check if two String values must be equal, product engineers might want a way to quickly check if two String values must not be equal. This is an interesting idea, but this can exist as an independent proposal. We don’t need to block the review of this proposal on a review of isNotIdentical semantics.

Acknowledgments

Thanks @Ben_Cohen for helping to think through and generalize the original use-case and problem-statement.


  1. swift/stdlib/public/core/String.swift at swift-6.1.2-RELEASE · swiftlang/swift · GitHub ↩︎

  2. swift-collections/Sources/DequeModule/Deque._Storage.swift at 1.2.0 · apple/swift-collections · GitHub ↩︎

  3. swift-collections/Sources/HashTreeCollections/HashNode/_HashNode.swift at 1.2.0 · apple/swift-collections · GitHub ↩︎

  4. swift-collections/Sources/HashTreeCollections/HashNode/_RawHashNode.swift at 1.2.0 · apple/swift-collections · GitHub ↩︎

  5. swift-collections/Sources/RopeModule/BigString/Conformances/BigString+Equatable.swift at 1.2.0 · apple/swift-collections · GitHub ↩︎

  6. swift-collections/Sources/RopeModule/BigString/Views/BigString+UnicodeScalarView.swift at 1.2.0 · apple/swift-collections · GitHub ↩︎

  7. swift-collections/Sources/RopeModule/BigString/Views/BigString+UTF8View.swift at 1.2.0 · apple/swift-collections · GitHub ↩︎

  8. swift-collections/Sources/RopeModule/BigString/Views/BigString+UTF16View.swift at 1.2.0 · apple/swift-collections · GitHub ↩︎

  9. swift-collections/Sources/RopeModule/BigString/Views/BigSubstring.swift at 1.2.0 · apple/swift-collections · GitHub ↩︎

  10. swift-collections/Sources/RopeModule/BigString/Views/BigSubstring+UnicodeScalarView.swift at 1.2.0 · apple/swift-collections · GitHub ↩︎

  11. swift-collections/Sources/RopeModule/BigString/Views/BigSubstring+UTF8View.swift at 1.2.0 · apple/swift-collections · GitHub ↩︎

  12. swift-collections/Sources/RopeModule/BigString/Views/BigSubstring+UTF16View.swift at 1.2.0 · apple/swift-collections · GitHub ↩︎

  13. swift-collections/Sources/RopeModule/Rope/Basics/Rope.swift at 1.2.0 · apple/swift-collections · GitHub ↩︎

  14. swift-markdown/Sources/Markdown/Base/Markup.swift at swift-6.1.1-RELEASE · swiftlang/swift-markdown · GitHub ↩︎

  15. Swift-CowBox/Sources/CowBox/CowBox.swift at 1.1.0 · Swift-CowBox/Swift-CowBox · GitHub ↩︎

  16. swift-evolution/proposals/0447-span-access-shared-contiguous-storage.md at main · swiftlang/swift-evolution · GitHub ↩︎

2 Likes

Looks like you might have posted two very similar pitches in rapid succession today. Can you clarify the difference? :slight_smile:

1 Like

Ahh… good question!

These are two similar-slash-different different — but independent — pitches:

  • This pitch is proposing a method for identity equality to a well-defined set of concrete types from standard library.
  • The other pitch is proposing a method for identity equality to the Equatable protocol. The other pitch does not ship with its own strong opinions about what concrete types should or should not override that method with their own concept of identity equality.

The pitches are conceptually and philosophically solving for similar problems… but can also be reviewed and shipped independently. Mixed feedback on one proposal does not necessarily need to block the other proposal from shipping.

1 Like

In what scenarios are you pitching that the one method would return true but the other false, and vice versa?

2 Likes

This specific pitch is concerned with methods on concrete types:

extension String {
  func isIdentical(to other: String) -> Bool {
    self._isIdentical(to: other)
  }
}

This specific pitch does not ship with its own opinions about whether or not a similar method should be added to Equatable… but if another pitch did add a similar method to Equatable it should just forward to the same method to override the default behavior:

extension String: Equatable {
  func isKnownIdentical(_ lhs: String, _ rhs: String) -> Bool? {
    lhs.isIdentical(to: rhs)
  }
}

Is it your pitch, then, that in all cases on a concrete type with both methods, isIdentical and isKnownIdentical would have identical behavior? If so, can you explain why they're pitched with distinct names and separately?

Or to put the question another way—what distinguishes the semantics of your two pitched methods? Why do you feel that we need both?

1 Like

A few ideas:

The isKnownIdentical pitch is a new method on a protocol with a default implementation. The pitch does not explicitly say how library maintainers should override that default implementation and I expect the decision to override that default implementation on any specific type — like String or Array — can take place outside of a proposal design review.

The biggest difference in semantics is the return value. This specific pitch proposes methods on concrete types that return true or false to indicate identity equality. The method on the other pitch can also return nil to indicate this type is not able to determine identity equality in a way that is meaningfully faster than a linear-time check for value equality.

The two pitches are similar but can land and ship independently. We can add public methods for identity equality to concrete types without any generic protocol method and we can also add a public method for identity equality to a generic protocol method without any public methods for identity equality on any concrete types.

I suppose, then, my question above boils down to—what purpose does it serve to have distinct notions of "identity equality" and "value equality" for a type where it is not possible to determine the former "meaningfully" differently from the latter?

Or, concretely: if a given type implements ==, isIdentical, and isKnownIdentical, and I as a user obtain nil from calling isKnownIdentical, when would I then need to call isIdentical instead of ==? Or is there some other combination or scenario I've missed where all three come into play?

We could just as well sever one pitch into several parts. My question to you is not what we can do, but as the author of both pitches having studied the problem in depth, why should we ship two distinct designs to serve the same ends? I think this ends up being isomorphic to the first question.

2 Likes

This specific question might be better suited for the other thread which is focused on isKnownIdentical. The isKnownIdentical method is added to Equatable with a default implementation. The isKnownIdentical pitch does not ship its own opinions about specific concrete types that should define their own custom override. AFAIK that decision can be up to the maintainers of the library and can exist outside the scope of a proposal design review. The expectation is that library maintainers should "do the right thing" in the same way that library maintainers can override any other default implementation on any other protocol.

The original pitch was unified: a protocol method for generic constraints and methods on concrete implementations. The discussion from that original pitch changed my mind: two independent pitches is a better idea for now. I am open to hearing more pros and cons if unifying the pitches back together is a better idea… but for right now I still believe two pitches that can be considered and shipped independently is better.

Sorry to harp on this, but it’s the crux of my question: …better for what use case?

I have specific feedback about your point as to isKnownIdentical which I'll save for the other thread (specifically, that protocol requirements are not bags of syntax and must as the subject of their review discuss the expected semantics required of implementations).

However, my question here is specifically regarding the pitched API here, isIdentical. I'm not sure I understood your reply: If isKnownIdentical returns nil because there is no way to "meaningfully" obtain the answer, in what scenario would I want to resort to isIdentical rather than ==?

Put another way, is there a distinct semantics you're pitching here for isIdentical that must be distinguished both from == and from isKnownIdentical? Or, isomorphically, going back to the above: what's convinced you that having two independent designs is "better"?

2 Likes

That was the answer: my POV was splitting apart "one large" pitch was better because two small pitches could be considered and shipped independently. Mixed feedback on Pitch B would not need to block shipping or landing Pitch A if the feedback on Pitch A was universally positive.

If the feedback on these two pitches is that the pitches must be unified back together before a proposal review begins then I am open to changing my mind about that. For now… my POV is still that two small pitches is better.

I'm also not sure I actually understand the question:

  1. isKnownIdentical returns nil.
  • This is the case where a type either does not override the default isKnownIdentical method or does override the isKnownIdentical method and returns nil.
  1. in what scenario would I want to resort to isIdentical rather than ==?
  • isIdentical returns true in constant time to indicate these types must be equal by value. isIdentical returns false to indicate we can't tell if value equality is true or false.

We might be getting a little into the weeds about implementation review… but just to clarify my expectations when isIdentical returns false: this is a type that can compare identity equality (for example: because of some underlying copy-on-write storage pointer reference) but can't return true for these two instances at this moment in time (for example: the copy-on-write storage pointer references are two different addresses).

This I see as different than whether or not two types can or cannot determine a meaningful identity equality at all. If a value type has no underlying pointer reference that can compare in constant time that would be an example of when returning nil is appropriate. A value type like InlineArray might not have meaningful ability to return an answer for identity equality without a linear-time comparison through all N elements… which is not the semantics we want for this method.

I will say that as a purely pragmatic matter, the title and content of these pitches was so similar that I myself thought this was an accidental repost with the name changed from isKnownIdentical to isIdentical—I think it will cause undue confusion to try to run these discussions simultaneously without wires getting crossed. If they're truly intended to be separable and independently justifiable then I'd probably recommend trying to sequence them purely to ease the cognitive load on the community of trying to keep them separate.

6 Likes

I agree this is good advice from a logistical perspective.

Perhaps I've misunderstood your intention in pitching these separately. As @Jumhyn points out, the implication of having two separate pitches where each feature is to be independently assessed is that they are orthogonal features. You're staking a claim, then, that you have studied the design space and concluded that there are major use cases not served by either feature alone, and/or that there are use cases where the combination of the two features composes well together. Hence my questions as to what those use cases are.

It is not the case that pitches must be the smallest quantum of a feature, nor does that sizing necessarily make it easier to pitch or review. Indeed, we've got a recent and long thread where community members point out that having multiple documents that tackle similar use cases inhibits understanding. For effective review, you end up having to describe how your multiple pitches compose—something which may then require an overarching vision document, for example.

Incidentally, for clarity: the user-observable behavior of an API (for example, what value is returned when the pitched method is called) is not what we mean by "implementation" and is part of the proposed design that is subject to Evolution review—indeed, it is the meat of what we consider for this process and is front-and-center for your pitch.

When we say that Evolution is separate from "implementation," we mean those nuts and bolts which are not user-observable: for example, we do not quibble here about what you name any library-internal helper functions.

2 Likes

As to the substance of the pitch:

If a user has two values a and b, it seems to me that it's enough to know that those specific values are (or aren't) known to share the same underlying storage. It doesn't strike me as particularly salient for most (all?) such users whether the type of a and b is such that no other values c and d can share (or be known to share) their storage.

On the flip side, if a user cares specifically whether a type T (not types—nothing in the other API is about comparing more than one type, if I understand correctly) has some notion of sharing underlying storage, why do we restrict them from asking the question unless they have or deliberately materialize two values a and b? A type-level question—if there is a need to answer it—is most straightforwardly addressed by a type-level API. This could be a static property; or in this case, it is semantically a capability which would be modeled by a protocol. Users could then express the query as something like: T is HasPotentiallySharedStorage.

1 Like

Thank you for your pitch Rick, I understand the clarity you seek from adding a new method but in practice defining identity and making it apply to value types (with CoW) is the conceptual larger leap more than extending === to apply to value types.

People understand == and === currently, CoW / reference types help by value types could help people choose between == and === as much as “I need to reach for isIdentical because just being equal does not fit the bill”.

Extending === and giving a brief explanation around it would IMHO make it more likely for newcomers to do the right thing and for people with a better grasp of why being identical can be more through than an equality check the thought process of overloading === for value types will not be a problem. They may actually welcome it instead of another method or yet another keyword (not that you are proposing one :)).

4 Likes

That's a fair point. If an implementation was ready and we planned for a real proposal review they don't have to run concurrently with each other.

That's a fair point. But both pitches are also standalone features that could ship independently. Reviewers could very well have the opinion that both pitches should ship as one — and that's a fair opinion — but splitting these two pitches apart did not result in two "broken" features that are incomplete and blocked on each other. We could decide to ship Pitch A without Pitch B. We could ship Pitch B without Pitch A. Or we could ship them both.

That's a fair point. For now… we do document in our pitch the "contract" we expect library maintainers to follow when overriding the default behavior of isKnownIdentical in terms of semantics and performance. I believe I thought the conversation was trending "tactical" from a question about how a library maintainer would choose to coordinate between isKnownIdentical and isIdentical.

Just to help me understand… I believe you are directing this feedback at the concept of a protocol method that returns nil to indicate these types do not have a concept of identity equality? Is that correct? This feedback is not directed at the concept of implementing identity equality on concrete types like String that only return a true or false? Is that correct?

I believe you are thinking in two possible directions:

protocol Equatable {
  static var canCheckForIdentical: Bool { get }
  
  func isIdentical(to other: Self) -> Bool
}

extension Equatable {
  static var canCheckForIdentical: Bool { false }
  
  func isIdentical(to other: Self) -> Bool { false }
}

extension String: Equatable {
  static var canCheckForIdentical: Bool { true }
  
  func isIdentical(to other: Self) -> Bool {
    self._isIdentical(to: other)
  }
}

Here we define a isIdentical method on Equatable that does not return an optional Bool and a static method for returning whether or not this type can meaningfully determine identity equality. A reasonable "default" behavior for both methods is false: Equatable types by default do not meaningfully determine identity equality and if a product engineer attempts to determine identity equality we return false. At that point… our String (and Array etc) adopts isIdentical to override the default behavior of Equatable but this method is the same name with the same signature.

I believe the other direction you are thinking is:

protocol HasPotentiallySharedStorage {
  func isIdentical(to other: Self) -> Bool
}

extension String: HasPotentiallySharedStorage {
  func isIdentical(to other: Self) -> Bool {
    self._isIdentical(to: other)
  }
}

Which is a new standalone protocol independent of Equatable. The isIdentical method returns a non-optional Bool and the method we adopt on String (and Array etc) matches the same name and signature.

Does that sound like closer to what direction you are thinking? I still would have disagreements with both these approaches as sketched out here and I still would prefer the pitch in its current form… but it might help if I knew that I was addressing your points more clearly after I had a more clear example to draw from.

1 Like

I think my POV on === for now is that while === might make sense as one option to expose identity equality… I don't believe I have a strong opinion right now that === should be the only option to expose identity equality. My POV is we can and should keep consistent with the documented prior art in this space: an isIdentical method that returns a Bool is a well-established pattern in Standard Library and Swift Packages.

If we did support === at some point in the future… I wouldn't see us then deprecating or obsoleting any of the existing isIdentical methods. These could exist as complimentary.

Since I don't see a discussion about === as blocking a discussion about isIdentical… this pitch keeps === as an "Alternative Considered"… but it could also exist as a "Future Direction".

Yes, those directions would address that point of feedback, so I think you're interpreting what I meant to communicate correctly.

(However, as I said over in the other thread—the two threads going to the same design problem—I suspect key use cases are directly expressible by composing facilities we already have or will soon inevitably have...)

2 Likes

The "Introduction", "Motivation", "Prior Art", and "Proposed Solution" sections of the two proposals are essentially the same (very high on the déjà vu scale).

3 Likes