[Pitch #2] Add `isIdentical` Methods for Quick Comparisons to Concrete Types

Thank you for helping me understand some points a little more clearly. I believe some previous comments address some more details that could be relevant to that:

We currently consider an independent protocol outside Equatable to be an "alternative considered". The previous pitch took this direction… but the previous pitch review and discussion compared the pros and cons of an independent protocol and I changed my mind after hearing those points. We feel that keeping this on Equatable is the right trade off for now for this pitch.

My understanding is similar: we have access to Span for types like Array… but we will not AFAIK ever have access to an easy Span transformation for types like Dictionary. Coupling our "identity equality" to Span limits our flexibility to make use of this API.

The previous pitch did not include a nil returning "default" value. This was added after the previous pitch review. We feel like nil communicates important information and context that is meaningfully different than what a false default value would represent.

You're not wrong… it's a valid point and if these two pitches did turn into formal proposal design reviews I see no reason they must run concurrently with each other.

We've had pitches complete with "Partially Accepted" as an outcome. Were these two pitches presented together and the one half was well-received and the other not, that could happen without having introduced the confusion of two seemingly identical pitches (yes, I do understand the distinction, but I didn't before you explained it in the comments). I guess that's moot, though, as it's too late to avoid confusion now.

Edit: I guess the acceptance is actually for formal Proposals, not pitches, so I suppose that a single unified Proposal could still be made with both "halves" of the Proposal being independently evaluable.

1 Like

A related use case that comes to mind is how some older APIs (associated objects in Obj-C come immediately to mind) encourage you to declare some statically allocated value and the identity of that string is used as the key into some data structure. (IIRC, most associated object keys are just a pointer to itself, guaranteeing uniqueness without caring about the concrete value, but you could imagine that strings could be used for this.)

For example, a similar Swift API could enforce that keys can't be injected at runtime by requiring them to be passed around as StaticStrings. If we need to look up values based on those keys, we don't want to pay the cost of doing string comparisons—whether the key matches is determined solely by the identity of the underlying read-only storage.

I noticed that StaticString wasn't included among among the supported types in this proposal. I think there's an argument that it would be reasonable to include it. (There's probably also an argument to skip it.) Let's assume that we added it. Given two StaticStrings, we could determine whether they're identical using this API, but the lookup would still be awkward because we'd have to reach for the underlying pointer directly (which StaticString does expose, but other APIs may not).

While it's probably more of a future direction than something required by this proposal, have you given thought to an "identity hash" operation that would be the hashing analogue to this quasi-equality comparison, and what that might look like?

1 Like

The LSG talked about this, and we had a few bits of feedback we wanted to convey as a group:

  1. The example code in the proposal seems pretty abstract; it’d be better if you could provide a more concrete algorithm that would benefit from this.
  2. It looks like you’re missing some stdlib types that should have this method, or at least deserve discussion about why they’re not included:
    • Character - should definitely provide it
    • String.*View - should definitely provide it
    • StaticString - certainly implementable; perhaps unclear how it would be used
    • StaticBigInt - certainly implementable; perhaps unclear how it would be used
    • KeyValuePairs - certainly implementable; perhaps unclear how it would be used
  3. The description of the method on String (and associated types) probably needs to discuss how this interacts with a couple different aspects of the string representation:
    • The small string representation presumably counts as a separate identity, such that (a) two small strings are identical if they represent the same data and (b) a small string is never identical to a string in a buffer.
    • It’s unclear what happens with the “summary bits” like “all the characters are known to be ASCII”, which we believe are not guaranteed to be correct in slices (that is, a slice that has the property may not have the bit set if was formed from a string that the bit would not have held for).
5 Likes

(speaking just for myself)

I know the isIdentical name has precedent from Span, but I'm a little concerned about it. There are a lot of languages where equality checks are done with a method call, including Java (equals) and Objective-C (isEqualTo:).[1] My concern is that someone coming to Swift from one of these languages might look for a method named something like that and find isIdentical(to:), making it an attractive nuisance. I think it would be better if the method had a more specific name that talked about the representation of the value, like hasSameRepresentation(as:).


  1. Usually when I start talking about Java and Objective-C, I can reflexively throw in C#, but in C# Equals actually still just compares reference identity for many of the common data structures. If you want to compare two arrays in C#, you have to write the algorithm out yourself. ↩︎

7 Likes

To add to this, String.isIdentical(to: String) sounds like something that a user could conceivably believe does a linear comparison other than canonical equivalence (like comparing UTF-8 code units).

3 Likes

Food for bikeshedding thought: many Swift programmers will already be familiar with isKnownUniquelyReferenced(_:) and isKnownASCII as false-negative fast paths. So isKnownIdenticalTo(_:)seems like a good place to start from.

7 Likes

Please correct me if I'm wrong… but AFAIK product engineers are unblocked today to build their own isIdentical implementation on StaticString with public APIs:

extension StaticString {
  func isIdentical(to other: Self) -> Bool {
    guard
      self.hasPointerRepresentation == other.hasPointerRepresentation,
      self.isASCII == other.isASCII
    else {
      return false
    }
    if self.hasPointerRepresentation {
      return self.utf8Start == other.utf8Start && self.utf8CodeUnitCount == other.utf8CodeUnitCount
    } else {
      return self.unicodeScalar == other.unicodeScalar
    }
  }
}

This already looks like a constant-time operation. I don't see any room for a ton of impact from building a new isIdentical symbol directly in standard library on StaticString. I'm not blocking another engineer from proposing that… I'm just not personally making myself available for writing the implementation (and the tests… and the benchmarks…) at this time.

Looking from the other direction… shipping isIdentical on StaticString does not unblock any of the isIdentical implementations already being proposed. They are conceptually related… but orthogonal and complementary.

I'm not completely sure I understand what your idea is here… are you suggesting that we can implement a new method that returns a "hashValue" after trading accuracy for speed? This might be an interesting idea… but I don't see why that blocks anything being proposed here currently. We also don't really want to expose or escape any transformation of the underlying identity. We don't want to expose the identity to give product engineers creative ways to make use of the identity… we want to use the identity inside to then return something we believe is useful: if two instances are indistinguishable from each other.

Sure, anyone could write it today, but your proposal is specifically defining a family of APIs and adding it to concrete types, and one of the questions we discussed was "are there any other stdlib types that it makes sense to add this to" (see John's reply as well for some others).

As for not making yourself available to write the implementation, you pretty much did already write it in your reply :sweat_smile:

I'm not asking for specific changes here, it's more of a thought exercise. Identity as it's being defined by your proposal can be used to define a different notion of equality of two values, and there's a fairly straightforward generalization of that concept that extends to hashability (since the two would need to be kept in sync)—it's that the hash value would be determined using the same values being compared by isIdentical(to:). If the identity of a value is useful for comparison, then I'm trying to discern whether there are situations where the corresponding "identity hash" would also be useful. The use case "I have a compile-time static value that I want to track by its location in memory instead of its value" is one that comes to mind. Some kind of "intern table" for large CoW values might be another where it's useful to track the identity itself (or a value derived from it) instead of just whether two already-known values are identical.

3 Likes

Hmm… I'm not completely sure I understand how detailed of an example you are looking for. Would you accept some linear-ish operation at a kind of high level that might use code from outside of standard library?

  • We want to securely hash a String value with CryptoKit?
  • We want to parse a String with a JSONDecoder or JSONSerialization?

Something at that scope? Kind of saying "here is place we need to do operation X" as opposed to writing all the code that does operation X? Is that the right idea? Would you have another proposal you could think of that was accepted that might give me another idea for me to see how detailed this example should be?

Short answer… at this time I am not volunteering to write these implementations as part of this proposal. Longer answers:

Character

AFAIK product engineers currently have the ability to transform Character to String with a public API:

Which uses this constructor from String:

The implication here AFAIK is that transforming Character to String is a constant time operation. We might pay a small performance penalty from an extra retain when we copy those StringGuts… but this penalty should not scale linearly with the amount of data in our Character.

If we ship isIdentical on String and product engineers can trivially transform Character to String in a constant-time operation… I do not currently see the impact in shipping isIdentical directly on Character at this time.

From another direction… do we know the practical limitations on how many dimensions one Character value could actually represent? Even if one Character value was a high-order composed sequence like an emoji we don't practically expect the dimension of this Character to ever be truly unbounded? Correct? The implication there is that comparing two Character values with a potential isIdentical method might not return its result meaningfully faster than just checking two Character values for conventional value equality using the == operator.

UnicodeScalarView

Similar to our situation with Character we already have a public API to transform a UnicodeScalarView back to a String in constant time:

Which them implies we can use the isIdentical method on String.

We also might pay a small performance penalty for the extra retain… but I do not see the impact from improving this small performance penalty being great enough to volunteer to write this implementation as part of this proposal at this time.

UTF16View

Similar story here:

This API is available from the 4.0 standard library release which was 2017.[1] A potential isIdentical method on UTF16View could potentially backdeploy to earlier releases… but I do not see the impact here being great enough to volunteer to write this implementation as part of this proposal at this time.

UTF8View

In addition to the ability to transform UTF8View back to String from the 4.0 standard library:

We also have a span property available for the latest releases:

And Span also implements its own isIdentical method.

A potential isIdentical method on UTF8View could potentially backdeploy to earlier releases… but I do not see the impact here being great enough to volunteer to write this implementation as part of this proposal at this time.

StaticString

Please see my earlier comments:

StaticBigInt

I can't actually find many places we use this across Swift or Apple other than here:

For some more context on that we describe integers of 20 bytes as "extremely large":

Would a new isIdentical method directly on StaticBigInt save a significant amount of performance as an alternative to transforming StaticBigInt to its Equatable wrapper and then checking for value equality? What is the "practical" upper bound on how many bytes engineers plan to store here and how does that upper bound compare to the width of one pointer on this machine?

At this time I do not see a ton of impact here for adding isIdentical to StaticBigInt.

A potential alternative would be to add a span to StaticBigInt that exposes the underlying buffer. I did not see StaticBigInt discussed as part of the SE-0456 review but this might be worth discussing as part of a different proposal. I am not volunteering to add the span property myself as this time.

From a different POV StaticBigInt is a relatively new type that is only available from the 5.8 standard library that shipped in 2023.[2] My current priorities are focusing on standard library types that are more widely available and this is where I see the most impact.

KeyValuePairs

The span property added to KeyValuePairs is backdeployed to 5.0.[2:1] A potential isIdentical method on KeyValuePairs could potentially backdeploy to earlier releases… but I do not see the impact here being great enough to volunteer to write this implementation as part of this proposal at this time.


Some constant themes from these new types:

  • Adding support for isIdentical on these new types does not unblock adding support for isIdentical on any of the types currently included in this proposal.
  • Our proposal currently focuses on what we see as the highest priority and most impactful types to add isIdentical. Just because we can add isIdentical to more types does not imply we must.
  • I'm not arguing that any of these types must never be added.
  • My current understanding of the evolution process is that if I proposal adding isIdentical to a type I am also volunteering to either write the implementation myself or recruit an engineer to volunteer to write the implementation. None of these types seem impactful enough for me to make that commitment… but if we had some flexibility here I would not be opposed to adding these types to the proposal review only if LSG understands I am adding this on the expectation that someone else can write the implementation on their own time and I am not currently planning to prioritize either being that engineer or finding that engineer.
Side discussion on Engineering as a Volunteer

As you probably figured out by now… I am not an Apple employee. The project lead does not give me any impact-slash-money for building this code. As you might not know… I am not an employee anywhere. There is no employer or manager giving me impact for volunteering to contribute any of this.

The "ruthless prioritization" here is that any and all "scope creep" is actually costing me impact from time that I could be spending engineering for money as a consultant or employee.

It's not a bad thing… I offered to volunteer and I understand what volunteering means… but just for some additional context here I will push back strongly against any additional asks here that I feel are orthogonal, non-blocking, or lacking in clear impact if that additional code can be completed in the future by another volunteer or an employee of the project lead.

All that being said… I am assuming good intent on the part of any reviewer that proposes adding work to new types. I know you all mean well… and I don't think these are bad ideas… I'm just not offering to be the volunteer to implement those ideas at this time. My hope is you can also assume good intent on my part when it comes to the constrains I am operating under.

Three ideas on that:

  • Detailed conversations about header doc comments specifically as they relate to implementations details on that specific concrete type might be more appropriate for the diff reviews on GitHub. I'm not opposed to some more general discussion about header doc structure or design taking place on the proposal review… but getting too far into the weeds about header doc comments on specific types might not be a great fit for the proposal review. But I am open to hearing other ideas about that if there exists clear precent or compelling reasons that proposal review is the correct place.
  • But the bigger point on that topic is that the implementation details don't really matter… and that's sort of the point. Product engineers making use of isIdentical don't really need to know or worry about how the type is determining the value… what product engineers care about is what can we do with this value after we know it is true. For String values returning true implies that two values must be equal by value. That's it. Pretty simple to me. I see no need to expose implementation details of our internal storage in header doc.
  • These types are all open source. If a product engineer really wanted to know how this value is computed there is nothing stopping them from reading the code themselves to see.

  1. Swift 4.0 Released! | Swift.org ↩︎

  2. Swift 5.8 Released! | Swift.org ↩︎ ↩︎

Our "prior art" in this space goes back years… well before Span was built:

I certainly think you are entitled to your own opinion on what the name should be… but I would then ask you for the data to back up this opinion. What prior art would support this opinion as being more correct?

I do think it's fine (and encouraged) for reviewers to tell me what their opinion might be… but the current opinion in this proposal comes with several documented instances of prior art backing up our claim that:

  • yes this is the correct name
  • product engineers are not going to be confused

Explaining what your opinion might be is a good first step… but then I would like for the persuasion to begin. Please persuade me why this new opinion is more correct than my current opinion taking into account the years worth of prior art we have documented.

If the new opinion does not have data backing it up… it's trending in the direction of conjecture. Conjecture is not necessarily bad… and if two conjectures are battling it out head to head maybe then we can discuss more of the "what ifs" and "might happens". But if our existing opinion documents all this prior art I think that presents a much stronger argument.

You are welcome to disagree with my suggestion; I was clear that I'm speaking for myself.

However, the proposal doesn't actually discuss alternatives to the name at all. It does have a "Prior Art" section with a lot of links to existing methods with this name; however, almost all of them are internal, and it's likely that they're all just copying a single decision that was made without too much thought during stdlib development because, after all, it was a purely internal method. I don't think it's unreasonable to ask for the proposal to discuss alternative names, even if you ultimately prefer this one.

It is relevant that Span uses this name, but (1) String et al. are much more prominent types than Span and (2) Span's interface is very new and could easily be fixed if we decided that this wasn't the right name.

5 Likes

No, a single Character can be arbitrarily large. In practice, most characters are small. But so are most Arrays. isIdentical (a [masked?] bitwise comparison of 16B) can absolutely be much faster than String.==.

4 Likes

It’s very important to be clear that for String and its various views, Substring, and Character, .isIdentical would not be equivalent to .isIdentical called on the span properties of the two strings (because of the small string representation; two identical small strings will generally have non-identical spans unless the compiler can statically prove that they are identical).

If John’s feedback was about “implementation details”, I would agree with this. But John’s feedback is about precisely defining the semantics of the operation that you’re proposing, and therefore an essential part of the proposal.

2 Likes

You seem to be saying that it’s ok to spend less time on API design because it’s open source. I’m not on the Language Working Group but I would be surprised if they would agree.

The reason I suggested isKnownIdenticalTo(_:) is because my understanding is that this API might return false even if two values are “identical,” for whatever definition of “identical” the implementor chooses for that specific type. This clearly must differ from the definition of equality or else there is no reason to add this method to a type. It also must be allowed to differ from bitwise equality of the value representation, or else it could be a single global function isIdentical<A, B>(_: A, _: B) -> Bool.

5 Likes

So that's a fair point… and I encourage engineers to propose types that should be added to this proposal. But if a type that is proposed to add to this proposal can accomplish the same functionality and semantics using existing public API then I see that as indicating this would not be an impactful change for me to support at this time.

The existing types currently in the proposal all leverage private or internal symbols to perform their identity checking. We need a new API in standard library because there is no public API available today for product engineers to do this.

Sure… another fair point. But then what comes after the implementation? Here is an RFC diff open for adding isIdentical to String and Substring:

The diff is currently 232 LOC. And that's before engineers request any additional changes. The actual implementation code is 1 LOC for String and 2 LOC for Substring. All the rest is tests and benchmarks… which I would also expect must ship for any additional types that were added to the proposal.

One small detail worth pointing out again is that we currently propose adding isIdentical to types that are not necessarily themselves Equatable. An example is Array. Array is only Equatable if its Element is Equatable. But we do not currently propose restricting our isIdentical implementation to only be available if Element is Equatable.

I'm still not completely sure that I follow… but for types that adopt Equatable the semantics of isIdentical return true to imply that two instances must be equal by value. But the semantics of isIdentical do not make any promises about value equality after returning false. So I think maybe any potential pain points about Hashable are kind of academic for now… unless there's something I'm missing about that.

This sounds interesting… but my POV so far is this can remain out of scope of this specific proposal. But if there do exist some legit reasons that a discussion about Hashable should block review here then I am open to hearing more about that.

Is not x.span.isIdentical(to: y.span) a valid implementation of the proposed function for two arrays x and y?

1 Like

We do discuss one alternative name in our proposal: maybeDifferent. But this isn't actually a different name for the same semantics… it's actually a different name and different semantics.

One potential issue here is I don't always have great visibility into historical engineering decisions that were made internally at Apple. Here is an example: we added a public and underscored isIdentical method to String over two years ago.

This diff points to a rdar that I can't open. What's inside there? What mysteries does this hold? What lore could I uncover? Was isIdentical controversial? Were there long debates taking place internally? Or… was it not controversial at all and the engineers all agreed this was the right name?

This is a fair point. The proposal could be updated with different names discussed as "Alternatives Considered" along with maybeDifferent.

I did search through the SE-0447 pitch and review threads for Span and I see zero discussion or debate about isIdentical. And those pitch threads began on 2024-02-06 and ended 2024-10-21. So for eight full months there was apparently zero controversy about isIdentical.

While I don't necessarily suggest that the lack of any documented controversy about adding isIdentical to Span is some kind of definitive "proof" that this is the right name… it does look like one more data point for me to add to the scale. This together with the prior art across standard library and swift-collections tell me this is the right name and at this time I have not seen any compelling argument to persuade me to change my opinion about that.

1 Like

This is a fair point. Character can be arbitrarily large and a conventional == operation performs a linear comparison across all N bytes. Which can be slow at scale.

The alternative then is my previous comment:

At scale the constant factor performance penalty from an additional retain is a legit tradeoff to save us from performing all the work in the == operator. Shipping isIdentical on String then gives us isIdentical on Character at the expense of one retain for each Character.

I'm not sure I agree with this POV. The semantic contract of isIdentical on String is that two String values that return true from isIdentical must also be equal by value. That's it. We don't make any promises about what a return value of false from isIdentical means… product engineers use this API to trade accuracy for speed and that's our semantic contract.

The implication is that all these APIs must defend against any "false positives": where two values that return true from isIdentical are not equal by value. These are Big Bad Bugs. But we understand that this API will deliver "false negatives": where two values that return false from isIdentical are equal by value. This is ok. This is not a bug.

The span property on UTF8View would satisfy our semantic contract: a return value of true implies value equality. I don't see the "extra" false negatives as a major issue. It's an opportunity for optimization… and this could certainly be an argument in favor of a dedicated isIdentical at some point in the future… but as a temporary workaround this option — along with transforming UTF8View back to String with an extra retain — are legit ways for product engineers to make use of this idea without blocking this specific proposal review on implementing a new isIdentical directly on UTF8View.

I'm not sure I agree with this POV. The semantics of our operation have already been defined: returning true from isIdentical implies value equality. That's it. Discussions about the implementation details of how this type arrived at that return value do not necessarily need to fall under the umbrella of the semantics of the operation IMO. I'm open to hearing more POVs on this specific topic… and I'm not saying that adding more implementation details in the header doc comments is a bad idea… but I do not currently see how or why that discussion must take place in a proposal design review: a diff review on GitHub still looks like the more appropriate place for that to happen.

What might help here is data. Would you have a previous proposal review you could pull up to show under what circumstances discussions about header doc comments need to take place as API design review and not in a GitHub diff implementation review? And to show under what circumstances header doc comments that expose implementation details would actually begin to cross over into what LSG considers to be design?