SE-0494: Add `isIdentical(to:)` Methods for Quick Comparisons to Concrete Types

No, I don't think you are reading that right. I understood @xwu's point to be that Hasher itself 'haphazardly' changes its hashing behavior such that we could quite easily adopt any change in hashing behavior we want without worry—in some sense it is 'maximally' unreliable and free of Hyrum's law.

6 Likes

OK, thanks, Frederick. I'm really tired, I must not have parsed this right. :pensive_face: I apologize; I removed that outburst.

Hasher gets closer to the ideal, but it is not free of Hyrum's Law, either. Changing the hash encoding of a type is technically still an observable change, and this does regularly lead to issues. This is particularly the case if hash(into:) ever gets marked @inlinable -- which sadly it often is. (Even I found myself being forced to mark it inlinable in the stdlib (within generic types in particular), because the alternative was unacceptably slow.)

Hasher also prominently (and very explicitly) does not document the specific hash function it implements (I suppose this is what you meant by referring to "undocumented" details). This (and random seeding) was indeed designed to allow us to change the hash function at will -- however, this still does not prevent Hyrum's Law from raising its ugly head: if anyone ever breaks Hasher's type barrier to extract its random seed, or to replace it with their own choice, then all bets are off again. I cannot really prevent that; all I can do is to hope it won't prevent us from shipping a replacement, if it ever becomes necessary. A theoretical problem is not an actual problem, and it should not prevent us from landing routine changes.

Trying to over-specify an entry point is counter-productive -- I think our job is to establish useful abstractions that free clients from having to think about every single tiny detail. Nobody but String's maintainers should care whether some hidden bit in its representation is compared by isIdentical(to:) -- in practice, practically no-one would need know about it, and the few people who do can trivially look it up in the implementation.

(Over-explaining things is also dangerous; developers do get scared when faced with complex documentation, and in my experience they often end up trying to reimplement the same thing on their own, often ending up breaking encapsulation and/or implementing a far less efficient home-brew alternative.)

Sure, it is an important question. However, whether or not an API ships inlinable is not usually fodder for Swift Evolution discussions -- it is an implementation detail that tends to be refined on our way to shipping the thing. @inlinable may get added or removed (or, say, replaced with @_alwaysEmitIntoClient) due to regular, routine engineering work; it would be quite messy to have to formally update the associated proposal whenever we do that.

4 Likes

Quite right—and I agree these decisions in the scenarios you outline are an implementation detail and would like them to remain as such.

My point being—sorry if I'm being ineffective at articulating it—that by giving a definition of identicality in the vein that you have for this proposed API, that "there must be no way to distinguish between one instance and a copy of the other through their public API surface," I fear you've incorporated all observable behavior into the semantics of the operation, and thereby elevated all of these concerns into API-level concerns.

I think what you're saying is that it'd be a can of worms, which is what I'm saying too. My view is that by (perhaps in some cases unnecessarily strictly) adhering to the view that the operation compares what you've termed representational equality rather than trying to define the semantics based on distinctions via the behavior of any part of the public API, we avoid all of the above worms.

The alternative I find very intriguing which you've laid out is to make the implementation opaque—which, quite rightly, is not something we normally have to discuss but in this case would be essential to guaranteeing the semantics you've articulated without restricting future evolution of the type. To my mind, documenting opaqueness here is in a similar vein to how we document per-execution Hasher behavior—also not something we typically do but important for understanding that functionality.

7 Likes

A potentially "weaker" semantic guarantee is to worry only about "value" state:

There are two main semantic implications of our proposed operation on String: a true value implies that other is identical-indistinguishable to self. An important implication of being identical-indistinguishable is also being interchangeable-substitutable. And we already know that substitutable implies equal.

But interchangeable-substitutable is a weaker semantic guarantee than identical-indistinguishable. Interchangeable-substitutable is a guarantee on the public "value" state. Identical-indistinguishable is a guarantee on the public value and non-value state.

If LSG insisted that this operation must ship as isKnownSubstitutable(for:) we would no longer be making any guarantees about the subset of public state that is non-value. We would only be making a semantic guarantee about the public value state.

This of course drops potentially useful and impactful information on the floor… if the engineer using our function wants to know if these two representations are true "copies" identical-indistinguishable then we no longer communicate that.

If LSG wanted to get real fancy we could ship isIdentical(to:) and isKnownSubstitutable(for:). But I don't really see any need to overthink this and have to then produce documentation explaining the difference between isIdentical and isKnownSubstitutable and ==. Let's just go ahead pick one and ship that.

If LSG wants to ship isIdentical(to:) then we should ship isIdentical(to:) and stay consistent with the prior art from Span. If LSG wants to ship isKnownSubstitutable(for:) then we should ship that and add it to Span. If LSG wants to ship both for one strong guarantee and one weak guarantee then we should ship that.

But after 100 comments I'm ready to just hand it over to LSG and ask them to pick one. I agree that isIdentical(to:) is the right choice and the right semantic guarantee for now. LSG is welcome to take that advice and recommendation and make their decision using all the information available here to them. But I'm sort of beginning to run out of ideas how else to express my opinions without just repeating something that was already covered here.

2 Likes

OK; I think I get your point!

What bothers me about the representational definition is that it necessarily breaks through the veil of type abstraction, opening the magic box -- it inherently has to talk about the bits and blobs that are actually stored in memory, and it has to somehow explain how to sort them into two buckets based on whether they contribute to an instance's identity or not. Even assuming we wish to avoid giving type authors any freedom to reserve certain bits for special effects, any potential layout-derived padding bytes still need to be addressed.

As a type author, I'd find it deeply uncomfortable to have to spell out whether my abstract type's layout happens to include any such unspecified/padding bytes, much less to try to figure out where exactly they are in the representation. As a low-level stdlib engineer working on @frozen types, I would most likely be able to reason about these things, but Swift types generally do not have stable layout -- so it isn't very practical at all to try reasoning about their underlying bytes. (I believe the layout algorithm isn't documented, and it may change between compiler releases.)

For this reason, I think I would prefer to avoid recommending developers to implement their own isIdentical(to:) operations by comparing the bytes exposed by withUnsafeBytes(of:). I think that would have the potential of making quite a mess, even if everyone would limit this activity to types that they own/maintain (which I do not expect they would).

In contrast, the functional/operational specification approach avoids punching through abstraction boundaries. It's pretty easy to figure out if a field affects the result of any public API -- and finding that out immediately settles the matter, as we've seen just now when you pulled up UTF8Span's fresh new properties.

It also lends itself to a readable, non-magical (and safe) implementation: at first approximation, we simply have to compare all direct stored properties through isIdentical(to:), ===, bitPattern comparisons, or regular == invocations, as needed.

3 Likes

Ah, I see your point. I agree it would be unwise to recommend straight-through bitwise comparison of contiguous memory as a general strategy for implementing this facility, for the reasons that you state. Whether it’s the best way of squeezing the highest performance out of standard library types, that’s an implementation detail that I’d happily leave to standard librarians.

To go back to Hashable, as I think it’s illustrative, you’ve been very clear that a proper implementation should feed every member to the hasher that contributes to value equality. For this API, I agree that a parallel exhortation that you have to compare every member that contributes to identicality would suffice. But it still leaves the question of how to know which members that would include or exclude.

When talking about representation, I don’t necessarily think it always has to drop down to the level of discussing which bit is stored exactly where. (Perhaps in this I’m overly influenced by IEEE floating point, where there’s the notion of a value’s representation as sign, exponent, and significand distinct both from the bit pattern encoding and from the value itself.)

I wonder if we could converge on a (handwavy) notion that values of a type are represented in memory by various fields, perhaps based on what Swift must copy in order to correctly make an independent copy of the value, and that these fields should be compared. The exact memory locations are not necessarily needed to do this (unless desired as a handrolled optimization for advanced use) but it would be a notion that is independent of what the rest of the API surface area of the type happens to be. Indeed, we end up at the same point where, to a first approximation, all directly stored properties need to be involved in the comparison.

1 Like

If we ever introduce the "opposite" operation (that would give a definite "not same" quickly or "don't know"), what would we name it?

Consider the following table that includes a row for that distant future hypothetical operation:

Name Synonym for Complexity Examples
== "is equal" O(n) Current Equatable
isIdentical "is identical" O(1) References or bits being equal
TBD "definitively not equal" O(1) Data counts or hashes being different

Note that "not identical" would not be the right name for that operation as we could quickly tell different values apart even when they are not identical, the examples would be differers data counts or different hashes. "notEqual" - again, not quite right, as the false could then be misinterpreted as "equal" instead of "don't know". "knownToBeNotEqual" or, perhaps, removing the double negative - "knownDifferent" sounds right.

Then it word be tempting to rename the second row of this table as well to get to a consistent table:

Name Synonym for Complexity Examples
== "is equal" O(n) Current Equatable
isKnownEqual "known to be equal" O(1) References or bits being equal
isKnownDifferent "known to be different" O(1) Data counts or hashes being different

OTOH, “isEqual”, “isKnownIdentical” and “isKnownDifferent” trichotomy is very precise in meaning and could also work.


On something else that was brought up above: performance flags. I admit I don't have the full picture here, but it does feel to me that:

  • they should NOT be masked out / ignored during discussed identity check. Even if they are not exposed via public API they could be observable (e.g. via they performance implications).
  • for "long" strings they should probably be in the underlying string object to begin with, not in the outer 16-byte String value. If done so they would be totally irrelevant to this topic.
2 Likes

As of right now I'm still opposed to putting "equal" in the name itself. We propose a new method on Array without constraining that availability based on whether or not this Array is Equatable. Given all the extended and spirited discussions over 100 comments about choosing a name that communicates correctly… it seems to me like we are not improving correctness by returning true for isKnownEqual if that Array itself has no ability to return true from == because it has no == operator to begin with.

I would also argue that isKnownIdentical is actually not very precise in meaning. A return value of true seems to imply "we know" that other is identical to self. A return value of false seems to imply we don't-slash-can't know that other is identical to self. The actual concrete types we have proposed for this operation do know: it's a binary decision. It either is or it isn't and there really is not any sort of legit "undetermined state" here where we would be unable to return a meaningful result.

Seungsahn Haengwon

The Korean Zen Master Seungsahn Haengwon practiced "Don't Know" Buddhism. Students would ask "What is the nature of the Buddha" and the answer would be "Don't know".

The "alternatives considered" are just that: alternatives that are interesting enough to briefly discuss in our proposal document… but we see no compelling reason to include them in our detailed design at this moment.

As an abstract thought exercise is it impactful to then not only debate the name of the current operation being proposed on its own but to then begin a secondary debate for the name of a hypothetical second operation we explicitly decided not to include in this proposal? And then begin a tertiary debate back to the current operation being proposed leveraging the hypothetical second operation?

Don't know. There's my answer. Don't know. What is the right name for "must not be equal"? Don't know. How would the name we choose for "must not be equal" live alongside "is identical"? Don't know.

1 Like

Know what, I have to agree, it's logical. Between isIdentical isKnownEqual and isKnownIdentical only the first two make sense, the last one doesn't as you rightfully pointed out.

Seungsahn Haengwon

I mean at some point in future we may find ourselves in a "if we had thought about that upfront we'd have chosen another name but we didn't so now we have to live with that decision" situation and my thought exercise above was an attempt to prevent that.

1 Like

I’m struggling to understand what you mean here. It must depend on a particular interpretation of “identical.”

This proposal will establish a naming pattern (a term of art) for future APIs, and it makes sense to choose a name that preserves the flexibility to return false negatives.

A representation that is "identical-indistinguishable" is a representation that is "interchangeable-substitutable". And a representation that is "interchangeable-substitutable" is equal by the existing definition from Equatable.

But not all representations that are interchangeable-substitutable are also identical-indistinguishable.

TBF… I've been a little sloppy here before in the pitch threads about using "false negative" for a situation where other is not identical to self but other is equal to self. But this is probably not a completely correct "false negative" because it's a false negative looking at the implications of what a return value of true would give us: identical-indistinguishable implies interchangeable-substitutable implies equal.

I fail to see any compelling reason why a concrete type would need an API to return false indicating "we can't" determine if other is an identical representation. The concrete types in this proposal can determine if other is an identical representation and I don't know if there is any reason a library maintainer would ever not have the ability to return a meaningful answer for that operation.

For the sake of simplicity let's suppose is(Known)Identical is implemented in terms of
memcmp(...) == 0

Then for any two values we know the result, it is either false or true.

And false means:

we know it is not identical

and not that:

we don't know if it's identical or not

Now, false could also mean or be interpreted as:

we don't know it is equal or not

in those regards isKnownEqual could be a reasonable choice. But definitely not isKnownIdentical.

1 Like

As I said before: if an engineer here believed that isKnownEqual was a "good" choice I would strongly argue that isKnownSubstitutable is the better choice. When the concrete type adopts Equatable this gives us the same information and semantic guarantees. When the concrete type does not adopt Equatable this no longer implies something that can't be possible because we don't have an == operator.

But either choice is still a weaker semantic guarantee than isIdentical. Returning true from isIdentical tells us that other is identical-indistinguishable and therefore interchangeable-substitutable. Returning true from isKnownEqual or isKnownSubstitutable tells us that other is interchangeable-substitutable without telling us other is also identical-indistinguishable. It's potentially useful and impactful information we are dropping on the floor and I see no compelling we want or need to ship a weaker guarantee for engineers.

2 Likes

You convinced me on that one as well... After all the two arrays might be identical but not necessarily equal or even Equatable... Or the infamous Nan peculiarity. I take my vote back from both isKnownIdentical and isKnownEqual. On isKnownSubstitutable... I don't quite like the sound of it, but on the other hand we want something deliberately scary, so it might work.

1 Like

Again… please keep in mind that choosing isKnownSubstitutable is not just a "more scary" name for the same semantic guarantees… it's a more scary name for weaker semantic guarantees than what is currently being proposed. I do not believe this is the correct tradeoff for us… but the community can continue to share their opinions and LSG can use their best judgement next week.

I searched above and found yours:

One drawback here is we are now making a weaker guarantee. Returning true to indicate other is identical to self is a stronger statement that returning true to indicate we know other is substitutable for self. We're dropping potentially useful and impactful information on the floor with that. This is also inconsistent with all our prior art that indicates isIdentical(to:) should be the correct name for this operation.

Could you expand on that? I don't quite understand what you are saying here. Maybe an example will help.

Expand on what exactly? I see a statement about identical implying substitutable and a different statement about prior art in this space.

Well, not exactly. String is the difficult case here where the implementation details have bearing on the semantics and (as you say) the naming also.

Specifically, we have reserved but unused performance bits in String. Certainly, you can require the semantics you've articulated above for String. However, if the semantics are that we must never return false if self and other are identical in all observable ways, then the implementation must not require these unused performance bits to be the same.

Unless I'm mistaken, that is not the approach taken in this proposal in the form it's being reviewed. It's certainly trickier to implement, requiring that the implementation never be emitted into the client and never inlined, and requiring that it be revisited in tandem with any changes to the type. Also, requiring careful bit-by-bit accounting in this way is—for the same reasons that @lorentey outlined above—not exactly ideal semantics for generalization to end-user types.

The pragmatic approach sufficient for the motivating use case is to compare the reserved bits and return false when they don't match, because we can't know for sure. Again, unless I'm mistaken, it is the approach that's actually presented in this proposal. What that means is that the semantics are such that we are leaving room to return false negatives—that is just another way of stating what we've said above, that representational equality is sufficient but overly strict. (Whether, in practice, it is actually possible to construct otherwise identical values that differ in the unused performance bits without bitwise manipulation is another matter.)

1 Like

Ok… fair enough. Here's a Dirty Trick:

let s1 = "Hello, SE-0494!"
let s2 = s1
print(s2.isIdentical(to: s1)) // true
let s3 = BitterFlipper(s2)
print(s3.isIdentical(to: s2)) // false
print(s3.isIdentical(to: s1)) // false

Here we have:

  1. We know that s2 is identical to s1.
  2. We know that s3 flipped some unused performance bit on s2.
  3. We know that s3 is not identical to s2.
    3.1. You are looking for some observable way that s3 is different from s2.
  4. We know that s3 is not identical to s1.
    4.1. But we also know that s2 is identical to s1.

So we have s3 and s2 that produce an observable difference across some public API:

  • print(s3.isIdentical(to: s1)) // false
  • print(s2.isIdentical(to: s1)) // true

Is this "cheating"? Is this "not fair"? Maybe… but if you were looking for some public API that returns some observable difference between two values that might have flipped some unused performance bit: well… here you go.

@lorentey provided an excellent example of String’s special bits that keep 2 instances identical while these bits differ.
The definitions of these functions in proposal are felt hard to understand, as they seems similar to ==.

I’ve found the following clarifications useful for myself:

if parts of the type's representation do not impact the actual result of any operation (which is clearly the case with String's opportunistic performance bits), then I think it would be quite reasonable to leave the matter or comparing them to the type author's judgment

we do not have a way to mechanically (and quickly) compare the representations of two instances of a copyable type in a reliable way that skips over unspecified padding bytes

This clarify that isIdentical is not about raw representation

This clearly explain the goal

This helps better understand the semantics and the goal

These will be good to mention in the proposal as guideline or warning for those how decide implement these functions to own types.

This helps to better understand both semantics and that isIdentical operation itself is very type specific. All those questions about Double.nan, -0 / +0 and similar become clear.

For me it seems to be good to mention this example for better understanding the nuances and relation between isIdentical and ==.

I’ve recently talked to some folks about this term, more concretely about String._isIdentical. Some of them are frustrated by its meaning and difference with ==, another think about it like “oh, this is like == but faster“.

Personally agree with the term “isIdentical“ as this term definitely says it is not about equality.

The only problem I see is that all these methods will be easily discoverable in public API and misunderstood.
Current is String._isIdentical is not easily discoverable with autocomplete.
However, I think it is enough to add proper documentation where at the beginning it is clearly stated that isIdentical:

  1. Differs a lot from ==. Precisely it is not a fast way to compare values for equality.
  2. Not for general usage / for specific needs

If isIdentical term seems to be misleading for LSG as the result of review, then my choice is isKnownIdentical(to:) – such term sounds far less similar to isEqual for folks from another languages, beginners and those who even hasn’t read the API documentation before using it.
If so, current isIdentical presented in Span types should be deprecated, unfortunately.

1 Like