[Pitch #2] Add `isIdentical` Methods for Quick Comparisons to Concrete Types

I'm looking at the stdlib source code, and I'm pretty sure the definition of this on String could've been copy-pasted to all of the character and view types in way less time than has been spent arguing against it here. They all just store a StringGuts and need to compare the bits.

Edit: The ideal implementation would be to just add an internal isIdentical to StringGuts and use it from all these types that wrap it.

6 Likes

Correct. This is a legit workaround for a missing isIdentical on Array. The bummer is that we are blocked on backdeploying the Array.span property:

The RFC implementation diff implementing isIdentical on Array is using _alwaysEmitIntoClient:

So AFAIK we will have no runtime OS requirements on Array.isIdentical.

The issue I have with this line of thinking is that you're asking users to rely on undocumented, non-obvious, and internal implementation details. Consider the documentation for String.init(Character):

Creates a string containing the given character.

That's it; that's all it says. It doesn't say that the returned String will share the exact same bit pattern or storage as the original Character. Someone could reasonably assume that creating a String from a Character might allocate a new copy of it, because that's what it looks like it's doing. If I have the following code:

func f(_ x: Character, _ y: Character) {
  _ = String(x).isIdentical(to: String(y))
}

As written, that looks like potentially two allocations, in which case a reader would never expect that to return true. The fact that x and y have their underlying storage shared by the new Strings is something most Swift users don't need to think or care about.

Indeed, a future version of the standard library could do something different here if it wanted. I can't think of a reason why it would, but I don't think there's anything preventing it. It shouldn't impact users because there's no current way to access the guts of a String or Character, so it wouldn't be an observable change.

On the other hand, you're proposing to add a new public API that makes those implementation choices observable. For that reason, it's important that both the behavior of the API and its usage be rationally defined. I don't think String(x).isIdentical(to: String(y)) is very rational, because it still requires users to know undocumented and unpromised behavior of String.init(Character) to understand it.

I also don't think that stating "We don't make any promises about what a return value of false from isIdentical means" regarding some of the cases being discussed here is helpful, because by that logic, we could simply implement isIdentical as return false and it would still technically satisfy every correctness requirement described here. But again, that wouldn't be a rational API.

3 Likes

Since isIdentical is being defined on some non-equatable types (such as unconstrained Array), the semantics can't be described in terms of == such as "a.isIdentical(to: b) implies a == b". If a and b are not equatable, does that mean we are allowed to define isIdentical in anyway they see fit and it would be valid? Would it be valid to always return false for Array.isIdentical?

The docs for Equatable go into great detail to explain that conformances should form an equivalence relation, and it even recites the axioms for equivalence relation. I would think at the very least the docs for this method should mention that isIdentical is expected to form an equivalence relation as a baseline to make it a reasonable concept. And further that it is more granular than == when defined on equatable types, i.e. if a.isIdentical(to: b), then a == b.

4 Likes

Further, identity does not imply value equality for all types that conform to Equatable. If I have let a = [Double.nan], then a == a is false, but a.isIdentical(to: a) is true. This (and the other examples people have mentioned) are why actually precisely specifying the semantics of this operation is so important.

6 Likes

Well ackshually, it's worth pointing out that Array already makes use of a shared buffer fast path and evaluates a == a as true, and changing that behavior has been deemed a won't-fix.

6 Likes

Right, my point here is that the criterion for inclusion in this pitch can't be that there aren't public APIs already that allow end user implementation. In fact, for most if not all concrete stdlib types at issue here, one could do a reasonable job using unsafe bit-casting APIs (plus or minus some bit masking).

Moreover, the implementation effort required for any concrete type that can reasonably implement to this API should be relatively trivial even when it relies on private, safe APIs. This would be more or less by construction, as the whole point of the exercise is that a reasonable implementation requires some performant way to rule two values in as identical; as that operation becomes more elaborate to implement so too would it raise concerns about getting bogged down in performance. Which is to say that for this particular API, I'm skeptical that there can be a concrete stdlib type which might be reasonable to include in the proposal theoretically but which would require an impracticable amount of implementation effort.

With respect to the String.isIdentical operation, the feedback from the steering group (as I understand it) is that we'd like serious consideration about how the semantics of that operation can guarantee more than that: we think that there are behaviors other than just true-means-identical which aren't implementation detail but essential for users to understand what they can expect to do and rely on. Particularly, as has been said, since most strings are small (at least in certain applications), whether someone would reach for this operation in their code can turn on whether it always returns false in that scenario.

This also ties somewhat into the feedback about presenting concrete, real-world algorithms that can benefit from this API. Even if we eventually find some protocolization of this functionality that makes fewer guarantees, for concrete types where there's only one or a narrow spectrum of reasonable implementations that guarantee certain additional behaviors, it can be important for a performant API to document them where it makes a difference for relevant use cases.

2 Likes

Why having two generic types here?

I'm using this for in-house project:

func ==== <T> (lhs: T, rhs: T) -> Bool {
   // a simple memcmp based implementation
}

Works like a charm although false negatives due to different padding bytes are expected to happen every now and then.

OMG.. "Value semantics" they say..

let a = [Double.nan]
print(a == a)                       // true
print([Double.nan] == [Double.nan]) // false

Not sure if this is what @allevato meant or something else: say I have two values, and I already know the hash of the first one. Then I it turns out that the two values are identical I don't have to spend time calculating the hash of the second one - it is the same.

2 Likes

Other span types?

  • MutableSpan and MutableRawSpan
  • OutputSpan and OutputRawSpan
  • UTF8Span

Other view types?

  • Dictionary.Keys and Dictionary.Values
  • Substring.*View

Other types?

  • CollectionDifference
  • RangeSet and RangeSet.Ranges

I can help with the implementation and tests. I don't think benchmarks are needed.

Could all methods share the same nonspecific doc comment? For example, by writing "instance" rather than "array" or "string".

1 Like

Other span types?

  • MutableSpan and MutableRawSpan

  • OutputSpan and OutputRawSpan

  • UTF8Span

We elected to pass on proposing isIdentical(to:) on the mutable variants (including OutputSpan/OutputRawSpan) because they require exclusive access to their memory. This means that uncovering an identical access to the same memory would be an exclusivity violation. In practice, then, any answer of true would be very bad news. They do all have either a span or a bytes property, so those could be used to ascertain an absence of bad news.

We seem to have forgotten to add isIdentical(to:) for UTF8Span. We can use its span property in the meantime.

5 Likes

At this point… I think it might be a good idea for us to slow down, catch our breath, and maybe use this time to reset or reboot what direction we want to take these conversations.

What is it we want? I'm not talking about "tactical" things like "this engineer wants a different name" or "this engineer wants a different type" or "this engineer wants a different documentation comment". At a higher level… what is it we both want out of this process?

I think if you looked at what we are both trying to do here… we would have a lot in common:

  • We want Swift to be safe.
  • We want Swift to be robust.
  • We want Swift to help engineers make great products.

Of course… we may very well disagree on details about how we want to accomplish those goals. The good news here is that I can continue assuming good intent. When you and others leave feedback that I disagree with I can continue assuming you are acting in line with our shared goals to make Swift as good as it can be. At that point I would hope we all can reciprocate those vibes: if I disagree with your feedback you can continue assuming I am acting in line with our shared goals to make Swift as good as it can be.


With all that being said… I think we can try and think a little more tactically about what is left blocking this pitch from moving forward. There has been a lot of recent activity here this week… which is good! I want to try and collect what seem to be the most important "buckets" of feedback I am seeing:

  • Defining and documenting semantics. The original proposal offered one generic "template" of header doc comments on String and left additional improvements on that template up to the library maintainers that submit and review the concrete implementations on GitHub diffs. The feedback is asking for some more upfront thought here in the design review.
  • Adding new types to the proposal. The original proposal focused on what we saw as the highest impact types that would benefit in standard library. The feedback is asking for more types from standard library to be included in this proposal. None of the feedback I have seen here is asking to remove any of the types from the original proposal.
  • Changing the name of isIdentical.

When I look at these in the other direction… I can begin to see the order that I believe we can have the most immediate impact by focusing our discussion:

  • Changing the name of isIdentical. Debating the name of this function is not impactful if we don't reach any agreement on what types are going to have this new function as part of this proposal.
  • Adding new types to the proposal. Because we all agree that the types from the original proposal should be included, debating what new types might be added is not impactful if we don't reach any agreement on how the semantics of the functions on the original types should be defined and documented.
  • Defining and documenting semantics. This to me looks like the most important blocker to settle on. Can we agree what semantics are defined and documented by this proposal? My advice is we start here — working from the types in the original proposal — and try to keep that discussion going without getting too distracted right this moment on what new types might be added or what a different name of this function might eventually be.

Here are some thoughts I have collected for a discussion about semantics:

The previous pitch introduced a Distinguishable protocol with an isIdentical member. We defined a set of types from standard library that would adopt Distinguishable. Over time we dropped the Distinguishable protocol idea and went with isIdentical methods on concrete types without a protocol.

I'm not sure I'm ready to propose going back to a Distinguishable protocol… but there might be value here in expanding our proposal with something like an "informal protocol". This attempts to provide some kind of "unified" value statements about what any concrete type that implements an isIdentical method publishes as a semantic contract.

The original proposal coupled the semantics of isIdentical to Equatable. But this can introduce ambiguity:

  • String always adopts Equatable.
  • Array sometimes adopts Equatable.
  • Span never adopts Equatable.

If we can agree that there should be some kind of basic and consistent semantics that all isIdentical implementations should communicate maybe then we can turn our attention to the concrete types from the original proposal and how those concrete types might then choose to refine those semantics.

I can offer to update the proposal to include the header doc comments on every type we propose. While I would normally expect review of header doc comments to take place on diff reviews among the library maintainers, I can offer to be more flexible here and open up the design review to include header doc comments. If an engineer has improvements to suggest how we communicate the semantics of isIdentical on String or Array then we can have that discussion here.

One issue here is how the isIdentical semantics on an Equatable type might introduce any future problems if that type also adopts Hashable. I'm not completely sure I understand what was being communicated in the feedback there… I would like to hear more about this if anyone can help with some more ideas or examples there.

My preference right now would be to focus our collective efforts on trying to work together through these blockers on our semantics before we move on to discussing additional types or different names. I might choose to prioritize my time responding to feedback that focuses on our semantics while this remains unresolved.


Please let me know if these ideas work for you as a general strategy and framework to help keep us moving forward. Thanks!

1 Like

The idea of having a focused discussion of each of these in turn makes sense to me, if you want to manage it that way.

So you'd like to start with the semantics, and specifically to what extent we should be specifying the behavior of isIdentical beyond "it's allowed to return false if the representations are really different"?

1 Like

So this is a fair point. The initial types proposed all do have the ability to return meaningful true values… but I'm not opposed to updating the proposal to indicate that library maintainers looking to add their own isIdentical methods should prioritize meaningful concrete implementations that are not just always returning false: which might be semantically "legal" under our contract but is effectively a useless API for product engineers.

1 Like

These are fair points. I can offer to open up more conversation about the implementations on these concrete types as a design review.

One potential side effect here is that if we move a discussion about exposing implementation details of a concrete type out of diff review and up here in design review I might need some more help from the library maintainers that are actively building and reviewing code on that concrete type to also help here in design review and help either fact check what I might assume to be true about these types or fill in missing information that I can't figure out for myself.

I'm not sure I would go so far as to say I want to discourage engineers from posting here with comments on additional types or changing names. But I do feel like I am personally not going to prioritize actively engaging with that feedback at this moment in time while I believe the more urgent priority for us remains unresolved.

Correct. I can think through this some more but my basic ideas so far were in the previous comment. And anyone that wants to expand on their previous feedback is welcome and encouraged to respond.

What I'm wondering is if there is some sort of "clean" way to attempt to build a framework here that we can not only define precise semantics of isIdentical on concrete types like String… but that framework is then flexible enough that a future type like StaticString could potentially build isIdentical and still fit the same sort of overall design. So there is some overall consistency between these implementations even if we also want individual types to perform and communicate different refinements on what these "implicitly inherited" semantics communicate. And if all that also conceptually "backdeploys" to the existing isIdentical implementation on Span then that would be a great side effect.

I also want to try and think forward because I also want this to ship on adjacent libraries like Foundation:

And Collections:

So what can we build here that is "precise" enough to resolve the incoming feedback about semantic ambiguity and also "flexible" enough to "fit" on new types that might not even exist yet? I don't have an easy answer yet but I can try and put some more thought to that.

I think we’d at least like to have the property that value-copying something and then checking isIdentical should always return true.

3 Likes

That would require an expensive implementation for non-CoW types like InlineArray, effectively rendering it inapplicable. Perhaps a decent tradeoff?

I don’t think there’s any point in providing this on a type like InlineArray.

1 Like

Well… maybe. In a general sense… I wouldn't say I would be opposed to promoting this requirement (which I see as a refinement on a requirement that self is always identical to self) to library maintainers as a sort of "north star" axiom that we strongly recommend to either adopt on their isIdentical implementations or provide some kind of clear documentation to product engineers why this axiom would not be appropriate to enforce.

But then things get a little dicier if we are proposing to move up that axiom to what are planning for a potential "meta contract" that we expect all concrete isIdentical implementations to conform to.

Consider this example:

So here we have:

  • A container that conforms to Equatable.
  • A container that should return true from == when all elements also return true from ==.
  • But this container returns true from == when all elements do not also return true from ==.

So… if this container adopts isIdentical on the condition that returning true from isIdentical must return true from ==… we sort of at that point "lock in" this weird situation where if we wanted to fix this bug at some point in the future we can't because at that point we would have self being identical to self and self not being value-equal to self. If we then attempted to solve for that by having self not being identical to self we would then be breaking the axiom we codified up in the meta-contract.

It's one reason why I believe my preference here is any such meta-contact that applies to all concrete implementations is as weak as possible. The "sub-contracts" that are defined directly on the concrete implementations can then be free to refine the meta-contract as appropriate.

Correct. Here is our opinion on that: