[Pitch #2] Add `isIdentical` Methods for Quick Comparisons to Concrete Types

We can't enforce it, but plenty of standard library collections/algorithms provide as part of their documentation the expected time complexity. For example, a RandomAccessCollection that has O(n) index operations would violate the protocol's semantic requirements in a way that the compiler can't detect.

I wonder if we can tighten what it means to be "meaningfully faster". Would it be reasonable to say that the implementation must be O(1), or are there expected implementations where that wouldn't be feasible? If so, would it be sufficient to say that the time complexity of isIdentical(to:) must be better than that of == by more than just a constant factor (e.g., O(log n) for an O(n) algorithm, O(n) for an O(n2) algorithm, etc.), or is that not a strong enough guarantee?

1 Like

I’m not sure it makes sense to compare these operations to the way we document the Collection hierarchy, because the performance characteristics there are designed with the expectation that there will be many diverse conformers as well as diverse generic implementations which may want to rely on performance characteristics.

I worry that we’re getting a bit bogged down in trying to figure out what the fully abstract guidance we would give to clients if we were designing this as a protocol requirement such that prospective conformers would know when it was appropriate and generic code would be able to meaningfully rely on the guarantees. But that seems like overkill for what we need to do here.

5 Likes

I don't really understand what this comment is asking for. This comment is implying that we should determine our desired semantics from "real-world motivating examples". But this comment is specifically directed at my attempt to begin to define semantics across any type that implements isIdentical. I can't give you real-world examples… it's already a discussion about an abstract type.

That being said… the clear prior art in this space is Span.isIdentical. Span is not Equatable. Here is all SE-0447 tells us about isIdentical:

Identifying whether a Span is a subrange of another:

When working with multiple Span instances, it is often desirable to know whether one is identical to or a subrange of another. We include functions to determine whether this is the case, as well as a function to obtain the valid offsets of the subrange within the larger span:

extension Span where Element: ~Copyable {
  /// Returns true if the other span represents exactly the same memory
  public func isIdentical(to span: borrowing Self) -> Bool
  
  /// Returns the indices within `self` where the memory represented by `span`
  /// is located, or `nil` if `span` is not located within `self`.
  ///
  /// Parameters:
  /// - span: a span that may be a subrange of `self`
  /// Returns: A range of offsets within `self`, or `nil`
  public func indices(of span: borrowing Self) -> Range<Index>?
}

Across 30 comments in a pitch thread, 36 comments in a second pitch thread, and 78 comments in a review thread, not one engineer asked for real-world examples to justify the inclusion of an isIdentical function and not one engineer asked for an additional discussion to clarify the semantics of the isIdentical function.

What I am not saying is that every attempt to ship a new type in standard library that implements isIdentical and is not Equatable must not present real-world motivating examples during design review. That is a decision that can take place between a proposal author and LSG. If LSG makes the decision that a new type should present real-world motivating examples during a proposal review then the proposal author can present those examples. I have zero problem with this.

What I am saying is that I am not persuaded that we must enforce here β€” in this proposal β€” that every type that implements isIdentical and is not Equatable must present real-world motivating examples. This clearly is not a requirement that must always be enforced β€” just look at Span.

All the concrete types in this specific proposal are either always Equatable (like String) or conditionally Equatable (like Array).

While Span predates this proposal I do believe it would be desirable if the discussion of what "general" behaviors we want and expect isIdentical to adopt "in the abstract" could conceptually "backdeploy" to what already shipped on Span. But I am not 100 percent committed to this POV and I can drop this if it slows us down from shipping this.

The problem here is that for a type that is not a Collection there's no clear concept being communicated by what O(n) really means… would that be n in terms of bytes? I don't think this would always be a very impactful thing to try and define at the abstract level we are talking about for any isIdentical method.

AFAIK there do not currently exist very many public examples of "fixed-width" copy-on-write data structures in standard library. The CoWs that do ship are optimizing for "variable-width" data structures: collections where you keep some number of things.

But fixed-width CoWs are legit tools to optimize for performance. The memory footprint of a fixed-width CoW could be 80 bytes for data plus 8 bytes for a pointer on a 64 bit architecture. A test for value-equality might need to check all 80 bytes. This is a constant-time operation. An isIdentical test against the heap storage only needs to check 8 bytes. This is a constant time operation that is meaningfully faster than a check for value equality. The fact that isIdentical is "constant" time does not really communicate very much because the == operator is already constant time. What matters to product engineers making use of this type is that isIdentical is meaningfully faster: an order of magnitude faster.

The concrete types in this proposal are all variable-width CoWs where a == operator performs work over some N elements. And our concrete types do tell product engineers their isIdentical methods are constant-time. But I am currently not persuaded that we must enforce a constant-time isIdentical for what we are proposing at this "meta" level semantic contract. What matters is that isIdentical should always be meaningfully faster.

1 Like

Very fair point and crosses out the first two options from the above list:

how this feature will work for structs.

  1. the feature (whatever the syntax is chosen) works out of the box
  2. I'd need to conform S to some protocol, then it works regardless of the fields

I do not understand why we don't support custom structs in this pitch e.g. via (3) and (4) above:

  1. I'd need to conform S to some protocol, but it will compile and work only if the fields conform to the protocol
  2. I'd need to provide my own implementation of this feature.

The mentioned S is made out of "is identical" supporting types:

struct S {
    let items: [Double]
    let value: String
}

so the check for is(known)identical could be quick:

func isIdentical(to other: Self) -> Bool {
    items.isIdentical(other.items) &&
    value.isIdentical(other.value)
}

and meaningfully faster than comparison, just elaborate to write manually for a struct with lots of fields. If we provide autosynthesized versions of EQ and hash for user struct types I see no reason not to do this for isIdentical.

Or do you merely mean that custom types support (if any) are better discussed in a separate pitch?


IRT not having a protocol for isIdentical – well, in that case we'd have that implemented by third party developers themselves:

protocol HasIdentical { ... } // "mine"
extension String: HasIdentical {} // so this is not retroactive
extension Double: HasIdentical {}
extension Array: HasIdentical {}

Having the protocol in the standard library would effectively stop that infestation.

Tera, do you understand that β€œcustom type support” in this instance simply means defining one’s own method named isIdentical and then calling it wherever you want to use it? Absolutely nothing about this proposal is builtin, implicit, or even metaprogrammed.

4 Likes

Except for StaticString, StaticBigInt, and KeyValuePairs β€” if you decide to include them.

StaticString is interesting because the compiler can deduplicate their storage, but I don't know if that behavior should be included in tests for identical strings.

StaticBigInt would probably require a new Builtin.isIdentical_IntLiteral(_:) function (implemented in C++).

2 Likes

I would be fine with walking back a discussion about what these abstract requirements might look like and focus just on the concrete types if LSG confirms that agreeing on abstract requirements would not need to block this proposal from moving forward.

@john_mccall… this actually got me thinking about some earlier feedback from LSG:

Was my takeaway from that feedback to be that every concrete type in this proposal must have at least one corresponding "concrete" and "real world" algorithm example presented as documentation?

My takeaway was that the "intro" portion of the proposal should have one (or maybe two or three) less abstract examples for introducing the concept to engineers reading the proposal… but those examples do not need to include every concrete type we are proposing for isIdentical. Is that the correct idea there?

I'm not speaking for the LSG here, but I think some consideration about the common properties we'd want isIdentical to have across types is helpful for guiding which types we apply it to in the standard library and what implementation we use and document for those types. I don't think we need to overly concern ourselves with nailing down precise semantics that will serve as a guide to the universe of Swift programmers for when they should add isIdentical to their own types, or which would be applied to a hypothetical Distinguishable protocol which is decidedly not being proposed. That's all I was trying to steer us away from.

3 Likes

To me this feels unnatural, as isIdentical feature is very similar to Equatable / Hashable for which we provide auto-synthesising. But.. even without autosyntesising and/or an established protocol – it would still be a step in the right direction and further improvements in the area could wait for another pitch and another day.

1 Like
Discussion about Infix Operators and Symmetry

Just an observation here specifically WRT the discussion about enforcing semantics across all concrete types:

So the "unstated implication" here is that if we defined this operation through === we are limiting the flexibility that library maintainers might need.

Suppose for example a library maintainer had some code path where they needed to forward === through to Span.isIdentical:

struct S {
  static func ===(_ lhs: Self, _ rhs: Self) -> Bool {
    guard
      lhs.hasStorage,
      rhs.hasStorage
    else {
      return lhs.span.isIdentical(to: rhs.span)
    }
    return lhs.storage === rhs.storage
  }
}

So it's not completely idiomatic here for S to "promote" lhs to be the receiver of the isIdentical operation on span. Unless maybe if Span.isIdentical gave us the formal guarantee of symmetry… which I do not currently see guaranteed in the documentation.

The option then might be to enforce that === must also be an symmetrical operation such that a === b is the same as b === a. But I'm not yet persuaded that symmetry must be enforced at an abstract level across all these operations. I have no problem with library maintainers of a concrete type making that decision for themselves when appropriate to do so. I'm just not convinced we need to always enforce that everywhere.

…

But now that I go back and think through again… there must be some kind of convention here that infix operators do not take labels and still distinguish between arguments. How else could a < implementation not just flip the arguments and return the wrong answer?

It looks like the declaration of === does not formally confirm Reflexivity, Symmetry, or Transitivity:

So unless maybe that is hiding out somewhere else… we do not explicitly guarantee that x === y must imply y === x. Which is surprising…

I do think that even without this explicit semantic guarantee any product engineer that looks at === probably expects it to respect Symmetry. And if we defined a new set of === operators on these types I think it would be fair to assume those project engineers also expect Symmetry there.

It's not formally defined, but it's trivial since === is literally just pointer equality.

However, I would be strongly opposed to repurposing === for what's being proposed here. Today in Swift, === has a very specific meaning, which is that both operands are reference types and they are referentially the same object. It would be really harmful for fundamental value types like Array<Element: Equatable> and String to implement both == and ===, because that would become a really easy source of subtle typos/bugs. It would also be a rough teaching experience to have to explain when to use == or === for a specific set of value types ("you want ==, === does handwavy things that you don't need to worry about right now..."). It's easier to teach today because the latter just doesn't compile, so you don't have to mention it until you start teaching reference types.

(Sure, the same thing about confusability of the operators can happen with class types today, but it's much less common for classes to implement Equatable than for value types.)

This is fundamentally a low-level performance-driven operation and its usage should be very obvious at the call site, not given the privilege of an operator that would make it blend in easily with other common value-type operations.

12 Likes

There is not consensus going into this that Span.isIdentical should be aligned with the API pitched here. We might ultimately converge on that, but it is not something to be assumed.

As you point out, a single-sentence doc comment fully specifies both the use case and the semantics of Span.isIdentical to the entire satisfaction of the community and steering group, and that is not the case here.

(To be explicit, Span models a safe, non-owned, non-escaping view of memory, and Span.isIdentical is the safe counterpart to an unsafe buffer pointer operation, which fixes both the desired use casesβ€”i.e., to supplant uses of the unsafe operationβ€”and the semanticsβ€”i.e., that of the unsafe operation it is supplanting.)

3 Likes

Look… I don't know you… I haven't met you… I haven't worked with you… but I like you. You are a good engineer. You are a smart engineer. And any attempt on my part to push back on feedback delivered by you should please not be interpreted as being inconsistent with those core beliefs I bring to the table.

This specific sub-discussion is trending in a direction that is making it challenging for me to continue working to a shared understanding. The feedback and requests are either confusing to the point I can't make an effective judgement on the potential impact of continuing to work to a shared understanding, or not confusing to the point I can make a judgement that this is not the most impactful place we should be focusing our time.

The comment that seems to be blocking us here:

implies that the request here is that a "meta-contract" and informal "pseudo-protocol" documented in our proposal should include real-world motivating examples not just for the isIdentical implementations on our concrete types. The incoming request is for real-world motivating examples on the abstract type.

We've already heard from engineers on LSG that this proposal should not be considering a generalized API or protocol:

It's confusing to me when a subset of feedback coming from LSG indicates we should not be focusing on a generalized API and another subset of feedback coming from LSG indicates that focusing on a generalized API should be a blocking requirement before we can ship this API on concrete types. Before I attempt to keep thinking through the potential impact of focusing on this… could you help me understand how and why those two sets of feedback can be consistent with each other?

So this is a fair point. If formalizing the meta contract that all concrete types must follow was a "secondary" goal of this proposal… conceptually backdeploying that contract back to Span would be β€” at best β€” a "tertiary" goal.

But whatever the relationship will be between a meta contract and Span is not important to the specific point that was being made in my response. The prior art from Span tells us that at some point in the future an engineer could propose and ship a new value type on standard library that is not Equatable and ships with isIdentical and "a single-sentence doc comment [that] fully specifies both the use case and the semantics" is enough to unblock LSG from approving the proposal. The implication here is that our meta contract does not need to produce real-world motivating examples here in this proposal. It is enough for us to say in this proposal that an engineer proposing isIdentical on a new type in the future may choose to document their code with the use case and semantics. We don't need to overthink this.

Continuing with the feedback: [emphasis added]

In reference to "a single-sentence doc comment" that "fully specifies both the use case and the semantics" being "not the case here"… is this feedback directed at the current doc comments being proposed on the concrete types? If so:

I believe I have already acknowledged the feedback here and communicated that I would update the proposal to incorporate header doc comments in design review. This is a known TODO and I don't believe we need to continue a debate reinforcing that the LSG is looking for more clarity on that.

If that feedback was instead directed at missing detailed semantic guarantees on the abstract type… then why is this so important? I can appreciate that this feedback communicated "a what"… I am now looking for "a why". I am looking for a feedback to persuade me why a continued sub-discussion to reach more agreements on abstract requirements on concrete types that are not included in this proposal and might not even exist yet is an impactful sub-discussion for us to continue prioritizing at this time.

1 Like

I could use some help from LSG understanding these requests:

It's unclear to me to what extent the behavior of "smol" String values should need to be documented as a "public" semantic contract.

Investigating the "threshold" at which a String value becomes "large" brought me here:

Which indicates 16 bytes is no longer small. But that's 64-bit… here is the threshold for 32-bit:

But wait… there's an additional threshold to document when Android is 64-bit:

So if the incoming feedback from LSG is that the public header documentation on String.isIdentical should be communicating the semantic behavior of small values… are we really asking to reveal these details to product engineers? I would normally make the argument this all falls under the umbrella of "implementation details"… but:

Suppose we didn't reveal the specific byte thresholds β€” across different architectures β€” that a String becomes large. Suppose all we revealed is that "small strings" below some opaque threshold "can be compared by value instead of by reference"? At that point… I fail to see why this communicates any value to product engineers. What product engineers care about is our semantic guarantee that returning true from isIdentical implies == returns true. Breaking down exactly how and why "small" values have a different code path to return true from isIdentical does not really seem to be an impactful use of our time. When isIdentical returns true it implies that == returns true. And those semantics are respected if we compare two strings by value or by reference. Product engineers don't need to know or care what's going on internally.

Please help me understand what I might be missing here. And if anyone is free to volunteer to help guide me through the String implementations to highlight any important edge cases I should know about here that would be very helpful. But from what I can tell this conversation about documenting the behavior of small strings does not deliver impact. There's "no there there".

This might be a clue where the feedback from LSG was coming from:[emphasis added]

Is the implication that when two strings are small they always return false? Even for smol.isIdentical(to: smol)? Nothing I have seen so far in my investigation indicates this is going to happen. Two small strings are compared by value. A subtle implication here is that for two small strings the isIdentical function is essentially returning false to indicate "must not be equal" as opposed to "we don't know"… but I do not believe this is important to document as an official supported use-case of the API.

I could also use help understanding what "a small string is never identical to a string in a buffer" is requesting.

I'm trying to work through the LSG feedback asking for more details about "summary bits" like "all the characters are known to be ASCII". I don't understand what LSG is asking for here.

Does anyone have a test case they could volunteer to share using the production swift toolchain that maybe shows off this specific use-case? Where summary bits should β€” or should not β€” affect the result of isIdentical? The production toolchain ships the _isIdentical underscored method that performs the same work we propose. So any test case using String._isIdentical would highlight an identical issue that would show up with our proposed method that ships without an underscore.

A test case against _isIdentical that also gives me a working example of "a small string is never identical to a string in a buffer" would also be very helpful.

Here are some tests I am running locally against production swift:

All of my feedback is on the topic of concrete types. If you have some type in mind that never conforms to Equatable but should have this API, then we need to know what one can usefully do with that API on that type. Whatever API is added to the standard library will need to have thought-through semantics. What those semantics are should look to, and be well justified by, what we want users actually to do with the API we add.

1 Like

No, I am arguing that such behavior would be quite unfortunate actually, and therefore the proposed API should not do that, and crucially it should also guarantee that it doesn't do that.

What is the behavior of the proposed String.isIdentical when two strings differ only in the "known ASCII" bit? Why should it have that behavior and not some other behavior?

1 Like

SGTM. Let's plan to keep the discussion about an abstract definition of isIdentical lightweight in the proposal. If and when we propose a concrete type that is not Equatable we can make sure there exist some sensible semantics and legit use-cases that would make that impactful.

So please correct me if I am wrong here… but the public documentation of String makes no mention to product engineers about the existence of different code paths to store small string values inline:

Please share any additional documentation I should reference. But if we do not already tell product engineers that small string values are stored inline… why would we then have to introduce this concept to product engineers for isIdentical?

I am afraid I do not have the ability to give you a good answer now because I am not sure I understand the question. Attempting to compare "cafΓ©" and "cafe" from isIdentical must return false: these two strings are not equal by value.

The feedback from LSG referenced slices. Is that the clue? Do I attempt to slice "cafΓ©" down to "caf" and that leads to some undetermined state where one slice believes it is ASCII and one slice believes it is not ASCII?