[Pitch #2] Add `isIdentical` Methods for Quick Comparisons to Concrete Types

xwu · August 21, 2025, 2:06am

Proposals are expected to comprehensively describe the observable behavior of any APIs added, enough that someone else can implement them. This may or may not be straightforward to do referencing only documented, user-facing details of the Swift language. Public documentation will describe only the subset of observable behaviors that are guaranteed to end users and explainable in the terminology of the surface language. It is neither necessary to limit the detailed design of your proposal to that subset, nor would it be sufficient detail for all proposals, unless you intend every observable behavior to be an API guarantee that can be explained using only existing public features and APIs.

Now, it is also important to reckon with how we'd describe and teach any feature to end users. As it happens, in this case, when you document that x.isIdentical(to: x) for all x, then you have guaranteed that small strings will compare their inline storage in the implementation. So this is not a problem—though again, to be clear, that particular piece of feedback was about the observable behavior and not the documentation of it.

No, you're being asked to consider what happens when two strings that both have the value "cafe" are queried for identity when one of these values has the "known ASCII" bit set but the other doesn't.

tera · August 21, 2025, 10:30am

I was able making two strings (short or long) different only in their isAscii bit via bit fiddling, although it would be great to see an example how do it via legitimate means.

In my attempts of doing so via legitimate means the short string's isAscii was always up to date as if the string was revalidating it after every change and for long strings a change that would affect isAscii bit was also triggering COW, in which case isIdentical would have to be false due to guts pointer difference.

Regardless of how to get into that situation, I'd say that isIdentical should return false for strings which are bit different in any of their 16 storage bytes (including isAscii bit):

for simplicity of implementation
it's a safer choice, e.g. some code might assume that if isIdentical is true then it could use one string instead of another, but not be prepared to see a sudden change of isAscii bit.

xwu · August 21, 2025, 12:35pm

Can any non-mutating operation on a string change its "known ASCII" bit?
Do the same properties apply to other flags (we should review them comprehensively)?

Is there any safe API that produces different observable behavior with respect to any trailing padding bits (e.g., in the small string representation)? If so, should those bits be masked or not masked for the purposes of this operation?

Does anything change with respect to Substring?
(On that note, while this proposal includes the custom slice types for String and Array, is there reason to add this API for the slice types of Dictionary, Set, and ContiguousArray? If so, are there any custom considerations there?)

vanvoorden · August 21, 2025, 7:04pm

As of right now I don't have very many great answers here. I can continue trying to open up the code to try to learn what was happening… but I could also really use some extra help from anyone that wants to dive in and volunteer to help work on answering these questions:

Our existing "fast path" algorithms compare the rawBits directly.^[1]^[2]^[3] Which I think implies we are calibrated to compare flags like isAscii. For the value equality checks… it might not be so bad because if we fail the fast path we will always fall through to the conventional O(n) algorithm to confirm value equality. But was the check directly against rawBits without controlling for differences in flags like isAscii made on purpose? Was this intentional? Was there some historical reason we explicitly wanted to check isAscii during a fast path identical check?
How and where do we codify what it means for two String values to be "equal by value" and what do the tests look like where we confirm that flags like isAscii would never affect the return value from the == operator?
This might all be academic if there do not exist any public APIs that can even get us into this state. What is the repro or the test that produces this outcome? Is the clue that we need to slice down a String that is not ASCII to two Substring values that contain no characters that are not ASCII… but it is "undefined" which Substring will carry the isAscii flag?
- If we did find a code path that gives us this outcome with public APIs what is the specific semantic guarantee made by String that is broken if we return false from isIdentical when isAscii is not equal? Where is that semantic guarantee documented?

This is probably the best idea if we do choose to rethink fast path checking across all code paths and then improve our early exist fast return from the existing value-equality operations. We could even improve performance for everyone by reducing the amount of unnecessary O(n) comparisons that take place for two strings that should be considered "identical" but differ in bits that should not be affecting the fast path.

This code looks pretty complex… I'm not completely confident I would be the best engineer to write that new implementation. I would absolutely like some help to figure out what is possible and where all the trapdoors and underwater stones are. And no matter who writes the code this should ship with a lot of unit tests if we refactor the way String checks for value equality everywhere.

vanvoorden · August 21, 2025, 7:06pm

I could also use some more help from LSG to understand this request if anyone is free to explain this one again to me.

tera · August 21, 2025, 7:20pm

if you compare all 16 bytes of storage to implement isIdentical that would happen automatically - inline string will have isInline bit set (one of the high bits) and the "out of line" stored sting will not have that bit set, hence the two will always be different, so I wouldn't worry about it other then perhaps mentioning this fact in the proposal.

As for those other interesting questions - my position is unchanged: disregard all those nuances and compare the whole N bits of the type storage (e.g. 16 bytes as per MemoryLayout<String>.size). My hypothesis is that the padding bytes (e.g. of a small string representation) would not be a frequent problem (those who thinks otherwise please provide a counter example) and the worst outcome is that isIdentical will be returning false in some extra ~1% of real world cases.

The less custom this machinery is – the better. And if we ever want to change that, say to be more smart in treating the padding bytes – we could change that later on, as it will not be a breaking change to go from isIdentity returning false in one version of Swift to true in the next version of Swift for two differently padded but otherwise identical strings, or another mentioned things. All IMHO.

scanon · August 21, 2025, 7:52pm

Well, that's certainly true, because the small string representation has no padding.

tera · August 21, 2025, 7:58pm

I mean those bytes shown here as X:

"123" ->
31 32 33 XX XX XX XX XX XX XX XX XX XX XX XX e3
|...     |...                                |...
payload  padding                             flags and length

vanvoorden · August 21, 2025, 8:05pm

I think for this — and for the buffer question above — the POV from LSG is not so much "do there exist false negatives" but to what extent do false negatives break semantic guarantees: either the existing semantic guarantees on String or the new semantic guarantees on isIdentical?

For example:

a.isIdentical(to: b) == true implies "there is no way to distinguish between a and b".
- a.isIdentical(to: b) != true implies "there is some way to distinguish between a and b".
Does there exist any public API transformation f on String such that the public semantic guarantee of f is that f(a) returns b such that "there is no way to distinguish between a and b".
- Does f have the ability to remove the isASCII flag bit from a before returning b?
- Does f have the ability to transform a that is inline and return b that is not inline?
Do we then have a situation where f(a) returns b which must be "indistinguishable" according to the semantic guarantee of f… and b must be "distinguishable" according to the semantic guarantee of isIdentical? This is what we want to figure out here in design review.

scanon · August 21, 2025, 8:08pm

Those bytes are always zero, and are part of the representation, they are not padding.

vanvoorden · August 22, 2025, 9:26pm

@glessard would you be available to help volunteer with an implementation if UTF8Span was added to this proposal?

vanvoorden · August 22, 2025, 11:58pm

I opened a task issue on my evolution fork if anyone is interested in a more targeted discussion about potential new types to add to the proposal. Please leave a comment here or on the task issue if you are interested in volunteering to help write implementations. Thanks!

github.com/vanvoorden/swift-evolution

[RFC] Add `isIdentical` Methods to Concrete Types

opened 11:56PM - 22 Aug 25 UTC

vanvoorden

### Draft Proposal https://github.com/swiftlang/swift-evolution/pull/2875 ### …Current Progress | | Impact | Volunteer | Diff | --- | --- | --- | --- | | `String` | ★ ★ ★ ★ ★ | @vanvoorden | https://github.com/swiftlang/swift/pull/82055 | `Substring` | ★ ★ ★ ★ ★ | @vanvoorden | https://github.com/swiftlang/swift/pull/82055 | `Array` | ★ ★ ★ ★ ★ | @vanvoorden | https://github.com/swiftlang/swift/pull/82438 | `ArraySlice` | ★ ★ ★ ★ ★ | @vanvoorden | https://github.com/swiftlang/swift/pull/82438 | `ContiguousArray ` | ★ ★ ★ ★ ★ | @vanvoorden | https://github.com/swiftlang/swift/pull/82438 | `Dictionary` | ★ ★ ★ ★ ★ | @vanvoorden | https://github.com/swiftlang/swift/pull/82439 | `Set` | ★ ★ ★ ★ ★ | @vanvoorden | https://github.com/swiftlang/swift/pull/82439 | `Character` | --- | --- | --- | | `String.UnicodeScalarView ` | --- | --- | --- | | `String.UTF16View ` | --- | --- | --- | | `String.UTF8View ` | --- | --- | --- | | `StaticString ` | --- | --- | --- | | `StaticBigInt ` | --- | --- | --- | | `KeyValuePairs ` | --- | --- | --- | | `Substring.UnicodeScalarView ` | --- | --- | --- | | `Substring.UTF16View ` | --- | --- | --- | | `Substring.UTF8View ` | --- | --- | --- | | `Dictionary.Keys ` | --- | --- | --- | | `Dictionary.Values ` | --- | --- | --- | | `UTF8Span ` | --- | --- | --- | ### Request for Help Please leave a comment if you can help provide any more details on these types that are not yet implemented: * Are we missing a type? * What else might be appropriate for this proposal? * What is the impact of this type? * How important is it we ship this type *as soon as possible* on Standard Library? * Are you willing to volunteer to write the implementation? * Are you also available to answer design questions on this type from engineers during proposal review?

vanvoorden · September 2, 2025, 8:58pm

The new version of the proposal is a significant rewrite that I believe addresses many of the outstanding questions from reviewers so far:

Motivation
- Greatly expanded real-world examples
- One command line utility and one SwiftUI app
Prior Art
- Discussion of existing "fast path" value equality
- Discussion of existing alternatives to value equality:
  - withUnsafeBufferPointer
  - withUnsafeBytes and memcmp
  - Span
Proposed Solution
- Discussion of "abstract" axioms and principles
- Discussion of types that are not Equatable
- Application to previous real-world examples
Detailed Design
- All concrete types are presented
  - Header documentation comments include important axioms and principles
Future Directions
- Additional concrete types from Standard Library
  - Could ship in a follow up "second round" proposal
Alternatives Considered
- Exposing Identity
  - Discussion of why we do not want to escape identity
- Different Names
  - Examples of different names suggested in pitch review
- Generic Contexts
  - Discussion of why we are not proposing a protocol

I believe the major outstanding blockers from LSG were:

Determine what to do about String and Substring:
- Should "summary bits" like the isASCII flag be included when we compare for identity equality?
Determine what types must ship in this proposal:
- What types from Future Directions must we include here in this "first round" proposal?

Looking closer at those last two… I would like to hear back more from LSG for:

Do these still remain blockers to start a proposal review?
Would these remain blockers to accept or approve a proposal after review started?

For some more thoughts on those:

Determine what to do about String and Substring

It looks like the comments and feedback here concerned what to do about two strings that compare as "equal-substitutable-interchangeable" but contain different summary bits like isASCII. I believe another discussion was around what happens if a string that is stored inline is somehow copied into a buffer and returned as a new string.

I opened up the discussion to try and collect more details on this feedback… but I still am not very clear on why these questions belong at the level of a proposal design review. These questions — to me — seem much more appropriate for a diff implementation review.

The biggest missing pieces for me here — and this is where I have not heard or seen any very clear answers so far:

What public API operates as a transformation on a String that returns "the same" string back without identical summary bits
What semantic guarantees are made about that public API in the documentation? Is that public API currently guaranteed to return a String value that "is identical" to the previous one?

Those same two questions apply to what happens if a String value is transformed from an inline value to be copied into a buffer.

If there exists no public API today that can repro what we are talking about "in the abstract" then I'm not completely clear why we are having this discussion right now. It does not seem like an impactful place for us to be blocking this proposal. If there does exist a public API today but that public API makes no guarantees about the "identity" of the returned value… then it seems to me like there is no existing semantic contract our new proposal could be breaking. Again… this does not seem like an impactful place for us to be blocking this proposal.

Please help me understand… are we still blocking this proposal review on having solid answers to these questions about String? Could you please help me understand why these are not questions that can — and should — take place during diff implementation review?

If LSG feels strongly that this proposal cannot ship in its current form because of these unresolved questions on String and Substring I would also be open to moving String and Substring to the "Future Directions". This would unblock us on shipping isIdentical(to:) on the "core" collection types: Array, Set, and Dictionary. Which then brings us to the next blocker:

Determine what types must ship in this proposal

As of right now I am still against growing the scope of this proposal with additional types. I am not at all opposed to additional types shipping as part of a second round follow up proposal. I just have not heard or seen any very compelling reason why additional types must ship as part of this proposal. There is no technical reason AFAIK why one of the new types being proposed would need to block an implementation on Array or Dictionary. If someone does have a legit technical reason to block on new types then please respond back and help me understand what that would be.

Please take a look at the new proposal draft and please let me know if you feel we can move forward from this one. Or please help me understand what would be left blocking and why we feel like that should be a blocker on a proposal review. Thanks!

github.com/swiftlang/swift-evolution

[stdlib] add is identical methods to concrete types

main ← vanvoorden:add-is-identical-methods

opened 10:29PM - 04 Jun 25 UTC

vanvoorden

+713 -0

Many types in Standard Library are “copy-on-write” data structures. These types …present as value types, but can leverage a reference to some shared state to optimize for performance. When we copy this value we copy a reference to shared storage. If we perform a mutation on a copy we can preserve value semantics by copying the storage reference to a unique value before we write our mutation: we “copy” on “write”. This means that many types in Standard Library already have some private reference that can be checked in constant-time to determine if two values are identical. Because these types copy before writing, two values that are identical by their shared storage *must* be equal by value.

John_McCall · September 9, 2025, 9:17pm

Hi, Rick. The LSG talked about this today.

We think that the concrete implementations should document that values produced by copying the same value, with no intervening mutations, will compare identical. Otherwise we don't think it's necessary to document the behavior of details like the string summary bits, beyond saying that equal values produced in other ways may or may not compare identical.
We would like at least the String view types to be included in the proposal. However, as an exception to the general rule requiring implementations, we are comfortable with not implementing isIdentical for these types before the proposal goes to review. Furthermore, you don't need to be on the hook to implement them in any case; other maintainers will just make sure they're implemented before the feature is released.

Does that sound acceptable?

vanvoorden · September 9, 2025, 10:18pm

I actually had some questions about that and started this side discussion:

The implication seems to be that if we consider SE-0390 to be the "canon" definition of "what is a copy" then we already have the implication that a copy is identical.

I was considering collecting some thoughts from that thread and then writing an "appendix" on the proposal as an attempt to connect the dots between "copies" and representations that are "identical-indistinguishable" and "interchangeable-substitutable"… but I'm not sure I have enough context to try and answer some of the ambiguity there once and for all.

Could you please confirm if the feedback to update the proposal and explicitly confirm that a copy must be identical was directed at the "abstract" section where we attempted to generalize isIdentical(to:) or to the "concrete" sections where we wrote out header documentation for every concrete type being presented? Nevermind… I did not read correctly. The feedback was to update the header doc comments on the concrete types. SGTM.

I believe we already do document that currently for every concrete type:

  /// string storage object. Therefore, identical strings are guaranteed to
  /// compare equal with `==`, but not all equal strings are considered
  /// identical.

So I think this already covered the request that we confirm that equal values do not imply identical values. Would you be able to confirm if LSG was looking for extra documentation for that on top of what was already being proposed?

Just to help me confirm… are we talking about:

String.UnicodeScalarView
String.UTF16View
String.UTF8View
Substring.UnicodeScalarView
Substring.UTF16View
Substring.UTF8View

The String views plus the Substring views? Is that correct?

My plan would be to just copy over the header doc comments from String and Substring for these new concrete types. But if anyone thinks of some kind of weird gotchas here that could impact the proposed semantics then please let me know and we can think through that together.

That's a big help for me and I appreciate the extra flexibility on that. Thanks!

John_McCall · September 10, 2025, 1:49am

I think that's fine, we don't need to wordsmith doc comments here as long as we're in agreement about the intended semantics.

That seems right.

You don't actually need to include specific "header doc" comments for every method in the proposal document; it's not like we're committing to using your proposed text as the documentation. The proposal should describe the semantics of the methods, to an appropriate level of detail and in a manner that other community members can readily understand, and explain how they're expected to be used. That's it. Being overly redundant can actually detract from the readability of the proposal.

vanvoorden · September 15, 2025, 6:01pm

The new version of the proposal made a few small edits to address feedback from LSG:

Detailed Design
- Added String view types.
  - String.UnicodeScalarView
  - String.UTF16View
  - String.UTF8View
- Added Substring view types.
  - Substring.UnicodeScalarView
  - Substring.UTF16View
  - Substring.UTF8View
- Added semantic guarantees that copies are identical representations.

Please let me know if you see any remaining blockers left before we are ready to begin moving to a proposal review. Thanks!

github.com/swiftlang/swift-evolution

[stdlib] add is identical methods to concrete types

main ← vanvoorden:add-is-identical-methods

opened 10:29PM - 04 Jun 25 UTC

vanvoorden

+793 -0

Many types in Standard Library are “copy-on-write” data structures. These types …present as value types, but can leverage a reference to some shared state to optimize for performance. When we copy this value we copy a reference to shared storage. If we perform a mutation on a copy we can preserve value semantics by copying the storage reference to a unique value before we write our mutation: we “copy” on “write”. This means that many types in Standard Library already have some private reference that can be checked in constant-time to determine if two values are identical. Because these types copy before writing, two values that are identical by their shared storage *must* be equal by value.

vanvoorden · September 15, 2025, 7:18pm

Side Discussion with Self-Promotion

Swift-CowBox will synthesize a heap allocation on your behalf to transform your "traditional" Swift struct to a copy-on-write data structure. CowBox will preserve value semantics publicly while moving your stored properties to a heap allocation reference internally which can be compared for identity equality. CowBox will then synthesize an isIdentical(to:) method. If your struct adopts Equatable then CowBox will also add that identity equality "fast path" as an early exit in the == operator.

There is a performance penalty from allocating an extra reference… but if you make many copies over your app lifecycle you can still see measurable performance wins. You would also be reducing the amount of retain calls needed on every copy if your stored instance properties are themselves copy-on-write data structures.