Proposals are expected to comprehensively describe the observable behavior of any APIs added, enough that someone else can implement them. This may or may not be straightforward to do referencing only documented, user-facing details of the Swift language. Public documentation will describe only the subset of observable behaviors that are guaranteed to end users and explainable in the terminology of the surface language. It is neither necessary to limit the detailed design of your proposal to that subset, nor would it be sufficient detail for all proposals, unless you intend every observable behavior to be an API guarantee that can be explained using only existing public features and APIs.
Now, it is also important to reckon with how we'd describe and teach any feature to end users. As it happens, in this case, when you document that x.isIdentical(to: x) for all x, then you have guaranteed that small strings will compare their inline storage in the implementation. So this is not a problem—though again, to be clear, that particular piece of feedback was about the observable behavior and not the documentation of it.
No, you're being asked to consider what happens when two strings that both have the value "cafe" are queried for identity when one of these values has the "known ASCII" bit set but the other doesn't.
I was able making two strings (short or long) different only in their isAscii bit via bit fiddling, although it would be great to see an example how do it via legitimate means.
In my attempts of doing so via legitimate means the short string's isAscii was always up to date as if the string was revalidating it after every change and for long strings a change that would affect isAscii bit was also triggering COW, in which case isIdentical would have to be false due to guts pointer difference.
Regardless of how to get into that situation, I'd say that isIdentical should return false for strings which are bit different in any of their 16 storage bytes (including isAscii bit):
for simplicity of implementation
it's a safer choice, e.g. some code might assume that if isIdentical is true then it could use one string instead of another, but not be prepared to see a sudden change of isAscii bit.
Can any non-mutating operation on a string change its "known ASCII" bit?
Do the same properties apply to other flags (we should review them comprehensively)?
Is there any safe API that produces different observable behavior with respect to any trailing padding bits (e.g., in the small string representation)? If so, should those bits be masked or not masked for the purposes of this operation?
Does anything change with respect to Substring?
(On that note, while this proposal includes the custom slice types for String and Array, is there reason to add this API for the slice types of Dictionary, Set, and ContiguousArray? If so, are there any custom considerations there?)
As of right now I don't have very many great answers here. I can continue trying to open up the code to try to learn what was happening… but I could also really use some extra help from anyone that wants to dive in and volunteer to help work on answering these questions:
Our existing "fast path" algorithms compare the rawBits directly.[1][2][3] Which I think implies we are calibrated to compare flags like isAscii. For the value equality checks… it might not be so bad because if we fail the fast path we will always fall through to the conventional O(n) algorithm to confirm value equality. But was the check directly against rawBitswithout controlling for differences in flags like isAscii made on purpose? Was this intentional? Was there some historical reason we explicitly wanted to check isAscii during a fast path identical check?
How and where do we codify what it means for two String values to be "equal by value" and what do the tests look like where we confirm that flags like isAscii would never affect the return value from the == operator?
This might all be academic if there do not exist any public APIs that can even get us into this state. What is the repro or the test that produces this outcome? Is the clue that we need to slice down a String that is not ASCII to two Substring values that contain no characters that are not ASCII… but it is "undefined" which Substring will carry the isAscii flag?
If we did find a code path that gives us this outcome with public APIs what is the specific semantic guarantee made by String that is broken if we return false from isIdentical when isAscii is not equal? Where is that semantic guarantee documented?
This is probably the best idea if we do choose to rethink fast path checking across all code paths and then improve our early exist fast return from the existing value-equality operations. We could even improve performance for everyone by reducing the amount of unnecessary O(n) comparisons that take place for two strings that should be considered "identical" but differ in bits that should not be affecting the fast path.
This code looks pretty complex… I'm not completely confident I would be the best engineer to write that new implementation. I would absolutely like some help to figure out what is possible and where all the trapdoors and underwater stones are. And no matter who writes the code this should ship with a lot of unit tests if we refactor the way String checks for value equality everywhere.
if you compare all 16 bytes of storage to implement isIdentical that would happen automatically - inline string will have isInline bit set (one of the high bits) and the "out of line" stored sting will not have that bit set, hence the two will always be different, so I wouldn't worry about it other then perhaps mentioning this fact in the proposal.
As for those other interesting questions - my position is unchanged: disregard all those nuances and compare the whole N bits of the type storage (e.g. 16 bytes as per MemoryLayout<String>.size). My hypothesis is that the padding bytes (e.g. of a small string representation) would not be a frequent problem (those who thinks otherwise please provide a counter example) and the worst outcome is that isIdentical will be returning false in some extra ~1% of real world cases.
The less custom this machinery is – the better. And if we ever want to change that, say to be more smart in treating the padding bytes – we could change that later on, as it will not be a breaking change to go from isIdentity returning false in one version of Swift to true in the next version of Swift for two differently padded but otherwise identical strings, or another mentioned things. All IMHO.
I think for this — and for the buffer question above — the POV from LSG is not so much "do there exist false negatives" but to what extent do false negatives break semantic guarantees: either the existing semantic guarantees on String or the new semantic guarantees on isIdentical?
For example:
a.isIdentical(to: b) == true implies "there is no way to distinguish between a and b".
a.isIdentical(to: b) != true implies "there is some way to distinguish between a and b".
Does there exist any public API transformation f on String such that the public semantic guarantee of f is that f(a) returns b such that "there is no way to distinguish between a and b".
Does f have the ability to remove the isASCII flag bit from a before returning b?
Does f have the ability to transform a that is inline and return b that is not inline?
Do we then have a situation where f(a) returns b which must be "indistinguishable" according to the semantic guarantee of f… andbmust be "distinguishable" according to the semantic guarantee of isIdentical? This is what we want to figure out here in design review.
I opened a task issue on my evolution fork if anyone is interested in a more targeted discussion about potential new types to add to the proposal. Please leave a comment here or on the task issue if you are interested in volunteering to help write implementations. Thanks!
The new version of the proposal is a significant rewrite that I believe addresses many of the outstanding questions from reviewers so far:
Motivation
Greatly expanded real-world examples
One command line utility and one SwiftUI app
Prior Art
Discussion of existing "fast path" value equality
Discussion of existing alternatives to value equality:
withUnsafeBufferPointer
withUnsafeBytes and memcmp
Span
Proposed Solution
Discussion of "abstract" axioms and principles
Discussion of types that are not Equatable
Application to previous real-world examples
Detailed Design
All concrete types are presented
Header documentation comments include important axioms and principles
Future Directions
Additional concrete types from Standard Library
Could ship in a follow up "second round" proposal
Alternatives Considered
Exposing Identity
Discussion of why we do not want to escape identity
Different Names
Examples of different names suggested in pitch review
Generic Contexts
Discussion of why we are not proposing a protocol
I believe the major outstanding blockers from LSG were:
Determine what to do about String and Substring:
Should "summary bits" like the isASCII flag be included when we compare for identity equality?
Determine what types must ship in this proposal:
What types from Future Directions must we include here in this "first round" proposal?
Looking closer at those last two… I would like to hear back more from LSG for:
Do these still remain blockers to start a proposal review?
Would these remain blockers to accept or approve a proposal after review started?
For some more thoughts on those:
Determine what to do about String and Substring
It looks like the comments and feedback here concerned what to do about two strings that compare as "equal-substitutable-interchangeable" but contain different summary bits like isASCII. I believe another discussion was around what happens if a string that is stored inline is somehow copied into a buffer and returned as a new string.
I opened up the discussion to try and collect more details on this feedback… but I still am not very clear on why these questions belong at the level of a proposal design review. These questions — to me — seem much more appropriate for a diff implementation review.
The biggest missing pieces for me here — and this is where I have not heard or seen any very clear answers so far:
What public API operates as a transformation on a String that returns "the same" string back without identical summary bits
What semantic guarantees are made about that public API in the documentation? Is that public API currently guaranteed to return a String value that "is identical" to the previous one?
Those same two questions apply to what happens if a String value is transformed from an inline value to be copied into a buffer.
If there exists no public API today that can repro what we are talking about "in the abstract" then I'm not completely clear why we are having this discussion right now. It does not seem like an impactful place for us to be blocking this proposal. If there does exist a public API today but that public API makes no guarantees about the "identity" of the returned value… then it seems to me like there is no existing semantic contract our new proposal could be breaking. Again… this does not seem like an impactful place for us to be blocking this proposal.
Please help me understand… are we still blocking this proposal review on having solid answers to these questions about String? Could you please help me understand why these are not questions that can — and should — take place during diff implementation review?
If LSG feels strongly that this proposal cannot ship in its current form because of these unresolved questions on String and Substring I would also be open to moving String and Substring to the "Future Directions". This would unblock us on shipping isIdentical(to:) on the "core" collection types: Array, Set, and Dictionary. Which then brings us to the next blocker:
Determine what types must ship in this proposal
As of right now I am still against growing the scope of this proposal with additional types. I am not at all opposed to additional types shipping as part of a second round follow up proposal. I just have not heard or seen any very compelling reason why additional types must ship as part of this proposal. There is no technical reason AFAIK why one of the new types being proposed would need to block an implementation on Array or Dictionary. If someone does have a legit technical reason to block on new types then please respond back and help me understand what that would be.
Please take a look at the new proposal draft and please let me know if you feel we can move forward from this one. Or please help me understand what would be left blocking and why we feel like that should be a blocker on a proposal review. Thanks!
We think that the concrete implementations should document that values produced by copying the same value, with no intervening mutations, will compare identical. Otherwise we don't think it's necessary to document the behavior of details like the string summary bits, beyond saying that equal values produced in other ways may or may not compare identical.
We would like at least the String view types to be included in the proposal. However, as an exception to the general rule requiring implementations, we are comfortable with not implementing isIdentical for these types before the proposal goes to review. Furthermore, you don't need to be on the hook to implement them in any case; other maintainers will just make sure they're implemented before the feature is released.
I actually had some questions about that and started this side discussion:
The implication seems to be that if we consider SE-0390 to be the "canon" definition of "what is a copy" then we already have the implication that a copy is identical.
I was considering collecting some thoughts from that thread and then writing an "appendix" on the proposal as an attempt to connect the dots between "copies" and representations that are "identical-indistinguishable" and "interchangeable-substitutable"… but I'm not sure I have enough context to try and answer some of the ambiguity there once and for all.
Could you please confirm if the feedback to update the proposal and explicitly confirm that a copy must be identical was directed at the "abstract" section where we attempted to generalize isIdentical(to:) or to the "concrete" sections where we wrote out header documentation for every concrete type being presented? Nevermind… I did not read correctly. The feedback was to update the header doc comments on the concrete types. SGTM.
I believe we already do document that currently for every concrete type:
/// string storage object. Therefore, identical strings are guaranteed to
/// compare equal with `==`, but not all equal strings are considered
/// identical.
So I think this already covered the request that we confirm that equal values do not imply identical values. Would you be able to confirm if LSG was looking for extra documentation for that on top of what was already being proposed?
Just to help me confirm… are we talking about:
String.UnicodeScalarView
String.UTF16View
String.UTF8View
Substring.UnicodeScalarView
Substring.UTF16View
Substring.UTF8View
The String views plus the Substring views? Is that correct?
My plan would be to just copy over the header doc comments from String and Substring for these new concrete types. But if anyone thinks of some kind of weird gotchas here that could impact the proposed semantics then please let me know and we can think through that together.
That's a big help for me and I appreciate the extra flexibility on that. Thanks!
I think that's fine, we don't need to wordsmith doc comments here as long as we're in agreement about the intended semantics.
That seems right.
You don't actually need to include specific "header doc" comments for every method in the proposal document; it's not like we're committing to using your proposed text as the documentation. The proposal should describe the semantics of the methods, to an appropriate level of detail and in a manner that other community members can readily understand, and explain how they're expected to be used. That's it. Being overly redundant can actually detract from the readability of the proposal.
Swift-CowBox will synthesize a heap allocation on your behalf to transform your "traditional" Swift struct to a copy-on-write data structure. CowBox will preserve value semantics publicly while moving your stored properties to a heap allocation reference internally which can be compared for identity equality. CowBox will then synthesize an isIdentical(to:) method. If your struct adopts Equatable then CowBox will also add that identity equality "fast path" as an early exit in the == operator.
There is a performance penalty from allocating an extra reference… but if you make many copies over your app lifecycle you can still see measurable performance wins. You would also be reducing the amount of retain calls needed on every copy if your stored instance properties are themselves copy-on-write data structures.