Regression in Set with Swift 5.5

oliverkrakora · October 28, 2021, 1:21pm

Hello,
I came across a crash on Xcode 13(iOS 15) that wasn't present in previous iOS releases(< Xcode 13). I found the change that is responsible for that regression. The reason why the crash occurs is that I have a Hashable type that wraps a non Hashable type and allows me to perform Set operations on two different non Hashable types(it's an exotic and not safe solution, but it worked before that change) and since formIntersection now "mixes" my hashable wrapper types the force cast to my specific type fails. I am able to work around that issue, but I was wondering what happens with this kind of regressions in the Swift Evolution Process since the benefit of the change clearly outweighs my use case.

Here the example of my code that shows that the types in in the Set are mixed after doing an intersection:

struct A {
    let a: Int
    let b: Int
}

struct B {
    let a: Int
    let b: Int
}

let a = [A(a: 1, b: 1), A(a: 1, b: 2), A(a: 1, b: 3)]
let b = [B(a: 2, b: 1), B(a: 1, b: 2), B(a: 2, b: 3)]

var aSet = Set(a.map { PartialHashable(object: $0, keyPaths: [\A.a, \A.b]) })
let bSet = Set(b.map { PartialHashable(object: $0, keyPaths: [\B.a, \B.b]) })

aSet.formIntersection(bSet)
let isMixed = !aSet.allSatisfy { type(of: $0.object) == A.self }
print("isMixed: \(isMixed)")

When I run this code in Xcode 12 isMixed is false and with Xcode 13 it's true

hisekaldma · October 29, 2021, 9:34am

When you make a type Hashable you’re also making it Equatable, which means that you’re promising that instances of the type that are equal are also substitutable. Code that’s generic over an Equatable type, like Set, relies on this promise and is allowed to substitute instances that are equal.

So the problem here is that you’re not upholding your end of the bargain. You’re promising that two instances of PartialHashable can be substituted, but then you’re expecting them not to be. That’s not going to work, and that it ever worked was pure coincidence. Set, and other types that require Equatable, can change their implementation at any time, e.g. to improve performance, and in doing so might start substituting instances that weren’t being substituted before.

oliverkrakora · October 29, 2021, 2:06pm

Makes total sense, I haven't thought about the substitution part when I implemented my solution, thank you for pointing that out so clearly!

lorentey · October 30, 2021, 2:21am

@hisekaldma's explanation is excellent, thanks!

Unfortunately, this is Hyrum's Law in action -- Set's intersection methods do not document which input set the result's members come from, but they also do not randomize this, so naturally there is code out there that implicitly assumes the original behavior.

Sadly we did not realize this optimization was source-breaking.

Beta testing did not uncover any obvious binary compatibility issues arising from this change, but I imagine this is just because in most cases, Set's methods get specialized into the client module, so existing binaries did not pick up this change.

Now that people start recompiling their code in the new stdlib in Swift 5.5, this behavioral change is evidently triggering source compatibility issues, which isn't good.

It is possible to restore the original behavior without losing the performance improvement. I think we should do that! This will include a code size regression, as we'd probably need to implement the small.intersect(large) and large.intersect(small) cases on two completely separate code paths.

(Unfortunately 5.5 has shipped now, which puts us in a somewhat uncomfortable position -- developers whose code is affected by this change may now be busy updating it to instead rely on the new behavior; and if so, then fixing the stdlib to restore the original will break them twice. But if this proves to be a widespread problem, reverting to the original behavior is likely still the right thing to do.)

Note: (The rest of this post is a rant about the concept "substitutability" -- it's largely irrelevant to the topic at hand.)

Equatable's documentation does include this passage:

Equality implies substitutability—any two instances that compare equally can be used interchangeably in any code that depends on their values.

However, sadly this sentence is very misleading and I find it best to ignore it. Two values that are "substitutable" in one context may have differences that are very much relevant in another. For instance, "Caf\u{c9}" == "Cafe\u{301}" is true, but these two strings are definitely not substitutable with each other in a universal sense.

The way I look at it, Equatable simply defines some arbitrary equivalence relation that the type's author thought to be useful. (And sometimes it's not even an equivalence relation, as is the case with standard floating point types... )

The purpose of a Set is to remember at most one member of each equivalence class defined by this relation, and, given a value, to quickly determine if the set already has a member of the same equivalence class. The exact value that gets remembered clearly does 100% matter -- or Set wouldn't need to provide separate APIs for insert(_:) and update(with:).

"Equality implies substitutability" is a gross oversimplification at best.

Lantua · October 30, 2021, 2:54am

I did rely a lot on the distinction between the things in the set and the things being added in the past. While that violates substitutability, all APIs point to that being one of the expected usages. It is handy when [Key: Value] are so common they might as well be Set<(Key, Value)>, or that the tuple (Key, Value) comes out very naturally via other calculations.

I might need to re-audit my past code from that optimization .

Is it possible to add a flag whether the larger set is the other set and act accordingly to maintain the old behaviour?

wowbagger · October 30, 2021, 10:37am

In addition to the grapheme cluster example, I think the message in documentation also ignores the identity for reference types. For example, you can easily have equal but not substitutable nodes in a graph.

tera · October 30, 2021, 7:55pm

Karoy, very enlightening analysis as usual, thank you. Topics around equivalence are mind-blowing: we may have values which are equal but not identical, or identical but not equal, or values which are not equal to themselves (nan != nan) or otherwise violating math axioms (rock < paper < scissors < rock).

whoever did the fix clearly haven't thought about the reference types case. if you go back in time, what would you do here, leave existing "intersection" API as it was and introduce a new one like so?

struct IntersectionOptions: OptionSet {
    init(rawValue: UInt)
    static let optimized = IntersectionOptions(rawValue: 1)
}

func intersection(_ other: Set<Element>, options: IntersectionOptions) -> Set<Element>

is this path still an option now or is it too late?

ps. if to start completely from scratch (alas something we can't do), the name intersected(with:) would be appropriate for the old unoptimised intersection behaviour and the name intersection is good for the new optimised behaviour, perhaps even in a static func form, or as a * operator.

lorentey · October 30, 2021, 10:18pm

Oh, I don’t think these operations deserve that much API. We don’t need to overthink this; the existing operations are fine, we just need to make sure that for items that are in both inputs, high-level Set operations always take their results from the first set, i.e., self.

(We should also explicitly document this, although that might require going through Swift Evolution, and it’s difficult to justify the expense of doing that. Related note: we still haven’t updated Sequence.sorted()‘s docs to guarantee stable ordering.)

To turn this into something positive, this would be an ideal excuse for landing the optimizations in PR #21300.

tera · October 30, 2021, 10:28pm

yep, after i wrote my comment above i came up with the same idea (so it is both quick and preserving the old behaviour):

   @inlinable
   public __consuming func intersection(_ other: Set<Element>) -> Set<Element> {
	 var newSet = Set<Element>()
	 if count > other.count {
	   for member in other {
		 if let element = (first { member == $0 }) {
		   newSet.insert(element)
		 }
	   }
	 } else {
	   for member in self {
		 if other.contains(member) {
		   newSet.insert(member)
		 }
	   }
	}
	return newSet
  }

bzamayo · November 1, 2021, 2:48pm

I'm surprised you think this is even something that needs fixing. I was under the impression as long as Set's behaviour respected the Hashable guarantees of its inputs, then all was dandy — and all of these operations do that. Is guaranteeing where the elements originate from really something that needs to be done?

(No criticism in tone intended, I'm just genuinely surprised by this.)

scanon · November 1, 2021, 3:55pm

I'm not sure that this is worth doing--if it had been caught during beta, it might have been, but even if we "fix" this now, there will be lots of devices with the 5.5 behavior in the wild, so developers still won't be able to depend on the old behavior for the next few years.

We could, I suppose, add an alwaysEmitIntoClient version with the old behavior, but we'd have to maintain the old non-aEIC entry point as well, and I'm not sure this is all really worth the hassle (especially given that there's a "trivial" workaround for anyone bitten by this).

That said, I think people are increasingly aware that a lot of the keyed-collection API has some rough edges around equatable, and it's worth keeping track of these issues for a more comprehensive reworking down the line?

tera · November 1, 2021, 11:11pm

if i want an optimised version of:

func ordered_intersection(_ other: Set<Element>) -> Set<Element> {
	var newSet = Set<Element>()
	for member in self {
		if other.contains(member) {
			newSet.insert(member)
		}
	}
	return newSet
}

func quick_and_ordered_intersection(...) -> ... {
	// same as the above but iterates over a smaller set
	// still maintains the fact that elements are taken from "self"
}

that maintains its observable behaviour (takes elements from "self") but cleverly iterates over a smaller set to do this quickly, shall i implement such a method myself? if Set's intersect is not such a thing (now or ever) can this new method be added to standard library in addition? would be +1 from me in any form (whether this is a new method or a new documented behaviour of the old intercept - no particular preference).

lorentey · November 2, 2021, 1:39am

I would not necessarily consider this source compat issue a bug if the documentation explicitly called out that the selection is subject to change. (But even then, it would be counter-productive to deliberately change things only to mess with code that didn't read the docs.)

That we shipped this change is not a drop-everything-the-sky-is-falling issue (it isn't even necessarily something we must fix), but it's still an issue.

Looking past this immediate problem, I think Set.union and Set.intersection are underspecified. Given that Set's API is actively encouraging people to care about equal-but-not-identical elements, it would very much make sense for these operations to specify which input set the members of the result come from. (Just like OrderedSet's equivalent methods specify the ordering of the result.)

I see no practical reason these operations cannot always take dupes from their first input, so we should make sure they do that.

Specialization takes care of this objection reasonably well, I think. (As demonstrated by us not realizing this was an issue until long after 5.5 shipped.) This issue is an esoteric detail of a relatively rarely used API; we don't need to overthink the fix.

If we wanted to change the docs (which I don't think we should right now), then we could simply state that the behavior is only guaranteed from Swift 5.6 on (or whatever). The same goes for Sequence.sorted() and similar cases.

I agree, this clearly wouldn't be worth the hassle. We can't fix past releases, but we can at least make sure to do the right thing in future ones.

(People can always write extensions that polyfill the right behavior on older releases.)

Hm, I'm not entirely sure I know what these rough edges are -- the core Set and Dictionary APIs are very deliberate about preserving/maintaining key identity, and they have been for years.

(Besides us spuriously changing Set.intersect's behavior in 5.5, the one minor issue I'm aware of is that Dictionary doesn't provide efficient API that corresponds to Set.update(with:) -- dictionary keys are immutable. However, one can simply remove the key and re-add its value with the updated key. We also have some API wrinkles around in-place mutations of Dictionary values, but that's completely unrelated to Equatable.)

Equatable has some fundamental issues, of course -- we have the mostly nonsensical substitutability "requirement", we have floating point types violating reflexivity, and we transparently bridge ObjC/Swift types that have incompatible definitions for equality (NSString/String being the major offender). But how can we fix these by tweaking hashing collection APIs?

lorentey · November 2, 2021, 5:17am

I submitted PR #40012, which reimplements the dramatic Set improvements from my old PR #21300.

This also restores the original, proper Set.intersect behavior without losing the optimization we shipped in 5.5.

jonathanpenn · November 2, 2021, 5:55pm

This is news to me. Does this mean that we do ensure a stable ordering?

lorentey · November 2, 2021, 6:02pm

The stdlib’s sort algorithm has been stable since Swift 4.2 or Swift 5 or so.

We technically do not ensure a stable ordering, because the documentation does not guarantee this.

hisekaldma · November 2, 2021, 6:39pm

That this still hasn’t been documented is one of my biggest gripes with the std lib. It leaves users with two equally bad options:

a) implement their own stable sort (even though the std lib one is actually stable!), or
b) assume that the std lib one is going to stay stable even though it isn’t guaranteed to.

sjavora · November 2, 2021, 7:12pm

Are there more such documentation changes that would require a proposal? If so, we could collect them all and at least get it over with with one review

lorentey · November 2, 2021, 10:31pm

That would be an excellent idea! Are you volunteering to write it up?

So right now we have:

Sequence.sorted(), .sorted(by:), MutableCollection.sort(), .sort(by:) should document that the sort algorithm is stable. (Starting from Swift x.y)
Set.union(_:), Set.intersection(_:), and Set.formIntersection(_:) (all overloads) should document that members of the result that appear in both inputs are taken from self. (Interestingly, formUnion already has this guarantee.) (Starting from Swift 5.6)

Anything else?

sjavora · November 3, 2021, 3:39pm

Well, I've never done a proposal or even a pitch before, but I suppose here I could pull of the implementation if it's just comments