[Proposal] Set literals

crontab · July 4, 2024, 8:34am

I am surprised this has only been discussed once and it was in 2016: [Proposal] Set literal and Set type syntax unless there were others that I can't find. Starting a new thread since the previous one is too old and too negative and dismissive towards sets (I wonder why!).

Been thinking about making sets first-class citizens ever since I began coding in Swift. Seems like such a big omission and disregard of a data type that deserves to be in the front row seat.

I do notice in open-source code people using arrays where sets would be more appropriate likely because arrays are easier to declare and use. For example, an array of days of week is almost always meant to be a set but you rarely see it declared as such for some reason. There are many more similar examples.

So, set type:

typealias IntSet = :[Int]
typealias StringSet = :[String]
typealias DaysOfWeek = :[DayOfWeek]

Set literals:

let empty: IntSet = :[] // alternatively = :[Int]()
let longMonths = :[0, 2, 4, 6, 7, 9, 11]
let weekend: DaysOfWeek = :[.saturday, .sunday]

Overall the "price" of this suggestion is just the :[ which becomes a new multicharacter special symbol. To me, it beats the previously proposed [_: ] and other possible variations as relatively more aesthetical.

Optionally the following operators could be defined for sets, but I can see why someone would oppose these in favor of explicit functions:

a * b // intersection, commutative
a * i // membership test, commutative; alternatively could be `i ~ a`
a + b // union, commutative
a - b // subtraction

Lovers of sets, unite!

vns · July 4, 2024, 8:45am

typealias DaysOfWeek = Set<DayOfWeek>
let weekend: DaysOfWeek = [.saturday, .sunday]

That will literally work as of right now. With the previous thread pointing out that there are type inference available as well:

let longMonths: Set = [0, 2, 4, 6, 7, 9, 11]

You simply need to be explicit that type is Set and than initialize with literal. So I don't think there is a need for any additional syntax.

Quedlinbug · July 4, 2024, 8:53am

OrderedSet is a better match.

crontab · July 4, 2024, 8:53am

Yes but I presume there is a cost of conversion if it's not an array literal, i.e.

let set1: Set = [a, b, c]

And actually I'm not sure if there isn't an overhead even in case of array literals. But I admit I know very little about how much Swift can do at compile time vs. run time.

Besides, my point still stands I think that sets will get a chance of being used more often if they become first-class citizens.

crontab · July 4, 2024, 9:00am

Not for enumerable types, because they are already ordered. OrderedSet is good for e.g. LRU type storage and overall is more exotic I think than the ordinary sets.

vns · July 4, 2024, 9:07am

I'm not sure Sets being less used because there is no specific literal syntax for them. It is more combination of unawareness (new developers tent to dig into collection types less and less), lack of ordering which is important quite often, and small size of data to operate on (I can argue that for most of the data we operate on, Array.contains is actually cheaper than using Set). Personally, I quite often find myself in last category when reach for Set: the need to add Hashable conformance + zero to negative win in performance on small collections make it actually less useful than array. You end up with relatively small portion of day-to-day tasks in which you'd be better with set.

ktraunmueller · July 4, 2024, 9:12am

But the array literal could contain duplicates, which would be silently dropped in the Set initializer, right?

Ideally the code shouldn't compile if the literal contains duplicate values (i.e. does not constitute a set).

vns · July 4, 2024, 9:18am

Yes, which IMO perfectly expected behaviour as you reach for a set. And I don't think it should prevent from compilation (or even can, if we pass variables there). I can think of Python and C++ (kinda) right now as examples, and they behave the same way on literal init, e.g. in Python:

x = {1, 1, 1}
print(x)

will output {1}

ktraunmueller · July 4, 2024, 9:25am

Overall, I would second this feature. I see no reason why sets should be considered "lesser" collections than, say, arrays. Even though this discussion is just about literals.

In general, I think using the right data types can make code simpler, easier to read, and more robust.

If I have a problem at hand that requires exclusivity of values, I reach for Set. If I need an ordered collection that allows duplicates, I reach for Array. If I need both exclusivity and ordering, I would use OrderedSet.

Using an Array for a problem that requires exclusivity opens the door for bugs which would simply not be possible when using a Set.

crontab · July 4, 2024, 9:25am

While true, it's a potentially dangerous assumption (that certain arrays will always be small) because you can easily overlook a situation with an array coming from user input which means opening the door for DoS attacks.

In the world of structured concurrency this is not a big problem anymore as most data (or all?) that's implicitly Sendable is also Hashable.

vns · July 4, 2024, 9:38am

That's not just an assumption, that derivable from the majority of the applications and end-up in preferability of arrays. And user input processing is completely another topic I'd say, where you should limit how much it is possible to send or how much you read (unbounded read from unknown source you cannot trust is bad despite data structure being used).

Of course, one should be careful making decision, maybe add some assertion to catch cases when data size has grown if you made assumptions about size in particular case. But just using set, without understanding why, can also have dangerous implications.

That seem to be two unrelated things to me... I'm not sure how's that helps? You still need to deal with Hashable, maybe this type outside of your module, and hashing comes at price anyway.

crontab · July 4, 2024, 9:45am

Not really, and actually I'm not sure anymore about your previous thesis that small arrays may be faster than sets.

Imagine an array of 10 strings, that's up to 10 string comparisons for linear search. Compare that to a single .hash() call, a single pass on one string and some trivial 64-bit int operations for a set .contains(). Would be interesting to measure but I think Set will win even for small arrays.

As for sendability and hashability: everything that's easily copyable is also easily hash-computable. Do you have examples of the opposite? I can't think of any.

Quedlinbug · July 4, 2024, 9:45am

Your thread is not about the original enumeration. It is about collections based on it.

Lancelotbronner · July 4, 2024, 9:47am

If there is an overhead to conversion from an array literal, then I’m pretty sure it can be considered an optimizer issue and fixed without the need for a proposal.

For that reason I don’t feel like the proposed literal syntax adds much value.

I would however like to have some operators for SetAlgebra types. I often use OptionSet for C bitsets and miss the simple operators.

Karl · July 4, 2024, 10:05am

It depends on the strings. A string comparison can early-exit as soon as it is established that the strings are not equal. Hashing needs to visit the entire string (and then still needs to perform a string comparison if there is a potential match, because of the possibility of collisions). And hashing a string is not trivial - in the worst case it involves Unicode normalisation of the entire string.

Hash tables also take much more memory, and their literals cannot be statically initialised by the compiler because they involve a random seed chosen at runtime.

Each of these data types have their uses, and they are not universal. There are quantities and kinds of data for which some implementation strategies are better than others.

So I disagree with your premise, and actually I think the opposite - that developers often overuse Set and Dictionary for very small amounts of easily-comparable data which would be better stored in an Array. That's not to say that Set literals are a bad idea, but that I would not make such sweeping assumptions as you do that people mostly use arrays out of ignorance.

vns · July 4, 2024, 10:07am

I tested and there is no difference at all: either with Array or with Set it took same time. I still expected an array to be faster maybe, but even equal time makes sense. Because hashing is not free, for string this is passing whole string anyway. So linear search with contains or hashing has same timing (UPD: @Karl in the post above has made more detailed overview of the differences).

And if you hash something more complex (some compound struct), I'd expect it to grow as well in time. Of course, you can find exceptions, even with same data types, yet that's why they exceptions.

I'm not saying that implementation is hard, it is that you need to make additional steps to conform to it, which can be also easy as Swift will synthesise for you, yet if that type from outside of your current module, you'll have to write it by hand (and maintain later). That's all compared to an array is unnecessarily much more complicated, than just Array.contains, while you gain nothing (as rough generasization).

crontab · July 4, 2024, 10:32am

For how many items, 10? You can't generalize based on a case of 10 strings because after all there are computational complexity truths that can't be beaten

I should have clarified that I meant trivial types like Int, String etc. then enums and also structs and arrays that enclose only trivial types. Anything that is not trivially and implicitly hashable, is also not trivially sendable which means you will tend to avoid it in your modern Swift code.

crontab · July 4, 2024, 10:37am

Maybe, but the set type's primary purpose is a guarantee of exclusivity/uniqueness, that is what developers I think often overlook. (I don't know why this discussion slipped into performance and memory issues, to me it's not the most important thing in this case.)

vns · July 4, 2024, 10:49am

As there were pointed out, we actually can generalize for majority of strings that amount of work for linear search vs hashing and maintaining set at least require the same amount of work, not even measuring it. We cannot say that this is true in 100% cases, and that’s not what I am trying to say. I’m only saying that we can expect it to be true for majority of them.

I am also not trying to say that sets inherently useless or bad of any kind. There are plenty of legitimate cases when it is beneficial to reach for it, sometimes despite hashing costs. We’ve started discussion on the topic from sets being often dismissed because of syntax details, and my point was (and remains) that lesser "popularity" driven by other factors. After all, what we actually need is better understanding of data structures and when to use them. I am not advocating for blind use of array instead of sets on small data just because there is no difference either, only that in more cases than not it turns out better strategy.

Your premise lies in the land of concurrency, but so far there are a lot of valid use cases where you don’t need or don’t want concurrency, yet still will need hashing capabilities. Even in the concurrency, we have actors that can remove the need of sendability if all operations hold inside. That’s actually the same kind of generalization you are agains with array vs set: even through there are majority of cases when that is valid, there are exceptions to that.

vns · July 4, 2024, 10:54am

I think this case falls into first category I’ve stated earlier: unawareness of set as data structure in overall that can hold such guarantees as uniqueness.