Algorithms.uniqued() is not ergonomic

phoneyDev · October 25, 2021, 7:08pm

I decided today to add the Algorithms package to my current project with the intention of replacing all the usages of list = Array(Set(list)) with list = list.uniqued(). I was surprised to find that that code doesn't compile. Other examples like this also don't work:

        var animals = ["dog", "pig", "cat", "ox", "dog", "cat"]
        animals = animals.uniqued() // Cannot assign value of type 'UniquedSequence<[String], String>' to type '[String]'
        animals = Array(animals.uniqued()) // Works

The problem is that uniqued() looks like this

public func uniqued() -> UniquedSequence<Self, Element> {

and not like I would have expected it to:

public func uniqued() -> [Element] {

Curiously, uniqued(on:) is declared like I would expect:

public func uniqued<Subject: Hashable>(
  on projection: (Element) throws -> Subject
) rethrows -> [Element] {

Related methods like map, filter, sorted etc. all return [Element]. I understand that I can get what I want by casting to Array(list.uniqued()) but that doesn't seem very ergonomic to me. I could just stick with Array(Set(list)). I could write my own uniqued() as I have on other projects but that doesn't seem like the right solution either.

What are other people doing about this?

1-877-547-7272 · October 26, 2021, 12:11am

UniquedSequence is an example of a “lazy adapter” — it conforms to the Sequence protocol, but it isn’t an Array. Instead, it computes its results lazily. This means that if you use another lazy adapter, then your program can avoid allocating intermediate storage. Methods like reversed(), enumerated(), and joined() also work this way.

To learn how to handle lazy adapters, I’d recommend watching the following video:

hisekaldma · October 26, 2021, 7:34am

This is most likely hiding a bug. Set is unordered, so Array(Set(list)) will scramble whatever order list had.

If you want to preserve the original order, use Array(list.uniqued()) or write your own order-preserving uniqued().

If you don’t care about preserving the original order, either keep it as a Set to express that it’s unordered, or use Set(list).sorted(by:) and sort it in a way that makes sense.

Nevin · October 26, 2021, 1:00pm

If you do want to maintain the order of the remaining elements, here’s an in-place implementation:

extension RangeReplaceableCollection where Element: Hashable {
  mutating func removeDuplicates() {
    var alreadySeen: Set<Element> = []
    removeAll { !alreadySeen.insert($0).inserted }
  }
}

anon9791410 · October 26, 2021, 2:01pm

C#'s IEnumerable / LINQ does everything this way. Then you call ToArray() at the end if necessary.

uniqued doesn't have an ergonomics problem. Swift not having a way to use initializers with a fluent syntax, and instead sometimes returning arrays from Sequence / Collection methods, but not always, is the ergonomic problem.

phoneyDev · October 26, 2021, 3:01pm

OK I get it. uniqued() has a design that is beneficialif it's used in a pipeline before the end because it doesn't allocate an Array. But the suggested solution Array(list.uniqued()) obviates that benefit since it allocates the memory anyway.
If the suggestion "is write your own," like @nevin has done, I think there's a problem. I suggest that if one doesn't wish to opt-in to the lazy adaptor strategy there should still be a method in the Algorithms library that uniques an Array and returns an Array. I'm not sure if it can have the same method name or needs to have a different name, maybe uniquing(), but it should be there. For now I've written my own uniqed() method and Algorithms is backed out of the project.

nnnnnnnn · October 27, 2021, 5:16pm

@phoneyDev Thanks for sharing your experience with this! Whenever possible, the operations in the Algorithms package lazily compute their results to avoid unnecessary computation and allocations. The resulting UniquedSequence still has access to the rest of the Sequence operations (so you can call map or filter afterward), but it is inconvenient when you just need an array.

We've talked before about adding a collect() or toArray() method a few different times in Swift Evolution, but haven't ever taken that step. If you could just call something like that, would that resolve the ergonomic issue? For example:

var animals = ["dog", "pig", "cat", "ox", "dog", "cat"]
animals = animals.uniqued().collect()

divanise · October 27, 2021, 5:48pm

Looking at the source, uniqued() always returns a lazy sequence, while uniqued(on:) returns a lazy sequence when called on a lazy sequence, and an array otherwise. That means you should have another alternative:

var animals = ["dog", "pig", "cat", "ox", "dog", "cat"]
animals = animals.uniqued { $0 }

I can't verify that myself right now, unfortunately, so I may be misunderstanding the overload resolution in this case — I've only read the code, not actually tried to run it.

I certainly agree with the overall sentiment, but I'd also argue that the differing return types on uniqued() and uniqued(on:) on non-lazy sequences qualifies as an ergonomic problem of its own.

divanise · October 27, 2021, 5:51pm

And I just realized I skimmed over this part of the original post:

Still, I think explicitly pointing out uniqued { $0 } as another alternative was worthwhile, so I'll leave my original post as is.

anon9791410 · October 27, 2021, 7:22pm

I don’t recall seeing these conversations, but I’d like to see something happen in that area. Really, I’d like an operator (e.g. …) in the standard library that would allow for

animals.uniqued()…(.init)

but even though that’s good for the general case, it doesn’t read as well as a method that performs a subset of such an operator’s job.

collect is better than toArray, because e.g. Array is to Sequence as String is to Substring. The method, which should be added to all applicable types, should have a universal name.

In the graphics world, we’d call this method baked.
In the audio world, frozen.
There are probably many others for other domains.

Is collect the domain-independent word for bake and freeze?

mtsrodrigues · October 27, 2021, 7:38pm

It would be really nice to add a collect() method to LazySequenceProtocol. Laravel has something similar:

$lazyCollection = LazyCollection::make(function () {
    yield 1;
    yield 2;
    yield 3;
});

$collection = $lazyCollection->collect();

phoneyDev · October 27, 2021, 8:23pm

The reason I wanted to convert my Array(Set(list)) code to uniqued() is that uniqued() reads better. You read the pipeline left to right and each step of the pipeline is clearly shown. I'm not sure how you even say Array(Set(list)) out loud in a way that makes clear what it's doing. So yes list.uniqued().collect() reads better than Array(Set(list)). Having said that, a single method that uniques an Array would probably be preferable. If there are other uses for collect() then that would be a reasonable solution to this but it wouldn't make much sense if this is the only use for that method. Having two similar uniquing methods, one that is lazy and one that isn't might cause confusion I guess.

christopherweems · October 28, 2021, 6:52am

Maybe this helps?

extension Sequence where Element : Hashable {
    @_disfavoredOverload
    public func uniqued() -> [Element] {
        Array(self.uniqued())
    }
}

It avoids the Array allocation when .chained().withOtherMethods() and allows your original example to compile and run as you expected. Thanks to @_disfavoredOverload, it should only apply when the type of the variable you're updating is explicitly an Array.

Edit: Missed the requirement for Element: Hashable in the first version.

David_Sweeris · October 28, 2021, 6:36pm

Does this work? I’m on my phone so I can’t check.

extension LazySequenceProtocol {
  var industrious: [Element] {
    Array(self)
  }

LucianoPAlmeida · October 29, 2021, 2:22am

This is interesting @nnnnnnnn! There were any main concerns about taking this approach brought by people on those evolution discussions? I think this will be really nice, and coming from .NET where Linq has a lot of those conveniences ToArray, ToList, and so on ... I would say that are really useful.

Jon_Shier · October 29, 2021, 3:05am

My concern with adding such methods is that it encourages people to materialize an Array when they don't need it, creating a potential footgun, depending on the size of the collection. At the least it's an attractive nuisance which may trick users into thinking it's required if they later want to, say, loop over the elements. So I'm personally okay not having it since it shouldn't be required very often and when it is, it's usually best to have the user materialize it explicitly by initializing an Array or other concrete collection.

LucianoPAlmeida · October 29, 2021, 12:06pm

This is definitely a valid point, and from experience, our team had a lot of profiling work and significant improvements on GC load just by removing those calls to ToArray or ToList in application hot paths where not need, but as you mentioned they were because people actually think it was required or just copied the same chained operation from another place.

So I definitely agree with this point and even add that it can cause extreme high complexity in cases like:

for let elt in a_huge_list.toArray() {
   for (let _ in elt.another_list.toArray()) { // This could be a high complexity increase 
    ...
   }
}

christopherweems · October 29, 2021, 3:07pm

If we go down this path, I'd like to suggest naming this .asArray().

scanon · October 29, 2021, 3:10pm

Another alternative, which is maybe too magic, would be to add a disfavored overload that produces an Array. Then, when used in a type context that doesn't determine a return type you would get the UniquedSequence result, but when used in a context that requires an array (like assigning to an existing variable), you would get an Array if needed.

David_Smith · October 29, 2021, 7:47pm

My not-very-actionable hunch here is that generics improvements might be able to partially fix this situation from the other end. If it were natural to write code that operated on Collection and so on, we might not feel as much of a need to convert things to Array.