Array Subtraction

michaeleisel · September 7, 2021, 7:36pm

This method returns a sequence with all the elements from one array, A, that are not present in another array, B. This is the array version of Set's subtract method.

let animals = ["Cow", "Bulldog", "Labrador"]
return animals.without(["Bulldog", "Labrador"]) // ["Cow"]

We can see that this is an operation Swift developers are already trying to do from SO questions (1, 2), and it exists in other languages like Ruby.

If people are interested in this, then there are a few questions:

Would it return a lazy sequence or an array? (my preference is a lazy sequence)
Would it require all elements to conform to Hashable? (my preference is yes, or else we can't guarantee O(n) time, where n is the number of elements in A)

ben-cohen · September 7, 2021, 10:54pm

As you note with the Hashable conformance, you would probably want to build a Set of the subtractions to avoid a linear scan per element test. So it wouldn't be truly lazy.

Nevin · September 8, 2021, 12:40am

This sounds quite similar to removeAll(where:)

tera · September 8, 2021, 12:44am

i like this feature, as the naïve "array -> to set -> subtract another set -> back to array" alternative is not good: it'll mess with element order and duplicated elements.

quick and untested attempt to make a version that allows conditional hashable conformance:

func - <T: Equatable> (_ a: [T], _ b: [T]) -> [T] {
    if let a = a as? [AnyHashable], let b = b as? [AnyHashable] {
        let set = Set(b)
        return a.filter { !set.contains($0) } as! [T]
    } else {
        return a.filter { b.contains($0) }
    }
}

but that's good only if those "as? [AnyHashable]" casts are O(1) ... are they?

without conditional hashable requirement it's simpler:

extension Array where Element: Hashable {
    static func - (a: Self, b: Self) -> Self {
        let set = Set(b)
        return a.filter { !set.contains($0) }
    }
}

as per the name choice, i'd personally prefer "-" as shown above (plus the related -= mutating variant), but that might be not everyone's cup of tea. "without" is ok. "subtracting" (plus the related mutating variant "subtract") is also good.

ben-cohen · September 8, 2021, 1:37am

They are not.

michaeleisel · September 8, 2021, 2:22pm

It would be lazy in that it would go from O(n+m) to O(m) time upfront, where n and m are the lengths of the two arrays. And if this call is being chained with another one that takes a sequence, i.e. it doesn't need to be an array right after the call, it would skip unnecessary O(n) array creation.

Karl · September 8, 2021, 3:02pm

As @michaeleisel says, there is no requirement that creation of a lazy sequence/collection must be cheap or O(1).

LazySequenceProtocol

A sequence on which normally-eager sequence operations are implemented lazily.

Lazy sequences can be used to avoid needless storage allocation and computation, because they use an underlying sequence for storage and compute their elements on demand. For example, doubled in this code sample is a sequence containing the values 2 , 4 , and 6 .
let doubled = [1, 2, 3].lazy.map { $0 * 2 }
Each time an element of the lazy sequence doubled is accessed, the closure accesses and transforms an element of the underlying array.

The docs

I'm firmly in the camp that the type being described here would indeed be truly lazy. It computes its elements on demand. As I read it, that's the only criteria. In order to achieve that, it may need to perform additional calculations or allocate on creation, and that's fine. Whether or not those are "needless" depends on the lazy collection, and whether those creation-time calculations actually reduce the overall calculations or allocations when used for its intended purpose.

ben-cohen · September 8, 2021, 6:06pm

Couple of things here:

You appear to be language lawyering something that is not written up to specification quality. In these cases, we need a principles-based rather than rules-based approach, where the real reasons why things are lazy are discussed and then considered in the context of a new proposal. It's also worth considering clarifying the docs, though making guarantees requires an evolution proposal. But absence of such a proposal does not mean you should treat what is written as "the rules".
You are pointing to the rules for LazySequenceProtocol but that is not what we are talking about here. Rather, we are talking about whether the default behavior of this function should be lazy. If it isn't, then there's a potential follow-on task to create a .lazy version.

The underlying principles here that have driven these decisions in the past:

Something should be lazy by default if there are no downsides to that laziness. So for example, reversing a bi-di collection or enumerating a sequence is lazy, because there is no risk or cost to that. Whereas lazy mapping is explicit because there are two downsides: closures can capture and mutate state so iterating behaves differently each time which is a no-no for collections, and unexpected cost of running an expensive mapping operation unexpectedly multiple times.

When something is made explicitly lazy, one reason to choose to do that is to guarantee no allocations. This would not fulfill that criteria.

I'm generally pretty skeptical of "m+n" complexity guarantees, especially as they rely on assumptions about "normal" use (i.e. the subtractee is always longer than the subtraction) that may not always hold, and also because it's a step too far in expecting people to read the docs carefully to understand the consequences of laziness.

tera · September 8, 2021, 6:40pm

lazy vs non lazy choice aside, how do we feel about the pitch itself? +1 from me.

Karl · September 8, 2021, 8:32pm

I don't think so; I'm trying to figure out what it is that people expect from a "lazy" sequence or collection. A lazy variable is simple enough to understand - there's only one thing you can do with it, which is to read the value. But what is a lazy collection? From what the documentation says, it appears it was intended to represent a sequence or collection whose elements are calculated on-demand.

And I think a lazy collection which needs to do some calculations in order to achieve that still fit perfectly well within that model.

Apologies; your comment says "you would probably want to build a Set of the subtractions to avoid a linear scan per element test. So it wouldn't be truly lazy."

I took that as meaning that any collection which requires a Set be constructed wouldn't even qualify as being "truly lazy". This topic has come up several times before (e.g. whether a lazy filter collection which scanned for its startIndex on creation would qualify as lazy), so I thought you were alluding to a similar issue.

1-877-547-7272 · September 8, 2021, 9:07pm

I think it would be better to build on the Sequence protocol rather than just Array. Something like this:

extension Sequence where Element: Hashable {
    func removingAll(in set: Set<Element>) -> RemovingSequence<Self> {
        return RemovingSequence(removedElements: set, base: self)
    }
}

struct RemovingSequence<Base: Sequence>: Sequence where Base.Element: Hashable {
    let removedElements: Set<Element>
    let base: Base
    
    struct Iterator: IteratorProtocol {
        let removedElements: Set<Element>
        var iterator: Base.Iterator
        
        mutating func next() -> Base.Element? {
            guard let element = iterator.next() else { return nil }
            
            guard !removedElements.contains(element) else { return next() }
            
            return element
        }
    }
    
    func makeIterator() -> Iterator {
        return Iterator(removedElements: removedElements, iterator: base.makeIterator())
    }
}

If the user wanted to create the Set from another sequence, they could do so manually.

Paul-Ollivier · September 14, 2021, 10:29am

I like the proposition, lazy sequence / array aside