SE-0270 (Review #2): Add Collection Operations on Noncontiguous Elements

bzamayo · December 16, 2019, 9:59pm

I read the above explanations and I am still inclined towards indices over ranges. It’s not a big deal either way but I don’t see RangeSet not being an actual collection as a reason against using the name indices, I see the operation as returning indexes and that’s what it does. The fact that the indexes are returned in this slightly-indirect form in a RangeSet, for performance reasons, is irrelevant IMO as to the purpose of the method.

I'm also not sure the Collection.range(at:) method holds its weight. The proposal doesn't really say why it is so important that it needs a convenience method — maybe it comes up a lot and I simply can't imagine the use cases. On a raw 'character count battle', writing collection.range(at: index) isn't saving much compared to typing index..<collection.index(after: index) manually.

Apart from those minor niggles, I can't find any meaningful issues with the newly stripped-down proposal and very much support its addition to the standard library.

dabrahams · December 17, 2019, 9:56pm

Actually, I was only trying to say that using “range” to denote a contiguous set of positions in a collection would be very much coining a term of art, because we don't say collections have ranges, and if we did, the first assumption I'd make was that we meant a range of values in the collection, rather than a contiguous range of positions in the collection. “Region” doesn't have the same problem, and thus doesn't need to be treated as a term of art, because we do regularly think and speak of collections spatially. So if you know what a collection is, you know what a region of the collection delineates a range of positions.

John_McCall · December 17, 2019, 10:45pm

Taking off my review-manager hat:

That does seems somewhat undermined by the fact that the type for regions is always just Range. But I agree that, under the current design, it makes sense to either have a term of art for this or to name the methods things like indexRanges(where:).

The alternative design would be for RangeSet to act like a collection of indices and provide some alternative view for getting at the ranges.

Ben_Cohen · December 17, 2019, 11:09pm

The trouble is it can't do this without capturing the collection too (or abandoning the optimization), because it's needed to get the individual indices from the ranges (even with strideable ranges, because index types are reused by wrappers like lazy filtered collections).

John_McCall · December 17, 2019, 11:12pm

Ah, that's fair.

Although that's interesting about the lazy filtered collections. This is beside the point, but does that mean it's essentially never correct to use the striding operations unless you know it's okay for the collection? Why do we not wrap the indices?

Ben_Cohen · December 17, 2019, 11:13pm

I think given our time again this would probably have been a better plan.

John_McCall · December 17, 2019, 11:14pm

Understood, thanks.

kiel · December 18, 2019, 2:22am

I’m curious. I might also be misunderstanding something too but: what’s the rush?

Why not get it right the first time than commit to supporting something less than ideal into the foreseeable future? I guess not every ideal should be implemented.

John_McCall · December 18, 2019, 2:23am

That particular choice — the index type of the lazy filter collections — has already been committed to; it’s in the ABI.

kiel · December 18, 2019, 2:25am

That’s what I misunderstood. Thanks!

dabrahams · December 18, 2019, 3:20am

`gather` Should Be Independent of `RangeSet`

[I have a few issues with the proposal that I'll be raising in separate posts.]

One of the issues I had with the previous proposal was that it failed to separate independent concerns. The most glaring examples—the methods that mutate a RangeSet based on a Collection—were removed from in this proposal, but one still remains: there's an overload of gather that takes a RangeSet argument.

To review, the proposed gather methods are as follows:

extension MutableCollection {
    /// Collects the elements at the given indices just before the element at 
    /// the specified index.
    @discardableResult
    public mutating func gather(
        _ indices: RangeSet<Index>, at insertionPoint: Index
    ) -> Range<Index>

    /// Collects the elements that satisfy the given predicate just before
    /// the element at the specified index.
    @discardableResult
    public mutating func gather(
        at insertionPoint: Index, where predicate: (Element) -> Bool
    ) -> Range<Index>
}

Why is the first method's first parameter a RangeSet? Its algorithm will work with any representation of a set of a collection's indices. In fact, all it needs is a predicate that tells us whether the element at a given index should be gathered. If someone is already working with a Set<Index>, or already has a predicate that answers the question, we should not force them to reconstitute it as a RangeSet<Index>. That code looks like this:

let s = someSetOfIndices()

// O(a.count), plus an allocation
let rs = RangeSet(
        a.indices.lazy
            .filter { i in s.contains(i) }
            .map { i in i..<a.index(after: i))

a.gather(rs, at: destination)

Actually, that's pretty fancy code. It's more likely they'd end up with this, which performs worse:

let s = someSetOfIndices()

// Same big-O, but probably many allocations
var rs = RangeSet<A.Index>()
for i in a.indices where s.contains(i) {
   rs.insert(contentsOf: i..<a.index(after: i))
}

a.gather(rs, at: destination)

and I doubt most people would be aware of the difference.

The second gather overload is almost what we'd need to solve the problem, but its predicate works on elements, rather than indices. With a gather method that takes a predicate over indices, the user could just write:

let s = someSetOfIndices()
a.gather(at: destination) { i in s.contains(i) }

That version of gather could be declared as follows (but see below for a problem):

    /// Collects the elements at positions satisfying the given predicate
    /// just before the element at the specified index, preserving their
    /// relative order.
    @discardableResult
    public mutating func gather(
        at insertionPoint: Index, where predicate: (Index) -> Bool
    ) -> Range<Index>

The One True `gather`

It's tempting to think we could use this new gather method to do the work of the second overload:

a.gather(at: destination) { i in somePredicateOverElements(a[i]) }

The problem is that the closure captures a, which will force the first mutation made by gather to copy a's storage, even if it was otherwise uniquely-referenced.

It's also problematic to add this new gather method as an overload, because there are common cases where the Index type is the same as that of the Element (e.g. Array<Int>), which would make invoking the method ambiguous.

So, if we are going to solve this problem with one method taking a predicate argument, it should be a binary predicate over elements and indices:

    /// Collects the elements satisfying the given predicate over their
    /// values and positions just before the element at the specified index,
    /// preserving their relative order.
    @discardableResult
    public mutating func gather(
        at gatheringPlace: Index, 
        where predicate: (Element, Index) -> Bool
    ) -> Range<Index>

Using it with a RangeSet or a Set<Index> would be equally simple:

a.gather(at: gatheringPlace) { _, i in someSetOfIndices.contains(i) }

Once you have this method, it seems to me that the second overload buys very little; you get to drop “, _” from

a.gather(at: gatheringPlace) { e, _ in somePredicate(e) }

The Generic Approach

Another way of making gather independent of RangeSet would be to

introduce a primitive set protocol that only supports containment testing
make the first gather overload generic over the set type
preserve the second gather overload

But this seems to add lots of complexity with little benefit. You still need two overloads. That primitive set protocol is, semantically, nothing more than a predicate function that answers the containment question. Also, if what you're starting with is a predicate over indices, you'd have to build a conforming wrapper type just to use it, which would interrupt the flow of code.

Conclusion

gather should be a single method taking a binary predicate as suggested above. The simplicity of the result, both for the library and for its users, is telling us that's the right design. I hope we'll listen.

-Dave

Nevin · December 18, 2019, 3:53am

I thought the benefit of RangeSet as relates to gather is that the indices are already sorted in ascending order and grouped by contiguous ranges. That should allow substantial improvements over an arbitrary predicate-set, right?

• • •

Regardless, I can’t shake the feeling that a great many of the difficulties, complexities, and performance traps related to collection algorithms could be entirely eliminated if we had functionality like myCollection.indicesWithoutIncrementingRetainCount. That would allow the obvious way to mutate the elements of a generic MutableCollection, it would allow the simple version of gather that @dabrahams described, and in general it would make indices work the way people expect instead of triggering unwanted copy-on-write.

dabrahams · December 18, 2019, 4:33am

Actually not: any RangeExpression where Bound == C.Index represents a contiguous region of a Collection. That includes, for example, ClosedRange.

Technicalities aside, even if the claim were true, it would support the name regions. Method names should reflect semantics, not types. I don't think ranges reflects semantics particularly well, but even if it did, it would be at a disadvantage. When you read it, you don't know whether the author's intention is to refer to the type, or to the more general meaning of the word. Sometimes you just can't avoid using the same word as an involved type name, no matter how you try, but if you can reflect semantics well with another word, that's almost always the better choice.

IMO-ly y'rs,
Dave

dabrahams · December 18, 2019, 4:45am

Ah, well: first of all, the lookups in RangeSet are O(log N), and there are N lookups, so it doesn't change the big-O of gather to know you have a RangeSet.

But yes, there's a constant factor. How can we avoid that overhead? IMO the right answer is to provide a “hinted” contains method on such sorted sets. For RangeSet the hint type would be Int, and you'd use it like this:

var hint: Int?
a.gather(at: gatheringPlace) { _, i in someRangeSet.contains(i, &hint) }

All sets can support this pattern (though the hint will only speed lookups in sorted sets), and if/when we extend the set protocol hierarchy, we can build the pattern in so algorithms can use it generically.

CTMacUser · December 18, 2019, 12:54pm

dabrahams:

So, if we are going to solve this problem with one method taking a predicate argument, it should be a binary predicate over elements and indices:
/// Collects the elements satisfying the given predicate over their
/// values and positions just before the element at the specified index,
/// preserving their relative order.
@discardableResult
public mutating func gather(
    at gatheringPlace: Index, 
    where predicate: (Element, Index) -> Bool
) -> Range<Index>

As I said in another post on discontinuous element manipulation, (having only) a predicate-based method is inherently flawed. It must test all elements, even when only a relatively few will be targeted. Imagine touching only 10 elements out of 1000. An sibling function that can specify its targets by exact indices stored in ranges, (sorted) sets, etc. is still needed.

curt · December 18, 2019, 2:32pm

dabrahams:

That version of gather could be declared as follows (but see below for a problem):

    /// Collects the elements at positions satisfying the given predicate
    /// just before the element at the specified index, preserving their
    /// relative order.
    @discardableResult
    public mutating func gather(
        at insertionPoint: Index, where predicate: (Element) -> Bool
    ) -> Range<Index>

Did you intend the argument to the predicate here to be an Index?

dabrahams · December 18, 2019, 3:30pm

Uh, yeah, thanks. Gonna fix up that post if I still can.

dabrahams · December 18, 2019, 3:55pm

I'm not so sure both are needed, at least initially; see what you think of my reasoning. First, this is not a big-O difference—the algorithm needs to touch all the elements anyway—so one should only expect to observe an effect when the predicate is expensive. Second, the cost of hinted lookups in a sorted set is extremely low, because the algorithm always makes its tests in order. In fact, when the algorithm is inlinable, a good optimizer could generate the same code.

Further, if you think about the divide-and-conquer nature of the algorithm itself, it's not obvious how to take good advantage of a sorted data structure, because the divisions in the RangeSet are in different places. I can imagine, someday, having a parallel version of the algorithm that might benefit from slicing up the RangeSet, but that is extreme speculation.

I'd like to see some numbers and possibly hear from optimization experts before complicating the library on this account.

benrimmington · December 18, 2019, 6:27pm

Would there be type checker issues?

regions(where:) method of Collection
regions property of DataProtocol

Chris_Lattner3 · December 18, 2019, 6:42pm

The gather method seems very subtle to me, particularly given it mutates in place and does not return a new collection. This behavior conflicts with the typical meaning of gather (as in "scatter and gather") in other domains. Has anyone considered something like moveElements as the base name?

SE-0270 (Review #2): Add Collection Operations on Noncontiguous Elements

gather Should Be Independent of RangeSet

The One True gather

The Generic Approach

Conclusion

`gather` Should Be Independent of `RangeSet`

The One True `gather`