[Pitch #2] Add `indexed()` and `Collection` conformances for `enumerated()` and `zip(_:_:)`

Hi all,

This is a second pitch of the indexed() proposal (initial pitch thread, review thread). The following changes have been made since the original proposal:

  • The return type of the indexed() method has been renamed from Indexed to IndexedCollection.
  • The conditional RandomAccessCollection conformance of Zip2Sequence has been removed in favor of a new zip(_:_:) overload for random-access collections that returns a new Zip2RandomAccessCollection type, due to implementation difficulties. This has been elaborated on in the Alternatives considered.

I want to highlight a trade-off made regarding the BidirectionalCollection conformance of the Zip2Sequence and EnumeratedSequence types. EnumeratedSequence is simpler, so I'll explain the trade-off in terms of that type, but all of the below applies to both types equally.

Here’s what the Collection conformance could look like:

extension EnumeratedSequence: Collection where Base: Collection {
    struct Index {
        let base: Base.Index
        let offset: Int
    }
    var startIndex: Index {
        Index(base: _base.startIndex, offset: 0)
    }
    var endIndex: Index {
        Index(base: _base.endIndex, offset: 0)
    }
    func index(after index: Index) -> Index {
        Index(base: _base.index(after: index.base), offset: index.offset + 1)
    }
    subscript(index: Index) -> (offset: Int, element: Base.Element) {
        (index.offset, _base[index.base])
    }
}

extension EnumeratedSequence.Index: Comparable {
    static func == (lhs: Self, rhs: Self) -> Bool {
        return lhs.base == rhs.base
    }
    static func < (lhs: Self, rhs: Self) -> Bool {
        return lhs.base < rhs.base
    }
}

Here’s what the Bidirectional conformance could look like. The question is: should Base be required to conform to BidirectionalCollection or RandomAccessCollection?

extension EnumeratedSequence: BidirectionalCollection where Base: ??? {
    func index(before index: Index) -> Index {
        let currentOffset = index.base == _base.endIndex ? _base.count : index.offset
        return Index(base: _base.index(before: index.base), offset: currentOffset - 1)
    }
}

Notice that calling index(before:) with the end index requires computing the count of the base collection. This is an O(1) operation if the base collection is RandomAccessCollection, but O(n) if it's BidirectionalCollection.

Option 1: where Base: BidirectionalCollection

A direct consequence of index(before:) being O(n) when passed the end index is that some operations like last are also O(n):

extension BidirectionalCollection {
    var last: Element? {
        isEmpty ? nil : self[index(before: endIndex)]
    }
}
 
// A bidirectional collection that is not random-access.
let evenNumbers = (0 ... 1_000_000).lazy.filter { $0.isMultiple(of: 2) }
let enumerated = evenNumbers.enumerated()
 
// This is still O(1), ...
let endIndex = enumerated.endIndex
 
// ...but this is O(n).
let lastElement = enumerated.last!
print(lastElement) // (offset: 500000, element: 1000000)

However, since this performance pitfall only applies to the end index, iterating over a reversed enumerated collection stays O(n):

// A bidirectional collection that is not random-access.
let evenNumbers = (0 ... 1_000_000).lazy.filter { $0.isMultiple(of: 2) }

// Reaching the last element is O(n), and reaching every other element is another combined O(n).
for (offset, element) in evenNumbers.enumerated().reversed() {
    // ...
}

In other words, this could make some operations unexpectedly O(n), but it’s not likely to make operations unexpectedly O(n²).

Option 2: where Base: RandomAccessCollection

If EnumeratedSequence’s conditional conformance to BidirectionalCollection is restricted to when Base: RandomAccessCollection, then operations like last and last(where:) will only be available when they’re guaranteed to be O(1):

// A bidirectional collection that is not random-access.
let str = "Hello"

let lastElement = str.enumerated().last! // error: value of type 'EnumeratedSequence<String>' has no member 'last'

That said, some algorithms that can benefit from bidirectionality such as reversed() and suffix(_:) are also available on regular collections, but with a less efficient implementation. That means that the code would still compile if the enumerated sequence is not bidirectional, it would just perform worse — the most general version of reversed() on Sequence allocates an array and adds every element to that array before reversing it:

// A bidirectional collection that is not random-access.
let str = "Hello"

// This no longer conforms to `BidirectionalCollection`.
let enumerated = str.enumerated()

// As a result, this now returns a `[(offset: Int, element: Character)]` instead
// of a more efficient `ReversedCollection<EnumeratedSequence<String>>`.
let reversedElements = enumerated.reversed()

The base collection needs to be traversed twice either way, but the defensive approach of giving the BidirectionalCollection conformance a stricter bound ultimately results in an extra allocation.


Taking all of this into account, I’ve gone with option 1 for the sake of giving collections access to more algorithms and more efficient overloads of some algorithms. Conforming these collections to BidirectionalCollection when the base collection conforms to the same protocol is less surprising. I don’t think the possible performance pitfalls pose a large enough risk in practice to negate these benefits.

Comments are welcome! If this direction has downsides that weren't considered yet, then I’d love to hear about it.

6 Likes

How does the addition of indexed() intersect with users of the Algorithms package? I'd hate to lose that functionality on older OSes just because the standard library added it.

Edit: I think this is also a larger question for language evolution and Apple's side libraries in general. I think the swift-crypto framework has at least a partial solution here, since it ships an API that duplicates CryptoKit but under a separate import. Hard to do that with the standard library though.

I think you’ve made the right call for this feature.

However I also think that the difficulties you encountered speak to the need for a better way to achieve “manual” dynamic dispatch.

It is often said that Swift has exactly 3 ways to achieve dynamic dispatch: protocol requirements, subclass overrides, and manual casting.

The problem is that manual casting only works when you know ahead of time the type that you want to cast to.

Once SE-309 lands it will be possible to as?-cast the individual sequences to RandomAccessCollection and extract their counts, falling back to lockstep iteration when the casts fail.

(Technically it is already possible to achieve comparable semantics with the workaround seen here, but I won’t guess about the runtime overhead involved.)

That will allow the optimal time-complexity for Zip2Sequence.count in a single implementation. It could also be incorporated into implementations for things like index(_:offsetBy:), by getting the counts of as many of the sequences as are random-access.

• • •

But in the more general case, even SE-309 is not sufficient.

For example, and this is getting rather far afield, if we wanted to add some functionality to Collection where Element: BinaryInteger, and give it an optimized implementation when the elements actually conform to FixedWidthInteger, then SE-309 would not help, unless I’ve misunderstood something

It's not our plan to remove indexed() from the Algorithms package!

In general, I think it's fair to assume that the standard library version of functionality will always be as good or better than the package version. Ideally we'll strike upon a general solution that allows you to pick up the stdlib version when it's available, while continuing to use the package version when it isn't.

We'll play around with techniques like the one we pitched here to find a good fit for these package to stdlib migrations.

2 Likes

as? RandomAccessCollection (post-SE-309) was considered, but we decided against it because the performance penalty of the dynamic cast and the prevention of inlining and other optimizations do not seem to outweigh the benefits of only having a single overload for zip.

This approach does indeed work! We suspect it will have similar performance drawbacks though, but it would have been an excellent workaround if overloading zip wasn't on the table.

Agreed, being able to efficiently determine at runtime whether we're dealing with a random-access collection or not has benefits far beyond zip, e.g. for a more efficient index(_:offsetBy:) for FlattenSequence. Unfortunately the solutions available to us at the moment all have significant drawbacks.

1 Like