Implementation of Collection Versions of Zip

Introduction

I'm currently implementing zipLongest(_:_:) (apple/swift#18769).

In the pitch phase, @Ben_Cohen mentioned the possibility of providing Collection versions of zip. Indeed, this would be a nice feature to have for both zipLongest(_:_:) and zip(_:_:).

This means that we have to introduce two new types, namely ZipLongest2Collection and Zip2Collection. (I'll primarily focus on the first, but everything is applicable to the latter as well.)

It turns out that it is not as straightforward as I thought it would be and would love some feedback from the community.

Current Implementation

An always up-to-date implementation can be found as a Swift package here.

Current Points Of Discussion

The current points of discussion can be found here.

:+1:

Out-of-date part of original post.

Implementation

Overload zipLongest(_:_:) specifically for Collections. It acts as the constructor for a ZipLongest2Collection:

@inlinable public func zipLongest
    <Collection1: Collection, Collection2: Collection>
    (_ collection1: Collection1, _ collection2: Collection2)
    -> ZipLongest2Collection<Collection1, Collection2> {
  return ZipLongest2Collection(
    _collection1: collection1, _collection2: collection2
  )
}

The general definition of ZipLongest2Collection:

@_fixed_layout public struct ZipLongest2Collection
    <Collection1: Collection, Collection2: Collection> {
  @usableFromInline internal let _collection1: Collection1
  @usableFromInline internal let _collection2: Collection2

  @inlinable public init (
    _collection1 collection1: Collection1,
    _collection2 collection2: Collection2
  ) {
    (_collection1, _collection2) = (collection1, collection2)
  }
}

Conformance to Collection:

extension ZipLongest2Collection: Collection {
  public typealias Index = Int
  public typealias Element = (Collection1.Element?, Collection2.Element?)

  @inlinable public var startIndex: Index { return 0 }

  @inlinable public var endIndex: Index {
    return Swift.max(collection1.count, collection2.count)
  }

  @inlinable public subscript (index: Index) -> Element {
    let index1 = collection1.index(collection1.startIndex, offsetBy: index)
    let index2 = collection2.index(collection2.startIndex, offsetBy: index)

    return (
      collection1.indices.contains(index1) ? .some(collection1[index1]) : .none,
      collection2.indices.contains(index2) ? .some(collection2[index2]) : .none
    )
  }

  @inlinable public func index (after i: Index) -> Index { return i + 1 }
}

For completion's sake, the implementations for BidirectionalCollection and RandomAccessCollection:

extension ZipLongest2Collection: BidirectionalCollection
   where Collection1: BidirectionalCollection, Collection2: BidirectionalCollection {
  @inlinable public func index (before i: Index) -> Index { return i - 1 }
}

extension ZipLongest2Collection: RandomAccessCollection
  where Collection1: RandomAccessCollection, Collection2: RandomAccessCollection {}

A simple example (for context and to gain an intuition of how the algorithm works):

print(Array(zipLongest([0, 1, 2], [0])))
// Prints: "[(Optional(0), Optional(0)), (Optional(1), nil), (Optional(2), nil)]"

Rationale

Index

We need a concrete type for Index, but which one should we choose?

  1. A pair is a compound type, this means that we have two indices, one for Collection1 and one for Collection2. They might have the same Index, but we can't make this a requirement as it would leave out too many use-cases.

  2. We can't make Index a compound type from the indices of both Collections.

  3. We can't use the Index of the longest Collection since we can only confirm the longest Collection at run-time.

From deduction, the most obvious choice here is Int. It's got precedence in Array, other types from the Standard Library and C. This leads to the next issue.

Bottlenecks

Since our Index differs from the underlying indices, we can't compare them. This means that we need alternative solutions to implement endIndex and subscript(_:).

endIndex

We can use count to retrieve our endIndex:

@inlinable public var endIndex: Index {
  return Swift.max(collection1.count, collection2.count)
}

But now we've hit a bottleneck. We're expected to provide our endIndex in O(1).

count is O(1) if the Collection conforms to RandomAccessCollection, otherwise it's O(n), where n is the length of the Collection.

subscript(_:)

Here we use index(_:offsetBy:) to retrieve the underlying indices and contains(_:) and subscript(_:) to check for the existence and retrievement of its associated Element:

@inlinable public subscript (index: Index) -> Element {
  let index1 = collection1.index(collection1.startIndex, offsetBy: index)
  let index2 = collection2.index(collection2.startIndex, offsetBy: index)

  return (
    collection1.indices.contains(index1) ? .some(collection1[index1]) : .none,
    collection2.indices.contains(index2) ? .some(collection2[index2]) : .none
  )
}

Again, its time complexity is O(n) where O(1) is expected unless the underlying Collections conform to RandomAccessCollection.

(contains(_:) is needed here for safety. We're not certain if one of the elements lies outside of the shorter Collection.)

last

The default implementation of last, by way of BidirectionalCollection, is the following:

public var last: Element? {
  return isEmpty ? nil : self[index(before: endIndex)]
}

Same story here, our implementation of endIndex makes this O(n) and RandomAccessCollection saves the day once again.

Points of Discussion

Taking performance semantics and usability into consideration, what would be the best way forward? How should these types be implemented? (Or maybe you see room for improvement?)

Currently I see two options:

  1. If the implementation specified above meets the threshold of Standard Library inclusion, properly document all complexity exceptions.

  2. Only conditionally conform to RandomAccessCollection iff Collection1 and Collection2 conform to RandomAccessCollection. This excludes all other Collections down the protocol chain. For these we've still got ZipLongest2Sequence as a back-up.

2 Likes

It's not immediately clear to me why this is the case, but you probably know better that I do because I haven't done much work on custom collections. Do you mind explaining this? My idea of the natural index type would be the pair of indices that you advance simultaneously, one from each collection, with a custom endIndex to avoid having to do an O(n) work there. Where does this approach fail?

1 Like

It's actually pretty simple. Index has to conform to Comparable. Since built-in tuples are non-nominal, and therefore can't conform to protocols, they can't be used. It's the reason why you can't use a tuple as a Key in a Dictionary (since it can't conform to Hashable).

I wonder if it'll be worth it to cache the count for each constituent collection. Also cache both endIndex values and the starting index of the longer collection's suffix, if the collections are unequal in length. The only time that inappropriately O(n) stuff happens would be in the initializer, to set up this cache.

Edit: the last "to" was accidentally "so."

Sure, but the compound type doesn't have to be a tuple. e.g. here's a very rough sketch of a minimal implementation of Zip2Collection with an enum index (forgive the poorly chosen names and any errors):

struct Zip2Collection<S1: Collection, S2: Collection>: Collection {
  private var s1: S1
  private var s2: S2
  
  typealias Element = (S1.Element, S2.Element)
  
  enum Zip2CollectionIndex: Comparable {
    case pair(S1.Index, S2.Index)
    case end
    
    static func < (lhs: Zip2CollectionIndex, rhs: Zip2CollectionIndex) -> Bool {
      switch (lhs, rhs) {
      case (.end, .end): return false
      case (.end, _): return false
      case (_, .end): return true
      case let (.pair(lhsi1, lhsi2), .pair(rhsi1, rhsi2)):
        guard (lhsi1 < rhsi1) == (lhsi2 < rhsi2) else { fatalError("index mismatch")}
        return lhsi1 < rhsi1
      }
    }
  }
  
  typealias Index = Zip2CollectionIndex
  
  var startIndex: Index { return .pair(s1.startIndex, s2.startIndex) }
  var endIndex: Index { return .end }
  
  subscript(index: Index) -> Element {
    guard case let .pair(i1, i2) = index else { fatalError("invalid index") }
    return (s1[i1], s2[i2])
  }
  
  func index(after i: Index) -> Index {
    switch i {
    case .end: fatalError("can't advance endIndex")
    case let .pair(i1, i2):
      let nexti1 = s1.index(after: i1)
      let nexti2 = s2.index(after: i2)
      if nexti1 == s1.endIndex || nexti2 == s2.endIndex {
        return .end
      }
      return .pair(nexti1, nexti2)
    }
  }
  
  init(_ s1: S1, _ s2: S2) {
    self.s1 = s1
    self.s2 = s2
  }
}

A real implementation would want to hide the index type better, so invalid indices can't be directly constructed, and do a lot of specialisation for performance reasons.

2 Likes

You pretty much have to use a custom Index here anyway.

public struct Index: Comparable {
    let i1: Collection1.Index
    let i2: Collection2.Index

    public static func <(_ lhs: Index, _ rhs: Index) {
        return return lhs.iLonger < rhs.iLonger
    }
}

Obviously, you have to include logic to determine which of the two constituent indexes has a longer range. This is done in the initializer as mentioned in my previous reply. This means you have to include a third property in the Index type to record which index property to use for comparisons.

... Hmm ...

public enum Index: Comparable {
    case first(Collection1.Index)
    case second(Collection2.Index)

    public static func <(_ lhs: Index, _ rhs: Index) {
        switch (lhs, rhs) {
        case (.first(let fl), .first(let fr)):
            return fl < fr
        case (.second(let sl), .second(let sr)):
            return sl < sr
        default:
            preconditionFailure("Mixed index states")
        }
    }
}

I forgot that we have to cache both constituents' elements' indexes in the outer Index:

public enum Index: Comparable {
    case useFirst(Collection1.Index, Collection2.Index)
    case useSecond(Collection1.Index, Collection2.Index)

    public static func <(_ lhs: Index, _ rhs: Index) {
        switch (lhs, rhs) {
        case (.useFirst(let fl, _), .useFirst(let fr, _)):
            return fl < fr
        case (.useSecond(_, let sl), .useSecond(_, let sr)):
            return sl < sr
        default:
            preconditionFailure("Mixed index states")
        }
    }
}

Let's improve that again:

public struct Index: Comparable {
    enum Internals: Equatable {
        case useFirst(Collection1.Index, Collection2.Index?)
        case useSecond(Collection1.Index?, Collection2.Index)
    }

    let state: Internals

    public static func < (lhs: Index2, rhs: Index2) -> Bool {
        switch (lhs.state, rhs.state) {
        case (.useFirst(let fl, _), .useFirst(let fr, _)):
            return fl < fr
        case (.useSecond(_, let sl), .useSecond(_, let sr)):
            return sl < sr
        default:
            preconditionFailure("Mixed index states")
        }
    }
}

This has been really helpful, thank you.

My response was originally longer, but I'll leave that for when I have a somewhat final version ready.

I've been hacking away at it a bit more. Currently, the design takes cues from both Dictionary.Index and ClosedRange.Index:

extension ZipLongest2Collection {
  @_frozen public enum Index {
    case index (Pair)
    case endIndex
  }
}

extension ZipLongest2Collection.Index {
  @_fixed_layout public struct Pair {
    public let first: Collection1.Index
    public let second: Collection2.Index

    @inlinable internal init (_ first: Collection1.Index, _ second: Collection2.Index) {
      (self.first, self.second) = (first, second)
    }
  }
}

extension ZipLongest2Collection.Index: Comparable {
  @inlinable public static func < (lhs: ZipLongest2Collection.Index, rhs: ZipLongest2Collection.Index) -> Bool {
    switch (lhs, rhs) {
      case (.endIndex, _):
        return false

      case (_, .endIndex):
        return true

      case let (.index(lhsPair), .index(rhsPair)):
        return lhsPair < rhsPair
    }
  }
}

extension ZipLongest2Collection.Index.Pair: Comparable {
  @inlinable public static func < (lhs: ZipLongest2Collection.Index.Pair, rhs: ZipLongest2Collection.Index.Pair) -> Bool {
    return lhs.first < rhs.first && lhs.second < rhs.second
  }
}

Here .index is not directly constructable since the initialiser of ZipLongest2Collection.Index.Pair is internal.

Dictionary is built around the same construct; you can't directly construct a Dictionary.Index. Instead, it has index(OfKey:) which returns an Index? and an additional public subscript(_:) that takes a Key and returns a Value? (which is not a Collection requirement as Key != Index).

I'd say, let's steal this construct.

This would be our equivalent of index(OfKey:):

@inlinable public func index (first: Collection1.Index, second: Collection2.Index) -> Index? { <#...#> }

Similarly, we could create two additional safe subscripts that validate the input:

@inlinable public subscript (_ index: Int) -> Element? { <#...#> }
@inlinable public subscript (_ first: Collection1.Index, _ second: Collection2.Index) -> Element? { <#...#> }

Here Int is our Key and Element? our Value?.

I think this design has a lot of potential.

Yeah, something like this seems like a good approach, thanks for working on it. I would only say that

return lhs.first < rhs.first && lhs.second < rhs.second

should be a fatal error on mismatch instead of a silent false, since there should be no way for this to occur if the index initialisers are internal and you're careful to advance the iterators simultaneously. This definition breaks some of the semantic guarantees of <, e.g. both a < b and b < a can be false.

This is probably not something you're asking about (I haven't read the whole post yet), but I think it's possible that the range check inside the index(_:offsetBy:) will trap for a shorter collection here. Using index(_:offsetBy:limitedBy:) would not have this limitation, it would also eliminate the need for collection.indices.contains(...) call later.


A ClosedRange.Index approach to implementing ZipLongestCollection.Index won't work for BidirectionalCollection conformance, which requires index(before:). There is no obvious way to derive a last valid index value (upd: that is, no obvious way of doing it in constant time). For ClosedRange it's trivial, it's just an upperBound, here – not so much.

1 Like

What do you mean exactly here with mismatch? We should also provide the false case, no?

I had another go at it. (It's really tricky to get right.)

These are all the cases for which (lhs < rhs) == true:

         0 1    0 1 2    0 1 2
first   |β—–|β——|  | |β—–|β——|  | |●|
second  |β—–|β——|  | |●|    | |β—–|β——|
         (1)     (2)      (3)

β—– = lhs, β—— = rhs
  1. lhs.first.index < rhs.first.index && lhs.second.index < rhs.second.index, or;
  2. lhs.first.index < rhs.first.index && lhs.second.index == rhs.second.endIndex, or;
  3. lhs.first.index == rhs.first.endIndex && lhs.second.index < rhs.second.index.

Given the following definition:

public enum Index: Comparable {
  @_fixed_layout public struct Position: Comparable {
    public let first: (index: Collection1.Index, endIndex: Collection1.Index)
    public let second: (index: Collection2.Index, endIndex: Collection2.Index)

    @inlinable internal init (
      first: (index: Collection1.Index, endIndex: Collection1.Index),
      second: (index: Collection2.Index, endIndex: Collection2.Index)
    ) {
      (self.first, self.second) = (first, second)
    }

    @inlinable public static func < (lhs: Position, rhs: Position) -> Bool {
      return
        lhs.first.index < rhs.first.index && lhs.second.index < rhs.second.index ||
        lhs.first.index < rhs.first.index && lhs.second.index == rhs.second.endIndex ||
        lhs.first.index == rhs.first.endIndex && lhs.second.index < rhs.second.index
    }

    @inlinable public static func == (lhs: Position, rhs: Position) -> Bool {
      return lhs == rhs
    }
  }

  case position (Position)
  case pastEnd

  @inlinable public static func < (lhs: Index, rhs: Index) -> Bool {
    switch (lhs, rhs) {
      case (.pastEnd, _):
        return false

      case (_, .pastEnd):
        return true

      case let (.position(lhsPosition), .position(rhsPosition)):
        return lhsPosition < rhsPosition
    }
  }
}

Renamed Pair to Position and .endIndex to .pastEnd to better reflect the docs for Index ("Valid indices consist of the position of every element and a β€œpast the end” position that’s not valid for use as a subscript argument.") and naming conventions in the source code of the Standard Library.

Am I missing something here? We're still able to use endIndex in O(1), right? Can't find anything in the docs.

I'll ignore the end indices and just focus on the case of valid indices, because you can either represent the end index as a separate enum case or test for it everywhere like in your latest example (this approach makes it hard to support bidirectionality though, and is generally more complex). By mismatch I mean any case where lhs.first < rhs.first does not equal lhs.second < rhs.second (i.e. one is true and the other is false). This can only happen as the result of a logical error in the implementation, e.g. accidentally advancing only one index of the pair and not the other. So this situation should be a fatal error, instead of just returning false like in your examples. See the code in my earlier post for an example.

Ah, good catch.

"Index mismatch" might be a bit too vague as a message. "asymmetry doesn't hold" probably is too.

The checks for endIndex are needed in the case of zipLongest(_:_:) (not for zip(_:_:) though) since the endIndex of the shortest Collection stays the same while iterating past this point.

I'll look into it. :+1:

Still wondering if doing a O(n) double-pass through both sequences on init to cache both counts and the longer collection's suffix's index would give better performance. (Should the double run-through and caching be delayed until the first time you need to go beyond startIndex instead?)

Someone with a deeper knowledge of the inner guts of Collections will have to tell you if caching would be beneficial in this case or not.

Currently, the only downside to not having access to the length of the underlying Collections is that the index argument is not validated in index(after:) when the underlying Collections are only Collections. index is validated in index(before:) since we have access to distance(from:to:). (I will provide an overload for index(after:) on BidirectionalCollection that does validate the index.)

It turns out that Index can be implemented without a nested enum. Optional indices were at one point also implemented and discarded. The final design will probably end up looking like this:

@_fixed_layout public struct Index: Comparable {
  public let index1: Collection1.Index
  public let index2: Collection2.Index

  @inlinable internal init (_ index1: Collection1.Index, _ index2: Collection2.Index) {
    (self.index1, self.index2) = (index1, index2)
  }

  @inlinable public static func < (lhs: Index, rhs: Index) -> Bool {
    precondition(
      (lhs.index1 < rhs.index1) == (lhs.index2 < rhs.index2),
      "Indices failed to hold their antisymmetric property (lhs < rhs should imply !(rhs < lhs))"
    )

    return lhs.index1 < rhs.index1 && lhs.index2 < rhs.index2
  }
}

Are you fixing the index for the shorter collection to be that collection's endIndex whenever you're in the territory of the longer collection's suffix?

Exactly. We could use nil or the .pastEnd case of an enum, but it's all the same really.

This still would make implementing index(before:) hard to say the least. When one collection is longer by 2 elements, how would the last index (not the endIndex, but the one before that) of a zipped collection be obtained?

Here's the current implementation of index(before:):

extension ZipLongest2Collection: BidirectionalCollection, RandomAccessCollection where
    Collection1: RandomAccessCollection, Collection2: RandomAccessCollection {
  @inlinable public func index (before index: Index) -> Index {
    switch (index.index1, index.index2) {
      case (collection1.startIndex, collection2.startIndex):
        preconditionFailure("Can't decrement beyond startIndex")

      case (let index1, collection2.endIndex) where
          collection1.distance(from: collection1.startIndex, to: index1) >
          collection2.distance(from: collection2.startIndex, to: collection2.endIndex):
        return Index(collection1.index(before: index1), collection2.endIndex)

      case (collection1.endIndex, let index2) where
          collection2.distance(from: collection2.startIndex, to: index2) >
          collection1.distance(from: collection1.startIndex, to: collection1.endIndex):
        return Index(collection1.endIndex, collection2.index(before: index2))

      case (collection1.startIndex, _), (_, collection2.startIndex):
        preconditionFailure("Can't decrement beyond startIndex")

      case let (index1, index2):
        return Index(collection1.index(before: index1), collection2.index(before: index2))
    }
  }
}

To answer your question:

typealias Index = ZipLongest2Collection<[Int], [Int]>.Index

let xs = zipLongest([Int](), [0, 1]) // (Or zip(longest:_:), still undecided.)

assert(xs.endIndex == Index(0, 2))
assert(xs.index(before: xs.endIndex) == Index(0, 1))
assert(xs.index(before: xs.index(before: xs.endIndex)) == xs.startIndex)

I'll post a repo containing a Swift package with the implementation and unit tests later today so people can follow along and play with (and poke holes in) it.