Improved zip API

dennisvennink · December 19, 2018, 7:57pm

Proposal: SE-NNNN
Author: Dennis Vennink
Review Manager: To Be Determined
Status: Awaiting Review
Implementation: apple/swift#18769
Bug: SR-3760

Introduction

This proposal addresses the limitations of the Standard Library's current zip API by adding improvements in the form of enhancements and additions.

It's comprised of:

the types Zip[2–6]Index, Zip[2–6]Sequence and ZipLongest[2–6]Sequence;
the global functions zip(_:_:[_:[_:[_:[_:]]]]), zipLongest(_:_:[_:[_:[_:[_:]]]]);
and the extension method unzip() on Sequence and LazyCollectionProtocol.

Discussions

It has been discussed in the following Evolution forum threads:

Its development has been discussed in the following Development forum thread:

Implementation of Collection Versions of zip.

Motivation

The Standard Library's current zip API consists solely of the global function zip(_:_:), which creates a Sequence of pairs from a pair of Sequences by zipping their respective elements until the shortest Sequence is exhausted, truncating the latter elements of the longest Sequence.

It's primarily used to iterate over two Sequences simultaneously:

func dotProduct <T: Numeric> (_ vector1: [T], _ vector2: [T]) -> T {
  return zip(vector1, vector2).map(*).reduce(0, +)
}

print(dotProduct([-1.5, 2], [4, 3]))
// Prints "0.0"

And since it's lazily implemented we can also zip over (pseudo-)infinite Sequences; which is a common functional pattern:

extension Sequence {
  func enumerated () -> Zip2Sequence<PartialRangeFrom<Int>, Self> {
    return zip(0..., self)
  }
}

print(Array("abc".enumerated()))
// Prints "[(0, "a"), (1, "b"), (2, "c")]"

However, it's missing certain protocol conformances, and features found in many other programming languages, like non-truncating and inverse zip operations and support for higher arities.

Missing `protocol` Conformances

`Collection`, `BidirectionalCollection` and `RandomAccessCollection`

Currently, Zip2Sequence's only conformance is to Sequence. Since SE-0234 removed SubSequence as an associatedtype from Sequence, we can extend Zip2Sequence to conditionally conform to protocols higher up the hierarchy without introducing a new type.

Adding these additional conformances has a number of benefits:

It makes their properties and methods available for use.
It makes existing type inferred code that uses operations with improved default implementations higher up the hierarchy more capable or performant.
It guarantees that the instance can be non-destructively consumed by iteration when the underlying Sequences conform to Collection.

However, the granularity of Sequence's protocol hierarchy makes a correct implementation of these conformances non-trivial.

Each Collection places additional conditional constraints on the underlying Sequences to comply with its semantics. These constraints hinder members lower in the protocol hierarchy from using a Collection's more capable or performant default implementations in a generic context. This means that these members will have to be reimplemented on an unconstrained extension. But which members specifically can only be found out by manually tracking code paths in the Standard Library's source code as the compiler doesn't provide fixits.

`LazySequenceProtocol`

Because Zip2Sequence doesn't conform to LazySequenceProtocol when its underlying Sequences conform to LazySequenceProtocol, some subsequent operations for which lazy variants exist are not used, even though Zip2Sequence itself is implemented lazily:

func id <T> (_ x: T) -> T {
  return x
}

print(type(of: zip([0, 1, 2].lazy, "abc".lazy).map(id)))
// Prints "Array<(Int, Character)>"

Here, we workaround this limitation by calling lazy on zip(_:_:)'s result to make consequent operations lazy again:

print(type(of: zip([0, 1, 2].lazy, "abc".lazy).lazy.map(id)))
// Prints "LazyMapSequence<Zip2Sequence<LazySequence<Array<Int>>, LazySequence<String>>, (Int, Character)>"

While this doesn't have a negative impact on performance—as LazyMapSequence will unwrap LazySequence—it's not exactly elegant.

Non-Truncating `zip` Operation

Some successive operations require all elements of the underlying Sequences, including the latter elements of the longest Sequence. Such an operation is useful when the lengths of the underlying Sequences are indeterminable at compile time and helps to prevent potential correctness traps. It's also versatile in that it composes nicely with other operations.

Both eager and lazy concrete type-returning implementations of this operation can be created from existing operations and language constructs.

Here, we use state and a while loop to create an eager imperative implementation:

func zipLongest <Sequence1: Sequence, Sequence2: Sequence> (_ sequence1: Sequence1, _ sequence2: Sequence2) -> [(Sequence1.Element?, Sequence2.Element?)] {
  var (iterator1, iterator2) = (sequence1.makeIterator(), sequence2.makeIterator())
  var result = [(Sequence1.Element?, Sequence2.Element?)]()

  result.reserveCapacity(Swift.max(sequence1.underestimatedCount, sequence2.underestimatedCount))

  while true {
    let (element1, element2) = (iterator1.next(), iterator2.next())

    if element1 == nil && element2 == nil {
      break
    } else {
      result.append((element1, element2))
    }
  }

  return result
}

print(zipLongest([0, 1, 2], [0]))
// Prints "[(Optional(0), Optional(0)), (Optional(1), nil), (Optional(2), nil)]"

And here, we use sequence(state:next:) and lazy to create a lazy functional implementation:

func zipLongest <Sequence1: Sequence, Sequence2: Sequence> (_ sequence1: Sequence1, _ sequence2: Sequence2) -> LazySequence<UnfoldSequence<(Sequence1.Element?, Sequence2.Element?), (Sequence1.Iterator, Sequence2.Iterator)>> {
  return sequence(state: (sequence1.makeIterator(), sequence2.makeIterator())) {
    let (element1, element2) = ($0.0.next(), $0.1.next())

    return element1 == nil && element2 == nil ? nil : (element1, element2)
  }.lazy
}

print(Array(zipLongest([0, 1, 2], [0])))
// Prints "[(Optional(0), Optional(0)), (Optional(1), nil), (Optional(2), nil)]"

However, these implementations won't scale along with the capabilities of the underlying Sequences. A type that does, similar to Zip2Sequence, is tricky to implement for the aforementioned reasons.

Inverse `zip` Operation

Most programming languages that include zip also include some variant of its inverse operation, usually named unzip, as an idiom. Swift should follow suit.

Fortunately, this operation doesn't require any new types. But implementing it is tricky as some common performance and correctness traps exist.

Here, reduce(into:_:) can't reserve the capacity up-front:

func unzip <Base: Sequence, Element1, Element2> (_ sequence: Base) -> ([Element1], [Element2]) where Base.Element == (Element1, Element2) {
  return sequence.reduce(into: ([Element1](), [Element2]())) { (result, element) in
    result.0.append(element.0)
    result.1.append(element.1)
  }
}

And a lazy variant can't be implemented on a type that only conforms to LazySequenceProtocol since the operation requires multiple passes.

Support for Higher Arities

zip is currently limited to a pair of Sequences.

Here, we workaround it by nesting multiple calls to zip(_:_:) and calling map(_:) on its result to realign its elements:

import UIKit

let components = (
  red: sequence(state: 0.0...1.0) { CGFloat.random(in: $0) },
  green: sequence(state: 0.0...1.0) { CGFloat.random(in: $0) },
  blue: sequence(state: 0.0...1.0) { CGFloat.random(in: $0) },
  alpha: repeatElement(CGFloat(1.0), count: 3)
)

zip(components.red, zip(components.green, zip(components.blue, components.alpha)))
  .map { UIColor(red: $0, green: $1.0, blue: $1.1.0, alpha: $1.1.1) }
  .forEach { print($0) }
// Prints "UIExtendedSRGBColorSpace 0.357433 0.784562 0.903349 1"
// Prints "UIExtendedSRGBColorSpace 0.411219 0.622004 0.842976 1"
// Prints "UIExtendedSRGBColorSpace 0.615279 0.888622 0.651203 1"

However, this becomes less readable and increasingly more complicated for higher arities and is less performant.

For non-truncating zip operations this becomes a readability issue as both the values and tuples will be wrapped in Optionals.

jrose · December 19, 2018, 10:43pm

compactUnzip seems like a trap to me because the two sequences are compacted independently, rather than dropping the element if either value is nil or if both values are nil. Is this operation precedented in any other languages?

jawbroken · December 20, 2018, 5:04am

Thanks for the well-written proposal. zipLongest fills an obvious hole in the standard library, and the rest should be nice to have when needed. My main concern is compactUnzip, for similar reasons to @jrose and also because it seems very niche. I think this should probably be removed and (perhaps, in future) replaced by a compact function on arrays of optionals, which would allow the functionality to be easily recreated by composition. I don't really want to see a proliferation of compact___ versions of methods that are roughly equivalent to just compacting the output(s).

dennisvennink · December 20, 2018, 11:43am

I figured compactUnzip(_:) would be a point of contention.

Initially I put it in because unzip(_:) doesn't let you get back to the underlying Sequences in a zipLongest(_:_:):

let xs = [0, 1, 2]
let ys = [0]

print(unzip(zipLongest(xs, ys)) == (xs, ys))
// false

print(compactUnzip(zipLongest(xs, ys)) == (xs, ys))
// true

Later on I discovered that it could also be used to create an immutable variant of MutableCollection.partition(by:) that returns a tuple of Sequences—more closely resembling Haskell's partition—instead of mutating the elements in-place and returning an Index:

let (even, uneven) = compactUnzip((0..<10).map { (n: Int) -> (Int?, Int?) in
  n.isMultiple(of: 2) ? (n, nil) : (nil, n)
})

print(even, uneven, separator: ", ")
// Prints "[0, 2, 4, 6, 8], [1, 3, 5, 7, 9]"

This would also work on all Sequences.

unzip(_:) and compactUnzip(_:) are analogous to map(_:) and compactMap(_:), respectively:

let xs = zip([0, 1, 2], [0])
let ys = zipLongest([0, 1, 2], [0])

print(unzip(xs) == (xs.map { $0.0 }, zs.map { $0.1 }))
// true

print(compactUnzip(ys) == (ys.compactMap { $0.0 }, ys.compactMap { $0.1 }))
// true

It would also be possible to add a default argument that specifies its behaviour:

let xs = [(nil, 0), (1, nil), (nil, nil), (2, 2)]

print(compactUnzip(xs))
// ([1, 2], [0, 2])

print(compactUnzip(xs, all: true))
// ([2], [2])

Here all defaults to false.

The closest I can think of is Haskell's catMaybes which would be equivalent to compactMap { $0 } (or compact(), if we had it).

I share your concern regarding composability as well. However, compact() would work on a Sequence containing Optionals, not on a Sequence containing tuples that contain Optional elements.

salutis · December 20, 2018, 6:31pm

I wonder why zip is a free function in Swift. Kotlin, Ruby, Rust, and Scala, all expose zip as a method. Anybody knows why zip is not declared as a method on Collection / Sequence which would make it much more discoverable?

dennisvennink · December 20, 2018, 7:13pm

I did some digging in the commit history. 70d728e from April 29th 2015 mentions the following:

API review has reached consensus that zip() reads better as a free function.

By the way, in Xcode you can use alt–esc or ctrl–space to pull up completions regardless of whether or not you're typing. This should also include all global functions.

michelf · December 20, 2018, 7:19pm

I think it makes some sense for zip to be a free function because of the similarity between the tuples produced and the parenthesis for the two collection arguments: zip(a, b) produces a collection of (A, B).

I don't think there is any similar reason justifying unzip as a free function. To me collection.unzip() reads much better than unzip(collection).

bzamayo · December 20, 2018, 7:37pm

From API Design Guidelines:

Free functions are used only in special cases, when there's no obvious self.

That's the basis for how I've reasoned why zip is a free function — although you could make an argument that Collection.zipped(to:) does have semantic meaning as to the ordering of the elements in the pairs.

jawbroken · December 21, 2018, 3:39am

Isn't the output of your unzip a tuple of sequences (or really, just two separate sequences) that can then be individually compacted if the elements are optionals (e.g. they came from zipLongest), not a sequence of tuples containing optionals? Maybe I don't understand your point.

dennisvennink · December 21, 2018, 9:57am

This proposal also adds additional higher arity unzip(_:)s up to 6. This means compact() would add arity number of additional passes through the underlying Arrays. So sure, it would work, but we trade in performance for composability.

Having said that, I'm leaning towards removing compactUnzip(_:) from the proposal. In the spirit of iteration, it's probably better to observe how people will start using zipLongest and then follow it up with an additional proposal that suits their needs, if needed.

dennisvennink · January 4, 2019, 12:55am

I've included your feedback into the implementation and added tests and preliminary benchmarks.

Final changes:

compactUnzip(_:) has been removed.
unzip() is now an extension on Sequence and LazyCollectionProtocol.
All of the types' properties are now internal.

Code reviews are welcome, but it does require some knowledge of gyb. (A non-gyb version of the implementation can be found for arity 2 in the Detailed Design section of the proposal.)

Improved zip API

Introduction

Discussions

Motivation

Missing protocol Conformances

Collection, BidirectionalCollection and RandomAccessCollection

LazySequenceProtocol

Non-Truncating zip Operation

Inverse zip Operation