Add Various Zip-Related Types and Operations

Add Various Zip-Related Types and Operations

Introduction

This proposal adds various zip-related types and operations to the standard library.

Motivation

In functional programming, zip and unzip are irreplaceable operations. zip maps an n-tuple of sequences to a sequence of n-tuples. It exhausts the input sequences up to the length of either the shortest or the longest sequence. In the case of shortest sequence exhaustion, all mapped elements have a value, but the rest of the elements of the longer sequences are not included. In the case of longest sequence exhaustion, the latter elements of the shorter sequences in the n-tuples have no value, but all elements are included. unzip is the inverse of zip; it maps a sequence of n-tuples to an n-tuple of sequences.

Currently, the standard library only contains the generic function zip(_:_:) that takes two Sequence parameters and returns a Sequence of 2-tuples using shortest sequence exhaustion. An operation using longest sequence exhaustion is oft-requested on the mailing lists and forums. An unzip operation complements both and completes the zip-related API for 2-tuples.

Proposed Solution

This proposal adds: the generic namespace Zip containing the types ShortestSequence and LongestSequence, the generic function zipLongest(_:_:), the method extension unzip() on Sequence where Element is equal to 2-tuples and the typealiases Zip2Sequence and Zip2Iterator.

Detailed Design

A new generic namespace Zip is introduced:

public struct Zip <Sequence1: Sequence, Sequence2: Sequence> { ... }

Zip contains two optimised Sequence types. The first is ShortestSequence (which is almost identical to Zip2Sequence):

public struct ShortestSequence: Sequence {
  internal let sequence1: Sequence1
  internal let sequence2: Sequence2

  internal init (_ sequence1: Sequence1, _ sequence2: Sequence2) {
    (self.sequence1, self.sequence2) = (sequence1, sequence2)
  }

  public struct Iterator: IteratorProtocol {
    internal var iterator1: Sequence1.Iterator
    internal var iterator2: Sequence2.Iterator
    internal var reachedEnd = false

    internal init (_ iterator1: Sequence1.Iterator, _ iterator2: Sequence2.Iterator) {
      (self.iterator1, self.iterator2) = (iterator1, iterator2)
    }

    public mutating func next() -> (Sequence1.Iterator.Element, Sequence2.Iterator.Element)? {
      if self.reachedEnd {
        return nil
      }

      guard let element1 = self.iterator1.next(), let element2 = self.iterator2.next() else {
        self.reachedEnd.toggle()

        return nil
      }

      return (element1, element2)
    }
  }

  public func makeIterator () -> Iterator {
    return Iterator(self.sequence1.makeIterator(), self.sequence2.makeIterator())
  }
}

And LongestSequence, which is a new type capable of longest sequence exhaustion:

public struct LongestSequence: Sequence {
  internal let sequence1: Sequence1
  internal let sequence2: Sequence2

  internal init (_ sequence1: Sequence1, _ sequence2: Sequence2) {
    (self.sequence1, self.sequence2) = (sequence1, sequence2)
  }

  public struct Iterator: IteratorProtocol {
    internal var iterator1: Sequence1.Iterator
    internal var iterator2: Sequence2.Iterator
    internal var reachedEnd = false

    internal init (_ iterator1: Sequence1.Iterator, _ iterator2: Sequence2.Iterator) {
      (self.iterator1, self.iterator2) = (iterator1, iterator2)
    }

    public mutating func next() -> (Sequence1.Iterator.Element?, Sequence2.Iterator.Element?)? {
      if self.reachedEnd {
        return nil
      }

      let element1 = self.iterator1.next()
      let element2 = self.iterator2.next()

      if element1 == nil && element2 == nil {
        self.reachedEnd.toggle()

        return nil
      }

      return (element1, element2)
    }
  }

  public func makeIterator () -> Iterator {
    return Iterator(self.sequence1.makeIterator(), self.sequence2.makeIterator())
  }
}

zip(_:_:) will now return a ShortestSequence:

public func zip <Sequence1: Sequence, Sequence2: Sequence> (_ sequence1: Sequence1, _ sequence2: Sequence2) -> Zip<Sequence1, Sequence2>.ShortestSequence {
  return Zip.ShortestSequence(sequence1, sequence2)
}

zipLongest(_:_:) is a new generic function that returns a LongestSequence:

public func zipLongest <Sequence1: Sequence, Sequence2: Sequence> (_ sequence1: Sequence1, _ sequence2: Sequence2) -> Zip<Sequence1, Sequence2>.LongestSequence {
  return Zip.LongestSequence(sequence1, sequence2)
}

Finally, unzip() is an extension on Sequence where Element is equal to 2-tuples:

extension Sequence {
  public func unzip <Type1, Type2> () -> ([Type1], [Type2]) where Self.Element == (Type1, Type2) {
    return self.reduce(into: ([Type1](), [Type2]())) { (result, element) in
      result.0.append(element.0)
      result.1.append(element.1)
    }
  }
}

Source Compatibility

To prevent code breakage, typealiases for Zip2Sequence and Zip2Iterator will be redirected to Zip.ShortestSequence and Zip.ShortestSequence.Iterator, respectively:

@available(*, deprecated, renamed: "Zip.ShortestSequence")
public typealias Zip2Sequence<Sequence1, Sequence2> = Zip<Sequence1, Sequence2>.ShortestSequence where Sequence1: Sequence, Sequence2: Sequence
@available(*, deprecated, renamed: "Zip.ShortestSequence.Iterator")
public typealias Zip2Iterator<Sequence1, Sequence2> = Zip<Sequence1, Sequence2>.ShortestSequence.Iterator where Sequence1: Sequence, Sequence2: Sequence

Effect on ABI Stability

Not applicable.

Effect on API Resilience

Not applicable.

Alternatives Considered

The main points of focus were the naming and return type of zipLongest(_:_:); would it return Optionals or not? If not, then variants with defaults or a closure would have to be introduced. Using default values meant that dummy placeholder values had to be used, which was not ideal. The variant with a closure makes using the function harder. These were all quickly dismissed as they're all derivatives of the Optional-returning variant. It turned out that .lazy.map(_:) can be used to efficiently implement all other variants.

While tempted, implementations for arities higher than 2 have been deferred until variadic generics have been introduced.

4 Likes

You write, " zip and unzip are irreplaceable operations". I have certainly found that to be true with zip. I have never needed unzip even once. (I convert to an array and index and grab .0 or .1 as needed.)

Your motivation section describes what the features are but not where they are needed or how they are used. I can see you have put a lot of thought and work into this but without concrete motivating examples of use, I would probably vote against inclusion should this proposal come to review.

A while ago, I developed a pad-with-nil version of zip with @davedelong and Oisin Kidney (recent version). I have not used it in real code in the several years of its existence. I suspect its the kind of solutions that library developers like to build that is more motivated by potential than real-world use.

3 Likes

Here's an example where unzip would be useful: Training a Text Classifier with Create ML and the Natural Language Framework - Flight School

The implementation the author uses instead:

let (texts, labels): ([String], [String]) =
    corpus.reduce(into: ([], [])) { (columns, row)
        columns.0.append(row.text)
        columns.1.append(row.label)
    }

is sub-optimal since it doesn't reserve space up-front. But doing that would in turn rule out using reduce(into:), maybe hurting readability. Other implementations might to it one-by-one, which takes two passes and might be inefficient. And it looks like the expression needed type context to work (or perhaps for clarity). As such, I think there's a reasonable case for std lib addition on the grounds of clarity, ease-of-use, and performance.

I think this would make for a great proposal.

Personally, I'm a bit skeptical of namespacing this kind of thing. I'd prefer just leaving the types at the top level.

It would be good to add to this the creation of collection versions of Zip as well. Two forward collections zipped together can be forward collections. Two random-access collections can be random access (via conditional conformance). Note, they both need to be random-access to go beyond forward-only because you need to be able to compute length in constant time to know which is longer in order to make e.g. zipped.last computable in constant time.

1 Like

The fact that pre-allocation of space rules out reduce(into:) is a gap in that API that I pointed out in the discussion and review of that proposal. The conversation about this in the discussion thread started here: Reduce with inout - #37 by anandabits. There was also quite a bit of conversation in response to my review of that proposal: SE-0171: Reduce with inout - #5 by anandabits

I think we should fill the gap that prevents an efficient and straightforward eager implementation of unzip from being written outside the standard library rather than accept a proposal like this one on the grounds that an efficient solution is not straightforward.

Regarding the proposal itself, I think it would be more inline with the rest of the standard library if unzip were lazy rather than eager. It does not make use of a closure and is similar in nature to enumerated and reversed.

I think the proposal should provide concrete motivating use cases for zipLongest. I am specifically curious as to whether they would benefit from a design that retained information about the "common prefix" with a non-optional pair values (i.e. the result when the usual zip is used). If so, perhaps LongestSequenceshould also providecommonPrefixandpartialSuffix` views (these are straw man names).

This argument could also be leveled against map, which can be replicated with the mutating reduce as well. But map as a stand-alone is far more readable, and avoids the problem of users forgetting/not knowing about reserving capacity ahead of time. Tortured attempts to cram a problem into a use of reduce are an anti-pattern IMO.

Now, unzip is obviously not as common as map. But it's common enough that many languages have a version of it as an idiom, and Swift should too.

It would maybe help if Array had an init that took a capacity argument. That would allow you to pass an array with a size hint into reduce, and if it was taken consuming, should allow that array to be uniquely referenced, mutated, and returned all in one shot. But this wouldn't change the fact that unzip would be worthwhile as both more readable and as a way of helping users get the performance tweak for free.

I don't think this can work, since you need to produce two separate things. If you don't mind them being independently lazily produced, all you need is two lazy maps (which requires no addition to the library). The point of unzip is to create a way of producing two arrays eagerly in one pass rather than (at least) two.

1 Like

I'm not arguing that people should be using reduce directly at all. I am arguing that it should be possible to implement higher level algorithms in as efficient and straightforward a manner as possible using the tools provided by the standard library. Sometimes this would be facilitated by providing a reduce algorithm that takes the accumulator inout. I think it is a valuable and fundamental building block.

This is not currently possible, relies on the optimizations which may not always happen and does not address all use cases for an inout accumulator. reduce with an inout accumulator is the most primitive version of reduce we could have in Swift. Is there a compelling reason to not provide it in the standard library?

I agree. I'm not arguing against including it. I'm only arguing that it should be possible to write it in an efficient and straightforward manner whether it is included or not. The primitive necessary to do so is extremely general and will be useful in the implementation of other algorithms that will not be provided by the standard library. Without the necessary primitive user code will suffer from either unnecessary complexity or suboptimal performance.

Secondarily, if the necessary version of reduce is provided then the suboptimal implementation users are likely to write ceases to be a motivation for this proposal. That does not detract from other motivation for moving it forward but it should be evaluated only on that motivation.

I see. It think this should be made clear and motived directly in the proposal. I don't think it is obvious and I wonder how often the eager behavior is the right choice. Clearly if independent arrays are actually required the eager version will be superior. On the other hand, if they are not required and independent passes over the resulting sequences are all that is necessary then allocating the arrays would be less than ideal.

I don't know enough about concrete use cases to determine whether I think the eager version belongs in the standard library. If most use cases for unzipping would benefit from laziness then including an eager version and not a lazy version in the standard library would be a disservice to users. A proposal that moves forward should contain sufficient motivation and detail about concrete use cases to help answer this question.

Taking arguments consuming is planned for as part of prep for move-only types. Once we have it, this shouldn't be relying on optimizations – rather, passing a new value into a consuming argument ought to be something you can rely on as resulting in a uniquely referenced value. @Joe_Groff can maybe confirm.

Yes – for the reasons I think were discussed at the time of introducing the mutating reduce (and that I don't think need relitigating): that reduce is barely more than some sugar around a fairly simple loop, which can sometimes improve readability. As soon as you make reduce take inout arguments, you go from reduce being a nice, short, simple alternative to a for loop, to instead writing obfuscated code that would be better suited to a loop. Just write the loop.

That's exactly why I quoted a recent use case where an eager unzip was a win. I feel this case is common enough to justify it (and, like I say, so do other languages). And crowdsourcing these examples is kinda the point of the pitch phase.

These trade-offs are unavoidable, and worrying something might be misused, even though it has legitimate uses, is a poor reason to not put it in the standard library.

The policy currently in place with the std lib is: if laziness is unambiguously "free" (like reversed()) make it lazy by default. If there are downsides to the laziness (e.g. multiple passes would incur multiple costs – lazy.map is not just about capturing the closure), make it eager, and if there's a lazy equivalent, put that under .lazy. This would be in keeping with that.

3 Likes

Why couldn't you unzip just by providing access to either base iterator? Or do you mean to start with a stream of tuples (not a Zip2) and produce a stream or collection from one of its members? And if so, why shouldn't there be a mapKeypath function that does this for arrays or streams of all types? Or am I (as usual) completely missing the point?

Right. This also suggests that Zip2 ought to have properties that let you get back to the zipped things. But sometimes you just have a collection of pairs.

Ideally, yeah, you could use a keypath wherever you want a function from a value to a property of the value. Then you could foo.map(\.bar). But unzip would still be useful for when you want to split something into two arrays in a single pass.

Thanks for explaining this. It is slightly different than what I had previously understood.

TIL:

error: key path cannot reference tuple elements

Proposal0: Promote Zip2Sequence's baseStream 1 and 2 public.
Proposal1: Allow keyPaths to reference tuple elements.

btw, TI(also)L, that you can't pass an arbitrary number of keypaths with T, U, V... etc and construct a usable non-Any type from them, e.g. ([T], [U], [V], ...).

I think the pitch needs more motivation detail about the chosen solution. I think unzip and zipLongest would both be good and straightforward additions to the standard library, but I don't really understand the implementation strategy, e.g. why Zip, which is really something like a SequencePair, is independently useful. Are there other things you would use SequencePair for?

It seems like the simpler implementation strategy would be just to have zipLongest as a top level function with whatever return type makes sense (maybe Zip2Sequence can just be reused). unzip is trickier, because it should probably just be a method on a Sequence of tuples, but that would require parameterised extensions if I understand correctly. But perhaps consistency arguments can be made for it just being a top-level function, at least for now.

1 Like

I'll reword this in the next edit. It's a bit hyperbolic.

You're absolutely right. I will add use cases for both zipLongest(_:_:) and unzip() and will add documentation to all operations.

Here's one off the top of my head. I recently had to compare two Sequences of tuples (this overload does not exist in the Standard Library since tuples can't conform to Equatable yet). zip(_:_:) is not suitable for this application since it gives a false positive. I had to use zipLongest(_:_:) (or use a for loop, but I'd rather use reusable, generic functions):

let xs = [(0, 1), (0, 2)]
let ys = [(0, 1), (0, 2), (0, 3)]

print(zip(xs, ys).allSatisfy({ $0.0 == $0.1 }))
// true πŸ‘Ž

print(zipLongest(xs, ys).allSatisfy({ $0.0 == $0.1 })
// false πŸ‘

What I like about it is the 1:1 mapping from Zip2Sequence to Zip<Sequence1, Sequence2>. If a user wants to extend the operations to other arities they can do so in Zip3, ..., ZipN. If there's a performance penalty or any other major downside, then this is unacceptable and I'll revert back to something like Zip2ShortestSequence.

Another idea was to perhaps use a protocol to group both Sequences:

protocol ZipSequence: Sequence {}

extension Zip2ShortestSequence: ZipSequence {}
extension Zip2LongestSequence: ZipSequence {}

print(zip([0, 1], [0, 1]) is ZipSequence)
// true

I'll look into specialised versions. Your dotSwift talk will be very helpful in getting this off the ground. :+1:

public struct SequencePair <Sequence1: Sequence, Sequence2: Sequence> {
  public enum Zip {
    public struct ShortestSequence: Sequence { ... }
    public struct LongestSequence: Sequence { ... }
  }

  internal let sequence1: Sequence1
  internal let sequence2: Sequence2

  internal init (_ sequence1: Sequence1, _ sequence2: Sequence2) {
    (self.sequence1, self.sequence2) = (sequence1, sequence2)
  }

  public var zip: (shortest: Zip.ShortestSequence, longest: Zip.LongestSequence) {
    return (shortest: Zip.ShortestSequence(self.sequence1, self.sequence2), longest: Zip.LongestSequence(self.sequence1, self.sequence2))
  }

  public var product: ProductSequence {
    return ...
  }
}

let xs = [0, 1]
let ys = ["a", "b", "c"]
let sequencePair = SequencePair(xs, ys)

// Zip with shortest exhaustion:
print(Array(sequencePair.zip.shortest))
// [(0, "a"), (1, "b")]

// Zip with longest exhaustion:
print(Array(sequencePair.zip.longest))
// [(.some(0), .some("a")), (.some(1), .some("b")), (.none, .some("c"))]

// The Cartesian product of two `Sequence`s (functional alternative to nested for loops):
print(Array(sequencePair.product))
// [(0, "a"), (0, "b"), (0, "c"), (1, "a"), (1, "b"), (1, "c")]

Take your pick.

It can't. I've tried, unfortunately.

A part with reasoning for non fp programmers why this is useful would help, too.

1 Like

I'm currently meditating on whether to ditch 2-tuples for Pairs:

public struct Pair <Type1, Type2> {
  public let first: Type1
  public let second: Type2

  public init (_ first: Type1, _ second: Type2) {
    (self.first, self.second) = (first, second)
  }
}

SequencePair would then become Pair<Type1: Sequence, Type2: Sequence>:

extension Pair where Type1: Sequence, Type2: Sequence  {
  public enum Zip {
    public struct ShortestSequence: Sequence { ... }
    public struct LongestSequence: Sequence { ... }
  }

  public var zip: (shortest: Zip.ShortestSequence, longest: Zip.LongestSequence) {
    return (shortest: Zip.ShortestSequence(self.first, self.second), longest: Zip.LongestSequence(self.first, self.second))
  }
}

Here zip.shortest is a Sequence of Pair<Type1, Type2>s and zip.longest is a Sequence of Pair<Optional<WrappedType1>, Optional<WrappedType2>>s.

The cool thing about this approach is that we can create a fill(_:_:) method on Pair<Optional<WrappedType1>, Optional<WrappedType2>>:

extension Pair {
  func fill <WrappedType1, WrappedType2> (_ first: WrappedType1, _ second: WrappedType2) -> Pair<WrappedType1, WrappedType2> where Type1 == Optional<WrappedType1>, Type2 == Optional<WrappedType2> {
    return Pair<WrappedType1, WrappedType2>(self.first ?? first, self.second ?? second)
  }
}

So you could do:

Array(Pair([0, 1, 2], [true, false]).zip.longest.map({$0.fill(2, true)}))
// [Pair(0, true), Pair(1, false), Pair(2, true)]

If only we could get the compiler to map 2-tuples to the Pair type. :thinking:

It’s possible to build up a lot of Sequence returning functions without creating concrete types for each individual use case. We already have universally applicable UnfoldSequence returned from the sequence function. I’d suggest we explore the design space and implementations using this technique. See:

1 Like