Add Various Zip-Related Types and Operations


(Ben Cohen) #4

I think this would make for a great proposal.

Personally, I'm a bit skeptical of namespacing this kind of thing. I'd prefer just leaving the types at the top level.

It would be good to add to this the creation of collection versions of Zip as well. Two forward collections zipped together can be forward collections. Two random-access collections can be random access (via conditional conformance). Note, they both need to be random-access to go beyond forward-only because you need to be able to compute length in constant time to know which is longer in order to make e.g. zipped.last computable in constant time.


Implementation of Collection Versions of Zip
(Matthew Johnson) #5

The fact that pre-allocation of space rules out reduce(into:) is a gap in that API that I pointed out in the discussion and review of that proposal. The conversation about this in the discussion thread started here: Reduce with inout. There was also quite a bit of conversation in response to my review of that proposal: SE-0171: Reduce with inout

I think we should fill the gap that prevents an efficient and straightforward eager implementation of unzip from being written outside the standard library rather than accept a proposal like this one on the grounds that an efficient solution is not straightforward.

Regarding the proposal itself, I think it would be more inline with the rest of the standard library if unzip were lazy rather than eager. It does not make use of a closure and is similar in nature to enumerated and reversed.

I think the proposal should provide concrete motivating use cases for zipLongest. I am specifically curious as to whether they would benefit from a design that retained information about the "common prefix" with a non-optional pair values (i.e. the result when the usual zip is used). If so, perhaps LongestSequenceshould also providecommonPrefixandpartialSuffix` views (these are straw man names).


(Ben Cohen) #6

This argument could also be leveled against map, which can be replicated with the mutating reduce as well. But map as a stand-alone is far more readable, and avoids the problem of users forgetting/not knowing about reserving capacity ahead of time. Tortured attempts to cram a problem into a use of reduce are an anti-pattern IMO.

Now, unzip is obviously not as common as map. But it's common enough that many languages have a version of it as an idiom, and Swift should too.

It would maybe help if Array had an init that took a capacity argument. That would allow you to pass an array with a size hint into reduce, and if it was taken consuming, should allow that array to be uniquely referenced, mutated, and returned all in one shot. But this wouldn't change the fact that unzip would be worthwhile as both more readable and as a way of helping users get the performance tweak for free.

I don't think this can work, since you need to produce two separate things. If you don't mind them being independently lazily produced, all you need is two lazy maps (which requires no addition to the library). The point of unzip is to create a way of producing two arrays eagerly in one pass rather than (at least) two.


(Matthew Johnson) #7

I'm not arguing that people should be using reduce directly at all. I am arguing that it should be possible to implement higher level algorithms in as efficient and straightforward a manner as possible using the tools provided by the standard library. Sometimes this would be facilitated by providing a reduce algorithm that takes the accumulator inout. I think it is a valuable and fundamental building block.

This is not currently possible, relies on the optimizations which may not always happen and does not address all use cases for an inout accumulator. reduce with an inout accumulator is the most primitive version of reduce we could have in Swift. Is there a compelling reason to not provide it in the standard library?

I agree. I'm not arguing against including it. I'm only arguing that it should be possible to write it in an efficient and straightforward manner whether it is included or not. The primitive necessary to do so is extremely general and will be useful in the implementation of other algorithms that will not be provided by the standard library. Without the necessary primitive user code will suffer from either unnecessary complexity or suboptimal performance.

Secondarily, if the necessary version of reduce is provided then the suboptimal implementation users are likely to write ceases to be a motivation for this proposal. That does not detract from other motivation for moving it forward but it should be evaluated only on that motivation.

I see. It think this should be made clear and motived directly in the proposal. I don't think it is obvious and I wonder how often the eager behavior is the right choice. Clearly if independent arrays are actually required the eager version will be superior. On the other hand, if they are not required and independent passes over the resulting sequences are all that is necessary then allocating the arrays would be less than ideal.

I don't know enough about concrete use cases to determine whether I think the eager version belongs in the standard library. If most use cases for unzipping would benefit from laziness then including an eager version and not a lazy version in the standard library would be a disservice to users. A proposal that moves forward should contain sufficient motivation and detail about concrete use cases to help answer this question.


(Ben Cohen) #8

Taking arguments consuming is planned for as part of prep for move-only types. Once we have it, this shouldn't be relying on optimizations – rather, passing a new value into a consuming argument ought to be something you can rely on as resulting in a uniquely referenced value. @Joe_Groff can maybe confirm.

Yes – for the reasons I think were discussed at the time of introducing the mutating reduce (and that I don't think need relitigating): that reduce is barely more than some sugar around a fairly simple loop, which can sometimes improve readability. As soon as you make reduce take inout arguments, you go from reduce being a nice, short, simple alternative to a for loop, to instead writing obfuscated code that would be better suited to a loop. Just write the loop.


(Ben Cohen) #9

That's exactly why I quoted a recent use case where an eager unzip was a win. I feel this case is common enough to justify it (and, like I say, so do other languages). And crowdsourcing these examples is kinda the point of the pitch phase.

These trade-offs are unavoidable, and worrying something might be misused, even though it has legitimate uses, is a poor reason to not put it in the standard library.

The policy currently in place with the std lib is: if laziness is unambiguously "free" (like reversed()) make it lazy by default. If there are downsides to the laziness (e.g. multiple passes would incur multiple costs – lazy.map is not just about capturing the closure), make it eager, and if there's a lazy equivalent, put that under .lazy. This would be in keeping with that.


(Erica Sadun) #10

Why couldn't you unzip just by providing access to either base iterator? Or do you mean to start with a stream of tuples (not a Zip2) and produce a stream or collection from one of its members? And if so, why shouldn't there be a mapKeypath function that does this for arrays or streams of all types? Or am I (as usual) completely missing the point?


(Ben Cohen) #11

Right. This also suggests that Zip2 ought to have properties that let you get back to the zipped things. But sometimes you just have a collection of pairs.

Ideally, yeah, you could use a keypath wherever you want a function from a value to a property of the value. Then you could foo.map(\.bar). But unzip would still be useful for when you want to split something into two arrays in a single pass.


(Matthew Johnson) #12

Thanks for explaining this. It is slightly different than what I had previously understood.


(Erica Sadun) #13

TIL:

error: key path cannot reference tuple elements

Proposal0: Promote Zip2Sequence's baseStream 1 and 2 public.
Proposal1: Allow keyPaths to reference tuple elements.

btw, TI(also)L, that you can't pass an arbitrary number of keypaths with T, U, V... etc and construct a usable non-Any type from them, e.g. ([T], [U], [V], ...).


#14

I think the pitch needs more motivation detail about the chosen solution. I think unzip and zipLongest would both be good and straightforward additions to the standard library, but I don't really understand the implementation strategy, e.g. why Zip, which is really something like a SequencePair, is independently useful. Are there other things you would use SequencePair for?

It seems like the simpler implementation strategy would be just to have zipLongest as a top level function with whatever return type makes sense (maybe Zip2Sequence can just be reused). unzip is trickier, because it should probably just be a method on a Sequence of tuples, but that would require parameterised extensions if I understand correctly. But perhaps consistency arguments can be made for it just being a top-level function, at least for now.


(Dennis Vennink) #15

I'll reword this in the next edit. It's a bit hyperbolic.

You're absolutely right. I will add use cases for both zipLongest(_:_:) and unzip() and will add documentation to all operations.

Here's one off the top of my head. I recently had to compare two Sequences of tuples (this overload does not exist in the Standard Library since tuples can't conform to Equatable yet). zip(_:_:) is not suitable for this application since it gives a false positive. I had to use zipLongest(_:_:) (or use a for loop, but I'd rather use reusable, generic functions):

let xs = [(0, 1), (0, 2)]
let ys = [(0, 1), (0, 2), (0, 3)]

print(zip(xs, ys).allSatisfy({ $0.0 == $0.1 }))
// true πŸ‘Ž

print(zipLongest(xs, ys).allSatisfy({ $0.0 == $0.1 })
// false πŸ‘

(Dennis Vennink) #16

What I like about it is the 1:1 mapping from Zip2Sequence to Zip<Sequence1, Sequence2>. If a user wants to extend the operations to other arities they can do so in Zip3, ..., ZipN. If there's a performance penalty or any other major downside, then this is unacceptable and I'll revert back to something like Zip2ShortestSequence.

Another idea was to perhaps use a protocol to group both Sequences:

protocol ZipSequence: Sequence {}

extension Zip2ShortestSequence: ZipSequence {}
extension Zip2LongestSequence: ZipSequence {}

print(zip([0, 1], [0, 1]) is ZipSequence)
// true

I'll look into specialised versions. Your dotSwift talk will be very helpful in getting this off the ground. :+1:


(Dennis Vennink) #18
public struct SequencePair <Sequence1: Sequence, Sequence2: Sequence> {
  public enum Zip {
    public struct ShortestSequence: Sequence { ... }
    public struct LongestSequence: Sequence { ... }
  }

  internal let sequence1: Sequence1
  internal let sequence2: Sequence2

  internal init (_ sequence1: Sequence1, _ sequence2: Sequence2) {
    (self.sequence1, self.sequence2) = (sequence1, sequence2)
  }

  public var zip: (shortest: Zip.ShortestSequence, longest: Zip.LongestSequence) {
    return (shortest: Zip.ShortestSequence(self.sequence1, self.sequence2), longest: Zip.LongestSequence(self.sequence1, self.sequence2))
  }

  public var product: ProductSequence {
    return ...
  }
}

let xs = [0, 1]
let ys = ["a", "b", "c"]
let sequencePair = SequencePair(xs, ys)

// Zip with shortest exhaustion:
print(Array(sequencePair.zip.shortest))
// [(0, "a"), (1, "b")]

// Zip with longest exhaustion:
print(Array(sequencePair.zip.longest))
// [(.some(0), .some("a")), (.some(1), .some("b")), (.none, .some("c"))]

// The Cartesian product of two `Sequence`s (functional alternative to nested for loops):
print(Array(sequencePair.product))
// [(0, "a"), (0, "b"), (0, "c"), (1, "a"), (1, "b"), (1, "c")]

Take your pick.

It can't. I've tried, unfortunately.


#19

A part with reasoning for non fp programmers why this is useful would help, too.


(Dennis Vennink) #20

I'm currently meditating on whether to ditch 2-tuples for Pairs:

public struct Pair <Type1, Type2> {
  public let first: Type1
  public let second: Type2

  public init (_ first: Type1, _ second: Type2) {
    (self.first, self.second) = (first, second)
  }
}

SequencePair would then become Pair<Type1: Sequence, Type2: Sequence>:

extension Pair where Type1: Sequence, Type2: Sequence  {
  public enum Zip {
    public struct ShortestSequence: Sequence { ... }
    public struct LongestSequence: Sequence { ... }
  }

  public var zip: (shortest: Zip.ShortestSequence, longest: Zip.LongestSequence) {
    return (shortest: Zip.ShortestSequence(self.first, self.second), longest: Zip.LongestSequence(self.first, self.second))
  }
}

Here zip.shortest is a Sequence of Pair<Type1, Type2>s and zip.longest is a Sequence of Pair<Optional<WrappedType1>, Optional<WrappedType2>>s.

The cool thing about this approach is that we can create a fill(_:_:) method on Pair<Optional<WrappedType1>, Optional<WrappedType2>>:

extension Pair {
  func fill <WrappedType1, WrappedType2> (_ first: WrappedType1, _ second: WrappedType2) -> Pair<WrappedType1, WrappedType2> where Type1 == Optional<WrappedType1>, Type2 == Optional<WrappedType2> {
    return Pair<WrappedType1, WrappedType2>(self.first ?? first, self.second ?? second)
  }
}

So you could do:

Array(Pair([0, 1, 2], [true, false]).zip.longest.map({$0.fill(2, true)}))
// [Pair(0, true), Pair(1, false), Pair(2, true)]

If only we could get the compiler to map 2-tuples to the Pair type. :thinking:


(Pavol Vaskovic) #21

It’s possible to build up a lot of Sequence returning functions without creating concrete types for each individual use case. We already have universally applicable UnfoldSequence returned from the sequence function. I’d suggest we explore the design space and implementations using this technique. See:


(Dennis Vennink) #22

I've been thinking about this too. I've experimented with AnySequence, but there was a slight performance hit. UnfoldSequence would indeed be a viable alternative. :+1:


(Ben Cohen) #23

Please don't. We shouldn't add a type that competes with a built-in language feature. That will lead to impedance mismatches when you have one, but need the other.

I realize there are limitations to work around in not being able to extend structural types, but the answer should not be to add new named types, which would need to exist forever even after the language limitations are resolved.


Roadmap for and state of single element tuple types?
(Dennis Vennink) #24

I completely agree, hence:

Ideally this would be a variadic Tuple type. (0, true) would then be a literal for Tuple<Int, Bool>, the same way "string" is for String.