[Pitch] Remove the single-pass requirement on Sequence

Hi everyone,

I've been waiting until the new forums were rolled out to post this, and I'm excited to finally get a chance to.

Sequence in the standard library has two semantic capabilities: sequences may be infinite, which means they keep generating values forever, and they may be single-pass, which means if they are iterated a second time, they may not produce the same values (or any values at all!).

I think Swift should require all sequences to safely be multi-pass, and I think the argument for this is quite strong. I will lay out that arugment in this post.

True single-pass sequences in Swift are extremely rare. By my account, there are only a few situations where they can happen at all:

  1. Some situations (_DropFirstSequence in the standard library is the only one I know of) where both the Sequence and the Iterator are the same object, and that object is a reference type. This means that when the iterator is mutated, the sequence is mutated as well.
  2. Sequences defined with blocks, such as with the constructor functions sequence(first:next:) and sequence(state:next:), where they also capture external elements instead of purely relying on their first or state values.
  3. Sequences based on network data, which I have never seen in practice.
  4. Sequences of random data, which generate random elements as they are iterated.

Case 1 is a easy to avoid. Use structs, or make the Iterator and Sequence separate types.

Case 2 is easy to avoid as well. I would argue that code that captures external values in those blocks is worse than code that does not.

Case 3 is ill-suited for sequences and would be better expressed using streams or reactive paradigms.

Case 4 is a small case, and random data being different each time you access it isn't particularly unexpected anyway. Swift's randomness story isn't very well-defined at the moment, and it's possible that a RandomSequence could be initialized with a seed that would create the exact same sequence each time if necessary.

Single pass sequences are so rare in Swift code that I usually create this contrived situation to test functions single pass sequences (h/t Nate Cook):

let singlePassSequence = sequence(first: 1, next: { $0 == 4 ? nil : $0 + 1 }).dropFirst()

for e in singlePassSequence { print(e) }
// 2
// 3
// 4
for e in singlePassSequence { print(e) }
// no output

While single pass sequences themselves are rare, code that operates on sequences (for example, a function in an extension on Sequence) is much more common, and so the onus of handling these single-pass sequences is foisted upon all this code. This code is written by Swift consumers much more often than new Sequences, and so this change will easy the burden on them (with some minor cost to the standard library maintainers). While it can sometimes be a fun coding challenge to write some operation in a single-pass fashion, it usually a burden and an afterthought. I'd rather Swift be optimized away from this type of "gotcha!" code.

Further, if the idea is to have a construct whose iteration is destructive, Swift already has that: Iterator. This change will brighten the line that Swift makes between Sequence and Iterator. It will mean that one is always mutating/destructive and one is always safe and always readable, instead of the blurry world we have today where one is always mutating/destructive and the other...sometimes can be?

To execute this change, we would need very few changes in Swift itself: we'd need to change _DropFirstSequence to be a multi-pass sequence, update the documentation a little bit to reflect the new semantics, and do an audit of the other sequences in the standard library to make sure there are no other cases of single-pass sequences. The impact on consumer code would be very minimal.

Some may ask: if we make this change, what is the value in keeping Sequence around at all? I think there are two main reasons. First, sequences can still be infinite, which is a useful semantic distinction commonly used in Swift code and provided for in the standard library (c.f., 1...), and second, sequences are still a really easy way to create an iterable thing without having to muck around with creating an Index type, IndexDistance, and so on.

As a cherry on top of the semantic simplicity we would gain from this change, it would also let us move the first property from Collection to Sequence. (Properties are never supposed to mutate their owner, and since sequences currently can be single-pass, in rare cases, first actually mutates the owner.)

If there is interest in this change, I would be happy to write up a more full proposal and open a pull request and sample implementation.

7 Likes

There's been discussion about relaxing the requirement for Collection to have a finite number of elements. Then, what you're asking for would be simply a Collection.

What use case are you thinking of that cannot be a Collection today?

I would have to think more about it, but any time you want to iterate something but don't want to deal with indexes. Off the top of my head, RandomSequence, iterating the nodes in a linked list, a tree, or a graph. I think other posters here can think of even more. Markov chains?

I think I must be misunderstanding you. Any multipass sequence would have a meaningful index. That's inherent to being multipass: by construction, the nth item is always the nth item, so n would be a meaningful index for the item.

But that's not germane to the question at hand. I was wondering what multipass sequence today can't conform to Collection such that it's motivated you to pitch this idea.

Yeah, I think we are misunderstanding each other. I don't quite follow you. There are plenty of sequences today that are multipass and can't conform to Collection. UnfoldSequence generally is a great example.

let seq = sequence(first: 1, next: { $0 + 1 }) (or 1...) are both multipass sequences that can never be collections (because of Collection's semantic finiteness requirement).

Now, you are right that anything that can be expressed as a multipass UnfoldSequence could be rewritten as a Collection, but that's not really how we use Swift today. We use sequences for a number of reasons, including the ease of defining them (as we see with UnfoldSequence's entire existence).

UnfoldSequence can't conform to Collection because it's not guaranteed to be multipass. The predicate next does not have to do the same thing every time. Nor, mind you, does first need to be immutable.

Even though your first example is in fact a multipass sequence, Swift has no notion of function "purity," so your idea would actually require UnfoldSequence to cease to be a Sequence.

Meanwhile, there's no semantic impediment to 1... being a Collection even today because, for Int, there are a finite number of values. There are other, practical issues, but conditional conformance for 1... is definitely being examined.

There's plenty of situations in Swift where the limitations on a particular type are enforced via semantics rather than via the compiler itself. (A simple example of this is Collection requiring its endIndex to be reachable from its startIndex; there are many others) In this case, usage of UnfoldSequence with impure functions (or a mutated first value) wouldn't be allowed on a semantic level (rather than a type-system level), because it would generate a sequence that breaks a semantic requirement — that sequences should be multi-pass.

If infinite collections do happen, then you're absolutely right — this proposal would be evaluated in a different light. But for now, infinite collections are still an idea; this is a different approach in the same space. To my mind, it's a simpler solution to implement and solves a lot of the same problems.

Lastly, to answer your earlier question, my motivation for pitching this is that writing functions that operate on sequences always requires me to backtrack and change it to handle single-pass sequences, when I know that single-pass sequences are wildly rare.

That's an example of a protocol's semantic requirements for a type. It's very problematic to allow initialization of values of a concrete type that violate semantic guarantees. We do have some scenarios in the standard library where this comes to pass, and it's not good every time.

Not to mention, what would be your justification for restricting use of sequence(first:next:) in this way? It's a very useful function, and you'd remove some of that usefulness--for what benefit?

There are many options here that don't require changing the semantics of standard library protocols. Your functions can document that being a multipass sequence is a precondition; this would have the identical effect to your proposal of changing the semantics of Sequence, but without affecting everyone. Or you can constrain your function to take only collections. What are some concrete examples that you've encountered where it becomes an issue?

@xwu One good example is grouping pairs of elements together in a new sequence. I use this all the time. For multi-pass sequences, this code is very straightforward:

extension Sequence where ... {
    func eachPair() -> AnySequence<(Element, Element)> {
        return AnySequence(zip(self, self.dropFirst()))
    }
}

I usually put this code into an extension on Sequence.

When you modify this function to handle single-pass sequences, it becomes a lot more complex:

extension Sequence where ... {

    typealias Pair = (Element, Element)

    func eachPair() -> AnySequence<Pair> {
        var iterator = self.makeIterator()
        guard var previous = iterator.next() else { return AnySequence([]) }
        return AnySequence({ () -> AnyIterator<Pair> in
            return AnyIterator({
                guard let next = iterator.next() else { return nil }
                defer { previous = next }
                return (previous, next)
            })
        })
    }
}

Quite a difference. And I don't feel comfortable just documenting that the first version is only works on multi-pass sequences, because in a lot of cases you don't know if a library is handing you a single-pass or a multi-pass sequence, and assuming it's multi-pass could create a subtle bug that you find out about way later.

Putting this code on Collection isn't that satisfying either; the algorithm really does only require iteration, which suggests that Sequence is the right place for it.

My concerns are also ideological to a degree. Reading a sequence shouldn't destroy it. Violating command-query separation seems bad.

1 Like

Your second algorithm only requires iteration, but the first requires both iteration and that the sequence be multipass (aka a collection).

These are all choices that are up to you to make. You can choose to support single-pass sequences or not to do so in your own functions, and Swift doesn't stand in the way of either choice. In what scenarios have you been unsure whether the return value from a library function is single-pass or multi-pass, and how would implementing your idea help you to know?

It seems that your idea boils down to not wanting the Swift standard library to support single-pass sequences at all. Yet people will still use them, whether in the form of sequence(first:next:) or random number generators.

I'd also point out that combining your eachPair method with a random sequence is a great example of why understanding the semantic requirement is important—when implemented properly (like the second version), the second element of each pair is the first element of the pair that follows it. When implemented without taking single-pass into account, like the first version, this wouldn't be true—each pair would be generated fresh and you wouldn't have the correspondence. (This is true even for seeded random number generators.)

5 Likes

The example in my post: sequence(first: 1, next: { $0 == 4 ? nil : $0 + 1 }).dropFirst() is a great example of a sequence I wouldn't expect to be single-pass, but is.

Yes, exactly! In a world with my proposal, if people want to make single-pass sequences, they "can", in the same way that they "can" make infinite collections now. The compiler will allow it, but you still shouldn't.

This is no doubt a problem, although I do think it's less of a problem until Swift nails down its randomness story. There are resolutions to the problem, like seeding each RandomSequence once, when it is initialized. That way, you can iterate it more than once, and get the same values each time, but any new RandomSequence you make will have different elements. I'm not an expert on randomness, though, so I admit I may be a bit out of my depth here.

My issue with this proposal is that, putting aside the semantics of Sequence, the protocol has a syntactic meaning: "Instances can be iterated with a for/in loop". I'm not comfortable with the suggestion that it should be impossible to use a for/in loop on a single-pass sequence.

Maybe the semantics should be severed from the syntax: we should have an Iterable protocol, which is refined by Sequence, which is refined by Collection. But we then have to ask ourselves, what is Sequence for? Any type which can be multipass can have reusable indices. (If nothing else, the iterator's state can be packaged up as an "index".) So when do we want to provide a multipass guarantee, but not want to provide indices?

Swift’s randomness story isn’t very well-defined at the moment, and it’s possible that a RandomSequence could be initialized with a seed that would create the exact same sequence each time if necessary.

There are certainly some types of RNGs which use a seed and thus could be multipass. There are also types of RNGs which really, really can't, and they're usually the ones with the strongest randomness guarantees. RNGs are a separate discussion (and one I've removed myself from), but I don't think we should design an RNG system which specifically excludes the strongest implementations.

1 Like

I agree with Brent. It would be unfortunate if we can't use for...in to read content of, say, a packet.

I'm not sure I understand the point. How would your idea help you know whether that value is a single-pass sequence or not, or to use it correctly? And where have you encountered such values being returned from a library without documentation that it's single-pass, as you say occurs in "a lot of cases"?

Soroush mentioned the traversal of data structures such as linked lists, trees, and graphs. These cases seem like good examples to me.

As mentioned above, one can forcibly define indices for any multiply enumerable sequence. But that doesn't necessarily mean that such indices are semantically meaningful or of practical value. What's the natural interpretation of the endIndex of a linked list? The fact that you can torture such sequences into being indexable doesn't mean that you should, or that you should be required to.

More broadly, I don't think the sequence provider's perspective is all that relevant. I see the thrust of this proposal as being more about the consumer's perspective. Given that you want to implement a generic facility such as eachPair(), how can you write it to cover the broadest possible spectrum of inputs, and how can Swift best support that?

The current taxonomy forces some code to be written in terms of Collection when it could just as well be written in terms of multiply enumerable Sequences, were those a thing. It's hard to quantify exactly how often that happens, but my subjective impression is that it's a common situation---common enough that on reading the OP, I immediately thought, "Oh yes, I know where he's going with this." I've often required Collections as inputs because I needed reiterability and knew that in the instant case, the arguments would happen to be Collections. But I'm always aware of the code smell.

Cartesian products are a paradigmatic example. They're used every-freaking-where, and you just can't define them on Sequences unless you're willing to transcribe at least one complete sequence to fixed storage. Yes, they're easy to implement in terms of Collections, but often, you don't control the data sources. All I want is to be able to write

for elt1 in seq1 {
    for elt2 in seq2 {
        // Do something with (elt1, elt2)...
    }
}

No explicit indexes needed, wanted, or visible on the horizon.

I have the sense that this was intended as something of a straw man; nevertheless, +1 from me for this variant. :grinning:

I think there's a role for nonreenumerable sequences. I just wish they weren't as prominent in the API as they currently are. The middle layer (Sequence above) should be the general-purpose lowest common denominator. It's fine with me if Iterables don't support things like Cartesian products without assistance; they'd be genuinely lower-level constructs.

It's also not clear to me why going from two to three layers should compromise for...in (or syntax generally) in any way. Couldn't all these layers support that?

3 Likes

I also want an Iterable protocol to sever the semantics of Sequence from it's syntax.
Not (directly) for its multi-enumerability, but for its order, in Unordered Collections.
Because I want to be able to express that while the order of elements could be decided by calls to arc4random(), each element I receive in one iteration, I will receive in a second iteration, eventually.

3 Likes

A Collection doesn't need to be either bidirectionally or random-access traversible--that's guaranteed by more refined protocols. But, an index has meaningful information for any multipass sequence. For one, you know you have the same value if you have the same index (inherent to being multipass), and that can be pretty valuable (no pun intended). And to answer your question, the natural interpretation of the endIndex of a linked list is that, if you have advanced from startIndex, you know that there are no other indices you haven't visited.

Let me ask you, what "meaning" or "value" for indices of a collection would be "tortured" for linked lists, trees, or graphs but not for other Collection types that don't conform to BidirectionalCollection or other *Collection protocols?

I think, given Nate's feedback, in fact the current design helps you to write correct algorithms for the broadest spectrum of inputs. By ignoring the possibility of a single-pass sequence, the "easier" implementation of eachPair() would not be able to accommodate even seedable (i.e., repeatable) pseudorandom sequences, whereas a trivial reworking required by the current semantics of Sequence is correct for all conforming sequence types.

To entertain the notion of changing the protocol hierarchy of Swift sequences, this will require empiric evidence.

Core team members have stated that their intention is that types that conform to Sequence but not Collection should be rare. I asked Soroush, and I will ask you too, what libraries have you encountered where this poses a problem, and for what algorithms that you want to implement?

Your stated for elt1 in seq1 { for elt2 in seq2 { ... } } can be constrained to Sequence without problems today, so I'm not sure what that example is supposed to illustrate in the way of proposed changes.

I'm putting this first because my whole complaint may be based on a misunderstanding of the API.

If I understand it correctly, this form is syntactically fine but semantically undefined: "A conforming sequence that is not a collection is allowed to produce an arbitrary sequence of elements in [a] second for-in loop."

Is it a recommended approach to allow generic functions to assume that their input Sequences are in fact multiply iterable? (In other words, just let everything collapse at run time if inappropriate Sequences happen to be supplied as arguments?) This doesn't seem very formal, but it does have a certain simplicity and appeal. I had formerly assumed that since Sequences were not guaranteed to be reiterable, you should not presume to reiterate them.

When "torture" enters the API discussion, can "Hitler" be far behind? :-) Sorry, my fault.

You're right: indexing on something like a graph or tree is not inherently more obscure than indexing on any other abstract Collection. However, I'd argue that much of the indexing in the standard library is already pretty tortured. Lord help you if you need to do complex character-level processing on a String without first converting it to an array of Characters.

Mind you, I don't claim that there's anything wrong here. Indexing is abstract algebra, and libraries must be full-featured and formally correct. I just question whether, as an average developer who just wants to emit a sequence of values within the context of a specific app, I should be forced to define my own abstract algebra of data structure indices. It's extraneous. If nobody is ever going to call that code, I shouldn't be required to write it just to conform to the standard library's requirements.

I'm not sure I agree that the rewrite is trivial. Nate didn't post the "correct" code; Soroush did, as an example of how tedious it is to be correct under the current standards.

In this case, the "correct" code is longer and more complicated but also potentially faster. However, for something like a Cartesian product (all combinations of elements from multiple sequences), the current regime imposes additional overhead costs.

I acknowledge the rest of this as hand-waving, but I don't think libraries are the real issue here. This is about how the patterns and examples set by the standard library generalize to other code and influence the use of Swift generally. Of course robust libraries promote abstractions to their highest possible level, e.g., Collection vs. Sequence. Who cares if a library developer has to spend a few hours developing an abstract indexing algebra for a particular type of collection? Somebody will eventually use that stuff.

But, the standard library is also a general model for how to structure Swift code. Simple, everyday things should be simple. It makes sense for the standard library to accommodate these straightforward use cases even when doing so wouldn't significantly change the architecture of the standard library itself.

Given that, it makes sense for the standard protocol hierarchy to acknowledge the existence of sequences that are multiply iterable but not indexable.