[Pitch] Make Collection Super-Convenient and Retire Sequence

dabrahams · April 24, 2018, 2:19pm

Let's Retire `Sequence`

TL;DR Evaluated against its documented purpose, Sequence is not pulling its weight. It creates pitfalls for library users and complexity for everyone. That said, defining a Sequence is super easy. We should make it that easy to define a new Collection.

Note: for earlier discussion on this topic, please see this post and, if you have the fortitude, the foregoing thread.

Costs of `Sequence` In Its Current Form

The biggest problem with Sequence is that it is so easily and regularly misused. An arbitrary Sequence supports only a single traversal, so any generic code (or extension of Sequence) that attempts to read the elements multiple times is incorrect. Reading the elements includes the use of for...in loops, but also methods declared to be non-mutating—such as map, filter, and reduce- that must actually change the value of an arbitrary sequence. It's hard to overstate the harm done to code readability, since code that appears to be pure hides side-effects that occur in the general case.

Because the vast majority of Sequence models (and all the ones supplied by the standard library) support multiple passes, generic code over Sequences is typically never tested with a single-pass Sequence. Finally, because it is at the root of the protocol hierarchy and has so few requirements, it is very attractive to implement Sequence where implementing Collection would be more appropriate. Making a type conform to Sequence instead of Collection has both efficiency and capability costs for operations on that type.

Because the definitions of Sequence.SubSequence and Collection.SubSequence conflict, it is currently impossible to create a generic type that conditionally conforms to Sequence or Collection based on its parameters. This is a clue that the design is wrong in some fundamental way. Furthermore, Sequence's SubSequence-creation operations—whose spellings can't involve indices or subscripts beceause Sequence doesn't support them—block progress on unifying these slicing APIs under a subscript syntax.

Why Do We Have `Sequence` and `IteratorProtocol`?

These protocols were originally created to support for looping over arbitrary sequences. The code

for x in seq { something(x) }

would be compiled as:

__i = seq.makeIterator()
while let x = __i.next() { something(x) }

where seq could have any type conforming to Sequence. It was an extremely simple model that directly addressed the needs of the for loop and could support trivial generic algorithms such as reduce, map, and filter besides.

The Sequence model was, however, too simple to support nontrivial algorithms such as reverse, sort, and binary search. The nontrivial algorithms had a common need to represent, and revisit, a position—or index—in the sequence.

When we discover that an already-useful protocol lacks the requirements needed to support an important usage, we have to decide where to add those requirements: to the original protocol, or to a new refinement of the original protocol. To make that decision, we ask whether any known models of the protocol would be unable to efficiently support the new requirement. This is, for example, why BidirectionalCollection and RandomAccessCollection are distinct protocols: random access is an important capability for some algorithms, but there is no efficient way for important collections such as Dictionary and String to support it.

Some real-world sequences (e.g. a series of network events) only support making a single pass over the elements. We'll call these sequences streams from here on to avoid confusion. Our experience with C++ input iterators told us that representing anything like a “position” in such a stream was fraught with difficulty. Based on these factors, combined with the high (at the time) engineering cost associated with changing the way the compiler implemented for...in to interact with a more complicated protocol such as Collection, we decided that streams deserved their own protocol… without ever discovering a model that couldn't efficiently support multi-pass traversal. Arguably—for a reasonable definition of “efficient”—the GeneratorCollection adapter described below demonstrates that no such model exists.

But Sequences Can be Infinite

Currently Collection models are required to have a finite number of elements, but a Sequence that is not a Collection can be infinte. This requirement is motivated as follows in the documentation:

The fact that all collections are finite guarantees the safety of many sequence operations, such as using the contains(_:) method to test whether a collection includes an element.

It should be noted first off that the comment is misleading: finiteness does not make any difference to memory safety, which is what “safe” means in Swift. The only possible consequence of allowing infinite collections is that some algorithms might run forever or exhaust memory, neither of which is a safety problem.

More importantly, though, the motivation is hollow:

When it comes to termination or memory exhaustion, there's little practical difference between an infinite collection and one that is simply huge. We don't have a problem with 0...UInt64.max as a collection, and yet no reasonable program can process all of its elements.
The standard library provides many Sequence algorithms that may never terminate when the target is inifinite: among them, We provide min()/max(), starts(with:), elementsEqual().

Since it would do no harm to allow infinite collections, support for infinite sequences is not legitimate a motivation for the existence of Sequence as a protocol separate from Collection.

But Some Sequences are Fundamentally Single-Pass

While it's easy to build a Sequence that is single-pass as an artifact of how it is defined, true streams are extremely rare. For example, the following is only single-pass because of a design choice:

// Would be multipass if declared as a struct
class Seq : Sequence, IteratorProtocol {
    var i = 0
    func makeIterator() -> Seq { return self }
    func next() -> Int? { return i >= 10 ? nil : (i, i+=1).0 }
}
let s = Seq()
print(Array(s)) // [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
print(Array(s)) // [] oops, s was mutated

A true stream would require significant additional storage or computation to
support multiple passes over its elements. There are two known cases:

The initial state of the element generation procedure, which is modified as new elements are generated, is very costly to construct and to store, such as in the Mersenne Twister pseudo-random number generator.
The sequence represents some volatile, non-reproducible data stream, such as readings from a temperature sensor. In these cases, the lowest levels of the operating system are often buffering input in a way that makes multiple passes possible, and GeneratorCollection could easily be used to adapt the others.

The Proposal

I propose we define a default Index type for Collections that provide an Iterator, as shown in Sequence.swift, and retire Sequence.

Migrating Existing Code

Immediately removing Sequence would break unacceptable amounts of existing code, so I propose making it into a deprecated typealias for Collection as shown in Sequence.swift. Even so, some code will break (e.g. any extension made for both Sequence and Collection becomes ambiguous), so this change shouldn't be undertaken lightly.

Migrating Existing `Sequence` Conformances

Under this proposal, existing types that conform to Sequence will newly and automatically conform to Collection. It will be up to the user to correctly identify sequences that are true streams and adapt them accordingly (see below), but there is a heuristic that can be partially automated: an Iterator whose next method does not mutate its stored properties, but calls functions with unknown side-effects, is often consuming the source elements. False positives for such a test seem to be rare, but there are examples such as AnyIterator (which would of course be exempted).

Migrating Existing `Sequence` Constraints

Standard library code that uses Sequence as a constraint, but is not already and separately present on Collection, would be changed to use Collection instead. Sequence constraints in the wild should be changed to Collection as well.

Adapting Streams

If Sequence is no longer going to mean “single-pass,” we'll need some way to migrate code that is passing a stream to a Sequence operation such as map. We can add multipass capability using an adapter that buffers blocks of elements as they are visited for the first time. The accompanying GeneratorCollection.swift demonstrates.

Needlessly buffering an entire stream would be unacceptable, but traversing a GeneratorCollection just once can be much cheaper than that, as the example demonstrates: just one block buffer is allocated and reused throughout the traversal. This overhead of this block buffer is real, but I argue that in the rare places it would be needed, it will seldom be significant: true streams already pay substantial performance costs for their large state, I/O, or both. In the rare instances where this cost is unacceptable, algorithms can be overloaded to accept generators. Array initialization is one likely candidate.

Open Questions

I've tried to think of everything, but there are still some questions for which I don't have answers.

What is the Protocol for Single-Pass Iteration?

Semantically, any stream of Ts can be captured in a generator function of the form ()->T?. Generators avoid the impression given by Sequence that it can be traversed without mutation and by IteratorProtocol that it can be independently copied and stored. If the need should arise to build generic systems over single-pass sequences, generators would almost work… except that, as non-nominal types, they can't be extended. Any algorithm over streams would have to be defined as a free function, which I think is unacceptable.

One possibility is to not define a protocol at all, but instead, one generic type that can be extended with algorithms:

struct Stream<T> {
   public let next: ()->T?
}

extension Stream {
   func map<U>(_ transform: (T)->U) -> [U]
}

To properly capture the single-pass nature of streams, though, we need the ability to express in the type system that any interesting operation consumes the stream. This capability is expected to arrive with ownership support, but is not yet available. In my opinion the problem of how to support single-pass iteration should be solved when we have full ownership support, and not before.

How do We Support Move-Only Element Types?

With full ownership support, the language will gain move-only (noncopyable) types, which are not supported by the signature of IteratorProtocol.next, which returns its argument by value and thus would have to consume the source. Iteration over an Array of move-only types should be a non-mutating operation that either borrows, or gets shared read-only access, to each element. The solution to this problem is as yet unknown.

Whither `IteratorProtocol`?

Geiven everything said above, it's not even clear that IteratorProtocol should exist in the long run. Collection certainly doesn't need the protocol or associated type: iteration over c could be expressed as repeated calls to popFront() on a copy of c[...]. I don't propose to remove it now, but where it should end up is anybody's guess.

What about methods like `suffix(n)`?

Today, x.suffix(n) is supported by Collection with performance O(count). We probably shouldn't have done that. This method was originally defined only for BidirectionalCollection with O(n) performance, but when we discovered that it was possible to implement it for Sequence (with reduced performance guarantees) we went ahead. However, that meant that Collection, which only had forward iteration, had to support it too. I'm not sure if there are other Collection methods whose existence there was driven by trying to provide maximum functionality for Sequence, but this one, at least, should be reconsidered.

Acknowledgements

Special thanks to Nate Cook for his initial implementation of SequenceIndex and to @bjhomer for his idea of collapsing Sequence and Collection.

bjhomer · April 24, 2018, 3:57pm

It looks like under the new definition of Sequence proposed here, defining next() or makeIterator() is sufficient to define a Collection. That's great.

Is there a benefit to limiting this functionality to types declared as a Sequence? Could we instead declare this:

typealias Sequence = Collection

and then implement the same functionality on Collection? Given only their names, it's not immediately apparent to me what the difference should be between a "Sequence" and a "Collection". If the only difference is that Sequence is more convenient to define, I wonder if we could take those same benefits and apply them to Collection.

nuclearace · April 24, 2018, 4:17pm

Thanks for the write-up Dave. Can you speak more to the source compatibility of this change? Making changes to core protocols, especially this late in the game, seems a bit scary. Specifically, will making this change introduce any potentially hidden semantic changes that weren't already being misused?

I agree with your suggestion that we wait to add a Stream<T> till more of the ownership system is in place.

dabrahams · April 24, 2018, 6:58pm

@bjhomer That's a very interesting idea, but… OMG it works—I love my Swift community! I think I'll have to revise some of the proposal's conclusions in light of this.

@nuclearace Source compatibility is always a gamble; almost any change can break source—it's just a matter of probability—and the one I propose is very intrusive so is more likely to break source than most. I think it probably gets much better overall with @bjhomer's suggestion, but there are still obvious cases that will break:

extension Sequence { func f(){} }
extension Collection { func f(){} } // Error: f already defined

The only mitigating factor is that they're relatively easy to fix.

davedelong · April 24, 2018, 7:39pm

I love the suggestion @bjhomer makes to kill Sequence by making it a typealias.

Also, an observation: We already have infinite, single-pass sequences in the form of AnyIterator<T>. I don't think we need a new type (Stream<T> or whatever) to represent an infinite sequence. They're rare enough in practice that I don't think we need a way to generalize their existence. (Although if we leave it at that, we may want to deal with the fact that I can init an Array with an Iterator, which isn't great in the face of infinite iterators. However, we already have this problem, so ... )

Jens · April 24, 2018, 8:47pm

Did you mean "to represent a single-pass sequence"?

From the proposal:

Some real-world sequences (e.g. a series of network events) only support making
a single pass over the elements. We’ll call these sequences streams from
here on to avoid confusion.

hlovatt · April 24, 2018, 8:51pm

I like this proposal and @bjhomer’s suggestion of a typealias.

Infinite collections have to be processed lazily therefore why not make lazy collections single pass and potentially infinite? Whereas non-lazy collections are by definition finite and multipass. Further lazy collections cannot be used in for loops, that way there is no possibility of nested for loops contravening the single pass requirement.

dabrahams · April 24, 2018, 9:09pm

I just updated the proposal to include @bjhomer's suggestion.

jawbroken · April 25, 2018, 9:13am

Seeing this as a complete pitch, I have moved from agnostic to positive about making this change to Sequence. In particular, I agree that most Sequences that are currently single-pass are accidentally so, because of the relative ease of conforming to Sequence vs Collection, and providing an equally simple way to define a Collection is a very nice solution to this issue.

dabrahams · April 25, 2018, 1:33pm

I got a couple of private questions about this pitch this morning, which I'll answer here:

Q: If a collection can be infinite, what is count going to do? A: it can loop forever, just like an implementation of max() might for an infinite Sequence of Floats today, or it can trap, like a recursive implementation of the same method might.
Q: What will happen to the sequence(…) functions? A: they would automatically continue to work, but would now create a conforming Collection. As with everything that newly gains a Collection conformance, it would be up to the user to ensure multipass behavior (usually by inspection but at worst by using something like GeneratorCollection).

nnnnnnnn · April 25, 2018, 2:41pm

I think I'd describe my current opinion of this plan as "cautiously optimistic." It's really interesting that the source compatibility issues are actually lessened by just aliasing Sequence than the inversion you originally described. For example, in cases where someone has extended Sequence, those extensions should still work for any types that they did before.

One way of thinking about this change is that it's adding a new way to build a custom Collection type — before, you had to provide, at minimum, lower and upper index bounds and an (Index) -> Element subscript, now you'll be able to just provide a () -> Element? method and be done with it. Even though there won't be a separate protocol, this fits well with the existing Collection hierarchy in that if you need a bit more performance you can often just do a bit more work in your implementation. In a way we'd still have the same hierarchy:

protocol	benefits	considerations
`Collection` (via `next()` [née `Sequence`])	easy to implement	forward traversal only, unoptimized index type
`Collection` (via indexes)	better indexing performance	forward traversal only, custom index types can be more work
`BidirectionalCollection`	backward traversal	need to add `index(before:)`
`RandomAccessCollection`	better overall performance	need to handle offsets and `distance`

soroush · April 26, 2018, 12:08pm

There's one part of your table I don't follow — how do you get better indexing performance by implementing custom indexes? SequenceIndex keeps a reference to its element so you can return that in O(1) time.

In fact, you might be able to remove that entirely, because calling next() on Base.Iterator should give you the same value?

Regarding the proposal as a whole — I'm not sure. It seems a lot like using a sledgehammer to crack a nut, although it does solve a problems. My preference is still towards semantically changing Sequence to be multipass (a la [Pitch] Remove the single-pass requirement on Sequence), purely for the limited effects on existing source, but Dave's pitch also solves the associatedtype weirdness when jumping from Sequence to Collection.

nnnnnnnn · April 26, 2018, 3:58pm

The index that you get if you just write the next() method is pretty heavyweight—each index needs to contain an Int counter, the element at that index, and the full iteration state. If you can write a smaller index type (or even use Int or something!) you should get better performance.

Of course, this is a case where you should really only need to optimize this if it's causing problems. If you just need to create a lightweight sequence type and then iterate it, you wouldn't even be touching any of the indices.

dabrahams · April 26, 2018, 7:16pm

This demonstrates that it's possible. It's a tradeoff between storing the element and calling the next() function twice as often in an advance-index-and-subscript loop. I think storing the element probably wins, but it's just a guess.

Jon_Hull · April 28, 2018, 9:55am

Thanks for writing this up @dabrahams!

I have spent quite a bit of time thinking about this since our initial discussion and I am convinced we really need to spend some time considering both the containment aspect of sequence/collection (i.e. they are a collection of values) and the ordering aspect (i.e. those values are ordered in some way... even if the ordering is arbitrary).

I had this all written down at some point, but seemed to misplace that. Please forgive any errors as I am trying to recreate it from memory after several months.

Containment

On the containment side, let's start with something fairly abstract:

//Something which contains (possibly uncountably many) values of some type
protocol Container {
    associatedtype Element
    
    var isEmpty:Bool {get} //Does the container contain any elements

    func randomElement() -> Element? //A random element from the container
    func randomElement(using rng: inout RandomNumberGenerator) -> Element? //A random element using a particular RNG
}

extension Container where Element:Equatable {
    func contains(_ target:Element) -> Bool {...} //Does the container contain the target element?
}

It may seem strange that I started by allowing uncountably many elements, but note that this is also compatible with things like Range, and would give things like Ranges and Collections a common base protocol. At this level of abstraction, all we can do is ask what type of Element the container/collection has, if it is empty, and ask for a random element. If the element is equatable, we can ask if it is contained in the Container. Notice that we can't yet iterate over the things in any order.

Next up:

//A Container which has a countable number of items which can be iterated through in a reasonable time
protocol CountableContainer : Container {
    var count : Int {get}
    func contains(where: (Element)->Bool) -> Bool
    func unorderedElements() -> Iterator<Element>
    //Any other functions enabled by unordered iteration (e.g. map, filter, etc...)
}

Here we gain our count, and the ability to iterate over the container's contents in an arbitrary (undefined) ordering. This could underly CountableRange, but notice that this also fits Set and Dictionary better than Collection does, because they shouldn't guarantee an ordering. CountableContainers should be for...in-able. They should also be equatable if their Element is Equatable.

I would also argue without providing details, that we should provide NonEmptyContainer, and if it isn't possible to combine naturally with CountableContainer, NonEmptyCountableContainer (which would be the equivalent of CountableContainer & NonEmptyContainer w/o the diamond issue). These would underly ClosedRange and CountableClosedRange.

I would also like to see a SingleContainer, which contains a single element.

Ordering

There are a couple different ways we could go for ordering. We could define a protocol which creates/defines an ordering of a particular Element. For simplicity's sake, I am going to instead posit the idea of a "natural ordering", where there is one ordering which is obvious.

protocol Collection : CountableContainer, Ordered {...}
protocol Range : Container, Ordered {...}

That is, Collection is just a CountableContainer with a natural ordering defined somehow within it. Range is a Container with a natural ordering defined on it. With Collection, we get things like first, last, and the ability to index into it with a subscript.

I just looked at the clock and I need to be up in a couple of hours. I can explain better later if needed, but that should give you the gist...

Tino · April 28, 2018, 11:01am

Deprecating Sequence obviously solves all issues that protocol has ;-), so imho it's preferable over what we have now.
But apparently, this step alone doesn't solve all problems, but turns some of them into Collection-issues.

It's also a big change in one very important aspect of Swift, so I say it's crucial to get this right on the first try to avoid a situation similar to what we had with access control.

Therefor, I think it makes sense to split this huge topic into threads:
There are now at least three different designs that have been suggested, and supplementary to separate threads for all of those, imho it would make sense to collect the aspects of the status quo that should be improved.

On top of that, we could also look at existing solutions -- Swift may be different than some other languages, but collections are quite fundamental, so I'm sure we can learn from some of them.

jawbroken · April 28, 2018, 12:41pm

unorderedElements() seems like a lie, because the Iterator is going to iterate over them in some order. Presumably that order would be fixed (because there is no advantage in having it somehow be random for any implementation of any collection I've ever heard of), so you can easily define all the things you save for Collection on it (first, last, indexing, etc). But this is all back in Set/Sequence thread territory where some people are going to be adamant that these are unordered types, when actually the order is just not controllable/specified/whatever. I remain very positive about the pitch in this thread, which actually eliminates some of the hierarchy while retaining the simplicity of Sequence, but negative about building a more complex hierarchy with benefits that are very limited at best, as far as I can tell.

Chris_Lattner3 · April 28, 2018, 10:30pm

I like the direction this proposal is going, and I agree with your premises. I don't think that Sequence is carrying its weight, and I agree that anything actually single pass that is modeled in terms of Sequence is probably broken in practice.

I think that merging Sequence into Collection is very tantalizing and making a separate Stream protocol makes a lot of sense.

I would love to see this happen, but it is also hard to evaluate the effects and details of this proposal, I think those details can only be understood with a prototype implementation.

-Chris

Avi · April 29, 2018, 3:57am

If you are going to be pedantic, the very act of iteration imposes an order, even if that order is ephemeral. The name unorderedElements() implies no reliable ordering, and that seems good enough. As I understand it, one of the issues with sets and dictionaries is the requirement that order be preserved across iterations, provided no mutations of state. The proposed name implies that this requirement has been dropped, and it implies that the order may be arbitrary.

dabrahams · April 29, 2018, 5:53am

FWIW, that's a misunderstanding. There's no desire to allow consecutive iterations to present different orderings. In fact doing anything that changed the iteration order in subsequent passes would break thread-safety of these value types (or would require that they employ expensive and otherwise-unneeded locks and/or atomics).

HTH,
Dave