[Pitch] Make Collection Super-Convenient and Retire Sequence

There are a lot of changes happening in Swift at the moment, we will hopefully have generalized existentials for protocols and also ownership that allows for move semantics. I think both of these would have a large effect on a collections API. Therefore I would suggest not putting any effort into Collections until these two items come to pass.

PS Just to be clear I am in favour of updating the Collections API, from very early on I criticised its design, but I think existentials and ownership will have a profound effect.

1 Like

I am supportive of addressing the issues with the collection protocols in general. However, like Howard I also think it might be better to take a step back and revisit them more holistically when other language features are in place.

I have been having some difficulty forming an opinion regarding the relatively narrow proposed changes that resolve some issues while introducing new ones. Allowing valid conformances of Collection to trap when count is accessed seems particularly egregious to me. IMO we should be willing to commit to a design with strong semantics associated with a protocol.

It is ok for a requirement to have preconditions that apply to all valid conformances. For example, requiring indices provided to be valid indices into the current instance is perfectly sensible. On the other hand, allowing individual conformances to have idiosyncratic oreconditions (or worse simply not support a required member) seems awful to me.

It should be possible to write generic algorithms that are valid for every type that declares a valid conformance to the protocols required by the algorithm. It should also be possible to clearly state a semantic requirement that is common in a broad family of algorithms (such as the requirement that it be possible to process every element).

I haven’t had as much time as I would like recently but am continuing to think about this topic. It is very important to get it right if or when we do have the opportunity to make some breaking changes in this area. Perhaps some targeted changes make sense in the near future but perhaps they would simply cause churn without getting us closer to a true solution to the underlying problems.

It would be much easier to evaluate that if we had a clear vision of what the long-term goal is when we have an ownership system, etc. This would include the iteration redesign @dabrahams has mentioned and perhaps some other as-yet-unstated goals. I wonder if a manifesto-style document may be in order here.

4 Likes

Nobody's proposing to do that. Collection.count would always have the precondition “the collection is finite.” The fact that some collection types always satisfy that precondition is perfectly normal.

I think there's a valid concern here, which I'd phrase as, “lots of algorithms that currently have no preconditions will gain a precondition that the collection is finite, and that's unfortunate.” But if you throw 0..<Int.max at most of these algorithms, they'll fall over as well, with the practical result being a trap or an apparent hang, so I'm not very compelled by this concern.

1 Like

I'm not sure generalized existentials will have much effect on a rework of the collection protocol heirarchy, but I do think that ownership will. Being able to express in the type system that a given sequence won't (or will) be consumed by iteration would be very helpful towards producing a useful set of protocols.

1 Like

Sure. I think the reason I see this one differently is because the entire semenatics of this member is to express the precise size of the collection. Including this “requirement” without requiring the collection to be finite in the first place feels like a pretty clear contradiction in the semantics of the protocol to me. count simply makes no sense at all as a requirement of a protocol that allows infinite conformances.

I have been thinking a lot about this issue since you posted your current thoughts last week. IMO it gets at a deeper design issue. I am still thinking about it, but the “container” direction that some have mentioned is one that I feel warrants further consideration.

Here are a few of the things I’ve been thinking about:

  1. Many algorithms require the ability to process every element. Ideally this would be a semantic requirement of a protocol and not a precondition that is repeated all over the place.
  2. Many types can guarantee this capability. Most obviously all container types. If every element was explicitly constructed prior to invoking an algorithm (which is true of all containers) it should be possible to process every element.
  3. There are some “lazy” types such as ranges which have some instances for which processing every element is sensible and other instances for which it is not sensible. These types seem to be at the heart of the design issue we are facing.

As I mentioned earlier, I don’t have any strong opinions yet. This is a very tough problem especially given the source compatibility concerns we have to consider. I really appreciate your willingness to take it on despite the difficulty.

1 Like

Even putting aside the slim difference between an infinite Collection and one that is very large, I would say that a key problem with splitting the hierarchy into FiniteCollection and InfiniteCollection is that the type system will not be able to help you maintain the separation in many cases. So this will require an escape hatch, like .assumingFinite or similar, to assert that the Collection is actually finite and convert between them, or to call methods that you know are reasonable for your particular infinite Collection and given parameters.

Things like count, map, etc. should obviously not be called directly on an infinite Collection, though map can be on a lazy view, but other parts of the API are not so clear cut. Quickly skimming the Collection documentation and Sequence documentation, it seems like it would be ambiguous as to whether the resulting Collection is finite or infinite, or whether the method is reasonable to call (e.g. won't infinite loop) as it depends on the contents of the Collection and the given parameters, in the following cases:

  • subscript(bounds: Range<Self.Index>)
  • index(of:)
  • index(where:)
  • distance(from:to:)
  • split(separator:maxSplits:omittingEmptySubsequences:), which can be safe given judicial use of maxSplits and some knowledge of the contents
  • first(where:)
  • prefix(while:)
  • lexicographicallyPrecedes(_:) theoretically safe if you know the two are not equal, I guess

And I'm not immediately sure about things like prefix(upTo:), suffix(from:), and things like starts(with:) which are safe if only one of the Sequences is infinite, but not if both are. So the situation is kind of messy and it's hard to strike a balance between using the type system to help developers and removing or hiding possibly useful functionality.

Yes I agree. Like I said, I’m not sure what the best solution is.

1 Like

To clear things up, count would never trap in Dave’s proposal, it just potentially wouldn’t terminate. So I think we are searching for a protocol requirements to prevent the halting problem?

FWIW, in Scala 2.12, Traversable trait has following relevant customization points:

Size info operations isEmpty, nonEmpty, size, and hasDefiniteSize: Traversable collections can be finite or infinite. An example of an infinite traversable collection is the stream of natural numbers Stream.from(0). The method hasDefiniteSize indicates whether a collection is possibly infinite. If hasDefiniteSize returns true, the collection is certainly finite. If it returns false, the collection has not been fully elaborated yet, so it might be infinite or finite.

The following refinement in the collections hierarchy, trait Iterable adds operations that work from the end (takeRight, dropRight). And it is the next refinement, trait Seq, that represents sequences.

A sequence is a kind of iterable that has a length and whose elements have fixed index positions, starting from 0.

Scala is currently in the middle of third rework of collections for upcoming version 2.13 and will remove the Traversable and go with just the Iterable… which seems like a rough correspondence to Dave’s proposal here.

The following are slicing operations, and thus are always “lazy” and work fine with infinite Collections:

also, in your survey, don't neglect the algorithms that are currently, already defined on Sequence, many of which will never finish or will trap if the input is infinite, e.g. map(_), reduce(_:_), filter(_), min(), max()

To be clear, I'd love to have a better answer for infinite sequences and collections, but my sense is that no answer exists that is better in practice, especially given the existence of very large collections, of memory constraints on the machine, of time constraints on our users' patience, etc. And even if you live in the idealized world of pure abstractions, where only true infinities are bad, it is a problem that exists today for many Sequence methods, and that we wouldn't be making any worse. So, IMO, the thing to do is improve the parts of the design for which we have a clear path to forward progress, and explicitly not spend energy trying to make improvements that are unlikely to bear fruit.

I don't think the separation between technically finite (but very large) and infinite is really useful. If we decide we need to deal with infinite collections (e.g. using lazy map), then I think we should make the split point based on whether it can be iterated over in a "reasonable" time... where reasonable is defined somewhat hand-wavily.

Thanks for the corrections. These kind of details weren't that clear from my brief skim of the online documentation.

If it wasn't clear, I was trying to explore the API that is “ambiguous” when used with an infinite Collection, i.e. sometimes the returned Collection will be infinite and sometimes it will be finite; or sometimes it will cause an infinite loop and sometimes it won't depending on the contents of the collection and the given parameters. All the methods you mention I already presumed would definitely be removed from the hypothetical InfiniteCollection protocol.

I completely agree with you. I was just trying to explore what trying to maintain this separation might look like, but it only confirmed to me that this probably isn't worth enforcing, or possible to enforce, at the type system level.

1 Like

I think conceptualizing a one-pass sequence as a Stream better represents things like a random number generator or user input. The output can be created but it can't be repeated and it may be infinite.

As you started out discussing, Sequence is most commonly a walk along knowable data, so is multipass. A walk produces iterable values. A knowable data source can have multiple walks, like striding, which is what @soroush and I were discussing in https://forums.swift.org/t/pitch-adding-strideable-sequences. We quickly converged on behavior that unified sequences and collections.

As for streams, I started putting together a very rough concept of Stream last week because I wanted to think through the problem space: stream.swift · GitHub I slightly disagree with @davedelong in its utility (He writes, "They’re rare enough in practice that I don’t think we need a way to generalize their existence"). I think representing any real world input, whether audio from a microphone, feeds from a website, tweets from social media, or as mentioned above random numbers and user input, all can be well represented as streams.

7 Likes

The proposal for Stream looks a lot like an RxSwift Observable. Do we need such a thing in the standard library?

You may be right. You have certainly spent far more time than the rest of us thinking about this.

At a minimum, I think we should make the use of “finite” in the documentation more precise. For example:

a collection's indices form a finite range of the positions of the collection's elements

Given that we allow 0…<Int.max this sentence implies a connotation in direction of the mathematical notion of "finite". At the same time we have algorithm precondition documentation that says "must be finite". This clearly carries a narrower connotation that the it must be possible in practice to process every element. Obviously your proposal to allow infinite collections eliminates the former connotation leaving us with the latter.

Dave's answer quoted below implies that trapping is valid behavior. That would actually be better than an infinite loop IMO.

I don't think so. We are looking for a way to express the precondition that the input must have a size small enough to allow each element to be processed in practice. Several people have made the observation that this precondition is easily satisfied for all instances of container types.

The trouble comes when we want algorithms that process every element to also work with types that produce elements on demand. There isn't a precise way to distinguish 0..<5 from 0..<Int.max at runtime let alone compile time.

I think it is important to distinguish streams like random number generators from streams like user input. The former are able to produce the next element any time a program requests it. The latter have inverted control: the next event is produced and provided to the program when an external event occurs. "Stream" should only have one of these meanings in Swift. The discussion of streams-as-single-pass-sequences clearly implies streams which are able to produce the next element on demand.

This might be considered pedantic, but I think it is more accurate to say something like "the output cannot be repeated unless the stream has value semantics and a copy is made prior to iteration when multiple passes are necessary".

3 Likes

If I understand this correctly, trap here means stack overflow. I don’t see anything you can lift from here into types to get any static analysis support from compiler... I don’t get your point.

See my reply to Erica above. Streams which produce elements on demand are very different from "reactive streams".

1 Like

I see.

I interpret it to mean that if we allow infinite collections and all instances of a given conforming type have infinite elements it would be reasonable to implement count by calling fatalError(). What would be the point of entering an infinite loop in that case?

For the sake of discussion let's consider a protocol ContainedCollection which has a semantic requirement that conforming types must store their elements internally. All of the algorithms that have the finite precondition could instead require Self: ContainedCollection. The compiler obviously cannot verify whether a conformance meets this semantic requirement, but that is true of many semantic requirements. But the fact that the compiler cannot verify this does not mean it is not useful! If that were true we would not need refined protocols such as RandomAccessCollection which exist to encode complexity requirements.

There are very good reasons ContainedCollection is not the design we have but it is certainly possible and sometimes very useful (see RandomAccessCollection) to encode semantic requirements like this.