Introduce a `MaterializableCollection` protocol

anandabits · May 8, 2018, 9:03pm

This pitch assumes that the pitch Make Collection Super-Convenient and Retire Sequence moves ahead. If you are not familiar with that you may wish to read it first.

TLDR; Many algorithms need to process every element of a collection. This is a significant and pervasive requirement which results in an infinite loop and / or a crash when it is not met. We should elevate this requirement from a precondition on algorithms to a semantic requirement of a new MaterializableCollection protocol and move the relevant algorithms from Collection to this protocol.

Finite and infinite collections

The topic of finite vs infinite sequences and collections has come up in quite a few Swift Evolution threads. @dabrahams summed up the reason no meaningful progress has been made in this area in the above mentioned pitch:

When it comes to termination or memory exhaustion, there’s little practical difference between an infinite collection and one that is simply huge. We don’t have a problem with 0...UInt64.max as a collection, and yet no reasonable program can process all of its elements.

In other words, what we care about in practice when we talk about "finite" collections is termination of the algorithm. The mathematical notion of finitude is a necessary but not sufficient constraint. Termination requires a finite collection that is small enough such that every element can be materialized in practice so that it can be processed.

Problems with the current design

The current design does not encode the requirement of materializability in the langauge. Instead, the standard library uses precondition documentation on specific algorithms. The implied expectation is that user-defined code that is generic over Collection will do the same when necessary and callers will be diligent in meeting this precondition.

Unfortunately this precondition is quite pervasive. Roughly half of the members available on Collection require every element to be processed in the general case. I haven't done any analysis, but I would speculate that the ratio is even higher for third party extensions.

In practice, most types conforming to Collection are able to materialize all elements. There are several reasons for this. The most commonly encountered collections store their elements, thus the elements have already been materialzed. Further, instances of finite collections such as ranges that do not store their elements are often (usually?) small enough to materialize their elements on demand without trouble. Most importantly, infinite collections are not allowed (for example a lazy mathematical sequence).

Complexity and efficiency

@dabrahams provided very useful criteria for evaluating when a new protocol should be introduced:

When we discover that an already-useful protocol lacks the requirements needed to support an important usage, we have to decide where to add those requirements: to the original protocol, or to a new refinement of the original protocol. To make that decision, we ask whether any known models of the protocol would be unable to efficiently support the new requirement. This is, for example, why BidirectionalCollection and RandomAccessCollection are distinct protocols: random access is an important capability for some algorithms, but there is no efficient way for important collections such as Dictionary and String to support it.

Collection clearly does not require the ability to materialize all elements of a conforming type yet such a requirement is necessary for a large number of algorithms. This begs the question: are any known models of Collection unable to efficiently materialize all of their elements? The answer depends on how we define "efficiently". I argue that we should not restrict the meaning of "efficiently" to Big-O complexity.

It requires quite a stretch of the imagination to say that an infinite collection or even 0...Int.max is able to efficiently support an O(n) algoirthm in practice. In such cases theoretical Big-O complexity is irrelevant to actual usage in real world programs. The logic that argues for BidirectionalCollection and RandomAccessCollection also argues for MaterializableCollection.

The pitch

As stated previously, I propose introducing a MaterializableCollection protocol. All members which must process every element in the general case will be moved from Collection to the new protocol. Note: this includes count, thus allowing infinite collections without requiring them to implement count.

Some requirements and extensions of Collection can be implemented by BidirectionCollection or a RandomAccessCollection without having to materialize every element in the general case. For example, all RandomAccessCollections will be able to implement count without needing to traverse the collection and all BidirectionalCollections will be able to work from the end of the collection without traversing the collection, thus supporting members such as last, dropLast, etc. Members for which this is the case will be moved to both MaterializableCollection and either BidirectionalCollection or RandomAccessCollection.

I have spent some time doing a more detailed analysis of how individual members would be impacted but am omitting that in this pitch. If there is sufficient interest I intend to collaboarate with anyone interested to flesh out a more complete detailed design.

Collections such as Array and String that store their elements interanlly will conform to MaterializableCollection. Collections that do not store their elements internally may conform to MaterializableCollection, but only when all instances are statically known to be small enough to be materialized in practice (for example, AllCases collections may fall into this category).

Collections such as CountableRange will not conform to MaterializableCollection directly. Instead, they will provide a materializable view which allows users to assert that the instance they are working with is materializable. The standard library would provide a generic MaterializableCollectionWrapper type (this name is a strawman) that can be used with any Collection.

Migrating existing code

This pitch has a significant impact on source compatibility which will need to be carefully considered.

Migrating code that uses concrete `Collection` types that do not conform to `MaterializableCollection`

Concrete types that conform to Collection but will not conform to MaterializableCollection (such as CountableRange) will no longer have members which require materializability. Code that uses these members on instances of the concrete type will need to use the materializable property to assert that this usage is valid.

Migrating `Collection` constraints

Generic code using a Collection constraint as well as members that require materializability will need to be migrated to require MaterializableCollection instead. This will have ripple-effects which will need to be addressed by either increasing a generic constraint to MaterializableCollection or using the materializable view.

Migrating user-defined `Collection` conformances

Many user-defined collections will need to be updated to conform to MaterializableCollection instead of just Collection.

Open questions

What terminology should be used?

The term "materializable" was chosen as it represents the requirement in a very concrete way but there are alternatives that could be considered. Most notably, Haskell defines the type class Traversable which is defined as the "class of data structures that can be traversed from left to right, performing an action on each element." I would be happy to revise this pitch to use the name TraversableCollection if feedback indicates that is a better choice.

Should the `materializable` view be provided on `Collection` itself or only on concrete types?

If we are going to support infinite collections there will be some conforming types for which no instances are truly materializable. On the other hand, the next section discusses an adanced use case where it may be useful to "pretend" that they are materializable.

How does `MaterializableCollection` interact with `MutableCollection`?

Mutability seems to imply that elements are stored and includes algorithms such as sort which imply processing of every element. Should MutableCollection refine MaterializableCollection? If so, what is the impact?

Is there a path to a similar design which would allow a smoother migration?

Instead of moving requirements from Collection to MaterializableCollection perhaps we could phase in the materializable requirement on Collection itself and repurpose Sequence or introduce a new protocol with the more relaxed requirment of a "multi-pass, potentially infinite" collection.

Footnote on advanced use cases

@dabrahams mentioned and advanced use case for using O(n) algorithms on infinite collections:

You just expect an O(N) algorithm on an infinite collection not to terminate. Furthermore, I don’t think I’d want to outlaw the use of O(N) algorithms on infinite collections. I lost the battle long ago against side-effectful parameters to algorithms like reduce, which is one reason we have .lazy (there are other good reasons too, though). Given that we’re not re-litigating that question, I have no problem with someone calling reduce in a thread on a potentially infinite collection of messages from another process, and using its side-effects to do whatever they want.

It is possible to support this use case by using a materializable view on the infinite collection. This doesn't align with the "materializable" terminology precisely but that seems like an acceptable drawback to me.

Nevin · May 9, 2018, 12:54am

Counter-proposal:

• The protocol for materializable collections should be spelled Collection
• The protocol for non-materializable collections should be spelled Sequence

Furthermore, since non-materializable collections may be computed lazily, they should not be required to have indices, nor to be finite, nor to be multi-pass.

dabrahams · May 9, 2018, 1:32am

I appreciate the effort that went into this pitch but it seems to introduce complexity and truly massive code breakage with no obvious benefit. The definition of “Materializable” is too vague to be useful (how big is too big?), and seems like it would introduce premature capability cliffs, where some part of the system decides that a collection can't be efficiently processed before another part of the system filters, strides, or slices it. There's no obvious answer for how many times you can concatenate a RangeReplaceable + Materializable collection to itself before it becomes non-Materializable, and there's no obvious way to represent that. My premise in the post that's cited in the pitch is that trying to represent the distinction “too big to traverse” in the type system is a losing battle with heavy costs, and I still believe that.

anandabits · May 9, 2018, 2:03am

If I read this correctly you are proposing that we strengthen the requirement of Collection from finite to materializable and otherwise do nothing. Is that correct? That would not accommodate ranges and other similar types.

anandabits · May 9, 2018, 2:30am

Thanks for taking time to respond to my pitch Dave!

I understand. I thought you articulated the fact of the matter very clearly even if I would like to pursue the battle a bit further in search of a solution with acceptable costs. You've been at it longer than I, have you explored anything similar to this approach in the past?

The premise of my pitch is partly that this is an important question and the answer is not clearly called out in our code today. It is precisely those cases where the answer is least clear and most dependent upon programmer judgement that deserve to be highlighted. This is why I am arguing that we should require the use of a view in these contexts.

In other cases which are extraordinarily common the answer is immediately obvious. When a collection stores its elements it is certainly not too big to be materialized.

You have a very good point here but I don't think having to choose an answer is worse than allowing infinite collections that include a count property and expose algorithms that process every element.

Can you elaborate further on what you have in mind here? The pitched design is intended to support all capabilities that are currently available. The only difference should be that it may be necessary to use the materializable view in some cases. The decision to "trust" the other part of the system to perform the necessary filter, stride or slice becomes explicit.

jawbroken · May 9, 2018, 3:11am

Putting some difficulties and subtleties aside (like count being fine to call on some Collections that you can't map, etc) I would say that this only substitutes the programmer judgement of “should I call these obviously O(n) algorithms on a collection with a very large (or infinite) n” with “should I call .materializable and then do the same thing”. So the judgement call is identical, and the .materializable has no benefit beyond reminding a programmer that the collection could be very large, for some definition of large. This might be worthwhile if the error was very dangerous to make or hard to diagnose, but I'm not sure that is the case here. Infinite (or very large) loops seem fairly benign to me, because it is usually obvious when they happen and a stack trace or similar debugging tool will point directly to the location of the issue. Do you disagree with that?

anandabits · May 9, 2018, 3:36am

I imagine you have in mind something like 0...Int.max, right? Ranges and similar types would still conform to RandomAccessCollection which would include count. There may be some subtleties I haven't considered but this is not one of them.

First, I want to be clear that in many, many cases .materializable would not be necessary at all. In cases where it is, O(n) may not be obvious to the reader of the code for any number of reasons including inexperience as a programmer or lack of familiarity with the algorithm in question. But this usage-site clarity is not the primary motivation for the pitch.

The primary motivation is to require generic code to accurately express when it needs the ability to materialize a collection in order to function correctly. Generic algorithms written as extensions on the collection protocols are common but they are not the only code that works with generic collections. In other contexts the materializability requirement may be much less obvious than it is in the case of "obviously O(n) algorithms".

I believe there would be a meaningful increase in clarity if all such code were able (and required!) to express its requirements in code rather than in documentation (if the author is responsible enough to document such things) or not at all. I also suspect we may regret it if we allow infinite collections without also introducing something like a materializable constraint.

felix91gr · May 9, 2018, 4:11am

An exponential number of ways would be a nice first approach, I think. Whenever you detect that the result's size isn't polynomial in the size of the input, the compiler could say "nope, this result is clearly insane. I cannot guarantee it will be Materializable like the input is"

felix91gr · May 9, 2018, 4:14am

This approach I tend to like. That's the main reason why I like strongly typed languages. They allow code itself to refuse to compile if its semantics aren't being used correctly.

jawbroken · May 9, 2018, 4:23am

I didn't have any specific example in mind, and I was just saying that the two things are not necessarily related, but your example makes it clear that it's somewhat confusing. Should this algorithm that uses count be generic over MaterializableCollection or RandomAccessCollection? Perhaps there are no useful examples that don't require other functionality that makes it clear, though. It's not my primary concern, anyway.

Probably, because most Collection use is with Array and Dictionary, but my point was that these are the exact same cases where you can reasonably call the methods directly currently, so we're discussing whether to require .materialize for other cases. I would say what you're asking here is if the documentation about an algorithm requiring O(n) time should be moved into the type system. To me this is still “just” documentation, because it isn't really possible to model and enforce in the type system, e.g. are the results of a prefix(while:) call on an infinite collection materializable?

Relatedly, exponential time (or polynomial time with large exponent) algorithms have similar issues. They can be written generically on Collection but can't practically be called even for a modestly-sized Array, so MaterializableCollection doesn't help here. And then there are algorithms with impractical space requirements to consider too. Without very heroic measures in the type system (that I know nothing about), the judgement on whether an algorithm will complete in a reasonable amount of time and use a reasonable amount of memory is always going to be a judgement call that depends on context. So, to reiterate my previous question, is this error dangerous enough that the documentation benefits are worthwhile?

anandabits · May 9, 2018, 2:46pm

Not generally. If you have specific knowledge about the contents of the collection and the predicate you're using you may know that it is in a specific case. And you could of course roll the dice and hope for the best. Therefore this algorithm would move to MaterializableCollection and would only be available on infinite collections via the .materializable view.

Quoting from the pitch, here is the general plan:

If you are interested in seeing how each algorithm would be handled I can post the results of the detailed analysis I did. I omitted that from the pitch in order to keep it focused on the conceptual changes.

I'm going to generalize the question slightly to ask whether it is beneficial to encode efficiency requirements in the type system. Clearly the answer is yes in some cases or we would not have RandomAccessCollection. We could replace this protocol with documentation on algorithms that specifies how the complexity of the algorithm degrades when index movement is O(n) instead of O(1) but we don't, and for very good reasons.

The most significant difference I see is that a collection either meets the efficiency requirements of RandomAccessCollection or it does not. In the case of MaterializableCollection there exist models (such as ranges) where some instances do meet the requirement and some do not (as well as a gray area in between which is highly context-dependent). Supporting these types requires some additional complexity, but along with that we gain the ability to see explicitly where a programmer judgment is being made.

Do you think encoding the efficiency requirements of RandomAccessCollection as a protocol is a good idea? If you do but remain unconvinced about MaterializableCollection what differences between them account for the difference in opinion?

Choosing what to encode (and not to encode) in a type system is always a judgement call. Many programmers still prefer dynamic languages! On balance, my opinion is that the benefits of encoding materializability are worthwhile, especially if collections are allowed to be infinite. If I felt otherwise I would not have pitched it.

jawbroken · May 9, 2018, 11:37pm

Sorry, the prefix(while:) thing was a rhetorical question, explaining why I think this is more a documentation thing than actually encoding the complexity in the type system. For many parts of the API, the type system can't help you determine if the resulting Collection is materializable or not (I did a brief non-exhaustive survey in one of the other Sequence/Collection threads).

I remain unconvinced about MaterializableCollection, versus RandomAccessCollection because of the exact difference you mention. The type system can't help much and the harm of an infinite loop seems benign, because the error is generally obvious and “local”. I don't see your question strictly as a generalisation of mine, because in cases where the type system can easily help (e.g. RandomAccessCollection) and/or the error is particularly harmful or hard to diagnose, then the tradeoff is different.