Consistent lazy API

I've started work on a proposal that addresses the absence of lazy implementations of some Sequence-related operations.

In it I propose conforming all existing lazily implemented Sequences to LazySequenceProtocol:

Show / Hide

The following Sequences can't conform to LazySequenceProtocol because they don't contain an underlying Sequence and therefore can't conform to both Sequence and LazySequenceProtocol:

  • AnyBidirectionalCollection
  • AnyCollection
  • AnyIterator
  • AnyRandomAccessCollection
  • AnySequence
  • Array
  • ArraySlice
  • ClosedRange
  • CollectionOfOne
  • ContiguousArray
  • Dictionary
  • EmptyCollection
  • EmptyCollection.Iterator
  • KeyValuePairs
  • PartialRangeFrom
  • Range
  • Repeated
  • Set
  • StrideThrough
  • StrideTo
  • String
  • String.UnicodeScalarView
  • String.UTF16View
  • String.UTF8View
  • Substring
  • Substring.UnicodeScalarView
  • Substring.UTF16View
  • Substring.UTF8View
  • UnfoldSequence
  • Unicode.Scalar.UTF16View
  • UnsafeBufferPointer
  • UnsafeMutableBufferPointer
  • UnsafeMutableRawBufferPointer
  • UnsafeRawBufferPointer
  • UnsafeRawBufferPointer.Iterator

The following types can't conform to LazySequenceProtocol because they aren't sufficiently lazy:

  • DropFirstSequence
  • DropWhileSequence

DropFirstSequence can't conform to LazySequenceProtocol because makeIterator() pre-emptively drops the first k elements due to performance reasons. For the latter type a lazy implementation already exists.

The following Sequences already conform to LazySequenceProtocol and require no modifications:

  • LazyDropWhileSequence
  • LazyFilterSequence
  • LazyMapSequence
  • LazyPrefixWhileSequence
  • LazySequence
  • ReversedCollection
  • Slice

The following Sequences can conform to LazySequenceProtocol:

  • LazyFilterSequence.Iterator
  • LazyPrefixWhileSequence.Iterator
  • LazyMapSequence.Iterator

The following Sequences can conform to LazySequenceProtocol on the condition that Base conforms to LazySequenceProtocol:

  • EnumeratedSequence
  • EnumeratedSequence.Iterator
  • FlattenSequence
  • FlattenSequence.Iterator
  • IteratorSequence
  • JoinedSequence
  • PrefixSequence
  • ReversedCollection.Iterator

The conformance of FlattenSequence to LazySequenceProtocol means we can remove LazySequence from the return types of joined() and flatMap(_:) on LazySequenceProtocol. The old implementations can be deprecated. It also requires the explicit declaration of FlattenCollection.SubSequence.

The following Sequences can conform to LazySequenceProtocol on the condition that Elements conforms to LazySequenceProtocol:

  • DefaultIndices
  • IndexingIterator

Finally, Zip2Sequence can conform to LazySequenceProtocol on the condition that Sequence1 and Sequence2 conform to LazySequenceProtocol.

Then, for those inherited members from Sequence that still don't contain a lazy implementation on LazySequenceProtocol, overloads and types (conforming to LazySequenceProtocol and conventionally prefixed with Lazy) can be introduced:

  • dropFirst(_:) (LazyDropFirstSequence)
  • dropLast(_:) (LazyDropLastSequence)
  • reversed() (LazyReversedSequence)
  • shuffled() / shuffled(using:) (LazyShuffledSequence)
  • sorted() / sorted(by:) (LazySortedSequence)
  • split(maxSplits:omittingEmptySubsequences:isSeparator:) / split(separator:maxSplits:omittingEmptySubsequences:) (LazySplitSequence)
  • suffix(_:) (LazySuffixSequence)

But before I do any further work I want to gauge if this is the right direction to take so I can determine the scope of the proposal.

For instance, I can imagine some backlash against a lazy implementation that on the first call to next() requires the traversal of the entire underlying Sequence plus additional memoisation, like reversed().

Is this the right approach in your opinion? And if not, where would you draw the line for inclusion?

1 Like

One thing that I'd like to point out is that all the X.Iterator types that you've listed, though technically conform to Sequence, should not be considered for LazySequenceProtocol. They should not be sequences in the first place, but that we can't change.

As for the "meat" of the proposal. I'm very much in favor of cleaning up the lazy story for the standard library, however it is not easy to review all the proposed APIs without more concrete details on the implementation. For example, you mention reversed that simply can't be truly lazy. I believe the same applies to sorted. Perhaps we should not have lazy variants of these methods to avoid confusion, and clearly document them as "terminating the lazy chain" or something?

1 Like

That makes a lot of sense, I'll remove them from the list. Would it be possible to deprecate their conformance to Sequence? And if so, would it make sense to add it to this proposal?

That's a good idea!

Here's a summary:

  • dropFirst(_:):

    • The first call to next() takes O(k) where k is the number of elements to drop. Subsequent calls to next() take O(1).
    • Requires a Boolean check with each invocation of next() that checks if the first call to next() has occurred.
    • Requires no memoisation.
    • Conformance to protocols higher up the hierarchy isn't needed as dropFirst(_:) is overloaded on Collection to return a SubSequence (which defaults to Slice, which conforms to LazySequenceProtocol).
  • dropLast(_:):

    • The first call to next() takes O(k) where k is the number of elements to drop. Subsequent calls to next() take O(1).
    • Requires memoisation. The amount of memory to reserve is k.
    • Requires a Boolean check with each invocation of next() that checks if the first call to next() has occurred and if the last element in the memoised Array is nil.
    • Conformance to protocols higher up the hierarchy isn't needed as dropLast(_:) is overloaded on Collection to return a SubSequence (which defaults to Slice, which conforms to LazySequenceProtocol).
  • reversed():

    • The first call to next() takes O(n) where n is the size of the underlying Sequence. Subsequent calls to next() take O(1).
    • Requires memoisation. The amount of memory to reserve is underestimated.
    • Requires a Boolean check with each invocation of next() that checks if the first call to next() has occurred.
    • Conformance to protocols higher up the hierarchy isn't needed as reversed() is overloaded on BidirectionalCollection to return a ReversedCollection (which conforms to LazySequenceProtocol).
  • shuffled() / shuffled(using:):

    • The first call to next() takes O(n) where n is the size of the underlying Sequence. Subsequent calls to next() take O(1).
    • Requires memoisation. The amount of memory to reserve is underestimated.
    • Requires a Boolean check with each invocation of next() that checks if the first call to next() has occurred.
  • sorted() / sorted(by:):

    • The first call to next() takes O(n log n) where n is the size of the underlying Sequence. Subsequent calls to next() take O(1).
    • Requires memoisation. The amount of memory to reserve is underestimated.
    • Requires a Boolean check with each invocation of next() that checks if the first call to next() has occurred.
  • split(...):

    • All calls to next() take O(k) where k is the distance between each split.
    • Requires memoisation. The amount of memory to reserve is unknown.
    • Has been discussed on the forums and has been positively received by the core team. @CTMacUser is probably able to tell us more.
  • suffix(_:):

    • The first call to next() takes O(n) where n is the size of the underlying Sequence. Subsequent calls to next() take O(1).
    • Requires memoisation. The amount of memory to reserve is known.
    • Requires a Boolean check with each invocation of next() that checks if the first call to next() has occurred and if the last element in the memoised Array is nil.
    • Conformance to protocols higher up the hierarchy isn't needed as suffix(_:) is overloaded on Collection to return a SubSequence (which defaults to Slice, which conforms to LazySequenceProtocol).

I'd say LazyDropFirstSequence, LazyDropLastSequence and LazySplitSequence are worth pursuing? The rest simply isn't a good fit.

3 Likes

Deprecate, probably. It is introduced by an extension to Sequence where Self.Iterator == Self, which @Ben_Cohen unsuccessfully tried to remove at some point. I don't exactly remember why, but believe there was some code out there that relied on it. In any case, I don't think it makes sense to include it in this proposal.

Thanks for the detailed description of the proposed behaviors! I am on the fence here. On one hand, I strongly believe that any operation that claims to be lazy should not take linear (or worse) time, and that aligns with your last statement, on the other hand though it is soooo tempting to be able to take any method call chain applied to a sequence and make it better by just prepending it with a single .lazy. Without these "more expensive than linear" variants, .lazys will have to be strategically placed after every laziness terminating call.

How about we proceed with the non-controversial ones for now? The proposal can (and IMO should) contain a dedicated section discussing why exactly the worse-than-linear lazy functions are not that good. I can see 2 possible outcomes: 1) community decides that the pros outweigh the cons and the proposal gets extended to include everything; or 2) the proposal does not change scope, but we get some valuable input from the community and perhaps improve lazy a little more at a later date.

1 Like

My opinion on whether or not we should have a dedicated lazy type for something: You should be able to either reduce memory allocations or reduce computation when consuming only part of the sequence by using a lazy type. split does this (if you only grab the first two items from a lazy split sequence, you avoid having to split the rest of the sequence), but I don't think any of the other proposed additions do, since they all calculate the entire thing the non-lazy version would have if you try to use them at all.

1 Like

The argument can indeed be made that including these operations is more ergonomic.

Which makes me think of another approach. We could also add the overloads, but make them return their eager variant wrapped in a LazySequence. This way subsequent operations stay lazy. Added and improved documentation, both on the operations themselves as well as lazy, alerts the user of its performance characteristics.

Noted!

2 Likes

dropFirst(_:), dropLast(_:) and split(…) are all in the same class of operations whose laziness depends on the arguments that are passed in.

Out of all operations I do agree that the split(…) operations are the ones that will see the most dramatic performance improvements.

DropFirstSequence is already very lazy. The only difference is that LazyDropFirstSequence defers the dropping of elements until the first time next() is called, mimicking DropWhileSequence and LazyDropWhileSequence.

For a lazy implementation, dropLast(_:) requires buffering up till k elements. Something like AnySequence(0..<10000).dropLast(1000).prefix(1) doesn't require the elements >=1. So buffering till the end can potentially be wasteful.

In these three cases I think we should add their more lazy implementations, add really good documentation and give the user the option to decide which variant to pick.

I like that!

2 Likes

Wouldn't you need O(n) reading on the first call to next()? The first post-sorted element may be the final pre-sorted element.

Also, for the sorting, shuffling, and reversing methods, since they all require reading in every element then doing the analysis, there is no advantage over reading the sequence into a collection then applying the eager variant. So there's no need for lazy versions.

1 Like

Only in the best case when the Sequence is already sorted. As far as I can see, the modified Timsort algorithm used in Swift 5 has an average and worst case complexity of O(n log n).

I would expect a “lazy” sort to be implemented as a priority queue. Ingestion using Floyd’s Method would be O(n), subsequent reads O(log n). (This is equivalent to a heapsort generator.)

It would give some potential laziness benefits in cases like hugeList.lazy.sorted(by: whatever).prefix(10).

I don’t think this is common enough to leave a performance footgun lying around on a low shelf, though. Standard library laziness should be reserved for use cases where the first next is better than O(n).

3 Likes

...Wait, best case? I meant O(n) as worst....

Oops, O(n * log n) is worse than O(n), right? I read what you originally wrote as O(log n), which is better than linear I believe. So you are reading every element first on the first next() then doing some post-processing.

1 Like