A lazy version of Collection.split

CTMacUser · December 22, 2018, 1:17am

Related to SR-6691, I reworked the type described in the "Collection Protocols" section of the LazySplitCollection sample type in the article "Conditional Conformance in the Standard Library" at the Swift Blog by @Ben_Cohen. The LazyEmptyAdmittingUnlimitedSplitCollection type is the same as Ben's, the LazyEmptyAdmittingSplitCollection and LazyUnlimitedSplitCollection types introduce maximum split counts and empty sequence skipping, and LazySplitCollection would be the final type to be actually submitted.

This project is actually a poor example of conditional conformance. The only thing you can actually optionally extend is make the Index type Hashable based on the wrapped collection's index type, and said optional conformance is automatically defined. A split collection can never be bidirectional because discovering when a split limit is reached is inherently forward-only. This may warrant including the variants without maxSplits support (i.e. maxSplits is infinite) in the Standard library, since there's no other way for bidirectional traversal.

For the types that support skipping empty subsequences, I had to use shenanigans with a class to get around mutability restrictions when I need to set a cache on the first run of startIndex. This has the benefit of making startIndex (and isEmpty) amortized O(1) instead of O(n). The problem of using a cache is that mutating the wrapped collection ruins everything. LazyFilterSequence calculates its startIndex on demand, with a penalty of always running at near-O(n) time. Which way is better; should I change mine to match filter's?

I put the code in a gist, to see if anyone can check if I messed up anything, especially the various conditions of LazySplitCollection when I run out of splits and/or various empty-subsequence states.

A bigger challenge to work on next: LazySplitSequence. I have to somehow pass on a copy of a partially-expended wrapped-level iterator so the sequence-level iterator and use it for the next subsequence.

Ben_Cohen · December 26, 2018, 8:09pm

Hi @CTMacUser,

For this reason, I don't think a lazy split should have a max splits capability. That mainly exists as a kind of halfway-to-laziness feature of the eager split so it's benefit is really marginal in a lazy version. Lazy bidirectionality is far more valuable.

Alternatively: instead of the default argument approach the eager version has, write two versions of split: one that has a max splits argument, and one that doesn't. Have them return different types. The type returned from the no-max-splits version can be bidirectional. This also means the simpler no-max-splitting version doesn't need to spend time at runtime checking if it needs to account for a max splits. However, this is a lot of complexity and not really worth it (opaque types would let you hide the complexity in the future, but could still lead to confusing type checker diagnostics). The nice thing about this approach is it can be added later in an ABI-stable way so there's no downside to leaving it off for now.

My suggestion would be for the lazy sequence split to pull off [Element] chunks.

So all you need to do is stash the base iterator in the lazy splitting iterator, then consume from it into an array that takes the place of the old "subsequence" that it then returns. The iterator will also have to stash one element if coalescing empty regions (because it needs to hunt forward for the first non-empty one, thus consuming the first element of it from the iterator).

This is admittedly no longer quite so lazy (you're producing arrays for the fields) but I don't think you can do it the other way without getting entangled with reference semantics for the iterator.

CTMacUser · December 26, 2018, 11:46pm

How could that be ABI-stable? We would be going from:

extension LazyCollectionProtocol {

    public func split(maxSplits: Int = Int.max, omittingEmptySubsequences: Bool = true, whereSeparator matches: @escaping (Element) -> Bool) -> LazySplitCollection<Elements>

}

in Swift 5 + x to

extension LazyCollectionProtocol {

    public func split(maxSplits: Int, omittingEmptySubsequences: Bool = true, whereSeparator matches: @escaping (Element) -> Bool) -> LazySplitCollection<Elements>
    public func split(omittingEmptySubsequences: Bool = true, whereSeparator matches: @escaping (Element) -> Bool) -> LazyUnlimitedSplitCollection<Elements>

}

in Swift 6 + x. Would we be hiding the user-level difference between LazySplitCollection and LazyUnlimitedSplitCollection somehow?

beccadax · December 27, 2018, 3:07am

Removing a default argument is source-breaking but ABI-compatible. If you simultaneously add another overload which preserves source compatibility, the only source breaks that would result would be from people explicitly stating that something should be LazySplitCollection.

(That's part of why opaque result types are interesting—they avoid even allowing people to explicitly declare things that would break in these situations.)

Ben_Cohen · December 27, 2018, 5:43am

I am saying, provide an overload that doesn’t have a maxSplits argument at all (instead of taking one defaulted to Int.max). It can return a type that doesn’t account for a max number of splits. Then, at the same time or later or (my preference as I don’t think it’s worth having in the lazy case) never, you add an overload that does take a maxSplit arg and returns a type that accounts for it.

Adding it later is abi compatible because the old function remains. You are just adding a new function (and type).

Ben_Cohen · December 27, 2018, 6:00am

Note, you can do the same trick with coalescing empty fields. And I think the coalesced version can just use a lazy filter to eliminate empty fields, so you don’t need a new type for it. Again, opaque types would be a win here. Tho this approach doesn’t scale too well with combinations of options...

TellowKrinkle · December 27, 2018, 7:21am

Is there a reason some structs use a range (or struct containing a range) as their index while LazyEmptyAdmittingUnlimitedSplitCollection uses just a single Base.index and has to calculate the end of the range in the indexing subscript?

CTMacUser · December 27, 2018, 4:35pm

LazyEmptyAdmittingUnlimiitedSplitCollection is my adaptation of the sample type from the blog, so it keeps the same limitation of a simpler Index type at the cost of having to calculate the endpoint every time. The other sample types came later, and follow the suggestion at the end of the blog to change the Index type to also cache the end of each split range.

CTMacUser · December 31, 2018, 3:58am

Ben_Cohen:

CTMacUser:

A bigger challenge to work on next: LazySplitSequence . I have to somehow pass on a copy of a partially-expended wrapped-level iterator so the sequence-level iterator and use it for the next subsequence.

My suggestion would be for the lazy sequence split to pull off [Element] chunks.

So all you need to do is stash the base iterator in the lazy splitting iterator, then consume from it into an array that takes the place of the old "subsequence" that it then returns. The iterator will also have to stash one element if coalescing empty regions (because it needs to hunt forward for the first non-empty one, thus consuming the first element of it from the iterator).

This is admittedly no longer quite so lazy (you're producing arrays for the fields) but I don't think you can do it the other way without getting entangled with reference semantics for the iterator.

I was going to do the lazy-sequence-of-lazy-sequences route, but then I considered trying your lazy-sequence-of-eager-sequences idea, and what common code both can use. A starting point would be something like a filter; given an Iterator, a separator closure, and other split arguments, make another iterator that vends just the non-separator elements. Further, include an index for which split sub-sequence that element would be part of. For instance, given

"ABCDEF"

with "D" as a separator. You would get as a return from this splitting iterator:

("A", 1), ("B", 1), ("C", 1), ("E", 2), ("F", 2)

The problem with this kind of return are sequences like:

"DABCDDEFD"

and elect to include empty sub-sequences, giving split-element-index pairs of:

("A", 2), ("B", 2), ("C", 2), ("E", 4), ("F", 4)

You can infer where the purely empty sub-sequences by where the index suddenly jumps. But, unless the iterator includes a property for the count of total sub-sequences after it starts returning nil from next(), you'd miss that there's another empty sub-sequence at index 5. So I had to add place markers for empty sub-sequences:

(Empty, 1), ("A", 2), ("B", 2), ("C", 2), (Empty, 3), ("E", 4), ("F", 4), (Empty, 5)

Both the slightly- and really-lazy split sequences' iterators wrap around this iterator. The version with eager sub-sequences was pretty quick. The reference-link dance needed for the deeply lazy version took some work, but there's some more issues, coming up in a later post.