Supporting collection slice by slices of a given size parameter

Hello Swift Community :))

Based on the slice collection API the Swift Std have today we mostly can slice a collection by a separator or a closure that takes each value and decides if it gonna slice in that position.

The pitch here is to provide a method on the collections that allows us to slice a collection in slices of a given size.
Example:
Given the array: ["1", "2", "3","4", "5", "6","7", "8", "9", "10"]

Calling array.split(size: 3) will produce the output:

[ArraySlice(["1", "2", "3"]), ArraySlice(["4", "5", "6"]), ArraySlice(["7", "8", "9"]), ArraySlice(["10"])]

Proposed Implementation

extension RandomAccessCollection {
    
    func split(size: Int) -> [SubSequence] {
        precondition(size > 0, "Split size should be greater than 0.")
        
        var idx = startIndex
        var splits = [SubSequence]()
        splits.reserveCapacity(count/size)
        while idx < endIndex {
            let advanced = Swift.min(index(idx, offsetBy: size), endIndex)
            splits.append(self[idx..<advanced])
            idx = advanced
        }
        return splits
    }
}

let array = (1...10).map({ "\($0)" })

print("Result:      \(array.split(size: 3))")
print("Result Even: \(array.split(size: 5))")

// Result:     [ArraySlice(["1", "2", "3"]), ArraySlice(["4", "5", "6"]), ArraySlice(["7", "8", "9"]), ArraySlice(["10"])]
// Result Even: [ArraySlice(["1", "2", "3", "4", "5"]), ArraySlice(["6", "7", "8", "9", "10"])]

This is a common API in some other libs in other languages like

Some points to consider here:

  • I'm not 100% sure if split(size:) is actually the best naming here. Another option I thought will be split(ofSize: ).
  • A forEachSlice(size: Int, _ body: (SubSequence) throws -> Void) rethrows like the ruby each_slice is not the topic of this pitch, but it's something to mention here for maybe a futher discussion.

That's it for the pitch, looking forward to hear the thoughts from the community :slight_smile:

5 Likes

To me, this feels very related to this pitch, which is specific to adjacentPairs, except yours seems like a more dynamic version (instead of always 2, you can specify the grouping size). Maybe take a look at that thread and pitch your idea in there as well? If they're as related as I think they are then the community may opt to include your more dynamic idea since it satisfies a larger subset of developers.

2 Likes

It's not just that, the groups here don't overlap.

2 Likes

Essentially, these seem to me both variations of a sliding window, differing in: window size, step size and what to do when the underlying collection ends (split returns non-full windows, adjacentPairs stops iteration).

These parameters could be added to the wrapper type built for adjacentPairs @mpangburn mentioned in the other thread, then both adjacentPairs and split could be implemented using that. Advanced users could use the wrapper directly to get more custom behaviour (a bigger window sliding one at a time to calculate statistics, iterating a list n elements at a time with one overlapping item, ...).

One negative this could potentially bring is allowing the wrapper to return non-full windows could make it harder to allow it to return constant-size subsequences once we have variadic generics.

1 Like

As @ahti says, I think there are several related APIs here that would probably benefit from being designed and proposed together. The extension to a sliding window of n elements would be useful, e.g. moving averages, n = 3 would let you render triangle strips, etc. The main question that I have is how many different methods to split this into. It is theoretically possible to handle them all with a single method and a ton of parameters, but that's not great from a usability or naming perspective.

2 Likes

Hey guys,
Those are interesting points, thank you :))
I agree that it could be a interesting thing if we put this together with adjacentPair proposal as you guys said. In fact I believe if we put a ovelap size parameter on this split function and do some changes on the implementation we can produce the almost the same behavior of adjacentPairs, only difference is that we will deal with Subsequence and not tuples.

Example

extension RandomAccessCollection {
    
    func split(size: Int, overlapSize overlap: Int) -> [SubSequence] {
        precondition(size > 0, "Split size should be greater than 0.")
        precondition(size > overlap, "Split size should be greater than overlap.")

        var idx = startIndex
        var splits = [SubSequence]()
        splits.reserveCapacity(count/(size - overlap))
        var advanced = idx
        while advanced < endIndex {
            advanced = Swift.min(index(idx, offsetBy: size), endIndex)
            splits.append(self[idx..<advanced])
            idx = index(advanced, offsetBy: -overlap)
        }
        return splits
    }
}

let array = (1...5).map({ $0 })
print("Overlaping: \(array.split(size: 2, overlapSize: 1))")
// Overlaping: [ArraySlice([1, 2]), ArraySlice([2, 3]), ArraySlice([3, 4]), ArraySlice([4, 5])]

We can put a default value of 0 on the overlap size, so we still have the same behavior proposed with non overlaping slices.
Also, we can also add another boolean parameter indicating if the method should return only full windows and ignore the last incomplete one.
Making those changes allow the method behavior to be more dynamic as you guys mentioned, and also can make it a useful API in more use cases.

I think this would be a useful compliment to split, since it's quite tricky to do by hand and feels pretty common. I'm pretty skeptical of the overlapping feature though. I think that's probably overgeneralizing.

Note, it doesn't need to return an Array eagerly. This feature falls under the rule of thumb followed by the std lib that things should be lazy if there isn't a major downside to them being lazy, since it doesn't take closure. It also doesn't need to be restricted to random-access collections.

Here's a sketch of a lazy version:

struct ChunkedCollection<Base: Collection> {
  let base: Base
  let size: Int
}

extension ChunkedCollection: Collection {
  typealias Index = Base.Index
  typealias Element = Base.SubSequence
  var startIndex: Base.Index { return base.startIndex }
  var endIndex: Base.Index { return base.endIndex }
  subscript(i: Index) -> Element { return base[i..<index(after: i)] }
  func index(after i: Base.Index) -> Base.Index {
    return base.index(i, offsetBy: size, limitedBy: base.endIndex) ?? base.endIndex
  }
}

extension Collection {
  func chunked(into size: Int) -> ChunkedCollection<Self> {
    return ChunkedCollection(base: self, size: size)
  }
}

let s = "abracadabra"
print(s.chunked(into: 4).joined(separator: ","))
// abra,cada,bra

This would need more work to make it into the std lib (needs preconditions, might be more efficient to have a custom index type that stashed the chunk end as well) but not too much.

1 Like

Oh, also this is an interested case where ChunkedCollection can conditionally be a BidirectionalCollection when Base: RandomAccessCollection and not when Base: BidirectionalCollection, because it needs to calculate the size of the final chunk in constant time.

Hmm, I think something that does sliding windows deserves to be in the standard library eventually, but you might be right that combining the two would be overgeneralising. Thinking about this some more it might also be a place where having two separate, simpler utilities for this might actually be useful performance-wise.

Just for reference, both Ruby and Rust have separate methods for these two, each_cons / each_chunk and windows / chunks respectively.

2 Likes

Also, wrt the shed, chunked(into: 4) to me reads very much like it'd produce four chunks, not chunks of size <= 4.

6 Likes

I was envisioning such functionality to use a shift or slide instead of an overlap. As implemented the current code does support (for some scenario, but not all) a negative overlap. So with overlap: -1, it produces [[1,2], [4,5]]. Not sure if there's any real world usage for such result, but fully supporting this kind of result would make the function as generic as it can get, probably too generic for the taste of some of the folks here.

Of course, negative overlap might be hard to grasp; forcing the code to be rewritten using a shift or slide concept instead.

chunked(by: 4), perhaps?

1 Like

Hey @Ben_Cohen :))
I totally agree with this lazy version.
Just a few points here are:

  • About the overlapping feature, I agree that maybe is overgeneralizing here, but also agree with @ahti maybe it is something to have in a separated method but still in the stdlib :))
  • Do you think that the eager version as an additional to the lazy version can also add some value to the proposal? I've seem for example array.reversed() has both a version that returns a array and other the returns ReversedCollection and sometimes is useful to have the eager one to use the array directly. What do you guys think?
  • As @ahti mention I also agree that chunked(into: 4) reads like it'd produce four chunks, not chunks of size <= 4. Maybe some options that we can consider are chunked(with: 4), chunked(withSize: 4).
  • Also, I have implemented a first version BidirectionalCollection conformance based from your code previous implementation and suggestion to make the conformance with RangeRepleacebleCollection :))
struct ChunkedCollection<Base: Collection> {
    let base: Base
    let size: Int
}

extension ChunkedCollection: Collection {
    typealias Index = Base.Index
    typealias Element = Base.SubSequence
    var startIndex: Base.Index { return base.startIndex }
    var endIndex: Base.Index { return base.endIndex }
    subscript(i: Index) -> Element { return base[i..<index(after: i)] }
    
    func index(after i: Base.Index) -> Base.Index {
        return base.index(i, offsetBy: size, limitedBy: base.endIndex) ?? base.endIndex
    }
}

extension ChunkedCollection: BidirectionalCollection where Base: RandomAccessCollection {
    subscript(i: Index) -> Element {
        if i == startIndex && base.count%size != 0 {
            return base[i..<base.index(i, offsetBy: base.count%size)]
        }
        return base[i..<index(after: i)]
    }
    
    func index(before i: Index) -> Index {
        return base.index(i, offsetBy: -size, limitedBy: base.startIndex) ?? base.startIndex
    }
}

extension Collection {
    func chunked(into size: Int) -> ChunkedCollection<Self> {
        precondition(size > 0, "Chunk size should be greater than 0.")
        return ChunkedCollection(base: self, size: size)
    }
}

let s = (1...10).map({ $0 })

let r: ReversedCollection<ChunkedCollection<[Int]>> = s.chunked(into: 3).reversed()
print(r.map({ [Int]($0) })) // [[8, 9, 10], [5, 6, 7], [2, 3, 4], [1]]

Surely the with is redundant, wouldn't chunked(size: ) work fine?

1 Like

This is an incorrect implementation. According to the docs:

func reversed()

Returns a view presenting the elements of the collection in reverse order.

So your reversed collection should return [[10], [7, 8, 9], [4, 5, 6], [1, 2, 3]]

If you wanted your result, you should call:

let r = s
    .reversed()            // [10, 9, 8, ..., 2, 1]
    .chunked(into: 3)      // [[10, 9, 8], ..., [4, 3, 2], [1]]
    .map { $0.reversed() } // [[8, 9, 10], ..., [2, 3, 4], [1]]

Hey @sveinhal :))

I believe it makes total sense and it seems more like a correct behavior.

I did update the implementation to this behavior

struct ChunkedCollection<Base: Collection> {
    let base: Base
    let size: Int
}

extension ChunkedCollection: Collection {
    typealias Index = Base.Index
    typealias Element = Base.SubSequence
    var startIndex: Base.Index { return base.startIndex }
    var endIndex: Base.Index { return base.endIndex }
    subscript(i: Index) -> Element { return base[i..<index(after: i)] }
    
    func index(after i: Base.Index) -> Base.Index {
        return base.index(i, offsetBy: size, limitedBy: base.endIndex) ?? base.endIndex
    }
}

extension ChunkedCollection: BidirectionalCollection {
    func index(before i: Index) -> Index {
        let offset = i == base.endIndex && base.count%size != 0 ? base.count%size: size
        return base.index(i, offsetBy: -offset)
    }
}

extension Collection {
    func chunked(into size: Int) -> ChunkedCollection<Self> {
        precondition(size > 0, "Split size should be greater than 0.")
        return ChunkedCollection(base: self, size: size)
    }
}

let s = (1...10).map({ $0 })
print(s.chunked(into: 3).map({ [Int]($0) })) // [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10]]

let r: ReversedCollection<ChunkedCollection<[Int]>> = s.chunked(into: 3).reversed()
print(r.map({ [Int]($0) })) // [[10], [7, 8, 9], [4, 5, 6], [1, 2, 3]]

This is interesting because we can implement BidirectionalCollection conformance without constraint the ChunkedCollection to Any protocol like before was constrained to RangeRepleacebleCollection.

Hey guys :))
I've added the constraint for RandomAccessCollection as @Ben_Cohen mention to guarantee the constant time O(1). Also when base conditionally conforms to RandomAccessCollection, we can implement count in constant time too.

struct ChunkedCollection<Base: Collection> {
    let base: Base
    let size: Int
}

extension ChunkedCollection: Collection {
    typealias Index = Base.Index
    typealias Element = Base.SubSequence
    var startIndex: Base.Index { return base.startIndex }
    var endIndex: Base.Index { return base.endIndex }
    subscript(i: Index) -> Element { return base[i..<index(after: i)] }
    
    func index(after i: Index) -> Index {
        return base.index(i, offsetBy: size, limitedBy: base.endIndex) ?? base.endIndex
    }
}

extension ChunkedCollection: BidirectionalCollection where Base: RandomAccessCollection {
    func index(before i: Index) -> Index {
        var offset = size
        if  i == base.endIndex && base.count%size != 0 {
            offset = base.count%size
        }
        return base.index(i, offsetBy: -offset)
    }
    
    var count: Int {
        return base.count%size != 0 ? base.count/size + 1 : base.count/size
    }
}

extension Collection {
    func chunked(into size: Int) -> ChunkedCollection<Self> {
        precondition(size > 0, "Split size should be greater than 0.")
        return ChunkedCollection(base: self, size: size)
    }
}

let s = (1...10).map({ $0 })
print(s.chunked(into: 3).map({ [Int]($0) })) // [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10]]

let r: ReversedCollection<ChunkedCollection<[Int]>> = s.chunked(into: 3).reversed()
print(r.map({ [Int]($0) })) // [[10], [7, 8, 9], [4, 5, 6], [1, 2, 3]]

@Ben_Cohen Do you think it's good to start drafting an evolution proposal?

As @ahti already points out, chunked(into: 4) can only sensibly mean that you get four chunks, not that you get chunks of four. A noun here would be more consistent with other methods that expose views, like String.utf8 or Int.words, so I'd suggest chunks(of:).

7 Likes

Here is the first draft of an evolution proposal Supporting split collection in chunks of a given size parameter
Of course, it needs more work and any feedback is welcome :))

This idea seemed very familiar. In fact I worked on it myself. Maybe you can take a look at what's already been done:

Hope these'll help.