Supporting collection slice by slices of a given size parameter

I think this would be a useful compliment to split, since it's quite tricky to do by hand and feels pretty common. I'm pretty skeptical of the overlapping feature though. I think that's probably overgeneralizing.

Note, it doesn't need to return an Array eagerly. This feature falls under the rule of thumb followed by the std lib that things should be lazy if there isn't a major downside to them being lazy, since it doesn't take closure. It also doesn't need to be restricted to random-access collections.

Here's a sketch of a lazy version:

struct ChunkedCollection<Base: Collection> {
  let base: Base
  let size: Int
}

extension ChunkedCollection: Collection {
  typealias Index = Base.Index
  typealias Element = Base.SubSequence
  var startIndex: Base.Index { return base.startIndex }
  var endIndex: Base.Index { return base.endIndex }
  subscript(i: Index) -> Element { return base[i..<index(after: i)] }
  func index(after i: Base.Index) -> Base.Index {
    return base.index(i, offsetBy: size, limitedBy: base.endIndex) ?? base.endIndex
  }
}

extension Collection {
  func chunked(into size: Int) -> ChunkedCollection<Self> {
    return ChunkedCollection(base: self, size: size)
  }
}

let s = "abracadabra"
print(s.chunked(into: 4).joined(separator: ","))
// abra,cada,bra

This would need more work to make it into the std lib (needs preconditions, might be more efficient to have a custom index type that stashed the chunk end as well) but not too much.

1 Like

Oh, also this is an interested case where ChunkedCollection can conditionally be a BidirectionalCollection when Base: RandomAccessCollection and not when Base: BidirectionalCollection, because it needs to calculate the size of the final chunk in constant time.

Hmm, I think something that does sliding windows deserves to be in the standard library eventually, but you might be right that combining the two would be overgeneralising. Thinking about this some more it might also be a place where having two separate, simpler utilities for this might actually be useful performance-wise.

Just for reference, both Ruby and Rust have separate methods for these two, each_cons / each_chunk and windows / chunks respectively.

2 Likes

Also, wrt the shed, chunked(into: 4) to me reads very much like it'd produce four chunks, not chunks of size <= 4.

6 Likes

I was envisioning such functionality to use a shift or slide instead of an overlap. As implemented the current code does support (for some scenario, but not all) a negative overlap. So with overlap: -1, it produces [[1,2], [4,5]]. Not sure if there's any real world usage for such result, but fully supporting this kind of result would make the function as generic as it can get, probably too generic for the taste of some of the folks here.

Of course, negative overlap might be hard to grasp; forcing the code to be rewritten using a shift or slide concept instead.

chunked(by: 4), perhaps?

1 Like

Hey @Ben_Cohen :))
I totally agree with this lazy version.
Just a few points here are:

  • About the overlapping feature, I agree that maybe is overgeneralizing here, but also agree with @ahti maybe it is something to have in a separated method but still in the stdlib :))
  • Do you think that the eager version as an additional to the lazy version can also add some value to the proposal? I've seem for example array.reversed() has both a version that returns a array and other the returns ReversedCollection and sometimes is useful to have the eager one to use the array directly. What do you guys think?
  • As @ahti mention I also agree that chunked(into: 4) reads like it'd produce four chunks, not chunks of size <= 4. Maybe some options that we can consider are chunked(with: 4), chunked(withSize: 4).
  • Also, I have implemented a first version BidirectionalCollection conformance based from your code previous implementation and suggestion to make the conformance with RangeRepleacebleCollection :))
struct ChunkedCollection<Base: Collection> {
    let base: Base
    let size: Int
}

extension ChunkedCollection: Collection {
    typealias Index = Base.Index
    typealias Element = Base.SubSequence
    var startIndex: Base.Index { return base.startIndex }
    var endIndex: Base.Index { return base.endIndex }
    subscript(i: Index) -> Element { return base[i..<index(after: i)] }
    
    func index(after i: Base.Index) -> Base.Index {
        return base.index(i, offsetBy: size, limitedBy: base.endIndex) ?? base.endIndex
    }
}

extension ChunkedCollection: BidirectionalCollection where Base: RandomAccessCollection {
    subscript(i: Index) -> Element {
        if i == startIndex && base.count%size != 0 {
            return base[i..<base.index(i, offsetBy: base.count%size)]
        }
        return base[i..<index(after: i)]
    }
    
    func index(before i: Index) -> Index {
        return base.index(i, offsetBy: -size, limitedBy: base.startIndex) ?? base.startIndex
    }
}

extension Collection {
    func chunked(into size: Int) -> ChunkedCollection<Self> {
        precondition(size > 0, "Chunk size should be greater than 0.")
        return ChunkedCollection(base: self, size: size)
    }
}

let s = (1...10).map({ $0 })

let r: ReversedCollection<ChunkedCollection<[Int]>> = s.chunked(into: 3).reversed()
print(r.map({ [Int]($0) })) // [[8, 9, 10], [5, 6, 7], [2, 3, 4], [1]]

Surely the with is redundant, wouldn't chunked(size: ) work fine?

1 Like

This is an incorrect implementation. According to the docs:

func reversed()

Returns a view presenting the elements of the collection in reverse order.

So your reversed collection should return [[10], [7, 8, 9], [4, 5, 6], [1, 2, 3]]

If you wanted your result, you should call:

let r = s
    .reversed()            // [10, 9, 8, ..., 2, 1]
    .chunked(into: 3)      // [[10, 9, 8], ..., [4, 3, 2], [1]]
    .map { $0.reversed() } // [[8, 9, 10], ..., [2, 3, 4], [1]]

Hey @sveinhal :))

I believe it makes total sense and it seems more like a correct behavior.

I did update the implementation to this behavior

struct ChunkedCollection<Base: Collection> {
    let base: Base
    let size: Int
}

extension ChunkedCollection: Collection {
    typealias Index = Base.Index
    typealias Element = Base.SubSequence
    var startIndex: Base.Index { return base.startIndex }
    var endIndex: Base.Index { return base.endIndex }
    subscript(i: Index) -> Element { return base[i..<index(after: i)] }
    
    func index(after i: Base.Index) -> Base.Index {
        return base.index(i, offsetBy: size, limitedBy: base.endIndex) ?? base.endIndex
    }
}

extension ChunkedCollection: BidirectionalCollection {
    func index(before i: Index) -> Index {
        let offset = i == base.endIndex && base.count%size != 0 ? base.count%size: size
        return base.index(i, offsetBy: -offset)
    }
}

extension Collection {
    func chunked(into size: Int) -> ChunkedCollection<Self> {
        precondition(size > 0, "Split size should be greater than 0.")
        return ChunkedCollection(base: self, size: size)
    }
}

let s = (1...10).map({ $0 })
print(s.chunked(into: 3).map({ [Int]($0) })) // [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10]]

let r: ReversedCollection<ChunkedCollection<[Int]>> = s.chunked(into: 3).reversed()
print(r.map({ [Int]($0) })) // [[10], [7, 8, 9], [4, 5, 6], [1, 2, 3]]

This is interesting because we can implement BidirectionalCollection conformance without constraint the ChunkedCollection to Any protocol like before was constrained to RangeRepleacebleCollection.

Hey guys :))
I've added the constraint for RandomAccessCollection as @Ben_Cohen mention to guarantee the constant time O(1). Also when base conditionally conforms to RandomAccessCollection, we can implement count in constant time too.

struct ChunkedCollection<Base: Collection> {
    let base: Base
    let size: Int
}

extension ChunkedCollection: Collection {
    typealias Index = Base.Index
    typealias Element = Base.SubSequence
    var startIndex: Base.Index { return base.startIndex }
    var endIndex: Base.Index { return base.endIndex }
    subscript(i: Index) -> Element { return base[i..<index(after: i)] }
    
    func index(after i: Index) -> Index {
        return base.index(i, offsetBy: size, limitedBy: base.endIndex) ?? base.endIndex
    }
}

extension ChunkedCollection: BidirectionalCollection where Base: RandomAccessCollection {
    func index(before i: Index) -> Index {
        var offset = size
        if  i == base.endIndex && base.count%size != 0 {
            offset = base.count%size
        }
        return base.index(i, offsetBy: -offset)
    }
    
    var count: Int {
        return base.count%size != 0 ? base.count/size + 1 : base.count/size
    }
}

extension Collection {
    func chunked(into size: Int) -> ChunkedCollection<Self> {
        precondition(size > 0, "Split size should be greater than 0.")
        return ChunkedCollection(base: self, size: size)
    }
}

let s = (1...10).map({ $0 })
print(s.chunked(into: 3).map({ [Int]($0) })) // [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10]]

let r: ReversedCollection<ChunkedCollection<[Int]>> = s.chunked(into: 3).reversed()
print(r.map({ [Int]($0) })) // [[10], [7, 8, 9], [4, 5, 6], [1, 2, 3]]

@Ben_Cohen Do you think it's good to start drafting an evolution proposal?

As @ahti already points out, chunked(into: 4) can only sensibly mean that you get four chunks, not that you get chunks of four. A noun here would be more consistent with other methods that expose views, like String.utf8 or Int.words, so I'd suggest chunks(of:).

7 Likes

Here is the first draft of an evolution proposal Supporting split collection in chunks of a given size parameter
Of course, it needs more work and any feedback is welcome :))

This idea seemed very familiar. In fact I worked on it myself. Maybe you can take a look at what's already been done:

Hope these'll help.

Sorry, something about this part bugs me. What about:

extension ChunkedCollection: BidirectionalCollection where Base: RandomAccessCollection {
    func index(before i: Index) -> Index {
        let stragglerCount = base.count % size
        let lastChunkCount = straggerCount == 0 ? size : stragglerCount
        let offset = (i == base.endIndex) ? lastChunkCount : size
        return base.index(i, offsetBy: -offset)
    }

    var count: Int {
       let (wholeChunkCount, stragglerCount) = base.count.quotientAndRemainder(dividingBy: size)
       return wholeChunkCount + stragglerCount.signum()
    }
}

But looking at this, maybe we can go full random-access:

extension ChunkedCollection: RandomAccessCollection where Base: RandomAccessCollection {
    func index(before i: Index) -> Index {
        let stragglerCount = base.count % size
        let lastChunkCount = straggerCount == 0 ? size : stragglerCount
        let offset = (i == base.endIndex) ? lastChunkCount : size
        return base.index(i, offsetBy: -offset)
    }

    func distance(from start: Index, to end: Index) -> Int {
        let (chunkDelta, stragglerDelta) = base.distance(from: start, to: end).quotientAndRemainder(dividingBy: size)
        return chunkDelta + stragglerDelta.signum()
    }

    func index(_ i: Index, offsetBy distance: Int) -> Index {
        let stragglerCount = base.count % size
        let lastExactIndex = base.index(base.endIndex, offsetBy: -stragglerCount)
        if stragglerCount != 0, distance == 1 + distance(from: i, to: lastExactIndex) {
            return base.endIndex
        } else {
            return base.index(i, offsetBy: distance * size)
        }
    }
}

(Warning: I'm not sure I got the mod-for-negative-numbers correct for distance(from: to:).)

(Warning: I wrote both clips in the forum panel; neither has been compiled or tested.)

Hey Daryle :)
Those are really interesting materials and for sure they can help we move forward with this proposal.
I'll definitely take a look on the links and on the sectioned sequences implementation :))

Those are interesting implementations :))
About going full RandomAccessCollection, that's an interesting point. Can you elaborate more on the pros and cons of that?

About your index(before:) suggestion, one maybe is that we don't have to calculate base.count % size on every iteration since it's only be used when i == base.endIndex. That's why I put the if i == base.endIndex && base.count%size != 0 statement so when the i is not the same as base.endIndex it doesn't even execute the other part of the statement. I know it seems like doing the same operation twice but maybe another way to do it will be

func index(before i: Index) -> Index {
        var offset = size
        if i == base.endIndex  {
            let reminder = base.count%size 
            if remainder != 0 {
                offset = remainder
            }
        }
        return base.index(i, offsetBy: -offset)
    }

That way we don't execute the same operation twice and also don't calculate it on every iteration and I believe it could be a improvement to the proposal :))

About the count suggestion, since base.count is O(1) I think that the two are just different styled implementations with the same cost.

Yes, the collection should definitely be random access if its underlying collection is random access. But note, you have to explicitly conform to both because it's a conditional conformance i.e.:

extension ChunkedCollection: BidirectionalCollection, RandomAccessCollection 
where Base: RandomAccessCollection {
  // etc
}

This seems surprising at first because normally you get base protocol conformance "for free" when you conform to a protocol. But with conditional conformance you have to spell it out. See the relevant part of SE-0143 for details.

Pro vs. cons doesn't really apply here. Can the results be done in constant time instead of linear time; i.e. with calculations instead of resorting to single steps then counting those? If the answer is yes, then we can switch to RandomAccessCollection.

1 Like

Here is the proposal PR to the Swift Evolution.
Any feedback is welcome :))

Terms of Service

Privacy Policy

Cookie Policy