Chunking Collections and Strings in Swift 5.1

Hi there! I’m trying to write some code that splits collections into subsequences based on chunks. An array of integers and a chunk size of 2, for instance, transforms from [1, 2, 3, 4, 5, 6] to [[1, 2], [3, 4], [5, 6]]. I can achieve this nicely using the following code:

extension Collection {
    
    public func chunked(into size: Int) -> [SubSequence] {
        var chunks: [SubSequence] = []
        var i = startIndex
        
        while let nextIndex = index(i, offsetBy: size, limitedBy: endIndex) {
            let chunk = self[i ..< nextIndex]
            chunks.append(chunk)
            i = nextIndex
        }
        
        let finalChunk = self[i ..< endIndex]
        if finalChunk.isEmpty == false {
            chunks.append(finalChunk)
        }
        
        return chunks
    }

}

But I feel like that could be better. I can rewrite it a bit so that it works on any Collection where the Index is an Int:

extension Collection where Index == Int {
    
    public func chunked(into size: Index.Stride) -> [SubSequence] {
        return stride(from: startIndex, to: endIndex, by: size).map { index in
            let end = self.index(index, offsetBy: size, limitedBy: self.endIndex) ?? endIndex
            
            return self[index ..< end]
        }
    }

}

But that means that I can’t use it with String, so I need to have another extension:

extension StringProtocol {
    
    public func chunked(into size: Int) -> [SubSequence] {
        var chunks: [SubSequence] = []
        
        var i = startIndex
        
        while let nextIndex = index(i, offsetBy: size, limitedBy: endIndex) {
            chunks.append(self[i ..< nextIndex])
            i = nextIndex
        }
        
        let finalChunk = self[i ..< endIndex]
        
        if finalChunk.isEmpty == false {
            chunks.append(finalChunk)
        }
        
        return chunks
    }
    
}

Is there a more concise way to achieve the desired result in Swift 5.1 without having separate code paths for String and other Collections?

If you want it to work with String as well, I'd suggest that you drop the Index == Int requirement and use the StringProtocol implementation in all cases. Unless you need highly performant code, it'll probably work fine (you can of course benchmark it later).

Thanks for the suggestion! I’ve tried a few permutations of this. First, if we drop the Index == Int requirement:

extension Collection {
    
    public func chunked(into size: Index.Stride) -> [SubSequence] {
        precondition(
            size > 0,
            "Calls to `chunked(into:)` must have a `size` greater than 0")
        
        return stride(from: startIndex, to: endIndex, by: size).map { index in
            let endIndex = self.index(index, 
                                      offsetBy: size, 
                                      limitedBy: self.endIndex) ?? self.endIndex
            
            return self[index ..< endIndex]
        }
    }

}

This fails with this error:

CollectionChunks.swift:38:16: error: cannot invoke 'stride' with an argument list of type '(from: Self.Index, to: Self.Index, by: Int)'
        return stride(from: startIndex, to: endIndex, by: size).map { index in
               ^

Next, let’s try constraining it to Collections where the Index conforms to Strideable:

extension Collection where Index: Strideable {
    
    public func chunked(into size: Index.Stride) -> [SubSequence] {
        precondition(
            size > 0,
            "Calls to `chunked(into:)` must have a `size` greater than 0")
        
        return stride(from: startIndex, to: endIndex, by: size).map { index in
            let endIndex = self.index(index, 
                                      offsetBy: size, 
                                      limitedBy: self.endIndex) ?? self.endIndex
            
            return self[index ..< endIndex]
        }
    }

}

This fails with this error:

CollectionChunks.swift:40:49: error: argument type 'Self.Index.Stride' does not conform to expected type 'BinaryInteger'
                                      offsetBy: size, 
                                                ^

OK, so we make sure that the Stride of the Index conforms to BinaryInteger:

extension Collection where Index: Strideable, Index.Stride: BinaryInteger {
    
    public func chunked(into size: Index.Stride) -> [SubSequence] {
        precondition(
            size > 0,
            "Calls to `chunked(into:)` must have a `size` greater than 0")
        
        return stride(from: startIndex, to: endIndex, by: size).map { index in
            let endIndex = self.index(index, 
                                      offsetBy: size, 
                                      limitedBy: self.endIndex) ?? self.endIndex
            
            return self[index ..< endIndex]
        }
    }

}

Now we appear to hit a deprecation error:

CollectionChunks.swift:39:33: error: 'index(_:offsetBy:limitedBy:)' is unavailable: all index distances are now of type Int
            let endIndex = self.index(index, 
                                ^~~~~
Swift.Collection:11:17: note: 'index(_:offsetBy:limitedBy:)' was obsoleted in Swift 5.0
    public func index<T>(_ i: Self.Index, offsetBy n: T, limitedBy limit: Self.Index) -> Self.Index? where T : BinaryInteger
                ^

The only combination I’ve had that’s worked is constraining Index to Int, but then I can’t use it on String. I’m happy to continue trying permutations of generic constraints if we can find something that works in all cases!

Most likely non of the permutations will work, String.Index is one of the most bare-minimum Index out there. Aside from being a BidirectionalCollection, it doesn't conform to anything else.

By that, if you have a working implementation for String and StringProtocol, likely it'll work for all other kinds of Collection. So I'm suggesting that you use the StringProtocol implementation on general Collection as well.

extension Collection {
    public func chunked(into size: Int) -> [SubSequence] {
        var chunks: [SubSequence] = []
        chunks.reserveCapacity((underestimatedCount + size - 1) / size)
        
        var currentIndex = startIndex
        while let nextIndex = index(currentIndex, offsetBy: size, limitedBy: endIndex) {
            chunks.append(self[currentIndex ..< nextIndex])
            currentIndex = nextIndex
        }
        
        let finalChunk = suffix(from: currentIndex)
        if !finalChunk.isEmpty {
            chunks.append(finalChunk)
        }
        return chunks
    }
}

So you'll have only one implementation to maintain. I added some tweaks to your code, don't mind it.

1 Like

While I'm at it, let me throw in my spin of the implementation as well.

extension Collection {
    public func chunked(into size: Int) -> [SubSequence] {
        var chunks: [SubSequence] = []
        chunks.reserveCapacity((underestimatedCount + size - 1) / size)
        
        var residual = self[...]
        while !residual.isEmpty {
            let splitIndex = index(residual.startIndex, offsetBy: size, limitedBy: endIndex) ?? endIndex
            chunks.append(residual.prefix(upTo: splitIndex))
            residual = residual.suffix(from: splitIndex)
        }
        
        return chunks
    }
}

My favorite one yet:

extension Collection {
    public func chunked(into size: Int) -> [SubSequence] {
        var chunks: [SubSequence] = []
        chunks.reserveCapacity((underestimatedCount + size - 1) / size)
        
        var residual = self[...], splitIndex = startIndex
        while formIndex(&splitIndex, offsetBy: size, limitedBy: endIndex) {
            chunks.append(residual.prefix(upTo: splitIndex))
            residual = residual.suffix(from: splitIndex)
        }
        
        return residual.isEmpty ? chunks : chunks + CollectionOfOne(residual)
    }
}
2 Likes

I might consider making a type that does this lazily, holding onto an iterator for the base sequence and vending each chunk as needed. Then, if you really need an Array, you can use map to get one:

let s = whateverSequenceYouHave()

let lazyChunkedSequence = s.chunked(size: 3)

let arrayOfArrays = lazyChunkedSequence.map(Array.init)
1 Like

Thanks for that! I like your version with formIndex(_:offsetBy:limitedBy:). Between that and using prefix(upTo:) and suffix(from:), this is pretty readable.

I couldn’t figure out how to vend instances of T.SubSequence instead of [T.Element], but here’s a take on that:

public struct LazyChunkSequence<T: Collection>: Sequence, IteratorProtocol {
    
    private var baseIterator: T.Iterator
    private let size: Int
    
    fileprivate init(over collection: T, chunkSize size: Int) {
        baseIterator = collection.lazy.makeIterator()
        self.size = size
    }
    
    mutating public func next() -> [T.Element]? {
        var chunk: [T.Element] = []
        
        var remaining = size
        while remaining > 0, let nextElement = baseIterator.next() {
            chunk.append(nextElement)
            remaining -= 1
        }
        
        return chunk.isEmpty ? nil : chunk
    }
    
}

extension Collection {
    
    public func lazyChunkSequence(size: Int) -> LazyChunkSequence<Self> {
        return LazyChunkSequence(over: self, chunkSize: size)
    }
    
}

Here’s my proof-of-concept, which works for any Sequence, and thus does not involve either Index nor SubSequence:

final class ChunkedSequence<Base: Sequence>: Sequence, IteratorProtocol {
  var chunkSize: Int
  var iterator: Base.Iterator
  var cache: Base.Element?
  
  init(base: Base, size: Int) {
    precondition(size > 0)
    chunkSize = size
    iterator = base.makeIterator()
    cache = iterator.next()
  }
  
  func next() -> Chunk<Base>? {
    return (element == nil) ? nil : Chunk(base: self)
  }
  
  func nextElement() -> Base.Element? {
    let x = cache
    cache = iterator.next()
    return x
  }
}
struct Chunk<Base: Sequence>: Sequence, IteratorProtocol {
  var base: ChunkedSequence<Base>
  var n: Int = 0
  
  init(base: ChunkedSequence<Base>) {
    self.base = base
  }
  
  mutating func next() -> Base.Element? {
    guard n < base.chunkSize else { return nil }
    n += 1
    return base.nextElement()
  }
}
extension Sequence {
  func chunked(size: Int) -> ChunkedSequence<Self> {
    return ChunkedSequence(base: self, size: size)
  }
}

Edit:

Although, this implementation creates an extraneous empty Chunk at the end if size exactly divides count. That can be solved, but I’m calling it a night.

Edit 2: Fixed

1 Like

If you want sequence of SubSequence, you can do:

public struct ChunkSequence<T: Collection>: Sequence where T.SubSequence == T {
    public struct Iterator<T: Collection>: IteratorProtocol where T.SubSequence == T {
        fileprivate var residual: T, size: Int
        
        public mutating func next() -> T? {
            guard !residual.isEmpty else {
                return nil
            }
            
            let index = residual.index(residual.startIndex, offsetBy: size, limitedBy: residual.endIndex) ?? residual.endIndex
            
            let result = residual.prefix(upTo: index)
            residual = residual.suffix(from: index)
            
            return result
        }
    }
    
    fileprivate var data: T, size: Int
    
    public func makeIterator() -> Iterator<T> {
        return .init(residual: data[...], size: size)
    }
}

extension Collection {
    public func chunked(size: Int) -> ChunkSequence<SubSequence> {
        return ChunkSequence(data: self[...], size: size)
    }
}

Edit:

Split Sequence and Iterator as per @Ben_Cohen's suggestion.

Try not to follow the Self-Iterator (anti-)pattern. It's a bit more code, but if you declare an Iterator nested type served up via makeIterator it ensures the sequence is properly multi-pass, which, given it's based on a collection, it ought to be.

4 Likes

I have a feeling that others have both considered the problem before and its solutions.

This code is incorrect. It's assuming that all the indices in the collection are consecutive integers, i.e. you're assuming that index(after: i) is i+1.

4 Likes

I believe it would be correct if the extension were on a RandomAccessCollection where Index == Int.

No, it would still be incorrect. RAC guarantees O(1) offsetting and distance, but does not require that the index values are consecutive.

Does this imply that the correct way is to use an iterator and use next() each time you need the next element in a chunk?

You can use indices via index(after:), index(_:offsetBy:), etc.. The issue is that you're calling the global function stride(from:to:by:) on Ints. stride will perform integer striding, not index advancement within some collection.

2 Likes

Quick version that follows more of a collection consumer approach. This also has an example that illustrates the issue with the Int constrained version.

extension Collection {
  public func chunked_orignal(into size: Int) -> [SubSequence] {
    var chunks: [SubSequence] = []
    var i = startIndex

    while let nextIndex = index(i, offsetBy: size, limitedBy: endIndex) {
      let chunk = self[i ..< nextIndex]
      chunks.append(chunk)
      i = nextIndex
    }

    if i != endIndex {
      chunks.append(self[i...])
    }

    return chunks
  }
}

print("Original")
print([1,2,3,4,5,6,7,8,9,10].chunked_orignal(into: 2))
print([1,2,3,4,5,6,7,8,9,10].lazy.filter { $0 % 2 == 0 }.chunked_orignal(into: 2).map { Array($0) })

extension Collection where Index == Int {
  public func chunked_broken(into size: Index.Stride) -> [SubSequence] {
    return stride(from: startIndex, to: endIndex, by: size).map { index in
      let end = self.index(index, offsetBy: size, limitedBy: self.endIndex) ?? endIndex
        return self[index ..< end]
      }
  }
}

print("Broken")
print([1,2,3,4,5,6,7,8,9,10].chunked_broken(into: 2))
print([1,2,3,4,5,6,7,8,9,10].lazy.filter { $0 % 2 == 0 }.chunked_broken(into: 2).map { Array($0) })


extension Collection {
  public func chunked_consumer(into size: Int) -> [SubSequence] {
    var chunks: [SubSequence] = []
    var rest = self[...]
    while !rest.isEmpty {
      chunks.append(rest.prefix(size))
      rest = rest.dropFirst(size)
    }
    return chunks
  }
}

print("Consuming")
print([1,2,3,4,5,6,7,8,9,10].chunked_consumer(into: 2))
print([1,2,3,4,5,6,7,8,9,10].lazy.filter { $0 % 2 == 0 }.chunked_consumer(into: 2).map { Array($0) })

The prefix/dropFirst combo is the sort of thing discussed here under "Collection Consumers".

2 Likes

I like this implementation. Short and simple.

I would like to see this sort of function implemented into the swift standard library. Many languages have "chunking" built in.

I just released a package (there's still some work), but I'm hoping to get some of the types into the standard library given [SR-6867] "Chunking" split · Issue #49416 · apple/swift · GitHub, [SR-6864] Add a "cycle" method to the standard library · Issue #49413 · apple/swift · GitHub and Issues · apple/swift-issues · GitHub

For chunking there are two strategies and lazy support:

Array((1 ... 10).chunks(of: 3, exact: false)) == [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10]]
Array((1 ... 10).chunks(of: 3, exact: true)) == [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
Array((1 ... 10).chunksInexactly(of: 3)) == [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10]]
Array((1 ... 10).chunksExactly(of: 3)) == [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10]]

I focused more on laziness but the chunks API support eagerness on Collection and Sequence

Edit:

I don't have support for Swift 5.1, but I think it could be added easily.

1 Like