Add Sequence.split(maxLength:)

The standard library defines two split functions:

public func split(
    maxSplits: Int = Int.max,
    omittingEmptySubsequences: Bool = true,
    whereSeparator isSeparator: (Element) throws -> Bool
  ) rethrows -> [SubSequence]

// where Element == Equatable
public func split(
    separator: Element,
    maxSplits: Int = Int.max,
    omittingEmptySubsequences: Bool = true
  ) -> [SubSequence]

I would like to pitch adding another variant, which splits according to a maximum row-length:

public func split(maxLength: Int) -> [SubSequence]

...as well as a lazy version.

Reshaping data like this can be very useful - e.g. turning a flat collection of 100 elements in to a 10x10 2D collection. It can be particularly helpful for Strings.

The implementation is trivial - see this gist. If there is support for adding it, I'll write up a proper proposal. Here are some examples:

let numbers = (0..<40).split(maxLength: 10)
for row in numbers { print(row) }

// 0..<10
// 10..<20
// 20..<30
// 30..<40
let lines = "So this is a story all about how my life got flipped, turned upside-down".lazy.split(maxLength: 10)
for line in lines { print(line) }

// So this is
// a story a
// ll about h
// ow my life
//  got flipp
// ed, turned
//  upside-do
// wn
5 Likes

Some initial thoughts--

It would be nice to consider what bundle of related functions might be broadly useful in the same vein, and therefore if this particular functionality might be generalizable in some way.

Based on the use cases you've demonstrated, I don't think Array is the right return type. It would be more efficient, I think, to have a lazy type so that we're not making an unnecessary copy. This is just one way in which it probably serves us best not to lump this together with the existing split functions.

(Naming nit: in the standard library, it's referred to as count rather than length.)

2 Likes

The gist includes a lazy version, and I'm using maxLength to parallel Sequence.prefix(_ maxLength: Int) (although admittedly, that's an internal parameter label).

As for related functionality: I'm not sure. We have a predicate-based function, so it makes sense to also have a positional/length-based version.

Good catch. That can be changed :)

1 Like

What I’m saying is that I think this intuitively feels like it meets the criteria for being lazy by default. It’s been outlined elsewhere how the core team has reasoned about this in the past, and I think it deserves serious consideration here.

What I’m referring to is this notion that it’s a reshaping API. Thus, what is the reciprocal operation? What other operations would we want to allow easy reshaping of Sequences? Can they be grouped together with each other (rather than lumping this with split)? Those are, I think, questions to be explored here.

1 Like

Since the Element type of the lazy version is SubSequence, that would mean requiring a minimum of Collection.

I'm totally fine with that - it's there for consistency, but in fact, the other Sequence.split functions are quite strange and probably worth deprecating. They literally just copy the sequence in to an Array and split that instead, so they all return [ArraySlice<Element>]. The Collection versions don't do that, and return an [SubSequence], as this one does.

If a user really can't lift their generic constraints above Sequence to Collection, there are probably important reasons for that (i.e. the sequence truly is single-pass, and they are trying to avoid copies). If they want to copy in to an Array, I think it's safe to assume that they're knowledgeable enough to do that by themselves.

We already have .joined(), which is available on all sequences and is lazy by default, which supports your argument that this too should be. We could also add a variant for variable-length rows, something like:

func split<S>(lengths: S) -> LazyVariableWidthSplitCollection<Self, S>
  where S: Sequence, S.Element == Int

As for more general-purpose reshaping, AFAIK the two main ways to implement it are:

  • statically enforce the rank of the collection (i.e. with generic wrappers, whose Element types eventually terminate at some scalar), or
  • make the elements a recursive, infinite tower of collections and expect the user to keep track of how many times they subscript (ala TensorFlow's ShapedArray).

This, like the existing stdlib split/join functions, concern themselves with the former. But there are certain advantages to the latter (e.g. reshaping the collection at any time). There may be a place for that in the standard library, but it's sufficiently large to deserve its own proposal.

1 Like
Terms of Service

Privacy Policy

Cookie Policy