Add Sequence.split(maxLength:)

Karl · February 11, 2020, 11:18pm

The standard library defines two split functions:

public func split(
    maxSplits: Int = Int.max,
    omittingEmptySubsequences: Bool = true,
    whereSeparator isSeparator: (Element) throws -> Bool
  ) rethrows -> [SubSequence]

// where Element == Equatable
public func split(
    separator: Element,
    maxSplits: Int = Int.max,
    omittingEmptySubsequences: Bool = true
  ) -> [SubSequence]

I would like to pitch adding another variant, which splits according to a maximum row-length:

public func split(maxLength: Int) -> [SubSequence]

...as well as a lazy version.

Reshaping data like this can be very useful - e.g. turning a flat collection of 100 elements in to a 10x10 2D collection. It can be particularly helpful for Strings.

The implementation is trivial - see this gist. If there is support for adding it, I'll write up a proper proposal. Here are some examples:

let numbers = (0..<40).split(maxLength: 10)
for row in numbers { print(row) }

// 0..<10
// 10..<20
// 20..<30
// 30..<40

let lines = "So this is a story all about how my life got flipped, turned upside-down".lazy.split(maxLength: 10)
for line in lines { print(line) }

// So this is
// a story a
// ll about h
// ow my life
//  got flipp
// ed, turned
//  upside-do
// wn

xwu · February 11, 2020, 11:25pm

Some initial thoughts--

It would be nice to consider what bundle of related functions might be broadly useful in the same vein, and therefore if this particular functionality might be generalizable in some way.

Based on the use cases you've demonstrated, I don't think Array is the right return type. It would be more efficient, I think, to have a lazy type so that we're not making an unnecessary copy. This is just one way in which it probably serves us best not to lump this together with the existing split functions.

(Naming nit: in the standard library, it's referred to as count rather than length.)

Karl · February 11, 2020, 11:28pm

The gist includes a lazy version, and I'm using maxLength to parallel Sequence.prefix(_ maxLength: Int) (although admittedly, that's an internal parameter label).

As for related functionality: I'm not sure. We have a predicate-based function, so it makes sense to also have a positional/length-based version.

xwu · February 11, 2020, 11:55pm

Good catch. That can be changed :)

xwu · February 11, 2020, 11:58pm

What I’m saying is that I think this intuitively feels like it meets the criteria for being lazy by default. It’s been outlined elsewhere how the core team has reasoned about this in the past, and I think it deserves serious consideration here.

xwu · February 12, 2020, 12:01am

What I’m referring to is this notion that it’s a reshaping API. Thus, what is the reciprocal operation? What other operations would we want to allow easy reshaping of Sequences? Can they be grouped together with each other (rather than lumping this with split)? Those are, I think, questions to be explored here.

Karl · February 12, 2020, 12:08am

Since the Element type of the lazy version is SubSequence, that would mean requiring a minimum of Collection.

I'm totally fine with that - it's there for consistency, but in fact, the other Sequence.split functions are quite strange and probably worth deprecating. They literally just copy the sequence in to an Array and split that instead, so they all return [ArraySlice<Element>]. The Collection versions don't do that, and return an [SubSequence], as this one does.

If a user really can't lift their generic constraints above Sequence to Collection, there are probably important reasons for that (i.e. the sequence truly is single-pass, and they are trying to avoid copies). If they want to copy in to an Array, I think it's safe to assume that they're knowledgeable enough to do that by themselves.

We already have .joined(), which is available on all sequences and is lazy by default, which supports your argument that this too should be. We could also add a variant for variable-length rows, something like:

func split<S>(lengths: S) -> LazyVariableWidthSplitCollection<Self, S>
  where S: Sequence, S.Element == Int

As for more general-purpose reshaping, AFAIK the two main ways to implement it are:

statically enforce the rank of the collection (i.e. with generic wrappers, whose Element types eventually terminate at some scalar), or
make the elements a recursive, infinite tower of collections and expect the user to keep track of how many times they subscript (ala TensorFlow's ShapedArray).

This, like the existing stdlib split/join functions, concern themselves with the former. But there are certain advantages to the latter (e.g. reshaping the collection at any time). There may be a place for that in the standard library, but it's sufficiently large to deserve its own proposal.

CTMacUser · February 25, 2020, 9:51pm

This is a "chunked sequence," which has been brought up before:

LucianoPAlmeida · February 25, 2020, 10:13pm

As Daryle mentioned this was discussed before and some of those pitches even are opened as proposals:

bob · March 2, 2020, 2:09am

Another minor naming nit: the parameter should be labeled maxCountPerSplit to make clear that it is not maxSplitCount

zwaldowski · March 13, 2020, 8:23pm

The recent advent of the preview package inspired me to draft something up in this vein. I've spelled it as split(every:) (don't worry, I'm sure we'll bikeshed that; after rereading the above I found split(maxLength:) somewhat compelling) with an implementation here and a proposal document underway.

I am drafting something that very closely mirrors split(maxSplits:omittingEmptySubsequences:whereSeparator:) — in naming, documentation, and semantics — to purposefully match the existing patterns of the stdlib. I'm no longer convinced that a perfect™️ lazy implementation will help more developers at this point than having anything would.

zwaldowski · March 13, 2020, 10:15pm

Rough outline of a proposal: 0000-split-every.md · GitHub

xwu · March 13, 2020, 11:00pm

I like the goal of a fluent reading style! split(every:) is still a little cryptic, though, IMO; have you considered something like split(intoGroupsOf: 4)?

LucianoPAlmeida · March 14, 2020, 2:37pm

Hi @zwaldowski @xwu :)
This seems similar to the Chunked Collection proposal here.
My 2 cents about it:

"Eager splitting follows the default behavior of the rest of the library."

It seems reasonable, but I think it could be a good thing to have both lazy and eager ... like other methods in the stdlib e.g. reversed()

the peformance difference of a completely lazy split is likely small

I think it should effectively make a difference for a large collection/sequence, perhaps we should try to benchmark this to have a better notion of the performance impact lazy vs eager ...

Maybe another name to consider splits(of size: )? @xwu What do you think? Kinda like chunks(of size) ...

Hope, those insights are helpful :))

zwaldowski · March 14, 2020, 3:25pm

I don’t disagree; but I’m not sure I agree, either. I’m actively searching for an alternate spelling. But I don’t think adding extra words alone enhances clarity.

For instance, the groups in intoGroupsOf says something that is not ambiguous at the point of use — the verb “split”, the return type, and how the returned value will often immediately get used (in a method chain or at the subject of a for loop) all IMO impart that split splits something into groups and not, say, ice cream sundaes. I also look to the rest of the Collection API, where you won’t find prefix(lessThanOrEqualToInCount:) or suffix(elementsAtIndexesStartingFrom:).

zwaldowski · March 14, 2020, 3:52pm

I didn’t mean to imply I didn’t think there should be one; we can do both, in multiple parts, for a more manageable set of proposals. I say this in the draft proposal text.

I like it, but a verb phrase makes more sense than a noun phrase when there is a side effect being performed (in this case, making an array).

xwu · March 14, 2020, 4:01pm

chunks(of: 4) is idiomatic English; it'd be a pretty self-explanatory API. No one says splits(of: 4), however; it's simply not English.

Making an array is not a side effect; that's the return value.

LucianoPAlmeida · March 14, 2020, 10:07pm

Humm I see ... was just looking for a simpler option because IMO split(intoGroupsOf: 4) is a bit verbose although it makes what the method does very clear That's why I think chunks(of:) would be a good choice here.

I see ... Thank's for the answer

zwaldowski · March 15, 2020, 4:49pm

It agree that it reads well. I don't think it's discoverable. While "chunking" is occasionally a term used in software, I don't know that I'd reach for "chunk" in autocomplete.

Plus, not to be indelicate, chunk/chunks/chunking does not have a positive connotation in several dialects of English.

There are several instances where the Standard Library disagrees with you. There is a balance to be made between developing the perfect APIs in isolation and fitting in with the patterns that already exist.

Like @Karl at the top of the thread, I feel adding variants to the split family is uncontroversial and fitting.

xwu · March 15, 2020, 6:17pm

If you can find examples in the standard library, I’m all ears. (Here, split is the past participle, following the “ed/ing” rule, not the verb in the active voice.)