Add Sequence.split(maxLength:)

The gist includes a lazy version, and I'm using maxLength to parallel Sequence.prefix(_ maxLength: Int) (although admittedly, that's an internal parameter label).

As for related functionality: I'm not sure. We have a predicate-based function, so it makes sense to also have a positional/length-based version.

Good catch. That can be changed :)

1 Like

What I’m saying is that I think this intuitively feels like it meets the criteria for being lazy by default. It’s been outlined elsewhere how the core team has reasoned about this in the past, and I think it deserves serious consideration here.

What I’m referring to is this notion that it’s a reshaping API. Thus, what is the reciprocal operation? What other operations would we want to allow easy reshaping of Sequences? Can they be grouped together with each other (rather than lumping this with split)? Those are, I think, questions to be explored here.

1 Like

Since the Element type of the lazy version is SubSequence, that would mean requiring a minimum of Collection.

I'm totally fine with that - it's there for consistency, but in fact, the other Sequence.split functions are quite strange and probably worth deprecating. They literally just copy the sequence in to an Array and split that instead, so they all return [ArraySlice<Element>]. The Collection versions don't do that, and return an [SubSequence], as this one does.

If a user really can't lift their generic constraints above Sequence to Collection, there are probably important reasons for that (i.e. the sequence truly is single-pass, and they are trying to avoid copies). If they want to copy in to an Array, I think it's safe to assume that they're knowledgeable enough to do that by themselves.

We already have .joined(), which is available on all sequences and is lazy by default, which supports your argument that this too should be. We could also add a variant for variable-length rows, something like:

func split<S>(lengths: S) -> LazyVariableWidthSplitCollection<Self, S>
  where S: Sequence, S.Element == Int

As for more general-purpose reshaping, AFAIK the two main ways to implement it are:

  • statically enforce the rank of the collection (i.e. with generic wrappers, whose Element types eventually terminate at some scalar), or
  • make the elements a recursive, infinite tower of collections and expect the user to keep track of how many times they subscript (ala TensorFlow's ShapedArray).

This, like the existing stdlib split/join functions, concern themselves with the former. But there are certain advantages to the latter (e.g. reshaping the collection at any time). There may be a place for that in the standard library, but it's sufficiently large to deserve its own proposal.

1 Like

This is a "chunked sequence," which has been brought up before:

4 Likes

As Daryle mentioned this was discussed before and some of those pitches even are opened as proposals:

2 Likes

Another minor naming nit: the parameter should be labeled maxCountPerSplit to make clear that it is not maxSplitCount

1 Like

The recent advent of the preview package inspired me to draft something up in this vein. I've spelled it as split(every:) (don't worry, I'm sure we'll bikeshed that; after rereading the above I found split(maxLength:) somewhat compelling) with an implementation here and a proposal document underway.

I am drafting something that very closely mirrors split(maxSplits:omittingEmptySubsequences:whereSeparator:) — in naming, documentation, and semantics — to purposefully match the existing patterns of the stdlib. I'm no longer convinced that a perfect™️ lazy implementation will help more developers at this point than having anything would.

1 Like

Rough outline of a proposal: https://gist.github.com/zwaldowski/c1c93097d24d3024e8e4d940ff7f75d6

2 Likes

I like the goal of a fluent reading style! split(every:) is still a little cryptic, though, IMO; have you considered something like split(intoGroupsOf: 4)?

2 Likes

Hi @zwaldowski @xwu :)
This seems similar to the Chunked Collection proposal here.
My 2 cents about it:

"Eager splitting follows the default behavior of the rest of the library."

It seems reasonable, but I think it could be a good thing to have both lazy and eager ... like other methods in the stdlib e.g. reversed()

the peformance difference of a completely lazy split is likely small

I think it should effectively make a difference for a large collection/sequence, perhaps we should try to benchmark this to have a better notion of the performance impact lazy vs eager ...

Maybe another name to consider splits(of size: )? @xwu What do you think? Kinda like chunks(of size) ...

Hope, those insights are helpful :))

I don’t disagree; but I’m not sure I agree, either. I’m actively searching for an alternate spelling. But I don’t think adding extra words alone enhances clarity.

For instance, the groups in intoGroupsOf says something that is not ambiguous at the point of use — the verb “split”, the return type, and how the returned value will often immediately get used (in a method chain or at the subject of a for loop) all IMO impart that split splits something into groups and not, say, ice cream sundaes. I also look to the rest of the Collection API, where you won’t find prefix(lessThanOrEqualToInCount:) or suffix(elementsAtIndexesStartingFrom:).

I didn’t mean to imply I didn’t think there should be one; we can do both, in multiple parts, for a more manageable set of proposals. I say this in the draft proposal text.

I like it, but a verb phrase makes more sense than a noun phrase when there is a side effect being performed (in this case, making an array).

chunks(of: 4) is idiomatic English; it'd be a pretty self-explanatory API. No one says splits(of: 4), however; it's simply not English.

Making an array is not a side effect; that's the return value.

Humm I see :+1: ... was just looking for a simpler option because IMO split(intoGroupsOf: 4) is a bit verbose although it makes what the method does very clear :slight_smile: That's why I think chunks(of:) would be a good choice here.

I see ... Thank's for the answer :+1:

It agree that it reads well. I don't think it's discoverable. While "chunking" is occasionally a term used in software, I don't know that I'd reach for "chunk" in autocomplete.

Plus, not to be indelicate, chunk/chunks/chunking does not have a positive connotation in several dialects of English.

There are several instances where the Standard Library disagrees with you. There is a balance to be made between developing the perfect APIs in isolation and fitting in with the patterns that already exist.

Like @Karl at the top of the thread, I feel adding variants to the split family is uncontroversial and fitting.

If you can find examples in the standard library, I’m all ears. (Here, split is the past participle, following the “ed/ing” rule, not the verb in the active voice.)

You're right. What an irregular word.

But are filter, the drop family, dump, readLine, finalize not named for the verb action they take to produce the result?


Anyway.

“Split every 3” also works in the past tense IMHO, but I feel like this is going to be a distraction.

I’m leaning back towards maxLength: to reduce the amount of distraction, so we can focus on discussing the benefit to a Swift user, even though the bikeshedding will likely continue unabated.


Additional side thought: intoGroupsOf doesn’t clearly communicate the semantic that a group may be less than the given amount in a way that every: (forming a grammatical phrase) or maxLength: (not forming one) do.

I think I mentioned this earlier; but in case I haven't: in the Swift standard library, the terminology is usually "count" rather than "length." I'd be fine with something like split(maxCount:), although you do run into the ambiguity as to whether it's the maximum number of splits or the maximum number of elements per split. For full clarity, it'd probably have to be something like split(maxCountPerSplit:).

I disagree; it is well understood by anyone who has gone to elementary school that if you are asked to split into groups of three, sometimes there's a group of two. In fact I think that's the virtue of using a more colloquial expression such as that, because this implication is actually very strongly communicated.

Terms of Service

Privacy Policy

Cookie Policy