An official proposal for "diverges(from:)"

CTMacUser · August 15, 2018, 8:44pm

Inspired by count(where:) on Sequence, I've put up a more formal proposal for the Collection.diverges(from:) method and related methods. I need a champion to put up an implementation branch; the actual code is given in the proposal, the champion needs to supply test cases.

Besides that, is there anything else the proposal needs to be put on the queue? Any suggestions to clean up the prose, if necessary, would be appreciated. Is this idea interesting enough to be put on the queue? As I said in the proposal, and in my first post on the subject, the need for diverges(from:) comes up when implementing other algorithms.

jawbroken · August 16, 2018, 1:16am

I think the general idea seems useful, but the proposed versions seem kind of hard to use due to the 3-tuple return combined with optional types. The usage examples you give mostly just test these for optionality, and they all ignore the prefix count. I also notice that the C++ version returns the pair of iterators, making it easier to pick up from where you left off, a use case that you might want to consider. In general I would be interested in seeing more use cases and alternative implementations.

I don't have time right now to think a lot about this, but one possibility would be something like zip(_:_:), returning a (lazy?) sequence type that provided this information in some form. e.g. return a sequence of enum cases roughly like .equal(_.SubSequence), .notEqual(Element, Element), .firstSuffix(S1.SubSequence), .secondSuffix(S2.SubSequence), but with better names and more thought about the performance requirements and details (e.g. should .notEqual provide elements or equal length subsequences where they differ?).

divergence([2, 3, 4, 5], [2, 3, 5, 7, 11])
// roughly [.equal([2, 3]), .notEqual(4, 5), .notEqual(5, 7), .secondSuffix([11])]
//      or [.equal([2, 3]), .notEqual([4, 5], [5, 7]), .secondSuffix([11])]

Or if that is too general and almost all the use cases really only need the length of the common prefix and the differing elements, you could consider returning a struct with the common prefix length and an enum with a case containing the differing pair of (non-optional) elements and other cases describing the other possibilities (sequences are the same, hit the end of one sequence or the other).

gregtitus · August 17, 2018, 2:32am

I think the tuple return types are reasonable, given the current nature of Sequence being potentially only iterable once. The sort-of equivalent to the C++ version returning a pair of iterators is the Collection extension that returns a pair of indexes.

This proposal would get a +1 from me.

Minor implementation note, I think the main loop in findDivergence() would be cleaner as

    while true {
      switch (iterator.next(), otherIterator.next()) {
      case let (e?, oe?) where try areEquivalent(e, oe):
        prefixCount += 1
      case let (optE, optOE):
        return (prefixCount, optE, optOE)
      }
    }

Ben_Cohen · August 17, 2018, 8:27pm

I think this would be a great addition to Collection, returning the index of the point of divergence. The painful nature of the return type is an indication, to me, that this does not belong on Sequence.

Regarding naming: "diverges from" sounds like a predicate. I would suggest something like Collection.divergentIndex(verus:) (don't like that arg label much tho).

Ben_Cohen · August 17, 2018, 8:40pm

Hi @CTMacUser – is there a reason why you're not able to put up the implementation as a PR? If there's anything unclear about how to integrate into the stdlib or how the tests should be written, I'm sure people would be happy to help over on the stdlib dev forum. Writing the implementation and tests is an integral part of putting together the proposal, as it helps bottom out any edge cases or implementation nuances that might otherwise be missed.

CTMacUser · August 21, 2018, 8:09am

You can understand the need for a common-prefix count and the differing element values, but you need a method to return all three when the sequence is single-pass. Would a return of:

(offset: Int, patch: (Element?, OtherSequence.Element?))

look better? It's reminiscent of Sequence.enumerated and zipSequence, along with Unix diff. Although the name "patch" works better when both sub-elements are non-nil.

Speaking of zipSequence, I saw another thread of zipLongest, and it's neat. The current zip-sequence system seems like it would almost work for diverges, but its information on the end of the sequences is wrong. What zipLongest provides is better for this work.