Persistent Sequence

CTMacUser · March 18, 2019, 12:24am

This extends one of my previous posts and another idea that many (many, many, many) have posted about before. I wrote most of it before I wrote the corresponding code.

Am I the only one who read this part of the documentation for Sequence:

Repeated Access

The Sequence protocol makes no requirement on conforming types regarding
whether they will be destructively consumed by iteration. As a
consequence, don't assume that multiple for-in loops on a sequence
will either resume iteration or restart from the beginning:
for element in sequence {
    if ... some condition { break }
}

for element in sequence {
    // No defined behavior
}
In this case, you cannot assume either that a sequence will be consumable
and will resume iteration, or that a sequence is a collection and will
restart iteration from the first element. A conforming sequence that is
not a collection is allowed to produce an arbitrary sequence of elements
in the second for-in loop.

To establish that a type you've created supports nondestructive iteration,
add conformance to the Collection protocol.

and saw that there's a gap for an intermediate protocol? I've had inklings about it since the first time I read that part of the docs, but I have been thinking about it more for the past few months.

This protocol was originally just an empty marker protocol. It's still up to the conforming type's developer to make sure the sequence is multi-pass; that its makeIterator() returns iterators that can run independently of each other. A month or so ago, I brainstormed about potential applications, something useful in general and can quash complaints about the protocol not "pulling its weight." I came up with: cycles. Not generating them, but detecting them.

A few days ago, I came up with another usage pattern, one that requires the protocol to not be empty: giving users their "offset" interface. Offset-dereference can be provided at the sequence level because we're working with multi-pass sequences. It's normally linear time, but I made the interface a requirement so that persistent sequences that are also random-access collections can override to provide constant time.

/// A `Sequence` that allows multiple passes.
protocol PersistentSequence: Sequence {

    /// Visit each element found from the given range of offsets.
    func forEach<R: RangeExpression>(across offsets: R, _ body: (Element) throws -> Void) rethrows where R.Bound == Int

    /// Access an element referenced by an offset.
    subscript(offset o: Int) -> Element { get }

}

extension PersistentSequence {

    func forEach<R>(across offsets: R, _ body: (Element) throws -> Void) rethrows where R : RangeExpression, R.Bound == Int {
        switch offsets {
        case is UnboundedRange:
            try forEach(body)
        case let r as PartialRangeFrom<Int>:
            var iterator = makeIterator()
            for _ in 0..<r.lowerBound {
                _ = iterator.next()!
            }
            while let e = iterator.next() {
                try body(e)
            }
        default:
            let bounds = offsets.relative(to: 0..<Int.max)
            var iterator = makeIterator()
            for _ in 0..<bounds.lowerBound {
                _ = iterator.next()!
            }
            for _ in bounds {
                try body(iterator.next()!)
            }
        }
    }

    subscript(offset o: Int) -> Element {
        var iterator = makeIterator()
        for _ in 0..<o {
            _ = iterator.next()
        }
        return iterator.next()!
    }

}

extension DropFirstSequence: PersistentSequence where Base: PersistentSequence {}
extension PrefixSequence: PersistentSequence where Base: PersistentSequence {}
// Can't do DropWhileSequence (for now)
// Can't do IteratorSequence (ever)
extension StrideTo: PersistentSequence {}
extension StrideThrough: PersistentSequence {}
// Can't do UnfoldSequence (for now)
extension EnumeratedSequence: PersistentSequence where Base: PersistentSequence {}
extension FlattenSequence: PersistentSequence where Base: PersistentSequence, Base.Element: PersistentSequence {}
extension LazyMapSequence: PersistentSequence where Base: PersistentSequence {}
extension JoinedSequence: PersistentSequence where Base: PersistentSequence, Base.Element: PersistentSequence {}
extension Zip2Sequence: PersistentSequence where Sequence1: PersistentSequence, Sequence2: PersistentSequence {}
extension LazyFilterSequence: PersistentSequence where Base: PersistentSequence {}
extension LazyDropWhileSequence: PersistentSequence where Base: PersistentSequence {}
extension LazyPrefixWhileSequence: PersistentSequence where Base: PersistentSequence {}
extension LazySequence: PersistentSequence where Base: PersistentSequence {}

// I don't think LazySequenceProtocol should/can be made to conform to PersistentSequence (conditionally when Elements: PersistentSequence).

You need both methods because:

The subscript lets you use Int.max as a potential offset. And it's the version everyone's been asking for.
The bulk processor lets you handle multiple elements in linear time, instead of quadratic.

I tried making an AnyPersistentSequence that wraps an instance of AnySequence, but that would use the default definition of the two required methods and not dynamically dispatch to the inner persistent-sequence's version. So the actual type would have to be added to the GYB macros in the file.

Above, StrideTo and StrideThrough unconditionally conform to PersistentSequence, so we would just change their base protocol. The big deal is that we'll do the same for Collection. Collection would add a default implementations like:

// My workaround code during testing.
extension PersistentSequence where Self: Collection {

    func forEach<R>(across offsets: R, _ body: (Element) throws -> Void) rethrows where R : RangeExpression, R.Bound == Int {
        let relativeBounds = offsets.relative(to: 0..<count)
        let si = index(startIndex, offsetBy: relativeBounds.lowerBound)
        let ei = index(si, offsetBy: relativeBounds.count)
        try self[si..<ei].forEach(body)
    }

    subscript(offset o: Int) -> Element {
        return self[index(startIndex, offsetBy: o)]
    }

}

Here's sample code for cycle detection. It can't be done for Sequence because it's allowed to be single-pass. It shouldn't be done for Collection because it always has to be finite. (However, it will work for collections as long as its suffix repeats at least once.)

/// The cycle length of a periodic sequence, or its length if it's finite.
fileprivate enum CycleLength {
    /// The length of a finite sequence.
    case delay(Int)
    /// The length of the cycle.
    case period(Int)
}

extension PersistentSequence {

    /// Find the period for this infinite cyclical sequence, unless it isn't, using the given predicate as the equivalence test.
    fileprivate func periodLength(maxAttempts limit: inout Int, by areEquivalent: (Element, Element) throws -> Bool) rethrows -> CycleLength? {
        // Brent's algorithm <https://en.wikipedia.org/wiki/Cycle_detection#Brent's_algorithm>.
        var iterator = makeIterator()
        guard let first = iterator.next() else { return .delay(0) }
        guard let second = iterator.next() else { return .delay(1) }

        var count = 2
        var power = 1, lambda = 1
        var tortoise = first, hare = second
        while try !areEquivalent(tortoise, hare) {
            guard limit > 0 else { return nil }

            limit -= 1
            if power == lambda {
                tortoise = hare
                power <<= 1
                lambda = 0
            }
            guard let nextHare = iterator.next() else { return .delay(count) }

            count += 1
            hare = nextHare
            lambda += 1
        }
        return .period(lambda)
    }

    /// Find the period and initial delay for this infinite cyclical sequence, unless it isn't, using the given predicate as the equivalence test.
    func detectCycle(maxAttempts: Int = Int.max, by areEquivalent: (Element, Element) throws -> Bool) rethrows -> (delay: Int, period: Int)? {
        precondition(maxAttempts > 0)  // maybe >= 0

        // Brent's algorithm <https://en.wikipedia.org/wiki/Cycle_detection#Brent's_algorithm>.
        var limit = maxAttempts
        guard let span = try periodLength(maxAttempts: &limit, by: areEquivalent) else {
            return nil
        }

        switch span {
        case .delay(let mu):
            return (delay: mu, period: 0)
        case .period(let lambda):
            // Put tortoise and hare one cycle apart.
            var tortoiseIterator = makeIterator()
            var hareIterator = dropFirst(lambda).makeIterator()

            // Find when the two pointers agree.
            var mu = 0, count = lambda
            while true {
                let te = tortoiseIterator.next()!  // count >= 1, retreads hare's path
                guard let he = hareIterator.next() else {
                    // This could only happen if we got "lucky" during periodLength()
                    // and here we read further than that method did.
                    // (Unless one of you can show that'll never happen....)
                    return (delay: count, period: 0)
                }
                count += 1
                guard try !areEquivalent(te, he) else { break }
                guard limit > 0 else { return nil }

                limit -= 1
                mu += 1
            }
            return (delay: mu, period: lambda)
        }
    }

}

extension PersistentSequence where Element: Equatable {

    /// Find the period and initial delay for this infinite cyclical sequence, unless it isn't.
    func detectCycle(maxAttempts: Int = Int.max) -> (delay: Int, period: Int)? {
        return detectCycle(maxAttempts: maxAttempts, by: ==)
    }

}

We'll first demonstrate with an array:

extension Array: PersistentSequence {}  // work-around

[Int]().detectCycle()  // (delay: 0, period: 0)
[1].detectCycle()      // (delay: 1, period: 0)
[1, 2].detectCycle()   // (delay: 2, period: 0)

It doesn't take long to detect a cycle; I got just one repetition here:

func makeBlock(cycleCount: Int) -> [Int] {
    precondition(cycleCount >= 0)

    // This sequence is from <https://en.wikipedia.org/wiki/Cycle_detection>.
    // 2, 0, 6, 3, 1, 6, 3, 1, 6, 3, 1, ....
    return [2, 0] + Array(repeatElement([6, 3, 1], count: cycleCount).joined())
}

makeBlock(cycleCount: 0).detectCycle()  // (delay: 2, period: 0)
makeBlock(cycleCount: 1).detectCycle()  // (delay: 5, period: 0)
makeBlock(cycleCount: 2).detectCycle()  // (delay: 2, period: 3)
makeBlock(cycleCount: 3).detectCycle()  // (delay: 2, period: 3)

You can limit the number of tries:

makeBlock(cycleCount: 2).detectCycle(maxAttempts: 6)  // nil
makeBlock(cycleCount: 2).detectCycle(maxAttempts: 7)  // (delay: 2, period: 3)

I got it to work on a truly unlimited sequence:

/// An infinite sequence from the Wikipedia example.
struct WikipediaSequence {
    /// A value of the sequence.
    enum State: Int {
        case two = 2, zero = 0, six = 6, three = 3, one = 1

        /// The following state in the cycle.
        var next: State {
            switch self {
            case .two:
                return .zero
            case .zero:
                return .six
            case .six:
                return .three
            case .three:
                return .one
            case .one:
                return .six
            }
        }
    }
}

extension WikipediaSequence: Sequence {

    struct Iterator: IteratorProtocol {
        var state = State.two

        mutating func next() -> Int? {
            defer { state = state.next }
            return state.rawValue
        }
    }

    __consuming func makeIterator() -> Iterator {
        return Iterator()
    }

}

extension WikipediaSequence: PersistentSequence {

    func forEach<R>(across offsets: R, _ body: (Int) throws -> Void) rethrows where R : RangeExpression, R.Bound == Int {
        for o in offsets.relative(to: 0..<Int.max) {
            try body(self[offset: o])
        }
    }

    subscript(offset o: Int) -> Int {
        switch o {
        case 0:
            return State.two.rawValue
        case 1:
            return State.zero.rawValue
        case let x where (x - 2).isMultiple(of: 3) && x >= 0:
            return State.six.rawValue
        case let x where (x - 2) % 3 == 1:
            return State.three.rawValue
        case let x where (x - 2) % 3 == 2:
            return State.one.rawValue
        default:
            preconditionFailure("Can't use negative offset")
        }
    }

}

let ws = WikipediaSequence()
ws.detectCycle(maxAttempts: 6)  // nil
ws.detectCycle(maxAttempts: 7)  // (delay: 2, period: 3)

Note my detectCycle implementation doesn't use the required methods; I added overrides since WikipediaSequence can provide them in constant time.

I didn't realize that the cycle-detection routine is easy to fool; it doesn't really look past the prefix and two cycles, so you can have non-periodic garbage and/or finite-ness past that. But I would consider those pre-condition failures (and/or you don't mind passing a Collection as long as it's long enough). It also means it can be fooled by a cycle that starts with a mini-cycle (for example: 2 3 1 2 3 1 0 9 9 2 3 1 2 3 1 0 9 9...).

The required methods are meant for non-collection persistent sequences; if you're extensively using them for collections, especially in loops, you may be "holding it wrong." There are so many fans that I'm worried about those methods' foot-gun (or leg-shotgun) potential. I could have left them for the user to define, and take their own risk on, but that would mean throwing away random-access optimizations for random-access collections and directly-computable non-collection sequences. (Remember, these methods are linear by default, so their loops are quadratic by default!) Use functional programming to trim the result set before dereference.

The multi-element version originally took a Range<Int> before I changed it to RangeExpression. Updating did make figuring out the default implementations harder. (Hmm, should I change the core method to take a sequence of ranges? That's the maximum case to go through all desired offsets in one pass to minimize quadratic code.)

We could go down to making it a marker protocol; no offsetting methods provided by default, and maybe no cycle detection. I feel it's important to have a way to enshrine multi-pass in code instead by implicit convention; you can't code-detect with convention. I feel that Collection is meant for containers, so forcing mathematically computable (possibly infinite) sequences into Collection would be inappropriate, and we need something in-between Sequence and Collection to model those. And it's a way to code sequence routines without having compromise correctness over worrying whether or not a sequence may be single-pass.

I'm not sure whether the required methods should crash (which their default implementations do now) if the sequence doesn't have enough elements to support the targeted offset(s), or if they should do-nothing/return-nil. What do all of you think? (For the multi-element version, what should happen if only some of the offsets are supported?)

I also have ideas for the "Dictionary and Set aren't really Sequences" problem, and I'm not sure whether to do the PersistentSequence proposal first or do DaSarS first, which includes PS within it. I decided to go with PS because it's smaller and potentially less controversial.

CTMacUser · March 21, 2019, 7:23am

I've been thinking of tweaking the hierarchy a bit. One of those previous mega-threads had a note that a Sequence doesn't necessarily have to publish the same list of elements each time. (This requires a sequence type that's either a class, to hide the mutating, or one that publishes iterators that access an unstable data source, like a random number generator.) So I should separate a sequence being repeatable and one that repeats a stable sequence.

ReusableSequence: Sequence, each Iterator returned from makeIterator() has state independent from each other, i.e. it's multi-pass. (Each iterator can still individually share state with the source sequence.)
PersistentSequence: ReusableSequence, a reusable sequence where each iterator returned vends identical underlying sequences if there's no mutation of the sequence between calls to makeIterator().

Of course, Collection would refine PersistentSequence. The cycle-detection function would still be require a PersistentSequence, but the subscript/forEach(across:) methods may move to ReusableSequence or maybe Sequence.