Supporting varied collections of ranges

As we’ve been building out the Algorithms package, and looking toward future additions for searching and matching within strings and collections, we’ve identified some more kinds of collections of ranges, each with their own semantics and twists on their particular details.

The RangeSet type, added via SE-0270, represents a set of indices in a collection by coalescing them into non-overlapping, non-adjoining ranges. This makes it a great fit for storing the positions, of, say, all the alphanumeric characters in a string. However, it can’t represent all the occurrences of a substring, which might adjoin, or even overlap, and need to be treated as separate entities:

let matches = "abababa".overlappingRanges(of: "aba")
// equivalent to [0..<3, 2..<5, 4..<7]

We’d like to have a consistent set of APIs to use with all different kinds of collections of ranges, so that you can call methods like removeSubranges(_:) without thinking about the specific semantics of a particular collection.

First, these are the types of range collections that I’ve identified so far:

  • The RangeSet type represents all the parts of a collection that match some criteria. While it isn’t a collection itself, it presents a collection of the ranges it encompasses through its ranges property — the ranges in this collection are always sorted in ascending order, and never overlap or even adjoin (e.g. 0..<5 and 5..<10 are coalesced into 0..<10).
  • The SubSequenceFinder (actual name tbd) type in progress in this PR is the result of a call to allRanges(of:). It is a collection of ranges that match the given subsequence, and its ranges are always in ascending order and never overlap. However, because each range represents a separate found instance, they may adjoin ("abcabc".allRanges(of: "abc") yields the equivalent of [0..<3, 3..<6]). Notably, this collection of ranges is only forward-traversable — reverse traversal can yield different subranges due to overlapping matches, so it is prohibited.
  • FromEndSubSequenceFinder is a separate type that yields ranges starting at the end of a collection rather than the beginning. These ranges do not overlap and are always in descending order, unlike SubSequenceFinder.
  • As noted in the introduction, the OverlappingSubSequenceFinder contains ranges that can overlap, and are always in ascending order. This collection yields the same ranges in either forward or backward order, so it can be bidirectional.
  • Any of the above collections can be transformed through collection operations, like prefix(), dropFirst(), lazy.filter { ... }, or even reversed().
  • Finally, someone can manually generate an arbitrary collection of arbitrary ranges, with no guarantees about the relationship between ranges in terms of ordering or anything else.

Here’s a graphic depicting the different types described above:

So, we’d like to be able to use any of these types when calling operations that work with multiple ranges at once. The methods added as part of SE-0270 only take a RangeSet as a parameter — what’s the best way to broaden the types that we can pass?

The primary challenge is that, in order for these operations to be efficient, the ranges that we pass must satisfy at least some of RangeSet’s preconditions — being in ascending order and not overlapping — that aren’t reasonable to expect for the caller. That is, you should be able to write something like this without having to think about the order of the ranges in matches:

let matches = myString.allRangesFromEnd(of: "needle")
myString.removeSubranges(matches.prefix(2))

To resolve this, I’d like to propose that we add overloads that are generic over collections of ranges. These overloads would convert the passed collection into a RangeSet, and then call the RangeSet-based method directly — the remove/move algorithms could continue to rely on the most efficient representations of the ranges that they’re working on. This approach is similar to our design for converting RangeExpression types into concrete ranges, just without an explicit protocol.

extension RangeReplaceableCollection {
    // Existing method
    public mutating func removeSubranges(_ subranges: RangeSet<Index>)

    // Additional method
    public mutating func removeSubranges<S: Sequence>(_ subranges: S) where S.Element == Range<Index>
}

This approach would work for all of the types described above, but would include the overhead of generating the new RangeSet instance, even when the passed collection already has ranges that meet the method’s requirements (like SubSequenceFinder). Additionally, we would need two of each of these methods that use a RangeSet as the parameter for their work.

Alternatively, we could introduce a protocol for collections of ranges. Types that conform to a hypothetical RangeProvidingCollection protocol would implement a method that returns a collection of ranges in ascending order:

protocol RangeProvidingCollection: Collection where
    Element == Range<Bound>
{
	associatedtype Bound: Comparable
	associatedtype AscendingRanges: Collection where
		AscendingRanges.Element == Element

	/// Returns non-overlapping ranges represented by this
    /// collection, in ascending order.
    func ascendingRanges() -> AscendingRanges
}

Of the types above, only SubSequenceFinder could trivially conform to RangesProvider, though RangeSet could as well if we move the collection view from the ranges property to the type itself. The remainder of the types would need to perform the same kind of sorting/coalescing as described above. In addition, RangeProvidingCollection would be another, narrowly-scoped protocol on top of the already complex collection protocol hierarchy. Given that existing complexity, I’m reticent to add another with such a restricted purview.


What do people think between these two solutions? Or is this problem even worth solving, given that we could require that people manually coalesce ranges into a RangeSet before calling these methods?

3 Likes

As a first impression, the protocol-based approach seems more in line with Swift’s design principles.

(Also, not really relevant, but I think the spelling “allRanges” should include overlapping ranges, and the non-overlapping version should be called something like “disjointRanges”.)

3 Likes

I had missed this question earlier. I wonder what other algorithms can be represented if the protocol is added, and I'm not sure they exist. It seems to me that the only use for it is the removeSubranges method.

Given that, I would favor the overload instead of the protocol.