[Prototype] Protocol-powered generic trimming, searching, splitting,

Michael_Ilseman · September 30, 2019, 11:36pm

Hi all, I’d like to share a prototype of protocol-powered generic Collection APIs for trimming, searching, splitting, replacing, etc. You can browse the code here and contributed as a PR. Instructions and an overview is given in the README, presented below.

This is a very early prototype of what will likely be broken out into multiple pitches.

Collection Consumers and Searchers Prototype

Early prototype of generalized collection consumers and searchers, APIs powered by them, and a smattering of conformers.

Inspired by Rust’s RFC-2500.

Introduction

This prototype demonstrates two main protocols, CollectionConsumer and CollectionSearcher, and a suite of API generic over them.

Example:

"  \n\tThe quick brown fox\n".trim { $0.isWhitespace }
// "The quick brown fox"

[1,2,3,2,1,5,5,9,0,9,0,9,0,9].matches(PalindromeFinder()).map { $0.count }
// [5, 2, 7]

let number = try! NSRegularExpression(pattern: "[-+]?[0-9]*\\.?[0-9]+")
var str = "12.34,5,.6789\n10,-1.1"

str.split(number)
// [",", ",", "\n", ","]

str.matches(number).joined(separator: " ")
// "12.34 5 .6789 10 -1.1"

str.replaceAll(number) { "\(Double($0)!.rounded())" }
// str == "12.0,5.0,1.0\n10.0,-1.0"

How to Use

Add as package dependency:

URL: GitHub - milseman/swift-evolution: This maintains proposals for changes and user-visible enhancements to the Swift Programming Language.
branch: collections_om_nom_nom

Import with import Prototype_CollectionConsumerSearcher

Brief Background: The Royal Road to Regex

Native regex literals will be built on top of generic functionality and used with generic APIs. These can be broken down into 4 stages:

Consumers: nibbling off the front or back of a Collection
Searchers: finding a needle in a haystack (i.e. a Collection)
Validators: constructing types in the process of consuming/searching
Destructuring Pattern Matching Syntax: Like ~=, but can bind a value

Each stage delivers important API and syntax improvements to users alongside powerful extension points for libraries. Native regex literals would conform to these protocols and leverage new syntax, allowing them to be used by the same generic APIs.

This prototype covers stages 1 and 2.

Caveats and Disclaimers (aka “This is Just a Prototype”)

Why So Much API?

I’m not sure yet what will be a super convenient overload vs excessive bloat, so I erred on the side of defining anything I thought might be useful. For the prototype’s purposes, this lets us see a broader array of functionality enabled from these constructs.

I Hate Your Names!

Since this is an early prototype and there’s a lot of API here, names were chosen to be clearly distinguished and follow a regular convention. I welcome any contributions towards solving one of the hardest problems in computer science: naming things.

But How Does it Perform?

No performance work has been done so far.

This prototype is written in a naive fashion without any tuning or optimization. Simple overloads are written in terms of the more general constructs, allowing for better testing and demonstration of how to use them, but this technique can result in overhead. In a perfect world, these could be optimized away (at the cost of some compilation time), but the reality of optimizers is that real world constraints lead to heuristic-driven decisions.

Once we have a design in place, we can write the benchmarks and do tuning. We’ll especially want to provide fast-paths for certain concrete types. E.g. if a String is normalized, operate directly over its UTF-8 bytes.

Can I Deploy This in Production?

Please don’t. Some parts have been heavily tested, but some have not. This is just meant to generate ideas and early feedback on the approach.

Overview

Protocols

`CollectionConsumer`

Conformers to CollectionConsumer are able to perform a match anchored at some point:

public protocol CollectionConsumer {
  associatedtype C: Collection

  func consume(_: C, from: C.Index) -> C.Index?
}

public protocol BidirectionalCollectionConsumer: CollectionConsumer
where C: BidirectionalCollection {
  // Return value is 1-past-the-end
  func consumeBack(_: C, endingAt: C.Index) -> C.Index?
}

Consumer state can be created at the time of initialization (e.g. compiling a regex), persisted across many collections, and across repeated invocations for any given collection.

Generic API powered by CollectionConsumer:

extension Collection {
  public func trimFront<CC: CollectionConsumer>(_: CC) -> SubSequence
  where CC.C == Self

  public func starts<CC: CollectionConsumer>(with: CC) -> Bool where CC.C == Self
}

extension BidirectionalCollection {
  public func trimBack<CC: BidirectionalCollectionConsumer>(_: CC) -> SubSequence 
  where CC.C == Self

  public func trim<CC: BidirectionalCollectionConsumer>(_: CC) -> SubSequence
  where CC.C == Self

  public func ends<CC: BidirectionalCollectionConsumer>(with: CC) -> Bool
  where CC.C == Self
}

extension RangeReplaceableCollection {
  public mutating func stripFront<CC: CollectionConsumer>(_: CC) where CC.C == Self 
}

extension RangeReplaceableCollection where Self: BidirectionalCollection {
  public mutating func stripBack<CC: BidirectionalCollectionConsumer>(cc: CC) 
  where CC.C == Self

  public mutating func strip<CC: BidirectionalCollectionConsumer>(cc: CC)
  where CC.C == Self
}

extension Collection where SubSequence == Self {
  @discardableResult
  public mutating func eat<CC: CollectionConsumer>(_: CC) -> SubSequence
  where CC.C == Self

  @discardableResult
  public mutating func eatAll<CC: CollectionConsumer>(_: CC) -> [SubSequence]
  where CC.C == Self
}

`CollectionSearcher`

Conformers to CollectionSearcher are able to perform a match anywhere from a staring point (not an anchor).

public protocol CollectionSearcher {
  associatedtype C: Collection

  associatedtype State = ()

  func preprocess(_: C) -> State

  func search(_: C, from: C.Index, _: inout State) -> Range<C.Index>?
}

public protocol BidirectionalCollectionSearcher: CollectionSearcher
where C: BidirectionalCollection {
  func searchBack(
    _ c: C, endingAt idx: C.Index, _: inout State
  ) -> Range<C.Index>?
}

Naively, any consumer can be made into a searcher with O(n * m) complexity. To achieve better performance, efficient searchers avoid redundant computation by managing state. Searcher conformers can choose to manage 3 levels of state:

Persisted state across all inputs, created during init
- Examples: compile a regex, preprocess the search term
State persisted across calls for a given instance, created during preprocess(_:C)
- Examples: check if a String is normalized, build up a suffix tree
State updated across calls for a given instance, via the inout State parameter to search
- Examples: Z-algorithm’s Z array, fuzzy search context

Many searchers have no per-collection or cross-call state, so convenience implementations are provided for them:

extension CollectionSearcher where State == () {
  public func preprocess(_: C) -> State { return () }
}

Generic API powered by CollectionSearcher:


extension Collection {
  public func firstRange<CS: CollectionSearcher>(_: CS) -> Range<Index>?
  where CS.C == Self

  public func first<CS: CollectionSearcher>(_:) -> SubSequence?
  where CS.C == Self

  public func matches<CS: CollectionSearcher>(
    _: CS
  ) -> CollectionMatchSequence<Self, CS> // A Sequence of matched C.SubSequence

  public func split<CS: CollectionSearcher>(
    _: CS, maxSplits: Int = Int.max, omittingEmptySubsequences: Bool = true
  ) -> [SubSequence] where CS.C == Self
}

extension BidirectionalCollection {
  public func lastRange<CS: BidirectionalCollectionSearcher>(
    _: CS
  ) -> Range<Index>? where CS.C == Self

  public func last<CS: BidirectionalCollectionSearcher>(
    _: CS
  ) -> SubSequence? where CS.C == Self
}

extension RangeReplaceableCollection {
  public mutating func replaceFirst<C: Collection, CS: CollectionSearcher>(
    _: CS, with: (SubSequence) -> C
  ) where CS.C == Self, C.Element == Element

  public mutating func replaceAll<C: Collection, CS: CollectionSearcher>(
    _: CS, with: (SubSequence) -> C
  ) where CS.C == Self, C.Element == Element
}

extension RangeReplaceableCollection where Self: BidirectionalCollection {
  public mutating func replaceLast<
    C: Collection, CS: BidirectionalCollectionSearcher
  >(_: CS, with: (SubSequence) -> C) where CS.C == Self, C.Element == Element
}

extension Collection where SubSequence == Self {
  @discardableResult
  mutating func eat<CS: CollectionSearcher>(untilFirst: CS) -> SubSequence
  where CS.C == Self

  @discardableResult
  mutating func eat<CS: CollectionSearcher>(throughFirst: CS) -> SubSequence
  where CS.C == Self
}

extension BidirectionalCollection where SubSequence == Self {
  @discardableResult
  mutating func eat<CS: BidirectionalCollectionSearcher>(
    untilLast: CS
  ) -> SubSequence where CS.C == Self

  @discardableResult
  mutating func eat<CS: BidirectionalCollectionSearcher>(
    throughLast: CS
  ) -> SubSequence where CS.C == Self
}

Conformers

This prototype adds conformances for types from Foundation for user convenience:

extension CharacterSet: CollectionConsumer, CollectionSearcher {
  // For String, respecting grapheme-cluster boundaries
}
extension NSRegularExpression: CollectionConsumer, CollectionSearcher {
	// For String
}
extension Scanner: CollectionConsumer, CollectionSearcher {
  // For String, by forwarding on to `charactersToBeSkipped`	
}

Since CharacterSet is effectively just a set of Unicode.Scalar, it doesn’t have to be limited to just String. This prototype adds the ability to construct a consumer/searcher for any collection of Characters, where it operates respecting grapheme-cluster boundaries, as well as for any collection of Unicode.Scalar, which ignores grapheme-cluster boundaries:

extension CharacterSet {
  public func searcher<C: Collection>(
    for: C.Type = C.self
  ) -> CharacterSearcher<C> where C.Element == Character

  public func searcher<C: Collection>(
    for: C.Type = C.self
  ) -> UnicodeScalarSearcher<C> where C.Element == Unicode.Scalar
}

(This approach will get a lot better when opaque result types can specify associated types)

This prototype also demonstrates how to conform 3rd party library types to these protocols, by providing conformances for:

PalindromeFinder: A custom-written palindrome finder
StdlibPattern from the stdlib prototype, which demonstrates a technique for retroactively providing this functionality to a stable protocol
Pattern from PatternKit, as an example conformance for an existing SPM package
TODO: Provide a NSPredicate conformance, getting around the untyped-ness somehow
TODO: Provide a generic 2-way search algorithm when Element: Comparable, which will perhaps be the default search algorithm
TODO: Provide a generic Z Algorithm to demonstrate cross-call state.

Tour of API, Naming Convention, and Lexicon

CollectionConsumer: A consumer returns a match anchored at a given position, else nil
- Name taken directly from Rust, but not ideal (has nothing in common with __consuming, could imply mutation, etc.)
- “Trimmer” could be another name, but that overweights one specific use for it
CollectionSearcher: A searcher returns the next match anywhere after a given position, else nil
BidirectionalCollectionConsumer / BidirectionalCollectionSearcher
- Can run matching logic backwards over a BidirectionalCollection
Generalized trim, non-mutating operations returning the rest of the collection after a consumption from the head/tail/both.
- trimFront is available on all collections and consumes a prefix. trimBack and trim are available on bidirectional collections/consumers
  - trimBack consumes a suffix and trim consumes both a prefix and a suffix.
  - All of the APIs below exist for all 3 forms, but I will just be listing trim
- trim(_: BidirectionalCollectionConsumer): Generalized trim operation
- trim(_: Element), trim(_: (Element) -> Bool) trims off all leading and trailing occurrences
- trim(in: Set<Element>) trims off the leading and trailing elements which are present in the set
  - trim(anyOf: Sequence): does the same thing; it is a convenience overload present in many languages/libraries
  - Argument label helps disambiguate Set and Array literals, which use the same syntax
- trim(exactly: Sequence): trims a single occurrence of a prefix and suffix
  - Argument label helps disambiguate Character and String literals, which use the same syntax
- strip is the mutating version of trim
  - There is a corresponding strip API for every trim API for RangeReplaceableCollecitons
Generalization of existing APIs through consumers
- starts(with:) / ends(with:) taking CollectionConsumer / BidirectionalCollectionConsumer
- TODO: consider adding drop / dropBack, prefix / suffix, removeFirst / removeLast
Generalized eat available on slices: advances the slice’s range as part of a successful match
- Om nom nom
- All eat methods return what was eaten as a @discardableResult, else nil if nothing happened.
- eat(_:CollectionConsumer): generalized eat operation
- eat(upTo:Index) adjusts the slice’s lowerBound, eat(from:Index) adjusts the slice’s upperBound
- eat(untilFirst: CollectionSearcher), eat(throughFirst: CollectionSearcher) generalized eat until/through the first match in the slice
  - Convenience overloads for (Element) -> Bool), Set<Element>, Element
- eat(untilLast: BidirectionalCollectionSearcher), eat(throughLast: BidirectionalCollectionSearcher) generalized eat until/through the last match in the slice
  - Convenience overloads for (Element) -> Bool), Set<Element>, Element,
- eat() , eat(one: (Element) -> Bool), eat(one: Element), eat(oneIn: Set<Element>) eats a single Element off
- eat(count: Int), eat(while: (Element) -> Bool), eat(many: Element), and eat(whileIn: Set<Element>) eat many off
- eat(exactly: Sequence) eats one occurrence of a given prefix
Generalized find and replace operations
- firstRange(_:CollectionSearcher) / first(_:CollectionSearcher) return the range/slice of the first match
  - Convenience overload taking Sequence
  - Corresponding lastRange and last for BidirectionalCollectionSearcher
- replaceFirst(_: CollectionSearcher, with: (SubSequence) -> Collection passes first match to a closure to generate the replacement
  - replaceAll does them all
  - Corresponding replaceLast for BidirectionalCollectionSearcher
Generalized access to matches from a searcher
- Collection.matches(_: CollectionSearcher) produces a lazy sequence of successful search matches
  - TODO: Make an unknown-count collection, like other lazy operations
  - TODO: Make it bidirectional if the collection/searcher is
- Collection.split(_:CollectionSearcher, …) -> [SubSequence]: generalized split

TODO: Slicing Problem

Some conformers would like to be applied to both C and C.SubSequence. One approach is to change the protocol requirement to take a slice, and add overloads over C.SubSequence for every API.

Workaround: for conformers generic over the collection, there is no issue. For concrete type conformers, they can often vend generic helper structs. See CharacterSet.searcher(for:) pattern above.

What’s Next?

I want to let this circulate for a while to collect ideas, opinions, and use cases.

Before any formal pitch or proposal, I want to have a good idea of what type validators will look like in order to avoid making incongruous API decisions early on.

CTMacUser · October 2, 2019, 7:50am

For the protocols, since associated types are visible to those that use the protocol or a conforming type, you should use a name that isn’t a single letter. Would “Client” work?

shawnthroop · October 2, 2019, 8:12am

There’s a lot to like here, thanks for such a detailed contribution. Personally, I’m in favor of and extensible API and I have a use case in mind for a custom CollectionSearcher.

On the topic of naming the associated types, what about similar to Optional’s Wrapped.

protocol CollectionConsumer {
    associatedtype Consumed: Collection
    //...
}

protocol CollectionSearcher {
    associatedtype Searched: Collection
    //...
}

Michael_Ilseman · October 2, 2019, 5:54pm

I had C as a placeholder name because I couldn't think of anything good to use instead. I think we'd want to share the associated type name between consumers and searchers, since many searchers can be both. I even considered making CollectionSearcher refine CollectionConsumer, but haven't come across a sufficiently compelling reason to. I want to see how validators shake out to help inform the design.

If we had to have two different names, I do like "Searched" / "Consumed". I'm also hoping for a better name than "consumer" as well, but the best alternative I could think of was "trimmer".

SDGGiesbrecht · October 2, 2019, 8:16pm

I have something like this I have been using for years. The functionality is more‐or‐less the same, but the API is significantly different. (I’ll use your terms here though.) I mention it because there is one performance Achilles’ heel I have never been able solve. And it has probably drained more of my time than anything else I’ve ever tried to fix.

In order to compose consumers (such as to find 100 consecutive pairs of (PalindromeFinder + NSRegularExpression)) you need to be able to store heterogeneous lists of various patterns where the target collection is a certain type (such as [CollectionConsumer where C == String]). There is no direct way to do that in Swift right now. The solution I use right now is to have an abstract class, CollectionConsumerBase<Target>, that conforms to CollectionConsumer, and that every consumer wishing to participate in composition must be a subclass of. Then it is possible to store them in arrays: [CollectionConsumerBase<String>]. The problem is that this obliterates all manner of optimizability and also injects dynamic dispatch and the passing around of mountains of protocol witness tables. An immense amount of tailoring has provided certain short cuts for the compiler that it can then inline, so that a few of the most common applications are lightning fast. But if you try to search a collection for something that the implementations haven’t been hand‐tailored for, then suddenly you are not just dealing with molasses in January, but something more like concrete running up a cliff at absolute zero.

Do you have a plan for making conformers efficiently composable? Because it likely needs to be a central design factor. I doubt it will be possible to tack on retroactively and still achieve reasonable performance (if that is possible at all).

After drafting the above, I decided to look at the PatternKit that you linked. It seems to attempt to sidestep the problem by not allowing sequences of n patterns, only concatenations of pairs. I guess you can still achieve ConcatenatedExpression<String, ConcatenatedExpression<String, ConcatenatedExpression<String, ConcatenatedExpression<String, ConcatenatedExpression<String, ConcatenatedExpression<String, String>>>>>> and so on. I am curious to hear from anyone with experience using that package. Does that make call sites cumbersome, or does it go mostly unnoticed to clients? And what does it do to binary size? Does the compiler specialize all of them and result in an explosion in the number of compiled types? I also notice very little in the package is marked @inlinable. How does it feel performance‐wise when used from another module?

Michael_Ilseman · October 2, 2019, 9:28pm

Could you explain the problems you solve here? What are the semantics of composition? Are you trying to find 100 palindromes that also satisfy a regex, or are you expecting your input to be an alternating sequence of palindromes and something satisfying the regex?

For the former, I think that's what I mean by "validators" (in search of a better name), which I haven't worked through yet. A pass/fail validator would take a slice of a collection and return whether it matches. A type-producing validator would also return a value of a given type for that range in the process of matching.

For example, a tokenizer would be a consumer as well as a token-producing validator.

This is trying to present subtyping while preserving layout equivalence, using object references as a universal layout for the elements. Array requires layout equivalence of its elements, so you have to resort to boxing either through existentials (unavailable due to the presence of associated types) or type erasers (which seems equivalent to what you've done).

Unfortunately, I don't have any good answers. Type-parameter composition (like you mentioned later) is the only alternative I can think of.

My example conformances for PatternKit are for demonstrating how 3rd party libraries could use the feature, not necessarily an endorsement of any approach . I haven't worked with it beyond just this exercise.

This is exactly the sort of issue that opaque result types ameliorates. Unfortunately, you can't currently specify bound associated types. If you have a strong motivating use case for this, definitely push on it. This prototype has some places it would be useful, but not critical:

extension CharacterSet {
  public func searcher<C: Collection>(
    for: C.Type = C.self
  ) -> CharacterSearcher<C> where C.Element == Character {
    return CharacterSearcher(self)
  }
  public func searcher<C: Collection>(
    for: C.Type = C.self
  ) -> UnicodeScalarSearcher<C> where C.Element == Unicode.Scalar {
    return UnicodeScalarSearcher(self)
  }
}

which means that [Character|UnicodeScalar]Searcher have to be public types.

Without extensive annotation and tuning, I would expect the performance to be catastrophic. This is not the fault of the package author, that's just how generic composition + separate compilation falls out.

@ctxppc , any thoughts?

SDGGiesbrecht · October 2, 2019, 10:10pm

I had the second variant in mind, but it doesn’t really matter either way. By “composition”, I mean creating a new conformer by putting several other conformers together in some way. Any of these are tricky to do in a performant way:

A pattern that represents a heterogeneous group of n sub‐patterns concatenated together.
A pattern that represents a heterogeneous group of n sub‐patterns that are alternate possibilities.
A pattern that represents a heterogeneous group of n sub‐patterns that must all be true of the same match at once.
Anything combining the above by using each other as sub‐patterns.

It is easy to design an API that can do that—and do it fast—for a fixed target collection (i.e. for String instead of Collection). It is also easy to design an API that can do that for a generic collection when speed is unimportant. But getting generics to work without severe performance sacrifices is extremely hard—and maybe even impossible without new language features and/or new or better optimizations available to the compiler.

The fact that the closest I have gotten thus far requires conformers to inherit from a common base class demonstrates that such functionality cannot not be feasibly added on from the outside.

Edit: The original posts may have sounded confusing because for some reason I repeatedly wrote homogenous when I meant heterogeneous. I must have been tired that day. I went back and fixed them so they actually make sense.

SDGGiesbrecht · October 17, 2019, 6:21am

Woohoo! Knowing the pairwise, specializable style of PatternKit must logically be faster than the type‐erased N‐ary stuff I had, I went back and looked over all the known call sites to see if they would be compatible with the opposite style, and how awkward the switch would be. That is when I realized that everything falls into one of the two fast categories. Either:

The call site has fixed number of components at compile time. In this case pairwise types can easily be strung together with operators instead of listing n‐aries with array literals. The compiler can infer the fixed type and specialize everything. (At least with @inlinable in play.)
The only way not to have a fixed number of components at compile time is in the context of looping or mapping or the like. And all of those operations require a consistent type on their inputs, outputs and accumulators anyway. So wherever N‐ary types are actually needed, the lists must be homogeneous anyway. In this case the list can be stored in a strongly‐typed array, and without superclasses or protocol existentials, there is no dynamic dispatch and the compiler can again specialize everything.

	homogeneous	heterogeneous
fixed‐length	fast (either)	fast (nested, tuple‐like storage)
n‐ary	fast (array storage)	[impossible anyway]

By abandoning the idea of n‐ary heterogeneous patterns, and instead providing two separate pattern types with tuple‐like and array‐like functionality, everything can be fast. (And in the event I have overlooked something, there is still a type‐erasing wrapper that can force anything into the N‐ary variants, at the expense of slowness again.)

With so much code now running multiple times faster, I’m ecstatic! Thanks for sparking the idea, @Michael_Ilseman, @ctxppc.