Pitch: Rule<T: Collection> — A Composable Pattern Recognizer for any Collection
Introduction
I would like to pitch a new standard library type, Rule<T: Collection>, a universal, composable pattern recognizer for any Swift Collection.
A Rule answers one question: "Does this pattern match here, and if so, how much input does it consume?" Everything else — repetition, alternatives, lookahead, capturing — is built by composing rules together using a small set of orthogonal primitives.
Because Rule is generic over Collection, the same combinators work on String, String.UTF8View, [Character], [UInt8], or any other collection type. You define domain-specific matchers for your element type; the library provides the composition.
This pitch covers:
- The generic core:
Rule<T>, matchers, combinators, quantifiers, lookahead, and captures. - Standard matchers for common collection types (e.g.
String,[UInt8]). - The
RuleLibrarytype for recursive and late-bound grammars.
Motivation
Swift's Regex and RegexBuilder (SE-0350, SE-0351) brought powerful string matching to the language. But regular expressions were designed in an era of simple byte streams and ASCII text. Swift's world is broader than that.
1. They only work on some strings — and not with all characters
Regex operates on String, but Swift has many ways to look at text: String.UTF8View, String.UnicodeScalarView, [Character], raw [UInt8] buffers. A compiler working on UTF-8 bytes cannot use Regex. A text processor working on [Character] — where every grapheme cluster, every emoji, every combined character like à is a single element — cannot use Regex either.
This matters because Swift deliberately distinguishes these views. Regex only supports one of them.
2. They cannot express recursive structures
Matching nested parentheses, balanced brackets, or any recursive grammar is outside the regex formalism. This is not a limitation that can be fixed with more features — it is a fundamental boundary of regular languages. Anyone who needs to parse structured data must leave Regex behind entirely and build a separate parser.
3. Extending them requires framework changes
If RegexBuilder does not have the combinator you need, you wait for a future proposal. You cannot plug in your own matching logic. With a closure-based design, any developer can write a Rule { sequence, range in ... } and it composes with every other rule immediately — no framework changes, no proposals, no waiting.
4. The mental model changes between domains
String matching uses Regex. Byte matching uses manual loops. Token matching uses switch statements. The underlying composition primitives are always the same — sequence, choice, repetition, lookahead — but every domain reinvents them from scratch.
5. What is missing is a universal pattern recognizer
Swift has a strong tradition of generic, composable abstractions: Collection, Sequence, Comparable. Pattern matching is a gap. There is no generic answer to the question "does this sequence of elements match this pattern?" — regardless of whether the elements are characters, bytes, tokens, or something else entirely.
Rule fills that gap.
Consider how the same composition works across different collection types:
// Classic string matching — familiar territory
let decimal: Rule<String> = .sequence(
.oneOrMore(.unicode(.decimalDigits)),
.word("."),
.oneOrMore(.unicode(.decimalDigits))
)
// The same pattern on raw UTF-8 — for compilers, network parsers,
// and anywhere you work with bytes instead of String
let decimal: Rule<String.UTF8View> = .sequence(
.oneOrMore(asciiDigit),
.byte(0x2E), // "."
.oneOrMore(asciiDigit)
)
// [Character] — where à, 한, and 🏳️🌈 are each ONE element.
// No Unicode scalar complications, no combining character surprises.
let displayName: Rule<[Character]> = .oneOrMore(.first(letter, digit, emoji, .element("_")))
// And the same primitives work far beyond text —
// recognizing a swipe gesture in a touch event stream
let swipeRight: Rule<[TouchEvent]> = .sequence(
touchDown,
.oneOrMore(movedRight),
touchUp
)
The first three examples solve real pain points in string and text processing that Regex cannot address. The last one shows that the same composition primitives work on any sequence where patterns matter.
Proposed Solution
A new generic type Rule<T: Collection> that provides:
- A universal contract: every rule is a function
(T, Range<T.Index>) -> RuleMatch<T>?. Match or fail, nothing more. - Composable primitives: a small, orthogonal set of combinators that work on any collection type.
- Structural captures: a tree of named ranges that records what matched and where, without mixing in value interpretation.
- User extensibility: if the library does not have what you need, write a
Rule { sequence, range in ... }closure and it plugs right into the composition system.
The key insight is that the composition primitives (sequence, choice, repetition, lookahead) are collection-agnostic. Only the leaf matchers are domain-specific. By separating these concerns, one set of combinators serves every collection type.
Comparison with RegexBuilder
| Capability | RegexBuilder | Rule |
|---|---|---|
| Works on String | ✓ | ✓ |
| Works on any Collection | ✗ | ✓ |
| Recursive grammars | ✗ | ✓ (via RuleLibrary) |
| User-defined matchers | Limited | Full (closure-based) |
| Captures | Typed tuples | Named tree (separate phase) |
| Value transformation | Built into captures | Separate phase |
Rule provides a general foundation that covers the same space as RegexBuilder — and extends it to any Collection type, recursive grammars, and user-defined matchers.
Detailed Design
Core Types
/// The result of a successful rule evaluation.
struct RuleMatch<T: Collection> {
let range: Range<T.Index>
var captures: [RuleCapture<T>]
}
/// A named region within the input, forming a tree of nested captures.
struct RuleCapture<T: Collection> {
let name: String
let range: Range<T.Index>
let innerCaptures: [RuleCapture<T>]
}
/// A composable pattern that evaluates a Collection within a given range.
struct Rule<T: Collection> {
init(evaluationBlock: @escaping (T, Range<T.Index>) -> RuleMatch<T>?)
func evaluate(_ sequence: T, in range: Range<T.Index>) -> RuleMatch<T>?
}
Primitive Set
The entire library is built from a small set of orthogonal primitives:
Matchers — leaf rules that inspect elements:
extension Rule {
/// Always succeeds, consuming nothing.
static func empty() -> Rule<T>
/// Matches any single element.
static func any() -> Rule<T>
/// Matches a single element satisfying a predicate.
static func element(where predicate: @escaping (T.Element) -> Bool) -> Rule<T>
/// Succeeds only at the end of the remaining range.
static func endOfSequence() -> Rule<T>
}
Combinator — sequential composition:
extension Rule {
/// Matches rules consecutively. Fails if any rule fails.
static func sequence(_ rules: Rule<T>...) -> Rule<T>
}
Selectors — choosing between alternatives:
extension Rule {
/// Returns the first rule that matches. Order defines priority.
static func firstMatch(_ rules: Rule<T>...) -> Rule<T>
/// Returns the rule that consumes the most input. Order does not matter.
static func longestMatch(_ rules: Rule<T>...) -> Rule<T>
}
Quantifiers — repetition:
extension Rule {
/// Repeats a rule between min and max times (greedy).
static func `repeat`(_ rule: Rule<T>, min: UInt, max: UInt) -> Rule<T>
static func zeroOrOne(_ rule: Rule<T>) -> Rule<T>
static func zeroOrMore(_ rule: Rule<T>) -> Rule<T>
static func oneOrMore(_ rule: Rule<T>) -> Rule<T>
/// Instance property for readability: `sign.optional`
var optional: Rule<T> { get }
}
Lookahead — zero-width assertions:
extension Rule {
/// Succeeds if rule would match, consumes nothing.
static func expecting(_ rule: Rule<T>) -> Rule<T>
/// Succeeds if rule would NOT match, consumes nothing.
static func notExpecting(_ rule: Rule<T>) -> Rule<T>
}
Annotation — structural captures:
extension Rule {
/// Wraps the matched range in a named capture.
func capture(as name: String) -> Rule<T>
}
Convenience — repeat with termination:
extension Rule {
/// Repeats body until end would match. Does NOT consume end.
static func `repeat`(_ body: Rule<T>, as name: String? = nil, endBefore end: Rule<T>) -> Rule<T>
/// Repeats body until end matches, then consumes end as well.
static func `repeat`(_ body: Rule<T>, as name: String? = nil, endAfter end: Rule<T>) -> Rule<T>
}
Indirection: RuleLibrary
For recursive and mutually-recursive grammars, rules need to reference each other before they are defined. RuleLibrary provides late binding with recursion depth tracking:
class RuleLibrary<I: Hashable, T: Collection> {
let maxDepth: Int
private(set) var depth: Int
init(maxDepth: Int = 1_000)
subscript(identifier: I) -> Rule<T>? { get set }
}
extension Rule {
/// Creates a rule that delegates to a named rule in a library.
/// Lookup happens at evaluation time, enabling mutual recursion.
static func rule<I: Hashable>(_ identifier: I, in library: RuleLibrary<I, T>) -> Rule<T>
}
Example — parsing a simplified JSON-like format where values can be numbers, words, or nested { key: value } objects:
let lib = RuleLibrary<String, String>()
let digit = Rule<String>.oneOrMore(.unicode(.decimalDigits))
let word = Rule<String>.oneOrMore(.unicode(.letters))
let gap = Rule<String>.zeroOrMore(.unicode(.whitespaces))
lib["value"] = .firstMatch(
digit.capture(as: "Number"),
word.capture(as: "Word"),
.rule("object", in: lib)
)
lib["pair"] = .sequence(
word.capture(as: "Key"),
gap, .word(":"), gap,
.rule("value", in: lib)
)
lib["object"] = .sequence(
.word("{"), gap,
.rule("pair", in: lib),
.zeroOrMore(.sequence(.word(","), gap, .rule("pair", in: lib))),
gap, .word("}")
).capture(as: "Object")
if let match = lib.evaluate("object", sequence: "{name: Max, age: 30, address: {city: Berlin}}",
in: input.startIndex..<input.endIndex) {
print(match.debugDescription)
}
// Object [0..<42]
// ├─ Key [1..<5] → "name"
// ├─ Word [7..<10] → "Max"
// ├─ Key [12..<15] → "age"
// ├─ Number [17..<19] → "30"
// ├─ Key [21..<28] → "address"
// ╰─ Object [30..<41]
// ├─ Key [31..<35] → "city"
// ╰─ Word [37..<43] → "Berlin"
The recursive "object" rule references itself through the library — "value" can be a "Word", a "Number", or another "object". The capture tree reflects the nesting, making it straightforward to walk the structure and extract values in a separate phase.
String-Specific Matchers
Standard matchers for Rule<String>:
extension Rule where T == String {
/// Matches a character whose Unicode scalar belongs to the given CharacterSet.
static func unicode(_ characterSet: CharacterSet) -> Rule<T>
/// Matches specific Unicode scalars.
static func unicode(_ scalars: UnicodeScalar...) -> Rule<T>
/// Matches Unicode scalars within ranges.
static func unicode(_ ranges: ClosedRange<UnicodeScalar>...) -> Rule<T>
/// Matches an exact string at the current position.
static func word(_ word: String, caseInsensitive: Bool = false) -> Rule<T>
}
Equivalent matchers can be provided for other collection types — String.UTF8View, [Character], [UInt8], and so on. The standard library would ship matchers for the most common types. But the real strength of Rule is that if something is missing, you add it yourself: write a Rule { sequence, range in ... } closure and it composes with every other rule immediately. No proposals, no framework changes — just your code, plugged into the same composition system.
Three-Phase Architecture
The design follows a strict separation of concerns:
Input → Rules → Capture Tree → Your Code
↑ ↑ ↑
Patterns Structure Semantics
Rules decide whether something matches and how much it consumes. They know nothing about what the match means.
Captures record what was matched and where, organized as a tree. They carry no values, no types, no interpretation.
Your code walks the capture tree and decides what to do with it — compute a value, highlight syntax, report errors, or anything else.
This separation is deliberate. The same capture tree can serve different purposes: a syntax highlighter reads the ranges, a compiler reads the values, a linter reads the structure. Mixing these concerns into the rule system would limit reuse.
Design Decisions
Why captures are name + range only (no value transformers)
I considered adding a transformer closure that converts matched ranges into typed values directly in RuleCapture. The problem is generics: every capture in a tree can produce a different type, leading to type erasure (Any), existential types, or heterogeneous trees. This complexity ripples through every combinator.
The clean alternative: transform in a separate phase. Navigate the capture tree with find() and findAll(), read the ranges, and convert values in your own code. This keeps captures as a simple, serializable data structure.
Why firstMatch and longestMatch instead of a single choice
The original design had a single choice combinator. In practice, the two use cases are fundamentally different:
firstMatch— order matters, first success wins (priority-based, like PEG).longestMatch— order does not matter, longest match wins (greediness-based, like tokenizers).
Separate names make the behavior self-documenting.
Why expecting / notExpecting instead of lookahead / not
"Lookahead" is regex jargon that requires prior knowledge. "Not" is ambiguous. expecting and notExpecting read like English, form a natural pair, and their zero-width behavior is implied by the name.
Why closures instead of a Context generic parameter
I considered Rule<T: Collection, Context> for shared state. This infects every type signature and prevents composition between rules with different context types. Since Rule is closure-backed, you simply capture state in the closure environment when needed.
Source Compatibility
This is purely additive — new types and extensions. No existing APIs are changed.
Alternatives Considered
Extend RegexBuilder to support generic collections
This would require fundamental changes to the Regex type, which is deeply tied to string processing and the regex formalism. The recursive grammar support alone would change the computational model from regular to context-free.
Use Swift macros for grammar definitions
Macros could provide DSL syntax, but the core composition primitives are still needed. Macros could be a future addition on top of Rule, not a replacement.
Result builders for rule composition
Result builders could eliminate commas in sequences and enable if/else syntax. However, Rule's existing API is already compact:
// With result builders
Rule {
oneOrMore(unicode(.decimalDigits))
word(".")
oneOrMore(unicode(.decimalDigits))
}
// Current API — barely different
.sequence(
.oneOrMore(.unicode(.decimalDigits)),
.word("."),
.oneOrMore(.unicode(.decimalDigits))
)
The practical gain is small — a few dots and commas. But there is also a downside: with a result builder, the composition semantics become implicit. Is the block a sequence? A firstMatch? A longestMatch? With the current API, the intent is always visible at the call site. Removing that explicitness runs counter to Swift's emphasis on clarity at the point of use.
Future Directions
- Standard matchers for additional collection types —
String.UTF8View,[Character],[UInt8], and others. The generic core is ready; the matchers are additive and can be introduced incrementally. - Frequently needed, reusable rules — patterns like caching/memoization are already possible today through closure capture into external data structures (following the same principle as
RuleLibrary). If specific patterns prove to be widely needed, they could be added to the library as composable rules — always as new rules, never as new parameters on the core type.
Acknowledgments
This design draws on ideas from PEG (Parsing Expression Grammars), parser combinators (particularly their Haskell and Scala traditions), and the design philosophy of Swift's own RegexBuilder.