String Consumption
Hi all, I just want to give a quick update on some directions discussed last year. ABI stability drew a lot of focus and attention, and I want to make sure these avenues are not forgotten.
Collection consumers
Processing a string, or any Collection really, typically involves tracking and advancing a position (i.e. an Index). This makes something akin to Slice a great basis to add consuming operations, which advance indices from the front (or back if bidirectional).
There are at least 3 ways to surface this:
- A new wrapper type
- Add these to Slice
- Add these to all Collections that are their own SubSequence
Here’s an example of approach 3, which would put this functionality in a lot of namespaces (Slice, Substring, even Data). Having these broadly available could help discoverability, but we also don’t want to pollute namespaces. These are all “mutating” because they advance indices, but not because they modify the underlying collection. We need to see some usage to evaluate the tradeoffs.
This was discussed previously here.
Regex
I wrote more about regexes here (which has better formatting than the original mailing-list post predating the forums). Regexes did not make it in Swift 5, but we should still pursue them.
Quick recap
1. Generalize pattern-matching through something like:
protocol Pattern {
associatedtype In
associatedtype Out
func match(_: In) -> Out?
}
Since that post from last year, I think we will probably want to model partial-matches from the front, meaning that the result of match could also return an index that it matched until.
2. Introduce syntax for a new kind of value-binding-pattern:
let myValue: T = // ... something of type T
let pattern: MyTPattern<(U,U)> = // ... something that matches a T to (U, U)
let otherPattern: OtherTPattern<V> = // ... something that matches a T to V
switch myValue {
case (let a: Int, let b: Int) <- pattern: // ... tries `pattern`, then tries default U->Int pattern
case let pair <- pattern: // ... tries `pattern`, pair has type (U,U)
case let d: Double <- otherPattern: // ... tries `otherPattern`, then tries default V->Double pattern
case let num <- otherPattern: // ... tries `otherPattern`, num has type V
}
3. Regexes are patterns:
struct Regex<T> { /* ... */ }
extension Regex: Pattern {
typealias In = Substring
typealias Out = T
func match(_ s: In) -> Out? { /* ... */ }
}
4. Language integration
There was straw man syntax for language integration and a lot of discussion about feature set in the original post. Raw-string literals could allow us to prototype a library-driven approach at first, and custom string interpolations could even give us the ability to prototype interpolating sub-expressions. But we will definitely want to integrate with the language, picking up features like let-bindings for named captures.
Some guidance
The two greatest strengths of regexes are their ability to express complex operations succinctly and their broad familiarity among developers. When it comes to integrating regexes into Swift, these two strengths need to be balanced against each other.
The ability to succinctly express complex operations can quickly turn into a hazard, where actual execution and complexity diverge from the expression. Regexes can scale poorly and quickly become unmaintainable. As the old joke goes “I had a problem so I wrote a regex. Now I have two problems” (link). To remedy this in Swift, we should leverage key language features that help keep this complexity under control. For example:
- The ability to interpolate subexpressions. This lets us refactor complex regexes.
- The ability to
let
-bind captures. This integrates named capture groups into the code itself. - The ability to use strong types. Types can have a default pattern.
- The ability to pick the right view. E.g.
.
means a Character on String, Unicode.Scalar on UnicodeScalarView, UInt8 on UTF8View, etc.
Again, this was discussed in much more detail here. However, this needs to be balanced against the second strength of regexes: pre-existing familiarity. As much as possible, we should stick to standard conventions.
PEGs - When to move off of regexes
Just as important as knowing how to use a regex is knowing when not to use a regex. Regexes excel at “needle in a hay stack” kind of search and lexical analysis. However, they are a poor tool for discerning and analyzing structure in a string, that is, they make terrible parsers.
Parsing Expression Grammars (PEGs) directly model the execution of a parser and are more powerful, but less succinct. The sub-expressions in a PEG are simpler and more straight-forward than regexes, they’re essentially “ratcheting” or one-pass regexes, but they can be combined in powerful ways to produce parsers. PEGs were alluded to in the discussion of regexes from last year and are what Perl 6 uses for their grammars.
We will want to explore variants or extensions to PEGs that support left-recursion, error handling, and other affordances. Trying to avoid left-recursion when expressing a grammar complicates the user model and can produce results that don’t directly follow the true grammar with left-recursion. This is a deep topic, and you can read more about one approach here, which includes links to several papers on the topic.
Text Streams
edit: I forgot to call out @omochimetaru's StringStream and discussion in this thread. Streams are definitely an interesting direction for the standard library in the future. We'll want to see when it makes sense to model resources with move-only structs (unique identity / affine types) vs classes (shared).