String Consumption

String Consumption

Hi all, I just want to give a quick update on some directions discussed last year. ABI stability drew a lot of focus and attention, and I want to make sure these avenues are not forgotten.

Collection consumers

Processing a string, or any Collection really, typically involves tracking and advancing a position (i.e. an Index). This makes something akin to Slice a great basis to add consuming operations, which advance indices from the front (or back if bidirectional).

There are at least 3 ways to surface this:

  1. A new wrapper type
  2. Add these to Slice
  3. Add these to all Collections that are their own SubSequence

Here’s an example of approach 3, which would put this functionality in a lot of namespaces (Slice, Substring, even Data). Having these broadly available could help discoverability, but we also don’t want to pollute namespaces. These are all “mutating” because they advance indices, but not because they modify the underlying collection. We need to see some usage to evaluate the tradeoffs.

This was discussed previously here.

Regex

I wrote more about regexes here (which has better formatting than the original mailing-list post predating the forums). Regexes did not make it in Swift 5, but we should still pursue them.

Quick recap

1. Generalize pattern-matching through something like:
protocol Pattern {
  associatedtype In
  associatedtype Out
  func match(_: In) -> Out?
}

Since that post from last year, I think we will probably want to model partial-matches from the front, meaning that the result of match could also return an index that it matched until.

2. Introduce syntax for a new kind of value-binding-pattern:
let myValue: T = // ... something of type T
let pattern: MyTPattern<(U,U)> = // ... something that matches a T to (U, U)
let otherPattern: OtherTPattern<V> = // ... something that matches a T to V
switch myValue {
  case (let a: Int, let b: Int) <- pattern: // ... tries `pattern`, then tries default U->Int pattern
  case let pair <- pattern: // ... tries `pattern`, pair has type (U,U)
  case let d: Double <- otherPattern: // ... tries `otherPattern`, then tries default V->Double pattern
  case let num <- otherPattern: // ... tries `otherPattern`, num has type V
}
3. Regexes are patterns:
struct Regex<T> { /* ... */ }
extension Regex: Pattern {
  typealias In = Substring
  typealias Out = T
  func match(_ s: In) -> Out? { /* ... */ }
}
4. Language integration

There was straw man syntax for language integration and a lot of discussion about feature set in the original post. Raw-string literals could allow us to prototype a library-driven approach at first, and custom string interpolations could even give us the ability to prototype interpolating sub-expressions. But we will definitely want to integrate with the language, picking up features like let-bindings for named captures.

Some guidance

The two greatest strengths of regexes are their ability to express complex operations succinctly and their broad familiarity among developers. When it comes to integrating regexes into Swift, these two strengths need to be balanced against each other.

The ability to succinctly express complex operations can quickly turn into a hazard, where actual execution and complexity diverge from the expression. Regexes can scale poorly and quickly become unmaintainable. As the old joke goes “I had a problem so I wrote a regex. Now I have two problems” (link). To remedy this in Swift, we should leverage key language features that help keep this complexity under control. For example:

  1. The ability to interpolate subexpressions. This lets us refactor complex regexes.
  2. The ability to let-bind captures. This integrates named capture groups into the code itself.
  3. The ability to use strong types. Types can have a default pattern.
  4. The ability to pick the right view. E.g. . means a Character on String, Unicode.Scalar on UnicodeScalarView, UInt8 on UTF8View, etc.

Again, this was discussed in much more detail here. However, this needs to be balanced against the second strength of regexes: pre-existing familiarity. As much as possible, we should stick to standard conventions.

PEGs - When to move off of regexes

Just as important as knowing how to use a regex is knowing when not to use a regex. Regexes excel at “needle in a hay stack” kind of search and lexical analysis. However, they are a poor tool for discerning and analyzing structure in a string, that is, they make terrible parsers.

Parsing Expression Grammars (PEGs) directly model the execution of a parser and are more powerful, but less succinct. The sub-expressions in a PEG are simpler and more straight-forward than regexes, they’re essentially “ratcheting” or one-pass regexes, but they can be combined in powerful ways to produce parsers. PEGs were alluded to in the discussion of regexes from last year and are what Perl 6 uses for their grammars.

We will want to explore variants or extensions to PEGs that support left-recursion, error handling, and other affordances. Trying to avoid left-recursion when expressing a grammar complicates the user model and can produce results that don’t directly follow the true grammar with left-recursion. This is a deep topic, and you can read more about one approach here, which includes links to several papers on the topic.

Text Streams

edit: I forgot to call out @omochimetaru's StringStream and discussion in this thread. Streams are definitely an interesting direction for the standard library in the future. We'll want to see when it makes sense to model resources with move-only structs (unique identity / affine types) vs classes (shared).

28 Likes

I don’t have anything of substance to say right now, but I want to thank you for all the work you’re doing to make String better @Michael_Ilseman. I want to thank you as well for writing about both work that has been done and the work that is on the roadmap.

16 Likes

Personally, this is the first thing I add to any collection I need to work with (and there's already an ad-hoc implementation of this on UnsafeBufferPointer hiding in the standard library as part of KeyPath's implementation.) I think it's valuable enough to be worth including in all Collections which are their own SubSequence.

5 Likes

While ByteBuffer has a bunch of specialised functions for this sort of work, it would be very useful to see (3) come to fruition, as we can then consider extending ByteBufferView to be self-slicing to inherit these methods, which would provide a nice symmetry with other containers in Swift. We may also want to consider extending ByteBuffer with some of these methods (though if ByteBufferView has them it shouldn’t matter)

2 Likes

Eat and peek could be pitched whenever. I wonder if we should do a closure-based (Character) -> Bool seek now, or invest more effort in a Pattern type. Find/replace/split is in a similar boat, and we want a Pattern that's useful for regexes, pegs, parser-combinators, etc.

Rust's str::Pattern seems like a good start. I don't think we'd limit it to Strings, rather have it for all Collections. This would (I think) be distinct from the protocol used for destructuring value-binding-patterns.

1 Like

I would love to see the regex and PEG work get explored, as well as improvements to pattern matching that will inevitably be driven by that work!

2 Likes

In my experience, searching for individual Characters just isn't so interesting in practice; the only advantage over an API that lets you search in terms of String is that you only need to return an Index instead of a Range. Most of the operations you build on top don't end up caring that your delimiter is a single Character (think consumeUntil or three-way split). I'd rather not add the Character APIs for this reason.

1 Like

Great! While pattern matching and other stuff is great, these would be much more ergonomic primitives for people to write their own parsing than index manipulation.

Awesome work.There are a few things I've noticed. For some reason, I always end up parsing things in Swift. Unfortunately, I don't have a lot of time to condense my thoughts, but I figured some "real-world" experience could be informative. So here's a bunch of random thoughts / braindump:

There are many different ways to parse things. From parser combinators (very abstract) to writing a parser by hand (not as bad as one might think). Parser combinators are very "functional" in their API, which might not be for everyone. Also performance could be an issue (lots of closures). If you write combinators that have backtracking support then performance can be very unpredictable.

I would love for parser-related APIs to be available on Collection (possibly even Sequence). That means it works on e.g. String, but also Substring (which could be very efficient!), Data or an array of Tokens.

I've recently written a number of parsers that use an extended subset of the eat-like combinators, and overall, it's pretty nice, as it gives you very fine-grained control. However, there are a few issues I noticed, which might be quite specific to what I'm trying to do. I want to write parsers that have decent error reporting. This means that my functions either need to throw or return a Result. The nice thing about throws is that you can write your statements in a natural flow: first you try to parse a [, then a type name, and then a ]. There's no deep nesting (compared to Result). However, one problem is that it's quite tedious to write higher-level combinators. For example, when I parse multiple declarations, my code looks something like:

many(while: { someCondition() }, { try $0.parseDecl() }, separator: { try $0.comma() })

Especially the trys and $0s get very noisy. Here's another (similar) line:

let identifiers = try many(until: ")", { try $0.parseIdentifier() }, separator: { try $0.comma() })

// With parser combinators, this would look more like:

let identifiers = try many(until: ")", parseIdentifier, comma)

The many combinator looks like this (I've only defined it on Substring but it's easy to make more generic):

  mutating func many<A>(until end: (inout Substring) throws -> (), _ f: (inout Substring) throws -> A, separator: (inout Substring) throws -> Bool) throws -> [A] {
        var result: [A] = []
        while (try? end(&self)) == nil {
            result.append(try f(&self))
            if try separator(&self) {
                continue
            } else {
                try end(&self)
                return result
            }
        }
        return result
    }

One thing I found useful in my parsers is to do something like this:

public struct ParseError: Error {
    public var message: String
    public var position: Substring.Index
    var debugFile: StaticString
    var debugLine: UInt
    init(message: String, position: Substring.Index, debugFile: StaticString = #file, debugLine: UInt = #line) {
        self.message = message
        self.position = position
        self.debugFile = debugFile
        self.debugLine = debugLine
    }
    
    public var localizedDescription: String {
        return "Parse Error: \(self)"
    }
}

extension Substring {    
    func err(_ message: String, position: Index? = nil, file: StaticString = #file, line: UInt = #line) -> ParseError {
        return ParseError(message: message, position: position ?? startIndex, debugFile: file, debugLine: line)
    }
}

That means I can easily write an err("expected identifier"). Because my parsers generally operate on Substring, the startIndex is the position in the source string.

One challenge I see is to come up with a reasonable set of methods. There are a lot of useful ones, and the question is: how do we design a minimal API that's still very powerful. I wonder if writing a library first (not standard lib) could help there. Once it's adopted and we know more about how people use it, we could move part of that into the standard library.

If it's helpful, here's a not-cleaned-up version of the methods I currently use:

And maybe also informative (unfortunately, I can't share the entire parser) is an example parsing method (on Substring):

    // We call this after we've seen a `struct` keyword
    mutating func parseStructBody() throws -> ExprWithPosition {
        let pos = startIndex
        try eatSpace(required: true)
        let (name, _) = try parseIdentifier()
        try eatSpaceOrComment()
        let body: [(Identifier, Identifier)] = try curly {
            try $0.while2_({ $0.peek() != "}" }, do: { try $0.parseStructProperty() })
        }
        let items = try Dictionary(body, uniquingKeysWith: {
            throw err("Duplicate property name \($1)")
        })
        try eatSpaceOrComment()
        try eat(expecting: "in")
        try eatSpaceOrComment()
        let cont = try parseExpr()
        return ExprWithPosition(Expr.structDecl(name: name, body: items, cont: cont), pos)
    }
8 Likes

Anecdotally, in the few languages I've worked on, I've almost always come to regret throwing while parsing. Most of the time you want to skip the invalid tokens to recover, and if every one of your consumption functions can throw an error, it makes it more difficult to do that. For example, if you miss the in keyword in your example, your parser will just stop executing and you won't be able to recover and continue diagnosing errors in the rest of the file.

5 Likes

Same, I've always tried, then shied away from, error-throwing for parse errors. I typically steal @Joe_Groff's idea and use error tokens/nodes, and perhaps set a "error happened" bit somewhere in my parser context.

8 Likes

Thanks for the shoutout. To be clear, it wasn't originally my idea; I was trying to describe an architecture that it seems like many compiler developers have independently discovered.

7 Likes

Interesting bit of parallel evolution there, Adobe's fascinating 'Adam' user interface language (Adobe Software Technology Lab: ASL Overview) uses a very similar value poisoning concept. The idea there is that you can do things like have your UI update to show "the calculated value, but greyed out" or something to indicate that it was out of range.

3 Likes

Not throwing errors but having error recovery make a lot of sense. I completely agree that it's often (almost always?) preferable.

In the library that I wrote, the many combinator returns an array of things. Instead, I wondered if it could also be useful for that to be lazy somehow. That could lead to a type explosion though (similar to sequence.lazy).

For what it’s worth, and I’m just throwing it out there for ideas, I’ve developed a pattern-matching library a few years ago. It’s generic over all Sequences with Equatable elements and pattern types (like quantifiers, literals, capture groups, etc.) are simply types conforming to a common protocol Pattern.

I’m excited for first-class pattern-matching in Swift though, especially for directly binding matched subsequences/substrings!

2 Likes

I like it! When we consider an equivalent to Rust's str::Pattern, it would be cool to use this as a major use-case.