Declarative String Processing Overview

Michael_Ilseman · October 1, 2021, 3:35pm

This overview is early presentation of something we very much intend to ship in in the near-term, as well as the basis of a larger story to unfold over time. The challenge is to deliver significant and meaningful features while allowing for future extension, particularly in a ABI-stable world. Present too early () and people are understandably disappointed when other language priorities take focus. Present too broadly (), and there's isn't anything concrete enough to talk about, much less deliver.

I'm very happy and excited to discuss the broader topic! That being said, much of the focus in the near-term will be on fleshing out and shipping what is presented here.

This overview is pretty focused on String processing, especially of the regular languages+ kind, but the intention is to support generic parsing, asynchrony, and low-level processing using a shared technological foundation.

It's not clear yet whether Pattern specifically would be generic or if we'd have a family of types unified by a protocol.

This kind of composition is very appealing. Another example I've had in mind is processing the components of FilePath using shell glob-style matching, and dropping down to string matching to process the content of any specific components.

On support for library-extensible literals, it's not clear yet whether there would be an ExpressibleByRegexLiteral protocol. Or, if there was, what would be handed off to the library (e.g. a String of the regex or some kind of AST). It is the case that regex literals are understood by the compiler and statically parsed (and compiled, at least to a bytecode).

One advantage would be allowing libraries that wrap other engines (PCRE, ICU, JavascriptCore, etc.) to take literals. It's unlikely captures would be strongly typed by such engines, so there might be a difference between a literal that surfaces capture information and one that doesn't. One potential disadvantage or area for confusion would be if a library has fundamentally different semantics for a construct (beyond Unicode concerns). For example, quantification in PEGs) is possessive / "ratcheting". There would also have to be some mechanism for conformers to clearly communicate what features they support.

Generalizing further (and growing more speculative and fuzzy in the process), it might make sense to have a library-extensible matching literals. Regex-style literals only really make sense for collections whose Element type has a single-Character representation. For example. /123/ matching Array<Int> would match [1, 2, 3] instead of [123].

I was very happy to see the (at the time, Perl 6) community embracing PEGs. Using individually less powerful constructs ("tokens") allows you to algebraically compose them into more powerful grammars, resulting in a full-fledged recursive descent parser. I also love the terminology, like "ratchet". I didn't pay attention to a lot of the semantic details, such as how they choose to map regex ambiguity to unambiguous PEGs, and I haven't poked around in their implementation.

The prior incarnation's literal syntax was heavily inspired by Raku. In the end, familiarity (and compatiblity, modulo Unicodey stuff) with traditional regex syntax is the killer feature for the literal. For more complex or multi-line constructs we're likely to encourage users to use Patterns rather than add another literal kind to the language. Basically, we'd rather spend our design/"weirdness" budget on Pattern and filling out the big picture than another literal syntax, even if I personally like Raku's syntax better than the traditional one.

As for language integration, this is a big topic that I hope to make steady progress on. One small incremental step in line with this overview and what currently exists in Swift could be Regex-backed (or perhaps even Pattern backed) enums.

enum CalculatorToken: Regex {
  case wholeNumber = /\d+/
  case identifier = /\w+/
  case symbol = /\p{Math}/
  ...
}

We have thought about and discussed extending this further to define grammars using indirect enums, where the enum cases are the produced AST nodes. I'm a little wary of this approach. What I want to avoid, i.e. my disaster scenario, is shipping what amounts to a toy feature in the language that doesn't scale to real problems. This can detract from real improvements and can encourage developers to go down the wrong path initially ("but if you want to write a parser for reals, you should instead ..."). In my experience shipping binary-stable libraries, even basic enum-like constructs shouldn't be modeled as Swift enums if you care about memory layout or extensibility.

From The Big Picture:

We want powerful functionality presented as normal Swift code, not forced into a particular formalism. In academia, the computational complexity class of a formalism is often the most salient point. That's a really nice thing to have and know, but it's usually not even in the top-5 concerns. For example, imagine adding a typo-correction feature to an existing parser for a programming language: surfacing in-scope corrections would be context-sensitive, and furthermore, candidates would be weighted by things such as edit distance or even lexical distance.

Vanilla PEGs still tend to produce parse-trees rather than ASTs as language syntax is coerced to fit a PEG. Examples include left-recursive grammars (though there are extensions for this) and operator associativity (though there are also extensions for this).

(Again, caveat about the difference between near-term and long-term discussions here).

This is exactly the kind of power we want to enable in 3rd party libraries. We haven't figured out all the extension points yet, but our repo has a generic PEG frontend inspired by LPeg.