Declarative String Processing Overview

xwu · October 1, 2021, 3:04pm

Sure seems reasonable to design it that way. We allow users to shadow standard library operators; this would just be more of the same. All existing code would continue to work.

Michael_Ilseman · October 1, 2021, 3:35pm

This overview is early presentation of something we very much intend to ship in in the near-term, as well as the basis of a larger story to unfold over time. The challenge is to deliver significant and meaningful features while allowing for future extension, particularly in a ABI-stable world. Present too early () and people are understandably disappointed when other language priorities take focus. Present too broadly (), and there's isn't anything concrete enough to talk about, much less deliver.

I'm very happy and excited to discuss the broader topic! That being said, much of the focus in the near-term will be on fleshing out and shipping what is presented here.

This overview is pretty focused on String processing, especially of the regular languages+ kind, but the intention is to support generic parsing, asynchrony, and low-level processing using a shared technological foundation.

It's not clear yet whether Pattern specifically would be generic or if we'd have a family of types unified by a protocol.

This kind of composition is very appealing. Another example I've had in mind is processing the components of FilePath using shell glob-style matching, and dropping down to string matching to process the content of any specific components.

On support for library-extensible literals, it's not clear yet whether there would be an ExpressibleByRegexLiteral protocol. Or, if there was, what would be handed off to the library (e.g. a String of the regex or some kind of AST). It is the case that regex literals are understood by the compiler and statically parsed (and compiled, at least to a bytecode).

One advantage would be allowing libraries that wrap other engines (PCRE, ICU, JavascriptCore, etc.) to take literals. It's unlikely captures would be strongly typed by such engines, so there might be a difference between a literal that surfaces capture information and one that doesn't. One potential disadvantage or area for confusion would be if a library has fundamentally different semantics for a construct (beyond Unicode concerns). For example, quantification in PEGs) is possessive / "ratcheting". There would also have to be some mechanism for conformers to clearly communicate what features they support.

Generalizing further (and growing more speculative and fuzzy in the process), it might make sense to have a library-extensible matching literals. Regex-style literals only really make sense for collections whose Element type has a single-Character representation. For example. /123/ matching Array<Int> would match [1, 2, 3] instead of [123].

I was very happy to see the (at the time, Perl 6) community embracing PEGs. Using individually less powerful constructs ("tokens") allows you to algebraically compose them into more powerful grammars, resulting in a full-fledged recursive descent parser. I also love the terminology, like "ratchet". I didn't pay attention to a lot of the semantic details, such as how they choose to map regex ambiguity to unambiguous PEGs, and I haven't poked around in their implementation.

The prior incarnation's literal syntax was heavily inspired by Raku. In the end, familiarity (and compatiblity, modulo Unicodey stuff) with traditional regex syntax is the killer feature for the literal. For more complex or multi-line constructs we're likely to encourage users to use Patterns rather than add another literal kind to the language. Basically, we'd rather spend our design/"weirdness" budget on Pattern and filling out the big picture than another literal syntax, even if I personally like Raku's syntax better than the traditional one.

As for language integration, this is a big topic that I hope to make steady progress on. One small incremental step in line with this overview and what currently exists in Swift could be Regex-backed (or perhaps even Pattern backed) enums.

enum CalculatorToken: Regex {
  case wholeNumber = /\d+/
  case identifier = /\w+/
  case symbol = /\p{Math}/
  ...
}

We have thought about and discussed extending this further to define grammars using indirect enums, where the enum cases are the produced AST nodes. I'm a little wary of this approach. What I want to avoid, i.e. my disaster scenario, is shipping what amounts to a toy feature in the language that doesn't scale to real problems. This can detract from real improvements and can encourage developers to go down the wrong path initially ("but if you want to write a parser for reals, you should instead ..."). In my experience shipping binary-stable libraries, even basic enum-like constructs shouldn't be modeled as Swift enums if you care about memory layout or extensibility.

From The Big Picture:

We want powerful functionality presented as normal Swift code, not forced into a particular formalism. In academia, the computational complexity class of a formalism is often the most salient point. That's a really nice thing to have and know, but it's usually not even in the top-5 concerns. For example, imagine adding a typo-correction feature to an existing parser for a programming language: surfacing in-scope corrections would be context-sensitive, and furthermore, candidates would be weighted by things such as edit distance or even lexical distance.

Vanilla PEGs still tend to produce parse-trees rather than ASTs as language syntax is coerced to fit a PEG. Examples include left-recursive grammars (though there are extensions for this) and operator associativity (though there are also extensions for this).

(Again, caveat about the difference between near-term and long-term discussions here).

This is exactly the kind of power we want to enable in 3rd party libraries. We haven't figured out all the extension points yet, but our repo has a generic PEG frontend inspired by LPeg.

Saklad5 · October 1, 2021, 3:41pm

My suggestion wasn’t specifically for regex literals, but for any literals. You could do ExpressibleByCustomLiteral, then use extremely restrained requirements to do parsing that runs during compilation.

The interpretation of the literal would be dependent on the type, with no “default” type inference. I imagine you could do something with StaticString as an intermediate form.

breathe · October 1, 2021, 8:14pm

I'm not sold on the argument for the Regexp literals ... if I'm understanding the argument correctly for this kind of regexp design -- its basically because its familiar and similar to existing regexp literals from other languages? But why not skip regexp and invest directly in a Parsing Expression Grammar type literal instead?

I'd love a literal that takes inspiration from the ohm language philosophy -- ohm/philosophy.md at master · harc/ohm · GitHub -- over a literal that offers a 'familiar but crappy' experience by regurgitating the perl compatible regular expression style ... This is probably a bad place for floating alternative pet ideas ... but if the proposal does end up with regexp literals, I hope peg-like alternative literal designs are at least mentioned in the alternatives considered section ...

(And just to try and be a little more clear what kind of literal I'm advocating for -- I'll go ahead and put a specific half-baked notion of what a 'better' string processing literal might look like -- based on ohm's peg definition grammar ...)

Summary

(I would love a peg literal based on ohm's peg definition grammar -- ohm/syntax-reference.md at master · harc/ohm · GitHub ...). In that system the Grammar is defined entirely separately from the semantic actions. To define semantic actions, you define a dictionary of [string : function] matching the names of the rules in the grammar. Swift would easily allow for a type-safe version of this ... (and it seems to me like the runtime needed to implement Pattern would directly support this type of literal as well ...)

// hypothetical ohm-like peg literal

#Grammar HttpHeaderGrammar {
     whitespace := " " // redefine the definition of whitespace used by this grammar -- these characters are ignored when defining 'lexical' rules -- which are rules that start with lowercase letters in the ohm design
     HttpHeader = httpPreamble HttpFields
     httpPreamble = "HTTP/" httpVersion digit+ newline
     httpVersion = digit "." digit
     HttpFields = (DateField | ContentTypeField | ignoredJunkFields)*
     DateField = "Date: " Date.rfc1123 newline
     ContentTypeField = "Content-Type: " MimeType newline
     ignoredJunkFields = ~(whitespace)+ ":" whitespace ~(newline)* newline
}

// compiler synthesizes a protocol representing the semantic actions that need to be implemented to take action based on the grammar (HttpHeaderGrammar.HttpHeader would mean to synthesize a protocol appropriate for using HttpHeader as the initial symbol of the grammar).  To use, adopt the protocol and implement the semantic actions ... if desired reuse the same grammar for multiple purposes ... in the ohm design you optionally implement one method per rule and any rules that do not have unmatched actions get a default fallback implementation for the semantic action for that rule ...  In swift I think you'd probably want a spelling in the grammar literal to opt certain rules out of the default implementation so that you indicate in the grammar which rules you expect to implement semantic actions for and the compiler synthesized protocol will guide you on which rules you need to implement methods for...

struct ExtractInfoFromHttpHeaderGrammar: HttpHeaderGrammar.HttpHeader {
        func HttpHeader (httpPreamble: NonTerminalNode, HttpFields: NonTerminalNode) {
            (httpPreamble.eval(), HttpFields.eval())
        }
        
        func httpPreamble (literal1: TerminalNode, _httpVersion: NonTerminalNode, httpCodeDigits: RepeatedNode, newline: TerminalNode) {
            Int(digits.capture)
        }
        
        func DateField (dateLiteral: TerminalNode, dateMatch: NonTerminalNode, newline: TerminalNode) {
            HTTPHeaderField.date(dateMatch.capture)
        }
        
        func ContentTypeField(contentTypeLiteral: TerminalNode, mimeType: NonTerminalNode, newline: TerminalNode) {
            HTTPHeaderField.contentType(mimeType.capture)
        }
}

Example of using a parsing expression grammar rule declaration syntax inside of pattern builder

Group {
       #Grammar { ~(whitespace)+: whitespace ~(newline)* newline }
       Newline()
      }

The Pattern Builder proposal looks pretty awesome! I'm just not a big fan of the unreadable regexp literal syntax and design and hope there is at least an argument made in the eventual proposal for why a more powerful literal design was not chosen ...

Saklad5 · October 2, 2021, 2:58pm

Could we maybe put a pin in the literals and just focus on Pattern for now? I don’t think anyone would be particularly upset if it initially used string literals (with runtime-checking), or just skipped regex entirely (potentially splitting it into a package).

Michael_Ilseman · October 2, 2021, 6:35pm

I want to make sure we are addressing concerns with the choice of / as a delimiter. I don't currently know enough about the parser impact of this or the implications of shadowing, but this is something we'll be figuring out over the next few weeks.

@beccadax or @hamishknight , any current intuition here?

A version literal doesn't sound completely absurd to me, especially if they're useful for more things than just package versions. It's totally conceivable that Swift could gain support for first-class Version structs (I think @lorentey thought about this a while ago) and version literals could support that.

Probably orthogonal to string processing

Overhauling and unifying literal support is independently good, though I'm not sure if it's strictly a prerequisite for this work. As we discuss the literal, let's keep our eyes open for potential generalizations and whether we are hitting issues with the current system.

I wish I had your confidence!

What exactly is the basis of your "should" here? Should we undo custom string interpolations? Any rationale?

The topic of library-driven compilation is a broad one that I'm very interested in and will definitely need some co-evolution with this effort. Part of that could be library-provided parsers, but I don't know if we need to design that whole subsystem before we can deliver real value to our developers.

I'm somewhat inclined to encourage refactoring into Pattern at that point, but it's possible if there's compelling use cases.

This is a really neat approach. I don't have a ton to say about this direction right now, but I want to keep it in consideration.

A fully custom literal would be one which the compiler essentially says nothing about. That might be independently useful and interesting, but it seems mostly orthogonal to what we're talking about here. We want the compiler to understand the syntactic structure of the regex literal so it can provide features like typed capture groups. It could even be a means for libraries to declare their feature set and evolve over time using availability annotations.

We would want/need a default type for use in generic API, at least for regex literals.

(These are more fundamental "why" questions and questions of resource allocation. I do want to follow up on this, but it deserves a dedicated reply).

Actually, now seems like precisely the time to be having the discussion. I'm open to spin-off threads if the volume here is too large (so far it doesn't seem so), and we will certainly be having separate threads when proposing more specific details.

Saklad5 · October 2, 2021, 7:34pm

Custom string interpolations build on top of the existing string literals, no? The only thing they can build are string literals. In fact, it is impossible to use custom string interpolation without guaranteeing any string literal could be used.

This is because string literals follow specific rules, and the conforming type cannot add more rules. For many purposes, that is sufficient. For some, it is not. And I don’t think only the standard library should benefit from having more control over literal interpretation.

Do we need to support that, though? For example, integer literals default to Int because that is almost always the desired type. Now imagine that the compiler complained about ambiguity instead of choosing a default. It would still work, right?

Tangent

I’m not saying Swift should not make assumptions about what type literals are. That’s generally a reasonable tradeoff to make, even if it means people accidentally use an Int when they meant to use UInt. If the same literal can have a custom meaning, however, no reasonable default exists.

let thirtyTwoInt: Int32 = 6
let sixtyFourInt: Int64 = 8
let integer = 42 // Probably an Int
let foundationDecimal = 3.14 // Double, should have specified Decimal

let foundationURL: URL = 'https://apple.com'
let semver: Version = '1.2.5-beta.4.newFeature+21AF-26D-3'
let stdlibRegex = '([0-9A-F]+)(?:\.\.([0-9A-F]+))?\s*;\s(\w+).*' // Who the hell knows?

Consider this: the standard library has one form of regular expression syntax, and a package offers another. A function accepts regular expressions from either, and outlines that with some form of protocol and extension. If you pass a custom literal, and you don’t say which of the possible formats you’re using, the compiler shouldn’t make an assumption. It should throw an error, which could then be trivially resolved using coercion at the call site.

If you made bespoke literals for everything, you’d run out of reasonable options for syntax fast. Just look at the pushback on using /regex/! Now consider that you could justify such a literal for pretty much any type with an optional initializer accepting a String. Why should Regex get special treatment over WebURL?

To me, this sounds rather familiar. Result builders have a very similar role, no? They aren’t fully custom, but they are quite flexible while retaining a degree of compiler control.

I want something similar, but with literals known at compile time (StaticString). I’m not certain that is possible, but I am confident it is worth finding out.

There are good reasons to have regex expressions checked at compile-time, but they are not even remotely unique ones. If people force-unwrap URLs on a regular basis, they can do the same for regex.

Saklad5 · October 2, 2021, 8:04pm

This is what I’d prefer:

let regex: Regex = '([0-9A-F]+)(?:\.\.([0-9A-F]+))?\s*;\s(\w+).*'
print(type(of: regex))
// Prints Regex<(Substring, Substring?, Substring)>

For comparison, this already works:

let set: Set = [1, 2, 3]
print(type(of: set))
// Prints Set<Int>

The trick would be devising a way for allowing unspecialized generics to determine their own specialization at compile-time without baking it into the language itself. Hardly trivial, I admit.

Philippe_Hausler · October 2, 2021, 11:14pm

Let me first off say the pattern builder approach is quite an elegant solution to a really hard problem, and including things like the Foundation formatters really makes it mesh nicely. However I do have a worry that the other items in the pattern builders might not be quite useful outside of that builder; and may take up valuable namespace in the global name lookup.

Have we considered perhaps allowing result builders to have certain additional functions in their scope my some mechanism? For example in the case of PatternBuilder it could specify an "additional scope" search where there is a type defined on that result builder (perhaps with a decoration identifying how it interacts with that scope) such that types within that scope are claimed to be "inside the scope of the builder"

straw-api syntax:


@resultBuilder
struct PatternBuilder {
  @scope(additional)
  enum Scope {
    struct OneOrMore { ... }
  }
  ...
}

where the @scope(additional) indicates that this type is an addition to the global scope. Alternative usages could have @scope(restricted) where that restricts the functions/types available in the builder to those within that type's scope. This would allow for a whole lot of additions to that scope (and the ability to extend them since it is really just a type) as well as functions that would ONLY be callable in that scope or variables that would ONLY be usable in that scope.

I know this perhaps could be seen as scope creep... (pun partially intended). But it seems that it would allow a lot of freedom for picking the right names for the pattern building operations without worrying about sitting on a really good name that already exists or may exist in client code. One prominent case would be Optionally. Instead that could be defined as a member of the scope (where those members are resolved first by the standard rules to scope definitions) such that PatternBuilder.Scope.Optional would be resolved when using the term Optional in the PatternBuilder syntax.

Michael_Ilseman · October 2, 2021, 11:19pm

Yes! And it's in line with the direction we took for custom string interpolation:

We wouldn't necessarily need extra attributes, just a bound associated type.

Philippe_Hausler · October 2, 2021, 11:22pm

It might be good however to consider for other uses to allow for the replacement of namespaces/scopes. One could envision a usage not for pattern matching that would offer a set of specifically picked methods that would avoid heap allocations, or things that avoid blocking etc, that are the only available members of that scope. I'd guess that a replacing scope would need some sort of decorator; unless we have a builder method for that (which would be kinda meta...)

Saklad5 · October 3, 2021, 5:26pm

I’m not sure I follow here. Are you proposing the ability to have declarations that are only accessible by implementations? A similar thing has often been proposed for protocols.

mfilonen2 · October 3, 2021, 5:53pm

What classes of grammars will this be able to parse? Will the Pattern be able to parse nested structures?

hamishknight · October 4, 2021, 1:40pm

I believe there would be some level of conflict with any existing custom prefix or postfix / operators if we used / as a delimiter. The problem is that we need to be able to tell when parsing whether or not we're dealing with a regex literal, and I'm not sure we can do some kind of clever delayed parsing as it might be something that would span multiple expressions e.g

let x = /5; let y = 5/

Should that be parsed as a regex literal or as two let bindings each using a custom / operator?

We might be able to mitigate the source breakage in certain cases, best I can immediately think of would be to do a pre-parse over all the files in the module looking for the tokens prefix operator / & postfix operator / and supressing regex literal parsing in that case. However that would be pretty gross, and has a bunch of downsides:

We'd no longer be able to do a standalone parse of a single Swift file in a module
It would supress regex literals for the entire module, prefix / and postfix / essentially wouldn't be able to co-exist with regex literals using / delimiters (though we could allow additional # delimiters to disambiguate, similar to string literals, allowing e.g #/regex/#)
It wouldn't avoid source breakage for cases where modules import prefix / or postfix / operators (though client modules could manually re-declare e.g prefix operator / to workaround that)

patrickgoley · October 4, 2021, 3:06pm

Very exciting proposal! One important thing I wasn't able to find in this thread is a look at the API surface of a Pattern itself? Do structs like OneOrMore and Newline all implement some common protocol? What do I need to do to implement my own Pattern? Apologies if this has been explained already.

Saklad5 · October 4, 2021, 6:06pm

I think this is an inherently intractable problem if we want to have a reasonable parsing story. Delimiters must not start or end with anything that is a valid operator head.

This is one of the reasons I’m against any new specialized literal syntax. Parsers shouldn’t need to worry about a DSL literal anymore than they need to worry about a result builder or property wrapper.

xwu · October 4, 2021, 6:28pm

Excellent point.

While it’s possible to come up with heroics that would allow both uses simultaneously much of the time from the perspective of the compiler, there’s still the issue that humans users have to be able to understand what they’re seeing.

If we were starting from scratch, it’s clear we wouldn’t dream of supporting / as both custom prefix or postfix operator and simultaneously regex delimiter. And it’s doubtful users would want to define a custom prefix or postfix operator that clashes with a delimiter in this way.

I can think of two ways forward, then:

Either choose another delimiter—of note, we’ve been saving single quotes for a “worthy” literal type for a while
Or choose / as the delimiter and deprecate support for / as a character for custom prefix or postfix operators, which is somewhat unprecedented but not likely to impact any appreciable number of users; as a transitory measure, use a “parse both ways” approach to keep things working for existing code until the next Swift version

masters3d · October 4, 2021, 6:49pm

High level feedback:

Pattern

Can the pattern create UNIX style regex associated with a version of the Pattern that created them?
I would like to check in both the Pattern and Unix regex but in production I would probably only would want to use the unix regex. I like the flexibility of building a regex using patterns but it would be really nice if I could see the "rendered" regex somehow. Perhaps this could be built into a playgrounds or similar app.

Regular unix style regex.

Can we make the init a compile time error somehow? If the regex is not valid, fail during compilation.
Can we have some kind of API version for this? For example when creating a UNIX style regex, it would be nice if we could evolve the API to fix bugs in the future. Strawman suggestion below.
let regex:Regex<V1> = /([0-9A-F]+)(?:\.\.([0-9A-F]+))?\s*;\s(\w+).*/
let regex:Regex<V2> = ([0-9A-F]+)(?:\.\.([0-9A-F]+))?\s*;\s(\w+).*

hooman · October 4, 2021, 7:14pm

How about we use single quote to support a broader set of DSL literals? We can use the character next to the opening/closing quote to determine the literal DSL syntax. In the case of regex, we can use '/ as opening and /' as closing. Then we could have a different DSL literal that uses something like '< and >' as opening and closing tokens.

benlings · October 4, 2021, 7:31pm

I think using textual prefixes would be easier to lookup documentation about and would also provide a wider range of possible prefixes than symbols (see Scala’s string literal prefixes for an example of this). Not sure if this would cause problems with Swift’s parser though?

Eg, we could have r'[a-z]+' to be a regex literal.