[Pitch] Regex Syntax

xwu · March 6, 2022, 6:03pm

Mmm, I would have the opposite expectation for code that's parsed in Swift (as regex literals would be here): whitespace is generally not significant.

Has that been the case for Raku? I would be reassured if the empiric experience there has been that it's mostly fine, and if not then certainly we should worry about the same.

ksluder · March 6, 2022, 6:06pm

Whitespace is significant within quotation marks even in Swift. In every other language with regular expressions that I have used, they have behaved like quoted strings rather than parenthesis-delimited expressions.

xwu · March 6, 2022, 6:12pm

For me at least, the most exciting part about regex literals as they're proposed for Swift is that they're not going to be in quotation marks and won't behave like quoted strings (the type of these values will reflect the parsed syntax tree, and you'll be able to do tuple destructuring for matches, etc.). And since that's the overall vibe we're going for, I'd expect whitespace not to be significant.

ksluder · March 6, 2022, 6:52pm

As far as I know, delimiters have not yet been decided.

[quote]and won't behave like quoted strings (the type of these values will reflect the parsed syntax tree, and you'll be able to do tuple destructuring for matches, etc.).
[/quote]

A richer representation doesn’t mean they have to have unexpected behavior.

scanon · March 6, 2022, 6:54pm

It's worth keeping in mind that even with a non-string regex literal, we will still want to allow initialization of regexes from strings, using a common interior syntax. If you're writing a text editor, you want your users to be able to provide a regex as a string and use it to perform search and replace operations, for example.

benlings · March 6, 2022, 9:52pm

How about 2 different literals - one for ‘regex’ that is whitespace sensitive, and another for ‘multi line regex’ that isn’t? I know that delimiters aren’t up for discussion now, but some sort of parallel to ”…” and ”””…””” might make sense? Or maybe it’s more like string vs raw string?

I can see that there are advantages to encourage people to use non-whitespace-significant regex as maybe the default. But I also think that it could be annoying/ bug-inducing to have to rewrite any pre-existing regexes to ‘escape’ the whitespace. So, it would be useful to be able to write both, and possibly not just with a flag at the end.

Michael_Ilseman · March 9, 2022, 12:08am

Something else that's interesting is mentioned in the Introduce a novel syntax alternative, which is that we're developing a experimental extended syntax for Swift which goes a little further. This is not proposed here and is likely to be future work (it's a lot to bite off at the same time), but the rough and ever-evolving idea is:

All ASCII values outside of [A-Za-z0-9] are reserved for metacharacters and should be escaped or quoted for literal treatment.
All whitespace is non-semantic unless escaped; # is supported for end-of-line comments (pending delimiter, perhaps // too)
Quoted literal content uses double-quotes, so you can say "a.b" instead of \Qa.b\E. These would be Swift string literals eventually supporting interpolation, raw strings, etc.
Clearer capture group syntax and defaults: (...) is non-capturing, (_: ...) for unnamed capture, and (name: ...) for named, etc.
Support Swift-syntax ranges for ranged quantification, i.e. x{3..<8} for x{3,7}
Use of other now-free delimiters, e.g. <...>, as a way of naming builtins such as character classes and anchors, perhaps also an interpolation sytax or way to refer to in-scope declarations.

This clearly breaks compatibility with existing regex syntax, so it would need to be clearly delineated and makes sense as future work. There's significant value to allowing things like command-line tools and search fields access to traditional regex syntax, so this wouldn't take the place of what we're proposing.

Another practical reason to consider this future work is that the current effort is pushing the state of the art of the Swift compiler: our regex parser is written as a stand-alone pure-Swift library that gets bundled up and incorporated with the C++ Swift compiler. The Swift compiler yields lexing/parsing state to our library, which then yields back to the Swift compiler after lexing/parsing. A "year 2" of overhauling parts of the Swift lexer could include handling string literals in such a library, making it natural for the regex parser to support embedded Swift string literals. Alternatively, in the nearer term if this deemed high-value, we could support just basic string literals at first.

If we're debating an extended syntax by default for literals (or one of multiple literals), I'm not sure how much value Perl-style xx gives us. Result builders seem like the better way to separate components across lines with comments. The syntax above, especially points 1-3, give us a more compelling extended syntax.

rdemarest · April 5, 2022, 3:54am

I think there's an issue in the proposal, it says that HexDigit can takee a-zA-Z?

HexDigit   -> [0-9a-zA-Z]

I think you meant

HexDigit   -> [0-9a-fA-F]

hamishknight · April 5, 2022, 9:51am

Good catch! Fixed in Fix HexDigit definition in RegexSyntax.md by hamishknight · Pull Request #253 · apple/swift-experimental-string-processing · GitHub.

Michael_Ilseman · April 8, 2022, 2:36pm

I have merged an update that pulls in run-time construction and AnyRegexOutput:

github.com

apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexSyntaxRunTimeConstruction.md


# Regex Syntax and Run-time Construction

* Proposal: [SE-NNNN](NNNN-filename.md)
* Authors: [Hamish Knight](https://github.com/hamishknight), [Michael Ilseman](https://github.com/milseman)
* Review Manager: [Ben Cohen](https://github.com/airspeedswift)
* Status: **Awaiting review**
* Implementation: https://github.com/apple/swift-experimental-string-processing
  * Available in nightly toolchain snapshots with `import _StringProcessing`

## Introduction

A regex declares a string processing algorithm using syntax familiar across a variety of languages and tools throughout programming history. We propose the ability to create a regex at run time from a string containing regex syntax (detailed here), API for accessing the match and captures, and a means to convert between an existential capture representation and concrete types.

The overall story is laid out in [SE-0350 Regex Type and Overview][overview] and each individual component is tracked in [Pitch and Proposal Status][pitches].

## Motivation

Swift aims to be a pragmatic programming language, striking a balance between familiarity, interoperability, and advancing the art. Swift's `String` presents a uniquely Unicode-forward model of string, but currently suffers from limited processing facilities.

This file has been truncated. show original

@hamishknight can you update the link in your original post to point to the new version? Thanks.

hamishknight · April 8, 2022, 3:29pm

Unfortunately it seems you can't edit old posts. Updated pitch thread: [Pitch #2] Regex Syntax and Run-time Construction

hamishknight · April 8, 2022, 4:57pm

Apologies for the late reply on this, we plan on mentioning it in the regex literal pitch as we feel it's more of a detail of the literal itself than the syntax of the regex engine.

Paul_Cantrell · April 8, 2022, 8:51pm

Agreed, and I’d even go further and say that, per Steve Canon’s comment, that it’s best to hew to tradition with this syntax so as to favor reuse of existing regexes from other languages, and save the bold new design ideas for the new DSL.

People are going to get frustrated fast if they can’t copy and paste that “regex for valid emails, attempt 6003” answer from Stack Overflow.