[Pitch] Regular Expression Literals

(I attached wrong reply target... I'm sorry :bowing_man:)

I think modes (or flags) of regex should be expressed by modifiers like /regex/.mode(), rather than /.../m syntax. It would allow customized modes in libraries.


By the way, does regex literal support interpolation? I sometimes want to combine regexes like /\(regex1)|\(regex2)/.

3 Likes

It'd be nice. But it'll have to be another syntax because \( already has the meaning of a parenthesis character (escaped) inside a regex.

1 Like

One syntax I like is this one:

#([a-z0-0]+)

Upsides:

  • It's less verbose than #regex().
  • The opening delimiter "#(" is unambiguous with any existing Swift syntax, so there's no need for a lot of new lexing rules.
  • It's balanced so it's easy to figure out where it ends when mixed with other expressions.
  • There's one less character to escape because parens already need to be escaped anyway.
  • Can write an empty regex.
  • It's also a nice touch that those parens look like a capture group because it will accurately represent capture group 0.

One downside:

string.match(#([a-z0-9]+))
  • That's a lot of parenthesis when the regex is already inside a parenthesized expression..

I can't say I dislike the idea of single quotes '' regexes. It would certainly read better when using the regex inside a parenthesized expression like a function call:

string.match('[a-z0-9]+')

Upsides:

  • It's more lightweight than #regex() or #().
  • The delimiters ' are unambiguous with any existing Swift syntax so lexing is easy and unlikely to produce unexpected results.
  • It's using a delimiter character not used elsewhere in Swift, so it's easy to read where the regex syntax starts and ends, like in the parenthesized expression above.
  • Can write an empty regex.
  • It looks like a string, so it's not expected its content will be Swift syntax.

Downsides:

  • You need to escape single quotes with \' in the regex. In my experience this is less common than having to escape /.
  • It looks like a string, and could be confused for one.

Now, trying to compare with the syntax in the pitch itself:

string.match(/[a-z0-9]+/)

Upsides:

  • It's more lightweight than #regex() or #()
  • It closely matches the regex syntax in a couple of other languages.

Downsides:

  • The opening delimiter / is ambiguous with existing Swift syntax, so there's a need for more complex lexing rules.
  • Syntax highlighters not based on SourceKit are more likely to do a bad job at telling apart regex from non-regex stuff, or properly identifying the boundaries of the regex literal.
  • You need to escape slashes with \/ in the regex, which are a frequent occurence than single quotes in my experience. Also I find escaping slashes more confusing than escaping other characters because its the same shape mirrored that gets repeated /\/\/.*/.
  • Cannot write an empty regex.

It's not unworkable, but I don't see many upside to this choice.

8 Likes

If interpolation was desired, a syntax like #/<stuff>/# that was suggested above may provide a way to allow interpolation without conflicting too much with commonly written regex, in a way that is consistent with raw strings.

Ex: #/\#(regex1)|\#(regex2)/#

3 Likes

I’m torn on this. I personally quite dislike regular expressions (for reasons already stated), but I do think interoperation with other languages is important. For example, I think it’s important that front and backend validations of user input agree, and needing to re-express the same pattern in a different format introduces the opportunity for discrepancy, and thus, a bad UX.

On the flip side, I see this as like, a legacy interoperability feature, which I don’t think should justify language-level changes.

I was hoping that it could just piggy-back off raw string literals, like mentioned at the bottom of the main post. I guess the compiler wouldn’t have an issue syntax highlighting the contents as a regex, when the inferred type is Regex or whatever.

But then it occurred to me, that this faces the same issue as the sharp delimitation between identifiers between operator and identifier characters. It’s not that the compiler couldn’t be made to figure it out otherwise, but other code, like website highlighting (e.g. in GitHub), non-LSP connected editors, etc. wouldn’t be smart enough to do it.

Put succinctly: I think it’s an important goal to end up with a syntax that’s distinct enough that a “dumb” parser could readily identify it.

7 Likes

I understand this thread is about literals not about semantics, but I think they would be related to each other.

Let me take my comment in the overview thread:

My thoughts are

  • If Swift adopts similar literals with e.g. Perl, semantics should be similar with Perl.
  • If Swift adopts different semantics from e.g. Perl, literals should be different from Perl.

That would be called perceived affordance.

4 Likes

To be honest, I quite doubt the assumption that adding regex literal make interoperation with other languages easier. It's dipping my toe into the semantics problem, but if the semantics of the same regex literal is different, it's breaks interoperation. I really hope that the default behavior of Swift regex become grapheme-semantic, but grapheme-semantic regex written for Swift will cause bugs in other languages whose String operation is scalar-semantic.

3 Likes

I'm quite excited by this pitch, and I have some ideas for what I'd like to do with it.

I think the proposal's motivation section should be expanded, as it essentially starts out from the position that "regexes exist" and so we should add support for them. IMO, the motivation is more that:

  • Text processing is critically important to most of the domains Swift wants to support, including applications, servers, and especially scripts.
  • Our current pattern-matching is inadequate, based on overly-verbose, generic collection methods (sometimes supplemented by Foundation)
  • When terseness and productivity is most important, there is already an industry-standard compact syntax for expressing patterns: regexes.

If you need to do some simple parsing of a machine-generated log file or database (such as the Unicode data tables), Swift makes it possible... but not easy, and not concise.

The pattern-matching DSL proposed elsewhere is great, but it involves a fair amount of ceremony and visually dominates the code around it. It's readable, but also not that concise. Regexes are primarily useful for simple patterns - e.g. split these log lines in to (time, severity, message) based on a given format, and whilst they are powerful enough that they can scale to the moon, like always, it's up to the developer to ensure their code stays readable.

When your regexes get too large or complex, I'd imagine the compiler's refactoring engine would be able to rewrite them using the Swift pattern DSL, extract it as a function, etc. The point is that the language scales to the complexity of the pattern, so both simple and complex patterns are convenient to use and easy to maintain.

I also want to remind people of this post from 2016(!) after Swift 3 was released:

It has taken a while, but the goal is to be better than Perl. Realistically, we can't do that unless we have a way to express simple patterns without a huge amount of ceremony. Any other regex-like pattern literal would just be confusing because regexes are so ubiquitous, and be subject to the same criticism that they could potentially be abused.


As for the proposed design, I think it's really excellent, and a great demonstration of what we can do with the generic builder transform (so far used only by result builders, IIRC). I really like the idea that my code will be able to get the regex AST through the builder, so we can know something about what the regex is going to do and how to incorporate it in to a larger pattern.

One thing that this highlights, though, is that we need to move our other builders - e.g. ExpressibleByStringInterpolation, to the new generic builder model, otherwise we won't be able to compose regexes with string literals and other patterns.

For instance, picture something like the popular JavaScript library path-to-regexp for Swift. It takes a path string, potentially including regexes or other patterns, and returns a pattern object. The best approach would seem to be to use a string interpolation with regex segments, e.g.:

url.matches(path: "/books/id_\(/\d+/)")

I'm guessing that there will be a buildCharacterClass_d callback so I can build a pattern which captures and returns an Int, but since ExpressibleByStringInterpolation uses mutating appendInterpolation calls, those types cannot be reflected in the type of the pattern object or returned by the url.matches function.

This shouldn't be a factor in whether this proposal is accepted, but I just wanted to point out that we may need to adjust other parts of the standard library for this feature to really shine.


As for the delimiter discussion, please also consider what those delimiters might look like as part of a string interpolation. For example:

"/books/id_\(/\d+/)/info/\(/.*/)"
"/books/id_\(#regex(\d+))/info/\(#regex(.*))"
"/books/id_\(#/\d+/#)/info/\(#/.*/#)"
"/books/id_\((\d+))/info/\((.*))"

Personally, I think #regex(...) and #/.../# add too much ceremony.

Also, it might be interesting if there was a way for ExpressibleByStringInterpolation to allow omitting regex delimiters within interpolation segments. It is also a kind of concise DSL which is particularly attractive for text patterns, and removing delimiters in contexts where regexes are common helps the pattern stay readable:

"/books/id_\(\d+)/info/\(.*)"

It's added complexity, and generally I don't like that, but I think the benefit is significant.

8 Likes

I think the #/.../# is the least confusing of these examples.

2 Likes

How about:

"/books/id_\('\d+')/info/\('.*')"

1 Like

We'd need something different than \() for that. I think \{} should work:

"/books/id_\{d+}/info/\{.*}"

But I'm not sure why we'd want this over '/books/id_\d+/info/.*'.

Where interpolation might shine is if you also want to use Swift expressions with \():

"/\(sectionPath)/id_\{\d+}/info/\{.*}"

Also, and maybe it's not a good idea, we could make string interpolation the only syntax to create a regex. You'd write "\{/[a-z]+/}" instead of /\/[a-z]+\// or '/[a-z]+/'. A bit more obscure, but we aren't stealing any delimiters from regular Swift code that way.


Edit: I realize capture groups might be a bit confusing inside of an interpolated string. Also that the reverse — interpolating Swift expressions inside the regex, in the middle of a capture group for instance — would be much more interesting.

3 Likes

I'm not sure how scalable it is. I see the logic in keeping parsing and formatting coupled, but often you want to keep the model decoupled from both, because if can be parsed from-/formatted into different "languages".

I am strongly opposed.

Regular expression literals, in the standard forms that exist today, are antithetical to Swift’s goal of clarity at the point of use. They form a dense jumble of arcane symbols all mashed together.

I can appreciate this point of view, but I don't think this is necessarilyy antithetical to that goal. As I frequently say, any non-trivial regex is indistinguishable from a cat having walked over a keyboard. But regular expressions do exist. They are a useful tool that you ultimately need to use from time to time in Swift, and which already permeate many codebases.

Much like needing to support pointers, you just have to be pragmatic at times. I mean, it's not like Swift is currently a paragon of "clarity at the point of use". There are a good many counter examples to this that I would argue are worse for clarity than native regular expression support (method overloading and not requiring self. for all method and property calls being just two deadly sins).

Given regular expressions are already used in Swift, just with an awkward and clunky API, I'd say this proposal is a net improvement to clarity and a win for pragmatic language design

4 Likes

I particular like this because it could compose a nice clear statement:

for path in paths.matching(#regex(/foo/bar/baz\d+$)) {...}

But I'm not sure why we'd want this over '/books/id_\d+/info/.*' .

This is what I was coming to ask.

Having written a ton of Perl back in the day, I can read REs with // delimiters, but the moment I found out about Perl's ability to have others, I switched.

s'/foo(\d+$)'/bar\1' is easier to parse than s/\/foo(\d+$)/\/bar\1/ for me, especially as things get more complicated. Just the removal of escapes is worth it.

I've also used s_/foo(\d+$)_/bar\1_ and s!/foo(\d+$)!/bar\1!

I think in the end, I'd rather like

myString.matching( #regex(/foo(\d+$)), substituting: #regex(/bar/1) )

or even

myString.matching( #re(/foo(\d+$)), substituting: #re(/bar/1) )

or

myString.matching( #re'/foo(\d+$)', substituting: #re'/bar/1' )

Just going by the syntax coloring, the last is great.

3 Likes

For things like path-to-regexp, the point is that you're not writing a regex - you're writing a path string DSL with embedded patterns that may be regexes or non-regex patterns, e.g.:

"/users/\(name: "user-id")/image-\(name: "img-id").png"
        ^----------------^ captures whole segment

To see why that is beneficial over a plain regex, we can look at path-to-regexp itself. It turns this path string DSL in to a regex, so we can compare what the DSL is doing:

// path-to-regexp syntax:
"/users/:user_id/image-:img_id.png"
// Equivalent regex:
/^\/users\/(?:([^\/]+?))\/image-(?:([^\/]+?))\.png\/?$/i
// path-to-regexp syntax:
"/books/id_(\\d+)/info/*"
// Equivalent regex:
/^\/books\/id_(\d+)\/info\/(.*)\/?$/i

And these are quite simple patterns, so it's no wonder why people prefer to use a DSL with smaller regexes sprinkled in.

If we did have a way to way to avoid double-delimited regexes in string interpolations, we could create something very nice - competitive even with path-to-regexp's pattern strings in terms of compactness and expressiveness, but with compile-time syntax checking, strongly-typed captures, etc.

// This is actually doable today...
"/users/\(name: "user-id")/image-\(name: "img-id").png" // swift
"/users/:user_id/image-:img_id.png" // js

// But this would be cool to have as well.
"/books/id_\(\d+)/info/\(.*)" // swift
"/books/id_(\\d+)/info/*" // js

I'm not sure if this is possible from a parser standpoint, but if I could wish for something, this is what I'd wish for.

I'm very strongly -1 on this pitch.

PCRE is NOT the pinnacle of a regex literal design and choosing it for historical reasons is a major mistake and lost opportunity in my opinion. I'd much rather see a good regex literal design. I wonder if the stated goals of copy/paste compatibility with stack-overflow couldn't perhaps be better met by adding a Regex(pcre:) initializer that takes a string in pcre syntax as an argument without forcing a suboptimal regex literal syntax into the language ...

This is true for any non-trivial regex written with PCRE syntax -- but it's just not always true for more sane regular expression encodings ... PCRE is absolutely not the best possible regular expression literal design in terms of read and write-ability ...

Seeing the proposals in this thread to support interpolating regular expressions via an incredibly ugly syntax exacerbates my disappointment with the approach this proposal takes ... it appears to me that the swift community is actively choosing a poor literal design for reasons that are not well justified. The main reason seems to be 'this is the only kind of regular expression literal that most programmers have used and many folks have 20+ years of experience with this specific literal' and there seems to be a blindspot to alternatives ... I hope that the proposal will at least acknowledge that despite some programmers having 20 years of experience with it, the PCRE literal design itself has a lot of problems ...

Problems with PCRE design include:

  • characters that are meaningful to the pattern matching engine are frequently used as terminals introducing the requirement for a large amount of character escaping in practice -- and knowing which characters to escape requires awareness of all regexp control symbols
  • special character classes like \w convey nothing about the pattern they are trying to match - I for one have to look it up everytime I see it (hmm does that mean whitespace?)
  • whitespace within the literal is semantic and thus cannot be used to organize patterns into visual chunks that are easier to parse or intended to correspond to some higher-level human concept
  • multiline literals are not supported
  • parenthesis are overloaded for all of capturing, order of operations, and terminals (with escaping)
  • regex interpolation (if supported) requires sigil heavy visually difficult syntax which usually leads to the same subpattern being copy/pasted multiple times within a given regex rather than simply defined once and re-used by named reference

In my opinion, a single design change would illuminate a path that could correct all these flaws and lead to a vastly superior literal design in the end ...

  1. terminals inside a regex literal must be inside of a delimiter (example delimiter: double-quote ")

This change would immediately allow for actual identifiers to be used within the literal (when not inside the terminal delimiters) -- so instead of \w, \s, and . we could have 'word', 'space', any. Instead of /#\(regex1)|#\(regex2)/ interpolation we could have sigil free interpolation ... Instead of escape characters everywhere for all regexp control sigils, we simply have quotes around the terminal usages which makes it immediately clear which sigils are being used as terminals and which not ...

let identifier = /alpha word*/
let hexdigit = /"0x" ("A"..."F" | "a"..."f")+/
let someDumbFormatExpression = /identifier "=" identifier ("+" | "*" | "-") hexdigit newline/

or -- avoiding the '//` as the terminators since if we don't use PCRE there is literally no reason to prefer those specific delimiters ...

let identifier = #re { alpha word* }
let hexdigit = #re { "0x" ("A"..."F" | "a"..."f")+ }
let someDumbExpression = #re { identifier "=" identifier ("+" | "*" | "-") hexdigit newline }

Sorry to add noise to this thread with a repeated comment offering more or less the same perspective as an earlier comment ... but I've seen a whole lot of "I don't like regular expressions", "regular expressions are inherently noisy/hard to parse" type comments in the thread -- and while I think its very much the case that these expressions are true for PCRE style literals specifically -- I don't think these feelings are anywhere near as valid for pattern matching expressions in general -- and imo some pushback on these claims is deserved ...

I'd love if this proposal would include a fairly large list of "example strings to parse/extract data from" -- so that the discussion could include concrete 'here's what it could look like to solve these problems with an alternative regexp syntax' -- that way claims about brevity gained/lost and the value of familiarity could be more realistically evaluated ... I don't see this discussion as doing a good job of representing any discarded alternatives that are not being chosen given the strong desire to inherit the historical baggage/familiarity of PCRE ...

Another exercise that I think would be a good stress test for this and for Pattern matching in general ... would be to define extractions for languages that are example based and not fully known in advance. In the real world this is often the problem faced -- first define something to get some data out of a messy format based on the examples you have, then modify those patterns as more sample data comes in. Often this involves defining and modifying 'heuristics' to try and extract data from not-well-defined data sources. Real world problem shape is roughly along these lines:
(1) here's two examples of the samples of the string we need to extract data from
(2) implement something
(3) ok that worked on a lot of sample data, but here's two more samples we came across that it didn't work right on
(4) modify
(5) ok here's two more new samples it did the wrong thing on ...
(6) ... repeat

In the real world formulation of the use cases for regex -- readability, understandability, and changeability matter a lot. regex/pattern matching/peg are super good tools for the problem and there's no reason imo to make these very useful literals be extremely difficult to read and harder to modify than they need to be ...

15 Likes

I may be misunderstanding Karl, but the path-to-regexp example doesn’t seem to me to be a very good fit for String interpolation of regex literals. Taking this example:

If I’m understanding correctly this produces a String? (By normal string-interpolation rules, if we assume the regex literals get parsed to regex values.) But then the library has to parse that string to extract some path components, as well as the regexes themselves: the library would have to runtime parse a String to get back type information that was already known at compile-time (the regex literal types, which got lost in the String interpolation).

This seems to mean that I can misuse the path-to-regexp library rather easily. Whatever string representation the regex gets interpolated to, I can put in the string literal too:

"\d*\\d*\\\\d*id_\(\d+)/info/\(.*)" // one of these is probably a problem

If I’m mixing strings and regexes I would rather expect the regex syntax to allow interpolating a string (as a regex terminal). And I get that path-to-regexp offers a richer DSL than just regexes, but it seems like using String as the representation type is going in the wrong direction: I think you really want a path-to-regexp DSL that preserves all the type information (path prefix, regex capture, etc etc). I wonder if the Pattern part of the two-pronged approach gets you closer?

1 Like

Does the design of this pitch imply that there is simply no way to interpolate a string into a regex literal expression? As in

let theWord: String = getTheWord()
let interpolateTheWord: String = "the word was \(theWord) and the word was good"
let matchTheWord: Regex = /\W\(theWord)\W/ // nope only pure PCRE syntax here?

(If interpolation is possible I would expect it to treat the string as a regex ... I want to say "literal" but that term is now rather overloaded. Regex terminal, per @breathe? The regex that matches exactly that string and no other.)

Once I start on this, I also think about interpolating regexes into regexes, then I'm naming them after grammar production rules, and before you know it I'm at Pattern. But is it indeed the principle behind this pitch that this slippery slope stops before it even gets started?

1 Like

No, it would return a custom pattern type which conforms to ExpressibleByStringInterpolation.

The idea is that you would have 2 ways to build a pattern - either using the compact string syntax:

let p: PathPattern = "/books/id_\(\d+)/info/\(.*)"

or some kind of result-builder DSL syntax if you have something which has outgrown that shorthand:

PathPattern {
  "books"
  Segment {
    "id_"
    Regex { /(\d+)/ }.capture()
  }
  "info"
  Regex { /(.*)/ }.capture()
}

In neither case would we be parsing regexes at runtime. That's something path-to-regexp does because it's in JavaScript and can assume a JIT, but hardly anybody actually cares what the path DSL parses down to - they just want a convenient way to match a path against a pattern.

The Swift version would be more like "path-to-opaque-pattern-object", I would think.