[Pitch #2] Regex Literals

hamishknight · April 13, 2022, 5:56pm

Regex Literals

Authors: Hamish Knight, Michael Ilseman, David Ewing

Introduction

We propose the introduction of regex literals to Swift source code, providing compile-time checks and typed-capture inference. Regex literals help complete the story told in Regex Type and Overview.

Motivation

In Regex Type and Overview we introduced the Regex type, which is able to dynamically compile a regex pattern:

let pattern = #"(\w+)\s\s+(\S+)\s\s+((?:(?!\s\s).)*)\s\s+(.*)"#
let regex = try! Regex(compiling: pattern)
// regex: Regex<AnyRegexOutput>

The ability to compile regex patterns at run time is useful for cases where it is e.g provided as user input, however it is suboptimal when the pattern is statically known for a number of reasons:

Regex syntax errors aren't detected until run time, and explicit error handling (e.g try!) is required to deal with these errors.
No special source tooling support, such as syntactic highlighting, code completion, and refactoring support, is available.
Capture types aren't known until run time, and as such a dynamic AnyRegexOutput capture type must be used.
The syntax is overly verbose, especially for e.g an argument to a matching function.

Proposed solution

A regex literal may be written using /.../ delimiters:

// Matches "<identifier> = <hexadecimal value>", extracting the identifier and hex number
let regex = /(?<identifier>[[:alpha:]]\w*) = (?<hex>[0-9A-F]+)/
// regex: Regex<(Substring, identifier: Substring, hex: Substring)>

Forward slashes are a regex term of art. They are used as the delimiters for regex literals in, e.g., Perl, JavaScript and Ruby. Perl and Ruby additionally allow for user-selected delimiters to avoid having to escape any slashes inside a regex. For that purpose, we propose the extended literal #/.../#.

An extended literal, #/.../#, avoids the need to escape forward slashes within the regex. It allows an arbitrary number of balanced # characters around the literal and escape. When the opening delimiter is followed by a new line, it supports a multi-line literal where whitespace is non-semantic and line-ending comments are ignored.

The compiler will parse the contents of a regex literal using regex syntax outlined in Regex Construction, diagnosing any errors at compile time. The capture types and labels are automatically inferred based on the capture groups present in the regex. Regex literals allows editors and source tools to support features such as syntax coloring inside the literal, highlighting sub-structure of the regex, and conversion of the literal to an equivalent result builder DSL (see Regex builder DSL).

A regex literal also allows for seamless composition with the Regex DSL, enabling lightweight intermixing of a regex syntax with other elements of the builder:

// A regex for extracting a currency (dollars or pounds) and amount from input 
// with precisely the form /[$£]\d+\.\d{2}/
let regex = Regex {
  Capture { /[$£]/ }
  TryCapture {
    /\d+/
    "."
    /\d{2}/
  } transform: {
    Amount(twoDecimalPlaces: $0)
  }
}

This flexibility allows for terse matching syntax to be used when it's suitable, and more explicit syntax where clarity and strong types are required.

Due to the existing use of / in comment syntax and operators, there are some syntactic ambiguities to consider. While there are quite a few cases to consider, we do not feel that the impact of any individual case is sufficient to disqualify the syntax. Some of these ambiguities require a couple of source breaking language changes, and as such the /.../ syntax requires upgrading to a new language mode in order to use.

Detailed design

Named typed captures

Regex literals have their capture types statically determined by the capture groups present. This follows the same inference behavior as the DSL, and is explored in more detail in Strongly Typed Captures. One aspect of this that is currently unique to the literal is the ability to infer labeled tuple elements for named capture groups. For example:

func matchHexAssignment(_ input: String) -> (String, Int)? {
  let regex = /(?<identifier>[[:alpha:]]\w*) = (?<hex>[0-9A-F]+)/
  // regex: Regex<(Substring, identifier: Substring, hex: Substring)>
  
  guard let match = regex.matchWhole(input), 
        let hex = Int(match.hex, radix: 16) 
  else { return nil }
  
  return (String(match.identifier), hex)
}

This allows the captures to be referenced as match.identifier and match.hex, in addition to numerically (like unnamed capture groups) as match.1 and match.2. This label inference behavior is not available in the DSL, however users are able to bind captures to named variables instead.

Extended delimiters `#/.../#`, `##/.../##`

Backslashes may be used to write forward slashes within the regex literal, e.g /foo\/bar/. However, this can be quite syntactically noisy and confusing. To avoid this, a regex literal may be surrounded by an arbitrary number of balanced number signs. This changes the delimiter of the literal, and therefore allows the use of forward slashes without escaping. For example:

let regex = #/usr/lib/modules/([^/]+)/vmlinuz/#
// regex: Regex<(Substring, Substring)>

The number of # characters may be further increased to allow the use of e.g /# within the literal. This is similar in style to the raw string literal syntax introduced by SE-0200, however it has a couple of key differences. Backslashes do not become literal characters. Additionally, a multi-line mode, where whitespace and line-ending comments are ignored, is entered when the opening delimiter is followed by a newline.

let regex = #/
  usr/lib/modules/ # Prefix
  (?<subpath> [^/]+)
  /vmlinuz          # The kernel
#/
// regex: Regex<(Substring, subpath: Substring)>

Escaping of backslashes

This syntax differs from raw string literals #"..."# in that it does not treat backslashes as literal within the regex. A string literal #"\n"# represents the literal characters \n. However a regex literal #/\n/# remains a newline escape sequence.

One of the primary motivations behind this escaping behavior in raw string literals is that it allows the contents to be easily transportable to/from e.g external files where escaping is unnecessary. For string literals, this suggests that backslashes be treated as literal by default. For regex literals however, it instead suggests that backslashes should retain their semantic meaning. This enables interoperability with regexes taken from outside your code without having to adjust escape sequences to match the delimiters used.

With string literals, escaping can be tricky without the use of raw syntax, as backslashes may have semantic meaning to the consumer, rather than the compiler. For example:

// Matches '\' <word char> <whitespace>* '=' <whitespace>* <digit>+
let regex = try NSRegularExpression(pattern: "\\\\\\w\\s*=\\s*\\d+", options: [])

In this case, the intent is not for the compiler to recognize any of these sequences as string literal escapes, it is instead for NSRegularExpression to interpret them as regex escape sequences. As such, a raw string may be used to treat the backslashes literally, allowing NSRegularExpression to directly process the escapes, e.g #"\\\w\s*=\s*\d+"#.

However this is not an issue for regex literals, as the regex parser is the only possible consumer of such escape sequences. Such a regex can be directly spelled as:

let regex = /\\\w\s*=\s*\d+/
// regex: Regex<Substring>

Backslashes still require escaping to be treated as literal, however we don't expect this to be as common of an occurrence as needing to write a regex escape sequence such as \s, \w, or \p{...}, within a regex literal with extended delimiters #/.../#.

Multi-line mode

Extended regex delimiters additionally support a multi-line mode when the opening delimiter is followed by a new line. For example:

let regex = #/
  # Match a line of the format e.g "DEBIT  03/03/2022  Totally Legit Shell Corp  $2,000,000.00"
  (?<kind>    \w+)                \s\s+
  (?<date>    \S+)                \s\s+
  (?<account> (?: (?!\s\s) . )+)  \s\s+ # Note that account names may contain spaces.
  (?<amount>  .*)
  /#

In this mode, extended regex syntax (?x) is enabled by default. This means that whitespace becomes non-semantic, and end-of-line comments are supported with # comment syntax.

This mode is supported with any (non-zero) number of # characters in the delimiter. Similar to multi-line strings introduced by SE-0168, the closing delimiter must appear on a new line. To avoid parsing confusion, such a literal will not be parsed if a closing delimiter is not present. This avoids inadvertently treating the rest of the file as regex if you only type the opening.

Ambiguities with comment syntax

Line comment syntax // and block comment syntax /* will continue to be parsed as comments. An empty regex literal is not a particularly useful thing to express, but can be written as #//# if desired. * would be an invalid starting character of a regex, and therefore does not pose an issue.

A parsing conflict does however arise when a block comment surrounds a regex literal ending with *, for example:

/*
let regex = /[0-9]*/
*/

In this case, the block comment prematurely ends on the second line, rather than extending all the way to the third line as the user would expect. This is already an issue today with */ in a string literal, though it is more likely to occur in a regex given the prevalence of the * quantifier. This issue can be avoided in many cases by using line comment syntax // instead, which it should be noted is the syntax that Xcode uses when commenting out multiple lines.

Ambiguity with infix operators

There is a minor ambiguity when infix operators are used with regex literals. When used without whitespace, e.g x+/y/, the expression will be treated as using an infix operator +/. Whitespace is therefore required for regex literal interpretation, e.g x + /y/. Alternatively, extended literals may be used, e.g x+#/y/#.

Regex syntax limitations

In order to help avoid further parsing ambiguities, a /.../ regex literal will not be parsed if it starts with a space, tab, or ) character. Though the latter is already invalid regex syntax. This restriction may be avoided by using the extended #/.../# literal.

Rationale

This is due to 2 main parsing ambiguities. The first of which arises when a /.../ regex literal starts a new line. This is particularly problematic for result builders, where we expect it to be frequently used, in particular within a Regex builder:

let digit = Regex {
  TryCapture(OneOrMore(.digit)) { Int($0) }
}
// Matches against <digit>+ (' + ' | ' - ') <digit>+
let regex = Regex {
   digit
   / [+-] /
   digit
}

Instead of being parsed as 3 result builder elements, the second of which being a regex literal, this is instead parsed as a single operator chain with the operands digit, [+-], and digit. This will therefore be diagnosed as semantically invalid.

To avoid this issue, a regex literal may not start with a space or tab character. This takes advantage of the fact that infix operators require consistent spacing on either side.

If a space or tab is needed as the first character, it must be either escaped, e.g:

let regex = Regex {
   digit
   /\ [+-] /
   digit
}

or extended literal must be used, e.g:

let regex = Regex {
   digit
   #/ [+-] /#
   digit
}

The second ambiguity arises with Swift's ability to pass an unapplied operator reference as an argument to a function or subscript, for example:

let arr: [Double] = [2, 3, 4]
let x = arr.reduce(1, /) / 5

The / in the call to reduce is in a valid expression context, and as such could be parsed as a regex literal. This is also applicable to operators in tuples and parentheses. To help mitigate this ambiguity, a regex literal will not be parsed if the first character is ). This should have minimal impact, as this would not be valid regex syntax anyway.

It should be noted that this only mitigates the issue, as it does not handle the case where the next character is a comma or right square bracket. These cases are explored further in the following section.

Language changes required

In addition to ambiguities listed above, there are also some parsing ambiguities that require the following language changes in a new language mode:

Deprecation of prefix operators containing the / character.
Parsing /, and /] as the start of a regex literal if a closing / is found, rather than an unapplied operator in an argument list. For example, fn(/, /) becomes a regex literal rather than 2 unapplied operator arguments.

Prefix operators containing `/`

We need to ban prefix operators starting with /, to avoid ambiguity with cases such as:

let x = /0; let y = 1/
let z = /^x^/

Prefix operators containing / more generally also need banning, in order to allow prefix operators to be used with regex literals in an unambiguous way, e.g:

let x = !/y / .foo()

Today, this is interpreted as the prefix operator !/ on y. With the banning of prefix operators containing /, it becomes prefix ! on a regex literal, with a member access .foo.

Postfix / operators do not require banning, as they'd only be treated as regex literal delimiters if we are already trying to lex as a regex literal.

`/,` and `/]` as regex literal openings

As stated previously, there is a parsing ambiguity with unapplied operators in argument lists, tuples, and parentheses. Some of these cases can be mitigated by not parsing a regex literal if the starting character is ). However it does not solve the issue when the next character is , or ]. Both of these are valid regex starting characters, and comma in particular may be a fairly common case for a regex.

For example:

// Ambiguity with comma:
func foo(_ x: (Int, Int) -> Int, _ y: (Int, Int) -> Int) {}
foo(/, /)

// Also affects cases where the closing '/' is outside the argument list.
func bar(_ fn: (Int, Int) -> Int, _ x: Int) -> Int { 0 }
bar(/, 2) + bar(/, 3)

// Ambiguity with right square bracket:
struct S {
  subscript(_ fn: (Int, Int) -> Int) -> Int { 0 }
}
func baz(_ x: S) -> Int {
  x[/] + x[/]
}

foo(/, /) is currently parsed as 2 unapplied operator arguments. bar(/, 2) + bar(/, 3) is currently parsed as two independent calls that each take an unapplied / operator reference. Both of these will become regex literals arguments, /, / and /, 2) + bar(/ respectively (though the latter will produce a regex error).

To disambiguate these cases, users will need to surround at least the opening / with parentheses, e.g:

foo((/), /)
bar((/), 2) + bar(/, 3)

func baz(_ x: S) -> Int {
  x[(/)] + x[/]
}

This takes advantage of the fact that a regex literal will not be parsed if the first character is ).

Source Compatibility

As explored above, two source breaking changes are needed for /.../ syntax:

Deprecation of prefix operators containing the / character.
Parsing /, and /] as the start of a regex literal if a closing / is found, rather than an unapplied operator in an argument list. For example, fn(/, /) becomes a regex literal rather than two unapplied operator arguments.

As such, both these changes and the /.../ syntax will be introduced in Swift 6 mode. However, projects will be able to adopt the syntax earlier by passing the compiler flag -enable-bare-regex-syntax. Note this does not affect the extended delimiter syntax #/.../#, which will be usable immediately.

Future Directions

Modern literal syntax

We could support a more modern Swift-like syntax in regex literals. For example, comments could be done with // and /* ... */, and quoted sequences could be done with "...". This would however be incompatible with the syntactic superset of regex syntax we intend to parse, and as such may need to be introduced using a new literal kind, with no obvious choice of delimiter.

However, such a syntax would lose out on the familiarity benefits of standard regex, and as such may lead to an "uncanny valley" effect. It's also possible that the ability to use regex literals in the DSL lessens the benefit that this syntax would bring.

Alternatives Considered

Given the fact that /.../ is an existing term of art for regular expressions, we feel it should be the preferred delimiter syntax. It should be noted that the syntax has become less popular in some communities such as Perl, however we still feel that it is a compelling choice, especially with extended delimiters #/.../#. Additionally, while there has some syntactic ambiguities, we do not feel that they are sufficient to disqualify the syntax. To evaluate this trade-off, below is a list of alternative delimiters that would not have the same ambiguities, and would not therefore require source breaking changes.

Prefixed quote `re'...'`

We could choose to use re'...' delimiters, for example:

// Matches "<identifier> = <hexadecimal value>", extracting the identifier and hex number
let regex = re'([[:alpha:]]\w*) = ([0-9A-F]+)'

The use of two letter prefix could potentially be used as a namespace for future literal types. It would also have obvious extensions to extended and multi-line literals using re#'...'# and re'''...''' respectively. However, it is unusual for a Swift literal to be prefixed in this way. We also feel that its similarity to a string literal might have users confuse it with a raw string literal.

Also, there are a few items of regex grammar that use the single quote character as a metacharacter. These include named group definitions and references such as (?'name'), (?('name')), \g'name', \k'name', as well as callout syntax (?C'arg'). The use of a single quote conflicts with the re'...' delimiter as it will be considered the end of the literal. However, alternative syntax exists for all of these constructs, e.g (?<name>), \k<name>, and (?C"arg"). Those could be required instead. An extended regex literal syntax e.g re#'...'# would also avoid this issue.

Prefixed double quote `re"...."`

This would be a double quoted version of re'...', more similar to string literal syntax. This has the advantage that single quote regex syntax e.g (?'name') would continue to work without requiring the use of the alternative syntax or extended literal syntax. However it could be argued that regex literals are distinct from string literals in that they introduce their own specific language to parse. As such, regex literals are more like "program literals" than "data literals", and the use of single quote instead of double quote may be useful in expressing this difference.

Single letter prefixed quote `r'...'`

This would be a slightly shorter version of re'...'. While it's more concise, it could potentially be confused to mean "raw", especially as Python uses this syntax for raw strings.

Single quotes `'...'`

This would be an even more concise version of re'...' that drops the prefix entirely. However, given how close it is to string literal syntax, it may not be entirely clear to users that '...' denotes a regex as opposed to some different form of string literal (e.g some form of character literal, or a string literal with different escaping rules).

We could help distinguish it from a string literal by requiring e.g '/.../', though it may not be clear that the / characters are part of the delimiters rather than part of the literal. Additionally, this would potentially rule out the use of '...' as a future literal kind.

Magic literal `#regex(...)`

We could opt for for a more explicitly spelled out literal syntax such as #regex(...). This is a more heavyweight option, similar to #selector(...). As such, it may be considered syntactically noisy as e.g a function argument str.match(#regex([abc]+)) vs str.match(/[abc]+/).

Such a syntax would require the containing regex to correctly balance parentheses for groups, otherwise the rest of the line might be incorrectly considered a regex. This could place additional cognitive burden on the user, and may lead to an awkward typing experience. For example, if the user is editing a previously written regex, the syntax highlighting for the rest of the line may change, and unhelpful spurious errors may be reported. With a different delimiter, the compiler would be able to detect and better diagnose unbalanced parentheses in the regex.

We could avoid the parenthesis balancing issue by requiring an additional internal delimiter such as #regex(/.../). However this is even more heavyweight, and it may be unclear that / is part of the delimiter rather than part of an argument. Alternatively, we could replace the internal delimiter with another character such as #regex`...` , #regex{...}, or #regex/.../. However those would be inconsistent with the existing #literal(...) syntax and the first two would overload the existing meanings for the `` and {} delimiters.

It should also be noted that #regex(...) would introduce a syntactic inconsistency where the argument of a #literal(...) is no longer necessarily valid Swift syntax, despite being written in the form of an argument.

Shortened magic literal `#(...)`

We could reduce the visual weight of #regex(...) by only requiring #(...). However it would still retain the same issues, such as still looking potentially visually noisy as an argument, and having suboptimal behavior for parenthesis balancing. It is also not clear why regex literals would deserve such privileged syntax.

Using a different delimiter for multi-line

Instead of re-using the extended delimiter syntax #/.../# for multi-line regex literals, we could choose a different delimiter for it. Unfortunately, the obvious choice for a multi-line regex literal would be to use /// delimiters, in accordance with the precedent set by multi-line string literals """. This signifies a (documentation) comment, and as such would not be viable.

Reusing string literal syntax

Instead of supporting a first-class literal kind for regex, we could instead allow users to write a regex in a string literal, and parse, diagnose, and generate the appropriate code when it's coerced to the Regex type.

let regex: Regex = #"([[:alpha:]]\w*) = ([0-9A-F]+)"#

However we decided against this because:

We would not be able to easily apply custom syntax highlighting and other editor features for the regex syntax.
It would require a Regex contextual type to be treated as a regex, otherwise it would be defaulted to String, which may be undesired.
In an overloaded context it may be ambiguous or unclear whether a string literal is meant to be interpreted as a literal string or regex.
Regex-specific escape sequences such as \w would likely require the use of raw string syntax #"..."#, as they are otherwise invalid in a string literal.
It wouldn't be compatible with other string literal features such as interpolations.

No custom literal

Instead of adding a custom regex literal, we could require users to explicitly write try! Regex(compiling: "[abc]+"). This would be similar to NSRegularExpression, and loses all the benefits of parsing the literal at compile time. This would mean:

No source tooling support (e.g syntax highlighting, refactoring actions) would be available.
Parse errors would be diagnosed at run time rather than at compile time.
We would lose the type safety of typed captures.
More verbose syntax is required.

We therefore feel this would be a much less compelling feature without first class literal support.

rvsrvs · April 13, 2022, 7:11pm

Looking at the alternatives considered, it is unclear to me which would be source-breaking and which would not. Given that my preference would be for less consistency with, e.g. Perl, over breaking existing Swift code, even at the expense of slightly more visual noise, it would be good to know which alternatives are breaking and which are not.

hamishknight · April 13, 2022, 7:40pm

None of the alternatives would require source breaking changes, I've edited it to clarify this

Jeehut · April 13, 2022, 8:55pm

Having used regexes in Ruby quite a bit, I love that you’re opting for the same syntax - I’ve always liked it a lot! And with ‘#/‘ I don’t even have to escape slashes, which is really useful as I’m parsing URLs or paths quite often.

+1 from me!

stackotter · April 13, 2022, 9:39pm

I think something like #regex(blah) would be more consistent with swift’s other more advanced literal types such as color literals. It would also avoid the comment ambiguity and would avoid the breaking change of banning / in operators (which would affect the swift-case-paths package for example which has a significant number of users). I believe that the operator breaking change should be avoided at all costs.

ksluder · April 13, 2022, 9:48pm

This seems like a high price for a syntax that is entirely superseded by the #/ ... /# syntax.

jayton · April 13, 2022, 9:56pm

Yes, it seems odd that “just” #/.../# isn’t in Alternatives Considered.

1-877-547-7272 · April 13, 2022, 10:12pm

Looks good! I personally like the ability to use regexes the natural way (/.../ instead of #/.../#) — the ambiguous cases seem rare and easily identifiable with syntax highlighting. Overall, I support this pitch.

A few nitpicks:

The link here takes me to a 404 page.

hamishknight:

let regex = #/
  # Match a line of the format e.g "DEBIT  03/03/2022  Totally Legit Shell Corp  $2,000,000.00"
  (?<kind>    \w+)                \s\s+
  (?<date>    \S+)                \s\s+
  (?<account> (?: (?!\s\s) . )+)  \s\s+ # Note that account names may contain spaces.
  (?<amount>  .*)
  /#

# as the comment delimiter feels unnatural for Swift. Could we not use something like #// instead? Are multi-line regexes necessary at all? IMO this isn't significantly better than

let regex = Regex {
  // Match a line of the format e.g "DEBIT  03/03/2022  Totally Legit Shell Corp  $2,000,000.00"
  #/(?x)(?<kind>    \w+)                \s\s+/#
  #/(?x)(?<date>    \S+)                \s\s+/#
  #/(?x)(?<account> (?: (?!\s\s) . )+)  \s\s+/# // Note that account names may contain spaces.
  #/(?x)(?<amount>  .*)/#
}

blangmuir · April 13, 2022, 10:29pm

Minor clarification: should this be return (String(match.identifier), hex)?

xwu · April 13, 2022, 10:53pm

While it seems obvious at first glance that /.../ is far superior to the alternatives (I was super eager to read the pitch to see how this was all made to work out!), the sheer number of scenarios outlined here where the workaround (or one of them) is “just use #/.../# instead,” coupled with the source compatibility issues, really quickly sours the initial impression.

I’m concerned that we’re proposing syntax restrictions that are basically heuristics such that f(*, /) is fine and f(/, *) is totally dandy (as long as we don’t then try to divide the result on the same line!) but f(/, /) means something totally different.

That #/.../# could be adopted immediately without a language version because it doesn’t share the same source compatibility issues is no small matter either.

I don’t necessarily think this means that bare /.../ has to be entirely jettisoned if there’s some more restrictive set of syntax rules that leans into this “just use #/.../# instead” approach more fully, limiting bare syntax to the simplest of cases where there’s no ambiguity—if such a set of rules can be found.

scanon · April 13, 2022, 10:58pm

It's not "just" consistency with Perl. In as much as there's an accepted spelling for regexes (especially in the UNIX tools world), it's /.../. It's not universal, but among languages and tools that have a dedicated delimiter, it's the most common by a wide margin.

Aesthetics and visual load matter. To pick an extreme example, if we used #[...]# for array literals in Swift, most people would rightly think that was absurd. To my mind, the question boils down to:

Is /.../ a better default syntax than #/.../#?
Yes. In complex regexes, it doesn't matter much, both just disappear into the line noise. In simple regexes, however, it does make a difference.
Are regexes a sufficiently important use case to deserve a "special" syntax?
This is somewhat subjective. I'm a numerics guy, I don't use them that often, I can easily put forward an argument that regexes are not any more special as program literals than, say, a concise language for linear algebra snippets or SQL queries. If one holds this view, then something like #regex(...) is pretty attractive, since it is trivially extensible to other domains.

But I have to acknowledge that regexes are part of the backbone of idiomatic string processing in Swift, which has heretofore been one of the largest pain points for adopters coming from other languages. From this perspective, yes, they probably do deserve a special lightweight literal syntax, and /.../ is the natural candidate.

I wholeheartedly agree with this in principle, but in practice, it turns out that these don't actually happen very often at all.

stackotter · April 13, 2022, 11:00pm

I completely agree with this argument. “It’s familiar in other languages” is a common argument for the single slash delimiters and I just don’t think that it’s at all worth it given all the complexities and edge cases it introduces. Both the #/ or #regex( syntaxes remove all of the ambiguities that I’ve noticed so far, and they seem like they would act much more desirably whichever is chosen.

stackotter · April 13, 2022, 11:11pm

That situation may not happen in practice, but there are ways you could use the CasePaths library that would cause ambiguity and therefore be broken with the new regex syntax if single slash delimiters were chosen. The pitch also wants to ban / as a custom operator of a certain type which would break the CasePaths library.

rvsrvs · April 13, 2022, 11:26pm

It's not "just" consistency with Perl. In as much as there's an accepted spelling for regexes (especially in the UNIX tools world), it's /.../ . It's not universal, but among languages and tools that have a dedicated delimiter, it's the most common by a wide margin.

My point was not to say that we shouldn't look at what the broader programming community has adopted notationally, but that backwards compatibility also matters and that on this particular change it weights more heavily with me personally than sharing common notation with other languages. I fully acknowledge other points of view here.

Aesthetics and visual load matter.

Certainly, aesthetics and visual load matter. I've been (quietly) disappointed in several language changes in the recent past for that very reason.

I'm one of those who uses the case path operator frequently and it would be both disruptive and unaesthetic to force another operator into my code at this point. While I can't say I spend my life with regexes these days, I have done and expect I will in the future do a fair amount of parsing. On balance I'd prefer the #/ ... /# notation especially if it is going to be there anyway.

To pick an extreme example, if we used #[...]# for array literals in Swift, most people would rightly think that was absurd.

I'm quite sympathetic to this. Had the / ... / syntax entered the language at the same time as [ ... ] I would acknowledge that syntax as plain common sense if not genius.

I'm a numerics guy, I don't use them that often, I can easily put forward an argument that regexes are not any more special as program literals than, say, a concise language for linear algebra snippets or SQL queries. If one holds this view, then something like #regex(...) is pretty attractive, since it is trivially extensible to other domains.

I find this argument very appealing. You've convinced me. There's a smiley there, but I'm quite serious. At this point, I'd actually vote for #regex(...) based on the extensibility combined with backwards compatibility.

xwu · April 13, 2022, 11:54pm

I wholeheartedly agree with both points and would be thrilled to see a solution which enables the use of the “bare” syntax.

Yet, after reviewing this pitch, I end up agreeing with @ksluder and @jayton. Clearly, then, for me at least the issue boils down to additional considerations than the two points above, and those considerations are enough to sway my opinion despite complete agreement with you on those two questions.

Others are articulating what those additional considerations are, and I’m not blessed with enough time to organize my thoughts as clearly as I’d like. However, I’d vaguely gesture in the direction of the question as to whether the proposed syntax “fits” Swift’s current design and direction.

My overall gestalt is that it fits very poorly—catastrophically poorly (in the original Greek sense of a sudden turn or overturning), given the need to break existing unrelated uses. I dearly wish that weren’t the case, for the reasons you outline above about how nice the syntax is.

Jeehut · April 14, 2022, 12:11am

The authors of swift-case-paths said themselves that their wish is actually for Swift to adopt case paths the same way key paths are currently supported. So in the long run this specific library should be rendered unnecessary anyways and we could just use backslashes when they are officially adopted into Swift to be consistent with key paths.

Jon_Shier · April 14, 2022, 12:14am

Doesn't really help in the meantime, especially when there's no timeline for that to happen, or even a guarantee that it will.

rvsrvs · April 14, 2022, 12:14am

unfortunately there seems to be no interest in additional work on optics. the wait could be very long.

Jeehut · April 14, 2022, 12:17am

But I wouldn’t bother if the syntax was #/, I’m currently escaping my String regex patterns that way anyways just to never have to escape anything. So I would be totally happy if /…/ wasn’t possible but it’d require #/…/#, too.

masters3d · April 14, 2022, 12:38am

To me there are two concerns:

being able to type check a regex a compile time.
provide syntax highlighting and completion for regex

I think the proposal would improve if these two concerns could be somehow separated and later composed. Being able to check if a regex literal is correct at compile time should be doable as an initializer on regex (if not, why not?). For simple enough regex I would suspect it would be nice to be able to pass a string pattern to a compile time regex init.

I think if we were able to do compile time check on regex based on string patterns then most of the simple regex cases could use this version of comptime. With the simple cases taken care of by a comptime then we could focus this proposal on the more advanced cases that might want multiline which use #’s. We should only support the # version of the regex literal and forgo the version that starts with / since that is meant to address the simpler cases.

I like a version of regex#….#.