SE-0354: Regex Literals

hooman · May 4, 2022, 4:42pm

To be clear, I am not advocating for those specific options. I was explaining how assuming a separate syntactic context within #something(...) can provide flexibility to support even those things if we wanted to. The main ideas I am trying to communicate is:

I propose we define the contents of #something(...) (What appears between those parentheses) as a new syntactic context where we can use delimiters to mean something without affecting their meaning outside of that context.

I also propose we define a special trailing literal sugar for those contexts so that e.g. #re('[regex]+') could be written as #re'[regex]+' without breaking anything.

Multi-line extension of the above syntax is also possible with the same technique of surrounding with #s.

To summarize: I am OK with #/.../# as well as any variant of the above logic that we come up with. But not at all OK with bare /.../. I also agree that trying to keep / as part of regex literal syntax is helpful.

hooman · May 4, 2022, 4:45pm

This is an instance of a variant that is covered by the design I propose above.

scanon · May 4, 2022, 5:01pm

What are the # and the (...) actually buying us here? If we were going to do this, I would simply use re'...', and get rid of the noise.

Jumhyn · May 4, 2022, 5:04pm

The #, at least, buys us a clearer indication of "literal" than direct juxtaposition of identifier characters with a single-quote-delimited string of characters, IMO.

scanon · May 4, 2022, 5:23pm

The # , at least, buys us a clearer indication of "literal" than direct juxtaposition of identifier characters with a single-quote-delimited string of characters, IMO.

I don't buy that at all. # is used as a prefix for all sorts of non-literal stuff (#if, #file, #line, ...), and all of the most common literals (strings, arrays, dictionaries, numbers ...) do not have a # prefix. I think you can reasonably argue that # indicates something, but it's not "literal".

codename · May 4, 2022, 5:36pm

I don’t find the # to indicate ‘literal’ necessarily, but rather in combination with the '...' delimiters that it indicates a different kind of literal — the program literal:

As such, regex literals are more like "program literals" than "data literals"

ben-cohen · May 4, 2022, 5:52pm

That something is generally "do special compiler-integrated thing". Supply a line number, check the runtime version in a way that affects availability checking, add/remove code before compiling, allow unescaped " in this string literal. Literals like #selector are a subset of those special things the compiler does. But as you observe, only for specific (often platform-specific or esoteric) literals. The regular literals are certainly special compiler-integrated things, but don't use # for well-understood reasons. The question then becomes "are regexes esoteric and thus the noise is unimportant, unlike arrays or dictionaries where it would be unacceptable".

Jumhyn · May 4, 2022, 6:10pm

It's not only used for literals, but it's used for a whole lot of literals and literal-like things (which, IMO, includes #file, #line and family)—enough that I do have the association of "# in value position probably means some sort of literal" in my mind.

jeremy · May 4, 2022, 6:12pm

I thought that the suggested use of # (at least in #/.../#) was by analogy with extended string delimiters. That seems quite different from how it is used in #available if etc.

hooman · May 4, 2022, 6:29pm

It is about a logical framework to guide us how to derive these kinds of things. That extra # that remains after applying trailing literal sugar, does indicate something:

# prefix generally indicates that whatever carries that prefix affects how compiler treats what comes next. A regex literal is not an ordinary literal. It is actually a program that (although is runtime-interpreted) compiler should syntax check and validate. Putting aside String interpolation which is an exception because of how ubiquitous literal strings are, other kinds of literals that actually represent a template or source code in some foreign language or template format deserve and need special distinction and I think #-prefixed keywords are a perfect fit for this.

Now the question is, do we want to accept regex literals as an integral part of the language and provide it with all privileges of something like string literal, even at the expense of feature removals and added complexity to the language, or are we willing to accept keeping some distance. I believe we should keep some distance from this particular literal syntax as it is not Swifty at all and does not feel like fully native Swift.

I fully support providing excellent support for regexes as parts of the steps that we are taking to improve string processing story of the language by providing compatibility and improving upon what is already out there and let developers (who have it) bring and apply their prior knowledge and hard-built literals from other environments to Swift.

I disagree with giving legacy regex literal syntax fully native Swift blessing. Look at what we did with string interpolation: We ditched the traditional C/Unix way and invented our own wonderful "\(...)" solution and Swift is so much better for it. I agree that it is not practical to do this with literal regex syntax (yet?), but I think we should at least keep some distance by that very same slightly noisy # prefix. Maybe one day we can use something like '...' to represent the true native Swift literal pattern matching format.

hooman · May 4, 2022, 6:32pm

What I am arguing in the previous reply to @scanon, is that the proposed literal syntax is not Swifty enough to earn the full embrace of giving it /.../ (especially considering the costs).

codename · May 4, 2022, 8:02pm

Being so richly supported by the compiler (to provide validation and diagnostics, type-safety etc. beside being a custom literal) and likely tooling (for syntax highlighting etc.), I think it makes sense to embrace this notion of having a special compiler treatment.
IMO This also reinforces that, being an embedded DSL, the contents of the literal are governed by a different ("special") set of rules than regular Swift source code. Consequently I don't find the # to be noise at all, but relevant to the context (more so than e.g. an Array literal).

I think the platform/language integration model is not unfitting to tie into with an embedded language form such as Regex (as mentioned above).

While different in their own way I would also consider these a sort of (dynamic) 'compiler literal' in a sense. IMHO embracing this for Regex literals could also encourage a more coherent mental model of such "compiler-integrated" features and potentially reduce the 'esoteric' feeling of existing #literal forms.

tem · May 4, 2022, 8:33pm

While I'm sympathetic toward the goal of delivering regexes with the simplest and most recognizable literal syntax, I do find some of the opposing arguments compelling. So I was wondering how the "only #/../# extended literals" alternative would look.

// short regex literals
let regex = Regex {
  Capture { #/[$£]/# }
  TryCapture {
    #/\d+/#
    "."
    #/\d{2}/#
  } transform: {
    Amount(twoDecimalPlaces: $0)
  }
}

// medium
let regex =  /([ab])|\d+/
let regex = #/([ab])|\d+/#

// long+
let regex =  /(?<identifier>[[:alpha:]]\w*) = (?<hex>[0-9A-F]+)/
let regex = #/(?<identifier>[[:alpha:]]\w*) = (?<hex>[0-9A-F]+)/#

My opinion is that the wrapping #'s hardly make a difference to the legibility of any individual regex literal. Short ones remain relatively short and simple, long ones remain long and obtuse.

However, in cases where many (probably short) literals are used, such as with the regex DSL, the sheer number of #'s looks a bit jarring to me. Maybe with the right syntax highlighting theme it would be fine (e.g. with a subtle gray on the #'s).

Also, I imagine that typing #/ in a good editor should autocomplete /# ahead of the cursor, making it that much easier to type (although this is NOT the case today with raw strings in Xcode and VS Code!).

Of course, the other big issue is that the extended literal syntax strongly suggests existence of the bare version. Perhaps even so much that even if the bare syntax were rejected, library authors might from now on avoid using prefix / operators, considering its future uncertain, which would in turn give the space to bare regex literals to eventually justify the breakage?

What about allowing / to be wrapped in backticks to disambiguate it as an operator rather than the start delimiter of a bare regex literal?

prefix func / (...) -> ...
let casepath = `/`Enum.a      // parse error today

Similar to:

func await (...) -> ...
`await`(...)                  // OK

Not great, but also not that bad? It would still cause a source break but would allow continued use of an operator with semblance to the backslash. Perhaps I'm missing something obvious as to why this is not already allowed today.

hooman · May 4, 2022, 9:23pm

To avoid the issue you raise, we should not call it the extended syntax, but a foreign literal syntax and interpret # more like a compiler directive than its analogy to its role in String. Also, #/.../# will behave differently in how it interprets / and this by itself will cause compatibility issues. You won't be able to simply copy/paste a foreign /.../ regex and just enclose it in a pair of #s.

That is why I proposed we start with something that does not use balanced #s. For example, #re/.../ which would interpret '/' exactly the same as /.../ and would get extended behavior when used as #re/.../# or #re#/.../#, and then consider #/.../# family as its shorthand syntax. We will use the shorthand all the time in practice, but this will define away the issue.

tem · May 4, 2022, 9:55pm

If I'm reading the discussion right in the old [Pitch] Regex Syntax - #12 by Michael_Ilseman and the current draft of Regex Syntax and Runtime Construction then escaped slashes would be treated the same in both bare and extended literals:

A metacharacter may be treated as literal by preceding it with a backslash. Other literal characters may also be preceded by a backslash, in which case it has no effect, e.g \% is literal % . However this does not apply to either non-whitespace Unicode characters, or to unknown ASCII letter character escapes, e.g \I is invalid and would produce an error.

Because backslashes are not treated as literal in "raw"/extended literals (unlike raw strings).

This syntax differs from raw string literals #"..."# in that it does not treat backslashes as literal within the regex. A string literal #"\n"# represents the literal characters \n . However a regex literal #/\n/# remains a newline escape sequence.

So the backslashes in #/\/path\/to\/files/# would just be redundant.

I can kind of see the logic there, but I could ask: why is there no shorthand for the non-extended #re/.../ ? It seems like either way you approach the #/.../# syntax it would be odd not to have the bare version.

hooman · May 4, 2022, 10:10pm

I am referring to what is said in this proposal:

(Emphasis mine)

This indicates that without #s, for the bare /.../, we need to escape forward slashes and that is how all existing regexes that use this bare form already work.

Because there is no good reason for using it in native Swift. We only need it when we are copy/pasting an existing /.../ literal from outside Swift (with forward slashes already escaped), such case is more foreign and deserves more attention, because we don't need to escape forward slashes in normal Swift strings and if we are copy/pasting an existing string to turn it into a regex, that bare format would be a poor choice.

scanon · May 4, 2022, 11:04pm

This is simply false. The experience of the standard library team as we've been working on the feature has been that it's quite desirable to use the literals for new regexes, even in the presence of the DSL (often as a component of the DSL). This actually enhances readability on the whole, because the literal can be more concise for simple usage without introducing undue complexity, allowing users to more quickly reason about the Regex as a whole.

We very much do not expect their usage to be restricted to pasting regexes from other languages. Certainly some people will use them only in this fashion, but we expect that most people will use both syntaxes pretty freely.

hooman · May 4, 2022, 11:33pm

I agree that for people who are fluent in classic regexes (bare slash style) this is much more convenient and readable. The question is what percentage of Swift programmers are expected to be fluent with classic regexes? How would they feel when they encounter this? For them, I suspect, escaping / will feel inconsistent with the rest of the language.

Nobody1707 · May 5, 2022, 12:46am

I'm (weaklly) on the #/.../# side of the argument, but I don't see why escaping the / would feel foreign. It'd be exactly the same as escaping a " in a string.

Nobody1707 · May 5, 2022, 1:32am

Yes. Typo.