SE-0354: Regex Literals

It is about a logical framework to guide us how to derive these kinds of things. That extra # that remains after applying trailing literal sugar, does indicate something:

# prefix generally indicates that whatever carries that prefix affects how compiler treats what comes next. A regex literal is not an ordinary literal. It is actually a program that (although is runtime-interpreted) compiler should syntax check and validate. Putting aside String interpolation which is an exception because of how ubiquitous literal strings are, other kinds of literals that actually represent a template or source code in some foreign language or template format deserve and need special distinction and I think #-prefixed keywords are a perfect fit for this.

Now the question is, do we want to accept regex literals as an integral part of the language and provide it with all privileges of something like string literal, even at the expense of feature removals and added complexity to the language, or are we willing to accept keeping some distance. I believe we should keep some distance from this particular literal syntax as it is not Swifty at all and does not feel like fully native Swift.

I fully support providing excellent support for regexes as parts of the steps that we are taking to improve string processing story of the language by providing compatibility and improving upon what is already out there and let developers (who have it) bring and apply their prior knowledge and hard-built literals from other environments to Swift.

I disagree with giving legacy regex literal syntax fully native Swift blessing. Look at what we did with string interpolation: We ditched the traditional C/Unix way and invented our own wonderful "\(...)" solution and Swift is so much better for it. I agree that it is not practical to do this with literal regex syntax (yet?), but I think we should at least keep some distance by that very same slightly noisy # prefix. Maybe one day we can use something like '...' to represent the true native Swift literal pattern matching format.

7 Likes

What I am arguing in the previous reply to @scanon, is that the proposed literal syntax is not Swifty enough to earn the full embrace of giving it /.../ (especially considering the costs).

2 Likes

Being so richly supported by the compiler (to provide validation and diagnostics, type-safety etc. beside being a custom literal) and likely tooling (for syntax highlighting etc.), I think it makes sense to embrace this notion of having a special compiler treatment.
IMO This also reinforces that, being an embedded DSL, the contents of the literal are governed by a different ("special") set of rules than regular Swift source code. Consequently I don't find the # to be noise at all, but relevant to the context (more so than e.g. an Array literal).

I think the platform/language integration model is not unfitting to tie into with an embedded language form such as Regex (as mentioned above).

While different in their own way I would also consider these a sort of (dynamic) 'compiler literal' in a sense. IMHO embracing this for Regex literals could also encourage a more coherent mental model of such "compiler-integrated" features and potentially reduce the 'esoteric' feeling of existing #literal forms.

3 Likes

While I'm sympathetic toward the goal of delivering regexes with the simplest and most recognizable literal syntax, I do find some of the opposing arguments compelling. So I was wondering how the "only #/../# extended literals" alternative would look.

// short regex literals
let regex = Regex {
  Capture { #/[$£]/# }
  TryCapture {
    #/\d+/#
    "."
    #/\d{2}/#
  } transform: {
    Amount(twoDecimalPlaces: $0)
  }
}

// medium
let regex =  /([ab])|\d+/
let regex = #/([ab])|\d+/#

// long+
let regex =  /(?<identifier>[[:alpha:]]\w*) = (?<hex>[0-9A-F]+)/
let regex = #/(?<identifier>[[:alpha:]]\w*) = (?<hex>[0-9A-F]+)/#

My opinion is that the wrapping #'s hardly make a difference to the legibility of any individual regex literal. Short ones remain relatively short and simple, long ones remain long and obtuse.

However, in cases where many (probably short) literals are used, such as with the regex DSL, the sheer number of #'s looks a bit jarring to me. Maybe with the right syntax highlighting theme it would be fine (e.g. with a subtle gray on the #'s).

Also, I imagine that typing #/ in a good editor should autocomplete /# ahead of the cursor, making it that much easier to type (although this is NOT the case today with raw strings in Xcode and VS Code!).

Of course, the other big issue is that the extended literal syntax strongly suggests existence of the bare version. Perhaps even so much that even if the bare syntax were rejected, library authors might from now on avoid using prefix / operators, considering its future uncertain, which would in turn give the space to bare regex literals to eventually justify the breakage?

What about allowing / to be wrapped in backticks to disambiguate it as an operator rather than the start delimiter of a bare regex literal?

prefix func / (...) -> ...
let casepath = `/`Enum.a      // parse error today

Similar to:

func await (...) -> ...
`await`(...)                  // OK

Not great, but also not that bad? It would still cause a source break but would allow continued use of an operator with semblance to the backslash. Perhaps I'm missing something obvious as to why this is not already allowed today.

9 Likes

To avoid the issue you raise, we should not call it the extended syntax, but a foreign literal syntax and interpret # more like a compiler directive than its analogy to its role in String. Also, #/.../# will behave differently in how it interprets / and this by itself will cause compatibility issues. You won't be able to simply copy/paste a foreign /.../ regex and just enclose it in a pair of #s.

That is why I proposed we start with something that does not use balanced #s. For example, #re/.../ which would interpret '/' exactly the same as /.../ and would get extended behavior when used as #re/.../# or #re#/.../#, and then consider #/.../# family as its shorthand syntax. We will use the shorthand all the time in practice, but this will define away the issue.

3 Likes

If I'm reading the discussion right in the old [Pitch] Regex Syntax - #12 by Michael_Ilseman and the current draft of Regex Syntax and Runtime Construction then escaped slashes would be treated the same in both bare and extended literals:

A metacharacter may be treated as literal by preceding it with a backslash. Other literal characters may also be preceded by a backslash, in which case it has no effect, e.g \% is literal % . However this does not apply to either non-whitespace Unicode characters, or to unknown ASCII letter character escapes, e.g \I is invalid and would produce an error.

Because backslashes are not treated as literal in "raw"/extended literals (unlike raw strings).

This syntax differs from raw string literals #"..."# in that it does not treat backslashes as literal within the regex. A string literal #"\n"# represents the literal characters \n . However a regex literal #/\n/# remains a newline escape sequence.

So the backslashes in #/\/path\/to\/files/# would just be redundant.

I can kind of see the logic there, but I could ask: why is there no shorthand for the non-extended #re/.../ ? It seems like either way you approach the #/.../# syntax it would be odd not to have the bare version.

I am referring to what is said in this proposal:

(Emphasis mine)

This indicates that without #s, for the bare /.../, we need to escape forward slashes and that is how all existing regexes that use this bare form already work.

Because there is no good reason for using it in native Swift. We only need it when we are copy/pasting an existing /.../ literal from outside Swift (with forward slashes already escaped), such case is more foreign and deserves more attention, because we don't need to escape forward slashes in normal Swift strings and if we are copy/pasting an existing string to turn it into a regex, that bare format would be a poor choice.

2 Likes

This is simply false. The experience of the standard library team as we've been working on the feature has been that it's quite desirable to use the literals for new regexes, even in the presence of the DSL (often as a component of the DSL). This actually enhances readability on the whole, because the literal can be more concise for simple usage without introducing undue complexity, allowing users to more quickly reason about the Regex as a whole.

We very much do not expect their usage to be restricted to pasting regexes from other languages. Certainly some people will use them only in this fashion, but we expect that most people will use both syntaxes pretty freely.

7 Likes

I agree that for people who are fluent in classic regexes (bare slash style) this is much more convenient and readable. The question is what percentage of Swift programmers are expected to be fluent with classic regexes? How would they feel when they encounter this? For them, I suspect, escaping / will feel inconsistent with the rest of the language.

2 Likes

I'm (weaklly) on the #/.../# side of the argument, but I don't see why escaping the / would feel foreign. It'd be exactly the same as escaping a " in a string.

1 Like

Yes. Typo.

I think this is subjective, and depends on how you use regexes and what you typically match. I encounter / far too many times in what I am doing. It is in URLs, dates, normal text, SKUs, part numbers, math expressions, (Unix) file paths, etc. I find it very inconvenient to have to escape /. On the other hand, I rarely encounter " in the strings I am working with.

2 Likes

This proposal, and the review discussion, tease at some of the fundamental design inflection points for the Swift language. I appreciate immensely the various topics and counterpoints that have surfaced on this review thread. Those comments will provide a valuable signal to the core team to evaluate this proposal.

Speaking from my perspective of the proposal, I strongly favor the proposed language change.

The regular expression proposals intend to close further the gaps on long-standing goals for Swift's string-processing capabilities. Since Swift's inception, there has been a standing goal for string-processing to be powerful while also clean and readable. Simple regular expressions are a time-tested and recognizable way to express intent for matches in strings. Combining them directly into the language with Swift's Unicode-first model will dramatically improve Swift's string processing capabilities.

One of Swift's goals is clean, clear syntax. We have always given specific consideration to the treatment of essential concepts in the language to achieve the broader goals of clean and intelligible code. These principles are throughout Swift, starting with dropping mandatory semicolons or parentheses around conditions. Regular expressions are essential because first-class string processing in a general-purpose programming language is essential. As such, first-class regular expression integration into Swift should adhere to these principles. While #/...#/ could support the feature on its own, it doesn't achieve the same level of the clean syntax that we can achieve with the proposed /.../ syntax.

I know there are concerns about the complexity of the parser rules as outlined in the proposal. I thank the proposal authors for calling those out so thoroughly. I do not find these rules concerning. Swift has many existing rules in the parser for making much of the intuitive syntax in Swift "just work," some far more complicated than the ones mentioned in the proposal. While we should not aim to throw kerosene on a fire, I do not believe that is the case here. The proposed parsing rules are well within a threshold of complexity for the parser's reasoning and implementation. Further, after running this change through millions of lines of code, this change triggered only a couple of instances of parsing ambiguity. Statistically, the data suggests the parsing rules are not a concern in practice when working with the Swift code people write today.

I believe the most crucial question here is the tough call around source compatibility and whether or not any source breaks are tolerable at any stage in the evolution of Swift — even when staged. Swift is still a language evolving in ways in service of its users. While new essential features aren't added to the Swift at the same regularity as in its first nascent years, our work on rounding out the fundamentals of the language isn't over. With Swift concurrency, new keywords (such as "async") were added to the language because those concepts are fundamental to modern programming, even though they also introduced a source break. It would have been calamitous not to give concurrency the clean and clear syntax it needed and deserved. Source breaks should not be inflected recklessly on Swift code as they can potentially destabilize the Swift ecosystem and burden developers. While some noteworthy points have surfaced in this review thread, I believe a good balance struct in this proposal allows the language to move forward with the proper application of blessed syntax for regular expressions while giving users a path to move at their own pace. I also believe the exhibited data on the amount of code trialed on the new syntax shows that the source break will not manifest pervasively. The combination of the infrequency of occurrence of source breaks in practice, and the staging via a language mode, convinces me there is a good path forward here, as outlined in the proposal.

17 Likes

I'm sorry, but we don't have the data needed to make this statement. We have only the narrow view of the compatibility suite, which says nothing about the number of apps affected by each break. We really don't know how much breakage this will cause, just that it will be some number of magnitudes greater than what's reflected in the compatibility suite.

I won't reiterate my other points here, I just wanted to point out the data issue.

5 Likes

That sounds way better than making prefix / impossible to express. With this, migration can happen independently for each module and can be automated. Coordinating API changes between libraries also becomes optional.

6 Likes

It is true that there is a wide population of code out there, and that nobody can audit it all. We can only draw inferences based on the data that we have.

I cannot share specific numbers, but there is a lot of Swift code at Apple, and we ran this change over that code and encountered one project that had an issue. Of course, that population of code may not be representative of all the kinds of Swift projects out there (leading to selection bias), but I believe that this population of code is not wildly different from much of the code in existence.

So I will rephrase my statement: I believe, based on the code this change has been tested on, that the break won't be pervasive. The break will, however, impact some codebases more than others.

We know there is a break here, and for that reason the proposed source break is intentionally staged as outlined in the proposal. For me, the question isn't about zero tolerance to source breakages, but about tradeoffs of what this change means both now and in the long-term, how much cavitation it will have on the ecosystem in practice, etc. A source break should not be considered lightly, and I do believe it is being considered in its weight when evaluating the value of what is proposed.

6 Likes

Thank you for the well written overall review of the proposal and for engaging with the community Ted, but I still feel that we are mixing the overall comment on needing regex literals and comments on source breaking and the attention paid to it or not and the need to allow the language to make source breaking changes, etc… with the limiter choice.

I do not personally see why and how what you or the proposal authors or the review manager wrote supports the absolute need of allowing bare /…/ syntax over just #/…/#. Considering the latter allows for more readable regex strings where you do not have to escape / characters, I do not see how all the arguments about readability and power and clarity support it, if anything they seem to do the opposite.

How is the proposal changing for the worse if bare syntax /…/ were to be ditched for just the #/…/# option and we did not have to escape / in our regexes instead?

7 Likes

I cannot share specific numbers, but there is a lot of Swift code at Apple, and we ran this change over that code and encountered one project that had an issue. Of course, that population of code may not be representative of all the kinds of Swift projects out there (leading to selection bias), but I believe that this population of code is not wildly different from much of the code in existence.

Selection bias aside, given what I know of Apple’s (non-public) policies against the use of open source software, it’s literally impossible for Apple’s internal ecosystem to be representative of the Swift ecosystem.

“Pervasive” still seems a rather nebulous standard for limiting source breakage, but at least we have some label on it now. I hope we see more concrete guidance around this issue from the Core Team in the future, namely things like whether phase in periods, migrations, or other mitigations make breakages more generally acceptable.

2 Likes

You say that as if it is fact. But clearly that is not true. It’s your opinion, and it is clearly a hot topic of contention.

You’re of course free to express your opinion on the matter, and have made that perfectly clear. But it doesn’t move this discussion forward.

Do you have examples? Even if we accept your claim of “actively reducing clarity” at face value, regexes still have utility. How do you propose to solve what regex literals solve?

Express.js-style routing, is one example where regexes are used to match handlers to incoming http request, based on url matching. The regexes are usually small, matched against mostly strings, with one or two capture groups.

That style is prevalent throughout modern web server apps, and Vapor could benefit from something like it.

This style of programming where small regex snippets can be inlined into a function argument, followed by a closure literal, is concise, familiar and with fairly little noise. At least many people think so, based on the popularity of this syntax and API design.

With this proposal, Swift could not only unleash the power of this syntax to projects such as Vapor, but it would allow strongly typed captures and named captures, making rewrites/refactorings more safe and less fragile. It could provide syntax highlighted literals to help human parsing, and immediately draw attention to captured parameters.

There are probably a lot of examples of complex and highly unreadable regexes around. I’m not sure we can ever stop people from misusing features, or write bad code.

But having an already existing feature (regexes) become type-safe, compile-time checked, syntax highlighted, and refactor-safe is clearly an improvement. All of which is made possible by literals.

7 Likes

Somewhat off topic (only somewhat, because I agree that regexes are not usually helpful), but I don't think that Express.js style routing uses regexes in any way. URL pattern matching usually needs a proper parser to properly deal with escaping, parameters, fragments, path components and so on. (I also implemented an own parser for that in MacroExpress).