SE-0354: Regex Literals

benlings · May 9, 2022, 6:55pm

Paul_Cantrell:

The hidden change in the meaning of all whitespace when #/…/# contains a newline:
#/ (foo|bar)(d|f|t) /#
matches " foot "

…but IIUC…
#/ (foo|bar)
   (d|f|t) /#
does not match " foot "

…which seems to me like a footgun.

To be a multi line literal, this would need to be

#/
(foo|bar)
(d|f|t)
/#

I think this wouldn’t be too confusing because there’s an extra newline in the middle.

This one is possibly more problematic, but wouldn’t you also expect it not to match because of the extra new lines at the start and end?

Avi · May 9, 2022, 6:58pm

In your side-by-side example, my brain literally filtered out the #'s. I had to look twice to find them.

ebg · May 9, 2022, 7:10pm

As a non-regex power user, I don't understand #/.../# and how that makes it easier to pick out tokens which are visually similar. The clearest to me is #regex(...), as there is precedence from regular method calls and #selector() that the stuff of interest is inside the parentheses.

Paul_Cantrell · May 9, 2022, 7:10pm

Ah, indeed! Yes, thanks. And that does raise sufficient user doubts about the leading space, doesn’t it?

Possibly? But I do think it's surprising that what seems like an innocent reformatting suddenly makes it match "helloworld" where it did not.

The “innocent reformatting” angle perhaps becomes more compelling with a longer RE that cries out for line breaks:

/Hell+o+, world!+ (How are you (today|tonight)\?)?/

→ matches "Hellllooo, world!!! How are you tonight?"

#/
  Hell+o+,
  world!+
  (
    How are you
    (today|tonight)\?
  )?
/#

→ matches "Hellllooo,world!!!Howareyoutonight?"

I'd venture that this is surprising. And it's an especially insidious surprise, because there is no compile-time diagnostic to help; it’s only (possibly case-specific and easy-to-miss) runtime behavior that changes.

Once the bug is apparent, the answer is perhaps hard to discover. However, an explicit “ignore whitespace” mode could come with a helpful compiler diagnostic that both makes the behavior change visible at compile time, and makes the solution apparent:

error: regular expression literal cannot span multiple lines

fixit: enable ‘ignore whitespace’ mode
       note: use `\ ` to match spaces

Douglas_Gregor · May 9, 2022, 7:19pm

(snip)

We have a number of source breaks queued up for Swift 6 already, as noted in my other thread, all of which have already gone through the evolution process. You can make a consistent argument that all of those are bad, if you'd like, and we should retract or rework every one of those accepted proposals.

However, I have yet to hear objective argument that this source break is worse than those other, accepted source breaks. It's certainly not true based on the scale of the break---several of those changes have much more widespread effect, both in terms of the # of projects effected and how much work there is to resolve the break.

(Ignoring the already-discussed issue with the quote above, I want to make a different point)

We should all be open to changing our minds as good arguments and improvements come up in the discussion. And we should encourage folks who have changed their minds to say so, especially proposal authors who quite literally set the flow of the conversation by making specific, actionable proposals.

Swift has whitespace sensitivity in quite a number of places, including around operators, with calls, with multi-line string literals, and with disambiguating < between a less-than operator and a generic parameter list. The rules of these were debated at length, and if you wrote them all out in great detail, as has been done with /.../ here, they'd be pretty scary, too. But they rarely come up in practice, so we get to use < as both a generic-parameter-list introducer and a custom operator, ( for both calls and tuples, etc.

I mean, isn't it confusing that

let x = someLongFunctionName(a)

and

let x = someLongFunctionName
         (a)

Have different meanings? What about:

3.14159

vs.

3
.14159

Should we have picked a different syntax for tuple literals, array literals, dictionary literals, and floating point literals because the newline rule is weird and we can come up with confusing cases? Or do those confusing cases come up so rarely, and are diagnosed so easily in the compiler, that fixating on them leads us down a road to worse overall design?

This seems to be a common argument:

That exact same argument applies to escaping " in string literals. Is the use of "..." for string literals also a mistake, and Swift really should have required #"..."# for all string literals because some of them would need to escape the "?

I don't see why regex literals with / are special here. Rather, I think they should follow what string literals do: there's a single-character delimiter (/ for regex, " for string) and if you need that character within the literal, you either escape it or go to the raw form (#/.../# or #"..."#).

Let's imagine we take just #/.../#: how would we explain the inconsistency between regex literals and string literals to a Swift developer that didn't follow this discussion?

I keep seeing the assertion that "there's nothing to recommend the bare /.../" syntax, so I'll try to summarize in bullets real quick:

/.../ is precedented in several other languages
/.../ is analogous to the other literals in the language, nearly all of which have special parsing rules whose potential for confusion has not had a practical impact
#/.../# is useful for the case where there is a need to escape / or go multi-line, but is unnecessary noise for the vast majority of regex literals

Doug

xAlien95 · May 9, 2022, 7:43pm

As a really minor side note, one field where #/.../# is slightly superior is in user experience. When writing #/, an IDE can auto-insert /# after the cursor position, enclosing the literal as already done with string/dictionary/array/tuple literals and avoiding any major disruption in the semantic checking of the content that follows. With a bare /.../, there wouldn't be any indication about the role of the inserted /, since it could serve as an operator, the beginning of a comment (both /* or //) or a regex literal.
As I said, a really minor side effect.

Avi · May 9, 2022, 7:47pm

This source break removes a capability that is currently used by third parties. Do any of the source breaks have this property?

Douglas_Gregor · May 9, 2022, 8:09pm

Several of them mean you have to go to an alternative, more heavyweight syntax to do what you're doing today, sure. Assuming that all we'll need to do is back-tick prefix operator / to resolve the conflict, the introduction of await in Swift 5.5 is quite close in spirit: if you had code like await(f()) before Swift 5.5, it had to get back-ticks around await to compile in 5.5.

Doug

Jumhyn · May 9, 2022, 8:10pm

IMO the relative controversy over the bare regex syntax indicates that there's a large portion of the community that does not view this source break as 'worth it' compared to the break for, say, any P. I don't think opponents of the 'bare' regex syntax should need to argue that the source break here is somehow worse than other source breaks, except to say that the benefit of the source break is not outweighed by the costs.

Given that the argued benefit is based on an "aesthetic preference" it feels a bit dismissive to me to say that you've seen no "objective" argument that this source break is worse than other accepted breaks. Indeed, the exact same objective data about the size of this break could support the conclusion that this break is 'better' or 'worse' overall entirely based on the (subjective) perceived benefit.

And further, the fact that we have accepted other, large source breaks for Swift 6 is to me a good reason to be more skeptical of additional source breaks, not less. I don't think we should too quickly discount the additional marginal cost of more migrations to verify, more libraries to update, etc., just because we already have larger source breaks planned.

Avi · May 9, 2022, 8:16pm

/ will no longer be usable as a prefix operator. There is no workaround for that being proposed.

Ben_Cohen · May 9, 2022, 8:17pm

FYI based on this being suggested upthread, the proposal authors are exploring using backticks to avoid requiring removal of operator prefix / – putting it on a similar footing to await. It's possible this could be combined with a heuristic that only two slashes on a line would trigger requiring backticks.

Douglas_Gregor · May 9, 2022, 8:51pm

I'd like to think that the arguments I stated at the end of my post are about precedent with regex literals in other languages and consistency with other literals in Swift.

Additionally, there is an objective measure of source breakage, and that's how much code will be affected by the change. We know, objectively, that the scale of source code breakage from this proposal is far less than any P. A tiny fraction of projects will be affected by this proposal (even fewer if this pans out) vs. nearly 100% of projects for any P and Sendable. We're not even in the same ball park here w.r.t. source breakage, and we shouldn't pretend we are.

Perhaps I should make my point differently: there has been a lot of discussion here about source breakage, and I find the amount of concern expressed is completely out of proportion to the actual demonstrated source breakage from the change.

Yes, it's fine to subjectively say that this amount of source break isn't worth it in your opinion, but evaluating that argument means being realistic about both (1) the actual cost of the source break, and (2) the downsides from adopting an alternative syntax like #/.../#. The more data we get, (1) seems smaller than first anticipated, and the more I think about the relationship of regex literals to other parts of the language, the more (2) seems to grow.

The problem with this line of argument is that I, or anyone else, can selectively wield it for any syntax I don't like, so long as it has the tiniest potential source break. And using that argument for this proposal, rather than something like any P or Sendable, would establish a baseline of unacceptable source breakage so low that essentially nothing can change from now on for Swift 6.

Doug

Jon_Shier · May 9, 2022, 9:22pm

Aside from the near 100% source breaks of things like any, which are easy to see, Swift lacks the tooling necessary to objectively determine the degree of source breakage of any change, unless you limit "objective" to simply mean "relative to the size of the language". By that measure, sure, the removal of / as an operator is objectively small, since it's only one character and operator among many. But I don't think that would be a useful measure to most people.

I think most people would consider "objective" to be more meaningful as a consideration of "How many Swift projects will this break?" or "How many changes are required to fix this?" The Swift ecosystem currently has no way of measuring those, or any, impacts. We can guess, based on the relative popularity of the CasePaths library and The Swift Composable Architecture in general, that it would number from hundreds to thousands of projects, but without actual stats we can't get more precise. So I don't think you can say "the amount of concern expressed is completely out of proportion to the actual demonstrated source breakage from the change" when you can't see the demonstrated source breakage. For anyone using the mentioned libraries, this breakage will be just as bad as the breakage from any and much harder to fix. With any you can simply mass apply the fixit and be done with it. There's no workaround here (perhaps yet?).

So while this change has a relatively small impact to the language, you can't say the same thing about the Swift ecosystem itself. So it's probably a good idea to stop thinking about the impact as "objectively" small.

Douglas_Gregor · May 9, 2022, 10:08pm

The numbers we have include "16/2968 projects in the Swift Package Index", "0 projects in the source compatibility suite" and "1 project out of all of the Swift code at Apple". In my experience, even the most minor source break we unintentionally make during the normal flow of compiler development breaks more projects than the above, so from my perspective as compiler implementer those results are very, very good.

You have claimed that this data is not representative, and, there's no way to definitively counter your claim because we can't see most of the Swift source code in the world. Maybe CasePath's 547 stars undercount it's influence on the wider Swift ecosystem. The 6.1k stars for the Swift Composable Architecture might be a better indicator, or maybe not. We're guessing here, but I do want to point out a bit of old precedent: a while ago we took away $ as a bare identifier, breaking the 4.2k-starred Dollar library, because it was the right thing to do.

There is most certainly a workaround, and I'm surprised that you didn't know about it: define a single-parameter function that does what prefix / does. Maybe we call it casePathRoot, so

/Authentication.authenticated

becomes

casePathRoot(Authentication.authenticated)

If this suggestion works out for this proposal, it'll be

`/`Authentication.authenticated

So there is a workaround, and it's a fairly localized fix to uses of the prefix / operator. It is similar to any in its locality but on a demonstrably smaller scale. I think we can agree on this part?

All of this is temporary, of course. Key paths absolutely should be able to refer to enum cases, and when they do, they'll be preferred to CasePaths because they can integrate better in the rest of the language. When that happens, do we then revisit the /.../ syntax or has it been forever taken by one library?

The reality is that we're both extrapolating from the data we have, because fundamentally that's all we can do when most of the Swift code in the world is closed-source. I've looked at that data and I'm comfortable with this source break: I know how narrow it is and how the smooth rollout of it will go. Every hat I wear in the Swift world has a lightning rod attached to the top, so I don't take source breaks lightly.

Doug

idrougge · May 9, 2022, 11:01pm

What is your evaluation of the proposal?

½
While regex literals can sometimes be useful, they do not have to look exactly like in certain other languages (notably Perl) which, due to their niche, place a lot more emphasis on regexes than Swift is likely to do. Some of these languages, notably, implemented regex literals several years before implementing raw string literals, which already serve to make regex strings more workable.

Is the problem being addressed significant enough to warrant a change to Swift?

Not if it involves adding special-cased syntax which is both source-breaking and non-extensible.

I am particularly thinking of the / … / syntax, but I feel that any kind of syntax that is tailored only for regex literals is to couple Swift too closely to a single sub-language which has no particular shared history with Swift. I know that whilst I do sometimes resort to a regular expression to solve a problem, I write a lot more raw JSON literals, whereas others may write SQL literals or HTML literals. And I think VB.NET has XML literals. If each kind of literal should have its unique literal delimiters, we would run out of character combinations.

I therefore prefer that alternatives like #Regex(…) or #re"…" be used. They are purely additive, do not add as many special cases or complexities to lexing and parsing and are more obvious at point of use.

Does this proposal fit well with the feel and direction of Swift?

No. Swift literals have so far been limited to the types in the standard library, whilst also adoptable for other types through ExpressibleBy*Literal. There is little precedence for adding support for a language-within-a-language, the closest being ResultBuilders, which are also used for regexes in this proposal as well as being applicable to many other use cases instead of just being a kind of SwiftUI literal.

If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

I have used them in Javascript, but I find regex literals there to be given an unnecessarily prominent place in the language with no hints regarding its nature. It is not easy to search for two slashes interspersed with random characters. A syntax like Regex() or something similar with normal letters would enable the newcomer to look up its meaning and usage and better fit in with Swift’s ideal of progressive disclosure.

How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

I have read the proposal twice, as well as having read the pitch thread.

hborla · May 9, 2022, 11:39pm

I've been following the review thread here so I don't have exactly the perspective you are asking for, but I'm not intimately familiar with regexes in general, and I have a fair amount of teaching experience to introductory programming students, so I feel I can still give some useful insight on the corner cases here. I also work on diagnostics in the compiler, so I can give some insight into what's possible for letting programmers know what went wrong in these corner cases.

Similar to the whitespace rules in this proposal, the compiler also has whitespace rules for operators, and I think those whitespace rules save us from these corner cases when operators are directly applied to arguments. I could be wrong about that, but I've found it difficult to cook up an applied operator expression that compiles today, and does not compile under SE-0354, and even if we found one, you'd have to write the code in an extremely specific way.

That leaves unapplied operator references to / as the only problematic case. Now, unapplied operator references are already problematic, because they must have a concrete contextual type in order to compile in the first place. That means it must be an argument to a function, an assignment to a local variable that already has a type, etc. Otherwise, overload resolution would not be able to disambiguate between all of the global overloads of /. This context is extremely helpful, because it means the error is nearly always going to manifest as a type mismatch between the Regex type and the operator function type, which means it's fairly easily detectable in the diagnostics code to produce a tailored message and a fix-it to, e.g., surround the operator reference in back ticks.

Now, to answer your question of "What do developers expect let y = foo(a, /, b).reduce(1, /) to do? How confident are they?" more directly. If we accept SE-0354 and you're asking somebody "What do you expect let y = foo(a, /, b).reduce(1, /) to do?", you're clearly asking a trivia question, because if someone attempts to write this in their project, they'll find out what it does when they get a (hopefully actionable after a bit of tailoring!) message telling them what went wrong and how to fix it. Trivia like this is extremely easy to cook up for any programming language given a combination of features. For example, I could ask all of you right now, what do you expect to happen given this code?

func test(_ arg1: String = "", _ arg2: Int = 1) {}

test(10)

or

func test(x: Double) -> Double {
  x/2/.pi
}

and many programmers would get the answers wrong, but it doesn't matter, because programmers rarely hit this behavior in practice, and in the rare case when somebody does, they find out what happens immediately with the error message produced. This is how people learn! People don't randomly encounter code that doesn't compile, and as long as the error message is actionable -- which should be fairly straightforward to achieve in all of the corner cases that I've seen in this thread, since they're all just argument-to-parameter type mismatches -- I don't think there's significant confusion or astonishment being introduced here.

I'd also like to note that, if we were to only parse #/.../# and not the bare /.../, attempting to write a bare /.../ would result in a pile of unhelpful parser diagnostics, which programmers might reasonably try given the parallel with extended string literals. The best way to guide programmers away from invalid syntax is to actually make the compiler understand the syntax anyway for the purpose of detecting it and diagnosing it as invalid, so we'd likely want to have that sort of parser support in place even if we were to choose only supporting the extended syntax.

tim1724 · May 9, 2022, 11:44pm

If we can add support for backtick-quoted operators then I'd be ok with bare /.../ syntax and I'd be fully onboard with the proposal as written. It still wouldn't be my first choice of syntax but it would be good enough.

Ben_Cohen · May 10, 2022, 12:00am

Another minor point regarding this example:

foo in this case is presumably something like "zip then map" i.e. zip(a,b).map(/). In Swift, though, you would not write that function signature with the operation in the middle (despite making the operation "infix" between the two sequences having a certain appeal), because trailing closures encourage the function argument to always be the last argument. And when that is the case, you get foo(a, b, /) which fixes the parse.

Yes, it is still possible, perhaps including functions that will take two unapplied instances of / – but it becomes increasingly less likely, and this seems to reinforce the reason why source compatibility testing has shown parsing ambiguity to be effectively a non-issue for this proposal (as distinct from the impact of eliminating operator prefix /, which is certainly not a non-issue, whether or not you believe it's acceptable impact).

Jon_Shier · May 10, 2022, 12:39am

My fundamental issue is that "objective" keeps being bandied about in this thread as if, due to some equation, this source breakage is okay when others aren't. As you've clearly stated, that's not how it works. Instead, we all balance data, experience, and preference, and different people reach different conclusions about acceptability. So repeatedly pointing to the very limited data we have to support this breakage, when we know the actual impact will be some magnitude greater than what we see (libraries using / X libraries using that as a dependency X number of users of that dependency X number of uses by those users) is, at best, myopic. My replies have attempted to point out that relying solely on the available data is very limiting, nothing more.

Now, there is a lot Swift and Apple could be doing to gather more data about things like this, but that's a discussion for another thread.

michelf · May 10, 2022, 1:10am

It's interesting to note that inside this regular expression you encounter a closing parenthesis first, which should be a parse error when parsing the regex. Presumably, the Swift parser could treat this situation as not-a-regex (just like it could treat a single / on a line as not-a-regex) and parse it as normal Swift code.