SE-0354: Regex Literals

hooman · May 5, 2022, 1:53am

I think this is subjective, and depends on how you use regexes and what you typically match. I encounter / far too many times in what I am doing. It is in URLs, dates, normal text, SKUs, part numbers, math expressions, (Unix) file paths, etc. I find it very inconvenient to have to escape /. On the other hand, I rarely encounter " in the strings I am working with.

tkremenek · May 5, 2022, 3:20am

This proposal, and the review discussion, tease at some of the fundamental design inflection points for the Swift language. I appreciate immensely the various topics and counterpoints that have surfaced on this review thread. Those comments will provide a valuable signal to the core team to evaluate this proposal.

Speaking from my perspective of the proposal, I strongly favor the proposed language change.

The regular expression proposals intend to close further the gaps on long-standing goals for Swift's string-processing capabilities. Since Swift's inception, there has been a standing goal for string-processing to be powerful while also clean and readable. Simple regular expressions are a time-tested and recognizable way to express intent for matches in strings. Combining them directly into the language with Swift's Unicode-first model will dramatically improve Swift's string processing capabilities.

One of Swift's goals is clean, clear syntax. We have always given specific consideration to the treatment of essential concepts in the language to achieve the broader goals of clean and intelligible code. These principles are throughout Swift, starting with dropping mandatory semicolons or parentheses around conditions. Regular expressions are essential because first-class string processing in a general-purpose programming language is essential. As such, first-class regular expression integration into Swift should adhere to these principles. While #/...#/ could support the feature on its own, it doesn't achieve the same level of the clean syntax that we can achieve with the proposed /.../ syntax.

I know there are concerns about the complexity of the parser rules as outlined in the proposal. I thank the proposal authors for calling those out so thoroughly. I do not find these rules concerning. Swift has many existing rules in the parser for making much of the intuitive syntax in Swift "just work," some far more complicated than the ones mentioned in the proposal. While we should not aim to throw kerosene on a fire, I do not believe that is the case here. The proposed parsing rules are well within a threshold of complexity for the parser's reasoning and implementation. Further, after running this change through millions of lines of code, this change triggered only a couple of instances of parsing ambiguity. Statistically, the data suggests the parsing rules are not a concern in practice when working with the Swift code people write today.

I believe the most crucial question here is the tough call around source compatibility and whether or not any source breaks are tolerable at any stage in the evolution of Swift — even when staged. Swift is still a language evolving in ways in service of its users. While new essential features aren't added to the Swift at the same regularity as in its first nascent years, our work on rounding out the fundamentals of the language isn't over. With Swift concurrency, new keywords (such as "async") were added to the language because those concepts are fundamental to modern programming, even though they also introduced a source break. It would have been calamitous not to give concurrency the clean and clear syntax it needed and deserved. Source breaks should not be inflected recklessly on Swift code as they can potentially destabilize the Swift ecosystem and burden developers. While some noteworthy points have surfaced in this review thread, I believe a good balance struct in this proposal allows the language to move forward with the proper application of blessed syntax for regular expressions while giving users a path to move at their own pace. I also believe the exhibited data on the amount of code trialed on the new syntax shows that the source break will not manifest pervasively. The combination of the infrequency of occurrence of source breaks in practice, and the staging via a language mode, convinces me there is a good path forward here, as outlined in the proposal.

Jon_Shier · May 5, 2022, 3:34am

I'm sorry, but we don't have the data needed to make this statement. We have only the narrow view of the compatibility suite, which says nothing about the number of apps affected by each break. We really don't know how much breakage this will cause, just that it will be some number of magnitudes greater than what's reflected in the compatibility suite.

I won't reiterate my other points here, I just wanted to point out the data issue.

michelf · May 5, 2022, 4:07am

tem:

What about allowing / to be wrapped in backticks to disambiguate it as an operator rather than the start delimiter of a bare regex literal?
prefix func / (...) -> ...
let casepath = `/`Enum.a      // parse error today

That sounds way better than making prefix / impossible to express. With this, migration can happen independently for each module and can be automated. Coordinating API changes between libraries also becomes optional.

tkremenek · May 5, 2022, 4:21am

It is true that there is a wide population of code out there, and that nobody can audit it all. We can only draw inferences based on the data that we have.

I cannot share specific numbers, but there is a lot of Swift code at Apple, and we ran this change over that code and encountered one project that had an issue. Of course, that population of code may not be representative of all the kinds of Swift projects out there (leading to selection bias), but I believe that this population of code is not wildly different from much of the code in existence.

So I will rephrase my statement: I believe, based on the code this change has been tested on, that the break won't be pervasive. The break will, however, impact some codebases more than others.

We know there is a break here, and for that reason the proposed source break is intentionally staged as outlined in the proposal. For me, the question isn't about zero tolerance to source breakages, but about tradeoffs of what this change means both now and in the long-term, how much cavitation it will have on the ecosystem in practice, etc. A source break should not be considered lightly, and I do believe it is being considered in its weight when evaluating the value of what is proposed.

Panajev · May 5, 2022, 4:56am

Thank you for the well written overall review of the proposal and for engaging with the community Ted, but I still feel that we are mixing the overall comment on needing regex literals and comments on source breaking and the attention paid to it or not and the need to allow the language to make source breaking changes, etc… with the limiter choice.

I do not personally see why and how what you or the proposal authors or the review manager wrote supports the absolute need of allowing bare /…/ syntax over just #/…/#. Considering the latter allows for more readable regex strings where you do not have to escape / characters, I do not see how all the arguments about readability and power and clarity support it, if anything they seem to do the opposite.

How is the proposal changing for the worse if bare syntax /…/ were to be ditched for just the #/…/# option and we did not have to escape / in our regexes instead?

Jon_Shier · May 5, 2022, 5:09am

I cannot share specific numbers, but there is a lot of Swift code at Apple, and we ran this change over that code and encountered one project that had an issue. Of course, that population of code may not be representative of all the kinds of Swift projects out there (leading to selection bias), but I believe that this population of code is not wildly different from much of the code in existence.

Selection bias aside, given what I know of Apple’s (non-public) policies against the use of open source software, it’s literally impossible for Apple’s internal ecosystem to be representative of the Swift ecosystem.

“Pervasive” still seems a rather nebulous standard for limiting source breakage, but at least we have some label on it now. I hope we see more concrete guidance around this issue from the Core Team in the future, namely things like whether phase in periods, migrations, or other mitigations make breakages more generally acceptable.

sveinhal · May 5, 2022, 7:20am

You say that as if it is fact. But clearly that is not true. It’s your opinion, and it is clearly a hot topic of contention.

You’re of course free to express your opinion on the matter, and have made that perfectly clear. But it doesn’t move this discussion forward.

Do you have examples? Even if we accept your claim of “actively reducing clarity” at face value, regexes still have utility. How do you propose to solve what regex literals solve?

Express.js-style routing, is one example where regexes are used to match handlers to incoming http request, based on url matching. The regexes are usually small, matched against mostly strings, with one or two capture groups.

That style is prevalent throughout modern web server apps, and Vapor could benefit from something like it.

This style of programming where small regex snippets can be inlined into a function argument, followed by a closure literal, is concise, familiar and with fairly little noise. At least many people think so, based on the popularity of this syntax and API design.

With this proposal, Swift could not only unleash the power of this syntax to projects such as Vapor, but it would allow strongly typed captures and named captures, making rewrites/refactorings more safe and less fragile. It could provide syntax highlighted literals to help human parsing, and immediately draw attention to captured parameters.

There are probably a lot of examples of complex and highly unreadable regexes around. I’m not sure we can ever stop people from misusing features, or write bad code.

But having an already existing feature (regexes) become type-safe, compile-time checked, syntax highlighted, and refactor-safe is clearly an improvement. All of which is made possible by literals.

sveinhal · May 5, 2022, 12:06pm

They can. But they can also use a somewhat less expressive pattern-matching DSL implemented as string literals.

Nevin · May 5, 2022, 1:11pm

We literally just discussed this a few posts earlier in the thread. Yes, of course I am expressing my opinion. We all are. That is the purpose of these review threads.

It has previously been made clear on these forums that we are expected to express our opinions, and we are not expected to couch them with “in my opinion”.

What do you mean by “move forward”?

In my opinion, the “forward” direction involves discarding traditional regex literal syntax entirely. I consider the idea of introducing regex literals to be a major step backwards.

In my opinion, this proposal would move Swift in the wrong direction. It would make the language less readable, harder to learn, and overall worse.

There are other proposals to introduce a more powerful, more Swifty, comprehensive string parsing and pattern matching system. I consider that to be the “forward” direction.

I believe those proposals should be developed to fruition so that Swift gains a powerful and clearly readable string processing system.

Then, after Swift has gained such a feature, I believe we should give people time to become familiar with it, to gain experience with it, and to develop best practices for it.

Once Swift has such a first-class string processing system which prioritizes clarity as well as utility, and once programmers have been using it for a few years, then we could check back again to see if anyone still wants legacy regex literals.

If they do, then we will at that point be in a position to consider the actual long-term tradeoffs. We will be able to decide whether adding those literals would be worth the cost in language complexity and lack of clarity, when they do not bring any new expressivity since Swift would already have its own powerful, native, approachable string parsing feature.

In addition to this current proposal going in the wrong direction, I also think it is sequentially out of order. We should get the comprehensive Swifty solution first, and then later consider the crufty legacy syntax.

sveinhal · May 5, 2022, 1:28pm

Of course you're expressing opinions. I think everybody understands that. I'm not objecting to the lack of qualifiers such as "in my opinion", which I agree just adds noise. But I'm complaining about the lack of justifications for your opinions, and that opinions stated without justifications, come off as non-opinions.

Maybe I've just missed them? This thread is quite long.

Also I'm not talking about the direction in which Swift is evolving, but the direction of the discussion at hand. Can you help move the discussion forward, by providing weight to you argument? Maybe give examples of better solutions to the problem this proposal aims to solve. Show how we have, in your opinion, as much (or even more) clarity without literals. You may sway people in your direction by providing alternatives.

Panajev · May 5, 2022, 1:38pm

Agreed, but this is orthogonal to how having the bare /…/ syntax is worth the trouble of implementation and readability issues (escaping / does not seem better than not having to) compared to #/…/#?

rvsrvs · May 5, 2022, 1:40pm

Can you speak to the possibility of staging this with changes to optics to include enum case keypaths. That has been mentioned several times as a desired feature to ameliorate some of the breakage pain, but there has been no mention of combining the two features before a final breaking change is made in Swift 6.

ben-cohen · May 5, 2022, 2:11pm

Posting as review manager for some moderation feedback:

Nevin is right here. People should generally accept that people are always representing their own perspective. Generally speaking, it's best to avoid meta-commentary on how people are arguing.

This is especially true in areas of taste – for example, there is no "right" answer to whether #/.../# looks worse than/.../. Nevertheless, it's good if people can try their best to substantiate their views. In cases of code aesthetics, examples are particularly helpful.

That said, @Nevin it would also be good for you to tone down the rhetorical flamethrower a little. Referring to things as "dumpster fires" isn't really appropriate language, and certainly won't help persuade the core team of your position.

I understand it was posted good-naturedly, but please avoid including humorous pics in your posts to evolution threads. This is particularly true of snowclone-style memes ("X all the things", "No, it is the X that are wrong" etc). They're not really appropriate for proposal reviews (other areas of the forum are perhaps more OK with it). People often link to XKCD as an aside, but a link is enough.

masters3d · May 5, 2022, 3:09pm

// bare
let regex = Regex {
   digit
   /\ [+-] /
   digit
}
// extended

let regex = Regex {
   digit
   #/ [+-] /#
   digit
}
//The above from the proposal. If We want to optimize for this type of regex composition and the # are not acceptable then let’s use ‘/…/‘ ( I did see the note in the alternatives)

let regex = Regex {
   digit
   '/ [+-] /'
   digit
}

I personally think that we should not optimize the syntax for partial small regex

Zollerboy1 · May 5, 2022, 4:07pm

I'm definitely +1 for introducing RegEx literals into Swift. IMHO they have immense value in various kinds of string processing and I have always been massively bothered every time I had to use the ObjC-style RegEx-API.

Of course we also get the 'more swifty' RegEx builder DSL and that is definitely a good thing that I will use whenever a literal would be too cumbersome to write and read, but, unlike some others in this thread, I don't think that we can introduce that DSL exclusively and leave literals completely out.
Especially for short matches it will be so incredibly much nicer to just write a little one-liner instead of writing a multi-line result builder closure.

Additionally, I think, RegEx literals will be particularly handy while conceptualizing and testing out your code and can then be translated into the DSL later on.

However, as many others, I'm greatly concerned about the choice of delimiter for the literals and about the set of trade-offs it is gonna bring. Even if the data that we currently have were really representative and the amount of source-breakage indeed very low, I'd still think that banning a whole set of operators using a common symbol that were possible to write since Swift 1.0 and introducing complicated parser rules that may be not so much a burden for the compiler but IMHO for humans very much so should really not be necessary.
Also, and correct me if I'm wrong, until now, Swift had a very clear division between operator characters and delimiter characters and it'd be a shame to lose that.

I think there are much more compelling delimiter choices (just #/.../#, '/.../', #re'...', or even #re/.../, just to name a few again) and the proposal definitely does too little in giving pro arguments for /.../ (except of "it's a term of art, although it is becoming less popular, but it's still a compelling choice") and con arguments against all other choices, given the consequences.

YOCKOW · May 5, 2022, 4:41pm

I am very disappointed to hear such a statement from the project lead because it seems as if the conclusion has been made before the deadline.

As some folks already pointed out, regex literals have very subjective problems rather than technical issues.
Now I feel that it may be trivial whether or not the new syntax will break source compatibility.
It is important that regex literals are really clear or illegible. It may be less important what delimiters are chosen.
Of course, "every man to his taste". No one can judge which is correct. Consequently the discussion could get heated.

Regular expression was born over a half century ago (also someone pointed out in this thread).
At that time, to be short was to be the best because our ancestors had to input regex onto a terminal by hand.
Today, we have "code completion" or something like it. DSL can be inputted with few key-touches.
Do we actually need regex literals?
Is to be short to be clear?

There are definitely only subjective answers, but my subjective answer is no.

The core team is the elite. The elite has much knowledge about many languages and related things.
Regex literals are easy to write and read for those who are wise in such syntaxes.
It is also subjective perception.

What about newcomers?
They would regard regex literals as ancient magic spells. As a matter of fact, regex is just legacy.
On the other hand, DSL consists of simple English words such as OneOrMore.
It is very clear to read. Even non-programmers can understand it.

To be short is not to be clean nor clear.

I know Mr. Lattner is not on the core team any longer. However, this utterance of him is still very valuable to us.
I don't want to guess that the reason why the core team had sent him off is to introduce this kind of ugly syntaxes.

Just in case, my stance is:

johnno1962 · May 5, 2022, 4:59pm

Let me slide into this review to restate my firm opposition from the outset to the valiant efforts of the proposal authors to support the 'bare' /regex/ syntax in Swift and my surprise to see it being brought to review in the proposal. It simply is not worth the known lexical contortions and source breaking of popular open source projects it would inflict on the language. At best it could be presented as a "future direction" for when these issues have been thought through. I just don't see #/regex/# being much less recognisable or "clean" as an alternative "term of art" or the need to support both.

Not strictly relevant to the review, given that ( and ) are special characters important to regex syntax I don't see a syntax like #re(stuff) being much more promising. Any literal for something as subtle as a regex terminated by a single character simply isn't going to cut it.

Karl · May 5, 2022, 5:37pm

Aside

You've mentioned this a couple of times. It's more of a procedural point, but I agree. For other proposal as well, we're seeing members of the core team jump in to pitches and reviews to defend even the tiniest minutia. For the recent light-weight generics pitch, every member of the core team who works on the compiler/stdlib - Doug, John, Joe, Ben, etc - all of them were there, defending tiny details about syntax.

It's unfair to the community, IMO. It's obviously valuable to hear what the core team members think, but it does often lead to situations where, to argue with any tiny point, suddenly volunteer community members who are given very little time to review a proposal and given little/no insight in to future plans are met by the core team in defensive formation, building a wall and blocking the debate.

It's very corporate, if you know what I mean ;)

The core team ultimately makes the decision, in a closed process without even any meeting minutes, and the secrecy is so extreme even that even the founder of the project and former Apple executive (I think he was on the "executive leadership" website at one point...?) finds the process intolerable. It has never been a democracy, and it is absolutely open to arbitrary "I just want this" kinds of decisions.

So it is obvious why the words of the core team can shift the debate. They can instantly render the efforts of community members moot with a single random musing that they just "aren't feeling" a particular feature. The core team do not need to justify anything.

@tkremenek 's point is basically that - he says he thinks /.../ looks nicer. Full stop. Case closed. You can try to convince him (or other core team members), but at the end of the day they shape the future of the language based on arbitrary feelings like that.

That's just how this system works. And that itself isn't necessarily unworkable, but it does mean that the musings of core team members has extraordinary weight. Increasingly, they are being less cautious about how they throw that weight around.

Personally, I've basically followed Chris' example and don't care about swift-evolution any more. I think the community is at an all-time low, and the core team are too far removed to even notice it. I somehow got roped in to this discussion, but for the last few months I generally don't comment/like/anything on evolution proposals any more. After years in the community, I think the process is a waste of time for community members; they are more like "soft announcements" than reviews.

tkremenek · May 5, 2022, 5:54pm

I understand why some may be concerned with me expressing an opinion they disagree with. As I said in my post, I am stating my opinion. I suspect folks would rather hear the diversity of the views and perspectives, with their reasoning to support those perspectives rather than not hearing them.

All of the signals from reviews feed into the core team's decision-making discussion. Of course, I have an opinion, which I will express in the core team discussion, but the signal raised in review threads is always considered.

Everyone: Let's please keep this thread civil. This topic is polarizing with very different viewpoints. We can politely ask people to explain their rationale and perspective, even if we won't agree with them entirely or potentially at all.