SE-0354: Regex Literals

allevato · May 6, 2022, 5:49pm

Yes, this is quite a surprising departure from what we've been told in the past about not wanting there to be different "dialects of Swift" via the use of feature flags.

Numerous times, there have been discussions about making improvements around controlling warnings emitted for the use of deprecated declarations and their conflicting nature with -warnings-as-errors. Many of those discussions have been sidelined with the proclamation that "we don't want dialects of Swift". And now here we are, proposing that very thing be added for a different feature.

I don't have a lot of skin in the game here; I'll continue to work with/on Swift whether /.../ or #/.../# or neither or both end up being accepted. My main interest at this point is having some consistency and an understanding either of why this situation is different or of whether the core team's position has changed/evolved since those other discussions, since that would inform how I approach future Swift Evolution discussions.

hamishknight · May 6, 2022, 6:22pm

Warning seems reasonable to me, though note we will reject unknown letter escape sequences, which should avoid confusion in that case. I've added this to the list of warnings to implement (Implement parser warnings · Issue #380 · apple/swift-experimental-string-processing · GitHub).

hamishknight · May 6, 2022, 6:22pm

dhoepfl:

How about:

func foo(_ a: [Int], _ o: (_ :Int, _: Int) -> Int, _ b: [Int]) -> [Int] { [4] }
func foo(_ a: [Int], _ r: Regex) -> [Int] { [2] }

let a = [1,2,3]
let b = [4,5,6]

let x = foo(a, +, b).reduce(1, /)
let y = foo(a, /, b).reduce(1, /)
let z = foo(a, /, b).reduce(1, +)

Would this compile? What value will y have?

foo(a, /, b).reduce(1, /) would unfortunately become a regex literal and require disambiguation by writing it as e.g foo(a, (/), b).reduce(1, /). We might be able to extend the ) heuristic to check for any unbalanced ) between the delimiters, which would help avoid most of these ambiguities. I will investigate this further.

hamishknight · May 6, 2022, 6:23pm

tem:

What about allowing / to be wrapped in backticks to disambiguate it as an operator rather than the start delimiter of a bare regex literal?
prefix func / (...) -> ...
let casepath = `/`Enum.a      // parse error today
Similar to:
func await (...) -> ...
`await`(...)                  // OK
Not great, but also not that bad? It would still cause a source break but would allow continued use of an operator with semblance to the backslash. Perhaps I'm missing something obvious as to why this is not already allowed today.

That's an interesting idea! It seems reasonable to allow backticks on operators as well as identifiers, and that would allow disambiguation of operators from regex literals. I will investigate this further.

hooman · May 6, 2022, 6:32pm

Speaking of @tem's idea. As a less noisy alternative to # that we can explain away why it is used and what it means, there is another possibility:

    `/.../`
    // Extended format being:
   ``/.../`` // add ` as needed

This saves ' and does not conflict with existing uses. What do you guys think?

1-877-547-7272 · May 6, 2022, 6:54pm

This is inaccurate; . within an operator is allowed and has special parsing rules. ..< is a valid operator, but >.. is not.

From The Swift Programming Language:

You can also define custom operators that begin with a dot ( . ). These operators can contain additional dots. For example, .+. is treated as a single operator. If an operator doesn’t begin with a dot, it can’t contain a dot elsewhere. For example, +.+ is treated as the + operator followed by the .+ operator.

masters3d · May 6, 2022, 9:07pm

+1 good alternative

hooman:

Speaking of @tem's idea. As a less noisy alternative to # that we can explain away why it is used and what it means, there is another possibility:
    `/.../`
    // Extended format being:
   ``/.../`` // add ` as needed 
This saves ' and does not conflict with existing uses. What do you guys think?

tem · May 6, 2022, 10:03pm

I like backticks as a delimiter because it's already widely used when embedding code in Markdown, and regexes are more program than data. Backticks are also not as noisy as pound signs.

But there is a slight issue that would make it ambiguous if backticks could also surround operators in the future:

`/.*/`    // operator or regex literal?

By the way, a potential issue with backticks surrounding operators is that they could be juxtaposed with backticked identifiers:

`/``default`

which at the very least doesn't look clear. But it would be extremely rare.

Backticks could also extend neatly to generalized foreign language multi-line literals by mirroring Markdown:

```sql
SELECT *
FROM users;
```

But I'm not sure what the compact one-line version would look like, whereas #sql'SELECT * FROM users;' would work well I think (using backticks even, but I can't figure out how to put backticks inside backticks here).

Right now I'm more in favor of the "allow only extended regex literals" alternative with the proposed #/../# syntax, even though I still think that anything but a clear and firm promise to never introduce /../ may have a self-reinforcing effect if CasePaths gets assimilated into the language, because library authors would avoid creating new operators that are in jeopardy of future source break, wouldn't they? I think #/../# could be quite noisy in some situations but the best way to find out is to let people use it for some time.

Either that or completely rejecting regex literals (for now) to give us some more time to evaluate the alternatives. All of the other new regex-related features will be plenty to get excited about when they drop!

Edit:

Actually, (I've had another change of heart) if the popular CasePaths library is given enough time to migrate the source breakage (via the -enable-bare-slash-regex compiler flag) then I guess accepting /../ would be the pragmatic choice, if not the ideal one. Aside from the concrete source break, cutting into the very limited pool of viable custom operator symbols hurts a bit, but custom operators are admittedly an already somewhat esoteric feature and if the actual demand for /-containing (prefix) operators is very low (which it seems to be, outside of CasePaths), then it seems like a reasonable compromise.

Additionally, generalized foreign language literals are not precluded by this choice. Deciding on regex literals before a more general design might well make regex literals inconsistent with those future foreign language literals, but that's not a big issue. I don't even know of any languages with such a generalized feature (Markdown code blocks don't count), whereas regex literals do exist in other languages, so it's ok for regex literals to be treated specially in Swift too IMO (even if regexes are often obtuse and 'arcane' or 'legacy').

I'm very interested in a comprehensive, more modern, replacement for regexes as @Nevin has advocated, but I'm not aware of any existing prototypes/designs that we could evaluate today. It seems like something that could take years and I don't see the benefit of holding back on first-class regex support that has already been (mostly?) implemented. If there is some all-new design that's clearly better than regexes, then I don't see why that would be precluded by regexes existing in the language. There were vague claims to that effect, but one could also counter, vaguely, that having first-class regex support sets the bar for any alternative to have to clear, and would make it easier to compare and contrast the two.

YOCKOW · May 6, 2022, 11:46pm

Sorry for one more digression.

This could be my last post in this thread.

Thank you for noticing my posts.

I am a mere bikeshedding person. I mean I thought it was enough for me to write my own stance.
I didn't imagine I would mention something about the formality. I just expressed my fear, though.
The reason why I had concerns might be because I knew, by hearsay, this was not the first time of their such behavior.

———

Also my intention is being distorted:

My fear is not that their opinion is different from mine, but is that they don’t seem neutral.

Douglas_Gregor · May 7, 2022, 12:46am

allevato:

Jumhyn:

I initially expressed ambivalence about this, but other comments have convinced me that this is a surprising departure from precedent that I don't believe has been adequately addressed in the review thread (unless I've missed a comment somewhere), and I'm not sure I understand the implications.

Would a later proposal (such as improved optics features) which had to make source-breaking changes to the bare regex syntax be considered 'truly' source breaking, since it would break -enable-bare-regex-syntax mode? Why is this being proposed as a production flag rather than an unreviewed -enable-experimental-bare-regex-syntax ?

Yes, this is quite a surprising departure from what we've been told in the past about not wanting there to be different "dialects of Swift" via the use of feature flags.

I commented about this earlier in the thread. To quote myself:

I just started a separate thread to discuss this promised design. My central thesis here is that we want to have a general way for Swift 5.x to opt into the source-breaking changes we've queued up for Swift 6 one at a time. This keeps coming up for Swift 6 features (-warn-concurrency for data-race safety; requests for a "require any on all existentials" mode) because folks want to adopt new features as soon as they can. This isn't creating permanent dialects, which we want to avoid: it's creating an incremental adoption path that smooths the transition to Swift 6. This way, developers won't have to confront every single breaking change all at once when they flip the language mode, which could be daunting. We can do better for the incremental adoption path and also get the syntax we want.

We have already made a number of source-breaking changes to the language that are queued up for Swift 6, and several of them have far larger impact on source code than what's being discussed here (any and Sendable checking will hit pretty much every bit of Swift code everywhere). We have to manage this transition well, or we'll end up with a permanent Swift 5/6 split. Against that backdrop, I consider the problems with the source-break of /.../ to be fairly minor.

That's why I'm looking for arguments as to why /.../ is the wrong destination that don't rely on the source compatibility angle. There really haven't been that many---folks that only want #/.../# but not /.../ tend to cite source compatibility alone. #regex(...) gained early favor in this thread, but I've already said why I think it's worse than the other options presented.

Doug

johnno1962 · May 7, 2022, 4:06am

But avoiding source breaks is important Doug and even one of the proposal authors conceded this was a showstopper during the pitch phase.

For me the strategy of managing the transition using feature flags is worse than the original problem, breaking the idea that Swift syntax is a linear progression forward. Far better not to create the problem in the first place no?

The "necessity" for the source break arises directly from fact the bare /regex/ syntax is the wrong destination. This, is in turn is a consequence of the naïve view in my opinion it will ever be possible to contain a full range of possible regexes inside single character delimiters let alone one which is already an operator in Swift. It's as futile as trying to construct the enclosure for a tiger with secondhand chicken wire and as unlikely to end well. The result is weird escaping rules and whitespace sensitivity that needs to be documented and the occasional mis-parses that have already been mentioned.

You need a distinct introducer that is not currently part of the language, for example #/ to switch the lexer into regex tokenising mode and a distinct terminator /# to cater for elements that may come up inside the regex. The #/regex/# syntax is no great beauty but fits this requirement well and also borrows from raw strings the notion that while it is essentially a string \ escapes are passed through.

Panajev · May 7, 2022, 6:15am

Another reasons some, like myself, disagree it is the right destination tend to also note how it adds a lot more frequent noise in the regex itself having to escape / which is not so infrequent in non trivially short regexes.

I not sure why, but the source compatibility issue hand waived away as an issue, as if it were clearly and evidently understood as important as the other source compatibility changes you mentioned and thus a necessity (burden of proof on the community)… and this is what I have not seen: the reason why having the bare /…/ is important (the visual noise angle seems a bit overemphasised). The other changes were bigger source breaking changes but their importance was also very high.

Still if we were looking at a clarity point of view alone, escaping / seems to be a much worse scenario than decorating the delimiters #/…/# … then we look at the cases and libraries it breaks as cherry on top :).

jayton · May 7, 2022, 12:10pm

I, for one, touched on source compatibility, but my central complaint is that /.../ complicates the mental model of the language while also being a manifestly bad syntax for regexes, since matching slashes is very common.

If the answer to that is to use “extended” literals either every time a slash is needed, or all the time, it’s hard to see how using /.../ some of the time will be an aesthetic or legibility advantage.

(Incidentally, citing the legacy of ed, which is mainly remembered for its outstandingly user-hostile syntax, in support of proposed Swift syntax is… difficult to describe politely.)

Michael_Ilseman · May 7, 2022, 1:18pm

That is a gross mischaracterization of our dialogue. If you're willing to engage in good faith, I'm happy to engage with you further on these costs or any technical details. But stop misrepresenting my views.

The core team has an aesthetic preference for /.../, in a if-we-had-a-time-machine like scenario. We don't have a time machine, and so this carries certain costs and impacts. The goal of this proposal is to accurately and fully detail those costs and present the most workable solution we can. It's up to the core team to weigh the value of an aesthetic preference with the impact of doing it now.

Breaking a popular 3rd party library is one of those costs, and in my view is the "most compelling". Especially because your other points didn't make much sense:

The first argument you link to:

This proposal does in fact propose a #/regex/# syntax for the contained-/ problem and other benefits. The proposal also goes into great pains to describe how the lexer decides when encountering a /. If you read the proposal, you will see that both of these are present and if there's any other information you need to help understand how Swift's lexer works, you can ask for clarification.

The second argument:

affects a different programming language than Swift.

And the rest of your arguments were addressed in the portion of that reply that you didn't quote:

Again, the syntactic/semantic blurring affects a different programming language than the one we're proposing these changes to.

Ben_Cohen · May 7, 2022, 2:19pm

Can you expand on why this differs from string literals, which have the same situation for their delimiter?

I am guessing the reasons would be either:

" is uncommon in strings, but / is more common in regexes; or
strings are so fundamental to the language that even though the same reasoning applies, #"..."# for all strings would be unacceptable, but is more acceptable for the less common case of regex literals.

I'd be interested if there's a third reason I'm missing. I think the first reason is the most common one cited. Speaking personally, I don't find / cropping up in so many regexes (bear in mind they must appear in the expresion, not the matched string) that it justifies a blanket "just always use the escaping one" rule of thumb. Though I gather maybe that might be the general feeling in the perl community?

Jumhyn · May 7, 2022, 2:29pm

A couple more plausible reasons come to mind (though I don’t really agree with the second, insofar as it implies I’d drop the ‘bare’ string literal syntax were source compatibility not a concern):

" as the string literal delimiter is a much more pervasive term of art than / is for regexes.
The source compatibility break required to change string literal delimiters would be far too large, compared to solving it at the proposal stage for regex literals, even though the same issues apply to both.

johnno1962 · May 7, 2022, 2:44pm

Calm down man, I've been patiently trying to engage you in good faith for 18 months now trying to point out that in my experience trying to shoe-horn Perl regex syntax into Swift wasn't perhaps a particularly good idea.

And now you find your self sitting on a review having to find more and more inventive strategies to stage a source break (I've been there) which is in my opinion very avoidable if you would just rephrase bare /regex/ syntax to be a future direction that can't be pursued until TCA makes room for it.

This was an interesting and relevant thread about the lengths someone had to go to to replicate Ruby's parsing rules for regex literals as a result of using / as a single character delimiter. I remain unconvinced the analysis in the review is exhaustive of the all problems that users will encounter.

I tried to do you a favour by not quoting that as Hamish was a little more honest further down the same thread.

Speaking freely I'm tired of trying to nudge this conversation in the right direction when the proposers simply don't seem to be taking input. Deprecating TCA for the sake of an aesthetic preference is out of the question ("compelling counter-argument" is the phrase you used) in the real world and let's not waste energy trying to pretend it is or it can be "managed".

Ben_Cohen · May 7, 2022, 4:22pm

Posting as review manager for some moderation feedback:

This thread is getting a little heated again, so I'll ask everyone to keep in mind it's important to stay civil. I know it's easy to get swept up in an argument and slip into a more jousting style of debate – I'm guilty of it myself sometimes – but in formal review threads it's particular important to engage with arguments on their merits.

@johnno1962 Michael's complaint that you were representing his view in bad faith seems well-founded.* It is clearly not the case that "the proposal authors conceded this was a showstopper". Michael was merely acknowledging it was worthy of serious consideration. So to say

is off-base. This is also demonstrated by Hamish's posts:

"Not taking input" and "not conceding that they are wrong and you are right" are not the same things.

Additionally,

is also a misrepresentation of what's being proposed. Making this change would not deprecate CasePaths. It would require CasePaths to migrate to a different operator or take a different approach (such as adopting a native key path feature if it's implemented). It is perfectly reasonable to say this is unacceptable – but that's not the same thing as claiming the whole of the Composable Architecture framework is being threatened with deprecation.

In all of these cases and others, it appears rhetorical force is being used to strengthen an argument. This approach doesn't work (the people you need to convince – the Core Team – will not find your posts more compelling) but it does have the affect of making the evolution thread more hostile. Please take a step back and think about how to make your case without these techniques.

* Nevertheless @Michael_Ilseman it's better to just point out the actual meaning of your quoted passage, rather than call someone out for misquoting it. If someone consistently does this kind of thing, the review manager will step in to ask the person to stop.

masters3d · May 7, 2022, 4:49pm

Ben_Cohen:

Making this change would not deprecate CasePaths. It would require CasePaths to migrate to a different operator or take a different approach (such as adopting a native key path feature if it's implemented). It is perfectly reasonable to say this is unacceptable – but that's not the same thing as claiming the whole of the Composable Architecture framework is being threatened with deprecation.

In all of these cases and others, it appears rhetorical force is being used to strengthen an argument. This approach doesn't work (the people you need to convince – the Core Team – will not find your posts more compelling) but it does have the affect of making the evolution thread more hostile. Please take a step back and think about how to make your case without these techniques.

The contention in this thread has been almost laser focused on bare single enclosing /…/ for simple regex literals.

I would suggest the core team to defer the bare syntax topic to a new proposal that should be focused on deprecating prefix / and repurposing it for simple regex literals.

It would be great if the new language group that is being put together could tackle these type of changes.

Ben_Cohen · May 7, 2022, 5:02pm

The extent to which proposals should be broken up into separate proposals is something we've discussed a fair bit in the core team, especially when it comes to large groups of themed proposals such as we've seen with concurrency, string processing, and generics. It is tricky, because huge numbers of "micro-proposals" can lead to fragmented reviews that interrelate but are hard to tie together. They can also create review fatigue.

For example, this proposal was split apart from the very closely related proposal around regular expression "interior" syntax. The consequence of this was that the other proposal received almost no commentary – and what commentary it did receive was very closely related to literals. So in that case, it seems it may have been more separate proposals than was necessary (though another factor is that "giant" proposals are hard to read, even if they do form a cohesive whole).

In very many proposals, there is often "one specific thing" that drives a lot of the discussion. But it is usually the case that breaking up the proposal isn't the right fix for this. One approach the core team has taken is to accept a proposal "in principal", but put it back into review for further feedback on other aspects (sometimes combined with amendments to the proposal addressing feedback during review).