SE-0354: Regex Literals

Aha! I'd been hoping to see a stat like that. I wrote above:

1 out of 2894 certainly meets my threshold for “tolerable.” That removes my concern about allowing /…/. (Other concerns from the OP still stand, none dealbreakers.)

1 Like

It should be noted that I was only talking about projects that broke due to ambiguity — I didn't count projects that broke due to the use of a prefix / operator. (I probably should have said 1 out of 2879 and not included those projects at all — I'll edit my original post.)

Here are the findings in full:

5 Likes

To repeat from the pitch thread, the issue is that that CasePaths package shown in the previous posting is actually quite popular bc it is a dependency of swift-composable-architecture (aka TCA). The CasePaths / operator gets used in every SwiftUI end-user application that uses TCA.

My expectation is that most of the public packages tested above are frameworks and not apps, and that therefore those numbers are probably not a good reflection of how many end-user apps are using TCA. Under this proposal, each of those apps (whose number is not known) would get a very pervasive source break that users of TCA are not looking forward to. (it's used in every screen in a TCA app multiple times).

That operator was not arbitrarily chosen, it's there primarily to deal with a shortcoming in Swift's optics system (i.e. swift has lenses but not prisms). / was chosen bc of its likeness to the \ operator for keypaths (lenses). The pitch thread discussed the possibility that a future version of Swift could incorporate direct language support for casepaths (prisms), but that's a completely separate evolution proposal on a completely separate timeframe.

Personally, I'd be happy with direct language support for casepaths and to see TCA use that support (as would TCA's authors by their own admission). That would be great and I would heartily applaud it. Even so, doing that won't go back and fix all the existing code that will no longer compile.

On ne fait pas d'omelette sans casser des œufs. The question is: do we really like omelettes that much.

7 Likes

This is my vote as well.

1 Like

I opened a PR with some updates and clarifications.

I broke out some of the aesthetic motivation into multiple sentences instead of it all being glommed into one potentially-confusing paragraph. Thanks to @Karl and others who requested some clarification.

I added a future direction for library-extensible support (Thanks again to @Karl for pointing out its omission):

Library-extensible protocol support

A regex literal describes a string processing algorithm which can be ran over some model of String. The precise semantics of running over extended grapheme clusters vs Unicode scalar values is part of [Unicode for String Processing][regex-unicode]. Libraries may wish to extend this behavior, but the approach presented by various ExpressibleBy* protocols is underpowered as libraries would need access to the structure of the algorithm itself.

A better (and future) approach is to open up the regex parser's AST, API, and AST actions to libraries. Here's some examples of why a library might want to customize regex:

A library may wish to provide support for a different or higher level model of string. For example, using localized comparison or tailored grapheme-cluster breaks. Such a use case would need access to the structure of the string processing algorithm literal.

A library may wish to provide support for running over another engine, such as ICU, PCRE, or Javascript. Such a use case would want to pretty-print Swift's regex syntax into one of these syntax variants.

A library may wish to provide their own higher-level structure around which regex literals can be embedded for the purpose of multi-tier processing. For example, processing URLs where regex literal-character portions would be converted into percent-encoded equivalents (with some kind of character class customization/mapping as well). Additionally, a library may have the desire to explicitly delineate patterns that evaluate within a component vs patterns spanning multiple components. Such an approach would benefit from access to the real AST and rich semantic API.

I added an alternative for forbidding features not present in the DSL (thanks to @ensan-hcl for mentioning this):

Restrict feature set to that of the builder DSL

The regex builder DSL is unable to provide some of the features presented such as named captures as tuble labels. An alternative could be to cut those features from the literal out of concern they may lead to an over-use of the literals. However, to do so would remove the clearest demonstration of the need for better type-level operations including working with labeled tuples.

Similarly, there is no literal equivalent for some of the regex builder features, but that isn't an argument against them. The regex builder DSL has references which serves this role (though not as concisely) and they are useful beyond just naming captures.

Regex literals should not be outright avoided, they should be used well. Artifically hampering their usage doesn't provide any benefit and we wouldn't want to lock these limitations into Swift's ABI.

And I added a sub-section to discussion about #regex(...) extensibility to foreign language snippets (thanks to @rvsrvs for reminding me of our extensive discussion from the pitch thread):

On future extensibility to other foreign language snippets

One of the benefits of #regex(...) or re'...' is the extensibility to other kinds of foreign langauge snippets, such as SQL. Nothing in this proposal precludes a scalable approach to foreign language snippets using #lang(...) or lang'...'. If or when that happens, regex could participate as well, but the proposed syntax would still be valuable as regex literals are unique in their prevalence as fragments passed directly to API, as well as components of a result builder DSL.

12 Likes

I would like to expand a bit more on the possibilities opened up by the magic literal solution:

We can have a local contextual scope for the arguments of the magic literals (generally, not just for #regex). This means that we could have a separate contextual interpretation for the syntax and delimiters of such arguments (parameters).

This way, we can define #regex/…/ to be a shorthand for trailing literals, the same way that map{…} is a shorthand for map( {…} ).

The full syntax would be #regex( /…/ ), and / would keep its meaning everywhere else. This opens the possibility of supporting Perl-style modifier prefixes for the regex literal as well as even additional arguments to select the language variant and behavior of the literal. For example, we would be able to support all the Perl variants noted by @tim1724 such as: #regex(qr{...}) or even #regex(Perl, qr{...}). Privileging #regex with # shorthand would give us #(qr{...}).

This would offer a general extensible foundation to deal with such things and we can add syntactic sugar to the frequently used ones, the same thing that we do with [Int] vs Array<Int>.

On the other hand, a tiny bit of syntactic noise (e.g. #/.../ instead of /.../) might not be such a bad thing. As far as I know, the intent is to discourage the overuse of bare regex literal and a tiny bit of noise may tip the scales in the right direction.

I will address the comments that downplay the extent of the damage to the language caused by adopting the bare /.../ later when/if I can spare some time to do so.

2 Likes

I posted my message above before reading this, but IMHO the logic I propose to get from #regex(...) to #/.../ still stands and offers a better solution that avoids source-breaking changes.

  • What is your evaluation of the proposal?

-0.25

Honesty compels me to admit that, despite regexes being generally terrible API, these literals will play an important part in Swift's string processing story. However, I see these literals as an API of last resort where copy pasting an existing regex is most important, or where inline capture is required. Therefore they're unworthy of the privileged syntax given to them by this proposal.

Nothing in this proposal justifies the breakage and edge cases introduced by the use of /. In fact, the proposal itself spends more time explaining the edge cases introduced by this usage than the actual capabilities of the literals themselves. That, in and of itself, should indicate that / is not a good choice.

In addition to the general issues introduced by the use of /, which the proposal actually address fairly comprehensively, I have to once again reiterate and echo the concerns voiced in this review and the various pitches leading up to it: nothing in this proposal justifies the relatively massive source breaking change this use of / represents. @rvsrvs is 100% correct: breaking TCA will break hundreds, if not thousands of apps. In fact, it seems likely this represents the largest source break since Swift declared source stability five years ago. Of course, we can't quantify this breakage due to Apple's continued neglect of the Swift ecosystem and the lack of analytics around package usage. But the library itself has over 6k stars on GitHub, making it one of the most popular Swift libraries out there.

In addition to the concrete impact this breakage will have, this raises important considerations for the community in general. Specifically, how popular does a library have to be for Apple to avoid breaking it (when such breakage is easily avoidable)? If TCA isn't popular enough, is Alamofire? Alamofire currently has over 37k stars on GitHub and is usually recognized as, if not the most popular Swift library, certainly one of the most popular. Yet it represents just a single entry in the compatibility suite. Does that mean it's subject to breakage at any point? Why would people spend their valuable time developing unique solutions to problems in the Swift community when it could be broken at any time?

  • Is the problem being addressed significant enough to warrant a change to Swift?

Probably, though the other parts of the regex proposals are far more valuable.

  • Does this proposal fit well with the feel and direction of Swift?

Not especially. Nasty, complex, inline literals are usually a feature of last resort. But, given the placement within Swift's sphere of features, it could be. If this proposal was a first implementation of custom literals, then it might be more valuable. If was exploring the future of protocols for literals, that might be valuable. If it was exploring generalized inline captures from literals, that might be more valuable. But in the end this proposal feels most like something we want for copy / paste compatibility for regexes we find on Stack Overflow, which has never been a priority for Swift before.

  • If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

Regexes in other languages suck pretty hard. The overall set of proposal certainly puts Swift out in front of them, for the most part.

  • How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

I've been tracking the various pitches and proposals and tracking the development of the string processing library on GitHub.

16 Likes

There's no question that / is the canonical delimiter for regex. But that comes from contexts where they are not used in direct conjuction with other syntax (grep, sed, etc.). To my mind it's an awfully important point to know just how many other general-purpose programming languages use this literal syntax, embedded as a normal expression? I don't claim full knowledge*, but I know of three, the same mentioned in the proposal: Javascript, Ruby, and Perl. Are there others? And then of those, how many:

  • use / for comments (Perl and Ruby use #)
  • have custom operators like Swift (that may include slashes)

Any? And then when we add on operator references as expressions, I think Swift is all alone in this region of syntactical space.

So, while I agree that this would be kind of a cool feature to have, I am skeptical that slash-literals-as-expression rises to the universality that the proposal suggests. And given that, I don't think the costs would be worth it.

A final point is that there will always be some places where code is read, that can't avail themselves of rigorous highlighting logic. For example, GitHub. A straightforwardly unambiguous delimiter and rules make it easier for that highlighting to be decent, and it aids reading even when there is no highlighting. An extra mark, like r'/[a-z]+/' or a pair of octothorpes, is no great burden in my opinion, and might even be a positive.


*And for some reason I'm having a devil of a time searching the web for this information.

7 Likes

Not to take anything away from the rest of your review, but this argument is not very convincing. At best, it would be an argument that the proposal document isn't very well drafted, which would be unfortunate but also something that reviewers would ideally look past to focus on the substance of the proposal. In this case, though, it's just misplaced: the capabilities of the literal are variously covered in SE-0355 (the regex syntax) and SE-0350 (the general API of the Regex type), so you wouldn't expect them to be redundantly described in SE-0354, which is specific to wrapping up the regex syntax into a literal.

Like I said, this doesn't take anything away from the rest of your review; your review stands by itself well enough without trying to make a point about the quality of the proposal document.

Not sure I see your point here. If the literal syntax were fully included in one of the other proposals, my point would be the same, I would've just said the literal syntax section spends more time describing its edge cases rather than the actual functionality. Perhaps the point would've been blunted since the relative size isn't as extreme, but the point stands. Boiling it down into its own proposal just makes the point more obvious. But I'll express it differently (and more generally) if that helps: new features which have to be described more in terms of the edge cases they have than on the actual functionality they add probably aren't a good idea.

5 Likes

No, you have it backwards. You are using the fact that the description of something has been pulled into a separate proposal (or section of a proposal) from the description of its functionality as evidence that it doesn't have much functionality.

1 Like

My understanding is that, without this proposal, the only things Swift loses by requiring use of the Regex initializer is inline captures and their integration into the Regex DSL. Is that incorrect? That you can represent a regex as a string seems separate from the actual literal, is it not?

Note that while this is certainly a consideration, the use in Javascript of / for division, regexes, and comments, suggests this is not really a problem in practice. This, combined with the fact that regexes cannot start with a space or ) means syntax highlighting using cross-language highlighting techniques found in places like GitHub or Textmate etc can be implemented with a very high degree of accuracy.

2 Likes

This point seems specious. There's zero connection between the use of something in JavaScript and whether it's a problem in practice. (I'm half joking, but still.)

2 Likes

Are you speaking as the review manager? I don't see how my point is in any way inappropriate to a review of this proposal.

No, not speaking as review manager here. Your comment was perfectly appropriate. I am just pointing out the situation already exists in Javascript and (unless I'm mistaken) is not challenging to syntax highlight.

What is your evaluation of the proposal?

I think it's a good idea to have regex literals. But the /.../ syntax appears to create a lot of bizarre situations in Swift that would be better avoided. Having only #/.../# delimiters would be perfectly fine and avoid creating a disruption.

Is the problem being addressed significant enough to warrant a change to Swift?

I don't think the /.../ syntax makes it worth dealing with the breakage it would cause. But in general having a way to express regex literals seems worth it.

Does this proposal fit well with the feel and direction of Swift?

To me, this proposal does a good job of discrediting the /.../ syntax by listing all the limitations and features that would have to be slightly broken in unexpected ways. It's sort of going timidly backward on things that existed since Swift 1.0. There's no evidence unapplied operators and prefix / are causing any harm, which should be the threshold for breaking them.

If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

I have occasionally used languages using / as a regex delimiter, and I don't feel it's a good delimiter. A good delimiter is one that you almost never need to escape, and in my experience those are generally balanced delimiters like (...). But #/.../# is seems good too, as I can hardly see any situation where you'd need to escape something in it.

How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

Read the proposal and the conversation about it. Some part of the earlier pitch.

10 Likes

The idea here is that tools such as GitHub, which are general cross-language tools, might be significantly disadvantaged by hard-to handle rules even if we think it's OK for the Swift parser and tooling to incorporate them.

Given this concern, I think it's reasonable to cite Javascript as a counter-example.

I guess we could enumerate the possibilities:

  1. Highlighting Javascript regexes in these tools is not a big problem, and won't be for Swift either
  2. Highlighting Javascript regexes was a massive pain, but hopefully that pain can be re-used for Swift
  3. Highlighting Javascript regexes is a major issue, and should be a cautionary tale for Swift
  4. Javascript is sufficiently different from Swift that it's fine for JS but not Swift

I must admit I'm making an informed guess but I'm pretty sure the answer is 1. I'm not aware of 3 being the case, and playing around with GitHub and Textmate suggests its highlighting of JS regexes is pretty robust. For 4, Swift and JS certainly differ, but I'm not sure if any of those differences are material here. Possibly JS's rules around semicolon insertion vs Swift's multi-line statements might be one.

2 Likes

There's no availability to define custom operators in JavaScript. So, it cannot be a rebuttal. In fact, certain number of projects will be broken by bare /.../ as already mentioned in this thread and the pitch thread.

2 Likes