SE-0354: Regex Literals

I very clearly and explicitly stated the syntax I object to. There isn't a single # in my previous comment.

  • What is your evaluation of the proposal?

-1-- I'd rather have seen a good regex literal design be implemented rather than the respin of familiar but horrible designs of old.

  • Is the problem being addressed significant enough to warrant a change to Swift?

Probably. Tho to be honest, it's hard for me to see this literal design as such a big improvement over just encoding regular expressions into strings -- there's very little no readability gain offered by these literals imo

  • Does this proposal fit well with the feel and direction of Swift?

+0

  • If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

This feels almost identically bad to the most common regex literal designs in other languages I've used -- modulo some swift specific problems associated with the delimiter choice (I don't find this surprising since the explicit goal was to copy the most familiar regex literal syntax)

  • How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

I paid attention to early pitches and completely lost interest when the goal seemed to be to maximize familiar syntax at any cost

5 Likes

I remember some of the previous proposals mentioned having user types which could be expressed as Regex literals. Is that part of another proposal or has it been dropped (at least for now)?

Generally in Swift, the standard library types are constructed from literals using protocols, such as ExpressibleByStringLiteral, ExpressibleByIntegerLiteral, etc - but this seems to be the first time (as far as I know) where a type is just outright created by the compiler from a literal with no corresponding protocol. This is all the proposal says about it:

The compiler will parse the contents of a regex literal using regex syntax outlined in Regex Construction, diagnosing any errors at compile time. The capture types and labels are automatically inferred based on the capture groups present in the regex. Regex literals allows editors and source tools to support features such as syntax coloring inside the literal, highlighting sub-structure of the regex, and conversion of the literal to an equivalent result builder DSL.

That in itself is worth calling out, but actually, I ask because I've been thinking about building URL patterns using a regex-like syntax.

We can't just use a normal regular expression, because even literal segments may need to be matched through percent-encoding (k may need to match k, %4b and %4B. Non-ASCII needs longer sequences of bytes) and take other normalization processes in to account. It really is a custom pattern, but there are lots of interesting ways you can express them with regexes or incorporate regexes, and as part of that, I may want/need a deeper understanding of the AST or the ability to construct my own type from a user's literal (or restrict some regex features). I haven't really thought deeply about what I would need for that, or prototyped it, but it is something I'd like to play with so I'd like to know what happened to the protocol.

8 Likes

That is future work. The ExpressibleBy* approach is geared around the needs of data literals, which it barely serves. Regex literals are more akin to algorithm literals. Thus, I think it is better to improve the library-compiler interface here. From an early thread:

Even better than raw mode is the ability for the regex parser to pretty-print its AST using a requested syntax variant.

Either way this is incremental and future work. Nothing in the proposal precludes this.

This is a fascinating use case and I'm really interested in exploring it! (outside the scope of this specific proposal, of course).

4 Likes

I don't think we should hamper regex literals because of compiler limitations preventing the same goodness from appearing for the DSL. Doing so is counter productive if we care about these limitations actually being addressed. Tuples have been muddy and under-featured in Swift (lots of historical reasons). There's never been as tangible a demonstration of this limitation within the Swift toolchain until now.

As an aside (I don't think this is necessarily your argument, but I can see it being related), I support developers having a policy or linter rules against using certain features. I could understand regex literals being that feature for some. However, I think regex literals serve a useful role for large swaths of developers.

4 Likes

I addressed this thoroughly in the pitch thread. Copying some of that content here for easy viewing:

Note there is no work being done, AFAICT, on SQL literals or these other theoretical use cases. I'm personally interested, but they're not plan of record.

Also, it seems clear that foreign source fragment might have their own needs above and beyond regex's, so we certainly wouldn't want to limit them from regex-based assumptions. I think it's better to design the general facility in light of general usage, which is multiple releases out as it involves further evolving the compiler-library interaction story. The regex work actually advances it behind the scenes, but it needs more examples than just regex to help complete it.

2 Likes

+0.75. I support the decision to use / as a delimiter for regexes, though I have some concerns about other various minutiae.


I feel that many posters on this thread and the previous thread are overstating the harm that comes from from the ambiguous cases listed in the proposal. While there definitely are cases where there is ambiguity, these cases seem very rare. I've never once needed to write anything like foo(/, /) or bar[/] + bar[/] in Swift. In the previous thread, Mishal Shah found that only 1 out of 2,879 projects on the Swift Package Index broke due to the ambiguities introduced in this proposal. To date, I haven't seen a case where an ambiguity would occur in realistic Swift code. Additionally, standalone operators already have cases where parentheses or an explicit closure are required e.g. let divide: (_, _) -> Int = (/).

Here's how I imagine these ambiguities will play out:

  1. It's extremely unlikely that a Swift programmer will encounter a situation where they have to disambiguate between two / operators and a /.../ regex literal in the first place.
  2. Even if they do get into that situation, they will recognize the situation due to syntax highlighting and, oftentimes, related errors. They can then use parentheses or closures to disambiguate.

I get the appeal of only having #/.../# as a language designer. But as a language user, the # characters are just noise and the ambiguities are rare enough that they don't really pull their weight, especially considering that plain /.../ is a term of art for regular expressions. Objective-C developers know about the clutter that comes from repeatedly using a special character (like #) to maintain backwards compatibility.


In regard to the CasePaths library, I agree that the deprecation of prefix / is unfortunate. However, I don't think we should hold Swift back for the sake of one library — especially since the library could switch to another operator, like |. Unless I'm mistaken, it seems that swift-syntax should be able to automate replacing / with | in existing codebases. And if case paths do get natively implemented in the language, delimited by \, then people would have to rewrite their code anyway.


I still think that named captures should be supported by the DSL if they're supported by literals. It's not a dealbreaker for me if that doesn't happen, but I do think this sort of thing is antithetical to how literals and the DSL is supposed to work. Regex literals are supposed to be terse, succinct expressions while the DSL is supposed to be more powerful, readable, and composable with the expense of being more verbose. Requiring programmers to un-DSLize (for lack of a better term) their Regex in order to have named captures would undermine the power, readability, and composability that the DSL is supposed to have over literals.

Reference is sort of similar to named captures, but I don't think it's close enough. It doesn't have the same semantics as named captures and replaces type system guarantees with runtime checks and confusing rules.


I have reservations about the -enable-bare-regex-syntax flag as well. I'd love to be able to use /.../ from day one, but I'm worried about creating a new dialect of Swift. If Swift 6 is coming out soon, though, it's less of an issue. I'd like to know what the intended use of this flag is. Is it just for regex enthusiasts? Or is it something that's intended to be added by default to new Swift packages and Xcode projects?

Also, how would this flag work with features like playgrounds or the REPL?

7 Likes

Aha! I'd been hoping to see a stat like that. I wrote above:

1 out of 2894 certainly meets my threshold for “tolerable.” That removes my concern about allowing /…/. (Other concerns from the OP still stand, none dealbreakers.)

1 Like

It should be noted that I was only talking about projects that broke due to ambiguity — I didn't count projects that broke due to the use of a prefix / operator. (I probably should have said 1 out of 2879 and not included those projects at all — I'll edit my original post.)

Here are the findings in full:

5 Likes

To repeat from the pitch thread, the issue is that that CasePaths package shown in the previous posting is actually quite popular bc it is a dependency of swift-composable-architecture (aka TCA). The CasePaths / operator gets used in every SwiftUI end-user application that uses TCA.

My expectation is that most of the public packages tested above are frameworks and not apps, and that therefore those numbers are probably not a good reflection of how many end-user apps are using TCA. Under this proposal, each of those apps (whose number is not known) would get a very pervasive source break that users of TCA are not looking forward to. (it's used in every screen in a TCA app multiple times).

That operator was not arbitrarily chosen, it's there primarily to deal with a shortcoming in Swift's optics system (i.e. swift has lenses but not prisms). / was chosen bc of its likeness to the \ operator for keypaths (lenses). The pitch thread discussed the possibility that a future version of Swift could incorporate direct language support for casepaths (prisms), but that's a completely separate evolution proposal on a completely separate timeframe.

Personally, I'd be happy with direct language support for casepaths and to see TCA use that support (as would TCA's authors by their own admission). That would be great and I would heartily applaud it. Even so, doing that won't go back and fix all the existing code that will no longer compile.

On ne fait pas d'omelette sans casser des œufs. The question is: do we really like omelettes that much.

7 Likes

This is my vote as well.

1 Like

I opened a PR with some updates and clarifications.

I broke out some of the aesthetic motivation into multiple sentences instead of it all being glommed into one potentially-confusing paragraph. Thanks to @Karl and others who requested some clarification.

I added a future direction for library-extensible support (Thanks again to @Karl for pointing out its omission):

Library-extensible protocol support

A regex literal describes a string processing algorithm which can be ran over some model of String. The precise semantics of running over extended grapheme clusters vs Unicode scalar values is part of [Unicode for String Processing][regex-unicode]. Libraries may wish to extend this behavior, but the approach presented by various ExpressibleBy* protocols is underpowered as libraries would need access to the structure of the algorithm itself.

A better (and future) approach is to open up the regex parser's AST, API, and AST actions to libraries. Here's some examples of why a library might want to customize regex:

A library may wish to provide support for a different or higher level model of string. For example, using localized comparison or tailored grapheme-cluster breaks. Such a use case would need access to the structure of the string processing algorithm literal.

A library may wish to provide support for running over another engine, such as ICU, PCRE, or Javascript. Such a use case would want to pretty-print Swift's regex syntax into one of these syntax variants.

A library may wish to provide their own higher-level structure around which regex literals can be embedded for the purpose of multi-tier processing. For example, processing URLs where regex literal-character portions would be converted into percent-encoded equivalents (with some kind of character class customization/mapping as well). Additionally, a library may have the desire to explicitly delineate patterns that evaluate within a component vs patterns spanning multiple components. Such an approach would benefit from access to the real AST and rich semantic API.

I added an alternative for forbidding features not present in the DSL (thanks to @ensan-hcl for mentioning this):

Restrict feature set to that of the builder DSL

The regex builder DSL is unable to provide some of the features presented such as named captures as tuble labels. An alternative could be to cut those features from the literal out of concern they may lead to an over-use of the literals. However, to do so would remove the clearest demonstration of the need for better type-level operations including working with labeled tuples.

Similarly, there is no literal equivalent for some of the regex builder features, but that isn't an argument against them. The regex builder DSL has references which serves this role (though not as concisely) and they are useful beyond just naming captures.

Regex literals should not be outright avoided, they should be used well. Artifically hampering their usage doesn't provide any benefit and we wouldn't want to lock these limitations into Swift's ABI.

And I added a sub-section to discussion about #regex(...) extensibility to foreign language snippets (thanks to @rvsrvs for reminding me of our extensive discussion from the pitch thread):

On future extensibility to other foreign language snippets

One of the benefits of #regex(...) or re'...' is the extensibility to other kinds of foreign langauge snippets, such as SQL. Nothing in this proposal precludes a scalable approach to foreign language snippets using #lang(...) or lang'...'. If or when that happens, regex could participate as well, but the proposed syntax would still be valuable as regex literals are unique in their prevalence as fragments passed directly to API, as well as components of a result builder DSL.

12 Likes

I would like to expand a bit more on the possibilities opened up by the magic literal solution:

We can have a local contextual scope for the arguments of the magic literals (generally, not just for #regex). This means that we could have a separate contextual interpretation for the syntax and delimiters of such arguments (parameters).

This way, we can define #regex/…/ to be a shorthand for trailing literals, the same way that map{…} is a shorthand for map( {…} ).

The full syntax would be #regex( /…/ ), and / would keep its meaning everywhere else. This opens the possibility of supporting Perl-style modifier prefixes for the regex literal as well as even additional arguments to select the language variant and behavior of the literal. For example, we would be able to support all the Perl variants noted by @tim1724 such as: #regex(qr{...}) or even #regex(Perl, qr{...}). Privileging #regex with # shorthand would give us #(qr{...}).

This would offer a general extensible foundation to deal with such things and we can add syntactic sugar to the frequently used ones, the same thing that we do with [Int] vs Array<Int>.

On the other hand, a tiny bit of syntactic noise (e.g. #/.../ instead of /.../) might not be such a bad thing. As far as I know, the intent is to discourage the overuse of bare regex literal and a tiny bit of noise may tip the scales in the right direction.

I will address the comments that downplay the extent of the damage to the language caused by adopting the bare /.../ later when/if I can spare some time to do so.

2 Likes

I posted my message above before reading this, but IMHO the logic I propose to get from #regex(...) to #/.../ still stands and offers a better solution that avoids source-breaking changes.

  • What is your evaluation of the proposal?

-0.25

Honesty compels me to admit that, despite regexes being generally terrible API, these literals will play an important part in Swift's string processing story. However, I see these literals as an API of last resort where copy pasting an existing regex is most important, or where inline capture is required. Therefore they're unworthy of the privileged syntax given to them by this proposal.

Nothing in this proposal justifies the breakage and edge cases introduced by the use of /. In fact, the proposal itself spends more time explaining the edge cases introduced by this usage than the actual capabilities of the literals themselves. That, in and of itself, should indicate that / is not a good choice.

In addition to the general issues introduced by the use of /, which the proposal actually address fairly comprehensively, I have to once again reiterate and echo the concerns voiced in this review and the various pitches leading up to it: nothing in this proposal justifies the relatively massive source breaking change this use of / represents. @rvsrvs is 100% correct: breaking TCA will break hundreds, if not thousands of apps. In fact, it seems likely this represents the largest source break since Swift declared source stability five years ago. Of course, we can't quantify this breakage due to Apple's continued neglect of the Swift ecosystem and the lack of analytics around package usage. But the library itself has over 6k stars on GitHub, making it one of the most popular Swift libraries out there.

In addition to the concrete impact this breakage will have, this raises important considerations for the community in general. Specifically, how popular does a library have to be for Apple to avoid breaking it (when such breakage is easily avoidable)? If TCA isn't popular enough, is Alamofire? Alamofire currently has over 37k stars on GitHub and is usually recognized as, if not the most popular Swift library, certainly one of the most popular. Yet it represents just a single entry in the compatibility suite. Does that mean it's subject to breakage at any point? Why would people spend their valuable time developing unique solutions to problems in the Swift community when it could be broken at any time?

  • Is the problem being addressed significant enough to warrant a change to Swift?

Probably, though the other parts of the regex proposals are far more valuable.

  • Does this proposal fit well with the feel and direction of Swift?

Not especially. Nasty, complex, inline literals are usually a feature of last resort. But, given the placement within Swift's sphere of features, it could be. If this proposal was a first implementation of custom literals, then it might be more valuable. If was exploring the future of protocols for literals, that might be valuable. If it was exploring generalized inline captures from literals, that might be more valuable. But in the end this proposal feels most like something we want for copy / paste compatibility for regexes we find on Stack Overflow, which has never been a priority for Swift before.

  • If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

Regexes in other languages suck pretty hard. The overall set of proposal certainly puts Swift out in front of them, for the most part.

  • How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

I've been tracking the various pitches and proposals and tracking the development of the string processing library on GitHub.

16 Likes

There's no question that / is the canonical delimiter for regex. But that comes from contexts where they are not used in direct conjuction with other syntax (grep, sed, etc.). To my mind it's an awfully important point to know just how many other general-purpose programming languages use this literal syntax, embedded as a normal expression? I don't claim full knowledge*, but I know of three, the same mentioned in the proposal: Javascript, Ruby, and Perl. Are there others? And then of those, how many:

  • use / for comments (Perl and Ruby use #)
  • have custom operators like Swift (that may include slashes)

Any? And then when we add on operator references as expressions, I think Swift is all alone in this region of syntactical space.

So, while I agree that this would be kind of a cool feature to have, I am skeptical that slash-literals-as-expression rises to the universality that the proposal suggests. And given that, I don't think the costs would be worth it.

A final point is that there will always be some places where code is read, that can't avail themselves of rigorous highlighting logic. For example, GitHub. A straightforwardly unambiguous delimiter and rules make it easier for that highlighting to be decent, and it aids reading even when there is no highlighting. An extra mark, like r'/[a-z]+/' or a pair of octothorpes, is no great burden in my opinion, and might even be a positive.


*And for some reason I'm having a devil of a time searching the web for this information.

7 Likes

Not to take anything away from the rest of your review, but this argument is not very convincing. At best, it would be an argument that the proposal document isn't very well drafted, which would be unfortunate but also something that reviewers would ideally look past to focus on the substance of the proposal. In this case, though, it's just misplaced: the capabilities of the literal are variously covered in SE-0355 (the regex syntax) and SE-0350 (the general API of the Regex type), so you wouldn't expect them to be redundantly described in SE-0354, which is specific to wrapping up the regex syntax into a literal.

Like I said, this doesn't take anything away from the rest of your review; your review stands by itself well enough without trying to make a point about the quality of the proposal document.

Not sure I see your point here. If the literal syntax were fully included in one of the other proposals, my point would be the same, I would've just said the literal syntax section spends more time describing its edge cases rather than the actual functionality. Perhaps the point would've been blunted since the relative size isn't as extreme, but the point stands. Boiling it down into its own proposal just makes the point more obvious. But I'll express it differently (and more generally) if that helps: new features which have to be described more in terms of the edge cases they have than on the actual functionality they add probably aren't a good idea.

5 Likes

No, you have it backwards. You are using the fact that the description of something has been pulled into a separate proposal (or section of a proposal) from the description of its functionality as evidence that it doesn't have much functionality.

1 Like

My understanding is that, without this proposal, the only things Swift loses by requiring use of the Regex initializer is inline captures and their integration into the Regex DSL. Is that incorrect? That you can represent a regex as a string seems separate from the actual literal, is it not?