SE-0354: Regex Literals

ksluder · May 4, 2022, 12:54am

My point is that there has been this flavor of argument that /[Th]is/ is not immediately apparent to be a regex to anyone who has ever used one. I personally find that argument very hard to believe.

Jon_Shier · May 4, 2022, 12:57am

What does the recognition of regexes (the stuff between the /) have to do with the use of / in the first place? My point is there is none, and the proposal makes no first principle argument for its use. Use of #regex() would be even easier to recognize since it has, you know, words, but even #/ would meet your criteria. No one made the argument that using / would make regexes unrecognizable.

ksluder · May 4, 2022, 1:02am

Slashes are as close to a universally recognized delimiter as regexes have. /[Tt]his/ is immediately recognizable as a regex to anyone who has ever seen one in any language or tool that supports them, even if those languages or tools also support alternative delimiters. That has value.

The meaning of #regex() is of course self-evident, and #/[Tt]his/# might be deduced, but they are unique to Swift and therefore lack that value of familiarity.

scanon · May 4, 2022, 1:06am

I don't believe that this is the case, and even if it were, programming languages are not the only thing that's relevant. The use of / for the regexes is woven throughout UNIX tooling--ed has used / since 1969, and has been followed by most other tools; even when they also support other delimiters, / is the one near-universal. This is a rich history to turn our backs on.

Jon_Shier · May 4, 2022, 1:07am

Three languages, two of which don't even recommend naked /, are nowhere near a universe. If you're arguing about recognizability, I would argue the delimiter doesn't matter at all. Anyone seeing a mess of seeming random characters will recognize it as a regex (assuming they know what they are), as that's all it really could be! So I find the familiarity argument to be extremely weak, as I've already mentioned. But that wasn't the original point you replied to anyway.

ksluder · May 4, 2022, 1:09am

Case in point: GitHub is an extremely “modern” tool that uses / as an accelerator for search, which is the operation which originally established the association between / and regular expressions in the first place.

Jon_Shier · May 4, 2022, 1:18am

I'll have to pull out my copy of UNIX: A Memoir to see what it says there, but that doesn't really expand the point made in the proposal. (Edit: It says nothing on the choice of /.) Nor do I recall a first principles argument there either (IIRC it was used because it was available on the keyboards of the time and wasn't already a control character). Given the trend away from naked / delimiters, a rich history of incomprehensible UNIX commands isn't a good argument to me.

It does't seem surprising to me that GitHub, built in Ruby, would use Ruby's regex syntax.

Karl · May 4, 2022, 2:19am

And yet, maybe we should? Looking at the Wikipedia page for ed:

Features

Features of ed include:

available on essentially all Unix systems (and mandatory on systems conforming to the Single Unix Specification).

support for regular expressions

powerful automation can be achieved by feeding commands from standard input

(In)famous for its terseness, ed gives almost no visual feedback,[7] and has been called (by Peter H. Salus) "the most user-hostile editor ever created", even when compared to the contemporary (and notoriously complex) TECO.[2] For example, the message that ed will produce in case of error, and when it wants to make sure the user wishes to quit without saving, is "?". It does not report the current filename or line number, or even display the results of a change to the text, unless requested. Older versions (c. 1981) did not even ask for confirmation when a quit command was issued without the user saving changes.

Maybe it's not a coincidence that "support for regular expressions" was an innovation brought to us by the creators of "the most user-hostile editor ever created"

But then, maybe we should also be brave enough to admit that not everything in the past was incredibly well thought out. We are the people of today; we are the ones who have to live with this stuff. Stop keeping us beholden to this ancient baggage! Just like people in the past made systems to suit their needs, we should not be scared to remake those systems for the needs of Swift developers over half a century (53 years) later.

Let's just admit it: using the / for regexes didn't work out so well. Forward slashes are super useful in lots of important text formats, such as paths and URLs, and having to escape characters in a regex increases its complexity at least tenfold.

This problem even has a name: Leaning toothpick syndrome - Wikipedia

In computer programming, leaning toothpick syndrome (LTS) is the situation in which a quoted expression becomes unreadable because it contains a large number of escape characters, usually backslashes (""), to avoid delimiter collision.

The official Perl documentation introduced the term to wider usage; there, the phrase is used to describe regular expressions that match Unix-style paths, in which the elements are separated by slashes /. The slash is also used as the default regular expression delimiter, so to be used literally in the expression, it must be escaped with a backslash \, leading to frequent escaped slashes represented as \/. If doubled, as in URLs, this yields \/\/ for an escaped //.

A similar phenomenon occurs for DOS/Windows paths, where the backslash is used as a path separator, requiring a doubled backslash \\ – this can then be re-escaped for a regular expression inside an escaped string, requiring \\\\ to match a single backslash. In extreme cases, such as a regular expression in an escaped string, matching a Uniform Naming Convention path (which begins \\) requires 8 backslashes \\\\\\\\ due to 2 backslashes each being double-escaped.

It is a famous problem.

(By the way - that wikipedia article has a better breakdown of regex literals in modern programming languages than this actual proposal to add regex literals to a modern programming language. More background information is still needed!).

The regex work is very forward-looking and innovative. Clearly, a lot of time and effort has been spent on the API, supported features, matching semantics, etc - but this area feels like it hasn't seen the same level of attention, and it's a shame because IMO it is the very worst thing about regexes and I would love for us to have something better.

Nevin · May 4, 2022, 2:46am

Yes, exactly. Regular expressions as traditionally written are a dense jumble of inscrutable symbols. We need to completely overhaul how string parsing and matching is spelled, to mesh with Swift’s principle of clarity at the point of use.

We should prioritize a powerful, convenient, and Swifty string processing system, which looks and feels the way people expect of Swift.

Once that is in the language, then, after people have 2 or 3 years of experience parsing strings “the Swift way”, with clear and easily readable syntax, if there is still a desire for ugly illegible regex syntax, then maybe we could consider it.

But we absolutely 100% definitely should not introduce the terrible, unreadable, user-hostile way of spelling regular expressions before then.

We don’t need to bikeshed the delimiters. We need to scrap the entire design of regex literals and make something massively better.

scanon · May 4, 2022, 3:07am

Fortunately, the proposal contains a solution to this problem!

1-877-547-7272 · May 4, 2022, 4:52am

I understand why / was chosen, but I don't think it's the only sensible operator for case paths. |Enumeration.case resembles \Type.property well enough IMO. (Plus, on many keyboards, | and \ are on the same key!)

Most previous source-braking versions of Swift have included a migrator tool to update older code. Perhaps the new migrator should be made aware of TCA/CasePaths? Or maybe there's a solution that could be generalized to all libraries...

Even if the Swift Migrator doesn't automatically update TCA code, I believe a custom migrator could be made with SwiftSyntax.

Modifier prefixes aren't very Swifty — they look inelegant and don't fit in with the rest of the language. Earlier versions of SE-0200 used modifier prefixes and were criticized for it. Furthermore, I don't think having quote-like operators like Perl's would add much value to Swift — operations like q and qq are redundant with string literals and other operations like s or tr would make more sense as APIs on the String type than literals. An extendable delimiter syntax (like ###/.../###) makes more sense for Swift's custom delimiters than Perl's custom delimiter syntax — delimiters like (), [], and {} in Swift have specific meanings already that aren't related to regex.

I'm not sure what you mean by "base principles" here, but / is generally considered the term-of-art delimiter for regexes. From Wikipedia's page on regular expressions:

Delimiters

When entering a regex in a programming language, they may be represented as a usual string literal, hence usually quoted; this is common in C, Java, and Python for instance, where the regex re is entered as "re" . However, they are often written with slashes as delimiters, as in /re/ for the regex re .

I think #/.../# handles the /-in-regex case well enough that it's hard to justify an esoteric new syntax.

What is this "language compatibility argument"? Swift isn't designed to be compatible with any other language, and I can't find anyone arguing for /.../ on the basis of compatibility with other languages.

There are people who argue for /.../ on the basis that the / delimiter is a term of art, but that's an entirely different argument. They're not arguing that you should be able to copy a regex from some Perl code and put it into some Swift code; they're arguing that the syntax is familiar to many people and that it clearly denotes a regex literal to anyone familiar with regular expressions.

#await# wouldn't have had any source-breaking effects at all. But requiring programmers to write #await# before each asynchronous call would have been ridiculous. Requiring #/.../# for every regex is similarly ridiculous.

The justification is that reserving prefix / is required to support /.../ syntax?

Increased verbosity (which you seem to be advocating for) does not actually result in an increase in clarity. To maximize clarity, you must carefully balance brevity and verbosity.

I disagree with the notion that regular expressions are inherently illegible and incomprehensible. It's not that hard to figure out that /0x([0-9A-Fa-f]+)/ matches a hexadecimal literal, especially with the web and/or a bit of practice.

I realize there are some regexes that are much more complex, and I do agree that they can be illegible and incomprehensible. But Swift programmers will be able to use the DSL to avoid these kinds of regex literals. And those regex DSLs will often contain a bunch of smaller regex literals!

let integerRegex = Regex {
    ChoiceOf {
        /0b([0-1]+)/
        /0o([0-7]+)/
        /([0-9]+)/
        /0x([0-9A-Fa-f]+)/
    }
}

Avi · May 4, 2022, 5:27am

What I dislike the most about using a bare / as the delimiter is that it sets a precedent that the Core Team will take an existing feature and completely break it to suit their whims. / has always been valid as a custom unary operator in Swift. Now it won't be. Not because it's unavoidable, but because the Core Team wants it that way.

In all the pitch and review threads for the Regex feature, I haven't seen anyone positing that a delimiter other than / would be harmful at all. Literally everyone who has weighed in would accept #//#, and yet the Core Team is persisting in getting their own way because reasons. Reasons they won't even articulate.

YOCKOW · May 4, 2022, 7:56am

Yes, you are! …Sorry for a corny joke.
I agree it was a hyperbolic remark. Of course, I didn’t mean I wanted to insist it’s the fact.

I fear that the core team would equate our earnest discussions with just "customer ratings" which they can ignore.
I fear that "evolution process" would become a mere facade and that they would regard the forum as their exhibition.

Even if you have the same opinion with the core team as to this proposal, what about some other next proposals?
Community might get "learned helplessness", and Swift would go worse for the community.
This issue is not exclusive to this proposal.

ensan-hcl · May 4, 2022, 8:10am

Can we exclude some 'incomprehensible' regexes from regex literals? If there is some neat subset of regex syntax which is acceptably readable, then we can limit regex literal into that subset. So that people must use Regex DSL when they want to write more complex regex. I believe there should be some kind of stoppers to avoid falling into chaos.

scanon · May 4, 2022, 1:25pm

A digression: I put the question to my neighbor Doug McIlroy. He believes Ken copied that binding from the qed editor on the Berkeley Timesharing System (where it was restricted to literal search, rather than regex).

Nevin · May 4, 2022, 1:40pm

Please do not falsely attribute incorrect motivations to others. I have consistently advocated for clarity, because clarity is what I value, and clarity is what Swift officially stands for.

To maximize clarity, you must maximize clarity.

Traditional regex literals do not maximize clarity. They do not even meet a basic minimal threshold of clarity. They actively reduce clarity, and must be opposed in Swift.

Strong disagree. Even on a “simple” regex like that, even for someone who is aware that regexes exist and how they work, the example you give is still an absurd mess of symbols. It completely fails to meet Swift’s goal of clarity at the point of use.

Furthermore, if we were to make the huge mistake of introducing regex literals into Swift, then by definition those literals would be part of Swift. In other words, the question “Do you know Swift?” would always and forevermore thereafter include within itself the subquestion “Do you know regex literal syntax?”

A programmer could not truthfully say “I know Swift” unless they also know regex literals. Because those literals would be part of the language.

And that would be a bad thing.

Whatever syntax we end up choosing, should prioritize clarity. And that immediately rules out traditional regex literals.

We should design Swift’s string-parsing features to be so clear, so convenient, and so powerful that nobody ever wants regex literals, that nobody ever misses regex literals, and that everybody who looks at the Swift design is appalled that anyone ever thought regex literals were anything other than a complete dumpster-fire of a syntax.

Nevin · May 4, 2022, 1:48pm

This exact point has previously been litigated on these forums: everyone is expressing their opinions, and it would be a waste of time to preface everything with “in my opinion”.

sveinhal · May 4, 2022, 1:50pm

Sure, but you're certainly using pretty absolute words to express what you must know to be not-so-absolute. But I digress.

Nevin · May 4, 2022, 2:00pm

In my opinion, the words I used have scarcely begun to scratch the surface of how thoroughly and completely I regard regex literals as being unclear, user-hostile, and entirely incompatible with Swift’s goal, philosophy, and design ethos of clarity at the point of use.

codename · May 4, 2022, 3:44pm

Overall, +1 ...

I think the general direction is great, and I’m excited to see all of the string processing work become a reality, including Regex and Regex literals. While I would definitely prefer the Regex DSL for certain applications, I think the convenience of having a traditional regex engine at one's disposal may be a selling feature for Swift in the world of scripting, where text processing (whether machine or user text) is not uncommon.

That said, while I could live with the proposed delimiter syntax, I do not think it is as desirable as it is made out to be, nor does it fit in particularly well with the language as it is now, and I think Swift can do better.

I would also be content with the suggestion of extended-delimiters-only (eliminating the need for a source break or disallowing once-allowed operators), though I could see the reasoning for the syntax being confusing/arbitrary for a newcomer in absence of the basic /.../ form.

Side question

Perhaps it is not in the scope of this proposal, but one thing I have not seen mentioned is whether inline documentation (e.g. Xcode's Quick Help) will be supported for individual Regex operators (e.g. \d).
If so I think this would be an unprecedented, but all the more incredible experience for Regex support in a language, and could clarify and reinforce what features are supported (and their behavior) at the point of use without e.g. needing to run a Regex different ways just to inspect a certain operator's behavior.

To throw another interesting idea in the mix, which I haven’t seen discussed directly...

Literal form `#r'...'`

let regex = #r'[Rr]eg(ular\s)?[Ee]x(pression)?'

And (multiline):

let multilineRegex = #r'''
[Rr]eg(ular\s)?
[Ee]x(pression)?
'''

As mentioned in the proposal, potential conflicts with single quotes within a Regex could be easily avoided using an extended form (#r#’...’# et al) similar to the one proposed (and that of String). One could perhaps be bothered by the multiple # characters in close proximity serving different purposes, though I do not see this as an issue. An alternative could be to use the multiline version instead.

I like seeing the # prefix which signifies the ‘program’ nature of the literal, in that it is something beyond just a regular ‘data’ literal. This also reduces the novelty of using letter-prefixed quotes (e.g re'...') since the letter is directly following the “compiler sign” #
From a language perspective it seems an r/re quoted form has similar precedent to /.../
The syntax is clearly extensible to any future program literals (were they to be introduced)
Using elevated characters as delimiters (", 'etc.) makes them easier to distinguish from the contents of the delimiters, and thus easier to parse visually
It does not preclude other uses of '...' (but rather sets a precedent for similar future constructs)
Using reserved characters (# as well as ') means it does not break any existing Swift source code and retains a simpler mental model without edge cases (regardless of whether such edge cases would commonly be encountered)
Avoiding / sidesteps a common pitfall of the standard delimiters in languages that have used it
Multiline delimiters can use the same established mental model as multiline strings (as well as Markdown code fences!)
[Just a broad guess] While perhaps somewhat novel, I would expect such a syntax to feel more immediately familiar to existing Swift developers, in that it would “fit in” comfortably with the rest of the language as they see it
While being short and succinct (an r longer than the proposed /.../), the syntax is both more descriptive IMO (with that one letter) and seems more likely to be found in an internet search (“what is swift #r...?”)

[Edit: I would be almost equally supportive of #re'...' instead of #r'...' as long as the # was retained (instead of a bare letter-prefix), though I prefer the succinctness of #r'...']

Counter arguments:
From the proposal (on re'...'):

it is unusual for a Swift literal to be prefixed in this way.

See point #1 (above).

We also feel that its similarity to a string literal might have users confuse it with a raw string literal.

In association with the # character this syntax becomes further separated from a bare string literal. While the # could perhaps be associated with a raw string, this seems unlikely to cause much confusion in Swift IMHO (see answer below as well).

From the proposal (on r'...'):

While it's more concise, it could potentially be confused to mean "raw", especially as Python uses this syntax for raw strings.

While possible, this seems unlikely for anyone vaguely familiar with Regex syntax, since the content of the literal is a Regex. Using a # also makes this somewhat distinct from e.g. Python’s r'...' raw string syntax.

Furthermore, whatever confusion remains is unlikely to last long considering the strong type system support surrounding the literal (especially in that most literals won’t be used in isolation). After all, the primary places one is likely to see a Regex literal for the first time is either in educational material (such as TSPL) or someone else’s code, where surrounding context (the Regex type, text matching algorithms etc) should give a pretty good idea of what it is for.

Anyway, my 10¢ [as a hobby app developer].

SE-0354: Regex Literals

Delimiters

Literal form #r'...'

Literal form `#r'...'`