SE-0354 (Second Review): Regex Literals

Jon_Shier · May 18, 2022, 8:49pm

This feature doesn't seem to work on the latest toolchains. With Xcode 13.4 and the appropriate flags (-enable-experimental-string-processing, -enable-bare-slash-regex, and -disable-availability-checking), attempts to run anything using the regex features fails with error code 11, from a macOS command line tool. Is there a particular setup we need to get this working correctly?

Looking at the crash log it looks like it can't load the string processing library and fails with symbol lookup.

Jon_Shier · May 18, 2022, 9:23pm

Bizarrely, when I have the debugger attempt to attach on launch it crashes lldb-rpc-server because it can't find libswiftCore.

Exception Type:        EXC_CRASH (SIGABRT)
Exception Codes:       0x0000000000000000, 0x0000000000000000
Exception Note:        EXC_CORPSE_NOTIFY

Termination Reason:    Namespace DYLD, Code 1 Library missing
Library not loaded: @rpath/libswiftCore.dylib
Referenced from: /Users/USER/Library/Developer/Toolchains/swift-5.7-DEVELOPMENT-SNAPSHOT-2022-05-17-a.xctoolchain/System/Library/PrivateFrameworks/LLDB.framework/Versions/A/LLDB
Reason: tried: '/Users/USER/Library/Developer/Toolchains/swift-5.7-DEVELOPMENT-SNAPSHOT-2022-05-17-a.xctoolchain/System/Library/PrivateFrameworks/LLDB.framework/Versions/A/../../../../../../../../Library/Frameworks/libswiftCore.dylib' (no such file), '/Users/USER/Library/Developer/Toolchains/swift-5.7-DEVELOPMENT-SNAPSHOT-2022-05-17-a.xctoolchain/System/Library/PrivateFrameworks/LLDB.framework/Versions/A/../../../../../Developer/Library/Frameworks/libswiftCore.dylib' (no such file), '/Users/USER/Library/Developer/Toolchains/swift-5.7-DEVELOPMENT-SNAPSHOT-2022-05-17-a.xctoolchain/System/Library/PrivateFrameworks/LLDB.framework/Versions/A/../../../../Developer/Library/Frameworks/libswiftCore.dylib' (no such file), '/Users/USER/Library/Developer/Toolchains/swift-5.7-DEVELOPMENT
(terminated at launch; ignore backtrace)

Ben_Cohen · May 18, 2022, 9:32pm

Are you trying to run your executable from the command line? In which case you'll need to set DYLD_LIBRARY_PATH to make sure you're using the toolchain's standard library, which has the regex stuff in it, instead of the one on your mac:

➜  ~ cat regex.swift
let r = /hello world/
let s = "hello world"
let m = try! r.wholeMatch(in: s)
print(m!.output)
➜  ~ xcrun --toolchain "Swift 5.7 Development Snapshot 2022-05-17 (a)" swiftc -enable-bare-slash-regex -Xfrontend -disable-availability-checking regex.swift
➜  ~ ./regex                                                                                                                             
[1]    56021 segmentation fault  ./regex
➜  ~ DYLD_LIBRARY_PATH=/Library/Developer/Toolchains/swift-5.7-DEVELOPMENT-SNAPSHOT-2022-05-17-a.xctoolchain/usr/lib/swift/macosx ./regex                   
hello world
➜  ~

Jon_Shier · May 18, 2022, 9:33pm

Nope, just through Xcode. I'll try adding the library path manually.

Ben_Cohen · May 18, 2022, 9:38pm

Yeah, I see the same problem. It should Just Work from Xcode with a toolchain. Let me do some digging, sorry about that. It does look like you can launch from the command line as I do above in the mean-time.

hamishknight · May 19, 2022, 4:07pm

I think it's desirable if it avoids breaking source in cases that are somewhat common. It would indeed fix this particular issue. However, at least from the source compatibility testing we've done so far, I'm not entirely sure that these cases are particularly common. I still remain slightly concerned that such a rule could potentially lead to an odd editing experience (i.e while typing a regex containing spaces, the literal would change meaning if space is the current character being typed, assuming both delimiters are present during typing). I think we may want to get the core team's opinion on the tradeoff being made here.

Michael_Ilseman:

nnnnnnnn:
We could enable semantic whitespace within option-setting groups, like this:
let r = #/
  (?-x:hello world)
  /#
// matches "hello world"
The entirety of the parenthesized expression would need to be on one line, but that seems like it would still be useful. Future directions could include interpolations of string literals or String instances.
That examples matches my intuition of how this should behave. @hamishknight, what do you think?

That seems reasonable to me

Yes, it does

IMO this would meet that bar, it seems fairly misleading to me that:

#/
  foo bar
/#

would match foobar.

A couple of other ways of suppressing the warning could be:

Use an inline comment foo(?#)bar
Just write it as foobar instead
Have the warning be suppressed if you explicitly write (?x) in the regex

The latter in particular seems quite reasonable to me, assuming that intentional non-semantic whitespace between literal characters isn't a case that comes up too often.

Paul_Cantrell · May 19, 2022, 4:30pm

+1. If we’re going with (1) allowing multiline literals and (2) enabling extended mode without any explicit flag, I’d vote strongly in favor of this approach.

Mulling it over, I’m leaning toward either option 1 (with this approach) or option 3 from my list above.

johnno1962 · May 20, 2022, 10:29am

I really think we're missing a trick dismissing the more assertively differentiating syntax for the multiline "extended" regex literal:

    let re = #///
            my long regex # a comment
            ///#

IMO it would be more consistent with other string literals if #/regex/# were not allowed to extend over more than one line and magically slip into multi-line mode. The more verbose format is perfectly fine to lex/parse provided you don't attempt to parse ///.../// as well on principle. I have a working toolchain supporting this syntax that compiles and runs "MovieSwift" and other code containing extensive doc commenting without issue.

As to whitespace rules in extended mode, here is an excerpt from the perlre man page which is likely to be a good starting point:

   "/x" and  "/xx"

   A single "/x" tells the regular expression parser to ignore most whitespace that is neither backslashed nor within a bracketed character class.  You can use this to break up your regular
   expression into more readable parts.  Also, the "#" character is treated as a metacharacter introducing a comment that runs up to the pattern's closing delimiter, or to the end of the current
   line if the pattern extends onto the next line.  Hence, this is very much like an ordinary Perl code comment.  (You can include the closing delimiter within the comment only if you precede it
   with a backslash, so be careful!)

   Use of "/x" means that if you want real whitespace or "#" characters in the pattern (outside a bracketed character class, which is unaffected by "/x"), then you'll either have to escape them
   (using backslashes or "\Q...\E") or encode them using octal, hex, or "\N{}" escapes.  It is ineffective to try to continue a comment onto the next line by escaping the "\n" with a backslash or
   "\Q".

   You can use "(?#text)" to create a comment that ends earlier than the end of the current line, but "text" also can't contain the closing delimiter unless escaped with a backslash.

   A common pitfall is to forget that "#" characters begin a comment under "/x" and are not matched literally.  Just keep that in mind when trying to puzzle out why a particular "/x" pattern
   isn't working as expected.

   Starting in Perl v5.26, if the modifier has a second "x" within it, it does everything that a single "/x" does, but additionally non-backslashed SPACE and TAB characters within bracketed
   character classes are also generally ignored, and hence can be added to make the classes more readable.

Paul_Cantrell · May 20, 2022, 5:28pm

Hmm, using #///…///# for extended mode is an idea worth considering. I added two variations of it to my list of options above.

ensan-hcl · May 25, 2022, 2:06pm

Is there any possibility to adopt the capability proposed in SE-0359 for regex literals? I thought the alternative #regex("...") can be justified by that proposal, considering as if it is a function which takes @const _ regex: String parameter.

Michael_Ilseman · May 25, 2022, 2:15pm

Could you elaborate? It seems SE-0359 would be an (incomplete) implementation mechanism at best, but otherwise it’s hard to see how it’s relevant to the discussion surrounding having a literal vs not. Are you arguing for no literal?

ensan-hcl · May 25, 2022, 2:28pm

Ah sorry. This comment in pitch thread was in my mind.

[Pitch #2] Regex Literals

The problem with #regex("...") is that it looks like a string literal argument to a magic literal, when in fact the quotes are part of the delimiter itself. For example, you wouldn't be able to do:
let pattern = "[abc]+"
let regex = #regex(pattern)
which would likely be unexpected.

[Pitch #2] Regex Literals

Note it wouldn't be entirely like StaticString (or literals in general), as you also wouldn't be able to intermix any expressions between #regex(...) and the "..." argument. For example you wouldn't be able to write:
#regex(b ? "[abc]" : "[def]")
Or, if you were, it would lose out on editor support.

As the counter argument for #regex("..."), he pointed out that the ("...") part behaves differently compared to other functions, since it does not allow expressions like #regex(x) or even #regex(condition ? "abc" : "def").

However, proposed @const parameters can support such behavior. Therefore, if they are introduced, #regex("...") signature is less unfamiliar, because the behavior is very similar to @const parameters.

I also understand #regex("") alternative has other problems as written in the proposal. I suggest nothing here, but I was just curious about how the SE-0359 relates to this proposal.

hamishknight · May 25, 2022, 4:55pm

johnno1962:

I really think we're missing a trick dismissing the more assertively differentiating syntax for the multiline "extended" regex literal:
    let re = #///
            my long regex # a comment
            ///#
IMO it would be more consistent with other string literals if #/regex/# were not allowed to extend over more than one line and magically slip into multi-line mode. The more verbose format is perfectly fine to lex/parse provided you don't attempt to parse ///.../// as well on principle.

While I do agree that #/// is parsable without much issue, I'm not convinced that it offers much of a benefit over #/. It's quite a bit more verbose, and doesn't indicate "extended syntax" any more than #/. Additionally, the presence of # in the delimiter seems to imply that there would also be a /// version of the literal, which there is not.

While the extra verbosity of #/// could be warranted to be consistent with """, I'm not convinced that the consistency is particularly desirable, as it follows completely different semantics. For example, whitespace is non-semantic and backslashes treat newlines as literal, rather than eliding them:

let str = """
  a\
  b\
  c
"""
// str = "  a  b  c"

let re = #/
  a\
  b\
  c
/#
// re = /a\nb\nc/

Talking with @beccadax about the rationale for using """ delimiters for multi-line string literals, the two main reasons were:

Editing. It was felt that typing " and temporarily messing up the source highlighting of the rest of the file was a bad experience.
Visual weight. It was felt that a single " written after potentially paragraphs of text would be difficult to notice.

However neither of these are a serious issue for regex literals. #/ has plenty of visual weight, is likely to be infrequently used in the single-line case, and the compiler requires a closing /# on a new line before the literal is treated as multi-line.

Ben_Cohen · May 25, 2022, 5:16pm

The premise of this second review is that the core team has accepted in principle the need for a succinct spelling for regex literals. So discussion of how to achieve a longer spelling, even in light of SE-0359, is mostly obviated.

However, SE-0359 doesn't really answer Hamish's first point here. He is pointing out that the quotation marks are misleading. It looks like a string literal argument, but it is not a string literal argument. Yes, if it were a string literal argument, then it could be required to be a @const string literal (addressing his second point). But that's not what it is. It is not a string, but a regex literal, the spelling of the delimiter of which contains a ".

Note that this is really a fairly small aspect of what regex literals bring. In future perhaps even compile-time validation could be done through a string literal. But the contents of the regex literal has an effect on the type system because capture groups become part of the structure of the returned type.

Paul_Cantrell · May 25, 2022, 5:48pm

Hamish, I tend to agree with most of that. I quibble on a couple of points:

Agreed that it doesn’t specifically indicate extended mode, but I’d argue that it does at least indicate something has changed here.

In order words, I’d argue that this is at least marginally less surprising:

#/hello world/#   ✅ matches "hello world"

#///
hello world       ❌ does not match "hello world"
///#

If nothing else, that #/// induces an “oh, that looks different” reaction, which is not nothing.

I doubt that! URLs and paths seem like a primary use case:

#/https://forums.swift.org/t/(.+)/(\d+)/(\d+)/#

Certainly in my experience with Ruby, custom delimiters for single-line regexes are common.

All that said, I do generally agree with your points, particularly the confusing false parallel between #/// and """.

Using #/// for extended mode is not an entirely satisfactory solution.

frozendevil · May 25, 2022, 10:43pm

I'm struggling a bit to understand this—IIRC in the original set of regex related pitches the argument for having both regex literals and builders was that the small/easy/obvious regexes could be represented as concise literals, while the larger more complicated ones could be represented as (theoretically) easier to read builders. As someone who hasn't used a regex-literal-supporting language in my day job, multi-line literals feel like a fairly esoteric feature (although clearly my perspective is limited)... given the ongoing conversation on behavior, it seems like it would make sense to at least move this piece of functionality to a later release, and use the intervening time to settle on a design and gauge how much demand there is.

Michael_Ilseman · May 25, 2022, 10:59pm

That's close to the intent and it is certainly the case that literals shine the most for small/easy/obvious regexes. But, we're supporting full regex literals, that is we're not artificially hampering literals by removing important syntax that run-time string supports, nor are we removing features currently inexpressible in the builders (see alternative).

A little more rationale from earlier in this thread:

Michael_Ilseman:

A key feature of regex syntax is broad compatibility with other engine syntaxes. (Note that literal delimiters are not motivated by a copy-paste scenario, but the regex syntax contained is). A key feature of regex literals is compile-time knowledge of the same regex syntax to drive compiler errors and type inference. We want to encourage people to use literals whenever possible instead of using run-time regex compilation from a string.

Without any story for multi-line non-semantic whitespace, a lot of the value of a literal is harmed. The direct workaround would be to represent these regexes as run-time compiled strings with explicit types provided, i.e. don't use literals. The other workarounds are to heavily re-work the regex either into a single-line literal or to convert to a builder (which could have further reaching implications).

...

We're not talking about shipping the "good" feature and the "bad" feature. We're talking about two good features with their own, mutually complementary, strengths and weaknesses.

If there's a "bad" feature here to nudge people off of, it's using run-time construction for statically-known regexes. If do not have a multi-line non-semantic whitespace literal solution to offer, then we are nudging people onto this "bad" path.

I'm not sure what new information we would acquire. It can be harder to explain a retro-fitted story than present the whole story up front. Additionally, delaying it would put more code on the "bad" path of run-time construction from strings.

frozendevil · May 26, 2022, 1:50am

Got it, thanks! I was looking in the proposal for multi-line literal discussion

This is where I lose clarity a bit, I think. This is again from the perspective of someone who's been primarily an Objective-C/Swift coder, but I can't recall ever seeing a multi-line regex "in the wild". It feels like a leap to say that people will insist on using run-time construction over switching to builders.

(Again without broad knowledge of the regex ecosystem across languages) The discussion in this thread makes it seem like there isn't strong established precedent for how to handle syntax for extended literals—in contrast to the more typical single-line version.

The two main things I would be interested in are if there's desire for this feature at all, and, if so, real-world examples that may influence design decisions (e.x. @Paul_Cantrell 's example of single-line #/ /# being important for URLs).

That said, I just did a very unscientific grep through the source for RubyGems and found a handful of multi-line literals. My original concern really boils down to worry that multi-line literals are more speculative than pragmatic, which it's now clear to me isn't the case. I do still think waiting and observing how the dynamics between literals and builders play out would be ideal, but I'm much more ambivalent about it.

xwu · May 26, 2022, 3:14am

hamishknight:

Michael_Ilseman:

@hamishknight What's the status of the no-trailing-whitespace rule? Is it desirable, or not, and would it fix this issue?

I think it's desirable if it avoids breaking source in cases that are somewhat common. It would indeed fix this particular issue. However, at least from the source compatibility testing we've done so far, I'm not entirely sure that these cases are particularly common. I still remain slightly concerned that such a rule could potentially lead to an odd editing experience (i.e while typing a regex containing spaces, the literal would change meaning if space is the current character being typed, assuming both delimiters are present during typing). I think we may want to get the core team's opinion on the tradeoff being made here.

I'd like to echo that it'd be really great to get some consideration of this—not only for avoiding nearly any source breaks, but also because it kind of makes sense from an aesthetic standpoint that neither leading nor trailing whitespace would be permitted.

(And I don't mean "aesthetic" here solely to mean the kind of satisfaction that one gets from an elegant math proof or something, but also—more crucially—in the sense that users who see foo(/, /) with their squishy human eyeballs are likely to interpret it in a certain way due to how they've been conditioned to make of whitespace, and it would be more user-friendly if Swift's actual rules aligned with that intuition.)

As I said earlier, I agree that the editing experience is a legitimate consideration, but in the case of a regex, when typing out a line that starts something like let x = /, I can't see how an editor would know to insert a closing delimiter, and in the absence of a paired / the syntax highlighting wouldn't flit back and forth between regex and non-regex in the scenario where a user is typing out a new line of code.

...What if, instead of supporting bare multiline #/ ... /# and merely suppressing warnings with (?x) in the regex, we required an explicit opening #/(?-x) in order to use a multiline literal at all?

It would (a) have the advantage of being very explicit about the change in syntactic behavior going from a one-line literal to a multi-line literal; while (b) still answering @Michael_Ilseman's rationale for supporting multiline literals in the first place; and (c) hewing to the insightful analysis that "Region 2" of the design space (semantic whitespace in multiline literals) is undesirable.

Yes, the syntax becomes more verbose, but in this context we're talking about a regex literal that's complicated enough to be best written spanning multiple lines, perhaps including comments just to explain what's going on. Being required to throw in a few characters for an explicit indicator of how whitespace is being handled doesn't seem so outlandish.

AliSoftware · May 26, 2022, 12:47pm

Re: multi-line RegExes and extended mode.

I'd tend to agree here.

Worth noting that, while I could imagine some people might be concerned that using the Builder DSL for the same regex would make things too verbose or too different from a multi line extended regex that you'd copy-paste from elsewhere, I wanted to highlight that since it'll be valid to mix the DSL with literals, a middle ground would also be possible to be used already under the current proposals, which would lower the barrier of using this and, imho, make one more argument for not supporting multi line extended regex literals and push towards this instead:

let regex = Regex {
  // Match a line of the format e.g "DEBIT  03/03/2022  Totally Legit Shell Corp  $2,000,000.00"
  let fieldBreak = /\s\s+/
  /(?<kind>\w+)/; fieldBreak
  /(?<date>\S+)/; fieldBreak
  /(?<account>(?:(?!\s\s).)+)/; fieldBreak  // Note that account names may contain spaces.
  /(?<amount>.*)/
}

To me the mere possibility of using this when you need multi line and comments means that supporting extended mode in multiline literals is probably not worth it (given the debate / ambiguity of which white spaces would be trimmed if we wanted to support it)