SE-0354 (Second Review): Regex Literals

I don't think it's a good idea to make subtle changes in how whitespace is handled in whitespace-ignoring mode compared to other languages. It'd make copy-pasting whitespace-ignoring regex from elsewhere error prone. If someone were to use whitespace to align elements of the regex in columns, using the same regex in Swift would have different semantics:

let delimited = #/
  \(  .*  \)  |
  \[  .*  \]  |
  \{  .*  \}  |
   <  .*   >
/#

I suppose a warning could work, but how do you disable that warning without rewriting the regex?

You can write a significant space with [ ] in Perl /x mode, so I assume it'd work similarly here. Whitespace is not ignored in a character class.

4 Likes

In SE-0355: Regex Syntax and Runtime Construction, we are proposing a unified non-semantic whitespace behavior that treats whitespace as non-semantic both inside and outside custom character classes:

In both PCRE and Perl, this is enabled through the (?x) , and in later versions, (?xx) matching options. The former allows non-semantic whitespace outside of character classes, and the latter also allows non-semantic whitespace in custom character classes.

Oniguruma, Java, and ICU however enable the more broad behavior under (?x) . We therefore propose following this behavior, with (?x) and (?xx) being treated the same.

Oh! This is surprising to me given I'm used to Perl and PCRE, not Java or ICU. I suppose it makes sense, but I'd be a bit baffled by [ ] not matching a space. Will check the other thread.

1 Like

Would it be possible to unify the behavior of multiline regex literals with that of regexes initialized at runtime from multiline strings, but in the other direction from that discussed in the SE-0355 review thread? Namely, to require (?x) explicitly for non-semantic whitespace behavior regardless of the regex literal delimiter (while still eliding the first and last newline in a multiline regex literal and any indentation less than the closing delimiter's)?

It would be super if the final design could achieve the goal that all of the following ultimately mean the same thing (modulo static typing, etc.), or at least for as many expressions as possible:

let a = Regex("<some regex>")
let b = Regex("""
  <some regex>
  """)
let c = #/<some regex>/#
let d = #/
  <some regex>
  /#
9 Likes

What is the use of raw literal with many #, e.g., ###/.../###? The only benefit from #/.../# seems to be access to /# as part of the regex. The large bracket is already unlikely given that ##### could be contracted to #{5}. Am I missing something?

2 Likes

I’d like to see discussion about whether the rule should be that all whitespace is non-semantic or just leading and trailing whitespace.

8 Likes

One reasonable rule — not necessarily advocating, just musing — would be as follows:

  1. Remove comments
  2. For each line, remove all leading and trailing whitespace
  3. Remove newlines
  4. Any whitespace that remains is significant

For example:

    #/
        (
            hello        # morning
            |
            good night   # evening  (this and only this space character is preserved)
        )
        (
            ,\s+
            every
            (body|one)
        )?
   /#

…would be equivalent to:

/(hello|good night)(,\s+every(body|one))?/

Edit to add: We might want an additional rule that any space preceded by a backslash is not removed in step 2, so that this works:

#/
  hello\       # space after backslash is not removed, but subsequent spaces on this line are
  world
/#

Edit again: With the thread move, my reply to a reply to this post got out of order; note that I found @hamishknight’s counterargument compelling and prefer their proposed alternative.

7 Likes

Apologies, I moved some of these comments from the previous thread for further discussion, so they are a bit out of order (e.g. @Paul_Cantrell's post above this one then got later replies that are now above it).

3 Likes

Regarding the new syntax: how does Swift diagnose incorrect regular expression syntax? What do you get here, for instance?

let foo = (/hello|(world))/;
2 Likes

@Ben_Cohen, do we have a toolchain with the currently proposed behavior to check such things ourselves?

2 Likes

The approach taken with the regex proposals (as with Swift Concurrency) is that the work is getting integrated under a compiler flag (-enable-bare-slash-regex in this case) while under review. This means you can use the nightly toolchains from swift.org (either main or release/5.7) to try out the feature. But it looks like recent nightly toolchain builds haven't been posted yet – I'm checking on this and the latest 5.7 branch should be available shortly.

That said, looking specifically at the diagnostics currently output by the compiler when code is invalid should not be considered something that is covered by this review.

The primary reason for this is that the bar for evolution proposals is a prototype implementation that demonstrates how the feature is used. The expectation is not that this prototype is yet "shippable" or even mergeable into the main branch without additional work. Part of the work to get it to that point, which happens after proposal acceptance, is often quality-of-implementation work such as good quality diagnostics when the compiler hits invalid code.

Of course, sometimes having this kind of QoI is highly desirable at the proposal stage. Without it, reviewers need to reason about the results of using a fully productized implementation, not just the prototype provided for review. A similar example is runtime performance optimization – with some proposals, performance is a key driver and so not having the final fully optimized implementation may present challenges to reviewers who might be considering whether, say, such a proposal is a worthwhile tradeoff versus the complexity it might add to the language.

Nevertheless, having a full production-worth implementation is felt to be too high a bar for proposal to make it to the review stage. So we ask reviewers to bear with the proposal and try and work through these things on paper instead.

Feedback on whether that bar should be raised is welcome, but would be more appropriate on a dedicated thread, probably one in the Evolution/Discussion category. Feedback on diagnostic implementation is also welcome, but probably belongs in the Development/Compiler category.


So to bring it back to the immediate question, I guess it really needs to come back as another question: as a human looking at that code, on paper what would the ideal diagnostic be for this code?

let foo = (/hello|(world))/;

Once there's consensus amongst us humans for what the "right" diagnostic is to give for this code (bearing in mind you can have the compiler more than one diagnostic for two different interpretations) then we can discuss whether it's possible given the parsing rules to have the compiler emit them. If the answer might be "no", then that's very relevant to the proposal review. Such feedback might lead to re-considering deprecating the prefix / operator, for example.

It's worth noting that diagnostics on invalid code are able to use more information than is available when parsing valid code. For example, in the f(/,/) case, the diagnostic can make use of knowledge from the type checker that there isn't a unary function that would accept a Regex but there is a binary function that takes two binary functions.

9 Likes

Thanks for the detailed response. I completely understand that we can't expect much from the diagnostics at this stage.

On the other hand, I think playing with a rough implementation of the rules and trying to see how compiler reacts to various situations can give more insight into whether the current rules are going to be enough for a good developer experience or not.

For example, what is going to happen in a place like playground when compiler is continuously trying to parse and diagnose as you type, and being in the middle of a regex literal is a totally new and weird place to be for the compiler.

For other literal types, there are good distinct indicators at least for their beginning, but / can be harder to detect at the start of a regex literal. For example, editor can confidently insert a closing delimiter as we type the opening delimiter, (which helps compiler with partially typed code) but this is only possible with / if compiler already expects a regex literal in that position. I want to get a better feeling of how many times that context is available to the compiler to see how the experience of typing a regex literal is going to be compared to, say, a string.

3 Likes

What's the rationale for extended literal (#/.../#) to enable free-spacing mode (?x) by default, compared to others, e.g., case-insensitive mode? I read the doc a few times but don't see it. Furthermore, is there a way to disable it?

Update: the Swift 5.7 toolchain snapshot as of last night is now available on swift.org.

4 Likes

To avoid any misunderstanding: #/ followed by a newline (and with a matching newline preceding the /#) enables extended-syntax (non-semantic whitespace + # comments) mode. #/.../# alone does not do it.

You might still ask why the multi-line literal is not also case-insensitive as well as whitespace-insensitive, of course.

It looks like no:

➜  ~ cat multiline.swift
if #available(macOS 9999, *) {
    let r = #/
        (?-x hello world)
    /#
}
➜  ~ xcrun --toolchain "Swift 5.7 Development Snapshot 2022-05-15 (a)" swiftc -enable-bare-slash-regex multiline.swift
multiline.swift:3:9: error: cannot parse regular expression: extended syntax may not be disabled in multi-line mode
        (?-x hello world)
        ^
➜  ~

This probably needs clarification/justification in the proposal.

3 Likes

Huh, the proposal doesn't mention case insensitivity. Is that a part of the proposed regex ecosystem at all? Seems like it belongs in here somewhere. (Apologies if I missed it.)

It's part of the regex syntax proposal:

let r = /(?i:h)ello (?i:w)orld/
let m = try! r.firstMatch(in: "Hello World")
print(m!.output) // prints Hello World
3 Likes

It occurs to me that another line of argument is that Swift simply should not support extended mode at all. Once again, I am musing, not necessarily advocating. The argument is that the concise literal syntax is best for short regexes, any regex that does not fit on a single line should use the builder DSL to break it into multiple lines.

Wondering how this plays out, I tried translating @hamishknight’s example from above:

…into a builder DSL expression with a similar spirit of formatting:

let kind = Reference(Substring.self)
let date = Reference(Substring.self)
let account = Reference(Substring.self)
let amount = Reference(Substring.self)

let regex = Regex {
  // Match a line of the format e.g "DEBIT  03/03/2022  Totally Legit Shell Corp  $2,000,000.00"
  let fieldBreak = /\s\s+/
  Capture(/\w+/,               as: kind);    fieldBreak
  Capture(/\S+/,               as: date);    fieldBreak
  Capture(/(?: (?!\s\s) . )+/, as: account); fieldBreak  // Note that account names may contain spaces.
  Capture(/.*/,                as: amount)
}

Is that compelling enough to dispense with extended mode altogether? I’m not sure.

The repetition of Reference(Substring.self) is certainly unsatisfying, and makes me wish again for the DSL to support named capture groups as tuple labels to parallel the behavior of literals. (One day, hopefully!)

If we’re willing to dispense with the clarity and safety of named capture groups, the DSL builder version isn't such a bad alternative to extended mode:

let regex = Regex {
  // Match a line of the format e.g "DEBIT  03/03/2022  Totally Legit Shell Corp  $2,000,000.00"
  let fieldBreak = /\s\s+/
  Capture(/\w+/); fieldBreak             // kind
  Capture(/\S+/); fieldBreak             // date
  Capture(/(?:(?!\s\s).)+/); fieldBreak  // account (Note that account names may contain spaces.)
  Capture(/.*/)                          // amount
}

I’d say that the builder is an improvement for my own multiline example from above, although it's probably less representative of common usage than Hamish’s example:

 #/
     (
         hello        # morning
         |
         good night   # evening  (this and only this space character is preserved)
     )
     (
         ,\s+
         every
         (body|one)
     )?
/#
Regex {
	ChoiceOf {
		"hello"       # morning
		"good night"  # evening  (no special handling of space character necessary)
	}
	Optionally {
		/,\s+/
		"every"
		/body|one/
	}
}

Perhaps multiline / extended mode won’t pull its weight as a feature in Swift? I’m not sure I’ve convinced myself here, but it’s worth considering the question.

12 Likes

Is there a reason we can't specify matching options as flags following the closing / (or /#) like in other languages?

let firstPart  = /abc | d /xi
let secondPart = /ef  | gh/xi

I don't see it mentioned in the proposal. I suppose this omission could be for disambiguating with the / operator. It seems to me this will be impacting how easy regexes can be copy-pasted from other places, so it should be worth a note.

It can be rewritten like this of course:

let firstPart  = /(?xi)abc | d /
let secondPart = /(?xi)ef  | gh/

so functionality isn't left out, only familiarity.

1 Like

Could the parser go even further? If there is a valid interpretation without regex literals, use it. I think this would remove all ambiguities and all source breakage.

func foo(_ x: (_: Int, _: Int) -> Int) -> [Int] { [] }
func foo(_ x: (_: Int, _: Int) -> Int, _ y: (_: Int, _: Int) -> Int) -> [Int] { [] }
func foo(_ x: Regex) {}

// Not regex:          vs.  Regex:
foo(/).reduce(4, /)         foo(#/).reduce(4, /#)
foo(/, /)                   foo(#/, /#)

// Must be regex -  '/' is not a postfix unary operator
foo(/, 4/)

Treating /…/ as syntactic sugar over #/…/# that can only be used when unambiguous.

Or would that result in new/bigger problems?

1 Like