SE-0354 (Second Review): Regex Literals

Lantua · May 16, 2022, 10:11pm

What's the rationale for extended literal (#/.../#) to enable free-spacing mode (?x) by default, compared to others, e.g., case-insensitive mode? I read the doc a few times but don't see it. Furthermore, is there a way to disable it?

Ben_Cohen · May 17, 2022, 12:58am

Update: the Swift 5.7 toolchain snapshot as of last night is now available on swift.org.

Ben_Cohen · May 17, 2022, 1:09am

To avoid any misunderstanding: #/ followed by a newline (and with a matching newline preceding the /#) enables extended-syntax (non-semantic whitespace + # comments) mode. #/.../# alone does not do it.

You might still ask why the multi-line literal is not also case-insensitive as well as whitespace-insensitive, of course.

It looks like no:

➜  ~ cat multiline.swift
if #available(macOS 9999, *) {
    let r = #/
        (?-x hello world)
    /#
}
➜  ~ xcrun --toolchain "Swift 5.7 Development Snapshot 2022-05-15 (a)" swiftc -enable-bare-slash-regex multiline.swift
multiline.swift:3:9: error: cannot parse regular expression: extended syntax may not be disabled in multi-line mode
        (?-x hello world)
        ^
➜  ~

This probably needs clarification/justification in the proposal.

Paul_Cantrell · May 17, 2022, 3:06am

Huh, the proposal doesn't mention case insensitivity. Is that a part of the proposed regex ecosystem at all? Seems like it belongs in here somewhere. (Apologies if I missed it.)

Ben_Cohen · May 17, 2022, 3:15am

It's part of the regex syntax proposal:

let r = /(?i:h)ello (?i:w)orld/
let m = try! r.firstMatch(in: "Hello World")
print(m!.output) // prints Hello World

Paul_Cantrell · May 17, 2022, 4:54am

It occurs to me that another line of argument is that Swift simply should not support extended mode at all. Once again, I am musing, not necessarily advocating. The argument is that the concise literal syntax is best for short regexes, any regex that does not fit on a single line should use the builder DSL to break it into multiple lines.

Wondering how this plays out, I tried translating @hamishknight’s example from above:

hamishknight:

let regex = #/
  # Match a line of the format e.g "DEBIT  03/03/2022  Totally Legit Shell Corp  $2,000,000.00"
  (?<kind>    \w+)                \s\s+
  (?<date>    \S+)                \s\s+
  (?<account> (?: (?!\s\s) . )+)  \s\s+ # Note that account names may contain spaces.
  (?<amount>  .*)
  /#

…into a builder DSL expression with a similar spirit of formatting:

let kind = Reference(Substring.self)
let date = Reference(Substring.self)
let account = Reference(Substring.self)
let amount = Reference(Substring.self)

let regex = Regex {
  // Match a line of the format e.g "DEBIT  03/03/2022  Totally Legit Shell Corp  $2,000,000.00"
  let fieldBreak = /\s\s+/
  Capture(/\w+/,               as: kind);    fieldBreak
  Capture(/\S+/,               as: date);    fieldBreak
  Capture(/(?: (?!\s\s) . )+/, as: account); fieldBreak  // Note that account names may contain spaces.
  Capture(/.*/,                as: amount)
}

Is that compelling enough to dispense with extended mode altogether? I’m not sure.

The repetition of Reference(Substring.self) is certainly unsatisfying, and makes me wish again for the DSL to support named capture groups as tuple labels to parallel the behavior of literals. (One day, hopefully!)

If we’re willing to dispense with the clarity and safety of named capture groups, the DSL builder version isn't such a bad alternative to extended mode:

let regex = Regex {
  // Match a line of the format e.g "DEBIT  03/03/2022  Totally Legit Shell Corp  $2,000,000.00"
  let fieldBreak = /\s\s+/
  Capture(/\w+/); fieldBreak             // kind
  Capture(/\S+/); fieldBreak             // date
  Capture(/(?:(?!\s\s).)+/); fieldBreak  // account (Note that account names may contain spaces.)
  Capture(/.*/)                          // amount
}

I’d say that the builder is an improvement for my own multiline example from above, although it's probably less representative of common usage than Hamish’s example:

 #/
     (
         hello        # morning
         |
         good night   # evening  (this and only this space character is preserved)
     )
     (
         ,\s+
         every
         (body|one)
     )?
/#

Regex {
	ChoiceOf {
		"hello"       # morning
		"good night"  # evening  (no special handling of space character necessary)
	}
	Optionally {
		/,\s+/
		"every"
		/body|one/
	}
}

Perhaps multiline / extended mode won’t pull its weight as a feature in Swift? I’m not sure I’ve convinced myself here, but it’s worth considering the question.

michelf · May 17, 2022, 10:59am

Is there a reason we can't specify matching options as flags following the closing / (or /#) like in other languages?

let firstPart  = /abc | d /xi
let secondPart = /ef  | gh/xi

I don't see it mentioned in the proposal. I suppose this omission could be for disambiguating with the / operator. It seems to me this will be impacting how easy regexes can be copy-pasted from other places, so it should be worth a note.

It can be rewritten like this of course:

let firstPart  = /(?xi)abc | d /
let secondPart = /(?xi)ef  | gh/

so functionality isn't left out, only familiarity.

dhoepfl · May 17, 2022, 12:40pm

Could the parser go even further? If there is a valid interpretation without regex literals, use it. I think this would remove all ambiguities and all source breakage.

func foo(_ x: (_: Int, _: Int) -> Int) -> [Int] { [] }
func foo(_ x: (_: Int, _: Int) -> Int, _ y: (_: Int, _: Int) -> Int) -> [Int] { [] }
func foo(_ x: Regex) {}

// Not regex:          vs.  Regex:
foo(/).reduce(4, /)         foo(#/).reduce(4, /#)
foo(/, /)                   foo(#/, /#)

// Must be regex -  '/' is not a postfix unary operator
foo(/, 4/)

Treating /…/ as syntactic sugar over #/…/# that can only be used when unambiguous.

Or would that result in new/bigger problems?

dhoepfl · May 17, 2022, 12:52pm

I’d say: Just leading/trailing whitespace.
The hello world example is convincing for me.
(I assume (?xx) would make all whitespace non-semantic)

Maybe:

/…/, #/…/# -> all whitespace is semantic
Multi line #/…/#, Single line ##/…/## -> leading/trailing (after comment removal) whitespace non-semantic
Multi line ##/…/##, Single line ###/…/### -> all whitespace non-semantic

Ben_Cohen · May 17, 2022, 2:14pm

Parsing happens before other parts of the compilation pipeline, so does not have access to semantic information like the types of function arguments (and other similar things – a notable example being whether or not the expression is inside a result builder).

Factoring in that kind of thing would have far-reaching implications for things like compilation time, and the ability for non-Swift compilers to parse Swift (including potentially factoring out a componentized Swift parser from the Swift compiler).

Diagnostics for failed parses can be produced with the help of that information, though, so fixits can benefit from it.

Michael_Ilseman · May 17, 2022, 3:28pm

This is what is proposed and it's unlikely that you'd commonly want more than one #. It's similar to "raw" strings, though perhaps even more rare.

Note that case insensitivity is a semantic option and not a syntactic one, which is why it's primary expression is via API. E.g. /abc/.ignoresCase(). We do support the regex syntax for enabling and disabling it. It's possible to argue that any semantic option could/should be set or unset by different literal syntax, but it is a little odd and I'm not aware of much precedent.

Regarding multi-line regexes, traditionally, a newline sequence encoded into a regex would be treated verbatim and match that exact sequence. This is rarely what is actually desired; and if you're splitting a regex across multiple lines for organization or clarity purposes, you nearly always want non-semantic whitespace as well.

The area in the Venn diagram where you want to split a regex across lines, ignore the newlines and surrounding spaces, but keep semantic whitespace within a line for long runs of verbatim content is very small. I'm not aware of any precedent (which doesn't argue we shouldn't do it, but does question how high that demand is).

Jumhyn · May 17, 2022, 4:14pm

Is the .ignoresCase() an API that's actually being proposed somewhere? I couldn't find it from a quick search but also might have missed one of the proposals...

Michael_Ilseman · May 17, 2022, 4:41pm

Paul_Cantrell:

…into a builder DSL expression with a similar spirit of formatting:

let kind = Reference(Substring.self)
let date = Reference(Substring.self)
let account = Reference(Substring.self)
let amount = Reference(Substring.self)

let regex = Regex {
  // Match a line of the format e.g "DEBIT  03/03/2022  Totally Legit Shell Corp  $2,000,000.00"
  let fieldBreak = /\s\s+/
  Capture(/\w+/,               as: kind);    fieldBreak
  Capture(/\S+/,               as: date);    fieldBreak
  Capture(/(?: (?!\s\s) . )+/, as: account); fieldBreak  // Note that account names may contain spaces.
  Capture(/.*/,                as: amount)
}

Is that compelling enough to dispense with extended mode altogether? I’m not sure.

You'd need to remove the whitespace inside those regexes, so you'd have:

let kind = Reference(Substring.self)
let date = Reference(Substring.self)
let account = Reference(Substring.self)
let amount = Reference(Substring.self)

let regex = Regex {
  // Match a line of the format e.g "DEBIT  03/03/2022  Totally Legit Shell Corp  $2,000,000.00"
  let fieldBreak = /\s\s+/
  Capture(/\w+/,               as: kind);    fieldBreak
  Capture(/\S+/,               as: date);    fieldBreak
  Capture(/(?:(?!\s\s).)+/,    as: account); fieldBreak  // Note that account names may contain spaces.
  Capture(/.*/,                as: amount)
}

Are you envisioning the scenario where a multi-line regex treats contained newlines as verbatim content, or would they be outright forbidden?

Similarly, what does a newline in a literal with semantic whitespace entail? Verbatim treatment or error? What about spaces around the next line?

Syntactic options are a little different in practice than semantic options, even though they use the same mechanism in traditional regex syntax. (Regex syntax conflates things that the builders treat orthogonally or via API).

The i would preferably be spelled as regex.ignoresCase(), which extends well to structured builders. E.g., string literals are verbatim by default, but you could add that to ignore case for just that component. (Assuming we want the API directly on String, otherwise it might be spelled "literal content".regex.ignoresCase())

Ignoring whitespace could be a modifier, but that implies a semantic change. E.g., it seems like /abc/.ignoringWhitespace() intends to match the input "a b\r\nc".

This is in [Pitch] Unicode for String Processing

@nnnnnnnn any update or thoughts here?

Paul_Cantrell · May 17, 2022, 4:50pm

Oh, thanks, missed that. Fixed the OP.

Is there a verbatim context in this proposal? I thought that #/…/# as proposed either (1) ignores whitespace or (2) has to be on a single line. Apologies if I missed that….

Michael_Ilseman · May 17, 2022, 5:05pm

I mean what should the compiler behavior be for a #/.../# literal that has a newline inside, in the context of your scenario where there is no multi-line/extended mode? Would the compiler reject it or would the newline be treated as verbatim content of the regex?

Paul_Cantrell · May 17, 2022, 5:06pm

In my scenario, the compiler would reject it. (Again, musing, not necessarily advocating. I'm on the fence myself.)

stackotter · May 17, 2022, 9:39pm

Someone may have brought this up already. But would the new parsing rules still misparse someFunction(/MyEnum.case1, /MyEnum.case2)? (casepaths syntax btw)

tim1724 · May 17, 2022, 10:05pm

Yes. I think it would because it matches all four of these conditions:

The least intrusive fix is probably to put parentheses around the first expression:

someFunction((/MyEnum.case1), /MyEnum.case2)

This causes the potential regex literal to fail the "unbalanced )" condition, thus causing it not to be parsed as a regex.

Another choice would be to split the arguments across multiple lines, to make it fail the "closing / on the same line" condition:

someFunction(
    /MyEnum.case1,
    /MyEnum.case2
)

nnnnnnnn · May 17, 2022, 10:05pm

This is a key point — the Unicode proposal includes API for the options that have semantic effects (e.g. ignoresCase(), anchorsMatchNewlines()), but not syntactic effects (e.g. x/xx for extended syntax or n for only capturing named groups). Those syntactic options feed back into things like parsing success or the compile-time output type inference, so method calls that will be evaluated at runtime aren't a good match.

michelf:

Is there a reason we can't specify matching options as flags following the closing / (or /# ) like in other languages?
let firstPart  = /abc | d /xi
let secondPart = /ef  | gh/xi
I don't see it mentioned in the proposal. I suppose this omission could be for disambiguating with the / operator. It seems to me this will be impacting how easy regexes can be copy-pasted from other places, so it should be worth a note.

I think you're likely right about the reason — we can add a note about this to the proposal.

nnnnnnnn · May 17, 2022, 11:00pm

We don't allow semantic whitespace inside a multiline regex because there ends up being ambiguity over whether the host language or the embedded language should be responsible for line breaks. We could enable semantic whitespace within option-setting groups, like this:

let r = #/
  (?-x:hello world)
  /#
// matches "hello world"

The entirety of the parenthesized expression would need to be on one line, but that seems like it would still be useful. Future directions could include interpolations of string literals or String instances.