SE-0354 (Second Review): Regex Literals

Could the parser go even further? If there is a valid interpretation without regex literals, use it. I think this would remove all ambiguities and all source breakage.

func foo(_ x: (_: Int, _: Int) -> Int) -> [Int] { [] }
func foo(_ x: (_: Int, _: Int) -> Int, _ y: (_: Int, _: Int) -> Int) -> [Int] { [] }
func foo(_ x: Regex) {}

// Not regex:          vs.  Regex:
foo(/).reduce(4, /)         foo(#/).reduce(4, /#)
foo(/, /)                   foo(#/, /#)

// Must be regex -  '/' is not a postfix unary operator
foo(/, 4/)

Treating /…/ as syntactic sugar over #/…/# that can only be used when unambiguous.

Or would that result in new/bigger problems?

1 Like

I’d say: Just leading/trailing whitespace.
The hello world example is convincing for me.
(I assume (?xx) would make all whitespace non-semantic)

Maybe:

  • /…/, #/…/# -> all whitespace is semantic
  • Multi line #/…/#, Single line ##/…/## -> leading/trailing (after comment removal) whitespace non-semantic
  • Multi line ##/…/##, Single line ###/…/### -> all whitespace non-semantic

Parsing happens before other parts of the compilation pipeline, so does not have access to semantic information like the types of function arguments (and other similar things – a notable example being whether or not the expression is inside a result builder).

Factoring in that kind of thing would have far-reaching implications for things like compilation time, and the ability for non-Swift compilers to parse Swift (including potentially factoring out a componentized Swift parser from the Swift compiler).

Diagnostics for failed parses can be produced with the help of that information, though, so fixits can benefit from it.

6 Likes

This is what is proposed and it's unlikely that you'd commonly want more than one #. It's similar to "raw" strings, though perhaps even more rare.

Note that case insensitivity is a semantic option and not a syntactic one, which is why it's primary expression is via API. E.g. /abc/.ignoresCase(). We do support the regex syntax for enabling and disabling it. It's possible to argue that any semantic option could/should be set or unset by different literal syntax, but it is a little odd and I'm not aware of much precedent.

Regarding multi-line regexes, traditionally, a newline sequence encoded into a regex would be treated verbatim and match that exact sequence. This is rarely what is actually desired; and if you're splitting a regex across multiple lines for organization or clarity purposes, you nearly always want non-semantic whitespace as well.

The area in the Venn diagram where you want to split a regex across lines, ignore the newlines and surrounding spaces, but keep semantic whitespace within a line for long runs of verbatim content is very small. I'm not aware of any precedent (which doesn't argue we shouldn't do it, but does question how high that demand is).

Is the .ignoresCase() an API that's actually being proposed somewhere? I couldn't find it from a quick search but also might have missed one of the proposals...

1 Like

You'd need to remove the whitespace inside those regexes, so you'd have:

let kind = Reference(Substring.self)
let date = Reference(Substring.self)
let account = Reference(Substring.self)
let amount = Reference(Substring.self)

let regex = Regex {
  // Match a line of the format e.g "DEBIT  03/03/2022  Totally Legit Shell Corp  $2,000,000.00"
  let fieldBreak = /\s\s+/
  Capture(/\w+/,               as: kind);    fieldBreak
  Capture(/\S+/,               as: date);    fieldBreak
  Capture(/(?:(?!\s\s).)+/,    as: account); fieldBreak  // Note that account names may contain spaces.
  Capture(/.*/,                as: amount)
}

Are you envisioning the scenario where a multi-line regex treats contained newlines as verbatim content, or would they be outright forbidden?

Similarly, what does a newline in a literal with semantic whitespace entail? Verbatim treatment or error? What about spaces around the next line?

Syntactic options are a little different in practice than semantic options, even though they use the same mechanism in traditional regex syntax. (Regex syntax conflates things that the builders treat orthogonally or via API).

The i would preferably be spelled as regex.ignoresCase(), which extends well to structured builders. E.g., string literals are verbatim by default, but you could add that to ignore case for just that component. (Assuming we want the API directly on String, otherwise it might be spelled "literal content".regex.ignoresCase())

Ignoring whitespace could be a modifier, but that implies a semantic change. E.g., it seems like /abc/.ignoringWhitespace() intends to match the input "a b\r\nc".

This is in [Pitch] Unicode for String Processing

@nnnnnnnn any update or thoughts here?

1 Like

Oh, thanks, missed that. Fixed the OP.

Is there a verbatim context in this proposal? I thought that #/…/# as proposed either (1) ignores whitespace or (2) has to be on a single line. Apologies if I missed that….

I mean what should the compiler behavior be for a #/.../# literal that has a newline inside, in the context of your scenario where there is no multi-line/extended mode? Would the compiler reject it or would the newline be treated as verbatim content of the regex?

In my scenario, the compiler would reject it. (Again, musing, not necessarily advocating. I'm on the fence myself.)

Someone may have brought this up already. But would the new parsing rules still misparse someFunction(/MyEnum.case1, /MyEnum.case2)? (casepaths syntax btw)

1 Like

Yes. I think it would because it matches all four of these conditions:

The least intrusive fix is probably to put parentheses around the first expression:

someFunction((/MyEnum.case1), /MyEnum.case2)

This causes the potential regex literal to fail the "unbalanced )" condition, thus causing it not to be parsed as a regex.

Another choice would be to split the arguments across multiple lines, to make it fail the "closing / on the same line" condition:

someFunction(
    /MyEnum.case1,
    /MyEnum.case2
)
1 Like

This is a key point — the Unicode proposal includes API for the options that have semantic effects (e.g. ignoresCase(), anchorsMatchNewlines()), but not syntactic effects (e.g. x/xx for extended syntax or n for only capturing named groups). Those syntactic options feed back into things like parsing success or the compile-time output type inference, so method calls that will be evaluated at runtime aren't a good match.

I think you're likely right about the reason — we can add a note about this to the proposal.

3 Likes

We don't allow semantic whitespace inside a multiline regex because there ends up being ambiguity over whether the host language or the embedded language should be responsible for line breaks. We could enable semantic whitespace within option-setting groups, like this:

let r = #/
  (?-x:hello world)
  /#
// matches "hello world"

The entirety of the parenthesized expression would need to be on one line, but that seems like it would still be useful. Future directions could include interpolations of string literals or String instances.

1 Like

If we aren't supporting Perl-like regex modifiers after the literal (i.e., /foo/i), then are there any situations where a valid regex literal and an identifier character would be juxtaposed like that? I can't think of any but I could very well be missing something. But if not, could the lexer detect that and reject it as a regex? I think we could add digits as regex non-followers as well.

I just thought of another weird one, extending the above idea to "what juxtaposed characters might we want to prevent a regex literal". I don't think it's a big deal and parentheses can be used to disambiguate this, but just to continue this line of thinking:

foo(/MyEnum.case1, /(bar))

Does this get parsed as a regex literal being called via a callAsFunction extension, or as two applications of prefix /?

1 Like

I think you're right. It would be worth studying that as a possible fifth condition, as that would catch a lot of common cases, including most of the CasePath uses.

I'm pretty sure it would parse as the former.

I'm not entirely sure I'm comfortable with the use of multiline regex literals. It feels like many expect them to have a magic "only match the whitespace I want it to match" heuristic but deciding which whitespace to match is difficult and any useful system is likely to be difficult to explain. If we have to have them at all, then I think I'd prefer to make them work exactly like multiline strings (with the same rules for removing indentation) and leave the remaining whitespace as significant.

I think in practice I'd most likely use the Regex builder DSL containing a bunch of single-line regex literals rather than a multiline literal with extended syntax. I can't see a good reason to use the multiline literal syntax unless I really want newlines to be significant. (And in the odd case where I do want extended syntax, I can always force it with (?x) anyway.)

6 Likes

In my opinion, the growing list of parsing rules for this simple literal type is starting to become a bit symptomatic of larger problems with the associated syntax changes. Straying away from the single forward slash feels like it would simplify parsing quite a bit, meaning using the #regex(stuff) syntax or just the extended syntax.

Here’s another example that could be a bit harder to fix:

someFunc(/MyEnum.someCase, /) (the second slash is being passed as the division operator)

4 Likes

Very similarly to another recent example, this is another one that can be fixed either by putting the first argument in parentheses or by separating the two arguments onto separate lines.

Either like this:

someFunc((/MyEnum.someCase), /)

Or like this:

someFunc(
    /MyEnum.someCase,
    /
)
1 Like

I know it can be fixed by writing the code differently, but that doesn't change that it's a breaking change. And it's a relatively easily avoidable breaking change with little gain in my opinion.

I understand that I'm probably fighting a losing battle because there are a lot of people who want Swift's regex literals to look exactly like their counterparts in other languages. However, I do have one more concern.

Almost all syntax highlighters used on the web are regex based, and I find that they have enough trouble highlighting Swift as-is — regex literals with all their associated rules are surely going to make that worse. This is in contrast to the relatively simple string literals which can be passed relatively easily because " isn't used for anything else (such as division in the case of the / delimiter).

2 Likes

Copying in my reply from the other thread, as I think it helps clarify and establish some reasoning for further discussion.


With respect to non-semantic whitespace, the literal proposal presents these 3 cases:

// 1
/whitespace is significant/

// 2
#/whitespace is significant/#

// 3
#/
  whitespace is **not** significant     # nor are comments
#/  

Getting behavior such as in #3 is highly desirable, via some delimiter-enabled way. We couldn't find a better one than #/ followed by a newline. The alternative #///'s has some issues with comments, and /// isn't workable AFAICT, but @hamishknight can you comment further?

An argument could be made that #2 should be non-semantic as well, as "extended delimiter" could mean "extended syntax" (and we'd likely error out on a line-ending comment). The downside is that changing a /hello world/ to a #/hello world/# would change meaning of whitespace and that would be weird (as you point out). I'd (weakly) recommend against this direction.

An argument could also be made that all regex literals, including #1, has non-semantic whitespace. That does get weird with the no-leading-space lexing rule (which IIUC we could restrict to start-of-line if we needed to). It's also surprising that /hello world/ doesn't match "hello world", without the newline that #3 has.

To clarify, the below are all compilation errors. The multi-line story only happens if the #/ is immediately followed by a newline:

// Error
/
  abcd
/

// Error
/ ab
  cd
/ 

// Error
#/ ab
   cd
/#

// Ok
#/
  ab
  cd
/#