SE-0354 (Second Review): Regex Literals

Michael_Ilseman · May 17, 2022, 4:41pm

Paul_Cantrell:

…into a builder DSL expression with a similar spirit of formatting:

let kind = Reference(Substring.self)
let date = Reference(Substring.self)
let account = Reference(Substring.self)
let amount = Reference(Substring.self)

let regex = Regex {
  // Match a line of the format e.g "DEBIT  03/03/2022  Totally Legit Shell Corp  $2,000,000.00"
  let fieldBreak = /\s\s+/
  Capture(/\w+/,               as: kind);    fieldBreak
  Capture(/\S+/,               as: date);    fieldBreak
  Capture(/(?: (?!\s\s) . )+/, as: account); fieldBreak  // Note that account names may contain spaces.
  Capture(/.*/,                as: amount)
}

Is that compelling enough to dispense with extended mode altogether? I’m not sure.

You'd need to remove the whitespace inside those regexes, so you'd have:

let kind = Reference(Substring.self)
let date = Reference(Substring.self)
let account = Reference(Substring.self)
let amount = Reference(Substring.self)

let regex = Regex {
  // Match a line of the format e.g "DEBIT  03/03/2022  Totally Legit Shell Corp  $2,000,000.00"
  let fieldBreak = /\s\s+/
  Capture(/\w+/,               as: kind);    fieldBreak
  Capture(/\S+/,               as: date);    fieldBreak
  Capture(/(?:(?!\s\s).)+/,    as: account); fieldBreak  // Note that account names may contain spaces.
  Capture(/.*/,                as: amount)
}

Are you envisioning the scenario where a multi-line regex treats contained newlines as verbatim content, or would they be outright forbidden?

Similarly, what does a newline in a literal with semantic whitespace entail? Verbatim treatment or error? What about spaces around the next line?

Syntactic options are a little different in practice than semantic options, even though they use the same mechanism in traditional regex syntax. (Regex syntax conflates things that the builders treat orthogonally or via API).

The i would preferably be spelled as regex.ignoresCase(), which extends well to structured builders. E.g., string literals are verbatim by default, but you could add that to ignore case for just that component. (Assuming we want the API directly on String, otherwise it might be spelled "literal content".regex.ignoresCase())

Ignoring whitespace could be a modifier, but that implies a semantic change. E.g., it seems like /abc/.ignoringWhitespace() intends to match the input "a b\r\nc".

This is in [Pitch] Unicode for String Processing

@nnnnnnnn any update or thoughts here?