SE-0354 (Second Review): Regex Literals

At this point, we have at least five credible suggestions about how to handle multiline regexes. (NB: All the below use the #/…/# syntax for any regex literals containing newlines, so this is not a delimiter question!)

It’s probably useful to summarize them, since the discussion has become quite tangled at this point:

  1. Strip all unescaped whitespace (traditional extended mode), and emit warnings about unescaped spaces that look suspicious

    Advantage: Allows traditional extended regexes, formatted for readability

    Disadvantage: May be confusing when users encounter it, requires verbose manual escaping of spaces and/or flag to disable warnings, “hello world” footgun still exists despite warning

  2. Remove leading and trailing whitespace (and comments) from each line

    Advantage: Somewhat intuitive behavior

    Disadvantage: Harms the ability to add internal whitespace for readability

  3. Don't allow regex literals to span multiple lines at all; use the regex DSL instead for formatting and commenting long regexes

    Advantage: Encourages people to use the regex builder DSL, which has numerous readability advantages and requires no confusing new rules about whitespace

    Disadvantage: Forces people to use the regex builder DSL, which is more verbose and in some cases clumsier, and (currently) discourages named captures

  4. Use the DSL for formatting long regexes, as in 3, but allow multiline regex literals and treat newlines + whitespace as significant

    Advantage: There is currently no other proposed facility for preserving literal newlines in a regex, which can be useful for matching large chunks of formatted text

    Disadvantage: Interaction with surrounding code gets messy. (How does it handle indentation, for example? Is the rule the same as multiline strings? What are the rules for a bare leading or trailing newline? Is all this really better than explicit \n? etc.)

  5. Combine 1+4: multiline regexes are literal by default (4), but some extra syntax enables extended mode where all unescaped whitespace is ignored (1)

    Advantages: Covers all the bases, more or less

    Disadvantages: Maximally confusing, may not actually carry its weight

  6. Use #///…///# as a second, separate delimiter to enable extended mode, and either (6a) disallow multiline #/…/# or (6b) allow multiline #/…/# and have it treat whitespace as significant

    Advantages: Might mitigate the “hello world” footgun, since it’s slightly less easy to accidentally enable, and the delimiter change could help signify that the meaning of whitespace changes

    Disadvantages: May be excessive and unnecessary. Option 6b poses all the problems of Option 4 above.

4 Likes