I thought I would give some more analysis of the different regions in the design space.
We have a 2-dimensional (with some wrinkles) design space and we need to fit in /.../
and #/.../#
. To illustrate this space, I'll use the alternative considered re'...'
syntax. This is not an attempt to re-litigate the core team's aesthetic preferences. I'm using the alternative because it was designed to treat these design dimensions as orthogonal, but I argue that desirability is not equally distributed across this plane.
Here re
is normal syntax and rx
is extended; '
is for single line and '''
is for multi-line.
/// Region 1
re'whitespace is significant'
/// Region 2
re'''
whitespace is still significant
but leading and trailing are trimmed
'''
re'''
whitespace is still significant
\ but leading and trailing are trimmed
'''
/// Region 3
rx'whitespace \s isn't \s significant \s (unless\ escaped)'
/// Region 4
rx'''
whitespace \s isn't \s significant \s (unless\ escaped)
'''
Region 1: Semantic whitespace in single-line literals
This is a highly desirable region to support as it maximizes compatibility and familiarity.
This is the behavior of /
as proposed. The no-leading whitespace rule makes it particularly difficult to have non-semantic whitespace with the bare /
delimiter.
Region 2: Semantic whitespace in multi-line literals
Initially this point looks promising given intuition from string literals and this specific example being a long run of verbatim content. But the precise meaning is not clear: should the newlines be preserved as verbatim content like string literals?
Traditionally, a newline sequence encoded into a regex would be treated verbatim and match that exact sequence. This includes any byte sequence that would be a newline within the regex literal, e.g. a CR-LF verbatim matches a CR-LF in the input. And that's fine for run-time string content. But when it comes time to embed a literal in the host language (Swift), the host language handles this structure.
Keeping the newlines as verbatim content but trimming seems like it would be surprising. Dropping the newlines and trimming diverges from string literals but even that can be surprising. In the example shown, the space separating words has to be added/escaped because of where the line break is. If the escape was instead at the end of the line, would that restore a verbatim newline?
There's a lot of details we could work through, but this region does not seem all the desirable to land upon. It's also entirely unprecedented AFAIK, which doesn't necessarily argue against doing it, but lends some credence to the argument that this isn't a particularly pragmatic or useful region.
For actual regexes, long runs of verbatim content are fairly rare, and overall the balance tilts towards non-semantic whitespace being more helpful than confusing. Thus, we are proposing not to target this region.
Region 3: Non-semantic whitespace in multi-line literals
This is a highly desirable region and it's broadly precedented by other language's extended or multi-line literals approach. It splits a regex across multiple lines, ignores the newlines contained, and turns on non-semantic whitespace.
This is what's proposed by a #/
followed by a newline. For example, to quickly capture a couple portions of the transaction used in the overview proposal's example:
// CREDIT 03/01/2022 Payroll from employer $200.23
let regex = #/
(?<date> \d{2} / \d{2} / \d{4})
(?<middle> \P{currencySymbol}+)
(?<currency> \p{currencySymbol})
/#
// Regex<(Substring, date: Substring, middle: Substring, currency: Substring)>
(Note that in this use case I'm not using a strongly-typed Foundation.Date
, which represents an instant in time and thus requires a priori knowledge of locale and/or timezone).
Some contingent of developers might shy away from this in lieu of converting everything to builders, and that's totally fine. Everyone has their own conversion curve between literals and builders. But this is still a very useful and valuable region to support.
Region 4: Non-semantic whitespace in single-line literals
This is an interesting region and is commonly supported by other language's single-line literals. It loses the instant familiarity and compatibility of region 1. This is the default for some languages like Raku and can aid in separating delimiter noise from regex content.
This is currently supported explicitly through the use of (?x)
, but note that syntactic options pertain to the interior regex syntax and wouldn't affect things like how delimiters are parsed. If we were to adopt a no-trailing-whitespace rule, then the following would hold:
/(?x) non semantic whitespace / // Invalid
#/(?x) non semantic whitespace #/ // Valid
An alternative could be to enter non-semantic whitespace mode if the #/
delimiter is followed by any whitespace, not just a newline, allowing the following:
#/ non semantic whitespace #/
We're (weakly) arguing against this direction because that seems like a more surprising on-ramp to non-semantic whitespace than restricting it to when the regex is split across multiple lines. A regex split across multiple lines is a much stronger signal that whitespace is handled differently by Swift.
For non-semantic whitespace content that fits in a single-line, sources can still use the multi-line variant:
#/
non semantic whitespace
#/