[Pitch] Regular Expression Literals

breathe · October 14, 2021, 6:06pm

Swift didn't choose the familiar $() for string interpolation because using \() is strictly smarter and the non-standard choice reduces the amount of 'escape character surprise' that readers and writers will commonly encounter ... isn't there something strictly smarter to be done with regular expressions to reduce the amount of escape character soup required when defining regular expressions ...? The stated goal to be able to copy/paste regular expressions from other languages found on stackoverflow could be achieved by a regular expression format translator website or tool ...

In my opinion -- much of the bad reputation that regular expressions have derives from the fact that PCRE style regexp's are strictly stupid as they require far too much use of escape characters ... You have to know and internalize the entire set of control characters before you can reasonably use pcre -- both when reading and writing -- its impossible to know whether a character you type might be a special regexp control character without looking it up in the list of all special characters ... A regexp syntax that required syntax/delimiters around non-special characters by contrast would do a lot to reduce the required usage of escape characters in practice and would make for a more beautiful and readable regexp life. Its also easier to learn how the regular expression mechanism works when its actively obvious syntactically which characters have special meaning and which are concrete terminals. In my opinion, separating the regexp control characters from the character literals would produce only a small loss of familiarity and increase character count by only a small amount (or not at all in some cases depending on the patterns being matched).

I would hypothesize as well that a large part of the perceived value of 'familiarity' in this domain actually comes from the fact that PCRE specifically are harder to learn than is needed ... Regular expressions as a mechanism are super easy to understand but PCRE are just simply difficult for humans to parse due to the lack of non-semantic whitespace and the number of escape characters required in common use. The brain has to enter a super-linear mode and read in a very strict left to right fashion as the syntax actively undermines our normal natural language character grouping and chunking faculties ...