[Pitch] Regular Expression Literals

Regular Expression Literals

  • Authors: Hamish Knight, Michael Ilseman

Introduction

We propose to introduce a first-class regular expression literal into the language that can take advantage of library support to offer extensible, powerful, and familiar textual pattern matching.

This is a component of a larger string processing picture. We would like to start a focused discussion surrounding our approach to the literal itself, while acknowledging that evaluating the utility of the literal will ultimately depend on the whole picture (e.g. supporting API). To aid this focused discussion, details such as the representation of captures in the type system, semantic details, extensions to lexing/parsing, additional API, etc., are out of scope of this pitch and thread. Feel free to continue discussion of anything related in the overview thread.

Motivation

Regular expressions are a ubiquitous, familiar, and concise syntax for matching and extracting text that satisfies a particular pattern. Syntactically, a regex literal in Swift should:

  • Support a syntax familiar to developers who have learned to use regular expressions in other tools and languages
  • Allow reuse of many regular expressions not specifically designed for Swift (e.g. from Stack Overflow or popular programming books)
  • Allow libraries to define custom types that can be constructed with regex literals, much like string literals
  • Diagnose at compile time if a regex literal uses capabilities that aren't allowed by the type's regex dialect

Further motivation, examples, and discussion can be found in the overview thread.

Proposed Solution

We propose the introduction of a regular expression literal that supports the PCRE syntax, in addition to new standard library protocols ExpressibleByRegexLiteral and RegexLiteralProtocol that allow for the customization of how the regex literal is interpreted (similar to string interpolation). The compiler will parse the PCRE syntax within a regex literal, and synthesize calls to corresponding builder methods. Types conforming to ExpressibleByRegexLiteral will be able to provide a builder type that opts into supporting various regex constructs through the use of normal function declarations and @available.

Note: This pitch concerns language syntax and compiler changes alone, it isn't stating what features the stdlib should support in the initial version or in future versions.

Detailed Design

A regular expression literal will be introduced using / delimiters, within which the compiler will parse PCRE regex syntax:

// Matches "<identifier> = <hexadecimal value>", extracting the identifier and hex number
let regex = /([[:alpha:]]\w*) = ([0-9A-F]+)/

The above regex literal will be inferred to be the default regex literal type Regex. Errors in the regex will be diagnosed by the compiler.

Regex here is a stand-in type, further details about the type such as if or how this will scale to strongly typed captures is still under investigation.

How best to diagnose grapheme-semantic concerns is still under investigation and probably best discussed in their corresponding threads. For example, Range<Character> is not countable and ordering is not linguistically meaningful, so validating character class ranges may involve restricting to a semantically-meaningful range (e.g. ASCII). This is best discussed in the (upcoming) character class pitch/thread.

The compiler will then transform the literal into a set of builder calls that may be customized by adopting the ExpressibleByRegexLiteral protocol. Below is a straw-person transformation of this example:

// let regex = /([[:alpha:]]\w*) = ([0-9A-F]+)/
let regex = {
  var builder = T.RegexLiteral()

  // __A4 = /([[:alpha:]]\w*)/
  let __A1 = builder.buildCharacterClass_POSIX_alpha()
  let __A2 = builder.buildCharacterClass_w()
  let __A3 = builder.buildConcatenate(__A1, __A2)
  let __A4 = builder.buildCaptureGroup(__A3)

  // __B1 = / = /
  let __B1 = builder.buildLiteral(" = ")

  // __C3 = /([0-9A-F]+)/
  let __C1 = builder.buildCustomCharacterClass(["0"..."9", "A"..."F"])
  let __C2 = builder.buildOneOrMore(__C1)
  let __C3 = builder.buildCaptureGroup(__C2)

  let __D1 = builder.buildConcatenate(__A4, __B1, __C3)
  return T(regexLiteral: builder.finalize(__D1))
}()

In this formulation, the compiler fully parses the regex literal, calling mutating methods on a builder which constructs an AST. Here, the compiler recognizes syntax such as ranges and classifies metacharacters (buildCharacterClass_w()). Alternate formulations could involve less reasoning (buildMetacharacter_w), or more (builderCharacterClass_word). We'd like community feedback on this approach.

Additionally, it may make sense for the stdlib to provide a RegexLiteral conformer that just constructs a string to pass off to a string-based library. Such a type might assume all features are supported unless communicated otherwise, and we'd like community feedback on mechanisms to communicate this (e.g. availability).

The ExpressibleByRegexLiteral and RegexLiteralProtocol protocols

New ExpressibleByRegexLiteral and RegexLiteralProtocol protocols will be introduced to the standard library, and will serve a similar purpose to the existing literal protocols ExpressibleByStringInterpolation and StringInterpolationProtocol.

public protocol ExpressibleByRegexLiteral {
  associatedtype RegexLiteral : RegexLiteralProtocol = DefaultRegexLiteral
  init(regexLiteral: RegexLiteral)
}

public protocol RegexLiteralProtocol {
  init()

  // Informal builder requirements for building a regex literal
  // will be specified here.
}

Types conforming to ExpressibleByRegexLiteral will be able to provide a custom type that conforms to RegexLiteralProtocol, which will be used to build the resulting regex value. A default conforming type will be provided by the standard library (DefaultRegexLiteral here).

Libraries can extend regex handling logic for their domains. For example, a higher-level library could provide linguistically richer regular expressions by incorporating locale, collation, language dictionaries, and fuzzier matching. Similarly, libraries wrapping different regex engines (e.g. NSRegularExpression) can support custom regex literals.

Opting into certain regex features

We intend for the compiler to completely parse the PCRE syntax. However, types conforming to RegexLiteralProtocol might not be able to handle the full feature set. The compiler will look for corresponding function declarations inside RegexLiteralProtocol and will emit a compilation error if missing. Conforming types can use @available on these function declarations to communicate versioning and add more support in the future.

This approach of lookup combined with availability allows the stdlib to support more features over time.

Impact of using / as the delimiter

On comment syntax

Single line comments use the syntax //, which would conflict with the spelling for an empty regex literal. As such, an empty regex literal would be forbidden.

While not conflicting with the syntax proposed in this pitch, it's also worth noting that the // comment syntax (in particular documentation comments that use ///) would likely preclude the ability to use /// as a delimiter if we ever wanted to support multi-line regex literals. It's possible though that future multi-line support could be provided through raw regex literals. Alternatively, it could be inferred from the regex options provided. For example, a regex that uses the multi-line option /(?m)/ could be allowed to span multiple lines.

Multi-line comments use the /* delimiter. As such, a regex literal starting with * wouldn't be parsed. This however isn't a major issue as an unqualified * is already invalid regex syntax. An escaped /\*/ regex literal wouldn't be impacted.

On custom infix operators using the / character

Choosing / as the delimiter means there will a conflict for infix operators containing / in cases where whitespace isn't used, for example:

x+/y/+z

Should the operators be parsed as +/ and /+ respectively, or should this be parsed as x + /y/ + z?

In this case, things can be disambiguated by the user inserting additional whitespace. We therefore could continue to parse x+/y/+z as a binary operator chain, and require additional whitespace to interpret /y/ as a regex literal.

On custom prefix and postfix operators using the / character

There will also be parsing ambiguity with any user-defined prefix and postfix operators containing the / character. For example, code such as the following poses an issue:

let x = /0; let y = 1/

Should this be considered to be two let bindings, with each initialization expression using prefix and postfix / operators, or is it a single regex literal?

This also extends more generally to prefix and postfix operators containing the / character, e.g:

let x = </<0; let y = 1</<

Is this a regex literal /<0; let y = 1</ with a prefix and postfix < operator applied, or two let bindings each using prefix and postfix </< operators?

There are no easy ways of resolving these ambiguities, therefore a regex literal parsed with / delimiters will likely need to be introduced under a new language version mode, along with a deprecation of prefix and postfix / operators. Some prefix and postfix operators containing / may be disambiguated with parenthesis, but we may have to figure out a way to refer to the operator explicitly or deprecate prefix and postfix (but not infix) operators containing /.

On the existing division operator /

The existing division operator / has less concerns than the above cases, however it raises some cases that currently parse as a sequence of binary operations, whereas the user might be expecting a regex literal.

For example:

extension Int {
  static func foo() -> Int { 0 }
}

let x = 0
/ 1 / .foo()

Today, this is parsed as a single binary operator chain 0 / 1 / .foo(), with .foo() becoming an argument to the / operator. This is because while Swift does have some parser behavior that is affected by newlines, generally newlines are treated as whitespace, and expressions therefore may span multiple lines. However the user may well be expecting the second line to be parsed as a regex literal.

This is also potentially an issue for result builders, for example:

SomeBuilder {
  x
  / y /
  z
}

Today this is parsed as SomeBuilder { x / y / z }, however it's likely the user was expecting this to become a result builder with 3 elements, the second of which being a regex literal.

There is currently no source compatibility impact as both cases will continue to parse as binary operations. The user may insert a ; on the prior line to get the desired regex literal parsing. However this may not be sufficient we may need to change parsing rules (under a version check) to favor parsing regex literals in these cases. We'd like to discuss this further with the community.

It's worth noting that this is similar to an ambiguity that already exists today with trailing closures, for example:

SomeBuilder {
  SomeType()
  { print("hello") }
  AnotherType()
}

{ print("hello") } will be parsed as a trailing closure to SomeType() rather than as a separate element to the result builder.

It can also currently arise with leading dot syntax in a result builder, e.g:

SomeBuilder {
  SomeType()
  .member
}

.member will be parsed as a member access on SomeType() rather than as a separate element that may have its base type inferred by the parameter of a buildExpression method on the result builder.

Future Directions

Typed captures

Typed captures would statically represent how many captures and of what kind are present in a regex literals. They could produce a Substring for a regular capture, Substring? for a zero-or-one capture, and Array<Substring> (or a lazy collection) for a zero(or one)-or-more capture. These are worth exploring, especially in the context of the start of variadic generics support, but we'd like to keep this pitch and discussion focused to the details presented.

Other regex literals

Multi-line extensions to regex literals is considered future work. Generally, we'd like to encourage refactoring into Pattern when the regex gets to that degree of complexity.

User-specified choice of quote delimiters is considered future work. A related approach to this could be a "raw" regex literal analogous to raw strings. For example (total strawperson), an approach where n #s before the opening delimiter would requires n # at the end of the trailing delimiter as well as requiring n-1 #s to access metacharacters.

// All of the below are trying to match a path like "/tmp/foo/bar/File.app/file.txt"

/\/tmp\/.*\/File\.app\/file\.txt/
#//tmp/.*/File\.app/file\.txt/#
##//tmp/#.#*/File.app/file.txt/##

"Swiftier" literals, such as with non-semantic whitespace (e.g. Raku's), is future work. We'd want to strongly consider using a different backing technology for Swifty matching literals, such as PEGs.

Fully-custom literal support, that is literals whose bodies are not parsed and there is no default type available, is orthogonal to this work. It would require support for compilation-time Swift libraries in addition to Swift APIs for the compiler and type system.

Further extension to Swift language constructs

Other language constructs, such as raw-valued enums, might benefit from further regex enhancements.

enum CalculatorToken: Regex {
  case wholeNumber = /\d+/
  case identifier = /\w+/
  case symbol = /\p{Math}/
  ...
}

As mentioned in the overview, general purpose extensions to Swift (syntactic) pattern matching could benefit regex

func parseField(_ field: String) -> ParsedField {
  switch field {
  case let text <- /#\s?(.*)/:
    return .comment(text)
  case let (l, u) <- /([0-9A-F]+)(?:\.\.([0-9A-F]+))?/:
    return .scalars(Unicode.Scalar(hex: l) ... Unicode.Scalar(hex: u ?? l))
  case let prop <- GraphemeBreakProperty.init:
    return .property(prop)
  }
}

Other semantic details

Further details about the semantics of regex literals, such as what definition we give to character classes, the initial supported feature set, and how to switch between grapheme-semantic and scalar-semantic usage, is still under investigation and outside the scope of this discussion.

Alternatives considered

Using a different delimiter to /

As explored above, using / as the delimiter has the potential to conflict with existing operators using that character, and may necessitate:

  • Changing of parsing rules around chained / over multiple lines
  • Deprecating prefix and postfix operators containing the / character
  • Requiring additional whitespace to disambiguate from infix operators containing /
  • Requiring a new language version mode to parse the literal with / delimiters

However one of the main goals of this pitch is to introduce a familiar syntax for regular expression literals, which has been the motivation behind choices such as using the PCRE regex syntax. Given the fact that / is an existing term of art for regular expressions, we feel that if the aforementioned parsing issues can be solved in a satisfactory manner, we should prefer it as the delimiter.

Reusing string literal syntax

Instead of supporting a first-class literal kind for regular expressions, we could instead allow users to write a regular expression in a string literal, and parse, diagnose, and generate the appropriate code when it's coerced to an ExpressibleByRegexLiteral conforming type.

let regex: Regex = "([[:alpha:]]\w*) = ([0-9A-F]+)"

However we decided against this because:

  • We would not be able to easily apply custom syntax highlighting for the regex syntax
  • It would require an ExpressibleByRegexLiteral contextual type to be treated as a regex, otherwise it would be defaulted to String, which may be undesired
  • In an overloaded context it may be ambiguous whether a string literal is meant to be interpreted as a literal string or regex
  • Regex escape sequences aren't currently compatible with string literal escape sequence rules, e.g \w is currently illegal in a string literal
  • It wouldn't be compatible with other string literal features such as interpolations
35 Likes

Overall, this seems like an awesome addition to Swift, and I'm glad to see it! :+1:

The impacts seem quite large, when there seems to be a reasonable option that satisfies both term-of-art familiarity and Swift familiarity.

I'm not 100% sure why #/ would have to be an "other" literal syntax and it couldn't just be_the_ literal syntax. As mentioned in Other regex literals there's a parallel to raw strings AND it still has the familiar slashes.

Copying from another language would simply mean type ## and pasting between the hashes, and it would seem that a lot of the impacts to the language would be mitigated heavily by using #/. Please correct me if I'm wrong.

By opting for #/, would it not eliminate a large swath of parsing issues to solve?

In the end, what is the material/tangible benefit of using just /? The only reason mentioned is satisfying an existing term of art. Are there specific workflows requiring precisely the / delimiter? I'm just curious why this point is so heavily preferred.

14 Likes

I am strongly opposed.

Regular expression literals, in the standard forms that exist today, are antithetical to Swift’s goal of clarity at the point of use. They form a dense jumble of arcane symbols all mashed together.

Introducing regex literals into Swift would be actively harmful to the language, by encouraging programmers to write code that hinders readability, whose meaning is non-obvious, and which cannot be understood at a glance.

Rather than “ubiquitous, familiar, and concise”, as the pitch claims, I see regex literals instead as being obscure, esoteric, and inscrutable. They are the specialized jargon of a particular problem domain, concocted out of a desire to prioritize character counts over comprehensibility.

While that goal may perhaps have served a purpose at the time they were invented last century, it is no longer a useful aim and we should not reshape a modern language to accommodate a feature so thoroughly incompatible with present-day design standards.

• • •

We should instead focus on designing a rich, powerful, first-class pattern-matching and string parsing system for Swift, built to prioritize self-documenting readability, that looks and feels like ordinary Swift code at the point of use.

Then, after we have built the infrastructure and created a proper solution which enables programmers to parse and match strings in native Swift, if there is still a reason to consider introducing regex literals, we may revisit the question.

But until then—until we have seen what we can do without sacrificing readability—I see no reason to compromise the principles of the language. The costs to code comprehension are high, and the benefits are not commensurate.

39 Likes

Swift didn't choose the familiar $() for string interpolation because using \() is strictly smarter and the non-standard choice reduces the amount of 'escape character surprise' that readers and writers will commonly encounter ... isn't there something strictly smarter to be done with regular expressions to reduce the amount of escape character soup required when defining regular expressions ...? The stated goal to be able to copy/paste regular expressions from other languages found on stackoverflow could be achieved by a regular expression format translator website or tool ...

In my opinion -- much of the bad reputation that regular expressions have derives from the fact that PCRE style regexp's are strictly stupid as they require far too much use of escape characters ... You have to know and internalize the entire set of control characters before you can reasonably use pcre -- both when reading and writing -- its impossible to know whether a character you type might be a special regexp control character without looking it up in the list of all special characters ... A regexp syntax that required syntax/delimiters around non-special characters by contrast would do a lot to reduce the required usage of escape characters in practice and would make for a more beautiful and readable regexp life. Its also easier to learn how the regular expression mechanism works when its actively obvious syntactically which characters have special meaning and which are concrete terminals. In my opinion, separating the regexp control characters from the character literals would produce only a small loss of familiarity and increase character count by only a small amount (or not at all in some cases depending on the patterns being matched).

I would hypothesize as well that a large part of the perceived value of 'familiarity' in this domain actually comes from the fact that PCRE specifically are harder to learn than is needed ... Regular expressions as a mechanism are super easy to understand but PCRE are just simply difficult for humans to parse due to the lack of non-semantic whitespace and the number of escape characters required in common use. The brain has to enter a super-linear mode and read in a very strict left to right fashion as the syntax actively undermines our normal natural language character grouping and chunking faculties ...

9 Likes

By opting for #/ , would it not eliminate a large swath of parsing issues to solve?

I think this is reason enough to choose #/…/ over /…/. It completely bypasses any source compatibility issues.

But even though the pitch does call out custom literals as “orthogonal” to this feature, I would also like to pitch #regex(…) as a forward-looking choice. In a world in which Swift supports user-defined literals, #literal-name(contents) seems a likely spelling. Choosing that for regular expression literals now would enable porting the implementation to a custom literal infrastructure in the future, perhaps as a litmus test.

12 Likes

That is essentially the spelling for (Objective-C) selectors: #selector(selectorName(:_)). We have a string literal which the compiler parses and type-checks at build time.

10 Likes

Yes, and it’s also related to the syntax for some Apple-specific literals like #image and #color.

6 Likes

To be clear, we are absolutely planning on doing this in the form of the Pattern result builder work, and we definitely want to encourage users to reach for its more versatile pattern matching functionality for more complex patterns. However for simple patterns that can be used as e.g arguments to collection algorithms or cases in a switch statement, the terseness and ubiquity of regex literals can be incredibly useful. These features suit different use cases and can complement each other.

13 Likes

I think the folks who are developing these pitches feel that we have already gotten far enough to conclude that there is, indeed, a reason to consider introducing regex literals.

Back in July, I wrote a ~2000-line prototype of a result-builder-based pattern matching implementation. One of the impressions I got from it was that even fairly small patterns are very verbose in the builder syntax, and that this severely limits how often you would want to write inline patterns in a builder syntax.

For instance, imagine you're trying to find a 0x-prefixed hex number in a string:

if let hexDigits = input.captures(/0x([0-9a-fA-F]+)/),
   let number = Int(hexDigits, radix:16) {
    ...
}

The regex here is simple enough that you can probably at least guess at its purpose even if you don't fully understand the syntax. And it's compact enough that you can just drop it into the middle of a statement and get on with things.

Now imagine the equivalent with a pattern builder syntax (remember, this is not the syntax we'll end up pitching—it's loosely based on an old prototype of it, and in particular, one which used the term Parser instead of Pattern):

if let hexDigits = input.captures({
                       "0x"
                       PredicateParser(\.isHexDigit)
                           .repeated(1...)
                           .captured()
                   }),
   let number = Int(hexDigits, radix:16) {
    ...
}

Is the version where you're using named types and methods for everything more self-explanatory? Absolutely. Is its being more self-explanatory actually valuable here, in the middle of an if let condition, where you just want to match some hex digits? Honestly, I don't think it is. The information density of the builder syntax is extremely low, so in small, inline examples it's usually much more disruptive than regex syntax unless it's composing powerful named patterns in a way that a regex can't.

Now, does regular expression syntax fall apart eventually as a pattern gets more complex? Absolutely. But by the time regex syntax is beginning to fall apart, builder syntax has accumulated enough statements that you probably don't want to use an inline closure at all anymore—you should extract it into a separate named declaration instead. And once it's there, you may find that some of the leaves of your pattern are actually well-suited to regexes once more.

We entered the design process with an open mind about whether we wanted to support regex syntax, but we think our prototypes are sufficient to conclude that they would be valuable.

29 Likes

Overall, +1. But I don't like how disruptive introduction of the classic /regex/ delimiter is. I think something like #regex(...) is a bit too noisy. I would love to keep it as close to the classic delimiter as possible while avoiding the disruption. I prefer what I've already proposed: '/regex/'. I think even #/regex/# is not ideal, as # overshadows /, but I can live with it (especially #/regex/ variant) and prefer it to the disruption.

3 Likes

I, for one, agree with essentially all of this pitch.

As I commented previously, I am very relieved that the authors arrived at the same conclusion that @beccadax describes, which is that there is a place for regex syntax alongside a pattern builder syntax. I continue to believe that attempting to merge them into one thing by inventing a more readable regex or making pattern builders more concise will make the overall result distinctly poorer.

I am satisfied, having read this pretty complete enumeration of the possible collisions with existing syntax, that the disruption to currently valid Swift code is minimal in practice. The authors could consider including some exploration on how the proposal might affect custom / operators used for third-party path types. But overall, departing from this recognizable and ergonomic regex literal syntax to preserve custom prefix and postfix operators doesn’t seem like the right trade-off, and I have yet to see an alternative spelling that is in the same league in terms of pulchritude.

My comment re the examples of ambiguity below—

let x = 0
/ 1 / .foo()
SomeBuilder {
  x
  / y /
  z
}

—is that they are sufficiently strange as-is, with or without the introduction of regex literals, that it seems reasonable to me to warn users even now. However the compiler ultimately wants to interpret these, there will be human readers who will reasonably be confused. It is not onerous to disambiguate, and Swift should encourage users to do so.

7 Likes

Personally I'd love to see regex literals. No, the syntax isn't pretty. But for those of us with 20 or 30 years of experience using them, they're very convenient and allow us to re-use well tested code from other projects. But I'd expect most Swift users to utilize a more advanced and less-cryptic Pattern facility in most programs. The biggest use of regexes would most likely be in scripting and similar one-off uses, or for very simple regular expressions (e.g., /^[0-9]+$/)

But I find the bare /foo/ syntax to be a horrible fit for Swift. It seems unnecessary to introduce so many weird edge cases in the parsing for the sake of a syntax that probably shouldn't be commonly used at all. A more verbose syntax such as #/foo/# or #regex(/foo/) or #regex(foo) or #regex/foo/ would be a better fit. (We could allow a choice of several bracketing characters, much like Perl 5's quote and regex operators although probably with only a short list of allowed characters rather than the crazy freedom Perl allows.)

10 Likes

I personally dislike very much using / as a delimiter for regex in other languages, simply because / is a common character to use in a regular expression and you end up with escape monstrosities like /\/[a-zA-Z0-9]+\//. So in practice in other languages I'm always using other delimiters (when available) in order to avoid this.

I particularly like @hooman's suggestion #regex(/[a-zA-Z0-9]+/) because unescaped parens in the expression are already required to nest properly for capture groups.

Otherwise I do like the direction of this pitch.

10 Likes

Could you share some more details about precisely how parsing ambiguities would be resolved in the case of using the standard division operator?

Looking through our code base, I see a handful of line-wrapped expressions of this form:

let result = (Double(someValue) - Double(someOtherValue))
  / Double(somethingElse) / someOtherThing

Would this be unambiguous because since a regular expression literal wouldn't be juxtaposed with another identifier, so / Double(somethingElse) / someOtherThing must refer to division?

But lets tweak this a little bit:

let result: Double = (Double(someValue) - Double(someOtherValue))
  / Double(somethingElse) / .greatestFiniteMagnitude

Should this divide Double(somethingElse) by the inferred static property Double.greatestFiniteMagnitude, or be a regular expression with a member access to the instance property greatestFiniteMagnitude? This could be potentially resolved by removing the space after the second / (which would be my choice anyway), but it's a place where users don't have to do that today.

But more importantly, is it even possible to use the context of surrounding tokens to resolve ambiguities here? Wouldn't it seriously complicate lexical analysis if you're given a sequence like / ... / foo and you can't know what kind of token(s) everything up until foo are going to be until you get to foo? You wouldn't be able to disambiguate based on neighboring tokens, but only based on neighboring characters, would you?

Overall, I'm very pro-Swift getting native support for regular expression literals, especially if they're compiled into an optimized machine at compile-time and have support for binding capture groups. But I'm not convinced by the argument that the / delimiter is so commonplace in representing regular expressions that using something else would cause confusion, given the potential confusion it could cause elsewhere in the language. The slash delimiter may be commonplace in Unix command line tools (sed, awk, etc.) and scripting languages (Perl, PHP), but the only mainstream programming languages that immediately come to mind when I think about languages that use the slash delimiter for regular expressions are Javascript/Typescript. I think the likelihood of users from other programming languages or environments being confused by a different delimiter is much smaller than the likelihood for confusion by trying to retrofit / as a regex delimiter.

Of the alternatives suggested here, I'm very fond of #/foo/#. It's not ambiguous with anything else in the language today (that I'm aware of), and it mirrors raw string literals nicely in that it can allow unescaped \ and # inside it, as well as unescaped / that isn't immediately followed by a #. And that could be supported by stacking # the same way we do for raw strings: ##/this regex has a /# in it!/##

8 Likes

Is there a way to set modifiers/options/flags (e.g. /foo/i) as all other languages using / support them? I believe /(?i)foo/ is less known and cryptic compared to /foo/i

Assuming Swift's Regex can't be initialized from a String value, (Regex(string: strVal)), punting "raw" regex literal (#/.../#) is unfortunate, maybe too unfortunate. There's no way to treat / as a normal character. Also, #/.../# could be used to disambiguate all the cases listed in "Impact of using / as the delimiter" section. Most other languages supporting /.../ have alternative quotes (e.g. qr{...}) , or at least, regex is constructible from a string (e.g.RegExp("...")).

I think those are great points in favor of regex() -- and it leaves open a path for future rich (maybe user defined) literals, as well as borrowing Perl's ability to use non-default delimiters for a given regex (e.g. m{^/path/name/} in Perl).

It isn't a huge increase over /pattern/ unless the pattern is trivially short, and it is a great decrease in complexity of finding rules that let us use slashes without breaking "too much" existing valid Swift code.

4 Likes

The pitch states

So at this point, the parser just parse them as it is now.

If we really need to change the parsing rule (under a version check), that would be something like "treat the / at the beginning of a line as the start of a regular expression literal".

EDIT: Wait, yeah I now understand your concern, @allevato . Let's see...

It's the sentence right after that one that concerns me:

So I hope the authors can elaborate more on what changes they think might be necessary. I think requiring semicolons to disambiguate a terminated statement from a wrapped one would be a major shortcoming (although result builders seem to have already cracked the door open a bit to that, based on the other examples they gave).

Right—a rule like that would make it impossible to wrap expressions such that binary operators are placed at the beginning of the next line when they occur at line breaks, without introducing an awkward special case for division, even if it was guarded by a language mode flag. (I'm not just saying that because operator-at-the-beginning-of-the-line is the default wrapping style used by swift-format, although that's part of the reason.) Asking users to rewrap their code when upgrading to the next version of Swift would be... unfortunate.

The Core Team is always reminding us that source compatibility is a high priority for changes to the language—even if a new language version flag is added, users should be able to upgrade their project without major disruptions, or my understanding is that the benefit should at least outweigh the disruption. This has blocked a number of places where Swift could smooth out long-time inconsistencies or rough edges. I hope that for a totally new feature, where we have the freedom to choose any delimiter we want at the beginning, we'll be consistent with that philosophy and not choose a delimiter that has the potential to be incredibly risky in terms of introducing ambiguity in existing code.

6 Likes

You could support both / and e.g. #/ by only allowing the former when the type was expected, using /# as a shorthand for Regex() elsewhere.

let p: Regex = /0x([0-9a-fA-F]+)/
let q = #/0x([0-9a-fA-F]+)/
let r = Regex(/0x([0-9a-fA-F]+)/)

This mirrors how assignment works more generally; if the type is explicitly specified the right hand side can often omit some information, if it can be inferred by the compiler.

enum T {case a, b}
let i: T = .a
let j =  T.b


var a: Int? = 1
var b = Optional<Int>(1)
1 Like

Overall being a big fan and fluent user of RegExes, I really like the pitch and the fact that we're planning on introducing RegEx literals in Swift.

My main concern, like many others above, is with all the impacts on using /.

At first I was like "yeah / makes ses because of term of art"… but now that I've realised how ambiguous it could be in all these situations (including with people already using libraries like PointFree's CasePath), I'm now +💯 on @ksluder 's suggestion of using #regex(…), which also matches precedents in Swift like #color(…) and #image(…) for other literals, and thus seems to fit very well here all while avoid the issues with /.

As for / being a term-of-art kind of separator, I'm not so sure that's strongly true. I mean yes many other languages use it for RegExes, but:

  • Languages like Ruby also allow things like %r{…} as an alternative for /…/ or RegExp(…)
  • If you're copy/pasting your RegEx from another language or from StackOverflow, it is really not a big deal to replace /…/ with #regex(…): what matters is not the delimiters but what's inside them.

Overall, the argument that "this is what other languages use as delimiter and it will make it easy to copy/paste from SO" is not critical imho. Making it easy to copy/paste a PCRE-compliant RegEx content from SO or another language? Definitively yes, 100%. Considering that it's as important to be able to also copy the delimiters around that RegEx content from the SO answer without having to replace the surrounding / with #regex() when pasting into Swift? Not so much.
The important thing imho to be able to copy/paste from another source is the RegEx itself, not its delimiters.


So, TL;DR: Since #regex(…) seems to match with other precedent for specific literals in Swift, and avoids all the issues that / delimiters would raise, I think it's an excellent candidate.

15 Likes