SE-0357: Regex String Processing Algorithms

nitpick in "Detailed design" > "CustomConsumingRegexComponent":

    func consuming(
        _ input: String,
        startingAt index: String.Index,
        in bounds: Range<String.Index>
    ) throws -> (upperBound: String.Index, match: Match)?

The return type should be (upperBound: String.Index, output: RegexOutput)?, shouldn't it?


One more thing.

There is explanation why Collection is used instead of Sequence and why it's Collection where SubSequence == Substring. But, is there any explanation why not StringProtocol where SubSequence == Substring and StringProtocol where Self: RangeReplaceableCollection?

This is just a question. I don’t necessarily hope that, though.

I'm not sure if this is the intended behavior but I find it confusing:

let regex = Regex { OneOrMore(.any) }
print("abc".wholeMatch(of: regex)!.0) // prints "abc", as expected
print("abc".suffix(1).wholeMatch(of: regex)!.0) // also prints "abc"

Are the semantics of Substring.wholeMatch intentionally defined so that the regex considers the full extend of the base String? If so, I think some clarification in the proposal would really help to understand the rationale behind it.

3 Likes

Thanks, opened [SE-0357] Typos and fixes by milseman · Pull Request #1646 · apple/swift-evolution · GitHub

StringProtocol is.... not well designed and insufficient for this purpose alone. We'd still need the where as you pointed out. At that point, overly constraining ourselves isn't that useful and we definitely don't want to be pushing anyone to conform to StringProtocol.

Being able to vend a Substring means that you're also creating a proper String for the base property, and doing so as a subsequence means you're not (likely) doing a copy every time. It's conceivable that we'll support some notion of (safely!) shared strings in the future with move semantics and it's conceivable that a type may vend a string-view. At that point, it's just better to use the slice (or string) as the adapter type without having another conformance to StringProtocol.

TL;DR: StringProtocol doesn't get us anything and we want there to be less of it in the future, not more.

2 Likes

This is mentioned in an alternative considered:

Matches-anywhere semantics for ~=

Some scripting languages use semantics akin to contains for this purpose. But in Swift, this would deviate from how switch normally behaves, such as when switching over an Array.

Matches-anywhere can be perilous when the input is dynamic and unpredictable: new inputs might just happen to trigger an earlier case in non-obvious ways. Matches-anywhere can be opted into by quantifying around the regex, and a modifier on regex to make this more convenient is future work.

I'd be interested in discussing further and how we can make it more intuitive or otherwise flesh out this rationale.

1 Like

Matches-anywhere semantics for ~=

I've read the proposal and the alternatives section before this thread. Only when reading @tim1724's concerns about surprising semantics of ~= I understood my slight dissonance with this part of the proposal.

I think making ~= equivalent to wholeMatch should be reconsidered. The rationale for the current semantic is to make behavior identical to other types, i.e. Array. At least for me this reasoning is surpassed by the inherent semantic of regular expressions themselves: to search for patterns within strings ("why is grep called grep? Because it greps for things.").

The workaround mentioned in the alternatives section exists the other way round:

Matches-anywhere can be opted into by quantifying around the regex

// Having to use ...
      case /.*abc.*/:
// ... is worse than:
      case /^abc$/:

The /.*abc.*/ workaround makes me feel uncomfortable. It might need backtracking—with an unknown performance hit. It also depends on non-possesive QuantificationBehavior. Even the semantics of . (dot) are not 100% crystal clear (matching newline or not).

So if there's any room for debate: I'm all for ~= means contains.

8 Likes

(Continuing a discussion from the pitch thread about a method on String to find if it’s matched completely by a Regex. Summary: There are corresponding methods for firstMatch and prefixMatch, but not wholeMatch.)

Inspired by this PR to resolve overload issues because of String’s conformance to RegexComponent:

What about equalElements(pattern: some RegexComponent) -> Bool? If other methods that also have string overloads were named similarly, that could tie in quite nicely.

Over at the other review thread for SE-0354 Ben motivated me to try the new bare regex syntax. While I was playing around with the excellent new APIs, I also tried the replace/replacing functions proposed here.

This worked well:

let str = "A banana a day keeps the doctor away."
let result = str.replacing(/A banana/, with: "Any fruit") // Any fruit a day keeps the doctor away.

The following did not. I wish regex based substitution was also possible, like this:

let result = str.replacing(/A banana (.*)/, with: /Any fruit \1/) // with \1 or $1

(I'm not sure which proposal this feature belongs to, whether this is only a new string processing algorithm or it is something that requires deeper support on the Regex type itself)

3 Likes

The with: argument you have is not a regex but is instead a template string, where captures get substituted. Something like .* inside that string would be verbatim content. I.e. it's for outputting and not processing input. A template string is future work (if it's compelling).

You can write that code using the replacing variant that takes the actual match:

let result = str.replacing(/A banana (.*)/) { match in
  "Any fruit " + match.1
}

(keeping your change in whitespace after banana)

2 Likes

This looks like a bug. I opened Issue #420.

1 Like

Plenty of room for debate:

I don't have strong opinions on ~= specifically and I could see it going either way. One argument could be that if it's used with a literal, the matches-anywhere is more intuitive, and it's more likely to be used with literals in if or all-literal case statements.

I don't know how strong the precedent or intuition is behind matching Array's behavior.

1 Like

:thinking: :thought_balloon:

Poor Investigation
Language Regular Expression Regex Literal Regex in case expression Matching semantics
C n/a n/a n/a n/a
C# :white_check_mark: n/a n/a n/a
Go :white_check_mark: n/a n/a n/a
Java :white_check_mark: (Pattern) n/a n/a n/a
JavaScript :white_check_mark: :white_check_mark: n/a n/a
Kotlin :white_check_mark: n/a n/a (when statement) n/a
Perl :white_check_mark: :white_check_mark: :white_check_mark: (given-when statement) anywhere-match
PHP :white_check_mark: n/a n/a n/a
Ruby :white_check_mark: :white_check_mark: :white_check_mark: (case-when statement) anywhere-match
Rust :white_check_mark: n/a n/a (match expression) n/a

It seems that regexes can be legitimately used with switch statements in only few languages. Such few languages adopt "matches-anywhere semantics" (and the same semantics would be also applied with a method of regex classes in other languages when regexes could be used illegitimately in case expression).

Our intuition would admire matches-anywhere semantics because there are indeed such use cases in other languages.
Our reason would accept whole-match semantics because it is more safe and Swift prefers safety.
Our resignation would refuse entirely to use regexes in case expression because that is in the majority of languages.

Hmm, we seem to lack a decisive factor in determining.

1 Like

I think Range has precedent for contains.

There's been discussions earlier about the "correct" semantics for option sets and other sets, but those I believe match agains the full set.

1 Like

Personally, I'm in favor of whole-match semantics by default for ~=. Swift's pattern matching has always matched the entire value. If "hello" doesn't match "hello world" and [2, 3] doesn't match [1, 2, 3], then why should /hello/ match "hello world"?

Swift's switches already deviate from the semantics of most languages with the lack of implicit fallthrough. As YOCKOW has shown, most languages don't allow regexes in switches anyway. Additionally, in Perl (one of the only languages that does support regexes in switches), switches are considered "highly experimental" and shouldn't be used.

5 Likes

This is false, as demonstrated above. A Range matches any element within the range.

1 Like

Speaking personally, I have really gone back and forth on this.

One other data point to consider: whole match leads to a loss of expressively, since you no longer get to use ^, $ etc to specify a whole match rather than a partial one if whole matching is all you get regardless.

4 Likes

One could always write case /.*foo.*/: to achieve partial matches, or case /(?m).*foo.*/: if the string contains newlines, so strictly speaking there’s no less of expressivity.

That said, that newline footgun, and the awkwardness of the syntax to fix it, are perhaps arguments against whole string matching.

This is determinative for me — bracketing an expression in ^/$ is simple and idiomatic for regular expressions, much more so than converting a partially matching expression to one that matches an entire input.

12 Likes

On the other hand, .* is also a clear, idiomatic sigil in a regex context that "there may be arbitrary amounts of other data here"

I guess it depends what kind of precedent we want to set if/when we add more advanced grammars and non-string pattern matching.

Clear and idiomatic indeed, but wrong!

If Swift cases enable whole string matching, then /.*fo+.*/ does not match "foo\nbar", because . does not match newlines except in multiline mode. The correct spelling would be /(?m).*foo.*/ (I think? Although the syntax proposal does not mention . matching newlines in multiline mode as it does in other languages…).

Conversely, if Swift cases match partial strings by default, the naive solution of /^fo+$/ to match "fooooo" but exclude "foo\nbar" is in fact correct, because in the proposed syntax, ^ and $ match beginning / end of string, not beginning / end of line, when not in multiline mode. (This is a notable departure from other languages Ruby, which would require /\Afo+\Z/ in order to exclude "foo\nbar".)

Either behavior is potentially confusing, but this asymmetry suggests to me that partial matching may be slightly less of a footgun.

4 Likes

That would be the semantic definition of the any character class, which would probably be formalized as part of [Pitch] Unicode for String Processing.

Minor process note, we should probably treat the definitions in the syntax description as temporary and update them after the Unicode pitch happens. I do think it's clearer to have them there, but they're not normative.

That being said, could we discuss departure from other languages more? Perl, Python, ICU/NSRegularExpression, Rust, Javascript, C#, ... all have the behavior of being beginning/end of input unless multi-line is specified. I couldn't quickly figure out Ruby, but that might be the language that diverges from the pack here.

It might end up being the case that we argue for multi-line as the default.

2 Likes