SE-0357: Regex String Processing Algorithms

Hello, Swift Community.

The review of SE-0357: Regex String Processing Algorithms begins now and runs through May 23, 2022.

This review is part of a collection of proposals for better string processing in Swift. The proposal authors have put together a proposal overview with links to in-progress pitches and reviews. This proposal builds on the Regex type previously accepted, building a number of algorithms based on it.

As with the concurrency initiative last year, the core team acknowledges that reviewing a large number of interlinked proposals can be challenging. In particular, acceptance of one of the proposals should be considered provisional on future discussions of follow-on proposals that are closely related but have not yet completed the evolution review process. Similarly, reviewers should hold back on in-depth discussion of a subject of an upcoming review. Please do your best to review each proposal on its own merits, while still understanding its relationship to the larger feature.

Reviews are an important part of the Swift evolution process. All review feedback should be either on this forum thread or, if you would like to keep your feedback private, directly to the review manager. If you do email me directly, please put "SE-0351" somewhere in the subject line.

What goes into a review?

The goal of the review process is to improve the proposal under review through constructive criticism and, eventually, determine the direction of Swift. When writing your review, here are some questions you might want to answer in your review:

  • What is your evaluation of the proposal?
  • Is the problem being addressed significant enough to warrant a change to Swift?
  • Does this proposal fit well with the feel and direction of Swift?
  • If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?
  • How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

More information about the Swift evolution process is available at:

https://github.com/apple/swift-evolution/blob/main/process.md

As always, thank you for contributing to Swift.

Ben Cohen

Review Manager

12 Likes
  • What is your evaluation of the proposal?

+1. As someone who writes a lot of Perl scripts for munging text files, this is a very welcome addition to Swift!

  • Is the problem being addressed significant enough to warrant a change to Swift?

Easier and more flexible string processing is a very common request. it would be great to be able to point people to these operations when they are otherwise attempting to do very questionable things with String.

  • Does this proposal fit well with the feel and direction of Swift?

It feels like a very Swifty design.

  • If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

This compares very well with other languages I've used. It's a very comprehensive set of string operations, with well thought-out naming. It is extremely obvious how and when to use most of these APIs.

The only thing that will probably trip me up multiple times is the proposed behavior for the ~= operator. I'd generally expect it to match subsequences of a string, not requiring the regex to match the complete string. Normally I'd expect to use a regex like /^..$/ if I want to match the full string. I'm sure I'll be surprised the first few times I use this.

  switch "abcde" {
      case /abc/:     // I'd expect this regular expression
      case /.*abc.*/: // to work like this one
      case /^abc$/:   // rather than this one
      default:      
  }

That will be very surprising to people accustomed to perl, sed, grep, etc. If we want people to be able to copy and paste regexes from other systems then I'd change the meaning of ~= to match the default used on those systems and require people to write /^abc$/ if they only want to match the full string.

my $a = "abcde";
if ($a =~ /abc/) { print "Found it!"; } # this will match in Perl

So I guess with this proposal I'll have to learn to change my regular expression to /.*abc.*/ instead when I want this behavior.

  • How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

I've read this proposal carefully I've been trying to keep track of all the regex-related proposals and forum threads.

13 Likes

+1
After reading through the proposal, I feel this will be a welcomed addition to Swift. The functions are well crafted and, due to the solid foundation provided by the string processing framework, far surpass what I'm accustomed to in Java. This, along with its companion proposals, will make Swift a first class text processing platform.

nitpick in "Detailed design" > "CustomConsumingRegexComponent":

    func consuming(
        _ input: String,
        startingAt index: String.Index,
        in bounds: Range<String.Index>
    ) throws -> (upperBound: String.Index, match: Match)?

The return type should be (upperBound: String.Index, output: RegexOutput)?, shouldn't it?


One more thing.

There is explanation why Collection is used instead of Sequence and why it's Collection where SubSequence == Substring. But, is there any explanation why not StringProtocol where SubSequence == Substring and StringProtocol where Self: RangeReplaceableCollection?

This is just a question. I don’t necessarily hope that, though.

I'm not sure if this is the intended behavior but I find it confusing:

let regex = Regex { OneOrMore(.any) }
print("abc".wholeMatch(of: regex)!.0) // prints "abc", as expected
print("abc".suffix(1).wholeMatch(of: regex)!.0) // also prints "abc"

Are the semantics of Substring.wholeMatch intentionally defined so that the regex considers the full extend of the base String? If so, I think some clarification in the proposal would really help to understand the rationale behind it.

3 Likes

Thanks, opened [SE-0357] Typos and fixes by milseman · Pull Request #1646 · apple/swift-evolution · GitHub

StringProtocol is.... not well designed and insufficient for this purpose alone. We'd still need the where as you pointed out. At that point, overly constraining ourselves isn't that useful and we definitely don't want to be pushing anyone to conform to StringProtocol.

Being able to vend a Substring means that you're also creating a proper String for the base property, and doing so as a subsequence means you're not (likely) doing a copy every time. It's conceivable that we'll support some notion of (safely!) shared strings in the future with move semantics and it's conceivable that a type may vend a string-view. At that point, it's just better to use the slice (or string) as the adapter type without having another conformance to StringProtocol.

TL;DR: StringProtocol doesn't get us anything and we want there to be less of it in the future, not more.

2 Likes

This is mentioned in an alternative considered:

Matches-anywhere semantics for ~=

Some scripting languages use semantics akin to contains for this purpose. But in Swift, this would deviate from how switch normally behaves, such as when switching over an Array.

Matches-anywhere can be perilous when the input is dynamic and unpredictable: new inputs might just happen to trigger an earlier case in non-obvious ways. Matches-anywhere can be opted into by quantifying around the regex, and a modifier on regex to make this more convenient is future work.

I'd be interested in discussing further and how we can make it more intuitive or otherwise flesh out this rationale.

Matches-anywhere semantics for ~=

I've read the proposal and the alternatives section before this thread. Only when reading @tim1724's concerns about surprising semantics of ~= I understood my slight dissonance with this part of the proposal.

I think making ~= equivalent to wholeMatch should be reconsidered. The rationale for the current semantic is to make behavior identical to other types, i.e. Array. At least for me this reasoning is surpassed by the inherent semantic of regular expressions themselves: to search for patterns within strings ("why is grep called grep? Because it greps for things.").

The workaround mentioned in the alternatives section exists the other way round:

Matches-anywhere can be opted into by quantifying around the regex

// Having to use ...
      case /.*abc.*/:
// ... is worse than:
      case /^abc$/:

The /.*abc.*/ workaround makes me feel uncomfortable. It might need backtracking—with an unknown performance hit. It also depends on non-possesive QuantificationBehavior. Even the semantics of . (dot) are not 100% crystal clear (matching newline or not).

So if there's any room for debate: I'm all for ~= means contains.

6 Likes

(Continuing a discussion from the pitch thread about a method on String to find if it’s matched completely by a Regex. Summary: There are corresponding methods for firstMatch and prefixMatch, but not wholeMatch.)

Inspired by this PR to resolve overload issues because of String’s conformance to RegexComponent:

What about equalElements(pattern: some RegexComponent) -> Bool? If other methods that also have string overloads were named similarly, that could tie in quite nicely.

Over at the other review thread for SE-0354 Ben motivated me to try the new bare regex syntax. While I was playing around with the excellent new APIs, I also tried the replace/replacing functions proposed here.

This worked well:

let str = "A banana a day keeps the doctor away."
let result = str.replacing(/A banana/, with: "Any fruit") // Any fruit a day keeps the doctor away.

The following did not. I wish regex based substitution was also possible, like this:

let result = str.replacing(/A banana (.*)/, with: /Any fruit \1/) // with \1 or $1

(I'm not sure which proposal this feature belongs to, whether this is only a new string processing algorithm or it is something that requires deeper support on the Regex type itself)

3 Likes

The with: argument you have is not a regex but is instead a template string, where captures get substituted. Something like .* inside that string would be verbatim content. I.e. it's for outputting and not processing input. A template string is future work (if it's compelling).

You can write that code using the replacing variant that takes the actual match:

let result = str.replacing(/A banana (.*)/) { match in
  "Any fruit " + match.1
}

(keeping your change in whitespace after banana)

2 Likes

This looks like a bug. I opened Issue #420.

1 Like

Plenty of room for debate:

I don't have strong opinions on ~= specifically and I could see it going either way. One argument could be that if it's used with a literal, the matches-anywhere is more intuitive, and it's more likely to be used with literals in if or all-literal case statements.

I don't know how strong the precedent or intuition is behind matching Array's behavior.

1 Like

:thinking: :thought_balloon:

Poor Investigation
Language Regular Expression Regex Literal Regex in case expression Matching semantics
C n/a n/a n/a n/a
C# :white_check_mark: n/a n/a n/a
Go :white_check_mark: n/a n/a n/a
Java :white_check_mark: (Pattern) n/a n/a n/a
JavaScript :white_check_mark: :white_check_mark: n/a n/a
Kotlin :white_check_mark: n/a n/a (when statement) n/a
Perl :white_check_mark: :white_check_mark: :white_check_mark: (given-when statement) anywhere-match
PHP :white_check_mark: n/a n/a n/a
Ruby :white_check_mark: :white_check_mark: :white_check_mark: (case-when statement) anywhere-match
Rust :white_check_mark: n/a n/a (match expression) n/a

It seems that regexes can be legitimately used with switch statements in only few languages. Such few languages adopt "matches-anywhere semantics" (and the same semantics would be also applied with a method of regex classes in other languages when regexes could be used illegitimately in case expression).

Our intuition would admire matches-anywhere semantics because there are indeed such use cases in other languages.
Our reason would accept whole-match semantics because it is more safe and Swift prefers safety.
Our resignation would refuse entirely to use regexes in case expression because that is in the majority of languages.

Hmm, we seem to lack a decisive factor in determining.

1 Like

I think Range has precedent for contains.

There's been discussions earlier about the "correct" semantics for option sets and other sets, but those I believe match agains the full set.

1 Like

Personally, I'm in favor of whole-match semantics by default for ~=. Swift's pattern matching has always matched the entire value. If "hello" doesn't match "hello world" and [2, 3] doesn't match [1, 2, 3], then why should /hello/ match "hello world"?

Swift's switches already deviate from the semantics of most languages with the lack of implicit fallthrough. As YOCKOW has shown, most languages don't allow regexes in switches anyway. Additionally, in Perl (one of the only languages that does support regexes in switches), switches are considered "highly experimental" and shouldn't be used.

4 Likes

This is false, as demonstrated above. A Range matches any element within the range.

1 Like

Speaking personally, I have really gone back and forth on this.

One other data point to consider: whole match leads to a loss of expressively, since you no longer get to use ^, $ etc to specify a whole match rather than a partial one if whole matching is all you get regardless.

3 Likes

One could always write case /.*foo.*/: to achieve partial matches, or case /(?m).*foo.*/: if the string contains newlines, so strictly speaking there’s no less of expressivity.

That said, that newline footgun, and the awkwardness of the syntax to fix it, are perhaps arguments against whole string matching.

This is determinative for me — bracketing an expression in ^/$ is simple and idiomatic for regular expressions, much more so than converting a partially matching expression to one that matches an entire input.

10 Likes