SE-0357: Regex String Processing Algorithms

On the other hand, .* is also a clear, idiomatic sigil in a regex context that "there may be arbitrary amounts of other data here"

I guess it depends what kind of precedent we want to set if/when we add more advanced grammars and non-string pattern matching.

Clear and idiomatic indeed, but wrong!

If Swift cases enable whole string matching, then /.*fo+.*/ does not match "foo\nbar", because . does not match newlines except in multiline mode. The correct spelling would be /(?m).*foo.*/ (I think? Although the syntax proposal does not mention . matching newlines in multiline mode as it does in other languages…).

Conversely, if Swift cases match partial strings by default, the naive solution of /^fo+$/ to match "fooooo" but exclude "foo\nbar" is in fact correct, because in the proposed syntax, ^ and $ match beginning / end of string, not beginning / end of line, when not in multiline mode. (This is a notable departure from other languages Ruby, which would require /\Afo+\Z/ in order to exclude "foo\nbar".)

Either behavior is potentially confusing, but this asymmetry suggests to me that partial matching may be slightly less of a footgun.

4 Likes

That would be the semantic definition of the any character class, which would probably be formalized as part of [Pitch] Unicode for String Processing.

Minor process note, we should probably treat the definitions in the syntax description as temporary and update them after the Unicode pitch happens. I do think it's clearer to have them there, but they're not normative.

That being said, could we discuss departure from other languages more? Perl, Python, ICU/NSRegularExpression, Rust, Javascript, C#, ... all have the behavior of being beginning/end of input unless multi-line is specified. I couldn't quickly figure out Ruby, but that might be the language that diverges from the pack here.

It might end up being the case that we argue for multi-line as the default.

2 Likes

Yes, Ruby appears to be the exception here:

$ ruby -e 'puts "match" if "foo\nbar" =~ /^fo+$/'
match

Note that this is specifically the built-in given statement that was intended as a replacement for the switch statement provided by the unofficial (but extremely commonly used) Switch module. The problems with given were due not to the use of regexes (which have been used with switch for decades without issue) but rather are due to problems with the ~~ ("smartmatch") operator upon which given was based.

The switch statement provided by the Switch module is frequently used with regexes and when used with regexes it uses Perl's =~ operator, so it operates with "matches anywhere" semantics, as that's how that operator works in Perl:

use Switch;

switch ("foo bar") {
    case /foo/    { print "Match foo\n"; next } # next makes it fall through to the next case
    case /bar/    { print "Match bar\n"; next }
    case /baz/    { print "Match baz\n"; next }
    case /^foo$/  { print "Match exactly foo\n"; next }
}         

The first two regexes match the string, so it outputs

Match foo
Match bar

The last case explicitly specifies that it needs to match against the entire string (using ^ and $) so it doesn't match the string.


I'd strongly prefer that Swift also use this behavior when using the ~= operator (and thus switch statements) with regular expressions. It's how regular expressions work pretty much everywhere. Even people who have never used them outside of grep will expect this behavior. (Although grep does have a -x flag to force it to match an entire line, it's more idiomatic to use ^..$ in the regular expression instead. I'd be willing to bet that the vast majority of grep users don't know about the -x flag and always use ^..$ when they want to match whole lines.)

12 Likes

Would it be possible to allow regexes to be applied to buffers of Unicode scalars?

It seems like quite a few Unicode algorithms are expressed at the scalar level, and it can happen during intermediate processing that scalars are the most natural storage format rather than UTF-8.

For example, Unicode presentation of domain names. The way the Punycode algorithm works is by encoding a list of random-access insertions in to a buffer of scalars, and it does it in code-point order (e.g. a list of where to insert every "a", then every "b", then every "c", etc - you loop back over the string multiple times, and the string grows every time you insert something. The encoding also accounts for this growth -- in units of unicode scalars, of course). It is hugely impractical to decode this as anything other than a buffer of scalars.

But then, once I have decoded part of a domain, I want to allow developers to match custom patterns, and the idea is that they will make heavy use of scalar properties such as script, joining behaviour, text direction, etc. There are a bunch of heuristics which go in to deciding how best to render a domain, and it can depend on the user and which scripts they personally are familiar with (Punycode isn't a universally safer option - consider that http://岍岊岊岅岉岎.com in Punycode is http://xn--citibank.com).

It would be nice if developers could express those patterns using regexes, but I'm giving them a buffer of Unicode scalars (an AnyRandomAccessCollection<Unicode.Scalar>), not a String or Substring, for the reasons I described.

At an implementation level I would expect these kinds of regexes would probably benefit from matching on decoded scalars anyway. And if I imagine what a truly great, next-level solution would look like, these heuristics might one day also involve other subsystems, such as the font rendering engine, which would probably also appreciate decoded scalars. So I'd really like to keep my data in this form if at all possible, while also giving developers the ability to match regex patterns.

2 Likes
  1. Why does the consuming method of the CustomConsumingRegexComponent protocol have both index and bounds parameters? I checked some examples:

    • CDoubleParser in the Swift Regex: Beyond the basics video (at 19:16) doesn't use the bounds parameter.

    • IntParser and CurrencyParser in the CustomTests.swift file both use input[index..<bounds.upperBound] rather than input[bounds]. They also guard that index != bounds.upperBound, which seems unnecessary.

  2. Would a half-open or partial range be clearer than the upperBound of the return type?

  3. For a type such as SemanticVersionParser in the RegexDSLTests.swift file, would it be more efficient to store a static let _regex: Regex, instead of building it for each consuming method call?


CustomConsumingRegexComponent
/// A protocol allowing custom types to function as regex components by
/// providing the raw functionality backing `prefixMatch`.
public protocol CustomConsumingRegexComponent: RegexComponent {
    /// Process the input string within the specified bounds, beginning at the given index, and return
    /// the end position (upper bound) of the match and the produced output.
    /// - Parameters:
    ///   - input: The string in which the match is performed.
    ///   - index: An index of `input` at which to begin matching.
    ///   - bounds: The bounds in `input` in which the match is performed.
    /// - Returns: The upper bound where the match terminates and a matched instance, or `nil` if
    ///   there isn't a match.
    func consuming(
        _ input: String,
        startingAt index: String.Index,
        in bounds: Range<String.Index>
    ) throws -> (upperBound: String.Index, output: RegexOutput)?
}

CDoubleParser
import Darwin

struct CDoubleParser: CustomConsumingRegexComponent {
    typealias RegexOutput = Double

    func consuming(
        _ input: String, startingAt index: String.Index, in bounds: Range<String.Index>
    ) throws -> (upperBound: String.Index, output: Double)? {
        input[index...].withCString { startAddress in
            var endAddress: UnsafeMutablePointer<CChar>!
            let output = strtod(startAddress, &endAddress)
            guard endAddress > startAddress else { return nil }
            let parsedLength = startAddress.distance(to: endAddress)
            let upperBound = input.utf8.index(index, offsetBy: parsedLength)
            return (upperBound, output)
        }
    }
}

1 Like

To answer question 3: The test was written this way specifically for testing CustomConsumingRegexComponent, i.e. testing a regex whose whole match (.0) output is not Substring. The implementation was sort of "bootstrapped" using Regex because it's easier than writing a custom parser. We could have used strtod like CDoubleParser, but a custom result type made it more interesting :)

2 Likes

Why is CustomConsumingRegexComponent.consuming()'s first parameter a String rather than a Substring? Doesn't this lead to excess allocations?

In addition to the unclear situation Ben asks about (startingAt vs. bounds), CustomConsumingRegexComponent.consuming() can either throw or return nil in order to signal a non-match. The documentation should make it clear whether one is preferable, and in what situations.

1 Like

I’m no regex expert but I definitely find whole string matching the intuitive option for switch, that’s what it does in every other case i can think of.

3 Likes

Review Manager Update

Thank you for everyone who participated in the review. The core team has concluded that the proposal should be accepted in principle, but that it would benefit from further discussion of the specific issue of whole match vs partial match for ~=. A second review focused on this aspect will be conducted shortly.

5 Likes

@Michael_Ilseman talked about this in the pitch thread: [Pitch] Regex-powered string processing algorithms - #7 by Michael_Ilseman

2 Likes

Review Conclusion

The proposal has been accepted with modifications (the modification being to subset out ~= for now).

1 Like