Declarative String Processing Overview

siracusa · September 30, 2021, 1:58pm

Setting aside Patterns and Raku Grammars for now, and just looking at "normal" regexes in Swift, everything looks OK right up until we hit the .capture part of the example in this proposal:

line.match(/([0-9A-F]+)(?:\.\.([0-9A-F]+))?\s*;\s(\w+).*/)?
    .captures.flatMap { (l, u, p) in
      guard let property = Unicode.GraphemeBreakProperty(p) else {
        return nil
      }
      let scalars = Unicode.Scalar(hex: l)! ... Unicode.Scalar(hex: u ?? l)!
      return (scalars, property)
    }

Something like this looks much nicer:

let (l, u, p) = line.match(/([0-9A-F]+)(?:\.\.([0-9A-F]+))?\s*;\s(\w+).*/)

…but then there's the pesky question of types (and "Swift-y" error handling).

I'm still not sure the proposed feature meets the "simple things should be simple" guideline. Having native regexes is great for eliminating the sprawl of manual parsing code, but if each actual use of a regex brings along its own "sidecar sprawl" of capture code, that's not great either.

kylemacomber · September 30, 2021, 3:24pm

We definitely plan to devote a focused follow-up proposal to the algorithm signatures. What you propose is actually where we started!

We decided to add the explicit captures to these examples to distinguish between the matched substrings and the captured substrings. The idea is match might be defined something like:

func match<Capture>(_ pattern: Regex<Capture>) -> MatchResult<Capture>?

struct MatchResult<Captures> {
    var matches: Matches // Collection where Element == Substring
    var captures: Captures
}

Because you might not care about the captures and just want to enumerate the matched substrings:

let line = "007F..009F    ; Control # Cc  [33] <control-007F>..<control-009F>"
line.match(/[0-9A-F]{4,6}/)?.matches.forEach {
    print($0)
}

// Prints "007F"
// Prints "009F"
// Prints "007F"
// Prints "009F"

Alternatively, extracting matches and captures could be defined as two different algorithms.

Michael_Ilseman · September 30, 2021, 3:53pm

My dream is named captures via let bindings inside the result builder, allowing you to interact with them later in the builder as a kind of Swift take on backreferences. Something like monadic bind would allow you to accumulate and forward values (though I don't know enough about result builder design rationale to know whether this is sacrilege).

The direct way would be by calling Collection.distance() with the returned indices. We could expose a method on the match result to calculate that for you.

We're unlikely to eagerly track the count during matching, as we'd rather operate directly over the UTF-8 bytes whenever possible without having to run grapheme-breaking for everything.

I definitely do want to talk about them! But that's a much longer reply that I'm drafting.

As outlined, the result of a match would be some kind of match object containing information about the overall match alongside captures, so you'd have to stick a .captures on the end.

That being said, I'm a big fan of allowing types to define custom tuple destructuring, so it's possible that statement would work as-is. It does open the door to ad-hoc type relationships when you can't just parameterize over a generic, but no more so than destructing matching or even collection algorithms.

I expect there to be a lot of pressure on the final design to eliminate such boilerplate, so it's good to be talking about, and identifying, ad hoc type relationships now.

Right, API design might obviate concerns about the return type.

siracusa · September 30, 2021, 4:36pm

Somewhat related to the discussion about captures, will named captures be supported? And if so, how will those names manifest in Swift?

In Perl's implementation, named captures can be used as back-references within the regex, and they also manifest outside the match in the special %+ hash:

my $re = qr/^(?<key>\w+)=(?:(?<number>\d+)|(?<string>\S+))$/;

"foo=bar" =~ $re;
print "k: $+{key}, n: $+{number}, s: $+{string}\n";

"foo=123" =~ $re;
print "k: $+{key}, n: $+{number}, s: $+{string}\n";

This prints:

k: foo, n: , s: bar
k: foo, n: 123, s:

Assuming named captures are supported in Swift's regex syntax, do they only work as back-references within the regex, or can they influence the representation of the captures outside the match?

More broadly, which features will be supported by Swift's regex syntax? Everything in pcre?

mayoff · September 30, 2021, 4:54pm

How is the regex literal syntax going to affect the availability and use of / as a prefix operator?

In particular, the swift-case-paths package uses prefix operator / as a convenience for constructing case paths (example syntax: let successPath = /Result<Int, Error>.success). If / is no longer usable as a prefix operator, it's going to be a detriment to most users of swift-case-paths, including most users of the swift-composable-architecture package.

1-877-547-7272 · September 30, 2021, 5:48pm

siracusa:

Something like this looks much nicer:
let (l, u, p) = line.match(/([0-9A-F]+)(?:\.\.([0-9A-F]+))?\s*;\s(\w+).*/)
…but then there's the pesky question of types (and "Swift-y" error handling).

This code:

guard let (l, u, p) = line.match(/([0-9A-F]+)(?:\.\.([0-9A-F]+))?\s*;\s(\w+).*/).captures else {
    handleFailure()
}
// use l, u, and p

would work (assuming captures is an Optional property). It’s more verbose than your example, but it’s still clear.

We could add a method to “regular-expressible” types called captures(from: RegEx) that would would be equivalent to match(_: RegEx).captures.

This confusion over where to put an API has come up before — I think we need official guidance from the Core Team on how to decide whether an API should go into the standard library, Foundation, or a separate package. For now, though, here are the rules I follow:

If an API is meant for general-purpose use and doesn’t rely on operating system services (i.e. numbers, collections, and collection algorithms), then it should go in the standard library.
If an API is meant for general-purpose use and relies on operating system services (i.e. networking), then it should go in Foundation.
If an API is meant for a specific domain of programming (like complex numbers) or we want to test-drive an API before its inclusion in the standard library / Foundation (like some Swift Collections algorithms), then we should put it in a separate package.

I’d consider regular expressions and Pattern as being meant for general-purpose use, so IMO they should go in the standard library.

I don’t think we should avoid regular expressions just because of their obscure syntax — regexes can be very useful in the right context. While it’s true that regular expressions can obfuscate the meaning of code, being too verbose can also obfuscate code. I think that allowing regexes within Pattern builders is a good way to help balance the obscurity of regexes with the verbosity of Pattern. Good documentation on Swift’s regular expression syntax should also help with reading regexes.

I don’t think that it’s possible to create a regex-like “convenient and powerful pattern-matching syntax that reads like normal Swift code” — as @xwu pointed out earlier in the thread, making Pattern more terse or making regular expressions more readable would lead to undesirable compromises.

I think named captures should be part of Swift’s pattern-matching syntax. So it would be something like

guard case /^(?<let key>\w+)=(?:(?<let number>\d+)|(?<let string>\S+))$/ else {
    handleFailure()
}
//use key, number, and string

(Obviously, we should use a Swiftier syntax for this.)

Have we considered combining the RegEx and Pattern types? Regular expression literals could simply be a shorthand syntax for Pattern.

Saklad5 · September 30, 2021, 6:34pm

I think adding new forms of built-in literal should be done extremely carefully. I agree that unchecked string literals aren’t ideal, but I think this is a broader issue than accepting regular expressions.

Consider Version in Swift Package Manager. I often use the explicit Version(x, y, z) initializer instead of using a string literal like ”x.y.z”, specifically because I’d rather have the compile-time check. However, I think we can all agree making a special literal just for that would be rather absurd.

Swift had a similar issue with the proliferation of attributes, which has largely been nipped in the bud using property wrappers. Would it be possible to implement a similar system for custom literals, with the ability to interpret said literals at compile-time?

typesanitizer · September 30, 2021, 6:50pm

Could you share what the types of the core operations (alternation r1|r2, sequencing r1r2 and repetition r*) will be? I see two main directions here:

Build on top of type sequences.
Use lots of overloads (SwiftUI style).

Not sure which one you have in mind.

I'm wondering how one would go about building generic functionality on top of these strongly-typed regexes. For example, one cannot add conformances for tuples in library code today, which means that adding conditional conformances for regexes and patterns would require different extensions, one per tuple size...

Is that the right branch? Maybe I'm missing something, but I see parser combinators in that branch, but I don't see example tests using regexes.

I understand this is a very tempting design direction, but my experience with using strongly-typed regex libraries has been not been very good, so much so that I would like to caution against this direction. In comparable languages:

Using the strongly-typed regexes in Haskell is complicated and not very useful, beyond trivial cases which can be verified by inspection anyways.
Using the weakly-typed regexes in Rust is relatively straightforward and familiar when coming from other tools/languages (e.g. Python, grep and sed).

kylemacomber · September 30, 2021, 7:03pm

I think it'd make sense for named captures to translate to named tuple elements:

let regex = /(?<lower>[0-9A-F]+)(?:\.\.(?<upper>[0-9A-F]+))?\s*;\s(?<property>\w+).*/
print(type(of: regex))
// Prints Regex<(lower: Substring, upper: Substring?, property: Substring)>

It's not as clear what the right syntax would be for the result builder based syntax.

I think we're going to figure out Swift's regex feature set as part of discussions like this. We want folks to be able to re-use their existing knowledge and existing regexes, so that argues for broad compatibility. At the same time, we also have this powerful, expressive result builder syntax, so we have the flexibility to not add every last exotic feature to the regex literal.

Saklad5 · September 30, 2021, 7:23pm

If there’s one thing I’m sure about, it’s that adding more specialized types of literal to the language itself is a slippery slope. If compile-time regular-expression checking can’t be implemented in actual Swift code (like result builders and property wrappers), it should not be added.

If we could add the ability to define and parse custom literals at compile-time, I think this would be an excellent point to break out single quotes. They’ve basically been kept for a rainy day in Swift up to this point.

Saklad5 · September 30, 2021, 7:30pm

Actually, here’s a thought: if you allowed compile-time parsing, couldn’t this feature be used to bootstrap itself? You could use Pattern to interpret regex literals.

timdecode · September 30, 2021, 7:58pm

Very exciting! I'm going to link swift-parsing because a brief scan didn't find it linked already. It has a similar composable API minus the result builder syntax.

stephencelis · September 30, 2021, 11:41pm

I hope this means the team is considering Pattern as being able to process any generic kind of input?

For instance, you could process a URLRequest (or NIO request) into some domain-specific routing, which would benefit server-side Swift projects and API clients alike.

Here's a sketch of how that might look, interleaving parsers of URL requests with parsers of strings:

struct User: Decodable {
  let email: String
  let password: String
}

enum Route {
  case home
  case archive(page: Int?, perPage: Int?)
  case article(id: Int)
  case signUp(User)	
}

urlRequest.match {
  Alternation {
    Group {
      Method.get
      PathEnd()
    }
    .capture { Route.home }

    Group {
      Method.get
      PathComponents {
        "archive"
      }
      PathEnd()
      Optionally {
        QueryItem("page", Int.FormatStyle().capture())
      }
      Optionally {
        QueryItem("per-page", Int.FormatStyle().capture())
      }
    }
    .capture { Route.archive(page: $0, perPage: $1) }
		
    Group {
      Method.get
      PathComponents {
        "archive"
        Int.FormatStyle()
      }
      PathEnd()
    }
    .capture { Route.article(id: $0) }
		
    Group {
      Method.post
      PathComponents {
        "sign-up"
      }
      PathEnd()
      Body(decoder: JSONDecoder(), as: User.self)
    }
    .capture { Route.signUp($0) }
  }
}

(We do just this sort of thing in swift-parsing.)

ahti · October 1, 2021, 3:36am

The one most relevant that comes to mind was the id3 MLLT header (section 4.6 here). The first few bytes contain some metadata, plus two bit lengths. The rest of the frame is then filled with alternating integers of the two bit lengths. I don't have the code I wrote back then anymore, but here is a java implementation I found, which one could say actually employs some form of multi-tiered processing by switching to ParsableBitArray for the tail.

dhoepfl · October 1, 2021, 5:39am

Pleased to see that regular expressions in Swift.

A few notes:

Alternation is ambiguous
Think of OneOrMore { Alternation { "a" "b" } } - does this require a and b to alternate?).
I’d suggest OneOf, AnyOne, or AnyOf

How will tricky types for captures work, like:

OneOrMore {
   Alternation {
      OneOrMore(.digit).capture { Int($0) }
      OneOrMore(.whitespace).capture { MyEnum.whitespace }
      Optionally {
         Repeat(.letter)
      }.capture { String("Word: \($0)") }
   }
}

(Array of tuple of optionals?)

Will it be possible to change the / delimiter?
It is annoying if you have to escape each occurance of a path separator.
Maybe similar to Strings? 1-877-547-7272’s proposal would work (I’d use #/.../# though).

Will there be a multiline regex syntax? (Would break on lots of comments …)

 line
    .match(///
        ([0-9A-F]+)
        (?:\.\.([0-9A-F]+))?
        \s*;\s
        (\w+)
        .*
        ///)

… allowing inline comments?

 line
    .match(///
        ([0-9A-F]+)           // Hex scalar
        (?:\.\.([0-9A-F]+))?  // Optional scalar range
        \s*;\s                // Separator
        (\w+)                 // Property
        .*                    // Ignore rest
        ///)

1-877-547-7272 · October 1, 2021, 8:09am

I think this is a good time to consider adding union types to the standard library. We could make them using enumerations:

enum UnionOf2<First, Second> {
    case first(First)
    case second(Second)
}
// as well as UnionOf3, UnionOf4, etc.

or we could add union types to the language itself, similarly to tuples:

let x = (String | Int).0("example")
switch x {
case .0(let string):
    print(0, string)
case .1(let number):
    print(1, number)
}
let namedExample = (string: String | number: Int).string("example")

While the second one would be a more complex change, it would allow for infinitely many types within the union.

Union types would provide a good way to express regular expressions like this:

let r = /(\w+)|(\s+)/
print(type(of: r))
// prints RegEx<UnionOf2<Substring, Substring>>
// or prints RegEx<(Substring | Substring)>

Note that these union types aren’t the same as the union types mentioned in the commonly-rejected proposals list — they are fully compatible with Swift’s type system.

Union types have also been recently discussed in the precise error typing thread.

Yeah, after further consideration, I don’t think it’s a good idea to reuse " — regexes are different enough from strings that we’d want different syntax highlighting, and we wouldn’t be able to highlight regexes differently than strings if we reused string literal syntax (unless we added type inference to the syntax highlighting algorithm, which would create its own problems). I think #/.../# is an acceptable syntax, though we should still try to think of alternatives. Ideally the syntax we end up using would support a raw string–like syntax (i.e. #####/file/path/#####). I don’t think it’s as important to have multi-line syntax or inline comments — Patterns can be used to achieve a similar effect and inline comments aren’t allowed within multi-line strings anyway.

line.match {
    /([0-9A-F]+)/          // Hex scalar
    /(?:\.\.([0-9A-F]+))?/ // Optional scalar range
    /\s*;\s/               // Separator
    /(\w+)/                // Property
    /.*/                   // Ignore rest
}

JanWillemBrands · October 1, 2021, 10:13am

Alternation feels a bit ambiguous indeed, maybe Alternative is better ? In the context of converting from the large number of BNF variants out there, maybe this is a consistent set:

...|... this or the other thing becomes Alternative { ... ... }

[...] zero or one of these becomes Optional { ... }

{...} or (...)* zero or more of these becomes Repetition { ... }

<...> or (...)+ one or more of these becomes Repetition(from: 1) { ... }

(...)2..4 two or three of four of these becomes Repetition(from: 2, upto: 4) { ... }

(... ...) a bunch of things becomes Group { ... ... } but we might not need this because Alternative is bracketed.

xwu · October 1, 2021, 11:13am

Whitespace rules are such that I'm fairly sure there's no unavoidable ambiguity between infix / and the proposed literal.

xAlien95 · October 1, 2021, 11:31am

There could be ambiguity with custom pre/post-fix operators, though.

prefix operator /
postfix operator /

prefix func / (value: Int) -> Int { value }
postfix func / (value: Int) -> Int { value }

let a = 3
let b = /a/
let c = /a.hashValue/

This is valid Swift today, will the compiler warn about ambiguity? Will it infer b and c to be Ints instead of regular expression literals?

siracusa · October 1, 2021, 1:22pm

I endorse the Perl solution here, which is to allow any of a small set of "balanced" delimiters. For example, when matching against a file path in Perl, rather than doing this:

# "Leaning toothpicks" disease:
$path =~ /^\/usr\/local\/bin$/;

I'd do this instead:

$path =~ m{^/usr/local/bin$};

These also work:

$path =~ m[^/usr/local/bin$];
$path =~ m(^/usr/local/bin$);

I suspect the m will be scary to many, and arguably it's not very Swift-y, but it could be replaced with Regex or something similarly more verbose.

Maybe this is a more general problem with quoted strings in Swift…