Declarative String Processing Overview

:tada::tada::tada:
I could not agree with this more! I noodled around a bit with some possibilities in this area a few months ago and quickly realized that making a Pattern builder more terse or making a regex type more readable quickly leads to poor compromises.

There’s some bikeshedding that I’m sure will take place for the names used in a Pattern, but the overall direction here is simply fantastic.

23 Likes

Overall I really like the direction here. I had previously spoken against including classic regex syntax, but strongly typed captures and the idea of a refactoring action to expand a regex into a pattern tip the scales for me :slight_smile:

Wrt the bigger picture, I think the idea of good support for multi-tiered processing is very appealing, even beyond strings in data: I've implemented a few binary parsers in recent years, and was always unsatisfied with how different parsing libraries handled extracting information from bit ranges that don't line up with byte boundaries (e.g. parsing 4 6-bit numbers from a 3 byte section of data).

10 Likes

Big +1 to this. I really like the Pattern result-builder. As you say, it's more much readable, and if it can indeed generate efficient code, it would be an obvious improvement.

That is rather unforunate. The standard library only has an eager split; there is a lazy version in swift-algorithms, but that's a very big package and some projects may be reluctant to add such massive dependencies. We have a lazy filter, but it's a badly-behaved Collection which can lead to performance and code-size penalties. There is definitely room for improvement when it comes to the standard library's collection algorithms, even outside of this effort.

This looks really great. The biggest issue I see is that result-builders don't allow you to control type resolution. Adding types like "Group", "Newline", or "Repeating" to the top-level standard library seems problematic. The topic has come up several times before, though, so there is certainly interest in the community.

Is it possible for the compiler to disambiguate the result-builder from the regular contains(where:) function we have today?

Can we do this with RangeReplaceableCollection? It has... certain quirks, you might say (little things, like invalidating all indices whenever you mutate the collection) that make this difficult to implement generically. I'd definitely be +1 to replacing RRC itself to remedy this.

If the matches are returned as slices and then processed in bulk, using them for replacement means quadratic performance due to excessive COWs. Replacing-as-you-go can be more efficient, but only if you can stop the collection removing excess capacity until all operations are finished. IMO, that would be a critical part of an RRC replacement.

6 Likes

Really love this, super excited to see it make its way into Swift!

Yeah, this is unfortunate. Result builders do allow you to control type inference, so we could probably have a solution that surfaced these as .group, .newline etc. without further language extensions, but then we lose the nice type-based syntax. We could always put these in some sort of Patterns module I suppose, so as not to pollute the global namespace with the cost of requiring an extra import.

5 Likes

Super excited! Only question I have at the moment is, would it require newer iOS/iPadOS SDK to use on those platforms?

Declarative parsers like this are really powerful, and conceptually I find a lot easier to understand.

I’ve being playing around with similar ideas with generator functions in JavaScript (GitHub - RoyalIcing/yieldparser: Parse using JavaScript generator functions — it’s like components for parsing). I think of the model as being akin to components for parsing. The power of components is that allow you to compose larger solutions out of smaller solutions, and they force you to give those smaller pieces a name.

import { parse, mustEnd } from 'yieldparser';

function* Digit() {
  const [digit]: [string] = yield /^\d+/;
  const value = parseInt(digit, 10);
  if (value < 0 || value > 255) {
    return new Error(`Digit must be between 0 and 255, was ${value}`);
  }
  return value;
}

function* IPAddress() {
  const first = yield Digit;
  yield '.';
  const second = yield Digit;
  yield '.';
  const third = yield Digit;
  yield '.';
  const fourth = yield Digit;
  yield mustEnd;
  return [first, second, third, fourth];
}

parse('1.2.3.4', IPAddress());
/*
{
  success: true,
  result: [1, 2, 3, 4],
  remaining: '',
}
*/

Another interesting library is elm/parser (GitHub - elm/parser: A parsing library, focused on simplicity and great error messages) which uses parser combinators. Their prior art is also worth reading: parser/comparison.md at master · elm/parser · GitHub

2 Likes

While I (sort of) see the appeal of Patterns, I find Raku’s Grammars to be a more useful higher-level structure for regexes. Could such a thing be built in Swift without any language features beyond those proposed above?

5 Likes

There’s a lot to think about here, and I’m still taking it all in.

As a first point though, I want to strongly agree with this:

The examples shown with regexes are essentially opaque to the reader. Even knowing what regular expressions are and how they work, their visual appearance is inscrutable.

Swift strongly values clarity at the point of use, and regular expressions are antithetical to that. Normal Swift code reads nearly like a grammatical English sentence. Regular expressions (with existing syntax) read like a jumble of arbitrary symbols.

I think that we should view regexes as a compatibility feature, not a recommended usage pattern. Something to avoid whenever possible, not something to reach for in the common case.

If we can make a convenient and powerful pattern-matching syntax that reads like normal Swift code, that will be amazing. Then we can recommend using that for all situations, and relegate regexes to the margins.

16 Likes

I love everything about this! Love seeing this happening!! :heart_eyes:

4 Likes

I love the idea of this, thank you so much for writing this.

One thing I would improve is being able to bind variable names where you declare them. In your graphemeBreakPropertyData, it'd be cool if we can somehow declare the variable names inside the result builder, rather than using captures. I realise this is not a trivial change and something that needs a lot more thinking / designing.

16 Likes

I’m excited to see this get addressed! I love the flexibility of having both RegEx and Pattern.

I do have concerns about RegEx literal syntax, though. / is already a valid prefix and postfix operator, so code like /a/ could be reinterpreted. I think we should consider reusing ", like Swift does with Characters (instead of using ' like other C-style languages). For example:

let regex: RegEx = "([0-9A-F]+)(?:\.\.([0-9A-F]+))?\s*;\s(\w+).*"

or

let regex: RegEx = #"([0-9A-F]+)(?:\.\.([0-9A-F]+))?\s*;\s(\w+).*"#

This wouldn’t be a simple String — the RegEx would still be evaluated at compile time and there would ideally be better syntax highlighting. We may also want to change the meaning of \ escapes (like in the first example), but we could instead keep \ the same and suggest the use of raw strings (like in the second example). Using " would give us valuable things from string literals, like

  • Multi-line literals
  • Raw string syntax
  • String interpolation syntax (if we decide not to change the meaning of \) that could be used to combine RegExes.

Also, what does . mean? The proposal seems to use “scalar” and “character” interchangeably, but they mean different things in Swift (“character” usually refers to a grapheme cluster, not a Unicode scalar). I think it should refer to a grapheme cluster — it would encourage correct string processing, which is a goal of Swift. However, it should be noted that no other regular expression implementation that I know of uses . to refer to a grapheme cluster — even the Unicode ICU implementation.

3 Likes

Also, does this imply we’re taking Group out of SwiftUI and putting it into the standard library? I think it’s a good idea, but would the ABI support it?

This looks like a fantastic start! Great work to all involved.

You're right to notice this issue; we're thinking we might need to propose a result builder feature to fix it. Part of the reason we're posting this document is so that, when we start proposing the individual features needed to make regexes and patterns work, Evolution will have an idea of how these random-looking features will fit together.

The Big Picture document suggests that metacharacters might shift meaning based on the string view you're matching, so . would match a Character when the target was a String, but it would match a UnicodeScalar when the target was a String.UnicodeScalarView.

15 Likes

We could work around this using dot notation.

// Assuming `PatternBuilder` constructs patterns from blocks of type `PatternBlock`
struct PatternOfOneOrMore: PatternBlock { ... }
enum PatternBlockShortenedNames: PatternBlock {}
extension PatternBlock where Self == PatternBlockShortenedNames {
    typealias OneOrMore = PatternOfOneOrMore
}
line.match {
    .OneOrMore(.hexDigit)
}

I would put Group in the standard library, though — it’s pretty useful in result builders (SwiftUI already uses it with eight different result builders!).

Does this need to be in standard library? The result builder part feels like it could sit in a separate lib just fine.

7 Likes

What formats have you had to parse?

The pattern I would recommend for this kind of task is a strongly-typed wrapper around the raw bytes, with computed vars that will do the masking and shifting.

struct EncodedPixel: RawRepresentable {
  var rawValue: (UInt8, UInt8, UInt8)

  var red: UInt32 { ... unpack ... }
  var green: UInt32 { ... unpack ... }
  var blue: UInt32 { ... unpack ... }
  var alpha: UInt32 { ... unpack ... }
}

A failable init can do error checking, if necessary for your format. A encoded pixel parser's conformance to something like CollctionConsumer (as an example interface for composition) would just try to init and advance 3 bytes.

Definitely improvements to be made to the current laziness situation (CC @timv). Even with an ideal presentation of laziness, it's not quite a drop-in for the script as written (e.g. use of integer subscripting).

A potential issue with approaches which accumulate and transform closures is that they rely on Swift's general-purpose optimizer, whose inlining heuristics are not necessarily tuned for your exact use case :-). It can also be foiled by such pesky real-world concerns as separate compilation.

It's still early, but we're hoping to leverage static compilation and/or staged compilation techniques for the parser library, inspired somewhat by work done in Scala. @rxwei will be exploring this more deeply.

I'm hoping we can extend result builders by allowing conformers to provide a type to serve as a namespaces for lookup, similar to custom string interpolation.

That would also open the door to defining and using custom operators (e.g. postfix +) inside of a result builder without bloating their global overload set.

We'll definitely have to figure out trailing closure ambiguity here. This seems like an issue that will keep coming up with broader application of result builders.

There's also more potential ambiguity with collection algorithms. For example, str.contains("A") will currently resolve "A" as a Character because there's only the Element-taking variant. But if there's an overload for a collection-with-the-same-element, it would prefer String for a literal. And, we're showing the use of literals inside a result builder to represent a literal subpattern. Semantically they would all be equivalent, but we like code to keep calling the functions they think they're calling :-).

We'll have to tackle these problems as we deploy the collection algorithms. For the consumers/searchers prototype way back when, I fell back to using argument labels to differentiate, but hopefully we can provide something cleaner here.

@timv is exploring these algorithms in more depth.

I agree that RRC is basically broken for batch operations and wrote more about it here. A version of replaceSubrange that returns the newly created range can help in some cases, but would still be inefficient.

At the time I was hoping more work on ownership and generalized accessors would make the design pattern obvious. This is definitely a bridge that needs to be crossed.

We defintely want to ship runtime support as part of Swift. Whether you access API directly from the Swift default module or need an explicit import Pattern is still up for debate. I'm hoping that result builder namespaces makes this decision clearer.

Yup, I'm a big fan of the programmer experience afforded by parser combinators and will defintely check out that work in Elm. As mentioned in the big picture, we also want to be able to leverage compiler technology to get around some of the performance issues with parser combinator libraries, especially since Swift supports separate compilation.

7 Likes

I love this. However, I have one question. With either method, regex or pattern, how can you determine how many characters were consumed?

Setting aside Patterns and Raku Grammars for now, and just looking at "normal" regexes in Swift, everything looks OK right up until we hit the .capture part of the example in this proposal:

line.match(/([0-9A-F]+)(?:\.\.([0-9A-F]+))?\s*;\s(\w+).*/)?
    .captures.flatMap { (l, u, p) in
      guard let property = Unicode.GraphemeBreakProperty(p) else {
        return nil
      }
      let scalars = Unicode.Scalar(hex: l)! ... Unicode.Scalar(hex: u ?? l)!
      return (scalars, property)
    }

Something like this looks much nicer:

let (l, u, p) = line.match(/([0-9A-F]+)(?:\.\.([0-9A-F]+))?\s*;\s(\w+).*/)

…but then there's the pesky question of types (and "Swift-y" error handling).

I'm still not sure the proposed feature meets the "simple things should be simple" guideline. Having native regexes is great for eliminating the sprawl of manual parsing code, but if each actual use of a regex brings along its own "sidecar sprawl" of capture code, that's not great either.

3 Likes

We definitely plan to devote a focused follow-up proposal to the algorithm signatures. What you propose is actually where we started!

We decided to add the explicit captures to these examples to distinguish between the matched substrings and the captured substrings. The idea is match might be defined something like:

func match<Capture>(_ pattern: Regex<Capture>) -> MatchResult<Capture>?

struct MatchResult<Captures> {
    var matches: Matches // Collection where Element == Substring
    var captures: Captures
}

Because you might not care about the captures and just want to enumerate the matched substrings:

let line = "007F..009F    ; Control # Cc  [33] <control-007F>..<control-009F>"
line.match(/[0-9A-F]{4,6}/)?.matches.forEach {
    print($0)
}

// Prints "007F"
// Prints "009F"
// Prints "007F"
// Prints "009F"

Alternatively, extracting matches and captures could be defined as two different algorithms.

5 Likes
Terms of Service

Privacy Policy

Cookie Policy