[Pitch] Regex builder DSL

This jumped out at me when skimming through as well. If the Float and Substring matches are indeed non-optional, it would be great to explain in the text why it’s necessary and how it works in practice.

More specifically, returning nil signals a local match failure and the engine backtracks to try an alternative. throwing aborts pattern matching.

What to do with custom abort errors and how to surface them throughout API is an open question. CC @itingliu. I opened API design: Custom abort errors and API threading.

That would be the equivalent of (?:\w\s)+, that is a concatenation of the two classes. Instead, you'd probably want to say:

OneOrMore {
  CharacterClass {
    CharacterClass.word
    CharacterClass.whitespace
  }
}

And we have to use the explicit CharacterClass before the .word etc. because of limitations of result builder syntax. I believe @rxwei was showing a formulation that doesn't have that because it's using the initializers directly. (It's possible that we could omit the .characterClass part for this formulation, though).

That listing is not building a ChoiceOf (i.e. an alternation or ordered-choice of alternatives), i.e. the if statement gets resolved at result builder construction-time. I'm not sure how that type signature follows, though. @rxwei?

Thanks for this! It's been great to see the string processing repo grow over the last year-ish. My main question is, while I'd love to have this in the standard library, what are the motivations to having this in the stdlib as opposed to a package? There's as much precedent in other language ecosystems for pattern-matching capabilities being a separate package as there is built-in to the standard library, so I'd love to hear what your thoughts are.

While I like the API in general, I was wondering about the possibility of extending this syntax to non-String Collections and custom Element types.

6 Likes

Strong +1 to this, I'm hoping this would be available on [UInt8], which is a natural way to parse binary formats.

3 Likes

@rxwei Will it be possible to declare a regex pattern like this?

struct Pattern: RegexComponent {
    
    var regex: some RegexComponent {
        Regex {
            Capture {
                OneOrMore(.word)
            }
            "."
            Capture {
                OneOrMore(.word)
            }
        }
    }
    
}
1 Like

While I hope to have a chance to review the proposal in detail, I just want to say quickly while I have the time: nice work! This is a really cool proposal, and I’m delighted to see people exploring further into creating eDSLs like this in Swift.

Hmm, this is an interesting choice. So, given a block like this:

  if <1> {
   // captures Int
  } else if <2> {
   // captures Float
  } else if <3> {
   // captures nothing
  } else {
   // captures (String, Double)
  }

I assume the overall capture is one of these:

  • (Int?, Float?, (String, Double)?)
  • (Int?, Float?, Void?, (String, Double)?)
  • (Int?, Float?, String?, Double?)
  • (Int?, Float?, Void?, String?, Double?)

Abstractly, it seems like it would be better if we could produce a OneOf<Int, Float, Void, (String, Double)>, but of course that would require both variadic generics and some unclear way of testing and projecting out the different components.

7 Likes

I also think that a tuple with optional values is not ideal in Swift.

Has the idea of anonymous sum types - as an analog to tuples ever been explored?

Like (Int | String) which could perhaps be tested like:

let a: (Int | String) = 1
switch a {
  case .0(let value):
    () // value is an Int
  case .1(let value):
    () // value is a String
}

EDITED: Changed $0 to .0 and $1 to .1 in strawman syntax.

1 Like

I agree it looks clunky and could use some sugar. Note that this proposal does not introduce CharacterClass, so this is an out-of-scope example, and the clunky example uses what works in the existing implementation. CharacterClass will be introduced soon by @nnnnnnnn in a separate pitch. I imagine a simplified version could be something like:

OneOrMore {
    CharacterClass(.word, .whitespace)
}

I made an error in the example at the top of the detailed design section. buildEither(first:) and buildEither(second:) currently return non-optionals, and therefore require all branches to have the same return type. Here's the fixed example:

Regex {
   regex0 // Regex<Substring>
   regex1 // Regex<(Substring, Int)>
   if condition {
     regex2 // Regex<(Substring, Float)>
   } else {
     regex3 // Regex<(Substring, Float)>
   }
} // Regex<(Substring, Int, Float)>
Regex {
  let e0 = RegexComponentBuilder.buildExpression(regex0) // Component<Regex<Substring>>
  let e1 = RegexComponentBuilder.buildExpression(regex1) // Component<Regex<(Substring, Int)>>
  let e2: Regex<(Substring, Float)>
  if condition {
    let comp = RegexComponentBuilder.buildExpression(regex2) // Component<Regex<(Substring, Float)>>
    e2 = RegexComponentBuilder.buildEither(first: comp) // Regex<(Substring, Float)>
  } else {
    let comp = RegexComponentBuilder.buildExpression(regex3) // Component<Regex<(Substring, Float)>>
    e2 = RegexComponentBuilder.buildEither(first: comp) // Regex<(Substring, Float)>
  }
  let r0 = RegexComponentBuilder.buildPartialBlock(first: e0)
  let r1 = RegexComponentBuilder.buildPartialBlock(accumulated: r0, next: e1)
  let r2 = RegexComponentBuilder.buildPartialBlock(accumulated: r1, next: e2)
  return r2
} // Regex<(Substring, Int, Float)>

I'm not quite sure if builder support is relevant for this. The way Capture(..., transform: ...) and TryCapture(..., transform: ...) work is by storing the failable/throwing closure inside the underlying regex program, and the closure will not be called until matching. The failable/throwing closure is not a component of the builder.

As for the naming of Capture and TryCapture, we could unify the both capture operations under Capture with different argument labels, e.g. transform: and tryTransform:. Would love to hear some opinions on this.

Yeah, in @John_McCall's example, we could have a tuple of optionals, but it may be confused with the result of ChoiceOf. An anonymous sum type or a sum type with variadic cases would be nice. Currently all branches are expected to have the same type.

Yes, and the protocol requirement var regex has already been marked with @RegexComponentBuilder. However the type of var regex would not be some RegexComponent, because the protocol requires a Regex<Output> type. Plus, a big difference between a regex and a SwiftUI view is that a user actually wants to access the regex output (Output) with non-opaque types. A working example would be:

struct Pattern: RegexComponent {
    var regex: Regex<(Substring, Substring, Substring)> {
         Capture {
             OneOrMore(.word)
         }
         "."
         Capture {
             OneOrMore(.word)
         }
    }
}

I did wonder if it would be possible to write var regex: Regex<_> and have the Output inferred, but @hborla reminded me that it would mean return type inference and complicate type checking.

2 Likes

In the fullness of time, we want to have a generic Pattern type with a builder DSL like this as shown in Declarative String Processing Overview to support general-purpose data processing. Regex is a scoped-down version of that, but aims to offer better familiarity and adopt regex conventions, e.g. to have .0 refer to the whole match followed by captures. Regex could conform to a future PatternComponent protocol and be used in a pattern builder block.

2 Likes

I would add that regexes make for a very poor data processing solution, e.g. due to their use of unrestricted backtracking. Not to mention that a huge swath of concerns are inherently string-specific, such as character classes, literals where elements are separated by adjacency, etc.

Far better would be a data pipeline system comprised of combinations of linear automata that run over a moving window (even if dynamically resizable) over asynchronous sources. This is clearly future work :grinning:.

5 Likes

I know this is a broader question than this pitch, but isn’t unrestricted backtracking a major problem even for string processing? What controls does the programmer have to avoid it under this model?

3 Likes

We support the typical: possessive quantification and atomic grouping. There's options to switch default to possessive, etc., which would bring regex in line with vanilla PEGs.

Future work include supporting more backtracking engine directives (ala PCRE).

I think adding engine limiters (future work) is particularly interesting, given that pathological behavior is often a function of what the input string ends up being. Given the matching engine is built for something more general purpose (originally PEGs), coming up with a good limiter story would help inform other domains.

5 Likes

Okay. I can understand why we might not want to change the default for regexp literals, since people will expect to be able to copy things like .*, from other sources, but should we consider that for these alternative forms and require backtracking to be more opt-in?

Regarding the naming - the argument of the map() operator is called transform, but at the call site we rarely see it. So when I read the named argument transform I immediately thought that I would have an easier time reading it if it was actually a map instead.

Would it be possible to make map a modifier to Capture?
Like:

Regex {
  Capture {
     OneOrMore(.word)
  }
  .map(User.init)
}

That would be the non-throwing scenario.

I don't know if such a spelling would be possible, but I think that I would prefer reading that over a transform argument.

4 Likes

Regarding throwing or non-throwing transforms:
Couldn’t this be implicit?
If the transform throws the parsing fails, if it can’t throw then the transform simply can’t be the source for failing?
I don’t think it would be surprising to the user even if the throwing and non-throwing APIs were the same.

Would an array be sufficient?

OneOrMore {
  [.word, .whitespace, "A"..."F"]
}

I would say that an array literal may very likely be confused with concatenation, despite the similarities to textual regexes. I think we could fold it under ChoiceOf. But as mentioned earlier, character classes are out of scope for this proposal.

A map would have very different semantics. With map you typically want to transform the entire generic parameter Output. However, Capture's transform only applies to the subpattern captured. More concretely:

// Non-nested capture
Capture {
  OneOrMore(.word)
} transform: { /* transform the captured content to T */ }
// => Capture<(Substring, T)>

// Nested capture
Capture {
  Capture {
    OneOrMore(.word)
  } transform: { /* transform the inner captured content to U */ }
  OneOrMore(.word)
} transform: { /* transform the outer captured content to T, NOT including the inner */ }
// => Capture<(Substring, T, U)>

But of course we could define a mapFirstCapture(_:), overload it 10 times and let it transform the first capture (T in the example above), but I think that would be less clear than having a transform: parameter.

TryCapture causes matching to fail if the transform closure fails, whereas Capture always succeeds. I think it comes down to clarity, and I don't feel too strongly about one way or another.

1 Like

I don't know what you mean by implicit. Here's the most reasonable behavior IMO:

  1. Return a .some T means it succeeded and matching should continue, storing the value in the capture
  2. Return a nil means matching failed and the engine should try an alternative (if there is one)
  3. throwing means that matching should abort, no further alternatives are tried.