SE-0350: Regex Type and Overview

Michael_Ilseman · April 11, 2022, 5:03pm

There is no OS dependency.

The Swift runtimes shipped in those toolchains have this work incorporated into them. When applications are deployed to ABI-stable platforms, they use the copy of the runtime available on that OS.

The costs of back-deployment include much more than OS support, it includes process changes involving how to carve up binaries under certain conditions.

I would love to see this work back deployed as much as anyone. Beyond that I don't have much say in the matter or visibility into the hairy process details.

Karl · April 11, 2022, 9:11pm

And do we have any numbers for those things, from prototypes or other sources? Or is it just speculation/FUD? The proposal doesn't discuss them, and I haven't seen those numbers elsewhere. Complex nested generic types are favoured by important frameworks such as SwiftUI and Combine, as well as language features such as result builders, so it is reasonable to expect significant optimisation effort around that pattern. I'm not sure we should be so discouraged that we don't even take a look.

So you're saying it is no longer reasonable to discuss alternative implementation approaches in swift-evolution?

Unfortunately, it is not feasible to build my own prototype or do anything other than theorise. There has been an entire team building the regex engine as their actual, full-time jobs for the last 6 months, whereas I am a lone volunteer who can sometimes find time to share my development experience/ideas with the community. The review runs for 10 days, start to finish.

The points that are reasonable/not reasonable for community members to make must be seen in that context.

The proposal authors made some decision about the object model, and it must have been justified on some basis. If they didn't perform any experiments (which is what I'm asking about), then they too were theorising on the potential benefits of their design relative to others. The alternatives I'm theorising about are straw-men to probe the thinking of the proposal authors and discover why the model is the way it is, because the proposal doesn't mention those considerations.

I don't feel the discussion points were actually addressed, but I also don't have the inclination to keep asking or engaging with this prolonged announcement. Community opinions are already worth nothing (even among the core team, it appears only a few opinions are actually meaningful), so it is quite literally a waste of time. Rather than a spirit of "jolly co-operation", the attitude here seems to be more hostile and abrasive, and I don't care for it. Discussion successfully shut down.

Michael_Ilseman · April 12, 2022, 12:28am

You could go to any struct in any API and argue that it should instead be a protocol. But to do so, you should bring rationale as to why.

Your rationale was based on a mistaken understanding of how compilation works. I addressed this in my replies.

scanon · April 13, 2022, 3:45pm

This is an approach that has been tried a few times in a few different languages (including by a few members of the Swift Standard Library and Core teams), and while it can produce attractive microbenchmarks, it has almost always proved to be a bad idea at the macro scale. In particular, as Michael mentioned, even if we set aside witness tables and other associated swift generics overhead, optimizing a fixed pipeline for each pattern you want to match causes significant codesize expansion when there are multiple patterns in use, as compared to a more flexible byte code interpreter. A bytecode interpreter makes better use of instruction caches and memory, and can also benefit from micro architectural resources that are shared across different patterns. There is a tradeoff w.r.t. branch prediction resources, where separately compiled patterns may have more decisive branch history data, but a shared bytecode engine has much more data to use; this tradeoff tends to fall on the side of a bytecode engine, but it does not always do so.

I should also note that nothing actually prevents AOT or JIT compiling of a restricted bytecode if we believe it will be advantageous, but compiling or interpreting arbitrary Swift code at runtime is rather more unattractive, since both the type system and language are undecidable.

Even absent this rationale, we would probably not go with a more complex approach that encodes regex programs directly into the type system, simply because it is more complex while delivering little benefit. But it also brings very real drawbacks, which is why we have avoided it here.

xwu · April 14, 2022, 4:33am

Pleased that the overall effort is coming together. Taken as a whole, this is a significant feature and fits well (certainly, better than the existing facilities) with Swift's overall string processing story. I have used regular expressions in other languages, but it's hard to compare as there's not one proposal under review that can detail how the final feature will look in Swift. I have followed many parts of these interconnected proposals closely.

With respect to the specific type introduced as part of this proposal, I have a few nits to pick:

I wonder about the naming of Regex(compiling:). What else could it be doing with the supplied argument other than compiling it? Even if it could be something other than compiling, is that a plausible misinterpretation if the argument label were omitted? Overall I feel like Regex(_:) would be both more succinct and more readable.
It's not clear to me whether String.firstMatch(of:) is being proposed here alongside Regex.firstMatch(in:), but this is a nice example of where the prepositions matter in terms of clarifying what's going on. I wonder why it's not similarly Regex.matchPrefix(of:) rather than Regex.matchPrefix(_:) as it's currently proposed. Similarly, Regex.matchWhole(_:) would read a bit oddly at the call site and I wonder if some massaging of the name would help. (Maybe, Regex.matchEntirety(of:)? But that seems a little bleh.)

benrimmington · April 14, 2022, 11:17am

Are matchWhole(_:) and matchPrefix(_:) equivalent to anchored patterns (^, $, \A, \Z, \z)?

Will it be possible to enumerate all of the (possibly overlapping) matches?
By using firstMatch(in:) with adjusted substrings, or another API?

Can the AttributedString type (and related views) participate?
Can the compiler infer AttributedSubstrings rather than Substrings?

Michael_Ilseman · April 14, 2022, 3:54pm

The argument can be quoted, e.g. a Regex(quoting: myString). This would be similar to try Regex(compiling: #"\Q\#(myString)\E"#), but not exactly the same as any interior \E would terminate the quote early.

Compiling a regex from a run-time string is not something we expect to see used everywhere the way literals are. They will most likely appear as a their own declaration in source code rather than used directly as an expression.

E.g. str.firstRange(of: /.../) is more common usage than str.firstRange(of: try! Regex(compiling: someUserString))

The string/collection algorithms will be proposed in the dedicated proposal.

I have a PR up that switches the spellings of these to all be noun phrases:

try regex.firstMatch(in: input)
try regex.wholeMatch(in: input)
try regex.prefixMatch(in: input)

I think this is a little clearer and more consistent use of the word "match" in API.

For algorithms, the preposition of is used.

input.firstMatch(of: regex)
input.wholeMatch(of: regex)
input.prefixMatch(of: regex)

The operations hosted directly on Regex are the 3 core ways of using patterns and these surface the full information of a match failure via throws. A thrown error, e.g. from a custom component, terminates matching early and that error will be surfaced through the Regex API. This is a more advanced, library-oriented use case than the convenience algorithms. The algorithms are implemented in terms of these via try?, i.e. they coalesce failure reasons.

3 core operations:

Try to match from the front of the input
- prefixMatch
- Customized via CustomMatchingRegexComponent, which provides the monadic parser interface.
Try to match the entire input
- wholeMatch
- Customized via the result builder TryCapture transform closures
Try to find the first match in the input
- firstMatch
- Customization is future work, involves modelling search state

Regex algorithms are powerful enough that you could emulate any one in terms of the other. For example, firstRange(of: /pattern/) can be emulated by getting the first capture from wholeMatch(of: /.*?(pattern).*/).

However, they are different from the purposes of customization and use in code. For example, Int.init?(_:String,radix:Int) by itself would be a wholeMatch-style int-validator, failing on overflow or unrecognized characters. Some kind of int-parser, which might have configurable behavior over backing off on overflow or treating unrecognized characters as the end of the number, would be prefixMatch-style. An int-searcher could remember information about the input to speed up subsequent search attempts for the purposes of matches(of:), which invokes firstMatch in a loop.

We've only formally protocolized the matchPrefix interface, which is the key way to interact with parsers. Others are future work.

hassila · April 14, 2022, 4:04pm

Apologies in advance for that this may be feedback in the wrong review thread, but a bit unclear to me where to put it and still reach the right people.

I just wanted to point out ripgrep@GitHub as a reference benchmarks for regexp engine performance - it might be useful to benchmark against (would need to build 'swiftgrep') to see how the new engines performance stacks up (maybe you're already familiar with ripgrep).

Admin / review manager, feel free to break this out if inappropriate in this thread.

Michael_Ilseman · April 14, 2022, 4:05pm

An allOverlappingMatches API is future work. This can be achieved by collecting all non-overlapping matches from every start position in the input. Ideally we'd utilize knowledge from matching itself to speed this up, and how to do so effectively will be more apparent as we develop and optimize the engine more.

For now, you can get the ranges over the backing string and convert the indices. I don't recall the status of API to do so. Applying regex to types other than strings is future work, as is interweaving concerns such as matching arbitrary attributes or localization. I am very interested in processing tiered textual data, but we're already covering and proposing a lot here.

Michael_Ilseman · April 14, 2022, 4:12pm

Right, performance is a huge area in and of itself. For us, we're proposing high-level Unicode semantics by default, so the high-order bit is skipping Unicode analysis when we don't need it. We're compiling down to a bytecode, so a big area for further performance improvements is compiler optimizations. Simple peephole optimizations can go a long, long way, but so can more complex analysis such as finding the first set of required terminal characters. Additionally, there are various execution strategies such as breadth-first search over an NFA when the regex (upon analysis) doesn't need determinism. And, of course, directly emitting a DFA for very simple regex.

For the purposes of API review, the key is making sure that API doesn't preclude performance improvements.

hassila · April 14, 2022, 4:35pm

That’s fair enough - just wanted to share it in case a benchmark was desired (ripgrep is faster than {grep, ag, git grep, ucg, pt, sift} - Andrew Gallant's Blog for more details). Thanks.

Michael_Ilseman · April 14, 2022, 4:49pm

CC @jmschonfeld for ease of use with AttributedString and whether there's some easily-addressable API friction. (With the caveat that this is future work and formally outside the scope of the proposals.) Friction is most likely solved through API on AttributedString. But, it would definitely be interesting and clarifying to see how it plays out using the current API.

jmschonfeld · April 14, 2022, 5:34pm

CC @jmschonfeld for ease of use with AttributedString and whether there's some easily-addressable API friction. (With the caveat that this is future work and formally outside the scope of the proposals.) Friction is most likely solved through API on AttributedString . But, it would definitely be interesting and clarifying to see how it plays out using the current API.

Yes I agree with Michael that based on this proposal, you will likely have to match ranges/indices over the backing string, and then convert those indices to AttributedString.Index types. We do offer initializers for AttributedString.Index and Range<AttributedString.Index> that convert from String.Index and Range<String.Index> respectively to help with this (for example, see this API). The future direction that Michael mentioned about applying regex to non-String types could help a lot here in the future if these features eventually supported a generic Collection where Element == Character (which includes both String and AttributedString.CharacterView).

LiarPrincess · April 14, 2022, 9:22pm

This is going to be a little OT, but:

I'm interested in how Processor -> mutating func cycle() is implemented, mainly the switch opcode part:

mutating func cycle() {
  switch opcode {
  case .invalid: fatalError("Invalid program")
  case .nop: …
  case .decrement: …
  // other…
  }

I may be wrong, but in the current Swift this will be a branch prediction miss on every switch (well, unless we have a long sequence of identical opcodes).

I assume that this will be compiled as a jump table (I will check in the morning, it's 12pm in Wrocław and I don't have Swift 5.5). But even then the jump itself is shared by all executions (Processor.cycle runs in a tight while true loop), so the CPU can't guess where to jump next (or more precisely: it guesses incorrectly).

The obvious fix is to jump after every instruction, but this kind-of requires a goto-like feature in language. For reference this is what CPython does where DISPATCH is basically #define DISPATCH_GOTO() goto *opcode_targets[opcode].

Does this matter?

Not really.

It would matter if we had a very long sequence of bytecode to compile. Even then I assume that the actual microspeed-of-processing-single-instruction would not be critical for performance.

Anyway, the reason why I am interested in it is that the Violet - Python VM written in Swift also does the switch thingy.

The question is: was any alternative implementation Processor -> mutating func cycle() considered? This is one of the perfect use cases for goto which Swift does not have.

EDIT: I forgot that you can also solve this with tail calls, but Swift does not guarantee tail call optimisation.

scanon · April 14, 2022, 9:56pm

This was once true of simple branch predictors, but modern branch target predictors (as seen on microarchitectures from the last decade or so) do not work like this at all. It's a few years out of date already, but https://hal.inria.fr/hal-01100647/document is the usual reference. The key takeaway is that mispredict rates for this style of workload have dropped from 10-20 per 1000 instructions down to 1 per 1000 or less, so that the traditional tricks are frequently unnecessary.

In any event, this is fundamentally tangential; the API is what's under review. This would be a concern if the API surface prevented us from adopting an approach with better performance in the future, but that doesn't apply here.

Ben_Cohen · April 22, 2022, 1:01am

Review Update

The period for the review has concluded. The core team has decided to start a second review covering some amendments to the proposal.