SE-0350: Regex Type and Overview

Ben_Cohen · April 5, 2022, 4:35am

Hello, Swift community.

The review of SE-0350: Regex Type and Overview begins now and runs through April 18, 2022.

This review is part of a collection of proposals for better string processing in Swift. The proposal authors have put together a proposal overview with links to in-progress pitches and reviews. This proposal introduces the fundamental type, Regex, to the standard library, and outlines how it will be used, including setting up future proposals. It will be run simultaneously with one of the ways to create the Regex type: from a result builder DSL.

As with the concurrency initiative last year, the core team acknowledges that reviewing a large number of interlinked proposals can be challenging. In particular, acceptance of one of the proposals should be considered provisional on future discussions of follow-on proposals that are closely related but have not yet completed the evolution review process. Similarly, reviewers should hold back on in-depth discussion of a subject of an upcoming review. Please do your best to review each proposal on its own merits, while still understanding its relationship to the larger feature.

Reviews are an important part of the Swift evolution process. All review feedback should be either on this forum thread or, if you would like to keep your feedback private, directly to the review manager. If you do email me directly, please put "SE-0350" somewhere in the subject line.

What goes into a review?

The goal of the review process is to improve the proposal under review through constructive criticism and, eventually, determine the direction of Swift. When writing your review, here are some questions you might want to answer in your review:

What is your evaluation of the proposal?
Is the problem being addressed significant enough to warrant a change to Swift?
Does this proposal fit well with the feel and direction of Swift?
If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?
How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

More information about the Swift evolution process is available at:

https://github.com/apple/swift-evolution/blob/main/process.md

As always, thank you for contributing to Swift.

Ben Cohen

Review Manager

SDGGiesbrecht · April 5, 2022, 4:58am

Is there a good reason it is specific to String, and not generic over the searched collection? I can easily imagine cases where users would need to apply it to String.UnicodeScalarView or Data instead to properly satisfy a particular specification. Some such types would make little sense with the literals, but all could benefit from the DSL representation.

(If it already came up in the pitch thread, just link to the existing conversation; no need to repeat yourself.)

stevapple · April 5, 2022, 6:00am

Sorry for messing up these two — the one you’re expecting is described in the big picture, which I would see as a detailed and clear future direction, or something like "the 2nd stage".

hassila · April 5, 2022, 6:28am

@Michael_Ilseman not sure best way to feedback, but in "the big picture" linked above you asked:

By "event processing" (working title, suggestions welcome), I mean being able to detect and respond to patterns over ephemeral "events" issued from an asynchronous source.

I believe "event stream processing" or "complex event processing" are fairly well established terms of art - given that we have AsyncStream I'd lean towards the former as the best fit for Swift.

If you'd want to avoid two words going with "stream processing" would probably be better than "event processing" as it alludes to the temporal nature of the processing.

(on a side note, very interested in where the 'data processing' ends up, especially if low-level efficient binary deserializers can be realised with an ergonomic API it would be awesome)

SDGGiesbrecht · April 5, 2022, 7:27am

That document only seems to indicate that it was a consideration at some point; it does not say why it was discarded or deferred. To rephrase my question, what is the reason we are not doing the following?

struct Regex<Element, Output> where Element: Equatable {
  func matchWhole<C>(_ c: C) throws -> Regex.Match<C>?
    where C: Collection
  // ...
  public struct Match<Searched> {
    public let range: Range<Searched.Index>
    public var output: Output
    // ...
  }
}

Because then...

let regex = Regex {
  OneOrMore(0x00)
}
let match = regex.firstMatch(
  in: Data([0x01, /* → */ 0x00, 0x00, 0x00, /* ← */ 0x01, 0x00])
)

Michael_Ilseman · April 5, 2022, 7:09pm

Regex is very heavily geared towards textual data, as nearly every construct, concept, and option beyond the core operations only make sense for text. Even the core operations themselves are still pretty heavily geared for search within text and require a concept of position (i.e. Collection.Index).

A question regarding the DSL representation is whether the DSL should provide a radically different model than the literal. Since the DSL is producing Regex instances and can embed literals, it is very much a Regex DSL. Significant divergence would be uncanny.

The original overview from way back when discussed a Pattern<T>, implying more of a parser combinator system with recursively nested structural captures and history preservation. That's future work and difficult to incorporate into result builders (and the type system) as they are today. Trying to make the DSL be both a Regex DSL and a Pattern DSL leads into an uncanny valley that fails to serve the needs of either. Instead, a future Pattern can interoperate with Regex through the custom components interfaces (see CustomMatchingRegexComponent).

Regexes excel at finding needles in haystacks, but they are poor at recognizing the recursive structure of stacks of hay. FWIW, we're developing the fundamental capabilities that underlie both and treating regex as a particular presentation of those capabilities. PEG-like systems (including parser combinators) would have a different presentation style, ideally with significant improvements to Swift's DSL story (either better result builders or something to supersede them). This is outside the scope of these proposals, but I am always happy to discuss the topic.

hassila:

@Michael_Ilseman not sure best way to feedback, but in "the big picture" linked above you asked:

By "event processing" (working title, suggestions welcome), I mean being able to detect and respond to patterns over ephemeral "events" issued from an asynchronous source.

I believe "event stream processing" or "complex event processing" are fairly well established terms of art - given that we have AsyncStream I'd lean towards the former as the best fit for Swift.

If you'd want to avoid two words going with "stream processing" would probably be better than "event processing" as it alludes to the temporal nature of the processing.

(on a side note, very interested in where the 'data processing' ends up, especially if low-level efficient binary deserializers can be realised with an ergonomic API it would be awesome)

While data processing can use the same underlying capabilities of the matching engine powering Regex as presented, it would benefit from a very different presentation than Regex. Most notably, local backtracking within a moving window over asynchronous sources. This is outside of the scope of these proposals.

I'll leave it up to the review manager to decide if we should spin off a thread for further discussions around non-String processing.

As for the semantics of processing Unicode scalars or bytes with Regex, here's some clarifications:

Models of string

The primary, or default, model would be String's model in which characters are extended grapheme clusters comparable under canonical equivalence. Thus . behaves the same as dropFirst().

Degenerate grapheme clusters is a nuanced topic, but a good rule is that if the original input did not contain any degenerates no "ordinary" operation on String would produce one. You'd have to drop down to a scalar or lower view or otherwise request sub-grapheme-cluster processing. For example, this means that while we may not require a grapheme break to be present between two adjacent scalars in a regex, we would require a grapheme break around any captured subpatterns (including the overall match), at least when in grapheme-semantic mode.

String also provides a UnicodeScalarView and regex can operate over that using scalar semantics, which is commonly referred to as "Unicode mode" in other engines. There, . corresponds to unicodeScalars.dropFirst() and comparison uses the raw value. This can be presented as API in a variety of ways, for example:


myString.unicodeScalars.firstMatch(of: regex)

myString.firstMatch(of: regex.scalarSemantic)

myString.firstMatch(of: regex, options: .scalarSemantic)

Which are just different ways to surface the semantic modes as API.

Another thing discussed would be a so-called byte-semantic mode, corresponding to String.UTF8View for validly encoded content and perhaps including other input sources permitting invalidly encoded content. There, ASCII literal values would be interpreted as their encoded values, non-ASCII literals as a series of their UTF-8 encoded values, and byte literals to match everything else.

All of this is currently being developed as part of the Unicode for String Processing proposal. Unfortunately I do not know the current status or details of that proposal, especially regarding byte-semantic mode. @nnnnnnnn, anything to clarify here?

Karl · April 6, 2022, 7:29pm

Why is Regex a single type?

I noticed this when reading the DSL proposal, but it's really about the Regex type itself, so I thought it was more appropriate to post the question here.

One of the goals of the regex builder DSL is allowing the developers to easily compose regexes from common currency types and literals, or even define custom patterns to use for matching. We introduce RegexComponent, a protocol that unifies all types that can represent a component of a regex.
public protocol RegexComponent {
 associatedtype Output
 @RegexComponentBuilder
 var regex: Regex<Output> { get }
}

The thing that struck me were the similarities to SwiftUI. You could imagine a conformance to RegexComponent, with its single, result-builder property regex would look something like a SwiftUI View and its body.

In SwiftUI, the result-builder captures information about the view hierarchy in its generic type, and that is propagated because the body property returns some View - i.e. a bespoke type, for its subtree.

That's really nice for composability - it means I can take chunks of my view hierarchy, encapsulate them in to computed properties also returning some View, and easily split a complex view in to manageable units:

// Before:

var body: some View {
  VStack {
    Group {
      // Header stuff
    }
    GroupBox {
      // Login box.
    }
    Group {
      // More content
    }
  }
}

// After:

var header: some View {
   Group {
      // Header stuff
    }
}

var loginBox: some View {
    GroupBox {
      // Login box.
    }
}

var footer: some View {
    Group {
      // More content
    }
}

var body: some View {
  VStack {
    header
    loginBox
    footer
  }
}

And even though you break all of this up, you don't lose any optimisation capability. The full tree of information is still there. Those computed properties can be replaced by full nominal types, with their own body properties, and everything just works. It's a real success story for using composition to build complex structures in Swift.

But because Regex is a single type, it doesn't quite work the same way. If I'm using the Regex DSL and split a complex pattern in to subpatterns the same way I did with the SwiftUI view, I'd introduce optimisation and possibly expressiveness barriers because a regex's static type is determined only by its output type, not its contents.

So... why?

Instead of this, we could make Regex a protocol with an associated type. If it used the new primary associated type feature to be spelled some/any Regex<Output>, wouldn't it be usable in all of the same places as the proposed version using a single concrete type? but it would give the flexibility of allowing each regex's type to be unique to its contents, in the same way that SwiftUI view hierarchies are.

Has this idea been explored already? Or was this API designed before the new generics features?

rxwei · April 7, 2022, 8:14am

Would you rather write

// Style 1
let domainPattern = Regex {
  word
  OneOrMore {
    "."
    Capture(word)
  }
}
let (whole, topLevel) = domainPattern.firstMatch(in: string)

or the following?

// Style 2
struct DomainPattern: RegexComponent {
  var regex: Regex<(Substring, Substring)> {
    word
    OneOrMore {
      "."
      Capture(word)
    }
  }
}
let domainPattern = DomainPattern()
let (whole, topLevel) = domainPattern.firstMatch(in: string)

Both are actually supported with the proposed regex builder design. But I think style 1 is vastly simpler than style 2 for the most common use cases, and therefore use style 1 for most examples in the DSL proposal. Why force developers to define a new struct when it's not necessary?

This may be best discussed in the regex builder proposal, but with style 1 the developer can totally modularize a regex with the current design of regex builder.

let subject = ChoiceOf {
  "I"
  "you"
}
let object = ChoiceOf {
  "apple"
  "orange"
}
let sentence = Regex {
  subject
  "like"
  object
}

The use of stored properties instead of computed properties is intentional, because multiple uses of the same pattern can be compiled more efficiently into calls to the same subroutine. A computed property would increase the code size of the underlying pattern matching program.

kiel · April 7, 2022, 10:47am

How long will I have to wait to use this in my apps? e.g.: Do I have to wait until enough of my users are on an OS with certain system libraries etc available?

Karl · April 7, 2022, 1:50pm

It is also entirely possible to instantiate a concrete conformance to a protocol, is it not? I don't understand this argument that developers would be forced to define new structs unless they want to -- and at least in the SwiftUI parallel, people often do want to.

The stored property vs. computed property thing isn't what I was asking; you could also split those SwiftUI views in to stored properties, and you could store the opaque regexes I'm suggesting in to local/global variables. The point is that you can cut the (view/pattern) hierarchy up, and because all of its nodes conform to a common protocol, you can return the piece as a some Protocol without introducing a barrier between type-level abstractions and value-level abstractions.

My understanding of the compiler is that it can more easily follow which logic will be executed by the program if that is propagated through static types. As the proposal points out, regexes are programs - and yet with the current design, two different Regex-es have no difference in the static type system. They are just two different arrays of instructions - two different arrays of numbers, essentially. It's a value-level abstraction so their difference is at the value level.

You were very careful to avoid that trap in your example. It's a really subtle difference:

let subject = ChoiceOf {
  "I"
  "you"
}

This is actually an instance of a concrete type, like a VStack in SwiftUI. If you broke that out one more level (like a property or RegexComponent) and had to erase it behind a Regex, it would suddenly become a separate program.

I'm not sure that's a great model.

Michael_Ilseman · April 8, 2022, 12:46am

I think it would help if you could give an example of an improvement here. Note that regexes are compiled into a byte code for the matching engine, currently at run time. Static and staged compilation is future work. I'm not sure what advantage encoding the algorithm into the type system would give you vs just compiling the regex.

Karl · April 8, 2022, 2:00am

Currently, what is being proposed is that when regexes are broken up in to components, they are split in to units like this:

var regex: Regex<Void> {  // Void? I don't know. Let's say Void.
  ChoiceOf {
    "I"
    "you"
  }
}

Every regex, whether created via the DSL, literals, or parsing a string, has the same type if its output type is the same. So every component is its own program, compiled separately, in to a separate array of bytecode, and as far as the type system is concerned, all of these regexes/programs look about the same. The difference between one regex and another is the difference between an array of 3 integers and some other array of 4 integers.

But as the proposal points out, regexes are programs. I think the goal should be that, at least for regex literals (and hopefully for the DSL to some extent), one day we might not even need a bytecode or interpreter. I think the ideal case is if each literal was its own function or type that gets generated and optimised as if you wrote it in Swift. The model should pretend that's the case and allow for that, even if the implementation isn't there.

If we used an opaque type to abstract away concrete implementations, those kinds of options remain open.

let myRegex: some Regex<(Substring)> = /hello, (world|mars)!/
//           ^^^^^^^^^^^^^^^^^^^^^^^
// The concrete type of myRegex might be:
// __swift_regex_SomeLibrary_someFile_myRegex
//
// struct __swift_regex_SomeLibrary_someFile_myRegex: Regex {
//  typealias Output = (Substring)
//  (... Optimised implementation generated from the regex literal ...)
// }
//
// None of that matters to the developer - it's just "some Regex"

// And you should still be able to get type inference
// for the capture type (AFAIK)

let myRegex: some Regex<_> = /hello, (world|mars)!/
//                     ^^^

At the same time, when you break things in to components, you can easily return the underlying ChoiceOf<X, Y, ...> via the opaque type, meaning it can actually be incorporated in to the larger regex and compiled together.

var component: some Regex<Void> {
  ChoiceOf {
    "world"
    "mars"
  }
}

var regex: some Regex<Void> {
  "hello, "
  component  // <- no separate compilation or bytecode array
  "!"
}

There are tradeoffs to this approach, but I don't think they're too bad provided we have the recent new generics capabilities. The design I'm floating here would not have been feasible without them.

But I just didn't really see a discussion of it in the proposal, and I'd like to hear what the team developing the new regex functionality think about this and other possible designs for the regex type.

Michael_Ilseman · April 8, 2022, 3:14am

No, that's not how it works. Compilation is deferred until the regex is used. At that point the whole tree is compiled into byte code.

Static compilation is listed as a future direction and clearly useful for those that can generate simple DFAs. There is a significant code size savings to byte code since the logic inside the interpreter is shared.

How does the proposed approach preclude this?

I'm struggling to see how any of this is relevant. What does encoding the program an additional time in the type signature give us?

Ben_Cohen · April 8, 2022, 4:03am

The tradeoffs of returning a complex nested generic type are not limited to the ergonomics of the type returned by the builder (though those are significant, even with the new generics features). These types increase binary size, build times, and dirty memory use from things like runtime metadata, considerably in some cases.

To justify a nested generic design like this, there need to be material benefits to outweigh these costs. Sometimes those benefits are part of the actual API design. For example, exposing the structure of the types is what allows the Lazy system to support conditional conformances. Other times, like SwiftUI, the benefits are more to the implementation. But in both cases, the default if we were evaluating the API would be "why must you do this?" rather than, if they'd implemented a simpler model returning a concrete type without such high costs, "why didn't you do this?".

It might be you have an idea of an implementation that could take advantage of the type system to provide a performance boost that the current design couldn't achieve. But it is probably not reasonable to theorize there might be one, and then expect the proposal authors to justify why the haven't pursued an API that could capitalize on that (but that worsened UX and performance in other ways).

benlings · April 8, 2022, 9:05pm

I have a question about the meaning of this phrase in the detailed design section of the Regex.match() functions:

Returns nil if no match and throws on abort

What does 'abort' mean in this context? Under what circumstances would this happen and what sort of error(s) would be thrown?

Related to this, I think it would be good to specify how cancellation of matching (and possibly compiling?) will work. I'm considering how it would be possible to protect against ReDoS attacks. .Net has a MatchTimeout, but presumably in Swift this could be done with Task cancellation (and therefore timeouts when they arrive)?

Ben_Cohen · April 9, 2022, 4:33am

The core team has been discussing a clearer format and requirement for laying out when proposals include a runtime component. Usually these are included in the "ABI Stability" section of proposal documents, but not consistently. John has a PR with some suggested updated wording for the proposal template.

In the case of this proposal, there is certainly a runtime requirement: the Regex type will need to be present in the runtime to use this feature.

There should be no assumptions that features requiring runtime support will back-deploy on platforms with ABI stability. The vendor may make the support available on existing platforms, but back deploying runtime support is not always practical for technical and other reasons and will vary from platform to platform. How vendors ship Swift, and how that relates to their products, is outside the scope of the Swift open source project. To make this concrete, Apple (a vendor who ships the Swift runtime in its operating systems) does not generally discuss future unreleased products.

There are rare cases where the design of a proposal might need to accommodate backward deployment specifically. One example is the introduction of Sendable in SE-0302. When those concerns arise, the aspects of the design concerning enabling backward deployment will be discussed in the proposal.

Michael_Ilseman · April 9, 2022, 4:08pm

User code that participates in pattern matching can return a value successfully, nil on failure, or throw on abort. nil denotes a local failure so that the engine will backtrack to try an alternative, and throwing will abort the whole pattern matching pipeline and surface the error through these APIs.

Yes, engine limiters are future work. User-specified timeouts are one important formulation of this. We could also consider limiters that fire when the number of bytecode instructions executed diverges significantly from the input length.

benlings · April 10, 2022, 9:08am

Thanks - that sounds useful.

In the meantime, before that is available, would it be possible to include some calls to Task.checkCancellation() during the evaluation of the match? The use case I'm thinking of for this for evaluating a user-specified Regex in an interactive application. If the match operation is taking too long, it should be possible for the user to cancel this.

From SE-0304:

Functions which perform a large amount of synchronous computation may wish to periodically check for cancellation explicitly.

Michael_Ilseman · April 10, 2022, 12:59pm

Tracked in Issue 262.

kiel · April 11, 2022, 7:44am

I guess what is motiving my question is this: I want to be confident the question of back deployment was taken seriously. It is a great feature, the result of a lot of hard work by the authors and I (as I am sure others are too) are excited to use it and would not like to have to write and maintain filthy string processing code for years while we wait for our user bases to adopt a certain runtime. So please forgive the naive question. What specifically is the reason? If Apple is not the only runtime vendor, then it seems like this question should be publicly answerable, otherwise vendors will not know what to include in the runtimes they vend. I guess the reason is more a case that the Regex type abstracts over some lower level C API or something that is not available on older OSes / runtimes? In that case, how do the toolchains work on current runtimes in Swift code simply importing _StringProcessing? Perhaps the toolchain is customised to deal with the runtime issues for the purposes of development. So I guess there is nothing apps can do to fill the gaps in older runtimes?

I understand you've probably got better things to do with your time than entertain my naive questions about highly technical topics. So, if I was to receive a generous reply, I would be delighted to at least hear that the question of back deployment was taken seriously but it is not reasonably possible to overcome.

Thanks for the other info about back deployment in future proposals, much appreciated.