Should Regex be Sendable?

While reading [Pitch] Predicate Regex Support, I noticed that Regex<T> is not Sendable.

In light of the restrictions on non-sendable types coming to the language, I think it should be. I think it is fairly standard practice to define a regex as a global constant, and to use it from a variety of concurrent contexts.

My understanding is that the basic implementation of Regex<T> is thread-safe: that it contains a compiled, immutable program. When you look for a match, an executor is created, which interprets the instructions encoded in the regex and contains all of the mutable state related to the matching operation.

AFAICT there are a few minor API issues with Regex which prevents it from being Sendable:

  • It may contain captures with transformation closures, and those closures are not marked @Sendable.

    let doubleValueRegex = Regex {
        "$"
        Capture {
            OneOrMore(.digit)
            "."
            Repeat(.digit, count: 2)
        } transform: { Double($0)! }  // <--
        Anchor.endOfLine
    }
    

    The documentation for these closures describes their intended use:

    transform

    A closure that takes the substring matched by component and returns a new value to capture. If transform throws an error, matching is abandoned and the error is returned to the caller.

    I don't think it is necessary to access non-sendable state in order to perform that duty.

  • CustomConsumingRegexComponent would need to refine Sendable, since they are stored in the Regex and their logic is invoked as part of the matching process.

  • It may also be necessary to make the RegexComponent protocol refine Sendable (I don't think that's necessary, since it has a .regex computed property which is the only thing that is really necessary to incorporate it in to another regex, but I haven't worked through all the details).

Currently, these are all places where non-thread-safe behaviour may be introduced as part of a Regex match, and fixing them would be potentially source-breaking.

With this thread, I'd like to do a few things:

  1. Make the community aware of the issue. It seems we may be getting close to Swift 6, and I think this needs addressing as part of that overall effort. Pretty-much everything else in the standard library has been audited for sendability AFAIK.

    The Foundation proposal I mentioned previously works around it by storing a String (and their workaround is always going to be necessary because it also needs to be Codable, so I don't know if they're motivated to bring up the Regex Sendable issue as a separate thing). I couldn't find any other threads about this issue specifically, so I assume it is not widely-known.

  2. To discuss these changes. I think they are very reasonable, and vastly preferable to the alternative of a non-Sendable Regex<T>, but others may disagree.

  3. To ask the Regex authors if there are other sources of non-Sendable behaviour that they are aware of.

Thanks

12 Likes

I’ll note that someone could be doing something stateful with their closures, so this would be a source-breaking change. I agree that a Sendable Regex makes sense though…

1 Like

Yeah, that seems to be the crux of the problem - Regexs can call out to arbitrary other code. All of those would need to be Sendable first, before Regex could be.

Although maybe Regex could conditionally conform to Sendable when all its dependencies do?

Would it be enough for a Regex to emit a sendable property? Like a classic string version of the Regex that leaves out transforms, or would that be too lossy/in fact harder to implement?

I think that'd be a sharp edge, given not all Regexes can be expressed in traditional regex string form. It'd be all too easy for folks to cargo-cult use of such a "Sendable workaround" without understanding the ramifications.

Granted it is an oft-requested feature otherwise, just for debugging purposes if nothing else. It's easy to see the appeal for Regex-to-string conversion, and I think it's perfectly valid for debugging and similar purposes - but I don't see a way to prevent it likely being abused for sending (if Regex doesn't support Sendable directly).

1 Like

There's nothing fundamentally wrong with the transform closures, or including custom regex components expressed using Swift code. But they need to Sendable for the overall regex to be Sendable.

They don't even need to be pure functions - they can use shared mutable state, so long as they take proper precautions for concurrent operation (e.g. using locks or atomics).

Considering what the transform closure and custom regex components are supposed to do, I don't think it's unreasonable to require them to be Sendable.

For instance, the capture's transform is there to integrate with other parsers, so a regex can return a Double rather than a Substring. That's a really nice feature. Typically string parsers don't access shared mutable state, instead just taking the string and possibly some flags as their inputs.

5 Likes

Thanks for bringing this issue up, @Karl! I think your description of the problem is spot on: it's common to define regexes as globals—which makes sense; they act like little functions—and it should therefore be safe to use them as globals in a concurrent context.

There are a few different solutions here, any of which would require going through the evolution process:

  1. Making Regex itself sendable would require the two changes you mentioned, with Capture/transform closures being marked @Sendable and the CustomConsumingRegexComponent being marked as extending the Sendable protocol. From my understanding, those changes wouldn't be source-breaking (since sendability checking still only emits warnings) and since Sendable is only a marker protocol, and doesn't have any ABI presence, wouldn't require availability or cause library evolution issues.

    Moreover, it does seem correct that transformations and custom components generally should be sendable as well. Importantly, the Regex type makes no promises whatsoever about when those bits of custom code are invoked, so transformations and custom components with exciting side effects won't be reliable. Caching/memoization/logging – those seem like reasonable or expected side effects that are also possible to implement in a sendable way.

    Conditional Sendable conformance is unfortunately out – the types that comprise an individual Regex aren't available at the point where that conditional conformance would be determined, so there's no way to distinguish a regex that includes a transform closure from one that doesn't.

  2. Introducing a parallel, SendableRegex type is also possible. This would require duplicating some of the RegexComponent infrastructure, but probably not everything, and would require the same changes in the new structure as those in solution #1. (This doesn't really seem worth it to me.)

  3. Providing a sendable wrapper for a regex at runtime is akin to #2, but the creation step for a regex would be the same as it is now. We could do this sendable checking during compilation (edit: the compilation within the regex instance, not the Swift compiler), so that it wouldn't need to be repeated each time a regex. Unfortunately, we'd need to rule out any regex that had a transform or a custom component, so the set of sendable regexes would be smaller than the total, and this would discourage using some powerful Regex features.

Before moving forward, I think it's important to have a better understanding about what kinds of non-sendable transforms and custom components are out there. Of those, some may be legitimate, and some may already be fragile or implementation-dependent.

Two other related notes:

  • There's a compilation step the first time a regex is used that is already protected via atomics.
  • There's recent work (to support the predicate regex) to generate a literal pattern string when possible—these are of course sendable, but represent a subset of all sendable regexes (and would require re-compilation)
13 Likes

It would be source breaking, but not ABI breaking (someone may be using non-sendable capture/transform closures today, and they would no longer be able to compile).

1 Like

Just my two cents, making Regex unconditionally Sendable with changes to Capture/transform @nnnnnnnn mentioned might prove to be restrictive just like in case of key path where all their captures are required to be Sendable which means that the code that has nothing to do with concurrency still has to provide Sendable types which is not always possible.

2 Likes

Yeah, I've been pondering this too. Requiring Sendable unconditionally feels concerningly blunt.

That said, it seems that it somewhat hinges on whether it's considered acceptable to have stateful capture / transform closures to begin with. I tend to assume it is, but then I can't recall ever actually doing that. Are there known use-cases like that?

We have faced a similar problem in the various AsyncSequence algorithms that take a closure in the standard library and swift-async-algorithms. The problem is that the concrete AsyncSequence types want to be Sendable if both their base async sequences and the supplied closure is Sendable. However, we can only be generic over the Sendable conformance of the base async sequence; hence, we have to enforce @Sendable on the closures. This allows us to make the types Sendable however it limits their functionality a bit in the context of actors if the AsyncSequence never escapes the actor.

Sometimes I wish functions would be concrete types inside Swift so they can participate in protocol conformances and generics.

6 Likes

Functions can participate in protocol conformances (well, they can just conform to Sendable) and can be used in generics. For Example:

struct Box<Value> {
    var value: Value
}

extension Box: Sendable where Value: Sendable {}

func assertSendable(_ sendable: (some Sendable).Type) {}

assertSendable(Box<() -> Void>.self) // warning: Type '() -> Void' does not conform to the 'Sendable' protocol
assertSendable(Box<@Sendable () -> Void>.self)

What is missing is a type constrain that allows to specify the shape of a function so that we can actually call it and not just pass it around.
Something like this:

struct Foo<Function> where Function: () -> Void { // error: Type 'Function' constrained to non-protocol, non-class type '() -> Void'
    var function: Function
}
5 Likes

Is it correct, then, that for a regex which doesn’t make use of any of this machinery, it would be safe to ‘smuggle’ it across isolation boundaries using unsafe mechanisms, or is that a misunderstanding?