[Pitch #2] Regex Syntax and Run-time Construction

hamishknight · April 8, 2022, 3:26pm

Hi everyone, we'd like to present an update to the Regex Syntax pitch in preparation for formal review. This pitch continues to cover the syntax within a regex literal, but has been expanded to also cover the API for compiling a Regex pattern at run-time, including the type-erased AnyRegexOutput capture type.

The behavior of unrecognized escape sequences has also changed, now unknown letter escapes (e.g \I), as well as non-whitespace Unicode escapes, are diagnosed as errors.

This is a refinement of a previous pitch, which was discussed on this thread.

benlings · April 10, 2022, 12:15pm

A few questions about regexes & matches with existential output types.

Will there be any way to introspect the capture groups on Regex<AnyRegexOutput>? For regexes with a concrete output (e.g. Regex<(Substring, Substring, name: Substring)>) there is no need for this because it's encoded in the type signature. However, for those created at runtime (e.g. with user provided strings), it would be useful to be able to find out how many groups there are, the names of the groups (if any) and go between names and group numbers.

Related to the name/number mapping, should it be explicitly stated how they are related? It looks like if there are duplicate names in the regex pattern, names could have many indexes. Or would they only map to to the single (last?) index to give the same behaviour as back references?
In the Collection conformance of AnyRegexOutput, am I right to assume that the indexing matches that of the typed output? (ie. 0 is whole match, followed by capture groups)
Will there be a way to get named captures on Match? (or convert between names and indexes)
Related to the above, will the capture names be required when casting between Regex<Output>.Match and AnyRegexOutput?

For example

let regex = try! Regex(compiling: "(?<name>abc)(de)")
let match = try! regex.matchWhole("abcde")!
let typed1 = match.as((Substring, name: Substring, Substring).self) // Can cast with names?
let typed2 = match.as((Substring, Substring, Substring).self) // Can cast without names? 
let typed3 = match.as((whole: Substring, foo: Substring, bar: Substring).self) // Can cast with different names?

rxwei · April 11, 2022, 6:22pm

Good point. It should be possible to add a property on Regex or RegexComponent to get the number of captures.

extension RegexComponent {
    public var captureCount: Int { get }
}

According to regex101, PCRE2 seems to require a unique name for each capture. The current implementation of Regex.init(compiling:) doesn't throw an error when there's duplicate names, but it seems like it should.

Yes, that's correct.

Perhaps we should add a string subscript to both Regex.Match and AnyRegexOutput?

extension Regex.Match where Output == AnyRegexOutput {
    public subscript(_ name: String) -> AnyRegexOutput.Element { get }
}

extension AnyRegexOutput {
    public subscript(_ name: String) -> AnyRegexOutput.Element { get }
}

They are not required.

scanon · April 11, 2022, 6:22pm

This is worth an issue on the repo.

hamishknight · April 11, 2022, 6:36pm

Yeah we still need to implement that, note though that there is a mode (?J) which allows duplicate group names. So we will still need to handle coalescing duplicate names for typed captures, though perhaps the parser could mark the captures which have duplicate names.

benlings · April 11, 2022, 7:55pm

.Net’s regex does allow duplicate names, and I think the last is used for back-references and for accessing the group (see Grouping Constructs in Regular Expressions - .NET | Microsoft Learn).

Ruby’s regex also allows duplicate names and its API allows access to all the indexes (see Class: Regexp (Ruby 3.1.1)).

For Swift, with the group names being reflected in the type system, it don’t think it would be possible to have both named access to duplicate groups, and the numbering be consistent with back reference numbers. I guess you could have the group name be an array of the captures, but the emphasis on having the match component numbering be the same as back reference numbering seems like it’s been an important part of the design, and I doubt is worth breaking for this.

Edit this page summarises the differences much better than I have above Regex Tutorial - Named Capturing Groups - Backreference Names . Maybe the answer would be to only allow duplicate names within branch reset groups, where they will map to the same group index?

Michael_Ilseman · April 12, 2022, 4:38pm

I want to explore this area a bit more more thoroughly. The below is more of a first-principles approach and I'm not arguing that it should be API or not, nor am I saying what should happen now vs be considered as future work.

Casting

We have failable member function as go to a concrete type and an initializer to come from a concrete type.

Casting between AnyRegexOutput and the concrete output tuple creates Substrings from the internal storage representation (which only has a single strong reference to the input). This is convenient for use sites, but it is sub-optimal if the receiver doesn't actually need to materialize every Substring capture contained.

AnyRegexOutput <-> Output

To keep the storage representation, but get typed access, we can cast the Match object. This involves some run-time reflection, but doesn't materialize the individual Substrings.

Regex<AnyRegexOutput>.Match <-> Regex<Output>.Match

Finally, we can cast the regex itself such that it will produce concretely-typed matches when used.

Regex<AnyRegexOutput> <-> Regex<Output>

Querying captures

AnyRegexOutput is a collection of its existentially-typed captures (the first of which is the matched portion of the input). Note that this is the same as .count, so it's not super compelling to add this unless it helps with API consistency on a broader level.

AnyRegexOutput.captureCount: Int { get } // Same as `.count`

More compelling is adding this API to Match and Regex, whether existential or not:

Regex.Match.captureCount: Int { get }
Regex.captureCount: Int { get }

Asking whether a named capture is present and what its number is (straw-person names):

Regex.captureNumber(forNamed: String) -> Int?
Regex.Match.captureNumber(forNamed: String) -> Int?
AnyRegexOutput.captureNumber(forNamed: String) -> Int?

Or alternatively we could produce a Dictionary<String, Int> for convenience, noting that would require materializing the dictionary.

For getting the capture out, we can add (as Richard said) subscripts. However, I would strongly consider whether we should treat those subscripts as returning optional values instead of trapping. They are more analogous to dictionary's key-based subscript than its index-based subscript, especially in dynamically-constructed scenarios.

A similar question exists for the reference-taking subscript, where the references present are not reflected in the type signature. If there is separation in time/space between the Regex and its match, this can produce unexpected traps. The counter argument is that because these are actual instances of a Reference type, rather than arbitrary strings from who knows where, they are far more akin to indices than dictionary keys. This is especially so since these regex tend to be statically constructed. I find this counter argument compelling.

We mostly likely should, at the very least, have some way of querying presence of names and references without trapping.

One final note, perhaps AnyRegexOutput.Element should have a name of AnyRegexCapture, especially if we want to use this existential more prominently. On the other hand, its nice to have fewer top-level names.

I haven't thought too deeply about whether it's worth reifying the capture metatype to support more type-level operations directly on Regex (and this would clearly be severable).

benrimmington · April 16, 2022, 4:43pm

Could a String-based @dynamicMemberLookup also be added?
This would support .name and .0 members.
(I think the compiler will attempt the existing KeyPath-based lookup first.)