[Discussion] Fetch Regex Named Captures Dynamically?

charlieMonroe · June 28, 2022, 6:38am

I've been playing with Regex lately and while I love it, I may be missing something, but it seems to be that the named captures cannot be fetched dynamically - let me give you an example:

Fetching some data from some string - for example, you have lines in a log and you need to fetch some ID from various lines.

The issue here is that the strings have evolved over time, so it's not feasible to fit all possible formats, so you have multiple regexes that have one thing in common - there's a capture named "id". Examples:

/particle-(?<id>\d+)/
/uuid: (?<id>[a-f\d]+)/

This works great for small regexes like the above whose result type is (String, id: String), so you can put them into an array and iterate.

The issue arises when one of the regexes adds an additional capture:

/(foo|bar)=(?<id>\d+)/

Suddenly, the result type is (String, String, id: String) and this regex can no longer be added into this list.

Yes, it's an easy fix for this particular example:

/(?:foo|bar)=(?<id>\d+)/

... by making the first group non-capturing, but... This will not work in case I have two groups that I want to capture:

/(?<name>particle)-(?<id>\d+)/
/(?<name>uuid): (?<id>[a-f\d]+)/

Now what about

/(?<id>\d+): (?<name>\w+)/

The reversed order of ID and name disqualifies the regex from being in an array with the others - even though it matches exactly the same-named groups...

I've thought about this and came up with a few possible solutions:

Dynamic look-up:

let myRegexes: [some RegexComponent] = [...]
for regex in myRegexes {
   guard
      let match = foo.wholeMatch(of: regex),
      let id = match["id"], // String
      let name = match["name"] // String
   else {
      continue
   }

    ...
}

The regex match would allow to fetch the value of a named group by name. If the match doesn't contain it, nil would be returned - or error thrown...

Auto-conforming protocols:

@autoconforming
protocol NameAndID {
   var id: String { get }
   var name: String { get }
}

let myRegexes: [Regex<NameAndID>] = [ ... ]

The idea here is to define an interface that values automatically conform to if they have the fields defined and the conformation would be emitted by the compiler when they are passed to a method that requires the conformation.

The reason for this is that the output of the regex are tuples that cannot be (to my knowledge) extended to conform to certain protocols. This way, the (String, id: String, name: String) would automatically conform to NameAndID, (String, name: String, id: String) would do so as well, even though the id and name fields are in different order, but also so would (String, name: String, String, String, id: String) as well.

Or am I simply missing some very simple solution?

paiv · June 28, 2022, 7:06am

Type-system questions aside, a single scan with complex regex will be quicker than multiple scans with shorter regexes. Make a long regex with multiple captures, each named uniquely, and check which one captured.

charlieMonroe · June 28, 2022, 7:49am

Yes, but ... it's not always that easy. The regexes can be supplied - by various parts of the app, by the server, etc. - I've included simple regexes for the purpose of a simply understandable example, but it can be much much more complex.

Two examples:

writing a regex evaluator and debugger (tool for developers, like https://regex101.com) - there's no way of iterating over captured groups and their names - I find this fairly limiting, mainly when you create the regex from a string, not via the regex builder or regex literals
you have an app that has a lot of integrations and each needs to provide a regex to capture two variables - there's a lot of unknowns and you cannot just build a single regex for the 100 or 1000 integrations and manually match them - you need this to be automatic - you add a new integration, you write a new regex.

Even more specific example based on the second one - you are making an app that goes through your browser history and monitors videos that you've watched.

To do so, it needs to know about the ID of the video - for https://www.youtube.com/watch?v=vZYsQDqhIWo and https://youtu.be/vZYsQDqhIWo you should get vZYsQDqhIWo - this way it can detect duplicates.

But there's not just YouTube, there are other sites as well (Vimeo, Bilibili, ...). So you write parsers for these - individual structs that conform to some metadata extracting protocol that defines a regex for parsing out the ID - fine, you can define it as Regex<(Substring, id: Substring)>

And you then find which integration matches the link and let it process it further. The issue here is that you may want to optionally include additional matches in the regex so that you don't need to create a new one unnecessarily - e.g. the YouTube-related regex may additionally look for a (list=(?<playlist>[^&]+))? parameter that would extract the playlist information, etc.

And here you are starting to paint yourself into a corner. So generally you need to create a single regex just for matching the ID and then a new one for the rest - that seems unnecessary to me and mainly leads to maintaining two regexes that can change over time instead of one...

xAlien95 · June 28, 2022, 10:10am

charlieMonroe:

Dynamic look-up:

let myRegexes: [some RegexComponent] = [...]
for regex in myRegexes {
   guard
      let match = foo.wholeMatch(of: regex),
      let id = match["id"], // String
      let name = match["name"] // String
   else {
      continue
   }

    ...
}

Almost there, you need to use AnyRegexOutput if your goal is to dynamically access captured groups:

let regexes: [Regex<AnyRegexOutput>] = [
  .init(/(?<name>particle)-(?<id>\d+)/),
  .init(/(?<name>uuid): (?<id>[a-f\d]+)/),
  .init(/(?<id>\d+): (?<name>\w+)/),
]
let foo = "particle-1"

for regex in regexes {
  guard
    let match = foo.wholeMatch(of: regex),
    let name = match["name"]?.substring,
    let id = match["id"]?.substring
  else {
    continue
  }

  print(id, name)
}

charlieMonroe · June 28, 2022, 10:23am

Unless there are some changes that I don't know about (testing in latest Xcode beta), this code won't compile:

Cannot convert value of type 'Regex<(Substring, name: Substring, id: Substring)>' to expected argument type 'String' - the Regex initializer only takes a string, so I would then need to use regex strings which are PITA (escaping, etc.) and miss the compiler
match["name"]?.substring - Cannot convert value of type 'String' to expected argument type 'Int' - the output only takes an index, doesn't take name of the capture group.

xAlien95 · June 29, 2022, 4:07pm

Xcode beta isn't probably up to date with the current status of apple/swift-experimental-string-processing. Regex<AnyRegexOutput> has an initializer to erase a regular expression

github.com

apple/swift-experimental-string-processing/blob/e87149a08d3d81ffb2a7beaac590e326a0a88c29/Sources/_StringProcessing/Regex/AnyRegexOutput.swift#L183-L192


      
          @available(SwiftStdlib 5.7, *)
          extension Regex where Output == AnyRegexOutput {
            /// Creates a type-erased regex from an existing regex.
            ///
            /// Use this initializer to fit a regex with strongly-typed captures into the
            /// use site of a type-erased regex, i.e. one that was created from a string.
            public init<Output>(_ regex: Regex<Output>) {
              self.init(node: regex.root)
            }
          }

Regarding the SwiftFiddle link in my previous post

I removed it since it does't preserve compiler flags. In order for the code snippet to work as expected, you need to manually select nightly-main from the dropdown menu and add -enable-bare-slash-regex to the compiler flags (the gear icon in the toolbar)

Michael_Ilseman · June 29, 2022, 4:56pm

@xAlien95 is right that not all API have made it into a beta yet.

@charlieMonroe thanks for the example, I translated it to an in-repo test here, though note it does construction via run-time strings to avoid a bootstrapping dependency. Could you share a little more about your use case? I'm really interested in improving the ergonomics of type erasing regexes and stress testing the API.

charlieMonroe · June 29, 2022, 5:44pm

@xAlien95 - thanks for the info, was not aware of that!

@Michael_Ilseman - here's generally what I've written above, but in some (pseudo-)code.

protocol MetadataExtractor {

    /// Generally, any output that has an "id" field.
    static var urlRegex: Regex<any Tuple[with: \.id]> { get }

    /// URL associated with this extractor.
    var url: URL { get }

    /// Extracts some metadata from an HTML source at URL. The metadata
    /// object contains title, preview, description, and can potentially
    /// contain playlist information, etc.
    func extractMetadata(from source: String) -> throws Metadata

}

extension MetadataExtractor: Identifiable {

    /// Assumes that the extractor can only be initialized with
    /// URL that matches Self.urlRegex
    var id: String {
         return url.absoluteString.wholeMatch(of: Self.urlRegex)!.id
    }



}

struct YouTubeExtractor: MetadataExtractor {

     // Here's a mismatch that the Regex also has a "playlist" capture group.
     static let urlRegex = #/https?://(?:^/*\.)(?:youtu\.be/|youtube\.com/watch\?v=)(?<id>[a-Z0-9_-]+)(?:&(?:.+&)?playlist=(?<playlist>[a-Z0-9_-]+)/#

     let url: URL

     func extractMetadata(from source: String) -> throws Metadata {
          var metadata = Metadata(url: self.url, id: self.id)
          metadata. title = source.firstMatch(of: ....)
          
          // ...

          if let playlistID = self.url.absoluteString.wholeMatch(of: Self.urlRegex)!.playlist {
              // Extract playlist info...
          }
          
          return metadata
     }

}

var urls: [URL] = // 10,000 URLs from browsing history

for url in urls {
     // Returns an instance for a URL.
     guard let extractor = MetadataExtractor.extractor(for: url) else {
          continue
     }

     // source(at:) would be a custom extension on URLSession that converts data to string.
     let metadata = extractor.extractMetadata(from: try await session.source(at: url))
}

Yes, I know the example is fairly primitive. But as @paiv mentioned - it's better to have one regex than several small ones. In my experience and benchmarks, it was a major improvement when all extractors had their cached compiled Regexes that could have also been reused from the code for additional information - like the playlist info, but also for other stuff: e.g. http://www.arte.tv/guide/en/068399-013-A/vox-pop-private-education - you can extract the language from the /en/ part easily.

Aside from all the above, here's another example for fetching attributes from HTML - this is my current implementation that I use now:

public func value(ofInputFieldNamed fieldName: String) -> String? {
	return self.firstOccurrence(of: "VALUE", inRegexes:
		"<input[^>]+(name|id)=\"\(fieldName)\"[^>]+value=\"(?P<VALUE>[^\"]+)\"",
		"<input[^>]+value=\"(?P<VALUE>[^\"]+)\"[^>]+(name|id)=\"\(fieldName)\""
	)
}

Where it iterates over the supplied regexes, finds a first match and extracts "VALUE" from the match. This takes into account that the 'name/id' field may or may not precede the 'value' field. Currently, this would need to be done with a series of if-returns.

I have similar helpers for extracting various stuff where the order can differ, etc. I know that these are fairly primitive examples, but they illustrate my usecase.

If you need more information, please feel free to ask.

Azzam-dev · October 25, 2024, 7:46am

thanks for pointing out the .init(/.../)
I was struggling to understand why I was getting this error :

Cannot convert value of type 'Regex<(Substring, Substring, Substring)>' to expected argument type 'Regex'