A regular expressions library

I've created a regular expressions library for swift! Check it out here. I would appreciate some feedback. Here are some examples:

var inputText = "name: Chris Lattner"

// create the regular expression object
let regex = try Regex(
    pattern: "name: ([a-z]+) ([a-z]+)",
    regexOptions: [.caseInsensitive]
)
 
if let match = try inputText.regexMatch(regex) {
    print("full match: '\(match.fullMatch)'")
    print("first capture group: '\(match.groups[0]!.match)'")
    print("second capture group: '\(match.groups[1]!.match)'")
    
    // perform a replacement on the first capture group
    inputText.replaceSubrange(
        match.groups[0]!.range, with: "Steven"
    )
    
    print("after replacing text: '\(inputText)'")
}

// full match: 'name: Chris Lattner'
// first capture group: 'Chris'
// second capture group: 'Lattner'
// after replacing text: 'name: Steven Lattner'
let name = "Charles Darwin"
let reversedName = try name.regexSub(
    #"(\w+) (\w+)"#,
    with: "$2 $1"
    // $1 and $2 represent the
    // first and second capture group, respectively.
    // $0 represents the entire match.
)
// reversedName = "Darwin Charles"

My library also supports naming the capture groups!. For example:

var inputText = "season 8, EPISODE 5; season 5, episode 20"

// create the regular expression object
let regex = try Regex(
    pattern: #"season (\d+), Episode (\d+)"#,
    regexOptions: [.caseInsensitive],
    groupNames: ["season number", "episode number"]
    // the names of the capture groups
)
        
let results = try inputText.regexFindAll(regex)
for result in results {
    print("fullMatch: '\(result.fullMatch)'")
    print("capture groups:")
    for captureGroup in result.groups {
        print("    \(captureGroup!.name!): '\(captureGroup!.match)'")
    }
    print()
}
let firstResult = results[0]
// perform a replacement on the first full match
inputText.replaceSubrange(
    firstResult.range, with: "new value"
)
print("after replacing text: '\(inputText)'")

// fullMatch: 'season 8, EPISODE 5'
// capture groups:
//     'season number': '8'
//     'episode number': '5'
//
// fullMatch: 'season 5, episode 20'
// capture groups:
//     'season number': '5'
//     'episode number': '20'
//
// after replacing text: 'new value; season 5, episode 20'
2 Likes

great work! have u thought about removing the NSRegularExpression dependency and leveraging swift’s Optional<T> and Array<T> types to model ? and *, respectively? string-based regular expressions are so limiting after all, and most of the time you want to do more with capture groups than just match a substring. i have a working sketch of this idea here but of course, it still needs a lot more design work to refine it into a usable framework.

Thank you for your response.

Are you suggesting that I implement my own regex engine? I wouldn't even know where to begin doing something like that. And If I did, I would want it to have all of the same features as NSegularExpression, such as lookahead and lookbehind. That sounds extremely complicated and beyond my skillset.

I'm not sure what a non-string-based regular expression pattern would look like. Could you give me an example? And how does swift's Optional type fit into this?

it’s not as hard as it sounds, if you start with the basic features and work your way up to the more advanced functionality :)

let’s say you want to parse an tv episode descriptor like your last example. to make things interesting, you might want to allow a user to specify both a single episode and a range of episodes. but if a show only has one season, a user might only specify the episode number. so a string-based regex for that might look like this:

/\s*(season\s+(\d+)\s*,\s*)?episode\s+(\d+)\s*(-\s*(\d+)\s*)?/
     ^~~~~~~~~~~~~~~~~~~~~            ^~~~~   ^~~~~~~~~~~~~~~
              ^~~~~              episode start     ^~~~~
              season                             episode end 
     optional capture group 0               optional capture group 2

which would produce

((season:Substring)?, episodeStart:Substring, (episodeEnd:Substring)?)

let episodeStart = capture.episodeStart
guard let season = capture.0?.season, let episodeEnd = capture.2?.episodeEnd
...

and accept all of the following strings:

"season 1, episode 1"
" season 1 ,episode  1"
"  episode  20 "
" episode  9- 2"
"season 5 , episode 2-4"
"season 5, episode 2 - 4"

but it would be a lot easier to use if we could express this regex pattern using swift, instead of a string literal using its own niche syntax. what if we could do the following?

struct Digit:Parseable 
{
    // regex for [\d]
}
struct Space:Parseable 
{
    // regex for [\s]
}

struct Season:Parseable.Terminal 
{
    static 
    let token:String = "season"
}
struct Episode:Parseable.Terminal 
{
    static 
    let token:String = "episode"
}
struct Comma:Parseable.Terminal 
{
    static 
    let token:String = ","
}
struct Hyphen:Parseable.Terminal 
{
    static 
    let token:String = "-"
}

// parser for /\d+/
struct Integer:Parseable 
{
    let value:Int 
    
    static 
    func parse(_ context:ParsingInput) throws -> Self 
    {
        let head:Digit      = try .parse(&context), 
            body:[Digit]    =     .parse(&context)
        // pretend we have an init that takes a digit sequence
        return .init([head] + body) 
    }
}
// parser for /\s+/
struct Whitespace:Parseable 
{
    static 
    func parse(_ context:ParsingInput) throws -> Self 
    {
        let _:Space         = try .parse(&context), 
            _:[Space]       =     .parse(&context)
        return .init() 
    }
}
struct Title:Parseable 
{
    let season:Int, 
        episodes:Range<Int>
    
    static 
    func parse(_ context:ParsingInput) throws -> Self 
    {
        let _:Whitespace?   =     .parse(&context), 
            season:
                List<Season, 
                List<Whitespace, 
                List<Integer, 
                List<Whitespace?, 
                List<Comma, Whitespace?>>>>>? = 
                                  .parse(&context), 
            _:Episode       = try .parse(&context), 
            _:Whitespace    = try .parse(&context), 
            start:Integer   = try .parse(&context), 
            _:Whitespace?   =     .parse(&context), 
            end:
                List<Hyphen, 
                List<Whitespace?, 
                List<Integer, Whitespace?>>>? = 
                                  .parse(&context)
        return .init(season: (season?.body.body.head.value ?? 1) - 1, 
            episodes: start.value - 1 ..< 
                end?.body.body.head.value ?? start.value) 
    }
}

of course, it would be the job of your library to define the protocols Parseable and Parseable.Terminal, implement List<T, U>, and to conform Optional<T> and Array<T> to Parseable. (hint, you can use do-catch to implement the requirement for Optional<T>, and use your Optional<T> implementation to implement Array<T>)

And how would I use these structs to parse data from an input string?

your implementation for the ParsingInput type in the example above would store the input, and the current buffer read position. so you might define a convenience method in an extension to Parseable that sets up the input buffer from a String argument.

I don't understand how any of this code works. Could you give me an example in which you use these structs to parse a string containing a season and episode number?