Regular Expressions in Swift

Currently in Swift, regular expressions are clunky and annoying to deal with for the most part. This is primarily because it means reaching for the API provided by NSRegularExpression which is fairly limited and not a very swifty API. Because of this, I wanted to ask the community about its thoughts on the future of regular expressions in Swift. Is this something that should be added soon? What should this look like in Swift? Should regexes be given first-class support in Swift? What are some implementation challenges that my be involved? What steps can be taken to move this area of strings forward? Or anything else related.

This has been discussed before in [1], [2] and in @Michael_Ilseman's State of String: ABI, Performance, Ergonomics, and You! doc.

9 Likes

The 3 things that immediately come to mind as important to have in Swift regex support:

  • It ought to be generic:
    Regex<Element>, Regex<Collection> or <R: Regex> where R.Element == Character

  • It ought to be able to support importing a variety of flavors of literal regex:
    try StringRegex<Unicode.Scalar>(flavor: .posix(.basic), #"\(ab\)\1"#)

  • It ought to support a reasonably verbose function/operator/method form,
    because literal regexes are often illegible in their compactness.
    (names to be bike-shedded):
    choose(oneOf: "abcd").repeated(1...3) + anchor(at: .end) // "[abcd]{1, 3}$"

21 Likes

I've used the NSRegularExpression API a fair bit in my day, and my personal opinion is that the biggest flaw in the API is that it uses strings. I don't think that the proper solution is using Swift methods or DSLs either. Instead we should have RegEx literals that are checked at compile time and we can option-click in Xcode to get documentation on the expression you wrote. I think that allowing converting strings to RegEx would be something to keep around, but most of the time I hardcode the expressions and in that case literals would be really nice.

21 Likes

I fully agree @calebkleveter, I think that compile-time checking for regular expressions and the concept of RegEx literals would be extremely useful. If it were checked at compile-time, a hypothetical RegEx literal could also be highlighted accordingly which would really make them easier to read. Because of this, a unique syntax is probably in order to differentiate it from a string literal (maybe using / instead of ").

NSRegularExpression also has other problems that make it harder to use. Firstly, it is not geared for use in Swift as it is fundamentally tied to Objective C and makes use of types like NSString, NSRange, NSMutableString, etc., even using Objective C in its documentation. Moreover, it makes uses of pointers which makes it even less user friendly. As well, NSRegularExpression's algorithms are not generic over StringProtocol making working with Substring more of a hassle.

There are also a bunch of algorithms missing for regular expressions such as, splitting with a RegEx as the delimiter, matching or replacing a regex in a string a specified max number of times, removing matches of a RegEx in a string (opposed to replacing occurrences with an empty string as I currently do), lazy iteration through matches of a pattern in a string as an alternative to NSRegularExpression's enumerateMatches(in:options:range:using:), etc. Furthermore, if this were added to the standard library, maybe Foundation could provide extensions that allow it to easily work with files.

Having a nice, succinct, swifty API for regular expressions with first-class support would make a big difference in terms of readability. For example, right now it is not as easy as it probably should be to check if a string matches a RegEx pattern:

// Now:
let somePattern = #"..."#
let matchesRegex = someString.range(of: somePattern, options: .regularExpression, range: nil, locale: nil) != nil 

This has some problems because matchesRegex may be false if the pattern couldn't be matched or if the pattern is an invalid RegEx, we don't know. To curb this, one would need NSRegularExpression.

guard let _ = try? NSRegularExpression(pattern: somePattern) else {
   fatalError("Regular expression pattern is invalid.")
}
// check for match ...

In a hypothetical implementation, it could be as easy as the following:

let matchesRegex = someString.matches(/.../)
// Compile-time error is thrown if RegEx literal is invalid

Lastly, a native Swift implementation has the potential to be quite powerful as it could leverage Swift's behaviour around characters and grapheme clusters.
extension StringProtocol {
    func matches<T>(_ pattern: T) -> Bool where T: StringProtocol {
        return self.range(of: pattern, options: .regularExpression, range: nil, locale: nil) != nil
    }
}

let str = "\u{D55C}" // 한
let pattern = "\u{1112}\u{1161}\u{11AB}" // ᄒ, ᅡ, ᆫ

print(str) // 한
print(pattern) // 한

print(str.unicodeScalars.elementsEqual(pattern.unicodeScalars)) // false
print(str == pattern) // true
print(str.matches(pattern)) // false

In Swift, string equality allows for the same characters composed in different ways to be considered equal (while their respective unicode scalars are not necessarily equal). Because NSRegularExpression does not leverage this type of equality, checking if str matches pattern returns false, even though String's default semantics dictate that the two are in fact equal. A Swift implementation could allow for the use of such semantics with the unicode equality available as an explicit option.

Swift's API for working with strings and extracting information is lacking and regular expressions would be a very good step in the right direction.

A few questions about regular expressions:

  1. Do you think that regular expressions should have their own literal syntax? If so, how should it look?
  2. Should regular expressions be incorporated into the standard library or available as a standalone module (potentially with compiler support)?
  3. What algorithms should be available to work with regular expressions?
8 Likes
  1. As I mentioned earlier, yes, I think there should be literals. As far as syntax goes, as long as we are actually writing RegEx, I'm not too picky on what delimiters are used. Using an opening and closing forward-slash like JavaScript is fine with me.

  2. RegEx is a pretty heavy subject, so I think it would make sense to put it in a separate module with compiler support. It would make sense to call this module RegEx, but then name-spacing gets weird because I think it would also make sense to just call the main type RegEx and then you have RegEx.RegEx and that can become a real issue down the road. In either case, I don't want it in Foundation. That module is already such a dumping ground and I don't want it getting any worse.

  3. While I think your equality operator is a cool idea, I'm going to vote against it. On first glance it gives the impression that we are checking that the string is equal to the RegEx expression. Instead, something like ~= would make more sense, being the operator for 'roughly equal to'. The String.matches(_: RegEx) would also make sense to have, along with overloads for methods that do any sort of String matching so they can take in RegEx instead.

4 Likes

@calebkleveter, the equality operator was only used to show the current semantics of String in that example. If we were to introduce an operator to check if a string matches a regular expression pattern, I would also agree that ~= is the right choice as it is literally the pattern matching operator and it would allow RegEx patterns to be used to match a string in a switch statement.

1 Like

Brainstorming a bit, compile-time recognition and validation of regular expressions could also weave into the type system in really unique ways. For example, if the compiler knows how many capture groups a regex has, its match method could return a tuple with exactly that many elements, making destructuring a breeze:

let re = /(\d{4})-(\d{2})-(\d{2})/
print(type(of: re))  // "Regex<(String, String, String)>" 🤷🏻‍♂️

if let (year, month, day) = re.match("2020-03-09") {
  // do something with the match
} else {
  // no match
}

It could even parse named capture groups as tuple labels, if you wanted something a little more formal to pass around:

let re = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/
print(type(of: re))  // "Regex<(year: String, month: String, day: String)>" 🤷🏻‍♂️

if let match = re.match("2020-03-09") {
  print(match.year, match.month, match.day)
}

Take it further: what if you could add a type annotation using any type that conformed to, say, LosslessStringConvertible, and the match would only succeed if the regex match was valid and all the type conversions succeeded?

let re = /(?<year: Int>\d{4})-(?<month: Int>\d{2})-(?<day: Int>\d{2})/
print(type(of: re))  // "Regex<(year: Int, month: Int, day: Int)>" 🤷🏻‍♂️

let match = re.match("2020-03-09")
  // match = (year: 2020, month: 3, day: 9)
let match = re.match("202X-03-09")
  // match = nil

This glosses over a lot of API complexity (like what if you need to iterate over multiple occurrences, or what if you need more information about the capture groups than just the text that was extracted, like the indexes into the string where they were found), but it would also be great to have an API that makes the simple cases extremely simple while also taking full advantage of the power of Swift's type safety.

33 Likes

I can't find the thread anymore, but another interesting idea that's been brought up in the past is introducing something like F# active patterns which would allow user defined types to support destructuring in pattern matches. It's definitely a more complicated and general feature, but might allow defining the bulk of regex support in libraries.

3 Likes

FYI, there’s an updated discussion since the first link (only 6 months old, so relatively recent), with a prototype implementation: [Prototype] Protocol-powered generic trimming, searching, splitting,

It’s a good time to think about this. We recently added RangeSet and DiscontiguousSlice, and I think that might have reignited interest in predicate-matching (perhaps something like this pitch). If we can come up with a cohesive design for all of this, we should have a solid foundation for one-off patterns via regex literals.

2 Likes

Just to throw some more ideas into the mix:

  • regex pattern matching is the dual of printing/formatting. I see them as very similar to string interpolation in a lot of ways. Just like interpolation, matching should be type driven (types should specify their matching rules) and there should be some way to customize formatting (e.g. the equivalent of printf style modifiers).

  • regex matching in Swift should integrate with pattern matching in general.

  • Perl 6 has some really great things in this department. That community has spent a very very large amount of time thinking about regex's. perl6 is not taking off in a huge way as a general language, but it makes sense to look at the things they are really great at and learn from them.

44 Likes

This came up in a conversation I had the other day so decided to have some fun messing around to see how far I could get with an ergonomic regex definition in Swift today. Turns out, pretty far! I implemented just a small (but functional) set of regex operators, which allows you to write this:

var last: String = ""
switch "abcbbdaabc" {
case ("abc" | "b" | "da").as({ last = $0 })*: // matches "(abc|b|da)*", with a capture
    print(last) // prints "abc"
default:
    print("no match!")
}

The missing piece of the puzzle here is really first-class pattern binding, which would avoid the up-front definition of last and the annoying closure boilerplate in as(_:), but overall it worked out better than I expected!

Implementation, no warranty of any kind, may be prone to bugs
enum Regex {
    case str(String)
    indirect case either(Regex, Regex)
    indirect case kleene(Regex)
    indirect case concat([Regex])
    indirect case capture(Regex, (String) -> Void)

    static func concat(_ regs: Regex...) -> Regex {
        .concat(regs)
    }

    func `as`(_ capture: @escaping (String) -> Void) -> Regex {
        .capture(self, capture)
    }

    func match(_ string: String, index: String.Index) -> (Bool, String.Index) {
        switch self {
        case .str(let s):
            if let toIndex = string.index(index, offsetBy: s.count, limitedBy: string.endIndex) {
                return (string[index..<toIndex] == s, toIndex)
            } else {
                return (false, string.endIndex)
            }
        case .either(let reg1, let reg2):
            let result1 = reg1.match(string, index: index)
            if result1.0 {
                return (true, result1.1)
            }

            let result2 = reg2.match(string, index: index)
            if result2.0 {
                return (true, result2.1)
            }
            return (false, max(result1.1, result2.1))
        case .kleene(let reg):
            var resultIndex = index
            var next = (true, index)
            while next.0 {
                next = reg.match(string, index: next.1)
                if next.0 {
                    resultIndex = next.1
                }
            }
            return (true, resultIndex)
        case .concat(let regexes):
            var currentIndex = index
            for regex in regexes {
                let result = regex.match(string, index: currentIndex)
                if result.0 == false {
                    return (false, result.1)
                }
                currentIndex = result.1
            }
            return (true, currentIndex)
        case .capture(let regex, let capture):
            let result = regex.match(string, index: index)
            if result.0 {
                capture(String(string[index..<result.1]))
                return (true, result.1)
            } else {
                return (false, result.1)
            }
        }
    }
}

extension Regex: ExpressibleByStringLiteral {
    init(stringLiteral: String) {
        self = .str(stringLiteral)
    }
}

postfix operator *
postfix operator +

extension Regex {
    static func |(_ lhs: Regex, _ rhs: Regex) -> Regex {
        return .either(lhs, rhs)
    }

    static postfix func *(_ reg: Regex) -> Regex {
        return .kleene(reg)
    }

    static postfix func +(_ reg: Regex) -> Regex {
        return .concat(reg, .kleene(reg))
    }
}

func ~=(_ regex: Regex, _ rhs: String) -> Bool {
    return regex.match(rhs, index: rhs.startIndex).0
}

:slightly_smiling_face:

6 Likes

I think that introducing a special literal for regular expressions is not a good idea. It will not only further complicate the language, but it will also make the feature quite unreadable. That would be devastating, especially, when taking beginners into consideration.

This is an excellent example of how regular expressions could be introduced to the language without trading readability for short-term convenience.

5 Likes

I dunno about this. It get some regexes are difficult to read. I wouldn't have through regexes are tools for beginners. I'd have also thought the existing regex convention is almost second nature to many experienced developers and they're understood by a larger and more diverse set of people in a team who may not have skill with Swift.

I'd rather wait for native, literal regex types with compile time and performance goodness. If I got a non-trivial regex to put into an app, this verbosity would be an unwelcome friction point, increasing cognitive load and introducing errors translating between and matching concepts. It'd also be quite difficult to debug with a regex tester.

I think we'd all get pretty frustrated if we argued against introducing special literals for mathematics because "it would complicate the language" and "make it unreadable". Compare:

let y = x*x + m*x + c
let y = x.squared.plus(m.multipled(by: x)).plus(c)
16 Likes

IMO it’s unfair to compare Math operations, which most people learn at a really young age and regularly use in their every day life, with a feature like RegEx. Yes, it would certainly make sense to introduce some basic operators, but I think that to make a RegEx expressive there needs to be a way to combine readability with operators and existing features. For example:

This: {2,} could become: {2...}

The flag g could become .global

That is introducing Ranges, Optionals and other suitable features to RegEx so that it fits in better with Swift. Just to clarify, I’m not suggesting replacing each and every operator with another currently built-in, just tailoring RegEx for Swift.

3 Likes

Reinventing the wheel for Regular Expressions seems like a bad area to focus on. Their syntax is pretty much universal and interchangeable (though I'm well aware some languages have their own features). Deviating too much from the norm feels unnecessary. Regex isn't actually a tool for beginners to begin with, and I fear you'll only confuse (or even frustrate) developers who are already familiar with Regex.

And this really hurt to say because conservatism is a thing I often fight against when it comes to technology. :stuck_out_tongue:

As far as I know there already are expressive Regular Expression modules for Swift. And I wouldn't oppose to making this a standard feature in Swift (whether in the standard library or a Regex module). But personally I'd rather see more attention and focus on the inclusion of Regex literals in Swift than other implementations.

Regex literals are one of the delightful things I like in Javascript (though the language isn't so delightful to some :stuck_out_tongue:). Easier to catch and read with a quick eye, and probably help IDEs a bit with parsing for syntax highlighting... perhaps.

1 Like

Personally I would welcome a Swift-way of handling RegExes, even though I don’t believe they warrant a specific literal type: ExpressibleByStringLiteral ought to be enough.

What I would personally require though is for the type to be generic on the number and type of its capture groups. Any implementation that doesn’t do that isn’t good enough for inclusion in official Swift modules imho (unless such design is demonstrated to be impossible/impractical/less usable)

1 Like

I'd love to see Regex become an "object" in its own right along the lines of Class, Struct and Enum. Chris suggested to explore Perl 6 ideas and implementation of Regex and I very much agree with him.
https://docs.perl6.org/language/regexes

4 Likes

But we already have introduced special literals for mathematics. All we are given that aligns with mathematics are the basic operators +-*/ and, I suppose, %. But we have no mathematical notation for exponentiation, no roots, no integrals or derivatives, no dot or cross products, set union or set intersection, no summations, no products, and so on; we have to write all of those out.

As for readability, mathematical expressions in code tend to stay small, with larger expressions broken down into smaller steps where possible, I don't see that with regular expressions.

2 Likes

I would actually argue the other way. Granted, I do have my biases and can practically read and write RegEx just as easily as Swift itself.

RegEx syntax is supposed to be concise, not self-documented, and I think there is a good reason for that. If you are using RegEx, that would mean the find/replace operation that you are running will be rather complicated (otherwise you would simply use the StringProtocol methods built into the standard library since they are easier to read and write). The result being that if you use a 'user friendly' DSL, your expressions will become so hideously long to the point where it could easily be just as hard to read and impossible to reasonably type check.

This is one of those cases in which I think developers who actually need RegEx should bit the bullet and learn the syntax. And not too many developers actually need RegEx. (As the old joke goes, you have a problem, so you decided to solve it with RegEx. Now you have two problems.) Even I learned it in a case where I should have used a different tool (in this case, parsing SPM manifest files. Should have used SourceKit instead).

As far as using a standard-like RegEx syntax that is 'Swiftified', I'm not convinced that's a good idea either. RegEx I something you should only have to learn once. This, sadly, isn't really the case since each language already has its own dialect, but we can mitigate that by not creating our own and using Perl's dialect instead.

15 Likes

I think this sums up my opinion and experiences pretty well. I learned RegEx pretty early on when I was learning to program and thought it was so useful and probably reached for it a little too often (I began programming in Ruby and Python). Later on, I worked for a company where a senior engineer used RegEx for everything. This engineer had been at the company almost since it was first formed and had created several scripts used extensively internally and also externally by some of our more hands-on customers.

Within a year after this engineer left the company, many of his scripts began to break and I was tasked with fixing his collection of scripts. I quickly learned that he used regexes for everything from validating that a variable was a float by casting it to a string and performing a regex search, to splitting strings on a single character and then getting the last group. He was extremely well-versed with RegEx, which made him appear to be an advanced developer, but it ended up being the one and only tool he reached for whenever a problem arose. As a result, much of his code had sub-optimal performance, was difficult to understand, and a nightmare to maintain.

My experience of rewriting his scripts drove me to avoid using regexes unless it was truly the right tool for the job. I worry that making RegExes too easy could lead to a situation where beginners unknowingly reach for the wrong tool just because it's easy and works..

Don't get me wrong though, RegEx's are still extremely useful and I absolutely consider them necessary in some instances. I do believe they deserve first class literal support to make constructing and using them more simple than it is today, however, I would be opposed to creating or using a library that makes RegExes constructible using human-readable syntax or APIs.

RegExes are an advanced solution for advanced problems and human-readable RegEx construction would lead to novice developers using an expert-level feature too early, contrary to the swift goal of "progressive disclosure." Rather we should eventually incorporate them into the language using the easily recognized/portable regex literal syntax similar to what many other languages provide (ie: PCRE). I imagine this would appear similar to string literals, maybe even a little closer raw string syntax (to avoid issues with escape sequences). There should be a first class regex type/syntax that can simplify common regex operations (eg: retrieval of capture groups), but human-readable construction of regexes should be rejected.

8 Likes