Regular Expressions in Swift

Currently in Swift, regular expressions are clunky and annoying to deal with for the most part. This is primarily because it means reaching for the API provided by NSRegularExpression which is fairly limited and not a very swifty API. Because of this, I wanted to ask the community about its thoughts on the future of regular expressions in Swift. Is this something that should be added soon? What should this look like in Swift? Should regexes be given first-class support in Swift? What are some implementation challenges that my be involved? What steps can be taken to move this area of strings forward? Or anything else related.

This has been discussed before in [1], [2] and in @Michael_Ilseman's State of String: ABI, Performance, Ergonomics, and You! doc.

6 Likes

The 3 things that immediately come to mind as important to have in Swift regex support:

  • It ought to be generic:
    Regex<Element>, Regex<Collection> or <R: Regex> where R.Element == Character

  • It ought to be able to support importing a variety of flavors of literal regex:
    try StringRegex<Unicode.Scalar>(flavor: .posix(.basic), #"\(ab\)\1"#)

  • It ought to support a reasonably verbose function/operator/method form,
    because literal regexes are often illegible in their compactness.
    (names to be bike-shedded):
    choose(oneOf: "abcd").repeated(1...3) + anchor(at: .end) // "[abcd]{1, 3}$"

15 Likes

I've used the NSRegularExpression API a fair bit in my day, and my personal opinion is that the biggest flaw in the API is that it uses strings. I don't think that the proper solution is using Swift methods or DSLs either. Instead we should have RegEx literals that are checked at compile time and we can option-click in Xcode to get documentation on the expression you wrote. I think that allowing converting strings to RegEx would be something to keep around, but most of the time I hardcode the expressions and in that case literals would be really nice.

14 Likes

I fully agree @calebkleveter, I think that compile-time checking for regular expressions and the concept of RegEx literals would be extremely useful. If it were checked at compile-time, a hypothetical RegEx literal could also be highlighted accordingly which would really make them easier to read. Because of this, a unique syntax is probably in order to differentiate it from a string literal (maybe using / instead of ").

NSRegularExpression also has other problems that make it harder to use. Firstly, it is not geared for use in Swift as it is fundamentally tied to Objective C and makes use of types like NSString, NSRange, NSMutableString, etc., even using Objective C in its documentation. Moreover, it makes uses of pointers which makes it even less user friendly. As well, NSRegularExpression's algorithms are not generic over StringProtocol making working with Substring more of a hassle.

There are also a bunch of algorithms missing for regular expressions such as, splitting with a RegEx as the delimiter, matching or replacing a regex in a string a specified max number of times, removing matches of a RegEx in a string (opposed to replacing occurrences with an empty string as I currently do), lazy iteration through matches of a pattern in a string as an alternative to NSRegularExpression's enumerateMatches(in:options:range:using:), etc. Furthermore, if this were added to the standard library, maybe Foundation could provide extensions that allow it to easily work with files.

Having a nice, succinct, swifty API for regular expressions with first-class support would make a big difference in terms of readability. For example, right now it is not as easy as it probably should be to check if a string matches a RegEx pattern:

// Now:
let somePattern = #"..."#
let matchesRegex = someString.range(of: somePattern, options: .regularExpression, range: nil, locale: nil) != nil 

This has some problems because matchesRegex may be false if the pattern couldn't be matched or if the pattern is an invalid RegEx, we don't know. To curb this, one would need NSRegularExpression.

guard let _ = try? NSRegularExpression(pattern: somePattern) else {
   fatalError("Regular expression pattern is invalid.")
}
// check for match ...

In a hypothetical implementation, it could be as easy as the following:

let matchesRegex = someString.matches(/.../)
// Compile-time error is thrown if RegEx literal is invalid

Lastly, a native Swift implementation has the potential to be quite powerful as it could leverage Swift's behaviour around characters and grapheme clusters.
extension StringProtocol {
    func matches<T>(_ pattern: T) -> Bool where T: StringProtocol {
        return self.range(of: pattern, options: .regularExpression, range: nil, locale: nil) != nil
    }
}

let str = "\u{D55C}" // ν•œ
let pattern = "\u{1112}\u{1161}\u{11AB}" // α„’, α…‘, ᆫ

print(str) // ν•œ
print(pattern) // ν•œ

print(str.unicodeScalars.elementsEqual(pattern.unicodeScalars)) // false
print(str == pattern) // true
print(str.matches(pattern)) // false

In Swift, string equality allows for the same characters composed in different ways to be considered equal (while their respective unicode scalars are not necessarily equal). Because NSRegularExpression does not leverage this type of equality, checking if str matches pattern returns false, even though String's default semantics dictate that the two are in fact equal. A Swift implementation could allow for the use of such semantics with the unicode equality available as an explicit option.

Swift's API for working with strings and extracting information is lacking and regular expressions would be a very good step in the right direction.

A few questions about regular expressions:

  1. Do you think that regular expressions should have their own literal syntax? If so, how should it look?
  2. Should regular expressions be incorporated into the standard library or available as a standalone module (potentially with compiler support)?
  3. What algorithms should be available to work with regular expressions?
4 Likes
  1. As I mentioned earlier, yes, I think there should be literals. As far as syntax goes, as long as we are actually writing RegEx, I'm not too picky on what delimiters are used. Using an opening and closing forward-slash like JavaScript is fine with me.

  2. RegEx is a pretty heavy subject, so I think it would make sense to put it in a separate module with compiler support. It would make sense to call this module RegEx, but then name-spacing gets weird because I think it would also make sense to just call the main type RegEx and then you have RegEx.RegEx and that can become a real issue down the road. In either case, I don't want it in Foundation. That module is already such a dumping ground and I don't want it getting any worse.

  3. While I think your equality operator is a cool idea, I'm going to vote against it. On first glance it gives the impression that we are checking that the string is equal to the RegEx expression. Instead, something like ~= would make more sense, being the operator for 'roughly equal to'. The String.matches(_: RegEx) would also make sense to have, along with overloads for methods that do any sort of String matching so they can take in RegEx instead.

2 Likes

@calebkleveter, the equality operator was only used to show the current semantics of String in that example. If we were to introduce an operator to check if a string matches a regular expression pattern, I would also agree that ~= is the right choice as it is literally the pattern matching operator and it would allow RegEx patterns to be used to match a string in a switch statement.

Brainstorming a bit, compile-time recognition and validation of regular expressions could also weave into the type system in really unique ways. For example, if the compiler knows how many capture groups a regex has, its match method could return a tuple with exactly that many elements, making destructuring a breeze:

let re = /(\d{4})-(\d{2})-(\d{2})/
print(type(of: re))  // "Regex<(String, String, String)>" πŸ€·πŸ»β€β™‚οΈ

if let (year, month, day) = re.match("2020-03-09") {
  // do something with the match
} else {
  // no match
}

It could even parse named capture groups as tuple labels, if you wanted something a little more formal to pass around:

let re = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/
print(type(of: re))  // "Regex<(year: String, month: String, day: String)>" πŸ€·πŸ»β€β™‚οΈ

if let match = re.match("2020-03-09") {
  print(match.year, match.month, match.day)
}

Take it further: what if you could add a type annotation using any type that conformed to, say, LosslessStringConvertible, and the match would only succeed if the regex match was valid and all the type conversions succeeded?

let re = /(?<year: Int>\d{4})-(?<month: Int>\d{2})-(?<day: Int>\d{2})/
print(type(of: re))  // "Regex<(year: Int, month: Int, day: Int)>" πŸ€·πŸ»β€β™‚οΈ

let match = re.match("2020-03-09")
  // match = (year: 2020, month: 3, day: 9)
let match = re.match("202X-03-09")
  // match = nil

This glosses over a lot of API complexity (like what if you need to iterate over multiple occurrences, or what if you need more information about the capture groups than just the text that was extracted, like the indexes into the string where they were found), but it would also be great to have an API that makes the simple cases extremely simple while also taking full advantage of the power of Swift's type safety.

23 Likes

I can't find the thread anymore, but another interesting idea that's been brought up in the past is introducing something like F# active patterns which would allow user defined types to support destructuring in pattern matches. It's definitely a more complicated and general feature, but might allow defining the bulk of regex support in libraries.

3 Likes

FYI, there’s an updated discussion since the first link (only 6 months old, so relatively recent), with a prototype implementation: [Prototype] Protocol-powered generic trimming, searching, splitting,

It’s a good time to think about this. We recently added RangeSet and DiscontiguousSlice, and I think that might have reignited interest in predicate-matching (perhaps something like this pitch). If we can come up with a cohesive design for all of this, we should have a solid foundation for one-off patterns via regex literals.

Just to throw some more ideas into the mix:

  • regex pattern matching is the dual of printing/formatting. I see them as very similar to string interpolation in a lot of ways. Just like interpolation, matching should be type driven (types should specify their matching rules) and there should be some way to customize formatting (e.g. the equivalent of printf style modifiers).

  • regex matching in Swift should integrate with pattern matching in general.

  • Perl 6 has some really great things in this department. That community has spent a very very large amount of time thinking about regex's. perl6 is not taking off in a huge way as a general language, but it makes sense to look at the things they are really great at and learn from them.

26 Likes
Terms of Service

Privacy Policy

Cookie Policy