Regular Expressions in Swift

For me the RegEx engine should be developed without any concern for novice programmers. This should be the least our worries. We will do the language disservice if the first thing we think of is how this language feature may be abused by novices.

2 Likes

This is a very interesting story. Have you considered submitting it to thedailywtf.com?

I don't think decreasing usability is a valid option for deterring abuse. It will just make it harder for people to learn it, and those who want to abuse it still will (that RegEx-savvy engineer in your anecdote for example).

I think we can take some inspiration from how Xcode helps with writing RegEx for finding/replacing words in the text editor, and translate it to a bit more code-like:

4 Likes

I respectfully disagree. Every single new feature design is evaluated for the harm that can be done by beginners. Swift aims to make correct programming simple and if a feature is deemed to create a trap that novice programmers will undoubtedly fall into then the feature is either changed to design around that problem, documented appropriately (if the risk is deemed worth it), or rejected altogether. Swift tries to eliminate foot-guns in many areas and we are able to because of this exact proposal process where people think about the answers to questions like "how will a novice programmer use this?". If we never thought about how code was used by beginners then Swift would be a very different language.

Swift indexes are a prime example where the swift language chose a specific direction because using integer-based indexes can lead to some challenging bugs to find and fix. Is it more difficult for novice programmers who are only used to integer-based indexes? Absolutely. Is that a bad thing? Maybe sometimes, but if it can be proven to help reduce bugs then the goal has been accomplished.

Now how strong of an argument is the "beginner footgun" line of thought? Not sure. I'll let others decide, but I have seen it used in other proposals so it does at least carry some weight.

I hardly think that enforcing the use of regex literals in a common format, like PCRE, has any affect whatsoever on how usable regexes can be in Swift. If that were the case then why would almost every other major language require developers to hand-create their regexes using a string (or string-like) syntax? The current state of most languages requires you to be familiar with regex syntax to use them. Having that exact same restriction does not make Swift regexes any less usable than Python regexes.

If anything, creating a human-readable API would have more usability limitations than creating regexes from scratch. I doubt the entire power of regexes could even be made into a complete human-readable API. That's not to say that there couldn't be a human-readable API which covers most of the more common use-cases, but there will limitations to any wrapper library.


I personally believe that human-readable regex composition belongs in a 3rd party library and not in the standard library. My initial thoughts are that human-readable regex composition may end up being useful only for more simple cases and more complex cases would still be better declared directly through a regex string/literal. If someone wants to create their own wrapper on top of some base RegEx functionality, I'm all for that! I'm sure people would use it, but I think that an implementation in the standard library should be kept minimal.

Of course everything I've said is just my opinion and y'all are welcome to disagree :slight_smile: All of this is just my interpretation of Swift's history and what I believe would be best for the language based on my personal experiences and opinions.

4 Likes

I don't see how the argument that regexes are a specialty tool only to be used by experts for specific problems and the one that regexes should be incorporated into the language and get special syntax to make them easy to construct and use fit together.

If we can find a design that is more approachable and handles more classes of problems better (including scaling to bigger problems, which regexes fail at spectacularly), I'd much prefer it, even at the cost of terseness (fwiw, once a regex gets just a little complex, I often break it down into multiple lines to make it more understandable, so I don't think that would be a huge problem anyway).

Of course that is not to say there should be no regexes at all in Swift, but I don't really think they deserve special syntax treatment or a place in the standard library or sth like that.

Edit: Btw, I don't think those advocating for a more human-readable approach here are proposing to create just a more verbose, spelled-out form of the exact functionality of regex, with a one-to-one mapping between them, but rather a re-thinking of the basic approach to matching/parsing strings (and please let's not stop at strings; byte sequences, bit sequences, collections of any kind of element; most of the basic concepts for parsing/matching apply and should be usable with all of those).

3 Likes

Regexes should be in the standard library by the same standards anything else is in the standard library. (This is from memory, I can't find the post where these were first summarized.)

  • Correctness of implementation. Correctly integrating something like PCRE with String would be, at best, difficult, largely impossible for most people
  • Performance. Until Swift eliminates the current limitations of cross model optimization, it's unlikely that any solution would be as performant as necessary without inclusion in the standard library.
  • Uniformity. A single ExpressibleByRegularExpressionLiteral and such support would go a long way towards making regexes first class citizens in Swift. New types of literal are only possible from the standard library, AFAIK.

Perhaps the only reason (aside from implementation time and complexity) regexes aren't in the language yet is the existence of NSRegularExpression. However, this type is rather terrible to use in Swift and has its own limitations. Performance isn't that great either. Eliminating this dependency is good for Swift in the long term.

6 Likes

What even is this conversation? The preceding thread and the numerous documents linked to, including the extensive discussion in the String Manifesto, specifically detail the possibilities that come with first-class support for regex literals; it is a highly anticipated feature that's been part of the roadmap for improving Swift string handling for years, a roadmap which has been extensively discussed and iterated upon.

It's as though all of it's gone out the window and now we're "re-thinking the basic approach" to strings as if none of this work has happened. Let's nip this in the bud and agree that there is approximately zero chance that the future of Swift will involve choose(oneOf: "abcd").repeated(1...3) + anchor(at: .end).

9 Likes

Excuse me? Did I hit a sore spot there?

That's a little needlessly hostile, isn't it? Aren't we here to try and work together to make Swift the best language it could be?


Now, in all honesty this might not be perfectly clear from my post, but I was addressing the sentiment expressed in a number of posts preceding mine that regex literals in Swift should be pretty much PCRE or a similar syntax with no or little adjustments. So when I say "regex" in my post, I mean just that, regexes as they have been written for decades now.

Looking back through this thread and some of the linked posts and materials (including the String Manifesto, which, by the way, merely glances over regexes), I see extensive discussion and interest in (a) adapting classic regex syntax to be more swifty, including dropping complex features, making others more verbose for the sake of clarity, allowing composition of simpler pieces of expression into a bigger whole, lifting some aspects of parsing/matching into types, (b) alternative approaches such as Parsing Expression Grammars and Parser Combinators that are not based on classic regular expressions at all, and (c) the possibility of designing a system that works with all kinds of sequences of things, not just strings.

Perl 6 with its language level concepts of regular expressions and grammars in particular makes repeat appearances, but I'm gonna go out on a limb here and assume people don't mean to copy the Perl-typical explosion of symbols upon symbols outright and rather mean to take it's level of language integration as inspiration for a swifty variant thereof.

Now, don't get me wrong, I completely agree a solution to matching/parsing is very important for Swift, and should be tackled soon. And I never said I wanted Swifts future to involve choose(oneOf: "abcd").repeated(1...3) + anchor(at: .end).

But I will re-state/clarify my opinion that just taking the same old regular expression syntax we know look up and love endure, sticking it into a special literal and calling it a day is not good enough and does not in that form deserve to make it into the Swift language. And honestly, apart from compile-time syntax checking, which many other things people stick in string literals would also benefit from and would thus be better handled by a general feature (or a linter), I don't see how integrating these into the language would result in much benefit compared to a library solution.

Of course, throwing away everything about preexisting designs just for the sake of it is no good either, but we should critically assess every component of regular expression syntax, and see what parts we should adopt into the syntax of "the Swift parsing/matching solution", and what parts we should reject outright or replace with swiftier (which I expect will often entail greater verbosity) designs.

And this assessment imho should include the discussion of whether we want to base our design on regular expression syntax at all, or whether something like PEGs might be a better basic construct to build upon, in particular if it could facilitate parsing binary formats using the same constructs people know from string parsing already.

5 Likes

My reply was not aimed at calling you out. You might have been trying to spark some critical assessment, but I think you'll agree that the conversation in this thread over the last 24 hours is anything but that.

We need to move away from this catch all word that has no specific meaning. Recall that Swift's overall design goal includes both clarity and concision. It's fine to critique some specific design as sacrificing too much clarity for concision, but it is definitely not the case that writing Swift "will often entail greater verbosity"--or at the very least, that is not the goal.

This is where I have to return to the question: what even is this conversation? There is no sense in simultaneously wondering whether to incorporate regexes at all into the language and to discuss the features we would want if we decided to do so and to bikeshed how we would spell those features.

It's like having a group of people vaguely talking about cakes, with one person wondering if cake is good enough to eat, another person figuring out what kind of cake people want to eat, and another person making a grocery list. No one's ever getting any cake and we are all wasting our time.

1 Like
Meritless meta-musings

Maybe. I use it basically as short hand for "fits into Swift the language nicely", maybe signifying a consciously chosen balance of clarity and concision, and did not mean the greater verbosity to be a consequence of "swiftiness" in general, but as applied to existing features of regex-syntax, which are often represented by only a symbol or two, so any change that is not just changing the symbol would necessarily entail greater verbosity.

Well, then allow me to play the Uno reverse-card and ask: What even is this reply? If you don't deem this conversation worth ones time (and I see why you might feel so) why participate?

I see it like this: We all agree there is hunger to be stilled, and it's not going to get better without getting something to eat. Some (many?) people would like cake, and are talking aloud about what kinds of cake one could get, maybe to convince others of specific kinds of cakes, or of the idea of getting cake in general. Others would maybe prefer something more hearty and are raising the question of whether a burger wouldn't do just as well or better. Others yet are trying to see what each would entail. Where can we even get cake/burgers at this time of day?

I see why you might feel this is quite pointless, as no-one is making very concrete plans, and the conversation is meandering, but I don't think every conversation always has to be a laser-focused discussion. The time for that will come about when we all get so hungry we have to make a decision. At that point, we might be better off having meandered a bit and having considered this and that possibility (if only in passing) earlier on.

But if you don't like talking food while you're getting hungry and have made up your mind already, you are free to sit back and let the others talk among themselves, no?

5 Likes

Perhaps it is because I am very hungry (both literally and figuratively), but the purpose of my reply is to try to urge clarity so that we do have a laser-focused discussion, here and now. To do so, we need first to bound the parameters of the discussion more tightly.

The standard library isnā€™t special with regard to cross-module optimization. It marks methods as @inlinable in the same way other modules can. So ā€œlimitations of cross-module optimizationā€ isnā€™t a reason to include something in the standard library.

(there are other performance-related reasons to include something in the standard library, but not this)

5 Likes

Has anything been discussed by the Core Team about the pattern matching possibilities in Swift. Have any RegEx alternatives been explored? Has there been discussion about a different feature with better potential?

Interesting to see regular expressions bubble to the surface again. Itā€™s clearly an itch evolution needs to scratch. The NSRegular expression class is not at all as lightweight as regular expressions deserve to be and while it could be improved with a few well chosen extensions its use of NSString under the covers makes it subject to speed regressions if you try to process large amounts of data. Iā€™d like to throw a few ideas into the mix.

First, for me, the focus is not regex literals themselves that is interesting though it wold be nice if they could be validated at compile time by and flagging them with a special syntax. /regex/ seems a good precedent to pick up if not particularly Swifty. It would be subject to many of the escaping concerns of strings themselves so Iā€™d suggest the option of an analogue to raw strings of something like #/regex/# to use in practice.

What is of interest is how regexes combine for the basic operations of match, iterate and replace when operating on a target string and Iā€™d like to one more time, float the idea of using subscripting into a string with a regex. Why on earth subscripts I hear you say? They have the advantage that they are atomic and not subject to operator precedence but also they have a unique property that while they are effectively a function call they are also one that can also be assigned to using its setter. Bear with me, certainly itā€™s an idea that takes some getting used to but remember there was once a time when subscripts where only for arrays and and accessing dictionary values using subscript syntax was novel and perhaps non-intuitive. Not currently an idiom in any language Iā€™m aware of, thereā€™s almost a mangled logic to it ā€” what else would you use to refer to a non-trivial range in a String other than a regular expression?

Fleshing out the idea...

Since generic subscripts became available in Swift it possible to concoct something along the lines of

let datePattern = #"(\d{4})-(\d{2})-(\d{2})"#
let date = "2018-01-01"
if let (year, month, day): (String, String, String) = date[datePattern] {
    print( year, month, day )
}

Using the symmetry of subscripts its possible to write a tuple back into a string

var date2 = "0000-00-00"
date2[datePattern] = ("2018", "01", "01")
XCTAssertEqual(date, date2)

Finally, for iterating the following also works:

let dates = "2018-01-01 2019-02-02 2020-03-03"
for (year, month, day): (String, String, String) in dates[datePattern] {
    print( year, month, day )
}

This isnā€™t a flight of fantasy. All these constructs already work with Swift as-is using this package (7 stars :grinning:). Sure, it would be nice to type check the number of groups in a pattern matches the number of elements in the tuple being used and perhaps have named capture groups or even types other than String but the two ideas are complementary and literals with more smarts can be worked into a more ambitious plan later.

13 Likes

Subscripting is an interesting idea, but how do you handle iteration to find (and modify) additional matches?

Itā€™s a bit of a long story. Regexes arenā€™t for the faint of heart after all. The third snippet is an example of iterating over all matches in a string. For modifying multiple matches there is the very eccentric construct of passing a closure over matches. For example to capitalise words in a string:

        str[#ā€(\w)(\w*)ā€#] = {
            (groups: (first: String, rest: String), stop) -> String in
            return groups.first.uppercased()+groups.rest.lowercased()
        }

The closure is called once for each match and the range of the match replaced with the return value. If these operator shorthands arenā€™t to your taste there are more comprehensible named functions in an extension to StringProtocol e.g.:

    public func replacing<T>(regex: RegexLiteral, pos: Int? = nil, group: Int? = nil,
                             exec closure: @escaping (T, UnsafeMutablePointer<ObjCBool>) -> String) -> String {
        return RegexImpl<T>(pattern: regex).replacing(target: self, pos: pos, group: group, exec: closure)
    }

This subscript operator calls this function under the covers. T can be a tuple or array of String, SubString? etc representing the capture groups in the implementation.

That, too, is interesting, but very not-Swifty. It looks like the closure is being assigned to the place where the regex matches, and that's completely nonsensical. I doubt anything like that would ever make it to the stdlib.

1 Like

So do I :slightly_smiling_face:. Iā€™ve just been experimenting what is possible as a shorthand to bring to Swift some of the expressiveness of Perl processing strings. A more realistic extension to StringProtocol along the lines of the following would help move out of sight some of the rough edges of NSRegularExpression IMO.

Edit: but ultimately these could never be in the stdlib as NSRegularExpression is part of Foundation unless another portable, ICU compatible, performant, perhaps more UNICODE correct, regex engine comes to the fore and they donā€™t grow on trees.

That's how things work here, isn't it? (ok, sometimes we get cake ā€” but most of the time, it's when I actually wanted pizza ;-).

I strongly agree with those who expressed the opinion that regular expressions are overrated ā€” many people seem to be really obsessed with them in an unhealthy way, and waste time figuring out the right magic strings even when there are simpler and better alternatives.
I don't think we would invent (without any prior art) syntax like

(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

This just does not look like Swift (and it's a different language indeed). How can you explain that we removed i++, and then add something like the regex-example above (can you see what it is doing?)?
If it is integrated, imho there ought to be really good reasons ā€” familiarity isn't enough, because there's not one true standard for regex.

One motivation could be performance:
If Swift gets some sort of regex-compiler, it might be possible that the result could be really fast ā€” and it could be ensured that the expression is actually validā€¦
On the other hand, I don't think any regex-implementation which does its work at runtime is qualified for inclusion in the stdlib (but the stdlib is full of stuff that I would never put there :smiley:).

7 Likes

:point_up_2::point_up_2::point_up_2::point_up_2::point_up_2:

6 Likes

Wouldn't that look a lot less scary if you broke it up into semantic chunks?

let recipient = [a-z0-9!#...
let domainName = [a-z0-9]...
let ipAddress = (2(5[0-5]...
let emailAddress = recipient + "@" + domainName + "|" + ipAddress

The verbose syntax proposed earlier would be a whole page long I think.

But this common example would be found somewhere in the library anyway, right?