[Proposal] a syntax for type-safe data detection in strings

The Swift 5.7 string processing features (regex builder DSL + typed matches in regex) are neat. Inspired by that work, I have an idea for the next chapter in the Swift string processing story: type-safe data detectors.

The problem with regex

The problem with using regex for data parsing is that they require you to think on the wrong level of abstraction (i.e. in terms of character sets, rather than data types). To construct a regex, you first need to carefully examine how the data is formatted in text, and also take into account its "context" (the other chararcters in the string that proceed it).

The process of writing a regex to parse out data (even with the builder DSL) is tedious when we "just want our data!". And the final regex (whether in concise form or builder long form) obfuscates what it is you're actually trying to parse out of the string.

Proposed solution: type-safe data detecting in String

Here's an example of how we'd parse the data out of a test suite log:

let (testCount, failureCount, timeTaken) = "Executed 4 tests, with 1 failure in 0.009 seconds".find(.number, .number, .time)!

testCount // 4
failureCount // 1
timeTaken // 0.009 seconds

The find method on string returns a tuple populated with the requested types.

let successCount = testCount -  failureCount // 3

Another couple of examples:

let (date, temperature, humidity) = "On August 23, 2022 the temperature in Chicago was 68.3 ÂşF (with a humidity of 74%)".find(.date, .temperature, .percentage)!

date // August 23, 2022
temperature // 68.3 ÂşF
humidity // 74%
let (earnings, fileSize, url) = "Total Earnings From PDF: $12.2k (3.25 MB, at https://lifeadvice.co.uk/pdfs/download?id=guide)".find(.currency, .fileSize, .url)!
earnings // 12,200 USD
fileSize // 3.25 MB
url // https://lifeadvice.co.uk/pdfs/download?id=guide

Dates & numbers come in lots of different formats, but by working on the level of abstraction of data types, you're able to ignore the specifics of the format of the data in the particular string you're working with.

Implementation

I have a working implementation of this string parsing syntax available here.

I have included around 30 common data types out-of-the-box (including dates, urls, email addresses, percentages, units of different kinds, etc) and up to 6 data points per call (until we get variadic generics in Swift 6 :tada:). You can also easily extend the system with parsers for your own custom types. I also have examples of how this approach can be used for data transformation as well.

My implementation uses the SoulverCore math engine for parsing. SoulverCore is closed source, but it's written in 100% Swift and works on Linux & Windows, so it's a good proof of concept for what could become part of platform-independent Foundation to support this feature in a future version of Swift.

Performance is acceptable (though significantly slower than regex): SoulverCore can do around 6k parse operations/second on my Intel i9 MacBook Pro, and 10k+ parse operations/second on my friend's M1.

Conclusion

I've been using regular expressions for years, but I've never loved using them (certainly not in the way I love using Swift).

Most regexes are just trying to get you some data…, so why can't computers be smart and just give you the data?

My proposal is Swifty, in the sense that the syntax is concise and clear, returned values are type-safe, and my proof of concept demonstrates this can be done in a performant manner.

This is my first post here so please be gentle, but I'm very open to feedback, suggestions and criticisms. I'm also just happy to just be contributing to the discussion on how to make Swift "better at string processing than Perl". Cheers.

15 Likes

This looks very useful. Kudos! I can totally see myself using this or something like it very frequently.

Having said that, it's unclear to me what you are proposing. A few thoughts/questions:

  • As you write, the library that is the basis for your implementation is closed source. Are you suggesting to open source it and contribute its functionality to the Swift project in some way?

  • I don't think functionality such as this needs to live in the standard library. (And much of it can’t be part of the standard library anyway because it depends on Foundation types such as URL, Date, and Measurement.)

  • In the Apple ecosystem, Foundation would be a natural place for this functionality, as a modern replacement of NSDataDetector.

    But additions to Foundation aren't really managed through the Swift Evolution process and don't usually come from the community. (There have been some pitches from Apple folks about the recent refinements to URL etc.)

    Moreover, I think many (most?) people on this forum would prefer not to add even more stuff to Foundation. The open source Swift ecosystem would be in a better place if we had more single-purpose "Swifty" libraries that each replaced one aspect of Foundation. One great example is @Karl’s WebURL library; your library could be another.

  • If we agree that this should be a self-contained library, the big question is how to gain traction, i.e. get people to use it. I imagine that publishing the library as a sort of official part of the Swift project (like Swift Collections, Swift Algorithms, SwiftNIO) can certainly help with this. Is this what you have in mind?


A specific question about your sample code: is the order of the search terms in the source string important? E.g. in the following example, the order of the data types in .find(.date, .temperature, .percentage) matches the order they appear in the input string:

Would this also work if we changed it to, say, .find(.temperature, .percentage, .date)? In other words, does the search for the next term start after the previous match or does it always start at the beginning of the input string?

6 Likes

Thanks Ole, great questions.

I think string processing is somewhat special in that having the right tools "out-of-the-box" make the language itself better.

While watching WWDC22, it struck me that, if the job is data extraction/transformation, regex is probably not the right tool. We actually want something like NSDataDetector with better ergonomics. And what would that look like?

I posted here because I think it's worth having the discussion about whether a tool like this could make sense as part of Swift (or cross-platform Foundation). Maybe the answer is no, but in my view having something like it would make the Swift string story stronger.

Yes, order matters.

Yes. It looks for the first term. Once found, it continues on from there looking for the next term, ignoring everything else it comes across, and so forth.

1 Like

A “byte-level data parser” is included in the future directions of SE-0363, is this something you’ve thought of?

I'm certainly not suggesting regex support in Swift shouldn't continue to improve. It's a good tool to have as well.

We also need a tool at a higher level of abstraction than regex - one we can reach for when we just need to parse or transform data. It's such a common & finicky task - so a great opportunity to let the language/frameworks do the work for us.

Not unlike how SwiftUI is at a higher level of abstraction than AppKit/UIKit. It's a better tool for the job in many (but not all) cases. Despite what Josh told us at WWDC, a solid SDK story should offer tools at various levels of abstraction, depending on your needs. Higher levels need not replace lower levels.

While I like the idea of "type-safe data detection", it sounds a lot like a library to me, implemented as a collection of ready-to-use Regexes that could live in an official swift-data-detectors or swift-regex-extensions repository. The main reason for me to think like this is that while it's easy to provide data detectors for such things as number or duration (I think time is not a good name for a time interval, I expected something like "5 AM" on first read), it's already starting to get more complicated with things like temperature or date.

If data detectors were available in the Swift standard library, I would expect them all to work correctly and support any localized formatting which I think is a too complex task to include in what is supposed to be a core set of functionality to build upon for more complex functionality, which data detection is in my opinion. As you mention that your implementation isn't as fast as Regexes in Swift, have you considered reimplementing (some portion) of the library based on Regexes?

Even if I'm wrong and some data detectors could actually make their way into the Standard library, I'm sure a step-by-step approach would be taken where shipping functionality as part of the swift-preview library first would be advisable.

2 Likes