Declarative String Processing Overview

Michael_Ilseman · September 30, 2021, 1:11pm

What formats have you had to parse?

The pattern I would recommend for this kind of task is a strongly-typed wrapper around the raw bytes, with computed vars that will do the masking and shifting.

struct EncodedPixel: RawRepresentable {
  var rawValue: (UInt8, UInt8, UInt8)

  var red: UInt32 { ... unpack ... }
  var green: UInt32 { ... unpack ... }
  var blue: UInt32 { ... unpack ... }
  var alpha: UInt32 { ... unpack ... }
}

A failable init can do error checking, if necessary for your format. A encoded pixel parser's conformance to something like CollctionConsumer (as an example interface for composition) would just try to init and advance 3 bytes.

Definitely improvements to be made to the current laziness situation (CC @timv). Even with an ideal presentation of laziness, it's not quite a drop-in for the script as written (e.g. use of integer subscripting).

A potential issue with approaches which accumulate and transform closures is that they rely on Swift's general-purpose optimizer, whose inlining heuristics are not necessarily tuned for your exact use case :-). It can also be foiled by such pesky real-world concerns as separate compilation.

It's still early, but we're hoping to leverage static compilation and/or staged compilation techniques for the parser library, inspired somewhat by work done in Scala. @rxwei will be exploring this more deeply.

I'm hoping we can extend result builders by allowing conformers to provide a type to serve as a namespaces for lookup, similar to custom string interpolation.

That would also open the door to defining and using custom operators (e.g. postfix +) inside of a result builder without bloating their global overload set.

We'll definitely have to figure out trailing closure ambiguity here. This seems like an issue that will keep coming up with broader application of result builders.

There's also more potential ambiguity with collection algorithms. For example, str.contains("A") will currently resolve "A" as a Character because there's only the Element-taking variant. But if there's an overload for a collection-with-the-same-element, it would prefer String for a literal. And, we're showing the use of literals inside a result builder to represent a literal subpattern. Semantically they would all be equivalent, but we like code to keep calling the functions they think they're calling :-).

We'll have to tackle these problems as we deploy the collection algorithms. For the consumers/searchers prototype way back when, I fell back to using argument labels to differentiate, but hopefully we can provide something cleaner here.

@timv is exploring these algorithms in more depth.

I agree that RRC is basically broken for batch operations and wrote more about it here. A version of replaceSubrange that returns the newly created range can help in some cases, but would still be inefficient.

At the time I was hoping more work on ownership and generalized accessors would make the design pattern obvious. This is definitely a bridge that needs to be crossed.

We defintely want to ship runtime support as part of Swift. Whether you access API directly from the Swift default module or need an explicit import Pattern is still up for debate. I'm hoping that result builder namespaces makes this decision clearer.

Yup, I'm a big fan of the programmer experience afforded by parser combinators and will defintely check out that work in Elm. As mentioned in the big picture, we also want to be able to leverage compiler technology to get around some of the performance issues with parser combinator libraries, especially since Swift supports separate compilation.