Declarative String Processing Overview

Michael_Ilseman · September 29, 2021, 11:13pm

Declarative String Processing Overview

Introduction

String processing is hard and the current affordances provided by the Swift Standard Library are underpowered. We propose adding two new declarative string processing APIs—a familiar Regex literal and a more powerful Pattern result builder—to help make Swift string processing fast and easy.

This is a large feature that will ultimately be divided into multiple Swift Evolution proposals. This initial pitch is intended to prompt discussion about the high level direction and to introduce the key prongs of the feature and their relationship to one another.

This overview is the work of a number of members of the Swift team (Alex Alonso, Nate Cook, Michael Ilseman, Kyle Macomber, Becca Royal-Gordon, Tim Vermeulen, and Richard Wei) as well as Ishan Bhargava, who implemented a prototype of regular expression literals with strongly-typed captures.

Example

The Swift Standard Library is implementing native grapheme breaking for String, which requires preprocessing Unicode data tables.

Here's a snippet of the data:

# ================================================

000A          ; LF # Cc       <control-000A>

# Total code points: 1

# ================================================

0000..0009    ; Control # Cc  [10] <control-0000>..<control-0009>
000B..000C    ; Control # Cc   [2] <control-000B>..<control-000C>
000E..001F    ; Control # Cc  [18] <control-000E>..<control-001F>
007F..009F    ; Control # Cc  [33] <control-007F>..<control-009F>
00AD          ; Control # Cf       SOFT HYPHEN
061C          ; Control # Cf       ARABIC LETTER MARK
180E          ; Control # Cf       MONGOLIAN VOWEL SEPARATOR
200B          ; Control # Cf       ZERO WIDTH SPACE
200E..200F    ; Control # Cf   [2] LEFT-TO-RIGHT MARK..RIGHT-TO-LEFT MARK
2028          ; Control # Zl       LINE SEPARATOR
2029          ; Control # Zp       PARAGRAPH SEPARATOR
202A..202E    ; Control # Cf   [5] LEFT-TO-RIGHT EMBEDDING..RIGHT-TO-LEFT OVERRIDE

Each relevant line is of the form:

0000..10FFFF  ; Property # Comment

The first column (delimited by the ;) is a hex number or range of hex numbers that represent Unicode scalar values.
The second column is the grapheme break property that applies to this range of scalars.
Everything after the # is a comment.
Entries are separated by newlines.

This is a very simple data format to process, and when we try to do so we quickly see inadequacies with the status quo.

Naive handwritten parser

A straight-forward approach to tackling this problem is to use the standard library's generic collection algorithms like split, map, and filter.

extension Unicode.Scalar {
  // Try to convert a hexadecimal string to a scalar
  init?<S: StringProtocol>(hex: S) {
    guard let val = UInt32(hex, radix: 16), let scalar = Self(val) else {
      return nil
    }
    self = scalar
  }
}

func graphemeBreakPropertyData(
  forLine line: String
) -> (scalars: ClosedRange<Unicode.Scalar>, property: Unicode.GraphemeBreakProperty)? {
  let components = line.split(separator: ";")
  guard components.count >= 2 else { return nil }

  let splitProperty = components[1].split(separator: "#")
  let filteredProperty = splitProperty[0].filter { !$0.isWhitespace }
  guard let property = Unicode.GraphemeBreakProperty(filteredProperty) else {
    return nil
  }

  let scalars: ClosedRange<Unicode.Scalar>
  let filteredScalars = components[0].filter { !$0.isWhitespace }
  if filteredScalars.contains(".") {
    let range = filteredScalars
      .split(separator: ".")
      .map { Unicode.Scalar(hex: $0)! }
    scalars = range[0] ... range[1]
  } else {
    let scalar = Unicode.Scalar(hex: filteredScalars)!
    scalars = scalar ... scalar
  }
  return (scalars, property)
}

This code gets the job done, but it suffers in readability, maintainability, and scalability.

It is difficult to read and understand quickly, one has to mentally process the line multiple times.
Hardcoded subscripts, force unwraps, etc., are fragile to changes in the format or the script itself.
This does multiple passes over the input, allocating multiple temporary data structures in the process.

Ideally, we'd process this string the same way we read the file: from left to right.

Single-pass handwritten parser

By following a consumer pattern, we can extract the relevant information in a single pass over the input.

// ... consumer helpers like `eat(exactly:)`, `eat(while:)`, and `peek()` ...

// Try to parse a Unicode scalar off the input
private func parseScalar(_ str: inout Substring) -> Unicode.Scalar? {
  let val = str.eat(while: { $0.isHexDigit })
  guard !val.isEmpty else { return nil }

  // Subtle potential bug: if this init fails, we need to restore
  // str.startIndex. Because of how this is currently called, the bug won't
  // manifest now, but could if the call site is changed.
  return Unicode.Scalar(hex: val)
}

func graphemeBreakPropertyData(
  forLine line: String
) -> (scalars: ClosedRange<Unicode.Scalar>, property: Unicode.GraphemeBreakProperty)? {
  var line = line[...]
  guard let lower = parseScalar(&line) else {
    // Comment or whitespace line
    return nil
  }

  let upper: Unicode.Scalar
  if line.peek(".") {
    guard !line.eat(exactly: "..").isEmpty else {
      fatalError("Parse error: invalid scalar range")
    }
    guard let s = parseScalar(&line) else {
      fatalError("Parse error: expected scalar upperbound")
    }
    upper = s
  } else {
    upper = lower
  }

  line.eat(while: { !$0.isLetter })
  let name = line.eat(while: { $0.isLetter || $0 == "_" })
  guard let prop = Unicode.GraphemeBreakProperty(name) else {
    return nil
  }

  return (lower ... upper, prop)
}

This implementation is more scalable and maintainable, but at the cost of approachability.

It executes in a single pass over the input, without intermediary allocations.
Buried assumptions in the naive code are explicit failures here.
But, this consumer pattern is very low-level and using it well requires care and expertise. For example, backtracking has to be manually handled and reasoned about, as unnecessary backtracking quickly saps performance.

Proposed Solution

Declarative APIs for string processing have the potential to be approachable, maintainable, and scalable.

Regular Expressions

A commonly used tool for this kind of pattern matching and data extraction is regular expressions.

Consider these two lines:

007F..009F    ; Control # Cc  [33] <control-007F>..<control-009F>
00AD          ; Control # Cf       SOFT HYPHEN

We can match them and extract the data using the regular expression:

/([0-9A-F]+)(?:\.\.([0-9A-F]+))?\s*;\s(\w+).*/

Let's break it down:

([0-9A-F]+) matches one or more hex digits, capturing the first scalar
(?:\.\.([0-9A-F]+))? optionally matches the .. and captures the second scalar
\s*;\s matches one or more spaces, a semicolon, and a space
(\w+) matches one or more word characters, capturing the grapheme break property
.* matches zero or more of any character (the rest of the line)

We propose adding a new regular expression literal, with strongly typed captures, to Swift. Using Regex, we can re-implement graphemeBreakPropertyData like so:

func graphemeBreakPropertyData(
  forLine line: String
) -> (scalars: ClosedRange<Unicode.Scalar>, property: Unicode.GraphemeBreakProperty)? {
  line
    .match(/([0-9A-F]+)(?:\.\.([0-9A-F]+))?\s*;\s(\w+).*/)?
    .captures.flatMap { (l, u, p) in
      guard let property = Unicode.GraphemeBreakProperty(p) else {
        return nil
      }
      let scalars = Unicode.Scalar(hex: l)! ... Unicode.Scalar(hex: u ?? l)!
      return (scalars, property)
    }
}

This code reads from left to right and doesn't require any hard-coded indices. Regex is generic over its captures, which the compiler infers from the capturing groups in the literal:

let regex = /([0-9A-F]+)(?:\.\.([0-9A-F]+))?\s*;\s(\w+).*/
print(type(of: regex))
// Prints Regex<(Substring, Substring?, Substring)>

Note: The type of the second capture, the end of the scalar range, is optional. This is because the corresponding capture group in the regex is optional. (If the capturing group was repeated using a * or + quantifier, it would correspond to a lazy collection.)

Strongly typed captures make it more convenient and safer to post-process match results, e.g. enabling the use of tuple destructuring and the nil-coalescing operator in graphemeBreakPropertyData.

Regular expressions are compact, powerful, and often fast. Their syntax is familiar and ubiquitous. Though regexes come in different flavors, knowledge and patterns acquired in one tool (e.g. Perl) can often be applied in another (e.g. Xcode). An important goal of adding regular expressions to Swift is to facilitate this kind of reuse.

From `Regex` to `Pattern`

Regular expressions originated for use in Unix command-line arguments and text editor search fields because they are very terse, representing in a single line what otherwise would be an entire program. Due to this lineage, regular expression syntax has a few disadvantages, especially when used within a general-purpose programming language:

The terse syntax can be hard to remember (does \w mean "word" or "whitespace"?) and difficult to read (the .. and ; in the example regex are obfuscated by the metacharacters).
The lack of library support encourages reinventing the wheel—simplistic regexes are commonly misused to parse deceptively complex formats like dates and currency, when rich libraries of sophisticated parsers already exist (e.g. Foundation's FormatStyle).
Regular expressions occupy an awkward middle ground of being too powerful to compose together, but not powerful enough to recognize recursive structures (contrast with PEGs). Extensions such as back-references catapult matching to being NP-complete, yet they still cannot be used to write parsers.

Swift prizes clarity over terseness. Regular expressions are great for simple matching, but as they grow in complexity we want to be able to bring the full power of Swift and its libraries to bear.

`Pattern` Builder

The downsides of regular expressions motivate a more versatile result builder syntax for declaring a Pattern:

func graphemeBreakPropertyData(
  forLine line: String
) -> (scalars: ClosedRange<Unicode.Scalar>, property: Unicode.GraphemeBreakProperty)? {
  line.match {
    OneOrMore(.hexDigit).capture { Unicode.Scalar(hex: $0) }

    Optionally {
      ".."
      OneOrMore(.hexDigit).capture { Unicode.Scalar(hex: $0) }
    }

    OneOrMore(.whitespace)
    ";"
    OneOrMore(.whitespace)

    OneOrMore(.word).capture(GraphemeBreakProperty.init)

    Repeat(.anyCharacter)
  }?.captures.map { (lower, upper, property) in
    let scalars = lower ... (upper ?? lower)
    return (scalars, property)
  }
}

Character classes and quantifiers are spelled out, making them more readable and discoverable via code completion.
String literals make punctuation matching simple and clear: the two dots and the semicolon are critical parts of our format and they stand out inside literals.
Capture groups can be processed inline, improving locality and strong typing. Compare Pattern<(Unicode.Scalar, Unicode.Scalar?, GraphemeBreakProperty)> vs. Regex<(Substring, Substring?, Substring)>. (Inferring the generic argument of Pattern from the capturing groups in the result builder will require language improvements.)
Failure to construct a Unicode.Scalar or GraphemeBreakProperty will exit matching early, just like in consumer-pattern code.

Sophisticated features like inline capture group processing feel right at home with the result builder syntax because it’s all just regular Swift code—it isn't nearly as natural to try to force this kind of functionality into the regex literal.

Consider the last capture group that uses GraphemeBreakProperty.init. GraphemeBreakProperty is defined as:

enum GraphemeBreakProperty: UInt32 {
  case control = 0
  case extend = 1
  case prepend = 2
  case spacingMark = 3
  case extendedPictographic = 4

  init?(_ str: String) {
    switch str {
    case "Extend":
      self = .extend
    case "Control", "CR", "LF":
      self = .control
    case "Prepend":
      self = .prepend
    case "SpacingMark":
      self = .spacingMark
    case "Extended_Pictographic":
      self = .extendedPictographic
    default:
      return nil
    }
  }
}

If GraphemeBreakProperty.init returns nil, the match fails. This is really convenient, since the table includes property names we want to ignore. To get the same level of checking with a traditional regex, we would have had to duplicate all the property names into an alternation. Since capture processing participates in pattern matching (i.e. it can signal a match failure), it can be used to prune the search space early, which is an advantage over post-processing the results of a traditional regex.

This kind of composition is incredibly powerful. Pattern supports the interpolation of a wide variety of sophisticated existing parsers, like Foundation's FormatStyles.

Consider parsing an HTTP header:

HTTP/1.1 200 OK
Connection: close
Proxy-Connection: close
Via: HTTP/1.1 localhost (IBM-PROXY-WTE)
Date: Thu, 02 Sep 2021 18:05:45 GMT
Server: Apache
X-Frame-Options: SAMEORIGIN
Strict-Transport-Security: max-age=15768000
Last-Modified: Thu, 02 Sep 2021 17:54:18 GMT
Accept-Ranges: bytes
Content-Length: 6583
Content-Type: text/html; charset=UTF-8

We can extract the HTTP status code, date, and content type with the following pattern:

let match = header.match {
  Group {
    "HTTP/"
    Double.FormatStyle()
    Int.FormatStyle().capture()
    OneOrMore(.letter)
    Newline()
  }
  .skipWhitespace

  Repeating {
    Alternation {
      Group {
        "Date: "
        Date.FormatStyle.rfc1123.capture { HTTPHeaderField.date($0) }
        Newline()
      }
      Group {
        "Content-Type: "
        MimeType.FormatStyle().capture { HTTPHeaderField.contentType($0) }
        Newline()
      }
      Group {
        /[-\w]+: .*/
        Newline()
      }
    }
  }
}
.caseInsensitive

print(type(of: match))
// Prints (Int, [HTTPHeaderField])?

Do we want both `Pattern` and `Regex`?

Yes!

Pattern uses a more versatile syntax (just regular Swift code!) and supports matching more complex languages than Regex. But Pattern can't compete with the familiarity and ubiquity of traditional regular expression syntax. Regex literals work especially well in conjunction with API such as collection algorithms presented below.

We think Pattern and Regex can complement one another by:

Allowing the use of Regex literals within Pattern builders, alongside a rich library of off the shelf parsers. This will let folks choose succinct expressions when they want, but still nudge them towards more powerful and general constructs.
Adding a refactor action to transform Regex literals into Pattern builders. This allows rapid prototyping using Regex literals with an easy off-ramp to something more maintainable and powerful.

Collection Algorithms

We intended to extend and add generic consumer and searcher algorithms to the standard library for operating over collections using patterns or regexes.

Consider contains, which today can only check for the presence of a single Element:

let str = "Hello, World!"
str.contains("Hello") // error: cannot convert value of type 'String' to expected argument type 'String.Element' (aka 'Character')

As part of this effort, we'll be adding a variant of contains that invokes a "searcher" of the same element type:

// The below are all equivalent
str.contains("Hello") || str.contains("Goodbye")
str.contains(/Hello|Goodbye/)
str.contains {
  Alternation {
    "Hello"
    "Goodbye"
  }
}

The kinds of algorithms that can be added or enhanced by consumers and searchers include:

firstRange(of:), lastRange(of:), allRanges(of:), contains(_:)
split(separator:)
trim(_:), trimPrefix(_:), trimSuffix(_:)
replaceAll(_:with:), removeAll(_:), moveAll(_:to:)
match(_:), allMatches(_:)

Future Work

The Swift operator ~= allows libraries to extend syntactic pattern matching by returning whether matching succeeded or not. An enhancement to this would allow libraries to produce a result as part of a destructuring pattern match, allowing patterns and regexes to be used inside case syntax and directly bind their captures to variables.

func parseField(_ field: String) -> ParsedField {
  switch field {
  case let text <- /#\s?(.*)/:
    return .comment(text)
  case let (l, u) <- /([0-9A-F]+)(?:\.\.([0-9A-F]+))?/:
    return .scalars(Unicode.Scalar(hex: l) ... Unicode.Scalar(hex: u ?? l))
  case let prop <- GraphemeBreakProperty.init:
    return .property(prop)
  }
}

The "Big Picture"

"String processing" as presented above is touching on something broader: processing content that might span from simple binary data (UInt8) to semantics-rich entities (Character or even generic over Equatable). Such content may be readily available for fully-synchronous processing, derived in contiguous chunks from an asynchronous source, or we may even be reacting live to some incoming stream of ephemeral content from a fully asynchronous source.

The "big picture" is a complex multi-dimensional (and non-orthogonal!) design space that we need some way to talk and reason about to make progress. Swift is source- and ABI-stable, meaning decisions we make now can positively or negatively impact the ability of Swift to meet future needs. Thinking ahead about different areas of this design space can help us avoid painting ourselves into a corner and can help guide us toward more general, broadly-useful approaches.

For musings on this topic, see The Big Picture.

davedelong · September 29, 2021, 11:32pm

This is one of those "shut up and take my money"-kind-of-pitches

xwu · September 30, 2021, 1:12am

I could not agree with this more! I noodled around a bit with some possibilities in this area a few months ago and quickly realized that making a Pattern builder more terse or making a regex type more readable quickly leads to poor compromises.

There’s some bikeshedding that I’m sure will take place for the names used in a Pattern, but the overall direction here is simply fantastic.

ahti · September 30, 2021, 1:44am

Overall I really like the direction here. I had previously spoken against including classic regex syntax, but strongly typed captures and the idea of a refactoring action to expand a regex into a pattern tip the scales for me

Wrt the bigger picture, I think the idea of good support for multi-tiered processing is very appealing, even beyond strings in data: I've implemented a few binary parsers in recent years, and was always unsatisfied with how different parsing libraries handled extracting information from bit ranges that don't line up with byte boundaries (e.g. parsing 4 6-bit numbers from a 3 byte section of data).

Karl · September 30, 2021, 1:50am

Big +1 to this. I really like the Pattern result-builder. As you say, it's more much readable, and if it can indeed generate efficient code, it would be an obvious improvement.

That is rather unforunate. The standard library only has an eager split; there is a lazy version in swift-algorithms, but that's a very big package and some projects may be reluctant to add such massive dependencies. We have a lazy filter, but it's a badly-behaved Collection which can lead to performance and code-size penalties. There is definitely room for improvement when it comes to the standard library's collection algorithms, even outside of this effort.

This looks really great. The biggest issue I see is that result-builders don't allow you to control type resolution. Adding types like "Group", "Newline", or "Repeating" to the top-level standard library seems problematic. The topic has come up several times before, though, so there is certainly interest in the community.

Is it possible for the compiler to disambiguate the result-builder from the regular contains(where:) function we have today?

Can we do this with RangeReplaceableCollection? It has... certain quirks, you might say (little things, like invalidating all indices whenever you mutate the collection) that make this difficult to implement generically. I'd definitely be +1 to replacing RRC itself to remedy this.

If the matches are returned as slices and then processed in bulk, using them for replacement means quadratic performance due to excessive COWs. Replacing-as-you-go can be more efficient, but only if you can stop the collection removing excess capacity until all operations are finished. IMO, that would be a critical part of an RRC replacement.

Jumhyn · September 30, 2021, 2:00am

Really love this, super excited to see it make its way into Swift!

Yeah, this is unfortunate. Result builders do allow you to control type inference, so we could probably have a solution that surfaced these as .group, .newline etc. without further language extensions, but then we lose the nice type-based syntax. We could always put these in some sort of Patterns module I suppose, so as not to pollute the global namespace with the cost of requiring an extra import.

nh7a · September 30, 2021, 2:10am

Super excited! Only question I have at the moment is, would it require newer iOS/iPadOS SDK to use on those platforms?

Patrick_Smith · September 30, 2021, 2:14am

Declarative parsers like this are really powerful, and conceptually I find a lot easier to understand.

I’ve being playing around with similar ideas with generator functions in JavaScript (GitHub - JavaScriptRegenerated/yieldparser: Parse using JavaScript generator functions — it’s like components but for parsing!). I think of the model as being akin to components for parsing. The power of components is that allow you to compose larger solutions out of smaller solutions, and they force you to give those smaller pieces a name.

import { parse, mustEnd } from 'yieldparser';

function* Digit() {
  const [digit]: [string] = yield /^\d+/;
  const value = parseInt(digit, 10);
  if (value < 0 || value > 255) {
    return new Error(`Digit must be between 0 and 255, was ${value}`);
  }
  return value;
}

function* IPAddress() {
  const first = yield Digit;
  yield '.';
  const second = yield Digit;
  yield '.';
  const third = yield Digit;
  yield '.';
  const fourth = yield Digit;
  yield mustEnd;
  return [first, second, third, fourth];
}

parse('1.2.3.4', IPAddress());
/*
{
  success: true,
  result: [1, 2, 3, 4],
  remaining: '',
}
*/

Another interesting library is elm/parser (GitHub - elm/parser: A parsing library, focused on simplicity and great error messages) which uses parser combinators. Their prior art is also worth reading: parser/comparison.md at master · elm/parser · GitHub

siracusa · September 30, 2021, 3:00am

While I (sort of) see the appeal of Patterns, I find Raku’s Grammars to be a more useful higher-level structure for regexes. Could such a thing be built in Swift without any language features beyond those proposed above?

Nevin · September 30, 2021, 3:21am

There’s a lot to think about here, and I’m still taking it all in.

As a first point though, I want to strongly agree with this:

The examples shown with regexes are essentially opaque to the reader. Even knowing what regular expressions are and how they work, their visual appearance is inscrutable.

Swift strongly values clarity at the point of use, and regular expressions are antithetical to that. Normal Swift code reads nearly like a grammatical English sentence. Regular expressions (with existing syntax) read like a jumble of arbitrary symbols.

I think that we should view regexes as a compatibility feature, not a recommended usage pattern. Something to avoid whenever possible, not something to reach for in the common case.

If we can make a convenient and powerful pattern-matching syntax that reads like normal Swift code, that will be amazing. Then we can recommend using that for all situations, and relegate regexes to the margins.

AliSoftware · September 30, 2021, 3:22am

I love everything about this! Love seeing this happening!!

Chris_Eidhof · September 30, 2021, 3:57am

I love the idea of this, thank you so much for writing this.

One thing I would improve is being able to bind variable names where you declare them. In your graphemeBreakPropertyData, it'd be cool if we can somehow declare the variable names inside the result builder, rather than using captures. I realise this is not a trivial change and something that needs a lot more thinking / designing.

1-877-547-7272 · September 30, 2021, 4:12am

I’m excited to see this get addressed! I love the flexibility of having both RegEx and Pattern.

I do have concerns about RegEx literal syntax, though. / is already a valid prefix and postfix operator, so code like /a/ could be reinterpreted. I think we should consider reusing ", like Swift does with Characters (instead of using ' like other C-style languages). For example:

let regex: RegEx = "([0-9A-F]+)(?:\.\.([0-9A-F]+))?\s*;\s(\w+).*"

or

let regex: RegEx = #"([0-9A-F]+)(?:\.\.([0-9A-F]+))?\s*;\s(\w+).*"#

This wouldn’t be a simple String — the RegEx would still be evaluated at compile time and there would ideally be better syntax highlighting. We may also want to change the meaning of \ escapes (like in the first example), but we could instead keep \ the same and suggest the use of raw strings (like in the second example). Using " would give us valuable things from string literals, like

Multi-line literals
Raw string syntax
String interpolation syntax (if we decide not to change the meaning of \) that could be used to combine RegExes.

Also, what does . mean? The proposal seems to use “scalar” and “character” interchangeably, but they mean different things in Swift (“character” usually refers to a grapheme cluster, not a Unicode scalar). I think it should refer to a grapheme cluster — it would encourage correct string processing, which is a goal of Swift. However, it should be noted that no other regular expression implementation that I know of uses . to refer to a grapheme cluster — even the Unicode ICU implementation.

1-877-547-7272 · September 30, 2021, 4:29am

Also, does this imply we’re taking Group out of SwiftUI and putting it into the standard library? I think it’s a good idea, but would the ABI support it?

dlbuckley · September 30, 2021, 7:46am

This looks like a fantastic start! Great work to all involved.

beccadax · September 30, 2021, 8:02am

You're right to notice this issue; we're thinking we might need to propose a result builder feature to fix it. Part of the reason we're posting this document is so that, when we start proposing the individual features needed to make regexes and patterns work, Evolution will have an idea of how these random-looking features will fit together.

The Big Picture document suggests that metacharacters might shift meaning based on the string view you're matching, so . would match a Character when the target was a String, but it would match a UnicodeScalar when the target was a String.UnicodeScalarView.

1-877-547-7272 · September 30, 2021, 9:31am

We could work around this using dot notation.

// Assuming `PatternBuilder` constructs patterns from blocks of type `PatternBlock`
struct PatternOfOneOrMore: PatternBlock { ... }
enum PatternBlockShortenedNames: PatternBlock {}
extension PatternBlock where Self == PatternBlockShortenedNames {
    typealias OneOrMore = PatternOfOneOrMore
}

line.match {
    .OneOrMore(.hexDigit)
}

I would put Group in the standard library, though — it’s pretty useful in result builders (SwiftUI already uses it with eight different result builders!).

Lantua · September 30, 2021, 11:18am

Does this need to be in standard library? The result builder part feels like it could sit in a separate lib just fine.

Michael_Ilseman · September 30, 2021, 1:11pm

What formats have you had to parse?

The pattern I would recommend for this kind of task is a strongly-typed wrapper around the raw bytes, with computed vars that will do the masking and shifting.

struct EncodedPixel: RawRepresentable {
  var rawValue: (UInt8, UInt8, UInt8)

  var red: UInt32 { ... unpack ... }
  var green: UInt32 { ... unpack ... }
  var blue: UInt32 { ... unpack ... }
  var alpha: UInt32 { ... unpack ... }
}

A failable init can do error checking, if necessary for your format. A encoded pixel parser's conformance to something like CollctionConsumer (as an example interface for composition) would just try to init and advance 3 bytes.

Definitely improvements to be made to the current laziness situation (CC @timv). Even with an ideal presentation of laziness, it's not quite a drop-in for the script as written (e.g. use of integer subscripting).

A potential issue with approaches which accumulate and transform closures is that they rely on Swift's general-purpose optimizer, whose inlining heuristics are not necessarily tuned for your exact use case :-). It can also be foiled by such pesky real-world concerns as separate compilation.

It's still early, but we're hoping to leverage static compilation and/or staged compilation techniques for the parser library, inspired somewhat by work done in Scala. @rxwei will be exploring this more deeply.

I'm hoping we can extend result builders by allowing conformers to provide a type to serve as a namespaces for lookup, similar to custom string interpolation.

That would also open the door to defining and using custom operators (e.g. postfix +) inside of a result builder without bloating their global overload set.

We'll definitely have to figure out trailing closure ambiguity here. This seems like an issue that will keep coming up with broader application of result builders.

There's also more potential ambiguity with collection algorithms. For example, str.contains("A") will currently resolve "A" as a Character because there's only the Element-taking variant. But if there's an overload for a collection-with-the-same-element, it would prefer String for a literal. And, we're showing the use of literals inside a result builder to represent a literal subpattern. Semantically they would all be equivalent, but we like code to keep calling the functions they think they're calling :-).

We'll have to tackle these problems as we deploy the collection algorithms. For the consumers/searchers prototype way back when, I fell back to using argument labels to differentiate, but hopefully we can provide something cleaner here.

@timv is exploring these algorithms in more depth.

I agree that RRC is basically broken for batch operations and wrote more about it here. A version of replaceSubrange that returns the newly created range can help in some cases, but would still be inefficient.

At the time I was hoping more work on ownership and generalized accessors would make the design pattern obvious. This is definitely a bridge that needs to be crossed.

We defintely want to ship runtime support as part of Swift. Whether you access API directly from the Swift default module or need an explicit import Pattern is still up for debate. I'm hoping that result builder namespaces makes this decision clearer.

Yup, I'm a big fan of the programmer experience afforded by parser combinators and will defintely check out that work in Elm. As mentioned in the big picture, we also want to be able to leverage compiler technology to get around some of the performance issues with parser combinator libraries, especially since Swift supports separate compilation.

Patrick_Gili · September 30, 2021, 1:13pm

I love this. However, I have one question. With either method, regex or pattern, how can you determine how many characters were consumed?

Declarative String Processing Overview

Declarative String Processing Overview

Introduction

Example

Naive handwritten parser

Single-pass handwritten parser

Proposed Solution

Regular Expressions

From Regex to Pattern

Pattern Builder

Do we want both Pattern and Regex?

Collection Algorithms

Future Work

The "Big Picture"

From `Regex` to `Pattern`

`Pattern` Builder

Do we want both `Pattern` and `Regex`?