Declarative String Processing Overview

xwu · October 5, 2021, 1:31pm

This is definitely beyond the scope of the present discussion. It would be wonderful to have in the distant future.

breathe · October 5, 2021, 4:45pm

I'm not sure that the traditional regex syntax is 'concise in a good way' -- if concision is measured in terms of 'ease of writing' instead of just character count ...

/([0-9A-F]+)(?:\.\.([0-9A-F]+))?\s*;\s(\w+).*/

The traditional regexp syntax introduces an excessive need to escape control characters due to the fact that non-control characters and control characters are allowed in the same lexical contexts without delimiters required ... A version of the syntax that required delimiters around the non-control characters would be far more writeable/readable in my opinion and allow things like named character classes ...

/[(digit|"A"..."F")+] (":.." [(digit|"A"..."F")+])? whitespace* ";" whitespace [word+] any*/

Just changing the syntax to require quotes around the non-control character sequences opens up a lot of options for a more sane syntax ...

Going further, a strawman 'just embed ohm as a literal' type literal (with [] brackets added as a shorthand for 'non-named capturing') wouldn't be that much less concise either ...

let pattern = #Grammar { ( [(digit | "A"..."F")+] (".." [(digit | "A"..."F")*])? whitespace* ";" whitespace [word*] newline)* }

line.match(pattern).captures.flatMap { (l, u, p) in ... }

(I'm not really sure how the non-named capture syntax should work -- but I think something with [] could be reasonable)

Such a style as that would be a little longer by character count -- but not by much ... but I think there are clear gains in readability and write-ability at the cost of a very small number of extra characters to type ...

// we are programmers -- we love non-semantic whitespace and naming things!!
let pattern = #Grammar {
  hexword = (digit | "A"..."F")+
  line = [hexword] (".." [hexword])? whitespace* ";" whitespace [word*] newline
  lines = line*
 }

line.match(pattern).lines.flatMap { (l, u, p) in }

An ohm-style literal could pay off even more as the processing needs become more complex ...

let pattern = #Grammar {
  hexword = (digit | "A"..."F")+
  line = [hexword] (".." [hexword])? whitespace* ";" whitespace [word*] [microsoft_line_extension?] newline
  microsoft_line_extension = <...blah...>
  lines = line*
 }

line.match(pattern).lines.flatMap { (l, u, p, m) }

I think the lack of delimiters around 'the non-pattern matching' parts of the syntax are a big issue with the 'old familiar regex style literal' and in my opinion the design choices made in the past are not worth carrying forward for reasons of familiarity alone ... Mixing control and non-control characters requires a lot of characters to be quoted in practical use which really obscures the logic of the pattern match in a way that tends to be both harder to read and harder to write ... I don't think 'to be concise' necessarily requires keeping the control and non-control character types mixed like this and think there's at least an argument for a literal style that does more to reduce the control-sequence-escaping soup of traditional perl-style regexp literals ...

Michael_Ilseman · October 5, 2021, 8:50pm

(quoting out of order)

Yes, for example: in Expr.h

If I'm understanding the question correctly, probably overloads, but we're hoping variadic generics can help with the insanity there.

I do share some concerns about using tuples for anything serious in Swift. I feel they make sense intuitively and on first impression for capures (and labels for named captures), but it's entirely possible we'll run into a brick wall at some point. Hopefully, this will help motivate variadic generics, improvements to tuples, and any story for easily making tuple-like structs.

Could you share more? For the literals, the underlying type is always Substring (the slice type) and we're adding on whether it's optional or an aggregate. We're not trying to turn it into an Int by implicitly invoking an initializer, for instance.

"Building generic functionality" is a big topic and not strictly required as part of this initial push, but I think we'd regret passing up the opportunity to establish a good technical foundation for generic functionality. I'm not sure this is exactly the info you're asking about, but I'll use your question as an excuse for an info dump on one narrow "slice" of this topic :-)

(Code examples, protocols, declarations, etc., below are not being formally proposed. They help highlight the domain of what is possible, but are intentionally presented with semantic clarity over code prettiness.)

Protocol powered matching algorithms

The overview talks about enabling collection algorithms (CC @timv) and the bare minimum would be concrete overloads taking Regex or Pattern or something similar. But, we're likely to generalize matching via protocols. The core interface for non-capturing (i.e. doesn't produce a value) could be consume below, and a value-producing interface could be match below. Exactly how this is formulated, what the protocol heirarchy is, when types get bound, etc., is all TBD. Here's a basic sketch:

protocol CollectionConsumer {
  associatedtype Input: Collection
  associatedtype Position = Input.Index

  func consume(_: Input, startingFrom: Position) -> Position?
}
protocol CollectionMatcher: CollectionConsumer {
  associatedtype Value

  func match(_: Input, startingFrom: Position) -> (Value, resumeFrom: Position)?
}
extension CollectionMatcher {
  func consume(_ c: Input, startingFrom idx: Position) -> Position? {
    match(c, startingFrom: idx)?.resumeFrom
  }
}
extension CollectionMatcher where Value == () {
  func match(_ c: Input, startingFrom idx: Position) -> (Value, resumeFrom: Position)?
    guard r = consume(c, startingFrom: idx) else { return nil }
    return ((), r)
  }
}

The generalization of a look-around assertion is just a special case of a consumer (e.g. consume(...) != nil).

Regex/Pattern would conform, of course, but so could Int (nibble an integer off), Foundation's date format styles (nibble a date off), or 3rd party types (arbitrary nibbling).

We are talking about being able to interleave library calls as part of matching, so we need some interface to express this. The consume/match interface above could also be the interface for composition with library calls. That means any conformers could be called (mechanism TBD) within a Pattern result builder. Effectively, conformers are sub-patterns, so long as all the types line up (more on that later).

Conforming to this protocol does require providing a binding for Input (which could be a generic parameter). This could be a problem for types who, for whatever reason, cannot (many-to-many type relationships, working around HKTs, etc., are TBD).

Similarly, the generalization of a character class is a function (Element) -> Bool.

protocol CharacterClass {
  associatedtype Element // Commonly `Equatable` or `Hashable`, but doesn't need to be.
  func contains(_: Element) -> Bool
}

Character classes don't need to bind Input, just their Element. Any character class could be used as a consumer (by binding Input) through the use of subscript and index(after:), mechanism TBD.

You could even imagine regex literals applying to any collection for which there is a bijection between Element and Character.

protocol Bijection {
  associatedtype From
  associatedtype To
  func mapFrom(_: From) -> To
  func mapTo(_: To) -> From
}
protocol RegexApplicableCollection: Collection, Bijection
  where To == Element, From: StringProtocol
{
  // ...
}

And you could even imagine all kinds of extra stuff built on top of the notion of a CollectionMatcher whose Value type is round-trippable through String (e.g. pretty printers). It goes on and on (provided the many-to-many or HKT issues don't get in the way...).

An alternative generalization of matching could be:

protocol MatchProcessor {
  func run(_: inout MatchingEngine)
}

We've traded being bound to Input for an API on MatchingEngine, which is TBD. A processor could query aspects of matching state, potentially fetch the collection and current position (mechanism TBD), and interact or drive pattern matching (signal failures, push/pop from a backtracking stack, etc). This would of course mean designing a stable API/ABI for MatchingEngine, potentially exposing implementation details. But, this kind of approach works well for extending pattern matching over non-collections, such as asynchronous streams of ephemeral content.

Multi-stage type binding

You could imagine a subset of matching functionality that doesn't need to bind any types. For example, /..*/ could match the entirety of any non-empty collection. For such a matcher to conform to this interface, we could have:

struct MatchNonEmpty<Input: Collection>: CollectionMatcher {
  typealias Value = Int
  func consume(_ c: Input, startingFrom start: Position) {
    start == c.endIndex ? nil : c.endIndex
  }
  func match(_ c: Input, startingFrom start: Position) -> (Value, resumeFrom: Position)?
    let end = c.endIndex
    guard start != end else { return nil }
    return (c.distance(from: start, to: end), resumeFrom: end)
  }
}

A generic parameter Input allows this pattern to apply to any collection of any element type, because this matcher doesn't care. We pretty much have to parameterize though, as a type can conform to a protocol in only one way.

Semantically, regexes are like matchers generic over Input: Collection where Element == Character (albeit with extra weirdness because String is weird, more on that later). They have similar semantics when applied to String as Array<Character>, though we'd of course want to fast-path String's implementation to exploit its UTF-8 storage. Regexes have bound the Element type to Character, that is they are post-Element-binding time.

Even after regexes are "compiled" (for some TBD notion of compilation) or even linked/loaded (TBD), they don't necessarily have to be restricted to a particular kind of collection. There is a later point in time in which the collection type gets bound, a later point in time when a regex gets associated with a particular instance of that collection type, and a later point in time when it is ran on that instance. Of course, these could all be the same instant in time myString.split(/\s+/), but each provides more information for compilation/specialization.

On the other hand, whenever we bind a type, we are either dropping generality or limited by what can be accomplished through parameterization. Something that only binds Element is perfectly applicable to Sequence and AsyncSequence in addition to Collection. Something that works on a moving window over asynchronously delivered chunked data might bind its collection to be the buffer. But, it would additionally need the means to interact with that window (e.g. peek/fill the buffer, eat/drop processed portions) and would need the means to restrict look-around/backtracking to the window.

If Patterns are composed using the CollectionConsumer interface above, that means that Patterns will have to deal with binding or parameterizing-away the Input type. This might end up forcing some many-to-many or HKT issues (hic sunt dracones).

rintaro · October 6, 2021, 5:03am

This is not only about prefix/postfix operators. For example, foo+/bar/+baz also becomes ambiguous. Of course we can make a rule "unspaced infix operator cannot be used with regex literals". But still we need to decide what to do here.

Similarly

foo()
  / bar / .baz()

Is currently parsed as foo() <infix-operator /> bar <infix-operator /> .baz()
However, this can be parsed as

foo();
<regex-literal ' bar '>.baz()

in this proposal.

ensan-hcl · October 6, 2021, 9:49am

I think Regex should hold parse target type, like Regex<String, _>. If metacharacters change its meaning as this quote, applying regex which was intended to be used for UnicodeScalar to String is just a bug, but it cannot be prevented if Regex doesn't hold target type.

YOCKOW · October 6, 2021, 9:49am

I think that what kind of syntax should be used for regular expressions is not the most important.
We can decide it later.

Really?
Well, almost all of us may want both of them. I think, however, Regex might break Swift's String ecosystem.

We must remember, in Swift, String is a Collection of Characters, not of Unicode.Scalars.

Let's take "café" as an example.

"café".precomposedStringWithCanonicalMapping consists of 4 scalars; U+0063, U+0061, U+0066, and U+00e9.
"café".decomposedStringWithCanonicalMapping consists of 5 scalars; U+0063, U+0061, U+0066, U+0065, and U+0301.

Although they have different number of scalars, they are regarded as the same string.

print("café".precomposedStringWithCanonicalMapping == "café".decomposedStringWithCanonicalMapping)
// -> Prints "true"

Furthermore, "é" and "e" are distinct Characters:

print("café".precomposedStringWithCanonicalMapping.contains("e"))
// -> Prints "false"

print("café".decomposedStringWithCanonicalMapping.contains("e"))
// -> Prints "false"

print("café".precomposedStringWithCanonicalMapping.contains("\u{00e9}"))
// -> Prints "true"

print("café".precomposedStringWithCanonicalMapping.contains("\u{0065}\u{0301}"))
// -> Prints "true"

print("café".decomposedStringWithCanonicalMapping.contains("\u{00e9}"))
// -> Prints "true"

print("café".decomposedStringWithCanonicalMapping.contains("\u{0065}\u{0301}"))
// -> Prints "true"

On the other hand, Regex /e+/ (or some other syntax like #/e+/# would be adopted) is interpreted into Pattern such as

OneOrMore("e" as Unicode.Scalar).capture { ... }

Imagine such a Regex world...

"café".decomposedStringWithCanonicalMapping.contains(/e+/) returns true.
"café".precomposedStringWithCanonicalMapping.contains(/é+/) may return false.

Even more, let's imagine split with Regex...

"café".decomposedStringWithCanonicalMapping.split(/e+/) returns ["\u{0063}\u{0061}\u{0066}", "\u{0301}"]
I want to insist that the string that consists of only one unicode scalar U+0301 (whose category is Mn) is invalid.

(You may not know but, these problems do big matters to Hiragana(ひらがな), Katakana(カタカナ), Kanji(漢字), and so on.)

The points are:

Regex is a pattern matcher for Unicode.ScalarView, not for String, at least in Swift world.
Only Pattern whose element is Character or a collection of Characters (such as Substring) should be a matcher for String

stephencelis · October 6, 2021, 4:15pm

I think result builders may need to be beefed up to support the proposed syntax, right? Otherwise you're looking at a lot of overloads. For example, from the beginning of the thread:

Group {
  "Date: "
  Date.FormatStyle.rfc1123.capture { HTTPHeaderField.date($0) }
  Newline()
}

Would this evaluate to something like?

Group<(Pattern<Void>, Pattern<Date>, Pattern<Void>)>

Somewhere, these Voids would need to be ignored so that we can work with just Date. But maybe, as you said, variadic generics will come with the tools needed to do just that?

gnuoyd · October 6, 2021, 4:40pm

This is called a "branch" by re_format(7) (*BSD, macOS) and by
regex(7) (Linux / GNU):

     A (modern) RE is one= or more non-empty= branches, separated by `|'.  It
     matches anything that matches one of the branches.

I agree that "Alternation" is misleading.

Dave

pilky · October 6, 2021, 7:42pm

Just to throw my two pence in. On the basic concept I think this is very much a +1, both regexes and patterns. In fact I only really have one concern: Alternation is a pretty obscure and unclear name. In fact if I read it my first thought is “these should be matched alternately” rather than “any of these should match”. Something like AnyOf might be a better option. Other than that, this looks like it could be a good use of a result builder (just make sure we have enough documentation of all the options, as I suspect there will be many )

Michael_Ilseman · October 6, 2021, 9:58pm

When applied with grapheme-cluster semantics (the default if applied to String), it would match grapheme-cluster by grapheme-cluster and comparison obeys canonical equivalence. There are some features that might not be supported, e.g. generalizing some scalar properties to grapheme clusters. Resulting indices would be grapheme-cluster aligned.

When applied with scalar semantics (the default if applied to String.UnicodeScalarView), then it would have scalar-by-scalar matching with binary semantics. Resulting indices would be scalar-aligned.

TBD is application to one of the encoded views, but it will probably closely adhere to scalar semantics.

These would also likely have different character classes, one which maps to a Character property (and we should add new ones as part of this effort) and one which follows the normal ands and ors based on scalar property. Character classes would likely be customizable (e.g. POSIX mode, or even supply a custom one), mechanism TBD.

xwu · October 6, 2021, 11:08pm

If—as I understand it—this differs from how other languages handle regex matching, this could be surprising for folks in a very likely scenario which we are frankly inviting: applying a regular expression they find on Stack Overflow which has been reliably vetted to a Swift String, relying on the fact that Swift supports the same regular expression features as other languages, only to find they obtain a different result.

Saklad5 · October 6, 2021, 11:11pm

This is one of the main reasons I’m opposed to baking regex into the language itself: it is literally impossible for Swift to maintain compatibility with every form of regular expression out there.

A dedicated regular expression literal will never be capable of using any regular expression.

gnuoyd · October 6, 2021, 11:13pm

What B says here is really sensible.

Reading and writing fluency is important. Conciseness and familiarity
can contribute to fluency, but they don't really contribute to fluency
with regular expressions. Other considerations apply.

Consider that on an average UNIX-y system you may find about five
different dialects of regular expression. So when we speak of familiar
expressions, we're even not talking about one thing.

Dave

Saklad5 · October 6, 2021, 11:16pm

Regex compatibility is a non-goal. Ideally the Swift standard library picks a standard and uses it.

Which, again, is why it shouldn’t get special treatment in the language.

Michael_Ilseman · October 6, 2021, 11:45pm

Other languages differ from each other in how they handle regex matching, sometimes greatly. Definitions of standard character classes differ, some engines match scalars, some engines even support matching under canonical equivalence matching or even grapheme-cluster semantics (IIRC Perl 6 did). Often these are configured through adverbs, though there's usually a default (which differs among languages). A broader cross-language comparison of features and classes is forthcoming (CC @nnnnnnnn).

We're talking matching similarly to other engines, modulo Swift's more recent interpretation of Unicode correctness. This isn't that different than other concerns, such as how counting characters in String is different than other languages. It would probably be more surprising for a regex match to produce indices that are not Character-aligned or to exhibit different behavior than walking the contents yourself.

Figuring this stuff out may produce more evidence for the "don't do literals" camp. Doing the literals does have some procedural benefits in directing efforts and I'm interested in leveraging those benefits even if they don't land :-). Most of the core technical work is the same: literals would effectively desugar to a shared representation, and from there is where all the hard stuff happens. They should be tackled in a way that improves aspects of the stdlib, e.g. character classes come with more Character properties and corresponding API on Unicode.Scalar.

A potential source of evidence for the "do literals" camp could be when we see them used with generic collection algorithms (CC @timv) or subpatterns within the result builder. If we do literals via an ExpressibleByRegexLiteral protocol (and I'm drafting an argument that we should if we're doing them), then literals could conceivably be passed to NSRegularExpression for ICU semantics, JSCore for JavaScript(ish) semantics (and JIT), a PCRE wrapper library for PCRE-semantics, etc.

Note that doing PCRE-esque regex literals doesn't preclude matching literals or custom literals in the future. The literals also don't preclude parsing support inside Pattern or even choice of defaults inside Pattern (e.g. default Alternation/OneOf could be PEG-style ordered choice by default, and the refactoring action would supply the boolean parameter to choose the other kind). There is still a lot TBD in Pattern though.

Saklad5 · October 6, 2021, 11:47pm

I don’t see how that could be done without implementing custom literals in general too.

scanon · October 6, 2021, 11:48pm

Also, attempting to process meaningful unicode via regex in a language that doesn't handle unicode like Swift does is a recipe for disaster, so I think the likelihood of encountering such regexes originally from another language is quite slim.

Saklad5 · October 6, 2021, 11:49pm

Just because it would break horribly doesn’t make it unlikely.

Training people to make context-free assumptions based on literal delimiters would make it especially likely.

YOCKOW · October 7, 2021, 3:13am

I'll be glad if Character(grapheme-cluster) semantics can be applied.
On the other hand, I agree with @xwu's indication because I also feel such feature would confuse some folks especially from other languages.

For example, some libraries such as PCRE, Oniguruma, etc. that are widely used have an expression to specify Script name such as \p{Hiragana}.
A scalar has a property of Script, but Character doesn't.
Some people may want to write /\p{Hiragana}/, however, it cannot be applied to String, but only to String.UnicodeScalarView.
They might think "I just want to write an usual Regular Expression and want it to be applied to String."

I'm looking forward to the result. But, I think it is important that which semantics is the majority rather than the fact that there are differences.

We may have to discuss first whether such feature in Swift should be called Regex or another appropriate name, and whether or not it should have similar appearance with existing regular expressions in other languages.

JanWillemBrands · October 7, 2021, 9:51am

In the context of Pattern I am wondering how wide the scope of this pitch is. Is it the intention that we can describe e.g. the Swift grammar as a Pattern ?