Declarative String Processing Overview

ensan-hcl · October 7, 2021, 12:22pm

How can PatternBuilder deal with recursive pattern? For example, how do you express Swift's multiLineComment which matches /* ... */ but also /* ... /* ... */ ... */ and not /* ... */ ... */?

JanWillemBrands · October 7, 2021, 3:00pm

Something like this ?

func nestedComment() {
    "/*"
    Repeat(.anyCharacter)
    Optional { nestedComment() }
    Repeat(.anyCharacter)
    "*/" 
}

ensan-hcl · October 7, 2021, 3:10pm

Won't nestedComment cause infinite recursion? I'm worried about this point.

JanWillemBrands · October 7, 2021, 3:12pm

it only recurses when it encounters additional /* opening brackets, no?

JanWillemBrands · October 12, 2021, 1:22pm

Question: would the below Pattern be rejected with 'ambiguous syntax' or return the four distinct parses ?

"xxx".match {
          ZeroOrMore { "x" }
          ZeroOrMore { "x" }
      }

dhoepfl · October 12, 2021, 2:03pm

You can’t.
Pure regular expression are limited to regular grammars.
Nested comments are not regular but context-free.

That being said, it is not clear to me what type of grammar the PatternBuilder should be capable of. If they should implement anything above regular grammars: Please split the enhancements. Actually, I think this would be good anyways.

I guess, from a compiler developer point of view, one probably will want to have PatternBuilder first, adding syntactical sugar (“regex literals”) that simply gets transcribed into a PatternBuilder later.

On the other hand, If I’d have to choose only one, I’d pick regex literals over a more capable PatternBuilder …

See above.

/* ... /* ... */ ... */ would be parsed as:

func nestedComment() {                Consumes:
    "/*"                              "/*"
    Repeat(.anyCharacter)             " ... /* ... */ ... "  (A)
    Optional { nestedComment() }      <nothing>
    Repeat(.anyCharacter)             <nothing>              (B)
    "*/"                              "*/"
}

(For non-greedy parsing, swap A and B)

You would need non-greedy parsing and OptionalButGreedyAndRequiredIfPossible. The non-/greedy dance is required for e.g. /*1/*2/*3*/4*/5*/, otherwise the first recursion could detect /*2/*3*/ as a valid nested comment, leaving 4*/5 as second Repeat(.anyCharacter) for the outer comment.

I’m not even sure if this would work in all cases, you just cannot parse non-regular languages using a regular parser. (Which it no longer is if you add recursion and these non-/greedy switches.)

It depends. Most RegEx parsers I know run by default in a “greedy mode” where they try to match as much of the pattern as soon as possible. In that mode, the first ZeroOrMore would consume all x, leaving an empty match to the second one. If supported, the “non-greedy-mode”, the opposite is true: The first matches an empty string, forcing the second ZeroOrMore to take the rest. Both modes always result in an unambiguous solution.

JanWillemBrands · October 12, 2021, 4:36pm

100%, in the context of RegEx's. But if we could make Pattern's cover context-free grammars that would make Swift pretty special and make it worth the much more verbose spelling...

It would also be an opportunity to marry the two formats, using RegEx syntax for token definitions and Pattern's for the grammar part.

And actually, in the greedy scenario I think the parse fails as follows:

func nestedComment() {                Consumes:
    "/*"                              "/*"
    Repeat(.anyCharacter)             " ... /* ... */ ... */"
    Optional { nestedComment() }      <nothing>
    Repeat(.anyCharacter)             <nothing> 
    "*/"                              <fails>
}

If Pattern were backed by a context-free parser, and it would implement longest-match-wins, things would be different :)

Michael_Ilseman · October 12, 2021, 6:05pm

I want to note that the overview presented here is confined to the borders of "regex" (even if those borders are fuzzy), but we do want Pattern (or some related type) to support parsers. Foresight and forethought is essential to settling on a good long-term design here, but specific details are not strictly necessary in the context of this overview. I'm happy to discuss this here and now, or in a related thread, but just making it clear that it's a little more speculative and certainly any bikeshedding here would be premature.

What is outlined above is talking about a subset of what Pattern may support, but it is my intention for Pattern (or a closely related type) to be capable of at least anything vanilla PEGs can. PEGs describe an algorithm for parsing context-free languages* and some context sensitive languages.

* Well, it's complicated, but certainly all unambiguous CFGs and the vast majority of all CFGs. It used to be unproven whether they can handle all CFGs, but I haven't checked the latest literature on this subject.

In my mind, yes, in the fullness of time. Again, regex is what's being overviewed and discussed here, but my goal is to do this in such a way that patterns can be used for real, industrial strength parsers for real languages such as Swift.

The PEG prototype has a test for C and Swift style comments. Note that this PEG prototype has no syntax, that is the test is constructing ASTs by hand and feeding that to an engine. So, it's merely a demonstration of the power of vanilla PEGs and not what the developer experience will look like. (This is also why it's difficult to discuss concretely so early, especially on a forum, as our minds tend to latch on to syntax over semantics).

That's correct. Your grammar is not left-recursive, so even vanilla PEGs handle them just fine. Your Optional would desugar into an ordered-choice between a recursive call or doing nothing. This establishes a save point before the recursive call, so when the recursive call eventually fails we'll back up to before the / in e.g. /x). (Left recursive grammars can be supported through a notion of forward progress, if necessary).

JanWillemBrands:

Question: would the below Pattern be rejected with 'ambiguous syntax' or return the four distinct parses ?
"xxx".match {
          ZeroOrMore { "x" }
          ZeroOrMore { "x" }
      }

This gets to the definition of the match API. Does it return all matches, all possible matches (and if so how are they differentiated), the first match, etc. It also gets to what the semantics of ZeroOrMore are specifically, e.g. along the lines of greedy/lazy and non-possessive/possessive. If ZeroOrMore is PEG's greedy+possessive then it would fail to match, having exhausted all the xs. Obviously we want to provide the alternatives as well.

Pattern's names and choices of default don't have to align with what one would often find in regexes, though it does have to be a superset and we do want the result of a refactoring action to look approachable. Hopefully this starts becoming clearer as we converge on regex functionality while trying to also adapt our prototypes such as PEGs.

This overview is splitting off the regex part, at least for the purposes of discussion and narrowing in on shipping something. I think it's far too early to talk about same-type vs different-types-with-protocol, etc., and that will become more obvious as we converge.

If we are doing parsers, we are far far more likely to go with something along the lines of PEGs (recognition systems / analytic grammars) rather than CFGs (generative systems/grammars). Non-determinism and ambiguity are anathema if you can about how something is parsed, which limits their applicability and scalability to real-world problems.

I have links to papers and webpages scattered throughout my responses, the original post, and documents in the repo. I'm happy to link or share for anything that's still unclear here.

Saklad5 · October 12, 2021, 7:57pm

I completely agree. One of Swift’s biggest strengths is being deterministic and consistent (at least in theory), and this shouldn’t be an exception.

dhoepfl · October 13, 2021, 6:52am

If this was the case, no patter would ever match that contains “.*” anywhere other than at the end.

Greedy/Non-Greedy test using perl (leaving out the nesting):

> echo "/* ... /* ... */ ... */" | \
     perl -pe 's|(/\*)(.*)(.*)(\*/)|<\1>\n<\2>\n<\3>\n<\4>|'
</*>
< ... /* ... */ ... >
<>
<*/>
> echo "/* ... /* ... */ ... */" | \
     perl -pe 's|(/\*)(.*?)(.*?)(\*/)|<\1>\n<\2>\n<\3>\n<\4>|'
</*>
<>
< ... /* ... */ ... >
<*/>

JanWillemBrands · October 13, 2021, 11:48am

Ah yes, it's greedy until it isn't. Why did it match only on the last */ bracket? Because if it didn't it would have some left-overs? Why must we worry about this?
A Pattern with it's declarative syntax should as much as possible be decoupled from a particular execution mechanism and relieve us from too much worrying.
What Swift rookie wouldn't like to avoid visiting regex101.com and simply write:

let nestedComment: Pattern = {
    "/*" 
    ZeroOrMore { .anyCharacter }
    OneOrZero { nestedComment() }
    ZeroOrMore { .anyCharacter }
    "*/"
}

YOCKOW · October 13, 2021, 2:05pm

I wonder how to precompile a regular expression for a range such as [A-B].

Given a character "A\u{20DD}"(LATIN CAPITAL LETTER A + COMBINING ENCLOSING CIRCLE).

print(Character("A\u{20DD}") > Character("A")) // -> true
print(Character("A\u{20DD}") < Character("B") ) // -> true

let regex = /([A-B])/

switch "A\u{20DD}" {
case let something <- regex:
  // What is the type of `something`?
}

something would be Character("A\u{20DD}") when applied with grapheme-cluster semantics.
something would be Unicode.Scalar("A") when applied with scalar semantics.

However, at the point of declaration let regex = /([A-B])/, it is impossible to determine whether A and B are Characters or Unicode.Scalars.

Michael_Ilseman · October 14, 2021, 4:45pm

Regex literal pitch is up: [Pitch] Regular Expression Literals. It has a lot more detail and is a great place to continue discussion surrounding the literal itself, it's supported syntax, and any compiler/syntactic concerns around it. It doesn't speak to API or the semantics of matching, which we can continue to discuss here.

Michael_Ilseman · October 14, 2021, 5:02pm

Character class ranges in grapheme-semantic regexes deserves focused discussion. Comparison is linguistically meaningless for grapheme clusters. Then again, so is just about any interpretation on Character, but that doesn't mean that Swift cannot supply common-sense semantics. We have options as to how we want to define this, and a sensible starting point would be < on Character itself, noting that Range<Character> is uncountable. For the literal, options include warning or erroring out

This isn't anything new, and we will have plenty of similar concerns surrounding our handling of e.g. degenerate grapheme clusters.

Right, whether a regex is scalar-semantic or grapheme-semantic may not be know at declaration site, so we may have to invent a way of carrying this information forwards if we want to provide different diagnostics for each.

Also note that many common ranges can be otherwise expressed as a combination of known character classes, so there's some opportunities for nice fixits.

YOCKOW · October 15, 2021, 2:58pm

(Digression: I just remember that I was confused by the change of comparison strategy for Characters in Swift 4.1 -> 4.2 so that I introduced normalized characters in my own library.)

Regardless of the way to compare Characters, it would be most intuitive that

string.contains(where: { ("A"..."B").contains($0) }) == string.contains(/[A-B]/)

is always true.

Anyway, I think we have to keep some consistency in the language since Character-wise comparison is related to not only Regex but also any types associated with String.

By the way, in Swift >= 4.2,

(Character("A")...Character("B")).contains("A\u{20DD}") is true.
(Character("A")...Character("B")).contains("A\u{0308}") is false.

I don't want to say which one is correct, but I think there are sometimes some cases that we want to reconcile the results.

It would be better that we have choices like:

let comparisonResult = someCharacter.compare(anotherCharacter, normalization: .decomposed)
let characterRange = CharacterRange("A"..."B", normalization: .precomposed)

Could we also suppose that key paths, functions, and closures would be available in Regex like below?

// Pseudo-syntax

let keyPathRegex = /$(\.isWholeNumber)/

func isWholeNumber(_ character: Character) -> Bool { character.isWholeNumber }
let funcRegex = /$(isWholeNumber)/

let closureRegex = /$({ $0.isWholeNumber })/

Michael_Ilseman · October 15, 2021, 3:28pm

Yes, when in grapheme-semantic mode, I believe that the most important design point is to maintain equivalent behavior as if you were walking the characters yourself by hand.

Thus, we shouldn't split grapheme clusters in resulting indices. We should generalize what we can to grapheme clusters and reject the rest. More on that in the upcoming discussion on character classes.

There is no universal concept of "correct" here, only a universal concept of "consistent". Ordering is used for programming invariants but is linguistically meaningless.

YOCKOW:

It would be better that we have choices like:

let comparisonResult = someCharacter.compare(anotherCharacter, normalization: .decomposed)
let characterRange = CharacterRange("A"..."B", normalization: .precomposed)

For programming invariants, ordering under NFD can also be interesting, but irrelevant to your concerns. In grapheme-semantic mode, they would all combine and be the same grapheme cluster either way. Grapheme segmentation doesn't split normalization segments, and vice versa.

We do want a lot more normalization API on String. Now that @Alejandro merged native normalization support, this is a possibility.

If you want to meaningfully compare those characters, you also need:

Collation data (too big to ship with stdlib)
Locale or other tailoring (too high level a concern for stdlib)
Know the application domain (even higher level than locale)

as the same character orders differently in different languages and even orders differently in different applications of the same language (e.g. German phonebook order vs German dictionary order). Also, what even is a "character" for ordering purposes is different than for counting purposes, e.g. some locales have digraphs that "stick" together for ordering.

No, we don't want to deviate from standard regex syntax for the literal. This is a great thing to have in Pattern though.

A related concern, which you might be actually asking for, is how to tailor custom character classes through the use of key paths, closures, etc. That is definitely something we want to support and we're baking support into the implementation early. I don't now how best this will surface in API (is it contextual?, etc.) yet, so that's a great thing to be discussing.

Saklad5 · October 15, 2021, 5:18pm

I think Swift should focus on picking a standard and sticking to it religiously. Doesn’t the Unicode Consortium have guidelines for this? Normalization forms, etcetera?

JanWillemBrands · October 16, 2021, 9:44am

Since we'll have both Pattern and Regex I wonder how they work together:

let sentence: Pattern = { "Mary has a little lamb." }

The spaces between the words can be factored out:

let spaces: Regex = ' +'
let sentence: Pattern = { "Mary" spaces "has" spaces "a" spaces "little" spaces "lamb." }

but we usually care more about the words than the spaces between them, so let's make the spaces transparent:

let spaces = ' +'
let sentence: Pattern = { "Mary" "has" "a" "little" "lamb." }.skip(spaces)

Mary's brother John also has a little lamb:

let spaces = ' +'
let person: Pattern = Alternatives { "Mary" "John" }
let sentence = { person "has" "a" "little" "lamb." }.skip(spaces)

10 minutes later:

let spaces   = ' +'
let comment  = { "/*" '.*' OneOrZero { comment } '.*' "*/" }
let person   = Alternatives { "Mary" "John" }
let word     = '\w+'
let sentence = { person
                 OneOrMore { word }
                 "." 
                }.skip([spaces, comment])

But now we have a problem because Mary is both a person and a word, so we'll need a way to e.g. set the matching priority of string literals higher than regex literals like .prefer(.stringLiterals).

I guess I'm asking about composability.

JanWillemBrands · October 16, 2021, 10:02am

Emoji Ordering, v12.0 (unicode.org)

YOCKOW · October 17, 2021, 4:02am

Thank you for the response to my crude questions and also for clarifying what I wanted to say.
Although comparison and some other things about grapheme clusters should be tailor-made as you mentioned, I suspect that regex doesn't have enough potential for such customization.
That's why my impression was that (at least as to grapheme-semantics) all we need is Pattern then Regex is merely the icing on the cake.