Declarative String Processing Overview

I think Regex should hold parse target type, like Regex<String, _>. If metacharacters change its meaning as this quote, applying regex which was intended to be used for UnicodeScalar to String is just a bug, but it cannot be prevented if Regex doesn't hold target type.

I think that what kind of syntax should be used for regular expressions is not the most important.
We can decide it later.


Really?
Well, almost all of us may want both of them. I think, however, Regex might break Swift's String ecosystem.

We must remember, in Swift, String is a Collection of Characters, not of Unicode.Scalars.

Let's take "café" as an example.

"café".precomposedStringWithCanonicalMapping consists of 4 scalars; U+0063, U+0061, U+0066, and U+00e9.
"café".decomposedStringWithCanonicalMapping consists of 5 scalars; U+0063, U+0061, U+0066, U+0065, and U+0301.

Although they have different number of scalars, they are regarded as the same string.

print("café".precomposedStringWithCanonicalMapping == "café".decomposedStringWithCanonicalMapping)
// -> Prints "true"

Furthermore, "é" and "e" are distinct Characters:

print("café".precomposedStringWithCanonicalMapping.contains("e"))
// -> Prints "false"

print("café".decomposedStringWithCanonicalMapping.contains("e"))
// -> Prints "false"

print("café".precomposedStringWithCanonicalMapping.contains("\u{00e9}"))
// -> Prints "true"

print("café".precomposedStringWithCanonicalMapping.contains("\u{0065}\u{0301}"))
// -> Prints "true"

print("café".decomposedStringWithCanonicalMapping.contains("\u{00e9}"))
// -> Prints "true"

print("café".decomposedStringWithCanonicalMapping.contains("\u{0065}\u{0301}"))
// -> Prints "true"

On the other hand, Regex /e+/ (or some other syntax like #/e+/# would be adopted) is interpreted into Pattern such as

OneOrMore("e" as Unicode.Scalar).capture { ... }

Imagine such a Regex world...

  • "café".decomposedStringWithCanonicalMapping.contains(/e+/) returns true. :scream:
  • "café".precomposedStringWithCanonicalMapping.contains(/é+/) may return false. :scream:

Even more, let's imagine split with Regex...

"café".decomposedStringWithCanonicalMapping.split(/e+/) returns ["\u{0063}\u{0061}\u{0066}", "\u{0301}"] :scream:
I want to insist that the string that consists of only one unicode scalar U+0301 (whose category is Mn) is invalid.

(You may not know but, these problems do big matters to Hiragana(ひらがな), Katakana(カタカナ), Kanji(漢字), and so on.)


The points are:

  • Regex is a pattern matcher for Unicode.ScalarView, not for String, at least in Swift world.
  • Only Pattern whose element is Character or a collection of Characters (such as Substring) should be a matcher for String
1 Like

I think result builders may need to be beefed up to support the proposed syntax, right? Otherwise you're looking at a lot of overloads. For example, from the beginning of the thread:

Group {
  "Date: "
  Date.FormatStyle.rfc1123.capture { HTTPHeaderField.date($0) }
  Newline()
}

Would this evaluate to something like?

Group<(Pattern<Void>, Pattern<Date>, Pattern<Void>)>

Somewhere, these Voids would need to be ignored so that we can work with just Date. But maybe, as you said, variadic generics will come with the tools needed to do just that?

This is called a "branch" by re_format(7) (*BSD, macOS) and by
regex(7) (Linux / GNU):

     A (modern) RE is one= or more non-empty= branches, separated by `|'.  It
     matches anything that matches one of the branches.

I agree that "Alternation" is misleading.

Dave

1 Like

Just to throw my two pence in. On the basic concept I think this is very much a +1, both regexes and patterns. In fact I only really have one concern: Alternation is a pretty obscure and unclear name. In fact if I read it my first thought is “these should be matched alternately” rather than “any of these should match”. Something like AnyOf might be a better option. Other than that, this looks like it could be a good use of a result builder (just make sure we have enough documentation of all the options, as I suspect there will be many :wink:)

4 Likes

When applied with grapheme-cluster semantics (the default if applied to String), it would match grapheme-cluster by grapheme-cluster and comparison obeys canonical equivalence. There are some features that might not be supported, e.g. generalizing some scalar properties to grapheme clusters. Resulting indices would be grapheme-cluster aligned.

When applied with scalar semantics (the default if applied to String.UnicodeScalarView), then it would have scalar-by-scalar matching with binary semantics. Resulting indices would be scalar-aligned.

TBD is application to one of the encoded views, but it will probably closely adhere to scalar semantics.

These would also likely have different character classes, one which maps to a Character property (and we should add new ones as part of this effort) and one which follows the normal ands and ors based on scalar property. Character classes would likely be customizable (e.g. POSIX mode, or even supply a custom one), mechanism TBD.

1 Like

If—as I understand it—this differs from how other languages handle regex matching, this could be surprising for folks in a very likely scenario which we are frankly inviting: applying a regular expression they find on Stack Overflow which has been reliably vetted to a Swift String, relying on the fact that Swift supports the same regular expression features as other languages, only to find they obtain a different result.

2 Likes

This is one of the main reasons I’m opposed to baking regex into the language itself: it is literally impossible for Swift to maintain compatibility with every form of regular expression out there.

A dedicated regular expression literal will never be capable of using any regular expression.

2 Likes

What B says here is really sensible.

Reading and writing fluency is important. Conciseness and familiarity
can contribute to fluency, but they don't really contribute to fluency
with regular expressions. Other considerations apply.

Consider that on an average UNIX-y system you may find about five
different dialects of regular expression. So when we speak of familiar
expressions, we're even not talking about one thing.

Dave

Regex compatibility is a non-goal. Ideally the Swift standard library picks a standard and uses it.

Which, again, is why it shouldn’t get special treatment in the language.

1 Like

Other languages differ from each other in how they handle regex matching, sometimes greatly. Definitions of standard character classes differ, some engines match scalars, some engines even support matching under canonical equivalence matching or even grapheme-cluster semantics (IIRC Perl 6 did). Often these are configured through adverbs, though there's usually a default (which differs among languages). A broader cross-language comparison of features and classes is forthcoming (CC @nnnnnnnn).

We're talking matching similarly to other engines, modulo Swift's more recent interpretation of Unicode correctness. This isn't that different than other concerns, such as how counting characters in String is different than other languages. It would probably be more surprising for a regex match to produce indices that are not Character-aligned or to exhibit different behavior than walking the contents yourself.

Figuring this stuff out may produce more evidence for the "don't do literals" camp. Doing the literals does have some procedural benefits in directing efforts and I'm interested in leveraging those benefits even if they don't land :-). Most of the core technical work is the same: literals would effectively desugar to a shared representation, and from there is where all the hard stuff happens. They should be tackled in a way that improves aspects of the stdlib, e.g. character classes come with more Character properties and corresponding API on Unicode.Scalar.

A potential source of evidence for the "do literals" camp could be when we see them used with generic collection algorithms (CC @timv) or subpatterns within the result builder. If we do literals via an ExpressibleByRegexLiteral protocol (and I'm drafting an argument that we should if we're doing them), then literals could conceivably be passed to NSRegularExpression for ICU semantics, JSCore for JavaScript(ish) semantics (and JIT), a PCRE wrapper library for PCRE-semantics, etc.

Note that doing PCRE-esque regex literals doesn't preclude matching literals or custom literals in the future. The literals also don't preclude parsing support inside Pattern or even choice of defaults inside Pattern (e.g. default Alternation/OneOf could be PEG-style ordered choice by default, and the refactoring action would supply the boolean parameter to choose the other kind). There is still a lot TBD in Pattern though.

6 Likes

I don’t see how that could be done without implementing custom literals in general too.

Also, attempting to process meaningful unicode via regex in a language that doesn't handle unicode like Swift does is a recipe for disaster, so I think the likelihood of encountering such regexes originally from another language is quite slim.

Just because it would break horribly doesn’t make it unlikely.

Training people to make context-free assumptions based on literal delimiters would make it especially likely.

I'll be glad if Character(grapheme-cluster) semantics can be applied.
On the other hand, I agree with @xwu's indication because I also feel such feature would confuse some folks especially from other languages.

For example, some libraries such as PCRE, Oniguruma, etc. that are widely used have an expression to specify Script name such as \p{Hiragana}.
A scalar has a property of Script, but Character doesn't.
Some people may want to write /\p{Hiragana}/, however, it cannot be applied to String, but only to String.UnicodeScalarView.
They might think "I just want to write an usual Regular Expression and want it to be applied to String."

I'm looking forward to the result. But, I think it is important that which semantics is the majority rather than the fact that there are differences.


We may have to discuss first whether such feature in Swift should be called Regex or another appropriate name, and whether or not it should have similar appearance with existing regular expressions in other languages. :nerd_face:

In the context of Pattern I am wondering how wide the scope of this pitch is. Is it the intention that we can describe e.g. the Swift grammar as a Pattern ?

How can PatternBuilder deal with recursive pattern? For example, how do you express Swift's multiLineComment which matches /* ... */ but also /* ... /* ... */ ... */ and not /* ... */ ... */?

1 Like

Something like this ?

func nestedComment() {
    "/*"
    Repeat(.anyCharacter)
    Optional { nestedComment() }
    Repeat(.anyCharacter)
    "*/" 
}
1 Like

Won't nestedComment cause infinite recursion? I'm worried about this point.

it only recurses when it encounters additional /* opening brackets, no?

3 Likes