Strings in Swift 4

mwhiteside.dev · January 27, 2017, 4:01am

Thanks for pointing this out to me.

-Matt

···

On Jan 25, 2017, at 13:10, Dave Abrahams via swift-evolution <swift-evolution@swift.org> wrote:

on Tue Jan 24 2017, Matt Whiteside <swift-evolution@swift.org> wrote:

On Jan 22, 2017, at 15:40, Chris Lattner via swift-evolution >> <swift-evolution@swift.org> wrote:
Right, the only sensible semantics for a one sided range with an
open end point is that it goes to the end of the collection. I see
a few different potential colors to paint this bikeshed with, all of
which would have the semantics “c[i..<c.endIndex]”:

1) Provide "c[i...]":
2) Provide "c[i..<]":
3) Provide both "c[i..<]” and "c[i…]":

Since all of these operations would have the same behavior, it comes down to subjective questions:

a) Do we want redundancy? IMO, no, which is why #3 is not very desirable.
b) Which is easier to explain to people? As you say, "i..< is shorthand for i..<endindex” is nice

and simple, which leans towards #2.

c) Which is subjectively nicer looking? IMO, #1 is much nicer
typographically. The ..< formulation looks like symbol soup,
particularly because most folks would not put a space before ].

There is no obvious winner, but to me, I tend to prefer #1. What do other folks think?

I also prefer #1. It’s a shame that this conflicts with the potential
syntax for variadic generics. Is there really no way around this?
I’m showing my ignorance on compilers here, but couldn’t the fact that
variadic generics will be inside angle brackets be used to
distinguish?

The variadic use cases don't always have ... appearing inside angle
brackets. See “pack expansion” at
Parameter pack(since C++11) - cppreference.com
for example.

--
-Dave

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

anandabits · January 25, 2017, 12:51am

Hey Matthew,

Do you have an example of doing parser combinators without FP? I'd be very interest

Hey Chris, looks like Dave provided a pretty good example of the kind of thing I was talking about. Does that answer your questions?

···

On Jan 24, 2017, at 10:14 AM, Chris Eidhof <chris@eidhof.nl> wrote:

On Tue, Jan 24, 2017 at 4:52 PM, Matthew Johnson <matthew@anandabits.com <mailto:matthew@anandabits.com>> wrote:

On Jan 24, 2017, at 2:05 AM, Chris Eidhof via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

I agree that being able to implement parsers in a nice way can be a huge step forward in being really good at string processing.

There are a couple of possibilities that come to mind directly:

1. Build parsers right into the language (like Perl 6 grammars)
2. Provide a parser combinator language (e.g. https://github.com/davedufresne/SwiftParsec\).
3. Rely on external tools like bison/yacc/etc.
4. Make it easy for people to write hand-written parsers (e.g. by providing an NSScanner alternative).

Some obvious drawbacks of each approach:

1. Lots of work, probably hard to get right?
2. Only way to do this, afaik, is using lots of functional programming which might scare people off. Also probably it's hard to get performance as fast as 1.

FWIW, it is quite possible to do things very similar to parser combinators without functional programming. What you need is a way to create and compose small parser fragments, ideally an EDSL approaching something like EBNF that allows users to build a grammar out of the parser fragments, and a way to execute / interpret the resulting grammar during parsing.

I'd love to see this. Do you mean "possible today" or "it would be possible”?

One really big thing that I took away learning parser combinators is that grammars are composable, whereas parsers themselves are not. Parser combinators express grammars.

For example, when you have a Swift parser available, and you want to "embed" it inside a Markdown parser, that's hard to do. Whereas composing the Markdown grammar with the Swift grammar is a lot easier.

It'd be nice to have that composability.

Chris

The functional approach would not be the most idiomatic approach in Swift and as you note, it probably wouldn’t have the performance a more idiomatic approach could achieve (too much copying).

My intuition is that a hybrid 1 / 2 approach might be best: do as much as possible in the library and let the design drive new language enhancements where necessary.

3. No clear integrated way to do this
4. You still have to know how to write a parser.

I would think that 4. would be a good step forward, and 1/2 would definitely benefit from this.

Also, I'd love to have this functionality on sequence/collection types, rather than Strings. For example, it can be tremendously helpful to parse a binary format using proper parsers. Or maybe you would want to use an event-driven XML parser as "tokenizer" and parse that. Plenty of cool possibilities.

On Tue, Jan 24, 2017 at 8:46 AM, Russ Bishop via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

On Jan 23, 2017, at 2:27 PM, Joe Groff via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

On Jan 23, 2017, at 2:06 PM, Ben Cohen via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

On Jan 23, 2017, at 7:49 AM, Joshua Alvarado <alvaradojoshua0@gmail.com <mailto:alvaradojoshua0@gmail.com>> wrote:

Taken from NSHipster <http://nshipster.com/nsregularexpression/>:
Happily, on one thing we can all agree. In NSRegularExpression, Cocoa has the most long-winded and byzantine regular expression interface you’re ever likely to come across.

There is no way to achieve the goal of being better at string processing than Perl without regular expressions being addressed. It just should not be ignored.

We’re certainly not ignoring the importance of regexes. But if there’s a key takeaway from your experiences with NSRegularExpression, it’s that a good regex implementation matters, a lot. That’s why we don’t want to rush one in alongside the rest of the overhaul of String. Instead, we should take our time to make it really great, and building on a solid foundation of a good String API that’s already in place should help ensure that.

I do think that there's some danger to focusing too narrowly on regular expressions as they appear in languages today. I think the industry has largely moved on to fully-structured formats that require proper parsing beyond what traditional regexes can handle. The decades of experience with Perl shows that making regexes too easy to use without an easy ramp up to more sophisticated string processing leads to people cutting corners trying to make regex-based designs kind-of work. The Perl 6 folks recognized this and developed their "regular expression" support into something that supported arbitrary grammars; I think we'd do well to start at that level by looking at what they've done.

-Joe

I fully agree. I think we could learn something from Perl 6 grammars. As PCREs are to languages without regex, Perl 6 grammars are to languages with PCREs.

A lot of really crappy user interfaces and bad tools come down to half-assed parsers; maybe we can do better? (Another argument against rushing it).

Russ

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

--
Chris Eidhof
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

--
Chris Eidhof

Ben_Cohen · January 25, 2017, 9:08pm

I've never seen anyone start a string with a combining character on purpose,

It will occur as a byproduct of the process of attaching a diacritic
to a base character.

Unless you're in the business of writing a text editor, I don't know
if that's a common use case.

I don't either, to be honest. But the experts I consult with keep
reassuring me that it's an important one.

Would it be possible that the Unicode experts' use cases are different from non-experts' use cases? It would make sense to put people who know a lot about Unicode in charge of handling complex Unicode operations, and that makes that use case very important to them, but through their hard work no one else needs to care about it.

though I'm familiar with just one natural language that needs
combining characters. I can imagine that it could be a convenient
feature in other natural languages.

However, if Swift Strings are now designed for machine processing
and less for human language convenience, for me, it's easy enough to
justify a safe default in the context of machine processing: `a+b`
will not combine the end of `a` with the start of `b`. You could do
this by inserting a ◌ that `b` could combine with if necessary.

You can do it, but it trades one semantic problem for a usability
problem, without solving all the semantic problems: you end up with
a.count + b.count == (a+b).count, sure, but you still don't satisfy
the usual law of collections that (a+b).contains(b.first!) if b is
non-empty, and now you've made it difficult to attach diacritics to
base characters.

"Difficult".

What kind of processing would you suggest on a variable "b" in the
expression "\(a),\(b)" to ensure that the result can be split with a
comma?

I'm sorry, I don't understand what you're driving at, here.

Okay, so I'm serializing two strings "a" and "b", and later on I want to deserialize them. I control "a", and the user controls "b". I know that I'll never have a comma in "a", so one obvious way to serialize the two strings is with "\(a),\(b)", and the most obvious way to deserialize them is with string.split(maxSplits: 2) { $0 == "," }.

For the example, string "a" is "hello", and the user put in "\u{0301}screw you" for "b". This makes the result "hello,́screw you". Now split misses the comma.

How do I fix it?

One option (once Character acquires a unicodeScalars view similar to String’s) would be:

s.split { $0.unicodeScalars.first == "," }

There’s probably also a case to be made for a String-specific overload split(separator: UnicodeScalar) in which case you’d pass in the scalar of “,”. This would replicate similar behavior to languages that use code points as their “character”.

Alternatively, the right solution is to sanitize your input before the interpolation. Sanitization is a big topic, of which this is just one example. Essentially, you are asking for this kind of sanitization to be automatically applied for all range-replaceable operations on strings for this specific use case. I’m not sure that’s a good precedent to set. There are other ways in which Unicode can be abused that wouldn’t be covered, should we be sanitizing for those too on all low-level operations?

This would also have pretty far-reaching implications across lots of different types and operations. For example, it’s not just on append:

var s = "pokemon"
let i = s.index(of: "m”)!
// insert not just \u{0301} but also a separator?
s.insert("\u{0301}", at: i)

It also would apply to in-place mutation on slices, given you can do this:

var a = [1,2,3,4]
a[0...2].append(99)
a // [1,2,3,99,4]

In this case, suppose you appended "e" to a slice that ended between "m" and "\u{0301}”. The append operation on the substring would need to look into the outer string, see that the next scalar is a combining character, and then insert a spacer element in between them.

We would still need the ability to append modifiers to characters legitimately. If users could not do this by inserting/appending these modifiers into String, we would have to put this logic onto Character, which would need to have the ability to range-replace within its scalars, which adds to a lot to the complexity of that type. It would also be fiddly to use, given that String is not going to conform to MutableCollection (because mutation on an element cannot be done in constant time). So you couldn’t do it in-place i.e. s[i].unicodeScalars.append("\u{0301}") wouldn’t work.

···

On Jan 24, 2017, at 7:02 PM, Félix Cloutier via swift-evolution <swift-evolution@swift.org> wrote:

Le 24 janv. 2017 à 11:33, Dave Abrahams via swift-evolution <swift-evolution@swift.org> a écrit :

Félix

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

anandabits · January 25, 2017, 8:08pm

I agree that being able to implement parsers in a nice way can be a huge step forward in being really good at string processing.

+1 from me as well, I agree with Joe that Swift can learn a lot from Perl 6 grammar’s and we should take the time to do it right. Below I say “regex” a lot, but I really mean a more general grammar system (and even Perl 5 regex’s aren’t regular :-)

There are a couple of possibilities that come to mind directly:

1. Build parsers right into the language (like Perl 6 grammars)
2. Provide a parser combinator language (e.g. https://github.com/davedufresne/SwiftParsec\).
3. Rely on external tools like bison/yacc/etc.
4. Make it easy for people to write hand-written parsers (e.g. by providing an NSScanner alternative).

My opinion is that #1 is the right path to start with, but it wouldn’t preclude doing #2. Here’s my rationale / half-baked thought process:

There are two important use cases for regex's: the literal case (e.g. /aa+b*/) and the dynamically computed case. The former is really what we’re talking about here, the latter should obviously be handled with some sort of Regex type which can be formed from string values or whatever. Regex literals in an expression context should default to producing the Regex type of course.

This means that when you pass a regex literal into an API call (e.g. split on a string), it is really just creating something of Regex type, and passing it down. If you wanted to introduce a parser combinator DSL, you could totally plug it into the system, by having the combinators produce something of the Regex type.

So why bless regex literals with language support at all? I see several reasons:

1. Diagnostics: These will be heavily used by people, and you want to have good compiler error and warning messages for them. You want to be able to validate the regex at compile time, not wait until runtime to detect syntactic mistakes like unbalanced parens.

2. Syntax Familiarity: To take advantage of people’s familiarity with other languages, we should strive to make the basic regex syntax familiar and obvious. I’d argue that /aa+b*/ should “just work” and do the thing you think it does. Relying on a combinator library to do that would be crazy.

3. Performance: Many regex’s are actually regular, so they can be trivially compiled into DFAs. There is a well understood body of work that can be simply dropped into the compiler to do this. Regex’s that are not regular can be compiled into hybrid DFA/NFA+backtracking schemes, and allowing a divide and conquer style of compiler optimization to do this is the path that makes the most sense (to me at least). Further, if you switch on a string and have a bunch of cases that are regex’s, you’d obviously want the compiler to generate a single state machine (like a lexer), not check each pattern in series.

4. Pattern matching greatness: One of the most obnoxious/error prone aspects of regex’s in many languages is that when you match a pattern, the various matches are dumped into numbered result values (often by the order of the parens in the pattern). This is totally barbaric: it begs for off by one errors, often breaks as the program is being evolved/maintained, etc. It is just as bad as printf/scanf!

You should instead be able to directly bind subexpressions into local variables. For example if you were trying to match something like “42: Chris”, you should be able to use straw man syntax like this:

case /(let id: \d+): (let name: \w+)/: print(id); print(name)

Unless we were willing to dramatically expand how patterns work, this requires baking support into the language.

5. Scanner/“Formatter" integration: Taking the above one step farther, we could have default patterns for known types (and make it extensible to user defined types of course). For example, \d+ is the obvious pattern for integers, so you should be able to write the above like this (in principle):

case /(let id: Int): (let name: \w+)/: print(id); print(name)

In addition to avoiding having to specify \d+ all the time, this eliminates the need for a “string to int” conversion after the pattern is matched, because id would be bound as type Int already.

Anyway, to summarize, I think that getting regex’s into the language is really important and expect them to be widely used. As such, I think it is worth burning compiler/language complexity to make them be truly great in Swift.

Another good reason to bake them into pattern matching is that it would make it easier to optimize when you want to match one of multiple patterns. Often, you don't want just one grammar, but possibly one of many, and it'd be nice if switch-ing over multiple string patterns led to a reasonably efficient DFA/NFA/rec-descent machine based on the needs of the grammars being matched.

+1 to the comments by Chris and Joe. We should do as much as we can in the library, but compile-time error detection, optimizability and syntactic convenience are important reasons to bake some support into the language itself.

···

On Jan 25, 2017, at 12:23 PM, Joe Groff via swift-evolution <swift-evolution@swift.org> wrote:

On Jan 24, 2017, at 9:35 PM, Chris Lattner via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:
On Jan 24, 2017, at 12:05 AM, Chris Eidhof via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

-Joe

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

Chris_Eidhof · January 25, 2017, 6:52pm

One concern I have with implementing this through pattern matching is that
it might make it really hard to optimize. For example, consider the
following hypothetical fragment:

switch string {
case "x" (let remainder):
return arc4random() % 2 == 0 ? parse1() : parse2()

This would parse a string starting with x, and then continue parsing with
either parse1 or parse2.

This is strictly more powerful (it allows you to implement parsers that are
"dynamically" constructed depending on the input), and therefore less
optimizable.

However, IANACE (I am not a compiler engineer), so hopefully my pessimism
is unjust.

···

On Wed, Jan 25, 2017 at 7:23 PM, Joe Groff <jgroff@apple.com> wrote:

On Jan 24, 2017, at 9:35 PM, Chris Lattner via swift-evolution < > swift-evolution@swift.org> wrote:

On Jan 24, 2017, at 12:05 AM, Chris Eidhof via swift-evolution < > swift-evolution@swift.org> wrote:

I agree that being able to implement parsers in a nice way can be a huge
step forward in being really good at string processing.

+1 from me as well, I agree with Joe that Swift can learn a lot from Perl
6 grammar’s and we should take the time to do it right. Below I say
“regex” a lot, but I really mean a more general grammar system (and even
Perl 5 regex’s aren’t regular :-)

There are a couple of possibilities that come to mind directly:

1. Build parsers right into the language (like Perl 6 grammars)
2. Provide a parser combinator language (e.g. https://github.com/
davedufresne/SwiftParsec).
3. Rely on external tools like bison/yacc/etc.
4. Make it easy for people to write hand-written parsers (e.g. by
providing an NSScanner alternative).

My opinion is that #1 is the right path to start with, but it wouldn’t
preclude doing #2. Here’s my rationale / half-baked thought process:

There are two important use cases for regex's: the literal case (e.g.
/aa+b*/) and the dynamically computed case. The former is really what
we’re talking about here, the latter should obviously be handled with some
sort of Regex type which can be formed from string values or whatever.
Regex literals in an expression context should default to producing the
Regex type of course.

This means that when you pass a regex literal into an API call (e.g. split
on a string), it is really just creating something of Regex type, and
passing it down. If you wanted to introduce a parser combinator DSL, you
could totally plug it into the system, by having the combinators produce
something of the Regex type.

So why bless regex literals with language support at all? I see several
reasons:

1. Diagnostics: These will be heavily used by people, and you want to have
good compiler error and warning messages for them. You want to be able to
validate the regex at compile time, not wait until runtime to detect
syntactic mistakes like unbalanced parens.

2. Syntax Familiarity: To take advantage of people’s familiarity with
other languages, we should strive to make the basic regex syntax familiar
and obvious. I’d argue that /aa+b*/ should “just work” and do the thing
you think it does. Relying on a combinator library to do that would be
crazy.

3. Performance: Many regex’s are actually regular, so they can be
trivially compiled into DFAs. There is a well understood body of work that
can be simply dropped into the compiler to do this. Regex’s that are not
regular can be compiled into hybrid DFA/NFA+backtracking schemes, and
allowing a divide and conquer style of compiler optimization to do this is
the path that makes the most sense (to me at least). Further, if you
switch on a string and have a bunch of cases that are regex’s, you’d
obviously want the compiler to generate a single state machine (like a
lexer), not check each pattern in series.

4. Pattern matching greatness: One of the most obnoxious/error prone
aspects of regex’s in many languages is that when you match a pattern, the
various matches are dumped into numbered result values (often by the order
of the parens in the pattern). This is totally barbaric: it begs for off
by one errors, often breaks as the program is being evolved/maintained,
etc. It is just as bad as printf/scanf!

You should instead be able to directly bind subexpressions into local
variables. For example if you were trying to match something like “42:
Chris”, you should be able to use straw man syntax like this:

case /(let id: \d+): (let name: \w+)/: print(id); print(name)

Unless we were willing to dramatically expand how patterns work, this
requires baking support into the language.

5. Scanner/“Formatter" integration: Taking the above one step farther, we
could have default patterns for known types (and make it extensible to user
defined types of course). For example, \d+ is the obvious pattern for
integers, so you should be able to write the above like this (in principle):

case /(let id: Int): (let name: \w+)/: print(id); print(name)

In addition to avoiding having to specify \d+ all the time, this
eliminates the need for a “string to int” conversion after the pattern is
matched, because id would be bound as type Int already.

Anyway, to summarize, I think that getting regex’s into the language is
really important and expect them to be widely used. As such, I think it is
worth burning compiler/language complexity to make them be truly great in
Swift.

Another good reason to bake them into pattern matching is that it would
make it easier to optimize when you want to match one of multiple patterns.
Often, you don't want just one grammar, but possibly one of many, and it'd
be nice if switch-ing over multiple string patterns led to a reasonably
efficient DFA/NFA/rec-descent machine based on the needs of the grammars
being matched.

-Joe

--
Chris Eidhof

Chris_Lattner · January 26, 2017, 4:36am

There are two important use cases for regex's: the literal case
(e.g. /aa+b*/) and the dynamically computed case. The former is
really what we’re talking about here, the latter should obviously be
handled with some sort of Regex type which can be formed from string
values or whatever.

Ideally these patterns interoperate so that you can combine them.

Yes, as I mentioned, the regex literal should form something of the Regex type. Any API that takes a Regex would work with them.

You should instead be able to directly bind subexpressions into local
variables. For example if you were trying to match something like
“42: Chris”, you should be able to use straw man syntax like this:

case /(let id: \d+): (let name: \w+)/: print(id); print(name)

This is a good start, but inadequate for handling the kind of recursive
grammars to which you want to generalize regexes, because you have to
bind the same variable multiple times—often re-entrantly—during the same
match. Actually the Kleene star (*) already has this basic problem,
without the re-entrancy, but if you want to build real parsers, you need
to do more than simply capture the last substring matched by each group.

Please specify some more details about what the problem is, because I’m not seeing it. Lots of existing regex implementations work with "(…)*” patterns by binding to the last value. From my perspective, this is pragmatic, useful, and proven. What is your specific concern?

When you say “real” parsers, you’re implicitly insulting the “unreal" parsers, without explaining what the “real” ones are, or why they matter. Please provide specific use cases that would be harmed by this approach.,

Unless we were willing to dramatically expand how patterns work, this
requires baking support into the language.

I don't understand the "Unless" part of that sentence. It seems obvious
that no expansion of how patterns work could make the above work without
language changes.

I’m not a believer in this approach, but someone could argue that we should allow arbitrary user-defined syntactic expansion of the pattern grammar, similar to how we allow syntactic expansion of the expression grammar through operator definitions. This is what I meant by “dramatically expanding” how patterns work.

Anyway, to summarize, I think that getting regex’s into the language
is really important and expect them to be widely used. As such, I
think it is worth burning compiler/language complexity to make them be
truly great in Swift.

Thanks for this post, Chris; it clarifies beautifully the reasons that
we're not trying to tackle regexes right away.

… and hopefully why it is important to consider them first as a language feature, not a crazy DSL you’d try to build in the library through a ton of overloaded operators.

-Chris

···

On Jan 25, 2017, at 7:32 PM, Dave Abrahams <dabrahams@apple.com> wrote:

Douglas_Gregor · January 26, 2017, 5:49am

I also prefer #1. It’s a shame that this conflicts with the potential
syntax for variadic generics. Is there really no way around this?
I’m showing my ignorance on compilers here, but couldn’t the fact that
variadic generics will be inside angle brackets be used to
distinguish?

The variadic use cases don't always have ... appearing inside angle
brackets. See “pack expansion” at
Parameter pack(since C++11) - cppreference.com
for example.

AFAIK, we have no serious / concrete design proposal for variadic generics, so it remains unclear to me that we would syntactically follow the C++ model. The C++ model seems very influenced by its instantiation based approach.

This aspect of the C++ model of variadic generics (variadic templates) is not influenced by the instantiation-based approach. The “…” suffix meaning “expand this pattern for each element in the parameter packs” was a concise syntax using an existing C++ token (…) that didn’t have any grammatical ambiguities [*] and composed well with constrained generics. I do think the model translates well to Swift, but obviously it’s not the only model we should consider.

In any case, it seems like an obviously good tradeoff to make the syntax for variadic generics more complicated if it makes one sided ranges more beautiful.

Given that we have “…” already as a binary operator for ranges, it would probably be confusing for a “…” postfix operator to mean something totally different than “a range without a specified end point”, even if “…” as a postfix operator is pointless (i.e., means the same thing as “..<“).

Frankly, I still think it was a mistake to introduce “…” as a binary operator for ranges, because closed ranges aren’t useful enough to burn an operator that was *already* overloaded with variadic functions.

- Doug

[*] There’s one ambiguity, but it was easily addressed.

···

On Jan 25, 2017, at 5:49 PM, Chris Lattner via swift-evolution <swift-evolution@swift.org> wrote:

On Jan 25, 2017, at 1:10 PM, Dave Abrahams via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

Russ_Bishop1 · January 26, 2017, 7:20am

If you haven’t seen Perl 6 grammars, I highly suggest taking a look.

My off-the-cuff straw-man attempt to apply this to Swift. (Fair warning, I’m no expert on Perl 6 grammars, just an enthusiast)

grammar FooGrammar {
    token top = {
        <string> |
        <number>
    }
    token string {
        pattern {
            '\"' <literal>\w+ '\"'
        }
        matched {
            // literal is in scope and holds the match
            // manipulate the AST
            return SomeTypeAdoptingAstNodeProtocol(x: y, value: literal)
        }
    }
    token number {
        pattern {
            /regex here/
        }
        matched {
        }
    }
}

// produces an AST:
let ast = FooGrammar.parse("some input")

The key is that you use regexes for the basic tokens but you combine tokens more or less along BNF lines. When a token matches, the matched() block executes and you can manipulate the resulting AST. Presumably the AST types would come from the standard library.

That is also a stepping stone to a hygienic macro system where the macro invocation can receive an AST representing the syntax inside the macro. You could even require a macro to define a grammar so the compiler could at least validate the syntax of a macro is well-formed.

Russ

···

On Jan 25, 2017, at 7:27 AM, Chris Eidhof <chris@eidhof.nl> wrote:

Great stuff!

I wonder if built-in grammars (that's what Perl calls them) would work only for things that are backed by string literals, or if it's worth the time/effort to make them work for other kind of data as well. For example, what if you write a grammar to tokenize (yielding some sequence of `Token`s), and then want to parse those `Token`s? Or what if you want to parse other kinds of data? Or should we try to make the 80% case work (only provide grammar/regex literals for Strings) to avoid complexity?

I think it's worth looking at parser combinators. The really cool thing about them is that they provide a few basic elements, and well-defined ways to combine those elements. Specifically, it's interesting to look at the problems surrounding parser combinators and seeing how we could do better:

- Error messages can be cryptic
- Because it's built on top of the language, you can't really do stuff like binding subexpressions to local variables
- Once you allow something like flatMap in a parser combinator library, you lose the possibility to optimize by rewriting the parser structure. This is because the parser that comes out of the rhs of a flatMap can depend on the parsed output of the lhs.

To me it seems like there's a lot of (exciting) work to be done to get this right :).

On Wed, Jan 25, 2017 at 6:35 AM, Chris Lattner <sabre@nondot.org <mailto:sabre@nondot.org>> wrote:
On Jan 24, 2017, at 12:05 AM, Chris Eidhof via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

I agree that being able to implement parsers in a nice way can be a huge step forward in being really good at string processing.

+1 from me as well, I agree with Joe that Swift can learn a lot from Perl 6 grammar’s and we should take the time to do it right. Below I say “regex” a lot, but I really mean a more general grammar system (and even Perl 5 regex’s aren’t regular :-)

There are a couple of possibilities that come to mind directly:

1. Build parsers right into the language (like Perl 6 grammars)
2. Provide a parser combinator language (e.g. https://github.com/davedufresne/SwiftParsec\).
3. Rely on external tools like bison/yacc/etc.
4. Make it easy for people to write hand-written parsers (e.g. by providing an NSScanner alternative).

My opinion is that #1 is the right path to start with, but it wouldn’t preclude doing #2. Here’s my rationale / half-baked thought process:

There are two important use cases for regex's: the literal case (e.g. /aa+b*/) and the dynamically computed case. The former is really what we’re talking about here, the latter should obviously be handled with some sort of Regex type which can be formed from string values or whatever. Regex literals in an expression context should default to producing the Regex type of course.

This means that when you pass a regex literal into an API call (e.g. split on a string), it is really just creating something of Regex type, and passing it down. If you wanted to introduce a parser combinator DSL, you could totally plug it into the system, by having the combinators produce something of the Regex type.

So why bless regex literals with language support at all? I see several reasons:

1. Diagnostics: These will be heavily used by people, and you want to have good compiler error and warning messages for them. You want to be able to validate the regex at compile time, not wait until runtime to detect syntactic mistakes like unbalanced parens.

2. Syntax Familiarity: To take advantage of people’s familiarity with other languages, we should strive to make the basic regex syntax familiar and obvious. I’d argue that /aa+b*/ should “just work” and do the thing you think it does. Relying on a combinator library to do that would be crazy.

3. Performance: Many regex’s are actually regular, so they can be trivially compiled into DFAs. There is a well understood body of work that can be simply dropped into the compiler to do this. Regex’s that are not regular can be compiled into hybrid DFA/NFA+backtracking schemes, and allowing a divide and conquer style of compiler optimization to do this is the path that makes the most sense (to me at least). Further, if you switch on a string and have a bunch of cases that are regex’s, you’d obviously want the compiler to generate a single state machine (like a lexer), not check each pattern in series.

4. Pattern matching greatness: One of the most obnoxious/error prone aspects of regex’s in many languages is that when you match a pattern, the various matches are dumped into numbered result values (often by the order of the parens in the pattern). This is totally barbaric: it begs for off by one errors, often breaks as the program is being evolved/maintained, etc. It is just as bad as printf/scanf!

You should instead be able to directly bind subexpressions into local variables. For example if you were trying to match something like “42: Chris”, you should be able to use straw man syntax like this:

case /(let id: \d+): (let name: \w+)/: print(id); print(name)

Unless we were willing to dramatically expand how patterns work, this requires baking support into the language.

5. Scanner/“Formatter" integration: Taking the above one step farther, we could have default patterns for known types (and make it extensible to user defined types of course). For example, \d+ is the obvious pattern for integers, so you should be able to write the above like this (in principle):

case /(let id: Int): (let name: \w+)/: print(id); print(name)

In addition to avoiding having to specify \d+ all the time, this eliminates the need for a “string to int” conversion after the pattern is matched, because id would be bound as type Int already.

Anyway, to summarize, I think that getting regex’s into the language is really important and expect them to be widely used. As such, I think it is worth burning compiler/language complexity to make them be truly great in Swift.

-Chris

--
Chris Eidhof

Jon_Hull · January 26, 2017, 11:49am

I had a realization a few weeks ago that regexes with capture groups actually correspond to a type, where successive capture groups form a tuple and recursive ones form arrays of the capture groups they recurse (and ‘?’ conveniently forms an optional). For example the type for the regex above would be (Int,String). Those types could be pretty hairy for complex regexes though.

let (id,name) = /(\d+): (\w+)/

I’m not sure how pleasant it would be for complex regexes, but at least the type checker would keep you honest. Just wanted to throw it out there incase it jogs an idea in someone else…

Thanks,
Jon

···

On Jan 25, 2017, at 7:32 PM, Dave Abrahams via swift-evolution <swift-evolution@swift.org> wrote:

4. Pattern matching greatness: One of the most obnoxious/error prone
aspects of regex’s in many languages is that when you match a pattern,
the various matches are dumped into numbered result values (often by
the order of the parens in the pattern). This is totally barbaric: it
begs for off by one errors, often breaks as the program is being
evolved/maintained, etc. It is just as bad as printf/scanf!

You should instead be able to directly bind subexpressions into local
variables. For example if you were trying to match something like
“42: Chris”, you should be able to use straw man syntax like this:

case /(let id: \d+): (let name: \w+)/: print(id); print(name)

This is a good start, but inadequate for handling the kind of recursive
grammars to which you want to generalize regexes, because you have to
bind the same variable multiple times—often re-entrantly—during the same
match. Actually the Kleene star (*) already has this basic problem,
without the re-entrancy, but if you want to build real parsers, you need
to do more than simply capture the last substring matched by each group.

Chris_Lattner · January 28, 2017, 12:55am

I wonder if built-in grammars (that's what Perl calls them) would work only for things that are backed by string literals, or if it's worth the time/effort to make them work for other kind of data as well. For example, what if you write a grammar to tokenize (yielding some sequence of `Token`s), and then want to parse those `Token`s? Or what if you want to parse other kinds of data? Or should we try to make the 80% case work (only provide grammar/regex literals for Strings) to avoid complexity?

I don’t have a strong opinion on this matter. I can definitely see the elegance of being able to pattern match non-string data with the regex features. Certainly things like parsing fixed packet formats coming off a network seem like good candidates for this sort of thing.

That said, it isn’t clear to me that this would be widely-used enough to be worth the complexity cost. If it just drops into the existing model (e.g. the string model works on sequences of bytes, so this just falls out of it) then that would be great. If it requires massive complexity for little gain, then probably not. We can see when it comes time to actually design and build this functionality out and lazily evaluate the decision based on what we know then.

I think it's worth looking at parser combinators.

Yep, I’m a fan, they are definitely very nice in many cases!

To me it seems like there's a lot of (exciting) work to be done to get this right :).

Totally. Lets start by getting the essential bones of the String design right :-)

-Chris

···

On Jan 25, 2017, at 7:27 AM, Chris Eidhof <chris@eidhof.nl> wrote:

mwhiteside.dev · January 28, 2017, 1:47am

I also prefer #1. It’s a shame that this conflicts with the potential
syntax for variadic generics. Is there really no way around this?
I’m showing my ignorance on compilers here, but couldn’t the fact that
variadic generics will be inside angle brackets be used to
distinguish?

AFAIK, we have no serious / concrete design proposal for variadic generics, so it remains unclear to me that we would syntactically follow the C++ model. The C++ model seems very influenced by its instantiation based approach.

In any case, it seems like an obviously good tradeoff to make the syntax for variadic generics more complicated if it makes one sided ranges more beautiful.

-Chris

Thanks for sharing your thoughts on this. It’s hard to disagree with your point.

My only other thought is that there is some elegance to sharing the same syntax at compile time and runtime for the conceptually a similar operation of “give me the rest of the items in the list”.

-Matt

Felix_Cloutier1 · January 26, 2017, 3:54am

Okay, so I'm serializing two strings "a" and "b", and later on I want to deserialize them. I control "a", and the user controls "b". I know that I'll never have a comma in "a", so one obvious way to serialize the two strings is with "\(a),\(b)", and the most obvious way to deserialize them is with string.split(maxSplits: 2) { $0 == "," }.

For the example, string "a" is "hello", and the user put in "\u{0301}screw you" for "b". This makes the result "hello,́screw you". Now split misses the comma.

How do I fix it?

One option (once Character acquires a unicodeScalars view similar to String’s) would be:

s.split { $0.unicodeScalars.first == "," }

My two main objections to this are that (1) this drops the acute accent (although that's probably an acceptable sacrifice in the face of purposefully bad input); and (2) it's annoying to me that you have to drop below the Character level to safely perform a task this simple.

There’s probably also a case to be made for a String-specific overload split(separator: UnicodeScalar) in which case you’d pass in the scalar of “,”. This would replicate similar behavior to languages that use code points as their “character”.

The way they're being built, I'm leaning towards the opinion that Strings wouldn't be the right tool to serialize anything. Unfortunately, in a world of XML, JSON, YAML, Markdown and such, they're also a very obvious choice.

Alternatively, the right solution is to sanitize your input before the interpolation. Sanitization is a big topic, of which this is just one example. Essentially, you are asking for this kind of sanitization to be automatically applied for all range-replaceable operations on strings for this specific use case. I’m not sure that’s a good precedent to set. There are other ways in which Unicode can be abused that wouldn’t be covered, should we be sanitizing for those too on all low-level operations?

I agree that the general Unicode abuse problem cannot be solved. The novel thing here is that Swift is one of the first languages to bring grapheme-cluster-aware strings to a wide audience, and doing so, it introduces a class of bugs that have essentially no precedent. I feel like this should worry people a little bit. People have been able to abuse RTL overrides for several years now, and we found that it's a problem to users but machines are pretty good at dealing with it. However, if you'll allow me to dramatize, these are characters that basically eat their neighbor.

This would also have pretty far-reaching implications across lots of different types and operations. For example, it’s not just on append:

var s = "pokemon"
let i = s.index(of: "m”)!
// insert not just \u{0301} but also a separator?
s.insert("\u{0301}", at: i)

It also would apply to in-place mutation on slices, given you can do this:

var a = [1,2,3,4]
a[0...2].append(99)
a // [1,2,3,99,4]

In this case, suppose you appended "e" to a slice that ended between "m" and "\u{0301}”. The append operation on the substring would need to look into the outer string, see that the next scalar is a combining character, and then insert a spacer element in between them.

We would still need the ability to append modifiers to characters legitimately. If users could not do this by inserting/appending these modifiers into String, we would have to put this logic onto Character, which would need to have the ability to range-replace within its scalars, which adds to a lot to the complexity of that type. It would also be fiddly to use, given that String is not going to conform to MutableCollection (because mutation on an element cannot be done in constant time). So you couldn’t do it in-place i.e. s[i].unicodeScalars.append("\u{0301}") wouldn’t work.

I'd argue that no one should feel particularly great about writing code points to a collection that exposes Characters in return. Have any alternatives around modifying a Unicode scalar view been explored? I don't have any problem with making it impossible to add a Character-that-is-not-a-Character to a String's Character view if you can opt in to Unicode scalars when you mean it.

Félix

···

Le 25 janv. 2017 à 13:08, Ben Cohen <ben_cohen@apple.com> a écrit :

dabrahams · January 26, 2017, 7:15pm

There are two important use cases for regex's: the literal case
(e.g. /aa+b*/) and the dynamically computed case. The former is
really what we’re talking about here, the latter should obviously be
handled with some sort of Regex type which can be formed from string
values or whatever.

Ideally these patterns interoperate so that you can combine them.

Yes, as I mentioned, the regex literal should form something of the
Regex type. Any API that takes a Regex would work with them.

But I think we want distinct types for some of these patterns so they
can capture compile-time knowledge. That's why

github.com

apple/swift/blob/main/test/Prototypes/PatternMatching.swift#L32


      
            }
          }
          //===--- Niceties ---------------------------------------------------------===//
          
          enum MatchResult<Index: Comparable, MatchData> {
          case found(end: Index, data: MatchData)
          case notFound(resumeAt: Index?)
          }
          
          protocol Pattern {
            associatedtype Element : Equatable
            associatedtype Index : Comparable
            associatedtype MatchData = ()
            
            func matched<C: Collection>(atStartOf c: C) -> MatchResult<Index, MatchData>
            where C.Index == Index, C.Element == Element
          }
          
          extension Pattern {
            func found<C: Collection>(in c: C) -> (extent: Range<Index>, data: MatchData)?
            where C.Index == Index, C.Element == Element

has a Pattern protocol. If you mean “type” in a looser sense that
admits protocols, then we are aligned.

You should instead be able to directly bind subexpressions into local
variables. For example if you were trying to match something like
“42: Chris”, you should be able to use straw man syntax like this:

case /(let id: \d+): (let name: \w+)/: print(id); print(name)

This is a good start, but inadequate for handling the kind of recursive
grammars to which you want to generalize regexes, because you have to
bind the same variable multiple times—often re-entrantly—during the same
match. Actually the Kleene star (*) already has this basic problem,
without the re-entrancy, but if you want to build real parsers, you need
to do more than simply capture the last substring matched by each group.

Please specify some more details about what the problem is, because
I’m not seeing it. Lots of existing regex implementations work with
"(…)*” patterns by binding to the last value. From my perspective,
this is pragmatic, useful, and proven. What is your specific concern?

My specific concern is that merely capturing the last match is
inadequate to many real parsing jobs.

When you say “real” parsers, you’re implicitly insulting the “unreal"
parsers,

No offense intended, truly. As a PL guy I assumed you'd know what I
meant. As you know, regexes aren't sufficiently powerful to handle
parsing languages like Swift, and even if they were, retaining only the
last match of a capture would be insufficient to go from recognizing
valid input (parsing) to semantic analysis.

without explaining what the “real” ones are, or why they matter.
Please provide specific use cases that would be harmed by this
approach.,

I'm talking about the kinds of parsers made possible by Perl 6 grammars,
which can be recursive. Some examples:

Unless we were willing to dramatically expand how patterns work, this
requires baking support into the language.

I don't understand the "Unless" part of that sentence. It seems obvious
that no expansion of how patterns work could make the above work without
language changes.

I’m not a believer in this approach, but someone could argue that we
should allow arbitrary user-defined syntactic expansion of the pattern
grammar, similar to how we allow syntactic expansion of the expression
grammar through operator definitions. This is what I meant by
“dramatically expanding” how patterns work.

Oh, I see what you mean. You need to bake *something* into the
language. That thing could either be regex support or it could be
something more general, like a macro system that allowed beautiful regex
support to be built in a library. Well, I'd love to have the latter,
but wouldn't be willing to sacrifice much quality-of-user-experience
with in order to get it. It would have to be roughly indistinguishable
from the end-user's point-of-view.

···

on Wed Jan 25 2017, Chris Lattner <sabre-AT-nondot.org> wrote:

On Jan 25, 2017, at 7:32 PM, Dave Abrahams <dabrahams@apple.com> wrote:

--
-Dave

tali · January 26, 2017, 3:06pm

Well, the regex would have a type of its own, but it would probably have a
generic Result parameter, which would include your `(Int,String)`.

You only get your (id,name) pair when match some input against that regex.

E.g. something like:

     let r: Regex<(Int, String)> = /(\d+): (\w+)/
     switch input {
     case r(let id, let name): print("\(id): \(name)")
     }

···

Am 2017-01-26 12:49, schrieb Jonathan Hull via swift-evolution:

I had a realization a few weeks ago that regexes with capture groups
actually correspond to a type, where successive capture groups form a
tuple and recursive ones form arrays of the capture groups they
recurse (and ‘?’ conveniently forms an optional). For example the
type for the regex above would be (Int,String). Those types could be
pretty hairy for complex regexes though.

let (id,name) = /(\d+): (\w+)/

--
Martin

dabrahams · January 28, 2017, 3:59am

It's actually much simpler to do for ordinary collections than it is for
Strings, for which you want to optimize pattern matching based on
knowledge of string-specific properties such as encoding, normalization,
etc. The Collection matching API is pretty straightforward
(https://github.com/apple/swift/blob/master/test/Prototypes/PatternMatching.swift#L32\).
Extending it to be able to take advantage of auxilliary encoding
information is what's going to add complexity.

···

on Fri Jan 27 2017, Chris Lattner <sabre-AT-nondot.org> wrote:

On Jan 25, 2017, at 7:27 AM, Chris Eidhof <chris@eidhof.nl> wrote:

I wonder if built-in grammars (that's what Perl calls them) would
work only for things that are backed by string literals, or if it's
worth the time/effort to make them work for other kind of data as
well. For example, what if you write a grammar to tokenize (yielding
some sequence of `Token`s), and then want to parse those `Token`s?
Or what if you want to parse other kinds of data? Or should we try
to make the 80% case work (only provide grammar/regex literals for
Strings) to avoid complexity?

I don’t have a strong opinion on this matter. I can definitely see
the elegance of being able to pattern match non-string data with the
regex features. Certainly things like parsing fixed packet formats
coming off a network seem like good candidates for this sort of thing.

That said, it isn’t clear to me that this would be widely-used enough
to be worth the complexity cost. If it just drops into the existing
model (e.g. the string model works on sequences of bytes, so this just
falls out of it) then that would be great. If it requires massive
complexity for little gain, then probably not. We can see when it
comes time to actually design and build this functionality out and
lazily evaluate the decision based on what we know then.

--
-Dave

anandabits · January 26, 2017, 7:27pm

There are two important use cases for regex's: the literal case
(e.g. /aa+b*/) and the dynamically computed case. The former is
really what we’re talking about here, the latter should obviously be
handled with some sort of Regex type which can be formed from string
values or whatever.

Ideally these patterns interoperate so that you can combine them.

Yes, as I mentioned, the regex literal should form something of the
Regex type. Any API that takes a Regex would work with them.

But I think we want distinct types for some of these patterns so they
can capture compile-time knowledge. That's why
https://github.com/apple/swift/blob/master/test/Prototypes/PatternMatching.swift#L32
has a Pattern protocol. If you mean “type” in a looser sense that
admits protocols, then we are aligned.

You should instead be able to directly bind subexpressions into local
variables. For example if you were trying to match something like
“42: Chris”, you should be able to use straw man syntax like this:

case /(let id: \d+): (let name: \w+)/: print(id); print(name)

This is a good start, but inadequate for handling the kind of recursive
grammars to which you want to generalize regexes, because you have to
bind the same variable multiple times—often re-entrantly—during the same
match. Actually the Kleene star (*) already has this basic problem,
without the re-entrancy, but if you want to build real parsers, you need
to do more than simply capture the last substring matched by each group.

Please specify some more details about what the problem is, because
I’m not seeing it. Lots of existing regex implementations work with
"(…)*” patterns by binding to the last value. From my perspective,
this is pragmatic, useful, and proven. What is your specific concern?

My specific concern is that merely capturing the last match is
inadequate to many real parsing jobs.

When you say “real” parsers, you’re implicitly insulting the “unreal"
parsers,

No offense intended, truly. As a PL guy I assumed you'd know what I
meant. As you know, regexes aren't sufficiently powerful to handle
parsing languages like Swift, and even if they were, retaining only the
last match of a capture would be insufficient to go from recognizing
valid input (parsing) to semantic analysis.

without explaining what the “real” ones are, or why they matter.
Please provide specific use cases that would be harmed by this
approach.,

I'm talking about the kinds of parsers made possible by Perl 6 grammars,
which can be recursive. Some examples:

parsing - Example of Perl 6 grammar with operator precedence rules - Stack Overflow

Unless we were willing to dramatically expand how patterns work, this
requires baking support into the language.

I don't understand the "Unless" part of that sentence. It seems obvious
that no expansion of how patterns work could make the above work without
language changes.

I’m not a believer in this approach, but someone could argue that we
should allow arbitrary user-defined syntactic expansion of the pattern
grammar, similar to how we allow syntactic expansion of the expression
grammar through operator definitions. This is what I meant by
“dramatically expanding” how patterns work.

Oh, I see what you mean. You need to bake *something* into the
language. That thing could either be regex support or it could be
something more general, like a macro system that allowed beautiful regex
support to be built in a library. Well, I'd love to have the latter,
but wouldn't be willing to sacrifice much quality-of-user-experience
with in order to get it. It would have to be roughly indistinguishable
from the end-user's point-of-view.

+1. I really like the directions the team is thinking about!

···

On Jan 26, 2017, at 1:15 PM, Dave Abrahams via swift-evolution <swift-evolution@swift.org> wrote:
on Wed Jan 25 2017, Chris Lattner <sabre-AT-nondot.org> wrote:

On Jan 25, 2017, at 7:32 PM, Dave Abrahams <dabrahams@apple.com> wrote:

--
-Dave
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

dgoldsmith · January 27, 2017, 1:02am

To throw another ingredient into the mix, there are issues for Unicode regex that don’t appear in more “traditional” regex implementations. See:

For example:

Case insensitive matching is specified by the UREGEX_CASE_INSENSITIVE flag during pattern compilation, or by the (?i) flag within a pattern itself. Unicode case insensitive matching is complicated by the fact that changing the case of a string may change its length. See FAQ - Character Properties, Case Mappings and Names for more information on Unicode casing operations.

Examples:
  • pattern "fussball" will match "fußball or "fussball"
  • pattern "fu(s)(s)ball" or "fus{2}ball" will match "fussball" or "FUSSBALL" but not "fußball.
  • pattern "ß" will find occurences of "ss" or "ß"
  • pattern "s+" will not find "ß"

and

w UREGEX_UWORD Controls the behavior of \b in a pattern. If set, word boundaries are found according to the definitions of word found in Unicode UAX 29, Text Boundaries. By default, word boundaries are identified by means of a simple classification of characters as either “word” or “non-word”, which approximates traditional regular expression behavior. The results obtained with the two options can be quite different in runs of spaces and other non-word characters.

If regexes are going to be used on human language text, these are all important considerations.

Debbie

···

On Jan 26, 2017, at 11:15 AM, Dave Abrahams via swift-evolution <swift-evolution@swift.org> wrote:

on Wed Jan 25 2017, Chris Lattner <sabre-AT-nondot.org> wrote:

On Jan 25, 2017, at 7:32 PM, Dave Abrahams <dabrahams@apple.com> wrote:

There are two important use cases for regex's: the literal case
(e.g. /aa+b*/) and the dynamically computed case. The former is
really what we’re talking about here, the latter should obviously be
handled with some sort of Regex type which can be formed from string
values or whatever.

Ideally these patterns interoperate so that you can combine them.

Yes, as I mentioned, the regex literal should form something of the
Regex type. Any API that takes a Regex would work with them.

But I think we want distinct types for some of these patterns so they
can capture compile-time knowledge. That's why
https://github.com/apple/swift/blob/master/test/Prototypes/PatternMatching.swift#L32
has a Pattern protocol. If you mean “type” in a looser sense that
admits protocols, then we are aligned.

You should instead be able to directly bind subexpressions into local
variables. For example if you were trying to match something like
“42: Chris”, you should be able to use straw man syntax like this:

case /(let id: \d+): (let name: \w+)/: print(id); print(name)

This is a good start, but inadequate for handling the kind of recursive
grammars to which you want to generalize regexes, because you have to
bind the same variable multiple times—often re-entrantly—during the same
match. Actually the Kleene star (*) already has this basic problem,
without the re-entrancy, but if you want to build real parsers, you need
to do more than simply capture the last substring matched by each group.

Please specify some more details about what the problem is, because
I’m not seeing it. Lots of existing regex implementations work with
"(…)*” patterns by binding to the last value. From my perspective,
this is pragmatic, useful, and proven. What is your specific concern?

My specific concern is that merely capturing the last match is
inadequate to many real parsing jobs.

When you say “real” parsers, you’re implicitly insulting the “unreal"
parsers,

No offense intended, truly. As a PL guy I assumed you'd know what I
meant. As you know, regexes aren't sufficiently powerful to handle
parsing languages like Swift, and even if they were, retaining only the
last match of a capture would be insufficient to go from recognizing
valid input (parsing) to semantic analysis.

without explaining what the “real” ones are, or why they matter.
Please provide specific use cases that would be harmed by this
approach.,

I'm talking about the kinds of parsers made possible by Perl 6 grammars,
which can be recursive. Some examples:

parsing - Example of Perl 6 grammar with operator precedence rules - Stack Overflow

Unless we were willing to dramatically expand how patterns work, this
requires baking support into the language.

I don't understand the "Unless" part of that sentence. It seems obvious
that no expansion of how patterns work could make the above work without
language changes.

I’m not a believer in this approach, but someone could argue that we
should allow arbitrary user-defined syntactic expansion of the pattern
grammar, similar to how we allow syntactic expansion of the expression
grammar through operator definitions. This is what I meant by
“dramatically expanding” how patterns work.

Oh, I see what you mean. You need to bake *something* into the
language. That thing could either be regex support or it could be
something more general, like a macro system that allowed beautiful regex
support to be built in a library. Well, I'd love to have the latter,
but wouldn't be willing to sacrifice much quality-of-user-experience
with in order to get it. It would have to be roughly indistinguishable
from the end-user's point-of-view.

--
-Dave
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

Chris_Lattner · January 28, 2017, 12:51am

You should instead be able to directly bind subexpressions into local
variables. For example if you were trying to match something like
“42: Chris”, you should be able to use straw man syntax like this:

case /(let id: \d+): (let name: \w+)/: print(id); print(name)

This is a good start, but inadequate for handling the kind of recursive
grammars to which you want to generalize regexes, because you have to
bind the same variable multiple times—often re-entrantly—during the same
match. Actually the Kleene star (*) already has this basic problem,
without the re-entrancy, but if you want to build real parsers, you need
to do more than simply capture the last substring matched by each group.

Please specify some more details about what the problem is, because
I’m not seeing it. Lots of existing regex implementations work with
"(…)*” patterns by binding to the last value. From my perspective,
this is pragmatic, useful, and proven. What is your specific concern?

My specific concern is that merely capturing the last match is
inadequate to many real parsing jobs.

Sure, depending on how the grammar is defined, the compiler will know when multiple matches are possible. If multiple matches are possible, it is straight-forward to bind them into an array of results instead of a single scalar result.

Unless we were willing to dramatically expand how patterns work, this
requires baking support into the language.

I don't understand the "Unless" part of that sentence. It seems obvious
that no expansion of how patterns work could make the above work without
language changes.

I’m not a believer in this approach, but someone could argue that we
should allow arbitrary user-defined syntactic expansion of the pattern
grammar, similar to how we allow syntactic expansion of the expression
grammar through operator definitions. This is what I meant by
“dramatically expanding” how patterns work.

Oh, I see what you mean. You need to bake *something* into the
language. That thing could either be regex support or it could be
something more general, like a macro system that allowed beautiful regex
support to be built in a library. Well, I'd love to have the latter,
but wouldn't be willing to sacrifice much quality-of-user-experience
with in order to get it. It would have to be roughly indistinguishable
from the end-user's point-of-view.

Right. The standard approach we take in Swift has been to start with something baked into the language, then generalize it out to the stdlib over time if there is a reason to. I’d love to see the various magic around Optional be accessible to other types, for example.

-Chris

···

On Jan 26, 2017, at 11:15 AM, Dave Abrahams <dabrahams@apple.com> wrote:

dabrahams · January 27, 2017, 3:31am

To throw another ingredient into the mix, there are issues for Unicode regex that don’t appear in
more “traditional” regex implementations. See:

ICU User Guide | ICU Documentation

For example:

Case insensitive matching is specified by the
UREGEX_CASE_INSENSITIVE flag during pattern compilation, or by the
(?i) flag within a pattern itself. Unicode case insensitive
matching is complicated by the fact that changing the case of a
string may change its length. See
FAQ - Character Properties, Case Mappings and Names for more information on
Unicode casing operations.

Examples:
  • pattern "fussball" will match "fußball or "fussball"
• pattern "fu(s)(s)ball" or "fus{2}ball" will match "fussball" or "FUSSBALL" but not "fußball.
  • pattern "ß" will find occurences of "ss" or "ß"
  • pattern "s+" will not find "ß"

These all appear to be issues for users to consider rather than design
issues for the regex implementation. Am I mistaken?

and

w UREGEX_UWORD Controls the behavior of \b in a pattern. If set,
word boundaries are found according to the definitions of word found
in Unicode UAX 29, Text Boundaries. By default, word boundaries are
identified by means of a simple classification of characters as
either “word” or “non-word”, which approximates traditional regular
expression behavior. The results obtained with the two options can
be quite different in runs of spaces and other non-word characters.

If regexes are going to be used on human language text, these are all
important considerations.

Yup, but I don't see how they affect the design, other than that maybe
matching on a LocalizedString type would use UREGEX_WORD by default.

···

on Thu Jan 26 2017, Deborah Goldsmith <swift-evolution@swift.org> wrote:

--
-Dave

anandabits · January 28, 2017, 1:32am

Right. The standard approach we take in Swift has been to start with something baked into the language, then generalize it out to the stdlib over time if there is a reason to. I’d love to see the various magic around Optional be accessible to other types, for example.

-Chris

+1. I love how Swift takes this approach. I would love to see Optional sugar expanded such that types like Result, for example, could take advantage of it.