Strings in Swift 4

Use of the `extendedASCII` view to parse it. :^)

In all seriousness, though, maybe this just means you should be parsing at the `unicodeScalars` level or below when you're doing this kind of thing.

···

On Jan 23, 2017, at 9:59 PM, Félix Cloutier via swift-evolution <swift-evolution@swift.org> wrote:

You can do it, but it trades one semantic problem for a usability problem, without solving all the semantic problems: you end up with a.count + b.count == (a+b).count, sure, but you still don't satisfy the usual law of collections that (a+b).contains(b.first!) if b is non-empty, and now you've made it difficult to attach diacritics to base characters.

"Difficult".

What kind of processing would you suggest on a variable "b" in the expression "\(a),\(b)" to ensure that the result can be split with a comma?

--
Brent Royal-Gordon
Architechies

Now that I think more about it, tainting strings with a comparison behavior may be a bad idea:

  // not a so good idea, after all
  let foo = "foo".comparison(case: .insensitive, locale: .current)

Problems are:

- Two strings tainted with the same comparison behavior should be comparable, but two strings tainted with different behaviors should not. I'm not sure that the Swift type system allows that. In the current state of affairs, such restrictions are only available to *types*, which implies that comparisons should be types as well. But types can't be built at runtime from other types: there's not way for a runtime-chosen locale to be able to build a type.

- Tainting strings is an old, and unsolved problem. If you're allowed to taint for comparison, why can't you taint for confidence (user input), needing escaping (for HTML, shell, SQL, javscript, etc.), and a whole bunch of other dimensions I don't even think of? We're already discussing different types for strings and substrings... Do we really want a combinatorial explosion of types?

- Should we restrict support for comparison behaviors to strings only? `Sequence.lexicographicallyPrecedes(_:by:)` is interested in comparison behaviors as well, and it extends much further than just Strings.

So all in all, I think that isolating comparison behaviors from Strings may be a much better idea.

Gwendal

···

Le 24 janv. 2017 à 08:18, David Hart <david@hartbit.com> a écrit :

Seems like a good solution to me.

On 24 Jan 2017, at 05:29, Gwendal Roué via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

Le 24 janv. 2017 à 04:31, Brent Royal-Gordon via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> a écrit :

The operands and sense of the comparison are kind of lost in all this garbage. You really want to see `foo < bar` in this code somewhere, but you don't.

Yeah, we thought about trying to build a DSL for that, but failed. I think the best possible option would be something like:

foo.comparison(case: .insensitive, locale: .current) < bar

The biggest problem is that you can build things like

   fu = foo.comparison(case: .insensitive, locale: .current)
   br = bar.comparison(case: .sensitive)
   fu < br // what does this mean?

We could even prevent such nonsense from compiling, but the cost in library API surface area is quite large.

Is it? I think we're talking, for each category of operation that can be localized like this:

* One type to carry an operand and its options.
* One method to construct this type.
* One alternate version of each operator which accepts an operand+options parameter. (I'm thinking it should always be the right-hand side, so the long stuff ends up at the end; Larry Wall noted this follows an "end-weight principle" in natural languages.)

I suspect that most solutions will at least require some sort of overload on the comparison operators, so this may be as parsimonious as we can get.

SQL has the `collate` keyword:

  -- sort users by email, case insensitive
  select * from users order by email collate nocase
  -- look for a specific email, in a case insensitive way
  select * from users where email = 'foo@example.com <mailto:foo@example.com>' collate nocase

It is used as a decorator that modifies an existing sql snippet (a sort descriptor first, and a comparison last)

When designing an SQL building to Swift, I chose the `nameColumn.collating(.nocase)` approach, because it allowed a common Swift syntax for both use cases:

  // sort users by email, case insensitive
  User.order(nameColumn.collating(.nocase))
  // look for a specific email, in a case insensitive way
  User.filter(nameColumn.collating(.nocase) == "foo@example.com <mailto:foo@example.com>")

Yes, it comes with extra operators so that nonsensical comparison are avoided.

But it just works.

Gwendal

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

I agree that being able to implement parsers in a nice way can be a huge step forward in being really good at string processing.

There are a couple of possibilities that come to mind directly:

1. Build parsers right into the language (like Perl 6 grammars)
2. Provide a parser combinator language (e.g. https://github.com/davedufresne/SwiftParsec\).
3. Rely on external tools like bison/yacc/etc.
4. Make it easy for people to write hand-written parsers (e.g. by providing an NSScanner alternative).

Some obvious drawbacks of each approach:

1. Lots of work, probably hard to get right?
2. Only way to do this, afaik, is using lots of functional programming which might scare people off. Also probably it's hard to get performance as fast as 1.

FWIW, it is quite possible to do things very similar to parser combinators without functional programming. What you need is a way to create and compose small parser fragments, ideally an EDSL approaching something like EBNF that allows users to build a grammar out of the parser fragments, and a way to execute / interpret the resulting grammar during parsing.

The functional approach would not be the most idiomatic approach in Swift and as you note, it probably wouldn’t have the performance a more idiomatic approach could achieve (too much copying).

My intuition is that a hybrid 1 / 2 approach might be best: do as much as possible in the library and let the design drive new language enhancements where necessary.

···

On Jan 24, 2017, at 2:05 AM, Chris Eidhof via swift-evolution <swift-evolution@swift.org> wrote:

3. No clear integrated way to do this
4. You still have to know how to write a parser.

I would think that 4. would be a good step forward, and 1/2 would definitely benefit from this.

Also, I'd love to have this functionality on sequence/collection types, rather than Strings. For example, it can be tremendously helpful to parse a binary format using proper parsers. Or maybe you would want to use an event-driven XML parser as "tokenizer" and parse that. Plenty of cool possibilities.

On Tue, Jan 24, 2017 at 8:46 AM, Russ Bishop via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

On Jan 23, 2017, at 2:27 PM, Joe Groff via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

On Jan 23, 2017, at 2:06 PM, Ben Cohen via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

On Jan 23, 2017, at 7:49 AM, Joshua Alvarado <alvaradojoshua0@gmail.com <mailto:alvaradojoshua0@gmail.com>> wrote:

Taken from NSHipster <http://nshipster.com/nsregularexpression/&gt;:
Happily, on one thing we can all agree. In NSRegularExpression, Cocoa has the most long-winded and byzantine regular expression interface you’re ever likely to come across.

There is no way to achieve the goal of being better at string processing than Perl without regular expressions being addressed. It just should not be ignored.

We’re certainly not ignoring the importance of regexes. But if there’s a key takeaway from your experiences with NSRegularExpression, it’s that a good regex implementation matters, a lot. That’s why we don’t want to rush one in alongside the rest of the overhaul of String. Instead, we should take our time to make it really great, and building on a solid foundation of a good String API that’s already in place should help ensure that.

I do think that there's some danger to focusing too narrowly on regular expressions as they appear in languages today. I think the industry has largely moved on to fully-structured formats that require proper parsing beyond what traditional regexes can handle. The decades of experience with Perl shows that making regexes too easy to use without an easy ramp up to more sophisticated string processing leads to people cutting corners trying to make regex-based designs kind-of work. The Perl 6 folks recognized this and developed their "regular expression" support into something that supported arbitrary grammars; I think we'd do well to start at that level by looking at what they've done.

-Joe

I fully agree. I think we could learn something from Perl 6 grammars. As PCREs are to languages without regex, Perl 6 grammars are to languages with PCREs.

A lot of really crappy user interfaces and bad tools come down to half-assed parsers; maybe we can do better? (Another argument against rushing it).

Russ

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

--
Chris Eidhof
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

I agree that being able to implement parsers in a nice way can be a huge
step forward in being really good at string processing.

There are a couple of possibilities that come to mind directly:

1. Build parsers right into the language (like Perl 6 grammars)
2. Provide a parser combinator language (e.g.
https://github.com/davedufresne/SwiftParsec\).
3. Rely on external tools like bison/yacc/etc.
4. Make it easy for people to write hand-written parsers (e.g. by providing
an NSScanner alternative).

Some obvious drawbacks of each approach:

1. Lots of work, probably hard to get right?
2. Only way to do this, afaik, is using lots of functional programming
which might scare people off. Also probably it's hard to get performance as
fast as 1.

No, you don't need to use lots of
FP. https://github.com/apple/swift/blob/master/test/Prototypes/PatternMatching.swift#L359
is a counterexample.

3. No clear integrated way to do this
4. You still have to know how to write a parser.

I would think that 4. would be a good step forward, and 1/2 would
definitely benefit from this.

Also, I'd love to have this functionality on sequence/collection types,
rather than Strings.

Yes, that's the plan.

···

on Tue Jan 24 2017, Chris Eidhof <chris-AT-eidhof.nl> wrote:

For example, it can be tremendously helpful to parse a binary format
using proper parsers. Or maybe you would want to use an event-driven
XML parser as "tokenizer" and parse that. Plenty of cool
possibilities.

On Tue, Jan 24, 2017 at 8:46 AM, Russ Bishop via swift-evolution < > swift-evolution@swift.org> wrote:

On Jan 23, 2017, at 2:27 PM, Joe Groff via swift-evolution < >> swift-evolution@swift.org> wrote:

On Jan 23, 2017, at 2:06 PM, Ben Cohen via swift-evolution < >> swift-evolution@swift.org> wrote:

On Jan 23, 2017, at 7:49 AM, Joshua Alvarado <alvaradojoshua0@gmail.com> >> wrote:

Taken from NSHipster <http://nshipster.com/nsregularexpression/&gt;:

Happily, on one thing we can all agree. In NSRegularExpression, Cocoa has
the most long-winded and byzantine regular expression interface you’re ever
likely to come across.

There is no way to achieve the goal of being better at string processing
than Perl without regular expressions being addressed. It just should not
be ignored.

We’re certainly not ignoring the importance of regexes. But if there’s a
key takeaway from your experiences with NSRegularExpression, it’s that a
good regex implementation matters, a lot. That’s why we don’t want to rush
one in alongside the rest of the overhaul of String. Instead, we should
take our time to make it really great, and building on a solid foundation
of a good String API that’s already in place should help ensure that.

I do think that there's some danger to focusing too narrowly on regular
expressions as they appear in languages today. I think the industry has
largely moved on to fully-structured formats that require proper parsing
beyond what traditional regexes can handle. The decades of experience with
Perl shows that making regexes too easy to use without an easy ramp up to
more sophisticated string processing leads to people cutting corners trying
to make regex-based designs kind-of work. The Perl 6 folks recognized this
and developed their "regular expression" support into something that
supported arbitrary grammars; I think we'd do well to start at that level
by looking at what they've done.

-Joe

I fully agree. I think we could learn something from Perl 6 grammars. As
PCREs are to languages without regex, Perl 6 grammars are to languages with
PCREs.

A lot of really crappy user interfaces and bad tools come down to
half-assed parsers; maybe we can do better? (Another argument against
rushing it).

Russ

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

--
-Dave

Le 23 janv. 2017 à 20:45, Dave Abrahams <dabrahams@apple.com> a

écrit :

doesn't necessarily mean that ignoring that case is the right

thing to do. In fact, it means that Unicode won't do anything to
protect programs against these, and if Swift doesn't, chances are that
no one will. Isolated combining characters break a number of
expectations that developers could reasonably have:

(a + b).count == a.count + b.count
(a + b).startsWith(a)
(a + b).endsWith(b)
(a + b).find(a) // or .find(b)

Of course, this can be documented, but people want easy, and documentation is hard.

Yes. Unfortunately they also want the ability to append a string
consisiting of a combining character to another string and have it
append. And they don't want to be prevented from forming
valid-but-defective Unicode strings.

[…]

Can you suggest an alternative that doesn't violate the Unicode
standard and supports the expected use-cases, somehow?

I'm not sure I understand. Did we go from "this is a
degenerate/defective
<https://github.com/apple/swift/blob/master/docs/StringManifesto.md#string-should-be-a-collection-of-characters-again&gt;
case that we shouldn't bother with" to "this is a supported use case
that needs to work as-is"?

No. The Unicode standard says it's a valid string, so we shouldn't
prohibit it. The standard also says it's a corner case for which it
isn't worth making heroic efforts to create sensible semantics. It's
totally in keeping with the Unicode standards that we treat it as
proposed.

In a domain as complex as String processing, we need a guiding star,
and that star is the Unicode standard. I'm very reluctant to do
anything that clashes with the spirit of the standard.

I've never seen anyone start a string with a combining character on purpose,

It will occur as a byproduct of the process of attaching a diacritic
to a base character.

Unless you're in the business of writing a text editor, I don't know
if that's a common use case.

I don't either, to be honest. But the experts I consult with keep
reassuring me that it's an important one.

though I'm familiar with just one natural language that needs
combining characters. I can imagine that it could be a convenient
feature in other natural languages.

However, if Swift Strings are now designed for machine processing
and less for human language convenience, for me, it's easy enough to
justify a safe default in the context of machine processing: `a+b`
will not combine the end of `a` with the start of `b`. You could do
this by inserting a ◌ that `b` could combine with if necessary.

You can do it, but it trades one semantic problem for a usability
problem, without solving all the semantic problems: you end up with
a.count + b.count == (a+b).count, sure, but you still don't satisfy
the usual law of collections that (a+b).contains(b.first!) if b is
non-empty, and now you've made it difficult to attach diacritics to
base characters.

"Difficult".

What kind of processing would you suggest on a variable "b" in the
expression "\(a),\(b)" to ensure that the result can be split with a
comma?

I'm sorry, I don't understand what you're driving at, here.

···

on Mon Jan 23 2017, Félix Cloutier <swift-evolution@swift.org> wrote:

On Jan 22, 2017, at 9:54 PM, Félix Cloutier <felixcca@yahoo.ca > <mailto:felixcca@yahoo.ca>> wrote:

--
-Dave

I agree that being able to implement parsers in a nice way can be a huge step forward in being really good at string processing.

+1 from me as well, I agree with Joe that Swift can learn a lot from Perl 6 grammar’s and we should take the time to do it right. Below I say “regex” a lot, but I really mean a more general grammar system (and even Perl 5 regex’s aren’t regular :-)

There are a couple of possibilities that come to mind directly:

1. Build parsers right into the language (like Perl 6 grammars)
2. Provide a parser combinator language (e.g. https://github.com/davedufresne/SwiftParsec\).
3. Rely on external tools like bison/yacc/etc.
4. Make it easy for people to write hand-written parsers (e.g. by providing an NSScanner alternative).

My opinion is that #1 is the right path to start with, but it wouldn’t preclude doing #2. Here’s my rationale / half-baked thought process:

There are two important use cases for regex's: the literal case (e.g. /aa+b*/) and the dynamically computed case. The former is really what we’re talking about here, the latter should obviously be handled with some sort of Regex type which can be formed from string values or whatever. Regex literals in an expression context should default to producing the Regex type of course.

This means that when you pass a regex literal into an API call (e.g. split on a string), it is really just creating something of Regex type, and passing it down. If you wanted to introduce a parser combinator DSL, you could totally plug it into the system, by having the combinators produce something of the Regex type.

So why bless regex literals with language support at all? I see several reasons:

1. Diagnostics: These will be heavily used by people, and you want to have good compiler error and warning messages for them. You want to be able to validate the regex at compile time, not wait until runtime to detect syntactic mistakes like unbalanced parens.

2. Syntax Familiarity: To take advantage of people’s familiarity with other languages, we should strive to make the basic regex syntax familiar and obvious. I’d argue that /aa+b*/ should “just work” and do the thing you think it does. Relying on a combinator library to do that would be crazy.

3. Performance: Many regex’s are actually regular, so they can be trivially compiled into DFAs. There is a well understood body of work that can be simply dropped into the compiler to do this. Regex’s that are not regular can be compiled into hybrid DFA/NFA+backtracking schemes, and allowing a divide and conquer style of compiler optimization to do this is the path that makes the most sense (to me at least). Further, if you switch on a string and have a bunch of cases that are regex’s, you’d obviously want the compiler to generate a single state machine (like a lexer), not check each pattern in series.

4. Pattern matching greatness: One of the most obnoxious/error prone aspects of regex’s in many languages is that when you match a pattern, the various matches are dumped into numbered result values (often by the order of the parens in the pattern). This is totally barbaric: it begs for off by one errors, often breaks as the program is being evolved/maintained, etc. It is just as bad as printf/scanf!

You should instead be able to directly bind subexpressions into local variables. For example if you were trying to match something like “42: Chris”, you should be able to use straw man syntax like this:

   case /(let id: \d+): (let name: \w+)/: print(id); print(name)

Unless we were willing to dramatically expand how patterns work, this requires baking support into the language.

5. Scanner/“Formatter" integration: Taking the above one step farther, we could have default patterns for known types (and make it extensible to user defined types of course). For example, \d+ is the obvious pattern for integers, so you should be able to write the above like this (in principle):

   case /(let id: Int): (let name: \w+)/: print(id); print(name)

In addition to avoiding having to specify \d+ all the time, this eliminates the need for a “string to int” conversion after the pattern is matched, because id would be bound as type Int already.

Anyway, to summarize, I think that getting regex’s into the language is really important and expect them to be widely used. As such, I think it is worth burning compiler/language complexity to make them be truly great in Swift.

-Chris

···

On Jan 24, 2017, at 12:05 AM, Chris Eidhof via swift-evolution <swift-evolution@swift.org> wrote:

The variadic use cases don't always have ... appearing inside angle
brackets. See “pack expansion” at
http://en.cppreference.com/w/cpp/language/parameter_pack
for example.

···

on Tue Jan 24 2017, Matt Whiteside <swift-evolution@swift.org> wrote:

On Jan 22, 2017, at 15:40, Chris Lattner via swift-evolution > <swift-evolution@swift.org> wrote:
Right, the only sensible semantics for a one sided range with an
open end point is that it goes to the end of the collection. I see
a few different potential colors to paint this bikeshed with, all of
which would have the semantics “c[i..<c.endIndex]”:

1) Provide "c[i...]":
2) Provide "c[i..<]":
3) Provide both "c[i..<]” and "c[i…]":

Since all of these operations would have the same behavior, it comes down to subjective questions:

a) Do we want redundancy? IMO, no, which is why #3 is not very desirable.
b) Which is easier to explain to people? As you say, "i..< is shorthand for i..<endindex” is nice

and simple, which leans towards #2.

c) Which is subjectively nicer looking? IMO, #1 is much nicer
typographically. The ..< formulation looks like symbol soup,
particularly because most folks would not put a space before ].

There is no obvious winner, but to me, I tend to prefer #1. What do other folks think?

I also prefer #1. It’s a shame that this conflicts with the potential
syntax for variadic generics. Is there really no way around this?
I’m showing my ignorance on compilers here, but couldn’t the fact that
variadic generics will be inside angle brackets be used to
distinguish?

--
-Dave

Woah, not to take this totally down a different path, but I thought
ContiguousArray was being deprecated?

···

On Wed, Jan 25, 2017 at 18:48 Zach Waldowski via swift-evolution < swift-evolution@swift.org> wrote:

On Wed, Jan 25, 2017, at 04:54 PM, Ben Cohen wrote:
> I’m normally all in favor of the “don’t give people features, or they'll
> use them too much” argument but in this case I don’t think it applies.

That's not what I'm calling for at all. In fact, ContiguousArray and co.
are a great example of the problem I'm having here. After reading,
learning, profiling, and tuning, more than once on my teams has a
correct use of ContiguousArray been shot down by "why isn't this just
Array?" during code review. I've more than once had to babysit an angry
coworker or walk a confused student through why they have a variable of
type ArraySlice and not Array.

I cannot emphasize more thoroughly that I want all this power (and
more!) to exist in the stdlib, but, and don't take this the wrong way,
the concern that I'm voicing is the team must balance the desire for a
perfect, beautiful, complete String model and how, in practice, it's
actually gets used — a set of possibilities which includes "not at all"
and many varieties of "incorrectly".

Best,
  Zachary Waldowski
  zach@waldowski.me
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

Sorry, it looks like I left you hanging on this–luckily I found it when I was cleaning my inbox.

Overall, I believe the issue I have with the Swift String indexing
model is that indices cannot be operated on like an Int can–you can
multiply, divide, square, whatever you want on integer indices, while
String.Index only allows for what is essentially addition and
subtraction. Now, I get that these operations may not make sense on
most Strings; the existing API covers them well. However, there are
cases, where these operations would be convenient; such as when
dealing with fixed-length records or tables of data; almost invariably
these are stored as ASCII. Thus, for these cases, I believe that there
should be some way to let String know that we are dealing with
something that is purely ASCII, so that it can allow us to use these
operations in an efficient manner (for example, having an optional
.asciiString property that conforms to RandomAccess; since I don’t
believe that extendedASCII does).

We could decide to make it random access at the cost of ruling out some
less-used but still-significant encodings of the string's backing store,
such as Shift-JIS. I personally am unconvinced that the marginal extra
convenience gained by random access to extendedASCII would be worth the
loss of the ability to operate directly on such encodings.

Such an API would keep the existing String paradigm, which is what is
needed most of the time, but allowing for random access when the data
can be guaranteed to support it.

We can easily make an ASCIIString that conforms to Unicode and provides
RandomAccessCollection conformance to all of its views. That random
access would not be preserved **in the type system** when ASCIIString is
wrapped in a String—the String's ExtendedASCIIView would only conform to
BidirectionalCollection—but the underlying efficiency characteristics
*would* be preserved, dynamically.

I’m not sure if I’m getting my point across, please do let me know if
you don’t quite get what I mean.

I'm pretty sure I get what you mean. Let me know if you don't think so.

···

on Sat Feb 04 2017, Saagar Jha <swift-evolution@swift.org> wrote:

Saagar Jha

On Jan 20, 2017, at 5:55 PM, Ben Cohen <ben_cohen@apple.com> wrote:

On Jan 20, 2017, at 2:58 PM, Saagar Jha via swift-evolution >>> <swift-evolution@swift.org >>> <mailto:swift-evolution@swift.org>> >>> wrote:

Sorry if I wasn’t clear; I’m looking for indexing using Int, instead of using formIndex.

Question: why do you think integer indices are so desirable?

Integer indexing is simple, but also encourages anti-patterns
(tortured open-coded while loops with unexpected fencepost errors,
conflation of positions and distances into a single type) and our
goal should be to make most everyday higher-level operations, such
as finding/tokenizing, so easy that Swift programmers don’t feel
they need to resort to loops as often.

Examples where formIndex is so common yet so cumbersome that it
would be worth efforts to create integer-indexed versions of string
might be indicators of important missing features on our collection
or string APIs. So do pass them along.

(There are definitely known gaps in them today – slicing needs
improving as the manifesto mentions for things like slices from an
index to n elements later. Also, we need support for in-place
remove(where:) operations. But the more commonly needed cases we
know about that aren’t covered, the better)

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

--
-Dave

I agree that being able to implement parsers in a nice way can be a huge step forward in being really good at string processing.

There are a couple of possibilities that come to mind directly:

1. Build parsers right into the language (like Perl 6 grammars)
2. Provide a parser combinator language (e.g. https://github.com/davedufresne/SwiftParsec\).
3. Rely on external tools like bison/yacc/etc.
4. Make it easy for people to write hand-written parsers (e.g. by providing an NSScanner alternative).

Some obvious drawbacks of each approach:

1. Lots of work, probably hard to get right?
2. Only way to do this, afaik, is using lots of functional programming which might scare people off. Also probably it's hard to get performance as fast as 1.

FWIW, it is quite possible to do things very similar to parser combinators without functional programming. What you need is a way to create and compose small parser fragments, ideally an EDSL approaching something like EBNF that allows users to build a grammar out of the parser fragments, and a way to execute / interpret the resulting grammar during parsing.

The functional approach would not be the most idiomatic approach in Swift and as you note, it probably wouldn’t have the performance a more idiomatic approach could achieve (too much copying).

My intuition is that a hybrid 1 / 2 approach might be best: do as much as possible in the library and let the design drive new language enhancements where necessary.

Yes, the first program I wrote in Swift was a packrat parser (now used in several of my projects). It uses custom operators to let you write the grammar directly in swift, which is nice because it can be combined directly with code which gets run when a rule is matched.

One of the most time consuming parts of that was writing a custom string scanner. I would really like to see support for basic string scanning, including a standard library implementation of something like NSCharacterSet, built into the new String. Or rather, I would like features built in which make building a string scanner trivial and lightweight. Basically just a couple of methods saying things like: given an index and a character set, return a range including the contiguous group of characters in that set directly after the index (with options for case/diacritic sensitivity, etc…). With that and better slices, it should be fairly easy to build a lightweight/efficient scanner...

···

On Jan 24, 2017, at 7:52 AM, Matthew Johnson via swift-evolution <swift-evolution@swift.org> wrote:

On Jan 24, 2017, at 2:05 AM, Chris Eidhof via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

3. No clear integrated way to do this
4. You still have to know how to write a parser.

I would think that 4. would be a good step forward, and 1/2 would definitely benefit from this.

Also, I'd love to have this functionality on sequence/collection types, rather than Strings. For example, it can be tremendously helpful to parse a binary format using proper parsers. Or maybe you would want to use an event-driven XML parser as "tokenizer" and parse that. Plenty of cool possibilities.

On Tue, Jan 24, 2017 at 8:46 AM, Russ Bishop via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

On Jan 23, 2017, at 2:27 PM, Joe Groff via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

On Jan 23, 2017, at 2:06 PM, Ben Cohen via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

On Jan 23, 2017, at 7:49 AM, Joshua Alvarado <alvaradojoshua0@gmail.com <mailto:alvaradojoshua0@gmail.com>> wrote:

Taken from NSHipster <http://nshipster.com/nsregularexpression/&gt;:
Happily, on one thing we can all agree. In NSRegularExpression, Cocoa has the most long-winded and byzantine regular expression interface you’re ever likely to come across.

There is no way to achieve the goal of being better at string processing than Perl without regular expressions being addressed. It just should not be ignored.

We’re certainly not ignoring the importance of regexes. But if there’s a key takeaway from your experiences with NSRegularExpression, it’s that a good regex implementation matters, a lot. That’s why we don’t want to rush one in alongside the rest of the overhaul of String. Instead, we should take our time to make it really great, and building on a solid foundation of a good String API that’s already in place should help ensure that.

I do think that there's some danger to focusing too narrowly on regular expressions as they appear in languages today. I think the industry has largely moved on to fully-structured formats that require proper parsing beyond what traditional regexes can handle. The decades of experience with Perl shows that making regexes too easy to use without an easy ramp up to more sophisticated string processing leads to people cutting corners trying to make regex-based designs kind-of work. The Perl 6 folks recognized this and developed their "regular expression" support into something that supported arbitrary grammars; I think we'd do well to start at that level by looking at what they've done.

-Joe

I fully agree. I think we could learn something from Perl 6 grammars. As PCREs are to languages without regex, Perl 6 grammars are to languages with PCREs.

A lot of really crappy user interfaces and bad tools come down to half-assed parsers; maybe we can do better? (Another argument against rushing it).

Russ

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

--
Chris Eidhof
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

Hey Matthew,

Do you have an example of doing parser combinators without FP? I'd be very
interest

I agree that being able to implement parsers in a nice way can be a huge
step forward in being really good at string processing.

There are a couple of possibilities that come to mind directly:

1. Build parsers right into the language (like Perl 6 grammars)
2. Provide a parser combinator language (e.g. https://github.com/
davedufresne/SwiftParsec).
3. Rely on external tools like bison/yacc/etc.
4. Make it easy for people to write hand-written parsers (e.g. by
providing an NSScanner alternative).

Some obvious drawbacks of each approach:

1. Lots of work, probably hard to get right?
2. Only way to do this, afaik, is using lots of functional programming
which might scare people off. Also probably it's hard to get performance as
fast as 1.

FWIW, it is quite possible to do things very similar to parser combinators
without functional programming. What you need is a way to create and
compose small parser fragments, ideally an EDSL approaching something like
EBNF that allows users to build a grammar out of the parser fragments, and
a way to execute / interpret the resulting grammar during parsing.

I'd love to see this. Do you mean "possible today" or "it would be
possible"?

One really big thing that I took away learning parser combinators is that
grammars are composable, whereas parsers themselves are not. Parser
combinators express grammars.

For example, when you have a Swift parser available, and you want to
"embed" it inside a Markdown parser, that's hard to do. Whereas composing
the Markdown grammar with the Swift grammar is a lot easier.

It'd be nice to have that composability.

Chris

···

On Tue, Jan 24, 2017 at 4:52 PM, Matthew Johnson <matthew@anandabits.com> wrote:

On Jan 24, 2017, at 2:05 AM, Chris Eidhof via swift-evolution < > swift-evolution@swift.org> wrote:

The functional approach would not be the most idiomatic approach in Swift
and as you note, it probably wouldn’t have the performance a more idiomatic
approach could achieve (too much copying).

My intuition is that a hybrid 1 / 2 approach might be best: do as much as
possible in the library and let the design drive new language enhancements
where necessary.

3. No clear integrated way to do this
4. You still have to know how to write a parser.

I would think that 4. would be a good step forward, and 1/2 would
definitely benefit from this.

Also, I'd love to have this functionality on sequence/collection types,
rather than Strings. For example, it can be tremendously helpful to parse a
binary format using proper parsers. Or maybe you would want to use an
event-driven XML parser as "tokenizer" and parse that. Plenty of cool
possibilities.

On Tue, Jan 24, 2017 at 8:46 AM, Russ Bishop via swift-evolution < > swift-evolution@swift.org> wrote:

On Jan 23, 2017, at 2:27 PM, Joe Groff via swift-evolution < >> swift-evolution@swift.org> wrote:

On Jan 23, 2017, at 2:06 PM, Ben Cohen via swift-evolution < >> swift-evolution@swift.org> wrote:

On Jan 23, 2017, at 7:49 AM, Joshua Alvarado <alvaradojoshua0@gmail.com> >> wrote:

Taken from NSHipster <http://nshipster.com/nsregularexpression/&gt;:

Happily, on one thing we can all agree. In NSRegularExpression, Cocoa
has the most long-winded and byzantine regular expression interface you’re
ever likely to come across.

There is no way to achieve the goal of being better at string processing
than Perl without regular expressions being addressed. It just should not
be ignored.

We’re certainly not ignoring the importance of regexes. But if there’s a
key takeaway from your experiences with NSRegularExpression, it’s that a
good regex implementation matters, a lot. That’s why we don’t want to rush
one in alongside the rest of the overhaul of String. Instead, we should
take our time to make it really great, and building on a solid foundation
of a good String API that’s already in place should help ensure that.

I do think that there's some danger to focusing too narrowly on regular
expressions as they appear in languages today. I think the industry has
largely moved on to fully-structured formats that require proper parsing
beyond what traditional regexes can handle. The decades of experience with
Perl shows that making regexes too easy to use without an easy ramp up to
more sophisticated string processing leads to people cutting corners trying
to make regex-based designs kind-of work. The Perl 6 folks recognized this
and developed their "regular expression" support into something that
supported arbitrary grammars; I think we'd do well to start at that level
by looking at what they've done.

-Joe

I fully agree. I think we could learn something from Perl 6 grammars. As
PCREs are to languages without regex, Perl 6 grammars are to languages with
PCREs.

A lot of really crappy user interfaces and bad tools come down to
half-assed parsers; maybe we can do better? (Another argument against
rushing it).

Russ

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

--
Chris Eidhof
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

--
Chris Eidhof

I've never seen anyone start a string with a combining character on purpose,

It will occur as a byproduct of the process of attaching a diacritic
to a base character.

Unless you're in the business of writing a text editor, I don't know
if that's a common use case.

I don't either, to be honest. But the experts I consult with keep
reassuring me that it's an important one.

Would it be possible that the Unicode experts' use cases are different from non-experts' use cases? It would make sense to put people who know a lot about Unicode in charge of handling complex Unicode operations, and that makes that use case very important to them, but through their hard work no one else needs to care about it.

though I'm familiar with just one natural language that needs
combining characters. I can imagine that it could be a convenient
feature in other natural languages.

However, if Swift Strings are now designed for machine processing
and less for human language convenience, for me, it's easy enough to
justify a safe default in the context of machine processing: `a+b`
will not combine the end of `a` with the start of `b`. You could do
this by inserting a ◌ that `b` could combine with if necessary.

You can do it, but it trades one semantic problem for a usability
problem, without solving all the semantic problems: you end up with
a.count + b.count == (a+b).count, sure, but you still don't satisfy
the usual law of collections that (a+b).contains(b.first!) if b is
non-empty, and now you've made it difficult to attach diacritics to
base characters.

"Difficult".

What kind of processing would you suggest on a variable "b" in the
expression "\(a),\(b)" to ensure that the result can be split with a
comma?

I'm sorry, I don't understand what you're driving at, here.

Okay, so I'm serializing two strings "a" and "b", and later on I want to deserialize them. I control "a", and the user controls "b". I know that I'll never have a comma in "a", so one obvious way to serialize the two strings is with "\(a),\(b)", and the most obvious way to deserialize them is with string.split(maxSplits: 2) { $0 == "," }.

For the example, string "a" is "hello", and the user put in "\u{0301}screw you" for "b". This makes the result "hello,́screw you". Now split misses the comma.

How do I fix it?

Félix

···

Le 24 janv. 2017 à 11:33, Dave Abrahams via swift-evolution <swift-evolution@swift.org> a écrit :

Another good reason to bake them into pattern matching is that it would make it easier to optimize when you want to match one of multiple patterns. Often, you don't want just one grammar, but possibly one of many, and it'd be nice if switch-ing over multiple string patterns led to a reasonably efficient DFA/NFA/rec-descent machine based on the needs of the grammars being matched.

-Joe

···

On Jan 24, 2017, at 9:35 PM, Chris Lattner via swift-evolution <swift-evolution@swift.org> wrote:

On Jan 24, 2017, at 12:05 AM, Chris Eidhof via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

I agree that being able to implement parsers in a nice way can be a huge step forward in being really good at string processing.

+1 from me as well, I agree with Joe that Swift can learn a lot from Perl 6 grammar’s and we should take the time to do it right. Below I say “regex” a lot, but I really mean a more general grammar system (and even Perl 5 regex’s aren’t regular :-)

There are a couple of possibilities that come to mind directly:

1. Build parsers right into the language (like Perl 6 grammars)
2. Provide a parser combinator language (e.g. https://github.com/davedufresne/SwiftParsec\).
3. Rely on external tools like bison/yacc/etc.
4. Make it easy for people to write hand-written parsers (e.g. by providing an NSScanner alternative).

My opinion is that #1 is the right path to start with, but it wouldn’t preclude doing #2. Here’s my rationale / half-baked thought process:

There are two important use cases for regex's: the literal case (e.g. /aa+b*/) and the dynamically computed case. The former is really what we’re talking about here, the latter should obviously be handled with some sort of Regex type which can be formed from string values or whatever. Regex literals in an expression context should default to producing the Regex type of course.

This means that when you pass a regex literal into an API call (e.g. split on a string), it is really just creating something of Regex type, and passing it down. If you wanted to introduce a parser combinator DSL, you could totally plug it into the system, by having the combinators produce something of the Regex type.

So why bless regex literals with language support at all? I see several reasons:

1. Diagnostics: These will be heavily used by people, and you want to have good compiler error and warning messages for them. You want to be able to validate the regex at compile time, not wait until runtime to detect syntactic mistakes like unbalanced parens.

2. Syntax Familiarity: To take advantage of people’s familiarity with other languages, we should strive to make the basic regex syntax familiar and obvious. I’d argue that /aa+b*/ should “just work” and do the thing you think it does. Relying on a combinator library to do that would be crazy.

3. Performance: Many regex’s are actually regular, so they can be trivially compiled into DFAs. There is a well understood body of work that can be simply dropped into the compiler to do this. Regex’s that are not regular can be compiled into hybrid DFA/NFA+backtracking schemes, and allowing a divide and conquer style of compiler optimization to do this is the path that makes the most sense (to me at least). Further, if you switch on a string and have a bunch of cases that are regex’s, you’d obviously want the compiler to generate a single state machine (like a lexer), not check each pattern in series.

4. Pattern matching greatness: One of the most obnoxious/error prone aspects of regex’s in many languages is that when you match a pattern, the various matches are dumped into numbered result values (often by the order of the parens in the pattern). This is totally barbaric: it begs for off by one errors, often breaks as the program is being evolved/maintained, etc. It is just as bad as printf/scanf!

You should instead be able to directly bind subexpressions into local variables. For example if you were trying to match something like “42: Chris”, you should be able to use straw man syntax like this:

   case /(let id: \d+): (let name: \w+)/: print(id); print(name)

Unless we were willing to dramatically expand how patterns work, this requires baking support into the language.

5. Scanner/“Formatter" integration: Taking the above one step farther, we could have default patterns for known types (and make it extensible to user defined types of course). For example, \d+ is the obvious pattern for integers, so you should be able to write the above like this (in principle):

   case /(let id: Int): (let name: \w+)/: print(id); print(name)

In addition to avoiding having to specify \d+ all the time, this eliminates the need for a “string to int” conversion after the pattern is matched, because id would be bound as type Int already.

Anyway, to summarize, I think that getting regex’s into the language is really important and expect them to be widely used. As such, I think it is worth burning compiler/language complexity to make them be truly great in Swift.

Great stuff!

I wonder if built-in grammars (that's what Perl calls them) would work only
for things that are backed by string literals, or if it's worth the
time/effort to make them work for other kind of data as well. For example,
what if you write a grammar to tokenize (yielding some sequence of
`Token`s), and then want to parse those `Token`s? Or what if you want to
parse other kinds of data? Or should we try to make the 80% case work (only
provide grammar/regex literals for Strings) to avoid complexity?

I think it's worth looking at parser combinators. The really cool thing
about them is that they provide a few basic elements, and well-defined ways
to combine those elements. Specifically, it's interesting to look at the
problems surrounding parser combinators and seeing how we could do better:

- Error messages can be cryptic
- Because it's built on top of the language, you can't really do stuff like
binding subexpressions to local variables
- Once you allow something like flatMap in a parser combinator library, you
lose the possibility to optimize by rewriting the parser structure. This is
because the parser that comes out of the rhs of a flatMap can depend on the
parsed output of the lhs.

To me it seems like there's a lot of (exciting) work to be done to get this
right :).

···

On Wed, Jan 25, 2017 at 6:35 AM, Chris Lattner <sabre@nondot.org> wrote:

On Jan 24, 2017, at 12:05 AM, Chris Eidhof via swift-evolution < > swift-evolution@swift.org> wrote:

I agree that being able to implement parsers in a nice way can be a huge
step forward in being really good at string processing.

+1 from me as well, I agree with Joe that Swift can learn a lot from Perl
6 grammar’s and we should take the time to do it right. Below I say
“regex” a lot, but I really mean a more general grammar system (and even
Perl 5 regex’s aren’t regular :-)

There are a couple of possibilities that come to mind directly:

1. Build parsers right into the language (like Perl 6 grammars)
2. Provide a parser combinator language (e.g. https://github.com/
davedufresne/SwiftParsec).
3. Rely on external tools like bison/yacc/etc.
4. Make it easy for people to write hand-written parsers (e.g. by
providing an NSScanner alternative).

My opinion is that #1 is the right path to start with, but it wouldn’t
preclude doing #2. Here’s my rationale / half-baked thought process:

There are two important use cases for regex's: the literal case (e.g.
/aa+b*/) and the dynamically computed case. The former is really what
we’re talking about here, the latter should obviously be handled with some
sort of Regex type which can be formed from string values or whatever.
Regex literals in an expression context should default to producing the
Regex type of course.

This means that when you pass a regex literal into an API call (e.g. split
on a string), it is really just creating something of Regex type, and
passing it down. If you wanted to introduce a parser combinator DSL, you
could totally plug it into the system, by having the combinators produce
something of the Regex type.

So why bless regex literals with language support at all? I see several
reasons:

1. Diagnostics: These will be heavily used by people, and you want to have
good compiler error and warning messages for them. You want to be able to
validate the regex at compile time, not wait until runtime to detect
syntactic mistakes like unbalanced parens.

2. Syntax Familiarity: To take advantage of people’s familiarity with
other languages, we should strive to make the basic regex syntax familiar
and obvious. I’d argue that /aa+b*/ should “just work” and do the thing
you think it does. Relying on a combinator library to do that would be
crazy.

3. Performance: Many regex’s are actually regular, so they can be
trivially compiled into DFAs. There is a well understood body of work that
can be simply dropped into the compiler to do this. Regex’s that are not
regular can be compiled into hybrid DFA/NFA+backtracking schemes, and
allowing a divide and conquer style of compiler optimization to do this is
the path that makes the most sense (to me at least). Further, if you
switch on a string and have a bunch of cases that are regex’s, you’d
obviously want the compiler to generate a single state machine (like a
lexer), not check each pattern in series.

4. Pattern matching greatness: One of the most obnoxious/error prone
aspects of regex’s in many languages is that when you match a pattern, the
various matches are dumped into numbered result values (often by the order
of the parens in the pattern). This is totally barbaric: it begs for off
by one errors, often breaks as the program is being evolved/maintained,
etc. It is just as bad as printf/scanf!

You should instead be able to directly bind subexpressions into local
variables. For example if you were trying to match something like “42:
Chris”, you should be able to use straw man syntax like this:

   case /(let id: \d+): (let name: \w+)/: print(id); print(name)

Unless we were willing to dramatically expand how patterns work, this
requires baking support into the language.

5. Scanner/“Formatter" integration: Taking the above one step farther, we
could have default patterns for known types (and make it extensible to user
defined types of course). For example, \d+ is the obvious pattern for
integers, so you should be able to write the above like this (in principle):

   case /(let id: Int): (let name: \w+)/: print(id); print(name)

In addition to avoiding having to specify \d+ all the time, this
eliminates the need for a “string to int” conversion after the pattern is
matched, because id would be bound as type Int already.

Anyway, to summarize, I think that getting regex’s into the language is
really important and expect them to be widely used. As such, I think it is
worth burning compiler/language complexity to make them be truly great in
Swift.

-Chris

--
Chris Eidhof

AFAIK, we have no serious / concrete design proposal for variadic generics, so it remains unclear to me that we would syntactically follow the C++ model. The C++ model seems very influenced by its instantiation based approach.

In any case, it seems like an obviously good tradeoff to make the syntax for variadic generics more complicated if it makes one sided ranges more beautiful.

-Chris

···

On Jan 25, 2017, at 1:10 PM, Dave Abrahams via swift-evolution <swift-evolution@swift.org> wrote:

I also prefer #1. It’s a shame that this conflicts with the potential
syntax for variadic generics. Is there really no way around this?
I’m showing my ignorance on compilers here, but couldn’t the fact that
variadic generics will be inside angle brackets be used to
distinguish?

The variadic use cases don't always have ... appearing inside angle
brackets. See “pack expansion” at
Parameter pack(since C++11) - cppreference.com
for example.

I agree that being able to implement parsers in a nice way can be a
huge step forward in being really good at string processing.

+1 from me as well, I agree with Joe that Swift can learn a lot from
Perl 6 grammar’s and we should take the time to do it right. Below I
say “regex” a lot, but I really mean a more general grammar system
(and even Perl 5 regex’s aren’t regular :-)

There are a couple of possibilities that come to mind directly:

1. Build parsers right into the language (like Perl 6 grammars)
2. Provide a parser combinator language
(e.g. GitHub - davedufresne/SwiftParsec: A parser combinator library written in the Swift programming language.
<https://github.com/davedufresne/SwiftParsec&gt;\).
3. Rely on external tools like bison/yacc/etc.
4. Make it easy for people to write hand-written parsers (e.g. by providing an NSScanner

alternative).

My opinion is that #1 is the right path to start with, but it wouldn’t
preclude doing #2. Here’s my rationale / half-baked thought process:

There are two important use cases for regex's: the literal case
(e.g. /aa+b*/) and the dynamically computed case. The former is
really what we’re talking about here, the latter should obviously be
handled with some sort of Regex type which can be formed from string
values or whatever.

Ideally these patterns interoperate so that you can combine them.

Regex literals in an expression context should default to producing
the Regex type of course.

This means that when you pass a regex literal into an API call
(e.g. split on a string), it is really just creating something of
Regex type, and passing it down. If you wanted to introduce a parser
combinator DSL, you could totally plug it into the system, by having
the combinators produce something of the Regex type.

So why bless regex literals with language support at all? I see
several reasons:

1. Diagnostics: These will be heavily used by people, and you want to
have good compiler error and warning messages for them. You want to
be able to validate the regex at compile time, not wait until runtime
to detect syntactic mistakes like unbalanced parens.

2. Syntax Familiarity: To take advantage of people’s familiarity with
other languages, we should strive to make the basic regex syntax
familiar and obvious. I’d argue that /aa+b*/ should “just work” and
do the thing you think it does. Relying on a combinator library to do
that would be crazy.

3. Performance: Many regex’s are actually regular, so they can be
trivially compiled into DFAs. There is a well understood body of work
that can be simply dropped into the compiler to do this. Regex’s that
are not regular can be compiled into hybrid DFA/NFA+backtracking
schemes, and allowing a divide and conquer style of compiler
optimization to do this is the path that makes the most sense (to me
at least). Further, if you switch on a string and have a bunch of
cases that are regex’s, you’d obviously want the compiler to generate
a single state machine (like a lexer), not check each pattern in
series.

4. Pattern matching greatness: One of the most obnoxious/error prone
aspects of regex’s in many languages is that when you match a pattern,
the various matches are dumped into numbered result values (often by
the order of the parens in the pattern). This is totally barbaric: it
begs for off by one errors, often breaks as the program is being
evolved/maintained, etc. It is just as bad as printf/scanf!

You should instead be able to directly bind subexpressions into local
variables. For example if you were trying to match something like
“42: Chris”, you should be able to use straw man syntax like this:

   case /(let id: \d+): (let name: \w+)/: print(id); print(name)

This is a good start, but inadequate for handling the kind of recursive
grammars to which you want to generalize regexes, because you have to
bind the same variable multiple times—often re-entrantly—during the same
match. Actually the Kleene star (*) already has this basic problem,
without the re-entrancy, but if you want to build real parsers, you need
to do more than simply capture the last substring matched by each group.

Unless we were willing to dramatically expand how patterns work, this
requires baking support into the language.

I don't understand the "Unless" part of that sentence. It seems obvious
that no expansion of how patterns work could make the above work without
language changes.

5. Scanner/“Formatter" integration: Taking the above one step farther,
we could have default patterns for known types (and make it extensible
to user defined types of course). For example, \d+ is the obvious
pattern for integers, so you should be able to write the above like
this (in principle):

   case /(let id: Int): (let name: \w+)/: print(id); print(name)

In addition to avoiding having to specify \d+ all the time, this
eliminates the need for a “string to int” conversion after the pattern
is matched, because id would be bound as type Int already.

Yup.

Anyway, to summarize, I think that getting regex’s into the language
is really important and expect them to be widely used. As such, I
think it is worth burning compiler/language complexity to make them be
truly great in Swift.

Thanks for this post, Chris; it clarifies beautifully the reasons that
we're not trying to tackle regexes right away.

···

on Tue Jan 24 2017, Chris Lattner <sabre-AT-nondot.org> wrote:

On Jan 24, 2017, at 12:05 AM, Chris Eidhof via swift-evolution <swift-evolution@swift.org> wrote:

--
-Dave

Thanks for pointing this out to me.

-Matt

···

On Jan 25, 2017, at 13:10, Dave Abrahams via swift-evolution <swift-evolution@swift.org> wrote:

on Tue Jan 24 2017, Matt Whiteside <swift-evolution@swift.org> wrote:

On Jan 22, 2017, at 15:40, Chris Lattner via swift-evolution >> <swift-evolution@swift.org> wrote:
Right, the only sensible semantics for a one sided range with an
open end point is that it goes to the end of the collection. I see
a few different potential colors to paint this bikeshed with, all of
which would have the semantics “c[i..<c.endIndex]”:

1) Provide "c[i...]":
2) Provide "c[i..<]":
3) Provide both "c[i..<]” and "c[i…]":

Since all of these operations would have the same behavior, it comes down to subjective questions:

a) Do we want redundancy? IMO, no, which is why #3 is not very desirable.
b) Which is easier to explain to people? As you say, "i..< is shorthand for i..<endindex” is nice

and simple, which leans towards #2.

c) Which is subjectively nicer looking? IMO, #1 is much nicer
typographically. The ..< formulation looks like symbol soup,
particularly because most folks would not put a space before ].

There is no obvious winner, but to me, I tend to prefer #1. What do other folks think?

I also prefer #1. It’s a shame that this conflicts with the potential
syntax for variadic generics. Is there really no way around this?
I’m showing my ignorance on compilers here, but couldn’t the fact that
variadic generics will be inside angle brackets be used to
distinguish?

The variadic use cases don't always have ... appearing inside angle
brackets. See “pack expansion” at
Parameter pack(since C++11) - cppreference.com
for example.

--
-Dave

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

Hey Matthew,

Do you have an example of doing parser combinators without FP? I'd be very interest

Hey Chris, looks like Dave provided a pretty good example of the kind of thing I was talking about. Does that answer your questions?

···

On Jan 24, 2017, at 10:14 AM, Chris Eidhof <chris@eidhof.nl> wrote:

On Tue, Jan 24, 2017 at 4:52 PM, Matthew Johnson <matthew@anandabits.com <mailto:matthew@anandabits.com>> wrote:

On Jan 24, 2017, at 2:05 AM, Chris Eidhof via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

I agree that being able to implement parsers in a nice way can be a huge step forward in being really good at string processing.

There are a couple of possibilities that come to mind directly:

1. Build parsers right into the language (like Perl 6 grammars)
2. Provide a parser combinator language (e.g. https://github.com/davedufresne/SwiftParsec\).
3. Rely on external tools like bison/yacc/etc.
4. Make it easy for people to write hand-written parsers (e.g. by providing an NSScanner alternative).

Some obvious drawbacks of each approach:

1. Lots of work, probably hard to get right?
2. Only way to do this, afaik, is using lots of functional programming which might scare people off. Also probably it's hard to get performance as fast as 1.

FWIW, it is quite possible to do things very similar to parser combinators without functional programming. What you need is a way to create and compose small parser fragments, ideally an EDSL approaching something like EBNF that allows users to build a grammar out of the parser fragments, and a way to execute / interpret the resulting grammar during parsing.

I'd love to see this. Do you mean "possible today" or "it would be possible”?

One really big thing that I took away learning parser combinators is that grammars are composable, whereas parsers themselves are not. Parser combinators express grammars.

For example, when you have a Swift parser available, and you want to "embed" it inside a Markdown parser, that's hard to do. Whereas composing the Markdown grammar with the Swift grammar is a lot easier.

It'd be nice to have that composability.

Chris

The functional approach would not be the most idiomatic approach in Swift and as you note, it probably wouldn’t have the performance a more idiomatic approach could achieve (too much copying).

My intuition is that a hybrid 1 / 2 approach might be best: do as much as possible in the library and let the design drive new language enhancements where necessary.

3. No clear integrated way to do this
4. You still have to know how to write a parser.

I would think that 4. would be a good step forward, and 1/2 would definitely benefit from this.

Also, I'd love to have this functionality on sequence/collection types, rather than Strings. For example, it can be tremendously helpful to parse a binary format using proper parsers. Or maybe you would want to use an event-driven XML parser as "tokenizer" and parse that. Plenty of cool possibilities.

On Tue, Jan 24, 2017 at 8:46 AM, Russ Bishop via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

On Jan 23, 2017, at 2:27 PM, Joe Groff via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

On Jan 23, 2017, at 2:06 PM, Ben Cohen via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

On Jan 23, 2017, at 7:49 AM, Joshua Alvarado <alvaradojoshua0@gmail.com <mailto:alvaradojoshua0@gmail.com>> wrote:

Taken from NSHipster <http://nshipster.com/nsregularexpression/&gt;:
Happily, on one thing we can all agree. In NSRegularExpression, Cocoa has the most long-winded and byzantine regular expression interface you’re ever likely to come across.

There is no way to achieve the goal of being better at string processing than Perl without regular expressions being addressed. It just should not be ignored.

We’re certainly not ignoring the importance of regexes. But if there’s a key takeaway from your experiences with NSRegularExpression, it’s that a good regex implementation matters, a lot. That’s why we don’t want to rush one in alongside the rest of the overhaul of String. Instead, we should take our time to make it really great, and building on a solid foundation of a good String API that’s already in place should help ensure that.

I do think that there's some danger to focusing too narrowly on regular expressions as they appear in languages today. I think the industry has largely moved on to fully-structured formats that require proper parsing beyond what traditional regexes can handle. The decades of experience with Perl shows that making regexes too easy to use without an easy ramp up to more sophisticated string processing leads to people cutting corners trying to make regex-based designs kind-of work. The Perl 6 folks recognized this and developed their "regular expression" support into something that supported arbitrary grammars; I think we'd do well to start at that level by looking at what they've done.

-Joe

I fully agree. I think we could learn something from Perl 6 grammars. As PCREs are to languages without regex, Perl 6 grammars are to languages with PCREs.

A lot of really crappy user interfaces and bad tools come down to half-assed parsers; maybe we can do better? (Another argument against rushing it).

Russ

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

--
Chris Eidhof
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

--
Chris Eidhof

I've never seen anyone start a string with a combining character on purpose,

It will occur as a byproduct of the process of attaching a diacritic
to a base character.

Unless you're in the business of writing a text editor, I don't know
if that's a common use case.

I don't either, to be honest. But the experts I consult with keep
reassuring me that it's an important one.

Would it be possible that the Unicode experts' use cases are different from non-experts' use cases? It would make sense to put people who know a lot about Unicode in charge of handling complex Unicode operations, and that makes that use case very important to them, but through their hard work no one else needs to care about it.

though I'm familiar with just one natural language that needs
combining characters. I can imagine that it could be a convenient
feature in other natural languages.

However, if Swift Strings are now designed for machine processing
and less for human language convenience, for me, it's easy enough to
justify a safe default in the context of machine processing: `a+b`
will not combine the end of `a` with the start of `b`. You could do
this by inserting a ◌ that `b` could combine with if necessary.

You can do it, but it trades one semantic problem for a usability
problem, without solving all the semantic problems: you end up with
a.count + b.count == (a+b).count, sure, but you still don't satisfy
the usual law of collections that (a+b).contains(b.first!) if b is
non-empty, and now you've made it difficult to attach diacritics to
base characters.

"Difficult".

What kind of processing would you suggest on a variable "b" in the
expression "\(a),\(b)" to ensure that the result can be split with a
comma?

I'm sorry, I don't understand what you're driving at, here.

Okay, so I'm serializing two strings "a" and "b", and later on I want to deserialize them. I control "a", and the user controls "b". I know that I'll never have a comma in "a", so one obvious way to serialize the two strings is with "\(a),\(b)", and the most obvious way to deserialize them is with string.split(maxSplits: 2) { $0 == "," }.

For the example, string "a" is "hello", and the user put in "\u{0301}screw you" for "b". This makes the result "hello,́screw you". Now split misses the comma.

How do I fix it?

One option (once Character acquires a unicodeScalars view similar to String’s) would be:

s.split { $0.unicodeScalars.first == "," }

There’s probably also a case to be made for a String-specific overload split(separator: UnicodeScalar) in which case you’d pass in the scalar of “,”. This would replicate similar behavior to languages that use code points as their “character”.

Alternatively, the right solution is to sanitize your input before the interpolation. Sanitization is a big topic, of which this is just one example. Essentially, you are asking for this kind of sanitization to be automatically applied for all range-replaceable operations on strings for this specific use case. I’m not sure that’s a good precedent to set. There are other ways in which Unicode can be abused that wouldn’t be covered, should we be sanitizing for those too on all low-level operations?

This would also have pretty far-reaching implications across lots of different types and operations. For example, it’s not just on append:

var s = "pokemon"
let i = s.index(of: "m”)!
// insert not just \u{0301} but also a separator?
s.insert("\u{0301}", at: i)

It also would apply to in-place mutation on slices, given you can do this:

var a = [1,2,3,4]
a[0...2].append(99)
a // [1,2,3,99,4]

In this case, suppose you appended "e" to a slice that ended between "m" and "\u{0301}”. The append operation on the substring would need to look into the outer string, see that the next scalar is a combining character, and then insert a spacer element in between them.

We would still need the ability to append modifiers to characters legitimately. If users could not do this by inserting/appending these modifiers into String, we would have to put this logic onto Character, which would need to have the ability to range-replace within its scalars, which adds to a lot to the complexity of that type. It would also be fiddly to use, given that String is not going to conform to MutableCollection (because mutation on an element cannot be done in constant time). So you couldn’t do it in-place i.e. s[i].unicodeScalars.append("\u{0301}") wouldn’t work.

···

On Jan 24, 2017, at 7:02 PM, Félix Cloutier via swift-evolution <swift-evolution@swift.org> wrote:

Le 24 janv. 2017 à 11:33, Dave Abrahams via swift-evolution <swift-evolution@swift.org> a écrit :

Félix

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

I agree that being able to implement parsers in a nice way can be a huge step forward in being really good at string processing.

+1 from me as well, I agree with Joe that Swift can learn a lot from Perl 6 grammar’s and we should take the time to do it right. Below I say “regex” a lot, but I really mean a more general grammar system (and even Perl 5 regex’s aren’t regular :-)

There are a couple of possibilities that come to mind directly:

1. Build parsers right into the language (like Perl 6 grammars)
2. Provide a parser combinator language (e.g. https://github.com/davedufresne/SwiftParsec\).
3. Rely on external tools like bison/yacc/etc.
4. Make it easy for people to write hand-written parsers (e.g. by providing an NSScanner alternative).

My opinion is that #1 is the right path to start with, but it wouldn’t preclude doing #2. Here’s my rationale / half-baked thought process:

There are two important use cases for regex's: the literal case (e.g. /aa+b*/) and the dynamically computed case. The former is really what we’re talking about here, the latter should obviously be handled with some sort of Regex type which can be formed from string values or whatever. Regex literals in an expression context should default to producing the Regex type of course.

This means that when you pass a regex literal into an API call (e.g. split on a string), it is really just creating something of Regex type, and passing it down. If you wanted to introduce a parser combinator DSL, you could totally plug it into the system, by having the combinators produce something of the Regex type.

So why bless regex literals with language support at all? I see several reasons:

1. Diagnostics: These will be heavily used by people, and you want to have good compiler error and warning messages for them. You want to be able to validate the regex at compile time, not wait until runtime to detect syntactic mistakes like unbalanced parens.

2. Syntax Familiarity: To take advantage of people’s familiarity with other languages, we should strive to make the basic regex syntax familiar and obvious. I’d argue that /aa+b*/ should “just work” and do the thing you think it does. Relying on a combinator library to do that would be crazy.

3. Performance: Many regex’s are actually regular, so they can be trivially compiled into DFAs. There is a well understood body of work that can be simply dropped into the compiler to do this. Regex’s that are not regular can be compiled into hybrid DFA/NFA+backtracking schemes, and allowing a divide and conquer style of compiler optimization to do this is the path that makes the most sense (to me at least). Further, if you switch on a string and have a bunch of cases that are regex’s, you’d obviously want the compiler to generate a single state machine (like a lexer), not check each pattern in series.

4. Pattern matching greatness: One of the most obnoxious/error prone aspects of regex’s in many languages is that when you match a pattern, the various matches are dumped into numbered result values (often by the order of the parens in the pattern). This is totally barbaric: it begs for off by one errors, often breaks as the program is being evolved/maintained, etc. It is just as bad as printf/scanf!

You should instead be able to directly bind subexpressions into local variables. For example if you were trying to match something like “42: Chris”, you should be able to use straw man syntax like this:

   case /(let id: \d+): (let name: \w+)/: print(id); print(name)

Unless we were willing to dramatically expand how patterns work, this requires baking support into the language.

5. Scanner/“Formatter" integration: Taking the above one step farther, we could have default patterns for known types (and make it extensible to user defined types of course). For example, \d+ is the obvious pattern for integers, so you should be able to write the above like this (in principle):

   case /(let id: Int): (let name: \w+)/: print(id); print(name)

In addition to avoiding having to specify \d+ all the time, this eliminates the need for a “string to int” conversion after the pattern is matched, because id would be bound as type Int already.

Anyway, to summarize, I think that getting regex’s into the language is really important and expect them to be widely used. As such, I think it is worth burning compiler/language complexity to make them be truly great in Swift.

Another good reason to bake them into pattern matching is that it would make it easier to optimize when you want to match one of multiple patterns. Often, you don't want just one grammar, but possibly one of many, and it'd be nice if switch-ing over multiple string patterns led to a reasonably efficient DFA/NFA/rec-descent machine based on the needs of the grammars being matched.

+1 to the comments by Chris and Joe. We should do as much as we can in the library, but compile-time error detection, optimizability and syntactic convenience are important reasons to bake some support into the language itself.

···

On Jan 25, 2017, at 12:23 PM, Joe Groff via swift-evolution <swift-evolution@swift.org> wrote:

On Jan 24, 2017, at 9:35 PM, Chris Lattner via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:
On Jan 24, 2017, at 12:05 AM, Chris Eidhof via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

-Joe

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution