[Pitch] Raw mode string literals

(John Holdsworth) #1

Hello S/E,

I’d like to put forward a perhaps rather banal change to the Swift lexer
primarily intended to make entering regular expression patterns easier.

https://github.com/DoubleSpeak/swift-evolution/blob/master/proposals/NNNN-raw-string-escaping.md

With a raw literal a string prefixed by “r” the \ character would have no
special role at all and be processed like any other character i.e.

    r"\n\(var)\n" == "\\n\\(var)\\n"

    r"\?\y\=" == "\\?\\y\\="

    r"c:\windows\system32" == "c:\\windows\\system32"

    r"""
        Line One\
        Line Two\
        """ == "Line One\\\nLineTwo\\"

I had considered another version of the proposal where known escapes
were still processed but it proved too difficult to reason exactly what was
contained in the string.

There is a example toolchain available for testing:

http://johnholdsworth.com/swift-LOCAL-2017-11-23-a-osx.tar.gz

Can we shepard this minor additive change into Swift 4.1?

John

[Pitch v2]: Raw strings and SE-0200
SE-0200: "Raw" mode string literals
SE-0200: "Raw" mode string literals
(Chris Lattner) #2

Hello S/E,

I’d like to put forward a perhaps rather banal change to the Swift lexer
primarily intended to make entering regular expression patterns easier.

https://github.com/DoubleSpeak/swift-evolution/blob/master/proposals/NNNN-raw-string-escaping.md

Hi John,

A lot of people (myself included) are interested in getting regex’s into Swift. I don’t think there is consensus on how to do this, but I’m personally a fan of adding first class support with the classical /a[b*]c/ syntax. Until we figure out that path forward for regex’s, I think they aren’t the right motivation for this proposal.

-Chris

···

On Nov 23, 2017, at 9:43 AM, John Holdsworth via swift-evolution <swift-evolution@swift.org> wrote:

With a raw literal a string prefixed by “r” the \ character would have no
special role at all and be processed like any other character i.e.

    r"\n\(var)\n" == "\\n\\(var)\\n"

    r"\?\y\=" == "\\?\\y\\="

    r"c:\windows\system32" == "c:\\windows\\system32"

    r"""
        Line One\
        Line Two\
        """ == "Line One\\\nLineTwo\\"

I had considered another version of the proposal where known escapes
were still processed but it proved too difficult to reason exactly what was
contained in the string.

There is a example toolchain available for testing:

http://johnholdsworth.com/swift-LOCAL-2017-11-23-a-osx.tar.gz

Can we shepard this minor additive change into Swift 4.1?

John
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

(Brent Royal-Gordon) #3

1. Even in our shining pattern matching future—a future which I, for one, am eager to hasten—we will still need to interoperate with NSRegularExpression and other Perl 5-compatible regex engines.

2. Code generation.

3. Windows-style paths.

4. Doesn’t LaTeX use backslashes?

5. Etc.

I think the Motivation section undersells this proposal. Regexes are a strong short-run use case, but in the long run, we’ll need this for other things. In both cases, though, raw literals will be a useful addition to the language, improving the clarity of Swift code much like multiline literals already have.

···

On Nov 23, 2017, at 11:15 AM, Chris Lattner via swift-evolution <swift-evolution@swift.org> wrote:

Until we figure out that path forward for regex’s, I think they aren’t the right motivation for this proposal.

--
Brent Royal-Gordon
Sent from my iPhone

(Xiaodi Wu) #4

This proposed addition addresses a known pain point, to be sure, but I
think it has many implications for the future direction of the language and
I'd like to explore them here.

The tl;dr version is that I'm not sure this is the right direction in which
to head to address the issue of regex ergonomics, and that the issue also
implicates other weaknesses in terms of literals for which I think the
solution exacerbates rather than solves the underlying problem. [Chris's
email just came through and cut to the meat of it, but I'll keep writing
this email and complete my thoughts.]

We've been talking on this list for quite some time about supporting `/this
syntax/` for a regex literal. Whether this can be accomplished or not
within the Swift 5 timeframe (and I think a reasonable barebones
implementation could be), the question here is whether your proposed
addition serves any useful purpose in a future version of Swift in which we
do have such a literal. After all, your motivation (and a very valid one)
is that regex literals are too difficult to type. I very strongly believe
that the solution here is what we've been talking about all along: actual
regex literals.

We should certainly move any discussion about regex literals into its own
thread, but to make it clear that I'm not simply suggesting that we
implement something in Swift 10 instead of addressing a known pain point
now, here's a sketch of how Swift 5 could make meaningful progress:

- Teach the lexer about basic /pattern/flag syntax.
- Add an `ExpressibleByRegularExpressionLiteral`, where the initializer
would be something like `init(regularExpressionLiteralPattern: String,
flags: RegularExpressionFlags)` where RegularExpressionFlags would be an
OptionSet type.
- Add conformance to `ExpressibleByRegularExpressionLiteral` to
`NSRegularExpression`.
- Have no default `RegularExpressionLiteralType` for now so that, in the
future, we can discuss and design a Swift standard library regular
expression type, which is justifiable because we've baked in language
support for the literal. This can be postponed.

Now, suppose we can motivate "raw" strings with some other use case. The
proposed syntax remains problematic and I'd like to discuss that for a
little bit.

I believe it was Chris who originally explained some time ago that it was a
deliberate design decision not to have little dangling bits on literals
such as "1.0f", opting instead for more readable spellings such as "1.0 as
Float". This `r` prefix is clearly undoing that deliberate design decision
and inconsistent with the direction of Swift. There are other options here,
should a use case arise that successfully motivate "raw" strings. For
example, one might be to use the long-reserved single-quoted 'string
literal' for this purpose, if this is judged to be a significant enough
feature that can justify it.

But there's another matter here that I'd like to touch on. Namely, for all
literal types, the `ExpressiblyBy*` protocols expose a processed version of
the literal. This leads to various issues that require hacky workarounds at
best. For instance, BigInt types can't initialize arbitrarily large values
with integer literals, because the literal has to be representable as a
built-in fixed-width integer; Decimal initializes its floating-point
literal values through a Double intermediary, which can lead to undesirable
results; and so on. What these issues have in common with your use case
here is the inability of `ExpressibleBy*` types to receive the underlying
"raw" literal as it is input. If we could come up with a holistic solution
here, we might be able to dispense with having any distinct syntax for
"raw" strings *and* solve all of these issues at once.

···

On Thu, Nov 23, 2017 at 11:43 AM, John Holdsworth via swift-evolution < swift-evolution@swift.org> wrote:

Hello S/E,

I’d like to put forward a perhaps rather banal change to the Swift lexer
primarily intended to make entering regular expression patterns easier.

https://github.com/DoubleSpeak/swift-evolution/blob/master/proposals/NNNN-
raw-string-escaping.md

With a raw literal a string prefixed by “r” the \ character would have no
special role at all and be processed like any other character i.e.

    r"\n\(var)\n" == "\\n\\(var)\\n"

    r"\?\y\=" == "\\?\\y\\="

    r"c:\windows\system32" == "c:\\windows\\system32"

    r"""
        Line One\
        Line Two\
        """ == "Line One\\\nLineTwo\\"

I had considered another version of the proposal where known escapes
were still processed but it proved too difficult to reason exactly what was
contained in the string.

There is a example toolchain available for testing:

http://johnholdsworth.com/swift-LOCAL-2017-11-23-a-osx.tar.gz

Can we shepard this minor additive change into Swift 4.1?

John

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

(Daniel Duan) #5

I eh, literally just ran into a situation where this feature would have been super useful:

Someone sent me a JSON string with “\” in it. I needed to plug it into the UI logic to test something but I had to escape the “\”s again to make JSONSerialization happy.

A raw string literal syntax, especially when combined with multi line literals, would be very helpful.

That said, how/whether this syntax compose with multi line literal, interpolation, and the potential regex literal Chris mentioned earlier, deserves thorough consideration and deliberation.

Daniel Duan

···

Sent from my iPhone

On Nov 23, 2017, at 11:12 AM, Brent Royal-Gordon via swift-evolution <swift-evolution@swift.org> wrote:

On Nov 23, 2017, at 11:15 AM, Chris Lattner via swift-evolution <swift-evolution@swift.org> wrote:

Until we figure out that path forward for regex’s, I think they aren’t the right motivation for this proposal.

1. Even in our shining pattern matching future—a future which I, for one, am eager to hasten—we will still need to interoperate with NSRegularExpression and other Perl 5-compatible regex engines.

2. Code generation.

3. Windows-style paths.

4. Doesn’t LaTeX use backslashes?

5. Etc.

I think the Motivation section undersells this proposal. Regexes are a strong short-run use case, but in the long run, we’ll need this for other things. In both cases, though, raw literals will be a useful addition to the language, improving the clarity of Swift code much like multiline literals already have.

--
Brent Royal-Gordon
Sent from my iPhone

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

(Chris Lattner) #6

Until we figure out that path forward for regex’s, I think they aren’t the right motivation for this proposal.

1. Even in our shining pattern matching future—a future which I, for one, am eager to hasten—we will still need to interoperate with NSRegularExpression and other Perl 5-compatible regex engines.

We already interoperate with those other engines. When we have an awesome default answer, I don’t see why we’d be compelled to sugar them any more.

2. Code generation.

3. Windows-style paths.

4. Doesn’t LaTeX use backslashes?

Right, I’m only objecting to regex as the motivation.

-Chris

···

On Nov 23, 2017, at 11:10 AM, Brent Royal-Gordon <brent@architechies.com> wrote:
On Nov 23, 2017, at 11:15 AM, Chris Lattner via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

5. Etc.

I think the Motivation section undersells this proposal. Regexes are a strong short-run use case, but in the long run, we’ll need this for other things. In both cases, though, raw literals will be a useful addition to the language, improving the clarity of Swift code much like multiline literals already have.

--
Brent Royal-Gordon
Sent from my iPhone

(Xiaodi Wu) #7

Until we figure out that path forward for regex’s, I think they aren’t the
right motivation for this proposal.

1. Even in our shining pattern matching future—a future which I, for one,
am eager to hasten—we will still need to interoperate with
NSRegularExpression and other Perl 5-compatible regex engines.

Can you explain why such interoperability would need _raw string literals_
as opposed to regex literals?

2. Code generation.

Can you elaborate on this?

3. Windows-style paths.

Sure, Windows-style paths use the backslash. But it's not a useful exercise
to enumerate all places where backslashes are used, but rather where
they're often used in _literals_. How often are you hardcoding
Windows-style paths? Wouldn't an ergonomic API accept the forward slash
also? Even many Microsoft-vended Windows utilities do so.

4. Doesn’t LaTeX use backslashes?

Again, not often you're hardcoding LaTeX literals. Let's not turn this into
an exercise in listing all places where you've seen a backslash used.

···

On Thu, Nov 23, 2017 at 1:12 PM, Brent Royal-Gordon via swift-evolution < swift-evolution@swift.org> wrote:

On Nov 23, 2017, at 11:15 AM, Chris Lattner via swift-evolution < > swift-evolution@swift.org> wrote:

5. Etc.

I think the Motivation section undersells this proposal. Regexes are a
strong short-run use case, but in the long run, we’ll need this for other
things. In both cases, though, raw literals will be a useful addition to
the language, improving the clarity of Swift code much like multiline
literals already have.

(Chris Lattner) #8

This proposed addition addresses a known pain point, to be sure, but I think it has many implications for the future direction of the language and I'd like to explore them here.

Thanks for writing this up Xiaodi,

We should certainly move any discussion about regex literals into its own thread, but to make it clear that I'm not simply suggesting that we implement something in Swift 10 instead of addressing a known pain point now, here's a sketch of how Swift 5 could make meaningful progress:

- Teach the lexer about basic /pattern/flag syntax.
- Add an `ExpressibleByRegularExpressionLiteral`, where the initializer would be something like `init(regularExpressionLiteralPattern: String, flags: RegularExpressionFlags)` where RegularExpressionFlags would be an OptionSet type.
- Add conformance to `ExpressibleByRegularExpressionLiteral` to `NSRegularExpression`.
- Have no default `RegularExpressionLiteralType` for now so that, in the future, we can discuss and design a Swift standard library regular expression type, which is justifiable because we've baked in language support for the literal. This can be postponed.

This approach could make sense, but it makes a couple of assumptions that I’m not certain are the right way to go (to be clear, I’m not certain that they’re wrong either!).

Things I’d like to carefully consider:

1) We could make the compiler parse and validate regex literals at compile time:

a) this allows the compiler to emit diagnostics (with fixits!) on malformed literals.

b) When the compiler knows the grammar of the regex, it can precompile the regex into a DFA table or static executable code, rather than runtime compiling into a bytecode.

c) however, the compiler can’t parse the literal unless it knows the dialect it corresponds to. While we could parameterize this somehow (e.g. as a requirement in ExpressibleByRegularExpressionLiteral), if we weren’t bound by backwards compatibility, we would just keep things simple and say “there is one and only one grammar”. I’d argue that having exactly one grammar supported by the // syntax is also *better* for users, rather than saying “it depends on what library you’re passing the regex into”.

2) I’d like to explore the idea of making // syntax be *patterns* instead of simply literals. As a pattern, it should be possible to bind submatches directly into variable declarations, eliminating the need to count parens in matches or other gross things. Here is strawman syntax with a dumb example:

if case /([a-zA-Z]+: let firstName) ([a-zA-Z]+: let lastName)/ = getSomeString() {
   print(firstName, lastName)
}

3) I see regex string matching as the dual to string interpolation. We already provide the ability for types to specify a default way to print themselves, and it would be great to have default regex’s associated with many types, so you can just say “match an Int here” instead of having to match [0-9]+ and then do a failable conversion to Int outside the regex.

4) I’d like to consider some of the advances that Perl 6 added to its regex grammar. Everyone knows that modern regex’s aren’t actually regular anyway, so it begs the question of how far to take it. If nothing else, I appreciate the freeform structure supported (including inline comments) which make them more readable.

We should also support a dynamic regex engine as well, because there are sometimes reasons to runtime construct regex’s. This could be handled by having the Regex type support a conversion from String or something, orthogonal to the language support for regex literals/patterns.

-Chris

···

On Nov 23, 2017, at 10:35 AM, Xiaodi Wu via swift-evolution <swift-evolution@swift.org> wrote:

(John Holdsworth) #9

I’m beginning to wish I hadn’t tied this proposal so strongly to regular expressions!
It is indeed the wrong motivation. Even as a ten year veteran of Perl development
I’m not sure we want to bake it into the language quite so tightly (isn’t a part of
Foundation?) What would /regex/ represent - an instance of NSRegularExpression?
Would the flags be pattern options or matching options? This is a whole other debate.

For me the focus of raw strings was a sort of super-literal literal which has many
applications. The r”literal” syntax has a precedent in Python and there seemed
to be a syntactic gap that could be occupied but perhaps there are other alternatives
we could discuss. It would be a shame to see ‘quoted strings’ be used for this however.
I still live in hope one day it will be used for single character UNICODE values.

John

···

On 23 Nov 2017, at 19:10, Brent Royal-Gordon <brent@architechies.com> wrote:

On Nov 23, 2017, at 11:15 AM, Chris Lattner via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

Until we figure out that path forward for regex’s, I think they aren’t the right motivation for this proposal.

1. Even in our shining pattern matching future—a future which I, for one, am eager to hasten—we will still need to interoperate with NSRegularExpression and other Perl 5-compatible regex engines.

2. Code generation.

3. Windows-style paths.

4. Doesn’t LaTeX use backslashes?

5. Etc.

I think the Motivation section undersells this proposal. Regexes are a strong short-run use case, but in the long run, we’ll need this for other things. In both cases, though, raw literals will be a useful addition to the language, improving the clarity of Swift code much like multiline literals already have.

--
Brent Royal-Gordon
Sent from my iPhone

(Xiaodi Wu) #10

I’m beginning to wish I hadn’t tied this proposal so strongly to regular
expressions!
It is indeed the wrong motivation. Even as a ten year veteran of Perl
development
I’m not sure we want to bake it into the language quite so tightly (isn’t
a part of
Foundation?) What would /regex/ represent - an instance of
NSRegularExpression?
Would the flags be pattern options or matching options? This is a whole
other debate.

For me the focus of raw strings was a sort of super-literal literal which
has many
applications. The r”literal” syntax has a precedent in Python and there
seemed
to be a syntactic gap that could be occupied but perhaps there are other
alternatives
we could discuss. It would be a shame to see ‘quoted strings’ be used for
this however.
I still live in hope one day it will be used for single character UNICODE
values.

Since what passes for a single character changes by Unicode revision--such

as whenever they get around to enumerating the permitted modifying
attributes of the poop emoji--it is quite impossible (and Swift's
`Character` doesn't attempt to) to enforce single-characterness at compile
time. We should put any such notions to rest up front.

···

On Thu, Nov 23, 2017 at 2:14 PM, John Holdsworth via swift-evolution < swift-evolution@swift.org> wrote:

On 23 Nov 2017, at 19:10, Brent Royal-Gordon <brent@architechies.com> > wrote:

On Nov 23, 2017, at 11:15 AM, Chris Lattner via swift-evolution < > swift-evolution@swift.org> wrote:

Until we figure out that path forward for regex’s, I think they aren’t the
right motivation for this proposal.

1. Even in our shining pattern matching future—a future which I, for one,
am eager to hasten—we will still need to interoperate with
NSRegularExpression and other Perl 5-compatible regex engines.

2. Code generation.

3. Windows-style paths.

4. Doesn’t LaTeX use backslashes?

5. Etc.

I think the Motivation section undersells this proposal. Regexes are a
strong short-run use case, but in the long run, we’ll need this for other
things. In both cases, though, raw literals will be a useful addition to
the language, improving the clarity of Swift code much like multiline
literals already have.

--
Brent Royal-Gordon
Sent from my iPhone

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

(Tony Allevato) #11

I’m beginning to wish I hadn’t tied this proposal so strongly to regular
expressions!
It is indeed the wrong motivation. Even as a ten year veteran of Perl
development
I’m not sure we want to bake it into the language quite so tightly (isn’t
a part of
Foundation?) What would /regex/ represent - an instance of
NSRegularExpression?
Would the flags be pattern options or matching options? This is a whole
other debate.

For me the focus of raw strings was a sort of super-literal literal which
has many
applications. The r”literal” syntax has a precedent in Python and there
seemed
to be a syntactic gap that could be occupied but perhaps there are other
alternatives
we could discuss. It would be a shame to see ‘quoted strings’ be used for
this however.
I still live in hope one day it will be used for single character UNICODE
values.

Since what passes for a single character changes by Unicode

revision--such as whenever they get around to enumerating the permitted
modifying attributes of the poop emoji--it is quite impossible (and Swift's
`Character` doesn't attempt to) to enforce single-characterness at compile
time. We should put any such notions to rest up front.

Unless I'm misunderstanding you here, I don't think that's true: writing
something like `let c: Character = "ab"` is definitely a compile-time
error: https://gist.github.com/allevato/ae267e2aaaa7939d6233d66a87b48fc0

To the original point though, I don't think Swift needs to use single
quotes for single characters (or single scalars). Type inference already
infers Characters from single-character String literals in contexts where a
Character is expected, and the only time you need to be explicit is if
you're trying to resolve an overload or initialize a variable by itself.
Using single quotes to avoid writing "as Character" would feel like a waste.

···

On Thu, Nov 23, 2017 at 12:21 PM Xiaodi Wu via swift-evolution < swift-evolution@swift.org> wrote:

On Thu, Nov 23, 2017 at 2:14 PM, John Holdsworth via swift-evolution < > swift-evolution@swift.org> wrote:

On 23 Nov 2017, at 19:10, Brent Royal-Gordon <brent@architechies.com> >> wrote:

On Nov 23, 2017, at 11:15 AM, Chris Lattner via swift-evolution < >> swift-evolution@swift.org> wrote:

Until we figure out that path forward for regex’s, I think they aren’t
the right motivation for this proposal.

1. Even in our shining pattern matching future—a future which I, for one,
am eager to hasten—we will still need to interoperate with
NSRegularExpression and other Perl 5-compatible regex engines.

2. Code generation.

3. Windows-style paths.

4. Doesn’t LaTeX use backslashes?

5. Etc.

I think the Motivation section undersells this proposal. Regexes are a
strong short-run use case, but in the long run, we’ll need this for other
things. In both cases, though, raw literals will be a useful addition to
the language, improving the clarity of Swift code much like multiline
literals already have.

--
Brent Royal-Gordon
Sent from my iPhone

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________

swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

(Xiaodi Wu) #12

This proposed addition addresses a known pain point, to be sure, but I
think it has many implications for the future direction of the language and
I'd like to explore them here.

Thanks for writing this up Xiaodi,

We should certainly move any discussion about regex literals into its own
thread, but to make it clear that I'm not simply suggesting that we
implement something in Swift 10 instead of addressing a known pain point
now, here's a sketch of how Swift 5 could make meaningful progress:

- Teach the lexer about basic /pattern/flag syntax.
- Add an `ExpressibleByRegularExpressionLiteral`, where the initializer
would be something like `init(regularExpressionLiteralPattern: String,
flags: RegularExpressionFlags)` where RegularExpressionFlags would be an
OptionSet type.
- Add conformance to `ExpressibleByRegularExpressionLiteral` to
`NSRegularExpression`.
- Have no default `RegularExpressionLiteralType` for now so that, in the
future, we can discuss and design a Swift standard library regular
expression type, which is justifiable because we've baked in language
support for the literal. This can be postponed.

This approach could make sense, but it makes a couple of assumptions that
I’m not certain are the right way to go (to be clear, I’m not certain that
they’re wrong either!).

Things I’d like to carefully consider:

1) We could make the compiler parse and validate regex literals at compile
time:

a) this allows the compiler to emit diagnostics (with fixits!) on
malformed literals.

b) When the compiler knows the grammar of the regex, it can precompile the
regex into a DFA table or static executable code, rather than runtime
compiling into a bytecode.

c) however, the compiler can’t parse the literal unless it knows the
dialect it corresponds to. While we could parameterize this somehow (e.g.
as a requirement in ExpressibleByRegularExpressionLiteral), if we weren’t
bound by backwards compatibility, we would just keep things simple and say
“there is one and only one grammar”. I’d argue that having exactly one
grammar supported by the // syntax is also *better* for users, rather than
saying “it depends on what library you’re passing the regex into”.

I think we've circled back to a topic that we've discussed here before. I
do agree that having more of this validation at compile time would improve
the experience. However, I can see a few drawbacks to the _compiler_ doing
the validation:

- In the absence of a `constexpr`-like facility, supporting runtime
expressions would mean we'd be writing the same code twice, once in C++ for
compile-time validation of literal expressions and another time in Swift
for runtime expressions.

- As seen in these discussions about string literals where users want to
copy and paste text and have it "just work," supporting only one dialect in
regex literals will inevitably lead users to ask for other types of regex
literals for each individual flavor of regex they encounter.

Just like ExpressibleByDictionaryLiteral doesn't deduplicate keys, leaving
that to Dictionary, I think regex literals are better off not validating
literal expressions (or, maybe, doing only the barest sanity check),
leaving the rest to concrete regex types. As you point out with validation
of integer overflows during constant folding, we could get enough
compile-time validation even without teaching the compiler itself how to
validate the literal.

2) I’d like to explore the idea of making // syntax be *patterns* instead

of simply literals. As a pattern, it should be possible to bind submatches
directly into variable declarations, eliminating the need to count parens
in matches or other gross things. Here is strawman syntax with a dumb
example:

if case /([a-zA-Z]+: let firstName) ([a-zA-Z]+: let lastName)/ =
getSomeString() {
   print(firstName, lastName)
}

This is an interesting idea. But is it significantly more usable than the
same type having a collection of named matches using the usual Perl syntax?

  if case /(?<firstName>[a-zA-Z]+) (?<lastName>[a-zA-Z]+)/ =
getSomeString() {
    print(Regex.captured["firstName"], Regex.captured["lastName"])
  }

3) I see regex string matching as the dual to string interpolation. We

already provide the ability for types to specify a default way to print
themselves, and it would be great to have default regex’s associated with
many types, so you can just say “match an Int here” instead of having to
match [0-9]+ and then do a failable conversion to Int outside the regex.

4) I’d like to consider some of the advances that Perl 6 added to its
regex grammar. Everyone knows that modern regex’s aren’t actually regular
anyway, so it begs the question of how far to take it. If nothing else, I
appreciate the freeform structure supported (including inline comments)
which make them more readable.

Sounds like we want multiline regex literals :slight_smile:

We should also support a dynamic regex engine as well, because there are

···

On Thu, Nov 23, 2017 at 5:33 PM, Chris Lattner <clattner@nondot.org> wrote:

On Nov 23, 2017, at 10:35 AM, Xiaodi Wu via swift-evolution < > swift-evolution@swift.org> wrote:
sometimes reasons to runtime construct regex’s. This could be handled by
having the Regex type support a conversion from String or something,
orthogonal to the language support for regex literals/patterns.

(Magnus Ahltorp) #13

Erlang has something very similar for binaries, where constructing and matching a binary is part of the syntax.

For example:

C = <<A:1, B:63>>

constructs a 64-bit binary C where the first bit comes from the integer variable A, and the rest from the integer variable B. In Erlang, this syntax is exactly the same when matching and constructing, so the corresponding syntax for matching is:

<<A:1, B:63>> = C

where the 64-bit binary C is matched so that the first bit is put in the integer variable A and the rest is put in the integer variable B. If we don't want to match a variable to an integer but keep it as a binary, we can just mark the variable matching as a binary:

<<A:8, B/binary>> = C

which means that A will still be an integer (but in this case 8 bits wide) and B will be a binary.

Or if we want to do different things based on the first bit:

case C of
    <<0:1, B:63>> ->
        B;
    <<1:1, B:63>> ->
        B + 10
end

Making this kind of powerful syntax work for regular expressions would be very nice in Swift.

(And I would like it for binaries as well)

/Magnus

···

24 Nov. 2017 08:33 Chris Lattner via swift-evolution <swift-evolution@swift.org> wrote:

2) I’d like to explore the idea of making // syntax be *patterns* instead of simply literals. As a pattern, it should be possible to bind submatches directly into variable declarations, eliminating the need to count parens in matches or other gross things. Here is strawman syntax with a dumb example:

if case /([a-zA-Z]+: let firstName) ([a-zA-Z]+: let lastName)/ = getSomeString() {
   print(firstName, lastName)
}

(Xiaodi Wu) #14

I’m beginning to wish I hadn’t tied this proposal so strongly to regular
expressions!
It is indeed the wrong motivation. Even as a ten year veteran of Perl
development
I’m not sure we want to bake it into the language quite so tightly
(isn’t a part of
Foundation?) What would /regex/ represent - an instance of
NSRegularExpression?
Would the flags be pattern options or matching options? This is a whole
other debate.

For me the focus of raw strings was a sort of super-literal literal
which has many
applications. The r”literal” syntax has a precedent in Python and there
seemed
to be a syntactic gap that could be occupied but perhaps there are other
alternatives
we could discuss. It would be a shame to see ‘quoted strings’ be used
for this however.
I still live in hope one day it will be used for single character
UNICODE values.

Since what passes for a single character changes by Unicode

revision--such as whenever they get around to enumerating the permitted
modifying attributes of the poop emoji--it is quite impossible (and Swift's
`Character` doesn't attempt to) to enforce single-characterness at compile
time. We should put any such notions to rest up front.

Unless I'm misunderstanding you here, I don't think that's true: writing
something like `let c: Character = "ab"` is definitely a compile-time
error: https://gist.github.com/allevato/ae267e2aaaa7939d6233d66a87b48fc0

Hmm, yes, it still attempts to make a best effort, it seems. I had thought
that this compile-time check was removed altogether, as it cannot be done
in the general case.

···

On Thu, Nov 23, 2017 at 2:47 PM, Tony Allevato <tony.allevato@gmail.com> wrote:

On Thu, Nov 23, 2017 at 12:21 PM Xiaodi Wu via swift-evolution < > swift-evolution@swift.org> wrote:

On Thu, Nov 23, 2017 at 2:14 PM, John Holdsworth via swift-evolution < >> swift-evolution@swift.org> wrote:

To the original point though, I don't think Swift needs to use single
quotes for single characters (or single scalars). Type inference already
infers Characters from single-character String literals in contexts where a
Character is expected, and the only time you need to be explicit is if
you're trying to resolve an overload or initialize a variable by itself.
Using single quotes to avoid writing "as Character" would feel like a waste.

Agree.

On 23 Nov 2017, at 19:10, Brent Royal-Gordon <brent@architechies.com> >>> wrote:

On Nov 23, 2017, at 11:15 AM, Chris Lattner via swift-evolution < >>> swift-evolution@swift.org> wrote:

Until we figure out that path forward for regex’s, I think they aren’t
the right motivation for this proposal.

1. Even in our shining pattern matching future—a future which I, for
one, am eager to hasten—we will still need to interoperate with
NSRegularExpression and other Perl 5-compatible regex engines.

2. Code generation.

3. Windows-style paths.

4. Doesn’t LaTeX use backslashes?

5. Etc.

I think the Motivation section undersells this proposal. Regexes are a
strong short-run use case, but in the long run, we’ll need this for other
things. In both cases, though, raw literals will be a useful addition to
the language, improving the clarity of Swift code much like multiline
literals already have.

--
Brent Royal-Gordon
Sent from my iPhone

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________

swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

(^) #15

i still think single quotes should be used as an alternate literal for
UInt8, like char. there’s a lot of cases where you’re working with
low-level 8-bit ASCII data and both String and Character and Unicode.Scalar
are inappropriate, and typing out hex literals makes code *less* readable.

···

On Thu, Nov 23, 2017 at 3:47 PM, Tony Allevato via swift-evolution < swift-evolution@swift.org> wrote:

On Thu, Nov 23, 2017 at 12:21 PM Xiaodi Wu via swift-evolution < > swift-evolution@swift.org> wrote:

On Thu, Nov 23, 2017 at 2:14 PM, John Holdsworth via swift-evolution < >> swift-evolution@swift.org> wrote:

I’m beginning to wish I hadn’t tied this proposal so strongly to regular
expressions!
It is indeed the wrong motivation. Even as a ten year veteran of Perl
development
I’m not sure we want to bake it into the language quite so tightly
(isn’t a part of
Foundation?) What would /regex/ represent - an instance of
NSRegularExpression?
Would the flags be pattern options or matching options? This is a whole
other debate.

For me the focus of raw strings was a sort of super-literal literal
which has many
applications. The r”literal” syntax has a precedent in Python and there
seemed
to be a syntactic gap that could be occupied but perhaps there are other
alternatives
we could discuss. It would be a shame to see ‘quoted strings’ be used
for this however.
I still live in hope one day it will be used for single character
UNICODE values.

Since what passes for a single character changes by Unicode

revision--such as whenever they get around to enumerating the permitted
modifying attributes of the poop emoji--it is quite impossible (and Swift's
`Character` doesn't attempt to) to enforce single-characterness at compile
time. We should put any such notions to rest up front.

Unless I'm misunderstanding you here, I don't think that's true: writing
something like `let c: Character = "ab"` is definitely a compile-time
error: https://gist.github.com/allevato/ae267e2aaaa7939d6233d66a87b48fc0

To the original point though, I don't think Swift needs to use single
quotes for single characters (or single scalars). Type inference already
infers Characters from single-character String literals in contexts where a
Character is expected, and the only time you need to be explicit is if
you're trying to resolve an overload or initialize a variable by itself.
Using single quotes to avoid writing "as Character" would feel like a waste.

(Tony Allevato) #16

This could be solved by extending the existing string literal handling and
letting type inference do the rest. The real problem here is that
`UInt8(ascii: X)` is annoying to write when you're dealing with a large
amount of low-level data.

If UInt8 conformed to ExpressibleByUnicodeScalarLiteral, you could get most
of the way there—you'd just have to have it fail at runtime for anything
outside 0...127. But then you could write `let c: UInt8 = "x"` and it would
just work.

Failing at runtime is undesirable though, so you could take it further and
add an ExpressibleByASCIILiteral protocol which would be known to the
compiler, and it would emit an error at compile time if the literal wasn't
a single ASCII character (like it does today for Character).

One of the things I think is really elegant about Swift is that string
literals are untyped by themselves and take on an appropriate type based on
the context they're used in. Handling different kinds of strings should
leverage and extend that mechanism, not add new syntax.

···

On Thu, Nov 23, 2017 at 2:43 PM Kelvin Ma <kelvin13ma@gmail.com> wrote:

On Thu, Nov 23, 2017 at 3:47 PM, Tony Allevato via swift-evolution < > swift-evolution@swift.org> wrote:

On Thu, Nov 23, 2017 at 12:21 PM Xiaodi Wu via swift-evolution < >> swift-evolution@swift.org> wrote:

On Thu, Nov 23, 2017 at 2:14 PM, John Holdsworth via swift-evolution < >>> swift-evolution@swift.org> wrote:

I’m beginning to wish I hadn’t tied this proposal so strongly to
regular expressions!
It is indeed the wrong motivation. Even as a ten year veteran of Perl
development
I’m not sure we want to bake it into the language quite so tightly
(isn’t a part of
Foundation?) What would /regex/ represent - an instance of
NSRegularExpression?
Would the flags be pattern options or matching options? This is a whole
other debate.

For me the focus of raw strings was a sort of super-literal literal
which has many
applications. The r”literal” syntax has a precedent in Python and there
seemed
to be a syntactic gap that could be occupied but perhaps there are
other alternatives
we could discuss. It would be a shame to see ‘quoted strings’ be used
for this however.
I still live in hope one day it will be used for single character
UNICODE values.

Since what passes for a single character changes by Unicode

revision--such as whenever they get around to enumerating the permitted
modifying attributes of the poop emoji--it is quite impossible (and Swift's
`Character` doesn't attempt to) to enforce single-characterness at compile
time. We should put any such notions to rest up front.

Unless I'm misunderstanding you here, I don't think that's true: writing
something like `let c: Character = "ab"` is definitely a compile-time
error: https://gist.github.com/allevato/ae267e2aaaa7939d6233d66a87b48fc0

To the original point though, I don't think Swift needs to use single
quotes for single characters (or single scalars). Type inference already
infers Characters from single-character String literals in contexts where a
Character is expected, and the only time you need to be explicit is if
you're trying to resolve an overload or initialize a variable by itself.
Using single quotes to avoid writing "as Character" would feel like a waste.

i still think single quotes should be used as an alternate literal for
UInt8, like char. there’s a lot of cases where you’re working with
low-level 8-bit ASCII data and both String and Character and Unicode.Scalar
are inappropriate, and typing out hex literals makes code *less* readable.

(Chris Lattner) #17

<email reordered a bit below to make responding easier>:

I think we've circled back to a topic that we've discussed here before. I do agree that having more of this validation at compile time would improve the experience. However, I can see a few drawbacks to the _compiler_ doing the validation:

- As seen in these discussions about string literals where users want to copy and paste text and have it "just work," supporting only one dialect in regex literals will inevitably lead users to ask for other types of regex literals for each individual flavor of regex they encounter.

Focusing first on the user model instead of implementation details:

I don’t see why this is desirable at all. If someone came to the Perl community and said “I want to use unmodified tcl regexp syntax”, the Perl community would politely tell them to buzz off. They can just use string literals.

Allowing // syntax to support different grammars makes the Swift language more complex for users (independent of implementation details) and I don’t see any benefit to allowing that. IMO, we’d be much better off by having a single blessed syntax, make it work as well as possible, and steer the community strongly towards using it.

Someone wanting to use NSRegularExpression or a bsd regex library or whatever can use string literals, just like they do now. This has the *advantage* that you don’t look at the code using //’s and think it does something it doesn’t.

- In the absence of a `constexpr`-like facility, supporting runtime expressions would mean we'd be writing the same code twice, once in C++ for compile-time validation of literal expressions and another time in Swift for runtime expressions.

Agreed. There are various ways we could factor this logic, including having the regex parser + tree representation be literally linked into both the compiler and stdlib. I don’t think the cost is great, and we definitely do such things already. If we do this right, the functionality can subsume tools like flex as well, which means we’d get a net reduction of complexity in the whole system.

2) I’d like to explore the idea of making // syntax be *patterns* instead of simply literals. As a pattern, it should be possible to bind submatches directly into variable declarations, eliminating the need to count parens in matches or other gross things. Here is strawman syntax with a dumb example:

if case /([a-zA-Z]+: let firstName) ([a-zA-Z]+: let lastName)/ = getSomeString() {
   print(firstName, lastName)
}

This is an interesting idea. But is it significantly more usable

I don’t know if this is the ideal way to do this, as I mentioned before, I think we need to have a concerted design effort that considers such things. Regex functionality does fit naturally with pattern matching though, so I don’t think we should discard it too early.

than the same type having a collection of named matches using the usual Perl syntax?

  if case /(?<firstName>[a-zA-Z]+) (?<lastName>[a-zA-Z]+)/ = getSomeString() {
    print(Regex.captured["firstName"], Regex.captured["lastName"])
  }

Personally, I really don’t like this. It turns a structured problem into one that violates DRY and loses the structure inherent in the solution. Also, while theoretically the dictionary could be optimized away, in practice that would be difficult to do without heroics.

3) I see regex string matching as the dual to string interpolation. We already provide the ability for types to specify a default way to print themselves, and it would be great to have default regex’s associated with many types, so you can just say “match an Int here” instead of having to match [0-9]+ and then do a failable conversion to Int outside the regex.

4) I’d like to consider some of the advances that Perl 6 added to its regex grammar. Everyone knows that modern regex’s aren’t actually regular anyway, so it begs the question of how far to take it. If nothing else, I appreciate the freeform structure supported (including inline comments) which make them more readable.

Sounds like we want multiline regex literals :slight_smile:

Yes, I absolutely do, but I want the // syntax to imply them. It’s “single line” literal syntax that we should eliminate by default.

-Chris

···

On Nov 24, 2017, at 11:12 AM, Xiaodi Wu <xiaodi.wu@gmail.com> wrote:

(Thorsten Seitz) #18

This proposed addition addresses a known pain point, to be sure, but I think it has many implications for the future direction of the language and I'd like to explore them here.

Thanks for writing this up Xiaodi,

We should certainly move any discussion about regex literals into its own thread, but to make it clear that I'm not simply suggesting that we implement something in Swift 10 instead of addressing a known pain point now, here's a sketch of how Swift 5 could make meaningful progress:

- Teach the lexer about basic /pattern/flag syntax.
- Add an `ExpressibleByRegularExpressionLiteral`, where the initializer would be something like `init(regularExpressionLiteralPattern: String, flags: RegularExpressionFlags)` where RegularExpressionFlags would be an OptionSet type.
- Add conformance to `ExpressibleByRegularExpressionLiteral` to `NSRegularExpression`.
- Have no default `RegularExpressionLiteralType` for now so that, in the future, we can discuss and design a Swift standard library regular expression type, which is justifiable because we've baked in language support for the literal. This can be postponed.

This approach could make sense, but it makes a couple of assumptions that I’m not certain are the right way to go (to be clear, I’m not certain that they’re wrong either!).

Things I’d like to carefully consider:

1) We could make the compiler parse and validate regex literals at compile time:

a) this allows the compiler to emit diagnostics (with fixits!) on malformed literals.

b) When the compiler knows the grammar of the regex, it can precompile the regex into a DFA table or static executable code, rather than runtime compiling into a bytecode.

c) however, the compiler can’t parse the literal unless it knows the dialect it corresponds to. While we could parameterize this somehow (e.g. as a requirement in ExpressibleByRegularExpressionLiteral), if we weren’t bound by backwards compatibility, we would just keep things simple and say “there is one and only one grammar”. I’d argue that having exactly one grammar supported by the // syntax is also *better* for users, rather than saying “it depends on what library you’re passing the regex into”.

I think we've circled back to a topic that we've discussed here before. I do agree that having more of this validation at compile time would improve the experience. However, I can see a few drawbacks to the _compiler_ doing the validation:

- In the absence of a `constexpr`-like facility, supporting runtime expressions would mean we'd be writing the same code twice, once in C++ for compile-time validation of literal expressions and another time in Swift for runtime expressions.

- As seen in these discussions about string literals where users want to copy and paste text and have it "just work," supporting only one dialect in regex literals will inevitably lead users to ask for other types of regex literals for each individual flavor of regex they encounter.

Just like ExpressibleByDictionaryLiteral doesn't deduplicate keys, leaving that to Dictionary, I think regex literals are better off not validating literal expressions (or, maybe, doing only the barest sanity check), leaving the rest to concrete regex types. As you point out with validation of integer overflows during constant folding, we could get enough compile-time validation even without teaching the compiler itself how to validate the literal.

2) I’d like to explore the idea of making // syntax be *patterns* instead of simply literals. As a pattern, it should be possible to bind submatches directly into variable declarations, eliminating the need to count parens in matches or other gross things. Here is strawman syntax with a dumb example:

if case /([a-zA-Z]+: let firstName) ([a-zA-Z]+: let lastName)/ = getSomeString() {
   print(firstName, lastName)
}

This is an interesting idea. But is it significantly more usable than the same type having a collection of named matches using the usual Perl syntax?

  if case /(?<firstName>[a-zA-Z]+) (?<lastName>[a-zA-Z]+)/ = getSomeString() {
    print(Regex.captured["firstName"], Regex.captured["lastName"])
  }

Definitely. Not only is it much more readable, it is much safer as well, as the compiler will tell you that a name is not defined on a typo. Furthermore, as Chris suggested, this can be extended to directly get out other types than strings in a typesafe was (which should be extendible to user defined types conforming to a specific protocol).

3) I see regex string matching as the dual to string interpolation. We already provide the ability for types to specify a default way to print themselves, and it would be great to have default regex’s associated with many types, so you can just say “match an Int here” instead of having to match [0-9]+ and then do a failable conversion to Int outside the regex.

4) I’d like to consider some of the advances that Perl 6 added to its regex grammar. Everyone knows that modern regex’s aren’t actually regular anyway, so it begs the question of how far to take it. If nothing else, I appreciate the freeform structure supported (including inline comments) which make them more readable.

Sounds like we want multiline regex literals :slight_smile:

Absolutely.

-Thorsten

···

Am 24.11.2017 um 20:13 schrieb Xiaodi Wu via swift-evolution <swift-evolution@swift.org>:

On Thu, Nov 23, 2017 at 5:33 PM, Chris Lattner <clattner@nondot.org> wrote:

On Nov 23, 2017, at 10:35 AM, Xiaodi Wu via swift-evolution <swift-evolution@swift.org> wrote:

We should also support a dynamic regex engine as well, because there are sometimes reasons to runtime construct regex’s. This could be handled by having the Regex type support a conversion from String or something, orthogonal to the language support for regex literals/patterns.

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

(Chris Lattner) #19

There is another way to handle this: we can diagnose it the same way we diagnose statically identifiable overflow of arithmetic operations:

(swift) 128 as Int8
<REPL Input>:1:1: error: integer literal '128' overflows when stored into 'Int8'
128 as Int8
^
(swift) 1+127 as Int8
<REPL Input>:1:2: error: arithmetic operation '1 + 127' (on type 'Int8') results in an overflow
1+127 as Int8
~^~~~

The way this happens is through constant folding at the SIL level, which emits diagnostics when they constant fold the “should trap” bit on these operations to true. The code in question is in lib/SILOptimizer/Mandatory/ConstantPropagation.cpp

-Chris

···

On Nov 23, 2017, at 3:07 PM, Tony Allevato via swift-evolution <swift-evolution@swift.org> wrote:

This could be solved by extending the existing string literal handling and letting type inference do the rest. The real problem here is that `UInt8(ascii: X)` is annoying to write when you're dealing with a large amount of low-level data.

If UInt8 conformed to ExpressibleByUnicodeScalarLiteral, you could get most of the way there—you'd just have to have it fail at runtime for anything outside 0...127. But then you could write `let c: UInt8 = "x"` and it would just work.

Failing at runtime is undesirable though

(^) #20

aren’t all literals evaluated at compile time?

···

On Thu, Nov 23, 2017 at 6:07 PM, Tony Allevato <tony.allevato@gmail.com> wrote:

This could be solved by extending the existing string literal handling and
letting type inference do the rest. The real problem here is that
`UInt8(ascii: X)` is annoying to write when you're dealing with a large
amount of low-level data.

If UInt8 conformed to ExpressibleByUnicodeScalarLiteral, you could get
most of the way there—you'd just have to have it fail at runtime for
anything outside 0...127. But then you could write `let c: UInt8 = "x"` and
it would just work.

Failing at runtime is undesirable though, so you could take it further and
add an ExpressibleByASCIILiteral protocol which would be known to the
compiler, and it would emit an error at compile time if the literal wasn't
a single ASCII character (like it does today for Character).

One of the things I think is really elegant about Swift is that string
literals are untyped by themselves and take on an appropriate type based on
the context they're used in. Handling different kinds of strings should
leverage and extend that mechanism, not add new syntax.

On Thu, Nov 23, 2017 at 2:43 PM Kelvin Ma <kelvin13ma@gmail.com> wrote:

On Thu, Nov 23, 2017 at 3:47 PM, Tony Allevato via swift-evolution < >> swift-evolution@swift.org> wrote:

On Thu, Nov 23, 2017 at 12:21 PM Xiaodi Wu via swift-evolution < >>> swift-evolution@swift.org> wrote:

On Thu, Nov 23, 2017 at 2:14 PM, John Holdsworth via swift-evolution < >>>> swift-evolution@swift.org> wrote:

I’m beginning to wish I hadn’t tied this proposal so strongly to
regular expressions!
It is indeed the wrong motivation. Even as a ten year veteran of Perl
development
I’m not sure we want to bake it into the language quite so tightly
(isn’t a part of
Foundation?) What would /regex/ represent - an instance of
NSRegularExpression?
Would the flags be pattern options or matching options? This is a
whole other debate.

For me the focus of raw strings was a sort of super-literal literal
which has many
applications. The r”literal” syntax has a precedent in Python and
there seemed
to be a syntactic gap that could be occupied but perhaps there are
other alternatives
we could discuss. It would be a shame to see ‘quoted strings’ be used
for this however.
I still live in hope one day it will be used for single character
UNICODE values.

Since what passes for a single character changes by Unicode

revision--such as whenever they get around to enumerating the permitted
modifying attributes of the poop emoji--it is quite impossible (and Swift's
`Character` doesn't attempt to) to enforce single-characterness at compile
time. We should put any such notions to rest up front.

Unless I'm misunderstanding you here, I don't think that's true: writing
something like `let c: Character = "ab"` is definitely a compile-time
error: https://gist.github.com/allevato/ae267e2aaaa7939d6233d66a87b48fc0

To the original point though, I don't think Swift needs to use single
quotes for single characters (or single scalars). Type inference already
infers Characters from single-character String literals in contexts where a
Character is expected, and the only time you need to be explicit is if
you're trying to resolve an overload or initialize a variable by itself.
Using single quotes to avoid writing "as Character" would feel like a waste.

i still think single quotes should be used as an alternate literal for
UInt8, like char. there’s a lot of cases where you’re working with
low-level 8-bit ASCII data and both String and Character and Unicode.Scalar
are inappropriate, and typing out hex literals makes code *less* readable.