[Pitch] Raw mode string literals

One other minor and obscure point: if the compiler is aware of the regex grammar it can properly type the matches, I can imagine the following cases:

if case /(let name: [a-zA-Z]+) (let count: Int)/ = getSomeString() {
   print(name, count)
}

-> name has type String, count has type Int (and matches [0-9]+)

if case /(let name: [a-zA-Z]+)? (let count: Int)/ = getSomeString() {
   print(name, count)
}

-> name has type String?

if case /(let name: [a-zA-Z]+)* (let count: Int)/ = getSomeString() {
   print(name, count)
}

-> name has type [String]

etc. Even if we don’t have a “default regex” for types, it would still be awesome to be able to write:

if case /(let name: [a-zA-Z]+) (let count: Int: [0-9]+)/ = getSomeString() {
   print(name, count)
}

and have that transparently invoke and check the Int?(string) failable initializer.

-Chris

···

On Nov 24, 2017, at 4:15 PM, Chris Lattner <clattner@nondot.org> wrote:

than the same type having a collection of named matches using the usual Perl syntax?

  if case /(?<firstName>[a-zA-Z]+) (?<lastName>[a-zA-Z]+)/ = getSomeString() {
    print(Regex.captured["firstName"], Regex.captured["lastName"])
  }

Personally, I really don’t like this. It turns a structured problem into one that violates DRY and loses the structure inherent in the solution. Also, while theoretically the dictionary could be optimized away, in practice that would be difficult to do without heroics.

Yes, these are very interesting options to explore, and you're right that
if we want to go down this road, then we'd need to imbue regex literals
with certain "smarts" as opposed to having lenient regex literal parsing
that entirely defers validation to a concrete regex type conforming to
ExpressibleByRegularExpressionLiteral.

I don't think it's an all-or-nothing proposition, though, as to whether the
literal or the conforming type performs the validation. Personally, I think
one of the strengths of Swift's literals is that they are intrinsically
untyped and that multiple concrete types are expressible by them. Whether
or not we think one or another regex syntax is best doesn't necessarily
mean we need to preclude other regex engines from interacting with a regex
literal. Rather, just like string interpolation literals allow the compiler
to parse some "stuff" inside the quotation marks, we can have some syntax
that allows for regex patterns to have segments parsed by the compiler for
binding without locking down regex syntax entirely. For instance, just as
the compiler parses `\(...)` inside string literals, suppose it parses
`(let...)` and `(var...)` inside regex literals.

···

On Fri, Nov 24, 2017 at 6:25 PM, Chris Lattner <clattner@nondot.org> wrote:

On Nov 24, 2017, at 4:15 PM, Chris Lattner <clattner@nondot.org> wrote:

than the same type having a collection of named matches using the usual
Perl syntax?

  if case /(?<firstName>[a-zA-Z]+) (?<lastName>[a-zA-Z]+)/ =
getSomeString() {
    print(Regex.captured["firstName"], Regex.captured["lastName"])
  }

Personally, I really don’t like this. It turns a structured problem into
one that violates DRY and loses the structure inherent in the solution.
Also, while theoretically the dictionary could be optimized away, in
practice that would be difficult to do without heroics.

One other minor and obscure point: if the compiler is aware of the regex
grammar it can properly type the matches, I can imagine the following cases:

if case /(let name: [a-zA-Z]+) (let count: Int)/ = getSomeString() {
   print(name, count)
}

-> name has type String, count has type Int (and matches [0-9]+)

if case /(let name: [a-zA-Z]+)? (let count: Int)/ = getSomeString() {
   print(name, count)
}

-> name has type String?

if case /(let name: [a-zA-Z]+)* (let count: Int)/ = getSomeString() {
   print(name, count)
}

-> name has type [String]

etc. Even if we don’t have a “default regex” for types, it would still be
awesome to be able to write:

if case /(let name: [a-zA-Z]+) (let count: Int: [0-9]+)/ = getSomeString()
{
   print(name, count)
}

and have that transparently invoke and check the Int?(string) failable
initializer.

Well, this hasn’t exactly gone in the direction I was expecting. Are raw literals off the table?

I have plenty of opinions about how to integrate Regex though. It’s not the literal that’s
the problem but the operators that surround it and these operators need to allow you to:
check for a match, get the match or groups, replace matches, iterate over matches etc.

One notion I explored was the idea that in the same way an index or key of a collection
refers to a subset of the data, subscripting using a string (regex) on a string refers to
a subset of the string, It's also atomic so you don’t have to worry about precedence.

Bear with me, It works out better than you might expect once you get used to the idea.
It’s certainly very succinct for common cases.

var input = "Now is the time for all good men to come to the aid of the party"

// basic regex match
input["\\w+"] != nil

// replace by assignment
input["men"] = "folk"
print(input)

// indiviual groups can be accessed
input["(all) (\\w+)", 2]

// and assigned to
input["the (\\w+)", 1] = "_$1_"
print(input)

Operators only get you so far though. This is the point where I wished
it was possible to define a setter for a subscript type without a getter.

// capitalising words using closure
print(input.replacing(pattern: "(_?)(\\w)(\\w*)") {
    (groups, stop) in
    return groups[1]!+groups[2]!.uppercased()+groups[3]!
})

// parsing a properties file
let props = """
    name1 = value1
    name2 = value2
    """

var params = [String: String]()
for groups in props.matching(pattern: "(\\w+)\\s*=\\s*(.*)") {
    params[String(groups[1]!)] = String(groups[2]!)
}

I prepared a playground taking this idea to it’s illogical conclusion.

John

SwiftRegex4.playground.zip (16 KB)

···

On 25 Nov 2017, at 00:25, Chris Lattner <clattner@nondot.org> wrote:

On Nov 24, 2017, at 4:15 PM, Chris Lattner <clattner@nondot.org <mailto:clattner@nondot.org>> wrote:

than the same type having a collection of named matches using the usual Perl syntax?

  if case /(?<firstName>[a-zA-Z]+) (?<lastName>[a-zA-Z]+)/ = getSomeString() {
    print(Regex.captured["firstName"], Regex.captured["lastName"])
  }

Personally, I really don’t like this. It turns a structured problem into one that violates DRY and loses the structure inherent in the solution. Also, while theoretically the dictionary could be optimized away, in practice that would be difficult to do without heroics.

One other minor and obscure point: if the compiler is aware of the regex grammar it can properly type the matches, I can imagine the following cases:

if case /(let name: [a-zA-Z]+) (let count: Int)/ = getSomeString() {
   print(name, count)
}

-> name has type String, count has type Int (and matches [0-9]+)

if case /(let name: [a-zA-Z]+)? (let count: Int)/ = getSomeString() {
   print(name, count)
}

-> name has type String?

if case /(let name: [a-zA-Z]+)* (let count: Int)/ = getSomeString() {
   print(name, count)
}

-> name has type [String]

etc. Even if we don’t have a “default regex” for types, it would still be awesome to be able to write:

if case /(let name: [a-zA-Z]+) (let count: Int: [0-9]+)/ = getSomeString() {
   print(name, count)
}

and have that transparently invoke and check the Int?(string) failable initializer.

-Chris

Right, but the string literal syntaxes we have (single and multiline) do not allow different grammars (e.g. escape sequences) depending on what type they are inferred to. Wouldn’t it be odd if a string literal accepted “\x12\u1212\U00001212” when it converts to a "const char *” but accepted “\u{12345}” when passed to a bridged Dart API?

-Chris

···

On Nov 24, 2017, at 7:52 PM, Xiaodi Wu <xiaodi.wu@gmail.com> wrote:

etc. Even if we don’t have a “default regex” for types, it would still be awesome to be able to write:

if case /(let name: [a-zA-Z]+) (let count: Int: [0-9]+)/ = getSomeString() {
   print(name, count)
}

and have that transparently invoke and check the Int?(string) failable initializer.

Yes, these are very interesting options to explore, and you're right that if we want to go down this road, then we'd need to imbue regex literals with certain "smarts" as opposed to having lenient regex literal parsing that entirely defers validation to a concrete regex type conforming to ExpressibleByRegularExpressionLiteral.

I don't think it's an all-or-nothing proposition, though, as to whether the literal or the conforming type performs the validation. Personally, I think one of the strengths of Swift's literals is that they are intrinsically untyped and that multiple concrete types are expressible by them.

...And here we come full circle. The original proposal is precisely to have
a different type of string literal that accepts/rejects different escape
sequences. In my initial reply, I wrote that (should raw strings be
sufficiently motivated that some sort of solution is clearly desirable) one
avenue to explore is redesigning literals to allow conforming types to
access the "raw" literal, free of all but the most minimal processing, so
that the type can choose the grammar rather than the literal. In so doing,
we avoid having to hardcode new "flavors" of string literal.

It is precisely in observing these repeated requests for now flavors of
string literals, as well as existing shortcomings of integer and float
literals for supporting BigInt and Decimal types, that leads me to think
that we should exactly allow what you describe as "odd."

···

On Sat, Nov 25, 2017 at 12:08 AM, Chris Lattner <clattner@nondot.org> wrote:

On Nov 24, 2017, at 7:52 PM, Xiaodi Wu <xiaodi.wu@gmail.com> wrote:

etc. Even if we don’t have a “default regex” for types, it would still be

awesome to be able to write:

if case /(let name: [a-zA-Z]+) (let count: Int: [0-9]+)/ =
getSomeString() {
   print(name, count)
}

and have that transparently invoke and check the Int?(string) failable
initializer.

Yes, these are very interesting options to explore, and you're right that
if we want to go down this road, then we'd need to imbue regex literals
with certain "smarts" as opposed to having lenient regex literal parsing
that entirely defers validation to a concrete regex type conforming to
ExpressibleByRegularExpressionLiteral.

I don't think it's an all-or-nothing proposition, though, as to whether
the literal or the conforming type performs the validation. Personally, I
think one of the strengths of Swift's literals is that they are
intrinsically untyped and that multiple concrete types are expressible by
them.

Right, but the string literal syntaxes we have (single and multiline) do
not allow different grammars (e.g. escape sequences) depending on what type
they are inferred to. Wouldn’t it be odd if a string literal accepted
“\x12\u1212\U00001212” when it converts to a "const char *” but accepted
“\u{12345}” when passed to a bridged Dart API?