Pure Bikeshedding: Raw Strings (why yes, again!)

BigZaphod · July 3, 2018, 4:06pm

What if raw strings were just stored 'externally' and referenced instead of trying to do it all inline? After all, it seems the idea with raw strings is to treat them as a resource. You wouldn't inline an image file, would you?

let str = #string(filename)

Then the string could be found by including the whole filename with the given name, or perhaps it could be sourced from a strings file (something like the asset catalogs that Xcode supports), etc.

xwu · July 3, 2018, 4:09pm

I think this is a fine idea, and at that point we could support both interpolations and escape sequences using the custom delimiter! To be clear, though, it ends up being rather a different feature altogether: we'd be building custom delimiters and custom escape sequences into string literals, but they wouldn't be "raw strings" really. But it's certainly a great alternative approach to addressing the five use cases.

anandabits · July 3, 2018, 4:13pm

This could be useful in some cases but I see it as orthogonal to raw strings. It certainly wouldn't work for the kind of code generation I have been doing.

Nevin · July 3, 2018, 4:19pm

BigZaphod:

What if raw strings were just stored 'externally' and referenced instead of trying to do it all inline? After all, it seems the idea with raw strings is to treat them as a resource. You wouldn't inline an image file, would you?
let str = #string(filename)
Then the string could be found by including the whole filename with the given name, or perhaps it could be sourced from a strings file (something like the asset catalogs that Xcode supports), etc.

This is a great idea, and makes for much cleaner Swift code. It could also take an optional second parameter to specify how to handle escaping:

let str = #string(named: "Asset Name", escapedWith: "")  // no escaping

The default value for the second parameter would be “\”, so standard string interpolation is the default. This fits with progressive disclosure, whereby alternative escape sequences can be ignored until required.

• • •

If we still want in-line raw strings in Swift files, we could use a similar spelling as well. I had been holding out hope that we might land on a simple and elegant syntax such as “any odd number greater than 3 of consecutive double-quote characters” as the delimiter.

But if that is not sufficient because the ability to interpolate is important, then I think we should stick with the established family of compile-time-literal syntax we already have. Something like this:

let str =
#stringLiteral(named: "Raw string example 1", escapedWith: "#####")
// Insert contents of string literal here
#endStringLiteral(named: "Raw string example 1")

Yes, it is verbose, but it is also extremely clear. For a rarely-used feature, clarity at the point of use is vastly more important than brevity.

anandabits · July 3, 2018, 4:20pm

We could extend it this far, but I'm not sure we should without strong motivation. I suspect use cases for using other escape sequences in a pseudo-raw string are sufficiently rare that we don't need to design around them. On the other hand, string interpolation offers an enormous benefit to readability in code generation.

Erica_Sadun · July 3, 2018, 5:19pm

As someone who depends on code generation, I strongly agree. If there is a simple (emphasizing simple and fixed) interpolation solution, I'd really like to incorporate it.

griotspeak · July 3, 2018, 6:51pm

Doesn't this push information about the filesystem into portions of compilation that we might want to keep ignorant of the filesystem?

beccadax · July 3, 2018, 6:56pm

I haven't responded to this before because it's just sort of…okay, I guess? It's perfectly machine-parseable, but I'm worried the "N# token might not be visually distinct enough to help users locate it in a very noisy string. The total namespace of twelve alternate delimiters is probably adequate, but not infinite. (Unless you imagine this would support multi-digit numbers, which extends the available delimiter set but makes them even less visually distinct.) The "N# token is moderately gross, but less than a fully arbitrary delimiter, and it only gets gross when you need a second alternate. It's fine, just kind of meh.

I agree—we should at least mention it and say we don't think it'll be a problem because there's no plausible way to interpret it as a key path. Don't really need much more than that.

Actually, the image literal syntax visually looks like you're inlining an image. But as the proposal says, our use case is strings which belong in the source. They aren't merely an outside resource—they're integral to the source and need to be edited and maintained with it. And in the case of raw strings, they'd be unmaintainable if you had to escape everything.

anandabits · July 3, 2018, 7:00pm

Great! What is your thought on the design I suggested using the custom delimiter? This is a relatively simple enhancement to the design of raw strings with custom delimiters you have in the latest draft.

It is very unlikely that a raw string with a custom delimiter will contain the character sequence \<delimiter>(. If it does need to contain such sequence the conflict is easily resolved by tweaking the custom delimiter. This is no different than any other delimiter conflict that might occur in the string content. Further, a conflict is likely to produce a compilation error calling attention itself when interpolation is not actually intended. The content following the \<delimiter>( sequence is unlikely to be a valid expression in the current context followed by a closing ) by coincidence.

beccadax · July 3, 2018, 7:02pm

John, Erica, and I have been talking privately. Processing escapes like this is a little trickier than most of what we've tested, but we're going to try to prototype this (but without the leading backslash—it's redundant) and see what we think.

anandabits · July 3, 2018, 7:09pm

Awesome! Are you referring to the backslash in the string interpolation? If you can make it work without that please do.

beccadax · July 3, 2018, 7:12pm

I mean the leading backslash before the entire string literal. It's pretty, but I don't think it actually adds any meaning. We're still hashing out the details, though.

To address some of the earlier comments:

I haven't responded to this before because it's just sort of…okay, I guess? It's perfectly machine-parseable, but I'm worried the "N# tokens might not be visually distinct enough to help users locate it in a very noisy string. The total namespace of twelve alternate delimiters is probably adequate, but not infinite. (Unless you imagine this would support multi-digit numbers, which extends the available delimiter set but makes them even less visually distinct.) The "N# token is moderately gross, but less than a fully arbitrary delimiter, and it only gets gross when you need a second alternate. It's fine, just kind of meh.

I agree—we should at least mention it and say we don't think it'll be a problem because there's no plausible way to interpret it as a key path. Don't really need much more than that.

Well, the image literal syntax does visually look like you're inlining an image. But as the proposal says, our use case is strings which belong in the source. They aren't merely an outside resource—they're integral to the source and need to be edited and maintained with it. And in the case of raw strings, they'd be unmaintainable if you had to escape everything.

anandabits · July 3, 2018, 7:26pm

I thought that is how you are distinguishing between raw strings and non-raw custom-delimited strings. Are you dropping that distinction? I thought it was a pretty cool part of the design!

johnno1962 · July 3, 2018, 9:45pm

Thats the way I saw it. What you are suggesting is additive to the proposal we just published where a raw string is always prefixed by \ and if it has a custom delimiter you can make an escape character \ active again by following it with the custom delimiter allowing for selective interpolation or even as Xiaodi suggests use any of the normal cooked escapes.

\#”\#(this) will interpolate”#
\#”\(this) will not”#
\#”\#n this will be a newline”#
\#”\n this will be two characters \ an n”
#"this is a cooked string and will \(interpolate)"#

If this is what you were suggesting it seems like a sensible solution to raw-but-can-still-interpolate to me. Brent and Erica have been inspired by this to make a bigger change to the design just proposed which they’re about to present. Over to you Brent and Erica...

anandabits · July 3, 2018, 9:53pm

This is basically what I had in mind, although expanded to work with all escape sequences. I suppose that's reasonable and consistent even if interpolation is really the only motivation with enough importance to do this.

I'm curious to see what additional changes Brent and Erica come up with.

Erica_Sadun · July 3, 2018, 11:36pm

An Alternate String Literal Design

As John already mentioned, Brent and I have come up with the following alternate design based on feedback from this thread and motivated by my production code. I know this is a reach. Any hostility should be directed to me and me alone.

This design moves in a slightly different direction but it takes inspiration from the same place as our most recent proposal draft: Adopt Rust-style delimiters and use them to enable a single mode of raw, cooked, and conventional string literals all using the same grammar.

String Literals

First, a review of what we have been discussing:

A conventional string literal is exactly what you use in Swift today. It allows you to use escape sequences like \\ and \" and \u{n} to express backslashes, quotes, and unicode scalars, among other special character sequences.
A raw string literal ignores escape sequences. It allows you to paste raw code, meaning the sequence \\\n represents three backslashes followed by the letter n, not a backslash followed by a line feed.
A "cooked" string literal (I believe we take the term from C++) allows you to adapt the leading and trailing delimiters so you can include quote marks within the string but retain interpolated sequences. This allows a string to have content like She said "\(phrase)" to him, where the quotes do not need escaping and phrase is expanded to its evaluated content.

Our Design

Our design powers up a conventional String literal and in doing so, allows you to access features normally associated with raw and cooked literals.

In this design, there is only one variety of string literals without a special "raw" syntax. A string literal is either

a sequence of characters surrounded by double quotation marks ("), or
a string that spans several lines surrounded by three double quotation marks.

These are examples of Swift string literals:

"This is a single line Swift string literal"

"""
    This is a multi line
    Swift string literal
    """

In this form, the revised string design acts exactly like any other string. You use escape sequences including string interpolation exactly as you would today. A backslash escape tells the compiler that a sequence should be interpolated, interpreted as an escaped character, or representa unicode scalar. Escape sequences include:

The special characters \0 (null character), \\ (backslash), \t (horizontal tab), \n (line feed), \r (carriage return), \" (double quotation mark) and \' (single quotation mark)
Arbitrary Unicode scalars, written as \u{n}, where n is a 1–8 digit hexadecimal number with a value equal to a valid Unicode code point
Interpolated expressions, introduced by \( and terminated by )

Expanding Delimiters

Our design includes custom string delimiters. You may pad a string literal with one or more # (pound, U+0023) characters:

"This is a Swift string literal"

#"This is also Swift string literal"#

####"So is this"####

The number of pound signs at the start of the string (in these examples, zero, one, and four) must match the number of pound signs at the end of the string. "This", #"This"#, and ##"This"## represent identical string values.

static-string-literal -> " quoted-text " |
   """ multiline-quoted-text """ |
   # static-string-literal #

Adding a pound signs changes the string delimiter, allowing you to "cook" a string and include unescaped double quotes:

#"She said, "This is dialog!""#
// The quoted text is `She said, "This is dialog!"`

If you do add a backslash, it is interpreted as an extra character. This string literal includes both the backslash and both double quote marks inside the string delimiters (#" and "#):

#"A \"quote"."#

If for some reason you need to include #" or "# in your quoted text, adjust the number of delimiter pound signs. This need should be rare.

Escaping

The second, and more impactful, change in this design is that any escape sequence in a string literal must match the number of pound signs used to delimit either end of the string.

Here is the degenerate case. It is a normal string with no pound signs.

"This string has an \(escaped) interpolated item"

Strings using customized delimiters add pound sign(s) after the leading backslash, as in these examples which produce identical results:

#"This string has an \#(escaped) interpolated item"#

####"This string has an \####(escaped) interpolated item"####

The escape sequence delimiter matches the extra delimiters given to the string. Any backslash that is not followed by the correct number of pound signs is treated as raw text. Each of these examples produces the exact characters of the quoted text between the quote marks:

#"This is not \(interpolated)"# 

"This is not \#(interpolated)"

#"This is not \##(interpolated)"#

This escaping rule reproduces the raw string behavior from our original proposal but adds string interpolation on demand. We feel this is a huge feature for code generation applications.

Summary

We feel this is a conceptual leap of elegance that simplifies all our workarounds and collapses them into one general solution. It retains Rust-inspired custom delimiters, offers all the features of both "cooked" and "raw" strings, introduces raw string interpolation, and does this all without adding a new special-purpose string type to Swift.

Yes, this approach requires slightly more work than our original design:

You must use pound signs for any raw string.
You must use a more cumbersome interpolation sequence for raw and cooked strings.

Hopefully the tradeoffs are worth it in terms of added expressibility and the resulting design is sufficiently elegant.

anandabits · July 3, 2018, 11:43pm

I think this new design is extremely elegant in its minimalism. It would be very useful in the code I have been writing lately.

The differences compared to the previous design plus custom-delimiter-based interpolation are relatively minor making it difficult to justify the added complexity of having raw strings, raw strings with custom delimiters and cooked strings (i.e. “normal” strings with custom delimiters).

I wholeheartedly support this update and can’t wait to see it reviewed!

hooman · July 3, 2018, 11:57pm

I haven't thought this through, but it looks very promising. I really like it.

beccadax · July 4, 2018, 12:04am

Just wanted to highlight that, although Erica's examples all show interpolations, this rule applies to other escape sequences as well. In a #"..."# string, \#n produces a newline character, while \n and \##n do not. You probably wouldn't see this too often, though—after all, one of the major reasons to use #"..."# is to avoid having to escape things!

nonsensery · July 4, 2018, 12:17am

I think I like this latest proposal, but this part gives me pause:

Erica_Sadun:

Any backslash that is not followed by the correct number of pound signs is treated as raw text. Each of these examples produces the exact characters of the quoted text between the quote marks:
"This is not \#(interpolated)"

For the simple case of a "conventional" string literal, this feels ... wrong?

Compare:

let x = "You're \#1"
// OK; `x` as the value: You're #1

to:

let x = "You're \#1"
//               ^
//               error: invalid escape sequence in literal

(The latter is the behavior in current versions of the language.)