Pure Bikeshedding: Raw Strings (why yes, again!)

Lily_Ballard · June 21, 2018, 7:35pm

I'm a fan of Rust's raw strings. The r#"foo"# syntax works quite well, and in practice you almost never need more than one #.

Erica_Sadun · June 21, 2018, 8:55pm

At this time, I'd like to move forward with a Rust-inspired proposal rewrite. Thank you @huon for putting Rust forward and @Lily_Ballard (among others) for feedback about real-world use.

Although I think many suggestions like @beccadax's Lisp-like approach are clever, I think in terms of picking a good match to the language we need something clear and concise (and already road-tested), which can work for both one-line regex's and multi-line pastes.

Preliminary Design

[signifier][delimiter x n]"<raw string contents>"[delimiter x n]

[signifier][delimiter x n]"""
    <raw string contents>
    """[delimiter x n]

A Swift raw strings starts with the signifier, followed by zero or more of the delimiter and a U+0022 (double-quote) character. The raw string body can contain any sequence of Unicode characters and is terminated only by another double-quote character, followed by the same number of delimiter characters that preceded the opening double quote character.

All Unicode characters contained in the raw string body represent themselves, the double-quote and backslash characters do not have any special meaning, except when a double-quote is followed by at least as many delimiter characters as were used at the start of the raw string.

Naming the signifier and the delimiter

Based on the review feedback, the signifier should not be r. We can go as simple as raw or consider a prefixed attribute like @raw or @rawString. (Concision is good.) Sound out. I'll keep notes. Don't worry about which one is your favorite yet but it would help if you stated why

As a delimiter, the pound sign can read "heavy" as it occupies a lot of the space given to present each character. It is also normally associated with keywords as a prefix, which may complicate implementation. The pound sign is rarely used, and when used rarely exceeds one use at each end of the string. If you have a better candidate that does not interfere with Swift parsing (or if you support #), please speak up and say so.

Thank you in advance for your focused feedback.

johnno1962 · June 21, 2018, 11:57pm

My +1 to any of the following (in order of preference)

Just adopt Rust’s convention

r”a \raw string”
r#”a “raw” string”#
r##”a "#raw" string”##

Using a prefix frees up the optional custom delimiter a little

@raw”a \raw string”
@raw#”a “raw” string”#
@raw(”a “raw” string”)

#raw would be just as valid.

#raw”a \raw string”
#raw#”a “raw” string”#
#raw(”a “#raw” string”)

The latter leaves the door open to something like:

@rawInterpolating”a raw string but \(interpolating)”

Dante-Broggi · June 22, 2018, 12:22am

An orthogonal question to the syntax of raw strings is whether the following would be valid for raw strings. It currently is for normal strings, but I consider it a bug: SR-6920

let quote = "̠"

These are the Unicode scalars:
{LATIN SMALL LETTER L}{LATIN SMALL LETTER E}{LATIN SMALL LETTER T}{SPACE}{LATIN SMALL LETTER Q}{LATIN SMALL LETTER U}{LATIN SMALL LETTER O}{LATIN SMALL LETTER T}{LATIN SMALL LETTER E}{SPACE}{EQUALS SIGN}{SPACE}{QUOTATION MARK}{COMBINING MINUS SIGN BELOW}{QUOTATION MARK}

xwu · June 22, 2018, 12:27am

I think, as an orthogonal question, this is best addressed in a separate thread. It's an important topic, though.

jawbroken · June 22, 2018, 1:56am

I don't see this as much of a downside, because calling attention to the different parsing rules seems like it might even be a good thing. I like # as a delimiter modifier, and would rather just have one simple form than open it up to arbitrary ones. I don't see the need to support #raw(”a “#raw” string”) and similar from @johnno1962's examples.

One question that is still open in my mind is if this should be supported for regular string literals. Is this desirable (and implementable):

let s = #"non-raw string with "quotation marks" that don't need to be escaped"#

If this is not desirable then why do we need a r/raw/#raw signifier at all, i.e. why isn't the above the syntax for a raw string, always requiring at least one #? Perhaps it isn't clear enough that the string is raw, or perhaps isn't implementable (ambiguity or other parsing difficulties?), or perhaps someone thinks there will be a need for further types of string in future so they want a syntax that generalises well. Does anyone have any opinions about this? Will further types of strings be required, like the “raw string with interpolation” that @johnno1962 mentions? My feeling is “hopefully not”.

I would also appreciate it if someone would weigh in on the difficulty of implementation here, and perhaps how well this will be tolerated by external tools like text editors. What is your experience there for Rust raw strings, @Lily_Ballard?

beccadax · June 22, 2018, 3:42am

I’m going to throw out two ideas that are really tempting, but might be wrong:

The signifier should be \. Mnemonic: You’re escaping the entire string at once, instead of escaping individual things inside the string.
I’m not at all convinced that we should only support alternate delimiters on multiline strings, but if we go that direction, the obvious alternate delimiter is simply adding more quote marks than the minimum of three. This has the right visual weight and doesn’t have the “punctuation soup” effect of many alternatives. (Also, if we do this, the alternate delimiter feature should probably be orthogonal to raw strings.)

jawbroken · June 22, 2018, 4:03am

You're right that 1. is a tempting interpretation. Does this conflict with key paths though? One of the benefits of # is that it's not allowed to be used in an operator, for various good reasons. I just checked if the same was true of \ and the compiler was confused about whether I was trying to use a key path. Even if it's not technically ambiguous, it might be confusing to have two very different uses for \ outside of string literals. I guess it will always appear right next to the quotation marks, \" or \""" so perhaps it will appear grouped enough with them to not be interpreted separately.

Lily_Ballard · June 22, 2018, 4:23am

I don't see why we should intentionally design a raw string format that cannot contain every possible typeable string. I should be able to take any raw string in my source and wrap it in a raw string if I want to. This is trivially solved by Rust's solution, e.g. if I have r"foo\nbar" I can wrap that in r#"r"foo\nbar""#, which can then itself be wrapped in r##"r#"r"foo\nbar""#"##, etc. It's still a simple rule, and it's rare enough in practice to hit this that most people won't even need to concern themselves with it (and the people that do need this will appreciate having it).

I was never directly involved in adding support for raw strings to editors for Rust, but I don't recall hearing anyone complain about it being difficult. In particular, I think it's pretty darn easy with Vim's syntax definition language to handle.

That would make it impossible to have a multiline string whose text begins with a quotation mark.

Interesting idea, but also strikes me as a bit weird. And it wouldn't really be consistent with any potential future support for other string literal prefixes.

For comparison, Rust also has b"foo" for byte strings (this produces a &'static [u8; N]). I could imagine us wanting to do something similar.

jawbroken · June 22, 2018, 4:30am

I think you misunderstood me. To clarify, I support the repeated delimiter solution, but not the alternative custom delimiters like the () in the example I quoted.

Thanks for the example of another custom string-like thing that might require a different prefix. I'm not sure this would be the solution chosen for Swift, because I guess it would turn the string literal into a different type entirely, no longer compatible with ExpressibleByStringLiteral types.

beccadax · June 22, 2018, 5:06am

A multiline string is delimited by three quotes and a newline. Going to N quotes and a newline is fine, because the newline delimits the delimiter.

Lily_Ballard · June 22, 2018, 5:31am

Good point. But in that case I'd very strongly argue against restricting raw string literals to multiline strings, because that would make it much more awkward to use for no benefit whatsoever.

cukr · June 22, 2018, 6:26am

Swifty way to do that would be making Array CustomStringConvertible and using it like this: "foo" as [UInt8]

griotspeak · June 22, 2018, 8:08am

would #N" work where N is an integer that must match for closing? Custom delimiter but restricted in complexity.

mdiep · June 22, 2018, 1:47pm

You mean you want to do this?

/// If you need to use multiline raw strings, prefix `"""` with raw
let mutliLineText = raw####"""
    /// ### Usage
    /// For example, lazily call `foo`
    /// ```
    /// mySequence.lazy.foo({ "$\($0).00")
    /// ```
    """####

You could do that just like Markdown does—by increasing the number of backticks used to delimit the string.

let mutliLineText = ````
    /// ### Usage
    /// For example, lazily call `foo`
    /// ```
    /// mySequence.lazy.foo({ "$\($0).00")
    /// ```
    ````

johnno1962 · June 22, 2018, 2:17pm

Looks like I was unclear -- my current preference is to adopt Rust’s well researched and proven design as is for raw strings as it is concise and sufficiently flexible for most if not all likely use cases.

The only problem as @Erica_Sadun mentions is that in the last review there was quite a lot of push pack against using just “r” as the signifier as it wasn’t “Swifty” enough. Using @raw v.s. #raw is a moot point but I’d prefer them over just “raw” or \. Being more specific, these longer signifiers allow more flexibility with any optional delimiter for the Lexer but need not do.

As to ease of implementation if the syntax is unambiguous they are all possible in terms of implementation. You don’t need to know the specific Lexer code (which is here), just ask yourself: as the compiler scans from left to right is this ambiguous against another language construct in Swift as a decision is made which type of token is starting. For naïve editor support, or rather designing raw strings so most existing editors support them, like multiline strings, provided the raw string contains pairs of “ characters things don’t go too far wrong.

Erica_Sadun · June 22, 2018, 9:11pm

Could we separate @raw from the opening quote (or pound) and treat it as an attribute on the string? Or is it possible simply to look for (#+)".*"($1) and skip the keyword entirely?

beccadax · June 22, 2018, 10:22pm

I'm very hesitant about @raw because currently, I believe @ is only used in types and declarations, not in expressions. Swift usually uses # when we want to introduce some kind of special keyword in an expression, and we usually treat it syntactically as though it were either a variable or a function call. That's where my suggestion of #raw("") (and #raw(((""))) by extension) came from.

masters3d · June 22, 2018, 11:09pm

We just went over this with #unknown vs @unknown

I like @raw ”””” way better

johnno1962 · June 23, 2018, 5:17pm

I feel this is coming down to a two horse race between the original proposal based on Python’s r”” reinvigorated by Rust’s addition of zero or more # characters around the string to satisfying the need for custom delimiters. I agree with @beccadax that @ is only used in declarations so the second contender is #rawSOMETHING”a raw string”SOMETHING. SOMETHING could be the zero or more # characters even if it looks a bit odd or perhaps a more general case “(“ followed by anything not containing a double quote mirrored at the end somehow though some may want to restrict this down. Examples could include:

#raw###”a string”###

#raw”a string”
#raw(“a string”)
#raw(SQL”a string”SQL)

I suggest we focus on these two(three?) alternatives and try to decide between them or, as @jawbroken pithily put it:

If this pitch thread doesn’t reach a conclusion I’d recommend putting a proposal with both forward for review and let the process/Core Team decide.