Pure Bikeshedding: Raw Strings (why yes, again!)

Erica_Sadun · July 1, 2018, 7:32pm

I believe \ would be far Swiftier than raw. It already carries the Swift connotation of escaping, in this case escaping the entire string rather than just the next character within a string. Further, adding # as a Rust-eze delimiter covers all cases both single and multi-line.

Removing raw as a prefix would address the Core Team's concerns about the form not fitting naturally into the language and allow us to create raw string literals for both string literal initializable and RawStringLiteralInitializable, for example, for a Regex implementation.

beccadax · July 1, 2018, 9:15pm

I mentioned backslash positively way upthread, and I still like it. The main objection I heard was that people might think it has something to do with key paths, but in practice I doubt that will happen—it’s hard to think of a plausible interpretation.

Tino · July 1, 2018, 9:48pm

I'm a little bit disappointed that we discussed one aspect of raw strings ad nauseam (actually, I could finish the post here... ;-), but ignored another:
One of the few things that is possibly clear is that we want custom character sequences to terminate a string, and there's just no agreement how those should be defined.

But throughout the debate, it was silent consent that the character sequence that's used for escaping and string interpolation just gets disabled without replacement.
Therefor I looked into the motivation section to check why we actually want raw strings:

I'm not sure about the importance of raw strings for education (aren't Playgrounds supposed to be used for that?)
I don't think we should care for Windows paths - not because I don't care for Windows users, but because using bare strings to identify a location on a volume imho is worse than using platform neutral alternatives like NSURL
I don't know about Dialogue
Regular expressions are a separate story

That leave two use cases that imho belong into the same bucket: Including code - either from the host language, or something like XML or SQL.
I think for all those metaprogramming tasks, string interpolation is extremely valuable.
Afaik, Rust doesn't have interpolation for neither kind of string, but we would have travel back in time to the dark ages of strfmt... do we want to go there?

jawbroken · July 2, 2018, 4:37am

If you're including code from the host language then having string interpolation enabled is going to necessarily cause issues, so a true raw string seems worthwhile to me. There might be some uses for a string type that disables escaping but still supports interpolation, which is one of the reasons I was asking if we want the syntax to be extensible to multiple types of string. However, I think raw-with-interpolation is perhaps less valuable because you can handle a lot of use cases by e.g. having a header and footer raw string, then using interpolation and regular strings to generate the “middle” content. And you can mix these on a single line in ways that are easy to optimise at compile time, e.g. raw"rawHeader" + "middle \(content) + raw"rawFooter" or \"rawHeader" + "middle \(content) + \"rawFooter" or whatever the syntax ends up being.

Tino · July 2, 2018, 8:16am

That would be what we have to do... I tried, and I think it's awful, so I would rather add some escapes than breaking the string into pieces (unless there are really big chunks that contain special characters and don't need interpolation).
For things likes Windows-paths, "true" raw strings definitely only make sense for completely static content, and for many other scenarios, there will be a tough decision, because it isn't obvious wether raw strings and concatenation have a better tradeoff than normal strings and escaping.

Did anyone post a real-world example for the use of raw strings yet? Maybe @Erica_Sadun could show how her current solutions look like, so that we can evaluate the discussed variants?

ezfe · July 2, 2018, 9:48pm

I like this idea and it is very similar to what I was about to write. I think that spelling out raw (versus r) and using some character (the # in your example) to differentiate it is a good way forward.

lancep · July 2, 2018, 11:42pm

johnno1962:

There is one other possibility: using \ as the raw string differentiator which, while a bit “magical” and not extensible does at least have some semantic sense to it. A sort of “escape the string as a whole”.
\”a raw string with \ in it”
\##”””
    a raw string with “”” in it
    “””##
Any support for \ over raw as a prefix? It has a certain appeal - would this be “swiftier”?

This is probably my favorite idea so far

johnno1962 · July 2, 2018, 11:57pm

Concrete proposal

Rather than have this thread ruminate indefinitely and to give it some focus @Erica_Sadun, @beccadax and I have prepared a proposal which we would like to collect one last round of feedback on before putting forward for review.

To cut a long proposal short, given the varying opinions on #raw, raw, r etc. we propose the “differentiator” prefix for a raw string be the \ character for which there seemed to be some support and the custom delimiter character be zero or more # characters surrounding the single or multiline stings’s quotes rust style. I’ve been using a toolchain with the \ character and while it’s not as explicit or self explanatory as some might have liked it does carry some sense, is memorable and pleasingly concise.

The following are valid raw strings according to the proposal:

\”a raw string”
\#”a “raw” string”#
\”””
	a multiline raw string
	“””

And the proposal introduces custom delimiters to normal cooked Swift strings.

#”a “cooked” string”#
##”””
	print(#“””
		a multiline cooked string with “”” in it
		“””#)
	“””##

If there is anything you strongly disagree with in the proposal let us know - better now than during review - but please mention the alternative you preferred.

The revised proposal in full is here.

Thanks everybody for your input and a big thanks to @Erica_Sadun for relaunching the pitch.

xwu · July 3, 2018, 1:01am

This is an excellent update! The proposed syntax is quite elegant and the revised proposal has improved significantly. I'm very glad to see that alternative delimiters and raw strings have been made orthogonal features. Overall, it feels very "swifty" in its current iteration.

I do want to present a significant alternative solution to the stated problems; I sketched this out briefly in a message to @Erica_Sadun but will flesh it out here. To be clear, I wouldn't be distraught if it isn't acclaimed to be a superior solution: your proposal is excellent. However, I do think that your proposal would be strengthened by considering such an alternative if only to weigh it fully against your own solution--

Consider the five stated use cases of this proposal:
Metaprogramming, regular expressions, pedagogy (specifically: code snippets), formatted data and DSLs, Windows paths.

It is true that all of these benefit from not requiring escaping. However, raw strings (as their name suggests) have fewer syntactic constraints than their "cooked" counterparts, whereas all of these use cases involve contents that have more syntactic constraints than arbitrary free text.

Now, consider the following two features of Swift:

/*
  This is a comment.
  /* This is a nested comment. */
  In C, this wouldn't do the intuitive thing!
*/

#if false
val x = 42 // Error: Consecutive statements on a line must be separated by ';'
// Swift parses inside conditional compilation blocks!
#endif

These examples show that, where possible, Swift helps the user do "the right thing" even in cases where what's written isn't compiled. Other languages don't always do so, and we might rightly regard this as a "swifty" characteristic of the language.

Can we provide the same benefits for the five motivating use cases? I think so. Let's consider a new species of literal, which I will call the code literal:

Just as Double is expressible by either integer literals or float literals, so too would String be expressible by either string literals or code literals.

The code literal would take essentially the same syntax as that of code blocks in GitHub-flavored Markdown, with three or more backticks followed optionally by a language name; the same rules would apply as to indent stripping as for multiline string literals:

  let x = ```swift
    let y = 42
  ```

In terms of its simplest implementation, there is no need for a code literal to be implemented any differently than a raw string literal. However, several advantages present themselves:

Ease of reading: For the human reader, it can be immediately discerned that the embedded contents are not free text but rather some structured code block. If a language name is given after the backticks, the reader has a visible marker to tell them how to parse the contents.
Syntax highlighting: For an editor or IDE that already supports syntax highlighting for multiple languages, a code literal would permit its embedded contents (if written in a supported language) to be correctly highlighted.
Swift-specific features: For code generation of Swift in Swift, or for code snippets to teach Swift, the compiler can parse the contents of the literal just as it parses the contents of conditional compilation blocks [*]. Besides the benefit of diagnosing some incorrect code, Swift-specific parsing would enable nested code literals without the need for alternative delimiters [**] just as we support nested block comments:

  let x = ```swift
    let y = ```swift
      let z = 42
    ```
  ```

[*] This could, of course, be disabled either by not indicating that the code is written in Swift, or by surrounding the embedded code with a conditional compilation block that requires a future Swift version (which disables parsing).

[**] There is no need to abandon alternative delimiters, however; as you propose, it can be an orthogonal feature.

All of these potential enhancements would advance Swift's support for the five stated use cases, not only getting rid of hard-to-read and hard-to-write escape sequences, but also providing other features to help users write what they intend and read what others intended.

Down the road, one might envision library types that are ExpressibleByCodeLiteral but not ExpressibleByStringLiteral. Indeed, one might even envision a design where the initializer is spelled init(codeLiteral: String, language: String?), and a conforming type could have in its implementation of that initializer a precondition that language is some particular value; with constexpr support, this could be a precondition that is checked at compile time.

Is the lack of a single-line syntax for such code literals a major drawback? In my view, not so much. As stated here, "raw strings" are envisioned to be a rarely used feature. In any case, the benefit of not escaping characters really increases with longer strings. Similarly, syntax highlighting and other features provide much greater benefit when the text is longer. Moreover, many of these use cases are likely to require multiple lines much more commonly than single lines of code.

beccadax · July 3, 2018, 2:22am

What if you want to generate the start of a code literal in one code literal, and the end of that code literal in a different code literal? Without an interpolation syntax of some sort, this would be a necessary feature to fully support generating Swift from within Swift.

xwu · July 3, 2018, 2:40am

Good question. In this alternative scheme, you'd use backticks without labeling the fragment as Swift (since neither fragment would parse as Swift):

let a = ````
  let embedded = ```
  print("Hello")
  
````

let b = ````
  
  print("World")
  ```
````

let c = a + b

You'd lose syntax highlighting and parsing, but that seems acceptable for what, as I've said before, appears to be a pretty niche use.

beccadax · July 3, 2018, 2:51am

Actually, let's back up a step. How does it do nesting at all? You say the language keyword is optional, so the starting and ending delimiters are exactly the same. The other constructs which support nesting all have distinct starting and ending delimiters so the lexer can keep a depth count (or stack or something; I've never looked at the implementation). It can't be indentation-sensitive, because the indentation would have to come from the ending delimiter, and the indentation of the first ending delimiter it encountered would always appear to be fine. So can you explain, at the level of a lexer looking at a pointer into a byte view of UTF-8 source code, how you'd parse this syntax?

xwu · July 3, 2018, 3:22am

Ah, I see your point. To allow nested literals without language keywords, we could attempt heuristics tracking minimum possible indentation but it would not be an easily explicable rule, nor would it be sufficiently reliable. So, nesting would be possible only in the case where the nested code literal also uses a language keyword (though the inner and outer language keywords don't need to match):

  let outer =
  ```swift
  let inner =
  ```c
  #include <stdio.h>
  ```
  ```

That said, I'm OK with this restriction. If I'm right that the language keyword in such a scheme would permit sufficiently worthwhile benefits (syntax highlighting, parsing), users would be willing to identify the language used whenever possible. Alternatively, if this idea is of interest to users, we could debate the virtue of requiring a language keyword, which could then be none.

jawbroken · July 3, 2018, 4:20am

Great work on the proposal, thank you to everyone for putting it all together. The orthogonal nature of raw markers, alternative delimiters, and single versus multi-line strings makes them straightforward and easy to understand, and avoids a lot of special casing from other proposals. The only additional points that I would suggest covering are @Tino's question about whether interpolation should be allowed and perhaps some discussion about the reuse of KeyPath syntax here (e.g. you mention that backticks are used for escaped identifiers as a downside, but not that \ is used for key paths).

Tino · July 3, 2018, 5:07am

I think the strength of the current proposal is that it minimizes reasons for strong disagreement:
For every aspect, there are other concepts that perform better, but all of those come with their own downsides where they perform worse.

'raw\string' is nice, intuitive and concise, but isn't flexibel
r#"raw\string"# is still quite concise, but ugly and not intuitive
raw#"raw\string"# is more clear, but also less concise and still ugly
#stringLiteral(... is very flexible and intuitive, but also very verbose

I'd only consider one change:
Don't mirror the introducer, but rather repeat it, e.g.
let rawString = \#"raw\string#"

The mirroring might put the focus on the end of the raw string, but especially when you do concatenation, it might be more interesting where "regular" code starts.

As you now included an enhancement for regular strings as well, we could also extend it a little bit further and use the sequence of # for general masking of characters with special meaning, so that

print(#"The backslash has a special meaning, and \(1 + 1) is replaced with #\(1 + 1)#")

would output

The backslash has a special meaning, and \(1 + 1) is replaced with 2

Of course, this could be seen as counter-productive to the goal of adding raw strings, but imho it wouldn't be that bad not to double the number of string literal types we have...

johnno1962 · July 3, 2018, 5:28am

Don’t worry about reusing \ the lexer code has no difficulty telling a possibly delimited string from a KeyPath expression.

I like the idea of “raw but interpolating” strings in fact I tried to tried to float the idea earlier on in the thread and the previous pitch without success. It could not be the default however so we are in search of a syntax that would enable it. I already have the implementation.

If people found the following syntax credible we could include it as an optional part of the proposal as I’m not sure what the core team would make of it.

\#”a raw delimited string”#
\(#”a “raw” delimited string that contains an \(interpolation)”#)

jawbroken · July 3, 2018, 5:38am

I wasn't so much worried about parsing ambiguity, I just think that it would strengthen the proposal if you mentioned that you are reusing the \ character which is already used for key paths, and explain why you think that reuse isn't confusing or limiting for future key path enhancements, etc. I don't think it's necessary or advisable to include raw-but-interpolating strings in this proposal (as there's clearly enough to discuss already), just that you should mention it in the “alternatives considered” section to show that you thought about it and decided that “truly raw” strings were more important to address at this point.

masters3d · July 3, 2018, 6:09am

Custom delimiters should be limited to raw strings.
Is custom the right word to use if we only support #. ?

anandabits · July 3, 2018, 3:52pm

The latest draft of the proposal looks very good. The discussion appears to be converging on a very Swifty design.

I have been writing some Swift that generates Swift lately. One thing I have noticed is that raw strings would come in very handy in many places, but only if string interpolation was still possible. Almost all of the strings involved in code generation include interpolations.

Given that code generation is one of the primary motivating use cases for raw strings, I wonder if it might be possible to find a way to support interpolation. One idea I had was to leverage custom delimiters. For example:

\#"""
    struct \#(name) {
        \#(properties)
    }
"""#

\##"""
    struct \##(name) {
        \##(properties)
    }
"""##

In the above example the custom delimiter is used between the \ and ( in the string interpolation. Use of the custom delimiter provides flexibility to synthesize code that performs string interpolation (and even code containing raw strings that themselves include interpolation using a different custom delimiter).

I apologize for not participating in the thread and bringing this up earlier, but I hope there is still time to consider including something like this. As it the proposal is currently written, I would probably use custom delimiters for normal strings on occasion but would unfortunately not be able to use raw strings very often. I have been able to manage escape sequences reasonably well in code generation and would be unlikely to be willing to give up string interpolation. Giving that up would incur a much more significant cost in readability than I am currently paying in escape sequences.

griotspeak · July 3, 2018, 4:04pm

I'll try once more before giving up but can we not count repeated characters, please? Tradition is wonderful and all but… why?

\#"a raw string with " in it"#
\#1"a raw string that needs "# in it"1#