SE-0200: "Raw" mode string literals

I think the #raw(delim"content"delim) idea is the best solution, but I'd change a tiny thing and spell it

#string(delim"content"delim)

instead. Not because raw could be associated with something else (bytes...), but because it looks like Swift will have quite a lot of different literals soon, which aren't really related.
#string could be the tool to tie them all together

#string(.raw, "content")
#string(.multiline, "
   content
")
#string(.regex, "[0-9]{1, 3}")
#string(.regular, "I'm \(self)")

That would probably be my starting point for a language designed with a handful of different string literal types in mind.
In this model, "raw" would be the default choice, and "" would be syntactic sugar for "regular" - and there could be as many string literals with different rules as you like.

1 Like

Sorry, to be clear, I don't have a super strong opinion about the concrete syntax, I just want to make sure that this covers the maximally general case without requiring yet-another syntax for an even more corner case down the road.

-Chris

i always thought #hashtag things were ugly and should be avoided but then again i feel like after we get /regex/ literals the need for true raw string literals will be basically zero so i think this is the best solution. raw strings are hacky corner-case things and it’s fine if they look that way

No need to apologize, I may have to excuse myself — reading my own question, I guess it may look more critical than curious (I don't mind being critical, but not in disguise).
So, taking your answer into account: I'd prefer not to have an extra delim-parameter ;-)

1 Like

Swift is a bit tricky to parse. That’s why the Swift project is building toolkits for syntax highlighters.

I think multi line raw strings complicate the current proposal quite a bit. I do agree that custom delimiters is probably worth to tackle as part of this proposal.

#raw("" something \ simple "")
#raw("@"something some c""o"m"plicated \\"@") // "@" delimiter

alternative

##for the /\.\.siimple case##
#@#something ##some c""o"m"plicated \\#@# // "@" delimiter

most text editors highlight using regexes (i know atom and gedit do this). atom has no support for external tokenizers. you can argue about whether this is correct or not but there’s still the fact that if a grammar is hard for a regex to read, it’s also hard for a human to read

It sounds like most people’s use cases would be served by native support for regex literals, which seem to have broad support. Furthermore, large strings with arbitrary contents and no escaping are the subject of another proposal, for file-content literals.

This leaves only the narrow problem space of short strings which contain backslashes and/or quotes. Now, Windows file paths can be constructed from an array using joined(separator:) like so:

let path = ["C:", "Users", "\(userName)", "My Documents"].joined(separator: "\\")

And strings with quotes can be written as multiline literals:

let quote = """
            "This quote is very memorable." – Randall Munroe
            """

For other short strings, such as snippets of Swift code, the existing syntax with escape characters isn’t *too* onerous, so it might be best to defer any action in this space until after regex and file literals become available. Then, if there is still a strong case for raw strings, we can consider adding them.

So…not a fan of HTML?

sighs in colorful monospaced text

I’ve been thinking about this a bit, and I think there is a way of doing ā€œall of the aboveā€ — that is, of unifying all of the use cases presented so far, with a simple, general syntax. To get this, we have to give up the idea that we’re talking about ā€œrawā€ strings (we’ve already been talking about other use cases as well), and start thinking in terms of generalized strings. Sorry about the length of the post, but I've broken it down into three major steps:


Step 1 defines a lexical token I’m going to call an ā€œintroducerā€, which is a kind of lexical attribute identifying what follows as a [generalized] string. I don’t care how this is spelled, but for the sake of concreteness, I’ll use #string here. In the simplest case, the introducer is just placed before a string, separated by arbitrary whitespace (or some other simple rule), so:

	let x = #string "some string contents"

or:

	let x = #string """
		some
		string
		contents
		"""

In this simplest case, the introducer doesn’t change anything — it’s transparent to the lexical structure.


Step 2 resurrects the idea of a configurable escape character, first suggested in this thread by @benrimmington. I’m proposing that an escape character should be subject to the following constraints, to ease the lexical analysis:

  • It cannot be whitespace.
  • It must be a single grapheme, aka Character.
  • It must be a single Unicode code point, aka UnicodeScalar.

The single-code-point requirement makes the character easy to recognize lexically. The single-grapheme requirement would be checked later, by the syntax analyzer that has full knowledge of Unicode graphemes. That is to say, the lexical analyzer might accept some escape characters that are later rejected by the syntax analyzer.

For concrete syntax, again I don’t care what it is, but I’m going to use the simplest thing I can think of, a suffix on the introducer, such as the bullet in the following example:

	let x = #string• "some string contents•nwith an internal new-line"

or (with interpolation):

	let x = #string• """
		some
		•(someVariable)
		contents
		"""

A backslash is still the default escape character, so the simple introducer #string actually means #string\. Obviously, if the escape character doesn’t appear in the string itself, the string is [very close to being] raw. But not quite, so…


Step 3 adds the idea of a custom delimiter. Again, to keep things easy for lexical analysis, the delimiter must be a non-whitespace, single-grapheme, single-code-point character. In this case, there is no need for any syntax to define the delimiter; it’s defined by use. Thus, the following are all the same String value:

	#string "abc"
	#string xabcx // crappy choice of delimiter, but valid
	#string /abc/

and the multi-line version of that last example looks like this:

	#string /// 
		some
		string
		contents
		///

(I’m assuming the separator ā€œwinsā€ over double- or triple-slash comments, but that doesn’t have to be so.) Mixed custom delimiters and escape characters look like this:

	#string• /.,;\•/'"/ 
// period, comma, semicolon, backslash, slash, single quote, double quote

The nice thing that falls out of this is that these are proper strings:

	#string 'abc'
	#string `abc`

regardless of what other uses single-quotes and back-ticks may have, elsewhere in the grammar. (There is no conflict or ambiguity.) Regex strings should fall out pretty nicely, too, with a careful choice of escape character, or delimiter, or both.


I’ve probably overlooked something important, but AFAICT:

  • all strings are representable
  • in most cases, the string remains pretty readable
  • no interpolations or standard escape sequences are arbitrarily excluded
  • truly raw strings are representable
  • strings can be pasted/embedded ā€œraw-lyā€ into other strings about as easily as any other solution that’s been discussed (I think)
  • the new syntax is (I hope) pretty much describable with a regex, reducing the complexity of implementing string recognition in IDEs and editors (I think)
  • inexperienced users aren’t going to trip over the generalized syntax by accident
5 Likes

Just to make sure I understand your idea, the ā€œintroducerā€ is optional if you’re not customizing the string parsing/formatting? So this is still a perfectly good string?

let foo = ā€œbar \(bast)ā€

Yes, that's correct.

I didn't say that part very well, because I was trying to make the additional point that the introducer is "harmless" in the familiar string case — it's the same with or without it, so the proposed syntax is a true generalization of the current syntax (I think).

FYI, this email from the java community has some interesting discussion about raw string literal design tradeoffs:

http://mail.openjdk.java.net/pipermail/amber-spec-experts/2018-March/000446.html

8 Likes

Good read — my first thought was "why did nobody post this in the thread for multiline strings?"... but it looks like Swift is somewhat quicker than Java ;-)

1 Like

Interesting read, but I’m sure we’re taking the right road keeping raw-ness orthogonal to multi-lined-ness.
I look at it as being 4 possibilities for the complexity of 2.

I would happily take the design described in the linked post, just swapping the backticks ` for single quotes ':

'Raw string'
''Raw string's birthday''
'''Raw 'string''s birthday'''

That's simple to write and non-distracting when reading. And for the rare situation were you need a single quote at the start or end of the string, you can solve the problem in a very intuitive manner like this:

"'" + 'string' + "'"

I find this workaround more elegant than using a string with arbitrary delimiters.

3 Likes

two single quotes in a row '' looks awful like one double quote "

1 Like

... but imho there's no real win if all for combinations are expressed in different ways:
If we had a system of delimiter strings like

Single line Multi line
Raw ' ''+
Cooked " ""+

the two things would really be separated (choice of delimiter indicates raw, number of delimiters indicates multiline).
But imho the way multiline strings are done already blurs the line, as you can have unescaped double quotes in them (so they already fulfill a big use case of raw strings).
Also, I'm not sure that it makes sense to make a big difference between single and multiline strings: Imho the "Java way" of utilizing libraries isn't that stupid:

"""
 We have this
"""
"""
 But we could also do it this way
""".frontTrimmed
"""So this would also be allowed"""

But anyways, review period is over, and I wonder if anybody will really read this huge bunch of posts ;-)

the multiline raw and the single line cooked looks exactly the same…

1 Like