Iāve been thinking about this a bit, and I think there is a way of doing āall of the aboveā ā that is, of unifying all of the use cases presented so far, with a simple, general syntax. To get this, we have to give up the idea that weāre talking about ārawā strings (weāve already been talking about other use cases as well), and start thinking in terms of generalized strings. Sorry about the length of the post, but I've broken it down into three major steps:
Step 1 defines a lexical token Iām going to call an āintroducerā, which is a kind of lexical attribute identifying what follows as a [generalized] string. I donāt care how this is spelled, but for the sake of concreteness, Iāll use #string
here. In the simplest case, the introducer is just placed before a string, separated by arbitrary whitespace (or some other simple rule), so:
let x = #string "some string contents"
or:
let x = #string """
some
string
contents
"""
In this simplest case, the introducer doesnāt change anything ā itās transparent to the lexical structure.
Step 2 resurrects the idea of a configurable escape character, first suggested in this thread by @benrimmington. Iām proposing that an escape character should be subject to the following constraints, to ease the lexical analysis:
- It cannot be whitespace.
- It must be a single grapheme, aka
Character
.
- It must be a single Unicode code point, aka
UnicodeScalar
.
The single-code-point requirement makes the character easy to recognize lexically. The single-grapheme requirement would be checked later, by the syntax analyzer that has full knowledge of Unicode graphemes. That is to say, the lexical analyzer might accept some escape characters that are later rejected by the syntax analyzer.
For concrete syntax, again I donāt care what it is, but Iām going to use the simplest thing I can think of, a suffix on the introducer, such as the bullet in the following example:
let x = #string⢠"some string contentsā¢nwith an internal new-line"
or (with interpolation):
let x = #string⢠"""
some
ā¢(someVariable)
contents
"""
A backslash is still the default escape character, so the simple introducer #string
actually means #string\
. Obviously, if the escape character doesnāt appear in the string itself, the string is [very close to being] raw. But not quite, soā¦
Step 3 adds the idea of a custom delimiter. Again, to keep things easy for lexical analysis, the delimiter must be a non-whitespace, single-grapheme, single-code-point character. In this case, there is no need for any syntax to define the delimiter; itās defined by use. Thus, the following are all the same String
value:
#string "abc"
#string xabcx // crappy choice of delimiter, but valid
#string /abc/
and the multi-line version of that last example looks like this:
#string ///
some
string
contents
///
(Iām assuming the separator āwinsā over double- or triple-slash comments, but that doesnāt have to be so.) Mixed custom delimiters and escape characters look like this:
#string⢠/.,;\ā¢/'"/
// period, comma, semicolon, backslash, slash, single quote, double quote
The nice thing that falls out of this is that these are proper strings:
#string 'abc'
#string `abc`
regardless of what other uses single-quotes and back-ticks may have, elsewhere in the grammar. (There is no conflict or ambiguity.) Regex strings should fall out pretty nicely, too, with a careful choice of escape character, or delimiter, or both.
Iāve probably overlooked something important, but AFAICT:
- all strings are representable
- in most cases, the string remains pretty readable
- no interpolations or standard escape sequences are arbitrarily excluded
- truly raw strings are representable
- strings can be pasted/embedded āraw-lyā into other strings about as easily as any other solution thatās been discussed (I think)
- the new syntax is (I hope) pretty much describable with a regex, reducing the complexity of implementing string recognition in IDEs and editors (I think)
- inexperienced users arenāt going to trip over the generalized syntax by accident