SE-0200: "Raw" mode string literals


(John Holdsworth) #144

Sure, but they wouldn’t quite be raw strings as indentation removal would apply.


(John Holdsworth) #145

The interpolation feature is only enabled if you opt in by double bracketing the literal:

#raw((“a string \(var)”))

Otherwise they are raw.


(Adrian Zubarev) #146

That wouldn’t make much sense to me, like the removal of wrapping back-slash, because in my code base the longest raw-string I possible would be able to create would be around 70 characters.


(John Holdsworth) #147

Making an exception for \ at the end of the line has some merits but I think we need to keep processing of raw strings as simple as possible.


(Adrian Zubarev) #148

That’s what I’m thinking as well so the following is fine by me:

#raw(```
   multi-line version is in the lines between the delimiters
   keep indentation and keep wrapping \
   slash but remove other rules
   ```)

(Xiaodi Wu) #149

I figured I’d share here some thoughts on the design space which I’ve already shared with @johnno1962 off-list. I think it’s useful to have a common vocabulary, putting some names to the problems that we’re trying to solve with this proposal.

As it happens, Wikipedia has a wonderfully thorough series of articles on just this topic, so these are some condensed notes that I took, as well as some reflections on how those points apply to Swift in particular:

Link to Gist


Notes on string literals

Most programming languages use delimiters to surround a string literal. A
known issue that arises due to the use of delimiters is delimiter collision,
which arises when the delimiter(s) themselves need to be represented in the
literal.

Solutions to delimiter collision

  • Paired quotes
    Different opening and closing delimiters; solves a limited subset of delimiter
    collision problems as it can permit only balanced, nested strings.
    Supported in PostScript (parentheses), Visual Basic .NET (curly quotes).
  • Escape characters and sequences
    A very commonly used solution.
    Already supported in Swift.
  • "Doubling up" delimiters
    Similar in concept to escaping the delimiter, two consecutive delimiters are
    interpreted as a literal character.
    Supported in Basic, Fortran, Pascal, Smalltalk.
  • Dual delimiters
    For example, a literal may be delimited by either single quotes or double
    quotes.
    Supported in Fortran, JavaScript, PHP, Python.
    A form of dual delimiters is supported in Swift in that " can be used
    without escaping inside multiline string literals.
  • Configurable multiple delimiters
    Here document-style strings are one variant; the user must know that the
    chosen delimiter will not appear in the quoted string or predict which
    sequences of characters are unlikely to appear.
    Supported in Perl, Ruby, C++11, Lua.

The principal drawback to the use of escape characters is leaning toothpick
syndrome
, a concept first widely introduced in Perl. The principle use cases
in which the issue arises are:

  • Regular expressions matching Unix-style paths
  • Windows paths–most pathologically, regular expressions matching Windows
    Uniform Naming Convention paths, which begin with the prefix \\ that
    requires double-escaping (\\\\\\\\)

Solutions to leaning toothpick syndrome

  • Custom delimiters
    In Perl, characters other than / can be used as delimiters for regular
    expressions.
  • Raw strings
    See table below for comparative syntax.
Language Syntax
C# @"string"
C++11 R"xxx(string)xxx", where xxx is an optional custom delimiter
Go `string`
Python r"string"
Scala """string""" (no interpolation) or raw"string" (interpolation)

Some conclusions

Many more languages offer raw strings than custom delimiters. The former
is addressed specifically at mitigating the issue of leaning toothpick syndrome,
which arises when using escape sequences. The latter is an alternative to
the use of escape sequences.

In Swift, both escape sequences and string interpolation segments are prefixed
with \. This is a deliberate design choice; Swift differs from languages such
as Scala where the two have distinct spelling. Scala offers a raw
interpolator
syntax (interpolation but no escaping) as well as other
variations. Swift’s deliberate design choice likely rules out such a design:
instead, string literals will support both interpolation and escaping or
neither.

Generally, languages support single-line raw strings. This trend likely
reflects the insight that leaning toothpick syndrome is most pathological in the
case of regular expressions. Although multiline raw strings would permit
unmodified embedding of source code, such a use case is not a primary motivation
because, in the absence of custom delimiters, raw strings actually disable a
solution to delimiter collision which may be necessary for the embedded code.

Support for custom delimiters for regular expressions obviates the need
for raw strings to overcome leaning toothpick syndrome involving forward
slashes
but not backslashes.

Syntax

Swift eschews numeric literal suffixes such as f and l; users have largely
rejected r"string" syntax on that basis, and it is unlikely that @"string"
in the style of C# would find greater acceptance.

Single backticks already serve another role in Swift. Multiple backticks may
still be considered, but the use of multiple backticks for single-line raw
string literals may be considered inconsistent given current syntax for
single-line and multiline string literals.

Given such considerations, the remaining options include either more verbose
spellings such as raw"string" or the single quote option 'string'.


(Chris Lattner) #150

Thank you for the great summary @xwu :clap:. I really appreciate your thoughtful contributions to this forum, particularly attempts to distill the essence out of the sometime chaotic contributions to the forum.

Given that we have single and multi-line string literals, and that regex literal syntax using //'s seems likely, the remaining unserved usecase is the swath of “other” that is not covered well by any of those.

I’m not sure how wide the audience of this “other” is, but it seems that (if we need to solve for it) that we should go with the maximally powerful solution that can blast away any problematic cases, even if it is syntactically onerous. The idea being that there are few cases that need this level of treatment, but (iff) they are common enough to require a solution, so they can tolerate syntactic excess at the edges.

To me, that seems to imply that niceties like string interpolation syntax is not important. It also seems to say that the ability to have custom delimiters is important, because that is the most general solution to the individual problems (but it doesn’t mean that we have to go whole hog and allow emoji as delimiters!). I agree that r"xx" syntax feel unnatural, but maybe something like:

#raw(“delim”, delim"…"delim) could work, for some limited idea of what the “delim” string can contain? If we require the quotes in the specified places but also require the delimiter to exist, then this allows unique-enough strings like:

#raw("x", x"crazy\(ain't it?!?@#?"x) 

which isn’t too bad. Because these things are only rarely used, maybe something like this could work?

All that said, it really isn’t clear to me that this is motivated enough to be worth language complexity to support. I’m glad there is extensive discussion though, so we can decide and legislate it once and for all.

-Chris


(Chris Lattner) #151

Also, I have to say that attempts to use ‘’ to solve this aren’t really motivating to me. I tend to believe in the use of ‘’ delimiters for characters, but also don’t think they are particularly useful for the purposes of raw strings: a raw string can very likely contain both a single and a double quote, so the perl approach of forcing you to choose seems unappetizing.

The other benefit of the approach I’m pitching above is that it lends itself to a natural “multiline” raw string literal syntax of:

#raw("x", x"""
    crazy\(ain't it?!?@#?
    """x) 

Thought we’d have to be careful to not allow "'s in the delimiter or something (to avoid ambiguity).

-Chris


(Tino) #152

What is the benefit of having the delimiter written out three times, instead of just using the opening delimiter to close the string?


(Goffredo Marocchi) #153

Simpler maybe to have it as an explicit parameter than as an explicit one?


(Brent Royal-Gordon) #154

If we want to do this, I think we should also have a way to apply it to a non-raw string. Strawman example:

#cooked("x", x"""
    crazy\(verb) it?!?@#?
    """x)

That suggests we should consider this custom delimiter feature, whatever it ends up looking like, to be a separate feature from raw strings.


(^) #155

String handling in Swift emphatically does not resemble scripting languages and intentionally so. Python for example has no concept of characters, only strings on length 1, has no concept of grapheme clusters, allows O(1) integer subscripting, and no concept of encoding views.


(^) #156

can i just say custom delimiters are going to be problematic for any text editor other than XCode. Swift is already an absolute nightmare to syntax highlight I don’t think any highlighter is better than about a 6 on a scale of 1 to 10 at the moment and this is only going to make it worse


(Ben Rimmington) #157

Instead of custom delimiters, how about using extra parentheses?

#raw("This string contains " and stops here.")

#raw(("This string contains ") and stops here."))

#raw((("This string contains ")) and stops here.")))

(Tino) #158

I don’t buy such arguments in general: If someone insists on suboptimal tools, he can either live with the problem or just refrain from using the problematic feature (and I consider Xcode suboptimal as well, so that’s not just saying “I don’t have that problem” ;-)

It’s another story that Swift imho is to complicated - but I don’t think this proposal is a game changer in this respect.


(^) #159

it wouldn’t be as big a deal if xcode was free and available for linux but it’s not. im not really someone who argues over text editors but when you have no other choice


(^) #160

people say :’))))))) a lot that’s a lot of matching parentheses


(Tino) #161

I think the #raw(delim"content"delim) idea is the best solution, but I’d change a tiny thing and spell it

#string(delim"content"delim)

instead. Not because raw could be associated with something else (bytes…), but because it looks like Swift will have quite a lot of different literals soon, which aren’t really related.
#string could be the tool to tie them all together

#string(.raw, "content")
#string(.multiline, "
   content
")
#string(.regex, "[0-9]{1, 3}")
#string(.regular, "I'm \(self)")

That would probably be my starting point for a language designed with a handful of different string literal types in mind.
In this model, “raw” would be the default choice, and “” would be syntactic sugar for “regular” - and there could be as many string literals with different rules as you like.


(Chris Lattner) #162

Sorry, to be clear, I don’t have a super strong opinion about the concrete syntax, I just want to make sure that this covers the maximally general case without requiring yet-another syntax for an even more corner case down the road.

-Chris


(^) #163

i always thought #hashtag things were ugly and should be avoided but then again i feel like after we get /regex/ literals the need for true raw string literals will be basically zero so i think this is the best solution. raw strings are hacky corner-case things and it’s fine if they look that way