Pure Bikeshedding: Raw Strings (why yes, again!)

You shouldn't actually need one, but << and >> would be alternatives if you want to stick with ASCII.

<< and >> are operators tho

Gah. Of course. Sorry.

for all 103 posts on this thread why can’t we just ditch the whole balanced delimeters thing and try something like

let string:String = @raw($%) "i "can't" think of \why i actually need''' this$%"

where the argument of @raw() is the thing the parser looks for, immediately followed by a " to terminate the string. (with the obvious limitation that the sigil can’t contain the ) character.)

@raw(terminator) "string-content terminator"

so, all that is needed to begin a raw string is the " character, but the custom terminator followed by a " is needed to end the string.

of course this would be yet another blow to 3rd party and Linux Swift tooling, but it seems that was never a priority anyway so let’s do this why don’t we

2 Likes

I ran a quick-and-dirty search of most of my closed-source Objective-C projects during the last ten years. (This was mainly my code, but some of it was open-source dependencies.) 468 of 114,609 lines (0.41%) in Objective-C implementation files contained a character literal. 319 of these were cases in switch statements, and most of them involved parsing in one way or another. Most of the hits were in support libraries—Mustache, GRDB, SBJSON, a generated parser for App Store receipt files—rather than application code.)

(apple/swift is a little heavier on the character literals—2,319 of 434,375 lines of C++, for 0.53% of all lines.)

I also did a similar analysis on a closed-source Ruby project (again, including dependencies) I work on. The numbers here are less certain because the regex is much looser, but 4,609 of its 32,360 lines (14%) appeared to contain a single-quoted string literal. About 40% of the strings in this project appear to be single-quoted. Of those, only 220 appear to actually contain characters that would need escaping—but that's still 0.68%, more common than character literals in either of the C-style code bases I examined.

(Why are so many single-quoted unnecessarily? Well, when raw literals are just as easy to write, you tend to use them for little alphanumeric strings where you know you should never need escapes or interpolations.)

Now, the Ruby code is working in a way more string-y environment than the Objective-C code; nearly a third of the lines involve at least one string literal. But that's sort of the point, isn't it? There are many different kinds of programming, and our personal experience may not provide a representative sample.

Stupid question (exploring both uses for single quotes):

Could 'A' be parsed as a "character literal" (new kind of literal), and 'ABC' be parsed as a "string literal" (the good old one)? Biggest pain point I foresee: this would require that we have defined what a character literal is, before single-quote raw strings could come to life.

1 Like

These were well covered in the very first post of the thread, and I would say the first two of them (metaprogramming and pedagogy) almost require custom delimiters for legibility because they have code, including Swift code with raw strings, within raw strings.

This is a great summary, thanks, and is what I was trying to get at by asking if #"
"# should mean a raw string or if it should be an alternative delimiter for a normal string. If we can find an accepted solution for alternate delimiters then the raw strings part seems easier to me (e.g. just prefix with raw, take single quotes if you can wrest them from the Character literal people, backticks if that makes sense, etc). I think these should probably be designed together though, and not split into separate proposals, because there is a two-way constraint on syntax.

2 Likes

I like this, including the chosen delimiters. Perhaps we can use something else than just single quotes for unicode scalar literals. For example:

let asciiA: Int8 = #u'A'

I think this would be inconsistent. If you remove two characters from 'ABC' it will suddenly typecheck to a character instead of a string, while you might still want a string.

Yes, I like it too. @beccadax has nailed it as far as I'm concerned. The only objection raised since he posted seems to b that we might want to use ' for single character literals, but that ain't gonna happen according to Commonly Rejected Changes.

Summary of some things:

  • Although ' may not be used for single character literals, there are some important explorations into their use and burning them on raw strings (a fairly niche use) may be inadvisable. I really like Brent's approach, but will be allowed to use '?
  • The original design r"..." was rejected in part for not being Swifty, that is, taking on the look and feel and characteristics of existing parts of the language. Similar approaches like raw"..." and #raw"..." carry the same issues.
  • Rust approach with adjustable delimiter counts has been highly successful. We will likely incorporate that into our design, although we may not keep the details and I certainly want to drop the leading r.
  • The top contenders for a Rusty approach are # and ` (pound sign and backtick). The advantage of backtick is that it already preserves the meaning of "code voice" and "literal", as you are used to in markdown.
  • I don't personally think arbitrary delimiters are needed, as the Rusty-approach combined with existing string syntax covers all cases. (`""This is a `String` instance," she \(said). "I like `String`s""`) Just multiply the delimiter as needed. You rarely need to do so, even for complex regex.
  • There are two questions of approach: discoverability ("how do I do a raw string in Swift", easy to search for) and recognition ("Why do some strings in Swift start with # (or `)?"). I think both are relatively easy to search for.

Another idea:
Would we want an arbitrary 'raw' string literal usable as a normal string literal, in all cases?
If not then 'raw' string literal initialization could be a protocol chain parallel to the normal string literal protocols.
In that case we could have e.g. UInt8: ExpressibleAsRawUnicodeScalarLiteral, where I think it is general consensus that UInt8 should not be expressible as a standard unicode scalar literal (with ").

we’ve been over this many times

Decision time?

I feel that in the last 24 hours we’ve finally been able to fix the dimensions of this problem and all that remains is to ascribe specific values to each of these dimensions for an optimum syntax for raw string literals. We crossed the finishing line with @brendax’s and other’s analysis that custom delimiters should be teased away from raw strings to be a separate concept with broader applicability (though I believe not so completely decoupled that it should be discussed as a separate proposal.) This brings the number of string literal possibilities to 8 but rest assured there is only ever one underlying String type.

Brent’s analysis breaks string syntax into three dimensions: whether the string is multiline, whether the literal should be processed as a raw string literal and whether there is a custom delimiter. How to detect multilined-ness is already decided by whether the quote character is typed once or three times. This leaves two further roles to ascribe syntax to: what you could the “differentiator” which determines whether the processing of the literal should be raw or not and the “custom or alternate delimiter”.

In Brent’s proposal the “differentiator" was whether you used single (‘) or double (“) quotes but that doesn’t seem to be available as an option so we’re going to have to use a prefix of some sort. The original proposal’s “r” prefix seems to be beyond the pale so for now I would suggest "#raw” as a starting point.

The final component is the custom delimiter sequence which (adopting rust’s model) is zero or more of a specific character which adopts this role because it immediately precedes the quote character and is repeated after the close quote. Rust uses # and this has been the basis of much of the discussion. Brent’s proposal switched it for ` (backtick) in his examples which confused the heck out of me at the time.

In concrete terms Brent was proposing:

‘a raw string’
`’a raw string with a ‘ in it’`

Which despite appearances conceptually isn’t very different from

#raw”a raw string”
#raw#”a raw string”#

or even

#raw`”a raw string with a “ in it”`

The advantage of separating the custom delimiter out as a concept is that it will now also apply to cooked/normal strings

 #”a string with a “ in it”#
`”a string with a “ in it”`

One can even look to the future and see custom delimiters applied to regex literals where similar escaping dilemas are likely to crop up.

#/a gnarly regex with a / in it/#
#regex#"a gnarly regex with a " in it”#

So you can see we’re left with a couple of decisions. The first is what should the custom delimiter character be: # or ` (backtick) or perhaps \.

The second decision about what should be the prefix I expect to be a little more difficult to find consensus on as it is very subjective what is “swifty” or “ugly". #raw isn’t great and in combination with still more #’s can result in something like #raw###”a string”### but keep in mind requiring a custom delimiter is going to be far more the exception than the rule and generally confined to single line raw strings (and strings).

#raw is ultimately a compromise most people won’t have to live with every day given how rare usage of raw strings will be but at least it's explicit for the uninitiated to google without being overly verbose.

So, I am suggesting we can get started on writing up a proposal along these lines soon for eventual review. If you can’t live with #raw or have other comments or feel we need more time please reply but also put forward something you feel would gain wider acceptance.

2 Likes

I think that perhaps #raw###"a string"### is a step too far because it looks pretty unbalanced. If repeating # is the proposed alternative delimiter then I think raw###"a string"### might be more acceptable. I'm still not entirely convinced that single quotes are off the table here, though there's going to be pushback from Character literal people.

Edit: For example, the proposal that @taylorswift linked says “Current proposals for raw string literals use r-prefixes (r")” which clearly is no longer necessarily true.

1 Like

and now current proposals use the hashtag. i think most ppl would agree hashtags around raw strings are less silly than hashtags surrounding character literals.

1 Like

I don't think it's a given that syntax for the two proposals would just swap around. There are a lot fewer constraints on a Character literal syntax, because it only needs to support single characters and not possibly multi-line strings with arbitrary-length content. So I presume that a different solution could be found for Character literals if single quotes were taken for raw strings, though I don't want to sidetrack this thread by discussing the specifics.

As I said earlier, If 'raw' and normal string literals use separate, but similar, protocols, one type could implement different behavior depending on whether single or double quotes were used: e.g. possible implementations for UInt8 could be "0" as UInt8 == 0, but '0' as UInt8 == 48, while still having '' not permit escaping.

If we’re heading in this direction, let’s look at the pieces more closely:

  • Why the leading pound? # usually syntactically mimics a constant, function, or statement; here, it’s just sort of a floating token on the front of something else. And in constructs like #raw##"href="#sidebar""##, I think it significantly contributes to the claustrophobic feel.

  • I also think we might want to use backticks instead of pounds for the alternate delimiter. Again, this is partly because pound is a character with a ton of visual weight; if you have a mass of confusing backslashed punctuation in a string literal, removing the backslashes but adding a few pound characters is a two-steps-forward-one-step-back thing. Also, backticks are rarely used—of all the characters the lexer would allow us to choose, I suspect it's the least likely to appear in a string literal.

Four options:

// Leading pound with pound delimiter
run(#raw##"swift -### > "`mktemp \"tmp XXXX\""`""##)
// Leading pound with backtick delimiter
run(#raw``"swift -### > "`mktemp \"tmp XXXX\""`""``)
// No leading pound with pound delimiter
run(raw##"swift -### > "`mktemp \"tmp XXXX\""`""##)
// No leading pound with backtick delimiter
run(raw``"swift -### > "`mktemp \"tmp XXXX\""`""``)

Maybe this will sound funny, but then at least I contributed a few laughs.

How about introducing a new RawString type or perhaps dedicated initializer with the existing String type:

let rawString: RawString = "same as normal string but without escaping"

let rawString = String(raw: "same as normal string but without escaping")

but make it in such a way that compiler intentionally does not escape those, is that even possible?

Anyway, if it is, that feels very swifty to me, so I just wanted to share with you...

2 Likes

Yes why do we keep going over it when it plainly states in the Commonly Rejected Changes document that it ain't never gonna happen?