Allow more characters (like whitespaces and punctuations) for escaped identifiers

Another consideration is runtime API that does, or may in the future, want to be able to parse qualified Swift symbol names, for things like dynamic type or method lookup. If identifiers are allowed to include punctuation marks like . or <, for instance, this could confuse an API that tried to look up a type by name:

struct `Foo<Int>.Bar` { }

struct Foo<T> { struct Bar { } }

let t = typeByName("Foo<Int>.Bar")

It might be prudent to keep characters that are significant in the type grammar off-limits from identifiers to avoid introducing escaping problems for runtime APIs.

13 Likes

That's not the point I was replying to, which was the statement "with this proposal, all identifiers have two syntactic forms, only one of them being always parseable". That read as if it was implying that it was this proposal that made that functionality possible, but it's already possible today. Did I misinterpret what you were saying?

I didn't miss it; more importantly, statements like these are unnecessarily antagonistic. Let's stick to the technical merits of the discussion.

I have experience writing code generators as well (I'm one of the maintainers of swift-protobuf), so I do understand the issues involved, especially when translating identifiers from one schema to another.

Generators today could take the easy way out and escape every identifier if they wanted to, because the language allows it. I don't think we've seen that to any great degree, and I don't think the chances are that much higher that we'd see it a lot more with new identifier rules. That's just conjecture on my part, but that's what your concerns were as well; do you have any concrete reasons to believe that generated code will suffer because of this change?

Code generators are also a very small subset of the day-to-day code written and read in Swift. I'm not sure that the possibility of someone writing a "bad" code generator should be a mark against a feature. And as someone who uses generated code in a number of my projects, I'm not sure I'd care that much if someone escaped all the identifiers in the generated code, because I don't look at the generated implementation that often. I'm usually more interested in viewing an interface-only API digest provided by Xcode, which would presumably only escape the identifiers that actually need it (since the escaping is not actually part of the identifier in the AST). But I realize that reasonable people may disagree on this point.

That's fine, though—the language doesn't have to provide an API for every possible language/architecture to identify identifiers. The grammar rules for identifiers in Swift are already fairly complex, especially with regard to the ranges of acceptable Unicode code points. To my knowledge, there's not an API anywhere today that allows third-parties to exactly match that in their own tooling regardless of language/architecture, so Swift providing one to third-parties who write their tools in C and Swift would still be a major improvement. And again, that would make it available to third-party tooling, satisfying the requirement in your original post; if someone chooses to write that tooling in a language that doesn't provide access to that API, then that's their choice, and they need to work around that decision.

3 Likes

Yes you did. I suggest a re-read.

That's a good point! We should definitely consider this.

One possibility would be for the API to parse the identifier the same way that the compiler would, thus requiring escaping inside the string if you wanted to handle identifiers that otherwise contained special delimiters:

struct `Foo<Int>.Bar` { }  // #1
struct Foo<T> { struct Bar { } }  // #2

let t = typeByName("Foo<Int>.Bar")  // #2
let t = typeByName("`Foo<Int>.Bar`")  // #1

There's some possible ambiguity about symbols that would need to be escaped in source but not in the string API call, like

struct `Foo Bar` {}

// Should this work? The API probably doesn't *need* to escape the
// identifier here.
let t = typeByName("Foo Bar")

// Or should we require this, for consistency with source?
let t = typeByName("`Foo Bar`")

Off the top of my head, I'm not sure I have a strong preference on this one.

1 Like

Thanks @Joe_Groff, valid consideration.

I do like this option as it feels more coherent with the approach of the proposal by keeping the "every char is allowed because this is an escaped identifier". If we go with that, I am more prone to always respect the grammar since it follows how we can statically reference to a type too:

`Foo Bar`() // Valid
let t = typeByName("`Foo Bar`") // Valid
Foo Bar() // Compiler error
let t = typeByName("Foo Bar") // Runtime error

I do not fully understand if _typeByName currently supports only Swift mangled names or also the example you mentioned, can you confirm if that is the case?
If we don't currently support qualified complex type names, do you think @allevato's suggested option may be a valid one that could be implement if/when Runtime API will support so?

_typeByName currently only supports mangled names, that's correct, so it wouldn't immediately be a concern because the mangling handles special characters already. My concern was about hypothetical future APIs that might want to parse identifier names in their human-consumable form.

3 Likes

It also supports top level classes by name no?

class Foo {}

print(_typeByName("Module.Foo")!) // Module.Foo

This feels unimportant and not worth the extra complication. Does not meet the threshold imv.

1 Like

Great. There were always some edge places where I needed the identifier to start with a number.

enum Dimension {
  case `1D`
  case `2D`
  case `3D`
  ...
}

enum Union3<A, B, C> {
  case `1`(A)
  case `2`(B)
  case `3`(C)
}
5 Likes

Hi Gwendal, thanks for bringing the topic of code and documentation generator!

I do agree that the proposal, as any change of the grammar, may have an effect on this type of programs and I want to share my reasons why I believe the impact may not be significant.

Most popular code, documentation generators or linters like Jazzy, Sourcery, SwiftLint and SourceDocs all use SourceKit under the hood.
Since back-ticks are currently considered (prior the proposal) as leading and trailing trivia it means that, if those tools currently support escaped identifiers they will automatically support the proposed change when updating SourceKit (update that may be required anyway for new Swift versions).
Do you have concrete examples of popular tools where this change does have an impact?

Related to HTML documentation security concern I find hard to see how this proposal can contribute to the issue.
Printing any methods, identifiers or comments should be already escaped as Swift already supports many characters outside the seven-bit ASCII that HTML supports without escaping. For example, characters like the < and > signs are, in fact, part of method declarations and they should be already escaped.
I went deeper and checked Jazzy implementation, as a result they do not have the mentioned issue: they either escape or wrap in a code tag the Swift declaration.
I am assuming other documentation generators may do the same.

I fully agree that we should be mindful on how a change may impact not only the language but the language environment too, so thanks for raising this and making me review the current state of Swift code and documentation generators.

2 Likes

That information is so out of date that when it actually was (sort of) true some people on this forum weren’t born yet. :wink:


But yes, as a former contributor to Jazzy and SwiftLint who went on to write in‐house replacements, this won’t make any significant difference to any such tools. In case you missed it, you also have the lead developer (@allevato) of Swift’s official formatter (swift-format) actively campaigning to have this added.

This would actually make things much easier. Right now I have about 200 lines of code just dedicated to producing valid identifiers. The ability to slap spaced grave accents on either end and use a file name as‐is would be so much simpler.


But the real reason I want it is that outside the English world, I’ve found that camel case just doesn’t always cut it and there is no legible solution without access to apostrophes, hyphens and other currently invalid joiners. French is one of the least problematic, but since I know at least one other person in this thread speaks it, I’ll use it for my examples. Which of the following uses the “right” style?

  • « aujourd’hui »

    • aujourdHui
    • aujourdhui
    • aujourd_Hui
    • aujourd_hui
    • aujourdeHui
    • `aujourd’hui`
  • « Faire quelque chose jusqu’à ce moment‐là. »

    • faireQuelqueChose(jusquÀ: ceMomentLà)
    • faireQuelqueChose(jusquà: ceMomentlà)
    • faireQuelqueChose(jusqu_À: ceMoment_Là)
    • faireQuelqueChose(jusqu_à: ceMoment_là)
    • faireQuelqueChose(jusqueÀ: ceMomentLà)
    • faireQuelqueChose(jusqueÀ: ceMoment_Là)
    • faireQuelqueChose(`jusqu’à`: `ceMoment‐là`)

To me, the currently valid ones all feel weird and are hard to choose between. But I would choose the last one in a heartbeat if it were available, especially if code completion can suggest it before the problematic character is typed, and can automatically place the accents on either side of the identifier.

8 Likes

But Swift is a language based on English. Why does it matter that its grammar and syntax fits poorly with other languages?

Do you also want to support localized versions of keywords, allowing arguments to be spelled before the function name, or adding argument labels after argument values?

The two that makes some sense to me are:

faireQuelqueChose(jusquÀ: ceMomentLà)

and

`faire quelque chose`(`jusqu’à`: `ce moment‐là`)

Either write it all camel-case or all "correctly". Mixing camel-case with apostrophes and dashes is the weird thing to do in my opinion. I'd rather just use camel-case like the rest of Swift. But that's just an opinion and we can all disagree on styling guidelines.

Apostrophes are weird because there are (at least) two of them you might use. Did you mean to use the straight one ' or the curved one ? You used the typographically correct one above, but someone with a different keyboard configuration will use a straight one and it'll fail to match the function's name. Combine this with the necessary surrounding backticks and those identifiers are pretty inconvenient to type.

While not strictly related to the topic at hand, I'd also like to bring back the unsolved issue of characters not being normalized by the compiler. For instance, the à above could be represented as either U+00E0 ('à') or U+0061 U+0300 ('a' + combining grave accent), but the compiler will see the two representation as a different identifier despite being equivalent in Unicode. This is a similar issue to the apostrophe above, except that it's not even visible at all.

So maybe we'd need more than to compare using Unicode canonical equivalence if we are to accept apostrophes and other punctuation in identifiers. Think about vs ... / - vs vs / " vs vs .

3 Likes

While this would be independently nice to have, I think it's best to separate it from the discussion of this pitch. There are a number of blocking technical issues which prevent implementing unicode normalization for identifiers in the compiler anytime soon. It will likely involve dropping the standard library's ICU dependency in favor of sharing a slimmed down version of the data files between the compiler and stdlib, which would be a large effort.

2 Likes

-1
I expect most people who think this is a terrible idea are keeping quiet to avoid getting flamed (certainly that's what the reactions in a local Slack dev group suggests), so I guess I'll have to speak up.

Escaped identifiers are a drag to parse visually and increasing their occurrence in code would make code look worse.

The example from the proposed solution looks terrible and I'd personally hate to work with code like this. Save prose for comments.

Do you believe there are no use cases where readability is enhanced? If not, why should we be afraid that developers will abuse the ability? It should be left to policy, not limitations in the parser.

1 Like

LOL. Human nature and history suggest that it's not about "fear" but about seeing the patterns during the last 30 years of software engineering. Plenty of bad code out there. Think custom operators and operator overloading in C++ … lot of scary bad code resulted.

I admit that I haven't run into as much bad emoji in identifiers and operators as I feared might appear, so maybe it won't be as bad as it could be…

I don't see any gain of significance. I can maybe see a slight interest in the testing context, but really not that much and I wouldn't want to have to deal with a test file full of tests with crazy long identifiers and back ticks all over the place (ones like the quote in my post). Just too darn ugly.

There are likely to be many consequences that we regret, seems to me. Here's a simple one: when we want to quote code in markdown it'll be like ``My Crazy Routine Name - the one I really like, you know?() so much punctuation noise….

anyway.

Maybe we should move to the review thread (SE-0275: Allow more characters (like whitespaces and punctuations) for escaped identifiers) now that it is up.

1 Like

Didn't realize that. Thank you. Moved my -1 to that. Someone who has the permissions necessary to edit the proposal at the top could maybe add a link to the review thread to make it obvious that there is one now and direct people to it.

Thanks for the suggestion, I edited the original post to include the link to the proposal review.

1 Like