Allow more characters (like whitespaces and punctuations) for escaped identifiers

allevato · December 29, 2019, 6:56pm

Ah, right, it's coming back to me now (I tinkered with the implementation briefly a few months ago). The Identifier type has an isOperator() method (and some helpers) that is used throughout Sema and elsewhere to determine whether an identifier is an operator or not, and it only looks at the first code point because right now that's enough to distinguish it.

If you wanted to apply my suggested rule above that a backticked identifier is an operator if and only if all of its characters are operator characters, then you could modify that function accordingly, and that should fix the issue you're seeing (assuming there are no other required changes elsewhere). There's a comment in isOperator() about caching that calculation, and if you're checking every character instead of just the first one, it would probably be a good opportunity to resolve that. (There's also some duplication that would be nice to clean up, because the code point ranges for operators are listed both in Lexer.cpp and Identifier.h.)

Another issue I remember running into: Make sure to add some tests that run some escaped identifiers through -emit-silgen and then feed that SIL back into the compiler to test SILPrinter and SILParser. Currently, SILPrinter only escapes identifiers that match keywords, so you'll need to extend that logic to cover other identifiers that require escaping under the proposed new rules, and then also make sure that SILParser handles those correctly (which hopefully falls out naturally because I think the lexer is the same).

benrimmington · December 30, 2019, 5:31am

macOS Catalina has a thousand system frameworks (in "/System/Library/Frameworks/" and "/System/Library/PrivateFrameworks/").

Why can't each team have a three-letter prefix for their modules? For example, swift-tools-support-core has the TSCUtility module.

allevato · December 30, 2019, 7:01am

This is a bit of a non sequitur, IMO; even if the proposed behavior wasn't accepted or didn't allow for the module names I want to use, the behavior we have implemented today in Bazel would remain: we automatically derive the module name by converting the build target label //path/to/package:target_name to path_to_package_target_name.

Since modules are already uniquely identified by the path to their build target, there's no benefit to having users assign names manually to them instead of generating the names. That would only serve as a vector for introducing possible error, because developers are human and can make mistakes, and the cost of such a mistake could be a broken build and/or a difficult migration. (Interestingly, the example you give somewhat proves my point, because TSCUtility was originally named Utility in SwiftPM and could not be used in a build that had any other target named Utility anywhere in its SPM dependency graph. This isn't hypothetical; I actually ran into it once.)

So for the purposes of the discussion here, the focus should be on how the proposed feature could improve the ability of the language to use the existing label as the module name compared to the existing transformation that we do today. Different naming schemes also don't satisfy some of the desired tooling goals mentioned previously, like being able to use the import lines in source files to generate build definitions because the names would an exact match instead of being elements in the codomain of a non-reversible transformation.

jawbroken · December 31, 2019, 8:06am

I think the right answer for backticked identifiers is that they're equivalent to their non-backticked form (if the non-backticked form is possible to use). This handles + but also functionName, etc. You're otherwise free to expand the space of backticked-only identifiers.

codedbypm · January 2, 2020, 10:12pm

Is it really useful for non-test method names? In that specific case, what about a syntax in the style of an annotation?

@test(“given X when Y then Z”) {
    // your test code here
}

adellibovi · January 2, 2020, 11:00pm

In my opinion the introduction of a new annotation will increase the complexity and the understanding of the feature, for example, what is @test? is it a function or is it annotation that takes a string and a closure? How that will work with Xcode test runner? What would be the benefit of using @test?

adellibovi · January 3, 2020, 4:08pm

Hello everyone, a quick update!

The proposal now includes the defined grammar for escaped identifiers and how to handle Objective-C interoperability. I will now try to move it forward by asking feedback from the Core Team, as it looks like there are no remaining incomplete points to be addressed.

Thanks again to all of you, who helped shaping this proposal, in particular to @allevato for helping me out for some parts of the implementation.

clayellis · January 3, 2020, 4:59pm

Will this enable identifiers beginning with a number? I assume it won't, but asking anyways.

adellibovi · January 3, 2020, 5:26pm

Yes it will!

The proposal removes any character constrain (apart from $ prefix, in that case the compiler will emit a diagnostic error since $ prefix is reserved for compiler internals).

My theory is that non escaped identifiers are avoiding starting with a number because of numeric literals parsing efficiency and (maybe) mangling was not supporting it.
Escape identifiers will not conflict with numeric parsing and mangling does support names starting with a digit, therefore, I saw no reason to keep this limitation.

May I ask why you were assuming that starting with a number would still not be enabled? I am asking to understand if from the proposal (specifically the grammar section) was not clear enough or if it was something that you are expecting. If you have in mind any use case for numeric-starting-identifiers I would love to hear those!

clayellis · January 3, 2020, 5:34pm

That's actually great to hear. I can think of two cases that will immediately benefit:

Allowing typed HTTP status codes to be identified by the raw status code (e.g. HTTPStatus.'300')
Allowing for asset names to be represented directly (e.g. a wrapper around SF Symbols could use identifiers that match the asset name exactly like '10.circle' instead of _10_circle.)

(I used ' in place of the backtick because Markdown formatting was tripping over it.)

I didn't read the grammar, but now that I've read it, it's clear. I assumed that this:

was the approach being taken which would've disallowed identifiers beginning with numbers (I think.)

gwendal.roue · January 3, 2020, 6:03pm

With this proposal, all identifiers have two syntactic forms, only one of them being always parseable.

A bad side effect is that this proposal will break code generators, documentation generators, and other programs of this family.

A fix, for those generators, will be to escape all identifiers, just in case they could not be parsed raw.

// Welcome to the future
import `TheModule`
class `TheClass`: `TheProtocol` {
    func `theFunc`() { ... }
    func `when I told you escaping was useful`() { ... } 
}

It reminds me of generated SQL (except that almost nobody reads SQL, when we read Swift all day long):

-- Say hello to double quotes, just in case
SELECT * FROM "player" WHERE "id" = 1

Let's take a practical example, starting with our own tooling: the Swift interface generator embedded in Xcode. It escapes `default`, `self`, and alike, but preserves other identifiers without escaping them, for legibility. Thank you, this is a quality tool.

The Swift interface generator ships with a function which decides if an identifier should be escaped or not (returns true for default, false for foobar).

This function is currently simple (it only has to check for a known list of keywords).

This function will become complex.

It is likely that this function will not be easily available to third-party tooling. Those may just give up and escape everything. Or ship with a poor-man buggy implementation.

It is likely that this function will be slow, forcing performance-focused tools to, again, escape everything.

With this proposal, all Swift identifiers can mean something in other languages.

This gives a security consideration: as long as HTML documentation generators are not "fixed" for this change, we'll see funny code appear:

func `<script>alert("pwnd")</script>`() { ... }

I stop here, but I'd be happy if the authors of the pitch would consider all those unfortunate consequences with care :-) Do we want "fallacies developers think about Swift identifiers" web pages to flourish?

allevato · January 3, 2020, 6:34pm

But isn't that already the case today? You can escape identifiers today that don't actually need to be escaped:

let `x` = 5
print(x)    // 5
print(`x`)  // also 5

"More" complex, but it's difficult to imagine it being significantly more complex, based on the implementation already provided by the PR author. Now instead of checking a list of keywords, it also checks the token to determine if it contains any non-identifier-safe code points (with a special case for property wrapper dollar signs).

Special cases are unfortunate, of course, but that brings us to the next point:

Why do you think it's unlikely for that function to be available to third-party tooling? The Swift syntax parser is already factored out into a dylib that ships with the toolchain for use by SwiftSyntax. I think it would be entirely reasonable and within the realm of possibility to add the relevant C binding to that library and Swift API to SwiftSyntax to answer the question "does this identifier need to be escaped?"

Moreover, since this list of keywords that the compiler checks is already not available to third-party tooling, those tools already have to duplicate some logic and may get it wrong. That list of keywords may also differ subtly between Swift versions (although hopefully must not anymore, for source compatibility reasons). The list of keywords where this is necessary is also somewhat non-obvious, because certain reserved words are context-sensitive. This recent thread is a great example, where the word set can be used as an identifier outside of a computed property, but in one specific location in an accessor block, it must be escaped or it's interpreted as a keyword:

struct S {
  var set = [0]  // OK, "set" is unambiguous here

  var first: Int? {
    set.first  // error: "set" treated as start of accessor
  }
}

So the conditions under which code generators have to mangle or escape identifiers are already somewhat fraught with edge cases, and I don't think expanding the space of valid escaped identifiers exacerbates that significantly—especially if we take the opportunity to provide clients with an API that matches the one used by the compiler, which would be an improvement over the state of things today.

gwendal.roue · January 3, 2020, 6:34pm

adellibovi:

It will enable to have method names (or other identifiers) with a more readable and natural language like the following:
func `test validation should succeed when input is less then ten`()

Since my previous message may sound like it ruins the hopes of many, I'd like to suggest two things which may address the original motivation:

@description("test validation should succeed when input is less then ten")
func test#() { ... }

@description("test validation should fail when input is more than twenty")
func test#() { ... }

The first is a suffix (here #) which has the compiler generate a unique name for the function, preserving its prefix. The name is unknown to the programmer, but unique in the relevant scope. XCTest, for example, finds as many test prefixed selectors as expected.

The second is a free-form annotation (here @description) which is made available at runtime for whatever purpose (like printing something) - I don't know how, this is just the baby of an idea.

gwendal.roue · January 3, 2020, 6:38pm

I can, but I don't. You missed the paragraph about quality generators.

Why do you think it's unlikely for that function to be available to third-party tooling?

Because generators are written in many languages, and run on many architectures, most of them won't have access to the holly dylib.

My post contains other objections. You don't have to rush :-)

Joe_Groff · January 3, 2020, 6:57pm

Another consideration is runtime API that does, or may in the future, want to be able to parse qualified Swift symbol names, for things like dynamic type or method lookup. If identifiers are allowed to include punctuation marks like . or <, for instance, this could confuse an API that tried to look up a type by name:

struct `Foo<Int>.Bar` { }

struct Foo<T> { struct Bar { } }

let t = typeByName("Foo<Int>.Bar")

It might be prudent to keep characters that are significant in the type grammar off-limits from identifiers to avoid introducing escaping problems for runtime APIs.

allevato · January 3, 2020, 7:08pm

That's not the point I was replying to, which was the statement "with this proposal, all identifiers have two syntactic forms, only one of them being always parseable". That read as if it was implying that it was this proposal that made that functionality possible, but it's already possible today. Did I misinterpret what you were saying?

I didn't miss it; more importantly, statements like these are unnecessarily antagonistic. Let's stick to the technical merits of the discussion.

I have experience writing code generators as well (I'm one of the maintainers of swift-protobuf), so I do understand the issues involved, especially when translating identifiers from one schema to another.

Generators today could take the easy way out and escape every identifier if they wanted to, because the language allows it. I don't think we've seen that to any great degree, and I don't think the chances are that much higher that we'd see it a lot more with new identifier rules. That's just conjecture on my part, but that's what your concerns were as well; do you have any concrete reasons to believe that generated code will suffer because of this change?

Code generators are also a very small subset of the day-to-day code written and read in Swift. I'm not sure that the possibility of someone writing a "bad" code generator should be a mark against a feature. And as someone who uses generated code in a number of my projects, I'm not sure I'd care that much if someone escaped all the identifiers in the generated code, because I don't look at the generated implementation that often. I'm usually more interested in viewing an interface-only API digest provided by Xcode, which would presumably only escape the identifiers that actually need it (since the escaping is not actually part of the identifier in the AST). But I realize that reasonable people may disagree on this point.

That's fine, though—the language doesn't have to provide an API for every possible language/architecture to identify identifiers. The grammar rules for identifiers in Swift are already fairly complex, especially with regard to the ranges of acceptable Unicode code points. To my knowledge, there's not an API anywhere today that allows third-parties to exactly match that in their own tooling regardless of language/architecture, so Swift providing one to third-parties who write their tools in C and Swift would still be a major improvement. And again, that would make it available to third-party tooling, satisfying the requirement in your original post; if someone chooses to write that tooling in a language that doesn't provide access to that API, then that's their choice, and they need to work around that decision.

gwendal.roue · January 3, 2020, 7:10pm

Yes you did. I suggest a re-read.

allevato · January 3, 2020, 7:17pm

That's a good point! We should definitely consider this.

One possibility would be for the API to parse the identifier the same way that the compiler would, thus requiring escaping inside the string if you wanted to handle identifiers that otherwise contained special delimiters:

struct `Foo<Int>.Bar` { }  // #1
struct Foo<T> { struct Bar { } }  // #2

let t = typeByName("Foo<Int>.Bar")  // #2
let t = typeByName("`Foo<Int>.Bar`")  // #1

There's some possible ambiguity about symbols that would need to be escaped in source but not in the string API call, like

struct `Foo Bar` {}

// Should this work? The API probably doesn't *need* to escape the
// identifier here.
let t = typeByName("Foo Bar")

// Or should we require this, for consistency with source?
let t = typeByName("`Foo Bar`")

Off the top of my head, I'm not sure I have a strong preference on this one.

adellibovi · January 3, 2020, 9:53pm

Thanks @Joe_Groff, valid consideration.

I do like this option as it feels more coherent with the approach of the proposal by keeping the "every char is allowed because this is an escaped identifier". If we go with that, I am more prone to always respect the grammar since it follows how we can statically reference to a type too:

`Foo Bar`() // Valid
let t = typeByName("`Foo Bar`") // Valid
Foo Bar() // Compiler error
let t = typeByName("Foo Bar") // Runtime error

I do not fully understand if _typeByName currently supports only Swift mangled names or also the example you mentioned, can you confirm if that is the case?
If we don't currently support qualified complex type names, do you think @allevato's suggested option may be a valid one that could be implement if/when Runtime API will support so?

Joe_Groff · January 3, 2020, 9:55pm

_typeByName currently only supports mangled names, that's correct, so it wouldn't immediately be a concern because the mangling handles special characters already. My concern was about hypothetical future APIs that might want to parse identifier names in their human-consumable form.