Allow more characters (like whitespaces and punctuations) for escaped identifiers

In my opinion the backtick shouldn't be part of the actual identifier, as it is more a syntactical help for the compiler to pick the method name with spaces and punctuation. Basically, #function will just be test foo ` bar and that may solve already the issue you are raising. :smile:

Edit: I checked and the backtick are already not part of the identifier definition i.e.: `x` and x are exactly the same

2 Likes

Thank you for bringing up the topic of identifiers with non-identifier characters! This is something I've been thinking about for a while, and the use case you describe is a great one—and probably a better starting sales pitch than what I need it for, so I'm glad you started the discussion instead of me :smile:

The use case I'm most interested in is allowing non-identifier characters in module names. At Google (and other companies using Bazel in a monorepo), a particular app could be made up of tens (or possibly hundreds!) of Swift and Objective-C modules at different paths in the monorepo, owned by various different teams. Each build target has a label of the form //path/to/package:target_name. Since module names cannot collide anywhere in the build graph for an application, we can't rely on teams to choose their own module names because two teams could choose something common like Utility for some internal library. So, we mangle the Bazel target label to turn it into an identifier, and the Swift code has to do:

import path_to_package_target_name

This works, but the mapping is not reversible (to try to keep it as simple and obvious as possible), so there is the potential for collision in rare cases, and there's still some mental load to convert the label (which you already know, because you have to express the dependency in your build file) into the module name.

I would love to allow Bazel users to write this instead:

import `//path/to/package:target_name`

This would have a couple huge benefits:

  1. There's no mental load to convert—the module name is the target label, period.
  2. It's now reversible, which means we can build great tooling around this. Specifically, we can make Swift source files be the source of truth and generate the build files (i.e., the dependency lists) from them, instead of making the user manually write them in two places. We can't do that today unless we maintained a master mapping from module names back to build target labels somewhere.

Now, backticked identifiers doesn't solve my problem completely—we'd need an alternate way to pass these modules to the compiler since you can't have a file named //path/to/package:target_name.swiftmodule (well, not easily), but that's a separate driver/frontend issue that I don't think needs to impact this feature.

So, huge +1 to this idea in general, and it should apply uniformly to all identifiers (modules, variables, functions, etc.). Backticks already mean "escape this reserved word that isn't a suitable identifier on its own and make it an identifier", so replacing "reserved word" with "sequence of characters" seems like the exact right thing to do.

I strongly disagree that we should have arbitrary restrictions like these (and not only because it would prevent my use case above). Many programming language features can be abused, but instead, we just trust users to make intelligent, grown-up decisions about their code. With identifiers, you can already do confusing things today:

struct A {}
let a = Α()  // error: use of unresolved identifier 'Α'

(Line 1 is Latin uppercase A; line 2 is Greek uppercase Alpha).

And that's not even touching emoji, which Swift has allowed emoji in identifiers since day 1 and we haven't seen an epidemic of users trying to shove those into identifiers, so I think we can trust users here as well. If someone gives identifiers an unusable or confusing name, good solutions include making a lint rule for it or calling it out in a code review, but not crippling the feature arbitrarily and limiting legitimate use cases.

I think raw string literals are a better thing to emulate here, by extending the grammar for backticked identifiers, because it generalizes more nicely than backslash escaping:

func `test foo bar`() {}
func #`test foo`bar`#() {}
func ##`test foo`#bar`##() {}

But these are certainly rare scenarios.

Tools would most likely have to be updated anyway to handle identifiers that contain backticks or non-identifier characters properly, so I don't think we need more rules like that concatenation one. They would just add complexity with little benefit.

4 Likes

My initial reaction (well, after the “huh…”) is that this seems fine iff backticks are applied universally. i.e. the tokeniser treats them the same anywhere they appear (other than in certain obvious exceptions, like inside string literals), as a way of suspending normal rules on whitespace, or other symbols, delineating tokens.

That to me seems justifiable from an ideological perspective - the “here’s your way out of whatever awkward edge cases may arise, because naming is hard”. Not something necessarily recommended, but fairly harmless and easy to comprehend if & when you encounter it for the first time.

If this were restricted to just certain places, such as method names (and IIRC variable names already), then I feel that I’d have to scrutinise it more heavily - e.g. is it a good idea to allow writing essentially arbitrary human language in a method name; is this something that’s better handled by documentation / comments, or decorators; is this really that much better than just using underscores (which a test harness could trivially replace with spaces if prettiness of the output is the concern); etc.

I suspect it would also serve humans well - if not also the tokeniser - to not allow implicit concatenation of backticked content with non-backticked content, i.e. func test`foo bar` should not be allowed; use func `testfoo bar` instead. It’s simpler to reason about (again, by defining ` as essentially a special token delimiter).

5 Likes

Thanks everyone for this first round of feedbacks, I do appreciate the different point of views.

I will soon update the original pitch including some of your suggestions, mainly:

  • extending support to every kind of escaped identifiers (methods, property, imports etc..)
  • clarifying that Swift already supports referencing and calling to escaped identifiers (i.e.: foo.`method`() or foo.`property`), so the proposal can keep what is already in place.

In the meanwhile... I wanted to share a sneak preview of a working prototype that is fully compatible with Xcode test runner :partying_face:

17 Likes

And that's not even touching emoji, which Swift has allowed emoji in identifiers since day 1 and we haven't seen an epidemic of users trying to shove those into identifiers, so I think we can trust users here as well. If someone gives identifiers an unusable or confusing name, good solutions include making a lint rule for it or calling it out in a code review, but not crippling the feature arbitrarily and limiting legitimate use cases.

I see you point, good call.

In the meanwhile... I wanted to share a sneak preview of a working prototype that is fully compatible with Xcode test runner :partying_face: ...

Looks good!

That's great!

One thing that occurred to me after my post above was that the identifiers I wanted to use (//path/to/package:target_name) contain operator characters (and indeed, start with one). If we want to allow backticks to escape non-identifier characters in identifiers, we need to give consideration to how operator characters are handled. Some open questions and thoughts which are partly motivated by my own needs/use case:

  • Should operator characters be allowed in backticked identifiers? I think so; not only for my import use case, but it might be nice to write func `test +`() { ... } if I'm testing the + operator of a custom type.

  • Should backticks turn sequences entirely composed of operator characters into regular non-operator identifiers? For example, should `..<` or `+` be treated as separate non-operator identifiers to ..< and +? I think the answer should be no; that could lead to confusion, and there has also been some interest in using backticks around operators to reference them as type members, and I think these two features would tie nicely together.

  • What about backticked identifiers that contain mixed operator and non-operator characters? Any difference depending on whether the identifier starts with an operator character vs. just containing one? I think it should be fine to mix them, and I don't think it should the behavior should differ whether the identifier starts with an operator character or not; in both cases, it should be a regular identifier. (Selfishly, these are both important to my module use case.)

So, to summarize, IMO a backticked identifier may contain operator characters, but a backticked sequence that contains only operator characters is still an operator, not a regular identifier. In other words:

static func + (lhs: Foo, rhs: Foo) -> Foo {}  // an operator, of course
static func `+` (lhs: Foo, rhs: Foo) -> Foo {}  // equivalent, still an operator
let `-` = 5  // still not allowed, `-` is the same as -, an operator

func `test +`() {}  // a regular identifier

Now, strict application of my proposed rule could allow some weird situations:

func `+ -`() {}  // a regular identifier, because SPACE is not an operator character

We could try to add more rules, like "an identifier may not consist only of operator characters and whitespace, even when surrounded by backticks," but I don't know if adding more rules to address that would actually help or if it would just make things more complicated. I think it goes back to trusting users to not do silly things, just as we do with Unicode support today, and enforcing additional rules through style guides, linters, and code reviews. After all, we can put all the mechanical rules that we can dream of but nothing would stop a user from naming an identifier jidjfosijfsiodno, so figuring out where to draw the line is important, but also challenging.

2 Likes

Backticks might also be required (by the compiler) or optional (for readability) if the SE-0111 regression is fixed.

One of the further suggestions was for multi-line compound names (with insignificant whitespace).

If I had to choose, I'd prefer to reserve backticks for closures with compound names.

Thanks for bringing up this edge case.

I extended the grammar to give it a try and see how the compiler behaves.

  • func `test +`() { ... } this look legit and already works in the prototype
  • Defining a static func + (lhs: Self, rhs: Self) or static func `+` (lhs: Self, rhs: Self) is exactly the same thing, therefore, in this case, it will be considered as operator. And, as side effect, that means we can already reference as Self.`+` for free! Which IMHO is a nice thing. (cc: @dan-zheng maybe you are interested in the findings :slight_smile: )
  • Unfortunately, when "starts with an operator" is where things get tricky and with my current knowledge I am not sure where/how this is handled. For your use case, though, we could just do path://path/to/package:target_name instead and it will work :smiley: So maybe we could limit this edge case for the sake of implementation simplicity. What do you think?
2 Likes

It's great to see that these two things fell out naturally already!

To clarify, I wasn't concerned about backticked identifiers starting with an operator, just with an operator character, so that should significantly simplify the problem (we can't know at lexing time whether a specific sequence of characters is defined as an operator somewhere, only whether it could be an operator). From a quick glance at your implementation, it looks like your isValidIdentifierEscaped{Continuation,Start}CodePoint functions would already treat something like `//path/to/package:target_name` as an identifier, correct?

For what it's worth, dropping the leading double-slash from the label would be a reasonable compromise if I had to for my use case, but I think it's possibly cleaner if there are fewer restrictions.

1 Like

Correct, in the current implementation those are considered as valid identifiers, but are currently conflicting with Sema, which believes (from my understanding, I could be wrong) those identifiers are operators (for func, not actually for imports). I will keep investigating :slightly_smiling_face:

1 Like

Ah, right, it's coming back to me now (I tinkered with the implementation briefly a few months ago). The Identifier type has an isOperator() method (and some helpers) that is used throughout Sema and elsewhere to determine whether an identifier is an operator or not, and it only looks at the first code point because right now that's enough to distinguish it.

If you wanted to apply my suggested rule above that a backticked identifier is an operator if and only if all of its characters are operator characters, then you could modify that function accordingly, and that should fix the issue you're seeing (assuming there are no other required changes elsewhere). There's a comment in isOperator() about caching that calculation, and if you're checking every character instead of just the first one, it would probably be a good opportunity to resolve that. (There's also some duplication that would be nice to clean up, because the code point ranges for operators are listed both in Lexer.cpp and Identifier.h.)

Another issue I remember running into: Make sure to add some tests that run some escaped identifiers through -emit-silgen and then feed that SIL back into the compiler to test SILPrinter and SILParser. Currently, SILPrinter only escapes identifiers that match keywords, so you'll need to extend that logic to cover other identifiers that require escaping under the proposed new rules, and then also make sure that SILParser handles those correctly (which hopefully falls out naturally because I think the lexer is the same).

3 Likes

macOS Catalina has a thousand system frameworks (in "/System/Library/Frameworks/" and "/System/Library/PrivateFrameworks/").

Why can't each team have a three-letter prefix for their modules? For example, swift-tools-support-core has the TSCUtility module.

This is a bit of a non sequitur, IMO; even if the proposed behavior wasn't accepted or didn't allow for the module names I want to use, the behavior we have implemented today in Bazel would remain: we automatically derive the module name by converting the build target label //path/to/package:target_name to path_to_package_target_name.

Since modules are already uniquely identified by the path to their build target, there's no benefit to having users assign names manually to them instead of generating the names. That would only serve as a vector for introducing possible error, because developers are human and can make mistakes, and the cost of such a mistake could be a broken build and/or a difficult migration. (Interestingly, the example you give somewhat proves my point, because TSCUtility was originally named Utility in SwiftPM and could not be used in a build that had any other target named Utility anywhere in its SPM dependency graph. This isn't hypothetical; I actually ran into it once.)

So for the purposes of the discussion here, the focus should be on how the proposed feature could improve the ability of the language to use the existing label as the module name compared to the existing transformation that we do today. Different naming schemes also don't satisfy some of the desired tooling goals mentioned previously, like being able to use the import lines in source files to generate build definitions because the names would an exact match instead of being elements in the codomain of a non-reversible transformation.

2 Likes

I think the right answer for backticked identifiers is that they're equivalent to their non-backticked form (if the non-backticked form is possible to use). This handles + but also functionName, etc. You're otherwise free to expand the space of backticked-only identifiers.

1 Like

Is it really useful for non-test method names? In that specific case, what about a syntax in the style of an annotation?

@test(“given X when Y then Z”) {
    // your test code here
}
1 Like

In my opinion the introduction of a new annotation will increase the complexity and the understanding of the feature, for example, what is @test? is it a function or is it annotation that takes a string and a closure? How that will work with Xcode test runner? What would be the benefit of using @test?

3 Likes

Hello everyone, a quick update!

The proposal now includes the defined grammar for escaped identifiers and how to handle Objective-C interoperability. I will now try to move it forward by asking feedback from the Core Team, as it looks like there are no remaining incomplete points to be addressed.

Thanks again to all of you, who helped shaping this proposal, in particular to @allevato for helping me out for some parts of the implementation.

3 Likes

Will this enable identifiers beginning with a number? I assume it won't, but asking anyways.

Yes it will! :grin:

The proposal removes any character constrain (apart from prefix, in that case the compiler will emit a diagnostic error since prefix is reserved for compiler internals).

My theory is that non escaped identifiers are avoiding starting with a number because of numeric literals parsing efficiency and (maybe) mangling was not supporting it.
Escape identifiers will not conflict with numeric parsing and mangling does support names starting with a digit, therefore, I saw no reason to keep this limitation.

May I ask why you were assuming that starting with a number would still not be enabled? I am asking to understand if from the proposal (specifically the grammar section) was not clear enough or if it was something that you are expecting. If you have in mind any use case for numeric-starting-identifiers I would love to hear those! :slight_smile:

1 Like

That's actually great to hear. I can think of two cases that will immediately benefit:

  • Allowing typed HTTP status codes to be identified by the raw status code (e.g. HTTPStatus.'300')
  • Allowing for asset names to be represented directly (e.g. a wrapper around SF Symbols could use identifiers that match the asset name exactly like '10.circle' instead of _10_circle.)

(I used ' in place of the backtick because Markdown formatting was tripping over it.)

I didn't read the grammar, but now that I've read it, it's clear. I assumed that this:

was the approach being taken which would've disallowed identifiers beginning with numbers (I think.)

2 Likes
Terms of Service

Privacy Policy

Cookie Policy