Allow more characters (like whitespaces and punctuations) for escaped identifiers

adellibovi · December 27, 2019, 4:45pm

Allow more characters (like whitespaces and punctuations) for escaped identifiers

Proposal: [SE-XXXX]
Authors: Alfredo Delli Bovi
Review Manager: TBD
Status: Awaiting review
Implementation: apple/swift#28966

Introduction

Swift has a beautiful concise yet expressive syntax.
As part of that, escaped identifiers are adopted to allow usage of reserved keywords.
This proposal wants to extend the characters allowance for escaped identifiers with more unicode scalars, like whitespace and punctuation.
It will enable to have method names (or other identifiers) with a more readable and natural language like the following:

func `test validation should succeed when input is less then ten`()

Motivation

Naming could be hard and having descriptive methods, like in tests, may result in declarations that are hard to read because of its lack of whitespace and punctuations or other symbols. Enabling natural language would improve readability.

Maintainers of different projects under the Swift Source Compatibility uses, instead of Swift's method declaration, testing frameworks, like Quick, because (among other reasons) how they can elegantly express tests descriptions.

Other modern languages like F# and Kotlin saw the value in supporting natural language for escaped identifiers. Today, naming methods with spaces and punctuation are, for those languages, a standard for tests, widely adopted and supported by different test runners and reporting tools.

Proposed solution

This proposal wants to extend the current grammar for every escaped identifiers (properties, methods, types etc...) by allowing every unicode scalar.

A declaration to an escaped identifier will follow the existing back-ticked syntax.

func `test validation should succeed when input is less then ten`()
var `some var` = 0

As per referencing.

`test validation should succeed when input is less then ten`()
foo.`property with space`

In fact, by allowing a larger set of characters, we will remove current limitations and, as an example, we will enable us to reference an operator, which currently produces an error.

let add = Int.`+`

Grammar

This proposal wants to replace the following grammar:

identifier → ` identifier-head identifier-characters opt `

with:

identifier → ` escaped-identifier `
escaped-identifier -> Any Unicode scalar value except U+000A or U+000D or U+0060

Objective-C Interoperability

Objective-C declarations do not support every type of Unicode scalar value.
If willing to expose an escaped identifier that includes a non supported Objective-C character, we can sanitize it using the existing @objc annotation like the following:

@objc(sanitizedName)

Source compatibility

This feature is strictly additive.

Effect on ABI stability

This feature does not affect the ABI.

Effect on API resilience

This feature does not affect the API.

Alternatives considered

It was considered to extend the grammars for methods declaration only, this was later discarded because we want to keep usage consistency and it would be hard to explain why an escaped identifier may support a certain set of characters in a context and a different one in another context.

Thanks for your time reading this post, any feedback is appreciated

Review is currently taking place at: SE-0275: Allow more characters (like whitespaces and punctuations) for escaped identifiers

anandabits · December 27, 2019, 4:58pm

+1. I would really like to be able to write test methods this way.

yuriferretti · December 27, 2019, 6:45pm

+1 as you mentioned, this would be very useful for tests!!

zoltanL · December 27, 2019, 7:04pm

This is a nice idea, I support it, +1

Is the back-tick syntax for test methods only?
Would you be able to call them manually at all? E.g.

 func `some function name`() { ... }

 ...

#warning(bikeshedding)
// is it:
 object.`some function name`()
 // or
 object.someFunctionName()
 // or
 object.some_function_name()

adellibovi · December 27, 2019, 7:15pm

Thanks, let me reply to your questions

Is the back-tick syntax for test methods only?

No, it may apply to every methods. For tests methods using it feels more natural, I am not sure if for other types of methods it is a good idea, anyway it is up to the developer.

Would you be able to call them manually at all?

Yes, for calling it, I was thinking on using object.`some function name`() in order to keep the same syntax and to avoid implicit name method conversion within Swift codebase. Yet, it may be different within an Obj-C context.

DevAndArtist · December 27, 2019, 7:35pm

Shouldn‘t this be generalized to all identifiers?

I think this would be useful if we get compound names.

var `some func`(a:b:): (Int, Int) -> Int = ...

// usage
`some func`(a: 1, b: 2)

Also would it be possible to generalize a line break? Test method names are sometimes very long, but some of us have a fixed character width such as 80 characters. In these cases such names won‘t fit and there is currently no way to break up the identifier into multiple lines. I think this would be a good opportunity.

Maybe we could reuse some rules from multi line string literals.

adellibovi · December 27, 2019, 8:29pm

Thanks Adrian for the feedback!

I definitely agree about properties, we should have the same behavior to both properties and methods, it would be easier to explain and we would keep a consistent usage.

Regarding the line break, I hear your point and want to share my thoughts since it was a topic I was thinking too.
Based on the English language's statistics, the average word has 6 characters, the average sentence is around 15-20, that means 80 characters may fit around 12 words, therefor the issue may not happening very often. Since it would increase both design and implementation complexity and it looks like that it wont happen very often, I would prefer to keep this decision for a future improvement. What do you think? Anyway, I also believe that this proposal will make a step towards that direction, partially having the foundation to support it.

zoltanL · December 27, 2019, 8:37pm

I would add the function names of this nature has to start with alphabetic characters, and have to end with alphanumeric ones, or some similar rule.

A string of whitespace and/or punctuation symbols only would be highly confusing, back-ticks or not.

DevAndArtist · December 27, 2019, 8:39pm

It‘s okay by me if the initial proposal won‘t have support for line breaks, it would be great though if the final design could leave some space for potential future extension of multi line identifiers.

DevAndArtist · December 27, 2019, 8:50pm

Generally I think we should look at string literal rules for that kind of feature.

func `test foo \` bar`() {
  print(#function) // prints >`test foo ` bar`<
}

Also do we want some kind of concatenation rule, because Xcode and other tooling tools probably won`t pick up test identifiers that start with a backtick (just an assumption, haven‘t actually tested it)?!

For example:

func test_`foo bar`() { ... }

adellibovi · December 27, 2019, 9:07pm

In my opinion the backtick shouldn't be part of the actual identifier, as it is more a syntactical help for the compiler to pick the method name with spaces and punctuation. Basically, #function will just be test foo ` bar and that may solve already the issue you are raising.

Edit: I checked and the backtick are already not part of the identifier definition i.e.: `x` and x are exactly the same

allevato · December 27, 2019, 9:36pm

Thank you for bringing up the topic of identifiers with non-identifier characters! This is something I've been thinking about for a while, and the use case you describe is a great one—and probably a better starting sales pitch than what I need it for, so I'm glad you started the discussion instead of me

The use case I'm most interested in is allowing non-identifier characters in module names. At Google (and other companies using Bazel in a monorepo), a particular app could be made up of tens (or possibly hundreds!) of Swift and Objective-C modules at different paths in the monorepo, owned by various different teams. Each build target has a label of the form //path/to/package:target_name. Since module names cannot collide anywhere in the build graph for an application, we can't rely on teams to choose their own module names because two teams could choose something common like Utility for some internal library. So, we mangle the Bazel target label to turn it into an identifier, and the Swift code has to do:

import path_to_package_target_name

This works, but the mapping is not reversible (to try to keep it as simple and obvious as possible), so there is the potential for collision in rare cases, and there's still some mental load to convert the label (which you already know, because you have to express the dependency in your build file) into the module name.

I would love to allow Bazel users to write this instead:

import `//path/to/package:target_name`

This would have a couple huge benefits:

There's no mental load to convert—the module name is the target label, period.
It's now reversible, which means we can build great tooling around this. Specifically, we can make Swift source files be the source of truth and generate the build files (i.e., the dependency lists) from them, instead of making the user manually write them in two places. We can't do that today unless we maintained a master mapping from module names back to build target labels somewhere.

Now, backticked identifiers doesn't solve my problem completely—we'd need an alternate way to pass these modules to the compiler since you can't have a file named //path/to/package:target_name.swiftmodule (well, not easily), but that's a separate driver/frontend issue that I don't think needs to impact this feature.

So, huge +1 to this idea in general, and it should apply uniformly to all identifiers (modules, variables, functions, etc.). Backticks already mean "escape this reserved word that isn't a suitable identifier on its own and make it an identifier", so replacing "reserved word" with "sequence of characters" seems like the exact right thing to do.

I strongly disagree that we should have arbitrary restrictions like these (and not only because it would prevent my use case above). Many programming language features can be abused, but instead, we just trust users to make intelligent, grown-up decisions about their code. With identifiers, you can already do confusing things today:

struct A {}
let a = Α()  // error: use of unresolved identifier 'Α'

(Line 1 is Latin uppercase A; line 2 is Greek uppercase Alpha).

And that's not even touching emoji, which Swift has allowed emoji in identifiers since day 1 and we haven't seen an epidemic of users trying to shove those into identifiers, so I think we can trust users here as well. If someone gives identifiers an unusable or confusing name, good solutions include making a lint rule for it or calling it out in a code review, but not crippling the feature arbitrarily and limiting legitimate use cases.

I think raw string literals are a better thing to emulate here, by extending the grammar for backticked identifiers, because it generalizes more nicely than backslash escaping:

func `test foo bar`() {}
func #`test foo`bar`#() {}
func ##`test foo`#bar`##() {}

But these are certainly rare scenarios.

Tools would most likely have to be updated anyway to handle identifiers that contain backticks or non-identifier characters properly, so I don't think we need more rules like that concatenation one. They would just add complexity with little benefit.

wadetregaskis · December 28, 2019, 5:54am

My initial reaction (well, after the “huh…”) is that this seems fine iff backticks are applied universally. i.e. the tokeniser treats them the same anywhere they appear (other than in certain obvious exceptions, like inside string literals), as a way of suspending normal rules on whitespace, or other symbols, delineating tokens.

That to me seems justifiable from an ideological perspective - the “here’s your way out of whatever awkward edge cases may arise, because naming is hard”. Not something necessarily recommended, but fairly harmless and easy to comprehend if & when you encounter it for the first time.

If this were restricted to just certain places, such as method names (and IIRC variable names already), then I feel that I’d have to scrutinise it more heavily - e.g. is it a good idea to allow writing essentially arbitrary human language in a method name; is this something that’s better handled by documentation / comments, or decorators; is this really that much better than just using underscores (which a test harness could trivially replace with spaces if prettiness of the output is the concern); etc.

I suspect it would also serve humans well - if not also the tokeniser - to not allow implicit concatenation of backticked content with non-backticked content, i.e. func test`foo bar` should not be allowed; use func `testfoo bar` instead. It’s simpler to reason about (again, by defining ` as essentially a special token delimiter).

adellibovi · December 28, 2019, 3:27pm

Thanks everyone for this first round of feedbacks, I do appreciate the different point of views.

I will soon update the original pitch including some of your suggestions, mainly:

extending support to every kind of escaped identifiers (methods, property, imports etc..)
clarifying that Swift already supports referencing and calling to escaped identifiers (i.e.: foo.`method`() or foo.`property`), so the proposal can keep what is already in place.

In the meanwhile... I wanted to share a sneak preview of a working prototype that is fully compatible with Xcode test runner

zoltanL · December 28, 2019, 7:07pm

And that's not even touching emoji, which Swift has allowed emoji in identifiers since day 1 and we haven't seen an epidemic of users trying to shove those into identifiers, so I think we can trust users here as well. If someone gives identifiers an unusable or confusing name, good solutions include making a lint rule for it or calling it out in a code review, but not crippling the feature arbitrarily and limiting legitimate use cases.

I see you point, good call.

In the meanwhile... I wanted to share a sneak preview of a working prototype that is fully compatible with Xcode test runner ...

Looks good!

allevato · December 28, 2019, 7:15pm

That's great!

One thing that occurred to me after my post above was that the identifiers I wanted to use (//path/to/package:target_name) contain operator characters (and indeed, start with one). If we want to allow backticks to escape non-identifier characters in identifiers, we need to give consideration to how operator characters are handled. Some open questions and thoughts which are partly motivated by my own needs/use case:

Should operator characters be allowed in backticked identifiers? I think so; not only for my import use case, but it might be nice to write func `test +`() { ... } if I'm testing the + operator of a custom type.
Should backticks turn sequences entirely composed of operator characters into regular non-operator identifiers? For example, should `..<` or `+` be treated as separate non-operator identifiers to ..< and +? I think the answer should be no; that could lead to confusion, and there has also been some interest in using backticks around operators to reference them as type members, and I think these two features would tie nicely together.
What about backticked identifiers that contain mixed operator and non-operator characters? Any difference depending on whether the identifier starts with an operator character vs. just containing one? I think it should be fine to mix them, and I don't think it should the behavior should differ whether the identifier starts with an operator character or not; in both cases, it should be a regular identifier. (Selfishly, these are both important to my module use case.)

So, to summarize, IMO a backticked identifier may contain operator characters, but a backticked sequence that contains only operator characters is still an operator, not a regular identifier. In other words:

static func + (lhs: Foo, rhs: Foo) -> Foo {}  // an operator, of course
static func `+` (lhs: Foo, rhs: Foo) -> Foo {}  // equivalent, still an operator
let `-` = 5  // still not allowed, `-` is the same as -, an operator

func `test +`() {}  // a regular identifier

Now, strict application of my proposed rule could allow some weird situations:

func `+ -`() {}  // a regular identifier, because SPACE is not an operator character

We could try to add more rules, like "an identifier may not consist only of operator characters and whitespace, even when surrounded by backticks," but I don't know if adding more rules to address that would actually help or if it would just make things more complicated. I think it goes back to trusting users to not do silly things, just as we do with Unicode support today, and enforcing additional rules through style guides, linters, and code reviews. After all, we can put all the mechanical rules that we can dream of but nothing would stop a user from naming an identifier jidjfosijfsiodno, so figuring out where to draw the line is important, but also challenging.

benrimmington · December 29, 2019, 6:14am

Backticks might also be required (by the compiler) or optional (for readability) if the SE-0111 regression is fixed.

One of the further suggestions was for multi-line compound names (with insignificant whitespace).

If I had to choose, I'd prefer to reserve backticks for closures with compound names.

adellibovi · December 29, 2019, 9:19am

Thanks for bringing up this edge case.

I extended the grammar to give it a try and see how the compiler behaves.

func `test +`() { ... } this look legit and already works in the prototype

Defining a static func + (lhs: Self, rhs: Self) or static func `+` (lhs: Self, rhs: Self) is exactly the same thing, therefore, in this case, it will be considered as operator. And, as side effect, that means we can already reference as Self.`+` for free! Which IMHO is a nice thing. (cc: @dan-zheng maybe you are interested in the findings )

Unfortunately, when "starts with an operator" is where things get tricky and with my current knowledge I am not sure where/how this is handled. For your use case, though, we could just do path://path/to/package:target_name instead and it will work So maybe we could limit this edge case for the sake of implementation simplicity. What do you think?

allevato · December 29, 2019, 5:51pm

It's great to see that these two things fell out naturally already!

To clarify, I wasn't concerned about backticked identifiers starting with an operator, just with an operator character, so that should significantly simplify the problem (we can't know at lexing time whether a specific sequence of characters is defined as an operator somewhere, only whether it could be an operator). From a quick glance at your implementation, it looks like your isValidIdentifierEscaped{Continuation,Start}CodePoint functions would already treat something like `//path/to/package:target_name` as an identifier, correct?

For what it's worth, dropping the leading double-slash from the label would be a reasonable compromise if I had to for my use case, but I think it's possibly cleaner if there are fewer restrictions.

adellibovi · December 29, 2019, 6:16pm

Correct, in the current implementation those are considered as valid identifiers, but are currently conflicting with Sema, which believes (from my understanding, I could be wrong) those identifiers are operators (for func, not actually for imports). I will keep investigating