SE-0451: Raw identifiers

Joe_Groff · October 24, 2024, 7:28pm

Hi everybody. Review for SE-0451: Raw identifiers begins now and runs through November 7th, 2024.

Reviews are an important part of the Swift evolution process. All review feedback should be either on this forum thread or, if you would like to keep your feedback private, directly to the review manager. When contacting the review manager directly, please keep the proposal link at the top of the message.

What goes into a review?

The goal of the review process is to improve the proposal under review through constructive criticism and, eventually, determine the direction of Swift. When writing your review, here are some questions you might want to answer in your review:

What is your evaluation of the proposal?
Is the problem being addressed significant enough to warrant a change to Swift?
Does this proposal fit well with the feel and direction of Swift?
If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?
How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

More information about the Swift evolution process is available at https://github.com/apple/swift-evolution/blob/main/process.md.

Thank you for contributing to Swift's evolution!

Joe Groff
Review Manager

Paul_Cantrell · October 24, 2024, 8:36pm

A quick mini-review, without deep consideration:

The rationale makes sense, both parts (descriptive idents and algorithmically generated idents). This is useful enough to modify the language.

The general approach is sensible, and the syntax is reasonable.

I say +1! (Readers may choose to interpret that exclamation point as either emphasis or factorial.)

One concern: part of the rationale is to allow programmatic synthesis of identifiers with arbitrary contents. This proposal as it stands does not fully accomplish that goal; I'm slightly concerned about accepting it:

without escape characters, and
with the prohibition on operator-char-only identifiers.

For example, the proposal mentions turning raw filenames / paths into identifiers. Filenames can contain newlines, and +* is a valid filename. Both are unlikely to occur in practice, but both are possible.

This means that code that transforms filenames (or other arbitrary strings) into Swift identifiers would still need to dance around Swift’s syntactic rules and invent some ad hoc, nonstandard escaping scheme (or otherwise police allowed characters in its emitted identifiers). While this proposal would reduce the number of characters that require ad hoc escaping, it would not eliminate the need for such ad hoc escaping altogether. The proposal therefore does not significantly reduce burden for code that synthesizes identifiers.

It seems to me that raw identifiers ought to offer the simple, uniform promise that:

any string is allowed, given a standard escaping scheme that is part of Swift’s syntax and not part of the identifier itself, and
the specific contents of the raw identifier will never create semantic differences.

Those two principles imply that:

`foo\nbar`() or some equivalent should be allowed, and
Int.`+` should refer to a regular identifier named +, just as Int.`foo` refers to a regular identifier named foo.

I understand the desire not to complicate the proposal with this esoteric problem, and accepting it as an incremental step makes sense; thus my +1. However, the proposal does seem incomplete as it stands if synthesizing arbitrary identifiers is in fact an ultimate goal.

Jnosh · October 25, 2024, 10:00am

I am in favor of this proposal.

I have hit many of the pain points mentioned in the proposal before:

Having written Kotlin before, writing test names using raw identifiers was really nice. It makes test names more readable and easier to name, removing the burden of having to force a test description into a valid identifier.
I have also run into enums with cases that should naturally start with e.g. a number. The option of using raw identifiers would be welcome here.
I have a package plugin that generates code to embed some resources. Being able to generate identifiers much more freely without having to try to transform input into valid identifiers would also be very welcome here.

The test name use case in particular makes this feature worth including in the language for me. The other applications are a nice bonus on top.

While this feature may slightly increase the burden on language and tooling developers, many more developers would profit from it and the proposal makes a good case for why the burden on tooling would hopefully be manageable.

To @Paul_Cantrell's point, could some of the remaining escaping pain for code generation maybe be alleviated by a small utility package in that area?
Something that could be queried for the set of disallowed characters, used to test if a given string is a valid raw identifier and maybe even offers one or more basic transformations to turn an arbitrary string into a valid raw identifier.
The proposal could discuss such a package as a future direction or call on the community to develop such functionality.

grynspan · October 25, 2024, 6:15pm

Since Swift Testing is one of the motivators for this proposal, I think it's important to note that we will not be able to provide support for this feature immediately. The testing team is in favour of adding this feature, but adoption/support won't be automatic.

I discussed the proposal with @allevato off-forum and we determined that, for Swift Testing to support it, we may also need runtime changes so that we can correctly construct fully-qualified type names and/or function names that use this feature. We don't know at this point if those runtime changes will be backwards-compatible with older versions of Apple's platforms.

If escape sequences are supported, we will need to make sure that swift-syntax correctly handles them, and that they do not accidentally allow for the use of a backtick in an identifier (which would make parsing a fully-qualified type name impossible.)

grynspan · October 25, 2024, 7:03pm

Just chatted with @Joe_Groff as well. The demangling problem is a tough nut to crack especially when you consider the need to remain backwards-compatible with older Apple OS releases.

One possible solution would be to encode the leading and trailing backticks in these names when mangling them so that they would simply automatically be present on demangling regardless of the OS version you're using. We could solve the problem of ambiguous string representations at the same time.

Interior backticks still pose a problem, but so long as round-tripping the demangled names isn't necessary, we could insert some invisible combining Unicode character next to the backtick code point so that, when decoded to a Swift string, they would present as distinct Characters from the leading/trailing backticks.

We would need to ensure that these encoding changes only occur for symbol names that would be invalid otherwise. If I write:

struct `MyStruct` {}

That's valid today without raw identifiers, so it should still mangle as simply MyStruct. I think.

Alejandro · October 25, 2024, 7:25pm

Older runtimes already can't handle names with symbols/whitespace in them, I don't think needing to be backwards compatible is possible here.

allevato · October 25, 2024, 7:25pm

That's an interesting idea! I can explore that in the implementation.

This would be a minor change in principle to the proposal, which currently states

In both cases, the backticks are not considered part of the identifier; they only delimit the identifier from surrounding tokens.

I think it would be the case that if the identifier cannot be spelled in source code without the backticks, and if the mangling of the symbol also includes the backticks (and thus round-tripping it preserves the backticks), then as far as both the user and the runtime are concerned, the backticks are part of the identifier. I don't think that's problematic, though, since it's solving a real issue.

This seems like something we can punt on for the time being, since the proposal as written forbids interior backticks in an identifier. (But if it were to be accepted with that as a change, we'd need to handle it.)

This should also be straightforward; the implementation has a function that checks whether an identifier must be escaped in any context (as opposed to escaped keywords, which only need to be escaped in certain unqualified contexts), so the mangler could use that to check whether it needs to affix literal backticks to the string before mangling it.

Joe_Groff · October 25, 2024, 7:28pm

I suspect that it's not so much that they can't, more that they don't, and will just render whatever identifier the demangler gives them verbatim. So if the verbatim demangling includes enough delimiting/escaping of raw identifiers to make them distinguishable from existing identifiers, that should be enough for old runtimes to do close enough to the right thing for new code running against an old runtime to be able to cope.

Alejandro · October 25, 2024, 7:30pm

I mean, it seems from the demangler's perspective it just fails at demangling these things:

$swift demangle --tree-only
$s1A3A CV
<<NULL>> CV

grynspan · October 25, 2024, 7:32pm

That's unfortunate and would mean Swift Testing will not be able to adopt this feature unless it is gated on a minimum deployment target (i.e. the compiler won't accept a raw identifier on an Apple platform without a high-enough minimum target version.)

allevato · October 25, 2024, 7:35pm

That particular example isn't a valid mangled identifier even under the new rules. Literal spaces would never appear in a mangled identifier; since they aren't identifier-safe ASCII characters, they would get punycoded. There are some examples in the implementation PR's tests:

$ xcrun swift-demangle '_$s4test0014foospace_ntJBbyyF'
_$s4test0014foospace_ntJBbyyF ---> test.foo space() -> ()

The version of swift-demangle that I'm running here is from Xcode, so it doesn't have (or need) support for raw identifiers. If we inserted the punycode-encoding of literal backticks into the mangling, they would be printed back out as well.

Alejandro · October 25, 2024, 7:37pm

I thought that maybe we could use punycode here, but I thought we could only do that for non-ASCII. If we can pull that off for everything then

allevato · October 25, 2024, 7:46pm

What I found out while implementing the proposal was that the mangler was already set up to punycode anything that isn't identifier-safe ASCII. So that part is already done, even before this proposal touched anything. The only case I had to adjust it for was identifiers starting with a digit, which were round-tripping incorrectly. (But if we add backticks to these identifiers, that will even solve the round-tripping issue for identifiers with leading digits in a backwards-compatible way.)

Karl · October 25, 2024, 7:54pm

It's a -1 from me.

The swift-testing examples in particular are not at all compelling, and IMO actually the opposite.

I think that swift-testing has made some strange design decisions. Test suites can become very large and complex, and adding some descriptive prose is clearly a good thing - just as we do in our library and application code as those codebases grow.

But the thing is, we already have a system designed specifically for this kind of prose -- documentation comments. Our documentation infrastructure is excellent these days, so why does swift-testing seem to encourage the use of every other mechanism?

First it's string parameters to the macro:

@Test("square returns x * x")
func squareIsXTimesX() {
  #expect(square(4) == 4 * 4)
}

Now it's suggested that we change it to this:

@Test func `square returns x * x`() {
  #expect(square(4) == 4 * 4)
}

Honestly, both of these look bad to me, and the second one looks even worse. Choosing function names that contain spaces and special characters confusable with executable code is such an obvious bad practice that I would never use this myself, and would ban it from every codebase where I have the authority to do so.

Sorry, but it's just not readable and could easily be actively harmful.

As an alternative:

/// Tests that ``square`` returns `x * x`.
///
@Test func testSquare() {
  #expect(square(4) == 4 * 4)
}

That is way more readable, easily scales to multiple lines, allows complex formatting, tables, etc, not to mention links to other functions and types in the project. The way documentation comments work, the first line is already considered a "summary" which can be used in test output or the Xcode sidebar.

I can't understand why we go through these contortions for a not-excellent result, when we have a truly excellent and class-leading documentation engine just sitting there.

I have sympathy for the Bazel use-case, but I'd rather that be addressed by a more targeted proposal. The other motivating use-cases (non-alphabetic identifiers) are not, by themselves, significant enough to be worth addressing IMO.

grynspan · October 25, 2024, 7:54pm

Oddly enough, I ran into this exact issue as well.

MPLewis · October 25, 2024, 9:09pm

Personally, I agree with you that the testing use-case is ugly and I’d much prefer documentation comments and normally-named functions. But that’s just a matter of style and the Bazel and enum cases are compelling enough to me that I don’t think we should prevent people from using raw identifiers if they so choose, assuming that the implementation doesn’t cause issues in other places (which it doesn’t seem like it does).

michelf · October 26, 2024, 12:31pm

Personally, the only example that feels useful to me is the one where you can make identifiers (especially enum cases) that start with a digit.

But then the proposal says you have to always quote them even after a dot, and to me this makes it worse than current workarounds using a letter or underscore prefix. The proposal explains this is to avoid ambiguity in a tuple, but most types aren't tuples and it'd be easy to disallow digit-only names in a tuple.

About tuples… Everywhere in the language it's always been the case that y.a is the same as y.`a`. So it comes as a big surprise that y.0 could be something different from y.`0`. Please don't allow this, we don't need this confusion. Example from proposal:

let a = (5, `0`: 10)
let b = y.0    // z <- 5
let c = y.`0`  // z <- 10

CharlesS · October 26, 2024, 6:33pm

I think I'm also a -1 on this, for the same reasons that @Karl and @michelf laid out. Every proposed use for this seems ugly compared to alternatives using currently allowed syntax.

allevato · October 26, 2024, 7:04pm

michelf:

About tuples… Everywhere in the language it's always been the case that y.a is the same as y.`a` . So it comes as a big surprise that y.0 could be something different from y.`0`. Please don't allow this, we don't need this confusion. Example from proposal:
let a = (5, `0`: 10)
let b = y.0    // z <- 5
let c = y.`0`  // z <- 10

I'm sympathetic to this, but there's a fundamental distinction here. The reason that y.a and y.`a` are treated identically is specifically because a is a valid identifier. Requiring the use of backticks around 0 in this situation is important signal that conveys to the reader "the 0 here does not mean what it normally means".

The opposite argument could just as easily be made that it's a big surprise that x.for and f(for: x) are allowed when for.x and f(x: for) are not. We simply don't find it confusing now because we've grown accustomed to those rules.

It's also quite unlikely that anyone would write code like the one in that specific example, which only serves to show how the rules are chosen to avoid collisions. When discussing changes like this to parsing/lookup, we must consider extreme edge cases like this and how they fit into the larger system. If we chose the rule that a.0 could mean the same as a.`0` , then we'd still have a different set of rules that we need to define (which do we choose if there are two possible meanings?).

But the existence of such examples is not an encouragement for folks to write code that looks like that, just as the fact that Swift allows emoji and other obscure Unicode characters is not an encouragement to have those permeate a codebase. History has shown that even with those tools, most developers are able to use them wisely and only when necessary, and I have every reason to expect that the same would hold true here.

michelf · October 26, 2024, 8:27pm

The problem is I've always seen the 0 in a.0 as an identifier. Just like you can have a.default where default becomes an identifier, with a.0 the 0 also becomes an identifier for the field. Perhaps this is not how the compiler parses things internally, but that's how I model it in my head when reading such code.

Allowing unquoted identifiers that start with a digit (when after a dot) would make enum cases that start with a digit actually convenient at the point of use. It would also allow a struct to mimic the API of a tuple, which could be useful when refactoring. Whereas a.0 and a.`0` meaning different things is much less useful, in addition to being confusing. Maybe it makes things more consistent in some interpretation of the language, but not in mine.