SE-0451: Raw identifiers

YOCKOW · October 30, 2024, 1:11am

I'm not against your point, but just in case, let me say Swift already allows some isolated combining characters in identifiers.

Quote from "Identifiers" section in TSPL:

Grammar of an identifier

identifier → identifier-head identifier-characters ?
...
identifier-head → U+3004–U+3007, U+3021–U+302F, U+3031–U+303F, or U+3040–U+D7FF
...

U+3040–U+D7FF contains U+3099(COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK) which is a combining character.

For example, "\u{304B}\u{3099}" == "\u{304C}" is true in Swift.
Note that U+3099 is different from U+309B(KATAKANA-HIRAGANA VOICED SOUND MARK) which is not a combining character.

Executing the following command on your terminal (zsh)

% echo "let \u3099 = 0; print(\u3099)" | swift repl

will put out the result 0.

Swift seems not to normalize identifiers

echo "let \u304B\u3099 = 0; let \u304C = 1; print(\u304B\u3099 == \u304C)" | swift repl

is executed successfully and prints "false".

CharlesS · October 30, 2024, 1:38am

Regardless, that comment was wonderfully illustrative.

taylorswift · October 30, 2024, 2:34am

the way i see it, this proposal was crafted to support two realistic use cases:

defining identifiers (usually enum cases) that start with a digit

enum CompressionEngine
{
    case `7z`
}
enum HTTPStatus
{
    case `2xx`
    case `3xx`
    case `4xx`
    case `5xx`
}

embedding natural language strings in function names for test identification

@Test func `square returns x * x` () 
{
}

there are a few other problems - like preserving spellings from other languages (e.g. box-shadow), or encoding file paths (e.g. myapp/extensions/widget/common/utils) - that feel conceptually adjacent and are roped in as additional motivations, but wouldn’t actually be solved by this proposal, as it is not possible to completely eliminate the need to transform input strings into Swift identifiers, and there are also a few other things this proposal would enable, such as including unusual punctuation in identifiers (e.g. 16:9). but to me that just isn’t anywhere near as motivating as the leading digit use case.

for #2, i feel that working this into the macro system, where the macro could mangle a name based on a string input, makes more sense than enabling all special characters in all identifiers. for #1, leading digits, we don’t need a new delimiter, because it is already unambiguous.

Karl · October 30, 2024, 8:14am

Interesting. We should revisit which code-points are allowed in identifiers at some point (it's possible our grammar definitions predate a lot of this advice). It would technically be source-breaking, but actual breakage should be limited to questionable identifiers (such as you demonstrated, single combining characters and the like) that probably qualify as "actively harmful" to continue supporting.

Ideally, no identifiers (raw or standard) should interact typographically with their surrounding text.

Rust is ahead of us here. It does not allow U+3099 to be used as an identifier:

// Rust:

fn main() {
    let ゙ = 42;
    println!("Hello, {}",  ゙);
}

-------------

error: unknown start of token: \u{3099}
 --> src/main.rs:2:9
  |
2 |     let ゙ = 42;
  |         ^

error: unknown start of token: \u{3099}
 --> src/main.rs:3:28
  |
3 |     println!("Hello, {}",  ゙);
  |                            ^

error: expected pattern, found `=`
 --> src/main.rs:2:11
  |
2 |     let ゙ = 42;
  |          ^ expected pattern

What's more, the Rust compiler even includes confusable detection:

// Swift:

func visitScope() {}

// Cyrillic small letter 'o'
func visitScоpe() {}

// No warnings or errors

// Rust:

fn visitScope() {}
fn visitScоpe() {}

-------------

warning: found both `visitScope` and `visitScоpe` as identifiers, which look alike
 --> src/lib.rs:2:4
  |
1 | fn visitScope() {}
  |    ---------- other identifier used here
2 | fn visitScоpe() {}
  |    ^^^^^^^^^^ this identifier can be confused with `visitScope`
  |
  = note: `#[warn(confusable_idents)]` on by default

warning: the usage of Script Group `Cyrillic` in this crate consists solely of mixed script confusables
 --> src/lib.rs:2:4
  |
2 | fn visitScоpe() {}
  |    ^^^^^^^^^^
  |
  = note: the usage includes 'о' (U+043E)
  = note: please recheck to make sure their usages are indeed what you want
  = note: `#[warn(mixed_script_confusables)]` on by default

Confusable detection is something we can get from ICU, although it is sometimes helpful to augment it with custom rules. I made an example package a while ago which uses ICU and some custom rules from Chromium to spoof-check Unicode domain names from Swift.

nukka123 · October 30, 2024, 2:14pm

typo? 0451-escaped-identifiers.md

case ColorVariant {

allevato · October 30, 2024, 2:19pm

Fixed. Thank you!

allevato · October 30, 2024, 2:41pm

There definitely are situations where it's ambiguous. Since enum cases are static members, they can be referenced without any leading qualification or punctuation in static contexts:

enum E {
  case a

  static let oldNameForA = a  // ok
  static var oldNameForA2: E { a }  // ok
}

So, if we swap those out for numeric identifiers, we'd have:

enum E {
  case `123`  // or case 123??

  static let oldNameForA = 123  // ok, but incorrectly inferred to be Int
  static var oldNameForA2: E { 123 }  // error: type mismatch, Int != E
}

Certainly, changing the language's type checking rules to allow bare integer literals to be treated as identifiers is not a path we want to go down, so there do still exist situations where we either need to (1) escape these identifiers or (2) qualify them with a type. I'm of the opinion that forcing only a specific subset identifiers to be qualified based on their spelling is a worse landing point than requiring them to be consistently escaped wherever they're used. (Even if we were to decide that we could elide the backticks for something like E.123, we could still permit E.`123` just as we permit both E.for and E.`for` .)

scanon · October 30, 2024, 2:56pm

I might be reading too much into the comment, but I think Taylor is proposing something like "allow identifiers that start with a digit, so long as they contain at least one non-digit," which I think removes the ambiguity that you pointed to.

allevato · October 30, 2024, 2:57pm

That would still be conflicting with octal, binary, and hexadecimal literals.

ksluder · October 30, 2024, 2:57pm

What about 1_000 and 0xFF?

taylorswift · October 30, 2024, 3:01pm

the list of numeric literal patterns has been pretty stable, and i don’t anticipate that we would add more bases, so the rule could just be “anything that doesn’t already look like an integer literal”

allevato · October 30, 2024, 3:07pm

Just to make sure I'm clear on your suggestion regarding identifiers/numbers then, which of these is it?

An identifier can start with a digit if it doesn't look like a numeric literal, and identifiers that look like numeric literals are forbidden because there is no way to spell them
An identifier can start with a digit if it doesn't look like a numeric literal, but backticks can be used to escape identifiers that would otherwise look like a numeric literal; e.g., case `0x0`
Something else?

taylorswift · October 30, 2024, 3:14pm

for me, #1 would cover most of my envisioned use cases, but i suspect if it went through evolution people might ask for a way to escape the names in #2, so that’s probably where we would end up.

the commitment that we’re making is we won’t be able to add support for more bases in the future (e.g. 0zXYZ) and regex syntax highlighters won’t handle these very well, but those seem like relatively minor costs.

macguru · November 1, 2024, 3:09pm

tl;dr Overall I'm neutral on the proposal, but the swift-testing example is a bad idea.

I know this is not about adding this to swift-testing, but as a heavy swift-testing adaptor, I'd like to add my perspective to the general idea proposed here. I think sentence-named test cases are not a good idea (and thus example to this proposal), even in the current swift-testing.

Function names are easily recognizable and, for tests at least, often unique identifiers. We use them all the time and everywhere for navigation, searching, analysis, and more. My favorite feature: you can double-click them to select them quickly. They are common and understood across system boundaries. While Xcode might make the use of sentences nice, I have my doubts for everything else.

To illustrate: our CI runs on Gitlab. When a test run fails, I view the log, where it says something like:

Failing tests:
	MathTests.square2EvaluatesTo4()

I select the name, command-tab to Xcode, hit Command-O and paste the name. How would this work with the sentence names? In the "okay" scenario, the log would contain the sentence (with backticks?):

Failing tests:
    MathTests.`2 * 2 evaluates to 4`()

That would be fiddly to select, I cannot double-click, but might just be fine. What I fear is that we'll see mangled names there. Maybe not in the log, but everywhere else in the interface. Something like (the mangled name is just something copied from above):

Failing tests:
    MathTests._$s4test0014foospace_ntJBbyyF()

Does Xcode's Quick Open support mangled names? Can anybody actually recognize or even read them? I don't.

In a log or a trace, identifiers stick out, sentences blend it. Camelcased identifiers are easy to spot, many small words look like debug output.

Let's not forget backtraces from swift-testing are already completely incomprehensible at times and more mangling certainly won't make it better.

I recognize this is purely a practical and ergonomic argument, but ergonomics are important to me and make me faster and more efficient at what I do. And that's why think sentence identifiers are a bad idea in general (and we're not using them in our code).

John_McCall · November 1, 2024, 4:04pm

Many, many languages have evolved additional numeric literals over the course of their history. I would strongly oppose adding bare identifiers that would lock out that option.

Nobody1707 · November 1, 2024, 11:01pm

The only start with numeric identifiers I'd want are the exactly .N forms that tuple indexes use. And I wouldn't even mind if they were restricted to only be generated by macros. I just don't want to have to backtick them on every use.

lorentey · November 4, 2024, 8:31pm

I am personally in favor of this; I think our profession should move beyond parsing/editing text syntax, and this feels like a tiny baby step towards that direction.

A quick reading of the proposal text has raised some nitpicking questions for me, mostly related to Unicode versioning:

The proposal allows any "valid Unicode character" (with a handful of explicitly enumerated exceptions). What precisely does that mean? Most importantly, will Unicode scalars with unassigned code points be considered valid? (Under what version of Unicode?)
The proposal disallows identifiers that solely consist of characters with the White_Space property. What version of Unicode is going to be used to make that determination? Is there a path to upgrading to newer versions of Unicode, or are we going to get forever stuck on a legacy one?
When do two raw identifiers resolve to the same identity? Do they need to be composed of precisely the same scalars, or is there a normalization algorithm?

E.g., is this going to be an error, or will the two names below be considered distinct?
```
let `foo;bar`: Int // U+003B SEMICOLON
let `foo;bar`: Int // U+037E GREEK QUESTION MARK
```
If there is a normalization scheme, what is it? (Does it follow the Unicode spec? Which version? Is there an upgrade path?)
I think it's a shame that we can't use \u{37e} escapes in this context (yet).

Philippe_Hausler · November 4, 2024, 8:33pm

Some of the rules proposed would help with MMIO interface generation for embedded - some registers will start with leading numeric values (or for that matter be purely numeric). As others have stated it is a bit unfortunate for this proposal not letting those be typed w/o having backticks.

allevato · November 5, 2024, 8:46pm

The version of Unicode that would be used would be the tables compiled into the standard library, since the parser is now written in Swift. Disallowing unassigned code points seems like the most obvious choice; it wouldn't allow a code point that we might need to forbid later for some reason. The only real risk I can see is that code using an assigned-in-the-future code point would not be able to be compiled with an older parser, which seems like a reasonable restriction (similar to how code using other new language features obviously can't be compiled with an older compiler).

This is an interesting question, because Unicode also defines the Pattern_White_Space property which is guaranteed to be stable/immutable and contains a subset of White_Space (plus LTR/RTL markers). Given that, it would also be worth considering the following formulation:

The only whitespace characters permitted in a raw identifier are characters with the Pattern_White_Space property, except for those that are line/paragraph separators.
A raw identifier may not consist of only whitespace characters as it is defined above.

I'm somewhat hesitant to carve out arbitrary exceptions like this if it makes the rules harder to understand (and to implement, for things like regex-based tooling that may want to match these), but since Pattern_White_Space is immutable and we'd only be subtracting from it, we're left with an extremely small and manageable set: \u{0020}, \u{200E}, and \u{200F}. So maybe that would be OK.

When do two raw identifiers resolve to the same identity? Do they need to be composed of precisely the same scalars, or is there a normalization algorithm?E.g., is this going to be an error, or will the two names below be considered distinct?
let `foo;bar`: Int // U+003B SEMICOLON
let `foo;bar`: Int // U+037E GREEK QUESTION MARK
If there is a normalization scheme, what is it? (Does it follow the Unicode spec? Which version? Is there an upgrade path?)

I agree that confusables and normalization are important considerations, but it's worth noting that we have these issues today with regular identifiers. Latin Capital A and Greek Capital Alpha are both valid but separate identifiers in Swift, as would be something like Ä written as a single code point vs. with combining diacritics. I think we should address them holistically, but not necessarily as part of this proposal.

davedelong · November 5, 2024, 11:22pm

Overall, I'm not in favor of this change.

As others have said, I find the test naming motivation anti-compelling; its inclusion, more than anything, leads me to believe we should not accept this. If Swift Testing really wants to have fancy English-like test names, it should be able to do that as part of the Test macro. We don't need to extend the language for that.

In my mind, the real motivation for this is to allow properties, cases, and methods to start with numbers, so you can do things like .480p, HTTPStatus.200, and so on.

This proposal does not allow for that. Instead, it's allowing for these declarations to begin with backticks, which is not the same thing as allowing them to begin with numbers.

Therefore, I do not want this. Either let me start the declarations with numbers, or I'll just keep working around the limitation in other ways. I don't want more exceptions to the grammar.