I don’t think this is true. The new lexer doesn’t use String
and I don’t think it uses any of the stdlib’s Unicode tables. Instead has its own logic for classifying Unicode characters (swift-syntax/Sources/SwiftParser/Lexer/UnicodeScalarExtensions.swift at main · swiftlang/swift-syntax · GitHub and swift-syntax/Sources/SwiftParser/CharacterInfo.swift at main · swiftlang/swift-syntax · GitHub)
Is there any reason not to access Unicode.Scalar.Properties
and similar APIs from the lexer? It seems like there would be nothing that prevents us from asking whether a scalar is assigned, unless that's purely a policy decision?
Indeed, generated resource identifiers in Xcode preserve the decomposed representation from the Apple file system. So that Ä
is generated as A
+ ¨
.
The swift grammar does not reference any Unicode properties; it is specified as a list of specific code-points.
I suggested above that we should look at revisiting this and bringing it in to alignment with the latest advice from Unicode. This would use scalar properties:
The formal syntax provided here captures the general intent that an identifier consists of a string of characters beginning with a letter or an ideograph, and followed by any number of letters, ideographs, digits, or underscores. It provides a definition of identifiers that is guaranteed to be backward compatible with each successive release of Unicode, but also adds any appropriate new Unicode characters.
This advice allows for profiles with targeted exceptions - e.g. allowing identifiers to start with underscores, or allowing them to start with digits (that seems like it would be a popular change):
The formulations allow for extensions, also known as profiles. That is, the particular set of code points or sequences of code points for each category used by the syntax can be customized according to the requirements of the environment. Profiles are described as additions to or removals from the categories used by the syntax.
Concretely, I'm suggesting that we reformulate the existing grammar as one of these "profiles", and take the opportunity to check that our definitions are not too broad (e.g. identifiers shouldn't be allowed to start with a combining character).
Then, I would suggest that instead of allowing any Unicode characters, "raw" identifiers could use an expanded profile, for instance allowing whitespace and other specific characters typically discouraged in identifiers.
-1 for exactly the same reasons @davedelong mentioned above. I specifically want to reiterate:
And It feels like a miss to not solve this problem.
I personally dont have an issue using import `foo/bar/module`
, but I can't imagine wanting to ever to use an api that makes pervasive use of these identifiers.
I hope no one ever would do this, but an API like func `plus 1`(x: Int) - > Int
is horribly obnoxious in my mind.
Additionally, I think the meta-effects on the ecosystem will not be great.
- It's very annoying to use
``
double ticks``
in markdown to monospace code using these identifiers. - I also foresee many random parsers around the web will struggle to handle the identifiers and overall quality of swift visual styling will suffer.
In my opinion the main reason is that we want these tables to be stable and not be dependent on the Unicode tables of the Swift compiler that was used to build the new Swift compiler.
That's a fair point now, but is it a sustainable position long-term? If we do ever want to add normalization/confusables support for identifiers at some point in the future, the parser (and the rest of the compiler) will eventually need to use Unicode properties or other Unicode APIs in some fashion.
I think that’s a discussion we can have but I don’t think this review thread is the right place to have that conversation.
That feels like a choice we might regret, as it splits the language into multiple parallel dialects. On Apple's platforms, the Swift Standard Library (and the Unicode tables therein) ship as part of the operating system. If the parser uses the same Unicode implementation as the stdlib, then the same compiler binaries will exhibit different behavior based on the version of the OS they're running on.
That sounds like the right thing to do!
That sounds like a good plan.
Caveat: If we let the stdlib dictate what code points are considered assigned, then the set of accepted characters will be decided by the version of the host OS that builds the code, not the version of Swift implemented by the compiler. That feels bad to me.
Yes, that would make the rules eminently clear.
"Regex-based tooling" raises some alarm bells for me, for the same reason as above. I think it would be a mistake to allow Swift code to fail to build (or to let their behavior change) because somebody upgrades their build environment's OS.
This was a clarifying question. I did not mean to hint that the two foo;bar
spellings above should be the same (nor different) -- I think the behavior should be explicitly stated, though, as it has direct ABI stability implications.
Not doing any normalization whatsoever avoids issues with Unicode versioning, which I believe is a clear benefit.
If Swift itself will not set rules, the rational thing to do is probably for each project to set up strict policies about what sets of characters they are willing to accept within their source code, and what subset of that they feel acceptable to expose on their ABI surface.
This is a good point. The first solution that comes to mind would be to effectively "freeze" the current ranges of unassigned code points by hardcoding them into the parser, and anything added to the Unicode standard later on wouldn't be available until the compiler was explicitly updated to support it. The number of contiguous ranges of unassigned code points is quite large, but other than that I can't think of a technical reason it wouldn't work.
I suspect that we're still going to have a reckoning with this kind of problem in the future if we want to support things like normalization, unless the answer to that (and maybe it is) is also that the parser must maintain its own copies of the Unicode data that it needs to perform those operations.
To clarify, what I meant here was that the reduced set of valid whitespace characters would simplify tools like regex-based syntax highlighting and the sort; nothing specifically related to building the code itself.