Unicode is Swifty...
Swift claims to support Unicode source. Many languages aim to be encoding agnostic by only using ASCII in the grammar and not attempting to interpret anything beyond that in comments and string literals. But Swift has gone much further and defined its grammar in terms of Unicode. The formal definitions of identifiers and operators are each a screen full of Unicode scalar references and demonstrate clear and deliberate support for advanced Unicode concepts like combining characters. Swiftâs documentation even goes so far as to encourage use of Unicode. Unicode characters appear in identifiers in The Swift Programming Language already on page 3, âThe Basicsâ under âNaming Constants and Variablesâ and recur in examples throughout the book.
...but Swift has dropped the .
Swift proudly touts support for Unicode equivalence across the documentation for its String
type, but the compiler does not actually consider it.
As @michelf noted recently:
It addition to allowing clashes, it fails to resolve references the developer expects to match:
let cafeÌ = "Queenâs Lane Coffee House"
print(cafĂ©) // Error: âUse of unresolved identifier 'cafĂ©'â
In addition to being surprising, this means that wherever Unicode actually gets used, two developers with different input methods may have difficulty using each otherâs code.
Due to quirks of tooling, a single developer may write functions he later has trouble calling himself: A developer might write most of his code locally with a decompositionâbased input methodâsuch as most Korean keyboardsâand he writes it in Xcode, which is normalization agnostic. But during review on GitHub, he detects a typo in a function name and quickly fixes it there. Since GitHubâs interface checks everything in in NFC, he nowâunknowinglyâhas a composed form function name. After a while he pulls the change to his local machine. When he types the function name with his keyboard in order to call it somewhere, the compiler tells him it does not exist. Confused, he tries it in several places and eventually concludes that the declaration must be buggy, so he erases and retypes it only to be puzzled by the fact that it now magically works. (Alternatively he gives up, reverts the commit and starts over, refactoring the code to avoid what he perceives as a compiler glitch.)
Known tools that are not really compatible with Swiftâs current behaviour include:
- GitHub UI: Commits made through the web interface are in NFC; but commits made by the command line are unaltered.
- Swift Forums: Copied and pasted text becomes NFC, but text entered directly is unaltered.
Possible Solutions
Essentially the Swift compiler needs to respect Unicode equivalence when it compares tokens. Several concrete routes to accomplishing this exist:
-
A) Have the compiler test tokens for equality the same way as the Swift
String
type does.- Advantages:
- By keeping the onâdisk representation around the whole time, error messages and diagnostics will have the same representation as the file. This means the token will look the same on platforms/fonts that handle equivalence poorly in display.
- Specific Unicode scalar sequences can be selected with direct stringâlike literals at the developerâs own risk. (Compare option C.)
- Disadvantages:
- Each comparison is slower during the compilation process.
- A few corner cases around stringâlike literals are still surprising. (Compare option C.)
- Advantages:
-
B) Have the compiler normalize each token after tokenization, with the exception of stringâlike literals.
- Advantages:
- Fast repeated comparison after the cost of a single conversion.
- Specific Unicode scalar sequences can be selected with direct stringâlike literals at the developerâs own risk. (Compare option C.)
- Disadvantages:
- Diagnostics may look odd on some platforms. (Compare option B.)
- A few corner cases around stringâlike literals are still surprising. (Compare option C.)
- Advantages:
-
C) Have the compiler parse a normalized copy of the entire file.
- Advantages:
- Fast repeated comparison after the cost of a single conversion.
- No more corner cases or delayed surprises around string literals. (See here and here.) If external normalization happens to the source code it will make no difference to the resulting program. Every stringâlike literal will either always generate the same compiled scalar sequence or always fail to compile.
- Cleanly prevents any similar problems from ever appearing in the future, no matter how syntax evolves. There would be no need to think about which new syntax elements need to behave like syntax tokens and which need to behave like user strings. All behaviour would be consistent and not require evolution consideration.
- Disadvantages:
- Diagnostics may look odd on some platforms. (Compare option A.)
- Nonânormal sequences will not be representable as direct literals and each scalar will need to be expressly requested by its hex value in order to evade the normalization:
let aÌngstroÌm = "\u{212B}"
. While this is already the only way to make source resilient, the requirement might come as an initial surprise to some developers. However, as @Michael_Ilseman has pointed out, Swift has never actually promised to preserve scalar sequences in aString
even during runtime, so evenlet aÌngstroÌm = "\u{212B}"
technically has no guarantees.
- Advantages:
-
D) Have the compiler normalize the actual onâdisk file.
- Advantages:
- Shares advantages of C.
- Encourages writers of tooling to think about Unicode equivalence, because if they produce nonânormal source they would trigger
diff
s that jump back and forth.
- Disadvantages:
- Shares disadvantages of D.
- May seem too intrusive.
- Opposite normalization to Swiftâagnostic text tools could cause certain tool combinations to become annoying.
- Advantages:
(I personally favour option C, because it is future proof, consistent, and predictable. A and B are also a good options with only minimal surprises.)
Source Compatibility
Properly handling Unicode equivalence will break source that contains equivalent identifiers that the developer intended to be distinct, such as @michelfâs example near the top. An important question is whether that sort of thing is a bug that should ever have compiled in the first placeâwhich is how I would qualify it.
Option C would also change the semantics of combinations like let x = "eÌcrire".unicodeScalars.first!
. Those oft assumed semantics were never actually promised, so it falls in a similar category to SEâ0241 about encodedOffset
. It only breaks misuse, but there is a lot of misuse. So at the very least, migration aids should be considered.
ABI Stability
...uh oh... Can someone more knowledgeable tell me whether name mangling and such already handles equivalence or if this whole pitch goes right here and Swift in Unicode is doomed for ever?
CC: @Michael_Ilseman
Immediate Mitigation on a PerâProject Basis
Projects which extensively use Unicode can defend against this to some degree by scripting normalization of all project files before build and at check in.â Then at least your project is internally consistent and the risk is much lower. But you will still run into issues if you try to import two packages which use opposite normalization forms. And SourceKit will still throw compiler errors at you while you type if your input method assumes the opposite normalization form to the rest of your packageâthough liberal use of autocomplete can reduce that somewhat.
â For anyone interested, I tool I wrote, Workspace, can concisely normalize packages: $ workspace normalize
, though that feature is largely unadvertised because I never recognized its importance.