Your post caused me to consider something else, which I think is quite important.
The reason the Swift grammar limits characters to the set it does is not solely to make parsing easier - it is also for security. Even the Unicode consortium does not recommend that languages accept all characters everywhere: UTS#55 Source Code Handling.
Source code, that is, plain text meant to be interpreted as a computer language, poses special security and usability issues that are absent from ordinary plain text. The reader (who may be the author or a reviewer) should be able to ascertain some properties of the underlying representation of the text by visual inspection, such as:
- the extent of lexical elements within the text;
- the nature of a lexical element (comment, string, or executable text);
- the order in memory of lexical elements;
- the equivalence or inequivalence of identifiers.
The potential presence in source code of characters from many writing systems, including ones whose writing direction is right-to-left, can make it difficult to ensure these properties are visually recognizable. Further, the reader may not be aware of these sources of confusion. These issues should be remedied at multiple levels: as part of computer language design, by ensuring that editors and review tools display source code in an appropriate manner, and by providing diagnostics that call out likely issues.
Accordingly, this document provides guidance for multiple levels in the ecosystem of tools and specifications surrounding a computer language. Section 3, Computer Language Specifications, is aimed at language designers; it provides recommendations on the lexical structure, syntax, and semantics of computer languages.
This is explored in more depth in UAX#31 Identifiers and Syntax.
The formal syntax provided here captures the general intent that an identifier consists of a string of characters beginning with a letter or an ideograph, and followed by any number of letters, ideographs, digits, or underscores. It provides a definition of identifiers that is guaranteed to be backward compatible with each successive release of Unicode, but also adds any appropriate new Unicode characters.
The formulations allow for extensions, also known as profiles. That is, the particular set of code points or sequences of code points for each category used by the syntax can be customized according to the requirements of the environment. Profiles are described as additions to or removals from the categories used by the syntax. They can thus be combined, provided that there are no conflicts (whereby one profile adds a character and another removes it), or that the resolution of such conflicts is specified.
And of course, it wouldn't be Unicode if it didn't come with a table, of which characters are allowed in which positions. These are the ID_*
and XID_*
family of properties, which we even make available in Unicode.Scalar.Properties
- isIDStart | Apple Developer Documentation
They document some standard profiles, such as for allowing mathematical symbols and emojis. In theory we could extend it with our own profiles, but any extra characters we allow should be considered carefully.
Anyway, compare that with the set of characters allowed by the proposal:
A raw identifier may contain any valid Unicode characters except for the following:
- The backtick (```) itself, which termintes the identifier.
- The backslash (
\
), which is reserved for potential future escape sequences.
- Carriage return (
U+000D
) or newline (U+000A
); identifiers must be written on a single line.
- The NUL character (
U+0000
), which already emits a warning if present in Swift source but would be disallowed completely in a raw identifier.
- All other non-printable ASCII code units that are also forbidden in single-line Swift string literals (
U+0001...U+001F
, U+007F
).
This seems extremely broad. In particular, it seems to allow isolated combining characters (which may combine with the surrounding text when rendered in an editor) and the Unicode line separator (U+2028), which is explicitly called out in UTS#55 because even though compilers tend not to recognise it as a newline (including the Swift compiler), editors may render it as one:
The Unicode Standard encompasses multiple representations of the New Line Function (NLF). These are described in Section 5.8, Newline Guidelines, in [Unicode], as well as in Unicode Standard Annex #14, Line Breaking Algorithm [UAX14].
An opportunity for spoofing can occur if implementations are not consistent in the supported representations of the newline function: multiple logical lines can be displayed as a single line, or a single logical line can be displayed as multiple lines.
For instance, consider the following snippet of C11, as shown in an editor which conforms to the Unicode Line Breaking Algorithm:
// Check preconditions.
if (arg == (void*)0) return -1;
If the line terminator at the end of line 1 is U+2028 Line Separator, which is not recognized as a line terminator by the language, the compiler will interpret this as a single line consisting only of a comment; to a reviewer, the program is visually indistinguishable from one that has a null check, but that check is really absent.