New to the list, but old hand at PL design. Was looking over the lexical
structure of Swift 2.2 and 3.0, and I have some questions. A number of
considerations identified in UAX31 (Unicode Identifier and Pattern Syntax)
and UAX36 (Unicode Security Considerations) aren't obviously addressed.
Here are some items that jumped out from a casual glance at the spec:
1. The specification does not appear to state any particular rules for
compatibility or normalization in identifiers. Other Unicode-aware
programming languages have adopted NFKC almost universally, and for good
reason. The current identifier-head and identifier-character grammar admit
sequences that Unicode considers malformed.
2. The specification does not appear to address any notion of Unicode
equivalent sequences.
3. The relationship between the identifiers admitted by Swift 3 and
identifiers admitted by UAX31 isn't clear. As a matter of cross-platform
compatibility it would be really good if identifiers permitted by the
default rules of UAX31 were all legal in Swift. This seems important for
cross-language interop.
Has this relationship been discussed somewhere I can catch up on?
4. Valid operators include code points that are undefined in any current or
historical Unicode standard. That seems problematic. Future revisions to
Unicode will eventually place *some* of those code points in the XIDS/XIDC
categories, at which point we will have to choose between backwards
compatibility and interop. Others will be assigned to new combining marks,
which will want to be used in identifiers. As new languages are added to
Unicode, compatibility concerns will exclude some groups from using
identifiers that are natural to them.
In order of least-to-most difficulty, I'd like to suggest some changes to
the specification. I'm willing to implement them if agreement can be
reached:
1. Pick a Unicode version and exclude any code point that is undefined as
of that standard from both operators and identifiers. It's relatively easy
and backwards compatible to move the Unicode version number forward as the
language specification evolves.
2. Ensure that no code point in the Unicode Pattern_Syntax and
Pattern_WhiteSpace categories are not included in identifier-head or
identifier-character.
3. Explicitly state that no code point in (XIDS u XIDC) or
Pattern_WhiteSpace is legal in an operator. Consider ensuring that
everything in Pattern_Syntax *is* permitted in an operator.
4. I'd personally like to see an explicit statement of the extensions to
XIDS/XIDC that are admitted by identifier-head and identifier-character.
UAX31 refers to such extensions as a "profile", and explicitly allows them.
I'm not interested in changing the identifier space unless there is
something grossly and obviously problematic. What I'm after is enabling
developers to be cognizant of potential interop challenges.
5. Adopt NFKC for identifiers. Specify and implement a combining algorithm
version so that forward/backward compatibility is ensured.
The first three are pretty trivial. The fourth would take some sleuthing,
but it is straightforward. The fifth is real work. I'd be willing to sign
up to any or all of these, but for a starting point I want to learn where
things stand, what decisions have already been made, and where any current
discussion may be happening.
I very much doubt that NFKC would break existing code, if only because the
use of malformed Unicode sequences is likely to be rare. To the extent that
they exist in the field, they are almost certainly (a) unintentional, or
(b) security concerns. It seems like a good thing to catch both of those
early to the extent that we can, and to do so while the language definition
remains somewhat fluid.
Thanks!
Jonathan Shapiro