As Stage 2 of Swift 4 evolution starts now, I'd like to share a revised
proposal in draft form.
It proposes a source-breaking change for *rationalizing* which characters
are permitted in identifiers and which in operators. It's justified for
this phase of Swift 4 because:
- Existing grammar, in permitting invisible characters without
security-minded restrictions, can be *actively harmful.*
- A rationalized approach is *superior* to the current approach: by
referencing Unicode standards, Swift should be able to evolve in a
backwards-compatible way alongside Unicode, and will benefit from the
significant expertise of others outside the Swift community with respect to
Unicode best practices.
- The vast majority of existing code (including all of the standard
library) should *require no migration* work at all
*What's changed* since the last time:
- In an earlier draft, we proposed some radical changes to align with
available Unicode standards; in particular, since emoji represent a
difficult issue, and no recommendations about "operator identifiers" have
surfaced from Unicode, we proposed temporarily stripping them out.
This was *very
poorly received*. This revision uses Unicode categories to identify nearly
all emoji and classify them as identifier characters (while excluding those
that depict operators such as !), and it uses Unicode categories to
identify over 900 operators that nearly all pass the subjective test of
"operator-likeness."
What this proposal *does not attempt* to do:
- This document *does not* seek to stake out new ground as to what
characters should be *added* to the set of valid identifiers and operators.
Such additions to the grammar are properly separate discussions. This
proposal is only an attempt at systemization and rationalization. Only one
character is incidentally added to the list of valid characters (`\`), and
it is on the basis of an explicit table in Unicode Technical Report 25
regarding ASCII characters that are "mathematical."
What feedback would be* most helpful*:
- "Hey, this approach is so much more *clumsy* than my superior, more
elegant category-based approach to identifying [operators/emoji], which is
[insert here]."
- "Hey, I disagree with the detailed design because it's got a *major
security hole*, which is [insert here]."
- "Hey, your proposal would break my *real-world* Swift code, which
requires that character be an [identifier/operator]."
What would be *less helpful*:
- "Hey, let's talk about how [specific character] should be an
[identifier/operator]. We should add that character to the list of
[identifiers/operators]. In fact, let's discuss [list] characters one by
one."
Acknowledgments:
Thanks to co-authors of the previous take for their support for
resurrecting this issue. Any brilliant ideas are undoubtedly theirs, and
any botched efforts are certainly mine. Thanks also to Nevin
Brackett-Rozinsky for helpful feedback.
Link:
Rendered text:
Refining identifier and operator symbology (take 2)
- Proposal: SE-NNNN
<https://gist.github.com/xwu/NNNN-refining-identifier-and-operator-symbology.md>
- Authors: Xiaodi Wu <https://github.com/xwu>, Jacob Bandes-Storch
<https://github.com/jtbandes>, Erica Sadun <https://github.com/erica>,
Jonathan Shapiro, João Pinheiro <https://github.com/joaopinheiro>
- Review Manager: TBD
- Status: Awaiting review
<Refining Identifier and Operator Symbology · GitHub;
Introduction
This proposal refines and rationalizes Swift's identifier and operator
symbology. Specifically, this proposal:
- refines the set of valid identifier characters based on Unicode
recommendations, with customizations principally to accommodate emoji;
- refines the set of valid operator characters based on Unicode
categories; and
- changes rules as to where dots may appear in operators.
<Refining Identifier and Operator Symbology · GitHub
discussion threads and proposals
- Define backslash '\' as a operator-head in the swift grammar
<https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20170130/031461.html>
- Refining Identifier and Operator Symbology
<https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20161017/028174.html>
(a
precursor to this document)
- Proposal: Normalize Unicode identifiers
<https://github.com/apple/swift-evolution/pull/531>
- Lexical matters: identifiers and operators
<https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160926/027479.html>
- Unicode identifiers & operators
<https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160912/027108.html>,
with pre-proposal
<https://gist.github.com/jtbandes/c0b0c072181dcd22c3147802025d0b59>
- Proposal: Allow Single Dollar Sign as Valid Identifier
<https://github.com/apple/swift-evolution/pull/354>
- Free the '$' Symbol!
<https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20151228/005133.html>
- Request to add middle dot (U+00B7) as operator character?
<https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20151214/003176.html>
<Refining Identifier and Operator Symbology · GitHub;
Motivation
Swift supports programmers from many languages and cultures. However, the
current identifier and operator character sets do not conform to any
Unicode standards, nor have they been rationalized in the language or
compiler documentation. These deserve a well-considered, standards-based
revision.
As Chris Lattner has written:
We need a token to be unambiguously an operator or identifier - we can have
different rules for the leading and subsequent characters though.
…our current operator space (particularly the Unicode segments covered) is
not super well considered. It would be great for someone to take a more
systematic pass over them to rationalize things.
Identifiers, which serve as *names* for various entities, are linguistic in
nature and must permit a variety of characters in order to properly serve
non–English-speaking coders. This issue has been considered by the
communities of many programming languages already, and the Unicode
Consortium has published recommendations on how to choose identifier
character sets. Swift should make an effort to conform to these
recommendations.
Operators, on the other hand, should be rare and carefully chosen because
they suffer from limited discoverability and readability. They are by
nature *symbols*, not names. This places a cognitive cost on users with
respect to recall ("What is the operator that applies the behavior I
need?") and recognition ("What does the operator in this code do?"). While
almost every non-trivial program defines new identifiers, most programs do
not define new operators.
<Refining Identifier and Operator Symbology · GitHub;
Inconsistency
Concrete discrepancies and edge cases motivate these proposed changes. For
example:
- The Greek question mark ; is a valid identifier.
- Some *non-combining* diacritics ´ ¨ ꓻ are valid in identifiers.
- Braille patterns ⠟, which are letter-like, are operator characters.
- Other symbols such as ⚄ and ♄ are operator characters despite not
being "operator-like."
- Currency symbols are split across operators (¢ £ ¤ ¥) and identifiers
(₪ € ₱ ₹ ฿ ...).
- are identifiers, while are operators.
- A few characters 〡〢〣〤〥〦〧〨〩 〪 〫 〬 〭 〮 〯 are valid in both identifiers
and operators.
<Refining Identifier and Operator Symbology · GitHub
distinctions
Identifiers that take advantage of Swift's Unicode support are not
normalized. This allows different representations of the same characters to
be considered distinct identifiers. For example:
let Å = "Angstrom"
let Å = "Latin Capital Letter A With Ring Above"
let Å = "Latin Capital Letter A + Combining Ring Above"
Non-printing characters such as ZERO WIDTH SPACE and ZERO WIDTH NON-JOINER
are also accepted as valid identifier chracters without any restrictions.
let ab = "ab"
let ab = "a + ZERO WIDTH SPACE + b"
func xy() { print("xy") }
func xy() { print("x + ZERO WIDTH NON-JOINER + y") }
<Refining Identifier and Operator Symbology · GitHub;
Timeline
These matters should be considered in a near timeframe (Swift 4).
Identifier and operator character sets are fundamental parts of Swift
grammar, and changes are inevitably source-breaking.
<Refining Identifier and Operator Symbology · GitHub;
Non-goals
The aim of this proposal is to rationalize the set of valid operator
characters and the set of valid identifier characters using Unicode
categories and specific Unicode recommendations where available. The
smallest necessary customizations are made to increase backwards
compatibility, but no attempt is made to expand Swift grammar or to
"improve" Unicode. Specifically, the following questions are potential
subjects of separate study, either within the purview of the Swift open
source project or of the Unicode Consortium:
···
-
Expanding the set of valid operator or identifier characters. For
example, $ is not currently a valid operator in Swift, there are no
current Unicode recommendations regarding operators in programming
languages, and $ is not enumerated among the list of "mathematical"
characters in Unicode. Although is possible for Swift to customize its
implementation of Unicode recommendations to add $ as a valid operator,
that is an expansion of Swift grammar distinct from the task of
rationalizing Swift symbology according to Unicode standards. Therefore,
this document neither proposes nor opposes its addition. For similar
reasons, this document refines the inclusion of emoji in identifiers based
on Unicode categories, but it neither proposes nor opposes the inclusion of
non-emoji pictographic symbols to the set of valid identifier characters.
-
Rectifying Unicode shortcomings. Although it is possible to discover
shortcomings concerning particular characters in the current version of
Unicode, no attempt is made to preempt the Unicode standardization process
by "patching" such issues in the Swift grammar. For example, in the current
version of Unicode, ⁗ QUADRUPLE PRIME is not deemed to be "mathematical"
(even though ‴ TRIPLE PRIME *is* deemed to be "mathematical").
Certainly, this issue would be appropriate to report to Unicode and may
well be corrected in a future revision of the standard. However, as the
Swift community is not congruent with the community of experts that
specialize in Unicode, there is no rational basis to expect that Swift-only
determinations of what Unicode "should have done" (without vetting through
Unicode's standardization processes) are likely to result in a better
outcome than the existing Unicode standard. Therefore, no attempt is made
to augment the Unicode derived category Math with ⁗ QUADRUPLE PRIME in
this proposal. Similarly, Unicode recommends certain normalization forms
for identifiers in code, which are proposed here for adoption by Swift, but
these normalization forms do not eliminate all possible combinations of
"confusable" characters. This proposal does not attempt to invent an ad-hoc
normalization form in an attempt to "improve" Unicode recommendations.
-
Implementing additional features. Innovative ideas such as mixfix operators
are detailed below in *Future directions*. This proposal does not
attempt to introduce any such features.
<Refining Identifier and Operator Symbology · GitHub
in other languages
Haskell distinguishes identifiers/operators by their general category
<http://www.fileformat.info/info/unicode/category/index.htm> (for instance,
"any Unicode lowercase letter" or "any Unicode symbol or punctuation").
Identifiers can start with any lowercase letter or _, and they may contain
any letter, digit, ', or _. This includes letters like δ and Я, and digits
like ٢.
- Haskell Syntax Reference
<https://www.haskell.org/onlinereport/syntax-iso.html>
- Haskell Lexer
<ghc/Lexer.x at 714bebff44076061d0a719c4eda2cfd213b7ac3d · ghc/ghc · GitHub;
Scala similarly allows letters, numbers, $, and _ in identifiers,
distinguishing by general categories Ll, Lu, Lt, Lo, and Nl. Operator
characters include mathematical and other symbols (Sm and So) in addition
to certain ASCII characters.
- Scala Lexical Syntax
<Lexical Syntax;
ECMAScript 2015 uses ID_Start and ID_Continue, as well as Other_ID_Start
and Other_ID_Continue, for identifiers.
- ECMAScript Specification: Names and Keywords
<ECMAScript 2015 Language Specification – ECMA-262 6th Edition;
Python 3 uses XID_Start and XID_Continue.
- The Python Language Reference: Identifiers and Keywords
<2. Lexical analysis — Python 3.10.7 documentation;
- PEP 3131: Supporting Non-ASCII Identifiers
<https://www.python.org/dev/peps/pep-3131/>
<Refining Identifier and Operator Symbology · GitHub
solution
Identifiers. Adopt recommendations made in UAX#31 Identifier and Pattern
Syntax <http://unicode.org/reports/tr31/>, deriving the sets of valid
identifier characters from ID_Start and ID_Continue. Adopt specific
customizations principally to accommodate emoji. Consider two identifiers
equivalent when they produce the same normalized form under Normalization
Form C (NFC) <http://unicode.org/reports/tr15/>, as recommended in UAX#31
for case-sensitive use cases.
Is an identifierIs not an identifier
Shall be an identifier 120,617 code points
<Unicode Utilities: UnicodeSet;
699
emoji
<Unicode Utilities: UnicodeSet;
Shall not be an identifier 846,137 unassigned code points;
4,929 other code points
<Unicode Utilities: UnicodeSet;
*All
other code points*
Operators. No Unicode recommendation currently exists on the topic of
"operator identifiers," although work is ongoing as part of a future update
to UAX#31. The aim of the proposed definition presented in this document is
to identify, using Unicode categories, a reasonable set of operators that
(a) may be in current use in Swift code; and (b) are likely to be included
in future versions of UAX#31. It is not intended to be a final judgment on
all code points that should ever be valid in Swift operators, for which it
is proposed that Swift await the recommendations of the Unicode Consortium.
Therefore, adopt an approach to define the set of valid operator characters
based primarily on the Unicode categories Math and Pattern_Syntax (an
approach analogous to that which is used to define ID_Start and ID_Continue in
Unicode recommendations), informed by UAX#25 Unicode Support for Mathematics
<http://www.unicode.org/reports/tr25/>\. Augment the set of valid operator
characters with a number of currently valid Swift operator characters to
increase backward compatibility. Consider two operators equivalent when
they produce the same normalized form under Normalization Form KC (NFKC)
<http://unicode.org/reports/tr15/>, as recommended in UAX#31 for
case-insensitive use cases. Fullwidth variants such as FULLWIDTH
HYPHEN-MINUS are equivalent to their non-fullwidth counterparts after
normalization under NFKC (but not NFC).
Is an operatorIs not an operator
Shall be an operator 986 code points
<Unicode Utilities: UnicodeSet;
\
Shall not be an operator 130 unassigned code points;
2,024 other code points
<Unicode Utilities: UnicodeSet;
*All
other code points*
Dots. Adopt a rule to allow dots to appear in operators at any location,
but only in runs of two or more. (Currently, dots must be leading.)
<Refining Identifier and Operator Symbology · GitHub
design
<Refining Identifier and Operator Symbology · GitHub;
Identifiers
Swift identifier characters shall conform to UAX#31
<UAX #31: Unicode Identifier and Pattern Syntax; as follows:
-
UAX31-C1. <UAX #31: Unicode Identifier and Pattern Syntax; The conformance
described herein refers to the Unicode 9.0.0 version of UAX#31.
-
UAX31-C2. <UAX #31: Unicode Identifier and Pattern Syntax; Swift shall observe the
following requirements:
-
UAX31-R1. <UAX #31: Unicode Identifier and Pattern Syntax; Swift shall augment
the definition of "Default Identifiers" with the following profiles:
1.
ID_Start and ID_Continue shall be used for Start and Continue,
replacing XID_Start and XID_Continue. This excludes characters in
Other_ID_Start and Other_ID_Continue.
2.
_ 005F LOW LINE shall additionally be allowed as a Start character.
3.
Certain emoji shall additionally be allowed as Start characters. A
detailed design for emoji permitted in identifiers is given below.
4.
UAX31-R1a. <UAX #31: Unicode Identifier and Pattern Syntax; The join-control
characters ZWJ and ZWNJ are strictly limited to the special
cases A1, A2,
and B described in UAX#31.
-
UAX31-R4. <UAX #31: Unicode Identifier and Pattern Syntax; Swift shall consider
two identifiers equivalent when they produce the same normalized
form under Normalization
Form C (NFC) <http://unicode.org/reports/tr15/>, as recommended in
UAX#31 for case-sensitive use cases.
<Refining Identifier and Operator Symbology · GitHub
changes
identifier-head → [:ID_Start:]
identifier-head → _
identifier-head → identifier-emoji
identifier-character → identifier-head
identifier-character → [:ID_Continue:]
<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#operators>
Operators
Swift operator characters shall be determined as follows:
-
Valid operator characters shall consist of Pattern_Syntax code points
with a derived property Math. However, the following blocks are
excluded: Geometric Shapes, Miscellaneous Symbols, and Miscellaneous
Technical. In UnicodeSet notation:
[:Pattern_Syntax:] & [:Math:]
- [:Block=Geometric Shapes:]
- [:Block=Miscellaneous Symbols:]
- [:Block=Miscellaneous Technical:]
Math captures a fuller set of operators than is possible using Sm, and
we avoid the inclusion of characters in So that are clearly not
"operator-like" (such as Braille). Math code points in the excluded
blocks include sign parts such as ⎲ SUMMATION TOP and tenuously
"operator-like" code points such as BLACK SPADE SUIT.
-
The set of valid operator characters shall be augmented with the
following ASCII characters: !, %, &, *, -, /, ?, \, ^. These ASCII
characters are required by the Swift standard library and/or considered
"weakly mathematical" in UAX#25 <http://www.unicode.org/reports/tr25/>\.
-
For increased compatibility with Swift 3, the set of valid operator
characters shall be augmented with the following Latin-1 Supplement
characters: ¡, ¦, §, °, ¶, ¿. For the same reason, augment the set of
valid operator characters with the following General Punctuation
characters: † DAGGER, ‡ DOUBLE DAGGER, • BULLET, ‰ PER MILLE SIGN, ‱ PER
TEN THOUSAND SIGN, ※ REFERENCE MARK, ‽ INTERROBANG, ⁂ ASTERISM, ⁅ LEFT
SQUARE BRACKET WITH QUILL, ⁆ RIGHT SQUARE BRACKET WITH QUILL, ⁊ TIRONIAN
SIGN ET, ⁋ REVERSED PILCROW SIGN, ⁌ BLACK LEFTWARDS BULLET, ⁍ BLACK
RIGHTWARDS BULLET, ⁎ LOW ASTERISK, ⁑ TWO ASTERISKS ALIGNED VERTICALLY.
-
Swift shall consider two operators equivalent when they produce the same
normalized form under Normalization Form KC (NFKC)
<http://unicode.org/reports/tr15/>, as recommended in UAX#31 for
*case-insensitive* use cases. Crucially, fullwidth variants such as
FULLWIDTH HYPHEN-MINUS are equivalent to their non-fullwidth counterparts
after normalization under NFKC (but not NFC).
-
Certainly strongly mathematical arrows now have an *alternative* emoji
presentation, and future versions of Unicode may add such an emoji
presentation to any Swift operator character. Some but not all
"environments" or applications (for instance, Safari but not TextWrangler)
display the alternative emoji presentation at all times, and such
discrepancies between applications are explicitly permitted by Unicode
recommendations (see dicussion in *Emoji*). However, it would be highly
unusual to define the set of valid operator characters based on an
essentially arbitrary criterion as to whether an alternative emoji
presentation is retroactively assigned to a code point, and codifying how
IDEs display Unicode characters in Swift files is outside the scope of this
proposal. Therefore, valid operator characters are defined without regard
to the presence or absence of an alternative emoji presentation, and U+FE0E
VARIATION SELECTOR-15 (text presentation selector) is *optionally* permitted
to follow an operator character that has an alternative emoji presentation.
Note that variation selectors are discarded by normalization.
These revised rules
<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[%3APattern_Syntax%3A]+%26+[%3AMath%3A]
-+[%3ABlock%3DGeometric+Shapes%3A]
-+[%3ABlock%3DMiscellaneous+Symbols%3A]
-+[%3ABlock%3DMiscellaneous+Technical%3A]
[!+%25+\%26+*+\-+%2F+%3F+\\+\^+¡+¦+§+°+¶+¿+†+‡+•+‰+‱+※+‽+⁂+⁅+⁆+⁊+⁋+⁌+⁍+⁎+⁑]&g=&i=>
produce
a set of 987 code points for operator characters. Since ID_Start is derived
in part by exclusion of Pattern_Syntax code points, it is assured that
operator and identifier characters do not overlap (although this assurance
does not extend to emoji, which require additional design as detailed
below).
All current restrictions on reserved tokens and operators remain. Swift
reserves =, ->, //, /*, */, ., ?, prefix <, prefix &, postfix >, and
postfix !.
<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#dots>Dots
Swift's existing rule for dots in operators is:
If an operator doesn’t begin with a dot, it can’t contain a dot elsewhere.
This proposal modifies the rule to:
Dots may only appear in operators in sequences of two or more.
Incorporating the "two-dot rule" offers the following benefits:
-
It avoids lexical complications arising from lone ..
-
The approach is conservative, erring on the side of overly restrictive.
Dropping the rule in future (and thereby allowing single dots) may be
possible.
-
It does not require special cases for existing infix dot operators in
the standard library, ... (closed range) and ..<(half-open range). It
leaves open the possibility of adding analogous half-open and fully-open
range operators <..and <..<.
Finally, this proposal *reserves* the .. operator for a possible "method
cascade" syntax in the future as supported by Dart
<http://news.dartlang.org/2012/02/method-cascades-in-dart-posted-by-gilad.html>
.
<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#grammar-changes-1>Grammar
changes
operator → operator-head operator-characters[opt]
operator-head → [[:Pattern_Syntax:] & [:Math:] - [:Emoji:] -
[:Block=Geometric Shapes:] - [:Block=Miscellaneous Symbols:] -
[:Block=Miscellaneous Technical:]]
operator-head → [[:Pattern_Syntax:] & [:Math:] & [:Emoji:] -
[:Block=Geometric Shapes:] - [:Block=Miscellaneous Symbols:] -
[:Block=Miscellaneous Technical:]] U+FE0E[opt]
operator-head → ! | % | & | * | - | / | ? | \ | ^ | ¡ | ¦ | § | ° | ¶ | ¿
operator-head → † | ‡ | • | ‰ | ‱ | ※ | ‽ | ⁂ | ⁅ | ⁆ | ⁊ | ⁋ | ⁌ | ⁍ | ⁎ | ⁑
operator-head → operator-dot operator-dots
operator-character → operator-head
operator-characters → operator-character operator-character[opt]
operator-dot → .
operator-dots → operator-dot operator-dots[opt]
<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#emoji>Emoji
The inclusion of emoji among valid identifier characters, though highly
desired, presents significant challenges:
-
Emoji characters are not displayed uniformly across different platforms.
-
Whether any particular character is presented as emoji or text depends
on a matrix of considerations, including "environment" (e.g., Safari vs.
XCode), presence or absence of a variant selector, and whether the
character itself defaults to "emoji presentation" or "text presentation."
This behavior is specifically documented in Unicode recommendations
<http://unicode.org/reports/tr51/#Presentation_Style>\.
-
Some emoji not classified as Math depict operators: . A Unicode
chart <http://unicode.org/emoji/charts/emoji-ordering.html> provides
additional information by dividing emoji according to "rough categories,"
but it warns that these categories "may change at any time, and should not
be used in production."
-
Full emoji support would require allowing identifiers to contain
zero-width joiner sequences that UAX#31 would forbid. Some normalization
scheme would have to be devised to account for Unicode recommendations that
👩❤️👨 (U+1F469 U+200D U+2764 U+FE0F U+200D U+1F468) can be displayed
as either (U+1F491) or, as a fallback, (U+1F469 U+2764 U+FE0F
U+1F468).
For maximum consistency across platforms, valid emoji in Swift identifiers
shall be determined using the following rules:
-
Emoji shall include code points with default emoji presentation (as
opposed to text presentation), minus Emoji_Defectives and ID_Continue.
Exclude Pattern_Syntax code points unless they are in the following
blocks: Miscellaneous Symbols, Miscellaneous Technical.
-
Emoji shall include Emoji code points with default text presentation *when
immediately followed by U+FE0F VARIATION SELECTOR-16 (emoji presentation
selector)*, minus Emoji_Defectives and ID_Continue. Again, exclude
Pattern_Syntax code points unless they are in the following blocks:
Miscellaneous Symbols, Miscellaneous Technical. (Note that the emoji picker
on Apple platforms--and, possibly, other platforms--automatically inserts
U+FE0F VARIATION SELECTOR-16 when a user selects such code points; for
instance, selecting inserts U+2764 U+FE0F. Therefore, it is important
that the invisible U+FE0F be permitted strictly in this use case. Note also
that variation selectors are discarded by normalization.)
-
Emoji shall include Emoji_Flag_Sequences, Emoji_Keycap_Sequences, and
(to the extent not already included) Emoji_Modifier_Sequences.
These revised rules
<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[%3AEmoji_Presentation%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AID_Continue%3A]+-+[%3APattern_Syntax%3A]]
[[%3AEmoji_Presentation%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Symbols%3A]]
[[%3AEmoji_Presentation%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Technical%3A]]
[[%3AEmoji%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AEmoji_Presentation%3A]+-+[%3AID_Continue%3A]+-+[%3APattern_Syntax%3A]]
[[%3AEmoji%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AEmoji_Presentation%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Symbols%3A]]
[[%3AEmoji%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AEmoji_Presentation%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Technical%3A]]
[%3AEmoji_Flag_Sequences%3A]
[%3AEmoji_Keycap_Sequences%3A]
[%3AEmoji_Modifier_Sequences%3A]&g=&i=>
produce
a set of 1,625 code points or sequences, of which 98 are currently
categorized as operator characters.
<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#grammar-changes-2>Grammar
changes
identifier-emoji → [[:Emoji_Presentation:] - [:Emoji_Defectives:] -
[:ID_Continue:] - [:Pattern_Syntax:]]
identifier-emoji → [[:Emoji_Presentation:] - [:Emoji_Defectives:] -
[:ID_Continue:] & [:Pattern_Syntax:] & [:Block=Miscellaneous
Symbols:]]
identifier-emoji → [[:Emoji_Presentation:] - [:Emoji_Defectives:] -
[:ID_Continue:] & [:Pattern_Syntax:] & [:Block=Miscellaneous
Technical:]]
identifier-emoji → [[:Emoji:] - [:Emoji_Defectives:] -
[:Emoji_Presentation:] - [:ID_Continue:] - [:Pattern_Syntax:]] U+FE0F
identifier-emoji → [[:Emoji:] - [:Emoji_Defectives:] -
[:Emoji_Presentation:] - [:ID_Continue:] & [:Pattern_Syntax:] &
[:Block=Miscellaneous Symbols:]] U+FE0F
identifier-emoji → [[:Emoji:] - [:Emoji_Defectives:] -
[:Emoji_Presentation:] - [:ID_Continue:] & [:Pattern_Syntax:] &
[:Block=Miscellaneous Technical:]] U+FE0F
identifier-emoji → [[:Emoji_Flag_Sequences:]
[:Emoji_Keycap_Sequences:] [:Emoji_Modifier_Sequences:]]
<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#source-compatibility>Source
compatibility
This change is source-breaking where developers have incorporated certain
emoji in identifiers or certain non-ASCII characters in operators. This is
unlikely to be a significant breakage for the majority of Swift code.
Diagnostics for invalid characters are already produced today. We can
improve them easily if needed.
Maintaining source compatibility for Swift 3 should be easy: keep the old
parsing and identifier lookup code.
<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#effect-on-abi-stability>Effect
on ABI stability
This proposal does not affect the ABI format itself. Normalization of
Unicode identifiers would affect the ABI of compiled modules. The standard
library will not be affected; it uses ASCII symbols with no combining
characters.
<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#effect-on-api-resilience>Effect
on API resilience
This proposal doesn't affect API resilience.
<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#alternatives-considered>Alternatives
considered
-
Use NFKC instead of NFC for identifiers. The decision to use NFC is
based on UAX#31, which states:
Generally if the programming language has case-sensitive identifiers,
then Normalization Form C is appropriate; whereas, if the programming
language has case-insensitive identifiers, then Normalization Form KC is
more appropriate.
-
Eliminate emoji from identifiers and restrict operator characters to a
limited number of ASCII code points. This approach would be simpler, but
feedback on Swift-Evolution has been overwhelmingly against such a change.
-
Hand-pick a set of "operator-like" characters to include. The proposal
authors tried this painstaking approach and came up with a relatively
agreeable set of about 650 code points
<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[!\%24%25\%26*%2B\-%2F<%3D>%3F\^|~
\u00AC
\u00B1
\u00B7
\u00D7
\u00F7
\u2208-\u220D
\u220F-\u2211
\u22C0-\u22C3
\u2212-\u221D
\u2238
\u223A
\u2240
\u228C-\u228E
\u2293-\u22A3
\u22BA-\u22BD
\u22C4-\u22C7
\u22C9-\u22CC
\u22D2-\u22D3
\u2223-\u222A
\u2236-\u2237
\u2239
\u223B-\u223E
\u2241-\u228B
\u228F-\u2292
\u22A6-\u22B9
\u22C8
\u22CD
\u22D0-\u22D1
\u22D4-\u22FF
\u22CE-\u22CF
\u2A00-\u2AFF
\u27C2
\u27C3
\u27C4
\u27C7
\u27C8
\u27C9
\u27CA
\u27CE-\u27D7
\u27DA-\u27DF
\u27E0-\u27E5
\u29B5-\u29C3
\u29C4-\u29C9
\u29CA-\u29D0
\u29D1-\u29D7
\u29DF
\u29E1
\u29E2
\u29E3-\u29E6
\u29FA
\u29FB
\u2308-\u230B
\u2336-\u237A
\u2395]>\.
Such a list can carefully avoid idiosyncrasies in the Unicode standard.
However, a character-by-character inventory is unlikely to converge on
consensus, as likely to introduce unintended Swift-specific idiosyncrasies
as it is to avoid Unicode shortcomings, and inconsistent with the Unicode
method of deriving such lists using categories.
-
Continue to allow single . in operators, perhaps even expanding the
original rule to allow them anywhere (even if the operator does not begin
with .).
This would allow a wider variety of custom operators (for some
interesting possibilities, see the operators in Haskell's Lens
<https://github.com/ekmett/lens/wiki/Operators> package). However, there
are a handful of potential complications:
-
Combining prefix or postfix operators with member access: foo*.bar would
need to be parsed as foo *. barrather than (foo*).bar. Parentheses
could be required to disambiguate.
-
Combining infix operators with contextual members: foo*.bar would
need to be parsed as foo *. bar rather than foo * (.bar). Whitespace
or parentheses could be required to disambiguate.
-
Hypothetically, if operators were accessible as members such as
MyNumber.+, allowing operators with single .s would require escaping
operator names (perhaps with backticks, such as MyNumber.`+`).
This would also require operators of the form [!?]*\. (for example . ?.
!. !!.) to be reserved, to prevent users from defining custom operators
that conflict with member access and optional chaining.
We believe that requiring dots to appear in groups of at least two,
while in some ways more restrictive, will prevent a significant amount of
future pain, and does not require special-case considerations such as the
above.
<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#future-directions>Future
directions
While not within the scope of this proposal, the following considerations
may provide useful context for the proposed changes. We encourage the
community to pick up these topics when the time is right.
-
Introduce a syntax for method cascades. The Dart language supports method
cascades
<http://news.dartlang.org/2012/02/method-cascades-in-dart-posted-by-gilad.html>,
whereby multiple methods can be called on an object within one expression:
foo..bar()..baz() effectively performs foo.bar(); foo.baz(). This syntax
can also be used with assignments and subscripts. Such a feature might be
very useful in Swift; this proposal reserves the .. operator so that it
may be added in the future.
-
Introduce "mixfix" operator declarations. Mixfix operators are based on
pattern matching and would allow more than two operands. For example, the
ternary operator ? : can be defined as a mixfix operator with three
"holes": _ ? _ : _. Subscripts might be subsumed by mixfix declarations
such as _ [ _ ]. Some holes could be made @autoclosure, and there might
even be holes whose argument is represented as an AST, rather than a value
or thunk, supporting advanced metaprogramming (for instance, F#'s code
quotations
<https://docs.microsoft.com/en-us/dotnet/articles/fsharp/language-reference/code-quotations>\).
Should mixfix operators become supported, it would be sensible to add
brackets to the set of valid operator characters.
-
Diminish or remove the lexical distinction between operators and
identifiers. If precedence and fixity applied to traditional identifiers
as well as operators, it would be possible to incorporate ASCII equivalents
for standard operators (e.g. and for &&, to allow A and B). If
additionally combined with mixfix operator support, this might enable
powerful DSLs (for instance, C#'s LINQ
<https://en.wikipedia.org/wiki/Language_Integrated_Query>\).