[Draft] Refining identifier and operator symbology (take 2)

As Stage 2 of Swift 4 evolution starts now, I'd like to share a revised
proposal in draft form.

It proposes a source-breaking change for *rationalizing* which characters
are permitted in identifiers and which in operators. It's justified for
this phase of Swift 4 because:

- Existing grammar, in permitting invisible characters without
security-minded restrictions, can be *actively harmful.*
- A rationalized approach is *superior* to the current approach: by
referencing Unicode standards, Swift should be able to evolve in a
backwards-compatible way alongside Unicode, and will benefit from the
significant expertise of others outside the Swift community with respect to
Unicode best practices.
- The vast majority of existing code (including all of the standard
library) should *require no migration* work at all

*What's changed* since the last time:

- In an earlier draft, we proposed some radical changes to align with
available Unicode standards; in particular, since emoji represent a
difficult issue, and no recommendations about "operator identifiers" have
surfaced from Unicode, we proposed temporarily stripping them out.
This was *very
poorly received*. This revision uses Unicode categories to identify nearly
all emoji and classify them as identifier characters (while excluding those
that depict operators such as !), and it uses Unicode categories to
identify over 900 operators that nearly all pass the subjective test of
"operator-likeness."

What this proposal *does not attempt* to do:

- This document *does not* seek to stake out new ground as to what
characters should be *added* to the set of valid identifiers and operators.
Such additions to the grammar are properly separate discussions. This
proposal is only an attempt at systemization and rationalization. Only one
character is incidentally added to the list of valid characters (`\`), and
it is on the basis of an explicit table in Unicode Technical Report 25
regarding ASCII characters that are "mathematical."

What feedback would be* most helpful*:

- "Hey, this approach is so much more *clumsy* than my superior, more
elegant category-based approach to identifying [operators/emoji], which is
[insert here]."
- "Hey, I disagree with the detailed design because it's got a *major
security hole*, which is [insert here]."
- "Hey, your proposal would break my *real-world* Swift code, which
requires that character [X] be an [identifier/operator]."

What would be *less helpful*:

- "Hey, let's talk about how [specific character] should be an
[identifier/operator]. We should add that character to the list of
[identifiers/operators]. In fact, let's discuss [list] characters one by
one."

Acknowledgments:
Thanks to co-authors of the previous take for their support for
resurrecting this issue. Any brilliant ideas are undoubtedly theirs, and
any botched efforts are certainly mine. Thanks also to Nevin
Brackett-Rozinsky for helpful feedback.

Link:
https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651

Rendered text:

Refining identifier and operator symbology (take 2)

   - Proposal: SE-NNNN
   <https://gist.github.com/xwu/NNNN-refining-identifier-and-operator-symbology.md>
   - Authors: Xiaodi Wu <https://github.com/xwu>, Jacob Bandes-Storch
   <https://github.com/jtbandes>, Erica Sadun <https://github.com/erica>,
   Jonathan Shapiro, João Pinheiro <https://github.com/joaopinheiro>
   - Review Manager: TBD
   - Status: Awaiting review

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#introduction>
Introduction

This proposal refines and rationalizes Swift's identifier and operator
symbology. Specifically, this proposal:

   - refines the set of valid identifier characters based on Unicode
   recommendations, with customizations principally to accommodate emoji;
   - refines the set of valid operator characters based on Unicode
   categories; and
   - changes rules as to where dots may appear in operators.

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#prior-discussion-threads-and-proposals>Prior
discussion threads and proposals

   - Define backslash '\' as a operator-head in the swift grammar
   <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20170130/031461.html>
   - Refining Identifier and Operator Symbology
   <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20161017/028174.html>
(a
   precursor to this document)
   - Proposal: Normalize Unicode identifiers
   <https://github.com/apple/swift-evolution/pull/531>
   - Lexical matters: identifiers and operators
   <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160926/027479.html>
   - Unicode identifiers & operators
   <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160912/027108.html>,
   with pre-proposal
   <https://gist.github.com/jtbandes/c0b0c072181dcd22c3147802025d0b59>
   - Proposal: Allow Single Dollar Sign as Valid Identifier
   <https://github.com/apple/swift-evolution/pull/354>
   - Free the '$' Symbol!
   <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20151228/005133.html>
   - Request to add middle dot (U+00B7) as operator character?
   <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20151214/003176.html>

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#motivation>
Motivation

Swift supports programmers from many languages and cultures. However, the
current identifier and operator character sets do not conform to any
Unicode standards, nor have they been rationalized in the language or
compiler documentation. These deserve a well-considered, standards-based
revision.

As Chris Lattner has written:

We need a token to be unambiguously an operator or identifier - we can have
different rules for the leading and subsequent characters though.

…our current operator space (particularly the Unicode segments covered) is
not super well considered. It would be great for someone to take a more
systematic pass over them to rationalize things.

Identifiers, which serve as *names* for various entities, are linguistic in
nature and must permit a variety of characters in order to properly serve
non–English-speaking coders. This issue has been considered by the
communities of many programming languages already, and the Unicode
Consortium has published recommendations on how to choose identifier
character sets. Swift should make an effort to conform to these
recommendations.

Operators, on the other hand, should be rare and carefully chosen because
they suffer from limited discoverability and readability. They are by
nature *symbols*, not names. This places a cognitive cost on users with
respect to recall ("What is the operator that applies the behavior I
need?") and recognition ("What does the operator in this code do?"). While
almost every non-trivial program defines new identifiers, most programs do
not define new operators.
<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#inconsistency>
Inconsistency

Concrete discrepancies and edge cases motivate these proposed changes. For
example:

   - The Greek question mark ; is a valid identifier.
   - Some *non-combining* diacritics ´ ¨ ꓻ are valid in identifiers.
   - Braille patterns ⠟, which are letter-like, are operator characters.
   - Other symbols such as ⚄ and ♄ are operator characters despite not
   being "operator-like."
   - Currency symbols are split across operators (¢ £ ¤ ¥) and identifiers
   (₪ € ₱ ₹ ฿ ...).
   - :slightly_smiling_face::metal::arrow_forward:️:small_airplane: are identifiers, while :frowning:️:v:️:arrow_up_small::airplane:️:spades:️ are operators.
   - A few characters 〡〢〣〤〥〦〧〨〩 〪 〫 〬 〭 〮 〯 are valid in both identifiers
   and operators.

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#invisible-distinctions>Invisible
distinctions

Identifiers that take advantage of Swift's Unicode support are not
normalized. This allows different representations of the same characters to
be considered distinct identifiers. For example:

let Å = "Angstrom"
let Å = "Latin Capital Letter A With Ring Above"
let Å = "Latin Capital Letter A + Combining Ring Above"

Non-printing characters such as ZERO WIDTH SPACE and ZERO WIDTH NON-JOINER
are also accepted as valid identifier chracters without any restrictions.

let ab = "ab"
let a​b = "a + ZERO WIDTH SPACE + b"

func xy() { print("xy") }
func x‌y() { print("x + ZERO WIDTH NON-JOINER + y") }

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#timeline>
Timeline

These matters should be considered in a near timeframe (Swift 4).
Identifier and operator character sets are fundamental parts of Swift
grammar, and changes are inevitably source-breaking.
<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#non-goals>
Non-goals

The aim of this proposal is to rationalize the set of valid operator
characters and the set of valid identifier characters using Unicode
categories and specific Unicode recommendations where available. The
smallest necessary customizations are made to increase backwards
compatibility, but no attempt is made to expand Swift grammar or to
"improve" Unicode. Specifically, the following questions are potential
subjects of separate study, either within the purview of the Swift open
source project or of the Unicode Consortium:

···

-

   Expanding the set of valid operator or identifier characters. For
   example, $ is not currently a valid operator in Swift, there are no
   current Unicode recommendations regarding operators in programming
   languages, and $ is not enumerated among the list of "mathematical"
   characters in Unicode. Although is possible for Swift to customize its
   implementation of Unicode recommendations to add $ as a valid operator,
   that is an expansion of Swift grammar distinct from the task of
   rationalizing Swift symbology according to Unicode standards. Therefore,
   this document neither proposes nor opposes its addition. For similar
   reasons, this document refines the inclusion of emoji in identifiers based
   on Unicode categories, but it neither proposes nor opposes the inclusion of
   non-emoji pictographic symbols to the set of valid identifier characters.
   -

   Rectifying Unicode shortcomings. Although it is possible to discover
   shortcomings concerning particular characters in the current version of
   Unicode, no attempt is made to preempt the Unicode standardization process
   by "patching" such issues in the Swift grammar. For example, in the current
   version of Unicode, ⁗ QUADRUPLE PRIME is not deemed to be "mathematical"
   (even though ‴ TRIPLE PRIME *is* deemed to be "mathematical").
   Certainly, this issue would be appropriate to report to Unicode and may
   well be corrected in a future revision of the standard. However, as the
   Swift community is not congruent with the community of experts that
   specialize in Unicode, there is no rational basis to expect that Swift-only
   determinations of what Unicode "should have done" (without vetting through
   Unicode's standardization processes) are likely to result in a better
   outcome than the existing Unicode standard. Therefore, no attempt is made
   to augment the Unicode derived category Math with ⁗ QUADRUPLE PRIME in
   this proposal. Similarly, Unicode recommends certain normalization forms
   for identifiers in code, which are proposed here for adoption by Swift, but
   these normalization forms do not eliminate all possible combinations of
   "confusable" characters. This proposal does not attempt to invent an ad-hoc
   normalization form in an attempt to "improve" Unicode recommendations.
   -

   Implementing additional features. Innovative ideas such as mixfix operators
   are detailed below in *Future directions*. This proposal does not
   attempt to introduce any such features.

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#precedent-in-other-languages>Precedent
in other languages

Haskell distinguishes identifiers/operators by their general category
<http://www.fileformat.info/info/unicode/category/index.htm> (for instance,
"any Unicode lowercase letter" or "any Unicode symbol or punctuation").
Identifiers can start with any lowercase letter or _, and they may contain
any letter, digit, ', or _. This includes letters like δ and Я, and digits
like ٢.

   - Haskell Syntax Reference
   <https://www.haskell.org/onlinereport/syntax-iso.html>
   - Haskell Lexer
   <https://github.com/ghc/ghc/blob/714bebff44076061d0a719c4eda2cfd213b7ac3d/compiler/parser/Lexer.x#L1949-L1973>

Scala similarly allows letters, numbers, $, and _ in identifiers,
distinguishing by general categories Ll, Lu, Lt, Lo, and Nl. Operator
characters include mathematical and other symbols (Sm and So) in addition
to certain ASCII characters.

   - Scala Lexical Syntax
   <http://www.scala-lang.org/files/archive/spec/2.11/01-lexical-syntax.html#lexical-syntax>

ECMAScript 2015 uses ID_Start and ID_Continue, as well as Other_ID_Start
and Other_ID_Continue, for identifiers.

   - ECMAScript Specification: Names and Keywords
   <http://www.ecma-international.org/ecma-262/6.0/#sec-names-and-keywords>

Python 3 uses XID_Start and XID_Continue.

   - The Python Language Reference: Identifiers and Keywords
   <https://docs.python.org/3/reference/lexical_analysis.html#grammar-token-identifier>
   - PEP 3131: Supporting Non-ASCII Identifiers
   <https://www.python.org/dev/peps/pep-3131/>

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#proposed-solution>Proposed
solution

Identifiers. Adopt recommendations made in UAX#31 Identifier and Pattern
Syntax <http://unicode.org/reports/tr31/>, deriving the sets of valid
identifier characters from ID_Start and ID_Continue. Adopt specific
customizations principally to accommodate emoji. Consider two identifiers
equivalent when they produce the same normalized form under Normalization
Form C (NFC) <http://unicode.org/reports/tr15/>, as recommended in UAX#31
for case-sensitive use cases.
Is an identifierIs not an identifier
Shall be an identifier 120,617 code points
<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[[a-zA-Z _ \u00A8 \u00AA \u00AD \u00AF \u00B2-\u00B5 \u00B7-\u00BA \u00BC-\u00BE \u00C0-\u00D6 \u00D8-\u00F6 \u00F8-\u00FF \u0100-\u02FF \u0370-\u167F \u1681-\u180D \u180F-\u1DBF \u1E00-\u1FFF \u200B-\u200D \u202A-\u202E \u203F-\u2040 \u2054 \u2060-\u206F \u2070-\u20CF \u2100-\u218F \u2460-\u24FF \u2776-\u2793 \u2C00-\u2DFF \u2E80-\u2FFF \u3004-\u3007 \u3021-\u302F \u3031-\u303F \u3040-\uD7FF \uF900-\uFD3D \uFD40-\uFDCF \uFDF0-\uFE1F \uFE30-\uFE44 \uFE47-\uFFFD \U00010000-\U0001FFFD \U00020000-\U0002FFFD \U00030000-\U0003FFFD \U00040000-\U0004FFFD \U00050000-\U0005FFFD \U00060000-\U0006FFFD \U00070000-\U0007FFFD \U00080000-\U0008FFFD \U00090000-\U0009FFFD \U000A0000-\U000AFFFD \U000B0000-\U000BFFFD \U000C0000-\U000CFFFD \U000D0000-\U000DFFFD \U000E0000-\U000EFFFD] [0-9 \u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE20-\uFE2F]] %26+[[%3AID_Continue%3A] _ [[%3AEmoji_Presentation%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AID_Continue%3A]+-+[%3APattern_Syntax%3A]] [[%3AEmoji_Presentation%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Symbols%3A]] [[%3AEmoji_Presentation%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Technical%3A]] [[%3AEmoji%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AEmoji_Presentation%3A]+-+[%3AID_Continue%3A]+-+[%3APattern_Syntax%3A]] [[%3AEmoji%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AEmoji_Presentation%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Symbols%3A]] [[%3AEmoji%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AEmoji_Presentation%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Technical%3A]] [[%3AEmoji_Flag_Sequences%3A]+[%3AEmoji_Keycap_Sequences%3A]+[%3AEmoji_Modifier_Sequences%3A]]]]&g=&i=>
699
emoji
<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[[%3AID_Continue%3A] _ [[%3AEmoji_Presentation%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AID_Continue%3A]+-+[%3APattern_Syntax%3A]] [[%3AEmoji_Presentation%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Symbols%3A]] [[%3AEmoji_Presentation%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Technical%3A]] [[%3AEmoji%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AEmoji_Presentation%3A]+-+[%3AID_Continue%3A]+-+[%3APattern_Syntax%3A]] [[%3AEmoji%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AEmoji_Presentation%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Symbols%3A]] [[%3AEmoji%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AEmoji_Presentation%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Technical%3A]] [[%3AEmoji_Flag_Sequences%3A]+[%3AEmoji_Keycap_Sequences%3A]+[%3AEmoji_Modifier_Sequences%3A]]] -[[a-zA-Z _ \u00A8 \u00AA \u00AD \u00AF \u00B2-\u00B5 \u00B7-\u00BA \u00BC-\u00BE \u00C0-\u00D6 \u00D8-\u00F6 \u00F8-\u00FF \u0100-\u02FF \u0370-\u167F \u1681-\u180D \u180F-\u1DBF \u1E00-\u1FFF \u200B-\u200D \u202A-\u202E \u203F-\u2040 \u2054 \u2060-\u206F \u2070-\u20CF \u2100-\u218F \u2460-\u24FF \u2776-\u2793 \u2C00-\u2DFF \u2E80-\u2FFF \u3004-\u3007 \u3021-\u302F \u3031-\u303F \u3040-\uD7FF \uF900-\uFD3D \uFD40-\uFDCF \uFDF0-\uFE1F \uFE30-\uFE44 \uFE47-\uFFFD \U00010000-\U0001FFFD \U00020000-\U0002FFFD \U00030000-\U0003FFFD \U00040000-\U0004FFFD \U00050000-\U0005FFFD \U00060000-\U0006FFFD \U00070000-\U0007FFFD \U00080000-\U0008FFFD \U00090000-\U0009FFFD \U000A0000-\U000AFFFD \U000B0000-\U000BFFFD \U000C0000-\U000CFFFD \U000D0000-\U000DFFFD \U000E0000-\U000EFFFD] [0-9 \u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE20-\uFE2F]]]&g=&i=>
Shall not be an identifier 846,137 unassigned code points;
4,929 other code points
<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[[a-zA-Z _ \u00A8 \u00AA \u00AD \u00AF \u00B2-\u00B5 \u00B7-\u00BA \u00BC-\u00BE \u00C0-\u00D6 \u00D8-\u00F6 \u00F8-\u00FF \u0100-\u02FF \u0370-\u167F \u1681-\u180D \u180F-\u1DBF \u1E00-\u1FFF \u200B-\u200D \u202A-\u202E \u203F-\u2040 \u2054 \u2060-\u206F \u2070-\u20CF \u2100-\u218F \u2460-\u24FF \u2776-\u2793 \u2C00-\u2DFF \u2E80-\u2FFF \u3004-\u3007 \u3021-\u302F \u3031-\u303F \u3040-\uD7FF \uF900-\uFD3D \uFD40-\uFDCF \uFDF0-\uFE1F \uFE30-\uFE44 \uFE47-\uFFFD \U00010000-\U0001FFFD \U00020000-\U0002FFFD \U00030000-\U0003FFFD \U00040000-\U0004FFFD \U00050000-\U0005FFFD \U00060000-\U0006FFFD \U00070000-\U0007FFFD \U00080000-\U0008FFFD \U00090000-\U0009FFFD \U000A0000-\U000AFFFD \U000B0000-\U000BFFFD \U000C0000-\U000CFFFD \U000D0000-\U000DFFFD \U000E0000-\U000EFFFD] [0-9 \u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE20-\uFE2F]] -[[%3AID_Continue%3A] _ [[%3AEmoji_Presentation%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AID_Continue%3A]+-+[%3APattern_Syntax%3A]] [[%3AEmoji_Presentation%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Symbols%3A]] [[%3AEmoji_Presentation%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Technical%3A]] [[%3AEmoji%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AEmoji_Presentation%3A]+-+[%3AID_Continue%3A]+-+[%3APattern_Syntax%3A]] [[%3AEmoji%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AEmoji_Presentation%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Symbols%3A]] [[%3AEmoji%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AEmoji_Presentation%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Technical%3A]] [[%3AEmoji_Flag_Sequences%3A]+[%3AEmoji_Keycap_Sequences%3A]+[%3AEmoji_Modifier_Sequences%3A]]]]&g=&i=>
*All
other code points*

Operators. No Unicode recommendation currently exists on the topic of
"operator identifiers," although work is ongoing as part of a future update
to UAX#31. The aim of the proposed definition presented in this document is
to identify, using Unicode categories, a reasonable set of operators that
(a) may be in current use in Swift code; and (b) are likely to be included
in future versions of UAX#31. It is not intended to be a final judgment on
all code points that should ever be valid in Swift operators, for which it
is proposed that Swift await the recommendations of the Unicode Consortium.

Therefore, adopt an approach to define the set of valid operator characters
based primarily on the Unicode categories Math and Pattern_Syntax (an
approach analogous to that which is used to define ID_Start and ID_Continue in
Unicode recommendations), informed by UAX#25 Unicode Support for Mathematics
<http://www.unicode.org/reports/tr25/>. Augment the set of valid operator
characters with a number of currently valid Swift operator characters to
increase backward compatibility. Consider two operators equivalent when
they produce the same normalized form under Normalization Form KC (NFKC)
<http://unicode.org/reports/tr15/>, as recommended in UAX#31 for
case-insensitive use cases. Fullwidth variants such as FULLWIDTH
HYPHEN-MINUS are equivalent to their non-fullwidth counterparts after
normalization under NFKC (but not NFC).
Is an operatorIs not an operator
Shall be an operator 986 code points
<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[[%3APattern_Syntax%3A]%20%26%20[%3AMath%3A] -%20[%3ABlock%3DGeometric%20Shapes%3A] -%20[%3ABlock%3DMiscellaneous%20Symbols%3A] -%20[%3ABlock%3DMiscellaneous%20Technical%3A] [!%20%%20\%26%20*%20\-%20%2F%20%3F%20\\%20\^%20¡%20¦%20§%20°%20¶%20¿%20†%20‡%20•%20‰%20‱%20※%20‽%20⁂%20⁅%20⁆%20⁊%20⁋%20⁌%20⁍%20⁎%20⁑]]%26[[ [%2F%20\-%20%2B%20!%20*%20%%20<->%20\%26%20|%20\^%20~%20%3F] U%2B00A1-U%2B00A7 U%2B00A9%20U%2B00AB U%2B00AC%20U%2B00AE U%2B00B0-U%2B00B1%20U%2B00B6%20U%2B00BB%20U%2B00BF%20U%2B00D7%20U%2B00F7 U%2B2016-U%2B2017%20U%2B2020-U%2B2027 U%2B2030-U%2B203E U%2B2041-U%2B2053 U%2B2055-U%2B205E U%2B2190-U%2B23FF U%2B2500-U%2B2775 U%2B2794-U%2B2BFF U%2B2E00-U%2B2E7F U%2B3001-U%2B3003 U%2B3008-U%2B3030 ] [ U%2B0300-U%2B036F U%2B1DC0-U%2B1DFF U%2B20D0-U%2B20FF U%2BFE00-U%2BFE0F U%2BFE20-U%2BFE2F U%2BE0100-U%2BE01EF ]]]>
\
Shall not be an operator 130 unassigned code points;
2,024 other code points
<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[[ [%2F%20\-%20%2B%20!%20*%20%%20<->%20\%26%20|%20\^%20~%20%3F] U%2B00A1-U%2B00A7 U%2B00A9%20U%2B00AB U%2B00AC%20U%2B00AE U%2B00B0-U%2B00B1%20U%2B00B6%20U%2B00BB%20U%2B00BF%20U%2B00D7%20U%2B00F7 U%2B2016-U%2B2017%20U%2B2020-U%2B2027 U%2B2030-U%2B203E U%2B2041-U%2B2053 U%2B2055-U%2B205E U%2B2190-U%2B23FF U%2B2500-U%2B2775 U%2B2794-U%2B2BFF U%2B2E00-U%2B2E7F U%2B3001-U%2B3003 U%2B3008-U%2B3030 ] [ U%2B0300-U%2B036F U%2B1DC0-U%2B1DFF U%2B20D0-U%2B20FF U%2BFE00-U%2BFE0F U%2BFE20-U%2BFE2F U%2BE0100-U%2BE01EF ]]-[[%3APattern_Syntax%3A]%20%26%20[%3AMath%3A] -%20[%3ABlock%3DGeometric%20Shapes%3A] -%20[%3ABlock%3DMiscellaneous%20Symbols%3A] -%20[%3ABlock%3DMiscellaneous%20Technical%3A] [!%20%%20\%26%20*%20\-%20%2F%20%3F%20\\%20\^%20¡%20¦%20§%20°%20¶%20¿%20†%20‡%20•%20‰%20‱%20※%20‽%20⁂%20⁅%20⁆%20⁊%20⁋%20⁌%20⁍%20⁎%20⁑]]]>
*All
other code points*

Dots. Adopt a rule to allow dots to appear in operators at any location,
but only in runs of two or more. (Currently, dots must be leading.)
<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#detailed-design>Detailed
design
<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#identifiers>
Identifiers

Swift identifier characters shall conform to UAX#31
<http://unicode.org/reports/tr31/#Conformance> as follows:

   -

   UAX31-C1. <http://unicode.org/reports/tr31/#C1> The conformance
   described herein refers to the Unicode 9.0.0 version of UAX#31.
   -

   UAX31-C2. <http://unicode.org/reports/tr31/#C2> Swift shall observe the
   following requirements:
   -

      UAX31-R1. <http://unicode.org/reports/tr31/#R1> Swift shall augment
      the definition of "Default Identifiers" with the following profiles:
      1.

         ID_Start and ID_Continue shall be used for Start and Continue,
         replacing XID_Start and XID_Continue. This excludes characters in
         Other_ID_Start and Other_ID_Continue.
         2.

         _ 005F LOW LINE shall additionally be allowed as a Start character.
         3.

         Certain emoji shall additionally be allowed as Start characters. A
         detailed design for emoji permitted in identifiers is given below.
         4.

         UAX31-R1a. <http://unicode.org/reports/tr31/#R1a> The join-control
         characters ZWJ and ZWNJ are strictly limited to the special
cases A1, A2,
         and B described in UAX#31.
         -

      UAX31-R4. <http://unicode.org/reports/tr31/#R4> Swift shall consider
      two identifiers equivalent when they produce the same normalized
form under Normalization
      Form C (NFC) <http://unicode.org/reports/tr15/>, as recommended in
      UAX#31 for case-sensitive use cases.

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#grammar-changes>Grammar
changes

identifier-head → [:ID_Start:]
identifier-head → _
identifier-head → identifier-emoji
identifier-character → identifier-head
identifier-character → [:ID_Continue:]

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#operators>
Operators

Swift operator characters shall be determined as follows:

   -

   Valid operator characters shall consist of Pattern_Syntax code points
   with a derived property Math. However, the following blocks are
   excluded: Geometric Shapes, Miscellaneous Symbols, and Miscellaneous
   Technical. In UnicodeSet notation:

   [:Pattern_Syntax:] & [:Math:]
   - [:Block=Geometric Shapes:]
   - [:Block=Miscellaneous Symbols:]
   - [:Block=Miscellaneous Technical:]

   Math captures a fuller set of operators than is possible using Sm, and
   we avoid the inclusion of characters in So that are clearly not
   "operator-like" (such as Braille). Math code points in the excluded
   blocks include sign parts such as ⎲ SUMMATION TOP and tenuously
   "operator-like" code points such as :spades:️ BLACK SPADE SUIT.
   -

   The set of valid operator characters shall be augmented with the
   following ASCII characters: !, %, &, *, -, /, ?, \, ^. These ASCII
   characters are required by the Swift standard library and/or considered
   "weakly mathematical" in UAX#25 <http://www.unicode.org/reports/tr25/>.
   -

   For increased compatibility with Swift 3, the set of valid operator
   characters shall be augmented with the following Latin-1 Supplement
   characters: ¡, ¦, §, °, ¶, ¿. For the same reason, augment the set of
   valid operator characters with the following General Punctuation
   characters: † DAGGER, ‡ DOUBLE DAGGER, • BULLET, ‰ PER MILLE SIGN, ‱ PER
   TEN THOUSAND SIGN, ※ REFERENCE MARK, ‽ INTERROBANG, ⁂ ASTERISM, ⁅ LEFT
   SQUARE BRACKET WITH QUILL, ⁆ RIGHT SQUARE BRACKET WITH QUILL, ⁊ TIRONIAN
   SIGN ET, ⁋ REVERSED PILCROW SIGN, ⁌ BLACK LEFTWARDS BULLET, ⁍ BLACK
   RIGHTWARDS BULLET, ⁎ LOW ASTERISK, ⁑ TWO ASTERISKS ALIGNED VERTICALLY.
   -

   Swift shall consider two operators equivalent when they produce the same
   normalized form under Normalization Form KC (NFKC)
   <http://unicode.org/reports/tr15/>, as recommended in UAX#31 for
   *case-insensitive* use cases. Crucially, fullwidth variants such as
   FULLWIDTH HYPHEN-MINUS are equivalent to their non-fullwidth counterparts
   after normalization under NFKC (but not NFC).
   -

   Certainly strongly mathematical arrows now have an *alternative* emoji
   presentation, and future versions of Unicode may add such an emoji
   presentation to any Swift operator character. Some but not all
   "environments" or applications (for instance, Safari but not TextWrangler)
   display the alternative emoji presentation at all times, and such
   discrepancies between applications are explicitly permitted by Unicode
   recommendations (see dicussion in *Emoji*). However, it would be highly
   unusual to define the set of valid operator characters based on an
   essentially arbitrary criterion as to whether an alternative emoji
   presentation is retroactively assigned to a code point, and codifying how
   IDEs display Unicode characters in Swift files is outside the scope of this
   proposal. Therefore, valid operator characters are defined without regard
   to the presence or absence of an alternative emoji presentation, and U+FE0E
   VARIATION SELECTOR-15 (text presentation selector) is *optionally* permitted
   to follow an operator character that has an alternative emoji presentation.
   Note that variation selectors are discarded by normalization.

These revised rules
<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[%3APattern_Syntax%3A]+%26+[%3AMath%3A] -+[%3ABlock%3DGeometric+Shapes%3A] -+[%3ABlock%3DMiscellaneous+Symbols%3A] -+[%3ABlock%3DMiscellaneous+Technical%3A] [!+%+\%26+*+\-+%2F+%3F+\\+\^+¡+¦+§+°+¶+¿+†+‡+•+‰+‱+※+‽+⁂+⁅+⁆+⁊+⁋+⁌+⁍+⁎+⁑]&g=&i=>
produce
a set of 987 code points for operator characters. Since ID_Start is derived
in part by exclusion of Pattern_Syntax code points, it is assured that
operator and identifier characters do not overlap (although this assurance
does not extend to emoji, which require additional design as detailed
below).

All current restrictions on reserved tokens and operators remain. Swift
reserves =, ->, //, /*, */, ., ?, prefix <, prefix &, postfix >, and
postfix !.
<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#dots>Dots

Swift's existing rule for dots in operators is:

If an operator doesn’t begin with a dot, it can’t contain a dot elsewhere.

This proposal modifies the rule to:

Dots may only appear in operators in sequences of two or more.

Incorporating the "two-dot rule" offers the following benefits:

   -

   It avoids lexical complications arising from lone ..
   -

   The approach is conservative, erring on the side of overly restrictive.
   Dropping the rule in future (and thereby allowing single dots) may be
   possible.
   -

   It does not require special cases for existing infix dot operators in
   the standard library, ... (closed range) and ..<(half-open range). It
   leaves open the possibility of adding analogous half-open and fully-open
   range operators <..and <..<.

Finally, this proposal *reserves* the .. operator for a possible "method
cascade" syntax in the future as supported by Dart
<http://news.dartlang.org/2012/02/method-cascades-in-dart-posted-by-gilad.html>
.
<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#grammar-changes-1>Grammar
changes

operator → operator-head operator-characters[opt]

operator-head → [[:Pattern_Syntax:] & [:Math:] - [:Emoji:] -
[:Block=Geometric Shapes:] - [:Block=Miscellaneous Symbols:] -
[:Block=Miscellaneous Technical:]]
operator-head → [[:Pattern_Syntax:] & [:Math:] & [:Emoji:] -
[:Block=Geometric Shapes:] - [:Block=Miscellaneous Symbols:] -
[:Block=Miscellaneous Technical:]] U+FE0E[opt]
operator-head → ! | % | & | * | - | / | ? | \ | ^ | ¡ | ¦ | § | ° | ¶ | ¿
operator-head → † | ‡ | • | ‰ | ‱ | ※ | ‽ | ⁂ | ⁅ | ⁆ | ⁊ | ⁋ | ⁌ | ⁍ | ⁎ | ⁑
operator-head → operator-dot operator-dots
operator-character → operator-head
operator-characters → operator-character operator-character[opt]

operator-dot → .
operator-dots → operator-dot operator-dots[opt]

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#emoji>Emoji

The inclusion of emoji among valid identifier characters, though highly
desired, presents significant challenges:

   -

   Emoji characters are not displayed uniformly across different platforms.
   -

   Whether any particular character is presented as emoji or text depends
   on a matrix of considerations, including "environment" (e.g., Safari vs.
   XCode), presence or absence of a variant selector, and whether the
   character itself defaults to "emoji presentation" or "text presentation."
   This behavior is specifically documented in Unicode recommendations
   <http://unicode.org/reports/tr51/#Presentation_Style>.
   -

   Some emoji not classified as Math depict operators: :exclamation:️:question::heavy_plus_sign::heavy_minus_sign::heavy_division_sign::heavy_multiplication_x:️. A Unicode
   chart <http://unicode.org/emoji/charts/emoji-ordering.html> provides
   additional information by dividing emoji according to "rough categories,"
   but it warns that these categories "may change at any time, and should not
   be used in production."
   -

   Full emoji support would require allowing identifiers to contain
   zero-width joiner sequences that UAX#31 would forbid. Some normalization
   scheme would have to be devised to account for Unicode recommendations that
   👩‍❤️‍👨 (U+1F469 U+200D U+2764 U+FE0F U+200D U+1F468) can be displayed
   as either :couple_with_heart_woman_man: (U+1F491) or, as a fallback, :woman::heart:️:man:(U+1F469 U+2764 U+FE0F
   U+1F468).

For maximum consistency across platforms, valid emoji in Swift identifiers
shall be determined using the following rules:

   -

   Emoji shall include code points with default emoji presentation (as
   opposed to text presentation), minus Emoji_Defectives and ID_Continue.
   Exclude Pattern_Syntax code points unless they are in the following
   blocks: Miscellaneous Symbols, Miscellaneous Technical.
   -

   Emoji shall include Emoji code points with default text presentation *when
   immediately followed by U+FE0F VARIATION SELECTOR-16 (emoji presentation
   selector)*, minus Emoji_Defectives and ID_Continue. Again, exclude
   Pattern_Syntax code points unless they are in the following blocks:
   Miscellaneous Symbols, Miscellaneous Technical. (Note that the emoji picker
   on Apple platforms--and, possibly, other platforms--automatically inserts
   U+FE0F VARIATION SELECTOR-16 when a user selects such code points; for
   instance, selecting :heart:️ inserts U+2764 U+FE0F. Therefore, it is important
   that the invisible U+FE0F be permitted strictly in this use case. Note also
   that variation selectors are discarded by normalization.)
   -

   Emoji shall include Emoji_Flag_Sequences, Emoji_Keycap_Sequences, and
   (to the extent not already included) Emoji_Modifier_Sequences.

These revised rules
<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[%3AEmoji_Presentation%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AID_Continue%3A]+-+[%3APattern_Syntax%3A]] [[%3AEmoji_Presentation%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Symbols%3A]] [[%3AEmoji_Presentation%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Technical%3A]] [[%3AEmoji%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AEmoji_Presentation%3A]+-+[%3AID_Continue%3A]+-+[%3APattern_Syntax%3A]] [[%3AEmoji%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AEmoji_Presentation%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Symbols%3A]] [[%3AEmoji%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AEmoji_Presentation%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Technical%3A]] [%3AEmoji_Flag_Sequences%3A] [%3AEmoji_Keycap_Sequences%3A] [%3AEmoji_Modifier_Sequences%3A]&g=&i=>
produce
a set of 1,625 code points or sequences, of which 98 are currently
categorized as operator characters.
<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#grammar-changes-2>Grammar
changes

identifier-emoji → [[:Emoji_Presentation:] - [:Emoji_Defectives:] -
[:ID_Continue:] - [:Pattern_Syntax:]]
identifier-emoji → [[:Emoji_Presentation:] - [:Emoji_Defectives:] -
[:ID_Continue:] & [:Pattern_Syntax:] & [:Block=Miscellaneous
Symbols:]]
identifier-emoji → [[:Emoji_Presentation:] - [:Emoji_Defectives:] -
[:ID_Continue:] & [:Pattern_Syntax:] & [:Block=Miscellaneous
Technical:]]
identifier-emoji → [[:Emoji:] - [:Emoji_Defectives:] -
[:Emoji_Presentation:] - [:ID_Continue:] - [:Pattern_Syntax:]] U+FE0F
identifier-emoji → [[:Emoji:] - [:Emoji_Defectives:] -
[:Emoji_Presentation:] - [:ID_Continue:] & [:Pattern_Syntax:] &
[:Block=Miscellaneous Symbols:]] U+FE0F
identifier-emoji → [[:Emoji:] - [:Emoji_Defectives:] -
[:Emoji_Presentation:] - [:ID_Continue:] & [:Pattern_Syntax:] &
[:Block=Miscellaneous Technical:]] U+FE0F
identifier-emoji → [[:Emoji_Flag_Sequences:]
[:Emoji_Keycap_Sequences:] [:Emoji_Modifier_Sequences:]]

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#source-compatibility>Source
compatibility

This change is source-breaking where developers have incorporated certain
emoji in identifiers or certain non-ASCII characters in operators. This is
unlikely to be a significant breakage for the majority of Swift code.
Diagnostics for invalid characters are already produced today. We can
improve them easily if needed.

Maintaining source compatibility for Swift 3 should be easy: keep the old
parsing and identifier lookup code.
<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#effect-on-abi-stability>Effect
on ABI stability

This proposal does not affect the ABI format itself. Normalization of
Unicode identifiers would affect the ABI of compiled modules. The standard
library will not be affected; it uses ASCII symbols with no combining
characters.
<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#effect-on-api-resilience>Effect
on API resilience

This proposal doesn't affect API resilience.
<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#alternatives-considered>Alternatives
considered

   -

   Use NFKC instead of NFC for identifiers. The decision to use NFC is
   based on UAX#31, which states:

   Generally if the programming language has case-sensitive identifiers,
   then Normalization Form C is appropriate; whereas, if the programming
   language has case-insensitive identifiers, then Normalization Form KC is
   more appropriate.

   -

   Eliminate emoji from identifiers and restrict operator characters to a
   limited number of ASCII code points. This approach would be simpler, but
   feedback on Swift-Evolution has been overwhelmingly against such a change.
   -

   Hand-pick a set of "operator-like" characters to include. The proposal
   authors tried this painstaking approach and came up with a relatively
   agreeable set of about 650 code points
   <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[!\%24%\%26*%2B\-%2F<%3D>%3F\^|~ \u00AC \u00B1 \u00B7 \u00D7 \u00F7 \u2208-\u220D \u220F-\u2211 \u22C0-\u22C3 \u2212-\u221D \u2238 \u223A \u2240 \u228C-\u228E \u2293-\u22A3 \u22BA-\u22BD \u22C4-\u22C7 \u22C9-\u22CC \u22D2-\u22D3 \u2223-\u222A \u2236-\u2237 \u2239 \u223B-\u223E \u2241-\u228B \u228F-\u2292 \u22A6-\u22B9 \u22C8 \u22CD \u22D0-\u22D1 \u22D4-\u22FF \u22CE-\u22CF \u2A00-\u2AFF \u27C2 \u27C3 \u27C4 \u27C7 \u27C8 \u27C9 \u27CA \u27CE-\u27D7 \u27DA-\u27DF \u27E0-\u27E5 \u29B5-\u29C3 \u29C4-\u29C9 \u29CA-\u29D0 \u29D1-\u29D7 \u29DF \u29E1 \u29E2 \u29E3-\u29E6 \u29FA \u29FB \u2308-\u230B \u2336-\u237A \u2395]>.
   Such a list can carefully avoid idiosyncrasies in the Unicode standard.
   However, a character-by-character inventory is unlikely to converge on
   consensus, as likely to introduce unintended Swift-specific idiosyncrasies
   as it is to avoid Unicode shortcomings, and inconsistent with the Unicode
   method of deriving such lists using categories.
   -

   Continue to allow single . in operators, perhaps even expanding the
   original rule to allow them anywhere (even if the operator does not begin
   with .).

   This would allow a wider variety of custom operators (for some
   interesting possibilities, see the operators in Haskell's Lens
   <https://github.com/ekmett/lens/wiki/Operators> package). However, there
   are a handful of potential complications:
   -

      Combining prefix or postfix operators with member access: foo*.bar would
      need to be parsed as foo *. barrather than (foo*).bar. Parentheses
      could be required to disambiguate.
      -

      Combining infix operators with contextual members: foo*.bar would
      need to be parsed as foo *. bar rather than foo * (.bar). Whitespace
      or parentheses could be required to disambiguate.
      -

      Hypothetically, if operators were accessible as members such as
      MyNumber.+, allowing operators with single .s would require escaping
      operator names (perhaps with backticks, such as MyNumber.`+`).

   This would also require operators of the form [!?]*\. (for example . ?.
   !. !!.) to be reserved, to prevent users from defining custom operators
   that conflict with member access and optional chaining.

   We believe that requiring dots to appear in groups of at least two,
   while in some ways more restrictive, will prevent a significant amount of
   future pain, and does not require special-case considerations such as the
   above.

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#future-directions>Future
directions

While not within the scope of this proposal, the following considerations
may provide useful context for the proposed changes. We encourage the
community to pick up these topics when the time is right.

   -

   Introduce a syntax for method cascades. The Dart language supports method
   cascades
   <http://news.dartlang.org/2012/02/method-cascades-in-dart-posted-by-gilad.html>,
   whereby multiple methods can be called on an object within one expression:
   foo..bar()..baz() effectively performs foo.bar(); foo.baz(). This syntax
   can also be used with assignments and subscripts. Such a feature might be
   very useful in Swift; this proposal reserves the .. operator so that it
   may be added in the future.
   -

   Introduce "mixfix" operator declarations. Mixfix operators are based on
   pattern matching and would allow more than two operands. For example, the
   ternary operator ? : can be defined as a mixfix operator with three
   "holes": _ ? _ : _. Subscripts might be subsumed by mixfix declarations
   such as _ [ _ ]. Some holes could be made @autoclosure, and there might
   even be holes whose argument is represented as an AST, rather than a value
   or thunk, supporting advanced metaprogramming (for instance, F#'s code
   quotations
   <https://docs.microsoft.com/en-us/dotnet/articles/fsharp/language-reference/code-quotations>).
   Should mixfix operators become supported, it would be sensible to add
   brackets to the set of valid operator characters.
   -

   Diminish or remove the lexical distinction between operators and
   identifiers. If precedence and fixity applied to traditional identifiers
   as well as operators, it would be possible to incorporate ASCII equivalents
   for standard operators (e.g. and for &&, to allow A and B). If
   additionally combined with mixfix operator support, this might enable
   powerful DSLs (for instance, C#'s LINQ
   <https://en.wikipedia.org/wiki/Language_Integrated_Query>).

I like the approach taken here, and it is a much better way of concluding the characters. I don't disagree with the design and don't have any example code that will be affected, but I do have some (minor) observations about the proposal.

* The 'Dots' treatment feels like a special case in an otherwise good write-up of Unicode, seemingly to lean towards Dart's method chaining and/or cleanliness of implementation. It might be clearer to pull that out to its own proposal, either independent of or building upon the general Unicode changes?

* The grammar changes for the operator head contain a number of (what seems like) hand-picked unicode symbols for increased compatibility with Swift 3 (e.g. dagger and friends). Maybe these could be pulled out into their own group e.g. operator-head -> operator-head-swift3, to call out the reason for their hand-picked nature (and for later cleanup, should that be required).

* The proposed solution tables (shall be an identifier/is an identifier) wasn't clear to me at first what the rows and columns were. Maybe calling these out as a bulleted list would be better:

- Identifiers under Swift 3 and this proposal: 120,617 code points
- Identifiers that would be added under this proposal: 699 emoji
- Identifiers under Swift 3 that would no longer be an identifier: unassigned code points and 4,929 other code points

Similarly, for operators:

- Operators under Swift 3 and this proposal: 986 code points
- Operators that would be added under this proposal: \
- Operators under Swift 3 that would no longer be an identifier: unassigned code points and 2,024 other code points

You could summarise that as a pseudo-diff --stat

Identifiers
+ 699 emoji
  120,617 code points
- 4,929 code points and unassigned code points

Operators
+ 1 code point \
  986 code points
- 2,024 code points

Alternatively you could change the 'Is an identifier/operator' to 'Is a Swift 3 identifier' to make it clear that it's the Swift 3 header, but the tabular form is still not that clear to me.

Another stat that would be worth calling out: of the 2,042 code points that are no longer operators, what the overlap is with the 699 emoji that are added to the identifiers? If they were all of them then it would only be 1,325 operators that were no longer valid.

To conclude: I like the look of the proposal from the block set definition, which will be better than hand-picking the character set as the grammar currently stands.

Alex

···

On 17 Feb 2017, at 05:50, Xiaodi Wu via swift-evolution <swift-evolution@swift.org> wrote:

As Stage 2 of Swift 4 evolution starts now, I'd like to share a revised proposal in draft form.

It proposes a source-breaking change for rationalizing which characters are permitted in identifiers and which in operators.

What feedback would be most helpful:

- "Hey, this approach is so much more clumsy than my superior, more elegant category-based approach to identifying [operators/emoji], which is [insert here]."
- "Hey, I disagree with the detailed design because it's got a major security hole, which is [insert here]."
- "Hey, your proposal would break my real-world Swift code, which requires that character [X] be an [identifier/operator]."

I was one of the people leading the charge for preserving Emoji support and I really like where this proposal landed. Thank you to all the authors for the hard work!

+1

Russ

···

On Feb 16, 2017, at 9:50 PM, Xiaodi Wu via swift-evolution <swift-evolution@swift.org> wrote:

As Stage 2 of Swift 4 evolution starts now, I'd like to share a revised proposal in draft form.

It proposes a source-breaking change for rationalizing which characters are permitted in identifiers and which in operators. It's justified for this phase of Swift 4 because:

- Existing grammar, in permitting invisible characters without security-minded restrictions, can be actively harmful.
- A rationalized approach is superior to the current approach: by referencing Unicode standards, Swift should be able to evolve in a backwards-compatible way alongside Unicode, and will benefit from the significant expertise of others outside the Swift community with respect to Unicode best practices.
- The vast majority of existing code (including all of the standard library) should require no migration work at all

What's changed since the last time:

- In an earlier draft, we proposed some radical changes to align with available Unicode standards; in particular, since emoji represent a difficult issue, and no recommendations about "operator identifiers" have surfaced from Unicode, we proposed temporarily stripping them out. This was very poorly received. This revision uses Unicode categories to identify nearly all emoji and classify them as identifier characters (while excluding those that depict operators such as !), and it uses Unicode categories to identify over 900 operators that nearly all pass the subjective test of "operator-likeness."

Thanks for all your hard work. I like this approach much better… +1

···

On Feb 16, 2017, at 9:50 PM, Xiaodi Wu via swift-evolution <swift-evolution@swift.org> wrote:

As Stage 2 of Swift 4 evolution starts now, I'd like to share a revised proposal in draft form.

It proposes a source-breaking change for rationalizing which characters are permitted in identifiers and which in operators. It's justified for this phase of Swift 4 because:

- Existing grammar, in permitting invisible characters without security-minded restrictions, can be actively harmful.
- A rationalized approach is superior to the current approach: by referencing Unicode standards, Swift should be able to evolve in a backwards-compatible way alongside Unicode, and will benefit from the significant expertise of others outside the Swift community with respect to Unicode best practices.
- The vast majority of existing code (including all of the standard library) should require no migration work at all

What's changed since the last time:

- In an earlier draft, we proposed some radical changes to align with available Unicode standards; in particular, since emoji represent a difficult issue, and no recommendations about "operator identifiers" have surfaced from Unicode, we proposed temporarily stripping them out. This was very poorly received. This revision uses Unicode categories to identify nearly all emoji and classify them as identifier characters (while excluding those that depict operators such as !), and it uses Unicode categories to identify over 900 operators that nearly all pass the subjective test of "operator-likeness."

What this proposal does not attempt to do:

- This document does not seek to stake out new ground as to what characters should be added to the set of valid identifiers and operators. Such additions to the grammar are properly separate discussions. This proposal is only an attempt at systemization and rationalization. Only one character is incidentally added to the list of valid characters (`\`), and it is on the basis of an explicit table in Unicode Technical Report 25 regarding ASCII characters that are "mathematical."

What feedback would be most helpful:

- "Hey, this approach is so much more clumsy than my superior, more elegant category-based approach to identifying [operators/emoji], which is [insert here]."
- "Hey, I disagree with the detailed design because it's got a major security hole, which is [insert here]."
- "Hey, your proposal would break my real-world Swift code, which requires that character [X] be an [identifier/operator]."

What would be less helpful:

- "Hey, let's talk about how [specific character] should be an [identifier/operator]. We should add that character to the list of [identifiers/operators]. In fact, let's discuss [list] characters one by one."

Acknowledgments:
Thanks to co-authors of the previous take for their support for resurrecting this issue. Any brilliant ideas are undoubtedly theirs, and any botched efforts are certainly mine. Thanks also to Nevin Brackett-Rozinsky for helpful feedback.

Link:
https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651

Rendered text:

Refining identifier and operator symbology (take 2)

Proposal: SE-NNNN <https://gist.github.com/xwu/NNNN-refining-identifier-and-operator-symbology.md>
Authors: Xiaodi Wu <https://github.com/xwu>, Jacob Bandes-Storch <https://github.com/jtbandes>, Erica Sadun <https://github.com/erica>, Jonathan Shapiro, João Pinheiro <https://github.com/joaopinheiro>
Review Manager: TBD
Status: Awaiting review
<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#introduction>Introduction

This proposal refines and rationalizes Swift's identifier and operator symbology. Specifically, this proposal:

refines the set of valid identifier characters based on Unicode recommendations, with customizations principally to accommodate emoji;
refines the set of valid operator characters based on Unicode categories; and
changes rules as to where dots may appear in operators.
<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#prior-discussion-threads-and-proposals>Prior discussion threads and proposals

Define backslash '\' as a operator-head in the swift grammar <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20170130/031461.html>
Refining Identifier and Operator Symbology <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20161017/028174.html> (a precursor to this document)
Proposal: Normalize Unicode identifiers <https://github.com/apple/swift-evolution/pull/531>
Lexical matters: identifiers and operators <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160926/027479.html>
Unicode identifiers & operators <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160912/027108.html>, with pre-proposal <https://gist.github.com/jtbandes/c0b0c072181dcd22c3147802025d0b59>
Proposal: Allow Single Dollar Sign as Valid Identifier <https://github.com/apple/swift-evolution/pull/354>
Free the '$' Symbol! <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20151228/005133.html>
Request to add middle dot (U+00B7) as operator character? <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20151214/003176.html>
<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#motivation>Motivation

Swift supports programmers from many languages and cultures. However, the current identifier and operator character sets do not conform to any Unicode standards, nor have they been rationalized in the language or compiler documentation. These deserve a well-considered, standards-based revision.

As Chris Lattner has written:

We need a token to be unambiguously an operator or identifier - we can have different rules for the leading and subsequent characters though.
…our current operator space (particularly the Unicode segments covered) is not super well considered. It would be great for someone to take a more systematic pass over them to rationalize things.
Identifiers, which serve as names for various entities, are linguistic in nature and must permit a variety of characters in order to properly serve non–English-speaking coders. This issue has been considered by the communities of many programming languages already, and the Unicode Consortium has published recommendations on how to choose identifier character sets. Swift should make an effort to conform to these recommendations.

Operators, on the other hand, should be rare and carefully chosen because they suffer from limited discoverability and readability. They are by nature symbols, not names. This places a cognitive cost on users with respect to recall ("What is the operator that applies the behavior I need?") and recognition ("What does the operator in this code do?"). While almost every non-trivial program defines new identifiers, most programs do not define new operators.

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#inconsistency>Inconsistency

Concrete discrepancies and edge cases motivate these proposed changes. For example:

The Greek question mark ; is a valid identifier.
Some non-combining diacritics ´ ¨ ꓻ are valid in identifiers.
Braille patterns ⠟, which are letter-like, are operator characters.
Other symbols such as ⚄ and ♄ are operator characters despite not being "operator-like."
Currency symbols are split across operators (¢ £ ¤ ¥) and identifiers (₪ € ₱ ₹ ฿ ...).
:slightly_smiling_face::metal::arrow_forward:️:small_airplane: are identifiers, while :frowning:️:v:️:arrow_up_small::airplane:️:spades:️ are operators.
A few characters 〡〢〣〤〥〦〧〨〩 〪 〫 〬 〭 〮 〯 are valid in both identifiers and operators.
<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#invisible-distinctions>Invisible distinctions

Identifiers that take advantage of Swift's Unicode support are not normalized. This allows different representations of the same characters to be considered distinct identifiers. For example:

let Å = "Angstrom"
let Å = "Latin Capital Letter A With Ring Above"
let Å = "Latin Capital Letter A + Combining Ring Above"
Non-printing characters such as ZERO WIDTH SPACE and ZERO WIDTH NON-JOINER are also accepted as valid identifier chracters without any restrictions.

let ab = "ab"
let a​b = "a + ZERO WIDTH SPACE + b"

func xy() { print("xy") }
func x‌y() { print("x + ZERO WIDTH NON-JOINER + y") }
<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#timeline>Timeline

These matters should be considered in a near timeframe (Swift 4). Identifier and operator character sets are fundamental parts of Swift grammar, and changes are inevitably source-breaking.

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#non-goals>Non-goals

The aim of this proposal is to rationalize the set of valid operator characters and the set of valid identifier characters using Unicode categories and specific Unicode recommendations where available. The smallest necessary customizations are made to increase backwards compatibility, but no attempt is made to expand Swift grammar or to "improve" Unicode. Specifically, the following questions are potential subjects of separate study, either within the purview of the Swift open source project or of the Unicode Consortium:

Expanding the set of valid operator or identifier characters. For example, $ is not currently a valid operator in Swift, there are no current Unicode recommendations regarding operators in programming languages, and $ is not enumerated among the list of "mathematical" characters in Unicode. Although is possible for Swift to customize its implementation of Unicode recommendations to add $ as a valid operator, that is an expansion of Swift grammar distinct from the task of rationalizing Swift symbology according to Unicode standards. Therefore, this document neither proposes nor opposes its addition. For similar reasons, this document refines the inclusion of emoji in identifiers based on Unicode categories, but it neither proposes nor opposes the inclusion of non-emoji pictographic symbols to the set of valid identifier characters.

Rectifying Unicode shortcomings. Although it is possible to discover shortcomings concerning particular characters in the current version of Unicode, no attempt is made to preempt the Unicode standardization process by "patching" such issues in the Swift grammar. For example, in the current version of Unicode, ⁗ QUADRUPLE PRIME is not deemed to be "mathematical" (even though ‴ TRIPLE PRIME is deemed to be "mathematical"). Certainly, this issue would be appropriate to report to Unicode and may well be corrected in a future revision of the standard. However, as the Swift community is not congruent with the community of experts that specialize in Unicode, there is no rational basis to expect that Swift-only determinations of what Unicode "should have done" (without vetting through Unicode's standardization processes) are likely to result in a better outcome than the existing Unicode standard. Therefore, no attempt is made to augment the Unicode derived category Math with ⁗ QUADRUPLE PRIME in this proposal. Similarly, Unicode recommends certain normalization forms for identifiers in code, which are proposed here for adoption by Swift, but these normalization forms do not eliminate all possible combinations of "confusable" characters. This proposal does not attempt to invent an ad-hoc normalization form in an attempt to "improve" Unicode recommendations.

Implementing additional features. Innovative ideas such as mixfix operators are detailed below in Future directions. This proposal does not attempt to introduce any such features.

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#precedent-in-other-languages>Precedent in other languages

Haskell distinguishes identifiers/operators by their general category <http://www.fileformat.info/info/unicode/category/index.htm> (for instance, "any Unicode lowercase letter" or "any Unicode symbol or punctuation"). Identifiers can start with any lowercase letter or _, and they may contain any letter, digit, ', or _. This includes letters like δ and Я, and digits like ٢.

Haskell Syntax Reference <https://www.haskell.org/onlinereport/syntax-iso.html>
Haskell Lexer <https://github.com/ghc/ghc/blob/714bebff44076061d0a719c4eda2cfd213b7ac3d/compiler/parser/Lexer.x#L1949-L1973>
Scala similarly allows letters, numbers, $, and _ in identifiers, distinguishing by general categories Ll, Lu, Lt, Lo, and Nl. Operator characters include mathematical and other symbols (Sm and So) in addition to certain ASCII characters.

Scala Lexical Syntax <http://www.scala-lang.org/files/archive/spec/2.11/01-lexical-syntax.html#lexical-syntax>
ECMAScript 2015 uses ID_Start and ID_Continue, as well as Other_ID_Start and Other_ID_Continue, for identifiers.

ECMAScript Specification: Names and Keywords <http://www.ecma-international.org/ecma-262/6.0/#sec-names-and-keywords>
Python 3 uses XID_Start and XID_Continue.

The Python Language Reference: Identifiers and Keywords <https://docs.python.org/3/reference/lexical_analysis.html#grammar-token-identifier>
PEP 3131: Supporting Non-ASCII Identifiers <https://www.python.org/dev/peps/pep-3131/>
<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#proposed-solution>Proposed solution

Identifiers. Adopt recommendations made in UAX#31 Identifier and Pattern Syntax <http://unicode.org/reports/tr31/>, deriving the sets of valid identifier characters from ID_Start and ID_Continue. Adopt specific customizations principally to accommodate emoji. Consider two identifiers equivalent when they produce the same normalized form under Normalization Form C (NFC) <http://unicode.org/reports/tr15/>, as recommended in UAX#31 for case-sensitive use cases.

Is an identifier Is not an identifier
Shall be an identifier 120,617 code points <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[[a-zA-Z _ \u00A8 \u00AA \u00AD \u00AF \u00B2-\u00B5 \u00B7-\u00BA \u00BC-\u00BE \u00C0-\u00D6 \u00D8-\u00F6 \u00F8-\u00FF \u0100-\u02FF \u0370-\u167F \u1681-\u180D \u180F-\u1DBF \u1E00-\u1FFF \u200B-\u200D \u202A-\u202E \u203F-\u2040 \u2054 \u2060-\u206F \u2070-\u20CF \u2100-\u218F \u2460-\u24FF \u2776-\u2793 \u2C00-\u2DFF \u2E80-\u2FFF \u3004-\u3007 \u3021-\u302F \u3031-\u303F \u3040-\uD7FF \uF900-\uFD3D \uFD40-\uFDCF \uFDF0-\uFE1F \uFE30-\uFE44 \uFE47-\uFFFD \U00010000-\U0001FFFD \U00020000-\U0002FFFD \U00030000-\U0003FFFD \U00040000-\U0004FFFD \U00050000-\U0005FFFD \U00060000-\U0006FFFD \U00070000-\U0007FFFD \U00080000-\U0008FFFD \U00090000-\U0009FFFD \U000A0000-\U000AFFFD \U000B0000-\U000BFFFD \U000C0000-\U000CFFFD \U000D0000-\U000DFFFD \U000E0000-\U000EFFFD] [0-9 \u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE20-\uFE2F]] %26+[[%3AID_Continue%3A] _ [[%3AEmoji_Presentation%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AID_Continue%3A]+-+[%3APattern_Syntax%3A]] [[%3AEmoji_Presentation%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Symbols%3A]] [[%3AEmoji_Presentation%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Technical%3A]] [[%3AEmoji%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AEmoji_Presentation%3A]+-+[%3AID_Continue%3A]+-+[%3APattern_Syntax%3A]] [[%3AEmoji%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AEmoji_Presentation%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Symbols%3A]] [[%3AEmoji%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AEmoji_Presentation%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Technical%3A]] [[%3AEmoji_Flag_Sequences%3A]+[%3AEmoji_Keycap_Sequences%3A]+[%3AEmoji_Modifier_Sequences%3A]]]]&g=&i=> 699 emoji <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[[%3AID_Continue%3A] _ [[%3AEmoji_Presentation%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AID_Continue%3A]+-+[%3APattern_Syntax%3A]] [[%3AEmoji_Presentation%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Symbols%3A]] [[%3AEmoji_Presentation%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Technical%3A]] [[%3AEmoji%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AEmoji_Presentation%3A]+-+[%3AID_Continue%3A]+-+[%3APattern_Syntax%3A]] [[%3AEmoji%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AEmoji_Presentation%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Symbols%3A]] [[%3AEmoji%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AEmoji_Presentation%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Technical%3A]] [[%3AEmoji_Flag_Sequences%3A]+[%3AEmoji_Keycap_Sequences%3A]+[%3AEmoji_Modifier_Sequences%3A]]] -[[a-zA-Z _ \u00A8 \u00AA \u00AD \u00AF \u00B2-\u00B5 \u00B7-\u00BA \u00BC-\u00BE \u00C0-\u00D6 \u00D8-\u00F6 \u00F8-\u00FF \u0100-\u02FF \u0370-\u167F \u1681-\u180D \u180F-\u1DBF \u1E00-\u1FFF \u200B-\u200D \u202A-\u202E \u203F-\u2040 \u2054 \u2060-\u206F \u2070-\u20CF \u2100-\u218F \u2460-\u24FF \u2776-\u2793 \u2C00-\u2DFF \u2E80-\u2FFF \u3004-\u3007 \u3021-\u302F \u3031-\u303F \u3040-\uD7FF \uF900-\uFD3D \uFD40-\uFDCF \uFDF0-\uFE1F \uFE30-\uFE44 \uFE47-\uFFFD \U00010000-\U0001FFFD \U00020000-\U0002FFFD \U00030000-\U0003FFFD \U00040000-\U0004FFFD \U00050000-\U0005FFFD \U00060000-\U0006FFFD \U00070000-\U0007FFFD \U00080000-\U0008FFFD \U00090000-\U0009FFFD \U000A0000-\U000AFFFD \U000B0000-\U000BFFFD \U000C0000-\U000CFFFD \U000D0000-\U000DFFFD \U000E0000-\U000EFFFD] [0-9 \u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE20-\uFE2F]]]&g=&i=>
Shall not be an identifier 846,137 unassigned code points;
4,929 other code points <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[[a-zA-Z _ \u00A8 \u00AA \u00AD \u00AF \u00B2-\u00B5 \u00B7-\u00BA \u00BC-\u00BE \u00C0-\u00D6 \u00D8-\u00F6 \u00F8-\u00FF \u0100-\u02FF \u0370-\u167F \u1681-\u180D \u180F-\u1DBF \u1E00-\u1FFF \u200B-\u200D \u202A-\u202E \u203F-\u2040 \u2054 \u2060-\u206F \u2070-\u20CF \u2100-\u218F \u2460-\u24FF \u2776-\u2793 \u2C00-\u2DFF \u2E80-\u2FFF \u3004-\u3007 \u3021-\u302F \u3031-\u303F \u3040-\uD7FF \uF900-\uFD3D \uFD40-\uFDCF \uFDF0-\uFE1F \uFE30-\uFE44 \uFE47-\uFFFD \U00010000-\U0001FFFD \U00020000-\U0002FFFD \U00030000-\U0003FFFD \U00040000-\U0004FFFD \U00050000-\U0005FFFD \U00060000-\U0006FFFD \U00070000-\U0007FFFD \U00080000-\U0008FFFD \U00090000-\U0009FFFD \U000A0000-\U000AFFFD \U000B0000-\U000BFFFD \U000C0000-\U000CFFFD \U000D0000-\U000DFFFD \U000E0000-\U000EFFFD] [0-9 \u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE20-\uFE2F]] -[[%3AID_Continue%3A] _ [[%3AEmoji_Presentation%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AID_Continue%3A]+-+[%3APattern_Syntax%3A]] [[%3AEmoji_Presentation%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Symbols%3A]] [[%3AEmoji_Presentation%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Technical%3A]] [[%3AEmoji%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AEmoji_Presentation%3A]+-+[%3AID_Continue%3A]+-+[%3APattern_Syntax%3A]] [[%3AEmoji%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AEmoji_Presentation%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Symbols%3A]] [[%3AEmoji%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AEmoji_Presentation%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Technical%3A]] [[%3AEmoji_Flag_Sequences%3A]+[%3AEmoji_Keycap_Sequences%3A]+[%3AEmoji_Modifier_Sequences%3A]]]]&g=&i=> All other code points
Operators. No Unicode recommendation currently exists on the topic of "operator identifiers," although work is ongoing as part of a future update to UAX#31. The aim of the proposed definition presented in this document is to identify, using Unicode categories, a reasonable set of operators that (a) may be in current use in Swift code; and (b) are likely to be included in future versions of UAX#31. It is not intended to be a final judgment on all code points that should ever be valid in Swift operators, for which it is proposed that Swift await the recommendations of the Unicode Consortium.

Therefore, adopt an approach to define the set of valid operator characters based primarily on the Unicode categories Math and Pattern_Syntax (an approach analogous to that which is used to define ID_Start and ID_Continue in Unicode recommendations), informed by UAX#25 Unicode Support for Mathematics <http://www.unicode.org/reports/tr25/>. Augment the set of valid operator characters with a number of currently valid Swift operator characters to increase backward compatibility. Consider two operators equivalent when they produce the same normalized form under Normalization Form KC (NFKC) <http://unicode.org/reports/tr15/>, as recommended in UAX#31 for case-insensitive use cases. Fullwidth variants such as FULLWIDTH HYPHEN-MINUS are equivalent to their non-fullwidth counterparts after normalization under NFKC (but not NFC).

Is an operator Is not an operator
Shall be an operator 986 code points <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[[%3APattern_Syntax%3A]%20%26%20[%3AMath%3A] -%20[%3ABlock%3DGeometric%20Shapes%3A] -%20[%3ABlock%3DMiscellaneous%20Symbols%3A] -%20[%3ABlock%3DMiscellaneous%20Technical%3A] [!%20%%20\%26%20*%20\-%20%2F%20%3F%20\\%20\^%20¡%20¦%20§%20°%20¶%20¿%20†%20‡%20•%20‰%20‱%20※%20‽%20⁂%20⁅%20⁆%20⁊%20⁋%20⁌%20⁍%20⁎%20⁑]]%26[[ [%2F%20\-%20%2B%20!%20*%20%%20<->%20\%26%20|%20\^%20~%20%3F] U%2B00A1-U%2B00A7 U%2B00A9%20U%2B00AB U%2B00AC%20U%2B00AE U%2B00B0-U%2B00B1%20U%2B00B6%20U%2B00BB%20U%2B00BF%20U%2B00D7%20U%2B00F7 U%2B2016-U%2B2017%20U%2B2020-U%2B2027 U%2B2030-U%2B203E U%2B2041-U%2B2053 U%2B2055-U%2B205E U%2B2190-U%2B23FF U%2B2500-U%2B2775 U%2B2794-U%2B2BFF U%2B2E00-U%2B2E7F U%2B3001-U%2B3003 U%2B3008-U%2B3030 ] [ U%2B0300-U%2B036F U%2B1DC0-U%2B1DFF U%2B20D0-U%2B20FF U%2BFE00-U%2BFE0F U%2BFE20-U%2BFE2F U%2BE0100-U%2BE01EF ]]]> \
Shall not be an operator 130 unassigned code points;
2,024 other code points <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[[ [%2F%20\-%20%2B%20!%20*%20%%20<->%20\%26%20|%20\^%20~%20%3F] U%2B00A1-U%2B00A7 U%2B00A9%20U%2B00AB U%2B00AC%20U%2B00AE U%2B00B0-U%2B00B1%20U%2B00B6%20U%2B00BB%20U%2B00BF%20U%2B00D7%20U%2B00F7 U%2B2016-U%2B2017%20U%2B2020-U%2B2027 U%2B2030-U%2B203E U%2B2041-U%2B2053 U%2B2055-U%2B205E U%2B2190-U%2B23FF U%2B2500-U%2B2775 U%2B2794-U%2B2BFF U%2B2E00-U%2B2E7F U%2B3001-U%2B3003 U%2B3008-U%2B3030 ] [ U%2B0300-U%2B036F U%2B1DC0-U%2B1DFF U%2B20D0-U%2B20FF U%2BFE00-U%2BFE0F U%2BFE20-U%2BFE2F U%2BE0100-U%2BE01EF ]]-[[%3APattern_Syntax%3A]%20%26%20[%3AMath%3A] -%20[%3ABlock%3DGeometric%20Shapes%3A] -%20[%3ABlock%3DMiscellaneous%20Symbols%3A] -%20[%3ABlock%3DMiscellaneous%20Technical%3A] [!%20%%20\%26%20*%20\-%20%2F%20%3F%20\\%20\^%20¡%20¦%20§%20°%20¶%20¿%20†%20‡%20•%20‰%20‱%20※%20‽%20⁂%20⁅%20⁆%20⁊%20⁋%20⁌%20⁍%20⁎%20⁑]]]> All other code points
Dots. Adopt a rule to allow dots to appear in operators at any location, but only in runs of two or more. (Currently, dots must be leading.)

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#detailed-design>Detailed design

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#identifiers>Identifiers

Swift identifier characters shall conform to UAX#31 <http://unicode.org/reports/tr31/#Conformance> as follows:

UAX31-C1. <http://unicode.org/reports/tr31/#C1> The conformance described herein refers to the Unicode 9.0.0 version of UAX#31.

UAX31-C2. <http://unicode.org/reports/tr31/#C2> Swift shall observe the following requirements:

UAX31-R1. <http://unicode.org/reports/tr31/#R1> Swift shall augment the definition of "Default Identifiers" with the following profiles:

ID_Start and ID_Continue shall be used for Start and Continue, replacing XID_Start and XID_Continue. This excludes characters in Other_ID_Start and Other_ID_Continue.

_ 005F LOW LINE shall additionally be allowed as a Start character.

Certain emoji shall additionally be allowed as Start characters. A detailed design for emoji permitted in identifiers is given below.

UAX31-R1a. <http://unicode.org/reports/tr31/#R1a> The join-control characters ZWJ and ZWNJ are strictly limited to the special cases A1, A2, and B described in UAX#31.

UAX31-R4. <http://unicode.org/reports/tr31/#R4> Swift shall consider two identifiers equivalent when they produce the same normalized form under Normalization Form C (NFC) <http://unicode.org/reports/tr15/>, as recommended in UAX#31 for case-sensitive use cases.

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#grammar-changes>Grammar changes

identifier-head → [:ID_Start:]
identifier-head → _
identifier-head → identifier-emoji
identifier-character → identifier-head
identifier-character → [:ID_Continue:]
<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#operators>Operators

Swift operator characters shall be determined as follows:

Valid operator characters shall consist of Pattern_Syntax code points with a derived property Math. However, the following blocks are excluded: Geometric Shapes, Miscellaneous Symbols, and Miscellaneous Technical. In UnicodeSet notation:

[:Pattern_Syntax:] & [:Math:]
- [:Block=Geometric Shapes:]
- [:Block=Miscellaneous Symbols:]
- [:Block=Miscellaneous Technical:]
Math captures a fuller set of operators than is possible using Sm, and we avoid the inclusion of characters in So that are clearly not "operator-like" (such as Braille). Math code points in the excluded blocks include sign parts such as ⎲ SUMMATION TOP and tenuously "operator-like" code points such as :spades:️ BLACK SPADE SUIT.

The set of valid operator characters shall be augmented with the following ASCII characters: !, %, &, *, -, /, ?, \, ^. These ASCII characters are required by the Swift standard library and/or considered "weakly mathematical" in UAX#25 <http://www.unicode.org/reports/tr25/>.

For increased compatibility with Swift 3, the set of valid operator characters shall be augmented with the following Latin-1 Supplement characters: ¡, ¦, §, °, ¶, ¿. For the same reason, augment the set of valid operator characters with the following General Punctuation characters: † DAGGER, ‡ DOUBLE DAGGER, • BULLET, ‰ PER MILLE SIGN, ‱ PER TEN THOUSAND SIGN, ※ REFERENCE MARK, ‽ INTERROBANG, ⁂ ASTERISM, ⁅ LEFT SQUARE BRACKET WITH QUILL, ⁆ RIGHT SQUARE BRACKET WITH QUILL, ⁊ TIRONIAN SIGN ET, ⁋ REVERSED PILCROW SIGN, ⁌ BLACK LEFTWARDS BULLET, ⁍ BLACK RIGHTWARDS BULLET, ⁎ LOW ASTERISK, ⁑ TWO ASTERISKS ALIGNED VERTICALLY.

Swift shall consider two operators equivalent when they produce the same normalized form under Normalization Form KC (NFKC) <http://unicode.org/reports/tr15/>, as recommended in UAX#31 for case-insensitive use cases. Crucially, fullwidth variants such as FULLWIDTH HYPHEN-MINUS are equivalent to their non-fullwidth counterparts after normalization under NFKC (but not NFC).

Certainly strongly mathematical arrows now have an alternative emoji presentation, and future versions of Unicode may add such an emoji presentation to any Swift operator character. Some but not all "environments" or applications (for instance, Safari but not TextWrangler) display the alternative emoji presentation at all times, and such discrepancies between applications are explicitly permitted by Unicode recommendations (see dicussion in Emoji). However, it would be highly unusual to define the set of valid operator characters based on an essentially arbitrary criterion as to whether an alternative emoji presentation is retroactively assigned to a code point, and codifying how IDEs display Unicode characters in Swift files is outside the scope of this proposal. Therefore, valid operator characters are defined without regard to the presence or absence of an alternative emoji presentation, and U+FE0E VARIATION SELECTOR-15 (text presentation selector) is optionally permitted to follow an operator character that has an alternative emoji presentation. Note that variation selectors are discarded by normalization.

These revised rules <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[%3APattern_Syntax%3A]+%26+[%3AMath%3A] -+[%3ABlock%3DGeometric+Shapes%3A] -+[%3ABlock%3DMiscellaneous+Symbols%3A] -+[%3ABlock%3DMiscellaneous+Technical%3A] [!+%+\%26+*+\-+%2F+%3F+\\+\^+¡+¦+§+°+¶+¿+†+‡+•+‰+‱+※+‽+⁂+⁅+⁆+⁊+⁋+⁌+⁍+⁎+⁑]&g=&i=> produce a set of 987 code points for operator characters. Since ID_Start is derived in part by exclusion of Pattern_Syntax code points, it is assured that operator and identifier characters do not overlap (although this assurance does not extend to emoji, which require additional design as detailed below).

All current restrictions on reserved tokens and operators remain. Swift reserves =, ->, //, /*, */, ., ?, prefix <, prefix &, postfix >, and postfix !.

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#dots>Dots

Swift's existing rule for dots in operators is:

If an operator doesn’t begin with a dot, it can’t contain a dot elsewhere.
This proposal modifies the rule to:

Dots may only appear in operators in sequences of two or more.
Incorporating the "two-dot rule" offers the following benefits:

It avoids lexical complications arising from lone ..

The approach is conservative, erring on the side of overly restrictive. Dropping the rule in future (and thereby allowing single dots) may be possible.

It does not require special cases for existing infix dot operators in the standard library, ... (closed range) and ..<(half-open range). It leaves open the possibility of adding analogous half-open and fully-open range operators <..and <..<.

Finally, this proposal reserves the .. operator for a possible "method cascade" syntax in the future as supported by Dart <http://news.dartlang.org/2012/02/method-cascades-in-dart-posted-by-gilad.html>.

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#grammar-changes-1>Grammar changes

operator → operator-head operator-characters[opt]

operator-head → [[:Pattern_Syntax:] & [:Math:] - [:Emoji:] - [:Block=Geometric Shapes:] - [:Block=Miscellaneous Symbols:] - [:Block=Miscellaneous Technical:]]
operator-head → [[:Pattern_Syntax:] & [:Math:] & [:Emoji:] - [:Block=Geometric Shapes:] - [:Block=Miscellaneous Symbols:] - [:Block=Miscellaneous Technical:]] U+FE0E[opt]
operator-head → ! | % | & | * | - | / | ? | \ | ^ | ¡ | ¦ | § | ° | ¶ | ¿
operator-head → † | ‡ | • | ‰ | ‱ | ※ | ‽ | ⁂ | ⁅ | ⁆ | ⁊ | ⁋ | ⁌ | ⁍ | ⁎ | ⁑
operator-head → operator-dot operator-dots
operator-character → operator-head
operator-characters → operator-character operator-character[opt]

operator-dot → .
operator-dots → operator-dot operator-dots[opt]
<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#emoji>Emoji

The inclusion of emoji among valid identifier characters, though highly desired, presents significant challenges:

Emoji characters are not displayed uniformly across different platforms.

Whether any particular character is presented as emoji or text depends on a matrix of considerations, including "environment" (e.g., Safari vs. XCode), presence or absence of a variant selector, and whether the character itself defaults to "emoji presentation" or "text presentation." This behavior is specifically documented in Unicode recommendations <http://unicode.org/reports/tr51/#Presentation_Style>.

Some emoji not classified as Math depict operators: :exclamation:️:question::heavy_plus_sign::heavy_minus_sign::heavy_division_sign::heavy_multiplication_x:️. A Unicode chart <http://unicode.org/emoji/charts/emoji-ordering.html> provides additional information by dividing emoji according to "rough categories," but it warns that these categories "may change at any time, and should not be used in production."

Full emoji support would require allowing identifiers to contain zero-width joiner sequences that UAX#31 would forbid. Some normalization scheme would have to be devised to account for Unicode recommendations that :woman:‍:heart:️‍:man: (U+1F469 U+200D U+2764 U+FE0F U+200D U+1F468) can be displayed as either :couple_with_heart_woman_man: (U+1F491) or, as a fallback, :woman::heart:️:man:(U+1F469 U+2764 U+FE0F U+1F468).

For maximum consistency across platforms, valid emoji in Swift identifiers shall be determined using the following rules:

Emoji shall include code points with default emoji presentation (as opposed to text presentation), minus Emoji_Defectives and ID_Continue. Exclude Pattern_Syntax code points unless they are in the following blocks: Miscellaneous Symbols, Miscellaneous Technical.

Emoji shall include Emoji code points with default text presentation when immediately followed by U+FE0F VARIATION SELECTOR-16 (emoji presentation selector), minus Emoji_Defectives and ID_Continue. Again, exclude Pattern_Syntax code points unless they are in the following blocks: Miscellaneous Symbols, Miscellaneous Technical. (Note that the emoji picker on Apple platforms--and, possibly, other platforms--automatically inserts U+FE0F VARIATION SELECTOR-16 when a user selects such code points; for instance, selecting :heart:️ inserts U+2764 U+FE0F. Therefore, it is important that the invisible U+FE0F be permitted strictly in this use case. Note also that variation selectors are discarded by normalization.)

Emoji shall include Emoji_Flag_Sequences, Emoji_Keycap_Sequences, and (to the extent not already included) Emoji_Modifier_Sequences.

These revised rules <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[%3AEmoji_Presentation%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AID_Continue%3A]+-+[%3APattern_Syntax%3A]] [[%3AEmoji_Presentation%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Symbols%3A]] [[%3AEmoji_Presentation%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Technical%3A]] [[%3AEmoji%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AEmoji_Presentation%3A]+-+[%3AID_Continue%3A]+-+[%3APattern_Syntax%3A]] [[%3AEmoji%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AEmoji_Presentation%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Symbols%3A]] [[%3AEmoji%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AEmoji_Presentation%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Technical%3A]] [%3AEmoji_Flag_Sequences%3A] [%3AEmoji_Keycap_Sequences%3A] [%3AEmoji_Modifier_Sequences%3A]&g=&i=> produce a set of 1,625 code points or sequences, of which 98 are currently categorized as operator characters.

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#grammar-changes-2>Grammar changes

identifier-emoji → [[:Emoji_Presentation:] - [:Emoji_Defectives:] - [:ID_Continue:] - [:Pattern_Syntax:]]
identifier-emoji → [[:Emoji_Presentation:] - [:Emoji_Defectives:] - [:ID_Continue:] & [:Pattern_Syntax:] & [:Block=Miscellaneous Symbols:]]
identifier-emoji → [[:Emoji_Presentation:] - [:Emoji_Defectives:] - [:ID_Continue:] & [:Pattern_Syntax:] & [:Block=Miscellaneous Technical:]]
identifier-emoji → [[:Emoji:] - [:Emoji_Defectives:] - [:Emoji_Presentation:] - [:ID_Continue:] - [:Pattern_Syntax:]] U+FE0F
identifier-emoji → [[:Emoji:] - [:Emoji_Defectives:] - [:Emoji_Presentation:] - [:ID_Continue:] & [:Pattern_Syntax:] & [:Block=Miscellaneous Symbols:]] U+FE0F
identifier-emoji → [[:Emoji:] - [:Emoji_Defectives:] - [:Emoji_Presentation:] - [:ID_Continue:] & [:Pattern_Syntax:] & [:Block=Miscellaneous Technical:]] U+FE0F
identifier-emoji → [[:Emoji_Flag_Sequences:] [:Emoji_Keycap_Sequences:] [:Emoji_Modifier_Sequences:]]
<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#source-compatibility>Source compatibility

This change is source-breaking where developers have incorporated certain emoji in identifiers or certain non-ASCII characters in operators. This is unlikely to be a significant breakage for the majority of Swift code. Diagnostics for invalid characters are already produced today. We can improve them easily if needed.

Maintaining source compatibility for Swift 3 should be easy: keep the old parsing and identifier lookup code.

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#effect-on-abi-stability>Effect on ABI stability

This proposal does not affect the ABI format itself. Normalization of Unicode identifiers would affect the ABI of compiled modules. The standard library will not be affected; it uses ASCII symbols with no combining characters.

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#effect-on-api-resilience>Effect on API resilience

This proposal doesn't affect API resilience.

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#alternatives-considered>Alternatives considered

Use NFKC instead of NFC for identifiers. The decision to use NFC is based on UAX#31, which states:

Generally if the programming language has case-sensitive identifiers, then Normalization Form C is appropriate; whereas, if the programming language has case-insensitive identifiers, then Normalization Form KC is more appropriate.
Eliminate emoji from identifiers and restrict operator characters to a limited number of ASCII code points. This approach would be simpler, but feedback on Swift-Evolution has been overwhelmingly against such a change.

Hand-pick a set of "operator-like" characters to include. The proposal authors tried this painstaking approach and came up with a relatively agreeable set of about 650 code points <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[!\%24%\%26*%2B\-%2F<%3D>%3F\^|~ \u00AC \u00B1 \u00B7 \u00D7 \u00F7 \u2208-\u220D \u220F-\u2211 \u22C0-\u22C3 \u2212-\u221D \u2238 \u223A \u2240 \u228C-\u228E \u2293-\u22A3 \u22BA-\u22BD \u22C4-\u22C7 \u22C9-\u22CC \u22D2-\u22D3 \u2223-\u222A \u2236-\u2237 \u2239 \u223B-\u223E \u2241-\u228B \u228F-\u2292 \u22A6-\u22B9 \u22C8 \u22CD \u22D0-\u22D1 \u22D4-\u22FF \u22CE-\u22CF \u2A00-\u2AFF \u27C2 \u27C3 \u27C4 \u27C7 \u27C8 \u27C9 \u27CA \u27CE-\u27D7 \u27DA-\u27DF \u27E0-\u27E5 \u29B5-\u29C3 \u29C4-\u29C9 \u29CA-\u29D0 \u29D1-\u29D7 \u29DF \u29E1 \u29E2 \u29E3-\u29E6 \u29FA \u29FB \u2308-\u230B \u2336-\u237A \u2395]>. Such a list can carefully avoid idiosyncrasies in the Unicode standard. However, a character-by-character inventory is unlikely to converge on consensus, as likely to introduce unintended Swift-specific idiosyncrasies as it is to avoid Unicode shortcomings, and inconsistent with the Unicode method of deriving such lists using categories.

Continue to allow single . in operators, perhaps even expanding the original rule to allow them anywhere (even if the operator does not begin with .).

This would allow a wider variety of custom operators (for some interesting possibilities, see the operators in Haskell's Lens <https://github.com/ekmett/lens/wiki/Operators> package). However, there are a handful of potential complications:

Combining prefix or postfix operators with member access: foo*.bar would need to be parsed as foo *. barrather than (foo*).bar. Parentheses could be required to disambiguate.

Combining infix operators with contextual members: foo*.bar would need to be parsed as foo *. bar rather than foo * (.bar). Whitespace or parentheses could be required to disambiguate.

Hypothetically, if operators were accessible as members such as MyNumber.+, allowing operators with single .s would require escaping operator names (perhaps with backticks, such as MyNumber.`+`).

This would also require operators of the form [!?]*\. (for example . ?. !. !!.) to be reserved, to prevent users from defining custom operators that conflict with member access and optional chaining.

We believe that requiring dots to appear in groups of at least two, while in some ways more restrictive, will prevent a significant amount of future pain, and does not require special-case considerations such as the above.

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#future-directions>Future directions

While not within the scope of this proposal, the following considerations may provide useful context for the proposed changes. We encourage the community to pick up these topics when the time is right.

Introduce a syntax for method cascades. The Dart language supports method cascades <http://news.dartlang.org/2012/02/method-cascades-in-dart-posted-by-gilad.html>, whereby multiple methods can be called on an object within one expression: foo..bar()..baz() effectively performs foo.bar(); foo.baz(). This syntax can also be used with assignments and subscripts. Such a feature might be very useful in Swift; this proposal reserves the .. operator so that it may be added in the future.

Introduce "mixfix" operator declarations. Mixfix operators are based on pattern matching and would allow more than two operands. For example, the ternary operator ? : can be defined as a mixfix operator with three "holes": _ ? _ : _. Subscripts might be subsumed by mixfix declarations such as _ [ _ ]. Some holes could be made @autoclosure, and there might even be holes whose argument is represented as an AST, rather than a value or thunk, supporting advanced metaprogramming (for instance, F#'s code quotations <https://docs.microsoft.com/en-us/dotnet/articles/fsharp/language-reference/code-quotations>). Should mixfix operators become supported, it would be sensible to add brackets to the set of valid operator characters.

Diminish or remove the lexical distinction between operators and identifiers. If precedence and fixity applied to traditional identifiers as well as operators, it would be possible to incorporate ASCII equivalents for standard operators (e.g. and for &&, to allow A and B). If additionally combined with mixfix operator support, this might enable powerful DSLs (for instance, C#'s LINQ <https://en.wikipedia.org/wiki/Language_Integrated_Query>).

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

Is this proposal still on track, or are there other plans to address the
issue of operator and identifier characters in Swift?

Nevin

···

On Fri, Feb 17, 2017 at 12:50 AM, Xiaodi Wu via swift-evolution < swift-evolution@swift.org> wrote:

As Stage 2 of Swift 4 evolution starts now, I'd like to share a revised
proposal in draft form.

It proposes a source-breaking change for *rationalizing* which characters
are permitted in identifiers and which in operators. It's justified for
this phase of Swift 4 because:

- Existing grammar, in permitting invisible characters without
security-minded restrictions, can be *actively harmful.*
- A rationalized approach is *superior* to the current approach: by
referencing Unicode standards, Swift should be able to evolve in a
backwards-compatible way alongside Unicode, and will benefit from the
significant expertise of others outside the Swift community with respect to
Unicode best practices.
- The vast majority of existing code (including all of the standard
library) should *require no migration* work at all

*What's changed* since the last time:

- In an earlier draft, we proposed some radical changes to align with
available Unicode standards; in particular, since emoji represent a
difficult issue, and no recommendations about "operator identifiers" have
surfaced from Unicode, we proposed temporarily stripping them out. This was *very
poorly received*. This revision uses Unicode categories to identify
nearly all emoji and classify them as identifier characters (while
excluding those that depict operators such as !), and it uses Unicode
categories to identify over 900 operators that nearly all pass the
subjective test of "operator-likeness."

What this proposal *does not attempt* to do:

- This document *does not* seek to stake out new ground as to what
characters should be *added* to the set of valid identifiers and
operators. Such additions to the grammar are properly separate discussions.
This proposal is only an attempt at systemization and rationalization. Only
one character is incidentally added to the list of valid characters (`\`),
and it is on the basis of an explicit table in Unicode Technical Report 25
regarding ASCII characters that are "mathematical."

What feedback would be* most helpful*:

- "Hey, this approach is so much more *clumsy* than my superior, more
elegant category-based approach to identifying [operators/emoji], which is
[insert here]."
- "Hey, I disagree with the detailed design because it's got a *major
security hole*, which is [insert here]."
- "Hey, your proposal would break my *real-world* Swift code, which
requires that character [X] be an [identifier/operator]."

What would be *less helpful*:

- "Hey, let's talk about how [specific character] should be an
[identifier/operator]. We should add that character to the list of
[identifiers/operators]. In fact, let's discuss [list] characters one by
one."

Acknowledgments:
Thanks to co-authors of the previous take for their support for
resurrecting this issue. Any brilliant ideas are undoubtedly theirs, and
any botched efforts are certainly mine. Thanks also to Nevin
Brackett-Rozinsky for helpful feedback.

Link:
https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651

Rendered text:

Refining identifier and operator symbology (take 2)

   - Proposal: SE-NNNN
   <https://gist.github.com/xwu/NNNN-refining-identifier-and-operator-symbology.md>
   - Authors: Xiaodi Wu <https://github.com/xwu>, Jacob Bandes-Storch
   <https://github.com/jtbandes>, Erica Sadun <https://github.com/erica>,
   Jonathan Shapiro, João Pinheiro <https://github.com/joaopinheiro>
   - Review Manager: TBD
   - Status: Awaiting review

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#introduction>
Introduction

This proposal refines and rationalizes Swift's identifier and operator
symbology. Specifically, this proposal:

   - refines the set of valid identifier characters based on Unicode
   recommendations, with customizations principally to accommodate emoji;
   - refines the set of valid operator characters based on Unicode
   categories; and
   - changes rules as to where dots may appear in operators.

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#prior-discussion-threads-and-proposals>Prior
discussion threads and proposals

   - Define backslash '\' as a operator-head in the swift grammar
   <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20170130/031461.html>
   - Refining Identifier and Operator Symbology
   <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20161017/028174.html> (a
   precursor to this document)
   - Proposal: Normalize Unicode identifiers
   <https://github.com/apple/swift-evolution/pull/531>
   - Lexical matters: identifiers and operators
   <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160926/027479.html>
   - Unicode identifiers & operators
   <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160912/027108.html>,
   with pre-proposal
   <https://gist.github.com/jtbandes/c0b0c072181dcd22c3147802025d0b59>
   - Proposal: Allow Single Dollar Sign as Valid Identifier
   <https://github.com/apple/swift-evolution/pull/354>
   - Free the '$' Symbol!
   <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20151228/005133.html>
   - Request to add middle dot (U+00B7) as operator character?
   <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20151214/003176.html>

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#motivation>
Motivation

Swift supports programmers from many languages and cultures. However, the
current identifier and operator character sets do not conform to any
Unicode standards, nor have they been rationalized in the language or
compiler documentation. These deserve a well-considered, standards-based
revision.

As Chris Lattner has written:

We need a token to be unambiguously an operator or identifier - we can
have different rules for the leading and subsequent characters though.

…our current operator space (particularly the Unicode segments covered) is
not super well considered. It would be great for someone to take a more
systematic pass over them to rationalize things.

Identifiers, which serve as *names* for various entities, are linguistic
in nature and must permit a variety of characters in order to properly
serve non–English-speaking coders. This issue has been considered by the
communities of many programming languages already, and the Unicode
Consortium has published recommendations on how to choose identifier
character sets. Swift should make an effort to conform to these
recommendations.

Operators, on the other hand, should be rare and carefully chosen because
they suffer from limited discoverability and readability. They are by
nature *symbols*, not names. This places a cognitive cost on users with
respect to recall ("What is the operator that applies the behavior I
need?") and recognition ("What does the operator in this code do?"). While
almost every non-trivial program defines new identifiers, most programs do
not define new operators.

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#inconsistency>
Inconsistency

Concrete discrepancies and edge cases motivate these proposed changes. For
example:

   - The Greek question mark ; is a valid identifier.
   - Some *non-combining* diacritics ´ ¨ ꓻ are valid in identifiers.
   - Braille patterns ⠟, which are letter-like, are operator characters.
   - Other symbols such as ⚄ and ♄ are operator characters despite not
   being "operator-like."
   - Currency symbols are split across operators (¢ £ ¤ ¥) and
   identifiers (₪ € ₱ ₹ ฿ ...).
   - :slightly_smiling_face::metal::arrow_forward:️:small_airplane: are identifiers, while :frowning:️:v:️:arrow_up_small::airplane:️:spades:️ are operators.
   - A few characters 〡〢〣〤〥〦〧〨〩 〪 〫 〬 〭 〮 〯 are valid in both identifiers
   and operators.

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#invisible-distinctions>Invisible
distinctions

Identifiers that take advantage of Swift's Unicode support are not
normalized. This allows different representations of the same characters to
be considered distinct identifiers. For example:

let Å = "Angstrom"
let Å = "Latin Capital Letter A With Ring Above"
let Å = "Latin Capital Letter A + Combining Ring Above"

Non-printing characters such as ZERO WIDTH SPACE and ZERO WIDTH NON-JOINER
are also accepted as valid identifier chracters without any restrictions.

let ab = "ab"
let a​b = "a + ZERO WIDTH SPACE + b"

func xy() { print("xy") }
func x‌y() { print("x + ZERO WIDTH NON-JOINER + y") }

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#timeline>
Timeline

These matters should be considered in a near timeframe (Swift 4).
Identifier and operator character sets are fundamental parts of Swift
grammar, and changes are inevitably source-breaking.
<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#non-goals>
Non-goals

The aim of this proposal is to rationalize the set of valid operator
characters and the set of valid identifier characters using Unicode
categories and specific Unicode recommendations where available. The
smallest necessary customizations are made to increase backwards
compatibility, but no attempt is made to expand Swift grammar or to
"improve" Unicode. Specifically, the following questions are potential
subjects of separate study, either within the purview of the Swift open
source project or of the Unicode Consortium:

   -

   Expanding the set of valid operator or identifier characters. For
   example, $ is not currently a valid operator in Swift, there are no
   current Unicode recommendations regarding operators in programming
   languages, and $ is not enumerated among the list of "mathematical"
   characters in Unicode. Although is possible for Swift to customize its
   implementation of Unicode recommendations to add $ as a valid
   operator, that is an expansion of Swift grammar distinct from the task of
   rationalizing Swift symbology according to Unicode standards. Therefore,
   this document neither proposes nor opposes its addition. For similar
   reasons, this document refines the inclusion of emoji in identifiers based
   on Unicode categories, but it neither proposes nor opposes the inclusion of
   non-emoji pictographic symbols to the set of valid identifier characters.
   -

   Rectifying Unicode shortcomings. Although it is possible to discover
   shortcomings concerning particular characters in the current version of
   Unicode, no attempt is made to preempt the Unicode standardization process
   by "patching" such issues in the Swift grammar. For example, in the current
   version of Unicode, ⁗ QUADRUPLE PRIME is not deemed to be "mathematical"
   (even though ‴ TRIPLE PRIME *is* deemed to be "mathematical").
   Certainly, this issue would be appropriate to report to Unicode and may
   well be corrected in a future revision of the standard. However, as the
   Swift community is not congruent with the community of experts that
   specialize in Unicode, there is no rational basis to expect that Swift-only
   determinations of what Unicode "should have done" (without vetting through
   Unicode's standardization processes) are likely to result in a better
   outcome than the existing Unicode standard. Therefore, no attempt is made
   to augment the Unicode derived category Math with ⁗ QUADRUPLE PRIME in
   this proposal. Similarly, Unicode recommends certain normalization forms
   for identifiers in code, which are proposed here for adoption by Swift, but
   these normalization forms do not eliminate all possible combinations of
   "confusable" characters. This proposal does not attempt to invent an ad-hoc
   normalization form in an attempt to "improve" Unicode recommendations.
   -

   Implementing additional features. Innovative ideas such as mixfix operators
   are detailed below in *Future directions*. This proposal does not
   attempt to introduce any such features.

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#precedent-in-other-languages>Precedent
in other languages

Haskell distinguishes identifiers/operators by their general category
<http://www.fileformat.info/info/unicode/category/index.htm> (for
instance, "any Unicode lowercase letter" or "any Unicode symbol or
punctuation"). Identifiers can start with any lowercase letter or _, and
they may contain any letter, digit, ', or _. This includes letters like δ
and Я, and digits like ٢.

   - Haskell Syntax Reference
   <https://www.haskell.org/onlinereport/syntax-iso.html>
   - Haskell Lexer
   <https://github.com/ghc/ghc/blob/714bebff44076061d0a719c4eda2cfd213b7ac3d/compiler/parser/Lexer.x#L1949-L1973>

Scala similarly allows letters, numbers, $, and _ in identifiers,
distinguishing by general categories Ll, Lu, Lt, Lo, and Nl. Operator
characters include mathematical and other symbols (Sm and So) in addition
to certain ASCII characters.

   - Scala Lexical Syntax
   <http://www.scala-lang.org/files/archive/spec/2.11/01-lexical-syntax.html#lexical-syntax>

ECMAScript 2015 uses ID_Start and ID_Continue, as well as Other_ID_Start
and Other_ID_Continue, for identifiers.

   - ECMAScript Specification: Names and Keywords
   <http://www.ecma-international.org/ecma-262/6.0/#sec-names-and-keywords>

Python 3 uses XID_Start and XID_Continue.

   - The Python Language Reference: Identifiers and Keywords
   <https://docs.python.org/3/reference/lexical_analysis.html#grammar-token-identifier>
   - PEP 3131: Supporting Non-ASCII Identifiers
   <https://www.python.org/dev/peps/pep-3131/>

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#proposed-solution>Proposed
solution

Identifiers. Adopt recommendations made in UAX#31 Identifier and Pattern
Syntax <http://unicode.org/reports/tr31/>, deriving the sets of valid
identifier characters from ID_Start and ID_Continue. Adopt specific
customizations principally to accommodate emoji. Consider two identifiers
equivalent when they produce the same normalized form under Normalization
Form C (NFC) <http://unicode.org/reports/tr15/>, as recommended in UAX#31
for case-sensitive use cases.
Is an identifierIs not an identifier
Shall be an identifier 120,617 code points
<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[[a-zA-Z _ \u00A8 \u00AA \u00AD \u00AF \u00B2-\u00B5 \u00B7-\u00BA \u00BC-\u00BE \u00C0-\u00D6 \u00D8-\u00F6 \u00F8-\u00FF \u0100-\u02FF \u0370-\u167F \u1681-\u180D \u180F-\u1DBF \u1E00-\u1FFF \u200B-\u200D \u202A-\u202E \u203F-\u2040 \u2054 \u2060-\u206F \u2070-\u20CF \u2100-\u218F \u2460-\u24FF \u2776-\u2793 \u2C00-\u2DFF \u2E80-\u2FFF \u3004-\u3007 \u3021-\u302F \u3031-\u303F \u3040-\uD7FF \uF900-\uFD3D \uFD40-\uFDCF \uFDF0-\uFE1F \uFE30-\uFE44 \uFE47-\uFFFD \U00010000-\U0001FFFD \U00020000-\U0002FFFD \U00030000-\U0003FFFD \U00040000-\U0004FFFD \U00050000-\U0005FFFD \U00060000-\U0006FFFD \U00070000-\U0007FFFD \U00080000-\U0008FFFD \U00090000-\U0009FFFD \U000A0000-\U000AFFFD \U000B0000-\U000BFFFD \U000C0000-\U000CFFFD \U000D0000-\U000DFFFD \U000E0000-\U000EFFFD] [0-9 \u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE20-\uFE2F]] %26+[[%3AID_Continue%3A] _ [[%3AEmoji_Presentation%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AID_Continue%3A]+-+[%3APattern_Syntax%3A]] [[%3AEmoji_Presentation%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Symbols%3A]] [[%3AEmoji_Presentation%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Technical%3A]] [[%3AEmoji%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AEmoji_Presentation%3A]+-+[%3AID_Continue%3A]+-+[%3APattern_Syntax%3A]] [[%3AEmoji%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AEmoji_Presentation%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Symbols%3A]] [[%3AEmoji%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AEmoji_Presentation%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Technical%3A]] [[%3AEmoji_Flag_Sequences%3A]+[%3AEmoji_Keycap_Sequences%3A]+[%3AEmoji_Modifier_Sequences%3A]]]]&g=&i=> 699
emoji
<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[[%3AID_Continue%3A] _ [[%3AEmoji_Presentation%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AID_Continue%3A]+-+[%3APattern_Syntax%3A]] [[%3AEmoji_Presentation%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Symbols%3A]] [[%3AEmoji_Presentation%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Technical%3A]] [[%3AEmoji%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AEmoji_Presentation%3A]+-+[%3AID_Continue%3A]+-+[%3APattern_Syntax%3A]] [[%3AEmoji%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AEmoji_Presentation%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Symbols%3A]] [[%3AEmoji%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AEmoji_Presentation%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Technical%3A]] [[%3AEmoji_Flag_Sequences%3A]+[%3AEmoji_Keycap_Sequences%3A]+[%3AEmoji_Modifier_Sequences%3A]]] -[[a-zA-Z _ \u00A8 \u00AA \u00AD \u00AF \u00B2-\u00B5 \u00B7-\u00BA \u00BC-\u00BE \u00C0-\u00D6 \u00D8-\u00F6 \u00F8-\u00FF \u0100-\u02FF \u0370-\u167F \u1681-\u180D \u180F-\u1DBF \u1E00-\u1FFF \u200B-\u200D \u202A-\u202E \u203F-\u2040 \u2054 \u2060-\u206F \u2070-\u20CF \u2100-\u218F \u2460-\u24FF \u2776-\u2793 \u2C00-\u2DFF \u2E80-\u2FFF \u3004-\u3007 \u3021-\u302F \u3031-\u303F \u3040-\uD7FF \uF900-\uFD3D \uFD40-\uFDCF \uFDF0-\uFE1F \uFE30-\uFE44 \uFE47-\uFFFD \U00010000-\U0001FFFD \U00020000-\U0002FFFD \U00030000-\U0003FFFD \U00040000-\U0004FFFD \U00050000-\U0005FFFD \U00060000-\U0006FFFD \U00070000-\U0007FFFD \U00080000-\U0008FFFD \U00090000-\U0009FFFD \U000A0000-\U000AFFFD \U000B0000-\U000BFFFD \U000C0000-\U000CFFFD \U000D0000-\U000DFFFD \U000E0000-\U000EFFFD] [0-9 \u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE20-\uFE2F]]]&g=&i=>
Shall not be an identifier 846,137 unassigned code points;
4,929 other code points
<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[[a-zA-Z _ \u00A8 \u00AA \u00AD \u00AF \u00B2-\u00B5 \u00B7-\u00BA \u00BC-\u00BE \u00C0-\u00D6 \u00D8-\u00F6 \u00F8-\u00FF \u0100-\u02FF \u0370-\u167F \u1681-\u180D \u180F-\u1DBF \u1E00-\u1FFF \u200B-\u200D \u202A-\u202E \u203F-\u2040 \u2054 \u2060-\u206F \u2070-\u20CF \u2100-\u218F \u2460-\u24FF \u2776-\u2793 \u2C00-\u2DFF \u2E80-\u2FFF \u3004-\u3007 \u3021-\u302F \u3031-\u303F \u3040-\uD7FF \uF900-\uFD3D \uFD40-\uFDCF \uFDF0-\uFE1F \uFE30-\uFE44 \uFE47-\uFFFD \U00010000-\U0001FFFD \U00020000-\U0002FFFD \U00030000-\U0003FFFD \U00040000-\U0004FFFD \U00050000-\U0005FFFD \U00060000-\U0006FFFD \U00070000-\U0007FFFD \U00080000-\U0008FFFD \U00090000-\U0009FFFD \U000A0000-\U000AFFFD \U000B0000-\U000BFFFD \U000C0000-\U000CFFFD \U000D0000-\U000DFFFD \U000E0000-\U000EFFFD] [0-9 \u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE20-\uFE2F]] -[[%3AID_Continue%3A] _ [[%3AEmoji_Presentation%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AID_Continue%3A]+-+[%3APattern_Syntax%3A]] [[%3AEmoji_Presentation%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Symbols%3A]] [[%3AEmoji_Presentation%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Technical%3A]] [[%3AEmoji%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AEmoji_Presentation%3A]+-+[%3AID_Continue%3A]+-+[%3APattern_Syntax%3A]] [[%3AEmoji%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AEmoji_Presentation%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Symbols%3A]] [[%3AEmoji%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AEmoji_Presentation%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Technical%3A]] [[%3AEmoji_Flag_Sequences%3A]+[%3AEmoji_Keycap_Sequences%3A]+[%3AEmoji_Modifier_Sequences%3A]]]]&g=&i=> *All
other code points*

Operators. No Unicode recommendation currently exists on the topic of
"operator identifiers," although work is ongoing as part of a future update
to UAX#31. The aim of the proposed definition presented in this document is
to identify, using Unicode categories, a reasonable set of operators that
(a) may be in current use in Swift code; and (b) are likely to be included
in future versions of UAX#31. It is not intended to be a final judgment on
all code points that should ever be valid in Swift operators, for which it
is proposed that Swift await the recommendations of the Unicode Consortium.

Therefore, adopt an approach to define the set of valid operator
characters based primarily on the Unicode categories Math and Pattern_
Syntax (an approach analogous to that which is used to define ID_Start
and ID_Continue in Unicode recommendations), informed by UAX#25 Unicode
Support for Mathematics <http://www.unicode.org/reports/tr25/>. Augment
the set of valid operator characters with a number of currently valid Swift
operator characters to increase backward compatibility. Consider two
operators equivalent when they produce the same normalized form under Normalization
Form KC (NFKC) <http://unicode.org/reports/tr15/>, as recommended in
UAX#31 for case-insensitive use cases. Fullwidth variants such as FULLWIDTH
HYPHEN-MINUS are equivalent to their non-fullwidth counterparts after
normalization under NFKC (but not NFC).
Is an operatorIs not an operator
Shall be an operator 986 code points
<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[[%3APattern_Syntax%3A]%20%26%20[%3AMath%3A] -%20[%3ABlock%3DGeometric%20Shapes%3A] -%20[%3ABlock%3DMiscellaneous%20Symbols%3A] -%20[%3ABlock%3DMiscellaneous%20Technical%3A] [!%20%%20\%26%20*%20\-%20%2F%20%3F%20\\%20\^%20¡%20¦%20§%20°%20¶%20¿%20†%20‡%20•%20‰%20‱%20※%20‽%20⁂%20⁅%20⁆%20⁊%20⁋%20⁌%20⁍%20⁎%20⁑]]%26[[ [%2F%20\-%20%2B%20!%20*%20%%20<->%20\%26%20|%20\^%20~%20%3F] U%2B00A1-U%2B00A7 U%2B00A9%20U%2B00AB U%2B00AC%20U%2B00AE U%2B00B0-U%2B00B1%20U%2B00B6%20U%2B00BB%20U%2B00BF%20U%2B00D7%20U%2B00F7 U%2B2016-U%2B2017%20U%2B2020-U%2B2027 U%2B2030-U%2B203E U%2B2041-U%2B2053 U%2B2055-U%2B205E U%2B2190-U%2B23FF U%2B2500-U%2B2775 U%2B2794-U%2B2BFF U%2B2E00-U%2B2E7F U%2B3001-U%2B3003 U%2B3008-U%2B3030 ] [ U%2B0300-U%2B036F U%2B1DC0-U%2B1DFF U%2B20D0-U%2B20FF U%2BFE00-U%2BFE0F U%2BFE20-U%2BFE2F U%2BE0100-U%2BE01EF ]]]>
\
Shall not be an operator 130 unassigned code points;
2,024 other code points
<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[[ [%2F%20\-%20%2B%20!%20*%20%%20<->%20\%26%20|%20\^%20~%20%3F] U%2B00A1-U%2B00A7 U%2B00A9%20U%2B00AB U%2B00AC%20U%2B00AE U%2B00B0-U%2B00B1%20U%2B00B6%20U%2B00BB%20U%2B00BF%20U%2B00D7%20U%2B00F7 U%2B2016-U%2B2017%20U%2B2020-U%2B2027 U%2B2030-U%2B203E U%2B2041-U%2B2053 U%2B2055-U%2B205E U%2B2190-U%2B23FF U%2B2500-U%2B2775 U%2B2794-U%2B2BFF U%2B2E00-U%2B2E7F U%2B3001-U%2B3003 U%2B3008-U%2B3030 ] [ U%2B0300-U%2B036F U%2B1DC0-U%2B1DFF U%2B20D0-U%2B20FF U%2BFE00-U%2BFE0F U%2BFE20-U%2BFE2F U%2BE0100-U%2BE01EF ]]-[[%3APattern_Syntax%3A]%20%26%20[%3AMath%3A] -%20[%3ABlock%3DGeometric%20Shapes%3A] -%20[%3ABlock%3DMiscellaneous%20Symbols%3A] -%20[%3ABlock%3DMiscellaneous%20Technical%3A] [!%20%%20\%26%20*%20\-%20%2F%20%3F%20\\%20\^%20¡%20¦%20§%20°%20¶%20¿%20†%20‡%20•%20‰%20‱%20※%20‽%20⁂%20⁅%20⁆%20⁊%20⁋%20⁌%20⁍%20⁎%20⁑]]]> *All
other code points*

Dots. Adopt a rule to allow dots to appear in operators at any location,
but only in runs of two or more. (Currently, dots must be leading.)

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#detailed-design>Detailed
design
<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#identifiers>
Identifiers

Swift identifier characters shall conform to UAX#31
<http://unicode.org/reports/tr31/#Conformance> as follows:

   -

   UAX31-C1. <http://unicode.org/reports/tr31/#C1> The conformance
   described herein refers to the Unicode 9.0.0 version of UAX#31.
   -

   UAX31-C2. <http://unicode.org/reports/tr31/#C2> Swift shall observe
   the following requirements:
   -

      UAX31-R1. <http://unicode.org/reports/tr31/#R1> Swift shall augment
      the definition of "Default Identifiers" with the following profiles:
      1.

         ID_Start and ID_Continue shall be used for Start and Continue,
         replacing XID_Start and XID_Continue. This excludes characters
         in Other_ID_Start and Other_ID_Continue.
         2.

         _ 005F LOW LINE shall additionally be allowed as a Start
          character.
         3.

         Certain emoji shall additionally be allowed as Start characters.
         A detailed design for emoji permitted in identifiers is given below.
         4.

         UAX31-R1a. <http://unicode.org/reports/tr31/#R1a> The
         join-control characters ZWJ and ZWNJ are strictly limited to the special
         cases A1, A2, and B described in UAX#31.
         -

      UAX31-R4. <http://unicode.org/reports/tr31/#R4> Swift shall
      consider two identifiers equivalent when they produce the same normalized
      form under Normalization Form C (NFC)
      <http://unicode.org/reports/tr15/>, as recommended in UAX#31 for
      case-sensitive use cases.

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#grammar-changes>Grammar
changes

identifier-head → [:ID_Start:]
identifier-head → _
identifier-head → identifier-emoji
identifier-character → identifier-head
identifier-character → [:ID_Continue:]

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#operators>
Operators

Swift operator characters shall be determined as follows:

   -

   Valid operator characters shall consist of Pattern_Syntax code points
   with a derived property Math. However, the following blocks are
   excluded: Geometric Shapes, Miscellaneous Symbols, and Miscellaneous
   Technical. In UnicodeSet notation:

   [:Pattern_Syntax:] & [:Math:]
   - [:Block=Geometric Shapes:]
   - [:Block=Miscellaneous Symbols:]
   - [:Block=Miscellaneous Technical:]

   Math captures a fuller set of operators than is possible using Sm, and
   we avoid the inclusion of characters in So that are clearly not
   "operator-like" (such as Braille). Math code points in the excluded
   blocks include sign parts such as ⎲ SUMMATION TOP and tenuously
   "operator-like" code points such as :spades:️ BLACK SPADE SUIT.
   -

   The set of valid operator characters shall be augmented with the
   following ASCII characters: !, %, &, *, -, /, ?, \, ^. These ASCII
   characters are required by the Swift standard library and/or considered
   "weakly mathematical" in UAX#25 <http://www.unicode.org/reports/tr25/>.
   -

   For increased compatibility with Swift 3, the set of valid operator
   characters shall be augmented with the following Latin-1 Supplement
   characters: ¡, ¦, §, °, ¶, ¿. For the same reason, augment the set of
   valid operator characters with the following General Punctuation
   characters: † DAGGER, ‡ DOUBLE DAGGER, • BULLET, ‰ PER MILLE SIGN, ‱ PER
   TEN THOUSAND SIGN, ※ REFERENCE MARK, ‽ INTERROBANG, ⁂ ASTERISM, ⁅ LEFT
   SQUARE BRACKET WITH QUILL, ⁆ RIGHT SQUARE BRACKET WITH QUILL, ⁊ TIRONIAN
   SIGN ET, ⁋ REVERSED PILCROW SIGN, ⁌ BLACK LEFTWARDS BULLET, ⁍ BLACK
   RIGHTWARDS BULLET, ⁎ LOW ASTERISK, ⁑ TWO ASTERISKS ALIGNED VERTICALLY.
   -

   Swift shall consider two operators equivalent when they produce the
   same normalized form under Normalization Form KC (NFKC)
   <http://unicode.org/reports/tr15/>, as recommended in UAX#31 for
   *case-insensitive* use cases. Crucially, fullwidth variants such as
   FULLWIDTH HYPHEN-MINUS are equivalent to their non-fullwidth counterparts
   after normalization under NFKC (but not NFC).
   -

   Certainly strongly mathematical arrows now have an *alternative* emoji
   presentation, and future versions of Unicode may add such an emoji
   presentation to any Swift operator character. Some but not all
   "environments" or applications (for instance, Safari but not TextWrangler)
   display the alternative emoji presentation at all times, and such
   discrepancies between applications are explicitly permitted by Unicode
   recommendations (see dicussion in *Emoji*). However, it would be
   highly unusual to define the set of valid operator characters based on an
   essentially arbitrary criterion as to whether an alternative emoji
   presentation is retroactively assigned to a code point, and codifying how
   IDEs display Unicode characters in Swift files is outside the scope of this
   proposal. Therefore, valid operator characters are defined without regard
   to the presence or absence of an alternative emoji presentation, and U+FE0E
   VARIATION SELECTOR-15 (text presentation selector) is *optionally* permitted
   to follow an operator character that has an alternative emoji presentation.
   Note that variation selectors are discarded by normalization.

These revised rules
<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[%3APattern_Syntax%3A]+%26+[%3AMath%3A] -+[%3ABlock%3DGeometric+Shapes%3A] -+[%3ABlock%3DMiscellaneous+Symbols%3A] -+[%3ABlock%3DMiscellaneous+Technical%3A] [!+%+\%26+*+\-+%2F+%3F+\\+\^+¡+¦+§+°+¶+¿+†+‡+•+‰+‱+※+‽+⁂+⁅+⁆+⁊+⁋+⁌+⁍+⁎+⁑]&g=&i=> produce
a set of 987 code points for operator characters. Since ID_Start is
derived in part by exclusion of Pattern_Syntax code points, it is assured
that operator and identifier characters do not overlap (although this
assurance does not extend to emoji, which require additional design as
detailed below).

All current restrictions on reserved tokens and operators remain. Swift
reserves =, ->, //, /*, */, ., ?, prefix <, prefix &, postfix >, and
postfix !.
<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#dots>Dots

Swift's existing rule for dots in operators is:

If an operator doesn’t begin with a dot, it can’t contain a dot elsewhere.

This proposal modifies the rule to:

Dots may only appear in operators in sequences of two or more.

Incorporating the "two-dot rule" offers the following benefits:

   -

   It avoids lexical complications arising from lone ..
   -

   The approach is conservative, erring on the side of overly
   restrictive. Dropping the rule in future (and thereby allowing single dots)
   may be possible.
   -

   It does not require special cases for existing infix dot operators in
   the standard library, ... (closed range) and ..<(half-open range). It
   leaves open the possibility of adding analogous half-open and fully-open
   range operators <..and <..<.

Finally, this proposal *reserves* the .. operator for a possible "method
cascade" syntax in the future as supported by Dart
<http://news.dartlang.org/2012/02/method-cascades-in-dart-posted-by-gilad.html>
.

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#grammar-changes-1>Grammar
changes

operator → operator-head operator-characters[opt]

operator-head → [[:Pattern_Syntax:] & [:Math:] - [:Emoji:] - [:Block=Geometric Shapes:] - [:Block=Miscellaneous Symbols:] - [:Block=Miscellaneous Technical:]]
operator-head → [[:Pattern_Syntax:] & [:Math:] & [:Emoji:] - [:Block=Geometric Shapes:] - [:Block=Miscellaneous Symbols:] - [:Block=Miscellaneous Technical:]] U+FE0E[opt]
operator-head → ! | % | & | * | - | / | ? | \ | ^ | ¡ | ¦ | § | ° | ¶ | ¿
operator-head → † | ‡ | • | ‰ | ‱ | ※ | ‽ | ⁂ | ⁅ | ⁆ | ⁊ | ⁋ | ⁌ | ⁍ | ⁎ | ⁑
operator-head → operator-dot operator-dots
operator-character → operator-head
operator-characters → operator-character operator-character[opt]

operator-dot → .
operator-dots → operator-dot operator-dots[opt]

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#emoji>Emoji

The inclusion of emoji among valid identifier characters, though highly
desired, presents significant challenges:

   -

   Emoji characters are not displayed uniformly across different
   platforms.
   -

   Whether any particular character is presented as emoji or text depends
   on a matrix of considerations, including "environment" (e.g., Safari vs.
   XCode), presence or absence of a variant selector, and whether the
   character itself defaults to "emoji presentation" or "text presentation."
   This behavior is specifically documented in Unicode recommendations
   <http://unicode.org/reports/tr51/#Presentation_Style>.
   -

   Some emoji not classified as Math depict operators: :exclamation:️:question::heavy_plus_sign::heavy_minus_sign::heavy_division_sign::heavy_multiplication_x:️. A
   Unicode chart <http://unicode.org/emoji/charts/emoji-ordering.html> provides
   additional information by dividing emoji according to "rough categories,"
   but it warns that these categories "may change at any time, and should not
   be used in production."
   -

   Full emoji support would require allowing identifiers to contain
   zero-width joiner sequences that UAX#31 would forbid. Some normalization
   scheme would have to be devised to account for Unicode recommendations that
   👩‍❤️‍👨 (U+1F469 U+200D U+2764 U+FE0F U+200D U+1F468) can be
   displayed as either :couple_with_heart_woman_man: (U+1F491) or, as a fallback, :woman::heart:️:man:(U+1F469
   U+2764 U+FE0F U+1F468).

For maximum consistency across platforms, valid emoji in Swift identifiers
shall be determined using the following rules:

   -

   Emoji shall include code points with default emoji presentation (as
   opposed to text presentation), minus Emoji_Defectives and ID_Continue.
   Exclude Pattern_Syntax code points unless they are in the following
   blocks: Miscellaneous Symbols, Miscellaneous Technical.
   -

   Emoji shall include Emoji code points with default text presentation *when
   immediately followed by U+FE0F VARIATION SELECTOR-16 (emoji presentation
   selector)*, minus Emoji_Defectives and ID_Continue. Again, exclude
   Pattern_Syntax code points unless they are in the following blocks:
   Miscellaneous Symbols, Miscellaneous Technical. (Note that the emoji picker
   on Apple platforms--and, possibly, other platforms--automatically inserts
   U+FE0F VARIATION SELECTOR-16 when a user selects such code points; for
   instance, selecting :heart:️ inserts U+2764 U+FE0F. Therefore, it is
   important that the invisible U+FE0F be permitted strictly in this use case.
   Note also that variation selectors are discarded by normalization.)
   -

   Emoji shall include Emoji_Flag_Sequences, Emoji_Keycap_Sequences, and
   (to the extent not already included) Emoji_Modifier_Sequences.

These revised rules
<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[%3AEmoji_Presentation%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AID_Continue%3A]+-+[%3APattern_Syntax%3A]] [[%3AEmoji_Presentation%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Symbols%3A]] [[%3AEmoji_Presentation%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Technical%3A]] [[%3AEmoji%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AEmoji_Presentation%3A]+-+[%3AID_Continue%3A]+-+[%3APattern_Syntax%3A]] [[%3AEmoji%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AEmoji_Presentation%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Symbols%3A]] [[%3AEmoji%3A]+-+[%3AEmoji_Defectives%3A]+-+[%3AEmoji_Presentation%3A]+-+[%3AID_Continue%3A]+%26+[%3APattern_Syntax%3A]+%26+[%3ABlock%3DMiscellaneous+Technical%3A]] [%3AEmoji_Flag_Sequences%3A] [%3AEmoji_Keycap_Sequences%3A] [%3AEmoji_Modifier_Sequences%3A]&g=&i=> produce
a set of 1,625 code points or sequences, of which 98 are currently
categorized as operator characters.

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#grammar-changes-2>Grammar
changes

identifier-emoji → [[:Emoji_Presentation:] - [:Emoji_Defectives:] - [:ID_Continue:] - [:Pattern_Syntax:]]
identifier-emoji → [[:Emoji_Presentation:] - [:Emoji_Defectives:] - [:ID_Continue:] & [:Pattern_Syntax:] & [:Block=Miscellaneous Symbols:]]
identifier-emoji → [[:Emoji_Presentation:] - [:Emoji_Defectives:] - [:ID_Continue:] & [:Pattern_Syntax:] & [:Block=Miscellaneous Technical:]]
identifier-emoji → [[:Emoji:] - [:Emoji_Defectives:] - [:Emoji_Presentation:] - [:ID_Continue:] - [:Pattern_Syntax:]] U+FE0F
identifier-emoji → [[:Emoji:] - [:Emoji_Defectives:] - [:Emoji_Presentation:] - [:ID_Continue:] & [:Pattern_Syntax:] & [:Block=Miscellaneous Symbols:]] U+FE0F
identifier-emoji → [[:Emoji:] - [:Emoji_Defectives:] - [:Emoji_Presentation:] - [:ID_Continue:] & [:Pattern_Syntax:] & [:Block=Miscellaneous Technical:]] U+FE0F
identifier-emoji → [[:Emoji_Flag_Sequences:] [:Emoji_Keycap_Sequences:] [:Emoji_Modifier_Sequences:]]

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#source-compatibility>Source
compatibility

This change is source-breaking where developers have incorporated certain
emoji in identifiers or certain non-ASCII characters in operators. This is
unlikely to be a significant breakage for the majority of Swift code.
Diagnostics for invalid characters are already produced today. We can
improve them easily if needed.

Maintaining source compatibility for Swift 3 should be easy: keep the old
parsing and identifier lookup code.

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#effect-on-abi-stability>Effect
on ABI stability

This proposal does not affect the ABI format itself. Normalization of
Unicode identifiers would affect the ABI of compiled modules. The standard
library will not be affected; it uses ASCII symbols with no combining
characters.

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#effect-on-api-resilience>Effect
on API resilience

This proposal doesn't affect API resilience.

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#alternatives-considered>Alternatives
considered

   -

   Use NFKC instead of NFC for identifiers. The decision to use NFC is
   based on UAX#31, which states:

   Generally if the programming language has case-sensitive identifiers,
   then Normalization Form C is appropriate; whereas, if the programming
   language has case-insensitive identifiers, then Normalization Form KC is
   more appropriate.

   -

   Eliminate emoji from identifiers and restrict operator characters to a
   limited number of ASCII code points. This approach would be simpler, but
   feedback on Swift-Evolution has been overwhelmingly against such a change.
   -

   Hand-pick a set of "operator-like" characters to include. The proposal
   authors tried this painstaking approach and came up with a relatively
   agreeable set of about 650 code points
   <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[!\%24%\%26*%2B\-%2F<%3D>%3F\^|~ \u00AC \u00B1 \u00B7 \u00D7 \u00F7 \u2208-\u220D \u220F-\u2211 \u22C0-\u22C3 \u2212-\u221D \u2238 \u223A \u2240 \u228C-\u228E \u2293-\u22A3 \u22BA-\u22BD \u22C4-\u22C7 \u22C9-\u22CC \u22D2-\u22D3 \u2223-\u222A \u2236-\u2237 \u2239 \u223B-\u223E \u2241-\u228B \u228F-\u2292 \u22A6-\u22B9 \u22C8 \u22CD \u22D0-\u22D1 \u22D4-\u22FF \u22CE-\u22CF \u2A00-\u2AFF \u27C2 \u27C3 \u27C4 \u27C7 \u27C8 \u27C9 \u27CA \u27CE-\u27D7 \u27DA-\u27DF \u27E0-\u27E5 \u29B5-\u29C3 \u29C4-\u29C9 \u29CA-\u29D0 \u29D1-\u29D7 \u29DF \u29E1 \u29E2 \u29E3-\u29E6 \u29FA \u29FB \u2308-\u230B \u2336-\u237A \u2395]>.
   Such a list can carefully avoid idiosyncrasies in the Unicode standard.
   However, a character-by-character inventory is unlikely to converge on
   consensus, as likely to introduce unintended Swift-specific idiosyncrasies
   as it is to avoid Unicode shortcomings, and inconsistent with the Unicode
   method of deriving such lists using categories.
   -

   Continue to allow single . in operators, perhaps even expanding the
   original rule to allow them anywhere (even if the operator does not begin
   with .).

   This would allow a wider variety of custom operators (for some
   interesting possibilities, see the operators in Haskell's Lens
   <https://github.com/ekmett/lens/wiki/Operators> package). However,
   there are a handful of potential complications:
   -

      Combining prefix or postfix operators with member access: foo*.bar would
      need to be parsed as foo *. barrather than (foo*).bar. Parentheses
      could be required to disambiguate.
      -

      Combining infix operators with contextual members: foo*.bar would
      need to be parsed as foo *. bar rather than foo * (.bar).
      Whitespace or parentheses could be required to disambiguate.
      -

      Hypothetically, if operators were accessible as members such as
      MyNumber.+, allowing operators with single .s would require
      escaping operator names (perhaps with backticks, such as
      MyNumber.`+`).

   This would also require operators of the form [!?]*\. (for example . ?.
    !. !!.) to be reserved, to prevent users from defining custom
   operators that conflict with member access and optional chaining.

   We believe that requiring dots to appear in groups of at least two,
   while in some ways more restrictive, will prevent a significant amount of
   future pain, and does not require special-case considerations such as the
   above.

<https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651#future-directions>Future
directions

While not within the scope of this proposal, the following considerations
may provide useful context for the proposed changes. We encourage the
community to pick up these topics when the time is right.

   -

   Introduce a syntax for method cascades. The Dart language supports method
   cascades
   <http://news.dartlang.org/2012/02/method-cascades-in-dart-posted-by-gilad.html>,
   whereby multiple methods can be called on an object within one expression:
   foo..bar()..baz() effectively performs foo.bar(); foo.baz(). This
   syntax can also be used with assignments and subscripts. Such a feature
   might be very useful in Swift; this proposal reserves the .. operator
   so that it may be added in the future.
   -

   Introduce "mixfix" operator declarations. Mixfix operators are based
   on pattern matching and would allow more than two operands. For example,
   the ternary operator ? : can be defined as a mixfix operator with
   three "holes": _ ? _ : _. Subscripts might be subsumed by mixfix
   declarations such as _ [ _ ]. Some holes could be made @autoclosure,
   and there might even be holes whose argument is represented as an AST,
   rather than a value or thunk, supporting advanced metaprogramming (for
   instance, F#'s code quotations
   <https://docs.microsoft.com/en-us/dotnet/articles/fsharp/language-reference/code-quotations>).
   Should mixfix operators become supported, it would be sensible to add
   brackets to the set of valid operator characters.
   -

   Diminish or remove the lexical distinction between operators and
   identifiers. If precedence and fixity applied to traditional
   identifiers as well as operators, it would be possible to incorporate ASCII
   equivalents for standard operators (e.g. and for &&, to allow A and B).
   If additionally combined with mixfix operator support, this might enable
   powerful DSLs (for instance, C#'s LINQ
   <https://en.wikipedia.org/wiki/Language_Integrated_Query>).

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

As Stage 2 of Swift 4 evolution starts now, I'd like to share a revised
proposal in draft form.

It proposes a source-breaking change for *rationalizing* which characters
are permitted in identifiers and which in operators.

What feedback would be* most helpful*:

- "Hey, this approach is so much more *clumsy* than my superior, more
elegant category-based approach to identifying [operators/emoji], which is
[insert here]."
- "Hey, I disagree with the detailed design because it's got a *major
security hole*, which is [insert here]."
- "Hey, your proposal would break my *real-world* Swift code, which
requires that character [X] be an [identifier/operator]."

I like the approach taken here, and it is a much better way of concluding
the characters. I don't disagree with the design and don't have any example
code that will be affected, but I do have some (minor) observations about
the proposal.

Thanks Alex! I've updated the document accordingly. Here's the link:
https://github.com/xwu/swift-evolution/blob/d1643c5c451232a277fe77b22fb891cdae90dcb4/proposals/NNNN-refining-identifier-and-operator-symbology.md

* The 'Dots' treatment feels like a special case in an otherwise good
write-up of Unicode, seemingly to lean towards Dart's method chaining
and/or cleanliness of implementation. It might be clearer to pull that out
to its own proposal, either independent of or building upon the general
Unicode changes?

Excellent point. I've removed mentions of method cascades. The rationale
for revising the "dots rule" is clarified in the context of alignment to
Unicode (or more accurately here, skating to where Unicode will be).

* The grammar changes for the operator head contain a number of (what seems

like) hand-picked unicode symbols for increased compatibility with Swift 3
(e.g. dagger and friends). Maybe these could be pulled out into their own
group e.g. operator-head -> operator-head-swift3, to call out the reason
for their hand-picked nature (and for later cleanup, should that be
required).

Done.

* The proposed solution tables (shall be an identifier/is an identifier)
wasn't clear to me at first what the rows and columns were. Maybe calling
these out as a bulleted list would be better:

- Identifiers under Swift 3 and this proposal: 120,617 code points
- Identifiers that would be added under this proposal: 699 emoji
- Identifiers under Swift 3 that would no longer be an identifier:
unassigned code points and 4,929 other code points

Similarly, for operators:

- Operators under Swift 3 and this proposal: 986 code points
- Operators that would be added under this proposal: \
- Operators under Swift 3 that would no longer be an identifier:
unassigned code points and 2,024 other code points

You could summarise that as a pseudo-diff --stat

Identifiers
+ 699 emoji
  120,617 code points
- 4,929 code points and unassigned code points

Operators
+ 1 code point \
  986 code points
- 2,024 code points

Alternatively you could change the 'Is an identifier/operator' to 'Is a
Swift 3 identifier' to make it clear that it's the Swift 3 header, but the
tabular form is still not that clear to me.

I've converted this to bulleted lists like you suggest.

Another stat that would be worth calling out: of the 2,042 code points
that are no longer operators, what the overlap is with the 699 emoji that
are added to the identifiers? If they were all of them then it would only
be 1,325 operators that were no longer valid.

The answer to that is 98; the 601 are emoji sequences that weren't
permitted previously. I've incorporated this information into the text.

···

On Mon, Feb 20, 2017 at 12:29 PM, Alex Blewitt <alblue@apple.com> wrote:

On 17 Feb 2017, at 05:50, Xiaodi Wu via swift-evolution < > swift-evolution@swift.org> wrote:

To conclude: I like the look of the proposal from the block set
definition, which will be better than hand-picking the character set as the
grammar currently stands.

Alex

This looks very good Xiaodi, and I have a few thoughts about it.

First, is the intent that Swift will follow future changes to Unicode
operator recommendations, or that Swift will choose a “frozen in time” set
of Unicode recommendations to adopt? If the former, then we will likely see
source-breaking changes as Unicode evolves. And if the latter, then Swift’s
choices are apt to diverge even more from Unicode’s over time.

Second, it is well-established that programming operators do not have to be
mathematical. For example, Swift uses the punctuation marks ‘!’, ‘?’, and
‘&’ as operators in its standard library. The approach described in your
proposal does an excellent job at covering the core mathematical operator
characters in Unicode, however it does not appear to make such an effort
toward non-mathematical operators.

Of particular note, given that ‘?’, ‘¿’, and ‘‽’ are operator characters,
it seems inconsistent to omit ‘⸘’. Similarly, with ‘&’ an operator, one
would expect ‘⅋’ to be as well. I see that “expanding the set of operator
characters” is listed as a non-goal, however that does not make it an
anti-goal, and the proposal indeed expands the set by adding ‘\’. Likewise
“rectifying Unicode shortcomings” is listed as a non-goal, although the
proposal incorporates some 16 characters for Swift 3 compatibility.

Another point that may be worth considering, are the two specific
characters ‘∅’ and ‘∞’ which, although strongly mathematical, are
definitely not operators. They are names for things—objects, quantities—and
thus by the principle of least surprise they should be available for use in
identifier names. Just as one might write “let π = Double.pi” at the top of
a file, so too might one write “let ∞ = Double.infinity” or “let ∅ =
Set<Int>()” for use later on:

let y = sin(π * x)
if tan(θ) == ∞ { … }
var s = ∅

Thus, for the purpose of consistency, I think it makes sense to classify
‘∅’ and ‘∞’ as identifiers, as well as ‘⸘’ and ‘⅋’ as operators.
Alternatively, ‘∞’ could be a floating-point literal, in which case it
still would not be an operator.

I understand that you described this type of feedback (on particular
characters) as “less helpful”, however it appears that the “most helpful”
types of feedback are unnecessary: the proposal is well thought out, with a
strong core approach. It is only in the fine details that a few
improvements can be made, “lesser” though they may be.

Nevin

Real Swift code uses very very few “unicode” operators, so I would heavily
tilt the division towards making most characters identifiers. While I don’t
want to talk about specific characters, I often wish I could name variables
`∇f` or `∂u∂v`, while no sane API designer would ever use `∇` or `∂` as
operators, even though they are considered “mathematical”. I think the bar
for making a character an operator should be higher: no character should be
classified as an operator if it can appear in language as part of an
identifier.

···

On Tue, Aug 8, 2017 at 2:10 PM, Nevin Brackett-Rozinsky via swift-evolution <swift-evolution@swift.org> wrote:

Is this proposal still on track, or are there other plans to address the
issue of operator and identifier characters in Swift?

Nevin

On Fri, Feb 17, 2017 at 12:50 AM, Xiaodi Wu via swift-evolution < > swift-evolution@swift.org> wrote:

As Stage 2 of Swift 4 evolution starts now, I'd like to share a revised
proposal in draft form.

It proposes a source-breaking change for *rationalizing* which
characters are permitted in identifiers and which in operators. It's
justified for this phase of Swift 4 because:

- Existing grammar, in permitting invisible characters without
security-minded restrictions, can be *actively harmful.*
- A rationalized approach is *superior* to the current approach: by
referencing Unicode standards, Swift should be able to evolve in a
backwards-compatible way alongside Unicode, and will benefit from the
significant expertise of others outside the Swift community with respect to
Unicode best practices.
- The vast majority of existing code (including all of the standard
library) should *require no migration* work at all

*What's changed* since the last time:

- In an earlier draft, we proposed some radical changes to align with
available Unicode standards; in particular, since emoji represent a
difficult issue, and no recommendations about "operator identifiers" have
surfaced from Unicode, we proposed temporarily stripping them out. This was *very
poorly received*. This revision uses Unicode categories to identify
nearly all emoji and classify them as identifier characters (while
excluding those that depict operators such as !), and it uses Unicode
categories to identify over 900 operators that nearly all pass the
subjective test of "operator-likeness."

What this proposal *does not attempt* to do:

- This document *does not* seek to stake out new ground as to what
characters should be *added* to the set of valid identifiers and
operators. Such additions to the grammar are properly separate discussions.
This proposal is only an attempt at systemization and rationalization. Only
one character is incidentally added to the list of valid characters (`\`),
and it is on the basis of an explicit table in Unicode Technical Report 25
regarding ASCII characters that are "mathematical."

What feedback would be* most helpful*:

- "Hey, this approach is so much more *clumsy* than my superior, more
elegant category-based approach to identifying [operators/emoji], which is
[insert here]."
- "Hey, I disagree with the detailed design because it's got a *major
security hole*, which is [insert here]."
- "Hey, your proposal would break my *real-world* Swift code, which
requires that character [X] be an [identifier/operator]."

What would be *less helpful*:

- "Hey, let's talk about how [specific character] should be an
[identifier/operator]. We should add that character to the list of
[identifiers/operators]. In fact, let's discuss [list] characters one by
one."

Acknowledgments:
Thanks to co-authors of the previous take for their support for
resurrecting this issue. Any brilliant ideas are undoubtedly theirs, and
any botched efforts are certainly mine. Thanks also to Nevin
Brackett-Rozinsky for helpful feedback.

Link:
https://gist.github.com/xwu/d2c2bb7097b0b5a4e9985aae737a2651

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

This looks very good Xiaodi, and I have a few thoughts about it.

First, is the intent that Swift will follow future changes to Unicode
operator recommendations, or that Swift will choose a “frozen in time” set
of Unicode recommendations to adopt? If the former, then we will likely see
source-breaking changes as Unicode evolves. And if the latter, then Swift’s
choices are apt to diverge even more from Unicode’s over time.

Great question. I guess the text leaves the mechanics of forward
compatibility unsaid. The answer is: both.

With respect to Unicode identifiers, UAX#31 guarantees future compatibility
for ID_Start and ID_Continue. That is, anything that is currently valid in
ID_Start will be valid in ID_Start for all time. It is reasonable to expect
that the same experts will adopt that approach for their operator
recommendations in the future. Indeed, they have set themselves up fairly
well for this already: UAX#31 also guarantees that Pattern_Syntax
characters will never be moved into ID_Start or ID_Continue. Therefore, we
also have a guarantee that the approach for Swift's operators proposed here
will *never* overlap with Swift's identifier characters even as Unicode
evolves.

Second, it is well-established that programming operators do not have to be

mathematical. For example, Swift uses the punctuation marks ‘!’, ‘?’, and
‘&’ as operators in its standard library. The approach described in your
proposal does an excellent job at covering the core mathematical operator
characters in Unicode, however it does not appear to make such an effort
toward non-mathematical operators.

Of particular note, given that ‘?’, ‘¿’, and ‘‽’ are operator characters,
it seems inconsistent to omit ‘⸘’. Similarly, with ‘&’ an operator, one
would expect ‘⅋’ to be as well. I see that “expanding the set of operator
characters” is listed as a non-goal, however that does not make it an
anti-goal, and the proposal indeed expands the set by adding ‘\’. Likewise
“rectifying Unicode shortcomings” is listed as a non-goal, although the
proposal incorporates some 16 characters for Swift 3 compatibility.

Expanding the set of valid operator characters by adding `\` is not a goal
for this proposal. However, it so happens that UTR#25 explicitly mentions
`\` as an operator. In fact, UTR#25 lists every one of Swift's ASCII
operators as mathematical operators not classified as [:Math:], minus `?`
but plus `\`. Therefore, if we agree that the alignment of Swift to Unicode
recommendations as closely as possible is a desirable goal, the most
intellectually honest set of ASCII operators would include `\`. Now, if
Swift-specific implementation concerns preclude its inclusion, then I
personally wouldn't fight it.

The proposal makes no attempt to define a "non-mathematical operator"
because, again, Unicode has no such definition--yet. There is no approach
of which I'm aware to achieving consensus on that topic, short of either
(a) waiting for more expert hands over at the Unicode Consortium; or (b) a
character-by-character survey of all symbols in Unicode by non-experts (I
count myself here) on this list, which is an explicit anti-goal of this
proposal. In anticipation of Unicode completing its work, this proposal
advances a design that (as I write above) makes possible the adoption of
future Unicode recommendations in a source-compatible way. The chief
mechanism by which this is guaranteed is by not assigning non-[:Math:]
Pattern_Syntax characters (emoji excepted) to either identifiers or
operators. It addresses the most common concern of those responding to an
earlier version of this proposal, who argued against restricting operators
in the interim to only ASCII characters (which would also be a
source-compatible approach that makes room for future Unicode
recommendations) because there is a set of non-ASCII characters that have
unambiguously the characteristics of "operatorlikeness" useful to enable a
more math-like syntax. The proposal here makes no effort to expand our
understanding of what an operator is beyond what's required for the Swift
standard library plus Unicode's somewhat imperfect classification of
mathematical symbols. Indeed, the proposal makes explicit the expectation
that Unicode experts will undertake that task.

The 20 characters included for Swift 3 compatibility have as their
objective only the preservation of Swift 3 source compatibility. They
represent an educated guess (based on public code samples and messages to
this list) as to what symbols are most likely to be used in real, shipping
Swift code, absent arguments against inclusion on other grounds. They are
not intended to represent any attempt at rationalization in alignment with
some Unicode-recommended criterion. As I mentioned, I'm eager to hear
feedback to the effect that some real, shipping code would be broken by the
proposal. I'm sensitive to the dissatisfactory nature of apparent
inconsistency. However, if the omission of `⸘` is to be regarded as a grave
shortcoming on the grounds of inconsistency, then it would be more in
alignment with the stated goals to drop `‽` as a compatibility character
than to include `⸘`. There is no evidence that either is in use. Again, the
purpose of including `¿` is really as stated: it has been mentioned on this
list that people use it as an operator in existing Swift code, and thus it
is included for compatibility.

Another point that may be worth considering, are the two specific

characters ‘∅’ and ‘∞’ which, although strongly mathematical, are
definitely not operators. They are names for things—objects, quantities—and
thus by the principle of least surprise they should be available for use in
identifier names. Just as one might write “let π = Double.pi” at the top of
a file, so too might one write “let ∞ = Double.infinity” or “let ∅ =
Set<Int>()” for use later on:

let y = sin(π * x)
if tan(θ) == ∞ { … }
var s = ∅

Thus, for the purpose of consistency, I think it makes sense to classify
‘∅’ and ‘∞’ as identifiers, as well as ‘⸘’ and ‘⅋’ as operators.
Alternatively, ‘∞’ could be a floating-point literal, in which case it
still would not be an operator.

There are more than just two such characters. For example, U+29DE INFINITY
NEGATED WITH VERTICAL BAR. There are also a slew of other characters
classified as "operators" by Unicode which have shades of
"identifierlikeness." See, for example, how tiny and miny (which I think
you'll agree pass the "operatorlikeness" smell test, being as they are tiny
versions of plus and minus) are used in math to denote values. As you will
see from previous discussions, this can prompt extensive
character-by-character debate: again, an anti-goal.

Now, I will grant you that however fuzzy the line between
"identifierlikeness" and "operatorlikeness," null set and infinity are
likely to fall on the "identifier" side of it. However, the fact remains
that Unicode has classified these characters as syntax characters (i.e.
Pattern_Syntax), and it is untenable for a community not made up of Unicode
experts to try to "fix" that classification character by character. There
are similar issues with identifier characters detailed in UAX#31, not to
mention likely issues not currently known to us. This proposal deliberately
omits any mention of specific characters outside the ASCII range, other
than 20 characters for source compatibility. As I mention above, I am not
convinced it is a good idea to include even those 20 absent evidence of
actual source breakage. In this particular case, since infinity and null
set are currently valid Swift 3 operators, it is their omission that would
increase source incompatibility.

I understand that you described this type of feedback (on particular

···

On Sun, Feb 26, 2017 at 11:50 AM, Nevin Brackett-Rozinsky via swift-evolution <swift-evolution@swift.org> wrote:

characters) as “less helpful”, however it appears that the “most helpful”
types of feedback are unnecessary: the proposal is well thought out, with a
strong core approach. It is only in the fine details that a few
improvements can be made, “lesser” though they may be.

Nevin

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

I think the most important goal is to end up with the right set of operator
and identifier characters for *Swift*. The Unicode guidelines are a useful
tool for that purpose, and get us a long way toward where we want to be.
However at the end of the day we should weigh our success by how well we
have done for Swift, not by how rigidly we adhere to Unicode recommendations
.

Our treatment of emoji is a great example: the right thing for Swift is
different from the right thing for Unicode, so we choose to do what works
best for Swift. This proposal captures that very well.

Matching what Unicode does should be a means for us, not an end. A stepping
stone we can use when it helps. Unicode’s categorizations should inform and
guide out decisions, not constrain them.

With regard to the fact that reclassifying the infinity and empty set
symbols would be a breaking change, that is all the more reason to do it
now, for Swift 4, before it is too late. Those two characters have come up
in every iteration of this discussion on Swift Evolution that I can recall,
and I have not heard anyone argue that they ought to be operators. I think
it is safe to consider them low-hanging fruit.

Nevin

I think the most important goal is to end up with the right set of
operator and identifier characters for *Swift*. The Unicode guidelines are
a useful tool for that purpose, and get us a long way toward where we want
to be. However at the end of the day we should weigh our success by how
well we have done for Swift, not by how rigidly we adhere to Unicode
recommendations.

Our treatment of emoji is a great example: the right thing for Swift is
different from the right thing for Unicode, so we choose to do what works
best for Swift. This proposal captures that very well.

In fact, I'm greatly dissatisfied with how this proposal captures emoji.
Having come up with that scheme, I suspect that it is deficient in subtle
or obvious ways that are not yet apparent to me. This is why I have asked
for feedback along those lines. Note that for emoji, too, I have
deliberately resisted the one-by-one inclusion of certain characters that
are excluded by Unicode categories, of which there are a (small) handful.
My very strong personal preference, though soundly rejected, would have
been to remove the security and forward compatibility headache of support
emoji altogether. It does not in my opinion hold its own weight.

Matching what Unicode does should be a means for us, not an end. A stepping

stone we can use when it helps. Unicode’s categorizations should inform
and guide out decisions, not constrain them.

Well, now we are talking about overarching principles. The aim of this
proposal is in fact to assert that Swift's identifiers and operators should
be rationalized in a way that is constrained by Unicode recommendations.
Just as Swift aims to provide full support for correct Unicode handling in
strings by default, this proposal aims to align the valid characters to
current and future Unicode recommendations as tightly as possible. It is
anticipated that it should break a very small amount of actual code (if
any). It permits Swift to evolve with new developments in Unicode in the
future essentially "for free." In exchange we accept imperfections in
Unicode as imperfections in Swift. I argue that we should do so because our
own imperfections in understanding international character sets will
necessarily be greater than that of Unicode experts working systematically.

With regard to the fact that reclassifying the infinity and empty set

symbols would be a breaking change, that is all the more reason to do it
now, for Swift 4, before it is too late. Those two characters have come up
in every iteration of this discussion on Swift Evolution that I can recall,
and I have not heard anyone argue that they ought to be operators. I think
it is safe to consider them low-hanging fruit.

Disagree. As mentioned in the proposal, no attempt is made to expand the
set of valid identifier characters to include non-emoji pictographs or
symbols. If we adopt your approach, infinity and empty set would be the
only non-emoji non-"human language" symbols deliberately allowed in
identifiers, an approach no more consistent that the previous proposal to
include only the cow and dog emoji. The alternative is to go through a vast
swath of symbols character-by-character to determine which is sufficiently
"noun-like" to be an identifier, as Unicode does not and (as far as I can
tell) will never expand UAX#31 to include such symbols among identifiers.

As I mentioned, it would be also be inconsistent to consider excluding only
these two characters and not related characters, such as variations on the
infinity symbol, from the set of valid operators. Very quickly, the
necessity of doing a character-by-character debate balloons to encompass
the entire character set. I continue to believe that this is absolutely the
wrong approach.

···

On Mon, Feb 27, 2017 at 10:07 PM, Nevin Brackett-Rozinsky via swift-evolution <swift-evolution@swift.org> wrote:

Nevin

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

As I said before, I am happy with this proposal overall.

I just had a strange thought that I thought I should share before this goes through. If we make ‘π’ an operator instead of identifier, then we would be able to write things like 3π directly. For those of us with rational types, we could write (3/4)π.

Another option is that we could have it be a literal with an associated ExpressibleBy… protocol.

Just a thought I wanted to share...