[Proposal] Refining Identifier and Operator Symbology


(Jacob Bandes-Storch) #1

Dear Swift-Evolution community,

A few of us have been preparing a proposal to refine the definitions of
identifiers & operators. This includes some changes to the permitted
Unicode characters.

The latest (perhaps final?) draft is available here:

https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md

We'd welcome your initial thoughts, and will probably submit a PR soon to
the swift-evolution repo for a formal review. Full text follows below.

—Jacob Bandes-Storch, Xiaodi Wu, Erica Sadun, Jonathan Shapiro

Refining Identifier and Operator Symbology

   - Proposal: SE-NNNN
   <https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md>
   - Authors: Jacob Bandes-Storch <https://github.com/jtbandes>, Erica Sadun
   <https://github.com/erica>, Xiaodi Wu <https://github.com/xwu>, Jonathan
   Shapiro
   - Review Manager: TBD
   - Status: Awaiting review

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#introduction>
Introduction

This proposal seeks to refine and rationalize Swift's identifier and
operator symbology. Specifically, this proposal:

   - adopts the Unicode recommendation for identifier characters, with some
   minor exceptions;
   - restricts the legal operator set to the current ASCII operator
   characters;
   - changes where dots may appear in operators; and
   - disallows Emoji from identifiers and operators.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#prior-discussion-threads--proposals>Prior
discussion threads & proposals

   - Proposal: Normalize Unicode identifiers
   <https://github.com/apple/swift-evolution/pull/531>
   - Unicode identifiers & operators
   <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160912/027108.html>,
   with pre-proposal
   <https://gist.github.com/jtbandes/c0b0c072181dcd22c3147802025d0b59> (a
   precursor to this document)
   - Lexical matters: identifiers and operators
   <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160926/027479.html>
   - Proposal: Allow Single Dollar Sign as Valid Identifier
   <https://github.com/apple/swift-evolution/pull/354>
   - Free the '$' Symbol!
   <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20151228/005133.html>
   - Request to add middle dot (U+00B7) as operator character?
   <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20151214/003176.html>

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#guiding-principles>Guiding
principles

Chris Lattner has written:

…our current operator space (particularly the unicode segments covered) is
not super well considered. It would be great for someone to take a more
systematic pass over them to rationalize things.

We need a token to be unambiguously an operator or identifier - we can have
different rules for the leading and subsequent characters though.

…any proposal that breaks:

let :dog::cow: = "moof"

will not be tolerated. :slight_smile: :slight_smile:

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#motivation>
Motivation

By supporting custom Unicode operators and identifiers, Swift attempts to
accomodate programmers and programming styles from many languages and
cultures. It deserves a well-thought-out specification of which characters
are valid. However, Swift's current identifier and operator character sets
do not conform to any Unicode standards, nor have they been rationalized in
the language or compiler documentation.

Identifiers, which serve as *names* for various entities, are linguistic in
nature and must permit a variety of characters to properly serve
non–English-speaking coders. This issue has been considered by the
communities of many programming languages already, and the Unicode
Consortium has published recommendations on how to choose identifier
character sets — Swift should make an effort to conform to these
recommendations.

Operators, on the other hand, should be rare and carefully chosen, because
they suffer from low discoverability and difficult readability. They are by
nature *symbols*, not names. This places a cognitive cost on users with
respect to both recall ("What is the operator that applies the behavior I
need?") and recognition ("What does the operator in this code do?").
While *almost
every* nontrivial program defines many new identifiers, most programs do
not define new operators.

As operators become more esoteric or customized, the cognitive cost rises.
Recognizing a function name like formUnion(with:) is simpler for many
programmers than recalling what the ∪ operator does. Swift's current
operator character set includes many characters that aren't traditional and
recognizable operators — this encourages problematic and frivolous uses in
an otherwise safe language.

Today, there are many discrepancies and edge cases motivating these changes:

   - · is an identifier, while • is an operator.
   - The Greek question mark ; is a valid identifier.
   - Braille patterns ⠟ seem letter-like, but are operator characters.
   - :slightly_smiling_face::metal::arrow_forward:️:small_airplane:🂡 are identifiers, while :frowning:️:v:️:arrow_up_small::airplane:️:spades:️ are operators.
   - Some *non-combining* diacritics ´ ¨ ꓻ are valid in identifiers.
   - Some completely non-linguistic characters, such as ۞ and ༒, are valid
   in identifiers.
   - Some symbols such as ⚄ and ♄ are operators, despite not really being
   "operator-like".
   - A small handful of characters 〡〢〣〤〥〦〧〨〩 〪 〫 〬 〭 〮 〯 are valid in
both identifiers
   and operators.
   - Some non-printing characters such as 2064 INVISIBLE PLUS and 200B ZERO
   WIDTH SPACE are valid identifiers.
   - Currency symbols are split across operators (¢ £ ¤ ¥) and identifiers
   ($ ₪ € ₱ ₹ ฿ ...).

This matter should be considered in a near timeframe (Swift 3.1 or 4) as it
is both fundamental to Swift and will produce source-breaking changes.
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#precedent-in-other-languages>Precedent
in other languages

Haskell distinguishes identifiers/operators by their general category
<http://www.fileformat.info/info/unicode/category/index.htm> such as "any
Unicode lowercase letter", "any Unicode symbol or punctuation", and so
forth. Identifiers can start with any lowercase letter or _, and may
contain any letter/digit/'/_. This includes letters like δ and Я, and
digits like ٢.

   - Haskell Syntax Reference
   <https://www.haskell.org/onlinereport/syntax-iso.html>
   - Haskell Lexer
   <https://github.com/ghc/ghc/blob/714bebff44076061d0a719c4eda2cfd213b7ac3d/compiler/parser/Lexer.x#L1949-L1973>

Scala similarly allows letters, numbers, $, and _ in identifiers,
distinguishing by general categories Ll, Lu, Lt, Lo, and Nl. Operator
characters include mathematical and other symbols (Sm and So) in addition
to other ASCII symbol characters.

   - Scala Lexical Syntax
   <http://www.scala-lang.org/files/archive/spec/2.11/01-lexical-syntax.html#lexical-syntax>

ECMAScript 2015 ("ES6") uses ID_Start and ID_Continue, as well as
Other_ID_Start / Other_ID_Continue, for identifiers.

   - ECMAScript Specification: Names and Keywords
   <http://www.ecma-international.org/ecma-262/6.0/#sec-names-and-keywords>

Python 3 uses XID_Start and XID_Continue.

   - The Python Language Reference: Identifiers and Keywords
   <https://docs.python.org/3/reference/lexical_analysis.html#grammar-token-identifier>
   - PEP 3131: Supporting Non-ASCII Identifiers
   <https://www.python.org/dev/peps/pep-3131/>

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#proposed-solution>Proposed
solution

For identifiers, adopt the recommendations made in UAX #31 Identifier and
Pattern Syntax <http://unicode.org/reports/tr31/>, deriving the sets of
valid characters from ID_Start and ID_Continue. Normalize identifiers using
Normalization Form C (NFC).

(For operators, no such recommendation currently exists, although active
work is in progress to update UAX #31 to address "operator identifiers".)

Restrict operators to those ASCII characters which are currently operators.
All other operator characters are removed from the language.

Allow dots in operators in any location, but only in runs of two or more.

(Overall, this proposal is aggressive in its removal of problematic
characters. We are not attempting to prevent the addition or re-addition of
characters in the future, but by paring the set down now, we require any
future changes to pass the high bar of the Swift Evolution process.)
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#detailed-design>Detailed
design
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#identifiers>
Identifiers

Swift identifier characters will conform to UAX #31
<http://unicode.org/reports/tr31/#Conformance> as follows:

···

-

   UAX31-C1. <http://unicode.org/reports/tr31/#C1> The conformance
   described herein refers to the Unicode 9.0.0 version of UAX #31 (dated
   2016-05-31 and retrieved 2016-10-09).
   -

   UAX31-C2. <http://unicode.org/reports/tr31/#C2> Swift shall observe the
   following requirements:
   -

      UAX31-R1. <http://unicode.org/reports/tr31/#R1> Swift shall augment
      the definition of "Default Identifiers" with the following profiles:
      1.

         ID_Start and ID_Continue shall be used for Start and Continue
          (replacing XID_Start and XID_Continue). This excludes characters
         in Other_ID_Start and Other_ID_Continue.
         2.

         _ 005F LOW LINE shall additionally be allowed as a Start character.
         3.

         The emoji characters :dog: 1F436 DOG FACE and :cow: 1F42E COW FACE shall
         be allowed as Start and Continue characters.
         4.

         (UAX31-R1a. <http://unicode.org/reports/tr31/#R1a>) The
         join-control characters ZWJ and ZWNJ are strictly limited to
the special
         cases A1, A2, and B described in UAX #31. (This requirement
is covered in
         the Normalize Unicode Identifiers proposal
         <https://github.com/apple/swift-evolution/pull/531>.)
         -

      UAX31-R4. <http://unicode.org/reports/tr31/#R4> Swift shall consider
      two identifiers equivalent when they have the same normalized form under
      NFC <http://unicode.org/reports/tr15/>. (This requirement is covered
      in the Normalize Unicode Identifiers proposal
      <https://github.com/apple/swift-evolution/pull/531>.)

These changes
<http://unicode.org/cldr/utility/unicodeset.jsp?a=[[a-zA-Z_\u00A8\u00AA\u00AD\u00AF\u00B2-\u00B5\u00B7-\u00BA\u00BC-\u00BE\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF\u0100-\u02FF\u0370-\u167F\u1681-\u180D\u180F-\u1DBF\u1E00-\u1FFF\u200B-\u200D\u202A-\u202E\u203F-\u2040\u2054\u2060-\u206F\u2070-\u20CF\u2100-\u218F\u2460-\u24FF\u2776-\u2793\u2C00-\u2DFF\u2E80-\u2FFF\u3004-\u3007\u3021-\u302F\u3031-\u303F\u3040-\uD7FF\uF900-\uFD3D\uFD40-\uFDCF\uFDF0-\uFE1F\uFE30-\uFE44\uFE47-\uFFFD\U00010000-\U0001FFFD\U00020000-\U0002FFFD\U00030000-\U0003FFFD\U000E0000-\U000EFFFD][0-9\u0300-\u036F\u1DC0-\u1DFF\u20D0-\u20FF\uFE20-\uFE2F]]&b=[[:ID_Continue:]\U0001F436\U0001F42E]>
result
in the removal of some 5,500 valid code points from the identifier
characters, as well as hundreds of thousands of unassigned code points.
(Though it does not appear on this unicode.org utility, which currently
supports only Unicode 8 data, the · 00B7 MIDDLE DOT is no longer an
identifier character.) Adopting ID_Start and ID_Continue does not add any
new identifier characters.
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#grammar-changes>Grammar
changes

identifier-head → [:ID_Start:]
identifier-head → _ :dog: :cow:
identifier-character → identifier-head
identifier-character → [:ID_Continue:]

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#operators>
Operators

Swift operator characters will be limited to only the following ASCII
characters:

! % & * + - . / < = > ? ^ | ~

The current restrictions on reserved tokens and operators will remain: =, ->
, //, /*, */, ., ?, prefix <, prefix &, postfix >, and postfix ! are
reserved.
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#dots-in-operators>Dots
in operators

The current requirements for dots in operator names are:

If an operator doesn’t begin with a dot, it can’t contain a dot elsewhere.

This proposal changes the rule to:

Dots may only appear in operators in runs of two or more.

Under the revised rule, ..< and ... are allowed, but <.< is not. We
also reserve
the .. operator, permitting the compiler to use .. for a "method cascade"
syntax in the future, as supported by Dart
<http://news.dartlang.org/2012/02/method-cascades-in-dart-posted-by-gilad.html>
.

Motivations for incorporating the two-dot rule are:

   -

   It helps avoid future lexical complications arising from lone .s.
   -

   It's a conservative approach, erring towards overly restrictive.
   Dropping the rule in future (thereby allowing single dots) may be possible.
   -

   It doesn't require special cases for existing infix dot operators in the
   standard library, ... (closed range) and ..< (half-open range). It also
   leaves the door open for the standard library to add analogous half-open
   and fully-open range operators <.. and <..<.
   -

   If we fail to adopt this rule now, then future backward-compatibility
   requirements will preclude the introduction of some potentially useful
   language enhancements.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#grammar-changes-1>Grammar
changes

operator → operator-head operator-characters[opt]

operator-head → ! % & * + - / < = > ? ^ | ~
operator-head → operator-dot operator-dots
operator-character → operator-head
operator-characters → operator-character operator-character[opt]

operator-dot → .
operator-dots → operator-dot operator-dots[opt]

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#emoji>
Emoji

If adopted, this proposal eliminates emoji from Swift identifiers and
operators. Despite their novelty and utility, emoji characters introduce
significant challenges to the language:

   -

   Their categorization into identifiers and operators is not semantically
   motivated, and is fraught with discrepancies.
   -

   Emoji characters are not displayed consistently and uniformly across
   different systems and fonts. Including all Unicode emoji
   <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[%3AEmoji%3A]>
introduces
   characters that don't render as emoji on Apple platforms without a variant
   selector, but which also wouldn't normally be used as identifier characters
   (e.g. ⏏ :black_small_square: :white_small_square:).
   -

   Some emoji nearly overlap with existing operator syntax: :exclamation:️:question::heavy_plus_sign::heavy_minus_sign::heavy_division_sign::heavy_multiplication_x:
   -

   Full emoji support necessitates handling a variety of use cases for
   joining characters and variant selectors, which would not otherwise be
   useful in most cases. It would be hard to avoid permitting sequences of
   characters which aren't valid emoji, or being overly restrictive and not
   properly supporting emoji introduced in future versions of Unicode.

As an exception, in homage to Swift's origins, we permit :dog: and :cow: in
identifiers.
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#source-compatibility>Source
compatibility

This change is source-breaking in cases where developers have incorporated
emoji or custom non-ASCII operators, or identifiers with characters which
have been disallowed. This is unlikely to be a significant breakage for the
majority of serious Swift code.

Code using the middle dot · in identifiers may be slightly more common. · is
now disallowed entirely.

Diagnostics for invalid characters are already produced today. We can
improve them easily if needed.

Maintaining source compatibility for Swift 3 should be easy: just keep the
old parsing & identifier lookup code.
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#effect-on-abi-stability>Effect
on ABI stability

This proposal does not affect the ABI format itself, although the Normalize
Unicode Identifiers proposal
<https://github.com/apple/swift-evolution/pull/531> affects the ABI of
compiled modules.

The standard library will not be affected; it uses ASCII symbols with no
combining characters.
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#effect-on-api-resilience>Effect
on API resilience

This proposal doesn't affect API resilience.
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#alternatives-considered>Alternatives
considered

   -

   Define operator characters using Unicode categories such as Sm and So
   <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[%3ASm%3A][%3ASo%3A]]>.
   This approach would include many "non-operator-like" characters and doesn't
   seem to provide a significant benefit aside from a simpler definition.
   -

   Hand-pick a set of "operator-like" characters to include. The proposal
   authors tried this painstaking approach, and came up with a relatively
   agreeable set of about 650 code points
   <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[!\%24%\%26*%2B\-%2F<%3D>%3F\^|~ \u00AC \u00B1 \u00B7 \u00D7 \u00F7 \u2208-\u220D \u220F-\u2211 \u22C0-\u22C3 \u2212-\u221D \u2238 \u223A \u2240 \u228C-\u228E \u2293-\u22A3 \u22BA-\u22BD \u22C4-\u22C7 \u22C9-\u22CC \u22D2-\u22D3 \u2223-\u222A \u2236-\u2237 \u2239 \u223B-\u223E \u2241-\u228B \u228F-\u2292 \u22A6-\u22B9 \u22C8 \u22CD \u22D0-\u22D1 \u22D4-\u22FF \u22CE-\u22CF \u2A00-\u2AFF \u27C2 \u27C3 \u27C4 \u27C7 \u27C8 \u27C9 \u27CA \u27CE-\u27D7 \u27DA-\u27DF \u27E0-\u27E5 \u29B5-\u29C3 \u29C4-\u29C9 \u29CA-\u29D0 \u29D1-\u29D7 \u29DF \u29E1 \u29E2 \u29E3-\u29E6 \u29FA \u29FB \u2308-\u230B \u2336-\u237A \u2395]>
(although
   this set would require further refinement), but ultimately felt the
   motivation for including non-ASCII operators is much lower than for
   identifiers, and the harm to readers/writers of programs outweighs their
   potential utility.
   -

   Use Normalization Form KC (NFKC) instead of NFC. The decision to use NFC
   comes from Normalize Unicode Identifiers proposal
   <https://github.com/apple/swift-evolution/pull/531>. Also, UAX #31
   states:

   Generally if the programming language has case-sensitive identifiers,
   then Normalization Form C is appropriate; whereas, if the programming
   language has case-insensitive identifiers, then Normalization Form KC is
   more appropriate.

   NFKC may also produce surprising results; for example, "ſ" and "s" are
   equivalent under NFKC.
   -

   Continue to allow single .s in operators, and perhaps even expand the
   original rule to allow them anywhere (even if the operator does not begin
   with .).

   This would allow a wider variety of custom operators (for some
   interesting possibilities, see the operators in Haskell's Lens
   <https://github.com/ekmett/lens/wiki/Operators> package). However, there
   are a handful of potential complications:
   -

      Combining prefix or postfix operators with member access: foo*.bar would
      need to be parsed as foo *. barrather than (foo*).bar. Parentheses
      could be required to disambiguate.
      -

      Combining infix operators with contextual members: foo*.bar would
      need to be parsed as foo *. bar rather than foo * (.bar). Whitespace
      or parentheses could be required to disambiguate.
      -

      Hypothetically, if operators were accessible as members such as
      MyNumber.+, allowing operators with single .s would require escaping
      operator names (perhaps with backticks, such as MyNumber.`+`).

   This would also require operators of the form [!?]*\. (for example . ?.
   !. !!.) to be reserved, to prevent users from defining custom operators
   that conflict with member access and optional chaining.

   We believe that requiring dots to appear in groups of at least two,
   while in some ways more restrictive, will prevent a significant amount of
   future pain, and does not require special-case considerations such as the
   above.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#future-directions>Future
directions

While not within the scope of this proposal, the following considerations
may provide useful context for the proposed changes. We encourage the
community to pick up these topics when the time is right.

   -

   Re-expand operators to allow some non-ASCII characters. There is work in
   progress to update UAX #31 with definitions for "operator identifiers" —
   when this work is completed, it would be worth considering for Swift.
   -

   Introduce a syntax for method cascades. The Dart language supports method
   cascades
   <http://news.dartlang.org/2012/02/method-cascades-in-dart-posted-by-gilad.html>,
   whereby multiple methods can be called on an object within one expression:
   foo..bar()..baz() effectively performs foo.bar(); foo.baz(). This syntax
   can also be used with assignments and subscripts. Such a feature might be
   very useful in Swift; this proposal reserves the .. operator so that it
   may be added in the future.
   -

   Introduce "mixfix" operator declarations. Mixfix operators are based on
   pattern matching, and would allow more than two operands. For example, the
   ternary operator ? : can be defined as a mixfix operator with three
   "holes": _ ? _ : _. Subscripts might be subsumed by mixfix declarations
   such as _ [ _ ]. Some holes could be made @autoclosure, and there might
   even be holes whose argument is represented as an AST, rather than a value
   or thunk, supporting advanced metaprogramming (for instance, F#'s code
   quotations
   <https://docs.microsoft.com/en-us/dotnet/articles/fsharp/language-reference/code-quotations>
   ).
   -

   Diminish or remove the lexical distinction between operators and
   identifiers. If precedence and fixity applied to traditional identifiers
   as well as operators, it would be possible to incorporate ASCII equivalents
   for standard operators (e.g. and for &&, to allow A and B). If
   additionally combined with mixfix operator support, this might enable
   powerful DSLs (for instance, C#'s LINQ
   <https://en.wikipedia.org/wiki/Language_Integrated_Query>).


(Benjamin Spratling) #2

Howdy,
Some good points about standardizing identifiers.
Some extremely short-sighted points about deleting my formal operators that are widely recognized as operators, and that I’ve spent months adding into my code. Frankly, I just couldn’t upgrade until you put them back in.

Operators

Swift operator characters will be limited to only the following ASCII characters:

! % & * + - . / < = > ? ^ | ~

For a mathematician / scientist / engineer, they have an easier time catching errors when the code on their screen look more like what they write on paper. Hence the only good reasons to leave sin() as a global function instead of a computed property. Obviously, we don’t have 2D layout in Swift, but finally using the right operator characters instead of the ridiculous ascii hacks was a breath of fresh air Swift breathed into my code. The state of operators in C languages was abysmal, and its legacy is still here. Take the blinders off for a moment and realize that “repetition” isn’t a great semantic: “&&” and “===“. They're a side effect of the hardware & character encoding sets available to developers in past decades, not a goal for the future. Sure, we don’t have screens on every key so I can set up my own domain specific operator character sets without having to scroll through a giant list of unused characters, but finally the second barrier had fallen. And at least there are prototypes and rumors of those keyboards out in the wild.

There’s just no good reason to make
≤ ≥ ≠ ±
not valid operators.

“in homage to Swift's origins, we permit :dog: and :cow: in identifiers."

That’s a blatant attempt at a cheat. Wrong answer.

It’s true there are inconsistencies of the choice of whether a particular symbol is an operator or identifier, but I’d rather resolve that instead of blow everything away.

- - From me

-Ben


(Matthew Johnson) #3

I very much support the proposal to rationalize our handling of identifier characters.

I also support doing something similar for operator symbols. However, I agree feedback from others that this proposal goes way to far in removing our ability to use mathematical operators.

If I’m reading the proposal and discussion properly, the group has not able to reach consensus on the right criteria for operator symbols, but is hopeful that will be possible after the Unicode Consortium completes its work. I think it would be far better to defer the changes to valid operator symbols until that time (removing only symbols which are currently treated as operators but for which the proposal suggests should be available for identifiers instead).

The argument against symbols is reasonable for *new* operators, defined by an individual programmer. But operator symbols that have been defined by mathematics for a very long time are extremely useful. Notation matters. They impose very little additional burden when learned along side the mathematical concepts. IMO, the best argument against using unicode symbols for operators defined by mathematics is that they are currently difficult to type. This is an argument with a limited lifespan and should not carry more weight than it deserves in the design of a language positioned to be the language for the next 20 years. I strongly believe that removing them, even temporarily, is a mistake.

···

On Oct 19, 2016, at 1:34 AM, Jacob Bandes-Storch via swift-evolution <swift-evolution@swift.org> wrote:

Dear Swift-Evolution community,

A few of us have been preparing a proposal to refine the definitions of identifiers & operators. This includes some changes to the permitted Unicode characters.

The latest (perhaps final?) draft is available here:

    https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md

We'd welcome your initial thoughts, and will probably submit a PR soon to the swift-evolution repo for a formal review. Full text follows below.

—Jacob Bandes-Storch, Xiaodi Wu, Erica Sadun, Jonathan Shapiro

Refining Identifier and Operator Symbology

Proposal: SE-NNNN <https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md>
Authors: Jacob Bandes-Storch <https://github.com/jtbandes>, Erica Sadun <https://github.com/erica>, Xiaodi Wu <https://github.com/xwu>, Jonathan Shapiro
Review Manager: TBD
Status: Awaiting review
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#introduction>Introduction

This proposal seeks to refine and rationalize Swift's identifier and operator symbology. Specifically, this proposal:

adopts the Unicode recommendation for identifier characters, with some minor exceptions;
restricts the legal operator set to the current ASCII operator characters;
changes where dots may appear in operators; and
disallows Emoji from identifiers and operators.
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#prior-discussion-threads--proposals>Prior discussion threads & proposals

Proposal: Normalize Unicode identifiers <https://github.com/apple/swift-evolution/pull/531>
Unicode identifiers & operators <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160912/027108.html>, with pre-proposal <https://gist.github.com/jtbandes/c0b0c072181dcd22c3147802025d0b59> (a precursor to this document)
Lexical matters: identifiers and operators <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160926/027479.html>
Proposal: Allow Single Dollar Sign as Valid Identifier <https://github.com/apple/swift-evolution/pull/354>
Free the '$' Symbol! <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20151228/005133.html>
Request to add middle dot (U+00B7) as operator character? <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20151214/003176.html>
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#guiding-principles>Guiding principles

Chris Lattner has written:

…our current operator space (particularly the unicode segments covered) is not super well considered. It would be great for someone to take a more systematic pass over them to rationalize things.
We need a token to be unambiguously an operator or identifier - we can have different rules for the leading and subsequent characters though.
…any proposal that breaks:

let :dog::cow: = "moof"
will not be tolerated. :slight_smile: :slight_smile:
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#motivation>Motivation

By supporting custom Unicode operators and identifiers, Swift attempts to accomodate programmers and programming styles from many languages and cultures. It deserves a well-thought-out specification of which characters are valid. However, Swift's current identifier and operator character sets do not conform to any Unicode standards, nor have they been rationalized in the language or compiler documentation.

Identifiers, which serve as names for various entities, are linguistic in nature and must permit a variety of characters to properly serve non–English-speaking coders. This issue has been considered by the communities of many programming languages already, and the Unicode Consortium has published recommendations on how to choose identifier character sets — Swift should make an effort to conform to these recommendations.

Operators, on the other hand, should be rare and carefully chosen, because they suffer from low discoverability and difficult readability. They are by nature symbols, not names. This places a cognitive cost on users with respect to both recall ("What is the operator that applies the behavior I need?") and recognition ("What does the operator in this code do?"). While almost every nontrivial program defines many new identifiers, most programs do not define new operators.

As operators become more esoteric or customized, the cognitive cost rises. Recognizing a function name like formUnion(with:) is simpler for many programmers than recalling what the ∪ operator does. Swift's current operator character set includes many characters that aren't traditional and recognizable operators — this encourages problematic and frivolous uses in an otherwise safe language.

Today, there are many discrepancies and edge cases motivating these changes:

· is an identifier, while • is an operator.
The Greek question mark ; is a valid identifier.
Braille patterns ⠟ seem letter-like, but are operator characters.
:slightly_smiling_face::metal::arrow_forward:️:small_airplane:🂡 are identifiers, while :frowning:️:v:️:arrow_up_small::airplane:️:spades:️ are operators.
Some non-combining diacritics ´ ¨ ꓻ are valid in identifiers.
Some completely non-linguistic characters, such as ۞ and ༒, are valid in identifiers.
Some symbols such as ⚄ and ♄ are operators, despite not really being "operator-like".
A small handful of characters 〡〢〣〤〥〦〧〨〩 〪 〫 〬 〭 〮 〯 are valid in both identifiers and operators.
Some non-printing characters such as 2064 INVISIBLE PLUS and 200B ZERO WIDTH SPACE are valid identifiers.
Currency symbols are split across operators (¢ £ ¤ ¥) and identifiers ($ ₪ € ₱ ₹ ฿ ...).
This matter should be considered in a near timeframe (Swift 3.1 or 4) as it is both fundamental to Swift and will produce source-breaking changes.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#precedent-in-other-languages>Precedent in other languages

Haskell distinguishes identifiers/operators by their general category <http://www.fileformat.info/info/unicode/category/index.htm> such as "any Unicode lowercase letter", "any Unicode symbol or punctuation", and so forth. Identifiers can start with any lowercase letter or _, and may contain any letter/digit/'/_. This includes letters like δ and Я, and digits like ٢.

Haskell Syntax Reference <https://www.haskell.org/onlinereport/syntax-iso.html>
Haskell Lexer <https://github.com/ghc/ghc/blob/714bebff44076061d0a719c4eda2cfd213b7ac3d/compiler/parser/Lexer.x#L1949-L1973>
Scala similarly allows letters, numbers, $, and _ in identifiers, distinguishing by general categories Ll, Lu, Lt, Lo, and Nl. Operator characters include mathematical and other symbols (Sm and So) in addition to other ASCII symbol characters.

Scala Lexical Syntax <http://www.scala-lang.org/files/archive/spec/2.11/01-lexical-syntax.html#lexical-syntax>
ECMAScript 2015 ("ES6") uses ID_Start and ID_Continue, as well as Other_ID_Start / Other_ID_Continue, for identifiers.

ECMAScript Specification: Names and Keywords <http://www.ecma-international.org/ecma-262/6.0/#sec-names-and-keywords>
Python 3 uses XID_Start and XID_Continue.

The Python Language Reference: Identifiers and Keywords <https://docs.python.org/3/reference/lexical_analysis.html#grammar-token-identifier>
PEP 3131: Supporting Non-ASCII Identifiers <https://www.python.org/dev/peps/pep-3131/>
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#proposed-solution>Proposed solution

For identifiers, adopt the recommendations made in UAX #31 Identifier and Pattern Syntax <http://unicode.org/reports/tr31/>, deriving the sets of valid characters from ID_Start and ID_Continue. Normalize identifiers using Normalization Form C (NFC).

(For operators, no such recommendation currently exists, although active work is in progress to update UAX #31 to address "operator identifiers".)

Restrict operators to those ASCII characters which are currently operators. All other operator characters are removed from the language.

Allow dots in operators in any location, but only in runs of two or more.

(Overall, this proposal is aggressive in its removal of problematic characters. We are not attempting to prevent the addition or re-addition of characters in the future, but by paring the set down now, we require any future changes to pass the high bar of the Swift Evolution process.)

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#detailed-design>Detailed design

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#identifiers>Identifiers

Swift identifier characters will conform to UAX #31 <http://unicode.org/reports/tr31/#Conformance> as follows:

UAX31-C1. <http://unicode.org/reports/tr31/#C1> The conformance described herein refers to the Unicode 9.0.0 version of UAX #31 (dated 2016-05-31 and retrieved 2016-10-09).

UAX31-C2. <http://unicode.org/reports/tr31/#C2> Swift shall observe the following requirements:

UAX31-R1. <http://unicode.org/reports/tr31/#R1> Swift shall augment the definition of "Default Identifiers" with the following profiles:

ID_Start and ID_Continue shall be used for Start and Continue (replacing XID_Start and XID_Continue). This excludes characters in Other_ID_Start and Other_ID_Continue.

_ 005F LOW LINE shall additionally be allowed as a Start character.

The emoji characters :dog: 1F436 DOG FACE and :cow: 1F42E COW FACE shall be allowed as Start and Continue characters.

(UAX31-R1a. <http://unicode.org/reports/tr31/#R1a>) The join-control characters ZWJ and ZWNJ are strictly limited to the special cases A1, A2, and B described in UAX #31. (This requirement is covered in the Normalize Unicode Identifiers proposal <https://github.com/apple/swift-evolution/pull/531>.)

UAX31-R4. <http://unicode.org/reports/tr31/#R4> Swift shall consider two identifiers equivalent when they have the same normalized form under NFC <http://unicode.org/reports/tr15/>. (This requirement is covered in the Normalize Unicode Identifiers proposal <https://github.com/apple/swift-evolution/pull/531>.)

These changes <http://unicode.org/cldr/utility/unicodeset.jsp?a=[[a-zA-Z_\u00A8\u00AA\u00AD\u00AF\u00B2-\u00B5\u00B7-\u00BA\u00BC-\u00BE\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF\u0100-\u02FF\u0370-\u167F\u1681-\u180D\u180F-\u1DBF\u1E00-\u1FFF\u200B-\u200D\u202A-\u202E\u203F-\u2040\u2054\u2060-\u206F\u2070-\u20CF\u2100-\u218F\u2460-\u24FF\u2776-\u2793\u2C00-\u2DFF\u2E80-\u2FFF\u3004-\u3007\u3021-\u302F\u3031-\u303F\u3040-\uD7FF\uF900-\uFD3D\uFD40-\uFDCF\uFDF0-\uFE1F\uFE30-\uFE44\uFE47-\uFFFD\U00010000-\U0001FFFD\U00020000-\U0002FFFD\U00030000-\U0003FFFD\U000E0000-\U000EFFFD][0-9\u0300-\u036F\u1DC0-\u1DFF\u20D0-\u20FF\uFE20-\uFE2F]]&b=[[:ID_Continue:]\U0001F436\U0001F42E]> result in the removal of some 5,500 valid code points from the identifier characters, as well as hundreds of thousands of unassigned code points. (Though it does not appear on this unicode.org <http://unicode.org/> utility, which currently supports only Unicode 8 data, the · 00B7 MIDDLE DOT is no longer an identifier character.) Adopting ID_Start and ID_Continue does not add any new identifier characters.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#grammar-changes>Grammar changes

identifier-head → [:ID_Start:]
identifier-head → _ :dog: :cow:
identifier-character → identifier-head
identifier-character → [:ID_Continue:]
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#operators>Operators

Swift operator characters will be limited to only the following ASCII characters:

! % & * + - . / < = > ? ^ | ~

The current restrictions on reserved tokens and operators will remain: =, ->, //, /*, */, ., ?, prefix <, prefix &, postfix >, and postfix ! are reserved.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#dots-in-operators>Dots in operators

The current requirements for dots in operator names are:

If an operator doesn’t begin with a dot, it can’t contain a dot elsewhere.
This proposal changes the rule to:

Dots may only appear in operators in runs of two or more.
Under the revised rule, ..< and ... are allowed, but <.< is not. We also reserve the .. operator, permitting the compiler to use .. for a "method cascade" syntax in the future, as supported by Dart <http://news.dartlang.org/2012/02/method-cascades-in-dart-posted-by-gilad.html>.

Motivations for incorporating the two-dot rule are:

It helps avoid future lexical complications arising from lone .s.

It's a conservative approach, erring towards overly restrictive. Dropping the rule in future (thereby allowing single dots) may be possible.

It doesn't require special cases for existing infix dot operators in the standard library, ... (closed range) and ..< (half-open range). It also leaves the door open for the standard library to add analogous half-open and fully-open range operators <.. and <..<.

If we fail to adopt this rule now, then future backward-compatibility requirements will preclude the introduction of some potentially useful language enhancements.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#grammar-changes-1>Grammar changes

operator → operator-head operator-characters[opt]

operator-head → ! % & * + - / < = > ? ^ | ~
operator-head → operator-dot operator-dots
operator-character → operator-head
operator-characters → operator-character operator-character[opt]

operator-dot → .
operator-dots → operator-dot operator-dots[opt]
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#emoji>Emoji

If adopted, this proposal eliminates emoji from Swift identifiers and operators. Despite their novelty and utility, emoji characters introduce significant challenges to the language:

Their categorization into identifiers and operators is not semantically motivated, and is fraught with discrepancies.

Emoji characters are not displayed consistently and uniformly across different systems and fonts. Including all Unicode emoji <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[%3AEmoji%3A]> introduces characters that don't render as emoji on Apple platforms without a variant selector, but which also wouldn't normally be used as identifier characters (e.g. ⏏ :black_small_square: :white_small_square:).

Some emoji nearly overlap with existing operator syntax: :exclamation:️:question::heavy_plus_sign::heavy_minus_sign::heavy_division_sign::heavy_multiplication_x:

Full emoji support necessitates handling a variety of use cases for joining characters and variant selectors, which would not otherwise be useful in most cases. It would be hard to avoid permitting sequences of characters which aren't valid emoji, or being overly restrictive and not properly supporting emoji introduced in future versions of Unicode.

As an exception, in homage to Swift's origins, we permit :dog: and :cow: in identifiers.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#source-compatibility>Source compatibility

This change is source-breaking in cases where developers have incorporated emoji or custom non-ASCII operators, or identifiers with characters which have been disallowed. This is unlikely to be a significant breakage for the majority of serious Swift code.

Code using the middle dot · in identifiers may be slightly more common. · is now disallowed entirely.

Diagnostics for invalid characters are already produced today. We can improve them easily if needed.

Maintaining source compatibility for Swift 3 should be easy: just keep the old parsing & identifier lookup code.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#effect-on-abi-stability>Effect on ABI stability

This proposal does not affect the ABI format itself, although the Normalize Unicode Identifiers proposal <https://github.com/apple/swift-evolution/pull/531> affects the ABI of compiled modules.

The standard library will not be affected; it uses ASCII symbols with no combining characters.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#effect-on-api-resilience>Effect on API resilience

This proposal doesn't affect API resilience.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#alternatives-considered>Alternatives considered

Define operator characters using Unicode categories such as Sm and So <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[%3ASm%3A][%3ASo%3A]]>. This approach would include many "non-operator-like" characters and doesn't seem to provide a significant benefit aside from a simpler definition.

Hand-pick a set of "operator-like" characters to include. The proposal authors tried this painstaking approach, and came up with a relatively agreeable set of about 650 code points <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[!\%24%\%26*%2B\-%2F<%3D>%3F\^|~ \u00AC \u00B1 \u00B7 \u00D7 \u00F7 \u2208-\u220D \u220F-\u2211 \u22C0-\u22C3 \u2212-\u221D \u2238 \u223A \u2240 \u228C-\u228E \u2293-\u22A3 \u22BA-\u22BD \u22C4-\u22C7 \u22C9-\u22CC \u22D2-\u22D3 \u2223-\u222A \u2236-\u2237 \u2239 \u223B-\u223E \u2241-\u228B \u228F-\u2292 \u22A6-\u22B9 \u22C8 \u22CD \u22D0-\u22D1 \u22D4-\u22FF \u22CE-\u22CF \u2A00-\u2AFF \u27C2 \u27C3 \u27C4 \u27C7 \u27C8 \u27C9 \u27CA \u27CE-\u27D7 \u27DA-\u27DF \u27E0-\u27E5 \u29B5-\u29C3 \u29C4-\u29C9 \u29CA-\u29D0 \u29D1-\u29D7 \u29DF \u29E1 \u29E2 \u29E3-\u29E6 \u29FA \u29FB \u2308-\u230B \u2336-\u237A \u2395]> (although this set would require further refinement), but ultimately felt the motivation for including non-ASCII operators is much lower than for identifiers, and the harm to readers/writers of programs outweighs their potential utility.

Use Normalization Form KC (NFKC) instead of NFC. The decision to use NFC comes from Normalize Unicode Identifiers proposal <https://github.com/apple/swift-evolution/pull/531>. Also, UAX #31 states:

Generally if the programming language has case-sensitive identifiers, then Normalization Form C is appropriate; whereas, if the programming language has case-insensitive identifiers, then Normalization Form KC is more appropriate.
NFKC may also produce surprising results; for example, "ſ" and "s" are equivalent under NFKC.

Continue to allow single .s in operators, and perhaps even expand the original rule to allow them anywhere (even if the operator does not begin with .).

This would allow a wider variety of custom operators (for some interesting possibilities, see the operators in Haskell's Lens <https://github.com/ekmett/lens/wiki/Operators> package). However, there are a handful of potential complications:

Combining prefix or postfix operators with member access: foo*.bar would need to be parsed as foo *. barrather than (foo*).bar. Parentheses could be required to disambiguate.

Combining infix operators with contextual members: foo*.bar would need to be parsed as foo *. bar rather than foo * (.bar). Whitespace or parentheses could be required to disambiguate.

Hypothetically, if operators were accessible as members such as MyNumber.+, allowing operators with single .s would require escaping operator names (perhaps with backticks, such as MyNumber.`+`).

This would also require operators of the form [!?]*\. (for example . ?. !. !!.) to be reserved, to prevent users from defining custom operators that conflict with member access and optional chaining.

We believe that requiring dots to appear in groups of at least two, while in some ways more restrictive, will prevent a significant amount of future pain, and does not require special-case considerations such as the above.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#future-directions>Future directions

While not within the scope of this proposal, the following considerations may provide useful context for the proposed changes. We encourage the community to pick up these topics when the time is right.

Re-expand operators to allow some non-ASCII characters. There is work in progress to update UAX #31 with definitions for "operator identifiers" — when this work is completed, it would be worth considering for Swift.

Introduce a syntax for method cascades. The Dart language supports method cascades <http://news.dartlang.org/2012/02/method-cascades-in-dart-posted-by-gilad.html>, whereby multiple methods can be called on an object within one expression: foo..bar()..baz() effectively performs foo.bar(); foo.baz(). This syntax can also be used with assignments and subscripts. Such a feature might be very useful in Swift; this proposal reserves the .. operator so that it may be added in the future.

Introduce "mixfix" operator declarations. Mixfix operators are based on pattern matching, and would allow more than two operands. For example, the ternary operator ? : can be defined as a mixfix operator with three "holes": _ ? _ : _. Subscripts might be subsumed by mixfix declarations such as _ [ _ ]. Some holes could be made @autoclosure, and there might even be holes whose argument is represented as an AST, rather than a value or thunk, supporting advanced metaprogramming (for instance, F#'s code quotations <https://docs.microsoft.com/en-us/dotnet/articles/fsharp/language-reference/code-quotations>).

Diminish or remove the lexical distinction between operators and identifiers. If precedence and fixity applied to traditional identifiers as well as operators, it would be possible to incorporate ASCII equivalents for standard operators (e.g. and for &&, to allow A and B). If additionally combined with mixfix operator support, this might enable powerful DSLs (for instance, C#'s LINQ <https://en.wikipedia.org/wiki/Language_Integrated_Query>).

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Daniel Vollmer) #4

Hi,

while I don’t really have an opinion on the proposal overall, the following

As an exception, in homage to Swift's origins, we permit :dog: and :cow: in identifiers.

seems pointless and complicates things for no apparent gain (other than satisfying
Chris’ requirement… ;)), so I’d remove those as well.

  Daniel.


(Joe Groff) #5

I think this is a promising direction. Getting us in line with Unicode recommendations is an important first step, and being conservative about the treatment of operator characters and emoji is a good engineering approach, though certainly unfortunate in the short term for users who've adopted custom operators or found interesting uses for emoji identifiers in Swift 3 and earlier.

In the discussion about operators, I wonder whether it makes sense to formally separate "identifier" and "operator" characters at all. My hunch is that there isn't going to be any perfect categorization; there are so many symbols and scripts out there that it's going to be difficult to definitively characterize many symbols as "obviously" an operator or identifier. Not every developer has the mathematical background to even recognize common math operators beyond the elementary arithmetic ones. Something to consider would be to change the way operators work in the language so that they can use *any* symbols (subject to canonicalization, visibility, and confusability constraints), but require their use to always be explicitly declared in a source file that uses an operator outside of the standard library. For example, you would have to say something like:

import Sets
import operator Sets.∪

to make the '∪' symbol available as an operator in the import declaration's scope. This would provide more obvious evidence in the source code of what tokens are being employed as operators, and lessen the need to have formally distinct identifier and operator character sets.

-Joe


(Jean-Denis Muys) #6

Before and above anything else, if I read the proposal correctly, we will not be able any more to use math operator signs as operators, beyond the paltry half dozen or so in the ASCII character set???

I strongly oppose such a restriction. Maths symbols (including ∪) are widely recognised in the scientific community and this change, IIUC, is very hostile to any scientific computing.

Jean-Denis

···

On 19 Oct 2016, at 08:34, Jacob Bandes-Storch via swift-evolution <swift-evolution@swift.org> wrote:

Dear Swift-Evolution community,

A few of us have been preparing a proposal to refine the definitions of identifiers & operators. This includes some changes to the permitted Unicode characters.

The latest (perhaps final?) draft is available here:

    https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md

We'd welcome your initial thoughts, and will probably submit a PR soon to the swift-evolution repo for a formal review. Full text follows below.

—Jacob Bandes-Storch, Xiaodi Wu, Erica Sadun, Jonathan Shapiro

Refining Identifier and Operator Symbology

Proposal: SE-NNNN <https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md>
Authors: Jacob Bandes-Storch <https://github.com/jtbandes>, Erica Sadun <https://github.com/erica>, Xiaodi Wu <https://github.com/xwu>, Jonathan Shapiro
Review Manager: TBD
Status: Awaiting review
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#introduction>Introduction

This proposal seeks to refine and rationalize Swift's identifier and operator symbology. Specifically, this proposal:

adopts the Unicode recommendation for identifier characters, with some minor exceptions;
restricts the legal operator set to the current ASCII operator characters;
changes where dots may appear in operators; and
disallows Emoji from identifiers and operators.
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#prior-discussion-threads--proposals>Prior discussion threads & proposals

Proposal: Normalize Unicode identifiers <https://github.com/apple/swift-evolution/pull/531>
Unicode identifiers & operators <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160912/027108.html>, with pre-proposal <https://gist.github.com/jtbandes/c0b0c072181dcd22c3147802025d0b59> (a precursor to this document)
Lexical matters: identifiers and operators <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160926/027479.html>
Proposal: Allow Single Dollar Sign as Valid Identifier <https://github.com/apple/swift-evolution/pull/354>
Free the '$' Symbol! <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20151228/005133.html>
Request to add middle dot (U+00B7) as operator character? <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20151214/003176.html>
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#guiding-principles>Guiding principles

Chris Lattner has written:

…our current operator space (particularly the unicode segments covered) is not super well considered. It would be great for someone to take a more systematic pass over them to rationalize things.
We need a token to be unambiguously an operator or identifier - we can have different rules for the leading and subsequent characters though.
…any proposal that breaks:

let :dog::cow: = "moof"
will not be tolerated. :slight_smile: :slight_smile:
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#motivation>Motivation

By supporting custom Unicode operators and identifiers, Swift attempts to accomodate programmers and programming styles from many languages and cultures. It deserves a well-thought-out specification of which characters are valid. However, Swift's current identifier and operator character sets do not conform to any Unicode standards, nor have they been rationalized in the language or compiler documentation.

Identifiers, which serve as names for various entities, are linguistic in nature and must permit a variety of characters to properly serve non–English-speaking coders. This issue has been considered by the communities of many programming languages already, and the Unicode Consortium has published recommendations on how to choose identifier character sets — Swift should make an effort to conform to these recommendations.

Operators, on the other hand, should be rare and carefully chosen, because they suffer from low discoverability and difficult readability. They are by nature symbols, not names. This places a cognitive cost on users with respect to both recall ("What is the operator that applies the behavior I need?") and recognition ("What does the operator in this code do?"). While almost every nontrivial program defines many new identifiers, most programs do not define new operators.

As operators become more esoteric or customized, the cognitive cost rises. Recognizing a function name like formUnion(with:) is simpler for many programmers than recalling what the ∪ operator does. Swift's current operator character set includes many characters that aren't traditional and recognizable operators — this encourages problematic and frivolous uses in an otherwise safe language.

Today, there are many discrepancies and edge cases motivating these changes:

· is an identifier, while • is an operator.
The Greek question mark ; is a valid identifier.
Braille patterns ⠟ seem letter-like, but are operator characters.
:slightly_smiling_face::metal::arrow_forward:️:small_airplane:🂡 are identifiers, while :frowning:️:v:️:arrow_up_small::airplane:️:spades:️ are operators.
Some non-combining diacritics ´ ¨ ꓻ are valid in identifiers.
Some completely non-linguistic characters, such as ۞ and ༒, are valid in identifiers.
Some symbols such as ⚄ and ♄ are operators, despite not really being "operator-like".
A small handful of characters 〡〢〣〤〥〦〧〨〩 〪 〫 〬 〭 〮 〯 are valid in both identifiers and operators.
Some non-printing characters such as 2064 INVISIBLE PLUS and 200B ZERO WIDTH SPACE are valid identifiers.
Currency symbols are split across operators (¢ £ ¤ ¥) and identifiers ($ ₪ € ₱ ₹ ฿ ...).
This matter should be considered in a near timeframe (Swift 3.1 or 4) as it is both fundamental to Swift and will produce source-breaking changes.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#precedent-in-other-languages>Precedent in other languages

Haskell distinguishes identifiers/operators by their general category <http://www.fileformat.info/info/unicode/category/index.htm> such as "any Unicode lowercase letter", "any Unicode symbol or punctuation", and so forth. Identifiers can start with any lowercase letter or _, and may contain any letter/digit/'/_. This includes letters like δ and Я, and digits like ٢.

Haskell Syntax Reference <https://www.haskell.org/onlinereport/syntax-iso.html>
Haskell Lexer <https://github.com/ghc/ghc/blob/714bebff44076061d0a719c4eda2cfd213b7ac3d/compiler/parser/Lexer.x#L1949-L1973>
Scala similarly allows letters, numbers, $, and _ in identifiers, distinguishing by general categories Ll, Lu, Lt, Lo, and Nl. Operator characters include mathematical and other symbols (Sm and So) in addition to other ASCII symbol characters.

Scala Lexical Syntax <http://www.scala-lang.org/files/archive/spec/2.11/01-lexical-syntax.html#lexical-syntax>
ECMAScript 2015 ("ES6") uses ID_Start and ID_Continue, as well as Other_ID_Start / Other_ID_Continue, for identifiers.

ECMAScript Specification: Names and Keywords <http://www.ecma-international.org/ecma-262/6.0/#sec-names-and-keywords>
Python 3 uses XID_Start and XID_Continue.

The Python Language Reference: Identifiers and Keywords <https://docs.python.org/3/reference/lexical_analysis.html#grammar-token-identifier>
PEP 3131: Supporting Non-ASCII Identifiers <https://www.python.org/dev/peps/pep-3131/>
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#proposed-solution>Proposed solution

For identifiers, adopt the recommendations made in UAX #31 Identifier and Pattern Syntax <http://unicode.org/reports/tr31/>, deriving the sets of valid characters from ID_Start and ID_Continue. Normalize identifiers using Normalization Form C (NFC).

(For operators, no such recommendation currently exists, although active work is in progress to update UAX #31 to address "operator identifiers".)

Restrict operators to those ASCII characters which are currently operators. All other operator characters are removed from the language.

Allow dots in operators in any location, but only in runs of two or more.

(Overall, this proposal is aggressive in its removal of problematic characters. We are not attempting to prevent the addition or re-addition of characters in the future, but by paring the set down now, we require any future changes to pass the high bar of the Swift Evolution process.)

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#detailed-design>Detailed design

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#identifiers>Identifiers

Swift identifier characters will conform to UAX #31 <http://unicode.org/reports/tr31/#Conformance> as follows:

UAX31-C1. <http://unicode.org/reports/tr31/#C1> The conformance described herein refers to the Unicode 9.0.0 version of UAX #31 (dated 2016-05-31 and retrieved 2016-10-09).

UAX31-C2. <http://unicode.org/reports/tr31/#C2> Swift shall observe the following requirements:

UAX31-R1. <http://unicode.org/reports/tr31/#R1> Swift shall augment the definition of "Default Identifiers" with the following profiles:

ID_Start and ID_Continue shall be used for Start and Continue (replacing XID_Start and XID_Continue). This excludes characters in Other_ID_Start and Other_ID_Continue.

_ 005F LOW LINE shall additionally be allowed as a Start character.

The emoji characters :dog: 1F436 DOG FACE and :cow: 1F42E COW FACE shall be allowed as Start and Continue characters.

(UAX31-R1a. <http://unicode.org/reports/tr31/#R1a>) The join-control characters ZWJ and ZWNJ are strictly limited to the special cases A1, A2, and B described in UAX #31. (This requirement is covered in the Normalize Unicode Identifiers proposal <https://github.com/apple/swift-evolution/pull/531>.)

UAX31-R4. <http://unicode.org/reports/tr31/#R4> Swift shall consider two identifiers equivalent when they have the same normalized form under NFC <http://unicode.org/reports/tr15/>. (This requirement is covered in the Normalize Unicode Identifiers proposal <https://github.com/apple/swift-evolution/pull/531>.)

These changes <http://unicode.org/cldr/utility/unicodeset.jsp?a=[[a-zA-Z_\u00A8\u00AA\u00AD\u00AF\u00B2-\u00B5\u00B7-\u00BA\u00BC-\u00BE\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF\u0100-\u02FF\u0370-\u167F\u1681-\u180D\u180F-\u1DBF\u1E00-\u1FFF\u200B-\u200D\u202A-\u202E\u203F-\u2040\u2054\u2060-\u206F\u2070-\u20CF\u2100-\u218F\u2460-\u24FF\u2776-\u2793\u2C00-\u2DFF\u2E80-\u2FFF\u3004-\u3007\u3021-\u302F\u3031-\u303F\u3040-\uD7FF\uF900-\uFD3D\uFD40-\uFDCF\uFDF0-\uFE1F\uFE30-\uFE44\uFE47-\uFFFD\U00010000-\U0001FFFD\U00020000-\U0002FFFD\U00030000-\U0003FFFD\U000E0000-\U000EFFFD][0-9\u0300-\u036F\u1DC0-\u1DFF\u20D0-\u20FF\uFE20-\uFE2F]]&b=[[:ID_Continue:]\U0001F436\U0001F42E]> result in the removal of some 5,500 valid code points from the identifier characters, as well as hundreds of thousands of unassigned code points. (Though it does not appear on this unicode.org <http://unicode.org/> utility, which currently supports only Unicode 8 data, the · 00B7 MIDDLE DOT is no longer an identifier character.) Adopting ID_Start and ID_Continue does not add any new identifier characters.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#grammar-changes>Grammar changes

identifier-head → [:ID_Start:]
identifier-head → _ :dog: :cow:
identifier-character → identifier-head
identifier-character → [:ID_Continue:]
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#operators>Operators

Swift operator characters will be limited to only the following ASCII characters:

! % & * + - . / < = > ? ^ | ~

The current restrictions on reserved tokens and operators will remain: =, ->, //, /*, */, ., ?, prefix <, prefix &, postfix >, and postfix ! are reserved.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#dots-in-operators>Dots in operators

The current requirements for dots in operator names are:

If an operator doesn’t begin with a dot, it can’t contain a dot elsewhere.
This proposal changes the rule to:

Dots may only appear in operators in runs of two or more.
Under the revised rule, ..< and ... are allowed, but <.< is not. We also reserve the .. operator, permitting the compiler to use .. for a "method cascade" syntax in the future, as supported by Dart <http://news.dartlang.org/2012/02/method-cascades-in-dart-posted-by-gilad.html>.

Motivations for incorporating the two-dot rule are:

It helps avoid future lexical complications arising from lone .s.

It's a conservative approach, erring towards overly restrictive. Dropping the rule in future (thereby allowing single dots) may be possible.

It doesn't require special cases for existing infix dot operators in the standard library, ... (closed range) and ..< (half-open range). It also leaves the door open for the standard library to add analogous half-open and fully-open range operators <.. and <..<.

If we fail to adopt this rule now, then future backward-compatibility requirements will preclude the introduction of some potentially useful language enhancements.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#grammar-changes-1>Grammar changes

operator → operator-head operator-characters[opt]

operator-head → ! % & * + - / < = > ? ^ | ~
operator-head → operator-dot operator-dots
operator-character → operator-head
operator-characters → operator-character operator-character[opt]

operator-dot → .
operator-dots → operator-dot operator-dots[opt]
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#emoji>Emoji

If adopted, this proposal eliminates emoji from Swift identifiers and operators. Despite their novelty and utility, emoji characters introduce significant challenges to the language:

Their categorization into identifiers and operators is not semantically motivated, and is fraught with discrepancies.

Emoji characters are not displayed consistently and uniformly across different systems and fonts. Including all Unicode emoji <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[%3AEmoji%3A]> introduces characters that don't render as emoji on Apple platforms without a variant selector, but which also wouldn't normally be used as identifier characters (e.g. ⏏ :black_small_square: :white_small_square:).

Some emoji nearly overlap with existing operator syntax: :exclamation:️:question::heavy_plus_sign::heavy_minus_sign::heavy_division_sign::heavy_multiplication_x:

Full emoji support necessitates handling a variety of use cases for joining characters and variant selectors, which would not otherwise be useful in most cases. It would be hard to avoid permitting sequences of characters which aren't valid emoji, or being overly restrictive and not properly supporting emoji introduced in future versions of Unicode.

As an exception, in homage to Swift's origins, we permit :dog: and :cow: in identifiers.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#source-compatibility>Source compatibility

This change is source-breaking in cases where developers have incorporated emoji or custom non-ASCII operators, or identifiers with characters which have been disallowed. This is unlikely to be a significant breakage for the majority of serious Swift code.

Code using the middle dot · in identifiers may be slightly more common. · is now disallowed entirely.

Diagnostics for invalid characters are already produced today. We can improve them easily if needed.

Maintaining source compatibility for Swift 3 should be easy: just keep the old parsing & identifier lookup code.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#effect-on-abi-stability>Effect on ABI stability

This proposal does not affect the ABI format itself, although the Normalize Unicode Identifiers proposal <https://github.com/apple/swift-evolution/pull/531> affects the ABI of compiled modules.

The standard library will not be affected; it uses ASCII symbols with no combining characters.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#effect-on-api-resilience>Effect on API resilience

This proposal doesn't affect API resilience.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#alternatives-considered>Alternatives considered

Define operator characters using Unicode categories such as Sm and So <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[%3ASm%3A][%3ASo%3A]]>. This approach would include many "non-operator-like" characters and doesn't seem to provide a significant benefit aside from a simpler definition.

Hand-pick a set of "operator-like" characters to include. The proposal authors tried this painstaking approach, and came up with a relatively agreeable set of about 650 code points <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[!\%24%\%26*%2B\-%2F<%3D>%3F\^|~ \u00AC \u00B1 \u00B7 \u00D7 \u00F7 \u2208-\u220D \u220F-\u2211 \u22C0-\u22C3 \u2212-\u221D \u2238 \u223A \u2240 \u228C-\u228E \u2293-\u22A3 \u22BA-\u22BD \u22C4-\u22C7 \u22C9-\u22CC \u22D2-\u22D3 \u2223-\u222A \u2236-\u2237 \u2239 \u223B-\u223E \u2241-\u228B \u228F-\u2292 \u22A6-\u22B9 \u22C8 \u22CD \u22D0-\u22D1 \u22D4-\u22FF \u22CE-\u22CF \u2A00-\u2AFF \u27C2 \u27C3 \u27C4 \u27C7 \u27C8 \u27C9 \u27CA \u27CE-\u27D7 \u27DA-\u27DF \u27E0-\u27E5 \u29B5-\u29C3 \u29C4-\u29C9 \u29CA-\u29D0 \u29D1-\u29D7 \u29DF \u29E1 \u29E2 \u29E3-\u29E6 \u29FA \u29FB \u2308-\u230B \u2336-\u237A \u2395]> (although this set would require further refinement), but ultimately felt the motivation for including non-ASCII operators is much lower than for identifiers, and the harm to readers/writers of programs outweighs their potential utility.

Use Normalization Form KC (NFKC) instead of NFC. The decision to use NFC comes from Normalize Unicode Identifiers proposal <https://github.com/apple/swift-evolution/pull/531>. Also, UAX #31 states:

Generally if the programming language has case-sensitive identifiers, then Normalization Form C is appropriate; whereas, if the programming language has case-insensitive identifiers, then Normalization Form KC is more appropriate.
NFKC may also produce surprising results; for example, "ſ" and "s" are equivalent under NFKC.

Continue to allow single .s in operators, and perhaps even expand the original rule to allow them anywhere (even if the operator does not begin with .).

This would allow a wider variety of custom operators (for some interesting possibilities, see the operators in Haskell's Lens <https://github.com/ekmett/lens/wiki/Operators> package). However, there are a handful of potential complications:

Combining prefix or postfix operators with member access: foo*.bar would need to be parsed as foo *. barrather than (foo*).bar. Parentheses could be required to disambiguate.

Combining infix operators with contextual members: foo*.bar would need to be parsed as foo *. bar rather than foo * (.bar). Whitespace or parentheses could be required to disambiguate.

Hypothetically, if operators were accessible as members such as MyNumber.+, allowing operators with single .s would require escaping operator names (perhaps with backticks, such as MyNumber.`+`).

This would also require operators of the form [!?]*\. (for example . ?. !. !!.) to be reserved, to prevent users from defining custom operators that conflict with member access and optional chaining.

We believe that requiring dots to appear in groups of at least two, while in some ways more restrictive, will prevent a significant amount of future pain, and does not require special-case considerations such as the above.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#future-directions>Future directions

While not within the scope of this proposal, the following considerations may provide useful context for the proposed changes. We encourage the community to pick up these topics when the time is right.

Re-expand operators to allow some non-ASCII characters. There is work in progress to update UAX #31 with definitions for "operator identifiers" — when this work is completed, it would be worth considering for Swift.

Introduce a syntax for method cascades. The Dart language supports method cascades <http://news.dartlang.org/2012/02/method-cascades-in-dart-posted-by-gilad.html>, whereby multiple methods can be called on an object within one expression: foo..bar()..baz() effectively performs foo.bar(); foo.baz(). This syntax can also be used with assignments and subscripts. Such a feature might be very useful in Swift; this proposal reserves the .. operator so that it may be added in the future.

Introduce "mixfix" operator declarations. Mixfix operators are based on pattern matching, and would allow more than two operands. For example, the ternary operator ? : can be defined as a mixfix operator with three "holes": _ ? _ : _. Subscripts might be subsumed by mixfix declarations such as _ [ _ ]. Some holes could be made @autoclosure, and there might even be holes whose argument is represented as an AST, rather than a value or thunk, supporting advanced metaprogramming (for instance, F#'s code quotations <https://docs.microsoft.com/en-us/dotnet/articles/fsharp/language-reference/code-quotations>).

Diminish or remove the lexical distinction between operators and identifiers. If precedence and fixity applied to traditional identifiers as well as operators, it would be possible to incorporate ASCII equivalents for standard operators (e.g. and for &&, to allow A and B). If additionally combined with mixfix operator support, this might enable powerful DSLs (for instance, C#'s LINQ <https://en.wikipedia.org/wiki/Language_Integrated_Query>).

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Alex Martini) #7

Dots in operators

The current requirements for dots in operator names are:

If an operator doesn’t begin with a dot, it can’t contain a dot elsewhere.
This proposal changes the rule to:

Dots may only appear in operators in runs of two or more.
Under the revised rule, ..< and ... are allowed, but <.< is not. We also reserve the .. operator, permitting the compiler to use .. for a "method cascade" syntax in the future, as supported by Dart <http://news.dartlang.org/2012/02/method-cascades-in-dart-posted-by-gilad.html>.

Motivations for incorporating the two-dot rule are:

It helps avoid future lexical complications arising from lone .s.

It's a conservative approach, erring towards overly restrictive. Dropping the rule in future (thereby allowing single dots) may be possible.

It doesn't require special cases for existing infix dot operators in the standard library, ... (closed range) and ..< (half-open range). It also leaves the door open for the standard library to add analogous half-open and fully-open range operators <.. and <..<.

If we fail to adopt this rule now, then future backward-compatibility requirements will preclude the introduction of some potentially useful language enhancements.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#grammar-changes-1>Grammar changes

operator → operator-head operator-characters[opt]

operator-head → ! % & * + - / < = > ? ^ | ~
operator-head → operator-dot operator-dots
operator-character → operator-head
operator-characters → operator-character operator-character[opt]

operator-dot → .
operator-dots → operator-dot operator-dots[opt]
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#emoji>I think there's a mismatch between the English and grammar. For example, is +..+ allowed or not?

The English rule does allow +..+ because its dots appear in a run of two.

The grammar allows a run of one or more dots as an operator head, but never allows dots as characters appearing in the middle of an operator, regardless of how many dots appear next to each other. The grammar wouldn't allow +..+ because the dots don't come at the beginning.

Here's an alternate version of the grammar that matches the "two or more" rule. Because we no longer distinguish between which characters are allowed as the first character of an operator vs a character inside, there's no longer a need for a separate operator-head.

operator --> operator-character operator-OPT

operator-character --> ! % & * + - / < = > ? ^ | ~
operator-character --> operator-dots

operator-dots --> .. operator-additional-dots-OPT
operator-additional-dots --> . operator-additional-dots-OPT


(Alex Blewitt) #8

I support this in principle, having suggested similar things in the past. I would suggest, however, that to simplify the discussion and the proposal itself, that 'reserving operators at this time' and 'appease specific example that Chris Lattner proposed just so that it isn't outright denied' are probably not appropriate within this document. It would be better to have a sound basis accepted, then propose specific variations on top of it at a later stage (such as 'Allow dogcow as an identifier').

Amongst other things, this proposal permits :dog:face and MK​:cow: as an identifier, which is probably not intentional. It would probably be better to define:

identifier -> identifier-head identifier-characters
identifier -> :dog::cow:

which would thus prevent the use of :dog: on its own (or :cow: on its own) being used in an identifier.

Alex

···

On 19 Oct 2016, at 07:34, Jacob Bandes-Storch via swift-evolution <swift-evolution@swift.org> wrote:

Dear Swift-Evolution community,

A few of us have been preparing a proposal to refine the definitions of identifiers & operators. This includes some changes to the permitted Unicode characters.

The latest (perhaps final?) draft is available here:

    https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md

We'd welcome your initial thoughts, and will probably submit a PR soon to the swift-evolution repo for a formal review. Full text follows below.

—Jacob Bandes-Storch, Xiaodi Wu, Erica Sadun, Jonathan Shapiro

Refining Identifier and Operator Symbology

Proposal: SE-NNNN <https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md>
Authors: Jacob Bandes-Storch <https://github.com/jtbandes>, Erica Sadun <https://github.com/erica>, Xiaodi Wu <https://github.com/xwu>, Jonathan Shapiro
Review Manager: TBD
Status: Awaiting review
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#introduction>Introduction

This proposal seeks to refine and rationalize Swift's identifier and operator symbology. Specifically, this proposal:

adopts the Unicode recommendation for identifier characters, with some minor exceptions;
restricts the legal operator set to the current ASCII operator characters;
changes where dots may appear in operators; and
disallows Emoji from identifiers and operators.
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#prior-discussion-threads--proposals>Prior discussion threads & proposals

Proposal: Normalize Unicode identifiers <https://github.com/apple/swift-evolution/pull/531>
Unicode identifiers & operators <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160912/027108.html>, with pre-proposal <https://gist.github.com/jtbandes/c0b0c072181dcd22c3147802025d0b59> (a precursor to this document)
Lexical matters: identifiers and operators <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160926/027479.html>
Proposal: Allow Single Dollar Sign as Valid Identifier <https://github.com/apple/swift-evolution/pull/354>
Free the '$' Symbol! <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20151228/005133.html>
Request to add middle dot (U+00B7) as operator character? <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20151214/003176.html>
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#guiding-principles>Guiding principles

Chris Lattner has written:

…our current operator space (particularly the unicode segments covered) is not super well considered. It would be great for someone to take a more systematic pass over them to rationalize things.
We need a token to be unambiguously an operator or identifier - we can have different rules for the leading and subsequent characters though.
…any proposal that breaks:

let :dog::cow: = "moof"
will not be tolerated. :slight_smile: :slight_smile:
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#motivation>Motivation

By supporting custom Unicode operators and identifiers, Swift attempts to accomodate programmers and programming styles from many languages and cultures. It deserves a well-thought-out specification of which characters are valid. However, Swift's current identifier and operator character sets do not conform to any Unicode standards, nor have they been rationalized in the language or compiler documentation.

Identifiers, which serve as names for various entities, are linguistic in nature and must permit a variety of characters to properly serve non–English-speaking coders. This issue has been considered by the communities of many programming languages already, and the Unicode Consortium has published recommendations on how to choose identifier character sets — Swift should make an effort to conform to these recommendations.

Operators, on the other hand, should be rare and carefully chosen, because they suffer from low discoverability and difficult readability. They are by nature symbols, not names. This places a cognitive cost on users with respect to both recall ("What is the operator that applies the behavior I need?") and recognition ("What does the operator in this code do?"). While almost every nontrivial program defines many new identifiers, most programs do not define new operators.

As operators become more esoteric or customized, the cognitive cost rises. Recognizing a function name like formUnion(with:) is simpler for many programmers than recalling what the ∪ operator does. Swift's current operator character set includes many characters that aren't traditional and recognizable operators — this encourages problematic and frivolous uses in an otherwise safe language.

Today, there are many discrepancies and edge cases motivating these changes:

· is an identifier, while • is an operator.
The Greek question mark ; is a valid identifier.
Braille patterns ⠟ seem letter-like, but are operator characters.
:slightly_smiling_face::metal::arrow_forward:️:small_airplane:🂡 are identifiers, while :frowning:️:v:️:arrow_up_small::airplane:️:spades:️ are operators.
Some non-combining diacritics ´ ¨ ꓻ are valid in identifiers.
Some completely non-linguistic characters, such as ۞ and ༒, are valid in identifiers.
Some symbols such as ⚄ and ♄ are operators, despite not really being "operator-like".
A small handful of characters 〡〢〣〤〥〦〧〨〩 〪 〫 〬 〭 〮 〯 are valid in both identifiers and operators.
Some non-printing characters such as 2064 INVISIBLE PLUS and 200B ZERO WIDTH SPACE are valid identifiers.
Currency symbols are split across operators (¢ £ ¤ ¥) and identifiers ($ ₪ € ₱ ₹ ฿ ...).
This matter should be considered in a near timeframe (Swift 3.1 or 4) as it is both fundamental to Swift and will produce source-breaking changes.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#precedent-in-other-languages>Precedent in other languages

Haskell distinguishes identifiers/operators by their general category <http://www.fileformat.info/info/unicode/category/index.htm> such as "any Unicode lowercase letter", "any Unicode symbol or punctuation", and so forth. Identifiers can start with any lowercase letter or _, and may contain any letter/digit/'/_. This includes letters like δ and Я, and digits like ٢.

Haskell Syntax Reference <https://www.haskell.org/onlinereport/syntax-iso.html>
Haskell Lexer <https://github.com/ghc/ghc/blob/714bebff44076061d0a719c4eda2cfd213b7ac3d/compiler/parser/Lexer.x#L1949-L1973>
Scala similarly allows letters, numbers, $, and _ in identifiers, distinguishing by general categories Ll, Lu, Lt, Lo, and Nl. Operator characters include mathematical and other symbols (Sm and So) in addition to other ASCII symbol characters.

Scala Lexical Syntax <http://www.scala-lang.org/files/archive/spec/2.11/01-lexical-syntax.html#lexical-syntax>
ECMAScript 2015 ("ES6") uses ID_Start and ID_Continue, as well as Other_ID_Start / Other_ID_Continue, for identifiers.

ECMAScript Specification: Names and Keywords <http://www.ecma-international.org/ecma-262/6.0/#sec-names-and-keywords>
Python 3 uses XID_Start and XID_Continue.

The Python Language Reference: Identifiers and Keywords <https://docs.python.org/3/reference/lexical_analysis.html#grammar-token-identifier>
PEP 3131: Supporting Non-ASCII Identifiers <https://www.python.org/dev/peps/pep-3131/>
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#proposed-solution>Proposed solution

For identifiers, adopt the recommendations made in UAX #31 Identifier and Pattern Syntax <http://unicode.org/reports/tr31/>, deriving the sets of valid characters from ID_Start and ID_Continue. Normalize identifiers using Normalization Form C (NFC).

(For operators, no such recommendation currently exists, although active work is in progress to update UAX #31 to address "operator identifiers".)

Restrict operators to those ASCII characters which are currently operators. All other operator characters are removed from the language.

Allow dots in operators in any location, but only in runs of two or more.

(Overall, this proposal is aggressive in its removal of problematic characters. We are not attempting to prevent the addition or re-addition of characters in the future, but by paring the set down now, we require any future changes to pass the high bar of the Swift Evolution process.)

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#detailed-design>Detailed design

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#identifiers>Identifiers

Swift identifier characters will conform to UAX #31 <http://unicode.org/reports/tr31/#Conformance> as follows:

UAX31-C1. <http://unicode.org/reports/tr31/#C1> The conformance described herein refers to the Unicode 9.0.0 version of UAX #31 (dated 2016-05-31 and retrieved 2016-10-09).

UAX31-C2. <http://unicode.org/reports/tr31/#C2> Swift shall observe the following requirements:

UAX31-R1. <http://unicode.org/reports/tr31/#R1> Swift shall augment the definition of "Default Identifiers" with the following profiles:

ID_Start and ID_Continue shall be used for Start and Continue (replacing XID_Start and XID_Continue). This excludes characters in Other_ID_Start and Other_ID_Continue.

_ 005F LOW LINE shall additionally be allowed as a Start character.

The emoji characters :dog: 1F436 DOG FACE and :cow: 1F42E COW FACE shall be allowed as Start and Continue characters.

(UAX31-R1a. <http://unicode.org/reports/tr31/#R1a>) The join-control characters ZWJ and ZWNJ are strictly limited to the special cases A1, A2, and B described in UAX #31. (This requirement is covered in the Normalize Unicode Identifiers proposal <https://github.com/apple/swift-evolution/pull/531>.)

UAX31-R4. <http://unicode.org/reports/tr31/#R4> Swift shall consider two identifiers equivalent when they have the same normalized form under NFC <http://unicode.org/reports/tr15/>. (This requirement is covered in the Normalize Unicode Identifiers proposal <https://github.com/apple/swift-evolution/pull/531>.)

These changes <http://unicode.org/cldr/utility/unicodeset.jsp?a=[[a-zA-Z_\u00A8\u00AA\u00AD\u00AF\u00B2-\u00B5\u00B7-\u00BA\u00BC-\u00BE\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF\u0100-\u02FF\u0370-\u167F\u1681-\u180D\u180F-\u1DBF\u1E00-\u1FFF\u200B-\u200D\u202A-\u202E\u203F-\u2040\u2054\u2060-\u206F\u2070-\u20CF\u2100-\u218F\u2460-\u24FF\u2776-\u2793\u2C00-\u2DFF\u2E80-\u2FFF\u3004-\u3007\u3021-\u302F\u3031-\u303F\u3040-\uD7FF\uF900-\uFD3D\uFD40-\uFDCF\uFDF0-\uFE1F\uFE30-\uFE44\uFE47-\uFFFD\U00010000-\U0001FFFD\U00020000-\U0002FFFD\U00030000-\U0003FFFD\U000E0000-\U000EFFFD][0-9\u0300-\u036F\u1DC0-\u1DFF\u20D0-\u20FF\uFE20-\uFE2F]]&b=[[:ID_Continue:]\U0001F436\U0001F42E]> result in the removal of some 5,500 valid code points from the identifier characters, as well as hundreds of thousands of unassigned code points. (Though it does not appear on this unicode.org <http://unicode.org/> utility, which currently supports only Unicode 8 data, the · 00B7 MIDDLE DOT is no longer an identifier character.) Adopting ID_Start and ID_Continue does not add any new identifier characters.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#grammar-changes>Grammar changes

identifier-head → [:ID_Start:]
identifier-head → _ :dog: :cow:
identifier-character → identifier-head
identifier-character → [:ID_Continue:]
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#operators>Operators

Swift operator characters will be limited to only the following ASCII characters:

! % & * + - . / < = > ? ^ | ~

The current restrictions on reserved tokens and operators will remain: =, ->, //, /*, */, ., ?, prefix <, prefix &, postfix >, and postfix ! are reserved.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#dots-in-operators>Dots in operators

The current requirements for dots in operator names are:

If an operator doesn’t begin with a dot, it can’t contain a dot elsewhere.
This proposal changes the rule to:

Dots may only appear in operators in runs of two or more.
Under the revised rule, ..< and ... are allowed, but <.< is not. We also reserve the .. operator, permitting the compiler to use .. for a "method cascade" syntax in the future, as supported by Dart <http://news.dartlang.org/2012/02/method-cascades-in-dart-posted-by-gilad.html>.

Motivations for incorporating the two-dot rule are:

It helps avoid future lexical complications arising from lone .s.

It's a conservative approach, erring towards overly restrictive. Dropping the rule in future (thereby allowing single dots) may be possible.

It doesn't require special cases for existing infix dot operators in the standard library, ... (closed range) and ..< (half-open range). It also leaves the door open for the standard library to add analogous half-open and fully-open range operators <.. and <..<.

If we fail to adopt this rule now, then future backward-compatibility requirements will preclude the introduction of some potentially useful language enhancements.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#grammar-changes-1>Grammar changes

operator → operator-head operator-characters[opt]

operator-head → ! % & * + - / < = > ? ^ | ~
operator-head → operator-dot operator-dots
operator-character → operator-head
operator-characters → operator-character operator-character[opt]

operator-dot → .
operator-dots → operator-dot operator-dots[opt]
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#emoji>Emoji

If adopted, this proposal eliminates emoji from Swift identifiers and operators. Despite their novelty and utility, emoji characters introduce significant challenges to the language:

Their categorization into identifiers and operators is not semantically motivated, and is fraught with discrepancies.

Emoji characters are not displayed consistently and uniformly across different systems and fonts. Including all Unicode emoji <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[%3AEmoji%3A]> introduces characters that don't render as emoji on Apple platforms without a variant selector, but which also wouldn't normally be used as identifier characters (e.g. ⏏ :black_small_square: :white_small_square:).

Some emoji nearly overlap with existing operator syntax: :exclamation:️:question::heavy_plus_sign::heavy_minus_sign::heavy_division_sign::heavy_multiplication_x:

Full emoji support necessitates handling a variety of use cases for joining characters and variant selectors, which would not otherwise be useful in most cases. It would be hard to avoid permitting sequences of characters which aren't valid emoji, or being overly restrictive and not properly supporting emoji introduced in future versions of Unicode.

As an exception, in homage to Swift's origins, we permit :dog: and :cow: in identifiers.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#source-compatibility>Source compatibility

This change is source-breaking in cases where developers have incorporated emoji or custom non-ASCII operators, or identifiers with characters which have been disallowed. This is unlikely to be a significant breakage for the majority of serious Swift code.

Code using the middle dot · in identifiers may be slightly more common. · is now disallowed entirely.

Diagnostics for invalid characters are already produced today. We can improve them easily if needed.

Maintaining source compatibility for Swift 3 should be easy: just keep the old parsing & identifier lookup code.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#effect-on-abi-stability>Effect on ABI stability

This proposal does not affect the ABI format itself, although the Normalize Unicode Identifiers proposal <https://github.com/apple/swift-evolution/pull/531> affects the ABI of compiled modules.

The standard library will not be affected; it uses ASCII symbols with no combining characters.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#effect-on-api-resilience>Effect on API resilience

This proposal doesn't affect API resilience.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#alternatives-considered>Alternatives considered

Define operator characters using Unicode categories such as Sm and So <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[%3ASm%3A][%3ASo%3A]]>. This approach would include many "non-operator-like" characters and doesn't seem to provide a significant benefit aside from a simpler definition.

Hand-pick a set of "operator-like" characters to include. The proposal authors tried this painstaking approach, and came up with a relatively agreeable set of about 650 code points <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[!\%24%\%26*%2B\-%2F<%3D>%3F\^|~ \u00AC \u00B1 \u00B7 \u00D7 \u00F7 \u2208-\u220D \u220F-\u2211 \u22C0-\u22C3 \u2212-\u221D \u2238 \u223A \u2240 \u228C-\u228E \u2293-\u22A3 \u22BA-\u22BD \u22C4-\u22C7 \u22C9-\u22CC \u22D2-\u22D3 \u2223-\u222A \u2236-\u2237 \u2239 \u223B-\u223E \u2241-\u228B \u228F-\u2292 \u22A6-\u22B9 \u22C8 \u22CD \u22D0-\u22D1 \u22D4-\u22FF \u22CE-\u22CF \u2A00-\u2AFF \u27C2 \u27C3 \u27C4 \u27C7 \u27C8 \u27C9 \u27CA \u27CE-\u27D7 \u27DA-\u27DF \u27E0-\u27E5 \u29B5-\u29C3 \u29C4-\u29C9 \u29CA-\u29D0 \u29D1-\u29D7 \u29DF \u29E1 \u29E2 \u29E3-\u29E6 \u29FA \u29FB \u2308-\u230B \u2336-\u237A \u2395]> (although this set would require further refinement), but ultimately felt the motivation for including non-ASCII operators is much lower than for identifiers, and the harm to readers/writers of programs outweighs their potential utility.

Use Normalization Form KC (NFKC) instead of NFC. The decision to use NFC comes from Normalize Unicode Identifiers proposal <https://github.com/apple/swift-evolution/pull/531>. Also, UAX #31 states:

Generally if the programming language has case-sensitive identifiers, then Normalization Form C is appropriate; whereas, if the programming language has case-insensitive identifiers, then Normalization Form KC is more appropriate.
NFKC may also produce surprising results; for example, "ſ" and "s" are equivalent under NFKC.

Continue to allow single .s in operators, and perhaps even expand the original rule to allow them anywhere (even if the operator does not begin with .).

This would allow a wider variety of custom operators (for some interesting possibilities, see the operators in Haskell's Lens <https://github.com/ekmett/lens/wiki/Operators> package). However, there are a handful of potential complications:

Combining prefix or postfix operators with member access: foo*.bar would need to be parsed as foo *. barrather than (foo*).bar. Parentheses could be required to disambiguate.

Combining infix operators with contextual members: foo*.bar would need to be parsed as foo *. bar rather than foo * (.bar). Whitespace or parentheses could be required to disambiguate.

Hypothetically, if operators were accessible as members such as MyNumber.+, allowing operators with single .s would require escaping operator names (perhaps with backticks, such as MyNumber.`+`).

This would also require operators of the form [!?]*\. (for example . ?. !. !!.) to be reserved, to prevent users from defining custom operators that conflict with member access and optional chaining.

We believe that requiring dots to appear in groups of at least two, while in some ways more restrictive, will prevent a significant amount of future pain, and does not require special-case considerations such as the above.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#future-directions>Future directions

While not within the scope of this proposal, the following considerations may provide useful context for the proposed changes. We encourage the community to pick up these topics when the time is right.

Re-expand operators to allow some non-ASCII characters. There is work in progress to update UAX #31 with definitions for "operator identifiers" — when this work is completed, it would be worth considering for Swift.

Introduce a syntax for method cascades. The Dart language supports method cascades <http://news.dartlang.org/2012/02/method-cascades-in-dart-posted-by-gilad.html>, whereby multiple methods can be called on an object within one expression: foo..bar()..baz() effectively performs foo.bar(); foo.baz(). This syntax can also be used with assignments and subscripts. Such a feature might be very useful in Swift; this proposal reserves the .. operator so that it may be added in the future.

Introduce "mixfix" operator declarations. Mixfix operators are based on pattern matching, and would allow more than two operands. For example, the ternary operator ? : can be defined as a mixfix operator with three "holes": _ ? _ : _. Subscripts might be subsumed by mixfix declarations such as _ [ _ ]. Some holes could be made @autoclosure, and there might even be holes whose argument is represented as an AST, rather than a value or thunk, supporting advanced metaprogramming (for instance, F#'s code quotations <https://docs.microsoft.com/en-us/dotnet/articles/fsharp/language-reference/code-quotations>).

Diminish or remove the lexical distinction between operators and identifiers. If precedence and fixity applied to traditional identifiers as well as operators, it would be possible to incorporate ASCII equivalents for standard operators (e.g. and for &&, to allow A and B). If additionally combined with mixfix operator support, this might enable powerful DSLs (for instance, C#'s LINQ <https://en.wikipedia.org/wiki/Language_Integrated_Query>).

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(plx) #9

+💯 on the emoji-related parts, +1 in general spirit, +1 for the identifier cleanup, -103 for being needlessly overly-restrictive for operators; net -1 overall.

Operator abuse is a social problem, and even if a technical fix is possible this isn’t that…and despite the messiness of the relevant unicode categories, this proposal goes far too far.

For operators, the reasonable thing to do at this time would be to hand-select a small subset of the mathematical characters to allow as operators—the “greatest hits” so to speak—and move on. If any grave oversights are discovered those characters can be included in subsequent major revisions; if the consortium ever finishes its recommendation it can be adopted at that time.

There’s no need to exhaustively re-do the consortium’s work and there’s no need to make a correct-for-all-time decision on each character at this time; pick the low-hanging fruit and leave the rest for later.

That not everyone will be perfectly happy with any specific subset is predictable and not interesting; not everyone is going to be perfectly happy with this proposal’s proposed subset, either.

In any case, I’d specifically hate to lose these:

- approximate equality: ≈
- set operations: ∩, ∪
- set relations: ⊂, ⊃, ⊄, ⊅, ⊆, ⊇, ⊈, ⊉, ⊊, ⊋
- set membership: ∌, ∋, ∈, ∉
- logical operators: ¬, ∧, ∨

…although there are many more that would be nice to keep available.

···

On Oct 19, 2016, at 1:34 AM, Jacob Bandes-Storch via swift-evolution <swift-evolution@swift.org> wrote:

Dear Swift-Evolution community,

A few of us have been preparing a proposal to refine the definitions of identifiers & operators. This includes some changes to the permitted Unicode characters.

The latest (perhaps final?) draft is available here:

    https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md

We'd welcome your initial thoughts, and will probably submit a PR soon to the swift-evolution repo for a formal review. Full text follows below.

—Jacob Bandes-Storch, Xiaodi Wu, Erica Sadun, Jonathan Shapiro

Refining Identifier and Operator Symbology

Proposal: SE-NNNN <https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md>
Authors: Jacob Bandes-Storch <https://github.com/jtbandes>, Erica Sadun <https://github.com/erica>, Xiaodi Wu <https://github.com/xwu>, Jonathan Shapiro
Review Manager: TBD
Status: Awaiting review
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#introduction>Introduction

This proposal seeks to refine and rationalize Swift's identifier and operator symbology. Specifically, this proposal:

adopts the Unicode recommendation for identifier characters, with some minor exceptions;
restricts the legal operator set to the current ASCII operator characters;
changes where dots may appear in operators; and
disallows Emoji from identifiers and operators.
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#prior-discussion-threads--proposals>Prior discussion threads & proposals

Proposal: Normalize Unicode identifiers <https://github.com/apple/swift-evolution/pull/531>
Unicode identifiers & operators <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160912/027108.html>, with pre-proposal <https://gist.github.com/jtbandes/c0b0c072181dcd22c3147802025d0b59> (a precursor to this document)
Lexical matters: identifiers and operators <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160926/027479.html>
Proposal: Allow Single Dollar Sign as Valid Identifier <https://github.com/apple/swift-evolution/pull/354>
Free the '$' Symbol! <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20151228/005133.html>
Request to add middle dot (U+00B7) as operator character? <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20151214/003176.html>
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#guiding-principles>Guiding principles

Chris Lattner has written:

…our current operator space (particularly the unicode segments covered) is not super well considered. It would be great for someone to take a more systematic pass over them to rationalize things.
We need a token to be unambiguously an operator or identifier - we can have different rules for the leading and subsequent characters though.
…any proposal that breaks:

let :dog::cow: = "moof"
will not be tolerated. :slight_smile: :slight_smile:
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#motivation>Motivation

By supporting custom Unicode operators and identifiers, Swift attempts to accomodate programmers and programming styles from many languages and cultures. It deserves a well-thought-out specification of which characters are valid. However, Swift's current identifier and operator character sets do not conform to any Unicode standards, nor have they been rationalized in the language or compiler documentation.

Identifiers, which serve as names for various entities, are linguistic in nature and must permit a variety of characters to properly serve non–English-speaking coders. This issue has been considered by the communities of many programming languages already, and the Unicode Consortium has published recommendations on how to choose identifier character sets — Swift should make an effort to conform to these recommendations.

Operators, on the other hand, should be rare and carefully chosen, because they suffer from low discoverability and difficult readability. They are by nature symbols, not names. This places a cognitive cost on users with respect to both recall ("What is the operator that applies the behavior I need?") and recognition ("What does the operator in this code do?"). While almost every nontrivial program defines many new identifiers, most programs do not define new operators.

As operators become more esoteric or customized, the cognitive cost rises. Recognizing a function name like formUnion(with:) is simpler for many programmers than recalling what the ∪ operator does. Swift's current operator character set includes many characters that aren't traditional and recognizable operators — this encourages problematic and frivolous uses in an otherwise safe language.

Today, there are many discrepancies and edge cases motivating these changes:

· is an identifier, while • is an operator.
The Greek question mark ; is a valid identifier.
Braille patterns ⠟ seem letter-like, but are operator characters.
:slightly_smiling_face::metal::arrow_forward:️:small_airplane:🂡 are identifiers, while :frowning:️:v:️:arrow_up_small::airplane:️:spades:️ are operators.
Some non-combining diacritics ´ ¨ ꓻ are valid in identifiers.
Some completely non-linguistic characters, such as ۞ and ༒, are valid in identifiers.
Some symbols such as ⚄ and ♄ are operators, despite not really being "operator-like".
A small handful of characters 〡〢〣〤〥〦〧〨〩 〪 〫 〬 〭 〮 〯 are valid in both identifiers and operators.
Some non-printing characters such as 2064 INVISIBLE PLUS and 200B ZERO WIDTH SPACE are valid identifiers.
Currency symbols are split across operators (¢ £ ¤ ¥) and identifiers ($ ₪ € ₱ ₹ ฿ ...).
This matter should be considered in a near timeframe (Swift 3.1 or 4) as it is both fundamental to Swift and will produce source-breaking changes.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#precedent-in-other-languages>Precedent in other languages

Haskell distinguishes identifiers/operators by their general category <http://www.fileformat.info/info/unicode/category/index.htm> such as "any Unicode lowercase letter", "any Unicode symbol or punctuation", and so forth. Identifiers can start with any lowercase letter or _, and may contain any letter/digit/'/_. This includes letters like δ and Я, and digits like ٢.

Haskell Syntax Reference <https://www.haskell.org/onlinereport/syntax-iso.html>
Haskell Lexer <https://github.com/ghc/ghc/blob/714bebff44076061d0a719c4eda2cfd213b7ac3d/compiler/parser/Lexer.x#L1949-L1973>
Scala similarly allows letters, numbers, $, and _ in identifiers, distinguishing by general categories Ll, Lu, Lt, Lo, and Nl. Operator characters include mathematical and other symbols (Sm and So) in addition to other ASCII symbol characters.

Scala Lexical Syntax <http://www.scala-lang.org/files/archive/spec/2.11/01-lexical-syntax.html#lexical-syntax>
ECMAScript 2015 ("ES6") uses ID_Start and ID_Continue, as well as Other_ID_Start / Other_ID_Continue, for identifiers.

ECMAScript Specification: Names and Keywords <http://www.ecma-international.org/ecma-262/6.0/#sec-names-and-keywords>
Python 3 uses XID_Start and XID_Continue.

The Python Language Reference: Identifiers and Keywords <https://docs.python.org/3/reference/lexical_analysis.html#grammar-token-identifier>
PEP 3131: Supporting Non-ASCII Identifiers <https://www.python.org/dev/peps/pep-3131/>
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#proposed-solution>Proposed solution

For identifiers, adopt the recommendations made in UAX #31 Identifier and Pattern Syntax <http://unicode.org/reports/tr31/>, deriving the sets of valid characters from ID_Start and ID_Continue. Normalize identifiers using Normalization Form C (NFC).

(For operators, no such recommendation currently exists, although active work is in progress to update UAX #31 to address "operator identifiers".)

Restrict operators to those ASCII characters which are currently operators. All other operator characters are removed from the language.

Allow dots in operators in any location, but only in runs of two or more.

(Overall, this proposal is aggressive in its removal of problematic characters. We are not attempting to prevent the addition or re-addition of characters in the future, but by paring the set down now, we require any future changes to pass the high bar of the Swift Evolution process.)

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#detailed-design>Detailed design

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#identifiers>Identifiers

Swift identifier characters will conform to UAX #31 <http://unicode.org/reports/tr31/#Conformance> as follows:

UAX31-C1. <http://unicode.org/reports/tr31/#C1> The conformance described herein refers to the Unicode 9.0.0 version of UAX #31 (dated 2016-05-31 and retrieved 2016-10-09).

UAX31-C2. <http://unicode.org/reports/tr31/#C2> Swift shall observe the following requirements:

UAX31-R1. <http://unicode.org/reports/tr31/#R1> Swift shall augment the definition of "Default Identifiers" with the following profiles:

ID_Start and ID_Continue shall be used for Start and Continue (replacing XID_Start and XID_Continue). This excludes characters in Other_ID_Start and Other_ID_Continue.

_ 005F LOW LINE shall additionally be allowed as a Start character.

The emoji characters :dog: 1F436 DOG FACE and :cow: 1F42E COW FACE shall be allowed as Start and Continue characters.

(UAX31-R1a. <http://unicode.org/reports/tr31/#R1a>) The join-control characters ZWJ and ZWNJ are strictly limited to the special cases A1, A2, and B described in UAX #31. (This requirement is covered in the Normalize Unicode Identifiers proposal <https://github.com/apple/swift-evolution/pull/531>.)

UAX31-R4. <http://unicode.org/reports/tr31/#R4> Swift shall consider two identifiers equivalent when they have the same normalized form under NFC <http://unicode.org/reports/tr15/>. (This requirement is covered in the Normalize Unicode Identifiers proposal <https://github.com/apple/swift-evolution/pull/531>.)

These changes <http://unicode.org/cldr/utility/unicodeset.jsp?a=[[a-zA-Z_\u00A8\u00AA\u00AD\u00AF\u00B2-\u00B5\u00B7-\u00BA\u00BC-\u00BE\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF\u0100-\u02FF\u0370-\u167F\u1681-\u180D\u180F-\u1DBF\u1E00-\u1FFF\u200B-\u200D\u202A-\u202E\u203F-\u2040\u2054\u2060-\u206F\u2070-\u20CF\u2100-\u218F\u2460-\u24FF\u2776-\u2793\u2C00-\u2DFF\u2E80-\u2FFF\u3004-\u3007\u3021-\u302F\u3031-\u303F\u3040-\uD7FF\uF900-\uFD3D\uFD40-\uFDCF\uFDF0-\uFE1F\uFE30-\uFE44\uFE47-\uFFFD\U00010000-\U0001FFFD\U00020000-\U0002FFFD\U00030000-\U0003FFFD\U000E0000-\U000EFFFD][0-9\u0300-\u036F\u1DC0-\u1DFF\u20D0-\u20FF\uFE20-\uFE2F]]&b=[[:ID_Continue:]\U0001F436\U0001F42E]> result in the removal of some 5,500 valid code points from the identifier characters, as well as hundreds of thousands of unassigned code points. (Though it does not appear on this unicode.org <http://unicode.org/> utility, which currently supports only Unicode 8 data, the · 00B7 MIDDLE DOT is no longer an identifier character.) Adopting ID_Start and ID_Continue does not add any new identifier characters.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#grammar-changes>Grammar changes

identifier-head → [:ID_Start:]
identifier-head → _ :dog: :cow:
identifier-character → identifier-head
identifier-character → [:ID_Continue:]
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#operators>Operators

Swift operator characters will be limited to only the following ASCII characters:

! % & * + - . / < = > ? ^ | ~

The current restrictions on reserved tokens and operators will remain: =, ->, //, /*, */, ., ?, prefix <, prefix &, postfix >, and postfix ! are reserved.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#dots-in-operators>Dots in operators

The current requirements for dots in operator names are:

If an operator doesn’t begin with a dot, it can’t contain a dot elsewhere.
This proposal changes the rule to:

Dots may only appear in operators in runs of two or more.
Under the revised rule, ..< and ... are allowed, but <.< is not. We also reserve the .. operator, permitting the compiler to use .. for a "method cascade" syntax in the future, as supported by Dart <http://news.dartlang.org/2012/02/method-cascades-in-dart-posted-by-gilad.html>.

Motivations for incorporating the two-dot rule are:

It helps avoid future lexical complications arising from lone .s.

It's a conservative approach, erring towards overly restrictive. Dropping the rule in future (thereby allowing single dots) may be possible.

It doesn't require special cases for existing infix dot operators in the standard library, ... (closed range) and ..< (half-open range). It also leaves the door open for the standard library to add analogous half-open and fully-open range operators <.. and <..<.

If we fail to adopt this rule now, then future backward-compatibility requirements will preclude the introduction of some potentially useful language enhancements.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#grammar-changes-1>Grammar changes

operator → operator-head operator-characters[opt]

operator-head → ! % & * + - / < = > ? ^ | ~
operator-head → operator-dot operator-dots
operator-character → operator-head
operator-characters → operator-character operator-character[opt]

operator-dot → .
operator-dots → operator-dot operator-dots[opt]
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#emoji>Emoji

If adopted, this proposal eliminates emoji from Swift identifiers and operators. Despite their novelty and utility, emoji characters introduce significant challenges to the language:

Their categorization into identifiers and operators is not semantically motivated, and is fraught with discrepancies.

Emoji characters are not displayed consistently and uniformly across different systems and fonts. Including all Unicode emoji <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[%3AEmoji%3A]> introduces characters that don't render as emoji on Apple platforms without a variant selector, but which also wouldn't normally be used as identifier characters (e.g. ⏏ :black_small_square: :white_small_square:).

Some emoji nearly overlap with existing operator syntax: :exclamation:️:question::heavy_plus_sign::heavy_minus_sign::heavy_division_sign::heavy_multiplication_x:

Full emoji support necessitates handling a variety of use cases for joining characters and variant selectors, which would not otherwise be useful in most cases. It would be hard to avoid permitting sequences of characters which aren't valid emoji, or being overly restrictive and not properly supporting emoji introduced in future versions of Unicode.

As an exception, in homage to Swift's origins, we permit :dog: and :cow: in identifiers.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#source-compatibility>Source compatibility

This change is source-breaking in cases where developers have incorporated emoji or custom non-ASCII operators, or identifiers with characters which have been disallowed. This is unlikely to be a significant breakage for the majority of serious Swift code.

Code using the middle dot · in identifiers may be slightly more common. · is now disallowed entirely.

Diagnostics for invalid characters are already produced today. We can improve them easily if needed.

Maintaining source compatibility for Swift 3 should be easy: just keep the old parsing & identifier lookup code.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#effect-on-abi-stability>Effect on ABI stability

This proposal does not affect the ABI format itself, although the Normalize Unicode Identifiers proposal <https://github.com/apple/swift-evolution/pull/531> affects the ABI of compiled modules.

The standard library will not be affected; it uses ASCII symbols with no combining characters.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#effect-on-api-resilience>Effect on API resilience

This proposal doesn't affect API resilience.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#alternatives-considered>Alternatives considered

Define operator characters using Unicode categories such as Sm and So <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[%3ASm%3A][%3ASo%3A]]>. This approach would include many "non-operator-like" characters and doesn't seem to provide a significant benefit aside from a simpler definition.

Hand-pick a set of "operator-like" characters to include. The proposal authors tried this painstaking approach, and came up with a relatively agreeable set of about 650 code points <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[!\%24%\%26*%2B\-%2F<%3D>%3F\^|~ \u00AC \u00B1 \u00B7 \u00D7 \u00F7 \u2208-\u220D \u220F-\u2211 \u22C0-\u22C3 \u2212-\u221D \u2238 \u223A \u2240 \u228C-\u228E \u2293-\u22A3 \u22BA-\u22BD \u22C4-\u22C7 \u22C9-\u22CC \u22D2-\u22D3 \u2223-\u222A \u2236-\u2237 \u2239 \u223B-\u223E \u2241-\u228B \u228F-\u2292 \u22A6-\u22B9 \u22C8 \u22CD \u22D0-\u22D1 \u22D4-\u22FF \u22CE-\u22CF \u2A00-\u2AFF \u27C2 \u27C3 \u27C4 \u27C7 \u27C8 \u27C9 \u27CA \u27CE-\u27D7 \u27DA-\u27DF \u27E0-\u27E5 \u29B5-\u29C3 \u29C4-\u29C9 \u29CA-\u29D0 \u29D1-\u29D7 \u29DF \u29E1 \u29E2 \u29E3-\u29E6 \u29FA \u29FB \u2308-\u230B \u2336-\u237A \u2395]> (although this set would require further refinement), but ultimately felt the motivation for including non-ASCII operators is much lower than for identifiers, and the harm to readers/writers of programs outweighs their potential utility.

Use Normalization Form KC (NFKC) instead of NFC. The decision to use NFC comes from Normalize Unicode Identifiers proposal <https://github.com/apple/swift-evolution/pull/531>. Also, UAX #31 states:

Generally if the programming language has case-sensitive identifiers, then Normalization Form C is appropriate; whereas, if the programming language has case-insensitive identifiers, then Normalization Form KC is more appropriate.
NFKC may also produce surprising results; for example, "ſ" and "s" are equivalent under NFKC.

Continue to allow single .s in operators, and perhaps even expand the original rule to allow them anywhere (even if the operator does not begin with .).

This would allow a wider variety of custom operators (for some interesting possibilities, see the operators in Haskell's Lens <https://github.com/ekmett/lens/wiki/Operators> package). However, there are a handful of potential complications:

Combining prefix or postfix operators with member access: foo*.bar would need to be parsed as foo *. barrather than (foo*).bar. Parentheses could be required to disambiguate.

Combining infix operators with contextual members: foo*.bar would need to be parsed as foo *. bar rather than foo * (.bar). Whitespace or parentheses could be required to disambiguate.

Hypothetically, if operators were accessible as members such as MyNumber.+, allowing operators with single .s would require escaping operator names (perhaps with backticks, such as MyNumber.`+`).

This would also require operators of the form [!?]*\. (for example . ?. !. !!.) to be reserved, to prevent users from defining custom operators that conflict with member access and optional chaining.

We believe that requiring dots to appear in groups of at least two, while in some ways more restrictive, will prevent a significant amount of future pain, and does not require special-case considerations such as the above.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#future-directions>Future directions

While not within the scope of this proposal, the following considerations may provide useful context for the proposed changes. We encourage the community to pick up these topics when the time is right.

Re-expand operators to allow some non-ASCII characters. There is work in progress to update UAX #31 with definitions for "operator identifiers" — when this work is completed, it would be worth considering for Swift.

Introduce a syntax for method cascades. The Dart language supports method cascades <http://news.dartlang.org/2012/02/method-cascades-in-dart-posted-by-gilad.html>, whereby multiple methods can be called on an object within one expression: foo..bar()..baz() effectively performs foo.bar(); foo.baz(). This syntax can also be used with assignments and subscripts. Such a feature might be very useful in Swift; this proposal reserves the .. operator so that it may be added in the future.

Introduce "mixfix" operator declarations. Mixfix operators are based on pattern matching, and would allow more than two operands. For example, the ternary operator ? : can be defined as a mixfix operator with three "holes": _ ? _ : _. Subscripts might be subsumed by mixfix declarations such as _ [ _ ]. Some holes could be made @autoclosure, and there might even be holes whose argument is represented as an AST, rather than a value or thunk, supporting advanced metaprogramming (for instance, F#'s code quotations <https://docs.microsoft.com/en-us/dotnet/articles/fsharp/language-reference/code-quotations>).

Diminish or remove the lexical distinction between operators and identifiers. If precedence and fixity applied to traditional identifiers as well as operators, it would be possible to incorporate ASCII equivalents for standard operators (e.g. and for &&, to allow A and B). If additionally combined with mixfix operator support, this might enable powerful DSLs (for instance, C#'s LINQ <https://en.wikipedia.org/wiki/Language_Integrated_Query>).

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Joe Groff) #10

This isn't entirely true. Swift's current identifier set derives from the C working group WG14's proposal N1518, "Recommendations for extended identifier characters for C and C++":

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3146.html

which unfortunately isn't called out anywhere in the compiler docs except this old language reference:

https://github.com/apple/swift/blob/master/docs/archive/LangRefNew.rst#identifier-tokens

-Joe

···

On Oct 18, 2016, at 11:34 PM, Jacob Bandes-Storch via swift-evolution <swift-evolution@swift.org> wrote:

However, Swift's current identifier and operator character sets do not conform to any Unicode standards, nor have they been rationalized in the language or compiler documentation.


(Russ Bishop) #11

Strong -1 from me as currently written.

There is no reason to remove Emoji from identifiers, nor to restrict operators to ASCII only (especially since the corresponding UAX spec is still under construction). Emoji are just as much a part of modern communication as the Latin alphabet. Swift should not seek to restrict a user’s ability to express themselves.

Given the problems with operators restricting Emoji from operators seems reasonable. Prohibiting non-printing characters also makes sense.

Russ

Refining Identifier and Operator Symbology

Proposal: SE-NNNN <https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md>
Authors: Jacob Bandes-Storch <https://github.com/jtbandes>, Erica Sadun <https://github.com/erica>, Xiaodi Wu <https://github.com/xwu>, Jonathan Shapiro
Review Manager: TBD
Status: Awaiting review
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#introduction>Introduction

This proposal seeks to refine and rationalize Swift's identifier and operator symbology. Specifically, this proposal:

adopts the Unicode recommendation for identifier characters, with some minor exceptions;
restricts the legal operator set to the current ASCII operator characters;
changes where dots may appear in operators; and
disallows Emoji from identifiers and operators.
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#prior-discussion-threads--proposals>
Proposed solution

For identifiers, adopt the recommendations made in UAX #31 Identifier and Pattern Syntax <http://unicode.org/reports/tr31/>, deriving the sets of valid characters from ID_Start and ID_Continue. Normalize identifiers using Normalization Form C (NFC).

(For operators, no such recommendation currently exists, although active work is in progress to update UAX #31 to address "operator identifiers".)

Restrict operators to those ASCII characters which are currently operators. All other operator characters are removed from the language.

Allow dots in operators in any location, but only in runs of two or more.

(Overall, this proposal is aggressive in its removal of problematic characters. We are not attempting to prevent the addition or re-addition of characters in the future, but by paring the set down now, we require any future changes to pass the high bar of the Swift Evolution process.)

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#detailed-design>Detailed design

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#identifiers>Identifiers

Swift identifier characters will conform to UAX #31 <http://unicode.org/reports/tr31/#Conformance> as follows:

UAX31-C1. <http://unicode.org/reports/tr31/#C1> The conformance described herein refers to the Unicode 9.0.0 version of UAX #31 (dated 2016-05-31 and retrieved 2016-10-09).

UAX31-C2. <http://unicode.org/reports/tr31/#C2> Swift shall observe the following requirements:

UAX31-R1. <http://unicode.org/reports/tr31/#R1> Swift shall augment the definition of "Default Identifiers" with the following profiles:

ID_Start and ID_Continue shall be used for Start and Continue (replacing XID_Start and XID_Continue). This excludes characters in Other_ID_Start and Other_ID_Continue.

_ 005F LOW LINE shall additionally be allowed as a Start character.

The emoji characters :dog: 1F436 DOG FACE and :cow: 1F42E COW FACE shall be allowed as Start and Continue characters.

(UAX31-R1a. <http://unicode.org/reports/tr31/#R1a>) The join-control characters ZWJ and ZWNJ are strictly limited to the special cases A1, A2, and B described in UAX #31. (This requirement is covered in the Normalize Unicode Identifiers proposal <https://github.com/apple/swift-evolution/pull/531>.)

UAX31-R4. <http://unicode.org/reports/tr31/#R4> Swift shall consider two identifiers equivalent when they have the same normalized form under NFC <http://unicode.org/reports/tr15/>. (This requirement is covered in the Normalize Unicode Identifiers proposal <https://github.com/apple/swift-evolution/pull/531>.)

These changes <http://unicode.org/cldr/utility/unicodeset.jsp?a=[[a-zA-Z_\u00A8\u00AA\u00AD\u00AF\u00B2-\u00B5\u00B7-\u00BA\u00BC-\u00BE\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF\u0100-\u02FF\u0370-\u167F\u1681-\u180D\u180F-\u1DBF\u1E00-\u1FFF\u200B-\u200D\u202A-\u202E\u203F-\u2040\u2054\u2060-\u206F\u2070-\u20CF\u2100-\u218F\u2460-\u24FF\u2776-\u2793\u2C00-\u2DFF\u2E80-\u2FFF\u3004-\u3007\u3021-\u302F\u3031-\u303F\u3040-\uD7FF\uF900-\uFD3D\uFD40-\uFDCF\uFDF0-\uFE1F\uFE30-\uFE44\uFE47-\uFFFD\U00010000-\U0001FFFD\U00020000-\U0002FFFD\U00030000-\U0003FFFD\U000E0000-\U000EFFFD][0-9\u0300-\u036F\u1DC0-\u1DFF\u20D0-\u20FF\uFE20-\uFE2F]]&b=[[:ID_Continue:]\U0001F436\U0001F42E]> result in the removal of some 5,500 valid code points from the identifier characters, as well as hundreds of thousands of unassigned code points. (Though it does not appear on this unicode.org <http://unicode.org/> utility, which currently supports only Unicode 8 data, the · 00B7 MIDDLE DOT is no longer an identifier character.) Adopting ID_Start and ID_Continue does not add any new identifier characters.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#grammar-changes>Grammar changes

identifier-head → [:ID_Start:]
identifier-head → _ :dog: :cow:
identifier-character → identifier-head
identifier-character → [:ID_Continue:]
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#operators>Operators

Swift operator characters will be limited to only the following ASCII characters:

! % & * + - . / < = > ? ^ | ~

The current restrictions on reserved tokens and operators will remain: =, ->, //, /*, */, ., ?, prefix <, prefix &, postfix >, and postfix ! are reserved.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#dots-in-operators>Dots in operators

The current requirements for dots in operator names are:

If an operator doesn’t begin with a dot, it can’t contain a dot elsewhere.
This proposal changes the rule to:

Dots may only appear in operators in runs of two or more.
Under the revised rule, ..< and ... are allowed, but <.< is not. We also reserve the .. operator, permitting the compiler to use .. for a "method cascade" syntax in the future, as supported by Dart <http://news.dartlang.org/2012/02/method-cascades-in-dart-posted-by-gilad.html>.

Motivations for incorporating the two-dot rule are:

It helps avoid future lexical complications arising from lone .s.

It's a conservative approach, erring towards overly restrictive. Dropping the rule in future (thereby allowing single dots) may be possible.

It doesn't require special cases for existing infix dot operators in the standard library, ... (closed range) and ..< (half-open range). It also leaves the door open for the standard library to add analogous half-open and fully-open range operators <.. and <..<.

If we fail to adopt this rule now, then future backward-compatibility requirements will preclude the introduction of some potentially useful language enhancements.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#grammar-changes-1>Grammar changes

operator → operator-head operator-characters[opt]

operator-head → ! % & * + - / < = > ? ^ | ~
operator-head → operator-dot operator-dots
operator-character → operator-head
operator-characters → operator-character operator-character[opt]

operator-dot → .
operator-dots → operator-dot operator-dots[opt]
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#emoji>Emoji

If adopted, this proposal eliminates emoji from Swift identifiers and operators. Despite their novelty and utility, emoji characters introduce significant challenges to the language

I understand removing Emoji from operators but I object to removing them from identifiers.

···

On Oct 18, 2016, at 11:34 PM, Jacob Bandes-Storch via swift-evolution <swift-evolution@swift.org> wrote:

Their categorization into identifiers and operators is not semantically motivated, and is fraught with discrepancies.

Emoji characters are not displayed consistently and uniformly across different systems and fonts. Including all Unicode emoji <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[%3AEmoji%3A]> introduces characters that don't render as emoji on Apple platforms without a variant selector, but which also wouldn't normally be used as identifier characters (e.g. ⏏ :black_small_square: :white_small_square:).

Some emoji nearly overlap with existing operator syntax: :exclamation:️:question::heavy_plus_sign::heavy_minus_sign::heavy_division_sign::heavy_multiplication_x:

Full emoji support necessitates handling a variety of use cases for joining characters and variant selectors, which would not otherwise be useful in most cases. It would be hard to avoid permitting sequences of characters which aren't valid emoji, or being overly restrictive and not properly supporting emoji introduced in future versions of Unicode.

As an exception, in homage to Swift's origins, we permit :dog: and :cow: in identifiers.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#source-compatibility>Source compatibility

This change is source-breaking in cases where developers have incorporated emoji or custom non-ASCII operators, or identifiers with characters which have been disallowed. This is unlikely to be a significant breakage for the majority of serious Swift code.

Code using the middle dot · in identifiers may be slightly more common. · is now disallowed entirely.

Diagnostics for invalid characters are already produced today. We can improve them easily if needed.

Maintaining source compatibility for Swift 3 should be easy: just keep the old parsing & identifier lookup code.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#effect-on-abi-stability>Effect on ABI stability

This proposal does not affect the ABI format itself, although the Normalize Unicode Identifiers proposal <https://github.com/apple/swift-evolution/pull/531> affects the ABI of compiled modules.

The standard library will not be affected; it uses ASCII symbols with no combining characters.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#effect-on-api-resilience>Effect on API resilience

This proposal doesn't affect API resilience.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#alternatives-considered>Alternatives considered

Define operator characters using Unicode categories such as Sm and So <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[%3ASm%3A][%3ASo%3A]]>. This approach would include many "non-operator-like" characters and doesn't seem to provide a significant benefit aside from a simpler definition.

Hand-pick a set of "operator-like" characters to include. The proposal authors tried this painstaking approach, and came up with a relatively agreeable set of about 650 code points <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[!\%24%\%26*%2B\-%2F<%3D>%3F\^|~ \u00AC \u00B1 \u00B7 \u00D7 \u00F7 \u2208-\u220D \u220F-\u2211 \u22C0-\u22C3 \u2212-\u221D \u2238 \u223A \u2240 \u228C-\u228E \u2293-\u22A3 \u22BA-\u22BD \u22C4-\u22C7 \u22C9-\u22CC \u22D2-\u22D3 \u2223-\u222A \u2236-\u2237 \u2239 \u223B-\u223E \u2241-\u228B \u228F-\u2292 \u22A6-\u22B9 \u22C8 \u22CD \u22D0-\u22D1 \u22D4-\u22FF \u22CE-\u22CF \u2A00-\u2AFF \u27C2 \u27C3 \u27C4 \u27C7 \u27C8 \u27C9 \u27CA \u27CE-\u27D7 \u27DA-\u27DF \u27E0-\u27E5 \u29B5-\u29C3 \u29C4-\u29C9 \u29CA-\u29D0 \u29D1-\u29D7 \u29DF \u29E1 \u29E2 \u29E3-\u29E6 \u29FA \u29FB \u2308-\u230B \u2336-\u237A \u2395]> (although this set would require further refinement), but ultimately felt the motivation for including non-ASCII operators is much lower than for identifiers, and the harm to readers/writers of programs outweighs their potential utility.

Use Normalization Form KC (NFKC) instead of NFC. The decision to use NFC comes from Normalize Unicode Identifiers proposal <https://github.com/apple/swift-evolution/pull/531>. Also, UAX #31 states:

Generally if the programming language has case-sensitive identifiers, then Normalization Form C is appropriate; whereas, if the programming language has case-insensitive identifiers, then Normalization Form KC is more appropriate.
NFKC may also produce surprising results; for example, "ſ" and "s" are equivalent under NFKC.

Continue to allow single .s in operators, and perhaps even expand the original rule to allow them anywhere (even if the operator does not begin with .).

This would allow a wider variety of custom operators (for some interesting possibilities, see the operators in Haskell's Lens <https://github.com/ekmett/lens/wiki/Operators> package). However, there are a handful of potential complications:

Combining prefix or postfix operators with member access: foo*.bar would need to be parsed as foo *. barrather than (foo*).bar. Parentheses could be required to disambiguate.

Combining infix operators with contextual members: foo*.bar would need to be parsed as foo *. bar rather than foo * (.bar). Whitespace or parentheses could be required to disambiguate.

Hypothetically, if operators were accessible as members such as MyNumber.+, allowing operators with single .s would require escaping operator names (perhaps with backticks, such as MyNumber.`+`).

This would also require operators of the form [!?]*\. (for example . ?. !. !!.) to be reserved, to prevent users from defining custom operators that conflict with member access and optional chaining.

We believe that requiring dots to appear in groups of at least two, while in some ways more restrictive, will prevent a significant amount of future pain, and does not require special-case considerations such as the above.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#future-directions>Future directions

While not within the scope of this proposal, the following considerations may provide useful context for the proposed changes. We encourage the community to pick up these topics when the time is right.

Re-expand operators to allow some non-ASCII characters. There is work in progress to update UAX #31 with definitions for "operator identifiers" — when this work is completed, it would be worth considering for Swift.

Introduce a syntax for method cascades. The Dart language supports method cascades <http://news.dartlang.org/2012/02/method-cascades-in-dart-posted-by-gilad.html>, whereby multiple methods can be called on an object within one expression: foo..bar()..baz() effectively performs foo.bar(); foo.baz(). This syntax can also be used with assignments and subscripts. Such a feature might be very useful in Swift; this proposal reserves the .. operator so that it may be added in the future.

Introduce "mixfix" operator declarations. Mixfix operators are based on pattern matching, and would allow more than two operands. For example, the ternary operator ? : can be defined as a mixfix operator with three "holes": _ ? _ : _. Subscripts might be subsumed by mixfix declarations such as _ [ _ ]. Some holes could be made @autoclosure, and there might even be holes whose argument is represented as an AST, rather than a value or thunk, supporting advanced metaprogramming (for instance, F#'s code quotations <https://docs.microsoft.com/en-us/dotnet/articles/fsharp/language-reference/code-quotations>).

Diminish or remove the lexical distinction between operators and identifiers. If precedence and fixity applied to traditional identifiers as well as operators, it would be possible to incorporate ASCII equivalents for standard operators (e.g. and for &&, to allow A and B). If additionally combined with mixfix operator support, this might enable powerful DSLs (for instance, C#'s LINQ <https://en.wikipedia.org/wiki/Language_Integrated_Query>).

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Georgios Moschovitis) #12

restricts the legal operator set to the current ASCII operator characters;

-1

disallows Emoji from identifiers and operators.

-1


(Chris Lattner) #13

I haven’t had a chance to read the entire proposal, nor the tons of great discussion down thread, but here are a few thoughts, just MHO:

- I’m loving that you’re taking a detail oriented approach to the problem. I agree with you that our current approach is unprincipled, and we need to get this right for Swift 4.
- I think that it is perfectly fine to err on the side of conservatism: if it isn’t clear how to classify something (e.g. Braille patterns), we should just reject them in both operators and identifiers (make them be unassigned). If these unclear cases are important to someone, then we can consider (as a separate additive proposal) adding them back later.
- As to conservatism, explicitly reserving “..” (for possible future language directions) seems reasonable to me. Are there any other similar things we should consider reserving?

- I applaud the creativity keeping 🐶🐮 a valid identifier :-), but it is really missing the point. *All* of the non-symbol-like emoji’s should be valid in identifiers. With a quick unscientific look at Apple’s character picker, all the emojis other than a few in “Symbols” seem like they should be identifiers. It would be fine to conservatively leave all emoji “symbols” as unassigned.
- I really think we should keep symbols as operators, including much of the math symbols (e.g. ∪). In a later separate proposal, we can consider whether it makes sense for emoji symbols (like ✖️to be usable as operators), I can see arguments both ways.

-Chris

···

On Oct 18, 2016, at 11:34 PM, Jacob Bandes-Storch via swift-evolution <swift-evolution@swift.org> wrote:

Dear Swift-Evolution community,

A few of us have been preparing a proposal to refine the definitions of identifiers & operators. This includes some changes to the permitted Unicode characters.

The latest (perhaps final?) draft is available here:

    https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md

We'd welcome your initial thoughts, and will probably submit a PR soon to the swift-evolution repo for a formal review. Full text follows below.


(Xiaodi Wu) #14

The restriction to ASCII operators need not be permanent. However, we were
unable to converge on a subset of mathematical symbols that we could
definitively consider to be operators in contradistinction to those not
included in that subset. Future Unicode recommendations on operators are
pending, and Swift can expand its operator characters accordingly in the
future.

Moreover, we do not know of any non-ASCII operators in the wild at present.
A branch of the Swift standard library tried out set algebra operators, but
that has not become the chosen API.

Finally, ASCII-only operators allow us to postpone design of more
sophisticated confusables checking to a later point. Unicode has seven or
eight varieties of forward slashes, at least several of which are plausible
and distinct operator characters, and figuring out how to deal with this
scenario would benefit from work from the Unicode Consortium that is still
pending.

···

On Wed, Oct 19, 2016 at 15:47 Jean-Denis Muys via swift-evolution < swift-evolution@swift.org> wrote:

Before and above anything else, if I read the proposal correctly, we will
not be able any more to use math operator signs as operators, beyond the
paltry half dozen or so in the ASCII character set???

I strongly oppose such a restriction. Maths symbols (including ∪) are
widely recognised in the scientific community and this change, IIUC, is
very hostile to any scientific computing.

Jean-Denis

On 19 Oct 2016, at 08:34, Jacob Bandes-Storch via swift-evolution < > swift-evolution@swift.org> wrote:

Dear Swift-Evolution community,

A few of us have been preparing a proposal to refine the definitions of
identifiers & operators. This includes some changes to the permitted
Unicode characters.

The latest (perhaps final?) draft is available here:

https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md

We'd welcome your initial thoughts, and will probably submit a PR soon to
the swift-evolution repo for a formal review. Full text follows below.

—Jacob Bandes-Storch, Xiaodi Wu, Erica Sadun, Jonathan Shapiro

Refining Identifier and Operator Symbology

   - Proposal: SE-NNNN
   <https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md>
   - Authors: Jacob Bandes-Storch <https://github.com/jtbandes>, Erica
   Sadun <https://github.com/erica>, Xiaodi Wu <https://github.com/xwu>,
   Jonathan Shapiro
   - Review Manager: TBD
   - Status: Awaiting review

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#introduction>
Introduction

This proposal seeks to refine and rationalize Swift's identifier and
operator symbology. Specifically, this proposal:

   - adopts the Unicode recommendation for identifier characters, with
   some minor exceptions;
   - restricts the legal operator set to the current ASCII operator
   characters;
   - changes where dots may appear in operators; and
   - disallows Emoji from identifiers and operators.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#prior-discussion-threads--proposals>Prior
discussion threads & proposals

   - Proposal: Normalize Unicode identifiers
   <https://github.com/apple/swift-evolution/pull/531>
   - Unicode identifiers & operators
   <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160912/027108.html>,
   with pre-proposal
   <https://gist.github.com/jtbandes/c0b0c072181dcd22c3147802025d0b59> (a
   precursor to this document)
   - Lexical matters: identifiers and operators
   <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160926/027479.html>
   - Proposal: Allow Single Dollar Sign as Valid Identifier
   <https://github.com/apple/swift-evolution/pull/354>
   - Free the '$' Symbol!
   <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20151228/005133.html>
   - Request to add middle dot (U+00B7) as operator character?
   <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20151214/003176.html>

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#guiding-principles>Guiding
principles

Chris Lattner has written:

…our current operator space (particularly the unicode segments covered) is
not super well considered. It would be great for someone to take a more
systematic pass over them to rationalize things.

We need a token to be unambiguously an operator or identifier - we can
have different rules for the leading and subsequent characters though.

…any proposal that breaks:

let :dog::cow: = "moof"

will not be tolerated. :slight_smile: :slight_smile:

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#motivation>
Motivation

By supporting custom Unicode operators and identifiers, Swift attempts to
accomodate programmers and programming styles from many languages and
cultures. It deserves a well-thought-out specification of which characters
are valid. However, Swift's current identifier and operator character sets
do not conform to any Unicode standards, nor have they been rationalized in
the language or compiler documentation.

Identifiers, which serve as *names* for various entities, are linguistic
in nature and must permit a variety of characters to properly serve
non–English-speaking coders. This issue has been considered by the
communities of many programming languages already, and the Unicode
Consortium has published recommendations on how to choose identifier
character sets — Swift should make an effort to conform to these
recommendations.

Operators, on the other hand, should be rare and carefully chosen, because
they suffer from low discoverability and difficult readability. They are by
nature *symbols*, not names. This places a cognitive cost on users with
respect to both recall ("What is the operator that applies the behavior I
need?") and recognition ("What does the operator in this code do?"). While *almost
every* nontrivial program defines many new identifiers, most programs do
not define new operators.

As operators become more esoteric or customized, the cognitive cost rises.
Recognizing a function name like formUnion(with:) is simpler for many
programmers than recalling what the ∪ operator does. Swift's current
operator character set includes many characters that aren't traditional and
recognizable operators — this encourages problematic and frivolous uses in
an otherwise safe language.

Today, there are many discrepancies and edge cases motivating these
changes:

   - · is an identifier, while • is an operator.
   - The Greek question mark ; is a valid identifier.
   - Braille patterns ⠟ seem letter-like, but are operator characters.
   - :slightly_smiling_face::metal::arrow_forward:️:small_airplane:🂡 are identifiers, while :frowning:️:v:️:arrow_up_small::airplane:️:spades:️ are operators.
   - Some *non-combining* diacritics ´ ¨ ꓻ are valid in identifiers.
   - Some completely non-linguistic characters, such as ۞ and ༒, are
   valid in identifiers.
   - Some symbols such as ⚄ and ♄ are operators, despite not really being
   "operator-like".
   - A small handful of characters 〡〢〣〤〥〦〧〨〩 〪 〫 〬 〭 〮 〯 are valid in both identifiers
   and operators.
   - Some non-printing characters such as 2064 INVISIBLE PLUS and 200B
   ZERO WIDTH SPACE are valid identifiers.
   - Currency symbols are split across operators (¢ £ ¤ ¥) and
   identifiers ($ ₪ € ₱ ₹ ฿ ...).

This matter should be considered in a near timeframe (Swift 3.1 or 4) as
it is both fundamental to Swift and will produce source-breaking changes.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#precedent-in-other-languages>Precedent
in other languages

Haskell distinguishes identifiers/operators by their general category
<http://www.fileformat.info/info/unicode/category/index.htm> such as "any
Unicode lowercase letter", "any Unicode symbol or punctuation", and so
forth. Identifiers can start with any lowercase letter or _, and may
contain any letter/digit/'/_. This includes letters like δ and Я, and
digits like ٢.

   - Haskell Syntax Reference
   <https://www.haskell.org/onlinereport/syntax-iso.html>
   - Haskell Lexer
   <https://github.com/ghc/ghc/blob/714bebff44076061d0a719c4eda2cfd213b7ac3d/compiler/parser/Lexer.x#L1949-L1973>

Scala similarly allows letters, numbers, $, and _ in identifiers,
distinguishing by general categories Ll, Lu, Lt, Lo, and Nl. Operator
characters include mathematical and other symbols (Sm and So) in addition
to other ASCII symbol characters.

   - Scala Lexical Syntax
   <http://www.scala-lang.org/files/archive/spec/2.11/01-lexical-syntax.html#lexical-syntax>

ECMAScript 2015 ("ES6") uses ID_Start and ID_Continue, as well as
Other_ID_Start / Other_ID_Continue, for identifiers.

   - ECMAScript Specification: Names and Keywords
   <http://www.ecma-international.org/ecma-262/6.0/#sec-names-and-keywords>

Python 3 uses XID_Start and XID_Continue.

   - The Python Language Reference: Identifiers and Keywords
   <https://docs.python.org/3/reference/lexical_analysis.html#grammar-token-identifier>
   - PEP 3131: Supporting Non-ASCII Identifiers
   <https://www.python.org/dev/peps/pep-3131/>

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#proposed-solution>Proposed
solution

For identifiers, adopt the recommendations made in UAX #31 Identifier and
Pattern Syntax <http://unicode.org/reports/tr31/>, deriving the sets of
valid characters from ID_Start and ID_Continue. Normalize identifiers
using Normalization Form C (NFC).

(For operators, no such recommendation currently exists, although active
work is in progress to update UAX #31 to address "operator identifiers".)

Restrict operators to those ASCII characters which are currently
operators. All other operator characters are removed from the language.

Allow dots in operators in any location, but only in runs of two or more.

(Overall, this proposal is aggressive in its removal of problematic
characters. We are not attempting to prevent the addition or re-addition of
characters in the future, but by paring the set down now, we require any
future changes to pass the high bar of the Swift Evolution process.)

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#detailed-design>Detailed
design
<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#identifiers>
Identifiers

Swift identifier characters will conform to UAX #31
<http://unicode.org/reports/tr31/#Conformance> as follows:

   -

   UAX31-C1. <http://unicode.org/reports/tr31/#C1> The conformance
   described herein refers to the Unicode 9.0.0 version of UAX #31 (dated
   2016-05-31 and retrieved 2016-10-09).
   -

   UAX31-C2. <http://unicode.org/reports/tr31/#C2> Swift shall observe
   the following requirements:
   -

      UAX31-R1. <http://unicode.org/reports/tr31/#R1> Swift shall augment
      the definition of "Default Identifiers" with the following profiles:
      1.

         ID_Start and ID_Continue shall be used for Start and Continue
          (replacing XID_Start and XID_Continue). This excludes characters
         in Other_ID_Start and Other_ID_Continue.
         2.

         _ 005F LOW LINE shall additionally be allowed as a Start
          character.
         3.

         The emoji characters :dog: 1F436 DOG FACE and :cow: 1F42E COW FACE
         shall be allowed as Start and Continue characters.
         4.

         (UAX31-R1a. <http://unicode.org/reports/tr31/#R1a>) The
         join-control characters ZWJ and ZWNJ are strictly limited to the special
         cases A1, A2, and B described in UAX #31. (This requirement is covered in
         the Normalize Unicode Identifiers proposal
         <https://github.com/apple/swift-evolution/pull/531>.)
         -

      UAX31-R4. <http://unicode.org/reports/tr31/#R4> Swift shall
      consider two identifiers equivalent when they have the same normalized form
      under NFC <http://unicode.org/reports/tr15/>. (This requirement is
      covered in the Normalize Unicode Identifiers proposal
      <https://github.com/apple/swift-evolution/pull/531>.)

These changes
<http://unicode.org/cldr/utility/unicodeset.jsp?a=[[a-zA-Z_\u00A8\u00AA\u00AD\u00AF\u00B2-\u00B5\u00B7-\u00BA\u00BC-\u00BE\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF\u0100-\u02FF\u0370-\u167F\u1681-\u180D\u180F-\u1DBF\u1E00-\u1FFF\u200B-\u200D\u202A-\u202E\u203F-\u2040\u2054\u2060-\u206F\u2070-\u20CF\u2100-\u218F\u2460-\u24FF\u2776-\u2793\u2C00-\u2DFF\u2E80-\u2FFF\u3004-\u3007\u3021-\u302F\u3031-\u303F\u3040-\uD7FF\uF900-\uFD3D\uFD40-\uFDCF\uFDF0-\uFE1F\uFE30-\uFE44\uFE47-\uFFFD\U00010000-\U0001FFFD\U00020000-\U0002FFFD\U00030000-\U0003FFFD\U000E0000-\U000EFFFD][0-9\u0300-\u036F\u1DC0-\u1DFF\u20D0-\u20FF\uFE20-\uFE2F]]&b=[[:ID_Continue:]\U0001F436\U0001F42E]> result
in the removal of some 5,500 valid code points from the identifier
characters, as well as hundreds of thousands of unassigned code points.
(Though it does not appear on this unicode.org utility, which currently
supports only Unicode 8 data, the · 00B7 MIDDLE DOT is no longer an
identifier character.) Adopting ID_Start and ID_Continue does not add any
new identifier characters.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#grammar-changes>Grammar
changes

identifier-head → [:ID_Start:]
identifier-head → _ :dog: :cow:
identifier-character → identifier-head
identifier-character → [:ID_Continue:]

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#operators>
Operators

Swift operator characters will be limited to only the following ASCII
characters:

! % & * + - . / < = > ? ^ | ~

The current restrictions on reserved tokens and operators will remain: =,
->, //, /*, */, ., ?, prefix <, prefix &, postfix >, and postfix ! are
reserved.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#dots-in-operators>Dots
in operators

The current requirements for dots in operator names are:

If an operator doesn’t begin with a dot, it can’t contain a dot elsewhere.

This proposal changes the rule to:

Dots may only appear in operators in runs of two or more.

Under the revised rule, ..< and ... are allowed, but <.< is not. We also reserve
the .. operator, permitting the compiler to use .. for a "method cascade"
syntax in the future, as supported by Dart
<http://news.dartlang.org/2012/02/method-cascades-in-dart-posted-by-gilad.html>
.

Motivations for incorporating the two-dot rule are:

   -

   It helps avoid future lexical complications arising from lone .s.
   -

   It's a conservative approach, erring towards overly restrictive.
   Dropping the rule in future (thereby allowing single dots) may be possible.
   -

   It doesn't require special cases for existing infix dot operators in
   the standard library, ... (closed range) and ..< (half-open range). It
   also leaves the door open for the standard library to add analogous
   half-open and fully-open range operators <.. and <..<.
   -

   If we fail to adopt this rule now, then future backward-compatibility
   requirements will preclude the introduction of some potentially useful
   language enhancements.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#grammar-changes-1>Grammar
changes

operator → operator-head operator-characters[opt]

operator-head → ! % & * + - / < = > ? ^ | ~
operator-head → operator-dot operator-dots
operator-character → operator-head
operator-characters → operator-character operator-character[opt]

operator-dot → .
operator-dots → operator-dot operator-dots[opt]

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#emoji>
Emoji

If adopted, this proposal eliminates emoji from Swift identifiers and
operators. Despite their novelty and utility, emoji characters introduce
significant challenges to the language:

   -

   Their categorization into identifiers and operators is not
   semantically motivated, and is fraught with discrepancies.
   -

   Emoji characters are not displayed consistently and uniformly across
   different systems and fonts. Including all Unicode emoji
   <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[%3AEmoji%3A]> introduces
   characters that don't render as emoji on Apple platforms without a variant
   selector, but which also wouldn't normally be used as identifier characters
   (e.g. ⏏ :black_small_square: :white_small_square:).
   -

   Some emoji nearly overlap with existing operator syntax: :exclamation:️:question::heavy_plus_sign::heavy_minus_sign::heavy_division_sign::heavy_multiplication_x:
   -

   Full emoji support necessitates handling a variety of use cases for
   joining characters and variant selectors, which would not otherwise be
   useful in most cases. It would be hard to avoid permitting sequences of
   characters which aren't valid emoji, or being overly restrictive and not
   properly supporting emoji introduced in future versions of Unicode.

As an exception, in homage to Swift's origins, we permit :dog: and :cow: in
identifiers.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#source-compatibility>Source
compatibility

This change is source-breaking in cases where developers have incorporated
emoji or custom non-ASCII operators, or identifiers with characters which
have been disallowed. This is unlikely to be a significant breakage for the
majority of serious Swift code.

Code using the middle dot · in identifiers may be slightly more common. · is
now disallowed entirely.

Diagnostics for invalid characters are already produced today. We can
improve them easily if needed.

Maintaining source compatibility for Swift 3 should be easy: just keep the
old parsing & identifier lookup code.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#effect-on-abi-stability>Effect
on ABI stability

This proposal does not affect the ABI format itself, although the Normalize
Unicode Identifiers proposal
<https://github.com/apple/swift-evolution/pull/531> affects the ABI of
compiled modules.

The standard library will not be affected; it uses ASCII symbols with no
combining characters.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#effect-on-api-resilience>Effect
on API resilience

This proposal doesn't affect API resilience.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#alternatives-considered>Alternatives
considered

   -

   Define operator characters using Unicode categories such as Sm and So
   <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[%3ASm%3A][%3ASo%3A]]>.
   This approach would include many "non-operator-like" characters and doesn't
   seem to provide a significant benefit aside from a simpler definition.
   -

   Hand-pick a set of "operator-like" characters to include. The proposal
   authors tried this painstaking approach, and came up with a relatively
   agreeable set of about 650 code points
   <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[!\%24%\%26*%2B\-%2F<%3D>%3F\^|~ \u00AC \u00B1 \u00B7 \u00D7 \u00F7 \u2208-\u220D \u220F-\u2211 \u22C0-\u22C3 \u2212-\u221D \u2238 \u223A \u2240 \u228C-\u228E \u2293-\u22A3 \u22BA-\u22BD \u22C4-\u22C7 \u22C9-\u22CC \u22D2-\u22D3 \u2223-\u222A \u2236-\u2237 \u2239 \u223B-\u223E \u2241-\u228B \u228F-\u2292 \u22A6-\u22B9 \u22C8 \u22CD \u22D0-\u22D1 \u22D4-\u22FF \u22CE-\u22CF \u2A00-\u2AFF \u27C2 \u27C3 \u27C4 \u27C7 \u27C8 \u27C9 \u27CA \u27CE-\u27D7 \u27DA-\u27DF \u27E0-\u27E5 \u29B5-\u29C3 \u29C4-\u29C9 \u29CA-\u29D0 \u29D1-\u29D7 \u29DF \u29E1 \u29E2 \u29E3-\u29E6 \u29FA \u29FB \u2308-\u230B \u2336-\u237A \u2395]> (although
   this set would require further refinement), but ultimately felt the
   motivation for including non-ASCII operators is much lower than for
   identifiers, and the harm to readers/writers of programs outweighs their
   potential utility.
   -

   Use Normalization Form KC (NFKC) instead of NFC. The decision to use
   NFC comes from Normalize Unicode Identifiers proposal
   <https://github.com/apple/swift-evolution/pull/531>. Also, UAX #31
   states:

   Generally if the programming language has case-sensitive identifiers,
   then Normalization Form C is appropriate; whereas, if the programming
   language has case-insensitive identifiers, then Normalization Form KC is
   more appropriate.

   NFKC may also produce surprising results; for example, "ſ" and "s" are
   equivalent under NFKC.
   -

   Continue to allow single .s in operators, and perhaps even expand the
   original rule to allow them anywhere (even if the operator does not begin
   with .).

   This would allow a wider variety of custom operators (for some
   interesting possibilities, see the operators in Haskell's Lens
   <https://github.com/ekmett/lens/wiki/Operators> package). However,
   there are a handful of potential complications:
   -

      Combining prefix or postfix operators with member access: foo*.bar would
      need to be parsed as foo *. barrather than (foo*).bar. Parentheses
      could be required to disambiguate.
      -

      Combining infix operators with contextual members: foo*.bar would
      need to be parsed as foo *. bar rather than foo * (.bar).
      Whitespace or parentheses could be required to disambiguate.
      -

      Hypothetically, if operators were accessible as members such as
      MyNumber.+, allowing operators with single .s would require
      escaping operator names (perhaps with backticks, such as
      MyNumber.`+`).

   This would also require operators of the form [!?]*\. (for example . ?.
    !. !!.) to be reserved, to prevent users from defining custom
   operators that conflict with member access and optional chaining.

   We believe that requiring dots to appear in groups of at least two,
   while in some ways more restrictive, will prevent a significant amount of
   future pain, and does not require special-case considerations such as the
   above.

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#future-directions>Future
directions

While not within the scope of this proposal, the following considerations
may provide useful context for the proposed changes. We encourage the
community to pick up these topics when the time is right.

   -

   Re-expand operators to allow some non-ASCII characters. There is work
   in progress to update UAX #31 with definitions for "operator identifiers" —
   when this work is completed, it would be worth considering for Swift.
   -

   Introduce a syntax for method cascades. The Dart language supports method
   cascades
   <http://news.dartlang.org/2012/02/method-cascades-in-dart-posted-by-gilad.html>,
   whereby multiple methods can be called on an object within one expression:
   foo..bar()..baz() effectively performs foo.bar(); foo.baz(). This
   syntax can also be used with assignments and subscripts. Such a feature
   might be very useful in Swift; this proposal reserves the .. operator
   so that it may be added in the future.
   -

   Introduce "mixfix" operator declarations. Mixfix operators are based
   on pattern matching, and would allow more than two operands. For example,
   the ternary operator ? : can be defined as a mixfix operator with
   three "holes": _ ? _ : _. Subscripts might be subsumed by mixfix
   declarations such as _ [ _ ]. Some holes could be made @autoclosure,
   and there might even be holes whose argument is represented as an AST,
   rather than a value or thunk, supporting advanced metaprogramming (for
   instance, F#'s code quotations
   <https://docs.microsoft.com/en-us/dotnet/articles/fsharp/language-reference/code-quotations>
   ).
   -

   Diminish or remove the lexical distinction between operators and
   identifiers. If precedence and fixity applied to traditional
   identifiers as well as operators, it would be possible to incorporate ASCII
   equivalents for standard operators (e.g. and for &&, to allow A and B).
   If additionally combined with mixfix operator support, this might enable
   powerful DSLs (for instance, C#'s LINQ
   <https://en.wikipedia.org/wiki/Language_Integrated_Query>).

_______________________________________________

swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Jonathan S. Shapiro) #15

Benjamin:

The situation "behind the scenes" is that I've been working with Mark Davis
to add Unicode standard properties for operator start and operator continue
character sets in Unicode UAX31. That's a process whose scope needs to be
broader than just Swift, and it's something that Swift will want to be
compatible with. I think the intention would be to adopt that new part of
UAX31 as soon as practical, and I am hopeful that specification will meet
your needs and objectives. If not, I'd very much like to pick up that
conversation with you offline to see how we can improve matters in UAX31.

The UAX31 discussion seems to be converging rapidly. The proposal here is
to *temporarily* limit operator identifiers to the ASCII operator
characters. This is mainly intended to provide a bridge solution until
UAX31 changes can be published in draft form. One reason to take a
temporary step back is to ensure that we do not unintentionally specify
something now that will become incompatible as soon as the UAX31 draft
emerges.

Changes to the operator identifier space are well-localized in the compiler
implementation, and don't have any large-scale impact on later passes. They
are one of the few kinds of compiler changes that can safely be made late
in a development cycle. If this part of UAX31 converges as quickly as I
expect, I think we can get that result reflected into Swift 4, and we can
get a draft version implemented sooner.

Jonathan

···

On Wed, Oct 19, 2016 at 4:09 AM, Benjamin Spratling via swift-evolution < swift-evolution@swift.org> wrote:

Some extremely short-sighted points about deleting my formal operators
that are widely recognized as operators, and that I’ve spent months adding
into my code. Frankly, I just couldn’t upgrade until you put them back in.


(Jonathan S. Shapiro) #16

The intent is that this is allowed. Your alternative grammar captures the
intent correctly.

···

On Wed, Oct 19, 2016 at 10:12 AM, Alex Martini via swift-evolution < swift-evolution@swift.org> wrote:

<https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md#emoji>

I think there's a mismatch between the English and grammar. For example,
is +..+ allowed or not?


(Jonathan S. Shapiro) #17

If I’m reading the proposal and discussion properly, the group has not
able to reach consensus on the right criteria for operator symbols, but is
hopeful that will be possible after the Unicode Consortium completes its
work. I think it would be far better to defer the changes to valid
operator symbols until that time (removing only symbols which are currently
treated as operators but for which the proposal suggests should be
available for identifiers instead).

Beginning with Swift 4, there will be a major push to ensure that backwards
compatibility with existing code is not broken. It will be possible to
expand the operator character set, but very difficult to shrink it.

Given the current state of the discussion over in Unicode land, I think it
would probably be safe from a compatibility standpoint to admit code points
that fall into the following (Unicode-style) code point set:

[:S:] - [:Sc:] - [:xidcontinue:] - [:nfcqc=n:] & [:scx=Common:] -
pictographics - emoji

into operator characters. In English, this would be:

All symbols excluding currency symbols, provided they are not already in
regular identifiers, requiring that they are legal under NFC normalization
and also that they live in the Common script.

Explicitly exclude pictographics and emojis, not as a value judgment of
UAX31, but because different languages seem to be choosing to go different
ways about whether these are part of normal identifiers or operator
identifiers.

Similar rationale for currency symbols, though I personally believe those
should be operators rather than regular identifiers.

It's possible that other things will go in to UAX31, but it's very hard to
imagine that anything in the set above will end up getting excluded. In
particular, there is some inclination to add some punctuation symbols in
UAX31, but that's going to take some work to ensure that we don't make a
mess inadvertently.

As a transitional matter, I think it would be conservatively safe to add
the code points identified above. Note that it's important to exclude ASCII
code points that are currently "punctuation reserved words". In Swift this
(at least) includes:

. (period, when it does not appear [at least] two times in sequence)
; (Semicolon)
: (Full colon)
$ (Dollar sign - used in special identifiers, which I consider a flaw)
any and all brackets (for now).

IMO, the best argument against using unicode symbols for operators defined

by mathematics is that they are currently difficult to type.

And there is no realistic hope of that changing. This issue is so
compelling that C and C++ introduced standardized text-ascii alternatives
for the punctuation operators to relieve stress on non-english keyboard
users.

This is an argument with a limited lifespan and should not carry more

weight than it deserves in the design of a language positioned to be the
language for the next 20 years. I strongly believe that removing them,
even temporarily, is a mistake.

I think it's good to be a little conservative given the fact that the issue
is more broadly "in flight". That said, I personally believe that the
current proposal has cut back too far.

Jonathan

···

On Wed, Oct 19, 2016 at 6:41 AM, Matthew Johnson via swift-evolution < swift-evolution@swift.org> wrote:


(Erica Sadun) #18

It's more practical to make breaking changes now and introduce the "right set" (that is, a standards-based set of mathematical operators) at a future date, than to justify keeping things as is and removing operators at a future date.

-- E

···

On Oct 19, 2016, at 7:41 AM, Matthew Johnson via swift-evolution <swift-evolution@swift.org> wrote:

I very much support the proposal to rationalize our handling of identifier characters.

I also support doing something similar for operator symbols. However, I agree feedback from others that this proposal goes way to far in removing our ability to use mathematical operators.

If I’m reading the proposal and discussion properly, the group has not able to reach consensus on the right criteria for operator symbols, but is hopeful that will be possible after the Unicode Consortium completes its work. I think it would be far better to defer the changes to valid operator symbols until that time (removing only symbols which are currently treated as operators but for which the proposal suggests should be available for identifiers instead).


(Xiaodi Wu) #19

Well, this is a very valuable contribution to the discussion. What
non-ASCII operators are you currently using in Swift code? How did you
decide on those operators instead of ASCII ones? Obviously, we would want
to enable as many operators as possible to continue functioning.

There is, however, a very strong argument for restricting operator
characters to ASCII. I'm going to quote from Erica Sadun, who's put this
much better than I can:

[begin quote]

• Operators suffer from low discoverability and difficult readability. They
use symbols, not names. This places a cognitive cost on users with respect
to both recall ("What is the operator that applies the behavior I need?")
and recognition ("What does the operator in this code do?").
• This cost is obviously highest when symbols are not tied to conventional
standards like `∪` for union and `⊇` for superset. `∪` is a standard,
mathematical representation. It’s widely accepted and widely used. Even so,
recognizing `formUnion(with:)` may work better for many coders than
recalling what the `∪` (or, worse, `⊇`) operator does, even when you end up
having to create suites of specialized selectors. As operators become more
self-defined or esoteric, costs rise.

[end quote]

As to your specific example, there are indeed good reasons why it is not
unreasonable to jettison support for, say, less-than-or-equal-to. For one,
even if you have a configurable keyboard, every reasonable keyboard that
could have the less-than-or-equal-to symbol will also have < and =, and <=
is the standard operator in Swift for that concept.

As for emoji, their not being included is based on the reasoning that they
are not required to support any real-world language; removal of "moof" is
not a dealbreaker.

···

On Wed, Oct 19, 2016 at 19:09 Benjamin Spratling via swift-evolution < swift-evolution@swift.org> wrote:

Howdy,
Some good points about standardizing identifiers.
Some extremely short-sighted points about deleting my formal operators that
are widely recognized as operators, and that I’ve spent months adding into
my code. Frankly, I just couldn’t upgrade until you put them back in.

Operators

Swift operator characters will be limited to only the following ASCII
characters:

! % & * + - . / < = > ? ^ | ~

For a mathematician / scientist / engineer, they have an easier time
catching errors when the code on their screen look more like what they
write on paper. Hence the only good reasons to leave sin() as a global
function instead of a computed property. Obviously, we don’t have 2D
layout in Swift, but finally using the right operator characters instead of
the ridiculous ascii hacks was a breath of fresh air Swift breathed into my
code. The state of operators in C languages was abysmal, and its legacy is
still here. Take the blinders off for a moment and realize that
“repetition” isn’t a great semantic: “&&” and “===“. They're a side effect
of the hardware & character encoding sets available to developers in past
decades, not a goal for the future. Sure, we don’t have screens on every
key so I can set up my own domain specific operator character sets without
having to scroll through a giant list of unused characters, but finally the
second barrier had fallen. And at least there are prototypes and rumors of
those keyboards out in the wild.

There’s just no good reason to make
≤ ≥ ≠ ±
not valid operators.

“in homage to Swift's origins, we permit :dog: and :cow: in identifiers."

That’s a blatant attempt at a cheat. Wrong answer.

It’s true there are inconsistencies of the choice of whether a particular
symbol is an operator or identifier, but I’d rather resolve that instead of
blow everything away.

- - From me

-Ben
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Rob Mayoff) #20

I'd add ≤ ≥ ≠ to that set.

···

On Wed, Oct 19, 2016 at 10:47 AM, plx via swift-evolution < swift-evolution@swift.org> wrote:

In any case, I’d specifically hate to lose these:

- approximate equality: ≈
- set operations: ∩, ∪
- set relations: ⊂, ⊃, ⊄, ⊅, ⊆, ⊇, ⊈, ⊉, ⊊, ⊋
- set membership: ∌, ∋, ∈, ∉
- logical operators: ¬, ∧, ∨