Unicode identifiers & operators


(Jacob Bandes-Storch) #1

*TL;DR:*

Swift 4 Stage 1 seeks to prioritize "Source stability features". Most
source-breaking changes were done with in Swift 3; however, the
categorization of Unicode characters into identifiers & operators was never
thoroughly discussed on swift-evolution. This seems like it might be our
last chance, and I think there are some big improvements to be had.

I've gathered some information+thoughts into an early-stage pitch /
pre-proposal. It doesn't really have a conclusion, so I'm hoping we can
discuss these issues and come up with good (pragmatic) solutions here. I
imagine this can morph into a proposal later.

You can read the following in nicer HTML form at
https://gist.github.com/jtbandes/c0b0c072181dcd22c3147802025d0b59

I look forward to the discussion!

-Jacob

*# Background and motivation*

To ease lexing/parsing and avoid user confusion, the names of custom
identifiers (type names, variable names, etc.) and operators in Swift can
be composed of (mostly) separate sets of characters.

Using terminology from TSPL:

`identifier-head`/`operator-head` are characters which can *begin *an
identifier or operator.

`identifier-character`/`operator-character` are characters which can appear
anywhere in an identifier or operator (these are supersets of the `-head`
sets).

<
https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/LexicalStructure.html

(Note also that some particular arrangements of characters are reserved;
for instance, `$` followed by digits for an implicit closure parameter, and
"If an operator doesn’t begin with a dot, it can’t contain a dot
elsewhere." There are also special characters in the language which are
neither identifiers nor operators, such as: `()[]{},:@#`)

*## Prior discussion on swift-evolution*

*"Request to add middle dot (U+00B7) as operator character?"*
<
https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20151214/003176.html

*"Free the '$' Symbol!"*
<
https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20151228/005133.html

*"Proposal: Allow Single Dollar Sign as Valid Identifier"*
<https://github.com/apple/swift-evolution/pull/354>

Chris Lattner has said:

"...our current operator space (particularly the unicode segments

covered) is not super well considered. It would be great for someone to
take a more systematic pass over them to rationalize things."

"We need a token to be unambiguously an operator or identifier - we can

have different rules for the leading and subsequent characters though."

*# Current state of affairs*

Swift's `identifier-head` and `identifier-character` mostly conform to the
recommendations in <
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3146.html>
<https://github.com/apple/swift/blob/08e7963/lib/Parse/Lexer.cpp#L421-L489>

The allowed operator characters include "Unicode math, symbol, arrow,
dingbat, and line/box drawing chars", however I don't believe this aligns
with any particular spec:
<
https://github.com/apple/swift/blob/08e7963/include/swift/AST/Identifier.h#L87-L121>

<https://github.com/apple/swift/commit/a2341a4>

*## Identifiers/operators elsewhere*

There is an Unicode Standard Annex "identifier and pattern syntax" <
http://unicode.org/reports/tr31/> which defines the categories
`ID_Start`/`ID_Continue`.

<
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[%3AID_Continue%3A]

*### ECMAScript 2015 "ES6"*

Uses `ID_Start` and `ID_Continue`, as well as `Other_ID_Start` /
`Other_ID_Continue`.
<http://www.ecma-international.org/ecma-262/6.0/#sec-names-and-keywords>

*### Haskell*

Distinguishes identifiers/operators by their general category (such as "any
Unicode lowercase letter", "any Unicode symbol or punctuation", etc.).
<http://www.fileformat.info/info/unicode/category/index.htm>

In particular, identifiers can start with any lowercase letter or _, and
may contain any letter/digit/'/_. This would seem to include letters like δ
and Я, and digits like ٢.

<https://www.haskell.org/onlinereport/syntax-iso.html>
<
https://github.com/ghc/ghc/blob/714bebff44076061d0a719c4eda2cfd213b7ac3d/compiler/parser/Lexer.x#L1949-L1973

*# Current problems*

*## Weird identifier code points*

The current `identifier-character` set contains many characters which
wouldn't make good identifiers:

- 11 entire planes of characters (U+20000–U+2FFFD, etc.) which are
currently unassigned.
- The middle dot · which looks like an operator.
- Many non-combining "modifiers" and accent marks, such as ´ and ¨ and ꓻ
which don't really make sense on their own.
- "Tone marks" from various languages, including ˫ (similar to a
box-drawing character ├ which is an operator).
- The "Greek question mark" ;
- Symbols which are simply not linguistic, such as ۞ and ༒.

short url: <https://goo.gl/tyn0Cz>

<
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[a-zA-Z _ \u00A8 \u00AA \u00AD \u00AF \u00B2-\u00B5 \u00B7-\u00BA \u00BC-\u00BE \u00C0-\u00D6 \u00D8-\u00F6 \u00F8-\u00FF \u0100-\u02FF \u0370-\u167F \u1681-\u180D \u180F-\u1DBF \u1E00-\u1FFF \u200B-\u200D \u202A-\u202E \u203F-\u2040 \u2054 \u2060-\u206F \u2070-\u20CF \u2100-\u218F \u2460-\u24FF \u2776-\u2793 \u2C00-\u2DFF \u2E80-\u2FFF \u3004-\u3007 \u3021-\u302F \u3031-\u303F \u3040-\uD7FF \uF900-\uFD3D \uFD40-\uFDCF \uFDF0-\uFE1F \uFE30-\uFE44 \uFE47-\uFFFD \U00010000-\U0001FFFD \U00020000-\U0002FFFD \U00030000-\U0003FFFD \U00040000-\U0004FFFD \U00050000-\U0005FFFD \U00060000-\U0006FFFD \U00070000-\U0007FFFD \U00080000-\U0008FFFD \U00090000-\U0009FFFD \U000A0000-\U000AFFFD \U000B0000-\U000BFFFD \U000C0000-\U000CFFFD \U000D0000-\U000DFFFD \U000E0000-\U000EFFFD] [0-9 \u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE20-\uFE2F]

*## Weird operator code points*

The current `operator-character` set has a lot of characters that are
clearly operator-esque (≈ ∈ ⊕ ⊅), but some things are not so obviously
desirable:

- Box-drawing characters
- Combining accents and other characters
- Various symbols, e.g. ⚄ and ♄ (this category also overlaps with emoji)
- Braille patterns such as ⠟ — should they not be treated as letter-like
(thus identifiers)?
- A plethora of arrows

short url: <https://goo.gl/s136Nh>

<
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[%2F%3D\-%2B!*%<>\%26|\^~%3F \u00A1-\u00A7 \u00A9\u00AB \u00AC \u00AE \u00B0-\u00B1 \u00B6 \u00BB \u00BF \u00D7 \u00F7 \u2016-\u2017 \u2020-\u2027 \u2030-\u203E \u2041-\u2053 \u2055-\u205E \u2190-\u23FF \u2500-\u2775 \u2794-\u2BFF \u2E00-\u2E7F \u3001-\u3003 \u3008-\u3030] [\u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE00-\uFE0F \uFE20-\uFE2F \U000E0100-\U000E01EF]

*## Code points which are both*

A handful of characters are accepted *both* as `identifier-head` and
`operator-head` (which seems pointless and might have been unintentional):

U+3021–U+3029, Suzhou numerals 〡〢〣〤〥〦〧〨〩 <
https://en.wikipedia.org/wiki/Suzhou_numerals>

U+302A–U+302F, ideographic & hangul tone marks 〪 〫 〬 〭 〮 〯

    let 〨 = 2
    infix operator <〨>

(Note that `infix operator 〨` doesn't work because the lexer greedily
treats this as an identifier. Also, interestingly, the corresponding
ideographic zero 〇 is only an identifier char.)

short url: <https://goo.gl/lZcMqO>

<
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[a-zA-Z _ \u00A8 \u00AA \u00AD \u00AF \u00B2-\u00B5 \u00B7-\u00BA \u00BC-\u00BE \u00C0-\u00D6 \u00D8-\u00F6 \u00F8-\u00FF \u0100-\u02FF \u0370-\u167F \u1681-\u180D \u180F-\u1DBF \u1E00-\u1FFF \u200B-\u200D \u202A-\u202E \u203F-\u2040 \u2054 \u2060-\u206F \u2070-\u20CF \u2100-\u218F \u2460-\u24FF \u2776-\u2793 \u2C00-\u2DFF \u2E80-\u2FFF \u3004-\u3007 \u3021-\u302F \u3031-\u303F \u3040-\uD7FF \uF900-\uFD3D \uFD40-\uFDCF \uFDF0-\uFE1F \uFE30-\uFE44 \uFE47-\uFFFD \U00010000-\U0001FFFD \U00020000-\U0002FFFD \U00030000-\U0003FFFD \U00040000-\U0004FFFD \U00050000-\U0005FFFD \U00060000-\U0006FFFD \U00070000-\U0007FFFD \U00080000-\U0008FFFD \U00090000-\U0009FFFD \U000A0000-\U000AFFFD \U000B0000-\U000BFFFD \U000C0000-\U000CFFFD \U000D0000-\U000DFFFD \U000E0000-\U000EFFFD]%26[%2F%3D\-%2B!*%<>\%26|\^~%3F \u00A1-\u00A7 \u00A9\u00AB \u00AC \u00AE \u00B0-\u00B1 \u00B6 \u00BB \u00BF \u00D7 \u00F7 \u2016-\u2017 \u2020-\u2027 \u2030-\u203E \u2041-\u2053 \u2055-\u205E \u2190-\u23FF \u2500-\u2775 \u2794-\u2BFF \u2E00-\u2E7F \u3001-\u3003 \u3008-\u3030]]

In addition to the numerals and tone marks above, many (all?) *combining
marks* are accepted as `identifier-character` and `operator-character`.
These may be necessary for natural-looking words in some languages, but
they don't seem necessary for operators.

Also present in both sets are the *variation selectors* 1 through 256
(U+FE00–U+FE0F, U+E0100–U+E01EF). It seems they are of limited use for the
operator characters, unless you count the emoji: <
http://www.unicode.org/Public/UCD/latest/ucd/StandardizedVariants.txt>

short url: <https://goo.gl/VKrisf>

<
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[a-zA-Z _ \u00A8 \u00AA \u00AD \u00AF \u00B2-\u00B5 \u00B7-\u00BA \u00BC-\u00BE \u00C0-\u00D6 \u00D8-\u00F6 \u00F8-\u00FF \u0100-\u02FF \u0370-\u167F \u1681-\u180D \u180F-\u1DBF \u1E00-\u1FFF \u200B-\u200D \u202A-\u202E \u203F-\u2040 \u2054 \u2060-\u206F \u2070-\u20CF \u2100-\u218F \u2460-\u24FF \u2776-\u2793 \u2C00-\u2DFF \u2E80-\u2FFF \u3004-\u3007 \u3021-\u302F \u3031-\u303F \u3040-\uD7FF \uF900-\uFD3D \uFD40-\uFDCF \uFDF0-\uFE1F \uFE30-\uFE44 \uFE47-\uFFFD \U00010000-\U0001FFFD \U00020000-\U0002FFFD \U00030000-\U0003FFFD \U00040000-\U0004FFFD \U00050000-\U0005FFFD \U00060000-\U0006FFFD \U00070000-\U0007FFFD \U00080000-\U0008FFFD \U00090000-\U0009FFFD \U000A0000-\U000AFFFD \U000B0000-\U000BFFFD \U000C0000-\U000CFFFD \U000D0000-\U000DFFFD \U000E0000-\U000EFFFD] [0-9 \u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE20-\uFE2F]%26[%2F%3D\-%2B!*%<>\%26|\^~%3F \u00A1-\u00A7 \u00A9\u00AB \u00AC \u00AE \u00B0-\u00B1 \u00B6 \u00BB \u00BF \u00D7 \u00F7 \u2016-\u2017 \u2020-\u2027 \u2030-\u203E \u2041-\u2053 \u2055-\u205E \u2190-\u23FF \u2500-\u2775 \u2794-\u2BFF \u2E00-\u2E7F \u3001-\u3003 \u3008-\u3030] [\u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE00-\uFE0F \uFE20-\uFE2F \U000E0100-\U000E01EF]]

*## Code points which should be illegal*

There are several surprising non-printing characters, including:

- U+2064 INVISIBLE PLUS is currently an identifier
- U+200B ZERO WIDTH SPACE is currently an identifier

No good will come of these. Invisible characters should probably be
disallowed (although some may be necessary for properly joining/splitting
characters in some other languages).

*## Categories which are split between identifiers and operators*

- Emoji and symbols: most of the newer emoji are identifiers, but many
emoji/pictographs are operators, especially those from "Miscellaneous
Symbols". The results are hilariously illogical:

  - :frowning:️ is an operator, but :slightly_smiling_face: is an identifier.
  - :v:️ is an operator, but :metal: is an identifier.
  - :arrow_up_small: is an operator, but :arrow_forward:️ is an identifier.
  - :eight_spoked_asterisk:️ is an operator, but :six_pointed_star: is an identifier.
  - :airplane:️ is an operator, but :small_airplane: is an identifier.
  - :spades:️ is an operator, but 🂡 is an identifier. (Presumably, 🂡 = A :spades:️ 🂠!)

  (But the counterintuitive examples extend outside the emoji too: + is an
operator, while ₊ and ⁺ are identifiers.)

- Currency symbols: ¢ £ ¤ ¥ are operators, but ₪ € ₱ ₹ ฿ and many others
are identifiers, and $ is allowed in an identifier.

*## Missing characters*

A handful of characters are neither operators nor identifiers. This list
mostly makes sense (reserved characters and whitespace), but I wonder about
a few which seem like they could easily be operators: ⑊ ⑀ ﹅ etc.

short url: <https://goo.gl/U0GVNn>

<
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[\u0001-\U0010FFFF]-[[%2F%3D\-%2B!*%<>\%26|\^~%3F \u00A1-\u00A7 \u00A9\u00AB \u00AC \u00AE \u00B0-\u00B1 \u00B6 \u00BB \u00BF \u00D7 \u00F7 \u2016-\u2017 \u2020-\u2027 \u2030-\u203E \u2041-\u2053 \u2055-\u205E \u2190-\u23FF \u2500-\u2775 \u2794-\u2BFF \u2E00-\u2E7F \u3001-\u3003 \u3008-\u3030] [\u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE00-\uFE0F \uFE20-\uFE2F \U000E0100-\U000E01EF][a-zA-Z _ \u00A8 \u00AA \u00AD \u00AF \u00B2-\u00B5 \u00B7-\u00BA \u00BC-\u00BE \u00C0-\u00D6 \u00D8-\u00F6 \u00F8-\u00FF \u0100-\u02FF \u0370-\u167F \u1681-\u180D \u180F-\u1DBF \u1E00-\u1FFF \u200B-\u200D \u202A-\u202E \u203F-\u2040 \u2054 \u2060-\u206F \u2070-\u20CF \u2100-\u218F \u2460-\u24FF \u2776-\u2793 \u2C00-\u2DFF \u2E80-\u2FFF \u3004-\u3007 \u3021-\u302F \u3031-\u303F \u3040-\uD7FF \uF900-\uFD3D \uFD40-\uFDCF \uFDF0-\uFE1F \uFE30-\uFE44 \uFE47-\uFFFD \U00010000-\U0001FFFD \U00020000-\U0002FFFD \U00030000-\U0003FFFD \U00040000-\U0004FFFD \U00050000-\U0005FFFD \U00060000-\U0006FFFD \U00070000-\U0007FFFD \U00080000-\U0008FFFD \U00090000-\U0009FFFD \U000A0000-\U000AFFFD \U000B0000-\U000BFFFD \U000C0000-\U000CFFFD \U000D0000-\U000DFFFD \U000E0000-\U000EFFFD] [0-9 \u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE20-\uFE2F]]]

*# Solutions*

Still up for discussion — please reply to this thread!

Adopting (X)ID_Start/Continue for identifiers, or a simpler solution like
Haskell's use of "letter" categories, might work well.

(I've given up hope of finding some kind of "perfect" solution — how can it
be possible, when ᛏ is a letter, yet ↑ is not?)

Making the choice of operator characters more logical/standards-based would
be nice (not just a set of ranges). However, Haskell's approach of using
all punctuation & symbols is probably not right for Swift:

short url: <https://goo.gl/Ud4KqY>

<
http://unicode.org/cldr/utility/unicodeset.jsp?a=[[-%2F%3D%2B!*%<>\%26|\^~?\u00A1-\u00A7\u00A9\u00AB\u00AC\u00AE\u00B0-\u00B1\u00B6\u00BB\u00BF\u00D7\u00F7\u2016-\u2017\u2020-\u2027\u2030-\u203E\u2041-\u2053\u2055-\u205E\u2190-\u23FF\u2500-\u2775\u2794-\u2BFF\u2E00-\u2E7F\u3001-\u3003\u3008-\u3030\u0300-\u036F\u1DC0-\u1DFF\u20D0-\u20FF\uFE00-\uFE0F\uFE20-\uFE2F\U000E0100-\U000E01EF]]&b=[[:Currency_Symbol:][:Modifier_Symbol:][:Math_Symbol:][:Other_Symbol:][:Connector_Punctuation:][:Dash_Punctuation:][:Close_Punctuation:][:Final_Punctuation:][:Initial_Punctuation:][:Other_Punctuation:][:Open_Punctuation:]]

I'm not really sure what to do with emoji — they're a very cute novelty
feature, but I don't know what the motivation is for including these as
valid operators/identifiers.

At the least, we should try to gather them all into one of the two
categories. My inclination would be to keep them as identifiers, which
would mean moving the following out of the operator category:

short url: <https://goo.gl/CBJEKX>

<
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[%3AEmoji%3A]%26[[%2F%3D\-%2B!*%<>\%26|\^~%3F \u00A1-\u00A7 \u00A9\u00AB \u00AC \u00AE \u00B0-\u00B1 \u00B6 \u00BB \u00BF \u00D7 \u00F7 \u2016-\u2017 \u2020-\u2027 \u2030-\u203E \u2041-\u2053 \u2055-\u205E \u2190-\u23FF \u2500-\u2775 \u2794-\u2BFF \u2E00-\u2E7F \u3001-\u3003 \u3008-\u3030] [\u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE00-\uFE0F \uFE20-\uFE2F \U000E0100-\U000E01EF]]]

*# Concurrently-discussable topics*

There are a few relevant topics that came to mind, which I think are worth
discussing around the same time.

*## Dollar signs ($)*

$ is currently allowed in identifiers, but it can't begin an identifier
except for the magic implicit closure params ($0, $1, ...) and
LLDB/REPL-related uses.

It's arguable, but I feel that $ would be more effective as an operator
character than an identifier character. There's precedent in Haskell for
operators like `<$>` and being able to replicate these in Swift would be
nice.

*## Diagnostics improvements*

Regardless of what ends up being the ultimate solution, it would be great
to improve diagnostics for cases when the wrong types of characters are
used.

`infix operator abc` produces `'abc' is considered to be an identifier, not
an operator`. That's not too bad.

`let +++ = 3` produces `expected pattern`.

`let $foo = 3` produces `expected numeric value following '$'`.

*## Security and сοnfuѕаbIе characters*

Confusable characters (e vs. е, o vs. ο, ; vs. ;) are an issue not taken
lightly in the world of web security (cf. domain names). I haven't found
much information about whether this has been considered a major security
issue in programming languages, but I would think so (one can imagine such
characters being introduced to a codebase subtly over time, hiding
malicious functionality).

It'd be pretty cool if Swift could detect whether two identifiers might be
confusable, and produce a warning.

<http://www.unicode.org/reports/tr36/#Recommendations_General>
<http://unicode.org/reports/tr39/#Confusable_Detection>


(Erica Sadun) #2

Let me tl;dr'er this even more: :frowning:️ is an operator, but :slightly_smiling_face: is an identifier.

-- E, succinct, who thinks there's room for improvement

···

On Sep 18, 2016, at 1:33 PM, Jacob Bandes-Storch via swift-evolution swift-evolution@swift.org wrote:

TL;DR:

Swift 4 Stage 1 seeks to prioritize “Source stability features”. Most source-breaking changes were done with in Swift 3; however, the categorization of Unicode characters into identifiers & operators was never thoroughly discussed on swift-evolution. This seems like it might be our last chance, and I think there are some big improvements to be had.

I’ve gathered some information+thoughts into an early-stage pitch / pre-proposal. It doesn’t really have a conclusion, so I’m hoping we can discuss these issues and come up with good (pragmatic) solutions here. I imagine this can morph into a proposal later.

You can read the following in nicer HTML form at https://gist.github.com/jtbandes/c0b0c072181dcd22c3147802025d0b59 https://gist.github.com/jtbandes/c0b0c072181dcd22c3147802025d0b59

I look forward to the discussion!

-Jacob

Background and motivation

To ease lexing/parsing and avoid user confusion, the names of custom identifiers (type names, variable names, etc.) and operators in Swift can be composed of (mostly) separate sets of characters.

Using terminology from TSPL:

identifier-head/operator-head are characters which can begin an identifier or operator.

identifier-character/operator-character are characters which can appear anywhere in an identifier or operator (these are supersets of the -head sets).

<https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/LexicalStructure.html https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/LexicalStructure.html>

(Note also that some particular arrangements of characters are reserved; for instance, $ followed by digits for an implicit closure parameter, and “If an operator doesn’t begin with a dot, it can’t contain a dot elsewhere.” There are also special characters in the language which are neither identifiers nor operators, such as: ()[]{},:@#)

Prior discussion on swift-evolution

“Request to add middle dot (U+00B7) as operator character?”
<https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20151214/003176.html https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20151214/003176.html>

“Free the ‘$’ Symbol!”
<https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20151228/005133.html https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20151228/005133.html>

“Proposal: Allow Single Dollar Sign as Valid Identifier”
<https://github.com/apple/swift-evolution/pull/354 https://github.com/apple/swift-evolution/pull/354>

Chris Lattner has said:

“…our current operator space (particularly the unicode segments covered) is not super well considered. It would be great for someone to take a more systematic pass over them to rationalize things.”

“We need a token to be unambiguously an operator or identifier - we can have different rules for the leading and subsequent characters though.”

Current state of affairs

Swift’s identifier-head and identifier-character mostly conform to the recommendations in <http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3146.html http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3146.html>
<https://github.com/apple/swift/blob/08e7963/lib/Parse/Lexer.cpp#L421-L489 https://github.com/apple/swift/blob/08e7963/lib/Parse/Lexer.cpp#L421-L489>

The allowed operator characters include “Unicode math, symbol, arrow, dingbat, and line/box drawing chars”, however I don’t believe this aligns with any particular spec:
<https://github.com/apple/swift/blob/08e7963/include/swift/AST/Identifier.h#L87-L121 https://github.com/apple/swift/blob/08e7963/include/swift/AST/Identifier.h#L87-L121>
<https://github.com/apple/swift/commit/a2341a4 https://github.com/apple/swift/commit/a2341a4>

Identifiers/operators elsewhere

There is an Unicode Standard Annex “identifier and pattern syntax” <http://unicode.org/reports/tr31/ http://unicode.org/reports/tr31/> which defines the categories ID_Start/ID_Continue.

<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[%3AID_Continue%3A] http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[%3AID_Continue%3A]>

ECMAScript 2015 “ES6”

Uses ID_Start and ID_Continue, as well as Other_ID_Start / Other_ID_Continue.
<http://www.ecma-international.org/ecma-262/6.0/#sec-names-and-keywords http://www.ecma-international.org/ecma-262/6.0/#sec-names-and-keywords>

Haskell

Distinguishes identifiers/operators by their general category (such as “any Unicode lowercase letter”, “any Unicode symbol or punctuation”, etc.).
<http://www.fileformat.info/info/unicode/category/index.htm http://www.fileformat.info/info/unicode/category/index.htm>

In particular, identifiers can start with any lowercase letter or , and may contain any letter/digit/’/. This would seem to include letters like δ and Я, and digits like ٢.

<https://www.haskell.org/onlinereport/syntax-iso.html https://www.haskell.org/onlinereport/syntax-iso.html>
<https://github.com/ghc/ghc/blob/714bebff44076061d0a719c4eda2cfd213b7ac3d/compiler/parser/Lexer.x#L1949-L1973 https://github.com/ghc/ghc/blob/714bebff44076061d0a719c4eda2cfd213b7ac3d/compiler/parser/Lexer.x#L1949-L1973>

Current problems

Weird identifier code points

The current identifier-character set contains many characters which wouldn’t make good identifiers:

  • 11 entire planes of characters (U+20000–U+2FFFD, etc.) which are currently unassigned.
  • The middle dot · which looks like an operator.
  • Many non-combining “modifiers” and accent marks, such as ´ and ¨ and ꓻ which don’t really make sense on their own.
  • “Tone marks” from various languages, including ˫ (similar to a box-drawing character ├ which is an operator).
  • The “Greek question mark” ;
  • Symbols which are simply not linguistic, such as ۞ and ༒.

short url: <https://goo.gl/tyn0Cz https://goo.gl/tyn0Cz>

<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[a-zA-Z _ \u00A8 \u00AA \u00AD \u00AF \u00B2-\u00B5 \u00B7-\u00BA \u00BC-\u00BE \u00C0-\u00D6 \u00D8-\u00F6 \u00F8-\u00FF \u0100-\u02FF \u0370-\u167F \u1681-\u180D \u180F-\u1DBF \u1E00-\u1FFF \u200B-\u200D \u202A-\u202E \u203F-\u2040 \u2054 \u2060-\u206F \u2070-\u20CF \u2100-\u218F \u2460-\u24FF \u2776-\u2793 \u2C00-\u2DFF \u2E80-\u2FFF \u3004-\u3007 \u3021-\u302F \u3031-\u303F \u3040-\uD7FF \uF900-\uFD3D \uFD40-\uFDCF \uFDF0-\uFE1F \uFE30-\uFE44 \uFE47-\uFFFD \U00010000-\U0001FFFD \U00020000-\U0002FFFD \U00030000-\U0003FFFD \U00040000-\U0004FFFD \U00050000-\U0005FFFD \U00060000-\U0006FFFD \U00070000-\U0007FFFD \U00080000-\U0008FFFD \U00090000-\U0009FFFD \U000A0000-\U000AFFFD \U000B0000-\U000BFFFD \U000C0000-\U000CFFFD \U000D0000-\U000DFFFD \U000E0000-\U000EFFFD] [0-9 \u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE20-\uFE2F] http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[a-zA-Z _ \u00A8 \u00AA \u00AD \u00AF \u00B2-\u00B5 \u00B7-\u00BA \u00BC-\u00BE \u00C0-\u00D6 \u00D8-\u00F6 \u00F8-\u00FF \u0100-\u02FF \u0370-\u167F \u1681-\u180D \u180F-\u1DBF \u1E00-\u1FFF \u200B-\u200D \u202A-\u202E \u203F-\u2040 \u2054 \u2060-\u206F \u2070-\u20CF \u2100-\u218F \u2460-\u24FF \u2776-\u2793 \u2C00-\u2DFF \u2E80-\u2FFF \u3004-\u3007 \u3021-\u302F \u3031-\u303F \u3040-\uD7FF \uF900-\uFD3D \uFD40-\uFDCF \uFDF0-\uFE1F \uFE30-\uFE44 \uFE47-\uFFFD \U00010000-\U0001FFFD \U00020000-\U0002FFFD \U00030000-\U0003FFFD \U00040000-\U0004FFFD \U00050000-\U0005FFFD \U00060000-\U0006FFFD \U00070000-\U0007FFFD \U00080000-\U0008FFFD \U00090000-\U0009FFFD \U000A0000-\U000AFFFD \U000B0000-\U000BFFFD \U000C0000-\U000CFFFD \U000D0000-\U000DFFFD \U000E0000-\U000EFFFD] [0-9 \u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE20-\uFE2F]>

Weird operator code points

The current operator-character set has a lot of characters that are clearly operator-esque (≈ ∈ ⊕ ⊅), but some things are not so obviously desirable:

  • Box-drawing characters
  • Combining accents and other characters
  • Various symbols, e.g. ⚄ and ♄ (this category also overlaps with emoji)
  • Braille patterns such as ⠟ — should they not be treated as letter-like (thus identifiers)?
  • A plethora of arrows

short url: <https://goo.gl/s136Nh https://goo.gl/s136Nh>

<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[%2F%3D\-%2B!*%<>\%26|\^~%3F \u00A1-\u00A7 \u00A9\u00AB \u00AC \u00AE \u00B0-\u00B1 \u00B6 \u00BB \u00BF \u00D7 \u00F7 \u2016-\u2017 \u2020-\u2027 \u2030-\u203E \u2041-\u2053 \u2055-\u205E \u2190-\u23FF \u2500-\u2775 \u2794-\u2BFF \u2E00-\u2E7F \u3001-\u3003 \u3008-\u3030] [\u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE00-\uFE0F \uFE20-\uFE2F \U000E0100-\U000E01EF] http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[%2F%3D\-%2B!*%<>\%26|\^~%3F \u00A1-\u00A7 \u00A9\u00AB \u00AC \u00AE \u00B0-\u00B1 \u00B6 \u00BB \u00BF \u00D7 \u00F7 \u2016-\u2017 \u2020-\u2027 \u2030-\u203E \u2041-\u2053 \u2055-\u205E \u2190-\u23FF \u2500-\u2775 \u2794-\u2BFF \u2E00-\u2E7F \u3001-\u3003 \u3008-\u3030] [\u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE00-\uFE0F \uFE20-\uFE2F \U000E0100-\U000E01EF]>

Code points which are both

A handful of characters are accepted both as identifier-head and operator-head (which seems pointless and might have been unintentional):

U+3021–U+3029, Suzhou numerals 〡〢〣〤〥〦〧〨〩 <https://en.wikipedia.org/wiki/Suzhou_numerals https://en.wikipedia.org/wiki/Suzhou_numerals>

U+302A–U+302F, ideographic & hangul tone marks 〪 〫 〬 〭 〮 〯

let 〨 = 2
infix operator <〨>

(Note that infix operator 〨 doesn’t work because the lexer greedily treats this as an identifier. Also, interestingly, the corresponding ideographic zero 〇 is only an identifier char.)

short url: <https://goo.gl/lZcMqO https://goo.gl/lZcMqO>

<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[a-zA-Z _ \u00A8 \u00AA \u00AD \u00AF \u00B2-\u00B5 \u00B7-\u00BA \u00BC-\u00BE \u00C0-\u00D6 \u00D8-\u00F6 \u00F8-\u00FF \u0100-\u02FF \u0370-\u167F \u1681-\u180D \u180F-\u1DBF \u1E00-\u1FFF \u200B-\u200D \u202A-\u202E \u203F-\u2040 \u2054 \u2060-\u206F \u2070-\u20CF \u2100-\u218F \u2460-\u24FF \u2776-\u2793 \u2C00-\u2DFF \u2E80-\u2FFF \u3004-\u3007 \u3021-\u302F \u3031-\u303F \u3040-\uD7FF \uF900-\uFD3D \uFD40-\uFDCF \uFDF0-\uFE1F \uFE30-\uFE44 \uFE47-\uFFFD \U00010000-\U0001FFFD \U00020000-\U0002FFFD \U00030000-\U0003FFFD \U00040000-\U0004FFFD \U00050000-\U0005FFFD \U00060000-\U0006FFFD \U00070000-\U0007FFFD \U00080000-\U0008FFFD \U00090000-\U0009FFFD \U000A0000-\U000AFFFD \U000B0000-\U000BFFFD \U000C0000-\U000CFFFD \U000D0000-\U000DFFFD \U000E0000-\U000EFFFD]%26[%2F%3D\-%2B!*%<>\%26|\^~%3F \u00A1-\u00A7 \u00A9\u00AB \u00AC \u00AE \u00B0-\u00B1 \u00B6 \u00BB \u00BF \u00D7 \u00F7 \u2016-\u2017 \u2020-\u2027 \u2030-\u203E \u2041-\u2053 \u2055-\u205E \u2190-\u23FF \u2500-\u2775 \u2794-\u2BFF \u2E00-\u2E7F \u3001-\u3003 \u3008-\u3030]] http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[a-zA-Z _ \u00A8 \u00AA \u00AD \u00AF \u00B2-\u00B5 \u00B7-\u00BA \u00BC-\u00BE \u00C0-\u00D6 \u00D8-\u00F6 \u00F8-\u00FF \u0100-\u02FF \u0370-\u167F \u1681-\u180D \u180F-\u1DBF \u1E00-\u1FFF \u200B-\u200D \u202A-\u202E \u203F-\u2040 \u2054 \u2060-\u206F \u2070-\u20CF \u2100-\u218F \u2460-\u24FF \u2776-\u2793 \u2C00-\u2DFF \u2E80-\u2FFF \u3004-\u3007 \u3021-\u302F \u3031-\u303F \u3040-\uD7FF \uF900-\uFD3D \uFD40-\uFDCF \uFDF0-\uFE1F \uFE30-\uFE44 \uFE47-\uFFFD \U00010000-\U0001FFFD \U00020000-\U0002FFFD \U00030000-\U0003FFFD \U00040000-\U0004FFFD \U00050000-\U0005FFFD \U00060000-\U0006FFFD \U00070000-\U0007FFFD \U00080000-\U0008FFFD \U00090000-\U0009FFFD \U000A0000-\U000AFFFD \U000B0000-\U000BFFFD \U000C0000-\U000CFFFD \U000D0000-\U000DFFFD \U000E0000-\U000EFFFD]%26[%2F%3D\-%2B!*%<>\%26|\^~%3F \u00A1-\u00A7 \u00A9\u00AB \u00AC \u00AE \u00B0-\u00B1 \u00B6 \u00BB \u00BF \u00D7 \u00F7 \u2016-\u2017 \u2020-\u2027 \u2030-\u203E \u2041-\u2053 \u2055-\u205E \u2190-\u23FF \u2500-\u2775 \u2794-\u2BFF \u2E00-\u2E7F \u3001-\u3003 \u3008-\u3030]]>

In addition to the numerals and tone marks above, many (all?) combining marks are accepted as identifier-character and operator-character. These may be necessary for natural-looking words in some languages, but they don’t seem necessary for operators.

Also present in both sets are the variation selectors 1 through 256 (U+FE00–U+FE0F, U+E0100–U+E01EF). It seems they are of limited use for the operator characters, unless you count the emoji: <http://www.unicode.org/Public/UCD/latest/ucd/StandardizedVariants.txt http://www.unicode.org/Public/UCD/latest/ucd/StandardizedVariants.txt>

short url: <https://goo.gl/VKrisf https://goo.gl/VKrisf>

<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[a-zA-Z _ \u00A8 \u00AA \u00AD \u00AF \u00B2-\u00B5 \u00B7-\u00BA \u00BC-\u00BE \u00C0-\u00D6 \u00D8-\u00F6 \u00F8-\u00FF \u0100-\u02FF \u0370-\u167F \u1681-\u180D \u180F-\u1DBF \u1E00-\u1FFF \u200B-\u200D \u202A-\u202E \u203F-\u2040 \u2054 \u2060-\u206F \u2070-\u20CF \u2100-\u218F \u2460-\u24FF \u2776-\u2793 \u2C00-\u2DFF \u2E80-\u2FFF \u3004-\u3007 \u3021-\u302F \u3031-\u303F \u3040-\uD7FF \uF900-\uFD3D \uFD40-\uFDCF \uFDF0-\uFE1F \uFE30-\uFE44 \uFE47-\uFFFD \U00010000-\U0001FFFD \U00020000-\U0002FFFD \U00030000-\U0003FFFD \U00040000-\U0004FFFD \U00050000-\U0005FFFD \U00060000-\U0006FFFD \U00070000-\U0007FFFD \U00080000-\U0008FFFD \U00090000-\U0009FFFD \U000A0000-\U000AFFFD \U000B0000-\U000BFFFD \U000C0000-\U000CFFFD \U000D0000-\U000DFFFD \U000E0000-\U000EFFFD] [0-9 \u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE20-\uFE2F]%26[%2F%3D\-%2B!*%<>\%26|\^~%3F \u00A1-\u00A7 \u00A9\u00AB \u00AC \u00AE \u00B0-\u00B1 \u00B6 \u00BB \u00BF \u00D7 \u00F7 \u2016-\u2017 \u2020-\u2027 \u2030-\u203E \u2041-\u2053 \u2055-\u205E \u2190-\u23FF \u2500-\u2775 \u2794-\u2BFF \u2E00-\u2E7F \u3001-\u3003 \u3008-\u3030] [\u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE00-\uFE0F \uFE20-\uFE2F \U000E0100-\U000E01EF]] http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[a-zA-Z _ \u00A8 \u00AA \u00AD \u00AF \u00B2-\u00B5 \u00B7-\u00BA \u00BC-\u00BE \u00C0-\u00D6 \u00D8-\u00F6 \u00F8-\u00FF \u0100-\u02FF \u0370-\u167F \u1681-\u180D \u180F-\u1DBF \u1E00-\u1FFF \u200B-\u200D \u202A-\u202E \u203F-\u2040 \u2054 \u2060-\u206F \u2070-\u20CF \u2100-\u218F \u2460-\u24FF \u2776-\u2793 \u2C00-\u2DFF \u2E80-\u2FFF \u3004-\u3007 \u3021-\u302F \u3031-\u303F \u3040-\uD7FF \uF900-\uFD3D \uFD40-\uFDCF \uFDF0-\uFE1F \uFE30-\uFE44 \uFE47-\uFFFD \U00010000-\U0001FFFD \U00020000-\U0002FFFD \U00030000-\U0003FFFD \U00040000-\U0004FFFD \U00050000-\U0005FFFD \U00060000-\U0006FFFD \U00070000-\U0007FFFD \U00080000-\U0008FFFD \U00090000-\U0009FFFD \U000A0000-\U000AFFFD \U000B0000-\U000BFFFD \U000C0000-\U000CFFFD \U000D0000-\U000DFFFD \U000E0000-\U000EFFFD] [0-9 \u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE20-\uFE2F]%26[%2F%3D\-%2B!*%<>\%26|\^~%3F \u00A1-\u00A7 \u00A9\u00AB \u00AC \u00AE \u00B0-\u00B1 \u00B6 \u00BB \u00BF \u00D7 \u00F7 \u2016-\u2017 \u2020-\u2027 \u2030-\u203E \u2041-\u2053 \u2055-\u205E \u2190-\u23FF \u2500-\u2775 \u2794-\u2BFF \u2E00-\u2E7F \u3001-\u3003 \u3008-\u3030] [\u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE00-\uFE0F \uFE20-\uFE2F \U000E0100-\U000E01EF]]>

Code points which should be illegal

There are several surprising non-printing characters, including:

  • U+2064 INVISIBLE PLUS is currently an identifier
  • U+200B ZERO WIDTH SPACE is currently an identifier

No good will come of these. Invisible characters should probably be disallowed (although some may be necessary for properly joining/splitting characters in some other languages).

Categories which are split between identifiers and operators

  • Emoji and symbols: most of the newer emoji are identifiers, but many emoji/pictographs are operators, especially those from “Miscellaneous Symbols”. The results are hilariously illogical:

    • :frowning:️ is an operator, but :slightly_smiling_face: is an identifier.
    • :v:️ is an operator, but :metal: is an identifier.
    • :arrow_up_small: is an operator, but :arrow_forward:️ is an identifier.
    • :eight_spoked_asterisk:️ is an operator, but :six_pointed_star: is an identifier.
    • :airplane:️ is an operator, but :small_airplane: is an identifier.
    • :spades:️ is an operator, but 🂡 is an identifier. (Presumably, 🂡 = A :spades:️ 🂠!)

    (But the counterintuitive examples extend outside the emoji too: + is an operator, while ₊ and ⁺ are identifiers.)

  • Currency symbols: ¢ £ ¤ ¥ are operators, but ₪ € ₱ ₹ ฿ and many others are identifiers, and $ is allowed in an identifier.

Missing characters

A handful of characters are neither operators nor identifiers. This list mostly makes sense (reserved characters and whitespace), but I wonder about a few which seem like they could easily be operators: ⑊ ⑀ ﹅ etc.

short url: <https://goo.gl/U0GVNn https://goo.gl/U0GVNn>

<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[\u0001-\U0010FFFF]-[[%2F%3D\-%2B!*%<>\%26|\^~%3F \u00A1-\u00A7 \u00A9\u00AB \u00AC \u00AE \u00B0-\u00B1 \u00B6 \u00BB \u00BF \u00D7 \u00F7 \u2016-\u2017 \u2020-\u2027 \u2030-\u203E \u2041-\u2053 \u2055-\u205E \u2190-\u23FF \u2500-\u2775 \u2794-\u2BFF \u2E00-\u2E7F \u3001-\u3003 \u3008-\u3030] [\u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE00-\uFE0F \uFE20-\uFE2F \U000E0100-\U000E01EF][a-zA-Z _ \u00A8 \u00AA \u00AD \u00AF \u00B2-\u00B5 \u00B7-\u00BA \u00BC-\u00BE \u00C0-\u00D6 \u00D8-\u00F6 \u00F8-\u00FF \u0100-\u02FF \u0370-\u167F \u1681-\u180D \u180F-\u1DBF \u1E00-\u1FFF \u200B-\u200D \u202A-\u202E \u203F-\u2040 \u2054 \u2060-\u206F \u2070-\u20CF \u2100-\u218F \u2460-\u24FF \u2776-\u2793 \u2C00-\u2DFF \u2E80-\u2FFF \u3004-\u3007 \u3021-\u302F \u3031-\u303F \u3040-\uD7FF \uF900-\uFD3D \uFD40-\uFDCF \uFDF0-\uFE1F \uFE30-\uFE44 \uFE47-\uFFFD \U00010000-\U0001FFFD \U00020000-\U0002FFFD \U00030000-\U0003FFFD \U00040000-\U0004FFFD \U00050000-\U0005FFFD \U00060000-\U0006FFFD \U00070000-\U0007FFFD \U00080000-\U0008FFFD \U00090000-\U0009FFFD \U000A0000-\U000AFFFD \U000B0000-\U000BFFFD \U000C0000-\U000CFFFD \U000D0000-\U000DFFFD \U000E0000-\U000EFFFD] [0-9 \u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE20-\uFE2F]]] http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[\u0001-\U0010FFFF]-[[%2F%3D\-%2B!*%<>\%26|\^~%3F \u00A1-\u00A7 \u00A9\u00AB \u00AC \u00AE \u00B0-\u00B1 \u00B6 \u00BB \u00BF \u00D7 \u00F7 \u2016-\u2017 \u2020-\u2027 \u2030-\u203E \u2041-\u2053 \u2055-\u205E \u2190-\u23FF \u2500-\u2775 \u2794-\u2BFF \u2E00-\u2E7F \u3001-\u3003 \u3008-\u3030] [\u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE00-\uFE0F \uFE20-\uFE2F \U000E0100-\U000E01EF][a-zA-Z _ \u00A8 \u00AA \u00AD \u00AF \u00B2-\u00B5 \u00B7-\u00BA \u00BC-\u00BE \u00C0-\u00D6 \u00D8-\u00F6 \u00F8-\u00FF \u0100-\u02FF \u0370-\u167F \u1681-\u180D \u180F-\u1DBF \u1E00-\u1FFF \u200B-\u200D \u202A-\u202E \u203F-\u2040 \u2054 \u2060-\u206F \u2070-\u20CF \u2100-\u218F \u2460-\u24FF \u2776-\u2793 \u2C00-\u2DFF \u2E80-\u2FFF \u3004-\u3007 \u3021-\u302F \u3031-\u303F \u3040-\uD7FF \uF900-\uFD3D \uFD40-\uFDCF \uFDF0-\uFE1F \uFE30-\uFE44 \uFE47-\uFFFD \U00010000-\U0001FFFD \U00020000-\U0002FFFD \U00030000-\U0003FFFD \U00040000-\U0004FFFD \U00050000-\U0005FFFD \U00060000-\U0006FFFD \U00070000-\U0007FFFD \U00080000-\U0008FFFD \U00090000-\U0009FFFD \U000A0000-\U000AFFFD \U000B0000-\U000BFFFD \U000C0000-\U000CFFFD \U000D0000-\U000DFFFD \U000E0000-\U000EFFFD] [0-9 \u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE20-\uFE2F]]]>

Solutions

Still up for discussion — please reply to this thread!

Adopting (X)ID_Start/Continue for identifiers, or a simpler solution like Haskell’s use of “letter” categories, might work well.

(I’ve given up hope of finding some kind of “perfect” solution — how can it be possible, when ᛏ is a letter, yet ↑ is not?)

Making the choice of operator characters more logical/standards-based would be nice (not just a set of ranges). However, Haskell’s approach of using all punctuation & symbols is probably not right for Swift:

short url: <https://goo.gl/Ud4KqY https://goo.gl/Ud4KqY>

<http://unicode.org/cldr/utility/unicodeset.jsp?a=[[-%2F%3D%2B!*%<>\%26|\^~?\u00A1-\u00A7\u00A9\u00AB\u00AC\u00AE\u00B0-\u00B1\u00B6\u00BB\u00BF\u00D7\u00F7\u2016-\u2017\u2020-\u2027\u2030-\u203E\u2041-\u2053\u2055-\u205E\u2190-\u23FF\u2500-\u2775\u2794-\u2BFF\u2E00-\u2E7F\u3001-\u3003\u3008-\u3030\u0300-\u036F\u1DC0-\u1DFF\u20D0-\u20FF\uFE00-\uFE0F\uFE20-\uFE2F\U000E0100-\U000E01EF]]&b=[[:Currency_Symbol:][:Modifier_Symbol:][:Math_Symbol:][:Other_Symbol:][:Connector_Punctuation:][:Dash_Punctuation:][:Close_Punctuation:][:Final_Punctuation:][:Initial_Punctuation:][:Other_Punctuation:][:Open_Punctuation:]] http://unicode.org/cldr/utility/unicodeset.jsp?a=[[-%2F%3D%2B!*%<>\%26|\^~?\u00A1-\u00A7\u00A9\u00AB\u00AC\u00AE\u00B0-\u00B1\u00B6\u00BB\u00BF\u00D7\u00F7\u2016-\u2017\u2020-\u2027\u2030-\u203E\u2041-\u2053\u2055-\u205E\u2190-\u23FF\u2500-\u2775\u2794-\u2BFF\u2E00-\u2E7F\u3001-\u3003\u3008-\u3030\u0300-\u036F\u1DC0-\u1DFF\u20D0-\u20FF\uFE00-\uFE0F\uFE20-\uFE2F\U000E0100-\U000E01EF]]&b=[[:Currency_Symbol:][:Modifier_Symbol:][:Math_Symbol:][:Other_Symbol:][:Connector_Punctuation:][:Dash_Punctuation:][:Close_Punctuation:][:Final_Punctuation:][:Initial_Punctuation:][:Other_Punctuation:][:Open_Punctuation:]]>

I’m not really sure what to do with emoji — they’re a very cute novelty feature, but I don’t know what the motivation is for including these as valid operators/identifiers.

At the least, we should try to gather them all into one of the two categories. My inclination would be to keep them as identifiers, which would mean moving the following out of the operator category:

short url: <https://goo.gl/CBJEKX https://goo.gl/CBJEKX>

<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[%3AEmoji%3A]%26[[%2F%3D\-%2B!*%<>\%26|\^~%3F \u00A1-\u00A7 \u00A9\u00AB \u00AC \u00AE \u00B0-\u00B1 \u00B6 \u00BB \u00BF \u00D7 \u00F7 \u2016-\u2017 \u2020-\u2027 \u2030-\u203E \u2041-\u2053 \u2055-\u205E \u2190-\u23FF \u2500-\u2775 \u2794-\u2BFF \u2E00-\u2E7F \u3001-\u3003 \u3008-\u3030] [\u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE00-\uFE0F \uFE20-\uFE2F \U000E0100-\U000E01EF]]] http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[%3AEmoji%3A]%26[[%2F%3D\-%2B!*%<>\%26|\^~%3F \u00A1-\u00A7 \u00A9\u00AB \u00AC \u00AE \u00B0-\u00B1 \u00B6 \u00BB \u00BF \u00D7 \u00F7 \u2016-\u2017 \u2020-\u2027 \u2030-\u203E \u2041-\u2053 \u2055-\u205E \u2190-\u23FF \u2500-\u2775 \u2794-\u2BFF \u2E00-\u2E7F \u3001-\u3003 \u3008-\u3030] [\u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE00-\uFE0F \uFE20-\uFE2F \U000E0100-\U000E01EF]]]>

Concurrently-discussable topics

There are a few relevant topics that came to mind, which I think are worth discussing around the same time.

Dollar signs ($)

$ is currently allowed in identifiers, but it can’t begin an identifier except for the magic implicit closure params ($0, $1, …) and LLDB/REPL-related uses.

It’s arguable, but I feel that $ would be more effective as an operator character than an identifier character. There’s precedent in Haskell for operators like <$> and being able to replicate these in Swift would be nice.

Diagnostics improvements

Regardless of what ends up being the ultimate solution, it would be great to improve diagnostics for cases when the wrong types of characters are used.

infix operator abc produces 'abc' is considered to be an identifier, not an operator. That’s not too bad.

let +++ = 3 produces expected pattern.

let $foo = 3 produces expected numeric value following '$'.

Security and сοnfuѕаbIе characters

Confusable characters (e vs. е, o vs. ο, ; vs. ;) are an issue not taken lightly in the world of web security (cf. domain names). I haven’t found much information about whether this has been considered a major security issue in programming languages, but I would think so (one can imagine such characters being introduced to a codebase subtly over time, hiding malicious functionality).

It’d be pretty cool if Swift could detect whether two identifiers might be confusable, and produce a warning.

<http://www.unicode.org/reports/tr36/#Recommendations_General http://www.unicode.org/reports/tr36/#Recommendations_General>
<http://unicode.org/reports/tr39/#Confusable_Detection http://unicode.org/reports/tr39/#Confusable_Detection>


swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Robert Widmann) #3

Some thoughts

TL;DR:

Swift 4 Stage 1 seeks to prioritize “Source stability features”. Most source-breaking changes were done with in Swift 3; however, the categorization of Unicode characters into identifiers & operators was never thoroughly discussed on swift-evolution. This seems like it might be our last chance, and I think there are some big improvements to be had.

I’ve gathered some information+thoughts into an early-stage pitch / pre-proposal. It doesn’t really have a conclusion, so I’m hoping we can discuss these issues and come up with good (pragmatic) solutions here. I imagine this can morph into a proposal later.

You can read the following in nicer HTML form at https://gist.github.com/jtbandes/c0b0c072181dcd22c3147802025d0b59 https://gist.github.com/jtbandes/c0b0c072181dcd22c3147802025d0b59

I look forward to the discussion!

-Jacob

Background and motivation

To ease lexing/parsing and avoid user confusion, the names of custom identifiers (type names, variable names, etc.) and operators in Swift can be composed of (mostly) separate sets of characters.

Using terminology from TSPL:

identifier-head/operator-head are characters which can begin an identifier or operator.

identifier-character/operator-character are characters which can appear anywhere in an identifier or operator (these are supersets of the -head sets).

<https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/LexicalStructure.html https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/LexicalStructure.html>

(Note also that some particular arrangements of characters are reserved; for instance, $ followed by digits for an implicit closure parameter, and “If an operator doesn’t begin with a dot, it can’t contain a dot elsewhere.” There are also special characters in the language which are neither identifiers nor operators, such as: ()[]{},:@#)

Prior discussion on swift-evolution

“Request to add middle dot (U+00B7) as operator character?”
<https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20151214/003176.html https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20151214/003176.html>

“Free the ‘$’ Symbol!”
<https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20151228/005133.html https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20151228/005133.html>

“Proposal: Allow Single Dollar Sign as Valid Identifier”
<https://github.com/apple/swift-evolution/pull/354 https://github.com/apple/swift-evolution/pull/354>

Chris Lattner has said:

“…our current operator space (particularly the unicode segments covered) is not super well considered. It would be great for someone to take a more systematic pass over them to rationalize things.”

“We need a token to be unambiguously an operator or identifier - we can have different rules for the leading and subsequent characters though.”

I feel a bit bad having implemented the patch that banned this - it feels like dollar was mistakenly left out of the operator character range considering how well it worked in operators up to then. Disambiguation with respect to other language constructs (anonymous parameters in closures and LLDB variables) is trivial and we already had diagnostics about it.

I definitely support having Swift’s operators use a wider range of the unicode spectrum - perhaps even a policy where instead of whitelisting ranges we blacklist reserved characters or ranges.

Current state of affairs

Swift’s identifier-head and identifier-character mostly conform to the recommendations in <http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3146.html http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3146.html>
<https://github.com/apple/swift/blob/08e7963/lib/Parse/Lexer.cpp#L421-L489 https://github.com/apple/swift/blob/08e7963/lib/Parse/Lexer.cpp#L421-L489>

The allowed operator characters include “Unicode math, symbol, arrow, dingbat, and line/box drawing chars”, however I don’t believe this aligns with any particular spec:
<https://github.com/apple/swift/blob/08e7963/include/swift/AST/Identifier.h#L87-L121 https://github.com/apple/swift/blob/08e7963/include/swift/AST/Identifier.h#L87-L121>
<https://github.com/apple/swift/commit/a2341a4 https://github.com/apple/swift/commit/a2341a4>

Identifiers/operators elsewhere

There is an Unicode Standard Annex “identifier and pattern syntax” <http://unicode.org/reports/tr31/ http://unicode.org/reports/tr31/> which defines the categories ID_Start/ID_Continue.

<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[%3AID_Continue%3A] http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[%3AID_Continue%3A]>

ECMAScript 2015 “ES6”

Uses ID_Start and ID_Continue, as well as Other_ID_Start / Other_ID_Continue.
<http://www.ecma-international.org/ecma-262/6.0/#sec-names-and-keywords http://www.ecma-international.org/ecma-262/6.0/#sec-names-and-keywords>

Haskell

Distinguishes identifiers/operators by their general category (such as “any Unicode lowercase letter”, “any Unicode symbol or punctuation”, etc.).
<http://www.fileformat.info/info/unicode/category/index.htm http://www.fileformat.info/info/unicode/category/index.htm>

In particular, identifiers can start with any lowercase letter or , and may contain any letter/digit/’/. This would seem to include letters like δ and Я, and digits like ٢.

<https://www.haskell.org/onlinereport/syntax-iso.html https://www.haskell.org/onlinereport/syntax-iso.html>
<https://github.com/ghc/ghc/blob/714bebff44076061d0a719c4eda2cfd213b7ac3d/compiler/parser/Lexer.x#L1949-L1973 https://github.com/ghc/ghc/blob/714bebff44076061d0a719c4eda2cfd213b7ac3d/compiler/parser/Lexer.x#L1949-L1973>

To give a language that supports the extreme case: Coq and Agda allow the full range of the Unicode spectrum (or so their implementation/docs would seem to say) in identifiers.

Current problems

Weird identifier code points

The current identifier-character set contains many characters which wouldn’t make good identifiers:

  • 11 entire planes of characters (U+20000–U+2FFFD, etc.) which are currently unassigned.
  • The middle dot · which looks like an operator.
  • Many non-combining “modifiers” and accent marks, such as ´ and ¨ and ꓻ which don’t really make sense on their own.
  • “Tone marks” from various languages, including ˫ (similar to a box-drawing character ├ which is an operator).
  • The “Greek question mark” ;
  • Symbols which are simply not linguistic, such as ۞ and ༒.

short url: <https://goo.gl/tyn0Cz https://goo.gl/tyn0Cz>

<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[a-zA-Z _ \u00A8 \u00AA \u00AD \u00AF \u00B2-\u00B5 \u00B7-\u00BA \u00BC-\u00BE \u00C0-\u00D6 \u00D8-\u00F6 \u00F8-\u00FF \u0100-\u02FF \u0370-\u167F \u1681-\u180D \u180F-\u1DBF \u1E00-\u1FFF \u200B-\u200D \u202A-\u202E \u203F-\u2040 \u2054 \u2060-\u206F \u2070-\u20CF \u2100-\u218F \u2460-\u24FF \u2776-\u2793 \u2C00-\u2DFF \u2E80-\u2FFF \u3004-\u3007 \u3021-\u302F \u3031-\u303F \u3040-\uD7FF \uF900-\uFD3D \uFD40-\uFDCF \uFDF0-\uFE1F \uFE30-\uFE44 \uFE47-\uFFFD \U00010000-\U0001FFFD \U00020000-\U0002FFFD \U00030000-\U0003FFFD \U00040000-\U0004FFFD \U00050000-\U0005FFFD \U00060000-\U0006FFFD \U00070000-\U0007FFFD \U00080000-\U0008FFFD \U00090000-\U0009FFFD \U000A0000-\U000AFFFD \U000B0000-\U000BFFFD \U000C0000-\U000CFFFD \U000D0000-\U000DFFFD \U000E0000-\U000EFFFD] [0-9 \u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE20-\uFE2F] http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[a-zA-Z _ \u00A8 \u00AA \u00AD \u00AF \u00B2-\u00B5 \u00B7-\u00BA \u00BC-\u00BE \u00C0-\u00D6 \u00D8-\u00F6 \u00F8-\u00FF \u0100-\u02FF \u0370-\u167F \u1681-\u180D \u180F-\u1DBF \u1E00-\u1FFF \u200B-\u200D \u202A-\u202E \u203F-\u2040 \u2054 \u2060-\u206F \u2070-\u20CF \u2100-\u218F \u2460-\u24FF \u2776-\u2793 \u2C00-\u2DFF \u2E80-\u2FFF \u3004-\u3007 \u3021-\u302F \u3031-\u303F \u3040-\uD7FF \uF900-\uFD3D \uFD40-\uFDCF \uFDF0-\uFE1F \uFE30-\uFE44 \uFE47-\uFFFD \U00010000-\U0001FFFD \U00020000-\U0002FFFD \U00030000-\U0003FFFD \U00040000-\U0004FFFD \U00050000-\U0005FFFD \U00060000-\U0006FFFD \U00070000-\U0007FFFD \U00080000-\U0008FFFD \U00090000-\U0009FFFD \U000A0000-\U000AFFFD \U000B0000-\U000BFFFD \U000C0000-\U000CFFFD \U000D0000-\U000DFFFD \U000E0000-\U000EFFFD] [0-9 \u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE20-\uFE2F]>

Weird operator code points

The current operator-character set has a lot of characters that are clearly operator-esque (≈ ∈ ⊕ ⊅), but some things are not so obviously desirable:

  • Box-drawing characters
  • Combining accents and other characters
  • Various symbols, e.g. ⚄ and ♄ (this category also overlaps with emoji)
  • Braille patterns such as ⠟ — should they not be treated as letter-like (thus identifiers)?
  • A plethora of arrows

short url: <https://goo.gl/s136Nh https://goo.gl/s136Nh>

<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[%2F%3D\-%2B!*%<>\%26|\^~%3F \u00A1-\u00A7 \u00A9\u00AB \u00AC \u00AE \u00B0-\u00B1 \u00B6 \u00BB \u00BF \u00D7 \u00F7 \u2016-\u2017 \u2020-\u2027 \u2030-\u203E \u2041-\u2053 \u2055-\u205E \u2190-\u23FF \u2500-\u2775 \u2794-\u2BFF \u2E00-\u2E7F \u3001-\u3003 \u3008-\u3030] [\u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE00-\uFE0F \uFE20-\uFE2F \U000E0100-\U000E01EF] http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[%2F%3D\-%2B!*%<>\%26|\^~%3F \u00A1-\u00A7 \u00A9\u00AB \u00AC \u00AE \u00B0-\u00B1 \u00B6 \u00BB \u00BF \u00D7 \u00F7 \u2016-\u2017 \u2020-\u2027 \u2030-\u203E \u2041-\u2053 \u2055-\u205E \u2190-\u23FF \u2500-\u2775 \u2794-\u2BFF \u2E00-\u2E7F \u3001-\u3003 \u3008-\u3030] [\u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE00-\uFE0F \uFE20-\uFE2F \U000E0100-\U000E01EF]>

Code points which are both

A handful of characters are accepted both as identifier-head and operator-head (which seems pointless and might have been unintentional):

U+3021–U+3029, Suzhou numerals 〡〢〣〤〥〦〧〨〩 <https://en.wikipedia.org/wiki/Suzhou_numerals https://en.wikipedia.org/wiki/Suzhou_numerals>

U+302A–U+302F, ideographic & hangul tone marks 〪 〫 〬 〭 〮 〯

let 〨 = 2
infix operator <〨>

(Note that infix operator 〨 doesn’t work because the lexer greedily treats this as an identifier. Also, interestingly, the corresponding ideographic zero 〇 is only an identifier char.)

short url: <https://goo.gl/lZcMqO https://goo.gl/lZcMqO>

<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[a-zA-Z _ \u00A8 \u00AA \u00AD \u00AF \u00B2-\u00B5 \u00B7-\u00BA \u00BC-\u00BE \u00C0-\u00D6 \u00D8-\u00F6 \u00F8-\u00FF \u0100-\u02FF \u0370-\u167F \u1681-\u180D \u180F-\u1DBF \u1E00-\u1FFF \u200B-\u200D \u202A-\u202E \u203F-\u2040 \u2054 \u2060-\u206F \u2070-\u20CF \u2100-\u218F \u2460-\u24FF \u2776-\u2793 \u2C00-\u2DFF \u2E80-\u2FFF \u3004-\u3007 \u3021-\u302F \u3031-\u303F \u3040-\uD7FF \uF900-\uFD3D \uFD40-\uFDCF \uFDF0-\uFE1F \uFE30-\uFE44 \uFE47-\uFFFD \U00010000-\U0001FFFD \U00020000-\U0002FFFD \U00030000-\U0003FFFD \U00040000-\U0004FFFD \U00050000-\U0005FFFD \U00060000-\U0006FFFD \U00070000-\U0007FFFD \U00080000-\U0008FFFD \U00090000-\U0009FFFD \U000A0000-\U000AFFFD \U000B0000-\U000BFFFD \U000C0000-\U000CFFFD \U000D0000-\U000DFFFD \U000E0000-\U000EFFFD]%26[%2F%3D\-%2B!*%<>\%26|\^~%3F \u00A1-\u00A7 \u00A9\u00AB \u00AC \u00AE \u00B0-\u00B1 \u00B6 \u00BB \u00BF \u00D7 \u00F7 \u2016-\u2017 \u2020-\u2027 \u2030-\u203E \u2041-\u2053 \u2055-\u205E \u2190-\u23FF \u2500-\u2775 \u2794-\u2BFF \u2E00-\u2E7F \u3001-\u3003 \u3008-\u3030]] http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[a-zA-Z _ \u00A8 \u00AA \u00AD \u00AF \u00B2-\u00B5 \u00B7-\u00BA \u00BC-\u00BE \u00C0-\u00D6 \u00D8-\u00F6 \u00F8-\u00FF \u0100-\u02FF \u0370-\u167F \u1681-\u180D \u180F-\u1DBF \u1E00-\u1FFF \u200B-\u200D \u202A-\u202E \u203F-\u2040 \u2054 \u2060-\u206F \u2070-\u20CF \u2100-\u218F \u2460-\u24FF \u2776-\u2793 \u2C00-\u2DFF \u2E80-\u2FFF \u3004-\u3007 \u3021-\u302F \u3031-\u303F \u3040-\uD7FF \uF900-\uFD3D \uFD40-\uFDCF \uFDF0-\uFE1F \uFE30-\uFE44 \uFE47-\uFFFD \U00010000-\U0001FFFD \U00020000-\U0002FFFD \U00030000-\U0003FFFD \U00040000-\U0004FFFD \U00050000-\U0005FFFD \U00060000-\U0006FFFD \U00070000-\U0007FFFD \U00080000-\U0008FFFD \U00090000-\U0009FFFD \U000A0000-\U000AFFFD \U000B0000-\U000BFFFD \U000C0000-\U000CFFFD \U000D0000-\U000DFFFD \U000E0000-\U000EFFFD]%26[%2F%3D\-%2B!*%<>\%26|\^~%3F \u00A1-\u00A7 \u00A9\u00AB \u00AC \u00AE \u00B0-\u00B1 \u00B6 \u00BB \u00BF \u00D7 \u00F7 \u2016-\u2017 \u2020-\u2027 \u2030-\u203E \u2041-\u2053 \u2055-\u205E \u2190-\u23FF \u2500-\u2775 \u2794-\u2BFF \u2E00-\u2E7F \u3001-\u3003 \u3008-\u3030]]>

In addition to the numerals and tone marks above, many (all?) combining marks are accepted as identifier-character and operator-character. These may be necessary for natural-looking words in some languages, but they don’t seem necessary for operators.

Also present in both sets are the variation selectors 1 through 256 (U+FE00–U+FE0F, U+E0100–U+E01EF). It seems they are of limited use for the operator characters, unless you count the emoji: <http://www.unicode.org/Public/UCD/latest/ucd/StandardizedVariants.txt http://www.unicode.org/Public/UCD/latest/ucd/StandardizedVariants.txt>

short url: <https://goo.gl/VKrisf https://goo.gl/VKrisf>

<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[a-zA-Z _ \u00A8 \u00AA \u00AD \u00AF \u00B2-\u00B5 \u00B7-\u00BA \u00BC-\u00BE \u00C0-\u00D6 \u00D8-\u00F6 \u00F8-\u00FF \u0100-\u02FF \u0370-\u167F \u1681-\u180D \u180F-\u1DBF \u1E00-\u1FFF \u200B-\u200D \u202A-\u202E \u203F-\u2040 \u2054 \u2060-\u206F \u2070-\u20CF \u2100-\u218F \u2460-\u24FF \u2776-\u2793 \u2C00-\u2DFF \u2E80-\u2FFF \u3004-\u3007 \u3021-\u302F \u3031-\u303F \u3040-\uD7FF \uF900-\uFD3D \uFD40-\uFDCF \uFDF0-\uFE1F \uFE30-\uFE44 \uFE47-\uFFFD \U00010000-\U0001FFFD \U00020000-\U0002FFFD \U00030000-\U0003FFFD \U00040000-\U0004FFFD \U00050000-\U0005FFFD \U00060000-\U0006FFFD \U00070000-\U0007FFFD \U00080000-\U0008FFFD \U00090000-\U0009FFFD \U000A0000-\U000AFFFD \U000B0000-\U000BFFFD \U000C0000-\U000CFFFD \U000D0000-\U000DFFFD \U000E0000-\U000EFFFD] [0-9 \u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE20-\uFE2F]%26[%2F%3D\-%2B!*%<>\%26|\^~%3F \u00A1-\u00A7 \u00A9\u00AB \u00AC \u00AE \u00B0-\u00B1 \u00B6 \u00BB \u00BF \u00D7 \u00F7 \u2016-\u2017 \u2020-\u2027 \u2030-\u203E \u2041-\u2053 \u2055-\u205E \u2190-\u23FF \u2500-\u2775 \u2794-\u2BFF \u2E00-\u2E7F \u3001-\u3003 \u3008-\u3030] [\u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE00-\uFE0F \uFE20-\uFE2F \U000E0100-\U000E01EF]] http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[a-zA-Z _ \u00A8 \u00AA \u00AD \u00AF \u00B2-\u00B5 \u00B7-\u00BA \u00BC-\u00BE \u00C0-\u00D6 \u00D8-\u00F6 \u00F8-\u00FF \u0100-\u02FF \u0370-\u167F \u1681-\u180D \u180F-\u1DBF \u1E00-\u1FFF \u200B-\u200D \u202A-\u202E \u203F-\u2040 \u2054 \u2060-\u206F \u2070-\u20CF \u2100-\u218F \u2460-\u24FF \u2776-\u2793 \u2C00-\u2DFF \u2E80-\u2FFF \u3004-\u3007 \u3021-\u302F \u3031-\u303F \u3040-\uD7FF \uF900-\uFD3D \uFD40-\uFDCF \uFDF0-\uFE1F \uFE30-\uFE44 \uFE47-\uFFFD \U00010000-\U0001FFFD \U00020000-\U0002FFFD \U00030000-\U0003FFFD \U00040000-\U0004FFFD \U00050000-\U0005FFFD \U00060000-\U0006FFFD \U00070000-\U0007FFFD \U00080000-\U0008FFFD \U00090000-\U0009FFFD \U000A0000-\U000AFFFD \U000B0000-\U000BFFFD \U000C0000-\U000CFFFD \U000D0000-\U000DFFFD \U000E0000-\U000EFFFD] [0-9 \u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE20-\uFE2F]%26[%2F%3D\-%2B!*%<>\%26|\^~%3F \u00A1-\u00A7 \u00A9\u00AB \u00AC \u00AE \u00B0-\u00B1 \u00B6 \u00BB \u00BF \u00D7 \u00F7 \u2016-\u2017 \u2020-\u2027 \u2030-\u203E \u2041-\u2053 \u2055-\u205E \u2190-\u23FF \u2500-\u2775 \u2794-\u2BFF \u2E00-\u2E7F \u3001-\u3003 \u3008-\u3030] [\u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE00-\uFE0F \uFE20-\uFE2F \U000E0100-\U000E01EF]]>

Code points which should be illegal

There are several surprising non-printing characters, including:

  • U+2064 INVISIBLE PLUS is currently an identifier
  • U+200B ZERO WIDTH SPACE is currently an identifier

No good will come of these. Invisible characters should probably be disallowed (although some may be necessary for properly joining/splitting characters in some other languages).

Categories which are split between identifiers and operators

  • Emoji and symbols: most of the newer emoji are identifiers, but many emoji/pictographs are operators, especially those from “Miscellaneous Symbols”. The results are hilariously illogical:

    • :frowning:️ is an operator, but :slightly_smiling_face: is an identifier.
    • :v:️ is an operator, but :metal: is an identifier.
    • :arrow_up_small: is an operator, but :arrow_forward:️ is an identifier.
    • :eight_spoked_asterisk:️ is an operator, but :six_pointed_star: is an identifier.
    • :airplane:️ is an operator, but :small_airplane: is an identifier.
    • :spades:️ is an operator, but 🂡 is an identifier. (Presumably, 🂡 = A :spades:️ 🂠!)

    (But the counterintuitive examples extend outside the emoji too: + is an operator, while ₊ and ⁺ are identifiers.)

  • Currency symbols: ¢ £ ¤ ¥ are operators, but ₪ € ₱ ₹ ฿ and many others are identifiers, and $ is allowed in an identifier.

Missing characters

A handful of characters are neither operators nor identifiers. This list mostly makes sense (reserved characters and whitespace), but I wonder about a few which seem like they could easily be operators: ⑊ ⑀ ﹅ etc.

short url: <https://goo.gl/U0GVNn https://goo.gl/U0GVNn>

<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[\u0001-\U0010FFFF]-[[%2F%3D\-%2B!*%<>\%26|\^~%3F \u00A1-\u00A7 \u00A9\u00AB \u00AC \u00AE \u00B0-\u00B1 \u00B6 \u00BB \u00BF \u00D7 \u00F7 \u2016-\u2017 \u2020-\u2027 \u2030-\u203E \u2041-\u2053 \u2055-\u205E \u2190-\u23FF \u2500-\u2775 \u2794-\u2BFF \u2E00-\u2E7F \u3001-\u3003 \u3008-\u3030] [\u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE00-\uFE0F \uFE20-\uFE2F \U000E0100-\U000E01EF][a-zA-Z _ \u00A8 \u00AA \u00AD \u00AF \u00B2-\u00B5 \u00B7-\u00BA \u00BC-\u00BE \u00C0-\u00D6 \u00D8-\u00F6 \u00F8-\u00FF \u0100-\u02FF \u0370-\u167F \u1681-\u180D \u180F-\u1DBF \u1E00-\u1FFF \u200B-\u200D \u202A-\u202E \u203F-\u2040 \u2054 \u2060-\u206F \u2070-\u20CF \u2100-\u218F \u2460-\u24FF \u2776-\u2793 \u2C00-\u2DFF \u2E80-\u2FFF \u3004-\u3007 \u3021-\u302F \u3031-\u303F \u3040-\uD7FF \uF900-\uFD3D \uFD40-\uFDCF \uFDF0-\uFE1F \uFE30-\uFE44 \uFE47-\uFFFD \U00010000-\U0001FFFD \U00020000-\U0002FFFD \U00030000-\U0003FFFD \U00040000-\U0004FFFD \U00050000-\U0005FFFD \U00060000-\U0006FFFD \U00070000-\U0007FFFD \U00080000-\U0008FFFD \U00090000-\U0009FFFD \U000A0000-\U000AFFFD \U000B0000-\U000BFFFD \U000C0000-\U000CFFFD \U000D0000-\U000DFFFD \U000E0000-\U000EFFFD] [0-9 \u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE20-\uFE2F]]] http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[\u0001-\U0010FFFF]-[[%2F%3D\-%2B!*%<>\%26|\^~%3F \u00A1-\u00A7 \u00A9\u00AB \u00AC \u00AE \u00B0-\u00B1 \u00B6 \u00BB \u00BF \u00D7 \u00F7 \u2016-\u2017 \u2020-\u2027 \u2030-\u203E \u2041-\u2053 \u2055-\u205E \u2190-\u23FF \u2500-\u2775 \u2794-\u2BFF \u2E00-\u2E7F \u3001-\u3003 \u3008-\u3030] [\u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE00-\uFE0F \uFE20-\uFE2F \U000E0100-\U000E01EF][a-zA-Z _ \u00A8 \u00AA \u00AD \u00AF \u00B2-\u00B5 \u00B7-\u00BA \u00BC-\u00BE \u00C0-\u00D6 \u00D8-\u00F6 \u00F8-\u00FF \u0100-\u02FF \u0370-\u167F \u1681-\u180D \u180F-\u1DBF \u1E00-\u1FFF \u200B-\u200D \u202A-\u202E \u203F-\u2040 \u2054 \u2060-\u206F \u2070-\u20CF \u2100-\u218F \u2460-\u24FF \u2776-\u2793 \u2C00-\u2DFF \u2E80-\u2FFF \u3004-\u3007 \u3021-\u302F \u3031-\u303F \u3040-\uD7FF \uF900-\uFD3D \uFD40-\uFDCF \uFDF0-\uFE1F \uFE30-\uFE44 \uFE47-\uFFFD \U00010000-\U0001FFFD \U00020000-\U0002FFFD \U00030000-\U0003FFFD \U00040000-\U0004FFFD \U00050000-\U0005FFFD \U00060000-\U0006FFFD \U00070000-\U0007FFFD \U00080000-\U0008FFFD \U00090000-\U0009FFFD \U000A0000-\U000AFFFD \U000B0000-\U000BFFFD \U000C0000-\U000CFFFD \U000D0000-\U000DFFFD \U000E0000-\U000EFFFD] [0-9 \u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE20-\uFE2F]]]>

Solutions

Still up for discussion — please reply to this thread!

Adopting (X)ID_Start/Continue for identifiers, or a simpler solution like Haskell’s use of “letter” categories, might work well.

(I’ve given up hope of finding some kind of “perfect” solution — how can it be possible, when ᛏ is a letter, yet ↑ is not?)

Making the choice of operator characters more logical/standards-based would be nice (not just a set of ranges). However, Haskell’s approach of using all punctuation & symbols is probably not right for Swift:

short url: <https://goo.gl/Ud4KqY https://goo.gl/Ud4KqY>

<http://unicode.org/cldr/utility/unicodeset.jsp?a=[[-%2F%3D%2B!*%<>\%26|\^~?\u00A1-\u00A7\u00A9\u00AB\u00AC\u00AE\u00B0-\u00B1\u00B6\u00BB\u00BF\u00D7\u00F7\u2016-\u2017\u2020-\u2027\u2030-\u203E\u2041-\u2053\u2055-\u205E\u2190-\u23FF\u2500-\u2775\u2794-\u2BFF\u2E00-\u2E7F\u3001-\u3003\u3008-\u3030\u0300-\u036F\u1DC0-\u1DFF\u20D0-\u20FF\uFE00-\uFE0F\uFE20-\uFE2F\U000E0100-\U000E01EF]]&b=[[:Currency_Symbol:][:Modifier_Symbol:][:Math_Symbol:][:Other_Symbol:][:Connector_Punctuation:][:Dash_Punctuation:][:Close_Punctuation:][:Final_Punctuation:][:Initial_Punctuation:][:Other_Punctuation:][:Open_Punctuation:]] http://unicode.org/cldr/utility/unicodeset.jsp?a=[[-%2F%3D%2B!*%<>\%26|\^~?\u00A1-\u00A7\u00A9\u00AB\u00AC\u00AE\u00B0-\u00B1\u00B6\u00BB\u00BF\u00D7\u00F7\u2016-\u2017\u2020-\u2027\u2030-\u203E\u2041-\u2053\u2055-\u205E\u2190-\u23FF\u2500-\u2775\u2794-\u2BFF\u2E00-\u2E7F\u3001-\u3003\u3008-\u3030\u0300-\u036F\u1DC0-\u1DFF\u20D0-\u20FF\uFE00-\uFE0F\uFE20-\uFE2F\U000E0100-\U000E01EF]]&b=[[:Currency_Symbol:][:Modifier_Symbol:][:Math_Symbol:][:Other_Symbol:][:Connector_Punctuation:][:Dash_Punctuation:][:Close_Punctuation:][:Final_Punctuation:][:Initial_Punctuation:][:Other_Punctuation:][:Open_Punctuation:]]>

I’m not really sure what to do with emoji — they’re a very cute novelty feature, but I don’t know what the motivation is for including these as valid operators/identifiers.

At the least, we should try to gather them all into one of the two categories. My inclination would be to keep them as identifiers, which would mean moving the following out of the operator category:

short url: <https://goo.gl/CBJEKX https://goo.gl/CBJEKX>

<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[%3AEmoji%3A]%26[[%2F%3D\-%2B!*%<>\%26|\^~%3F \u00A1-\u00A7 \u00A9\u00AB \u00AC \u00AE \u00B0-\u00B1 \u00B6 \u00BB \u00BF \u00D7 \u00F7 \u2016-\u2017 \u2020-\u2027 \u2030-\u203E \u2041-\u2053 \u2055-\u205E \u2190-\u23FF \u2500-\u2775 \u2794-\u2BFF \u2E00-\u2E7F \u3001-\u3003 \u3008-\u3030] [\u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE00-\uFE0F \uFE20-\uFE2F \U000E0100-\U000E01EF]]] http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[%3AEmoji%3A]%26[[%2F%3D\-%2B!*%<>\%26|\^~%3F \u00A1-\u00A7 \u00A9\u00AB \u00AC \u00AE \u00B0-\u00B1 \u00B6 \u00BB \u00BF \u00D7 \u00F7 \u2016-\u2017 \u2020-\u2027 \u2030-\u203E \u2041-\u2053 \u2055-\u205E \u2190-\u23FF \u2500-\u2775 \u2794-\u2BFF \u2E00-\u2E7F \u3001-\u3003 \u3008-\u3030] [\u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE00-\uFE0F \uFE20-\uFE2F \U000E0100-\U000E01EF]]]>

Concurrently-discussable topics

There are a few relevant topics that came to mind, which I think are worth discussing around the same time.

Dollar signs ($)

$ is currently allowed in identifiers, but it can’t begin an identifier except for the magic implicit closure params ($0, $1, …) and LLDB/REPL-related uses.

It’s arguable, but I feel that $ would be more effective as an operator character than an identifier character. There’s precedent in Haskell for operators like <$> and being able to replicate these in Swift would be nice.

Diagnostics improvements

Regardless of what ends up being the ultimate solution, it would be great to improve diagnostics for cases when the wrong types of characters are used.

infix operator abc produces 'abc' is considered to be an identifier, not an operator. That’s not too bad.

let +++ = 3 produces expected pattern.

let $foo = 3 produces expected numeric value following '$'.

Security and сοnfuѕаbIе characters

Confusable characters (e vs. е, o vs. ο, ; vs. ;) are an issue not taken lightly in the world of web security (cf. domain names). I haven’t found much information about whether this has been considered a major security issue in programming languages, but I would think so (one can imagine such characters being introduced to a codebase subtly over time, hiding malicious functionality).

It’d be pretty cool if Swift could detect whether two identifiers might be confusable, and produce a warning.

<http://www.unicode.org/reports/tr36/#Recommendations_General http://www.unicode.org/reports/tr36/#Recommendations_General>
<http://unicode.org/reports/tr39/#Confusable_Detection http://unicode.org/reports/tr39/#Confusable_Detection>

We have had a patch sitting in the queue for a long time now https://github.com/apple/swift/pull/732 that does diagnostics for confusables if you want to take that up again.

···

On Sep 18, 2016, at 3:33 PM, Jacob Bandes-Storch via swift-evolution <swift-evolution@swift.org> wrote:

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Jacob Bandes-Storch) #4

Side question: it looks like ICU is used by the standard library on
non-Apple platforms. Would it be possible to make it a dependency of the
compiler too? If we want to explicitly detect emoji, for instance, it'd be
nice to use a canonical library that already does it.

···

On Sun, Sep 18, 2016 at 12:33 PM, Jacob Bandes-Storch <jtbandes@gmail.com> wrote:

*TL;DR:*

Swift 4 Stage 1 seeks to prioritize "Source stability features". Most
source-breaking changes were done with in Swift 3; however, the
categorization of Unicode characters into identifiers & operators was never
thoroughly discussed on swift-evolution. This seems like it might be our
last chance, and I think there are some big improvements to be had.

I've gathered some information+thoughts into an early-stage pitch /
pre-proposal. It doesn't really have a conclusion, so I'm hoping we can
discuss these issues and come up with good (pragmatic) solutions here. I
imagine this can morph into a proposal later.

You can read the following in nicer HTML form at https://gist.github.com/
jtbandes/c0b0c072181dcd22c3147802025d0b59

I look forward to the discussion!

-Jacob
<snip>


(Xiaodi Wu) #5

Let me tl;dr'er this even more: :frowning:️ is an operator, but :slightly_smiling_face: is an identifier.

-- E, succinct, who thinks there's room for improvement

Ha, yes. Let's see if I can be as succinct in my contribution to the
discussion:

1) Agree that current situation not ideal, for reasons above

2) The solution might best be not one but several proposals:

  2a) Unicode normalization: invisible characters, Greek tonos, etc. (cf.
previous message about previously proposed solution, which reflects Unicode
recommendations in UTR #31)--low hanging fruit: there's an established
Unicode recommendation with clear wins for security and consistency

  2b) Legal and illegal characters for identifiers *or* operators: UTR #31
makes recommendations regarding rarely used scripts; probably best to
follow the letter and spirit of these recommendations (which would probably
mean ancient Greek musical symbols and Egyptian hieroglyphics shouldn't be
identifier or operator characters)

  2c) Decisions as to which characters are identifier characters or
operator characters: for instance, emoji should probably never be operator
characters; if an emoji has a non-emoji counterpart that is an operator
(:exclamation:️:question::heavy_plus_sign::heavy_minus_sign::heavy_division_sign::heavy_multiplication_x:️, etc.) it might be best simply to make these illegal rather than
operator characters

  2d) Confusables: I think the last time we had this discussion, it was
apparent that it'd be difficult to decide which confusables to allow or
disallow after some of the low-hanging fruit is taken care of by Unicode
normalization (see item 2a); the Unicode Consortium-provided list seems too
quick to call two things "confusable" for our purposes (with criteria that
might be relevant for URLs or other use cases, but casting too wide a net
perhaps for Swift identifiers)

···

On Sun, Sep 18, 2016 at 9:19 PM, Erica Sadun via swift-evolution < swift-evolution@swift.org> wrote:

On Sep 18, 2016, at 1:33 PM, Jacob Bandes-Storch via swift-evolution < > swift-evolution@swift.org> wrote:

*TL;DR:*

Swift 4 Stage 1 seeks to prioritize "Source stability features". Most
source-breaking changes were done with in Swift 3; however, the
categorization of Unicode characters into identifiers & operators was never
thoroughly discussed on swift-evolution. This seems like it might be our
last chance, and I think there are some big improvements to be had.

I've gathered some information+thoughts into an early-stage pitch /
pre-proposal. It doesn't really have a conclusion, so I'm hoping we can
discuss these issues and come up with good (pragmatic) solutions here. I
imagine this can morph into a proposal later.

You can read the following in nicer HTML form at https://gist.github.com/
jtbandes/c0b0c072181dcd22c3147802025d0b59

I look forward to the discussion!

-Jacob

*# Background and motivation*

To ease lexing/parsing and avoid user confusion, the names of custom
identifiers (type names, variable names, etc.) and operators in Swift can
be composed of (mostly) separate sets of characters.

Using terminology from TSPL:

`identifier-head`/`operator-head` are characters which can *begin *an
identifier or operator.

`identifier-character`/`operator-character` are characters which can
appear anywhere in an identifier or operator (these are supersets of the
`-head` sets).

<https://developer.apple.com/library/content/documentation/
Swift/Conceptual/Swift_Programming_Language/LexicalStructure.html>

(Note also that some particular arrangements of characters are reserved;
for instance, `$` followed by digits for an implicit closure parameter, and
"If an operator doesn’t begin with a dot, it can’t contain a dot
elsewhere." There are also special characters in the language which are
neither identifiers nor operators, such as: `()[]{},:@#`)

*## Prior discussion on swift-evolution*

*"Request to add middle dot (U+00B7) as operator character?"*
<https://lists.swift.org/pipermail/swift-evolution/
Week-of-Mon-20151214/003176.html>

*"Free the '$' Symbol!"*
<https://lists.swift.org/pipermail/swift-evolution/
Week-of-Mon-20151228/005133.html>

*"Proposal: Allow Single Dollar Sign as Valid Identifier"*
<https://github.com/apple/swift-evolution/pull/354>

Chris Lattner has said:

> "...our current operator space (particularly the unicode segments
covered) is not super well considered. It would be great for someone to
take a more systematic pass over them to rationalize things."

> "We need a token to be unambiguously an operator or identifier - we can
have different rules for the leading and subsequent characters though."

*# Current state of affairs*

Swift's `identifier-head` and `identifier-character` mostly conform to the
recommendations in <http://www.open-std.org/jtc1/
sc22/wg21/docs/papers/2010/n3146.html>
<https://github.com/apple/swift/blob/08e7963/lib/Parse/Lexer.cpp#L421-L489
>

The allowed operator characters include "Unicode math, symbol, arrow,
dingbat, and line/box drawing chars", however I don't believe this aligns
with any particular spec:
<https://github.com/apple/swift/blob/08e7963/include/
swift/AST/Identifier.h#L87-L121>
<https://github.com/apple/swift/commit/a2341a4>

*## Identifiers/operators elsewhere*

There is an Unicode Standard Annex "identifier and pattern syntax" <
http://unicode.org/reports/tr31/> which defines the categories
`ID_Start`/`ID_Continue`.

<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=
%5B%3AID_Continue%3A%5D>

*### ECMAScript 2015 "ES6"*

Uses `ID_Start` and `ID_Continue`, as well as `Other_ID_Start` /
`Other_ID_Continue`.
<http://www.ecma-international.org/ecma-262/6.0/#sec-names-and-keywords>

*### Haskell*

Distinguishes identifiers/operators by their general category (such as
"any Unicode lowercase letter", "any Unicode symbol or punctuation", etc.).

<http://www.fileformat.info/info/unicode/category/index.htm>

In particular, identifiers can start with any lowercase letter or _, and
may contain any letter/digit/'/_. This would seem to include letters like δ
and Я, and digits like ٢.

<https://www.haskell.org/onlinereport/syntax-iso.html>
<https://github.com/ghc/ghc/blob/714bebff44076061d0a719c4eda2cf
d213b7ac3d/compiler/parser/Lexer.x#L1949-L1973>

*# Current problems*

*## Weird identifier code points*

The current `identifier-character` set contains many characters which
wouldn't make good identifiers:

- 11 entire planes of characters (U+20000–U+2FFFD, etc.) which are
currently unassigned.
- The middle dot · which looks like an operator.
- Many non-combining "modifiers" and accent marks, such as ´ and ¨ and ꓻ
which don't really make sense on their own.
- "Tone marks" from various languages, including ˫ (similar to a
box-drawing character ├ which is an operator).
- The "Greek question mark" ;
- Symbols which are simply not linguistic, such as ۞ and ༒.

short url: <https://goo.gl/tyn0Cz>

<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=
%5Ba-zA-Z%0D%0A_%0D%0A%5Cu00A8%0D%0A%5Cu00AA%0D%0A%
5Cu00AD%0D%0A%5Cu00AF%0D%0A%5Cu00B2-%5Cu00B5%0D%0A%5Cu00B7-%5Cu00BA%0D%0A%
5Cu00BC-%5Cu00BE%0D%0A%5Cu00C0-%5Cu00D6%0D%0A%5Cu00D8-%5Cu00F6%0D%0A%
5Cu00F8-%5Cu00FF%0D%0A%5Cu0100-%5Cu02FF%0D%0A%5Cu0370-%5Cu167F%0D%0A%
5Cu1681-%5Cu180D%0D%0A%5Cu180F-%5Cu1DBF%0D%0A%5Cu1E00-%5Cu1FFF%0D%0A%
5Cu200B-%5Cu200D%0D%0A%5Cu202A-%5Cu202E%0D%0A%5Cu203F-%5Cu2040%0D%0A%
5Cu2054%0D%0A%5Cu2060-%5Cu206F%0D%0A%5Cu2070-%5Cu20CF%0D%0A%5Cu2100-%
5Cu218F%0D%0A%5Cu2460-%5Cu24FF%0D%0A%5Cu2776-%5Cu2793%0D%0A%5Cu2C00-%
5Cu2DFF%0D%0A%5Cu2E80-%5Cu2FFF%0D%0A%5Cu3004-%5Cu3007%0D%0A%5Cu3021-%
5Cu302F%0D%0A%5Cu3031-%5Cu303F%0D%0A%5Cu3040-%5CuD7FF%0D%0A%5CuF900-%
5CuFD3D%0D%0A%5CuFD40-%5CuFDCF%0D%0A%5CuFDF0-%5CuFE1F%0D%0A%5CuFE30-%
5CuFE44%0D%0A%5CuFE47-%5CuFFFD%0D%0A%5CU00010000-%
5CU0001FFFD%0D%0A%5CU00020000-%5CU0002FFFD%0D%0A%
5CU00030000-%5CU0003FFFD%0D%0A%5CU00040000-%5CU0004FFFD%
0D%0A%5CU00050000-%5CU0005FFFD%0D%0A%5CU00060000-%5CU0006FFFD%0D%0A%
5CU00070000-%5CU0007FFFD%0D%0A%5CU00080000-%5CU0008FFFD%
0D%0A%5CU00090000-%5CU0009FFFD%0D%0A%5CU000A0000-%5CU000AFFFD%0D%0A%
5CU000B0000-%5CU000BFFFD%0D%0A%5CU000C0000-%5CU000CFFFD%
0D%0A%5CU000D0000-%5CU000DFFFD%0D%0A%5CU000E0000-
%5CU000EFFFD%5D%0D%0A%5B0-9%0D%0A%5Cu0300-%5Cu036F%0D%0A%
5Cu1DC0-%5Cu1DFF%0D%0A%5Cu20D0-%5Cu20FF%0D%0A%5CuFE20-%5CuFE2F%5D>

*## Weird operator code points*

The current `operator-character` set has a lot of characters that are
clearly operator-esque (≈ ∈ ⊕ ⊅), but some things are not so obviously
desirable:

- Box-drawing characters
- Combining accents and other characters
- Various symbols, e.g. ⚄ and ♄ (this category also overlaps with emoji)
- Braille patterns such as ⠟ — should they not be treated as letter-like
(thus identifiers)?
- A plethora of arrows

short url: <https://goo.gl/s136Nh>

<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=
%5B%2F%3D%5C-%2B%21*%25%3C%3E%5C%26%7C%5C%5E~%3F%0D%0A%
5Cu00A1-%5Cu00A7%0D%0A%5Cu00A9%5Cu00AB%0D%0A%5Cu00AC%
0D%0A%5Cu00AE%0D%0A%5Cu00B0-%5Cu00B1%0D%0A%5Cu00B6%0D%0A%
5Cu00BB%0D%0A%5Cu00BF%0D%0A%5Cu00D7%0D%0A%5Cu00F7%0D%0A%
5Cu2016-%5Cu2017%0D%0A%5Cu2020-%5Cu2027%0D%0A%5Cu2030-%5Cu203E%0D%0A%
5Cu2041-%5Cu2053%0D%0A%5Cu2055-%5Cu205E%0D%0A%5Cu2190-%5Cu23FF%0D%0A%
5Cu2500-%5Cu2775%0D%0A%5Cu2794-%5Cu2BFF%0D%0A%5Cu2E00-%5Cu2E7F%0D%0A%
5Cu3001-%5Cu3003%0D%0A%5Cu3008-%5Cu3030%5D%0D%0A%5B%
5Cu0300-%5Cu036F%0D%0A%5Cu1DC0-%5Cu1DFF%0D%0A%5Cu20D0-%5Cu20FF%0D%0A%
5CuFE00-%5CuFE0F%0D%0A%5CuFE20-%5CuFE2F%0D%0A%5CU000E0100-%5CU000E01EF%5D>

*## Code points which are both*

A handful of characters are accepted *both* as `identifier-head` and
`operator-head` (which seems pointless and might have been unintentional):

U+3021–U+3029, Suzhou numerals 〡〢〣〤〥〦〧〨〩 <https://en.wikipedia.org/
wiki/Suzhou_numerals>

U+302A–U+302F, ideographic & hangul tone marks 〪 〫 〬 〭 〮 〯

    let 〨 = 2
    infix operator <〨>

(Note that `infix operator 〨` doesn't work because the lexer greedily
treats this as an identifier. Also, interestingly, the corresponding
ideographic zero 〇 is only an identifier char.)

short url: <https://goo.gl/lZcMqO>

<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=
[%5ba-zA-Z%0d%0a_%0d%0a%5cu00A8%0d%0a%5cu00AA%0d%0a%
5cu00AD%0d%0a%5cu00AF%0d%0a%5cu00B2-%5cu00B5%0d%0a%5cu00B7-%5cu00BA%0d%0a%
5cu00BC-%5cu00BE%0d%0a%5cu00C0-%5cu00D6%0d%0a%5cu00D8-%5cu00F6%0d%0a%
5cu00F8-%5cu00FF%0d%0a%5cu0100-%5cu02FF%0d%0a%5cu0370-%5cu167F%0d%0a%
5cu1681-%5cu180D%0d%0a%5cu180F-%5cu1DBF%0d%0a%5cu1E00-%5cu1FFF%0d%0a%
5cu200B-%5cu200D%0d%0a%5cu202A-%5cu202E%0d%0a%5cu203F-%5cu2040%0d%0a%
5cu2054%0d%0a%5cu2060-%5cu206F%0d%0a%5cu2070-%5cu20CF%0d%0a%5cu2100-%
5cu218F%0d%0a%5cu2460-%5cu24FF%0d%0a%5cu2776-%5cu2793%0d%0a%5cu2C00-%
5cu2DFF%0d%0a%5cu2E80-%5cu2FFF%0d%0a%5cu3004-%5cu3007%0d%0a%5cu3021-%
5cu302F%0d%0a%5cu3031-%5cu303F%0d%0a%5cu3040-%5cuD7FF%0d%0a%5cuF900-%
5cuFD3D%0d%0a%5cuFD40-%5cuFDCF%0d%0a%5cuFDF0-%5cuFE1F%0d%0a%5cuFE30-%
5cuFE44%0d%0a%5cuFE47-%5cuFFFD%0d%0a%5cU00010000-%
5cU0001FFFD%0d%0a%5cU00020000-%5cU0002FFFD%0d%0a%
5cU00030000-%5cU0003FFFD%0d%0a%5cU00040000-%5cU0004FFFD%
0d%0a%5cU00050000-%5cU0005FFFD%0d%0a%5cU00060000-%5cU0006FFFD%0d%0a%
5cU00070000-%5cU0007FFFD%0d%0a%5cU00080000-%5cU0008FFFD%
0d%0a%5cU00090000-%5cU0009FFFD%0d%0a%5cU000A0000-%5cU000AFFFD%0d%0a%
5cU000B0000-%5cU000BFFFD%0d%0a%5cU000C0000-%5cU000CFFFD%
0d%0a%5cU000D0000-%5cU000DFFFD%0d%0a%5cU000E0000-
%5cU000EFFFD%5d%26%5b%2f%3d%5c-%2b%21%2a%25%3C%3E%5c%26%
7c%5c%5e~%3f%0d%0a%5cu00A1-%5cu00A7%0d%0a%5cu00A9%5cu00AB%
0d%0a%5cu00AC%0d%0a%5cu00AE%0d%0a%5cu00B0-%5cu00B1%0d%0a%
5cu00B6%0d%0a%5cu00BB%0d%0a%5cu00BF%0d%0a%5cu00D7%0d%0a%
5cu00F7%0d%0a%5cu2016-%5cu2017%0d%0a%5cu2020-%5cu2027%0d%0a%5cu2030-%
5cu203E%0d%0a%5cu2041-%5cu2053%0d%0a%5cu2055-%5cu205E%0d%0a%5cu2190-%
5cu23FF%0d%0a%5cu2500-%5cu2775%0d%0a%5cu2794-%5cu2BFF%0d%0a%5cu2E00-%
5cu2E7F%0d%0a%5cu3001-%5cu3003%0d%0a%5cu3008-%5cu3030%5d]>

In addition to the numerals and tone marks above, many (all?) *combining
marks* are accepted as `identifier-character` and `operator-character`.
These may be necessary for natural-looking words in some languages, but
they don't seem necessary for operators.

Also present in both sets are the *variation selectors* 1 through 256
(U+FE00–U+FE0F, U+E0100–U+E01EF). It seems they are of limited use for the
operator characters, unless you count the emoji: <http://www.unicode.org/
Public/UCD/latest/ucd/StandardizedVariants.txt>

short url: <https://goo.gl/VKrisf>

<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=
[%5ba-zA-Z%0d%0a_%0d%0a%5cu00A8%0d%0a%5cu00AA%0d%0a%
5cu00AD%0d%0a%5cu00AF%0d%0a%5cu00B2-%5cu00B5%0d%0a%5cu00B7-%5cu00BA%0d%0a%
5cu00BC-%5cu00BE%0d%0a%5cu00C0-%5cu00D6%0d%0a%5cu00D8-%5cu00F6%0d%0a%
5cu00F8-%5cu00FF%0d%0a%5cu0100-%5cu02FF%0d%0a%5cu0370-%5cu167F%0d%0a%
5cu1681-%5cu180D%0d%0a%5cu180F-%5cu1DBF%0d%0a%5cu1E00-%5cu1FFF%0d%0a%
5cu200B-%5cu200D%0d%0a%5cu202A-%5cu202E%0d%0a%5cu203F-%5cu2040%0d%0a%
5cu2054%0d%0a%5cu2060-%5cu206F%0d%0a%5cu2070-%5cu20CF%0d%0a%5cu2100-%
5cu218F%0d%0a%5cu2460-%5cu24FF%0d%0a%5cu2776-%5cu2793%0d%0a%5cu2C00-%
5cu2DFF%0d%0a%5cu2E80-%5cu2FFF%0d%0a%5cu3004-%5cu3007%0d%0a%5cu3021-%
5cu302F%0d%0a%5cu3031-%5cu303F%0d%0a%5cu3040-%5cuD7FF%0d%0a%5cuF900-%
5cuFD3D%0d%0a%5cuFD40-%5cuFDCF%0d%0a%5cuFDF0-%5cuFE1F%0d%0a%5cuFE30-%
5cuFE44%0d%0a%5cuFE47-%5cuFFFD%0d%0a%5cU00010000-%
5cU0001FFFD%0d%0a%5cU00020000-%5cU0002FFFD%0d%0a%
5cU00030000-%5cU0003FFFD%0d%0a%5cU00040000-%5cU0004FFFD%
0d%0a%5cU00050000-%5cU0005FFFD%0d%0a%5cU00060000-%5cU0006FFFD%0d%0a%
5cU00070000-%5cU0007FFFD%0d%0a%5cU00080000-%5cU0008FFFD%
0d%0a%5cU00090000-%5cU0009FFFD%0d%0a%5cU000A0000-%5cU000AFFFD%0d%0a%
5cU000B0000-%5cU000BFFFD%0d%0a%5cU000C0000-%5cU000CFFFD%
0d%0a%5cU000D0000-%5cU000DFFFD%0d%0a%5cU000E0000-
%5cU000EFFFD%5d%0d%0a%5b0-9%0d%0a%5cu0300-%5cu036F%0d%0a%
5cu1DC0-%5cu1DFF%0d%0a%5cu20D0-%5cu20FF%0d%0a%
5cuFE20-%5cuFE2F%5d%26%5b%2f%3d%5c-%2b%21%2a%25%3C%3E%5c%
26%7c%5c%5e~%3f%0d%0a%5cu00A1-%5cu00A7%0d%0a%5cu00A9%
5cu00AB%0d%0a%5cu00AC%0d%0a%5cu00AE%0d%0a%5cu00B0-%
5cu00B1%0d%0a%5cu00B6%0d%0a%5cu00BB%0d%0a%5cu00BF%0d%0a%
5cu00D7%0d%0a%5cu00F7%0d%0a%5cu2016-%5cu2017%0d%0a%5cu2020-%5cu2027%0d%0a%
5cu2030-%5cu203E%0d%0a%5cu2041-%5cu2053%0d%0a%5cu2055-%5cu205E%0d%0a%
5cu2190-%5cu23FF%0d%0a%5cu2500-%5cu2775%0d%0a%5cu2794-%5cu2BFF%0d%0a%
5cu2E00-%5cu2E7F%0d%0a%5cu3001-%5cu3003%0d%0a%
5cu3008-%5cu3030%5d%0d%0a%5b%5cu0300-%5cu036F%0d%0a%
5cu1DC0-%5cu1DFF%0d%0a%5cu20D0-%5cu20FF%0d%0a%5cuFE00-%5cuFE0F%0d%0a%
5cuFE20-%5cuFE2F%0d%0a%5cU000E0100-%5cU000E01EF%5d]>

*## Code points which should be illegal*

There are several surprising non-printing characters, including:

- U+2064 INVISIBLE PLUS is currently an identifier
- U+200B ZERO WIDTH SPACE is currently an identifier

No good will come of these. Invisible characters should probably be
disallowed (although some may be necessary for properly joining/splitting
characters in some other languages).

*## Categories which are split between identifiers and operators*

- Emoji and symbols: most of the newer emoji are identifiers, but many
emoji/pictographs are operators, especially those from "Miscellaneous
Symbols". The results are hilariously illogical:

  - :frowning:️ is an operator, but :slightly_smiling_face: is an identifier.
  - :v:️ is an operator, but :metal: is an identifier.
  - :arrow_up_small: is an operator, but :arrow_forward:️ is an identifier.
  - :eight_spoked_asterisk:️ is an operator, but :six_pointed_star: is an identifier.
  - :airplane:️ is an operator, but :small_airplane: is an identifier.
  - :spades:️ is an operator, but 🂡 is an identifier. (Presumably, 🂡 = A :spades:️ 🂠!)

  (But the counterintuitive examples extend outside the emoji too: + is an
operator, while ₊ and ⁺ are identifiers.)

- Currency symbols: ¢ £ ¤ ¥ are operators, but ₪ € ₱ ₹ ฿ and many others
are identifiers, and $ is allowed in an identifier.

*## Missing characters*

A handful of characters are neither operators nor identifiers. This list
mostly makes sense (reserved characters and whitespace), but I wonder about
a few which seem like they could easily be operators: ⑊ ⑀ ﹅ etc.

short url: <https://goo.gl/U0GVNn>

<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=
%5B%5B%5Cu0001-%5CU0010FFFF%5D-%5B%5B%2F%3D%5C-%2B!*%25%
3C%3E%5C%26%7C%5C%5E~%3F%0D%0A%5Cu00A1-%5Cu00A7%0D%0A%
5Cu00A9%5Cu00AB%0D%0A%5Cu00AC%0D%0A%5Cu00AE%0D%0A%5Cu00B0-%
5Cu00B1%0D%0A%5Cu00B6%0D%0A%5Cu00BB%0D%0A%5Cu00BF%0D%0A%
5Cu00D7%0D%0A%5Cu00F7%0D%0A%5Cu2016-%5Cu2017%0D%0A%5Cu2020-%5Cu2027%0D%0A%
5Cu2030-%5Cu203E%0D%0A%5Cu2041-%5Cu2053%0D%0A%5Cu2055-%5Cu205E%0D%0A%
5Cu2190-%5Cu23FF%0D%0A%5Cu2500-%5Cu2775%0D%0A%5Cu2794-%5Cu2BFF%0D%0A%
5Cu2E00-%5Cu2E7F%0D%0A%5Cu3001-%5Cu3003%0D%0A%
5Cu3008-%5Cu3030%5D%0D%0A%5B%5Cu0300-%5Cu036F%0D%0A%
5Cu1DC0-%5Cu1DFF%0D%0A%5Cu20D0-%5Cu20FF%0D%0A%5CuFE00-%5CuFE0F%0D%0A%
5CuFE20-%5CuFE2F%0D%0A%5CU000E0100-%5CU000E01EF%5D%
5Ba-zA-Z%0D%0A_%0D%0A%5Cu00A8%0D%0A%5Cu00AA%0D%0A%5Cu00AD%
0D%0A%5Cu00AF%0D%0A%5Cu00B2-%5Cu00B5%0D%0A%5Cu00B7-%
5Cu00BA%0D%0A%5Cu00BC-%5Cu00BE%0D%0A%5Cu00C0-%5Cu00D6%0D%0A%5Cu00D8-%
5Cu00F6%0D%0A%5Cu00F8-%5Cu00FF%0D%0A%5Cu0100-%5Cu02FF%0D%0A%5Cu0370-%
5Cu167F%0D%0A%5Cu1681-%5Cu180D%0D%0A%5Cu180F-%5Cu1DBF%0D%0A%5Cu1E00-%
5Cu1FFF%0D%0A%5Cu200B-%5Cu200D%0D%0A%5Cu202A-%5Cu202E%0D%0A%5Cu203F-%
5Cu2040%0D%0A%5Cu2054%0D%0A%5Cu2060-%5Cu206F%0D%0A%5Cu2070-%5Cu20CF%0D%0A%
5Cu2100-%5Cu218F%0D%0A%5Cu2460-%5Cu24FF%0D%0A%5Cu2776-%5Cu2793%0D%0A%
5Cu2C00-%5Cu2DFF%0D%0A%5Cu2E80-%5Cu2FFF%0D%0A%5Cu3004-%5Cu3007%0D%0A%
5Cu3021-%5Cu302F%0D%0A%5Cu3031-%5Cu303F%0D%0A%5Cu3040-%5CuD7FF%0D%0A%
5CuF900-%5CuFD3D%0D%0A%5CuFD40-%5CuFDCF%0D%0A%5CuFDF0-%5CuFE1F%0D%0A%
5CuFE30-%5CuFE44%0D%0A%5CuFE47-%5CuFFFD%0D%0A%5CU00010000-%5CU0001FFFD%0D%
0A%5CU00020000-%5CU0002FFFD%0D%0A%5CU00030000-%
5CU0003FFFD%0D%0A%5CU00040000-%5CU0004FFFD%0D%0A%
5CU00050000-%5CU0005FFFD%0D%0A%5CU00060000-%5CU0006FFFD%
0D%0A%5CU00070000-%5CU0007FFFD%0D%0A%5CU00080000-%5CU0008FFFD%0D%0A%
5CU00090000-%5CU0009FFFD%0D%0A%5CU000A0000-%5CU000AFFFD%
0D%0A%5CU000B0000-%5CU000BFFFD%0D%0A%5CU000C0000-%5CU000CFFFD%0D%0A%
5CU000D0000-%5CU000DFFFD%0D%0A%5CU000E0000-%5CU000EFFFD%
5D%0D%0A%5B0-9%0D%0A%5Cu0300-%5Cu036F%0D%0A%5Cu1DC0-%
5Cu1DFF%0D%0A%5Cu20D0-%5Cu20FF%0D%0A%5CuFE20-%5CuFE2F%5D%5D%5D>

*# Solutions*

Still up for discussion — please reply to this thread!

Adopting (X)ID_Start/Continue for identifiers, or a simpler solution like
Haskell's use of "letter" categories, might work well.

(I've given up hope of finding some kind of "perfect" solution — how can
it be possible, when ᛏ is a letter, yet ↑ is not?)

Making the choice of operator characters more logical/standards-based
would be nice (not just a set of ranges). However, Haskell's approach of
using all punctuation & symbols is probably not right for Swift:

short url: <https://goo.gl/Ud4KqY>

<http://unicode.org/cldr/utility/unicodeset.jsp?a=[%
5B-%2F%3D%2B!*%25%3C%3E%5C%26%7C%5C%5E~?%5Cu00A1-%5Cu00A7%
5Cu00A9%5Cu00AB%5Cu00AC%5Cu00AE%5Cu00B0-%5Cu00B1%5Cu00B6%5Cu00BB%5Cu00BF%
5Cu00D7%5Cu00F7%5Cu2016-%5Cu2017%5Cu2020-%5Cu2027%
5Cu2030-%5Cu203E%5Cu2041-%5Cu2053%5Cu2055-%5Cu205E%
5Cu2190-%5Cu23FF%5Cu2500-%5Cu2775%5Cu2794-%5Cu2BFF%
5Cu2E00-%5Cu2E7F%5Cu3001-%5Cu3003%5Cu3008-%5Cu3030%
5Cu0300-%5Cu036F%5Cu1DC0-%5Cu1DFF%5Cu20D0-%5Cu20FF%
5CuFE00-%5CuFE0F%5CuFE20-%5CuFE2F%5CU000E0100-%5CU000E01EF%5D%5D&b=%5B%5B:
Currency_Symbol:%5D%5B:Modifier_Symbol:%5D%5B:Math_
Symbol:%5D%5B:Other_Symbol:%5D%5B:Connector_Punctuation:%
5D%5B:Dash_Punctuation:%5D%5B:Close_Punctuation:%5D%5B:
Final_Punctuation:%5D%5B:Initial_Punctuation:%5D%5B:
Other_Punctuation:%5D%5B:Open_Punctuation:%5D%5D>

I'm not really sure what to do with emoji — they're a very cute novelty
feature, but I don't know what the motivation is for including these as
valid operators/identifiers.

At the least, we should try to gather them all into one of the two
categories. My inclination would be to keep them as identifiers, which
would mean moving the following out of the operator category:

short url: <https://goo.gl/CBJEKX>

<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=
%5B%5B%3AEmoji%3A%5D%26%5B%5B%2F%3D%5C-%2B%21*%25%3C%3E%5C%
26%7C%5C%5E~%3F%0D%0A%5Cu00A1-%5Cu00A7%0D%0A%5Cu00A9%
5Cu00AB%0D%0A%5Cu00AC%0D%0A%5Cu00AE%0D%0A%5Cu00B0-%
5Cu00B1%0D%0A%5Cu00B6%0D%0A%5Cu00BB%0D%0A%5Cu00BF%0D%0A%
5Cu00D7%0D%0A%5Cu00F7%0D%0A%5Cu2016-%5Cu2017%0D%0A%5Cu2020-%5Cu2027%0D%0A%
5Cu2030-%5Cu203E%0D%0A%5Cu2041-%5Cu2053%0D%0A%5Cu2055-%5Cu205E%0D%0A%
5Cu2190-%5Cu23FF%0D%0A%5Cu2500-%5Cu2775%0D%0A%5Cu2794-%5Cu2BFF%0D%0A%
5Cu2E00-%5Cu2E7F%0D%0A%5Cu3001-%5Cu3003%0D%0A%
5Cu3008-%5Cu3030%5D%0D%0A%5B%5Cu0300-%5Cu036F%0D%0A%
5Cu1DC0-%5Cu1DFF%0D%0A%5Cu20D0-%5Cu20FF%0D%0A%5CuFE00-%5CuFE0F%0D%0A%
5CuFE20-%5CuFE2F%0D%0A%5CU000E0100-%5CU000E01EF%5D%5D%5D>

*# Concurrently-discussable topics*

There are a few relevant topics that came to mind, which I think are worth
discussing around the same time.

*## Dollar signs ($)*

$ is currently allowed in identifiers, but it can't begin an identifier
except for the magic implicit closure params ($0, $1, ...) and
LLDB/REPL-related uses.

It's arguable, but I feel that $ would be more effective as an operator
character than an identifier character. There's precedent in Haskell for
operators like `<$>` and being able to replicate these in Swift would be
nice.

*## Diagnostics improvements*

Regardless of what ends up being the ultimate solution, it would be great
to improve diagnostics for cases when the wrong types of characters are
used.

`infix operator abc` produces `'abc' is considered to be an identifier,
not an operator`. That's not too bad.

`let +++ = 3` produces `expected pattern`.

`let $foo = 3` produces `expected numeric value following '$'`.

*## Security and сοnfuѕаbIе characters*

Confusable characters (e vs. е, o vs. ο, ; vs. ;) are an issue not taken
lightly in the world of web security (cf. domain names). I haven't found
much information about whether this has been considered a major security
issue in programming languages, but I would think so (one can imagine such
characters being introduced to a codebase subtly over time, hiding
malicious functionality).

It'd be pretty cool if Swift could detect whether two identifiers might be
confusable, and produce a warning.

<http://www.unicode.org/reports/tr36/#Recommendations_General>
<http://unicode.org/reports/tr39/#Confusable_Detection>

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Xiaodi Wu) #6

There was a proposal written some time ago by João Pinheiro about Unicode
normalization for identifiers. Unfortunately, it couldn't make it in time
for the Swift 3 deadline, but it may be in the PR queue. Here it is again
in full:

Normalize Unicode Identifiers

Proposal: SE-NNNN
Author: João Pinheiro
Status: Awaiting review
Review manager: TBD

Introduction

This proposal aims to introduce identifier normalization in order to
prevent the unsafe and potentially abusive use of invisible or equivalent
representations of Unicode characters in identifiers.

Swift-evolution thread: Discussion thread

Motivation

Even though Swift supports the use of Unicode for identifiers, these aren't
yet normalized. This allows for different Unicode representations of the
same characters to be considered distinct identifiers.

For example:

let Å = "Angstrom"
let Å = "Latin Capital Letter A With Ring Above"
let Å = "Latin Capital Letter A + Combining Ring Above"

In addition to that, default-ignorable characters like the Zero Width Space
and Zero Width Non-Joiner (exemplified below) are also currently accepted
as valid parts of identifiers without any restrictions.

let ab = "ab"
let a​b = "a + Zero Width Space + b"

func xy() { print("xy") }
func x‌y() { print("x + <Zero Width Non-Joiner> + y") }

The use of default-ignorable characters in identifiers is problematical,
first because the effects they represent are stylistic or otherwise out of
scope for identifiers, and second because the characters themselves often
have no visible display. It is also possible to misapply these characters
such that users can create strings that look the same but actually contain
different characters, which can create security problems.

Proposed solution

Normalize Swift identifiers according to the normalization form NFC
recommended for case-sensitive languages in the Unicode Standard Annexes 15
and 31 and follow the Normalization Charts.

In addition to that, prohibit the use of default-ignorable characters in
identifiers except in the special cases described in UAX31, listed below:

Allow Zero Width Non-Joiner (U+200C) when breaking a cursive connection
Allow Zero Width Non-Joiner (U+200C) in a conjunct context
Allow Zero Width Joiner (U+200D) in a conjunct context

Impact on existing code

This has potential to be a code-breaking change in cases where people may
have used distinct, but identical looking, identifiers with different
Unicode representations. The likelihood of that happening in actual code is
very small and the problem can be solved by renaming identifiers that don't
conform to the new normalized form into new non-colliding identifiers.

Alternatives considered

The option of ignoring default-ignorable characters in identifiers was also
discussed, but it was considered to be more confusing and less secure than
explicitly treating them as errors.

Unaddressed Issues

There was some discussion around the issue of Unicode confusable
characters, but it was considered to be out of scope for this proposal.
Unicode confusable characters are a complicated issue and any possible
solutions also come with significant drawbacks that would require more time
and consideration.

···

On Sun, Sep 18, 2016 at 20:35 Robert Widmann via swift-evolution < swift-evolution@swift.org> wrote:

Some thoughts

On Sep 18, 2016, at 3:33 PM, Jacob Bandes-Storch via swift-evolution < > swift-evolution@swift.org> wrote:

*TL;DR:*

Swift 4 Stage 1 seeks to prioritize "Source stability features". Most
source-breaking changes were done with in Swift 3; however, the
categorization of Unicode characters into identifiers & operators was never
thoroughly discussed on swift-evolution. This seems like it might be our
last chance, and I think there are some big improvements to be had.

I've gathered some information+thoughts into an early-stage pitch /
pre-proposal. It doesn't really have a conclusion, so I'm hoping we can
discuss these issues and come up with good (pragmatic) solutions here. I
imagine this can morph into a proposal later.

You can read the following in nicer HTML form at
https://gist.github.com/jtbandes/c0b0c072181dcd22c3147802025d0b59

I look forward to the discussion!

-Jacob

*# Background and motivation*

To ease lexing/parsing and avoid user confusion, the names of custom
identifiers (type names, variable names, etc.) and operators in Swift can
be composed of (mostly) separate sets of characters.

Using terminology from TSPL:

`identifier-head`/`operator-head` are characters which can *begin *an
identifier or operator.

`identifier-character`/`operator-character` are characters which can
appear anywhere in an identifier or operator (these are supersets of the
`-head` sets).

<
https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/LexicalStructure.html
>

(Note also that some particular arrangements of characters are reserved;
for instance, `$` followed by digits for an implicit closure parameter, and
"If an operator doesn’t begin with a dot, it can’t contain a dot
elsewhere." There are also special characters in the language which are
neither identifiers nor operators, such as: `()[]{},:@#`)

*## Prior discussion on swift-evolution*

*"Request to add middle dot (U+00B7) as operator character?"*
<
https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20151214/003176.html
>

*"Free the '$' Symbol!"*
<
https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20151228/005133.html
>

*"Proposal: Allow Single Dollar Sign as Valid Identifier"*
<https://github.com/apple/swift-evolution/pull/354>

Chris Lattner has said:

> "...our current operator space (particularly the unicode segments
covered) is not super well considered. It would be great for someone to
take a more systematic pass over them to rationalize things."

> "We need a token to be unambiguously an operator or identifier - we can
have different rules for the leading and subsequent characters though."

I feel a bit bad having implemented the patch that banned this - it feels
like dollar was mistakenly left out of the operator character range
considering how well it worked in operators up to then. Disambiguation
with respect to other language constructs (anonymous parameters in closures
and LLDB variables) is trivial and we already had diagnostics about it.

I definitely support having Swift’s operators use a wider range of the
unicode spectrum - perhaps even a policy where instead of whitelisting
ranges we blacklist reserved characters or ranges.

*# Current state of affairs*

Swift's `identifier-head` and `identifier-character` mostly conform to the
recommendations in <
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3146.html>
<https://github.com/apple/swift/blob/08e7963/lib/Parse/Lexer.cpp#L421-L489
>

The allowed operator characters include "Unicode math, symbol, arrow,
dingbat, and line/box drawing chars", however I don't believe this aligns
with any particular spec:
<
https://github.com/apple/swift/blob/08e7963/include/swift/AST/Identifier.h#L87-L121>

<https://github.com/apple/swift/commit/a2341a4>

*## Identifiers/operators elsewhere*

There is an Unicode Standard Annex "identifier and pattern syntax" <
http://unicode.org/reports/tr31/> which defines the categories
`ID_Start`/`ID_Continue`.

<
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[%3AID_Continue%3A]
>

*### ECMAScript 2015 "ES6"*

Uses `ID_Start` and `ID_Continue`, as well as `Other_ID_Start` /
`Other_ID_Continue`.
<http://www.ecma-international.org/ecma-262/6.0/#sec-names-and-keywords>

*### Haskell*

Distinguishes identifiers/operators by their general category (such as
"any Unicode lowercase letter", "any Unicode symbol or punctuation", etc.).

<http://www.fileformat.info/info/unicode/category/index.htm>

In particular, identifiers can start with any lowercase letter or _, and
may contain any letter/digit/'/_. This would seem to include letters like δ
and Я, and digits like ٢.

<https://www.haskell.org/onlinereport/syntax-iso.html>
<
https://github.com/ghc/ghc/blob/714bebff44076061d0a719c4eda2cfd213b7ac3d/compiler/parser/Lexer.x#L1949-L1973
>

To give a language that supports the extreme case: Coq and Agda allow the
full range of the Unicode spectrum (or so their implementation/docs would
seem to say) in identifiers.

*# Current problems*

*## Weird identifier code points*

The current `identifier-character` set contains many characters which
wouldn't make good identifiers:

- 11 entire planes of characters (U+20000–U+2FFFD, etc.) which are
currently unassigned.
- The middle dot · which looks like an operator.
- Many non-combining "modifiers" and accent marks, such as ´ and ¨ and ꓻ
which don't really make sense on their own.
- "Tone marks" from various languages, including ˫ (similar to a
box-drawing character ├ which is an operator).
- The "Greek question mark" ;
- Symbols which are simply not linguistic, such as ۞ and ༒.

short url: <https://goo.gl/tyn0Cz>

<
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[a-zA-Z _ \u00A8 \u00AA \u00AD \u00AF \u00B2-\u00B5 \u00B7-\u00BA \u00BC-\u00BE \u00C0-\u00D6 \u00D8-\u00F6 \u00F8-\u00FF \u0100-\u02FF \u0370-\u167F \u1681-\u180D \u180F-\u1DBF \u1E00-\u1FFF \u200B-\u200D \u202A-\u202E \u203F-\u2040 \u2054 \u2060-\u206F \u2070-\u20CF \u2100-\u218F \u2460-\u24FF \u2776-\u2793 \u2C00-\u2DFF \u2E80-\u2FFF \u3004-\u3007 \u3021-\u302F \u3031-\u303F \u3040-\uD7FF \uF900-\uFD3D \uFD40-\uFDCF \uFDF0-\uFE1F \uFE30-\uFE44 \uFE47-\uFFFD \U00010000-\U0001FFFD \U00020000-\U0002FFFD \U00030000-\U0003FFFD \U00040000-\U0004FFFD \U00050000-\U0005FFFD \U00060000-\U0006FFFD \U00070000-\U0007FFFD \U00080000-\U0008FFFD \U00090000-\U0009FFFD \U000A0000-\U000AFFFD \U000B0000-\U000BFFFD \U000C0000-\U000CFFFD \U000D0000-\U000DFFFD \U000E0000-\U000EFFFD] [0-9 \u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE20-\uFE2F]
>

*## Weird operator code points*

The current `operator-character` set has a lot of characters that are
clearly operator-esque (≈ ∈ ⊕ ⊅), but some things are not so obviously
desirable:

- Box-drawing characters
- Combining accents and other characters
- Various symbols, e.g. ⚄ and ♄ (this category also overlaps with emoji)
- Braille patterns such as ⠟ — should they not be treated as letter-like
(thus identifiers)?
- A plethora of arrows

short url: <https://goo.gl/s136Nh>

<
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[%2F%3D\-%2B!*%<>\%26|\^~%3F \u00A1-\u00A7 \u00A9\u00AB \u00AC \u00AE \u00B0-\u00B1 \u00B6 \u00BB \u00BF \u00D7 \u00F7 \u2016-\u2017 \u2020-\u2027 \u2030-\u203E \u2041-\u2053 \u2055-\u205E \u2190-\u23FF \u2500-\u2775 \u2794-\u2BFF \u2E00-\u2E7F \u3001-\u3003 \u3008-\u3030] [\u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE00-\uFE0F \uFE20-\uFE2F \U000E0100-\U000E01EF]
>

*## Code points which are both*

A handful of characters are accepted *both* as `identifier-head` and
`operator-head` (which seems pointless and might have been unintentional):

U+3021–U+3029, Suzhou numerals 〡〢〣〤〥〦〧〨〩 <
https://en.wikipedia.org/wiki/Suzhou_numerals>

U+302A–U+302F, ideographic & hangul tone marks 〪 〫 〬 〭 〮 〯

    let 〨 = 2
    infix operator <〨>

(Note that `infix operator 〨` doesn't work because the lexer greedily
treats this as an identifier. Also, interestingly, the corresponding
ideographic zero 〇 is only an identifier char.)

short url: <https://goo.gl/lZcMqO>

<
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[a-zA-Z _ \u00A8 \u00AA \u00AD \u00AF \u00B2-\u00B5 \u00B7-\u00BA \u00BC-\u00BE \u00C0-\u00D6 \u00D8-\u00F6 \u00F8-\u00FF \u0100-\u02FF \u0370-\u167F \u1681-\u180D \u180F-\u1DBF \u1E00-\u1FFF \u200B-\u200D \u202A-\u202E \u203F-\u2040 \u2054 \u2060-\u206F \u2070-\u20CF \u2100-\u218F \u2460-\u24FF \u2776-\u2793 \u2C00-\u2DFF \u2E80-\u2FFF \u3004-\u3007 \u3021-\u302F \u3031-\u303F \u3040-\uD7FF \uF900-\uFD3D \uFD40-\uFDCF \uFDF0-\uFE1F \uFE30-\uFE44 \uFE47-\uFFFD \U00010000-\U0001FFFD \U00020000-\U0002FFFD \U00030000-\U0003FFFD \U00040000-\U0004FFFD \U00050000-\U0005FFFD \U00060000-\U0006FFFD \U00070000-\U0007FFFD \U00080000-\U0008FFFD \U00090000-\U0009FFFD \U000A0000-\U000AFFFD \U000B0000-\U000BFFFD \U000C0000-\U000CFFFD \U000D0000-\U000DFFFD \U000E0000-\U000EFFFD]%26[%2F%3D\-%2B!*%<>\%26|\^~%3F \u00A1-\u00A7 \u00A9\u00AB \u00AC \u00AE \u00B0-\u00B1 \u00B6 \u00BB \u00BF \u00D7 \u00F7 \u2016-\u2017 \u2020-\u2027 \u2030-\u203E \u2041-\u2053 \u2055-\u205E \u2190-\u23FF \u2500-\u2775 \u2794-\u2BFF \u2E00-\u2E7F \u3001-\u3003 \u3008-\u3030]]
>

In addition to the numerals and tone marks above, many (all?) *combining
marks* are accepted as `identifier-character` and `operator-character`.
These may be necessary for natural-looking words in some languages, but
they don't seem necessary for operators.

Also present in both sets are the *variation selectors* 1 through 256
(U+FE00–U+FE0F, U+E0100–U+E01EF). It seems they are of limited use for the
operator characters, unless you count the emoji: <
http://www.unicode.org/Public/UCD/latest/ucd/StandardizedVariants.txt>

short url: <https://goo.gl/VKrisf>

<
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[a-zA-Z _ \u00A8 \u00AA \u00AD \u00AF \u00B2-\u00B5 \u00B7-\u00BA \u00BC-\u00BE \u00C0-\u00D6 \u00D8-\u00F6 \u00F8-\u00FF \u0100-\u02FF \u0370-\u167F \u1681-\u180D \u180F-\u1DBF \u1E00-\u1FFF \u200B-\u200D \u202A-\u202E \u203F-\u2040 \u2054 \u2060-\u206F \u2070-\u20CF \u2100-\u218F \u2460-\u24FF \u2776-\u2793 \u2C00-\u2DFF \u2E80-\u2FFF \u3004-\u3007 \u3021-\u302F \u3031-\u303F \u3040-\uD7FF \uF900-\uFD3D \uFD40-\uFDCF \uFDF0-\uFE1F \uFE30-\uFE44 \uFE47-\uFFFD \U00010000-\U0001FFFD \U00020000-\U0002FFFD \U00030000-\U0003FFFD \U00040000-\U0004FFFD \U00050000-\U0005FFFD \U00060000-\U0006FFFD \U00070000-\U0007FFFD \U00080000-\U0008FFFD \U00090000-\U0009FFFD \U000A0000-\U000AFFFD \U000B0000-\U000BFFFD \U000C0000-\U000CFFFD \U000D0000-\U000DFFFD \U000E0000-\U000EFFFD] [0-9 \u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE20-\uFE2F]%26[%2F%3D\-%2B!*%<>\%26|\^~%3F \u00A1-\u00A7 \u00A9\u00AB \u00AC \u00AE \u00B0-\u00B1 \u00B6 \u00BB \u00BF \u00D7 \u00F7 \u2016-\u2017 \u2020-\u2027 \u2030-\u203E \u2041-\u2053 \u2055-\u205E \u2190-\u23FF \u2500-\u2775 \u2794-\u2BFF \u2E00-\u2E7F \u3001-\u3003 \u3008-\u3030] [\u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE00-\uFE0F \uFE20-\uFE2F \U000E0100-\U000E01EF]]
>

*## Code points which should be illegal*

There are several surprising non-printing characters, including:

- U+2064 INVISIBLE PLUS is currently an identifier
- U+200B ZERO WIDTH SPACE is currently an identifier

No good will come of these. Invisible characters should probably be
disallowed (although some may be necessary for properly joining/splitting
characters in some other languages).

*## Categories which are split between identifiers and operators*

- Emoji and symbols: most of the newer emoji are identifiers, but many
emoji/pictographs are operators, especially those from "Miscellaneous
Symbols". The results are hilariously illogical:

  - :frowning:️ is an operator, but :slightly_smiling_face: is an identifier.
  - :v:️ is an operator, but :metal: is an identifier.
  - :arrow_up_small: is an operator, but :arrow_forward:️ is an identifier.
  - :eight_spoked_asterisk:️ is an operator, but :six_pointed_star: is an identifier.
  - :airplane:️ is an operator, but :small_airplane: is an identifier.
  - :spades:️ is an operator, but 🂡 is an identifier. (Presumably, 🂡 = A :spades:️ 🂠!)

  (But the counterintuitive examples extend outside the emoji too: + is an
operator, while ₊ and ⁺ are identifiers.)

- Currency symbols: ¢ £ ¤ ¥ are operators, but ₪ € ₱ ₹ ฿ and many others
are identifiers, and $ is allowed in an identifier.

*## Missing characters*

A handful of characters are neither operators nor identifiers. This list
mostly makes sense (reserved characters and whitespace), but I wonder about
a few which seem like they could easily be operators: ⑊ ⑀ ﹅ etc.

short url: <https://goo.gl/U0GVNn>

<
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[\u0001-\U0010FFFF]-[[%2F%3D\-%2B!*%<>\%26|\^~%3F \u00A1-\u00A7 \u00A9\u00AB \u00AC \u00AE \u00B0-\u00B1 \u00B6 \u00BB \u00BF \u00D7 \u00F7 \u2016-\u2017 \u2020-\u2027 \u2030-\u203E \u2041-\u2053 \u2055-\u205E \u2190-\u23FF \u2500-\u2775 \u2794-\u2BFF \u2E00-\u2E7F \u3001-\u3003 \u3008-\u3030] [\u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE00-\uFE0F \uFE20-\uFE2F \U000E0100-\U000E01EF][a-zA-Z _ \u00A8 \u00AA \u00AD \u00AF \u00B2-\u00B5 \u00B7-\u00BA \u00BC-\u00BE \u00C0-\u00D6 \u00D8-\u00F6 \u00F8-\u00FF \u0100-\u02FF \u0370-\u167F \u1681-\u180D \u180F-\u1DBF \u1E00-\u1FFF \u200B-\u200D \u202A-\u202E \u203F-\u2040 \u2054 \u2060-\u206F \u2070-\u20CF \u2100-\u218F \u2460-\u24FF \u2776-\u2793 \u2C00-\u2DFF \u2E80-\u2FFF \u3004-\u3007 \u3021-\u302F \u3031-\u303F \u3040-\uD7FF \uF900-\uFD3D \uFD40-\uFDCF \uFDF0-\uFE1F \uFE30-\uFE44 \uFE47-\uFFFD \U00010000-\U0001FFFD \U00020000-\U0002FFFD \U00030000-\U0003FFFD \U00040000-\U0004FFFD \U00050000-\U0005FFFD \U00060000-\U0006FFFD \U00070000-\U0007FFFD \U00080000-\U0008FFFD \U00090000-\U0009FFFD \U000A0000-\U000AFFFD \U000B0000-\U000BFFFD \U000C0000-\U000CFFFD \U000D0000-\U000DFFFD \U000E0000-\U000EFFFD] [0-9 \u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE20-\uFE2F]]]
>

*# Solutions*

Still up for discussion — please reply to this thread!

Adopting (X)ID_Start/Continue for identifiers, or a simpler solution like
Haskell's use of "letter" categories, might work well.

(I've given up hope of finding some kind of "perfect" solution — how can
it be possible, when ᛏ is a letter, yet ↑ is not?)

Making the choice of operator characters more logical/standards-based
would be nice (not just a set of ranges). However, Haskell's approach of
using all punctuation & symbols is probably not right for Swift:

short url: <https://goo.gl/Ud4KqY>

<
http://unicode.org/cldr/utility/unicodeset.jsp?a=[[-%2F%3D%2B!*%<>\%26|\^~?\u00A1-\u00A7\u00A9\u00AB\u00AC\u00AE\u00B0-\u00B1\u00B6\u00BB\u00BF\u00D7\u00F7\u2016-\u2017\u2020-\u2027\u2030-\u203E\u2041-\u2053\u2055-\u205E\u2190-\u23FF\u2500-\u2775\u2794-\u2BFF\u2E00-\u2E7F\u3001-\u3003\u3008-\u3030\u0300-\u036F\u1DC0-\u1DFF\u20D0-\u20FF\uFE00-\uFE0F\uFE20-\uFE2F\U000E0100-\U000E01EF]]&b=[[:Currency_Symbol:][:Modifier_Symbol:][:Math_Symbol:][:Other_Symbol:][:Connector_Punctuation:][:Dash_Punctuation:][:Close_Punctuation:][:Final_Punctuation:][:Initial_Punctuation:][:Other_Punctuation:][:Open_Punctuation:]]
>

I'm not really sure what to do with emoji — they're a very cute novelty
feature, but I don't know what the motivation is for including these as
valid operators/identifiers.

At the least, we should try to gather them all into one of the two
categories. My inclination would be to keep them as identifiers, which
would mean moving the following out of the operator category:

short url: <https://goo.gl/CBJEKX>

<
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[%3AEmoji%3A]%26[[%2F%3D\-%2B!*%<>\%26|\^~%3F \u00A1-\u00A7 \u00A9\u00AB \u00AC \u00AE \u00B0-\u00B1 \u00B6 \u00BB \u00BF \u00D7 \u00F7 \u2016-\u2017 \u2020-\u2027 \u2030-\u203E \u2041-\u2053 \u2055-\u205E \u2190-\u23FF \u2500-\u2775 \u2794-\u2BFF \u2E00-\u2E7F \u3001-\u3003 \u3008-\u3030] [\u0300-\u036F \u1DC0-\u1DFF \u20D0-\u20FF \uFE00-\uFE0F \uFE20-\uFE2F \U000E0100-\U000E01EF]]]
>

*# Concurrently-discussable topics*

There are a few relevant topics that came to mind, which I think are worth
discussing around the same time.

*## Dollar signs ($)*

$ is currently allowed in identifiers, but it can't begin an identifier
except for the magic implicit closure params ($0, $1, ...) and
LLDB/REPL-related uses.

It's arguable, but I feel that $ would be more effective as an operator
character than an identifier character. There's precedent in Haskell for
operators like `<$>` and being able to replicate these in Swift would be
nice.

*## Diagnostics improvements*

Regardless of what ends up being the ultimate solution, it would be great
to improve diagnostics for cases when the wrong types of characters are
used.

`infix operator abc` produces `'abc' is considered to be an identifier,
not an operator`. That's not too bad.

`let +++ = 3` produces `expected pattern`.

`let $foo = 3` produces `expected numeric value following '$'`.

*## Security and сοnfuѕаbIе characters*

Confusable characters (e vs. е, o vs. ο, ; vs. ;) are an issue not taken
lightly in the world of web security (cf. domain names). I haven't found
much information about whether this has been considered a major security
issue in programming languages, but I would think so (one can imagine such
characters being introduced to a codebase subtly over time, hiding
malicious functionality).

It'd be pretty cool if Swift could detect whether two identifiers might be
confusable, and produce a warning.

<http://www.unicode.org/reports/tr36/#Recommendations_General>
<http://unicode.org/reports/tr39/#Confusable_Detection>

We have had a patch sitting in the queue for a long time now
<https://github.com/apple/swift/pull/732> that does diagnostics for
confusables if you want to take that up again.

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Jacob Bandes-Storch) #7

But more importantly, you were also the one who first asked for it to be an
operator character :slight_smile:
https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20151228/005133.html

Did you have a formal proposal in the works for this? If so, it might be
worth reviewing separately from any other changes. $ is a more well-known
character, and probably more likely to elicit opinions than some more
obscure Unicode stuff.

···

On Sun, Sep 18, 2016 at 6:34 PM, Robert Widmann <devteam.codafi@gmail.com> wrote:

Some thoughts

On Sep 18, 2016, at 3:33 PM, Jacob Bandes-Storch via swift-evolution < > swift-evolution@swift.org> wrote:

*TL;DR:*

Swift 4 Stage 1 seeks to prioritize "Source stability features". Most
source-breaking changes were done with in Swift 3; however, the
categorization of Unicode characters into identifiers & operators was never
thoroughly discussed on swift-evolution. This seems like it might be our
last chance, and I think there are some big improvements to be had.

I've gathered some information+thoughts into an early-stage pitch /
pre-proposal. It doesn't really have a conclusion, so I'm hoping we can
discuss these issues and come up with good (pragmatic) solutions here. I
imagine this can morph into a proposal later.

You can read the following in nicer HTML form at https://gist.github.com/
jtbandes/c0b0c072181dcd22c3147802025d0b59

I look forward to the discussion!

-Jacob

*# Background and motivation*

To ease lexing/parsing and avoid user confusion, the names of custom
identifiers (type names, variable names, etc.) and operators in Swift can
be composed of (mostly) separate sets of characters.

Using terminology from TSPL:

`identifier-head`/`operator-head` are characters which can *begin *an
identifier or operator.

`identifier-character`/`operator-character` are characters which can
appear anywhere in an identifier or operator (these are supersets of the
`-head` sets).

<https://developer.apple.com/library/content/documentation/
Swift/Conceptual/Swift_Programming_Language/LexicalStructure.html>

(Note also that some particular arrangements of characters are reserved;
for instance, `$` followed by digits for an implicit closure parameter, and
"If an operator doesn’t begin with a dot, it can’t contain a dot
elsewhere." There are also special characters in the language which are
neither identifiers nor operators, such as: `()[]{},:@#`)

*## Prior discussion on swift-evolution*

*"Request to add middle dot (U+00B7) as operator character?"*
<https://lists.swift.org/pipermail/swift-evolution/
Week-of-Mon-20151214/003176.html>

*"Free the '$' Symbol!"*
<https://lists.swift.org/pipermail/swift-evolution/
Week-of-Mon-20151228/005133.html>

*"Proposal: Allow Single Dollar Sign as Valid Identifier"*
<https://github.com/apple/swift-evolution/pull/354>

Chris Lattner has said:

> "...our current operator space (particularly the unicode segments
covered) is not super well considered. It would be great for someone to
take a more systematic pass over them to rationalize things."

> "We need a token to be unambiguously an operator or identifier - we can
have different rules for the leading and subsequent characters though."

I feel a bit bad having implemented the patch that banned this - it feels
like dollar was mistakenly left out of the operator character range
considering how well it worked in operators up to then. Disambiguation
with respect to other language constructs (anonymous parameters in closures
and LLDB variables) is trivial and we already had diagnostics about it.


(Chris Lattner) #8

Let me tl;dr'er this even more: ☹️ is an operator, but 🙂 is an identifier.

-- E, succinct, who thinks there's room for improvement

Ha, yes. Let's see if I can be as succinct in my contribution to the discussion:

1) Agree that current situation not ideal, for reasons above

+1, totally agreed. We really need to improve this, aiming for Swift 3.1 or Swift 4 seems like a really good idea, because the appetite for this sort of change will probably be very low after Swift 4.

2) The solution might best be not one but several proposals:

  2a) Unicode normalization: invisible characters, Greek tonos, etc. (cf. previous message about previously proposed solution, which reflects Unicode recommendations in UTR #31)--low hanging fruit: there's an established Unicode recommendation with clear wins for security and consistency

  2b) Legal and illegal characters for identifiers *or* operators: UTR #31 makes recommendations regarding rarely used scripts; probably best to follow the letter and spirit of these recommendations (which would probably mean ancient Greek musical symbols and Egyptian hieroglyphics shouldn't be identifier or operator characters)

  2c) Decisions as to which characters are identifier characters or operator characters: for instance, emoji should probably never be operator characters; if an emoji has a non-emoji counterpart that is an operator (❗️❓➕➖➗✖️, etc.) it might be best simply to make these illegal rather than operator characters

  2d) Confusables: I think the last time we had this discussion, it was apparent that it'd be difficult to decide which confusables to allow or disallow after some of the low-hanging fruit is taken care of by Unicode normalization (see item 2a); the Unicode Consortium-provided list seems too quick to call two things "confusable" for our purposes (with criteria that might be relevant for URLs or other use cases, but casting too wide a net perhaps for Swift identifiers)

These all seem like good points. I agree that we should default to following an existing Unicode standard unless there is a really good reason to deviate.

I don’t have an opinion about the specific direction of the proposal though.

-Chris

···

On Sep 18, 2016, at 6:24 PM, Xiaodi Wu via swift-evolution <swift-evolution@swift.org> wrote:
On Sun, Sep 18, 2016 at 9:19 PM, Erica Sadun via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:


(Dave Abrahams) #9

Confusables can IMO be entirely handled with warnings, which requires
much less discussion here than if it were a language feature.

···

on Sun Sep 18 2016, Xiaodi Wu <swift-evolution@swift.org> wrote:

  2d) Confusables: I think the last time we had this discussion, it was
apparent that it'd be difficult to decide which confusables to allow or
disallow after some of the low-hanging fruit is taken care of by Unicode
normalization (see item 2a); the Unicode Consortium-provided list seems too
quick to call two things "confusable" for our purposes (with criteria that
might be relevant for URLs or other use cases, but casting too wide a net
perhaps for Swift identifiers)

--
-Dave


(Robert Widmann) #10

In that case it was because $ was not allowed in operators. Here it’s just not allowed at all!

Nevertheless, the irony is delicious,

~Robert Widmann

···

On Sep 22, 2016, at 2:05 AM, Jacob Bandes-Storch <jtbandes@gmail.com> wrote:

On Sun, Sep 18, 2016 at 6:34 PM, Robert Widmann <devteam.codafi@gmail.com <mailto:devteam.codafi@gmail.com>> wrote:
Some thoughts

On Sep 18, 2016, at 3:33 PM, Jacob Bandes-Storch via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

TL;DR:

Swift 4 Stage 1 seeks to prioritize "Source stability features". Most source-breaking changes were done with in Swift 3; however, the categorization of Unicode characters into identifiers & operators was never thoroughly discussed on swift-evolution. This seems like it might be our last chance, and I think there are some big improvements to be had.

I've gathered some information+thoughts into an early-stage pitch / pre-proposal. It doesn't really have a conclusion, so I'm hoping we can discuss these issues and come up with good (pragmatic) solutions here. I imagine this can morph into a proposal later.

You can read the following in nicer HTML form at https://gist.github.com/jtbandes/c0b0c072181dcd22c3147802025d0b59

I look forward to the discussion!

-Jacob

# Background and motivation

To ease lexing/parsing and avoid user confusion, the names of custom identifiers (type names, variable names, etc.) and operators in Swift can be composed of (mostly) separate sets of characters.

Using terminology from TSPL:

`identifier-head`/`operator-head` are characters which can begin an identifier or operator.

`identifier-character`/`operator-character` are characters which can appear anywhere in an identifier or operator (these are supersets of the `-head` sets).

<https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/LexicalStructure.html>

(Note also that some particular arrangements of characters are reserved; for instance, `$` followed by digits for an implicit closure parameter, and "If an operator doesn’t begin with a dot, it can’t contain a dot elsewhere." There are also special characters in the language which are neither identifiers nor operators, such as: `()[]{},:@#`)

## Prior discussion on swift-evolution

"Request to add middle dot (U+00B7) as operator character?"
<https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20151214/003176.html>

"Free the '$' Symbol!"
<https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20151228/005133.html>

"Proposal: Allow Single Dollar Sign as Valid Identifier"
<https://github.com/apple/swift-evolution/pull/354>

Chris Lattner has said:

> "...our current operator space (particularly the unicode segments covered) is not super well considered. It would be great for someone to take a more systematic pass over them to rationalize things."

> "We need a token to be unambiguously an operator or identifier - we can have different rules for the leading and subsequent characters though."

I feel a bit bad having implemented the patch that banned this - it feels like dollar was mistakenly left out of the operator character range considering how well it worked in operators up to then. Disambiguation with respect to other language constructs (anonymous parameters in closures and LLDB variables) is trivial and we already had diagnostics about it.

But more importantly, you were also the one who first asked for it to be an operator character :slight_smile: https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20151228/005133.html

Did you have a formal proposal in the works for this? If so, it might be worth reviewing separately from any other changes. $ is a more well-known character, and probably more likely to elicit opinions than some more obscure Unicode stuff.


(Alex Blewitt) #11

It would probably make sense to define the supported characters based on their category, rather than abstract ranges of character sets. For example, using the Letter and Number categories might be sufficient for defining identifiers.

https://en.wikipedia.org/wiki/Unicode_character_property#General_Category

In this case both of these characters are in the 'Symbol, Other' category:

:slightly_smiling_face: http://www.fileformat.info/info/unicode/char/1f642/index.htm
:frowning:http://www.fileformat.info/info/unicode/char/2639/index.htm

Having the language define which categories are used for which type means they don't have to be individually enumerated as part of the grammar

https://developer.apple.com/library/prerelease/content/documentation/Swift/Conceptual/Swift_Programming_Language/zzSummaryOfTheGrammar.html#//apple_ref/doc/uid/TP40014097-CH38-ID458

It is possible to read and process the Unicode format to build up the character ranges programmatically; that's what ICU does to efficiently be able to answer questions like 'Is this a valid upper case letter?'. But defining the ranges as part of the grammar leads to evolutionary changes like this which can't be predicted in advance, because they're defined on a set of fixed code points.

Alex

···

On 18 Sep 2016, at 21:29, Chris Lattner via swift-evolution <swift-evolution@swift.org> wrote:

On Sep 18, 2016, at 6:24 PM, Xiaodi Wu via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

On Sun, Sep 18, 2016 at 9:19 PM, Erica Sadun via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:
Let me tl;dr'er this even more: :frowning:️ is an operator, but :slightly_smiling_face: is an identifier.

-- E, succinct, who thinks there's room for improvement

Ha, yes. Let's see if I can be as succinct in my contribution to the discussion:

1) Agree that current situation not ideal, for reasons above

+1, totally agreed. We really need to improve this, aiming for Swift 3.1 or Swift 4 seems like a really good idea, because the appetite for this sort of change will probably be very low after Swift 4.

2) The solution might best be not one but several proposals:

  2a) Unicode normalization: invisible characters, Greek tonos, etc. (cf. previous message about previously proposed solution, which reflects Unicode recommendations in UTR #31)--low hanging fruit: there's an established Unicode recommendation with clear wins for security and consistency

  2b) Legal and illegal characters for identifiers *or* operators: UTR #31 makes recommendations regarding rarely used scripts; probably best to follow the letter and spirit of these recommendations (which would probably mean ancient Greek musical symbols and Egyptian hieroglyphics shouldn't be identifier or operator characters)

  2c) Decisions as to which characters are identifier characters or operator characters: for instance, emoji should probably never be operator characters; if an emoji has a non-emoji counterpart that is an operator (:exclamation:️:question::heavy_plus_sign::heavy_minus_sign::heavy_division_sign::heavy_multiplication_x:️, etc.) it might be best simply to make these illegal rather than operator characters

  2d) Confusables: I think the last time we had this discussion, it was apparent that it'd be difficult to decide which confusables to allow or disallow after some of the low-hanging fruit is taken care of by Unicode normalization (see item 2a); the Unicode Consortium-provided list seems too quick to call two things "confusable" for our purposes (with criteria that might be relevant for URLs or other use cases, but casting too wide a net perhaps for Swift identifiers)

These all seem like good points. I agree that we should default to following an existing Unicode standard unless there is a really good reason to deviate.

I don’t have an opinion about the specific direction of the proposal though.

-Chris

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution