A path forward on rationalizing unicode identifiers and operators

The core team recently met to discuss PR609 - Refining identifier and operator symbology:

The proposal correctly observes that the partitioning of unicode codepoints into identifiers and operators is a mess in some cases. It really is an outright bug for :slightly_smiling_face: to be an identifier, but :frowning: to be an operator. That said, the proposal itself is complicated and is defined in terms of a bunch of unicode classes that may evolve in the “wrong way for Swift” in the future.

The core team would really like to get this sorted out for Swift 5, and sooner is better than later :-). Because it seems that this is a really hard problem and that perfection is becoming the enemy of good <https://en.wikipedia.org/wiki/Perfect_is_the_enemy_of_good&gt;, the core team requests the creation of a new proposal with a different approach. The general observation is that there are three kinds of characters: things that are obviously identifiers, things that are obviously math operators, and things that are non-obvious. Things that are non-obvious can be made into invalid code points, and legislated later in follow-up proposals if/when someone cares to argue for them.

To make progress on this, we suggest a few separable steps:

First, please split out the changes to the ASCII characters (e.g. . and \ operator parsing rules) to its own (small) proposal, since it is unrelated to the unicode changes, and can make progress on that proposal independently.

Second, someone should take a look at the concrete set of unicode identifiers that are accepted by Swift 4 and write a new proposal that splits them into the three groups: those that are clearly identifiers (which become identifiers), those that are clearly operators (which become operators), and those that are unclear or don’t matter (these become invalid code points).

I suggest that the criteria be based on utility for Swift code, not on the underlying unicode classification. For example, the discussion thread for PR609 mentions that the T character in “ xᔀ ” is defined in unicode as a latin “letter”. Despite that, its use is Swift would clearly be as a postfix operator, so we should classify it as an operator.

Other suggestions:
- Math symbols are operators excepting those primarily used as identifiers like “alpha”. If there are any characters that are used for both, this proposal should make them invalid.
- While there may be useful ranges for some identifiers (e.g. to handle european accented characters), the Emoji range should probably have each codepoint independently judged, and currently unassigned codepoints should not get a meaning defined for them.
- Unicode “faces”, “people”, “animals” etc are all identifiers.
- In order to reduce the scope of the proposal, it is a safe default to exclude characters that are unlikely to be used by Swift code today, including Braille, weird currency symbols, or any set of characters that are so broken and useless in Swift 4 that it isn’t worth worrying about.
- The proposal is likely to turn a large number of code points into rejected characters. In the discussions, some people will be tempted to argue endlessly about individual rejections. To control that, we can require that people point out an example where the character is already in use, or where it has a clear application to a domain that is known today: the discussion needs to be grounded and practical, not theoretical.

Third, if there is interest sometime in the future, we can have subsequent proposals that expand the range of accepted code points, motivated by the specific application domain that cares about them. These proposals will not be source breaking, so they can happen at any time.

Is anyone interested in helping to push this effort forward?

-Chris

I have a technical question on this:

Instead of parsing these into identifiers & operators, would it be possible to parse these into 3 categories: Identifiers, Operators, and Ambiguous?

The ambiguous category would be disallowed for the moment, as you say. But since they are rarely used, maybe we can allow a declaration (similar to how we define operators) that effectively pulls it into one of the other categories (not in terms of tokenization, but in terms of how it can be used in Swift). Trying to pull it into both would be a compilation error.

That way, Xiaodi can have a framework which lets her use superscript T as an identifier, and I can have one where I use superscript 2 to square things. The obvious/frequently used characters would not be ambiguous, so it would only slow down compilation when the rare/ambiguous characters are used.

In my mind, this would be the ideal solution, and it could be done in stages (with the ambiguous characters just being forbidden for now), but I am not sure if it is technically possible.

Thanks,
Jon

···

On Sep 30, 2017, at 3:59 PM, Chris Lattner via swift-evolution <swift-evolution@swift.org> wrote:

The core team recently met to discuss PR609 - Refining identifier and operator symbology:
https://github.com/xwu/swift-evolution/blob/7c2c4df63b1d92a1677461f41bc638f31926c9c3/proposals/NNNN-refining-identifier-and-operator-symbology.md

The proposal correctly observes that the partitioning of unicode codepoints into identifiers and operators is a mess in some cases. It really is an outright bug for :slightly_smiling_face: to be an identifier, but :frowning: to be an operator. That said, the proposal itself is complicated and is defined in terms of a bunch of unicode classes that may evolve in the “wrong way for Swift” in the future.

The core team would really like to get this sorted out for Swift 5, and sooner is better than later :-). Because it seems that this is a really hard problem and that perfection is becoming the enemy of good <https://en.wikipedia.org/wiki/Perfect_is_the_enemy_of_good&gt;, the core team requests the creation of a new proposal with a different approach. The general observation is that there are three kinds of characters: things that are obviously identifiers, things that are obviously math operators, and things that are non-obvious. Things that are non-obvious can be made into invalid code points, and legislated later in follow-up proposals if/when someone cares to argue for them.

To make progress on this, we suggest a few separable steps:

First, please split out the changes to the ASCII characters (e.g. . and \ operator parsing rules) to its own (small) proposal, since it is unrelated to the unicode changes, and can make progress on that proposal independently.

Second, someone should take a look at the concrete set of unicode identifiers that are accepted by Swift 4 and write a new proposal that splits them into the three groups: those that are clearly identifiers (which become identifiers), those that are clearly operators (which become operators), and those that are unclear or don’t matter (these become invalid code points).

I suggest that the criteria be based on utility for Swift code, not on the underlying unicode classification. For example, the discussion thread for PR609 mentions that the T character in “ xᔀ ” is defined in unicode as a latin “letter”. Despite that, its use is Swift would clearly be as a postfix operator, so we should classify it as an operator.

Other suggestions:
- Math symbols are operators excepting those primarily used as identifiers like “alpha”. If there are any characters that are used for both, this proposal should make them invalid.
- While there may be useful ranges for some identifiers (e.g. to handle european accented characters), the Emoji range should probably have each codepoint independently judged, and currently unassigned codepoints should not get a meaning defined for them.
- Unicode “faces”, “people”, “animals” etc are all identifiers.
- In order to reduce the scope of the proposal, it is a safe default to exclude characters that are unlikely to be used by Swift code today, including Braille, weird currency symbols, or any set of characters that are so broken and useless in Swift 4 that it isn’t worth worrying about.
- The proposal is likely to turn a large number of code points into rejected characters. In the discussions, some people will be tempted to argue endlessly about individual rejections. To control that, we can require that people point out an example where the character is already in use, or where it has a clear application to a domain that is known today: the discussion needs to be grounded and practical, not theoretical.

Third, if there is interest sometime in the future, we can have subsequent proposals that expand the range of accepted code points, motivated by the specific application domain that cares about them. These proposals will not be source breaking, so they can happen at any time.

Is anyone interested in helping to push this effort forward?

-Chris

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

I’m happy to participate in the reshaping of the proposal. It would be nice
to gather a group of people again to help drive it forward.

That said, it’s unclear to me that superscript T is clearly an operator,
any more than would be superscript H (Hermitian), superscript 2,
superscript 3, etc. But at any rate, this would be discussion for the
future workgroup.

I would strongly advocate that the things-that-are-identifiers group be
strongly tied to the existing, complete Unicode standard for such, and that
the critical parts of the previous document about normalization be retained.

···

On Sat, Sep 30, 2017 at 17:59 Chris Lattner via swift-evolution < swift-evolution@swift.org> wrote:

The core team recently met to discuss PR609 - Refining identifier and
operator symbology:

https://github.com/xwu/swift-evolution/blob/7c2c4df63b1d92a1677461f41bc638f31926c9c3/proposals/NNNN-refining-identifier-and-operator-symbology.md

The proposal correctly observes that the partitioning of unicode
codepoints into identifiers and operators is a mess in some cases. It
really is an outright bug for :slightly_smiling_face: to be an identifier, but :frowning: to be an
operator. That said, the proposal itself is complicated and is defined in
terms of a bunch of unicode classes that may evolve in the “wrong way for
Swift” in the future.

The core team would really like to get this sorted out for Swift 5, and
sooner is better than later :-). Because it seems that this is a really
hard problem and that perfection is becoming the enemy of good
<https://en.wikipedia.org/wiki/Perfect_is_the_enemy_of_good&gt;, the core
team requests the creation of a new proposal with a different approach.
The general observation is that there are three kinds of characters: things
that are obviously identifiers, things that are obviously math operators,
and things that are non-obvious. Things that are non-obvious can be made
into invalid code points, and legislated later in follow-up proposals
if/when someone cares to argue for them.

To make progress on this, we suggest a few separable steps:

First, please split out the changes to the ASCII characters (e.g. . and \
operator parsing rules) to its own (small) proposal, since it is unrelated
to the unicode changes, and can make progress on that proposal
independently.

Second, someone should take a look at the concrete set of unicode
identifiers that are accepted by Swift 4 and write a new proposal that
splits them into the three groups: those that are clearly identifiers
(which become identifiers), those that are clearly operators (which become
operators), and those that are unclear or don’t matter (these become
invalid code points).

I suggest that the criteria be based on *utility for Swift code*, not on
the underlying unicode classification. For example, the discussion thread
for PR609 mentions that the T character in “ xᔀ ” is defined in unicode
as a latin “letter”. Despite that, its use is Swift would clearly be as a
postfix operator, so we should classify it as an operator.

Other suggestions:
- Math symbols are operators excepting those primarily used as
identifiers like “alpha”. If there are any characters that are used for
both, this proposal should make them invalid.
- While there may be useful ranges for some identifiers (e.g. to handle
european accented characters), the Emoji range should probably have each
codepoint independently judged, and currently unassigned codepoints should
not get a meaning defined for them.
- Unicode “faces”, “people”, “animals” etc are all identifiers.
- In order to reduce the scope of the proposal, it is a safe default to
exclude characters that are unlikely to be used by Swift code today,
including Braille, weird currency symbols, or any set of characters that
are so broken and useless in Swift 4 that it isn’t worth worrying about.
- The proposal is likely to turn a large number of code points into
rejected characters. In the discussions, some people will be tempted to
argue endlessly about individual rejections. To control that, we can
require that people point out an example where the character is already in
use, or where it has a clear application to a domain that is known today:
the discussion needs to be grounded and practical, not theoretical.

Third, if there is interest sometime in the future, we can have subsequent
proposals that expand the range of accepted code points, motivated by the
specific application domain that cares about them. These proposals will
not be source breaking, so they can happen at any time.

Is anyone interested in helping to push this effort forward?

-Chris

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

After reading the original proposal and the Unicode Annex #31 document that
underlies it (https://unicode.org/reports/tr31/\) I think that the existing
work as an underlying layer could help frame the discussion and push it
forward.

Although I do see the concerns about defining things too strictly in
Unicode terms, the proposal brings in some really helpful work without
really tethering Swift to Unicode categories now or in the future. If
nothing else, throwing out the Unicode framework to start from scratch was
always likely to send the discussion into a massive detailed, unstructured
discussion of how to handle individual characters, which seems to be what
happened.

Particularly regarding identifiers, taking advantage of the Unicode work on
Unicode characters seems to be more promising to make progress than
starting from scratch.

My suggestion is to revive the proposal in part, keeping identifier work
and leaving operator work at the structural level:

1) Identifiers: Swift valid identifier characters will be aligned with UAX#31,
with exceptions.
2) Operators: Swift valid operator characters will be defined as an
arbitrary list, subject to community discussion within the Unicode
structure.

Identifiers:

The existing Unicode work in document UAX#31 is entirely about identifiers
and like the proposal authors, I think Swift can use this, with exceptions,
as the proposal specifies. In particular, the work in UAX#31 for
identifiers seems well-thought out and worthy of inclusion, and specifies
that individual programming languages can define their syntax *relative* to
these defaults, which is probably what Swift wants to do. 1) Swift
definitely will have "_" as a valid identifier-head. 2) Swift may need
additional rules for using Emoji as identifiers.

This is described more clearly later in the proposal, under
details-identifiers GitHub - xwu/swift-evolution: This maintains proposals for changes and user-visible enhancements to the Swift Programming Language.
proposals/NNNN-refining-identifier-and-operator-symbology.md#identifiers It
also has some guidelines on identifier equivalence that would be worth
pulling in.

Not sure if there is objection to the identifier part of the proposal. It
may have been phrased too prescriptively in terms of Unicode, but that
doesn't seem to be the intent.

Operators:

Unicode doesn't provide any guidance on operators, so what Swift includes
is arbitrary regardless. The rule in the proposal was solid but, first, too
complex: Default operator characters would be all unicode characters 1)
tagged by Unicode as 1a) "Pattern Syntax" characters and 1b) "Mathematical"
but 2) excluding characters in the blocks for 2a) "Geometric Shapes", 2b)
"Miscellaneous Symbols" and 2c) "Miscellaneous Technical". I can see why
this was considered too complex, but as an attempt for a non-arbitrary
definition it was a great start.

So the Swit list of valid operatior characters is arbitrary: embrace it.

Right now Swift's arbitrary list of Unicode ranges that are valid
operator-head or operator-character is unreadable and therefore mostly
useless for looking up which characters are valid (
https://developer.apple.com/library/content/documentation/
Swift/Conceptual/Swift_Programming_Language/LexicalStructure
.html#//apple_ref/doc/uid/TP40014097-CH30-ID418). It's introduced in text
as:

Custom operators can begin with one of the ASCII characters /, =, -, +, !,

*, %, <, >, &, |, ^, ?, or ~, or one of the Unicode characters defined in
the grammar below (which include characters from the Mathematical
Operators, Miscellaneous Symbols, and Dingbats Unicode blocks, among
others). After the first character, combining Unicode characters are also
allowed.

This description is more helpful. Discussions of characters should make
reference to which code blocks they belong to, and to attempt to align with
Unicode groupings or categories when it makes sense to do so.

The online "UnicodeSet" functionality is very useful for seeing what
characters are included (https://unicode.org/cldr/utility/list-unicodeset\.
jsp). For the current Swift 4 Language Reference, I made a list to see what
was included, which for this tool required manually expanding some ranges
and making a few other inferences, i.e., Swift allows the "dingbats" code
block with the exception of "Dingbat circled digits."

You can see the unicode-set list of existing Swift Unicode Utilities: UnicodeSet
(warning: may not be perfect).

The UnicodeSet tool doesn't scale very well for discussions. Regardless, it
would be very helpful if the final lists were accordingly grouped by
Unicode block or subhead, together with at least some descriptions /
examples:

*Ascii*
/­ =­ -­ +­ !­ *­ %­ <­ >­ &­ |­ ^­ ~­ ?

*Latin 1 Supplement*
Latin-1 punctuation and symbols: (INVERTED EXCLAMATION MARK) U+00A1–U+00A7,
U+00A9, U+00AB, U+00AC, U+00AE, U+00B0–U+00B1, U+00B6, U+00BB
(RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK)
Punctuation: U+00BF (INVERTED QUESTION MARK)
Mathematical operator: U+00D7 (MULTIPLICATION SIGN), U+00F7 (DIVISION SIGN)

*General Punctuation*
General punctuation: U+2016 (DOUBLE VERTICAL LINE)-U+2057 (QUADRUPLE PRIME)
Quotation marks: U+2039, U+203A
Double punctuation for vertical text: U+203C, U+2047-U+2049
Archaic punctuation: U+2056 (THREE DOT PUNCTUATION) - U+205E (VERTICAL FOUR
DOTS)

...

The included names of characters are just to help illustrate what kind of
marks are in each block, which with the titles helps to stay oriented. If
some kind of organization like this made it into Swift documentation, it
would be hugely helpful.

Of course it does raise the question of why certain characters made it in
and others didn't--did someone affirmatively decide Ancient Greek Textual
Symbols should be valid operators? It appears operators can start General
Punctuation, Arrows, Miscellaneous Technical, Box Drawing, Dingbats
(except Dingbat Circled Digits), Supplemental Punctuation and some CJK
Symbols and Punctuation.

Some criteria on how to decide what to include would be helpful -- I would
suggest that criteria go into this proposal, and the final list be
assembled in the future, perhaps subject to regular (annual?) review.

And for the mailing list (as long as we're still on it) can we have a tag
for discussions of specific operators, something like
[operator--discussion]? Probably many people would like to follow the
overall discussion but not most discussions of specific operators. I think
the forum-based discussion will help, and my point is mainly that
discussion of the criteria or broad scope is different from the discussion
of any particular character's inclusion or exclusion.

Does this direction seem like it could preserve flexibility while taking
advantage of Unicode standards for identifiers as a starting point for
Unicode characters as identifiers, while creating a criteria and structure
for future discussion of operators?

Mike Sand

···

On Sat, Sep 30, 2017 at 6:59 PM, Chris Lattner via swift-evolution < swift-evolution@swift.org> wrote:

The core team recently met to discuss PR609 - Refining identifier and
operator symbology:
GitHub - xwu/swift-evolution at 7c2c4df63b1d92a1677461f41bc638f31926c9c3
f31926c9c3/proposals/NNNN-refining-identifier-and-operator-symbology.md

The proposal correctly observes that the partitioning of unicode
codepoints into identifiers and operators is a mess in some cases. It
really is an outright bug for :slightly_smiling_face: to be an identifier, but :frowning: to be an
operator. That said, the proposal itself is complicated and is defined in
terms of a bunch of unicode classes that may evolve in the “wrong way for
Swift” in the future.

The core team would really like to get this sorted out for Swift 5, and
sooner is better than later :-). Because it seems that this is a really
hard problem and that perfection is becoming the enemy of good
<https://en.wikipedia.org/wiki/Perfect_is_the_enemy_of_good&gt;, the core
team requests the creation of a new proposal with a different approach.
The general observation is that there are three kinds of characters: things
that are obviously identifiers, things that are obviously math operators,
and things that are non-obvious. Things that are non-obvious can be made
into invalid code points, and legislated later in follow-up proposals
if/when someone cares to argue for them.

To make progress on this, we suggest a few separable steps:

First, please split out the changes to the ASCII characters (e.g. . and \
operator parsing rules) to its own (small) proposal, since it is unrelated
to the unicode changes, and can make progress on that proposal
independently.

Second, someone should take a look at the concrete set of unicode
identifiers that are accepted by Swift 4 and write a new proposal that
splits them into the three groups: those that are clearly identifiers
(which become identifiers), those that are clearly operators (which become
operators), and those that are unclear or don’t matter (these become
invalid code points).

I suggest that the criteria be based on *utility for Swift code*, not on
the underlying unicode classification. For example, the discussion thread
for PR609 mentions that the T character in “ xᔀ ” is defined in unicode
as a latin “letter”. Despite that, its use is Swift would clearly be as a
postfix operator, so we should classify it as an operator.

Other suggestions:
- Math symbols are operators excepting those primarily used as
identifiers like “alpha”. If there are any characters that are used for
both, this proposal should make them invalid.
- While there may be useful ranges for some identifiers (e.g. to handle
european accented characters), the Emoji range should probably have each
codepoint independently judged, and currently unassigned codepoints should
not get a meaning defined for them.
- Unicode “faces”, “people”, “animals” etc are all identifiers.
- In order to reduce the scope of the proposal, it is a safe default to
exclude characters that are unlikely to be used by Swift code today,
including Braille, weird currency symbols, or any set of characters that
are so broken and useless in Swift 4 that it isn’t worth worrying about.
- The proposal is likely to turn a large number of code points into
rejected characters. In the discussions, some people will be tempted to
argue endlessly about individual rejections. To control that, we can
require that people point out an example where the character is already in
use, or where it has a clear application to a domain that is known today:
the discussion needs to be grounded and practical, not theoretical.

Third, if there is interest sometime in the future, we can have subsequent
proposals that expand the range of accepted code points, motivated by the
specific application domain that cares about them. These proposals will
not be source breaking, so they can happen at any time.

Is anyone interested in helping to push this effort forward?

-Chris

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

Superscript T’s only regular use that I’m aware of is as the transpose operator for vectors and matrices. I’m certainly not omniscient, though.

Are we going to attempt to distinguish between characters like these two?
ⁿ (SUPERSCRIPT LATIN SMALL LETTER N Unicode: U+207F, UTF-8: E2 81 BF)
n (LATIN SMALL LETTER N Unicode: U+006E, UTF-8: 6E), with a superscript format applied

- Dave Sweerisn

···

On Sep 30, 2017, at 16:13, Xiaodi Wu via swift-evolution <swift-evolution@swift.org> wrote:

I’m happy to participate in the reshaping of the proposal. It would be nice to gather a group of people again to help drive it forward.

That said, it’s unclear to me that superscript T is clearly an operator, any more than would be superscript H (Hermitian), superscript 2, superscript 3, etc. But at any rate, this would be discussion for the future workgroup.

what happens if two public operator declarations conflict?

···

On Sat, Sep 30, 2017 at 9:10 PM, Jonathan Hull via swift-evolution < swift-evolution@swift.org> wrote:

I have a technical question on this:

Instead of parsing these into identifiers & operators, would it be
possible to parse these into 3 categories: Identifiers, Operators, and
Ambiguous?

The ambiguous category would be disallowed for the moment, as you say.
But since they are rarely used, maybe we can allow a declaration (similar
to how we define operators) that effectively pulls it into one of the other
categories (not in terms of tokenization, but in terms of how it can be
used in Swift). Trying to pull it into both would be a compilation error.

That way, Xiaodi can have a framework which lets her use superscript T as
an identifier, and I can have one where I use superscript 2 to square
things. The obvious/frequently used characters would not be ambiguous, so
it would only slow down compilation when the rare/ambiguous characters are
used.

In my mind, this would be the ideal solution, and it could be done in
stages (with the ambiguous characters just being forbidden for now), but I
am not sure if it is technically possible.

Thanks,
Jon

On Sep 30, 2017, at 3:59 PM, Chris Lattner via swift-evolution < > swift-evolution@swift.org> wrote:

The core team recently met to discuss PR609 - Refining identifier and
operator symbology:
GitHub - xwu/swift-evolution at 7c2c4df63b1d92a1677461f41bc638f31926c9c3
f31926c9c3/proposals/NNNN-refining-identifier-and-operator-symbology.md

The proposal correctly observes that the partitioning of unicode
codepoints into identifiers and operators is a mess in some cases. It
really is an outright bug for :slightly_smiling_face: to be an identifier, but :frowning: to be an
operator. That said, the proposal itself is complicated and is defined in
terms of a bunch of unicode classes that may evolve in the “wrong way for
Swift” in the future.

The core team would really like to get this sorted out for Swift 5, and
sooner is better than later :-). Because it seems that this is a really
hard problem and that perfection is becoming the enemy of good
<https://en.wikipedia.org/wiki/Perfect_is_the_enemy_of_good&gt;, the core
team requests the creation of a new proposal with a different approach.
The general observation is that there are three kinds of characters: things
that are obviously identifiers, things that are obviously math operators,
and things that are non-obvious. Things that are non-obvious can be made
into invalid code points, and legislated later in follow-up proposals
if/when someone cares to argue for them.

To make progress on this, we suggest a few separable steps:

First, please split out the changes to the ASCII characters (e.g. . and \
operator parsing rules) to its own (small) proposal, since it is unrelated
to the unicode changes, and can make progress on that proposal
independently.

Second, someone should take a look at the concrete set of unicode
identifiers that are accepted by Swift 4 and write a new proposal that
splits them into the three groups: those that are clearly identifiers
(which become identifiers), those that are clearly operators (which become
operators), and those that are unclear or don’t matter (these become
invalid code points).

I suggest that the criteria be based on *utility for Swift code*, not on
the underlying unicode classification. For example, the discussion thread
for PR609 mentions that the T character in “ xᔀ ” is defined in unicode
as a latin “letter”. Despite that, its use is Swift would clearly be as a
postfix operator, so we should classify it as an operator.

Other suggestions:
- Math symbols are operators excepting those primarily used as
identifiers like “alpha”. If there are any characters that are used for
both, this proposal should make them invalid.
- While there may be useful ranges for some identifiers (e.g. to handle
european accented characters), the Emoji range should probably have each
codepoint independently judged, and currently unassigned codepoints should
not get a meaning defined for them.
- Unicode “faces”, “people”, “animals” etc are all identifiers.
- In order to reduce the scope of the proposal, it is a safe default to
exclude characters that are unlikely to be used by Swift code today,
including Braille, weird currency symbols, or any set of characters that
are so broken and useless in Swift 4 that it isn’t worth worrying about.
- The proposal is likely to turn a large number of code points into
rejected characters. In the discussions, some people will be tempted to
argue endlessly about individual rejections. To control that, we can
require that people point out an example where the character is already in
use, or where it has a clear application to a domain that is known today:
the discussion needs to be grounded and practical, not theoretical.

Third, if there is interest sometime in the future, we can have subsequent
proposals that expand the range of accepted code points, motivated by the
specific application domain that cares about them. These proposals will
not be source breaking, so they can happen at any time.

Is anyone interested in helping to push this effort forward?

-Chris

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

I’m happy to participate in the reshaping of the proposal. It would be
nice to gather a group of people again to help drive it forward.

That said, it’s unclear to me that superscript T is clearly an operator,
any more than would be superscript H (Hermitian), superscript 2,
superscript 3, etc. But at any rate, this would be discussion for the
future workgroup.

Superscript T’s only regular use that I’m aware of is as the transpose
operator for vectors and matrices. I’m certainly not omniscient, though.

You don’t need to be omniscient: superscript T is part of the Unicode
Phonetic Extensions block—i.e., its existence is justified by use in some
phonetic spelling.

···

On Sat, Sep 30, 2017 at 18:59 David Sweeris <davesweeris@mac.com> wrote:

On Sep 30, 2017, at 16:13, Xiaodi Wu via swift-evolution < > swift-evolution@swift.org> wrote:

Are we going to attempt to distinguish between characters like these two?
ⁿ (SUPERSCRIPT LATIN SMALL LETTER N Unicode: U+207F, UTF-8: E2 81 BF)
n (LATIN SMALL LETTER N Unicode: U+006E, UTF-8: 6E), with a superscript
format applied

- Dave Sweerisn

This is commonly requested, but the third category isn’t practical.

Swift statically partitions characters between identifiers and operators to make it possible to parse a Swift source file without parsing all of its dependencies. If you could have directives that change this, it would be difficult or perhaps impossible to parse a file that used these characters without parsing/reading the transitive closure of dependent modules.

This is important for compile speed and some tooling, and is an area that C gets wrong - its grammar requires all headers to be parsed in order to distinguish between type names and normal identifiers.

-Chris

···

On Sep 30, 2017, at 7:10 PM, Jonathan Hull <jhull@gbis.com> wrote:

I have a technical question on this:

Instead of parsing these into identifiers & operators, would it be possible to parse these into 3 categories: Identifiers, Operators, and Ambiguous?

The ambiguous category would be disallowed for the moment, as you say. But since they are rarely used, maybe we can allow a declaration (similar to how we define operators) that effectively pulls it into one of the other categories (not in terms of tokenization, but in terms of how it can be used in Swift).

That said, it’s unclear to me that superscript T is clearly an operator, any more than would be superscript H (Hermitian), superscript 2, superscript 3, etc. But at any rate, this would be discussion for the future

Allowing superscripted characters to be used as operators seems like it would take the proposal down a rabbit hole. It’s getting into the realm of 2 dimensional notation. It would raise questions like: should we allow over script or under script operators? Or, should ‘inert’ superscripts and subscripts be allowed as part of identifiers, as they are sometimes in physics (for example transformed variables are sometimes ‘primed’).

-Matt

I’m happy to participate in the reshaping of the proposal. It would be nice to gather a group of people again to help drive it forward.

Awesome, thank you!

That said, it’s unclear to me that superscript T is clearly an operator, any more than would be superscript H (Hermitian), superscript 2, superscript 3, etc. But at any rate, this would be discussion for the future workgroup.

Yeah, a future proposal can debate that. For now, any of these that are currently accepted should get sidelined into the “unclear” bucket in order to make progress.

Just FWIW, IMO, these make sense as operators specifically because they are commonly used by math people as operations that transform the thing they are attached to. Superscript 2 is a function that squares its operand. That said, perhaps there are other uses that I’m not aware of which get in the way of the utilitarian interpretation. Such is a discussion for a future round after the active damage is fixed :)

I would strongly advocate that the things-that-are-identifiers group be strongly tied to the existing, complete Unicode standard for such, and that the critical parts of the previous document about normalization be retained.

Makes sense if there is something that covers the right bases. We certainly don’t want to be enumerating every accented letter codepoint, and allowing people to write words in non-english languages as identifiers is important. I’m not familiar enough to know if there are any unicode standard that includes “just the stuff we want” though.

-Chris

···

On Sep 30, 2017, at 4:12 PM, Xiaodi Wu <xiaodi.wu@gmail.com> wrote:

Hi All.

I’d like to help as well. I have fun with operators.

There is also the issue of code security with invisible unicode characters and characters that look exactly alike. (They should make a Coding font that ensures all characters look different.) Was that ever resolved? Googling, I found this:

https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160620/021446.html

Which seems to have been left at this:

https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160725/025555.html

https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160919/thread.html#27229

Should we throw all of this into the same pot, and make any characters that aren’t on the approved list illegal?

-Kenny

···

On Sep 30, 2017, at 4:13 PM, Xiaodi Wu via swift-evolution <swift-evolution@swift.org> wrote:

I’m happy to participate in the reshaping of the proposal. It would be nice to gather a group of people again to help drive it forward.

That said, it’s unclear to me that superscript T is clearly an operator, any more than would be superscript H (Hermitian), superscript 2, superscript 3, etc. But at any rate, this would be discussion for the future workgroup.

I would strongly advocate that the things-that-are-identifiers group be strongly tied to the existing, complete Unicode standard for such, and that the critical parts of the previous document about normalization be retained.

On Sat, Sep 30, 2017 at 17:59 Chris Lattner via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

The core team recently met to discuss PR609 - Refining identifier and operator symbology:
https://github.com/xwu/swift-evolution/blob/7c2c4df63b1d92a1677461f41bc638f31926c9c3/proposals/NNNN-refining-identifier-and-operator-symbology.md

The proposal correctly observes that the partitioning of unicode codepoints into identifiers and operators is a mess in some cases. It really is an outright bug for :slightly_smiling_face: to be an identifier, but :frowning: to be an operator. That said, the proposal itself is complicated and is defined in terms of a bunch of unicode classes that may evolve in the “wrong way for Swift” in the future.

The core team would really like to get this sorted out for Swift 5, and sooner is better than later :-). Because it seems that this is a really hard problem and that perfection is becoming the enemy of good <https://en.wikipedia.org/wiki/Perfect_is_the_enemy_of_good&gt;, the core team requests the creation of a new proposal with a different approach. The general observation is that there are three kinds of characters: things that are obviously identifiers, things that are obviously math operators, and things that are non-obvious. Things that are non-obvious can be made into invalid code points, and legislated later in follow-up proposals if/when someone cares to argue for them.

To make progress on this, we suggest a few separable steps:

First, please split out the changes to the ASCII characters (e.g. . and \ operator parsing rules) to its own (small) proposal, since it is unrelated to the unicode changes, and can make progress on that proposal independently.

Second, someone should take a look at the concrete set of unicode identifiers that are accepted by Swift 4 and write a new proposal that splits them into the three groups: those that are clearly identifiers (which become identifiers), those that are clearly operators (which become operators), and those that are unclear or don’t matter (these become invalid code points).

I suggest that the criteria be based on utility for Swift code, not on the underlying unicode classification. For example, the discussion thread for PR609 mentions that the T character in “ xᔀ ” is defined in unicode as a latin “letter”. Despite that, its use is Swift would clearly be as a postfix operator, so we should classify it as an operator.

Other suggestions:
- Math symbols are operators excepting those primarily used as identifiers like “alpha”. If there are any characters that are used for both, this proposal should make them invalid.
- While there may be useful ranges for some identifiers (e.g. to handle european accented characters), the Emoji range should probably have each codepoint independently judged, and currently unassigned codepoints should not get a meaning defined for them.
- Unicode “faces”, “people”, “animals” etc are all identifiers.
- In order to reduce the scope of the proposal, it is a safe default to exclude characters that are unlikely to be used by Swift code today, including Braille, weird currency symbols, or any set of characters that are so broken and useless in Swift 4 that it isn’t worth worrying about.
- The proposal is likely to turn a large number of code points into rejected characters. In the discussions, some people will be tempted to argue endlessly about individual rejections. To control that, we can require that people point out an example where the character is already in use, or where it has a clear application to a domain that is known today: the discussion needs to be grounded and practical, not theoretical.

Third, if there is interest sometime in the future, we can have subsequent proposals that expand the range of accepted code points, motivated by the specific application domain that cares about them. These proposals will not be source breaking, so they can happen at any time.

Is anyone interested in helping to push this effort forward?

-Chris

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

It won’t compile.

···

On Sep 30, 2017, at 7:14 PM, Taylor Swift <kelvin13ma@gmail.com> wrote:

what happens if two public operator declarations conflict?

On Sat, Sep 30, 2017 at 9:10 PM, Jonathan Hull via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:
I have a technical question on this:

Instead of parsing these into identifiers & operators, would it be possible to parse these into 3 categories: Identifiers, Operators, and Ambiguous?

The ambiguous category would be disallowed for the moment, as you say. But since they are rarely used, maybe we can allow a declaration (similar to how we define operators) that effectively pulls it into one of the other categories (not in terms of tokenization, but in terms of how it can be used in Swift). Trying to pull it into both would be a compilation error.

That way, Xiaodi can have a framework which lets her use superscript T as an identifier, and I can have one where I use superscript 2 to square things. The obvious/frequently used characters would not be ambiguous, so it would only slow down compilation when the rare/ambiguous characters are used.

In my mind, this would be the ideal solution, and it could be done in stages (with the ambiguous characters just being forbidden for now), but I am not sure if it is technically possible.

Thanks,
Jon

On Sep 30, 2017, at 3:59 PM, Chris Lattner via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

The core team recently met to discuss PR609 - Refining identifier and operator symbology:
https://github.com/xwu/swift-evolution/blob/7c2c4df63b1d92a1677461f41bc638f31926c9c3/proposals/NNNN-refining-identifier-and-operator-symbology.md

The proposal correctly observes that the partitioning of unicode codepoints into identifiers and operators is a mess in some cases. It really is an outright bug for :slightly_smiling_face: to be an identifier, but :frowning: to be an operator. That said, the proposal itself is complicated and is defined in terms of a bunch of unicode classes that may evolve in the “wrong way for Swift” in the future.

The core team would really like to get this sorted out for Swift 5, and sooner is better than later :-). Because it seems that this is a really hard problem and that perfection is becoming the enemy of good <https://en.wikipedia.org/wiki/Perfect_is_the_enemy_of_good&gt;, the core team requests the creation of a new proposal with a different approach. The general observation is that there are three kinds of characters: things that are obviously identifiers, things that are obviously math operators, and things that are non-obvious. Things that are non-obvious can be made into invalid code points, and legislated later in follow-up proposals if/when someone cares to argue for them.

To make progress on this, we suggest a few separable steps:

First, please split out the changes to the ASCII characters (e.g. . and \ operator parsing rules) to its own (small) proposal, since it is unrelated to the unicode changes, and can make progress on that proposal independently.

Second, someone should take a look at the concrete set of unicode identifiers that are accepted by Swift 4 and write a new proposal that splits them into the three groups: those that are clearly identifiers (which become identifiers), those that are clearly operators (which become operators), and those that are unclear or don’t matter (these become invalid code points).

I suggest that the criteria be based on utility for Swift code, not on the underlying unicode classification. For example, the discussion thread for PR609 mentions that the T character in “ xᔀ ” is defined in unicode as a latin “letter”. Despite that, its use is Swift would clearly be as a postfix operator, so we should classify it as an operator.

Other suggestions:
- Math symbols are operators excepting those primarily used as identifiers like “alpha”. If there are any characters that are used for both, this proposal should make them invalid.
- While there may be useful ranges for some identifiers (e.g. to handle european accented characters), the Emoji range should probably have each codepoint independently judged, and currently unassigned codepoints should not get a meaning defined for them.
- Unicode “faces”, “people”, “animals” etc are all identifiers.
- In order to reduce the scope of the proposal, it is a safe default to exclude characters that are unlikely to be used by Swift code today, including Braille, weird currency symbols, or any set of characters that are so broken and useless in Swift 4 that it isn’t worth worrying about.
- The proposal is likely to turn a large number of code points into rejected characters. In the discussions, some people will be tempted to argue endlessly about individual rejections. To control that, we can require that people point out an example where the character is already in use, or where it has a clear application to a domain that is known today: the discussion needs to be grounded and practical, not theoretical.

Third, if there is interest sometime in the future, we can have subsequent proposals that expand the range of accepted code points, motivated by the specific application domain that cares about them. These proposals will not be source breaking, so they can happen at any time.

Is anyone interested in helping to push this effort forward?

-Chris

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

Gotcha. What if the definitions were in a special file similar to the info.plist that was read before other parsing, with one file per package?

Thanks,
Jon

···

On Oct 1, 2017, at 4:09 PM, Chris Lattner <clattner@nondot.org> wrote:

On Sep 30, 2017, at 7:10 PM, Jonathan Hull <jhull@gbis.com> wrote:

I have a technical question on this:

Instead of parsing these into identifiers & operators, would it be possible to parse these into 3 categories: Identifiers, Operators, and Ambiguous?

The ambiguous category would be disallowed for the moment, as you say. But since they are rarely used, maybe we can allow a declaration (similar to how we define operators) that effectively pulls it into one of the other categories (not in terms of tokenization, but in terms of how it can be used in Swift).

This is commonly requested, but the third category isn’t practical.

Swift statically partitions characters between identifiers and operators to make it possible to parse a Swift source file without parsing all of its dependencies. If you could have directives that change this, it would be difficult or perhaps impossible to parse a file that used these characters without parsing/reading the transitive closure of dependent modules.

This is important for compile speed and some tooling, and is an area that C gets wrong - its grammar requires all headers to be parsed in order to distinguish between type names and normal identifiers.

-Chris

Don’t worry: IIRC, that operator isn’t accepted by Swift 4, so it wouldn’t be the topic of the first rounds of the proposal.

I was simply trying to get across the idea of “design for utility”, and the utility of superscript T is really about transposing matrices. That proposal would be a follow-on after the already-accepted set of stuff is rationalized :)

-Chris

···

On Oct 1, 2017, at 4:30 PM, Matt Whiteside via swift-evolution <swift-evolution@swift.org> wrote:

That said, it’s unclear to me that superscript T is clearly an operator, any more than would be superscript H (Hermitian), superscript 2, superscript 3, etc. But at any rate, this would be discussion for the future

Allowing superscripted characters to be used as operators seems like it would take the proposal down a rabbit hole.

Hi All.

I’d like to help as well. I have fun with operators.

There is also the issue of code security with invisible unicode characters and characters that look exactly alike.

Unless there is a compelling reason to add them, I think we should ban invisible characters. What is the harm of characters that look alike?

-Chris

···

On Oct 1, 2017, at 9:26 PM, Kenny Leung via swift-evolution <swift-evolution@swift.org> wrote:

(They should make a Coding font that ensures all characters look different.) Was that ever resolved? Googling, I found this:

[swift-evolution] Prohibit invisible characters in identifier names

Which seems to have been left at this:

[swift-evolution] [Proposal] Normalize Unicode Identifiers

The swift-evolution The Week Of Monday 19 September 2016 Archive by thread

Should we throw all of this into the same pot, and make any characters that aren’t on the approved list illegal?

-Kenny

On Sep 30, 2017, at 4:13 PM, Xiaodi Wu via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

I’m happy to participate in the reshaping of the proposal. It would be nice to gather a group of people again to help drive it forward.

That said, it’s unclear to me that superscript T is clearly an operator, any more than would be superscript H (Hermitian), superscript 2, superscript 3, etc. But at any rate, this would be discussion for the future workgroup.

I would strongly advocate that the things-that-are-identifiers group be strongly tied to the existing, complete Unicode standard for such, and that the critical parts of the previous document about normalization be retained.

On Sat, Sep 30, 2017 at 17:59 Chris Lattner via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

The core team recently met to discuss PR609 - Refining identifier and operator symbology:
https://github.com/xwu/swift-evolution/blob/7c2c4df63b1d92a1677461f41bc638f31926c9c3/proposals/NNNN-refining-identifier-and-operator-symbology.md

The proposal correctly observes that the partitioning of unicode codepoints into identifiers and operators is a mess in some cases. It really is an outright bug for :slightly_smiling_face: to be an identifier, but :frowning: to be an operator. That said, the proposal itself is complicated and is defined in terms of a bunch of unicode classes that may evolve in the “wrong way for Swift” in the future.

The core team would really like to get this sorted out for Swift 5, and sooner is better than later :-). Because it seems that this is a really hard problem and that perfection is becoming the enemy of good <https://en.wikipedia.org/wiki/Perfect_is_the_enemy_of_good&gt;, the core team requests the creation of a new proposal with a different approach. The general observation is that there are three kinds of characters: things that are obviously identifiers, things that are obviously math operators, and things that are non-obvious. Things that are non-obvious can be made into invalid code points, and legislated later in follow-up proposals if/when someone cares to argue for them.

To make progress on this, we suggest a few separable steps:

First, please split out the changes to the ASCII characters (e.g. . and \ operator parsing rules) to its own (small) proposal, since it is unrelated to the unicode changes, and can make progress on that proposal independently.

Second, someone should take a look at the concrete set of unicode identifiers that are accepted by Swift 4 and write a new proposal that splits them into the three groups: those that are clearly identifiers (which become identifiers), those that are clearly operators (which become operators), and those that are unclear or don’t matter (these become invalid code points).

I suggest that the criteria be based on utility for Swift code, not on the underlying unicode classification. For example, the discussion thread for PR609 mentions that the T character in “ xᔀ ” is defined in unicode as a latin “letter”. Despite that, its use is Swift would clearly be as a postfix operator, so we should classify it as an operator.

Other suggestions:
- Math symbols are operators excepting those primarily used as identifiers like “alpha”. If there are any characters that are used for both, this proposal should make them invalid.
- While there may be useful ranges for some identifiers (e.g. to handle european accented characters), the Emoji range should probably have each codepoint independently judged, and currently unassigned codepoints should not get a meaning defined for them.
- Unicode “faces”, “people”, “animals” etc are all identifiers.
- In order to reduce the scope of the proposal, it is a safe default to exclude characters that are unlikely to be used by Swift code today, including Braille, weird currency symbols, or any set of characters that are so broken and useless in Swift 4 that it isn’t worth worrying about.
- The proposal is likely to turn a large number of code points into rejected characters. In the discussions, some people will be tempted to argue endlessly about individual rejections. To control that, we can require that people point out an example where the character is already in use, or where it has a clear application to a domain that is known today: the discussion needs to be grounded and practical, not theoretical.

Third, if there is interest sometime in the future, we can have subsequent proposals that expand the range of accepted code points, motivated by the specific application domain that cares about them. These proposals will not be source breaking, so they can happen at any time.

Is anyone interested in helping to push this effort forward?

-Chris

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

Something like that is possible, but makes the language/compiler more complicated by introducing a whole new concept to the source distribution. It also doesn’t address the cases where you want to do a parse but don’t have the dependent source files, e.g. in a source browser tool like ViewVC.

-Chris

···

On Oct 1, 2017, at 4:17 PM, Jonathan Hull <jhull@gbis.com> wrote:

Gotcha. What if the definitions were in a special file similar to the info.plist that was read before other parsing, with one file per package?

Thanks,
Jon

On Oct 1, 2017, at 4:09 PM, Chris Lattner <clattner@nondot.org> wrote:

On Sep 30, 2017, at 7:10 PM, Jonathan Hull <jhull@gbis.com> wrote:

I have a technical question on this:

Instead of parsing these into identifiers & operators, would it be possible to parse these into 3 categories: Identifiers, Operators, and Ambiguous?

The ambiguous category would be disallowed for the moment, as you say. But since they are rarely used, maybe we can allow a declaration (similar to how we define operators) that effectively pulls it into one of the other categories (not in terms of tokenization, but in terms of how it can be used in Swift).

This is commonly requested, but the third category isn’t practical.

Swift statically partitions characters between identifiers and operators to make it possible to parse a Swift source file without parsing all of its dependencies. If you could have directives that change this, it would be difficult or perhaps impossible to parse a file that used these characters without parsing/reading the transitive closure of dependent modules.

This is important for compile speed and some tooling, and is an area that C gets wrong - its grammar requires all headers to be parsed in order to distinguish between type names and normal identifiers.

-Chris

Ok, thanks for clarifying that.

-Matt

···

On Oct 1, 2017, at 16:50, Chris Lattner <clattner@nondot.org> wrote:

On Oct 1, 2017, at 4:30 PM, Matt Whiteside via swift-evolution <swift-evolution@swift.org> wrote:

That said, it’s unclear to me that superscript T is clearly an operator, any more than would be superscript H (Hermitian), superscript 2, superscript 3, etc. But at any rate, this would be discussion for the future

Allowing superscripted characters to be used as operators seems like it would take the proposal down a rabbit hole.

Don’t worry: IIRC, that operator isn’t accepted by Swift 4, so it wouldn’t be the topic of the first rounds of the proposal.

I was simply trying to get across the idea of “design for utility”, and the utility of superscript T is really about transposing matrices. That proposal would be a follow-on after the already-accepted set of stuff is rationalized :)

-Chris

I guess theoretically you could have two variables that look alike, but are actually different values, allowing you to insert some obfuscated malicious code somehow.

-Kenny

···

On Oct 1, 2017, at 10:01 PM, Chris Lattner <clattner@nondot.org> wrote:

On Oct 1, 2017, at 9:26 PM, Kenny Leung via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

Hi All.

I’d like to help as well. I have fun with operators.

There is also the issue of code security with invisible unicode characters and characters that look exactly alike.

Unless there is a compelling reason to add them, I think we should ban invisible characters. What is the harm of characters that look alike?

-Chris

(They should make a Coding font that ensures all characters look different.) Was that ever resolved? Googling, I found this:

[swift-evolution] Prohibit invisible characters in identifier names

Which seems to have been left at this:

[swift-evolution] [Proposal] Normalize Unicode Identifiers

The swift-evolution The Week Of Monday 19 September 2016 Archive by thread

Should we throw all of this into the same pot, and make any characters that aren’t on the approved list illegal?

-Kenny

On Sep 30, 2017, at 4:13 PM, Xiaodi Wu via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

I’m happy to participate in the reshaping of the proposal. It would be nice to gather a group of people again to help drive it forward.

That said, it’s unclear to me that superscript T is clearly an operator, any more than would be superscript H (Hermitian), superscript 2, superscript 3, etc. But at any rate, this would be discussion for the future workgroup.

I would strongly advocate that the things-that-are-identifiers group be strongly tied to the existing, complete Unicode standard for such, and that the critical parts of the previous document about normalization be retained.

On Sat, Sep 30, 2017 at 17:59 Chris Lattner via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

The core team recently met to discuss PR609 - Refining identifier and operator symbology:
https://github.com/xwu/swift-evolution/blob/7c2c4df63b1d92a1677461f41bc638f31926c9c3/proposals/NNNN-refining-identifier-and-operator-symbology.md

The proposal correctly observes that the partitioning of unicode codepoints into identifiers and operators is a mess in some cases. It really is an outright bug for :slightly_smiling_face: to be an identifier, but :frowning: to be an operator. That said, the proposal itself is complicated and is defined in terms of a bunch of unicode classes that may evolve in the “wrong way for Swift” in the future.

The core team would really like to get this sorted out for Swift 5, and sooner is better than later :-). Because it seems that this is a really hard problem and that perfection is becoming the enemy of good <https://en.wikipedia.org/wiki/Perfect_is_the_enemy_of_good&gt;, the core team requests the creation of a new proposal with a different approach. The general observation is that there are three kinds of characters: things that are obviously identifiers, things that are obviously math operators, and things that are non-obvious. Things that are non-obvious can be made into invalid code points, and legislated later in follow-up proposals if/when someone cares to argue for them.

To make progress on this, we suggest a few separable steps:

First, please split out the changes to the ASCII characters (e.g. . and \ operator parsing rules) to its own (small) proposal, since it is unrelated to the unicode changes, and can make progress on that proposal independently.

Second, someone should take a look at the concrete set of unicode identifiers that are accepted by Swift 4 and write a new proposal that splits them into the three groups: those that are clearly identifiers (which become identifiers), those that are clearly operators (which become operators), and those that are unclear or don’t matter (these become invalid code points).

I suggest that the criteria be based on utility for Swift code, not on the underlying unicode classification. For example, the discussion thread for PR609 mentions that the T character in “ xᔀ ” is defined in unicode as a latin “letter”. Despite that, its use is Swift would clearly be as a postfix operator, so we should classify it as an operator.

Other suggestions:
- Math symbols are operators excepting those primarily used as identifiers like “alpha”. If there are any characters that are used for both, this proposal should make them invalid.
- While there may be useful ranges for some identifiers (e.g. to handle european accented characters), the Emoji range should probably have each codepoint independently judged, and currently unassigned codepoints should not get a meaning defined for them.
- Unicode “faces”, “people”, “animals” etc are all identifiers.
- In order to reduce the scope of the proposal, it is a safe default to exclude characters that are unlikely to be used by Swift code today, including Braille, weird currency symbols, or any set of characters that are so broken and useless in Swift 4 that it isn’t worth worrying about.
- The proposal is likely to turn a large number of code points into rejected characters. In the discussions, some people will be tempted to argue endlessly about individual rejections. To control that, we can require that people point out an example where the character is already in use, or where it has a clear application to a domain that is known today: the discussion needs to be grounded and practical, not theoretical.

Third, if there is interest sometime in the future, we can have subsequent proposals that expand the range of accepted code points, motivated by the specific application domain that cares about them. These proposals will not be source breaking, so they can happen at any time.

Is anyone interested in helping to push this effort forward?

-Chris

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

Especially if people want to use the character in question as both an identifier and an operator: We can make the character an identifier and its lookalike an operator (or the other way around).

- Dave Sweeris

···

On Oct 1, 2017, at 22:01, Chris Lattner via swift-evolution <swift-evolution@swift.org> wrote:

On Oct 1, 2017, at 9:26 PM, Kenny Leung via swift-evolution <swift-evolution@swift.org> wrote:

Hi All.

I’d like to help as well. I have fun with operators.

There is also the issue of code security with invisible unicode characters and characters that look exactly alike.

Unless there is a compelling reason to add them, I think we should ban invisible characters. What is the harm of characters that look alike?

Understood.

Thanks,
Jon

···

On Oct 1, 2017, at 4:20 PM, Chris Lattner <clattner@nondot.org> wrote:

Something like that is possible, but makes the language/compiler more complicated by introducing a whole new concept to the source distribution. It also doesn’t address the cases where you want to do a parse but don’t have the dependent source files, e.g. in a source browser tool like ViewVC.

-Chris

On Oct 1, 2017, at 4:17 PM, Jonathan Hull <jhull@gbis.com> wrote:

Gotcha. What if the definitions were in a special file similar to the info.plist that was read before other parsing, with one file per package?

Thanks,
Jon

On Oct 1, 2017, at 4:09 PM, Chris Lattner <clattner@nondot.org> wrote:

On Sep 30, 2017, at 7:10 PM, Jonathan Hull <jhull@gbis.com> wrote:

I have a technical question on this:

Instead of parsing these into identifiers & operators, would it be possible to parse these into 3 categories: Identifiers, Operators, and Ambiguous?

The ambiguous category would be disallowed for the moment, as you say. But since they are rarely used, maybe we can allow a declaration (similar to how we define operators) that effectively pulls it into one of the other categories (not in terms of tokenization, but in terms of how it can be used in Swift).

This is commonly requested, but the third category isn’t practical.

Swift statically partitions characters between identifiers and operators to make it possible to parse a Swift source file without parsing all of its dependencies. If you could have directives that change this, it would be difficult or perhaps impossible to parse a file that used these characters without parsing/reading the transitive closure of dependent modules.

This is important for compile speed and some tooling, and is an area that C gets wrong - its grammar requires all headers to be parsed in order to distinguish between type names and normal identifiers.

-Chris