[Proposal] Refining Identifier and Operator Symbology

Please excuse me if I booger the syntax here, but here is what this would
look like:

// Bind + as a function name in the usual way:
func +(a, b) -> { ... }

// Say we are treating + as a quasi-keyword with the given precedence.
// Requires that + be bound (before or after) as a two argument function
if infix, or a one argument function if prefix/postfix:
infix operator + : TheUsualPrecedevnce

// Parenthesizing a quasi-keyword lets you use it as an identifier:

+ // is an operator

(+) // is a use-occurrence of the function named +

I understand why collapsing the two may seem appealing, but my personal
opinion is keep separate things separate.

Even if the eventual consensus is to collapse them, let's first discuss
them as separate things so we can understand the two things that are being
said.

Jonathan

Ā·Ā·Ā·

On Thu, Oct 20, 2016 at 8:08 AM, Erica Sadun <erica@ericasadun.com> wrote:

I really liked Jonathan's suggestion that removed the distinction between
operators and identifiers entirely. You could mark a one-argument function
as postfix or prefix, and a two-argument function as infix and use them as
a kind of pseudo keyword.

The noun/verb distinction was clarifying for me in regards to operators. Is
there a similar human-factors distinction we can identify for emojis that
might usefully inform this part of the discussion?

Jonathan

Ā·Ā·Ā·

On Thu, Oct 20, 2016 at 8:18 AM, Erica Sadun via swift-evolution < swift-evolution@swift.org> wrote:

I fully agree. It’s hella presumptuous to decide that I’m not allowed to
express whimsy, frustration, humor, or any other emotions in my code. Or to
tell an 8 year old using Playgrounds on the iPad that he/she can’t name a
variable :pig: purely because they find it *funny*. We don’t have to squash
the joy out of *everything*.

Russ

The problem isn't whimsy so much as it's selecting the right set. If you
can point to a standard (or create one) that provides a good set, which
does not introduce the issues described in the proposal, that would be a
great starting step for adapting the proposed approach. The same goes for
the mathematical operators.

Freeze the set of allowed emoji to whatever the current version of the Unicode spec defines...

UAX31 won't include emojis in either space, because there is no clear consensus about where they belong (identifiers or operators). Individual languages can certainly add them to one space or the other, but should take care not to cross-contaminate. So if we add them to operators, we need to exclude any that are already part of normal identifiers and vice versa. That sanity restriction is technically necessary, but it shouldn't be an inconvenience in practical terms.

My understanding (which is admittedly fuzzy) is that the distinction between operators and identifiers is only "technically necessary" because allowing characters to be both causes the parsing algorithm lose its virtual mind, and it takes a century for it to figure out what's going on. What I don't recall being discussed before is whether that's a blanket penalty or if the compile times increases are proportional to the amount overlap between the two character sets. If it's the latter, it might be worth discussing whether we should allow a *small* group of characters to be legal as both identifiers or operators, while maintaining sub-century compile times.

- Dave Sweeris.

Ā·Ā·Ā·

Sent from my iPhone

On Oct 20, 2016, at 09:03, Jonathan S. Shapiro via swift-evolution <swift-evolution@swift.org> wrote:

On Thu, Oct 20, 2016 at 12:12 AM, Austin Zheng via swift-evolution <swift-evolution@swift.org> wrote:

Said differently: A monkey with a tool is still a monkey.
I.e. Swift cannot force somebody to become a good programmer no matter what rules it imposes.
As far as limiting personal freedoms goes: everybody (kid’s included) should be able to use whatever pleases them - within the possibilities of the language.
But the language should not impose restrictions it does not need.
If somebody out there wants to use emoticons, or whole pages of them… so what?.
Any company or programmer worth its salt has their own rules for what constitutes a good identifier or operator.

OTOH: I would not go as far as in optimizing the compiler to deal with anything non-ascii. If people want to use emoticons, and this results in sub-par performance in compilation or execution speed, so be it.

Rien.

Ā·Ā·Ā·

On 20 Oct 2016, at 16:03, Jonathan S. Shapiro via swift-evolution <swift-evolution@swift.org> wrote:

On Thu, Oct 20, 2016 at 12:12 AM, Austin Zheng via swift-evolution <swift-evolution@swift.org> wrote:
Is there a compromise we can come up with, maybe?

So speaking just for myself, I strongly oppose emojis because every example of emoji code I have seen has been truly obfuscated. Emojis therefore present very serious and active source-level security risks that will require significant engineering investment to manage and will never be fully managed successfully.

That said, I'm very glad that some people here have pointed out the "kid use case", because I had not considered that one. I think that's actually pretty compelling.

Let me ask a question: would single-character emoji identifiers be enough, or do we need multi-character emojis? Single-character emoji identifiers would go a long way toward limiting the capacity for obfuscation, but I'm guessing it won't be enough for a bunch of people here.

Freeze the set of allowed emoji to whatever the current version of the Unicode spec defines...

UAX31 won't include emojis in either space, because there is no clear consensus about where they belong (identifiers or operators). Individual languages can certainly add them to one space or the other, but should take care not to cross-contaminate. So if we add them to operators, we need to exclude any that are already part of normal identifiers and vice versa. That sanity restriction is technically necessary, but it shouldn't be an inconvenience in practical terms.

Jonathan
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

I don’t think it is a goal to ā€œpreventā€ or ā€œlimitā€ obfuscation. Operator overloading and using weird symbols is inherently going to obfuscate code for some people, and if it were a goal, we’d prevent operator overloading entirely. We can’t legislate in the language that code is readable and maintainable. We need to trust people to use the tools the language provides in a sane way, and rely on team feedback and coding standards to set the norm in their environment.

Besides that, I think we’ve all seen code where it would be most honest to name a variable or function :poop:. That’s the sometimes sad reality of software, and Swift should aim to support honest expression of these realities. :-) :-)

-Chris

Ā·Ā·Ā·

On Oct 20, 2016, at 7:03 AM, Jonathan S. Shapiro via swift-evolution <swift-evolution@swift.org> wrote:

On Thu, Oct 20, 2016 at 12:12 AM, Austin Zheng via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:
Is there a compromise we can come up with, maybe?

So speaking just for myself, I strongly oppose emojis because every example of emoji code I have seen has been truly obfuscated. Emojis therefore present very serious and active source-level security risks that will require significant engineering investment to manage and will never be fully managed successfully.

That said, I'm very glad that some people here have pointed out the "kid use case", because I had not considered that one. I think that's actually pretty compelling.

Let me ask a question: would single-character emoji identifiers be enough, or do we need multi-character emojis? Single-character emoji identifiers would go a long way toward limiting the capacity for obfuscation, but I'm guessing it won't be enough for a bunch of people here.

Why can't we just remove distinction between operator and identifier
symbols? I'd be fine with the following:

infix operator map
infix func map(lhs: [Int], rhs: (Int) -> Int) { ... }
[1,2,3] map {$0*2}

No explicit imports required, plus we can create nice DSLs. Of course, it's
an additive change, but it's worth considering right now, because it can
remove some headache caused by current distinction.

Ok, but to clarify the requirement, *every* file would have to declare the operators it is using at the top of the file. It isn’t enough for them to be declared in some file within the current module. Not having this property breaks the ability to do a quick parse of a file without doing name lookup.

Yeah, that's a tradeoff. I think that requiring non-standard operator use to be explicitly declared could be a good thing, though, since I don't think that we can realistically expect users to learn or intuitively agree on what glyphs are "operator" or "identifier", no matter what character set we design.

I could get behind having to explicitly import operators:
    import CoolLib
    import operators CoolLib
or
    import CoolLib {types functions vars operators}
But having to re-declare every "non-standard" operator for every file really limits their usefulness, IMHO.

As long as { } aren't in the operator character set, we should still be able to skip function bodies without parsing, so operator use declarations could still be order-independent at the top level of declarations. (Whether it's a good idea to bury your import declarations in the middle of your other decls is another story.)

Oh, is using {} as operators on the table? There's gotta be some interesting syntax someone could make with those...

- Dave Sweeris

Ā·Ā·Ā·

Sent from my iPhone

On Oct 25, 2016, at 10:24, Joe Groff via swift-evolution <swift-evolution@swift.org> wrote:

I'm not aware of any Unicode stipulations on rendering of unassigned code
points. In any case, Swift _n_ doesn't need to be designed around specific
platforms that exist in 2016, but it'd be perfectly sensible to say that
Swift 4 should restrict valid identifiers to those characters that are
displayed consistently on widely used platforms existing in 2016.

Ā·Ā·Ā·

On Thu, Oct 27, 2016 at 19:35 Russ Bishop <xenadu@gmail.com> wrote:

On Oct 25, 2016, at 3:19 AM, Xiaodi Wu <xiaodi.wu@gmail.com> wrote:

Unfortunately, Joe is correct on this point. As I stated earlier in the
thread, there are a series of characters that can be either text or emoji
in presentation, where the default presentation differs depending on
platform, technology, use case, or context. This is also not a bug, but
explicitly contemplated by Unicode technical recommendations. You can
convince yourself of this fact by looking up the Wikipedia page on the
Unicode "dingbats" block and comparing the rendering on Safari on iOS and
Safari on macOS. You will see that they are different.

Unfortunately, you are incorrect about the behavior of missing glyphs.
Unlike, say, Chinese displayed on a machine without the necessary fonts,
there is a security concern that Unicode 9 emoji not yet supported by Apple
are non-displaying on that platform. No placeholder appears. This includes
what is according to Emojipedia the #1 most popular emoji, the shrug.
(Check out Emojipedia on a Mac.) It appears that there is no required
placeholder glyph for unsupported emoji, so any of them can legitimately
disappear on a non-supported platform. This is an issue worth serious
consideration.

IMHO I don’t think Swift needs to be designed around rendering bugs with
specific fonts on specific platforms. We can file a radar to have this
corrected. I’m not aware of anything in Unicode that says it is acceptable
to just drop unknown characters. I think some ZJW sequences or modifiers
can be ignored; anything that can be ignored for rendering should be
ignored for uniqueness of identifiers too.

Russ

On Tue, Oct 25, 2016 at 00:41 Russ Bishop via swift-evolution < > swift-evolution@swift.org> wrote:

On Oct 24, 2016, at 9:43 AM, Joe Groff via swift-evolution < > swift-evolution@swift.org> wrote:

On Oct 23, 2016, at 9:41 PM, Chris Lattner via swift-evolution < > swift-evolution@swift.org> wrote:

On Oct 18, 2016, at 11:34 PM, Jacob Bandes-Storch via swift-evolution < > swift-evolution@swift.org> wrote:

Dear Swift-Evolution community,

A few of us have been preparing a proposal to refine the definitions of
identifiers & operators. This includes some changes to the permitted
Unicode characters.

The latest (perhaps final?) draft is available here:

https://github.com/jtbandes/swift-evolution/blob/unicode-id-op/proposals/NNNN-refining-identifier-and-operator-symbology.md

We'd welcome your initial thoughts, and will probably submit a PR soon to
the swift-evolution repo for a formal review. Full text follows below.

I haven’t had a chance to read the entire proposal, nor the tons of great
discussion down thread, but here are a few thoughts, just MHO:

- I’m loving that you’re taking a detail oriented approach to the
problem. I agree with you that our current approach is unprincipled, and
we need to get this right for Swift 4.
- I think that it is perfectly fine to err on the side of conservatism: if
it isn’t clear how to classify something (e.g. Braille patterns), we should
just reject them in both operators and identifiers (make them be
unassigned). If these unclear cases are important to someone, then we can
consider (as a separate additive proposal) adding them back later.
- As to conservatism, explicitly reserving ā€œ..ā€ (for possible future
language directions) seems reasonable to me. Are there any other similar
things we should consider reserving?

- I applaud the creativity keeping :dog::cow: a valid identifier :-), but it is
really missing the point. *All* of the non-symbol-like emoji’s should be
valid in identifiers. With a quick unscientific look at Apple’s character
picker, all the emojis other than a few in ā€œSymbolsā€ seem like they should
be identifiers. It would be fine to conservatively leave all emoji
ā€œsymbolsā€ as unassigned.

The problem with this is that "emoji" is not a well-defined category by
Unicode. Whether a character is rendered as emoji or a traditional symbol
in a given font on a given platform can depend on variation selectors, and
the exact variation selectors (or lack thereof) that choose emoji or
traditional representation are non-portable, even among different text
rendering APIs on the same platform (e.g. ATSUI vs TextKit vs CoreText vs
WebKit on Darwin).

-Joe

I’m not sure that is true. Unicode gives the list:
http://unicode.org/emoji/charts/full-emoji-list.html\.

If a platform can’t render the ZJW sequences it can render them as
separate Emoji, but Swift can still treat that as the same identifier.

:+1:t3: == :+1: šŸ¼

If you don’t have a font capable of displaying the character at all that
isn’t any different from not having a Chinese font available. You should
get the missing character glyph. The list of emoji base characters is not
unrestricted - there is a specific and limited list of valid base
characters that accept modifiers.

If we wanted to go further and say that all Emoji modifiers are preserved
and rendered if possible but not considered part of the identifier that
would be OK with me. Same for variation selectors.

Russ

- I really think we should keep symbols as operators, including much of
the math symbols (e.g. ∪). In a later separate proposal, we can consider
whether it makes sense for emoji symbols (like :heavy_multiplication_x:to be usable as
operators), I can see arguments both ways.

-Chris

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

The hard requirements are:

   1. Nothing in identifier start can be in operator start or operator
   continue. [*]
   2. Nothing in operator start can be in identifier start or identifier
   continue. [*]
   3. Nothing in syntactic punctuation (period, brackets, parens, and so
   forth) can be in either type of identifier without creating a lot of
   serious hair. You can see one example of hair in the "double dots" rule.

If these requirements are not preserved, the consequence is that white
space becomes required between identifiers and operators. So, for example,
without these rules:

a+b // gets broken

a + b // works

The presence of dots in operators is actually causing a whole bunch of
constraints to get introduced that I'm going to talk about in a moment.

Jonathan

Ā·Ā·Ā·

On Thu, Oct 20, 2016 at 7:30 AM, David Sweeris <davesweeris@mac.com> wrote:

Sent from my iPhone

On Oct 20, 2016, at 09:03, Jonathan S. Shapiro via swift-evolution < > swift-evolution@swift.org> wrote:

On Thu, Oct 20, 2016 at 12:12 AM, Austin Zheng via swift-evolution < > swift-evolution@swift.org> wrote:

Freeze the set of allowed emoji to whatever the current version of the
Unicode spec defines...

UAX31 won't include emojis in either space, because there is no clear
consensus about where they belong (identifiers or operators). Individual
languages can certainly add them to one space or the other, but should take
care not to cross-contaminate. So if we add them to operators, we need to
exclude any that are already part of normal identifiers and vice versa.
That sanity restriction is technically necessary, but it shouldn't be an
inconvenience in practical terms.

My understanding (which is admittedly fuzzy) is that the distinction
between operators and identifiers is only "technically necessary" because
allowing characters to be both causes the parsing algorithm lose its
virtual mind, and it takes a century for it to figure out what's going on.
What I don't recall being discussed before is whether that's a blanket
penalty or if the compile times increases are proportional to the amount
overlap between the two character sets.

Operators, Nouns, and Verbs

There's an issue that I think it's worth bringing it out into the open for
everyone to see so that we all know it is present. Solutions are possible,
but they go beyond the scope of the identifier proposal. Here's the brief
statement of the problem:

   1. Operators are verbs. They *operate* on their arguments.
   2. Math symbols are not always verbs. āˆ‘ is a verb (and an operator). āˆž
   is usually understood to be a noun.
   3. Operator *symbols* (that is: identifiers) are just names. They are
   neither inherently verbs nor inherently nouns.

We tend (at first glance) to prefer for nouns to be treated as identifiers
and operators as verbs. Operator identifiers confuse the issue because we
are calling them *operator* identifiers. A better name might be "math
symbol identifiers", because it doesn't have the same
association. Unfortunately there is no Unicode category for "Math symbols
that are verbs". This is true, in part, because there actually isn't
general agreement about how symbols are used in math. Once you get past the
basic stuff, a symbol means whatever you define it to mean in the current
publication, and math authors grab symbols entirely for the convenience of
the authors. Hopefully, but not always, in a way that reflects or suggests
a generally recognized intuition. Often no general agreement exists.

If we actually wanted to solve the noun/verb issue, we need to acknowledge
that being a noun (verb) is orthogonal to being a conventional identifier
(math symbol identifier). Here is one way to separate the concepts in Swift:

   1. Make it true that *any* identifier can be either a conventional
   identifier or a math symbol identifier. We already do this in several
   places.
   2. Make it true that *any* identifier (including a conventional
   identifier) can be treated like a reserved word (that is: like an operator)
   for parse purposes.

From a parse perspective, the thing that makes an identifier into an

operator is that (a) it has been given some status as a reserved
identifier, and (b) it has a defined precedence rule. It would be possible
to re-imagine the meaning of Swift's operator declaration syntax to mean
"this identifier is now being given reserved-word status, and should be
treated for parse purposes as an operator while this declaration is
lexically in scope". No change is required to the current language. This
re-interpretation would allow us to say (for example):

infix operator LazyAnd : *somePrecedence*

which would introduce "LazyAnd" as an operator token *even though the
identifier does not use math symbols as its characters.* Simultaneously, it
would allow us to bind āˆž and use that identifier without forcing a noun (āˆž)
to be a verb simply because it has symbols in the name.

I personally believe that this would resolve some of the confusion about
operators, because it would separate the "how do we tokenize?" question
from the "what behaves like an operator?" question. It would also allow us
to preserve the existing mathematical use of many math symbols that are (by
convention) nouns. From a lexer/parser perspective, the concrete change is
that we go from "it's an operator because it's made up of math symbols" to
"it's an operator because it's an identifier and it's in the list of things
that are in scope as operators" (effectively a look-up table). That's the
entire change.

Unfortunately every change comes at a cost, and the cost of this one is
that we would once again have to be thoughtful about white space. Why?
Because:

a.! // selection of a field named "!" in object a
a.!+ // selection of a field named "!+" in object a
a.! + // selection of field named "!" in object a followed by operator (?) +

You can build comparable examples without field names:

! b // two identifiers
!+ b // two identifiers
! + b// two identifiers
a+b // three identifiers

How confusing would this become? We have some limited experience, but only
limtied, in BitC. BitC allowed operator definitions to use conventional
identifiers in the way I sketched above (actually, we did full-up mixfix,
but that's another topic), and it worked very well. BitC did *not* allow
operators to be used as general-purpose identifiers, but in hindsight I
believe that we probably should have done so.

Keep in mind that this is exactly the same "think about white space" issue
that we already know from conventional identifiers.

From a "but would this be too weird?" standpoint, all of the *current*

minglings
of identifiers without white space would be preserved, so "a+b" would
continue to behave like you expect. But just like

a.b__and c
a.b __and c

mean two very different things in C++, it would now be true that

a.!&&c // ident dot ident ident
a.! &&c // ident dot ident ident ident

would mean different things.

I don't know if I'm being helpful or just confusing the issue further, but
I hope this helps people think about this stuff better.

Jonathan

Hello,

If we actually wanted to solve the noun/verb issue, we need to
acknowledge that being a noun (verb) is orthogonal to being a
conventional identifier (math symbol identifier).

I really like the idea of separating tokenization of names and
detection of 'normal' identifiers vs. operators.

Unfortunately every change comes at a cost, and the cost of this one
is that we would once again have to be thoughtful about white space.

Maybe we can minimize the change and confusion by using several groups of characters
which each can form a valid identifier token.

E.g.:
  * one group based on the tokenization of identifiers (as specified in this proposal)
  * one group based on ASCII operator symbols (again as specified in the proposal)
  * all other characters stand for themselves and directly name an identifier

That is: either use letters and numbers to build a name, or use ASCII symbols to build a name.
All other characters can also be used, but they don't combine with each other and would have
to be descriptive enough to directly name an identifier.
For all the mathematical symbols which were discussed here, this should not be a problem.
They already have a meaning of their own and do not have to be combined with each other.

This way, no explicit white space would be required between identifiers and operators,
assuming that the operators either use ASCII-only operator characters or some Unicode
operator symbol.

We could keep easy tokenization and still allow almost all use-cases of operators which
were presented here.

Ā·Ā·Ā·

Am 2016-10-20 17:14, schrieb Jonathan S. Shapiro via swift-evolution:

--
Martin

Ah, ok. Your explanation sounds very familiar… Clearly I’ve read it before and forgot. I must be going senile in my old age (37).

Would this whitespace rule affect pre/postfix operators, or just the infix ones? Personally, I’m fine with requiring whitespace around infix operators, but pre/postfix operators would completely lose readability if they needed it, too. Have we previously discussed this on the mailing list?

Given my apparent forgetfulness, I have no doubt that we discussed it at length two weeks ago, and someone will probably reply to this, quoting some forgotten 3-page email I sent on the topic :-)

- Dave Sweeris

Ā·Ā·Ā·

On Oct 20, 2016, at 9:37 AM, Jonathan S. Shapiro <jonathan.s.shapiro@gmail.com> wrote:

On Thu, Oct 20, 2016 at 7:30 AM, David Sweeris <davesweeris@mac.com <mailto:davesweeris@mac.com>> wrote:
Sent from my iPhone

On Oct 20, 2016, at 09:03, Jonathan S. Shapiro via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

On Thu, Oct 20, 2016 at 12:12 AM, Austin Zheng via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

Freeze the set of allowed emoji to whatever the current version of the Unicode spec defines...

UAX31 won't include emojis in either space, because there is no clear consensus about where they belong (identifiers or operators). Individual languages can certainly add them to one space or the other, but should take care not to cross-contaminate. So if we add them to operators, we need to exclude any that are already part of normal identifiers and vice versa. That sanity restriction is technically necessary, but it shouldn't be an inconvenience in practical terms.

My understanding (which is admittedly fuzzy) is that the distinction between operators and identifiers is only "technically necessary" because allowing characters to be both causes the parsing algorithm lose its virtual mind, and it takes a century for it to figure out what's going on. What I don't recall being discussed before is whether that's a blanket penalty or if the compile times increases are proportional to the amount overlap between the two character sets.

The hard requirements are:
Nothing in identifier start can be in operator start or operator continue. [*]
Nothing in operator start can be in identifier start or identifier continue. [*]
Nothing in syntactic punctuation (period, brackets, parens, and so forth) can be in either type of identifier without creating a lot of serious hair. You can see one example of hair in the "double dots" rule.
If these requirements are not preserved, the consequence is that white space becomes required between identifiers and operators. So, for example, without these rules:

a+b // gets broken
a + b // works

The presence of dots in operators is actually causing a whole bunch of constraints to get introduced that I'm going to talk about in a moment.

This has been discussed in prior threads: it is core to the behavior of the parser.

-Chris

Ā·Ā·Ā·

On Oct 25, 2016, at 3:08 PM, Anton Zhilin via swift-evolution <swift-evolution@swift.org> wrote:

Why can't we just remove distinction between operator and identifier symbols? I'd be fine with the following:

Kotlin allows arbitrary infix functions. Nice feature. +1

Ā·Ā·Ā·

Am 26.10.2016 um 00:08 schrieb Anton Zhilin via swift-evolution <swift-evolution@swift.org>:

Why can't we just remove distinction between operator and identifier symbols? I'd be fine with the following:

infix operator map
infix func map(lhs: [Int], rhs: (Int) -> Int) { ... }
[1,2,3] map {$0*2}

And there’s our confirmation of hardware emoji keyboard.