Prohibit invisible characters in identifier names

Recently there has been a screenshot going around Twitter about C++ allowing zero-width spaces in variable names. Swift also suffers from this problem which can be abused to create ambiguous, misleading, and potentially obfuscate nefarious code.

I would like to propose a change to prohibit the use of invisible characters in identifier names.

I'm including an example of problematic code at the bottom of this email.

Sincerely,
João Pinheiro

/* The output for this code is:
A
B
C
1
2
3
*/

func test() { print("A") }
func t​est() { print("B") }
func te​st() { print("C") }

let abc = 1
let a​bc = 2
let ab​c = 3

test()
t​est()
te​st()

print(abc)
print(a​bc)
print(ab​c)

+1

l8r
Sean

···

On Jun 20, 2016, at 12:51 PM, João Pinheiro via swift-evolution <swift-evolution@swift.org> wrote:

Recently there has been a screenshot going around Twitter about C++ allowing zero-width spaces in variable names. Swift also suffers from this problem which can be abused to create ambiguous, misleading, and potentially obfuscate nefarious code.

I would like to propose a change to prohibit the use of invisible characters in identifier names.

I'm including an example of problematic code at the bottom of this email.

Sincerely,
João Pinheiro

/* The output for this code is:
A
B
C
1
2
3
*/

func test() { print("A") }
func t​est() { print("B") }
func te​st() { print("C") }

let abc = 1
let a​bc = 2
let ab​c = 3

test()
t​est()
te​st()

print(abc)
print(a​bc)
print(ab​c)
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

IIRC, some languages require zero-width joiners (though not zero-width spaces, which are distinct) to properly encode some of their characters. I'd be very leery of having Swift land on a model where identifiers can be used with some languages and not others; that smacks of ethnocentrism.

Jordan

···

On Jun 20, 2016, at 10:51, João Pinheiro via swift-evolution <swift-evolution@swift.org> wrote:

Recently there has been a screenshot going around Twitter about C++ allowing zero-width spaces in variable names. Swift also suffers from this problem which can be abused to create ambiguous, misleading, and potentially obfuscate nefarious code.

I would like to propose a change to prohibit the use of invisible characters in identifier names.

I'm including an example of problematic code at the bottom of this email.

Sincerely,
João Pinheiro

/* The output for this code is:
A
B
C
1
2
3
*/

func test() { print("A") }
func t​est() { print("B") }
func te​st() { print("C") }

let abc = 1
let a​bc = 2
let ab​c = 3

test()
t​est()
te​st()

print(abc)
print(a​bc)
print(ab​c)
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

IIRC, some languages require zero-width joiners (though not zero-width spaces, which are distinct) to properly encode some of their characters. I'd be very leery of having Swift land on a model where identifiers can be used with some languages and not others; that smacks of ethnocentrism.

None of those languages require zero-width characters between two Latin letters, or between a Latin letter and an Arabic numeral, or at the end of a word. Since standard / system APIs will (barring some radical shift) use those code points exclusively, it's justifiable to give them some special attention.

John.

···

On Jun 20, 2016, at 5:22 PM, Jordan Rose via swift-evolution <swift-evolution@swift.org> wrote:

Jordan

On Jun 20, 2016, at 10:51, João Pinheiro via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

Recently there has been a screenshot going around Twitter about C++ allowing zero-width spaces in variable names. Swift also suffers from this problem which can be abused to create ambiguous, misleading, and potentially obfuscate nefarious code.

I would like to propose a change to prohibit the use of invisible characters in identifier names.

I'm including an example of problematic code at the bottom of this email.

Sincerely,
João Pinheiro

/* The output for this code is:
A
B
C
1
2
3
*/

func test() { print("A") }
func t​est() { print("B") }
func te​st() { print("C") }

let abc = 1
let a​bc = 2
let ab​c = 3

test()
t​est()
te​st()

print(abc)
print(a​bc)
print(ab​c)
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

Very interesting.

Btw, IBM Swift Sandbox shows these spaces:
https://swiftlang.ng.bluemix.net/
But my mail client does not - i.e. I saw exactly the same "test"&"abc"

Also, I read about some issues with left-to-right and right-to-left markers that also somehow change the actual text of source - i.e. you see one text, but when it compiles - it works not as expected. I.e. viewer/editor processes these special codes and show you one text, but compiler treats text in another way.

I believe it is a potential security problem that all unicode chars are allowed for variables/func names in Swift. IMO We definitely should limit allowed charset for identifiers in sources.

···

On 20.06.2016 20:51, João Pinheiro via swift-evolution wrote:

Recently there has been a screenshot going around Twitter about C++ allowing zero-width spaces in variable names. Swift also suffers from this problem which can be abused to create ambiguous, misleading, and potentially obfuscate nefarious code.

I would like to propose a change to prohibit the use of invisible characters in identifier names.

I'm including an example of problematic code at the bottom of this email.

Sincerely,
João Pinheiro

/* The output for this code is:
A
B
C
1
2
3
*/

func test() { print("A") }
func t​est() { print("B") }
func te​st() { print("C") }

let abc = 1
let a​bc = 2
let ab​c = 3

test()
t​est()
te​st()

print(abc)
print(a​bc)
print(ab​c)
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

Nice feature in the IBM Swift Sandbox. Xcode doesn't display zero-width
spaces either so the identifier names look exactly the same.

The issue with left-to-right and right-to-left markers is interesting and
has previously been exploited in email phishing attacks.

It would be possible to highlight invisible characters in Xcode as a
stopgap measure, but that doesn't solve the problem for developers using
other editors or in other platforms. I think it would be a better idea to
sanitise the set of allowed (or prohibited) characters for identifiers at
the language level.

This is a potential security problem, but no need try to invent an ad-hoc
solution here, particularly one as drastic as prohibiting characters. The
same security considerations are applicable elsewhere and there's a lot of
work about Unicode security. See here: UTS #39: Unicode Security Mechanisms

Unicode maintains a list of "confusable" characters. See here:
http://www.unicode.org/Public/security/latest/confusables.txt

It should be sufficient to regard confusables as the same glyph for the
purpose of identifier names; zero-width and invisible marks would then be
regarded as non-existent, so that `test` and `t[invisible glyph]est` would
refer to the same variable.

···

On Mon, Jun 20, 2016 at 2:17 PM, João Pinheiro <swift-evolution@swift.org> wrote:

Sincerely,
João Pinheiro

> On 20 Jun 2016, at 19:26, Vladimir.S <svabox@gmail.com> wrote:
>
> Very interesting.
>
> Btw, IBM Swift Sandbox shows these spaces:
> https://swiftlang.ng.bluemix.net/
> But my mail client does not - i.e. I saw exactly the same "test"&"abc"
>
> Also, I read about some issues with left-to-right and right-to-left
markers that also somehow change the actual text of source - i.e. you see
one text, but when it compiles - it works not as expected. I.e.
viewer/editor processes these special codes and show you one text, but
compiler treats text in another way.
>
> I believe it is a potential security problem that all unicode chars are
allowed for variables/func names in Swift. IMO We definitely should limit
allowed charset for identifiers in sources.
>
> On 20.06.2016 20:51, João Pinheiro via swift-evolution wrote:
>> Recently there has been a screenshot going around Twitter about C++
allowing zero-width spaces in variable names. Swift also suffers from this
problem which can be abused to create ambiguous, misleading, and
potentially obfuscate nefarious code.
>>
>> I would like to propose a change to prohibit the use of invisible
characters in identifier names.
>>
>> I'm including an example of problematic code at the bottom of this
email.
>>
>> Sincerely,
>> João Pinheiro
>>
>>
>> /* The output for this code is:
>> A
>> B
>> C
>> 1
>> 2
>> 3
>> */
>>
>> func test() { print("A") }
>> func t​est() { print("B") }
>> func te​st() { print("C") }
>>
>> let abc = 1
>> let a​bc = 2
>> let ab​c = 3
>>
>> test()
>> t​est()
>> te​st()
>>
>> print(abc)
>> print(a​bc)
>> print(ab​c)
>> _______________________________________________
>> swift-evolution mailing list
>> swift-evolution@swift.org
>> https://lists.swift.org/mailman/listinfo/swift-evolution
>>

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

I agree that treating zero-width spaces as non-existent would be a
possible solution, but I think it would make more sense to consider it as
white space and thus not admissible in identifier names.

If you treat it like whitespace, then you get interesting behaviors that I
don't think you would want. For example, something that looks like `if
letter...` could be parsed as conditional binding `if let ter...` if I put
in a zero-width space in the right place.

I'm not sure of what the best way to handle left-to-right and
right-to-left markers would be. Does it make sense to allow mixed text
orientation in identifiers?

How do other languages that support Unicode handle these markers in
identifiers? I'd be interested to know.

Removing ambiguity between unicode confusables is a much more complicated
issue which implies defining a canonical unicode representation for
identifiers and a way to resolve them. It would also make it impractical to
use certain valid mathematical symbols as identifiers.

Most interesting mathematical symbols are reserved for operators anyway. As
a result, `x` and the multiplication symbol are not readily confusable in
most contexts in Swift, and confusable resolution could be built in such a
way that identifier characters are not regarded as confusable with operator
characters.

I'm a little concerned about cases like these:

1D6CE ; 0076 ; MA # ( 𝛎 → v ) MATHEMATICAL BOLD SMALL NU → LATIN SMALL LETTER V # →ν→
1D6D2 ; 0070 ; MA # ( 𝛒 → p ) MATHEMATICAL BOLD SMALL RHO → LATIN SMALL LETTER P # →ρ→

etc. Now, one could reasonably argue that using “𝛎” and “v” to mean
different things in the same scope would be bad, but I'm not sure
we really want to accept them as aliases of one another, either.

···

on Mon Jun 20 2016, Xiaodi Wu <swift-evolution@swift.org> wrote:

On Mon, Jun 20, 2016 at 2:42 PM, João Pinheiro <joao@joaopinheiro.org> > wrote:

João Pinheiro

On 20 Jun 2016, at 20:23, Xiaodi Wu <xiaodi.wu@gmail.com> wrote:

On Mon, Jun 20, 2016 at 2:17 PM, João Pinheiro <swift-evolution@swift.org> >> wrote:

Nice feature in the IBM Swift Sandbox. Xcode doesn't display zero-width
spaces either so the identifier names look exactly the same.

The issue with left-to-right and right-to-left markers is interesting and
has previously been exploited in email phishing attacks.

It would be possible to highlight invisible characters in Xcode as a
stopgap measure, but that doesn't solve the problem for developers using
other editors or in other platforms. I think it would be a better idea to
sanitise the set of allowed (or prohibited) characters for identifiers at
the language level.

This is a potential security problem, but no need try to invent an ad-hoc
solution here, particularly one as drastic as prohibiting characters. The
same security considerations are applicable elsewhere and there's a lot of
work about Unicode security. See here:
UTS #39: Unicode Security Mechanisms

Unicode maintains a list of "confusable" characters. See here:
http://www.unicode.org/Public/security/latest/confusables.txt

It should be sufficient to regard confusables as the same glyph for the
purpose of identifier names; zero-width and invisible marks would then be
regarded as non-existent, so that `test` and `t[invisible glyph]est` would
refer to the same variable.

Sincerely,
João Pinheiro

> On 20 Jun 2016, at 19:26, Vladimir.S <svabox@gmail.com> wrote:
>
> Very interesting.
>
> Btw, IBM Swift Sandbox shows these spaces:
> https://swiftlang.ng.bluemix.net/
> But my mail client does not - i.e. I saw exactly the same "test"&"abc"
>
> Also, I read about some issues with left-to-right and right-to-left
markers that also somehow change the actual text of source - i.e. you see
one text, but when it compiles - it works not as expected. I.e.
viewer/editor processes these special codes and show you one text, but
compiler treats text in another way.
>
> I believe it is a potential security problem that all unicode chars are
allowed for variables/func names in Swift. IMO We definitely should limit
allowed charset for identifiers in sources.
>
> On 20.06.2016 20:51, João Pinheiro via swift-evolution wrote:
>> Recently there has been a screenshot going around Twitter about C++
allowing zero-width spaces in variable names. Swift also suffers from this
problem which can be abused to create ambiguous, misleading, and
potentially obfuscate nefarious code.
>>
>> I would like to propose a change to prohibit the use of invisible
characters in identifier names.
>>
>> I'm including an example of problematic code at the bottom of this
email.
>>
>> Sincerely,
>> João Pinheiro
>>
>>
>> /* The output for this code is:
>> A
>> B
>> C
>> 1
>> 2
>> 3
>> */
>>
>> func test() { print("A") }
>> func t​est() { print("B") }
>> func te​st() { print("C") }
>>
>> let abc = 1
>> let a​bc = 2
>> let ab​c = 3
>>
>> test()
>> t​est()
>> te​st()
>>
>> print(abc)
>> print(a​bc)
>> print(ab​c)
>> _______________________________________________
>> swift-evolution mailing list
>> swift-evolution@swift.org
>> https://lists.swift.org/mailman/listinfo/swift-evolution
>>

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

--
Dave

I agree that treating zero-width spaces as non-existent would be a
possible solution, but I think it would make more sense to consider it as
white space and thus not admissible in identifier names.

If you treat it like whitespace, then you get interesting behaviors that I
don't think you would want. For example, something that looks like `if
letter...` could be parsed as conditional binding `if let ter...` if I put
in a zero-width space in the right place.

I'm not sure of what the best way to handle left-to-right and
right-to-left markers would be. Does it make sense to allow mixed text
orientation in identifiers?

How do other languages that support Unicode handle these markers in
identifiers? I'd be interested to know.

Removing ambiguity between unicode confusables is a much more complicated
issue which implies defining a canonical unicode representation for
identifiers and a way to resolve them. It would also make it impractical to
use certain valid mathematical symbols as identifiers.

Most interesting mathematical symbols are reserved for operators anyway. As
a result, `x` and the multiplication symbol are not readily confusable in
most contexts in Swift, and confusable resolution could be built in such a
way that identifier characters are not regarded as confusable with operator
characters.

···

On Mon, Jun 20, 2016 at 2:42 PM, João Pinheiro <joao@joaopinheiro.org> wrote:

João Pinheiro

On 20 Jun 2016, at 20:23, Xiaodi Wu <xiaodi.wu@gmail.com> wrote:

On Mon, Jun 20, 2016 at 2:17 PM, João Pinheiro <swift-evolution@swift.org> > wrote:

Nice feature in the IBM Swift Sandbox. Xcode doesn't display zero-width
spaces either so the identifier names look exactly the same.

The issue with left-to-right and right-to-left markers is interesting and
has previously been exploited in email phishing attacks.

It would be possible to highlight invisible characters in Xcode as a
stopgap measure, but that doesn't solve the problem for developers using
other editors or in other platforms. I think it would be a better idea to
sanitise the set of allowed (or prohibited) characters for identifiers at
the language level.

This is a potential security problem, but no need try to invent an ad-hoc
solution here, particularly one as drastic as prohibiting characters. The
same security considerations are applicable elsewhere and there's a lot of
work about Unicode security. See here:
UTS #39: Unicode Security Mechanisms

Unicode maintains a list of "confusable" characters. See here:
http://www.unicode.org/Public/security/latest/confusables.txt

It should be sufficient to regard confusables as the same glyph for the
purpose of identifier names; zero-width and invisible marks would then be
regarded as non-existent, so that `test` and `t[invisible glyph]est` would
refer to the same variable.

Sincerely,
João Pinheiro

> On 20 Jun 2016, at 19:26, Vladimir.S <svabox@gmail.com> wrote:
>
> Very interesting.
>
> Btw, IBM Swift Sandbox shows these spaces:
> https://swiftlang.ng.bluemix.net/
> But my mail client does not - i.e. I saw exactly the same "test"&"abc"
>
> Also, I read about some issues with left-to-right and right-to-left
markers that also somehow change the actual text of source - i.e. you see
one text, but when it compiles - it works not as expected. I.e.
viewer/editor processes these special codes and show you one text, but
compiler treats text in another way.
>
> I believe it is a potential security problem that all unicode chars are
allowed for variables/func names in Swift. IMO We definitely should limit
allowed charset for identifiers in sources.
>
> On 20.06.2016 20:51, João Pinheiro via swift-evolution wrote:
>> Recently there has been a screenshot going around Twitter about C++
allowing zero-width spaces in variable names. Swift also suffers from this
problem which can be abused to create ambiguous, misleading, and
potentially obfuscate nefarious code.
>>
>> I would like to propose a change to prohibit the use of invisible
characters in identifier names.
>>
>> I'm including an example of problematic code at the bottom of this
email.
>>
>> Sincerely,
>> João Pinheiro
>>
>>
>> /* The output for this code is:
>> A
>> B
>> C
>> 1
>> 2
>> 3
>> */
>>
>> func test() { print("A") }
>> func t​est() { print("B") }
>> func te​st() { print("C") }
>>
>> let abc = 1
>> let a​bc = 2
>> let ab​c = 3
>>
>> test()
>> t​est()
>> te​st()
>>
>> print(abc)
>> print(a​bc)
>> print(ab​c)
>> _______________________________________________
>> swift-evolution mailing list
>> swift-evolution@swift.org
>> https://lists.swift.org/mailman/listinfo/swift-evolution
>>

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

>
>> I agree that treating zero-width spaces as non-existent would be a
>> possible solution, but I think it would make more sense to consider it
as
>> white space and thus not admissible in identifier names.
>>
>
> If you treat it like whitespace, then you get interesting behaviors that
I
> don't think you would want. For example, something that looks like `if
> letter...` could be parsed as conditional binding `if let ter...` if I
put
> in a zero-width space in the right place.
>
>> I'm not sure of what the best way to handle left-to-right and
>> right-to-left markers would be. Does it make sense to allow mixed text
>> orientation in identifiers?
>>
>
> How do other languages that support Unicode handle these markers in
> identifiers? I'd be interested to know.
>
>> Removing ambiguity between unicode confusables is a much more
complicated
>> issue which implies defining a canonical unicode representation for
>> identifiers and a way to resolve them. It would also make it
impractical to
>> use certain valid mathematical symbols as identifiers.
>>
>
> Most interesting mathematical symbols are reserved for operators anyway.
As
> a result, `x` and the multiplication symbol are not readily confusable in
> most contexts in Swift, and confusable resolution could be built in such
a
> way that identifier characters are not regarded as confusable with
operator
> characters.

I'm a little concerned about cases like these:

1D6CE ; 0076 ; MA # ( 𝛎 → v ) MATHEMATICAL BOLD SMALL NU → LATIN
SMALL LETTER V # →ν→
1D6D2 ; 0070 ; MA # ( 𝛒 → p ) MATHEMATICAL BOLD SMALL RHO → LATIN
SMALL LETTER P # →ρ→

etc. Now, one could reasonably argue that using “𝛎” and “v” to mean
different things in the same scope would be bad, but I'm not sure
we really want to accept them as aliases of one another, either.

Yes, that does give me pause. FWIW, though, Greek letters have been known
to turn into their lookalike Latin counterparts. For instance, do a Google
search for Planck's equation written as "E = hv" (that "v" is supposed to
be lowercase nu). Or consider the abbreviation "XP" for Christ,
etymologically uppercase chi and rho (the first two letters of Christ in
Greek). (Or relatedly, the erroneous claim that "Xmas" is an attempt to
remove Christ out of Christmas.)

I guess what I'm saying is, if a co-worker named two distinct variables v
and nu, I would have a word or two with them... Consider an alternative
scenario. I have a Greek keyboard in my keyboard switcher, handy for
scientific uses. If I accidentally use Greek uppercase alpha in my code
instead of A, this would be essentially impossible to find by eye. Why
should the language not elide the invisible distinction?

···

On Mon, Jun 20, 2016 at 4:44 PM, Dave Abrahams via swift-evolution < swift-evolution@swift.org> wrote:

on Mon Jun 20 2016, Xiaodi Wu <swift-evolution@swift.org> wrote:
> On Mon, Jun 20, 2016 at 2:42 PM, João Pinheiro <joao@joaopinheiro.org> > > wrote:

>> João Pinheiro
>>
>>
>> On 20 Jun 2016, at 20:23, Xiaodi Wu <xiaodi.wu@gmail.com> wrote:
>>
>> On Mon, Jun 20, 2016 at 2:17 PM, João Pinheiro < > swift-evolution@swift.org> > >> wrote:
>>
>>> Nice feature in the IBM Swift Sandbox. Xcode doesn't display zero-width
>>> spaces either so the identifier names look exactly the same.
>>>
>>> The issue with left-to-right and right-to-left markers is interesting
and
>>> has previously been exploited in email phishing attacks.
>>>
>>> It would be possible to highlight invisible characters in Xcode as a
>>> stopgap measure, but that doesn't solve the problem for developers
using
>>> other editors or in other platforms. I think it would be a better idea
to
>>> sanitise the set of allowed (or prohibited) characters for identifiers
at
>>> the language level.
>>>
>>
>> This is a potential security problem, but no need try to invent an
ad-hoc
>> solution here, particularly one as drastic as prohibiting characters.
The
>> same security considerations are applicable elsewhere and there's a lot
of
>> work about Unicode security. See here:
>> UTS #39: Unicode Security Mechanisms
>>
>> Unicode maintains a list of "confusable" characters. See here:
>> http://www.unicode.org/Public/security/latest/confusables.txt
>>
>> It should be sufficient to regard confusables as the same glyph for the
>> purpose of identifier names; zero-width and invisible marks would then
be
>> regarded as non-existent, so that `test` and `t[invisible glyph]est`
would
>> refer to the same variable.
>>
>>
>>> Sincerely,
>>> João Pinheiro
>>>
>>>
>>> > On 20 Jun 2016, at 19:26, Vladimir.S <svabox@gmail.com> wrote:
>>> >
>>> > Very interesting.
>>> >
>>> > Btw, IBM Swift Sandbox shows these spaces:
>>> > https://swiftlang.ng.bluemix.net/
>>> > But my mail client does not - i.e. I saw exactly the same
"test"&"abc"
>>> >
>>> > Also, I read about some issues with left-to-right and right-to-left
>>> markers that also somehow change the actual text of source - i.e. you
see
>>> one text, but when it compiles - it works not as expected. I.e.
>>> viewer/editor processes these special codes and show you one text, but
>>> compiler treats text in another way.
>>> >
>>> > I believe it is a potential security problem that all unicode chars
are
>>> allowed for variables/func names in Swift. IMO We definitely should
limit
>>> allowed charset for identifiers in sources.
>>> >
>>> > On 20.06.2016 20:51, João Pinheiro via swift-evolution wrote:
>>> >> Recently there has been a screenshot going around Twitter about C++
>>> allowing zero-width spaces in variable names. Swift also suffers from
this
>>> problem which can be abused to create ambiguous, misleading, and
>>> potentially obfuscate nefarious code.
>>> >>
>>> >> I would like to propose a change to prohibit the use of invisible
>>> characters in identifier names.
>>> >>
>>> >> I'm including an example of problematic code at the bottom of this
>>> email.
>>> >>
>>> >> Sincerely,
>>> >> João Pinheiro
>>> >>
>>> >>
>>> >> /* The output for this code is:
>>> >> A
>>> >> B
>>> >> C
>>> >> 1
>>> >> 2
>>> >> 3
>>> >> */
>>> >>
>>> >> func test() { print("A") }
>>> >> func t​est() { print("B") }
>>> >> func te​st() { print("C") }
>>> >>
>>> >> let abc = 1
>>> >> let a​bc = 2
>>> >> let ab​c = 3
>>> >>
>>> >> test()
>>> >> t​est()
>>> >> te​st()
>>> >>
>>> >> print(abc)
>>> >> print(a​bc)
>>> >> print(ab​c)
>>> >> _______________________________________________
>>> >> swift-evolution mailing list
>>> >> swift-evolution@swift.org
>>> >> https://lists.swift.org/mailman/listinfo/swift-evolution
>>> >>
>>>
>>> _______________________________________________
>>> swift-evolution mailing list
>>> swift-evolution@swift.org
>>> https://lists.swift.org/mailman/listinfo/swift-evolution
>>>
>>
>>
>>
> _______________________________________________
> swift-evolution mailing list
> swift-evolution@swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution
>

--
Dave

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

Nice feature in the IBM Swift Sandbox. Xcode doesn't display zero-width spaces either so the identifier names look exactly the same.

The issue with left-to-right and right-to-left markers is interesting and has previously been exploited in email phishing attacks.

It would be possible to highlight invisible characters in Xcode as a stopgap measure, but that doesn't solve the problem for developers using other editors or in other platforms. I think it would be a better idea to sanitise the set of allowed (or prohibited) characters for identifiers at the language level.

Sincerely,
João Pinheiro

···

On 20 Jun 2016, at 19:26, Vladimir.S <svabox@gmail.com> wrote:

Very interesting.

Btw, IBM Swift Sandbox shows these spaces:
https://swiftlang.ng.bluemix.net/
But my mail client does not - i.e. I saw exactly the same "test"&"abc"

Also, I read about some issues with left-to-right and right-to-left markers that also somehow change the actual text of source - i.e. you see one text, but when it compiles - it works not as expected. I.e. viewer/editor processes these special codes and show you one text, but compiler treats text in another way.

I believe it is a potential security problem that all unicode chars are allowed for variables/func names in Swift. IMO We definitely should limit allowed charset for identifiers in sources.

On 20.06.2016 20:51, João Pinheiro via swift-evolution wrote:

Recently there has been a screenshot going around Twitter about C++ allowing zero-width spaces in variable names. Swift also suffers from this problem which can be abused to create ambiguous, misleading, and potentially obfuscate nefarious code.

I would like to propose a change to prohibit the use of invisible characters in identifier names.

I'm including an example of problematic code at the bottom of this email.

Sincerely,
João Pinheiro

/* The output for this code is:
A
B
C
1
2
3
*/

func test() { print("A") }
func t​est() { print("B") }
func te​st() { print("C") }

let abc = 1
let a​bc = 2
let ab​c = 3

test()
t​est()
te​st()

print(abc)
print(a​bc)
print(ab​c)
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

I agree that treating zero-width spaces as non-existent would be a possible solution, but I think it would make more sense to consider it as white space and thus not admissible in identifier names. I'm not sure of what the best way to handle left-to-right and right-to-left markers would be. Does it make sense to allow mixed text orientation in identifiers?

Removing ambiguity between unicode confusables is a much more complicated issue which implies defining a canonical unicode representation for identifiers and a way to resolve them. It would also make it impractical to use certain valid mathematical symbols as identifiers.

João Pinheiro

···

On 20 Jun 2016, at 20:23, Xiaodi Wu <xiaodi.wu@gmail.com> wrote:

On Mon, Jun 20, 2016 at 2:17 PM, João Pinheiro <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:
Nice feature in the IBM Swift Sandbox. Xcode doesn't display zero-width spaces either so the identifier names look exactly the same.

The issue with left-to-right and right-to-left markers is interesting and has previously been exploited in email phishing attacks.

It would be possible to highlight invisible characters in Xcode as a stopgap measure, but that doesn't solve the problem for developers using other editors or in other platforms. I think it would be a better idea to sanitise the set of allowed (or prohibited) characters for identifiers at the language level.

This is a potential security problem, but no need try to invent an ad-hoc solution here, particularly one as drastic as prohibiting characters. The same security considerations are applicable elsewhere and there's a lot of work about Unicode security. See here: UTS #39: Unicode Security Mechanisms

Unicode maintains a list of "confusable" characters. See here: http://www.unicode.org/Public/security/latest/confusables.txt

It should be sufficient to regard confusables as the same glyph for the purpose of identifier names; zero-width and invisible marks would then be regarded as non-existent, so that `test` and `t[invisible glyph]est` would refer to the same variable.

Sincerely,
João Pinheiro

> On 20 Jun 2016, at 19:26, Vladimir.S <svabox@gmail.com <mailto:svabox@gmail.com>> wrote:
>
> Very interesting.
>
> Btw, IBM Swift Sandbox shows these spaces:
> https://swiftlang.ng.bluemix.net/
> But my mail client does not - i.e. I saw exactly the same "test"&"abc"
>
> Also, I read about some issues with left-to-right and right-to-left markers that also somehow change the actual text of source - i.e. you see one text, but when it compiles - it works not as expected. I.e. viewer/editor processes these special codes and show you one text, but compiler treats text in another way.
>
> I believe it is a potential security problem that all unicode chars are allowed for variables/func names in Swift. IMO We definitely should limit allowed charset for identifiers in sources.
>
> On 20.06.2016 20:51, João Pinheiro via swift-evolution wrote:
>> Recently there has been a screenshot going around Twitter about C++ allowing zero-width spaces in variable names. Swift also suffers from this problem which can be abused to create ambiguous, misleading, and potentially obfuscate nefarious code.
>>
>> I would like to propose a change to prohibit the use of invisible characters in identifier names.
>>
>> I'm including an example of problematic code at the bottom of this email.
>>
>> Sincerely,
>> João Pinheiro
>>
>>
>> /* The output for this code is:
>> A
>> B
>> C
>> 1
>> 2
>> 3
>> */
>>
>> func test() { print("A") }
>> func t​est() { print("B") }
>> func te​st() { print("C") }
>>
>> let abc = 1
>> let a​bc = 2
>> let ab​c = 3
>>
>> test()
>> t​est()
>> te​st()
>>
>> print(abc)
>> print(a​bc)
>> print(ab​c)
>> _______________________________________________
>> swift-evolution mailing list
>> swift-evolution@swift.org <mailto:swift-evolution@swift.org>
>> https://lists.swift.org/mailman/listinfo/swift-evolution
>>

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

IIRC, some languages *require* zero-width joiners (though not zero-width
spaces, which are distinct) to properly encode some of their characters.
I'd be very leery of having Swift land on a model where identifiers can be
used with some languages and not others; that smacks of ethnocentrism.

None of those languages require zero-width characters between two Latin
letters, or between a Latin letter and an Arabic numeral, or at the end of
a word. Since standard / system APIs will (barring some radical shift) use
those code points exclusively, it's justifiable to give them some special
attention.

Although the practical implementation may need to be more limited in scope,
the general principle doesn't need to privilege Latin letters and Arabic
numerals. If, in any context, the presence or absence of a zero-width glyph
cannot possibly be distinguished by a human reading the text, then the
compiler should also be indifferent to its presence or absence (or,
alternatively, its presence should be a compile-time error).

···

On Mon, Jun 20, 2016 at 8:58 PM, John McCall via swift-evolution < swift-evolution@swift.org> wrote:

On Jun 20, 2016, at 5:22 PM, Jordan Rose via swift-evolution < > swift-evolution@swift.org> wrote:

John.

Jordan

On Jun 20, 2016, at 10:51, João Pinheiro via swift-evolution < > swift-evolution@swift.org> wrote:

Recently there has been a screenshot going around Twitter about C++
allowing zero-width spaces in variable names. Swift also suffers from this
problem which can be abused to create ambiguous, misleading, and
potentially obfuscate nefarious code.

I would like to propose a change to prohibit the use of invisible
characters in identifier names.

I'm including an example of problematic code at the bottom of this email.

Sincerely,
João Pinheiro

/* The output for this code is:
A
B
C
1
2
3
*/

func test() { print("A") }
func t​est() { print("B") }
func te​st() { print("C") }

let abc = 1
let a​bc = 2
let ab​c = 3

test()
t​est()
te​st()

print(abc)
print(a​bc)
print(ab​c)
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

I agree that treating zero-width spaces as non-existent would be a possible solution, but I think it would make more sense to consider it as white space and thus not admissible in identifier names.

If you treat it like whitespace, then you get interesting behaviors that I don't think you would want. For example, something that looks like `if letter...` could be parsed as conditional binding `if let ter...` if I put in a zero-width space in the right place.

I hadn't thought of that possibility. Ignoring them has the problem of creating multiple valid representations for the same identifier though. Not allowing invisible characters in identifiers sounds like the best solution to me.

I'm not sure of what the best way to handle left-to-right and right-to-left markers would be. Does it make sense to allow mixed text orientation in identifiers?

How do other languages that support Unicode handle these markers in identifiers? I'd be interested to know.

Me too.

Removing ambiguity between unicode confusables is a much more complicated issue which implies defining a canonical unicode representation for identifiers and a way to resolve them. It would also make it impractical to use certain valid mathematical symbols as identifiers.

Most interesting mathematical symbols are reserved for operators anyway. As a result, `x` and the multiplication symbol are not readily confusable in most contexts in Swift, and confusable resolution could be built in such a way that identifier characters are not regarded as confusable with operator characters.

That would require maintaining a large list of exception characters though. Just like the problem with ignoring invisible characters mentioned above, eliminating confusables has the problem of creating multiple representations for the same identifier, which could become quite confusing and result in additional problems of its own. I think it would probably be best to avoid a situation where it's necessary to resolve different representations of an identifier.

Sincerely,
João Pinheiro

···

On 20 Jun 2016, at 21:07, Xiaodi Wu <xiaodi.wu@gmail.com> wrote:

On Mon, Jun 20, 2016 at 2:42 PM, João Pinheiro <joao@joaopinheiro.org <mailto:joao@joaopinheiro.org>> wrote:

Indeed, it would be unwise to pick "𝛎" and "v" for different things within the same scope. Unicode confusables are annoying and unfortunate, but not totally unexpected. Automatic aliases for similar characters would arguably be worse since it would probably qualify as unexpected behaviour for most people.

João Pinheiro

···

On 20 Jun 2016, at 22:44, Dave Abrahams via swift-evolution <swift-evolution@swift.org> wrote:
I'm a little concerned about cases like these:

1D6CE ; 0076 ; MA # ( 𝛎 → v ) MATHEMATICAL BOLD SMALL NU → LATIN SMALL LETTER V # →ν→
1D6D2 ; 0070 ; MA # ( 𝛒 → p ) MATHEMATICAL BOLD SMALL RHO → LATIN SMALL LETTER P # →ρ→

etc. Now, one could reasonably argue that using “𝛎” and “v” to mean
different things in the same scope would be bad, but I'm not sure
we really want to accept them as aliases of one another, either.

Perhaps stupid but: why was Swift designed to accept most Unicode characters in identifier names? Wouldn’t it be simpler to go back to a model where only standard ascii characters are accepted in identifier names?

···

On 20 Jun 2016, at 20:26, Vladimir.S via swift-evolution <swift-evolution@swift.org> wrote:

Very interesting.

Btw, IBM Swift Sandbox shows these spaces:
https://swiftlang.ng.bluemix.net/
But my mail client does not - i.e. I saw exactly the same "test"&"abc"

Also, I read about some issues with left-to-right and right-to-left markers that also somehow change the actual text of source - i.e. you see one text, but when it compiles - it works not as expected. I.e. viewer/editor processes these special codes and show you one text, but compiler treats text in another way.

I believe it is a potential security problem that all unicode chars are allowed for variables/func names in Swift. IMO We definitely should limit allowed charset for identifiers in sources.

On 20.06.2016 20:51, João Pinheiro via swift-evolution wrote:

Recently there has been a screenshot going around Twitter about C++ allowing zero-width spaces in variable names. Swift also suffers from this problem which can be abused to create ambiguous, misleading, and potentially obfuscate nefarious code.

I would like to propose a change to prohibit the use of invisible characters in identifier names.

I'm including an example of problematic code at the bottom of this email.

Sincerely,
João Pinheiro

/* The output for this code is:
A
B
C
1
2
3
*/

func test() { print("A") }
func t​est() { print("B") }
func te​st() { print("C") }

let abc = 1
let a​bc = 2
let ab​c = 3

test()
t​est()
te​st()

print(abc)
print(a​bc)
print(ab​c)
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

Perhaps stupid but: why was Swift designed to accept most Unicode characters in identifier names? Wouldn’t it be simpler to go back to a model where only standard ascii characters are accepted in identifier names?

I assume it has something to do with the fact that 94.6% of the world's population speak a first language which is not English. That outweighs the inconvenience for Anglo developers, IMHO.

Honestly, this seems to me like a concern for linters and security auditing tools, not for the compiler. Swift identifiers are case-sensitive; I see no reason they shouldn't be script-sensitive or zero-width-joiner-sensitive. (Though basic Unicode normalization seems like a good idea, since differently-normalized strings are `==` anyway.)

···

--
Brent Royal-Gordon
Architechies

> I'm a little concerned about cases like these:
>
> 1D6CE ; 0076 ; MA # ( 𝛎 → v ) MATHEMATICAL BOLD SMALL NU →
LATIN SMALL LETTER V # →ν→
> 1D6D2 ; 0070 ; MA # ( 𝛒 → p ) MATHEMATICAL BOLD SMALL RHO →
LATIN SMALL LETTER P # →ρ→
>
> etc. Now, one could reasonably argue that using “𝛎” and “v” to mean
> different things in the same scope would be bad, but I'm not sure
> we really want to accept them as aliases of one another, either.

Indeed, it would be unwise to pick "𝛎" and "v" for different things
within the same scope. Unicode confusables are annoying and unfortunate,
but not totally unexpected. Automatic aliases for similar characters would
arguably be worse since it would probably qualify as unexpected behaviour
for most people.

I'm not entirely sure about automatic aliasing either. But I will boldly
claim that "most people" who choose v and nu for distinct variables don't
walk into that situation with expectations of sanity.

···

On Mon, Jun 20, 2016 at 5:20 PM, João Pinheiro <swift-evolution@swift.org> wrote:

> On 20 Jun 2016, at 22:44, Dave Abrahams via swift-evolution < > swift-evolution@swift.org> wrote:

João Pinheiro
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

Perhaps stupid but: why was Swift designed to accept most Unicode characters in identifier names? Wouldn’t it be simpler to go back to a model where only standard ascii characters are accepted in identifier names?

I assume it has something to do with the fact that 94.6% of the world's population speak a first language which is not English. That outweighs the inconvenience for Anglo developers, IMHO.

Yes, but the SDKs (frameworks, system libraries) are all in English, including Swift standard library. I remember a few languages attempting localized versions for kids to study better, failing terribly because you learned something that had a very very limited use.

When it comes to maintaining code, using localized identifier names is a bad practice since anyone outside that country coming to the code can't really use it. I personally can't imagine coming to maintain Swift code with identifiers in Chinese, Japanese, Arabic, ...

While the feature of non-ASCII characters being allowed as identifiers (which was held up high with Apple giving emoji examples) may seem cool, I can only see this helpful in the future, given a different keyboard layout (as someone has pointed out some time ago here), to introduce one-character operators that would be otherwise impossible. But if someone came to me with a code where a variable would be an emoji of a dog, he'd get fired on the spot.

I'd personally vote to keep the zero-width-joiner characters forbidden within the code outside of string literals (where they may make sense). I agree that this can be easily solved by linters, but: I think this particular set of characters should be restricted by the language itself, since it's something easily omittable during code review and given the upcoming package manager, this can lead to a hard-to-find malware being distributed among developers who include these packages within their projects - since you usually do not run a linter on a 3rd party code.

As for the confusables - this depends a lot on the rendering and what font you have set. I've tried 𝛎 → v with current Xcode and it looks really different, mostly when you use a fixed-space font which usually doesn't have non-ASCII characters which are then rendered using a different font, making the distinction easy to spot.

···

On Jun 21, 2016, at 2:23 AM, Brent Royal-Gordon via swift-evolution <swift-evolution@swift.org> wrote:

Honestly, this seems to me like a concern for linters and security auditing tools, not for the compiler. Swift identifiers are case-sensitive; I see no reason they shouldn't be script-sensitive or zero-width-joiner-sensitive. (Though basic Unicode normalization seems like a good idea, since differently-normalized strings are `==` anyway.)

--
Brent Royal-Gordon
Architechies

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

IIRC, some languages require zero-width joiners (though not zero-width spaces, which are distinct) to properly encode some of their characters. I'd be very leery of having Swift land on a model where identifiers can be used with some languages and not others; that smacks of ethnocentrism.

None of those languages require zero-width characters between two Latin letters, or between a Latin letter and an Arabic numeral, or at the end of a word. Since standard / system APIs will (barring some radical shift) use those code points exclusively, it's justifiable to give them some special attention.

Although the practical implementation may need to be more limited in scope, the general principle doesn't need to privilege Latin letters and Arabic numerals. If, in any context, the presence or absence of a zero-width glyph cannot possibly be distinguished by a human reading the text, then the compiler should also be indifferent to its presence or absence (or, alternatively, its presence should be a compile-time error).

Sure, that's obvious. Jordan was observing that the simplest way to enforce that, banning such characters from identifiers completely, would still interfere with some languages, and I was pointing out that just doing enough to protect English would get most of the practical value because it would protect every use of the system and standard library. A program would then only become attackable in this specific way for its own identifiers using non-Latin characters.

All that said, I'm not convinced that this is worthwhile; the identifier-similarity problem in Unicode is much broader than just invisible characters. In fact, Swift still doesn't canonicalize identifiers, so canonically equivalent compositions of the same glyph will actually produce different names. So unless we're going to fix that and then ban all sorts of things that are known to generally be represented with a confusable glyph in a typical fixed-width font (like the mathematical alphabets), this is just a problem that will always exist in some form.

John.

···

On Jun 20, 2016, at 7:07 PM, Xiaodi Wu <xiaodi.wu@gmail.com> wrote:
On Mon, Jun 20, 2016 at 8:58 PM, John McCall via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

On Jun 20, 2016, at 5:22 PM, Jordan Rose via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

John.

Jordan

On Jun 20, 2016, at 10:51, João Pinheiro via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

Recently there has been a screenshot going around Twitter about C++ allowing zero-width spaces in variable names. Swift also suffers from this problem which can be abused to create ambiguous, misleading, and potentially obfuscate nefarious code.

I would like to propose a change to prohibit the use of invisible characters in identifier names.

I'm including an example of problematic code at the bottom of this email.

Sincerely,
João Pinheiro

/* The output for this code is:
A
B
C
1
2
3
*/

func test() { print("A") }
func t​est() { print("B") }
func te​st() { print("C") }

let abc = 1
let a​bc = 2
let ab​c = 3

test()
t​est()
te​st()

print(abc)
print(a​bc)
print(ab​c)
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

Any discussion about this ought to start from UAX #31, the Unicode consortium's recommendations on identifiers in programming languages:

http://unicode.org/reports/tr31/

Section 2.3 specifically calls out the situations in which ZWJ and ZWNJ need to be allowed. The document also describes a stability policy for handling new Unicode versions, other confusability issues, and many of the other problems with adopting Unicode in a programming language's syntax.

-Joe

···

On Jun 21, 2016, at 8:47 AM, John McCall via swift-evolution <swift-evolution@swift.org> wrote:

On Jun 20, 2016, at 7:07 PM, Xiaodi Wu <xiaodi.wu@gmail.com> wrote:
On Mon, Jun 20, 2016 at 8:58 PM, John McCall via swift-evolution <swift-evolution@swift.org> wrote:

On Jun 20, 2016, at 5:22 PM, Jordan Rose via swift-evolution <swift-evolution@swift.org> wrote:
IIRC, some languages require zero-width joiners (though not zero-width spaces, which are distinct) to properly encode some of their characters. I'd be very leery of having Swift land on a model where identifiers can be used with some languages and not others; that smacks of ethnocentrism.

None of those languages require zero-width characters between two Latin letters, or between a Latin letter and an Arabic numeral, or at the end of a word. Since standard / system APIs will (barring some radical shift) use those code points exclusively, it's justifiable to give them some special attention.

Although the practical implementation may need to be more limited in scope, the general principle doesn't need to privilege Latin letters and Arabic numerals. If, in any context, the presence or absence of a zero-width glyph cannot possibly be distinguished by a human reading the text, then the compiler should also be indifferent to its presence or absence (or, alternatively, its presence should be a compile-time error).

Sure, that's obvious. Jordan was observing that the simplest way to enforce that, banning such characters from identifiers completely, would still interfere with some languages, and I was pointing out that just doing enough to protect English would get most of the practical value because it would protect every use of the system and standard library. A program would then only become attackable in this specific way for its own identifiers using non-Latin characters.

All that said, I'm not convinced that this is worthwhile; the identifier-similarity problem in Unicode is much broader than just invisible characters. In fact, Swift still doesn't canonicalize identifiers, so canonically equivalent compositions of the same glyph will actually produce different names. So unless we're going to fix that and then ban all sorts of things that are known to generally be represented with a confusable glyph in a typical fixed-width font (like the mathematical alphabets), this is just a problem that will always exist in some form.