[Proposal] Normalize Unicode Identifiers


(João Pinheiro) #1

This proposal [gist <https://gist.github.com/JoaoPinheiro/5f226f46c67d235a7039c775a4300800>] is the result of the discussions from the thread "Prohibit invisible characters in identifier names <http://thread.gmane.org/gmane.comp.lang.swift.evolution/21022>". I hope it's still on time for inclusion in Swift 3.

Sincerely,
João Pinheiro

Normalize Unicode Identifiers

Proposal: SE-NNNN <https://gist.github.com/JoaoPinheiro/NNNN-normalize-identifiers.md>
Author: João Pinheiro <https://github.com/joaopinheiro>
Status: Awaiting review
Review manager: TBD
<https://gist.github.com/JoaoPinheiro/5f226f46c67d235a7039c775a4300800#introduction>Introduction

This proposal aims to introduce identifier normalization in order to prevent the unsafe and potentially abusive use of invisible or equivalent representations of Unicode characters in identifiers.

Swift-evolution thread: Discussion thread <http://thread.gmane.org/gmane.comp.lang.swift.evolution/21022>
<https://gist.github.com/JoaoPinheiro/5f226f46c67d235a7039c775a4300800#motivation>Motivation

Even though Swift supports the use of Unicode for identifiers, these aren't yet normalized. This allows for different Unicode representations of the same characters to be considered distinct identifiers.

For example:

let Å = "Angstrom"
let Å = "Latin Capital Letter A With Ring Above"
let Å = "Latin Capital Letter A + Combining Ring Above"
In addition to that, default-ignorable characters like the Zero Width Space and Zero Width Non-Joiner (exemplified below) are also currently accepted as valid parts of identifiers without any restrictions.

let ab = "ab"
let a​b = "a + Zero Width Space + b"

func xy() { print("xy") }
func x‌y() { print("x + <Zero Width Non-Joiner> + y") }
The use of default-ignorable characters in identifiers is problematical, first because the effects they represent are stylistic or otherwise out of scope for identifiers, and second because the characters themselves often have no visible display. It is also possible to misapply these characters such that users can create strings that look the same but actually contain different characters, which can create security problems.

<https://gist.github.com/JoaoPinheiro/5f226f46c67d235a7039c775a4300800#proposed-solution>Proposed solution

Normalize Swift identifiers according to the normalization form NFC recommended for case-sensitive languages in the Unicode Standard Annexes 15 <https://gist.github.com/JoaoPinheiro/UAX15> and 31 <https://gist.github.com/JoaoPinheiro/UAX31> and follow the Normalization Charts <https://gist.github.com/JoaoPinheiro/NormalizationCharts>.

In addition to that, prohibit the use of default-ignorable characters in identifiers except in the special cases described in UAX31 <https://gist.github.com/JoaoPinheiro/UAX31>, listed below:

Allow Zero Width Non-Joiner (U+200C) when breaking a cursive connection
Allow Zero Width Non-Joiner (U+200C) in a conjunct context
Allow Zero Width Joiner (U+200D) in a conjunct context
<https://gist.github.com/JoaoPinheiro/5f226f46c67d235a7039c775a4300800#impact-on-existing-code>Impact on existing code

This has potential to be a code-breaking change in cases where people may have used distinct, but identical looking, identifiers with different Unicode representations. The likelihood of that happening in actual code is very small and the problem can be solved by renaming identifiers that don't conform to the new normalized form into new non-colliding identifiers.

<https://gist.github.com/JoaoPinheiro/5f226f46c67d235a7039c775a4300800#alternatives-considered>Alternatives considered

The option of ignoring default-ignorable characters in identifiers was also discussed, but it was considered to be more confusing and less secure than explicitly treating them as errors.

<https://gist.github.com/JoaoPinheiro/5f226f46c67d235a7039c775a4300800#unaddressed-issues>Unaddressed Issues

There was some discussion around the issue of Unicode confusable characters, but it was considered to be out of scope for this proposal. Unicode confusable characters are a complicated issue and any possible solutions also come with significant drawbacks that would require more time and consideration.


(Chris Lattner) #2

Hi João,

Unfortunately, we’re out of time to accept new proposals. Tomorrow is the last day for *implementation* work on source breaking changes to be done. We can talk about this next week for Swift 3.x or Swift 4.

-Chris

···

On Jul 26, 2016, at 12:22 PM, João Pinheiro via swift-evolution <swift-evolution@swift.org> wrote:

This proposal [gist <https://gist.github.com/JoaoPinheiro/5f226f46c67d235a7039c775a4300800>] is the result of the discussions from the thread "Prohibit invisible characters in identifier names <http://thread.gmane.org/gmane.comp.lang.swift.evolution/21022>". I hope it's still on time for inclusion in Swift 3.


(Xiaodi Wu) #3

+1. Even if it's too late for Swift 3, though, I'd argue that it's highly
unlikely to be code-breaking in practice. Any existing code that would get
tripped up by this normalization is arguably broken already.

···

On Tue, Jul 26, 2016 at 2:22 PM, João Pinheiro <swift-evolution@swift.org> wrote:

This proposal [gist
<https://gist.github.com/JoaoPinheiro/5f226f46c67d235a7039c775a4300800>]
is the result of the discussions from the thread "Prohibit invisible
characters in identifier names
<http://thread.gmane.org/gmane.comp.lang.swift.evolution/21022>". I hope
it's still on time for inclusion in Swift 3.

Sincerely,
João Pinheiro

Normalize Unicode Identifiers

   - Proposal: SE-NNNN
   <https://gist.github.com/JoaoPinheiro/NNNN-normalize-identifiers.md>
   - Author: João Pinheiro <https://github.com/joaopinheiro>
   - Status: Awaiting review
   - Review manager: TBD

<https://gist.github.com/JoaoPinheiro/5f226f46c67d235a7039c775a4300800#introduction>
Introduction

This proposal aims to introduce identifier normalization in order to
prevent the unsafe and potentially abusive use of invisible or equivalent
representations of Unicode characters in identifiers.

Swift-evolution thread: Discussion thread
<http://thread.gmane.org/gmane.comp.lang.swift.evolution/21022>

<https://gist.github.com/JoaoPinheiro/5f226f46c67d235a7039c775a4300800#motivation>
Motivation

Even though Swift supports the use of Unicode for identifiers, these
aren't yet normalized. This allows for different Unicode representations of
the same characters to be considered distinct identifiers.

For example:

let Å = "Angstrom"
let Å = "Latin Capital Letter A With Ring Above"
let Å = "Latin Capital Letter A + Combining Ring Above"

In addition to that, *default-ignorable* characters like the *Zero Width
Space* and *Zero Width Non-Joiner* (exemplified below) are also currently
accepted as valid parts of identifiers without any restrictions.

let ab = "ab"
let a​b = "a + Zero Width Space + b"

func xy() { print("xy") }
func x‌y() { print("x + <Zero Width Non-Joiner> + y") }

The use of default-ignorable characters in identifiers is problematical,
first because the effects they represent are stylistic or otherwise out of
scope for identifiers, and second because the characters themselves often
have no visible display. It is also possible to misapply these characters
such that users can create strings that look the same but actually contain
different characters, which can create security problems.

<https://gist.github.com/JoaoPinheiro/5f226f46c67d235a7039c775a4300800#proposed-solution>Proposed
solution

Normalize Swift identifiers according to the normalization form NFC
recommended for case-sensitive languages in the Unicode Standard Annexes
15 <https://gist.github.com/JoaoPinheiro/UAX15> and 31
<https://gist.github.com/JoaoPinheiro/UAX31> and follow the Normalization
Charts <https://gist.github.com/JoaoPinheiro/NormalizationCharts>.

In addition to that, prohibit the use of *default-ignorable* characters
in identifiers except in the special cases described in UAX31
<https://gist.github.com/JoaoPinheiro/UAX31>, listed below:

   - Allow Zero Width Non-Joiner (U+200C) when breaking a cursive
   connection
   - Allow Zero Width Non-Joiner (U+200C) in a conjunct context
   - Allow Zero Width Joiner (U+200D) in a conjunct context

<https://gist.github.com/JoaoPinheiro/5f226f46c67d235a7039c775a4300800#impact-on-existing-code>Impact
on existing code

This has potential to be a code-breaking change in cases where people may
have used distinct, but identical looking, identifiers with different
Unicode representations. The likelihood of that happening in actual code is
very small and the problem can be solved by renaming identifiers that don't
conform to the new normalized form into new non-colliding identifiers.

<https://gist.github.com/JoaoPinheiro/5f226f46c67d235a7039c775a4300800#alternatives-considered>Alternatives
considered

The option of ignoring *default-ignorable* characters in identifiers was
also discussed, but it was considered to be more confusing and less secure
than explicitly treating them as errors.

<https://gist.github.com/JoaoPinheiro/5f226f46c67d235a7039c775a4300800#unaddressed-issues>Unaddressed
Issues
There was some discussion around the issue of Unicode confusable
characters, but it was considered to be out of scope for this proposal.
Unicode confusable characters are a complicated issue and any possible
solutions also come with significant drawbacks that would require more time
and consideration.

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(João Pinheiro) #4

I'll wait until next week to ping about this again when the stress from Swift 3 has passed. I'll also revise the proposal to clarify about the small code-breaking possibility. As Xiaodi Wu mentioned, this change is highly unlikely to be code-breaking and existing code that would be broken up by this normalisation could arguably be considered broken already.

Sincerely,
João Pinheiro

···

On 26 Jul 2016, at 22:32, Chris Lattner <clattner@apple.com> wrote:

On Jul 26, 2016, at 12:22 PM, João Pinheiro via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

This proposal [gist <https://gist.github.com/JoaoPinheiro/5f226f46c67d235a7039c775a4300800>] is the result of the discussions from the thread "Prohibit invisible characters in identifier names <http://thread.gmane.org/gmane.comp.lang.swift.evolution/21022>". I hope it's still on time for inclusion in Swift 3.

Hi João,

Unfortunately, we’re out of time to accept new proposals. Tomorrow is the last day for *implementation* work on source breaking changes to be done. We can talk about this next week for Swift 3.x or Swift 4.

-Chris


(João Pinheiro) #5

The crunch from Swift 3 has now passed and I'm bringing up this proposal again. Should I go ahead and issue a pull request for this?

Sincerely,
João Pinheiro

···

On 26 Jul 2016, at 22:32, Chris Lattner <clattner@apple.com> wrote:

On Jul 26, 2016, at 12:22 PM, João Pinheiro via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

This proposal [gist <https://gist.github.com/JoaoPinheiro/5f226f46c67d235a7039c775a4300800>] is the result of the discussions from the thread "Prohibit invisible characters in identifier names <http://thread.gmane.org/gmane.comp.lang.swift.evolution/21022>". I hope it's still on time for inclusion in Swift 3.

Hi João,

Unfortunately, we’re out of time to accept new proposals. Tomorrow is the last day for *implementation* work on source breaking changes to be done. We can talk about this next week for Swift 3.x or Swift 4.

-Chris


(Joe Groff) #6

I'm inclined to agree. To be paranoid about perfect compatibility, we could conceivably allow existing code with differently-normalized identifiers with a warning based on Swift version, but it's probably not worth it. It'd be interesting to data-mine Github or the iOS Swift Playgrounds app and see if this breaks any Swift 3 code in practice.

-Joe

···

On Jul 26, 2016, at 12:26 PM, Xiaodi Wu via swift-evolution <swift-evolution@swift.org> wrote:

+1. Even if it's too late for Swift 3, though, I'd argue that it's highly unlikely to be code-breaking in practice. Any existing code that would get tripped up by this normalization is arguably broken already.


(Jacob Bandes-Storch) #7

Hi João,
I think you should definitely put up a PR for this. I'm restarting the
discussion about allowed operator/identifier characters (
https://gist.github.com/jtbandes/c0b0c072181dcd22c3147802025d0b59), and I
think your proposal is an obvious requirement for any solution to be
complete. :slight_smile:

Jacob

···

On Tue, Aug 9, 2016 at 7:20 AM, João Pinheiro <swift-evolution@swift.org> wrote:

The crunch from Swift 3 has now passed and I'm bringing up this proposal
again. Should I go ahead and issue a pull request for this?

Sincerely,
João Pinheiro

On 26 Jul 2016, at 22:32, Chris Lattner <clattner@apple.com> wrote:

On Jul 26, 2016, at 12:22 PM, João Pinheiro via swift-evolution < > swift-evolution@swift.org> wrote:

This proposal [gist
<https://gist.github.com/JoaoPinheiro/5f226f46c67d235a7039c775a4300800>]
is the result of the discussions from the thread "Prohibit invisible
characters in identifier names
<http://thread.gmane.org/gmane.comp.lang.swift.evolution/21022>". I hope
it's still on time for inclusion in Swift 3.

Hi João,

Unfortunately, we’re out of time to accept new proposals. Tomorrow is the
last day for *implementation* work on source breaking changes to be done.
We can talk about this next week for Swift 3.x or Swift 4.

-Chris

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Michael Gottesman) #8

+1. Even if it's too late for Swift 3, though, I'd argue that it's highly unlikely to be code-breaking in practice. Any existing code that would get tripped up by this normalization is arguably broken already.

I'm inclined to agree. To be paranoid about perfect compatibility, we could conceivably allow existing code with differently-normalized identifiers with a warning based on Swift version, but it's probably not worth it. It'd be interesting to data-mine Github or the iOS Swift Playgrounds app and see if this breaks any Swift 3 code in practice.

As an additional interesting point here, we could in general normalize unicode strings. This could potentially reduce the size of unicode characters or allow us to constant propagate certain unicode algorithms in the optimizer.

···

On Sep 22, 2016, at 10:50 AM, Joe Groff via swift-evolution <swift-evolution@swift.org> wrote:

On Jul 26, 2016, at 12:26 PM, Xiaodi Wu via swift-evolution <swift-evolution@swift.org> wrote:

-Joe
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(João Pinheiro) #9

Hi Jacob,

I'll go ahead and submit a pull request for this later today then!

João

···

On 22 Sep 2016, at 08:00, Jacob Bandes-Storch <jtbandes@gmail.com> wrote:

Hi João,
I think you should definitely put up a PR for this. I'm restarting the discussion about allowed operator/identifier characters (https://gist.github.com/jtbandes/c0b0c072181dcd22c3147802025d0b59), and I think your proposal is an obvious requirement for any solution to be complete. :slight_smile:

Jacob

On Tue, Aug 9, 2016 at 7:20 AM, João Pinheiro <swift-evolution@swift.org> wrote:
The crunch from Swift 3 has now passed and I'm bringing up this proposal again. Should I go ahead and issue a pull request for this?

Sincerely,
João Pinheiro

On 26 Jul 2016, at 22:32, Chris Lattner <clattner@apple.com> wrote:

On Jul 26, 2016, at 12:22 PM, João Pinheiro via swift-evolution <swift-evolution@swift.org> wrote:

This proposal [gist] is the result of the discussions from the thread "Prohibit invisible characters in identifier names". I hope it's still on time for inclusion in Swift 3.

Hi João,

Unfortunately, we’re out of time to accept new proposals. Tomorrow is the last day for *implementation* work on source breaking changes to be done. We can talk about this next week for Swift 3.x or Swift 4.

-Chris

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Xiaodi Wu) #10

You mean values of type String? I would want those to be exactly what I say
they are; NFC normalization is available, if I recall, as part of
Foundation, but by no means should my String values be silently changed!

···

On Thu, Sep 22, 2016 at 6:10 PM, Michael Gottesman <mgottesman@apple.com> wrote:

> On Sep 22, 2016, at 10:50 AM, Joe Groff via swift-evolution < > swift-evolution@swift.org> wrote:
>
>
>> On Jul 26, 2016, at 12:26 PM, Xiaodi Wu via swift-evolution < > swift-evolution@swift.org> wrote:
>>
>> +1. Even if it's too late for Swift 3, though, I'd argue that it's
highly unlikely to be code-breaking in practice. Any existing code that
would get tripped up by this normalization is arguably broken already.
>
> I'm inclined to agree. To be paranoid about perfect compatibility, we
could conceivably allow existing code with differently-normalized
identifiers with a warning based on Swift version, but it's probably not
worth it. It'd be interesting to data-mine Github or the iOS Swift
Playgrounds app and see if this breaks any Swift 3 code in practice.

As an additional interesting point here, we could in general normalize
unicode strings. This could potentially reduce the size of unicode
characters or allow us to constant propagate certain unicode algorithms in
the optimizer.

>
> -Joe
> _______________________________________________
> swift-evolution mailing list
> swift-evolution@swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution


(Michael Gottesman) #11

You mean values of type String?

I was speaking solely of constant strings.

I would want those to be exactly what I say they are; NFC normalization is available, if I recall, as part of Foundation, but by no means should my String values be silently changed!

Why.

···

On Sep 22, 2016, at 4:19 PM, Xiaodi Wu <xiaodi.wu@gmail.com> wrote:

On Thu, Sep 22, 2016 at 6:10 PM, Michael Gottesman <mgottesman@apple.com <mailto:mgottesman@apple.com>> wrote:

> On Sep 22, 2016, at 10:50 AM, Joe Groff via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:
>
>
>> On Jul 26, 2016, at 12:26 PM, Xiaodi Wu via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:
>>
>> +1. Even if it's too late for Swift 3, though, I'd argue that it's highly unlikely to be code-breaking in practice. Any existing code that would get tripped up by this normalization is arguably broken already.
>
> I'm inclined to agree. To be paranoid about perfect compatibility, we could conceivably allow existing code with differently-normalized identifiers with a warning based on Swift version, but it's probably not worth it. It'd be interesting to data-mine Github or the iOS Swift Playgrounds app and see if this breaks any Swift 3 code in practice.

As an additional interesting point here, we could in general normalize unicode strings. This could potentially reduce the size of unicode characters or allow us to constant propagate certain unicode algorithms in the optimizer.

>
> -Joe
> _______________________________________________
> swift-evolution mailing list
> swift-evolution@swift.org <mailto:swift-evolution@swift.org>
> https://lists.swift.org/mailman/listinfo/swift-evolution


(Xiaodi Wu) #12

You mean values of type String?

I was speaking solely of constant strings.

I would want those to be exactly what I say they are; NFC normalization is
available, if I recall, as part of Foundation, but by no means should my
String values be silently changed!

Why.

For one, I don't want to pay the computational cost of normalization at
runtime unless necessary. For another, I expect to be able to round-trip
user input. Normalization is not lossless and cannot be reversed. Finally,
if I want to use normalization form D (NFD), your proposal would make it
impossible, because (IIUC) serial NFC + NFD normalization can produce
different output than NFD normalization alone.

···

On Thu, Sep 22, 2016 at 6:54 PM, Michael Gottesman <mgottesman@apple.com> wrote:

On Sep 22, 2016, at 4:19 PM, Xiaodi Wu <xiaodi.wu@gmail.com> wrote:

On Thu, Sep 22, 2016 at 6:10 PM, Michael Gottesman <mgottesman@apple.com> > wrote:

> On Sep 22, 2016, at 10:50 AM, Joe Groff via swift-evolution < >> swift-evolution@swift.org> wrote:
>
>
>> On Jul 26, 2016, at 12:26 PM, Xiaodi Wu via swift-evolution < >> swift-evolution@swift.org> wrote:
>>
>> +1. Even if it's too late for Swift 3, though, I'd argue that it's
highly unlikely to be code-breaking in practice. Any existing code that
would get tripped up by this normalization is arguably broken already.
>
> I'm inclined to agree. To be paranoid about perfect compatibility, we
could conceivably allow existing code with differently-normalized
identifiers with a warning based on Swift version, but it's probably not
worth it. It'd be interesting to data-mine Github or the iOS Swift
Playgrounds app and see if this breaks any Swift 3 code in practice.

As an additional interesting point here, we could in general normalize
unicode strings. This could potentially reduce the size of unicode
characters or allow us to constant propagate certain unicode algorithms in
the optimizer.

>
> -Joe
> _______________________________________________
> swift-evolution mailing list
> swift-evolution@swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution


(Jacob Bandes-Storch) #13

This point seems moot to me because String's == checks for *canonical
equivalence* anyway.

···

On Thu, Sep 22, 2016 at 4:54 PM Michael Gottesman via swift-evolution < swift-evolution@swift.org> wrote:

On Sep 22, 2016, at 4:19 PM, Xiaodi Wu <xiaodi.wu@gmail.com> wrote:

You mean values of type String?

I was speaking solely of constant strings.

I would want those to be exactly what I say they are; NFC normalization is
available, if I recall, as part of Foundation, but by no means should my
String values be silently changed!

Why.

On Thu, Sep 22, 2016 at 6:10 PM, Michael Gottesman <mgottesman@apple.com> > wrote:

> On Sep 22, 2016, at 10:50 AM, Joe Groff via swift-evolution < >> swift-evolution@swift.org> wrote:
>
>
>> On Jul 26, 2016, at 12:26 PM, Xiaodi Wu via swift-evolution < >> swift-evolution@swift.org> wrote:
>>
>> +1. Even if it's too late for Swift 3, though, I'd argue that it's
highly unlikely to be code-breaking in practice. Any existing code that
would get tripped up by this normalization is arguably broken already.
>
> I'm inclined to agree. To be paranoid about perfect compatibility, we
could conceivably allow existing code with differently-normalized
identifiers with a warning based on Swift version, but it's probably not
worth it. It'd be interesting to data-mine Github or the iOS Swift
Playgrounds app and see if this breaks any Swift 3 code in practice.

As an additional interesting point here, we could in general normalize
unicode strings. This could potentially reduce the size of unicode
characters or allow us to constant propagate certain unicode algorithms in
the optimizer.

>
> -Joe
> _______________________________________________
> swift-evolution mailing list
> swift-evolution@swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(João Pinheiro) #14

I've submitted the pull request for this proposal. :slight_smile:

···

On 22 Sep 2016, at 14:58, João Pinheiro via swift-evolution <swift-evolution@swift.org> wrote:

Hi Jacob,

I'll go ahead and submit a pull request for this later today then!

João

On 22 Sep 2016, at 08:00, Jacob Bandes-Storch <jtbandes@gmail.com <mailto:jtbandes@gmail.com>> wrote:

Hi João,
I think you should definitely put up a PR for this. I'm restarting the discussion about allowed operator/identifier characters (https://gist.github.com/jtbandes/c0b0c072181dcd22c3147802025d0b59), and I think your proposal is an obvious requirement for any solution to be complete. :slight_smile:

Jacob

On Tue, Aug 9, 2016 at 7:20 AM, João Pinheiro <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:
The crunch from Swift 3 has now passed and I'm bringing up this proposal again. Should I go ahead and issue a pull request for this?

Sincerely,
João Pinheiro

On 26 Jul 2016, at 22:32, Chris Lattner <clattner@apple.com <mailto:clattner@apple.com>> wrote:

On Jul 26, 2016, at 12:22 PM, João Pinheiro via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

This proposal [gist <https://gist.github.com/JoaoPinheiro/5f226f46c67d235a7039c775a4300800>] is the result of the discussions from the thread "Prohibit invisible characters in identifier names <http://thread.gmane.org/gmane.comp.lang.swift.evolution/21022>". I hope it's still on time for inclusion in Swift 3.

Hi João,

Unfortunately, we’re out of time to accept new proposals. Tomorrow is the last day for *implementation* work on source breaking changes to be done. We can talk about this next week for Swift 3.x or Swift 4.

-Chris

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Karl) #15

No the proposal seems to be about unicode characters in Swift source code:

To ease lexing/parsing and avoid user confusion, the names of custom identifiers (type names, variable names, etc.) and operators in Swift can be composed of (mostly) separate sets of characters.

In that sense, it sounds more like a source-breaking bugfix to me.

···

On 23 Sep 2016, at 01:19, Xiaodi Wu via swift-evolution <swift-evolution@swift.org> wrote:

You mean values of type String? I would want those to be exactly what I say they are; NFC normalization is available, if I recall, as part of Foundation, but by no means should my String values be silently changed!

On Thu, Sep 22, 2016 at 6:10 PM, Michael Gottesman <mgottesman@apple.com <mailto:mgottesman@apple.com>> wrote:

> On Sep 22, 2016, at 10:50 AM, Joe Groff via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:
>
>
>> On Jul 26, 2016, at 12:26 PM, Xiaodi Wu via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:
>>
>> +1. Even if it's too late for Swift 3, though, I'd argue that it's highly unlikely to be code-breaking in practice. Any existing code that would get tripped up by this normalization is arguably broken already.
>
> I'm inclined to agree. To be paranoid about perfect compatibility, we could conceivably allow existing code with differently-normalized identifiers with a warning based on Swift version, but it's probably not worth it. It'd be interesting to data-mine Github or the iOS Swift Playgrounds app and see if this breaks any Swift 3 code in practice.

As an additional interesting point here, we could in general normalize unicode strings. This could potentially reduce the size of unicode characters or allow us to constant propagate certain unicode algorithms in the optimizer.

>
> -Joe
> _______________________________________________
> swift-evolution mailing list
> swift-evolution@swift.org <mailto:swift-evolution@swift.org>
> https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Michael Gottesman) #16

You mean values of type String?

I was speaking solely of constant strings.

I would want those to be exactly what I say they are; NFC normalization is available, if I recall, as part of Foundation, but by no means should my String values be silently changed!

Why.

For one, I don't want to pay the computational cost of normalization at runtime unless necessary.

This would only happen with strings that are known to be constant at compile time (and as such the transformation would occur at compile time). There would be no runtime cost.

For another, I expect to be able to round-trip user input.

String checks for canonical equivalence, IIRC.

Normalization is not lossless and cannot be reversed. Finally, if I want to use normalization form D (NFD), your proposal
would make it impossible, because (IIUC) serial NFC + NFD normalization can produce different output than NFD normalization alone.

Why would you want to do this/care about this? I.e. what is the use case?

As an aside, I am not formally proposing this. I am just discussing potential opportunities for optimization given that we would need (as apart of this proposal) to add knowledge of unicode to the compiler which would allow for compile time transformations.

···

On Sep 22, 2016, at 5:09 PM, Xiaodi Wu <xiaodi.wu@gmail.com> wrote:
On Thu, Sep 22, 2016 at 6:54 PM, Michael Gottesman <mgottesman@apple.com <mailto:mgottesman@apple.com>> wrote:

On Sep 22, 2016, at 4:19 PM, Xiaodi Wu <xiaodi.wu@gmail.com <mailto:xiaodi.wu@gmail.com>> wrote:

On Thu, Sep 22, 2016 at 6:10 PM, Michael Gottesman <mgottesman@apple.com <mailto:mgottesman@apple.com>> wrote:

> On Sep 22, 2016, at 10:50 AM, Joe Groff via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:
>
>
>> On Jul 26, 2016, at 12:26 PM, Xiaodi Wu via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:
>>
>> +1. Even if it's too late for Swift 3, though, I'd argue that it's highly unlikely to be code-breaking in practice. Any existing code that would get tripped up by this normalization is arguably broken already.
>
> I'm inclined to agree. To be paranoid about perfect compatibility, we could conceivably allow existing code with differently-normalized identifiers with a warning based on Swift version, but it's probably not worth it. It'd be interesting to data-mine Github or the iOS Swift Playgrounds app and see if this breaks any Swift 3 code in practice.

As an additional interesting point here, we could in general normalize unicode strings. This could potentially reduce the size of unicode characters or allow us to constant propagate certain unicode algorithms in the optimizer.

>
> -Joe
> _______________________________________________
> swift-evolution mailing list
> swift-evolution@swift.org <mailto:swift-evolution@swift.org>
> https://lists.swift.org/mailman/listinfo/swift-evolution


(Xiaodi Wu) #17

You mean values of type String?

I was speaking solely of constant strings.

I would want those to be exactly what I say they are; NFC normalization
is available, if I recall, as part of Foundation, but by no means should my
String values be silently changed!

Why.

For one, I don't want to pay the computational cost of normalization at
runtime unless necessary.

This would only happen with strings that are known to be constant at
compile time (and as such the transformation would occur at compile time).
There would be no runtime cost.

Yes, for constant strings only there would be no runtime cost.

For another, I expect to be able to round-trip user input.

String checks for canonical equivalence, IIRC.

Sure, but I'm not talking about using comparison operators here. I mean
that if we have `let str = "[some non-NFC string]"`, I should be able to
write that out to a file with all the non-canonical glyphs intact.

There are known issues with NFC that are acceptable for normalizing Swift
identifiers but make it unsuitable for general use. For example, the
normalized form of Greek ano teleia is middle dot, but these two glyphs are
rendered differently in many fonts, and substituting a middle dot in place
of the Greek punctuation mark is actually quite inadequate for Greek text
(ano teleia is supposed to be around x-height; middle dot is not). Even for
constant strings, it is essential that one can output ano teleia when it is
specified rather than middle dot. However, Unicode normalization algorithms
guarantee stability and will forever require swapping the former for the
latter. I understand that other such problematic characters exist.

Normalization is not lossless and cannot be reversed. Finally, if I want to

use normalization form D (NFD), your proposal

would make it impossible, because (IIUC) serial NFC + NFD normalization
can produce different output than NFD normalization alone.

Why would you want to do this/care about this? I.e. what is the use case?

Use cases for NFD include searching, where you'd find substrings considered
"compatible." For instance, the fi ligature is considered compatible with
the letters f and i, but they are not equal. If you've ever successfully
searched for a word like "finance" in a PDF document that's been typeset
with ligatures, you've benefited from NFD. Roughly speaking (IIUC), the
difference between searching NFC-normalized strings and NFD-normalized
strings is analogous to the difference between a case-sensitive and a
case-insensitive search. Therefore, given a string x, it's sometimes
important to be able to obtain NFD(x). If every string x is now
automatically NFC(x), then the best one can do is NFD(NFC(x)), which is not
guaranteed equal to NFD(x) even with canonical comparison (i.e.
NFC(NFD(NFC(x))) != NFC(NFD(x)) for all x).

As an aside, I am not formally proposing this. I am just discussing
potential opportunities for optimization given that we would need (as apart
of this proposal) to add knowledge of unicode to the compiler which would
allow for compile time transformations.

I'd be interested to know what performance gains you're envisioning with
such an optimization of constant strings at compile time.

···

On Thu, Sep 22, 2016 at 7:44 PM, Michael Gottesman <mgottesman@apple.com> wrote:

On Sep 22, 2016, at 5:09 PM, Xiaodi Wu <xiaodi.wu@gmail.com> wrote:
On Thu, Sep 22, 2016 at 6:54 PM, Michael Gottesman <mgottesman@apple.com> > wrote:

On Sep 22, 2016, at 4:19 PM, Xiaodi Wu <xiaodi.wu@gmail.com> wrote:

On Thu, Sep 22, 2016 at 6:10 PM, Michael Gottesman <mgottesman@apple.com> >> wrote:

> On Sep 22, 2016, at 10:50 AM, Joe Groff via swift-evolution < >>> swift-evolution@swift.org> wrote:
>
>
>> On Jul 26, 2016, at 12:26 PM, Xiaodi Wu via swift-evolution < >>> swift-evolution@swift.org> wrote:
>>
>> +1. Even if it's too late for Swift 3, though, I'd argue that it's
highly unlikely to be code-breaking in practice. Any existing code that
would get tripped up by this normalization is arguably broken already.
>
> I'm inclined to agree. To be paranoid about perfect compatibility, we
could conceivably allow existing code with differently-normalized
identifiers with a warning based on Swift version, but it's probably not
worth it. It'd be interesting to data-mine Github or the iOS Swift
Playgrounds app and see if this breaks any Swift 3 code in practice.

As an additional interesting point here, we could in general normalize
unicode strings. This could potentially reduce the size of unicode
characters or allow us to constant propagate certain unicode algorithms in
the optimizer.

>
> -Joe
> _______________________________________________
> swift-evolution mailing list
> swift-evolution@swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution


(Michael Gottesman) #18

You mean values of type String?

I was speaking solely of constant strings.

I would want those to be exactly what I say they are; NFC normalization is available, if I recall, as part of Foundation, but by no means should my String values be silently changed!

Why.

For one, I don't want to pay the computational cost of normalization at runtime unless necessary.

This would only happen with strings that are known to be constant at compile time (and as such the transformation would occur at compile time). There would be no runtime cost.

Yes, for constant strings only there would be no runtime cost.

For another, I expect to be able to round-trip user input.

String checks for canonical equivalence, IIRC.

Sure, but I'm not talking about using comparison operators here. I mean that if we have `let str = "[some non-NFC string]"`, I should be able to write that out to a file with all the non-canonical glyphs intact.

I would argue that most people that is not an interesting distinction. Naturally there would be a way to escape such canonicalization to get the non-canonicalized String.

There are known issues with NFC that are acceptable for normalizing Swift identifiers but make it unsuitable for general use. For example, the normalized form of Greek ano teleia is middle dot, but these two glyphs are rendered differently in many fonts, and substituting a middle dot in place of the Greek punctuation mark is actually quite inadequate for Greek text (ano teleia is supposed to be around x-height; middle dot is not). Even for constant strings, it is essential that one can output ano teleia when it is specified rather than middle dot. However, Unicode normalization algorithms guarantee stability and will forever require swapping the former for the latter. I understand that other such problematic characters exist.

I would argue that that is a problem with the unicode standard and with the fonts. This is not a problem for Swift to solve.

Normalization is not lossless and cannot be reversed. Finally, if I want to use normalization form D (NFD), your proposal
would make it impossible, because (IIUC) serial NFC + NFD normalization can produce different output than NFD normalization alone.

Why would you want to do this/care about this? I.e. what is the use case?

Use cases for NFD include searching, where you'd find substrings considered "compatible." For instance, the fi ligature is considered compatible with the letters f and i, but they are not equal. If you've ever successfully searched for a word like "finance" in a PDF document that's been typeset with ligatures, you've benefited from NFD. Roughly speaking (IIUC), the difference between searching NFC-normalized strings and NFD-normalized strings is analogous to the difference between a case-sensitive and a case-insensitive search. Therefore, given a string x, it's sometimes important to be able to obtain NFD(x). If every string x is now automatically NFC(x), then the best one can do is NFD(NFC(x)), which is not guaranteed equal to NFD(x) even with canonical comparison (i.e. NFC(NFD(NFC(x))) != NFC(NFD(x)) for all x).

There are issues here related to String design. For instance, one could make an argument that such searching is really only interesting for a "Text" use case which is different from a String use case. That being said, I don't want to argue about this here since we are hijacking this thread ; ).

As an aside, I am not formally proposing this. I am just discussing potential opportunities for optimization given that we would need (as apart of this proposal) to add knowledge of unicode to the compiler which would allow for compile time transformations.

I'd be interested to know what performance gains you're envisioning with such an optimization of constant strings at compile time.

I would have to measure such wins to say anything concrete. Algorithmically one would be able to avoid normalization during common unicode operations when you know you are using constant strings. Even though this may provide a runtime win, the major win from teaching the compiler about unicode would be in terms of applying unicode operations such as encoding/decoding to constant strings.

That being said, this is not the proposal that is being discussed here or even being proposed here. [i.e. lets stop hijacking this thread ; )]

···

On Sep 22, 2016, at 6:11 PM, Xiaodi Wu <xiaodi.wu@gmail.com> wrote:
On Thu, Sep 22, 2016 at 7:44 PM, Michael Gottesman <mgottesman@apple.com <mailto:mgottesman@apple.com>> wrote:

On Sep 22, 2016, at 5:09 PM, Xiaodi Wu <xiaodi.wu@gmail.com <mailto:xiaodi.wu@gmail.com>> wrote:
On Thu, Sep 22, 2016 at 6:54 PM, Michael Gottesman <mgottesman@apple.com <mailto:mgottesman@apple.com>> wrote:

On Sep 22, 2016, at 4:19 PM, Xiaodi Wu <xiaodi.wu@gmail.com <mailto:xiaodi.wu@gmail.com>> wrote:

On Thu, Sep 22, 2016 at 6:10 PM, Michael Gottesman <mgottesman@apple.com <mailto:mgottesman@apple.com>> wrote:

> On Sep 22, 2016, at 10:50 AM, Joe Groff via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:
>
>
>> On Jul 26, 2016, at 12:26 PM, Xiaodi Wu via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:
>>
>> +1. Even if it's too late for Swift 3, though, I'd argue that it's highly unlikely to be code-breaking in practice. Any existing code that would get tripped up by this normalization is arguably broken already.
>
> I'm inclined to agree. To be paranoid about perfect compatibility, we could conceivably allow existing code with differently-normalized identifiers with a warning based on Swift version, but it's probably not worth it. It'd be interesting to data-mine Github or the iOS Swift Playgrounds app and see if this breaks any Swift 3 code in practice.

As an additional interesting point here, we could in general normalize unicode strings. This could potentially reduce the size of unicode characters or allow us to constant propagate certain unicode algorithms in the optimizer.

>
> -Joe
> _______________________________________________
> swift-evolution mailing list
> swift-evolution@swift.org <mailto:swift-evolution@swift.org>
> https://lists.swift.org/mailman/listinfo/swift-evolution


(Xiaodi Wu) #19

Agreed. Taking this offlist :slight_smile:

···

On Thu, Sep 22, 2016 at 9:01 PM, Michael Gottesman <mgottesman@apple.com> wrote:

On Sep 22, 2016, at 6:11 PM, Xiaodi Wu <xiaodi.wu@gmail.com> wrote:

On Thu, Sep 22, 2016 at 7:44 PM, Michael Gottesman <mgottesman@apple.com> > wrote:

On Sep 22, 2016, at 5:09 PM, Xiaodi Wu <xiaodi.wu@gmail.com> wrote:

On Thu, Sep 22, 2016 at 6:54 PM, Michael Gottesman <mgottesman@apple.com> >> wrote:

On Sep 22, 2016, at 4:19 PM, Xiaodi Wu <xiaodi.wu@gmail.com> wrote:

You mean values of type String?

I was speaking solely of constant strings.

I would want those to be exactly what I say they are; NFC normalization
is available, if I recall, as part of Foundation, but by no means should my
String values be silently changed!

Why.

For one, I don't want to pay the computational cost of normalization at
runtime unless necessary.

This would only happen with strings that are known to be constant at
compile time (and as such the transformation would occur at compile time).
There would be no runtime cost.

Yes, for constant strings only there would be no runtime cost.

For another, I expect to be able to round-trip user input.

String checks for canonical equivalence, IIRC.

Sure, but I'm not talking about using comparison operators here. I mean
that if we have `let str = "[some non-NFC string]"`, I should be able to
write that out to a file with all the non-canonical glyphs intact.

I would argue that most people that is not an interesting distinction.
Naturally there would be a way to escape such canonicalization to get the
non-canonicalized String.

There are known issues with NFC that are acceptable for normalizing Swift
identifiers but make it unsuitable for general use. For example, the
normalized form of Greek ano teleia is middle dot, but these two glyphs are
rendered differently in many fonts, and substituting a middle dot in place
of the Greek punctuation mark is actually quite inadequate for Greek text
(ano teleia is supposed to be around x-height; middle dot is not). Even for
constant strings, it is essential that one can output ano teleia when it is
specified rather than middle dot. However, Unicode normalization algorithms
guarantee stability and will forever require swapping the former for the
latter. I understand that other such problematic characters exist.

I would argue that that is a problem with the unicode standard and with
the fonts. This is not a problem for Swift to solve.

Normalization is not lossless and cannot be reversed. Finally, if I want

to use normalization form D (NFD), your proposal

would make it impossible, because (IIUC) serial NFC + NFD normalization
can produce different output than NFD normalization alone.

Why would you want to do this/care about this? I.e. what is the use case?

Use cases for NFD include searching, where you'd find substrings
considered "compatible." For instance, the fi ligature is considered
compatible with the letters f and i, but they are not equal. If you've ever
successfully searched for a word like "finance" in a PDF document that's
been typeset with ligatures, you've benefited from NFD. Roughly speaking
(IIUC), the difference between searching NFC-normalized strings and
NFD-normalized strings is analogous to the difference between a
case-sensitive and a case-insensitive search. Therefore, given a string x,
it's sometimes important to be able to obtain NFD(x). If every string x is
now automatically NFC(x), then the best one can do is NFD(NFC(x)), which is
not guaranteed equal to NFD(x) even with canonical comparison (i.e.
NFC(NFD(NFC(x))) != NFC(NFD(x)) for all x).

There are issues here related to String design. For instance, one could
make an argument that such searching is really only interesting for a
"Text" use case which is different from a String use case. That being said,
I don't want to argue about this here since we are hijacking this thread ;
).

As an aside, I am not formally proposing this. I am just discussing
potential opportunities for optimization given that we would need (as apart
of this proposal) to add knowledge of unicode to the compiler which would
allow for compile time transformations.

I'd be interested to know what performance gains you're envisioning with
such an optimization of constant strings at compile time.

I would have to measure such wins to say anything concrete.
Algorithmically one would be able to avoid normalization during common
unicode operations when you know you are using constant strings. Even
though this may provide a runtime win, the major win from teaching the
compiler about unicode would be in terms of applying unicode operations
such as encoding/decoding to constant strings.

That being said, this is not the proposal that is being discussed here or
even being proposed here. [i.e. lets stop hijacking this thread ; )]

On Thu, Sep 22, 2016 at 6:10 PM, Michael Gottesman <mgottesman@apple.com> >>> wrote:

> On Sep 22, 2016, at 10:50 AM, Joe Groff via swift-evolution < >>>> swift-evolution@swift.org> wrote:
>
>
>> On Jul 26, 2016, at 12:26 PM, Xiaodi Wu via swift-evolution < >>>> swift-evolution@swift.org> wrote:
>>
>> +1. Even if it's too late for Swift 3, though, I'd argue that it's
highly unlikely to be code-breaking in practice. Any existing code that
would get tripped up by this normalization is arguably broken already.
>
> I'm inclined to agree. To be paranoid about perfect compatibility, we
could conceivably allow existing code with differently-normalized
identifiers with a warning based on Swift version, but it's probably not
worth it. It'd be interesting to data-mine Github or the iOS Swift
Playgrounds app and see if this breaks any Swift 3 code in practice.

As an additional interesting point here, we could in general normalize
unicode strings. This could potentially reduce the size of unicode
characters or allow us to constant propagate certain unicode algorithms in
the optimizer.

>
> -Joe
> _______________________________________________
> swift-evolution mailing list
> swift-evolution@swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution