Quick pitch: Change Linux’s string comparison to match Darwin’s

On Darwin, known-ASCII strings are sorted according to the lexicographical ordering of their code units. All non-known-ASCII strings are otherwise ordered based on the UCA[1]. On Linux, however, even known-ASCII strings are ordered based on UCA. I propose to unify these by changing Linux’s string sort order to match Darwin’s in Swift 4.0.

Background

Swift’s default ordering for strings is appropriate for machine consumption (e.g. implementing sorted collections). It obeys Unicode canonical equivalence[2], that is strings compare the same modulo normalization. However, it is not meant to be sufficient for presenting a meaningful ordering to human consumers, as that requires incorporating reader-specific information (e.g. [3]).

Known-ASCII strings are a trivial case for the described sort order semantics: pure ASCII is unaffected by normalization. Thus, lexicographical ordering of code units is a valid machine ordering for ASCII strings. On Darwin, this is used to order known-ASCII strings while Linux uses UCA even for known-ASCII strings.

Long term, the plan is to switch String’s sort order to be the lexicographical ordering of normalized code units (or perhaps scalar values), as mentioned in the String Manifesto[4]. This is a more efficient ordering than that provided by UCA. However, this will not make it in time for Swift 4.0.

Changes

I propose to change Linux’s sort order for known-ASCII strings to be the same as it is on Darwin. This will be accomplished by dropping the relevant #if guards in StringCompare.swift. An example implementation can be found at [5].

In addition to unifying sort order semantics across platforms, this will also deliver significant performance boosts to pure ASCII strings on Linux.

[1] UTS #10: Unicode Collation Algorithm <http://unicode.org/reports/tr10/&gt;
[2] Canonical Equivalence in Applications <http://unicode.org/notes/tn5/&gt;
[3] UCA: Contextual Sensitivity <UTS #10: Unicode Collation Algorithm;
[4] String Manifesto: Comparing and Hashing Strings <https://github.com/apple/swift/blob/master/docs/StringManifesto.md#comparing-and-hashing-strings&gt;
[5] Unifying Linux/Darwin ASCII sort order semantics - github <https://github.com/milseman/swift/commit/5560e13198d5cc284f46bf190f59a2edf7ed747b&gt;

Unfortunately after some investigations and discussion, the situation seems to be more murky. This approach would break transitivity of String comparison on Linux, at least with any implementation of UCA using the normal collation weights. A < B, B < C should imply A < C. But, if both A and B are known-ASCII while C is UTF16, transitivity can be broken for any character that UCA yields a different sort order for (e.g. “#” vs “&”). On Darwin, the comparison implementation happens to preserve transitivity as the platform (in effect) relatively weights ASCII by code unit values.

While I would like to get some performance improvements in time for Linux, I don’t think this approach is viable for Swift 4.0. Unless anyone has any ideas about another minimally invasive approach, my recommendation is to do the long-term solution (lexicographical order of normalized code units) immediately after Swift 4.0.

···

On Jul 25, 2017, at 2:01 PM, Michael Ilseman via swift-dev <swift-dev@swift.org> wrote:

On Darwin, known-ASCII strings are sorted according to the lexicographical ordering of their code units. All non-known-ASCII strings are otherwise ordered based on the UCA[1]. On Linux, however, even known-ASCII strings are ordered based on UCA. I propose to unify these by changing Linux’s string sort order to match Darwin’s in Swift 4.0.

Background

Swift’s default ordering for strings is appropriate for machine consumption (e.g. implementing sorted collections). It obeys Unicode canonical equivalence[2], that is strings compare the same modulo normalization. However, it is not meant to be sufficient for presenting a meaningful ordering to human consumers, as that requires incorporating reader-specific information (e.g. [3]).

Known-ASCII strings are a trivial case for the described sort order semantics: pure ASCII is unaffected by normalization. Thus, lexicographical ordering of code units is a valid machine ordering for ASCII strings. On Darwin, this is used to order known-ASCII strings while Linux uses UCA even for known-ASCII strings.

Long term, the plan is to switch String’s sort order to be the lexicographical ordering of normalized code units (or perhaps scalar values), as mentioned in the String Manifesto[4]. This is a more efficient ordering than that provided by UCA. However, this will not make it in time for Swift 4.0.

Changes

I propose to change Linux’s sort order for known-ASCII strings to be the same as it is on Darwin. This will be accomplished by dropping the relevant if guards in StringCompare.swift. An example implementation can be found at [5].

In addition to unifying sort order semantics across platforms, this will also deliver significant performance boosts to pure ASCII strings on Linux.

[1] UTS #10: Unicode Collation Algorithm <http://unicode.org/reports/tr10/&gt;
[2] Canonical Equivalence in Applications <http://unicode.org/notes/tn5/&gt;
[3] UCA: Contextual Sensitivity <UTS #10: Unicode Collation Algorithm;
[4] String Manifesto: Comparing and Hashing Strings <https://github.com/apple/swift/blob/master/docs/StringManifesto.md#comparing-and-hashing-strings&gt;
[5] Unifying Linux/Darwin ASCII sort order semantics - github <https://github.com/milseman/swift/commit/5560e13198d5cc284f46bf190f59a2edf7ed747b&gt;\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev

FWIW, here is an illustration of how much an optimization in this area
would be worth to Kitura, running on Ubuntu 14.04 (where ICU 52 is
particularly expensive when comparing ASCII strings [1]).
The workload I used here is GitHub - the-benchmarker/web-frameworks: Which is the fastest web framework?
I have compiled Kitura 1.7.6 with the latest 4.0 toolchain (07-26-a), and
again with the same toolchain modified to remove the if in StringCompare's
==.

djones6@needletail:~/which_is_the_fastest$ numactl --physcpubind=0-3,16-19
--membind=0 bin/benchmarker kitura_40 kitura_40_memcmp
Language (Runtime) Framework (Middleware) Max [sec]
Min [sec] Ave [sec]
------------------------- ------------------------- ---------------
--------------- ---------------
swift kitura_40 7.824673
7.682657 7.740933
swift kitura_40_memcmp 5.163788
4.811082 4.955571

The difference here is around 35%. Other Kitura workloads I've performed
this comparison on in the past (such as involving JSON serialization) have
showed a difference in the 15 - 20% region.
The difference is far smaller on Ubuntu 16.04 (around 8% for this
workload), due to improvements in the newer level of ICU:

djones6@gruffalo:~/which_is_the_fastest$ numactl --physcpubind=0-3,16-19
--membind=0 bin/benchmarker kitura_40 kitura_40_memcmp
Language (Runtime) Framework (Middleware) Max [sec]
Min [sec] Ave [sec]
------------------------- ------------------------- ---------------
--------------- ---------------
swift kitura_40 4.691993
4.531465 4.580086
swift kitura_40_memcmp 4.349387
4.015061 4.201105

- David.

···

---
David Jones, Swift@IBM

[1] https://github.com/apple/swift/pull/7339

On 26 July 2017 at 03:59, Michael Ilseman via swift-dev <swift-dev@swift.org > wrote:

Unfortunately after some investigations and discussion, the situation
seems to be more murky. This approach would break transitivity of String
comparison on Linux, at least with any implementation of UCA using the
normal collation weights. A < B, B < C should imply A < C. But, if both A
and B are known-ASCII while C is UTF16, transitivity can be broken for any
character that UCA yields a different sort order for (e.g. “#” vs “&”). On
Darwin, the comparison implementation happens to preserve transitivity as
the platform (in effect) relatively weights ASCII by code unit values.

While I would like to get some performance improvements in time for Linux,
I don’t think this approach is viable for Swift 4.0. Unless anyone has any
ideas about another minimally invasive approach, my recommendation is to do
the long-term solution (lexicographical order of normalized code units)
immediately after Swift 4.0.

On Jul 25, 2017, at 2:01 PM, Michael Ilseman via swift-dev < > swift-dev@swift.org> wrote:

On Darwin, known-ASCII strings are sorted according to the lexicographical
ordering of their code units. All non-known-ASCII strings are otherwise
ordered based on the UCA[1]. On Linux, however, even known-ASCII strings
are ordered based on UCA. I propose to unify these by changing Linux’s
string sort order to match Darwin’s in Swift 4.0.
Background

Swift’s default ordering for strings is appropriate for machine
consumption (e.g. implementing sorted collections). It obeys Unicode
canonical equivalence[2], that is strings compare the same modulo
normalization. However, it is not meant to be sufficient for presenting a
meaningful ordering to human consumers, as that requires incorporating
reader-specific information (e.g. [3]).

Known-ASCII strings are a trivial case for the described sort order
semantics: pure ASCII is unaffected by normalization. Thus, lexicographical
ordering of code units is a valid machine ordering for ASCII strings. On
Darwin, this is used to order known-ASCII strings while Linux uses UCA even
for known-ASCII strings.

Long term, the plan is to switch String’s sort order to be the
lexicographical ordering of normalized code units (or perhaps scalar
values), as mentioned in the String Manifesto[4]. This is a more efficient
ordering than that provided by UCA. However, this will not make it in time
for Swift 4.0.
Changes

I propose to change Linux’s sort order for known-ASCII strings to be the
same as it is on Darwin. This will be accomplished by dropping the relevant
if guards in StringCompare.swift. An example implementation can be found
at [5].

In addition to unifying sort order semantics across platforms, this will
also deliver significant performance boosts to pure ASCII strings on Linux.

[1] UTS #10: Unicode Collation Algorithm
<http://unicode.org/reports/tr10/&gt;

[2] Canonical Equivalence in Applications <http://unicode.org/notes/tn5/&gt;

[3] UCA: Contextual Sensitivity
<UTS #10: Unicode Collation Algorithm;

[4] String Manifesto: Comparing and Hashing Strings
<https://github.com/apple/swift/blob/master/docs/StringManifesto.md#comparing-and-hashing-strings&gt;

[5] Unifying Linux/Darwin ASCII sort order semantics - github
<https://github.com/milseman/swift/commit/5560e13198d5cc284f46bf190f59a2edf7ed747b&gt;
_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev

_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev

The performance issues in Ubuntu 14.04 is not surprising. 14.04’s ICU version is ancient, relatively speaking (e.g. it predates both Swift and Unicode 7.0). Performance issues is not the only problem there, as Unicode semantics such as grapheme breaking has evolved dramatically since then. Out of curiosity, is there a reason why Kitura is stuck on 14.04? Can your platform bundle a newer ICU so that your users can have things like modern emoji support?

···

On Jul 27, 2017, at 7:28 AM, David Jones via swift-dev <swift-dev@swift.org> wrote:

FWIW, here is an illustration of how much an optimization in this area would be worth to Kitura, running on Ubuntu 14.04 (where ICU 52 is particularly expensive when comparing ASCII strings [1]).
The workload I used here is GitHub - the-benchmarker/web-frameworks: Which is the fastest web framework?
I have compiled Kitura 1.7.6 with the latest 4.0 toolchain (07-26-a), and again with the same toolchain modified to remove the if in StringCompare's ==.

djones6@needletail:~/which_is_the_fastest$ numactl --physcpubind=0-3,16-19 --membind=0 bin/benchmarker kitura_40 kitura_40_memcmp
Language (Runtime) Framework (Middleware) Max [sec] Min [sec] Ave [sec]
------------------------- ------------------------- --------------- --------------- ---------------
swift kitura_40 7.824673 7.682657 7.740933
swift kitura_40_memcmp 5.163788 4.811082 4.955571

The difference here is around 35%. Other Kitura workloads I've performed this comparison on in the past (such as involving JSON serialization) have showed a difference in the 15 - 20% region.
The difference is far smaller on Ubuntu 16.04 (around 8% for this workload), due to improvements in the newer level of ICU:

djones6@gruffalo:~/which_is_the_fastest$ numactl --physcpubind=0-3,16-19 --membind=0 bin/benchmarker kitura_40 kitura_40_memcmp
Language (Runtime) Framework (Middleware) Max [sec] Min [sec] Ave [sec]
------------------------- ------------------------- --------------- --------------- ---------------
swift kitura_40 4.691993 4.531465 4.580086
swift kitura_40_memcmp 4.349387 4.015061 4.201105

- David.
---
David Jones, Swift@IBM

[1] https://github.com/apple/swift/pull/7339

On 26 July 2017 at 03:59, Michael Ilseman via swift-dev <swift-dev@swift.org <mailto:swift-dev@swift.org>> wrote:
Unfortunately after some investigations and discussion, the situation seems to be more murky. This approach would break transitivity of String comparison on Linux, at least with any implementation of UCA using the normal collation weights. A < B, B < C should imply A < C. But, if both A and B are known-ASCII while C is UTF16, transitivity can be broken for any character that UCA yields a different sort order for (e.g. “#” vs “&”). On Darwin, the comparison implementation happens to preserve transitivity as the platform (in effect) relatively weights ASCII by code unit values.

While I would like to get some performance improvements in time for Linux, I don’t think this approach is viable for Swift 4.0. Unless anyone has any ideas about another minimally invasive approach, my recommendation is to do the long-term solution (lexicographical order of normalized code units) immediately after Swift 4.0.

On Jul 25, 2017, at 2:01 PM, Michael Ilseman via swift-dev <swift-dev@swift.org <mailto:swift-dev@swift.org>> wrote:

On Darwin, known-ASCII strings are sorted according to the lexicographical ordering of their code units. All non-known-ASCII strings are otherwise ordered based on the UCA[1]. On Linux, however, even known-ASCII strings are ordered based on UCA. I propose to unify these by changing Linux’s string sort order to match Darwin’s in Swift 4.0.

Background

Swift’s default ordering for strings is appropriate for machine consumption (e.g. implementing sorted collections). It obeys Unicode canonical equivalence[2], that is strings compare the same modulo normalization. However, it is not meant to be sufficient for presenting a meaningful ordering to human consumers, as that requires incorporating reader-specific information (e.g. [3]).

Known-ASCII strings are a trivial case for the described sort order semantics: pure ASCII is unaffected by normalization. Thus, lexicographical ordering of code units is a valid machine ordering for ASCII strings. On Darwin, this is used to order known-ASCII strings while Linux uses UCA even for known-ASCII strings.

Long term, the plan is to switch String’s sort order to be the lexicographical ordering of normalized code units (or perhaps scalar values), as mentioned in the String Manifesto[4]. This is a more efficient ordering than that provided by UCA. However, this will not make it in time for Swift 4.0.

Changes

I propose to change Linux’s sort order for known-ASCII strings to be the same as it is on Darwin. This will be accomplished by dropping the relevant if guards in StringCompare.swift. An example implementation can be found at [5].

In addition to unifying sort order semantics across platforms, this will also deliver significant performance boosts to pure ASCII strings on Linux.

[1] UTS #10: Unicode Collation Algorithm <http://unicode.org/reports/tr10/&gt;
[2] Canonical Equivalence in Applications <http://unicode.org/notes/tn5/&gt;
[3] UCA: Contextual Sensitivity <UTS #10: Unicode Collation Algorithm;
[4] String Manifesto: Comparing and Hashing Strings <https://github.com/apple/swift/blob/master/docs/StringManifesto.md#comparing-and-hashing-strings&gt;
[5] Unifying Linux/Darwin ASCII sort order semantics - github <https://github.com/milseman/swift/commit/5560e13198d5cc284f46bf190f59a2edf7ed747b&gt;\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_
swift-dev mailing list
swift-dev@swift.org <mailto:swift-dev@swift.org>
https://lists.swift.org/mailman/listinfo/swift-dev

_______________________________________________
swift-dev mailing list
swift-dev@swift.org <mailto:swift-dev@swift.org>
https://lists.swift.org/mailman/listinfo/swift-dev

_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev

Kitura does not mandate any particular distribution, but 14.04 is one of
the distributions that we expect to support for some time.

I did investigate what's required to build a Swift snapshot with a newer
version of ICU on Linux. I was pleased to discover that of the support for
doing so is already there, however the Foundation build needs tweaking to
point to the newly built version. I opened a couple of PRs to fix it, but
then someone pointed out an existing (unmerged) PR that fixes it in a
simpler way:

With #722, I can build a new snapshot on Ubuntu 14.04 or 16.04 using ICU
57.1 from ICU - International Components for Unicode - Download ICU 57
I'm using the buildbot_linux preset, so I added libicu=true and
install-libicu to the preset in utils/build-presets.ini

With this build, Kitura's 14.04 performance is much closer to the memcmp
version (leaving around 8% gap, similar to 16.04, which is what I'd expect):

Language (Runtime) Framework (Middleware) Max [sec]
Min [sec] Ave [sec]
------------------------- ------------------------- ---------------
--------------- ---------------
swift kitura_40 8.131616
7.634630 7.959684
swift kitura_40_memcmp 5.072511
4.784608 4.963862
swift kitura_40_icu 5.523411
5.297158 5.367100

···

On 31 July 2017 at 16:57, David Jones <djones6dev@gmail.com> wrote:

Kitura does not mandate any particular distribution, but 14.04 is one of
the distributions that we expect to support for some time.

I did investigate what's required to build a Swift snapshot with a newer
version of ICU on Linux. I was pleased to discover that of the support for
doing so is already there, however the Foundation build needs tweaking to
point to the newly built version. I opened a couple of PRs to fix it, but
then someone pointed out an existing (unmerged) PR that fixes it in a
simpler way:
Use PKG_CONFIG for libicu foundation.CFLAGS by spevans · Pull Request #722 · apple/swift-corelibs-foundation · GitHub

With #722, I can build a new snapshot on Ubuntu 14.04 or 16.04 using ICU
57.1 from ICU - International Components for Unicode - Download ICU 57
I'm using the buildbot_linux preset, so I added libicu=true and
install-libicu to the preset in utils/build-presets.ini

With this build, Kitura's 14.04 performance is much closer to the memcmp
version (leaving around 8% gap, similar to 16.04, which is what I'd expect):

Language (Runtime) Framework (Middleware) Max [sec]
Min [sec] Ave [sec]
------------------------- ------------------------- ---------------
--------------- ---------------
swift kitura_40
8.131616 7.634630 7.959684
swift kitura_40_memcmp
5.072511 4.784608 4.963862
swift kitura_40_icu
5.523411 5.297158 5.367100

On 27 July 2017 at 18:48, Michael Ilseman <milseman@apple.com> wrote:

The performance issues in Ubuntu 14.04 is not surprising. 14.04’s ICU
version is ancient, relatively speaking (e.g. it predates both Swift and
Unicode 7.0). Performance issues is not the only problem there, as Unicode
semantics such as grapheme breaking has evolved dramatically since then.
Out of curiosity, is there a reason why Kitura is stuck on 14.04? Can your
platform bundle a newer ICU so that your users can have things like modern
emoji support?

On Jul 27, 2017, at 7:28 AM, David Jones via swift-dev < >> swift-dev@swift.org> wrote:

FWIW, here is an illustration of how much an optimization in this area
would be worth to Kitura, running on Ubuntu 14.04 (where ICU 52 is
particularly expensive when comparing ASCII strings [1]).
The workload I used here is https://github.com/tbrand/whic
h_is_the_fastest
I have compiled Kitura 1.7.6 with the latest 4.0 toolchain (07-26-a), and
again with the same toolchain modified to remove the if in StringCompare's
==.

djones6@needletail:~/which_is_the_fastest$ numactl
--physcpubind=0-3,16-19 --membind=0 bin/benchmarker kitura_40
kitura_40_memcmp
Language (Runtime) Framework (Middleware) Max [sec]
Min [sec] Ave [sec]
------------------------- ------------------------- ---------------
--------------- ---------------
swift kitura_40
7.824673 7.682657 7.740933
swift kitura_40_memcmp
5.163788 4.811082 4.955571

The difference here is around 35%. Other Kitura workloads I've performed
this comparison on in the past (such as involving JSON serialization) have
showed a difference in the 15 - 20% region.
The difference is far smaller on Ubuntu 16.04 (around 8% for this
workload), due to improvements in the newer level of ICU:

djones6@gruffalo:~/which_is_the_fastest$ numactl --physcpubind=0-3,16-19
--membind=0 bin/benchmarker kitura_40 kitura_40_memcmp
Language (Runtime) Framework (Middleware) Max [sec]
Min [sec] Ave [sec]
------------------------- ------------------------- ---------------
--------------- ---------------
swift kitura_40
4.691993 4.531465 4.580086
swift kitura_40_memcmp
4.349387 4.015061 4.201105

- David.
---
David Jones, Swift@IBM

[1] https://github.com/apple/swift/pull/7339

On 26 July 2017 at 03:59, Michael Ilseman via swift-dev < >> swift-dev@swift.org> wrote:

Unfortunately after some investigations and discussion, the situation
seems to be more murky. This approach would break transitivity of String
comparison on Linux, at least with any implementation of UCA using the
normal collation weights. A < B, B < C should imply A < C. But, if both A
and B are known-ASCII while C is UTF16, transitivity can be broken for any
character that UCA yields a different sort order for (e.g. “#” vs “&”). On
Darwin, the comparison implementation happens to preserve transitivity as
the platform (in effect) relatively weights ASCII by code unit values.

While I would like to get some performance improvements in time for
Linux, I don’t think this approach is viable for Swift 4.0. Unless anyone
has any ideas about another minimally invasive approach, my recommendation
is to do the long-term solution (lexicographical order of normalized code
units) immediately after Swift 4.0.

On Jul 25, 2017, at 2:01 PM, Michael Ilseman via swift-dev < >>> swift-dev@swift.org> wrote:

On Darwin, known-ASCII strings are sorted according to the
lexicographical ordering of their code units. All non-known-ASCII strings
are otherwise ordered based on the UCA[1]. On Linux, however, even
known-ASCII strings are ordered based on UCA. I propose to unify these by
changing Linux’s string sort order to match Darwin’s in Swift 4.0.
Background

Swift’s default ordering for strings is appropriate for machine
consumption (e.g. implementing sorted collections). It obeys Unicode
canonical equivalence[2], that is strings compare the same modulo
normalization. However, it is not meant to be sufficient for presenting a
meaningful ordering to human consumers, as that requires incorporating
reader-specific information (e.g. [3]).

Known-ASCII strings are a trivial case for the described sort order
semantics: pure ASCII is unaffected by normalization. Thus, lexicographical
ordering of code units is a valid machine ordering for ASCII strings. On
Darwin, this is used to order known-ASCII strings while Linux uses UCA even
for known-ASCII strings.

Long term, the plan is to switch String’s sort order to be the
lexicographical ordering of normalized code units (or perhaps scalar
values), as mentioned in the String Manifesto[4]. This is a more efficient
ordering than that provided by UCA. However, this will not make it in time
for Swift 4.0.
Changes

I propose to change Linux’s sort order for known-ASCII strings to be the
same as it is on Darwin. This will be accomplished by dropping the relevant
if guards in StringCompare.swift. An example implementation can be
found at [5].

In addition to unifying sort order semantics across platforms, this will
also deliver significant performance boosts to pure ASCII strings on Linux.

[1] UTS #10: Unicode Collation Algorithm
<http://unicode.org/reports/tr10/&gt;

[2] Canonical Equivalence in Applications
<http://unicode.org/notes/tn5/&gt;

[3] UCA: Contextual Sensitivity
<UTS #10: Unicode Collation Algorithm;

[4] String Manifesto: Comparing and Hashing Strings
<https://github.com/apple/swift/blob/master/docs/StringManifesto.md#comparing-and-hashing-strings&gt;

[5] Unifying Linux/Darwin ASCII sort order semantics - github
<https://github.com/milseman/swift/commit/5560e13198d5cc284f46bf190f59a2edf7ed747b&gt;
_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev

_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev

_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev