Combining Skin Tone Emoji Into Single Extended Grapheme Clusters

Hello,

I would like to fix rdar://20511834 , which is that the new skin tone and
multi-person grouping emoji introduced with iOS 8.3 and OS X 10.10.3 are
represented as multiple extended grapheme clusters by Swift.String, and I
have a few questions.

1. Is this something we want to fix at this time, considering these emoji
are part of UTR #51, but not part of an official Unicode standard, or do we
want to wait for these emoji groupings to be published as part of the
Unicode standard.

2. What is the best way to fix this?

3. Is this a significant enough change to warrant going through the entire
swift-evolution process, or a simple bug fix?

In terms of the first question, I am not too familiar with the Unicode
standards process. It appears to me that although UTRs aren't formal
standards, code that conforms to unicode standards are free to conform to
UTRs as well.

Currently, extended grapheme clusters are using GraphemeBreakProperty.txt,
which is supplied as part of Annex #44 of the Unicode standard, and which
does not yet group skin-tone emoji or emoji sequences. UTR #51 includes a
emoji-data.txt file which, while slightly outdated compared to the UTR,
does contain enough information to group these emoji properly.

We could currently pull in emoji-data.txt, or some other data source, and
use it to group these emoji, but there is a chance that
GraphemeBreakProperty.txt will be updated to include these groupings in the
future (though probably not until the next version of Unicode), at which
point we'd have to reverse this work and reimplement it using
GraphemeBreakProperty.txt.

For the second question, I have implemented a simple implementation of
these emoji groupings using emoji-data.txt. A diff can be found at
https://github.com/MichaelBuckley/swift/pull/1

This implementation merely adds a few new character classes to the Unicode
Trie, pulled mainly from emoji-data.txt, though the Zero Width Joiner is
given its own hardcoded character class.

As I see it, there are three disadvantages to this approach: The hardcoded
character class, the reliance on a second emoji data file, and the fact
that the trie bitmap had to be extended from 16 bits to 32 bits. This last
change was probably inevitable in the future, and it only increases the
trie size by 4096 bytes (from 18961 bytes to 23057 bytes).

Still, it's possible that there is a much better way to implement this fix,
and I was hoping to get some feedback from the designer(s) of the current
Unicode trie code.

But probably the biggest reason for seeking an alternative implementation
is that the existing behavior is not always incorrect. It's incorrect on
the most recent versions of Apple operating systems when rendered in
contexts that support these emoji, but it's correct everywhere else. It
seems as though Swift perhaps needs to allow users to change grapheme
clustering behavior based on a user setting, and perhaps even allow users
to specify which unicode version they want to use, but that's a much larger
change, which may not be worth the costs.

Finally, for question 3, I'm on the fence as to whether this is a
significant enough change to warrant going through the swift-evolution
process. On the one hand, it seems like a simple bug fix, and there's
already a radar tracking it as such. No one would require a swift-evolution
proposal to fix a compiler crash, for example. But at the same time, it
would also change the runtime behavior or anyone relying on the existing
behavior, which is problematic because, as pointed out before, the existing
behavior is not always wrong.

Anyway, I was hoping to get some guidance on where the line is, and what
side of the line this bug fix is on. Thanks!

Hi Michael,

Hello,

I would like to fix rdar://20511834 , which is that the new skin tone and
multi-person grouping emoji introduced with iOS 8.3 and OS X 10.10.3 are
represented as multiple extended grapheme clusters by Swift.String, and I
have a few questions.

1. Is this something we want to fix at this time, considering these emoji
are part of UTR #51, but not part of an official Unicode standard, or do we
want to wait for these emoji groupings to be published as part of the
Unicode standard.

The issue you are describing is indeed important. There are multiple
considerations here. One of them is that we currently describe the
segmentation that Swift performs to be "extended grapheme clusters", which
has a precise definition in the spec. Changing that would mean introducing
tailoring, which is allowed by the spec, but the algorithm would be custom.

One thing to do would be to check the Apple's ICU implementation, which (I
think) implements some extra handling for UTR #51 (
http://opensource.apple.com/release/os-x-1011/\) to see how it deals with
this, whether it introduces tailoring, and if so, in what way.

2. What is the best way to fix this?

My primary concern with the fix in the PR is that it seems to change the
segmentation behavior for other sequences. The grapheme cluster
segmentation algorithm is local and stateless. It only looks at two
adjacent Unicode scalars. This means that adding a rule like "ZWJ
no_boundary Emoji" will affect all sequences, even those that are not a
grouping as defined in UTR #51 (for example, "Latin letter, ZWJ, Emoji":
the three scalars would be grouped).

This is the same issue as multiple flags pasted together (which are
represented as regional indicator characters). The current algorithm just
does not have enough information to split them apart, it needs to look at a
wider part of the string.

I would be much happier with a solution only changed the segmentation for
the cases covered by the TR, but I understand it might have performance
implications. I think we should try to add such a tailoring, and benchmark
it.

3. Is this a significant enough change to warrant going through the entire

swift-evolution process, or a simple bug fix?

The change that adds the first tailoring to the algorithm might be
significant enough. But I think it would be a question of whether we want
any tailoring at all, not about specific tailoring.

As I see it, there are three disadvantages to this approach: The hardcoded
character class, the reliance on a second emoji data file, and the fact
that the trie bitmap had to be extended from 16 bits to 32 bits. This last
change was probably inevitable in the future, and it only increases the
trie size by 4096 bytes (from 18961 bytes to 23057 bytes).

In my opinion, the biggest disadvantage is that it would change
segmentation for other sequences.

But probably the biggest reason for seeking an alternative implementation
is that the existing behavior is not always incorrect. It's incorrect on
the most recent versions of Apple operating systems when rendered in
contexts that support these emoji, but it's correct everywhere else. It
seems as though Swift perhaps needs to allow users to change grapheme
clustering behavior based on a user setting, and perhaps even allow users
to specify which unicode version they want to use, but that's a much larger
change, which may not be worth the costs.

I would prefer to avoid platform-specific differences here. Unicode is
hard as it is, and adding context-sensitivity to algorithms in an unusual
way (that is, not through existing established mechanisms like locales)
just calls for interoperability issues.

Dmitri

···

On Thu, Dec 17, 2015 at 9:16 PM, Michael Buckley via swift-dev < swift-dev@swift.org> wrote:

--
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr@gmail.com>*/

Thanks for the response, Dimitri. My comments inline below.

One thing to do would be to check the Apple's ICU implementation, which (I
think) implements some extra handling for UTR #51 (
http://opensource.apple.com/release/os-x-1011/\) to see how it deals with
this, whether it introduces tailoring, and if so, in what way.

I will look into that. I had always thought that would have been part of
Core Text, and not open sourced. It is great to know that it is
open-sourced.

My primary concern with the fix in the PR is that it seems to change the

segmentation behavior for other sequences. The grapheme cluster
segmentation algorithm is local and stateless. It only looks at two
adjacent Unicode scalars. This means that adding a rule like "ZWJ
no_boundary Emoji" will affect all sequences, even those that are not a
grouping as defined in UTR #51 (for example, "Latin letter, ZWJ, Emoji":
the three scalars would be grouped).

Apologies, I forgot to mention that disadvantage. It does change the
segmentation behavior for other sequences, which was one of the reasons I
was on the fence about whether this should go through the swift-evolution
process.

This is the same issue as multiple flags pasted together (which are
represented as regional indicator characters). The current algorithm just
does not have enough information to split them apart, it needs to look at a
wider part of the string.

I could be reading the Unicode standard incorrectly, but it appears that
this might be the intended behavior for the flag characters. I definitely
agree that it's not ideal.

I would be much happier with a solution only changed the segmentation for

the cases covered by the TR, but I understand it might have performance
implications. I think we should try to add such a tailoring, and benchmark
it.

Just so that I understand what you mean by tailoring, you mean switching to
a possibly stateful algorithm which can consider more than just two
adjacent characters when grouping, right?

The change that adds the first tailoring to the algorithm might be
significant enough. But I think it would be a question of whether we want
any tailoring at all, not about specific tailoring.

Thanks for the clarification. Just to be sure, if this change wasn't as
problematic, but still changed the behavior of Swift.String, you're saying
it would not be important enough for swift-evolution? As a concrete
example, if I was just proposing to fix the skin tone emoji, but not the
SWJ sequences, would it be considered just a bug fix?

···

On Fri, Dec 18, 2015 at 3:29 AM, Dmitri Gribenko <gribozavr@gmail.com> wrote:

After reading through the ICU sources, if I understand them correctly, ICU
uses the Aho–Corasick algorithm to determine grapheme breaks, word breaks
and line breaks, and then does some post-processing after matching using
the algorithm.

This allows ICU to solve the regional indicator problem by including a
pattern that matches 3 regional indicator characters in a row and inserts a
grapheme break after the second. This does not actually modify the string
by adding a zero-width space or something.

While this approach can solve the regional indicator problem efficiently,
it cannot solve the problem with the zero-width joiner emoji sequences as
easily. This is because Aho-Corasick is liner on the length of the text +
the number of patterns + the number of matches, and the emoji problem would
require a pattern for every emoji sequence we want to support.

However, after reading UTR #51 again, we may want to treat all emoji joined
by a ZWJ as a single extended grapheme cluster, whether they form a known
sequence or not. That's because UTR#51 leaves the exact sequences as
implementation-defined. It includes a list of currently-known implemented
sequences, but allows for implementers to add their own sequences.

Which means that Ubuntu could, for example, support a sequence of DOG FACE
+ ZWJ + PILE OF POO, and represent it with a glyph of a dog doing its
business. We basically have two options here. We could treat Swift as an
Apple-platform centric language and implement only the sequences that
appear on Apple platforms, or we could implement a rule of any emoji + ZWJ
+ any emoji has no break. As Dmitri pointed out, this would mean Swift
would mean Swift would report strings of invalid sequences as a single
character, which could be confusing. But I posit that the situation we have
now, reporting valid strings as multiple characters is also confusing, and
much more likely. It's unlikely that anyone is going to stick a ZWJ between
emoji unless they intend to make a sequence from it.

Incidentally, this is what ICU does. You can test this yourself in TextEdit
by typing HEAVY BLACK HEART followed by ZWJ ad infinitum, then press the
left arrow key once and watch TextEdit treat the sequence as a single
character, causing the cursor to jump to the beginning of the string. ICU,
however, does hard-code the emoji that are currently used by Apple emoji
sequences, so you can't do the same thing with PILE OF POO. This makes
sense in an ICU context, since it's only implementing the Apple sequences,
but if we want Swift to be more platform-agnostic, we would want this
behavior for any emoji.

ICU's implementation fixes the regional indicator problem, but the
implementation is large and moderately complicated. Just throwing this out
there, but would it be possible to add ICU as a dependency to Swift and
just use its implementation? I'm sure this would be a nightmare to work out
license and logistics-wise. (It would probably necessitate that ICU
development be opened up to the same degree that other Swift dependencies
are). I also understand that adding any dependencies at all is less than
ideal. But this seems like a perfect situation for some code sharing. We
have a moderately large and complicated library that is being updated with
new Emoji support when new Emoji are added anyway. It's fast, it's already
well-used, and we'd have to duplicate a lot of what it does to solve the
same problems if we didn't use it.

As a bonus, we could link to the system-supplied libicu on OS X and iOS, so
Swift apps would automatically get the latest emoji support when users
update their OSs. We would still have to bundle it for other OSs.

I know that there are a lot of downsides to making it a dependency, but I
wanted to throw the idea out there to see if it made sense.

···

On Fri, Dec 18, 2015 at 6:22 AM, Michael Buckley <michael@buckleyisms.com> wrote:

Thanks for the response, Dimitri. My comments inline below.

On Fri, Dec 18, 2015 at 3:29 AM, Dmitri Gribenko <gribozavr@gmail.com> > wrote:

One thing to do would be to check the Apple's ICU implementation, which
(I think) implements some extra handling for UTR #51 (
http://opensource.apple.com/release/os-x-1011/\) to see how it deals with
this, whether it introduces tailoring, and if so, in what way.

I will look into that. I had always thought that would have been part of
Core Text, and not open sourced. It is great to know that it is
open-sourced.

My primary concern with the fix in the PR is that it seems to change the

segmentation behavior for other sequences. The grapheme cluster
segmentation algorithm is local and stateless. It only looks at two
adjacent Unicode scalars. This means that adding a rule like "ZWJ
no_boundary Emoji" will affect all sequences, even those that are not a
grouping as defined in UTR #51 (for example, "Latin letter, ZWJ, Emoji":
the three scalars would be grouped).

Apologies, I forgot to mention that disadvantage. It does change the
segmentation behavior for other sequences, which was one of the reasons I
was on the fence about whether this should go through the swift-evolution
process.

This is the same issue as multiple flags pasted together (which are
represented as regional indicator characters). The current algorithm just
does not have enough information to split them apart, it needs to look at a
wider part of the string.

I could be reading the Unicode standard incorrectly, but it appears that
this might be the intended behavior for the flag characters. I definitely
agree that it's not ideal.

I would be much happier with a solution only changed the segmentation for

the cases covered by the TR, but I understand it might have performance
implications. I think we should try to add such a tailoring, and benchmark
it.

Just so that I understand what you mean by tailoring, you mean switching
to a possibly stateful algorithm which can consider more than just two
adjacent characters when grouping, right?

The change that adds the first tailoring to the algorithm might be
significant enough. But I think it would be a question of whether we want
any tailoring at all, not about specific tailoring.

Thanks for the clarification. Just to be sure, if this change wasn't as
problematic, but still changed the behavior of Swift.String, you're saying
it would not be important enough for swift-evolution? As a concrete
example, if I was just proposing to fix the skin tone emoji, but not the
SWJ sequences, would it be considered just a bug fix?

It actually appears that Swift already links against ICU. I'll see if I can
hook Swift up to ICU's grapheme separation code.

···

On Sun, Dec 20, 2015 at 10:41 PM, Michael Buckley <michael@buckleyisms.com> wrote:

After reading through the ICU sources, if I understand them correctly, ICU
uses the Aho–Corasick algorithm to determine grapheme breaks, word breaks
and line breaks, and then does some post-processing after matching using
the algorithm.

This allows ICU to solve the regional indicator problem by including a
pattern that matches 3 regional indicator characters in a row and inserts a
grapheme break after the second. This does not actually modify the string
by adding a zero-width space or something.

While this approach can solve the regional indicator problem efficiently,
it cannot solve the problem with the zero-width joiner emoji sequences as
easily. This is because Aho-Corasick is liner on the length of the text +
the number of patterns + the number of matches, and the emoji problem would
require a pattern for every emoji sequence we want to support.

However, after reading UTR #51 again, we may want to treat all emoji
joined by a ZWJ as a single extended grapheme cluster, whether they form a
known sequence or not. That's because UTR#51 leaves the exact sequences as
implementation-defined. It includes a list of currently-known implemented
sequences, but allows for implementers to add their own sequences.

Which means that Ubuntu could, for example, support a sequence of DOG FACE
+ ZWJ + PILE OF POO, and represent it with a glyph of a dog doing its
business. We basically have two options here. We could treat Swift as an
Apple-platform centric language and implement only the sequences that
appear on Apple platforms, or we could implement a rule of any emoji + ZWJ
+ any emoji has no break. As Dmitri pointed out, this would mean Swift
would mean Swift would report strings of invalid sequences as a single
character, which could be confusing. But I posit that the situation we have
now, reporting valid strings as multiple characters is also confusing, and
much more likely. It's unlikely that anyone is going to stick a ZWJ between
emoji unless they intend to make a sequence from it.

Incidentally, this is what ICU does. You can test this yourself in
TextEdit by typing HEAVY BLACK HEART followed by ZWJ ad infinitum, then
press the left arrow key once and watch TextEdit treat the sequence as a
single character, causing the cursor to jump to the beginning of the
string. ICU, however, does hard-code the emoji that are currently used by
Apple emoji sequences, so you can't do the same thing with PILE OF POO.
This makes sense in an ICU context, since it's only implementing the Apple
sequences, but if we want Swift to be more platform-agnostic, we would want
this behavior for any emoji.

ICU's implementation fixes the regional indicator problem, but the
implementation is large and moderately complicated. Just throwing this out
there, but would it be possible to add ICU as a dependency to Swift and
just use its implementation? I'm sure this would be a nightmare to work out
license and logistics-wise. (It would probably necessitate that ICU
development be opened up to the same degree that other Swift dependencies
are). I also understand that adding any dependencies at all is less than
ideal. But this seems like a perfect situation for some code sharing. We
have a moderately large and complicated library that is being updated with
new Emoji support when new Emoji are added anyway. It's fast, it's already
well-used, and we'd have to duplicate a lot of what it does to solve the
same problems if we didn't use it.

As a bonus, we could link to the system-supplied libicu on OS X and iOS,
so Swift apps would automatically get the latest emoji support when users
update their OSs. We would still have to bundle it for other OSs.

I know that there are a lot of downsides to making it a dependency, but I
wanted to throw the idea out there to see if it made sense.

On Fri, Dec 18, 2015 at 6:22 AM, Michael Buckley <michael@buckleyisms.com> > wrote:

Thanks for the response, Dimitri. My comments inline below.

On Fri, Dec 18, 2015 at 3:29 AM, Dmitri Gribenko <gribozavr@gmail.com> >> wrote:

One thing to do would be to check the Apple's ICU implementation, which
(I think) implements some extra handling for UTR #51 (
http://opensource.apple.com/release/os-x-1011/\) to see how it deals
with this, whether it introduces tailoring, and if so, in what way.

I will look into that. I had always thought that would have been part of
Core Text, and not open sourced. It is great to know that it is
open-sourced.

My primary concern with the fix in the PR is that it seems to change the

segmentation behavior for other sequences. The grapheme cluster
segmentation algorithm is local and stateless. It only looks at two
adjacent Unicode scalars. This means that adding a rule like "ZWJ
no_boundary Emoji" will affect all sequences, even those that are not a
grouping as defined in UTR #51 (for example, "Latin letter, ZWJ, Emoji":
the three scalars would be grouped).

Apologies, I forgot to mention that disadvantage. It does change the
segmentation behavior for other sequences, which was one of the reasons I
was on the fence about whether this should go through the swift-evolution
process.

This is the same issue as multiple flags pasted together (which are
represented as regional indicator characters). The current algorithm just
does not have enough information to split them apart, it needs to look at a
wider part of the string.

I could be reading the Unicode standard incorrectly, but it appears that
this might be the intended behavior for the flag characters. I definitely
agree that it's not ideal.

I would be much happier with a solution only changed the segmentation for

the cases covered by the TR, but I understand it might have performance
implications. I think we should try to add such a tailoring, and benchmark
it.

Just so that I understand what you mean by tailoring, you mean switching
to a possibly stateful algorithm which can consider more than just two
adjacent characters when grouping, right?

The change that adds the first tailoring to the algorithm might be
significant enough. But I think it would be a question of whether we want
any tailoring at all, not about specific tailoring.

Thanks for the clarification. Just to be sure, if this change wasn't as
problematic, but still changed the behavior of Swift.String, you're saying
it would not be important enough for swift-evolution? As a concrete
example, if I was just proposing to fix the skin tone emoji, but not the
SWJ sequences, would it be considered just a bug fix?

Hi Michael,

Thank you for the investigation. Yes, calling into ICU for this would be
an interesting direction to explore. One thing that we would want to do is
measure the performance and compare to the current implementation.

Dmitri

···

On Tue, Dec 22, 2015 at 12:10 AM, Michael Buckley via swift-dev < swift-dev@swift.org> wrote:

It actually appears that Swift already links against ICU. I'll see if I
can hook Swift up to ICU's grapheme separation code.

--
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr@gmail.com>*/