Splitting a string into "natural/visual character" components?


(Jens Persson) #1

I want a function f such that:

f("abc") == ["a", "b", "c"]

f("café") == ["c", "a", "f", "é"]

f(":family_man_woman_girl_boy::construction_worker_woman:t5:") == [":family_man_woman_girl_boy:", ":construction_worker_woman:t5:"]

I'm not sure if the last example renders correctly by mail for everyone but
the input String contains these _two_ "natural/visual characters":
(1) A family emoji
(2) a construction worker (woman, with skin tone modifier) emoji.
and the result is an Array of two strings (one for each emoji).

The first two examples are easy, the third example is the tricky one.

Is there a (practical) way to do this (in Swift 3)?

/Jens

PS

It's OK if the function has to depend on eg a graphics context etc.
(I tried writing a function so that it extracts the glyphs, using
NSTextStorage, NSLayoutManager and the AppleColorEmoji font, but it says
that ":family_man_woman_girl_boy::construction_worker_woman:t5:" contains 18(!) glyphs, whereas eg "café" contains
4 as expected.)

If the emojis of the third example doesn't look like they should in this
mail, here is another way to write the exact same example using only simple
text:

let inputOfThirdExample =
"\u{1F468}\u{200D}\u{1F469}\u{200D}\u{1F467}\u{200D}\u{1F466}\u{1F477}\u{1F3FE}\u{200D}\u{2640}\u{FE0F}"

let result = f(inputOfThirdExample)

let expectedResult =
["\u{1F468}\u{200D}\u{1F469}\u{200D}\u{1F467}\u{200D}\u{1F466}",
"\u{1F477}\u{1F3FE}\u{200D}\u{2640}\u{FE0F}"]

print(result.elementsEqual(result)) // Should print true


(Jens Persson) #2

FWIW: I can conclude that the third example does not render correctly in
Gmail ...

···

On Fri, May 12, 2017 at 10:43 AM, Jens Persson <jens@bitcycle.com> wrote:

I want a function f such that:

f("abc") == ["a", "b", "c"]

f("café") == ["c", "a", "f", "é"]

f(":family_man_woman_girl_boy::construction_worker_woman:t5:") == [":family_man_woman_girl_boy:", ":construction_worker_woman:t5:"]

I'm not sure if the last example renders correctly by mail for everyone
but the input String contains these _two_ "natural/visual characters":
(1) A family emoji
(2) a construction worker (woman, with skin tone modifier) emoji.
and the result is an Array of two strings (one for each emoji).

The first two examples are easy, the third example is the tricky one.

Is there a (practical) way to do this (in Swift 3)?

/Jens

PS

It's OK if the function has to depend on eg a graphics context etc.
(I tried writing a function so that it extracts the glyphs, using
NSTextStorage, NSLayoutManager and the AppleColorEmoji font, but it says
that ":family_man_woman_girl_boy::construction_worker_woman:t5:" contains 18(!) glyphs, whereas eg "café" contains
4 as expected.)

If the emojis of the third example doesn't look like they should in this
mail, here is another way to write the exact same example using only simple
text:

let inputOfThirdExample = "\u{1F468}\u{200D}\u{1F469}\u{
200D}\u{1F467}\u{200D}\u{1F466}\u{1F477}\u{1F3FE}\u{200D}\u{2640}\u{FE0F}"

let result = f(inputOfThirdExample)

let expectedResult = ["\u{1F468}\u{200D}\u{1F469}\
u{200D}\u{1F467}\u{200D}\u{1F466}", "\u{1F477}\u{1F3FE}\u{200D}\u{
2640}\u{FE0F}"]

print(result.elementsEqual(result)) // Should print true


(Martin R) #3

The enumerateSubstrings method of (NS)String has a .byComposedCharacterSequences option which causes Emoji sequences like ":family_man_woman_girl_boy:" to be treated as a single unit:

    func f(_ s: String) -> [String] {
        var a: [String] = []
        s.enumerateSubstrings(in: s.startIndex..<s.endIndex, options: .byComposedCharacterSequences) {
            (c, _, _, _) in
            if let c = c { a.append(c) }
        }
        return a
    }

    print(f(":family_man_woman_girl_boy::construction_worker_woman:t5:")) // [":family_man_woman_girl_boy:", ":construction_worker_woman:t5:"]

As I understand it from https://oleb.net/blog/2016/12/emoji-4-0/, Emoji sequences are considered as a single grapheme cluster in Unicode 9, which means that you can simply do something like

    Array(":family_man_woman_girl_boy::construction_worker_woman:t5:".characters)

once Unicode 9 is adopted in Swift.

Regards, Martin

···

On 12. May 2017, at 10:43, Jens Persson via swift-users <swift-users@swift.org> wrote:

I want a function f such that:

f("abc") == ["a", "b", "c"]

f("café") == ["c", "a", "f", "é"]

f(":family_man_woman_girl_boy::construction_worker_woman:t5:") == [":family_man_woman_girl_boy:", ":construction_worker_woman:t5:"]

I'm not sure if the last example renders correctly by mail for everyone but the input String contains these _two_ "natural/visual characters":
(1) A family emoji
(2) a construction worker (woman, with skin tone modifier) emoji.
and the result is an Array of two strings (one for each emoji).

The first two examples are easy, the third example is the tricky one.

Is there a (practical) way to do this (in Swift 3)?

/Jens

PS

It's OK if the function has to depend on eg a graphics context etc.
(I tried writing a function so that it extracts the glyphs, using NSTextStorage, NSLayoutManager and the AppleColorEmoji font, but it says that ":family_man_woman_girl_boy::construction_worker_woman:t5:" contains 18(!) glyphs, whereas eg "café" contains 4 as expected.)

If the emojis of the third example doesn't look like they should in this mail, here is another way to write the exact same example using only simple text:

let inputOfThirdExample = "\u{1F468}\u{200D}\u{1F469}\u{200D}\u{1F467}\u{200D}\u{1F466}\u{1F477}\u{1F3FE}\u{200D}\u{2640}\u{FE0F}"

let result = f(inputOfThirdExample)

let expectedResult = ["\u{1F468}\u{200D}\u{1F469}\u{200D}\u{1F467}\u{200D}\u{1F466}", "\u{1F477}\u{1F3FE}\u{200D}\u{2640}\u{FE0F}"]

print(result.elementsEqual(result)) // Should print true

_______________________________________________
swift-users mailing list
swift-users@swift.org
https://lists.swift.org/mailman/listinfo/swift-users


(Jens Persson) #4

Ah, thanks!

···

On Fri, May 12, 2017 at 11:45 AM, Martin R <martinr448@gmail.com> wrote:

The enumerateSubstrings method of (NS)String has a
.byComposedCharacterSequences option which causes Emoji sequences like
":family_man_woman_girl_boy:" to be treated as a single unit:

    func f(_ s: String) -> [String] {
        var a: [String] = []
        s.enumerateSubstrings(in: s.startIndex..<s.endIndex, options: .
byComposedCharacterSequences) {
            (c, _, _, _) in
            if let c = c { a.append(c) }
        }
        return a
    }

    print(f(":family_man_woman_girl_boy::construction_worker_woman:t5:")) // [":family_man_woman_girl_boy:", ":construction_worker_woman:t5:"]

As I understand it from https://oleb.net/blog/2016/12/emoji-4-0/, Emoji
sequences are considered as a single grapheme cluster in Unicode 9, which
means that you can simply do something like

    Array(":family_man_woman_girl_boy::construction_worker_woman:t5:".characters)

once Unicode 9 is adopted in Swift.

Regards, Martin

On 12. May 2017, at 10:43, Jens Persson via swift-users < > swift-users@swift.org> wrote:

I want a function f such that:

f("abc") == ["a", "b", "c"]

f("café") == ["c", "a", "f", "é"]

f(":family_man_woman_girl_boy::construction_worker_woman:t5:") == [":family_man_woman_girl_boy:", ":construction_worker_woman:t5:"]

I'm not sure if the last example renders correctly by mail for everyone
but the input String contains these _two_ "natural/visual characters":
(1) A family emoji
(2) a construction worker (woman, with skin tone modifier) emoji.
and the result is an Array of two strings (one for each emoji).

The first two examples are easy, the third example is the tricky one.

Is there a (practical) way to do this (in Swift 3)?

/Jens

PS

It's OK if the function has to depend on eg a graphics context etc.
(I tried writing a function so that it extracts the glyphs, using
NSTextStorage, NSLayoutManager and the AppleColorEmoji font, but it says
that ":family_man_woman_girl_boy::construction_worker_woman:t5:" contains 18(!) glyphs, whereas eg "café" contains
4 as expected.)

If the emojis of the third example doesn't look like they should in this
mail, here is another way to write the exact same example using only simple
text:

let inputOfThirdExample = "\u{1F468}\u{200D}\u{1F469}\u{
200D}\u{1F467}\u{200D}\u{1F466}\u{1F477}\u{1F3FE}\u{200D}\u{2640}\u{FE0F}"

let result = f(inputOfThirdExample)

let expectedResult = ["\u{1F468}\u{200D}\u{1F469}\
u{200D}\u{1F467}\u{200D}\u{1F466}", "\u{1F477}\u{1F3FE}\u{200D}\u{
2640}\u{FE0F}"]

print(result.elementsEqual(result)) // Should print true

_______________________________________________
swift-users mailing list
swift-users@swift.org
https://lists.swift.org/mailman/listinfo/swift-users