Swift String problem with Thai language

pitiphong.p · April 1, 2017, 7:52am

Hello Swift Community,

I’ve found a problem on Swift String API with Thai language. In Thai, we have 44 Consonants, 32 vowels and 5 tone marks. A special attribute of Thai vowels is that they can be put on anywhere around a consonant, some of them are placed after a consonant (ชา), some are before (แช), some are above (ชี) and some are below (ชุ). Since all vowels must be placed along with a consonants but they’re place in difference places around a consonant, Unicode standard says that some of the Thai vowels are Grapheme Base and some are Grapheme Extend.

And because Swift String is fully Unicode compliance and by having some vowels be a Grapheme Extend makes some Thai vowels have a invalid attributes in some aspects. For example a word “ชี” (a nun) and “ชา” (tea) both have one consonant (in this case is ช) and one vowel (ี and า). When we ask how many characters are there in those words or does this word contain a ช character, we should get the same results from those 2 words (2 characters and it contains ช). However I found that in Swift String API, I will get a difference answers from those questions.

// You can try this code snippet in a Swift Playground
let chi = "ชี"
let cha = “ชา"

// Value of these 2 lines below should be 2
chi.characters.count
cha.characters.count

// Value of these 3 lines below should be true
chi.contains("ช")
cha.contains("ช")
chi.characters.contains("ช”)

// end of code snippet

I’m not sure that if Swift team is aware of this problem and do they have any opinion on it. I know that Unicode is very very hard and do aware of that there would be a revamp on String API in Swift 4 so I want to make this into a discussion before Swift 4 is released.

Thank you,
Bank (Pitiphong)

Joe_Groff · April 3, 2017, 9:33pm

It'd be better to file a bug about this. It's not really an evolution topic.

-Joe

···

On Apr 1, 2017, at 12:52 AM, Pitiphong Phongpattranont via swift-evolution <swift-evolution@swift.org> wrote:

Hello Swift Community,

I’ve found a problem on Swift String API with Thai language. In Thai, we have 44 Consonants, 32 vowels and 5 tone marks. A special attribute of Thai vowels is that they can be put on anywhere around a consonant, some of them are placed after a consonant (ชา), some are before (แช), some are above (ชี) and some are below (ชุ). Since all vowels must be placed along with a consonants but they’re place in difference places around a consonant, Unicode standard says that some of the Thai vowels are Grapheme Base and some are Grapheme Extend.

And because Swift String is fully Unicode compliance and by having some vowels be a Grapheme Extend makes some Thai vowels have a invalid attributes in some aspects. For example a word “ชี” (a nun) and “ชา” (tea) both have one consonant (in this case is ช) and one vowel (ี and า). When we ask how many characters are there in those words or does this word contain a ช character, we should get the same results from those 2 words (2 characters and it contains ช). However I found that in Swift String API, I will get a difference answers from those questions.

// You can try this code snippet in a Swift Playground
let chi = "ชี"
let cha = “ชา"

// Value of these 2 lines below should be 2
chi.characters.count
cha.characters.count

// Value of these 3 lines below should be true
chi.contains("ช")
cha.contains("ช")
chi.characters.contains("ช”)

// end of code snippet

I’m not sure that if Swift team is aware of this problem and do they have any opinion on it. I know that Unicode is very very hard and do aware of that there would be a revamp on String API in Swift 4 so I want to make this into a discussion before Swift 4 is released.

xwu · April 3, 2017, 9:37pm

Hello Swift Community,

I’ve found a problem on Swift String API with Thai language. In Thai, we
have 44 Consonants, 32 vowels and 5 tone marks. A special attribute of Thai
vowels is that they can be put on anywhere around a consonant, some of them
are placed after a consonant (ชา), some are before (แช), some are above
(ชี) and some are below (ชุ). Since all vowels must be placed along with a
consonants but they’re place in difference places around a consonant,
Unicode standard says that some of the Thai vowels are Grapheme Base and
some are Grapheme Extend.

And because Swift String is fully Unicode compliance and by having some
vowels be a Grapheme Extend makes some Thai vowels have a invalid
attributes in some aspects. For example a word “ชี” (a nun) and “ชา” (tea)
both have one consonant (in this case is ช) and one vowel (ี and า). When
we ask how many characters are there in those words or does this word
contain a ช character, we should get the same results from those 2 words (2
characters and it contains ช). However I found that in Swift String API, I
will get a difference answers from those questions.

// You can try this code snippet in a Swift Playground
let chi = "ชี"

A Swift "character" models a Unicode extended grapheme cluster, which may
or may not be a character with respect to a human language.

I don't speak Thai, but do you have a reference saying that the Unicode
standard regards these as two separate extended grapheme clusters? Unless
I'm mistaken, according to the reference (
Unicode Utilities: Character Properties), the combining vowel
U+0E35 extends the grapheme cluster. Therefore, Unicode rules require this
to be reported as one extended grapheme cluster, which would make this one
single Swift "character". The remaining behaviors appear to be consistent
with the Unicode standard as well.

let cha = “ชา"

// Value of these 2 lines below should be 2
chi.characters.count
cha.characters.count

// Value of these 3 lines below should be true
chi.contains("ช")
cha.contains("ช")
chi.characters.contains("ช”)

// end of code snippet

It sounds like you want to query by Unicode scalar, which you can do:


chi.unicodeScalars.contains("ช") // true

cha.unicodeScalars.contains("ช") // true

From some brief reading, it appears that there's some disagreement about

how Unicode treats Thai (
http://www.thai-language.com/forums/t/linguistics/writing/t14254\). However,
Swift strictly implements Unicode, which would be the forum to consider any
issues.

I’m not sure that if Swift team is aware of this problem and do they have
any opinion on it. I know that Unicode is very very hard and do aware of
that there would be a revamp on String API in Swift 4 so I want to make
this into a discussion before Swift 4 is released.

Thank you,
Bank (Pitiphong)

···

On Sat, Apr 1, 2017 at 2:52 AM, Pitiphong Phongpattranont via swift-evolution <swift-evolution@swift.org> wrote:

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

pitiphong.p · April 4, 2017, 8:58am

Hello Swift Community,

I’ve found a problem on Swift String API with Thai language. In Thai, we have 44 Consonants, 32 vowels and 5 tone marks. A special attribute of Thai vowels is that they can be put on anywhere around a consonant, some of them are placed after a consonant (ชา), some are before (แช), some are above (ชี) and some are below (ชุ). Since all vowels must be placed along with a consonants but they’re place in difference places around a consonant, Unicode standard says that some of the Thai vowels are Grapheme Base and some are Grapheme Extend.

And because Swift String is fully Unicode compliance and by having some vowels be a Grapheme Extend makes some Thai vowels have a invalid attributes in some aspects. For example a word “ชี” (a nun) and “ชา” (tea) both have one consonant (in this case is ช) and one vowel (ี and า). When we ask how many characters are there in those words or does this word contain a ช character, we should get the same results from those 2 words (2 characters and it contains ช). However I found that in Swift String API, I will get a difference answers from those questions.

// You can try this code snippet in a Swift Playground
let chi = "ชี"

A Swift "character" models a Unicode extended grapheme cluster, which may or may not be a character with respect to a human language.

I don't speak Thai, but do you have a reference saying that the Unicode standard regards these as two separate extended grapheme clusters? Unless I'm mistaken, according to the reference (Unicode Utilities: Character Properties <Unicode Utilities: Character Properties), the combining vowel U+0E35 extends the grapheme cluster. Therefore, Unicode rules require this to be reported as one extended grapheme cluster, which would make this one single Swift "character". The remaining behaviors appear to be consistent with the Unicode standard as well.

I understand how Unicode and Grapheme Clusters work and also know the fact that Swift String is fully Unicode compliance. The problem I try to bring up is that even though the grapheme property of U+0E35 and U+0E32 are difference but in Thai language both vowels have the same properties which means that they should have the same Unicode properties and return the same results from the sample codes.

I understand this may be the problem in the Unicode specifications not on Swift. I just want to bring this issue up to the discussion so that this issue will be in the Swift team eyes and would take this issue into a discussion.

···

On Apr 4, 2560 BE, at 04:37, Xiaodi Wu <xiaodi.wu@gmail.com> wrote:
On Sat, Apr 1, 2017 at 2:52 AM, Pitiphong Phongpattranont via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

let cha = “ชา"

// Value of these 2 lines below should be 2
chi.characters.count
cha.characters.count

// Value of these 3 lines below should be true
chi.contains("ช")
cha.contains("ช")
chi.characters.contains("ช”)

// end of code snippet

It sounds like you want to query by Unicode scalar, which you can do:
chi.unicodeScalars.contains("ช") // true
cha.unicodeScalars.contains("ช") // true
From some brief reading, it appears that there's some disagreement about how Unicode treats Thai (http://www.thai-language.com/forums/t/linguistics/writing/t14254\). However, Swift strictly implements Unicode, which would be the forum to consider any issues.

I’m not sure that if Swift team is aware of this problem and do they have any opinion on it. I know that Unicode is very very hard and do aware of that there would be a revamp on String API in Swift 4 so I want to make this into a discussion before Swift 4 is released.

Thank you,
Bank (Pitiphong)

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution