CharacterSet vs Set<Character>


(Jean-Denis Muys) #1

I was playing with CharacterSet, and I came up with:

let vowels = CharacterSet(charactersIn: "AEIOU")

let char: Character = "E"

vowels.contains(char)

That last line doesn't compile: I get "*cannot convert value of type
'Character' to expected argument type 'UnicodeScalar'*"

The problem is, I could not find a simple way to convert from a character
to a unicodeScalar. The best I found is the very ugly:

vowels.contains(String(char).unicodeScalars[String(char).unicodeScalars.
startIndex])

Did I miss anything? Does it have to be that horrific?

If so, I find using Set much better:

let vowelsSet: Set<Character> = Set("AEIOU".characters)

vowelsSet.contains(char)

I must have missed something. Any suggestion welcome

Jean-Denis


(Dave Abrahams) #2

I was playing with CharacterSet, and I came up with:

let vowels = CharacterSet(charactersIn: "AEIOU")

Yeah, because CharacterSet is a set of UnicodeScalars, not a set of
Character. That should probably get fixed somehow. I suggest filing a
radar against Foundation.

let char: Character = "E"

vowels.contains(char)

That last line doesn't compile: I get "*cannot convert value of type
'Character' to expected argument type 'UnicodeScalar'*"

The problem is, I could not find a simple way to convert from a character
to a unicodeScalar. The best I found is the very ugly:

vowels.contains(String(char).unicodeScalars[String(char).unicodeScalars.
startIndex])

Did I miss anything?

  vowels.contains(String(char).unicodeScalars.first!)

Does it have to be that horrific?

For now, I'm afraid I don't have anything better for you. I very much
hope to improve String usability substantially for Swift 4.

···

on Sun Oct 02 2016, Jean-Denis Muys <swift-users-AT-swift.org> wrote:

If so, I find using Set much better:

let vowelsSet: Set<Character> = Set("AEIOU".characters)

vowelsSet.contains(char)

I must have missed something. Any suggestion welcome

--
-Dave


(Quinn “The Eskimo!”) #3

As is often the case with string examples, it would help if you posted more about your context. With the details we have now your code could be written like this:

let vowels = CharacterSet(charactersIn: "AEIOU")
let char: UnicodeScalar = "E"
vowels.contains(char)

but I’m pretty sure that won’t help in your real app (-: So, my questions:

* Do you plan to use a fixed character set? Or is the character set itself built at runtime?

* Do you have specific knowledge of either of the inputs? Like that they’re all ASCII? Or normalised in a certain way?

* Specifically, where do the characters you’re trying to test (`char` in your example) come from? Do they represent user input, in which case they can be arbitrary Unicode? Or something more constrained

Share and Enjoy

···

On 2 Oct 2016, at 19:02, Jean-Denis Muys via swift-users <swift-users@swift.org> wrote:

The problem is, I could not find a simple way to convert from a character to a unicodeScalar.

--
Quinn "The Eskimo!" <http://www.apple.com/developer/>
Apple Developer Relations, Developer Technical Support, Core OS/Hardware


(Jean-Denis Muys) #4

You are perfectly right. The context is playing around really, but I was more specifically writing a function counting vowels and consonants in an arbitrary string:

func countLetters(s: String) -> (vowels: Int, consonants: Int) {
    let vowels: Set<Character> = Set("AEIOU".characters)
    let consonants: Set<Character> = Set("BCDFGHJKLMNPQRSTVWXYZ".characters)

    var v = 0, c = 0

    for char in s.uppercased().characters {
        if vowels.contains(char) {
            v += 1
        }
        if consonants.contains(char) {
            c += 1
        }
    }

    return (v, c)
}

As you could see, I opted not to use CharacterSet for this case, as it looked too much trouble.

The current goal is for me to learn Swift. Trying to extrapolate a bit on what might happen in the real world, I would tend to answer your questions thus:

* Do you plan to use a fixed character set? Or is the character set itself built at runtime?

The character set is likely to be fixed. Does this really change anything?

* Do you have specific knowledge of either of the inputs? Like that they’re all ASCII? Or normalised in a certain way?

ASCII? Probably not. Latin? perhaps, though not obvious. For example French accented letters would probably have to be handled somehow. Greek or Cyrillic? Perhaps. Other scripts? Unlikely, but what do I know.
Normalisation: it should probably consider all variations of “é” to be the same…
Is this opening a Unicode can of worms? Possibly. I am not knowledgeable enough, but willing to learn.

* Specifically, where do the characters you’re trying to test (`char` in your example) come from? Do they represent user input, in which case they can be arbitrary Unicode? Or something more constrained

User input most probably.

I tried and your suggestion:

let uchar: UnicodeScalar = “E"

will not work with a Character variable (as opposed to a character literal)

        let uchar: UnicodeScalar = char

(Cannot convert value of type Character to specified type UnicodeScalar)

Thanks,

Jean-Denis

···

On 3 Oct 2016, at 09:43, Quinn The Eskimo! via swift-users <swift-users@swift.org> wrote:

On 2 Oct 2016, at 19:02, Jean-Denis Muys via swift-users <swift-users@swift.org> wrote:

The problem is, I could not find a simple way to convert from a character to a unicodeScalar.

As is often the case with string examples, it would help if you posted more about your context. With the details we have now your code could be written like this:

let vowels = CharacterSet(charactersIn: "AEIOU")
let char: UnicodeScalar = "E"
vowels.contains(char)

but I’m pretty sure that won’t help in your real app (-: So, my questions:

* Do you plan to use a fixed character set? Or is the character set itself built at runtime?

* Do you have specific knowledge of either of the inputs? Like that they’re all ASCII? Or normalised in a certain way?

* Specifically, where do the characters you’re trying to test (`char` in your example) come from? Do they represent user input, in which case they can be arbitrary Unicode? Or something more constrained

Share and Enjoy
--
Quinn "The Eskimo!" <http://www.apple.com/developer/>
Apple Developer Relations, Developer Technical Support, Core OS/Hardware

_______________________________________________
swift-users mailing list
swift-users@swift.org
https://lists.swift.org/mailman/listinfo/swift-users


(Gerriet M. Denkmann) #5

Don’t be so Europe-centric. Other people have vowels too.
Thai for example. And here are Swift characters completely useless: กี้ is one character (for Swift) but it is really one consonant + one vowel + one tone-mark.
So you will have to use unicodeScalars.

Have fun!

Gerriet.

···

On 3 Oct 2016, at 16:28, Jean-Denis Muys via swift-users <swift-users@swift.org> wrote:

ASCII? Probably not. Latin? perhaps, though not obvious. For example French accented letters would probably have to be handled somehow. Greek or Cyrillic? Perhaps. Other scripts? Unlikely, but what do I know.


(Jean-Denis Muys) #6

You are right: I don’t know much about asian languages.

How would you go about counting consonants, vowels (and tone-marks?) in the most general way?

I think I would need to educate myself about those things. Any pointer welcome.

JD

···

On 3 Oct 2016, at 12:52, Gerriet M. Denkmann <g@mdenkmann.de> wrote:

On 3 Oct 2016, at 16:28, Jean-Denis Muys via swift-users <swift-users@swift.org> wrote:

ASCII? Probably not. Latin? perhaps, though not obvious. For example French accented letters would probably have to be handled somehow. Greek or Cyrillic? Perhaps. Other scripts? Unlikely, but what do I know.

Don’t be so Europe-centric. Other people have vowels too.
Thai for example. And here are Swift characters completely useless: กี้ is one character (for Swift) but it is really one consonant + one vowel + one tone-mark.
So you will have to use unicodeScalars.

Have fun!

Gerriet.


(Gerriet M. Denkmann) #7

Iterate over unicodeScalars (in the most general case) - Swift characters are probably ok for European languages.

For each unicodeScalar a.k.a codepoint you can use the icu function:
  int8_t chrTyp = u_charType (codepoint)
This returns the general category value for the code point.
This gives you something like U_OTHER_PUNCTUATION, U_MATH_SYMBOL, U_OTHER_LETTER etc.
See enum UCharCategory in <http://icu-project.org/apiref/icu4c-latest/uchar_8h.html>

In European languages ignore U_NON_SPACING_MARKs.

There is a compare:options function for NSString (and probably similar for Swift String) which might use the options NSCaseInsensitiveSearch and NSDiacriticInsensitiveSearch to find equality between ‘E’, ‘e’ and è, é, Ĕ etc.
That is: for each character (or unicodeScalar) compare to a, e, i, o, u with these options.

let str = "HaÁÅǺáXeëẽêèâàZ"

for char in str.characters
{
  let vowel = isVowel( char )
  print("\(char) is \(vowel ? "vowel" : "consonant")")
}

func isVowel( _ char: Character ) -> Bool
{
  let s1 = "\(char)"
  let s2 = s1 as NSString
  let opt: NSString.CompareOptions = [.diacriticInsensitive, .caseInsensitive]

  // no idea how do to this with Strings:
  if s2.compare("a", options: opt) == .orderedSame {return true}
  if s2.compare("e", options: opt) == .orderedSame {return true}
  …
  return false
}

If you really want to use Thai, then do NOT ignore U_NON_SPACING_MARKs because some vowels are classified thusly.
U+0E01 … U+0E2E are consonants, U+0E30 … U+0E39 and U+0E40 … U+0E44 are vowels.
But then: ‘อ’ is sometimes a (silent) consonant (อยาก), sometimes a vowel (บอ), sometimes part of a vowel (มือ), sometimes part of a diphthong (เบื่อ).
Similar for ย: normal consonant (ยาก), part of vowel (ไทย) or diphthong (เมีย).
In the latter case only ม is a consonant, the rest is one single diphthong and ี is a U_NON_SPACING_MARK which really is a vowel.
Oh, and don't forget the ligatures ฤ, ฤๅ, ฦ, ฦๅ. These are both a consonant and a vowel. Same for ำ: not a ligature but a vowel + consonant.

But to talk about german:
What about diphthongs? “neu” has one consonant + one vowel sound (but 2 vowel characters).
What if some silly users don’t know how to type umlauts and write “ueber” (instead of correctly “über”). This is really one consonant (+diaeresis).
But beware: “aktuell” is definitely not a misspelling of “aktüll” and has two vowels.

Gerriet.

···

On 3 Oct 2016, at 19:17, Jean-Denis Muys via swift-users <swift-users@swift.org> wrote:

You are right: I don’t know much about asian languages.

How would you go about counting consonants, vowels (and tone-marks?) in the most general way?