Finding Character in CharacterSet

Are there any opinions on the best or correct way to check that a Character is in CharacterSet?

EDIT: It occurred to me after posting I should probably say that "best" is not quantifiable. So maybe I could add performant. Really I'm just looking to get some discussion about different approaches.

My confusion comes about because CharacterSet.contains takes a Unicode.Scalar instead of a CharacterSet. This is simultaneously expected and not very ergonomic.

For example, say that you are tasked with getting the sequential Characters of a String that belong to a CharacterSet. For this example get all the lowercase letters.

So far, I think this is my best answer. Though I'm not super jazzed about, what at least feels like even if it is not really, nested loops.

let str = "hellọ̀, playground"
let charset = CharacterSet.lowercaseLetters
let prefix = str.prefix(while: { $0.unicodeScalars.first(where: { charset.contains($0) }) != nil })
print(prefix) // hellọ̀

I tried this one too. Though as you can see it is not technically correct (it is missing the acute on the o).

let str = "hellọ̀, playground"
let charset = CharacterSet.lowercaseLetters
let unicodePrefix = str.unicodeScalars.prefix(while: { charset.contains($0) })
print(String(unicodePrefix)) // hellọ

Maybe it feels un-ergonomic because I'm using the wrong APIs?

I think CharacterSet is probably the wrong way to think about this problem, for the reasons you noted—it's actually a set of Unicode.Scalars, so you can't properly handle multi-scalar grapheme clusters by using it.

It would also be a bit difficult to express something like "a set of Characters that are lowercase", especially if you wanted that to be an iterable collection, since that would include many combinations of base letters followed by arbitrarily long sequences of combining modifiers. In fact, I believe such a set would be infinite, because you can stack modifiers ad infinitum (and thus, zalgo was born).

In this particular case, I think what you really want is predicates like isLowercase on Character, which would handle something like "ọ̀" correctly by implementing the test as defined by the Unicode Standard that handles the complete grapheme cluster instead of testing individual scalars. Then, you could just write str.prefix(while: { $0.isLowercase }) and be done with it.

@Michael_Ilseman and I have been discussing scalar and character properties in this thread, which may interest you.

2 Likes

Let's clarify the reason the acute is missed. You've got a lowercaseLetters CharacterSet and a unicode scalar representation of a string, the scalars of which you are checking for belonging to the lowercaseLetters CharacherSet. Note that character ọ̀ consists of two scalars: and `. First goes the , then the acute. What happens, obviously, is that passes the belonging test, since it is a lowercase letter, but the acute isn't and the prefix is hence cropped that very moment, resulting in hellọ.

So, there isn't really a way to check whether a Character, which can consist of an uncertain amount of scalars, belongs to a CharacterSet. Rather, it's senseless. It may be easier to understand if we say UnicodeScalarSet instead.

Your first example works because you iterate on the string itself (str.prefix(while..)), and when passes the belonging test in the closure, the whole character which belongs to together with the acute passes the test as well.

In other words, what you're really doing is sequentially checking whether at least one scalar of the given character belongs to the CharacterSet. This approach is fine if you are dealing with letters like in your case, since all the characters with letters will pass with all the extra 'acutes' (But you have to iterate on the string of course).

Hope this helped!

P.S. Using isLowercase on each character like @allevato suggested is a great way to get rid of the character set in your case.

Note that, despite it's name, CharacterSet cannot handle all valid Characters, as it seems like it can only work with BMP scalars. Also, making things worse, it doesn't check for invalid/unsupported input and the documentation doesn't mention anything about it. So you can easily end up getting strange behavior like this:

More detailed discussion in comments of this bug report:

3 Likes

Even at the risk of sounding stupid: What are BMP scalars?

I believe that's an abbreviation of bitmap, which seems to be a synonym of scalar in this context (a collection of pixels ~ bitmap image). A quick search shows that's most likely the case.

No, no, it's the Basic Multilingual Plane:

2 Likes

I found that as well, but it made no sense to me in the context :face_with_monocle:

Aaaah, that makes much more sense! Regular scalars, after all :slightly_smiling_face:

Thank you @Jens

Maybe the forum should contain a glossary of some kind ;)

After I reading this response and re-reading what I wrote last night I now realize that I did a poor job of explaining why I provided both code samples.

I understood why the first "worked" and the second did not. What I was trying to point out, albeit really clumsily, is that one might be tempted based on the types and docs and general "ergonomics" of the APIs to write the second code and it would at least at first glance "appear correct". That the first might also, at first glance, appear wrong and needlessly complicated. But for all the reasons you pointed out it is wrong. What is more, based on some of the discussion from reading @allevato and @Michael_Ilseman proposal and the example from @Jens all of them are wrong.

What I'm starting to feel, is that CharacterSet is deceptive, possibly to the point of dangerous. Especially if it is used for parsing user input.

Thank you for that thorough and thoughtful response though. I do appreciate it.

1 Like

I don't think CharacterSet is to be blamed though. Checking parts of characters for belonging to character sets is a rather trivial task and is done with a nested loop as a naive approach, as an analogue to your example. Thus is it O(nm) in the general case, m being the length of the most lengthy character. But I agree the documentation absolutely has to emphasize on it being a set of scalars rather than characters, and maybe point out that it isn't meant to be used as an ordered array to compare sequences of scalars to characters.

I think CharacterSet is misleading, confusing and probably dangerous (its name and its documentation together with its semantics).

Despite it's name, it doesn't correctly handle Characters, and it doesn't correctly handle UnicodeScalars either.

A BMP scalar is not a UnicodeScalar or Unicode.Scalar, as can be seen in the above link, and is demonstrated by this little program:

let str = "😀A"
for scalar in str.unicodeScalars {
    let hexValue = String(scalar.value, radix: 16)
    print("UnicodeScalar:", scalar)
    print("     hexValue:", hexValue)
}

which prints:

UnicodeScalar: 😀
     hexValue: 1f600
UnicodeScalar: A
     hexValue: 41

Note that 1f600 is not a 16-bit value, so ":grinning:" is clearly not representable as a BMP scalar (but note that it is representable as a UnicodeScalar).

The danger of CharacterSet is that it will happily accept input that it cannot handle correctly, and when doing so, it shows strange behavior like the following:

import Foundation

let cs1 = CharacterSet(charactersIn: "🙂🙁")
print(cs1.contains("🙂")) // true
print(cs1.contains("🙁")) // true

let cs2 = CharacterSet(charactersIn: "🙂🙁A")
print(cs2.contains("🙂")) // false (!)
print(cs2.contains("🙁")) // false (!)
print(cs2.contains("A")) // true

And another example:

import Foundation

var cs1 = CharacterSet()
cs1.insert("\u{1F600}")
print(cs1.contains("\u{1F600}")) // true

var cs2 = CharacterSet(charactersIn: "a")
cs2.insert("\u{1F600}")
print(cs2.contains("\u{1F600}")) // false (!)

The true semantics of CharacterSet is ... mysterious, or at least not at all according to what its name suggests, or what its current documentation says:

A CharacterSet represents a set of Unicode-compliant characters. Foundation types use CharacterSet to group characters together for searching operations, so that they can find any of a particular set of characters during a search.

This type provides “copy-on-write” behavior, and is also bridged to the Objective-C NSCharacterSet class.

2 Likes

FWIW, CharacterSet is really NSCharacterSet, and the "character" in NSCharacterSet is a character in the sense of NSString semantics, which is a UTF-16 code unit (not even a code-point, let alone a grapheme). The behavior of NSCharacterSet is controlled by that limitation.

Yes, in that sense, you have a point. What I meant is – if it wasn't for this bug or misleading documentation, whatever it is.

That might be the case, but still, it indeed is very strange that the documentation explicitly mentions:

An object representing a fixed set of Unicode character values that bridges to a CharacterSet; use NSCharacterSet when you need reference semantics or other Foundation-specific behavior.

Could the word fixed in the first sentence be the key to this?

I don't think so. It just means "predetermined".

The Apple documentation has a historical lineage. The term "character" historically describes a 8-bit C-string character, so Apple's documentation team used "Unicode character" to distinguish a 16-bit NSString character (aka UTF-16 code unit). In the past — and still, if you except Swift documentation — the word "grapheme" only ever appeared in one place.

If you read the documentation for CharacterSet, you'll see that all its APIs take Unicode.Scalar, as @rlovelett complained about. That's perhaps the best Swift translation of those NSString-related APIs, but CharacterSet is still defective in pure Swift terms. The description at the top of that documentation page:

A CharacterSet represents a set of Unicode-compliant characters. Foundation types use CharacterSet to group characters together for searching operations, so that they can find any of a particular set of characters during a search.

is pretty much nonsense unless you understand the historical trail that led to it.

In case I haven't made my opinion clear yet: Beware CharacterSet in Swift! (As @eskimo says, it's probably not the droid you're looking for.)

Based on what you said and considering this:

let cs2 = "🙂🙁A".unicodeScalars
print(cs2.contains("🙂")) // true
print(cs2.contains("🙁")) // true
print(cs2.contains("A")) // true

let cs1 = CharacterSet(charactersIn: "🙂🙁A")
print(cs1.contains("🙂")) // false
print(cs1.contains("🙁")) // false
print(cs1.contains("A")) // true

do I correctly understand (or get the correct impression) that Unicode.Scalar, unlike String.UnicodeScalarView, is a UTF-16 code unit under the hood?

Just a quick note: UnicodeScalar and Unicode.Scalar is the same thing, this line is from the std lib:

public typealias UnicodeScalar = Unicode.Scalar

And also, just to increase the confusion, the example will work if you insert the Unicode.Scalars one by one as in the following program, but only if you haven't previously created the CharacterSet with the CharacterSet(charactersIn: )-initialiser:

import Foundation

var cs = CharacterSet()
cs.insert("🙂")
cs.insert("🙁")
cs.insert("A")
print(cs.contains("🙂")) // true
print(cs.contains("🙁")) // true
print(cs.contains("A")) // true

No. CharacterSet is just broken.

7 Likes

No, I misled you slightly because I left out a step.

Unicode.Scalar is really UTF-32 code units, which correspond directly to code unitspoints. NSCharacterSet really does use UTF-16 code pointsunits. But most UTF-16 code pointsunits are also code unitspoints.

So, you can start from a Unicode.Scalar and in most cases it's either a valid UTF-16 code unit, or otherwise it's a code unitpoint that's not in any of the [positive] NSCharacterSets anyway.

If CharacterSet really was based on code unitspoints, not code pointsunits, then @rlovelett's example still wouldn't work. That's one step. But CharacterSet is really based on code pointsunits, so it's a degree more dangerous. That's the other step.

Edit: Sheesh, my brain reversed the "point" and "unit" all through this. Anyway, @Michael_Ilseman's answer is clearer: