Character range to collection

QuinceyMorris · April 1, 2023, 6:55pm

Am I missing an obvious solution here?

I'm starting from a range of Unicode characters — in the form of either Character or UnicodeScalar, I don't really care which, but using Character as my example:

"\u{007F}"..."\u{009F}"

How do I get a useful collection out of this? The use-case is to test whether certain characters are in this collection. I was thinking a Set, but it's fine to go via Array or Collection or Sequence or anything else, or a String would work too. These things look like they go nowhere:

["\u{007F}"..."\u{009F}"] // nope
String("\u{007F}"..."\u{009F}") // nope
("\u{007F}"..."\u{009F}").map { $0 } // nope

I did find this:

String((0x007F...0x009F).compactMap { UnicodeScalar($0) }.map { Character($0) })

but it's hardly fluent (and the fact that it starts from Ints is a bit unfortunate).

jrose · April 1, 2023, 7:12pm

There’s no such thing as a “character range collection” for the same reason there’s no such thing as a “string range collection”. You can’t get a collection of every string between “apple” and “zebra”, and you can’t get a collection of every Character between “a” and “b”. (There’s “á”, there’s “á́”, there’s “á́́”, and so on.)

UnicodeScalars you can do, as you discovered. The range itself isn’t automatically a Collection because UnicodeScalars are not Strideable; my guess is that’s because there are gaps in the list of valid codepoints and that makes some strides invalid.

But for the purposes of checking membership, you may not even need a Collection. For a range of UnicodeScalars, the usual ClosedRange.contains method is fine (and much more efficient). For anything else, you first need to decide if you want a range or a set (i.e. should “á” be included or not), and then use a (Closed)Range or a Set to model the check you want.

QuinceyMorris · April 1, 2023, 7:47pm

Ah, thank you. I did look at the Range documentation, but my eye slid right past "contains" without seeing it.

You can, however, get a range of Characters, such as "a"..."z", but it doesn't solve my problem, since I really do need to get to a collection that includes multiple ranges.

In my scenario, I'm ultimately testing against a set of valid characters that's the union of sets of characters, each of which has sequentially increasing UnicodeScalar code points (because someone else already went through the chore of figuring out the contiguous sub-ranges, in order to write down the valid characters in a compact way).

Even if my valid characters were a simple range, ClosedRange.contains isn't quite the solution I hoped for, since the thing I'm testing against it is a Character, and in order to convert that to a UnicodeScalar for the test, I have to deal with the fact that the conversion can result in multiple UnicodeScalars.

Still, I suppose I could write a contains extension on ClosedRange that takes a Character and returns false if the character doesn't resolve to a single code point.

So let me throw this back out there as a revised challenge:

I have a switch statement that looks like this:

switch character {
case "a"..."z", "=", "\u{007F}"..."\u{009F}":
    accept(character)
default:
    reject(character)
}

How do I rewrite this as an if statement using a containment test? Something like:

static let acceptableCharacters = ???
…
if Self.acceptableCharacters.contains(character) {
    accept(character)
}
else {
    reject(character)
}

SDGGiesbrecht · April 1, 2023, 10:06pm

I suspect the Character range "a"..."z" contains ạ̧̨̛̭̦̱̇̉̆̃̂́̀̈̌̄̊᷇᷆̋ (a + ◌̇ + ◌̛ + ◌̉ + ◌̆ + ◌ + ◌̃ + ◌̂ + ◌́ + ◌̀ + ◌̈ + ◌̧ + ◌̨ + ◌̌ + ◌̄ + ◌̊ + ◌̣ + ◌̭ + ◌᷇ + ◌᷆ + ◌̋ + ◌̦ + ◌̱), which I doubt you intend to include, so regardless of what Character’s API can or cannot express, you probably actually want to be using Unicode.Scalar.

If you can import Foundation, CharacterSet does this out of the box:

import Foundation

let acceptableCharacters: CharacterSet
  = CharacterSet(charactersIn: "a"..."z")
    .union(CharacterSet(charactersIn: "="))
    .union(CharacterSet(charactersIn: "\u{7F}"..."\u{9F}"))
  
if acceptableCharacters.contains("α") {
  print("Accepted.")
} else {
  print("Rejected.")
}

Otherwise, you can extend Set for convenience:

extension Set where Element == Unicode.Scalar {
  init(_ range: ClosedRange<Unicode.Scalar>) {
    self.init(
      (range.lowerBound.value...range.upperBound.value)
        .compactMap({ Unicode.Scalar($0) })
    )
  }
}

let acceptableCharacters: Set<Unicode.Scalar>
  = Set("a"..."z")
    .union(Set(["="]))
    .union(Set("\u{7F}"..."\u{9F}"))
  
if acceptableCharacters.contains("α") {
  print("Accepted.")
} else {
  print("Rejected.")
}

Or, if useful, you can roll your own generic union type (like this) in order to abstract this sort of thing over a much wider range of use cases.

QuinceyMorris · April 1, 2023, 10:35pm

It seems that it does not:

        let range = Character("a") ... Character("z")
        print(range.contains("a")) // true
            print(range.contains("ạ̧̨̛̭̦̱̇̉̆̃̂́̀̈̌̄̊᷇᷆̋")) // false

but you make a good point, since I doubt there's a reliable API contract here.

I'm reluctant to use CharacterSet, if it's still based on NSCharacterSet, because that and NSString count "characters" in UTF-16 ~~codepoints~~ code units, and I just don't trust NSCharacterSet in all cases.

SDGGiesbrecht · April 2, 2023, 12:28am

Somewhere between my original typing and my browser’s displaying of your post, it was normalized to NFC, resulting in the first scalar changing from plain a to ạ (U+1EA1), which is expectedly outside that range. (And < probably uses NFC anyway.) If I were to sort through the diacritics and remove those capable of merging with a for NFC, the result ought to bounce back inside the range, because it would start with a again. I am pretty sure a⃠ (a + ◌⃠) does not recompose, if you want to verify it.

CharacterSet operates in scalars, not UTF‐16 code bytes. It properly supports all the supplemental planes, even in Objective C. If, via Objective C, you were to query it about half of a surrogate pair, I do not know how it would answer; but you cannot make such a query from Swift anyway, because the Unicode.Scalar type will refuse to initialize to such a value in the first place (feeding it a corresponding underlying Int32 will yield nil).

Orup70 · April 2, 2023, 2:40pm

Handling “text” is surprisingly hard. From my understanding, the characters included in the range [a-z] depends on the language (and probably region and other factors).

For example, in German the character [ä] is included in the range, but in Swedish the character [ä] comes after [z].

In Swedish the alphabet is defined as [a-z] directly followed by [åäö]. But other characters like [á] and [é] are not considered to be unique letters, but rather letters with accents, and would therefore be sorted as part of the range [a-z].

I’m sure there are countless of similar differences in other languages and makes the question: what characters are included in the range [a-z]? impossible to answer (in most cases).

QuinceyMorris · April 2, 2023, 4:58pm

I'm happy to take your word for it.

Thanks for making this point, which led me to clarify my thinking.

I can't use CharacterSet for my scenario, because I really do want to test a Character for membership in a set of Characters, and CharacterSet doesn't work for that.

The complication I face is that my set of Characters is partially given in terms of ranges of UnicodeScalars. That means I have to convert those individually to sets of Characters. Now that I understand where the pitfalls are, I can get my Character set.

The other complication is that, although I started asking about this as a way to replace some switch statements with set containment checks, I still have a lot of switch statements I'd prefer to keep that way. Not sure what to do about that yet.

Karl · April 2, 2023, 8:11pm

Pattern matching in Unicode text is difficult, which is why we introduced the new Regex APIs. I believe they are the recommended way to perform this kind of processing. For example, we have CharacterClass:

A character class can represent individual characters, a group of characters, the set of character that match some set of criteria, or a set algebraic combination of all of the above.

The regex builder can create character classes for you if you supply a closed range, and you can tell the resulting regex to match at the unicode scalar level:

import RegexBuilder

let isAllowedString = Regex {
  Anchor.startOfSubject
  OneOrMore {
    ChoiceOf {
      "a"..."z"
      "="
      "\u{007F}"..."\u{009F}"
    }
  }
  Anchor.endOfSubject
}.matchingSemantics(.unicodeScalar)

func check(_ str: String) {
  if str.wholeMatch(of: isAllowedString) != nil {
    print(str, "allowed")
  } else {
    print(str, "not allowed")
  }
}

check("a")       // allowed
check("hell=o")  // allowed

check("å")       // not allowed
check("á")       // not allowed
check("α")       // not allowed
check("hEll=o")  // not allowed
check("A")       // not allowed
check("9")       // not allowed

Alternatively, you can express your pattern using higher-level text characteristics as defined by Unicode. For example, if you want to allow the lowercase letter a plus any combining characters, you can use the .lowercaseLetter general category:

let isAllowedString = Regex {
  Anchor.startOfSubject
  OneOrMore {
    ChoiceOf {
      CharacterClass.generalCategory(.lowercaseLetter)  // <-----
      "="
      "\u{007F}"..."\u{009F}"
    }
  }
  Anchor.endOfSubject
}.matchingSemantics(.unicodeScalar)

check("a")       // allowed
check("hell=o")  // allowed

check("å")       // allowed  <---
check("á")       // allowed  <---
check("α")       // allowed  <---
check("hEll=o")  // not allowed
check("A")       // not allowed
check("9")       // not allowed

This creates an interesting issue - let's check our old friend, é, and whether both precomposed and decomposed forms are accepted:

check("\u{00E9}")   // precomposed - allowed
check("e\u{0301}")  // decomposed - not allowed (!)

They are not! Because we've applied scalar semantics to the entire pattern.

No matter, we can fix this - by composing scalar-level patterns with character-level patterns:

let isAllowedString = Regex {
  Anchor.startOfSubject
  OneOrMore {
    ChoiceOf {
      // Grapheme cluster semantics.
      CharacterClass.generalCategory(.lowercaseLetter)
      "="
      // Additional character classes using scalar semantics.
      Regex {
        ChoiceOf {
          "\u{007F}"..."\u{009F}"
        }
      }.matchingSemantics(.unicodeScalar)
    }
  }
  Anchor.endOfSubject
}

check("\u{00E9}")   // precomposed - allowed
check("e\u{0301}")  // decomposed - allowed  <---

In your particular example, U+007F-U+009F are control characters, so I'm pretty sure they never compose with anything, and matching them at scalar or grapheme cluster level doesn't matter. But what I'm trying to show is that the new Regex APIs offer some powerful tools for pattern matching in Unicode text, and that they compose so you can express even complex patterns.

SE-0363: Unicode for String Processing has more information about CharacterClass, including some of the nuances when expressing character classes using ranges.

QuinceyMorris · April 3, 2023, 5:26am

Regex doesn't help me here, because I'm testing for a single Character containment, not a pattern. Well, I suppose one character is a pattern, and a Regex CharacterClass might be a solution for a single character match, but construction of CharacterClass values appears to be as problematic as CharacterSet.

To recap, the only valid solution for testing a Character for containment in a set of Characters is Set<Character>.contains — or a custom type's implementation of the same behavior.

That's because aggregating Characters by UnicodeScalar isn't a safe way to proceed.

For example, the character "" is a Character, but so are both of its single-UnicodeScalar components: 🇵(scalar 0x1F1F5) and 🇷(scalar 0x1F1F7). Using a String as a collection of particular characters doesn't work: adjacent characters can collapse into a single character.

However, crucially, CharacterSet doesn't work either, because it's actually a set of UnicodeScalar, not a set of Character. That's why computing values of CharacterSet is so dangerous. It's attractively easy to use, and it will work most of the time in many writing systems, but not always.

The documentation says that CharacterClass is a collection of characters, but I don't believe it, really. The way that CharacterClass is created and manipulated suggests that it, too, is at best a collection limited to single-UnicodeScalar characters.

Karl · April 3, 2023, 7:17am

We can test it using the example you have given; we can create a CharacterClass containing only the character "🇵🇷", and then check if it contains each of those scalars.

If it truly contains characters rather than single scalars, it should report that the combination, "\u{1F1F5}\u{1F1F7}" is present, but each scalar tested individually should be reported as not present. Indeed, that's what I see:

import RegexBuilder

let allowedCharacters = CharacterClass.anyOf("🇵🇷")

func check(_ str: String) {
  if str.wholeMatch(of: allowedCharacters) != nil {
    print(str, "allowed")
  } else {
    print(str, "not allowed")
  }
}

check("🇵🇷")                  // 🇵🇷 allowed
check("\u{1F1F5}\u{1F1F7}")  // 🇵🇷 allowed
check("\u{1F1F5}")           // 🇵 not allowed
check("\u{1F1F7}")           // 🇷 not allowed

As you point out, each of those scalars can also be a character. When you write them that way, the CharacterClass contains each of those characters separately and does not consider the combination to match:

let allowedCharacters = CharacterClass(
  .anyOf("\u{1F1F5}"),
  .anyOf("\u{1F1F7}")
)

func check(_ str: String) {
  if str.wholeMatch(of: OneOrMore(allowedCharacters)) != nil {
    print(str, "allowed")
  } else {
    print(str, "not allowed")
  }
}

check("🇵🇷")                  // 🇵🇷 not allowed
check("\u{1F1F5}\u{1F1F7}")  // 🇵🇷 not allowed
check("\u{1F1F5}")           // 🇵 allowed
check("\u{1F1F7}")           // 🇷 allowed

Again, this suggests that it is able to match full characters and is not limited to considering individual scalars.

Are you seeing something different? Do you have a result which contradicts the documentation?

tera · April 3, 2023, 11:29am

QuinceyMorris:

I have a switch statement that looks like this:
switch character {
case "a"..."z", "=", "\u{007F}"..."\u{009F}":
    accept(character)
default:
    reject(character)
}
How do I rewrite this as an if statement using a containment test?

If you need to replicate the above switch exactly – first you need to know what exactly it is doing. My uneducated guess would be it is doing a series of comparisons:

(c >= "a" && c <= "z") || c == "=" || (c >= "\u{007F}" && c <= "\u{009F}")

Worth double checking that assumption. Could be generalised to:

extension Character {
    func contained(in ranges: [any RangeExpression<Character>]) -> Bool {
        ranges.reduce(false) { result, range in
            result || range.contains(self)
        }
    }
}

xwu · April 3, 2023, 1:50pm

What sort of change in the documentation would help you believe it?

cc @Alex_Martini

QuinceyMorris · April 3, 2023, 3:46pm

Sounds great. Really good to hear.

Answering in the most literal sense, it would help if the page (CharacterClass | Apple Developer Documentation) mentioned the type name Character even once, or mentioned grapheme clusters, instead of just "characters". After all, Unicode doesn't use "character" in any formalized way any more, right? Swift's formalism is specifically Character.
What set off alarms for me was the inverted property. What does this mean? If it's the set of Character known to the compiler (in any given compiler version) outside the character class, then it's a little slippery because that set changes over time (and compiler versions). If it's a "stored" inversion operation which can be composed with union, intersection, etc, so that CharacterClass values are more like functions than sets, then OK. Or can I not compose inversions with unions and intersections? The document as it stands doesn't provide much guidance.
What would really help would be to deprecate CharacterSet and reintroduce it under a correct type name such as UnicodeScalarSet. I find the API name parallels between CharacterSet and CharacterClass to be more disturbing than helpful.

Even better, if CharacterClass is doing the right thing with Character values, would be to pull it out of Regex, and set it loose in the standard library with non-Regex API such as contains(_ character: Character), as well as String APIs such as trimming and splitting into components (and non-Regex String classification).