Another thought: We discussed earlier that CharacterSet
is inadequate because its definition of lowercaseCharacters
and uppercaseCharacters
is based on general categories instead of derived properties.
But as shown above, there are still scalars (like feminine/masculine ordinals ª/º) where the property value is inconsistent with the result of the case detection function.
If, in the future, we want a Unicode.ScalarSet
type that works as one would expect, I think users would expect the following to be true:
∀ (s ∈
Unicode.ScalarSet.lowercaseScalars
)s.isLowercase == true
∀ (s ∉Unicode.ScalarSet.lowercaseScalars
)s.isLowercase == false
...which means we cannot implement that set in terms of the Lowercase
Unicode property alone. Likely, we would need two APIs, to match the proposed pair of APIs in the previous post:
Unicode.ScalarSet.lowercaseScalars
is defined as the set of scalars for whichs.isLowercase == true
Unicode.ScalarSet(havingProperty: .lowercase)
is defined as the set of scalars for whichs.hasProperty(.lowercase) == true
The second one can be built directly on top of ICU uset_*
APIs. The harder question is how we implement the first in a way that's both efficient and safe with respect to future changes to the Unicode data.