Because case conversion can result in multiple characters, the result of lowercased() is a string.
and I am looking for an example where this actually happens.
I know that converting the “German eszett” to upper case results in two characters
let c: Character = "ß"
let s = c.uppercased()
print(s, s.count)
// SS 2
but I have not been able to find a similar example for the conversion to lower case.
There are characters (consisting of a single Unicode scalar) where the conversion to lower case results in two Unicode scalars, but in all examples that I found so far, these still are a single Character (extended grapheme cluster):
let c : Character = "İ" // U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE
let s = c.lowercased()
print(s, s.unicodeScalars.count, s.count)
// i̇ 2 1
Therefore my question:
Can Character.lowercased() return multiple characters?
The question was about lowercased(), not uppercased(), and I’m not currently aware of any.
However, as the API documentation states, it can happen (that is, Unicode is free to add such an extended grapheme cluster in a future version, and Swift would accordingly return a multi-character result for lowercased() if that comes to pass).
If you’re asking for an actual example so you can actually exercise this possibility in, say, a test—I don’t know of a good answer.
Did a brute force search, didn't find any results.
for i in 0...UInt32.max {
if i.isMultiple(of: 10_000_000) {
print("Checked \(i) (\(Double(i) / Double(UInt32.max) * 100)%)")
}
guard let scalar = UnicodeScalar(i) else { continue }
let c = Character(scalar)
let lower = c.lowercased()
if 1 < lower.count {
print("Found a case: \(i) U+\(String(i, radix: 16))")
}
}
Found 79 cases of the opposite though (single character lowercase becomes multi-character uppercase, like ß):
Thanks. But there still could be a Character (an extended grapheme cluster consisting of two or more Unicode scalar values) whose lowercased version is not a single Character.
(Side note: If I remember correctly, the maximal possible value of a Unicode scalar is 0x10FFFF, so there is no need to check up to UInt32.max.)
At the moment, this is defined to never be the case; from the ICU Case Mappings documentation:
The CaseFolding.txt file in the Unicode Character Database is used for performing locale-independent case folding.
<snip>
Unicode case folding is not context-sensitive. It is also not language-sensitive, although there is a flag for whether to apply special mappings for use with Turkic (Turkish/Azerbaijani) text data.
The latest CaseFolding.txt file is available, and you can confirm that case folding is done on a scalar-by-scalar basis.
More specifically, "Unicode case folding is not context-sensitive" means that, at the moment, the case folding rules for a specific scalar don't depend on any scalars that precede or succeed it, so you can't end up with a grapheme cluster that ends up breaking up into multiple clusters.
More specifically, "Unicode case folding is not context-sensitive" means that, at the moment, the case folding rules for a specific scalar don't depend on any scalars that precede or succeed it, so you can't end up with a grapheme cluster that ends up breaking up into multiple clusters.
So putting that together with my dumb little brute search, we can conclude with certainty that: No, there are no such cases today. Apparently not! See xwu's reply below.
Of course, they can always be added in the future, and the API's String return type is flexible enough to allow for that possibility.
Not so—one cannot make that conclusion. Unicode scalar-by-scalar lowercasing doesn't guarantee that one extended grapheme cluster doesn't end up as two when lowercased—even if no Unicode scalar by itself ends up as two or more extended grapheme clusters when lowercased.
Segmentation rules prohibit grapheme cluster boundaries between certain Unicode scalars based on their properties: if the relevant Unicode scalar changes when lowercased and that lowercase scalar in turn doesn't have the relevant property, then a boundary is then inserted. For example: given a Unicode scalar with property Grapheme_Extend, then rule GB9 says not to break before that scalar; but if there exists such a Unicode scalar that changes when lowercased to another scalar without the Grapheme_Extend property, then there would exist a case today of the behavior described.
You'd have to go down the list of each rule in the Unicode text segmentation specification, query if there exist combinations of scalars where the rule applies that then change when lowercased so that the rule doesn't apply.
That's a good point: you're right that this search is necessary but not sufficient. With a little help from ICU, though, it's easy to see that it's still the case today that there aren't any scalars which have this behavior.
For all characters which have a lowercase form (UAX #44 Changes_When_Lowercased), we can see which characters have an interesting grapheme cluster break class (which would prevent a break between grapheme clusters), and also see whether that property changes under lowercasing:
#include <stdio.h>
#include <unicode/uchar.h>
int main()
{
for (UChar32 c = 0; c <= 0x10FFFF; c += 1) {
if (!u_hasBinaryProperty(c, UCHAR_CHANGES_WHEN_LOWERCASED)) {
continue;
}
UGraphemeClusterBreak breakProp = u_getIntPropertyValue(c, UCHAR_GRAPHEME_CLUSTER_BREAK);
if (breakProp != U_GCB_OTHER) {
printf("0x%X: %d\n", c, breakProp);
}
UChar32 lc = u_tolower(c);
breakProp = u_getIntPropertyValue(lc, UCHAR_GRAPHEME_CLUSTER_BREAK);
if (breakProp != U_GCB_OTHER) {
printf("0x%X: %d\n", lc, breakProp);
}
}
}
If all goes well, this program should produce no output Changing under lowercasing and affecting grapheme cluster breaking rules are currently mutually-exclusive properties.
This is not the case, by the way, for uppercasing — a search for UCHAR_CHANGES_WHEN_UPPERCASED brings up U+0345 Combining Greek Ypogegrammeni, which belongs to the Extend Grapheme_Cluster_Break class.
For completeness, it looks like that ICU property doesn't include the Control and Extend property values (or the Hangul syllable types) mentioned in UAX#29 table 2, but it's not hard to check—we can also do these checks using UnicodeScalar APIs in Swift—and indeed it makes no difference to the conclusion.
Good catch — I meant to link to the source code, which does actually have the full list of matching properties. I'm not sure why the documentation is incomplete.
BTW in 2008 the LATIN CAPITAL LETTER SHARP S uppercase letter for LATIN SMALL LETTER SHARP S was added to Unicode, and since 2024 this is the recommended uppercase form. In Switzerland and Liechtenstein you do not use either, but this is a separate issue. So actually there should be an update on this.