Can Character.lowercased() return multiple characters?

Martin · November 2, 2024, 2:36pm

The documentation of Character.lowercased() states:

Because case conversion can result in multiple characters, the result of lowercased() is a string.

and I am looking for an example where this actually happens.

I know that converting the “German eszett” to upper case results in two characters

let c: Character = "ß"
let s = c.uppercased()
print(s, s.count)
// SS 2

but I have not been able to find a similar example for the conversion to lower case.

There are characters (consisting of a single Unicode scalar) where the conversion to lower case results in two Unicode scalars, but in all examples that I found so far, these still are a single Character (extended grapheme cluster):

let c : Character = "İ" // U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE
let s = c.lowercased()
print(s, s.unicodeScalars.count, s.count)
// i̇ 2 1

Therefore my question:

Can Character.lowercased() return multiple characters?

Martin · November 2, 2024, 4:34pm

What I am looking for is the other way around: an example where the conversion to lower case increases the character count.

xwu · November 2, 2024, 4:35pm

The question was about lowercased(), not uppercased(), and I’m not currently aware of any.

However, as the API documentation states, it can happen (that is, Unicode is free to add such an extended grapheme cluster in a future version, and Swift would accordingly return a multi-character result for lowercased() if that comes to pass).

If you’re asking for an actual example so you can actually exercise this possibility in, say, a test—I don’t know of a good answer.

AlexanderM · November 2, 2024, 4:37pm

Ah, I jumped the gun and misread it. I was gong off the latter example and thought you were asking for the opposite. My bad

AlexanderM · November 2, 2024, 4:51pm

Did a brute force search, didn't find any results.

for i in 0...UInt32.max {
	if i.isMultiple(of: 10_000_000) {
		print("Checked \(i) (\(Double(i) / Double(UInt32.max) * 100)%)")
	}
	
	guard let scalar = UnicodeScalar(i) else { continue }
	let c = Character(scalar)
	let lower = c.lowercased()
	
	if 1 < lower.count {
		print("Found a case: \(i) U+\(String(i, radix: 16))")
	}
}

Found 79 cases of the opposite though (single character lowercase becomes multi-character uppercase, like ß):

list

U+DF "ß" => "SS"
U+149 "ŉ" => "ʼN"
U+587 "և" => "ԵՒ"
U+1E9A "ẚ" => "Aʾ"
U+1F80 "ᾀ" => "ἈΙ"
U+1F81 "ᾁ" => "ἉΙ"
U+1F82 "ᾂ" => "ἊΙ"
U+1F83 "ᾃ" => "ἋΙ"
U+1F84 "ᾄ" => "ἌΙ"
U+1F85 "ᾅ" => "ἍΙ"
U+1F86 "ᾆ" => "ἎΙ"
U+1F87 "ᾇ" => "ἏΙ"
U+1F88 "ᾈ" => "ἈΙ"
U+1F89 "ᾉ" => "ἉΙ"
U+1F8A "ᾊ" => "ἊΙ"
U+1F8B "ᾋ" => "ἋΙ"
U+1F8C "ᾌ" => "ἌΙ"
U+1F8D "ᾍ" => "ἍΙ"
U+1F8E "ᾎ" => "ἎΙ"
U+1F8F "ᾏ" => "ἏΙ"
U+1F90 "ᾐ" => "ἨΙ"
U+1F91 "ᾑ" => "ἩΙ"
U+1F92 "ᾒ" => "ἪΙ"
U+1F93 "ᾓ" => "ἫΙ"
U+1F94 "ᾔ" => "ἬΙ"
U+1F95 "ᾕ" => "ἭΙ"
U+1F96 "ᾖ" => "ἮΙ"
U+1F97 "ᾗ" => "ἯΙ"
U+1F98 "ᾘ" => "ἨΙ"
U+1F99 "ᾙ" => "ἩΙ"
U+1F9A "ᾚ" => "ἪΙ"
U+1F9B "ᾛ" => "ἫΙ"
U+1F9C "ᾜ" => "ἬΙ"
U+1F9D "ᾝ" => "ἭΙ"
U+1F9E "ᾞ" => "ἮΙ"
U+1F9F "ᾟ" => "ἯΙ"
U+1FA0 "ᾠ" => "ὨΙ"
U+1FA1 "ᾡ" => "ὩΙ"
U+1FA2 "ᾢ" => "ὪΙ"
U+1FA3 "ᾣ" => "ὫΙ"
U+1FA4 "ᾤ" => "ὬΙ"
U+1FA5 "ᾥ" => "ὭΙ"
U+1FA6 "ᾦ" => "ὮΙ"
U+1FA7 "ᾧ" => "ὯΙ"
U+1FA8 "ᾨ" => "ὨΙ"
U+1FA9 "ᾩ" => "ὩΙ"
U+1FAA "ᾪ" => "ὪΙ"
U+1FAB "ᾫ" => "ὫΙ"
U+1FAC "ᾬ" => "ὬΙ"
U+1FAD "ᾭ" => "ὭΙ"
U+1FAE "ᾮ" => "ὮΙ"
U+1FAF "ᾯ" => "ὯΙ"
U+1FB2 "ᾲ" => "ᾺΙ"
U+1FB3 "ᾳ" => "ΑΙ"
U+1FB4 "ᾴ" => "ΆΙ"
U+1FB7 "ᾷ" => "Α͂Ι"
U+1FBC "ᾼ" => "ΑΙ"
U+1FC2 "ῂ" => "ῊΙ"
U+1FC3 "ῃ" => "ΗΙ"
U+1FC4 "ῄ" => "ΉΙ"
U+1FC7 "ῇ" => "Η͂Ι"
U+1FCC "ῌ" => "ΗΙ"
U+1FF2 "ῲ" => "ῺΙ"
U+1FF3 "ῳ" => "ΩΙ"
U+1FF4 "ῴ" => "ΏΙ"
U+1FF7 "ῷ" => "Ω͂Ι"
U+1FFC "ῼ" => "ΩΙ"
U+FB00 "ﬀ" => "FF"
U+FB01 "ﬁ" => "FI"
U+FB02 "ﬂ" => "FL"
U+FB03 "ﬃ" => "FFI"
U+FB04 "ﬄ" => "FFL"
U+FB05 "ﬅ" => "ST"
U+FB06 "ﬆ" => "ST"
U+FB13 "ﬓ" => "ՄՆ"
U+FB14 "ﬔ" => "ՄԵ"
U+FB15 "ﬕ" => "ՄԻ"
U+FB16 "ﬖ" => "ՎՆ"
U+FB17 "ﬗ" => "ՄԽ"

Martin · November 2, 2024, 4:58pm

Thanks. But there still could be a Character (an extended grapheme cluster consisting of two or more Unicode scalar values) whose lowercased version is not a single Character.

(Side note: If I remember correctly, the maximal possible value of a Unicode scalar is 0x10FFFF, so there is no need to check up to UInt32.max.)

AlexanderM · November 2, 2024, 5:03pm

True! I suppose that's a less feasible search space to brute search

It's embarrassingly parallel, we could throw it on some GPUs in the off chance that someone wrote a GPU-based port of the ICU algorithms.

Perhaps the best bet here is inspect the ICU library's source

(Side note: If I remember correctly, the maximal possible value of a Unicode scalar is 0x10FFFF, so there is no need to check up to UInt32.max.)

Heh I know, it just ran so fast that it was faster to type UInt32.max than to lookup the correct constant.

ksluder · November 2, 2024, 5:06pm

Does Character.lowercased() respect locale? If so, you might be able to get “ss” by lowercasing “ẞ” in the de-CH locale.

itaiferber · November 2, 2024, 5:08pm

At the moment, this is defined to never be the case; from the ICU Case Mappings documentation:

The CaseFolding.txt file in the Unicode Character Database is used for performing locale-independent case folding.
<snip>
Unicode case folding is not context-sensitive. It is also not language-sensitive, although there is a flag for whether to apply special mappings for use with Turkic (Turkish/Azerbaijani) text data.

The latest CaseFolding.txt file is available, and you can confirm that case folding is done on a scalar-by-scalar basis.

More specifically, "Unicode case folding is not context-sensitive" means that, at the moment, the case folding rules for a specific scalar don't depend on any scalars that precede or succeed it, so you can't end up with a grapheme cluster that ends up breaking up into multiple clusters.

itaiferber · November 2, 2024, 5:11pm

Both Character.lowercased() and String.lowercased() are locale-insensitive, and the current implementation of String.lowercased() case maps one Unicode.Scalar at a time. (Character.lowercased() also just forwards to String.lowercased())

I vaguely remember being documented somewhere, but can't find it at the moment

ksluder · November 2, 2024, 5:13pm

Thanks. I did check the documentation for Character.lowercased() before posting, but like you said it doesn’t elaborate.

itaiferber · November 2, 2024, 5:15pm

A quick search for "insensitive" in the codebase shows it comes up for the general Swift String docs, which is what I'd remembered:

Strings in Swift are Unicode correct and locale insensitive

The stdlib has no concept of locales, so the only way to get localized results is to go through Foundation

AlexanderM · November 2, 2024, 5:17pm

More specifically, "Unicode case folding is not context-sensitive" means that, at the moment, the case folding rules for a specific scalar don't depend on any scalars that precede or succeed it, so you can't end up with a grapheme cluster that ends up breaking up into multiple clusters.

So putting that together with my dumb little brute search, ~~we can conclude with certainty that: No, there are no such cases today.~~ Apparently not! See xwu's reply below.

Of course, they can always be added in the future, and the API's String return type is flexible enough to allow for that possibility.

itaiferber · November 2, 2024, 5:18pm

Indeed! It's certainly possible that this could be the case at some point in the future, and the current design allows for that.

xwu · November 2, 2024, 5:30pm

Not so—one cannot make that conclusion. Unicode scalar-by-scalar lowercasing doesn't guarantee that one extended grapheme cluster doesn't end up as two when lowercased—even if no Unicode scalar by itself ends up as two or more extended grapheme clusters when lowercased.

Segmentation rules prohibit grapheme cluster boundaries between certain Unicode scalars based on their properties: if the relevant Unicode scalar changes when lowercased and that lowercase scalar in turn doesn't have the relevant property, then a boundary is then inserted. For example: given a Unicode scalar with property Grapheme_Extend, then rule GB9 says not to break before that scalar; but if there exists such a Unicode scalar that changes when lowercased to another scalar without the Grapheme_Extend property, then there would exist a case today of the behavior described.

You'd have to go down the list of each rule in the Unicode text segmentation specification, query if there exist combinations of scalars where the rule applies that then change when lowercased so that the rule doesn't apply.

itaiferber · November 2, 2024, 8:12pm

That's a good point: you're right that this search is necessary but not sufficient. With a little help from ICU, though, it's easy to see that it's still the case today that there aren't any scalars which have this behavior.

UAX #29 table 2 defines the Grapheme_Cluster_Break properties that go into applying the grapheme cluster breaking rules, and ICU lists these under the UGraphemeClusterBreak property.

For all characters which have a lowercase form (UAX #44 Changes_When_Lowercased), we can see which characters have an interesting grapheme cluster break class (which would prevent a break between grapheme clusters), and also see whether that property changes under lowercasing:

#include <stdio.h>
#include <unicode/uchar.h>

int main()
{
    for (UChar32 c = 0; c <= 0x10FFFF; c += 1) {
        if (!u_hasBinaryProperty(c, UCHAR_CHANGES_WHEN_LOWERCASED)) {
            continue;
        }

        UGraphemeClusterBreak breakProp = u_getIntPropertyValue(c, UCHAR_GRAPHEME_CLUSTER_BREAK);
        if (breakProp != U_GCB_OTHER) {
            printf("0x%X: %d\n", c, breakProp);
        }

        UChar32 lc = u_tolower(c);
        breakProp = u_getIntPropertyValue(lc, UCHAR_GRAPHEME_CLUSTER_BREAK);
        if (breakProp != U_GCB_OTHER) {
            printf("0x%X: %d\n", lc, breakProp);
        }
    }
}

If all goes well, this program should produce no output Changing under lowercasing and affecting grapheme cluster breaking rules are currently mutually-exclusive properties.

This is not the case, by the way, for uppercasing — a search for UCHAR_CHANGES_WHEN_UPPERCASED brings up U+0345 Combining Greek Ypogegrammeni, which belongs to the Extend Grapheme_Cluster_Break class.

xwu · November 2, 2024, 9:21pm

For completeness, it looks like that ICU property doesn't include the Control and Extend property values (or the Hangul syllable types) mentioned in UAX#29 table 2, but it's not hard to check—we can also do these checks using UnicodeScalar APIs in Swift—and indeed it makes no difference to the conclusion.

But yes, Unicode is complicated

itaiferber · November 2, 2024, 9:25pm

Good catch — I meant to link to the source code, which does actually have the full list of matching properties. I'm not sure why the documentation is incomplete.

sspringer · November 4, 2024, 5:03am

BTW in 2008 the LATIN CAPITAL LETTER SHARP S uppercase letter for LATIN SMALL LETTER SHARP S was added to Unicode, and since 2024 this is the recommended uppercase form. In Switzerland and Liechtenstein you do not use either, but this is a separate issue. So actually there should be an update on this.