Is this a bug in Swift Stdlib Character isUppercase()?

rnantes · March 7, 2020, 7:32pm

I was looking at the tests for JSONCoder specifically this line testEncodingKeyStrategySnake Line 612[https://github.com/apple/swift/blob/master/test/stdlib/TestJSONEncoder.swift#L612]. I was comparing it against my own custom implementation and realized that the unicode character for latin capital letter l with combining diacritics from the test was considered lowercase as the function latinL.isUppercase() returned false. This looks like a bug to me. Could someone confirm?

allevato · March 7, 2020, 8:16pm

This is a really weird case but isUppercase is working correctly. The problem is specifically with one of the combining characters. Let's take a closer look at them:

print(Array("L̥̖͎͓̪̫ͅ".unicodeScalars))

// ["L", "\u{0325}", "\u{0316}", "\u{034E}",
//  "\u{0353}", "\u{032A}", "\u{032B}", "\u{0345}"]

That last one, U+0345, is "COMBINING GREEK YPOGEGRAMMENI". It's a lowercase iota subscript.

If a Character is composed of multiple scalars (as this one is), then for isUppercase to be true, then isCased must be true and the character must be unchanged when uppercase mappings are applied. Let's check these:

print(Character("L̥̖͎͓̪̫ͅ").isCased)  // true

print(Array("L̥̖͎͓̪̫ͅ".uppercased().unicodeScalars))

// ["L", "\u{0325}", "\u{0316}", "\u{034E}",
//  "\u{0353}", "\u{032A}", "\u{032B}", "\u{0399}"]

Note that the last scalar is not the same as before! The lowercase iota subscript was transformed to U+0399, "GREEK CAPITAL LETTER IOTA". So that particular character is not considered uppercase, and that's why isUppercase is evaluating to false for that character.

rnantes · March 7, 2020, 8:25pm

@allevato interesting then it looks like there may be a discrepancy between Character's isUppercase()and CharacterSet.uppercaseLetters which is used in JSONEncoder's _convertToSnakeCase()

allevato · March 7, 2020, 8:29pm

CharacterSet is unfortunately named in Swift, because it's really a set of UnicodeScalars, so it cannot make decisions about Characters that are composed of multiple scalars like the one above.

I'm not really sure what the expectation should be for case transformation of a character like this one—linguistically it's completely nonsensical, because it combines a Latin character with a Greek diacritic. I'd argue that the behavior should be undefined, so maybe the test should be tweaked to be a bit more stable?

Is there even a rigorous specification that defines camel vs. snake case in the presence of full Unicode support, anyway? Should JSONEncoder be using Character properties instead of CharacterSet matching?

cc @itaiferber, since he's mentioned in the comment for that test.

SDGGiesbrecht · March 7, 2020, 8:58pm

letter l with combining diacritics from the test was considered lowercase as the function latinL.isUppercase() returned false. This looks like a bug to me. Could someone confirm?

The first six diacritics aren’t letters and have no case, which is why they do not change anything about how L (or any of the other latin letters there) are considered.

But the last “diacritic” is actually a letter, even though it is printed underneath and is part of the same “grapheme cluster” as far as Unicode is concerned when it is in sentence case:

sentence case	ᾅδης	03B1 02BB 0301 0345 • 03B4 • 03B7 • 03C2
title case	Αἵδης	0391 • 03B9 02BB 0301 • 03B4 • 03B7 • 03C2
uppercase “font”	ΑΙΔΗΣ	0391 • 0399 • 0394 • 0397 • 03A0

Logically, it means the test is invalid. It is (at the human level) equivalent to requiring the snake case of of myGreatURLiRequest to be my_great_urli_request instead of my_great_ur_li_request.

However, as you can see from the chart, Unicode’s encoding of the letter is a huge mess for legacy reasons. I doubt a machine could do the “right” thing here no matter how hard you tried.

It would probably be wisest to pull that last scalar off of the test. But I don’t know if it is worth also trying to “fix” the implementation to match real human expectations (which your own re‐implementation happens to be closer to). We’re dealing with a letter that was officially abolished in 1982 and sees no use in any living language. No other Unicode characters work like this one either. Archeologists may use it in their papers, but the chances of it being used in a JSON key are basically 0 even in Greece.

then it looks like there may be a discrepancy between Character's isUppercase() and CharacterSet.uppercaseLetters

These are not the same thing. To avoid repeating myself, please see here:

itaiferber · March 8, 2020, 1:18am

Regarding the origin of that string: from what I remember, the test case was generated by running the base string through a "Zalgo text" generator (e.g. https://www.zalgogenerator.com/) for some randomness. There was no particular intention behind any of the combining codepoints, and as far as I am concerned, @SDGGiesbrecht summarized my feelings better than I could: I think the scalar can safely be removed from the test.

rnantes · March 8, 2020, 2:20am

I've made a PR to remove the scalar from the above mentioned strings in the tests https://github.com/apple/swift/pull/30282