I was looking at the tests for JSONCoder specifically this line testEncodingKeyStrategySnake Line 612[
https://github.com/apple/swift/blob/master/test/stdlib/TestJSONEncoder.swift#L612]. I was comparing it against my own custom implementation and realized that the unicode character for latin capital letter l with combining diacritics from the test was considered lowercase as the function latinL.isUppercase()
returned false. This looks like a bug to me. Could someone confirm?
This is a really weird case but isUppercase
is working correctly. The problem is specifically with one of the combining characters. Let's take a closer look at them:
print(Array("LĢ„ĢĶĶĢŖĢ«Ķ
".unicodeScalars))
// ["L", "\u{0325}", "\u{0316}", "\u{034E}",
// "\u{0353}", "\u{032A}", "\u{032B}", "\u{0345}"]
That last one, U+0345, is "COMBINING GREEK YPOGEGRAMMENI". It's a lowercase iota subscript.
If a Character
is composed of multiple scalars (as this one is), then for isUppercase
to be true, then isCased
must be true and the character must be unchanged when uppercase mappings are applied. Let's check these:
print(Character("LĢ„ĢĶĶĢŖĢ«Ķ
").isCased) // true
print(Array("LĢ„ĢĶĶĢŖĢ«Ķ
".uppercased().unicodeScalars))
// ["L", "\u{0325}", "\u{0316}", "\u{034E}",
// "\u{0353}", "\u{032A}", "\u{032B}", "\u{0399}"]
Note that the last scalar is not the same as before! The lowercase iota subscript was transformed to U+0399, "GREEK CAPITAL LETTER IOTA". So that particular character is not considered uppercase, and that's why isUppercase
is evaluating to false
for that character.
@allevato interesting then it looks like there may be a discrepancy between Character's isUppercase()
and CharacterSet.uppercaseLetters
which is used in JSONEncoder's _convertToSnakeCase()
CharacterSet
is unfortunately named in Swift, because it's really a set of UnicodeScalar
s, so it cannot make decisions about Character
s that are composed of multiple scalars like the one above.
I'm not really sure what the expectation should be for case transformation of a character like this oneālinguistically it's completely nonsensical, because it combines a Latin character with a Greek diacritic. I'd argue that the behavior should be undefined, so maybe the test should be tweaked to be a bit more stable?
Is there even a rigorous specification that defines camel vs. snake case in the presence of full Unicode support, anyway? Should JSONEncoder
be using Character
properties instead of CharacterSet
matching?
cc @itaiferber, since he's mentioned in the comment for that test.
letter l with combining diacritics from the test was considered lowercase as the function
latinL.isUppercase()
returned false. This looks like a bug to me. Could someone confirm?
The first six diacritics arenāt letters and have no case, which is why they do not change anything about how L (or any of the other latin letters there) are considered.
But the last ādiacriticā is actually a letter, even though it is printed underneath and is part of the same āgrapheme clusterā as far as Unicode is concerned when it is in sentence case:
sentence case | į¾ Ī“Ī·Ļ | 03B1 02BB 0301 0345 ā¢ 03B4 ā¢ 03B7 ā¢ 03C2 |
title case | Īį¼µĪ“Ī·Ļ | 0391 ā¢ 03B9 02BB 0301 ā¢ 03B4 ā¢ 03B7 ā¢ 03C2 |
uppercase āfontā | ĪĪĪĪĪ£ | 0391 ā¢ 0399 ā¢ 0394 ā¢ 0397 ā¢ 03A0 |
Logically, it means the test is invalid. It is (at the human level) equivalent to requiring the snake case of of myGreatURLiRequest
to be my_great_urli_request
instead of my_great_ur_li_request
.
However, as you can see from the chart, Unicodeās encoding of the letter is a huge mess for legacy reasons. I doubt a machine could do the ārightā thing here no matter how hard you tried.
It would probably be wisest to pull that last scalar off of the test. But I donāt know if it is worth also trying to āfixā the implementation to match real human expectations (which your own reāimplementation happens to be closer to). Weāre dealing with a letter that was officially abolished in 1982 and sees no use in any living language. No other Unicode characters work like this one either. Archeologists may use it in their papers, but the chances of it being used in a JSON key are basically 0 even in Greece.
then it looks like there may be a discrepancy between Character's
isUppercase()
andCharacterSet.uppercaseLetters
These are not the same thing. To avoid repeating myself, please see here:
Regarding the origin of that string: from what I remember, the test case was generated by running the base string through a "Zalgo text" generator (e.g. https://www.zalgogenerator.com/) for some randomness. There was no particular intention behind any of the combining codepoints, and as far as I am concerned, @SDGGiesbrecht summarized my feelings better than I could: I think the scalar can safely be removed from the test.
I've made a PR to remove the scalar from the above mentioned strings in the tests https://github.com/apple/swift/pull/30282