String(data:encoding:) doesn't seem to work with "advanced" encodings

tsolomko · October 27, 2021, 5:32pm

Hi,

The String.init?(data:encoding:) initializer seems to not work properly (or at all) with, what I call, "advanced" encodings. By "advanced" encodings I mean encodings that can be (only?) accessed via String.Encoding.init(rawValue:) initializer.

Let's discuss two examples. The first example is the one character string "" (U+263A unicode scalar). According to this wikipedia article, this string can be encoded as one byte 0x01 using CP437 aka DOS Latin-US encoding. As I understand from the output of String.availableStringEncodings.map { String.localizedName(of: $0) } this encoding is present in Swift as String.Encoding(rawValue: 0x8000_0400) (assuming that "Latin-US (DOS)", as reported by localizedName(of:), is indeed the same thing as CP437). So by doing the following:

let cp437 = String.Encoding(rawValue: 0x80000400) 
print(String(data: Data([0x01]), encoding: cp437)!)

I expect to get my smiley face, but instead I get "\u{1}", which is the exact same output you would get if you instead tried to use .utf8 encoding.

The second example is the following: "А" (Cyrillic letter, U+0410 unicode scalar). Using the CP866 aka Cyrillic DOS encoding it is encoded as one byte 0x80. In Swift this encoding can be found as String.Encoding(rawValue: 0x8000_0413). Doing a similar exercise:

let cp866 = String.Encoding(rawValue: 0x8000_0413)
print(String(data: Data([0x80]), encoding: cp866)!)

I get a mysterious "ђ" output, instead of my simple letter А.

So, I guess, my question is: am I missing something or is this a bug?

Jon_Shier · October 27, 2021, 5:46pm

String.Encoding is a bridged representation of the various NS<*>StringEncoding values from Foundation. It does not support any encoding that isn't already represented as a static value. Its init(rawValue:) is only exposed due to it's public conformance to RawRepresentable. Most likely that initializer defaults to a particular encoding when you pass it a value it doesn't support. There is a bunch of CFStringEncodings that aren't exposed through String.Encoding, including .dosLatinUS, but I'll leave that conversion to you.

tsolomko · October 27, 2021, 6:12pm

Well, that means that String.availableStringEncodings is at best misleading, doesn't it?

Regardless, I am aware of CFStringEncodings. I've been using them in the past for this exact purpose, but since, as I understand, using CoreFoundation is discouraged on non-darwin (and maybe non-linux) platforms, I am trying other ways to access "advanced" encodings.

Anyhow, I tried to use CFStringEncodings again right now:

let CfCp437 = UInt32(truncatingIfNeeded: CFStringEncodings.dosLatinUS.rawValue)
print(CFStringIsEncodingAvailable(CfCp437)) // prints "true"
let cfstring = CFStringCreateWithCString(nil, [0x01, 0x00], CfCp437)
print(cfstring) // prints "Optional()"

let convertedToNsCfCp437 = CFStringConvertEncodingToNSStringEncoding(CfCp437)
print(convertedToNsCfCp437) // prints "2147484672" which is exactly 0x8000_0400
print(NSString(data: Data([0x01]), encoding: convertedToNsCfCp437)) // prints "Optional()"

Note, that CoreFoundation's CP437 encoding converts seemingly into the exact same encoding that I've been trying to use in my original post. But more importantly, the CFStringEncodings doesn't seem to work either.

Jon_Shier · October 27, 2021, 6:33pm

You're right, it does appear to be an available value, and the instance created from the raw value does seem to be the correct one. In my testing applying the encoding does have different results.

let data = Data("☺️".utf8)
let encoding = String.Encoding(rawValue: 0x80000400)
print(String(data: data, encoding: encoding))
print(String(data: data, encoding: .utf8))
print(NSString(data: data, encoding: encoding.rawValue))
print(NSString(data: data, encoding: String.Encoding.utf8.rawValue))

results in

Optional("Γÿ║∩╕Å")
Optional("☺️")
Optional(Γÿ║∩╕Å)
Optional(☺️)

Whether any of that is correct I don't know.

tsolomko · October 27, 2021, 7:17pm

Well, technically, this is correct, but I was trying to perform the reverse operation: using the encoding decode back into a string the byte representation of the smiley face in this encoding. Basically, convert 0x01 into ☺

I've also noticed, that the encoding works correctly for some bytes, but not the others. For example,

print(String(data: Data([0xe2, 0x98, 0xba, 0x7f, 0x15]), encoding: cp437)!)

prints "Γÿ║ " instead of expected "Γÿ║⌂§" (I had to use quotation marks when typing these to show whitespaces in the first result). Perhaps, the issue is with those bytes that are normal characters in CP437, but are control characters in UTF-8/ASCII (in this example, the last two characters are DEL and NAK correspondingly in UTF-8/ASCII). So it looks like it mixes two encodings, but, honestly, I have no idea what is happening and why.

Martin · October 27, 2021, 8:14pm

I do not know if this table is the correct source, but it maps 0x01 to U+0001 and seems to be consistent with your other results:

ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP437.TXT

0x01	0x0001	#START OF HEADING
0x15	0x0015	#NEGATIVE ACKNOWLEDGE
0x7f	0x007f	#DELETE
0x98	0x00ff	#LATIN SMALL LETTER Y WITH DIAERESIS
0xba	0x2551	#BOX DRAWINGS DOUBLE VERTICAL
0xe2	0x0393	#GREEK CAPITAL LETTER GAMMA

tsolomko · October 27, 2021, 8:29pm

Hmm, the link from Unicode seems to be a more authoritative source than Wikipedia by itself, which makes Wikipedia wrong.

However, Wikipedia cites specifications from IBM, the authors of this encoding:

ftp://ftp.software.ibm.com/software/globalization/gcoc/attachments/CP00437.pdf
ftp://ftp.software.ibm.com/software/globalization/gcoc/attachments/CP00437.txt

These files corroborate the table from Wikipedia, and contradict the file from Unicode.

So, I don't know who is right in this case.

tera · October 27, 2021, 9:46pm

why are you doing this to begin with? is it to be compatible with documents created on pre-unicode systems or to save a few bytes or what?

tsolomko · October 28, 2021, 4:33pm

The former. In particular, ZIP file format's default encoding for string fields is CP437, and while it has facilities to support UTF-8, by nature of being a format used for archiving purposes, when parsing ZIP archives one should always expect to encounter a file created a long time ago, before UTF-8 support was clarified in the spec.

Anyhow, by looking further on the Internet, it seems CP437 is special in a sense, that some of its ranges can be interpreted as either control characters or normal character "depending on context". This last part is a bit vague, but I guess in light of this one can say that String(data:encoding:) is allowed to interpret them however it wants (i.e. not as normal characters).

With regards to the second example from my original post, apparently, it is actually CP855, not CP866 encoding hiding behind the name of "Cyrillic (DOS)" and it works as expected:

let cp855 = String.Encoding(rawValue: 0x8000_0413)
print(String.localizedName(of: cp855)) // prints "Cyrillic (DOS)"
print(String(data: Data([0xDD, 0xE1, 0xB7, 0xEB, 0xA8, 0xE5]), encoding: cp855)!) // prints "Привет"