How does JSONDecoder.decode() determine data's encoding (e.g. utf-8, utf-16, etc.)?

rayx · November 6, 2021, 7:29am

The second argument of JSONDecoder.decode() is of Data type. Since the data is read from file or network, it can be in different encoding formats. But the method doesn't provide an argument to specify the encoding format. I wonder how it determines that? SE-0167 didn't discuss this. Could it be that the Data type (or NSData) can provide encoding information? But I don't find such APIs in their docs.

I did an experiment with a few random encodings in the code below. I found .utf8, .utf16, .windowsCP1250 worked fine, but .utf32 caused Swift.DecodingError.dataCorrupted error.

struct GroceryProduct: Codable {
    var name: String
    var points: Int
    var description: String?
}

let json = """
{
    "name": "Durian",
    "points": 600,
    "description": "A fruit with a distinctive scent."
}
""".data(using: .utf8)!
// Experiment: change .utf8 to .utf16, .utf32, and .windowsCP1250

let decoder = JSONDecoder()
let product = try decoder.decode(GroceryProduct.self, from: json)

print(product.name)

Also, JSONEncoder.encode() doesn't provide an option for encoding format. Does that mean it's supposed to generate utf-8 output only? Not that this is an issue, I'm just trying to understand it.

Jon_Shier · November 6, 2021, 8:32am

By definition JSON is only UTF-8. Any other encoding is invalid if it doesn’t happen to align with it.

Martin · November 6, 2021, 8:49am

The current JSON standard RFC 8259 (from 2017) requires that

JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8.

The older RFC 7159 (from 2013) and RFC 7158 (from 2013) only stated that

JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default
encoding is UTF-8, and JSON texts that are encoded in UTF-8 are
interoperable in the sense that they will be read successfully by the
maximum number of implementations; there are many implementations
that cannot successfully read texts in other encodings (such as
UTF-16 and UTF-32).

I had tested this a while ago, and it seems that (at least on the Apple platforms where JSONSerialization from the Foundation library is used internally), JSONDecoder correctly detects UTF-8, and also UTF-16 and UTF-32 with byte order marker, e.g. .utf32BigEndian.

But I did not find that documented. It may also be different on non-Apple platforms.

I think that .windowsCP1250 worked in your example only by chance because it is very similar (or identical?) to UTF-8 for ASCII characters. If you change the string to "Düriän" then it will fail.

rayx · November 6, 2021, 8:56am

Thanks @Martin and @Jon_Shier. I also searched in RFC but used an outdated version

Yes, indeed!

rayx · November 6, 2021, 9:06am

I also thought the implementation might try to detect encoding and that's the reason why I asked. I know that's possible because there are a few commands (e.g. enca and file) doing this on Linux and they work quite well.

Martin · November 6, 2021, 12:28pm

It is not entirely correct what I said. First, .utf16 and .utf32 prepend a byte order marker (BOM), whereas .utf16BigEndian and friends do not.

Second, from the source code at swift-corelibs-foundation/JSONSerialization.swift at main · apple/swift-corelibs-foundation · GitHub one can see that JSONSerialization is supposed to detect not only UTF-8, but also UTF-16 and UTF-32 with and without BOM.

This does in fact work in my tests (on macOS and Ubuntu), with the exception of .utf32 (UTF-32 with BOM). For some reason, data starting with FF FE 00 00 seems not to be detected as little endian UTF-32.