JSON and "\u{FEFF}": unexpected behavior

kts · October 22, 2023, 4:02pm

I have a JSON file consisting of these eight bytes (including quotes): "\ufeff". Reading with JSONSerialization results in an empty string (program below). Note this isn't a byte-order-mark at the start of the file, but the U+FEFF unicode code-point within a string. The same thing happens if this is the key in an Object (loads as empty string).

Is this a bug? In Python, it gives what I'd expect, a length-one string, json.load(open(path)) == "\ufeff" is True

import Foundation

let path = "/tmp/sample.json"

let data = try! Data(contentsOf:URL(fileURLWithPath:path))
print("data size: \(data.count)") //8
let obj  = try! JSONSerialization.jsonObject(with:data,options:[.allowFragments])
let val = obj as! String
print(val=="") //true

tera · October 22, 2023, 6:37pm

Looks like a bug indeed:

func testJSONSerialization() {
    let data = #""\ufeff""#.data(using: .utf8)!
    let obj  = try! JSONSerialization.jsonObject(with: data, options: [.allowFragments])
    let val = obj as! String
    print(val == "") // true
}

func testJSONEncoder() {
    let data = #""\ufeff""#.data(using: .utf8)!
    let val = try! JSONDecoder().decode(String.self, from: data)
    print(val == "") // true
}

itaiferber · October 22, 2023, 6:51pm

The behavior here due to the fact that JSONSerialization comes from Objective-C, and parses NSStrings from the input.

NSString (whose underlying encoding is either ASCII or UTF-16) strips out leading U+FEFF as a BOM marker:

import Foundation

print("\u{FEFF}".count) // => 1
print(("\u{FEFF}" as NSString).length) // => 0

When parsing individual strings from the JSON, each leading U+FEFF will be stripped from those strings. Because JSON considers escaped characters to be identical to their underlying bytes, it appears that \u{FEFF} and \\uFEFF (in Swift notation) are stripped out the same (presumably because \\uFEFF is normalized to \u{FEFF} during parsing).

wadetregaskis · October 22, 2023, 6:53pm

I suspect this bug only occurs in private Foundation (the one used on Apple's platforms, as opposed to Linux & Windows), as the version in non-Darwin Foundation looks fine in this regard.

Which may not help much, but at least you know it's a bug in Apple's code (and in violation of the JSON spec).

itaiferber · October 22, 2023, 7:06pm

While not very helpful behavior, I'm not sure this is a spec violation exactly — the byte sequence is actually being parsed correctly, but happens to be stripped by NSString, best I can tell. I don't think the spec places any restrictions on what implementations do with the data after it's parsed.

Worthy of Feedback, to be sure. Improvements may be coming in the upcoming swift-foundation, but I haven't checked.

tera · October 24, 2023, 8:44am

That BOM marks are tried to be interpreted within JSON substrings doesn't sound right to me. For example that wouldn't be possible anyway:

// "A"
let data = Data([0xFE, 0xFF, 0x00, 0x22, 0xFF, 0xFE, 0x41, 0x00, 0x00, 0x22])
//               BOM         "           BOM         A           " 
let val = try! JSONDecoder().decode(String.self, from: data)

i.e. when you started with UTF16 little(big) endian and switched to big(little) endian within a string. Or if you started with UTF8 (UTF16) and switched to UTF16 (UTF8) within a string.

Interestingly both this:

let data = Data([0x00, 0x22, 0x00, 0x41, 0x00, 0x22])

and this:

let data = Data([0x22, 0x00, 0x41, 0x00, 0x22, 0x00])

parsed correctly as there's some automatic built-in endian detection that doesn't require BOM presence.

kts · October 24, 2023, 6:53pm

Interesting that those work! A little googling led me to this code that shows that yes, it does try to guess the encoding. In these two cases using those 0x00 bytes to determine assume utf-16 BE and LE. an extension to JSONSerialization has a method detectEncoding:

github.com

apple/swift-corelibs-foundation/blob/8a9b69b5041b5069360239cd18304fd8e2eb91d9/Sources/Foundation/JSONSerialization.swift#L313


      
                  } while stream.hasBytesAvailable
                  return try jsonObject(with: data, options: opt)
              }
          #endif
          }
          
          //MARK: - Encoding Detection
          
          private extension JSONSerialization {
              /// Detect the encoding format of the NSData contents
              static func detectEncoding(_ bytes: UnsafeRawBufferPointer) -> (String.Encoding, Int) {
                  // According to RFC8259, the text encoding in JSON must be UTF8 in nonclosed systems
                  // https://tools.ietf.org/html/rfc8259#section-8.1
                  // However, since Darwin Foundation supports utf16 and utf32, so should Swift Foundation.
                  
                  // First let's check if we can determine the encoding based on a leading Byte Ordering Mark
                  // (BOM).
                  if bytes.count >= 4 {
                      if bytes.starts(with: Self.utf8BOM) {
                          return (.utf8, 3)
                      }

(despite the current JSON spec saying JSON must be utf8 and can't have BOM)

itaiferber · October 24, 2023, 7:15pm

In case you're curious, JSONSerialization was originally written against the ECMA-404 1st edition spec, which predated RFC 8259 (and the preceding RFC 7519), and does not make any assertions about encoding. (The latest ECMA-404 2nd edition spec that JSON.org points to still omits any encoding considerations, like the original RFC 4627.)

It does still support UTF-16 (with and without BOM) for backwards compatibility, though IIRC it's never produced anything but BOM-free UTF-8 data.

kts · October 25, 2023, 3:20am

Thanks for the insightful replies. I just wrote a bug report. I realize due to the behavior of NSString it is not simple bug to fix and possibly could be considered an implementation choice ("it is valid JSON and the decoder accepts it, we just choose to strip U+FEFF from the start of strings").

As I mention there, it would seem to be a bug if the round-trip:

String => [json encoder] => bytes => [json decoder] => String

doesn't give you back the original. Since String and JSON strings are both meant to store any sequence of code points.

tera · October 25, 2023, 11:19am

Maybe even not possible:

func testString() {
    let data: [UInt8] = [0xFE, 0xFF, 0xFE, 0xFF, 0x00, 0x41, 0x00, 0x42]
    // original:         FEFF        FEFF        0041        0042
    let string = NSString(bytes: data, length: data.count, encoding: NSUTF16StringEncoding)!
    print(string.length)
    for i in 0 ..< string.length {
        let ch = string.character(at: i)
        print(String(format: "%04X ", ch), terminator: " ")
    }
    print()
    // result: FEFF  0041  0042
}

IMHO this behaviour needs to be optional and ideally "opt-in" (or at the very least there should be a way to "opt-out").