Towards swift-json 0.3: obsoleting json dictionary abstractions

hi all, as swift-json nears its third minor release, i just wanted to highlight an important API change coming in version 0.3:

obsoleting json dictionary abstractions

the early versions of swift-json modeled objects (things written in curly braces {}) as dictionaries of String keys and JSON values:

case array([Self])
case object([String: Self])

this is the most obvious representation for a JSON value, it’s convenient for 99% of use cases, and it dovetails nicely with the JSON.array(_:) enumeration case.

motivation

as i’ve used swift-json in more projects, and gotten more feedback from others, it’s becoming apparent this representation has some major downsides:

  1. it cannot handle duplicate keys in JSON input. JSON values with duplicate keys are actually valid JSON values, though they’re not very well-supported.

    right now, swift-json uses the value associated with the last occurrence of the key, or something that satisfies String.==(_:_:) with it.

    that last part is the source of a subtle class of bugs in highly dynamic JSON APIs, because JSON is defined in UTF8, but String.==(_:_:) compares grapheme clusters. so if an object vends separate keys for "\u{E9}" ('é') and "\u{65}\u{301}" (also 'é', perhaps, because the JSON is being used to bootstrap a unicode table), one of the values will be dropped.

  2. it doesn’t preserve key-value ordering. usually we don’t care about key-value ordering when decoding, but problems crop up when we combine it with serialization. regenerating the same JSON can yield different text even though it encodes the same data, which can cause VCS spam if the datafile is version-controlled.

    applications that regularly read and write JSON persistence data are often affected by this issue.

  3. it is not efficient. many JSON API clients can be optimized to use no hash table lookups at all. some even implement fast-paths that attempt to decode key-value pairs at constant offsets if the source of the JSON emits it deterministically.

    this happens quite often in fintech applications that have to parse “firehose“ JSON. i’m sure there are many other kinds of applications that are (or could be) doing things with JSON that simply aren’t currently feasible with the overhead of Decodable, or even [String: JSON].

    being a denser data structure, [(key:String, value:JSON)] also experiences slightly less heap fragmentation than [String: JSON].

proposed solution

starting with version 0.3, we’re going to change the payload of JSON.object(_:) from [String: JSON] to [(key:String, value:JSON)]. this matches APIs vended by Dictionary itself.

case object([(key:String, value:Self)])

note that for performance reasons, JSON is @frozen, but swift-json makes no binary stability guarantees (yet).

we’re going to deprecate [String: JSON]’s callAsFunction(as:) typecasting overload, but you can still use it for now, and its behavior is still the same.

@available(*, deprecated, message: 
    """
    handle duplicate keys explicitly with 
    `callAsFunction(as:uniquingKeysWith:)`
    """)
func callAsFunction(as _:[String: Self].Type) -> [String: Self]? 

in its place we’ll get an overload that returns [(key:String, value:JSON)]?, and one that returns a [String: JSON], but takes an explicit merging closure.

func callAsFunction(as _:[(key:String, value:Self)].Type) 
    -> [(key:String, value:Self)]? 

func callAsFunction(as _:[String: Self].Type, 
    uniquingKeysWith combine:(Self, Self) throws -> Self) rethrows
    -> [String: Self]? 

impact on library users

anyone who is case-switching on .object(_) will experience source-breakage. swift-json is still experimental, so we are only going to be bumping the minor release number.

people using swift-json through its Decoder/Decodable interface won’t see any changes, so this mostly affects users with high-performance, high-throughput requirements who are writing their own decoding logic. however, in the long term, these changes should benefit this use case via reduced parser overhead.

alternatives considered

to partially address the unicode aspects of problem #1, we could switch the key representation to [UInt8]. but we would probably lose more than we gain from this, and it wouldn’t do anything about problems #2 or #3.

to solve problem #2, we could escalate the payload to a data structure like OrderedDictionary from swift-collections. however, the overhead imposed by OrderedDictionary would be unacceptable to performance-sensitive users, and it wouldn’t do anything about problem #1 either. also, people might not want to depend on swift-collections just to use swift-json.

10 Likes

Possible alternative that supports duplicate keys

unfortunately, KeyValuePairs is not a good payload type, because it does not support a lot of useful RangeReplaceableCollection APIs, like append(_:). this isn’t that important for decoding, but it can be a hindrance when serializing JSON, since you would have to build an array and then copy it into KeyValuePairs storage.