The future of serialization & deserialization APIs

I'll add a vote for something like @CodingDefault(.unknown). :hand_with_fingers_splayed:

Missing keys (and setting a default value) may be the single most common reason for needing to implement codable functions in network data structures in projects I've worked on.

The future-proofing case (encountering an unexpected key) is also important, but I would rank it less important than handling missing keys.

3 Likes

Thank you so much for working on this

  • Another +1 for an architecture that can play nicely with more than one serialization format. When writing data exporters and importers exposed to users, it would be handy to have something that would make it easy to swap serialization methods at run time.

  • I would also +1 having a builder as part of this. I'm picturing a Serializable/SerializationRepresentation along the lines of a Transferable/TransferRepresentation?

1 Like

Going back to the visitor example in the OP, it's possible to implement this as an extension to the existing interface with Decoder implementations providing their own if they wish. I've included the complete implementation in a detail below the example.

Note that the example relies on the value being explicitly passed along with the key (via SingleValueDecodingContainer) rather than relying on the decoding call implicitly tied to the current key.

extension Decoder {
    public func keyedSequence<Key>(keyedBy type: Key.Type) throws -> any Sequence<(Key, any SingleValueDecodingContainer)> where Key : CodingKey
}
struct Person: Decodable {
    let name: String
    let age: Int

    init(from decoder: any Decoder) throws {
        var name: String?
        var age: Int?
        for (key, container) in try decoder.keyedSequence(keyedBy: CodingKeys.self) {
            switch key {
            case .name: name = try container.decode(String.self)
            case .age: age = try container.decode(Int.self)
            }
        }
        guard let name else { throw ValueNotSetError() }
        guard let age else { throw ValueNotSetError() }

        self.name = name
        self.age = age
    }
}
Complete Implementation
import Foundation

extension Decoder {
    public func keyedSequence<Key>(keyedBy type: Key.Type) throws -> any Sequence<(Key, any SingleValueDecodingContainer)> where Key : CodingKey {
        IteratorSequence(KeyedSingleValueDecodingIterator(keyedContainer: try container(keyedBy: Key.self)))
    }
}

struct KeyedSingleValueDecodingIterator<Key>: IteratorProtocol where Key: CodingKey {
    mutating func next() -> (Key, any SingleValueDecodingContainer)? {
        guard let key = keyIterator.next() else { return nil }
        return (key, KeyedSingleValueDecodingContainer(keyedContainer: keyedContainer, key: key))
    }

    var keyIterator: IndexingIterator<[Key]>
    var keyedContainer: KeyedDecodingContainer<Key>

    init(keyedContainer: KeyedDecodingContainer<Key>) {
        self.keyIterator = keyedContainer.allKeys.makeIterator()
        self.keyedContainer = keyedContainer
    }
}

struct KeyedSingleValueDecodingContainer<Key>: SingleValueDecodingContainer where Key : CodingKey {
    var codingPath: [any CodingKey] {
        keyedContainer.codingPath + [key]
    }

    var keyedContainer: KeyedDecodingContainer<Key>
    let key: Key

    func decodeNil() -> Bool {
        try! keyedContainer.decodeNil(forKey: key)
    }

    func decode(_ type: Bool.Type) throws -> Bool {
        try keyedContainer.decode(type, forKey: key)
    }

    func decode(_ type: String.Type) throws -> String {
        try keyedContainer.decode(type, forKey: key)
    }

    func decode(_ type: Double.Type) throws -> Double {
        try keyedContainer.decode(type, forKey: key)
    }

    func decode(_ type: Float.Type) throws -> Float {
        try keyedContainer.decode(type, forKey: key)
    }

    func decode(_ type: Int.Type) throws -> Int {
        try keyedContainer.decode(type, forKey: key)
    }

    func decode(_ type: Int8.Type) throws -> Int8 {
        try keyedContainer.decode(type, forKey: key)
    }

    func decode(_ type: Int16.Type) throws -> Int16 {
        try keyedContainer.decode(type, forKey: key)
    }

    func decode(_ type: Int32.Type) throws -> Int32 {
        try keyedContainer.decode(type, forKey: key)
    }

    func decode(_ type: Int64.Type) throws -> Int64 {
        try keyedContainer.decode(type, forKey: key)
    }

    func decode(_ type: UInt.Type) throws -> UInt {
        try keyedContainer.decode(type, forKey: key)
    }

    func decode(_ type: UInt8.Type) throws -> UInt8 {
        try keyedContainer.decode(type, forKey: key)
    }

    func decode(_ type: UInt16.Type) throws -> UInt16 {
        try keyedContainer.decode(type, forKey: key)
    }

    func decode(_ type: UInt32.Type) throws -> UInt32 {
        try keyedContainer.decode(type, forKey: key)
    }

    func decode(_ type: UInt64.Type) throws -> UInt64 {
        try keyedContainer.decode(type, forKey: key)
    }

    func decode<T>(_ type: T.Type) throws -> T where T : Decodable {
        try keyedContainer.decode(type, forKey: key)
    }
}

struct ValueNotSetError: Error { }

struct Person: Decodable {
    let name: String
    let age: Int

    enum CodingKeys: CodingKey {
        case name
        case age
    }

    init(from decoder: any Decoder) throws {
        var name: String?
        var age: Int?
        for (key, container) in try decoder.keyedSequence(keyedBy: CodingKeys.self) {
            switch key {
            case .name: name = try container.decode(String.self)
            case .age: age = try container.decode(Int.self)
            }
        }
        guard let name else { throw ValueNotSetError() }
        guard let age else { throw ValueNotSetError() }

        self.name = name
        self.age = age
    }
}

let json = """
{
    "name": "Martha",
    "age": 6
}
"""

let data = Data(json.utf8)

let person = try JSONDecoder().decode(Person.self, from: data)

print(person)
2 Likes

Reworking of serialization is long overdue; thanks for raising this topic. My only input: the current design is so flawed that no redesign should be considered without a comprehensive look at issues that have been raised. For example, this old thread exposes an important gap that I don't think is addressed here.

5 Likes

Hello Dave! Thanks for your reply and input!

Which aspect of the mentioned thread are you hoping for? Built-in support in the API surface for deduplication / references, or support stateful encoders/decoders (upon which deduplication could be built manually where desired)?

For the former, I'm genuinely curious about people's opinion: is this not better handled as a serialization format choice? For example, binary plist has limited support for this in the deduplication of leaf types. (Technically the format supports array/dictionary deduplication as well, but for various reasons, the built-in encoders don't do that today.) I'm not super well versed in these kinds of serialization formats, but I can only imagine there are many others that have first class support for internal value references.

In contrast, building this feature on top of, say, JSON, would lock you into a specific "dialect" of JSON that only your system is likely to be able to use, at which point you might be better served with a format with native support.

Happy to be further educated about real world scenarios here!

One absolute requirement here is that this behavior would need to be zero overhead unless it's actually used by a client.

My immediate need was generalized deduplication (independent of the use of classes, which I never want), but I'm pretty sure there are uses for the generalized state in encoders, and I suspect generalizing deduplication probably implies the latter anyhow.

For the latter, I'm genuinely curious about people's opinion: is this not better handled as a serialization format choice? For example, binary plist has limited support for this in the deduplication of leaf types.

Now I'm confused because “the latter” is not about deduplication but about coder state.

That said I don't see how deduplication is at all related to serialization format. I would want the same deduplication whether I was serializing as text or as binary or as something else. And I want to easily switch between text and binary formats for debugging purposes if nothing else.

I disagree, strictly speaking, about zero overhead, although it's almost certainly achievable. IMO a minimal O(1) cost should be totally acceptable. How you measure overhead is also really an issue. Within a resilience domain, many things can be zero overhead that will have a cost when there's a resilience boundary.

My larger point was that some research is needed into serialization limitations that have been discussed in the past. I only knew about that thread because I participated, but there are probably others.

1 Like

(Sorry, I edited my response a bit before I sent it, and didn't change "latter" to "former", hah)

It's undeniable that some serialization formats have native support for deduplication where others do not. Hence the example of binary plist I gave. JSON does not have native support for this. Building deduplication on top of it requires building (or reusing) a scheme on top of JSON for doing so. An equivalent data structure encoded with and without deduplication/reference support would be entirely different things, and the deduplicated one would not be readable unless the caller knows the deduplication scheme. In contrast, equivalent binary plists encoded with and without deduplication are readable by anyone.

Furthermore, suppose you do build/reuse a deduplication scheme on top of JSON, then decide you want to switch over straight to binary plist. You've now built custom deduplication on top of a format that already supports native deduplication! (Foundation keyed archives encoded in binary plist are a manifestation of this.)

I'm not saying the concept of text-based reference support isn't useful (Apple platforms have used and continue to use XML plist keyed archives), but I think it's generally more optimal to reach for formats that natively support this need, if possible.

Anyway, I don't want to dwell on this entirely too much here, because I know it's not your main point, but I think we'd be better off exploring options in which features like de-duplication are composed on top of the core encoder/decoders instead of something built in to them, so things like decoding "normal" JSON aren't burdened with irrelevant complexity. Whether that's through additional macro support or composable types, I can't say right now. (I'm vaguely thinking, like, build the deduplicated structure in memory, then pass that to the JSON encoder.)

My larger point was that some research is needed into serialization limitations that have been discussed in the past. I only knew about that thread because I participated, but there are probably others.

I've certainly done a fair bit of this research, but I will always welcome someone bringing up a concern they think isn't being properly addressed. Calling those up is one of the primary points of this thread.

2 Likes

Indeed. How could I express this in JSON?

[
	[1, 2, 3, 4, 5],
	[1, 2, 3, 4, 5]
]

I'd like to avoid the second copy, but how?

Or should it be possible at Codable level and only some formats that support it will support it while others (like JSON) would either emit an error or fallback to a different representation?

I wasn't aware of binary plist's support for deduplication. But when you say the format supports deduplication, at some level all you're really saying is that the syntax of the format has a place to store whatever registry is necessary for that purpose, and encoders/decoders for that format have to solve all the same problems that I would have to solve to do deduplication for text or CBOR or whatever other base format I'm interested in. The current system has no support for that AFAICT and if binary plists are coded without something like the horrors I had to implement for general deduplication then it seems to me they must be using private APIs not accessible to the rest of us. It's crucial that the new system has a way to allow new encoders to be built on top of existing ones to support this functionality.

And I disagree that a format with built-in support for deduplication is important; the choices among those with explicit support for deduplication must be quite limited. There's no reason to assume the designer of binary plist would do a better job of that than I would, and there's no reason to assume I want to accept the other tradeoffs of that format.

I also disagree that “being locked into a specific dialect" is a problem—at least not for all applications. Anybody can read any JSON file without another layer of schema, but that doesn't mean the data will be useful to them. In general every new data structure you serialize creates a dialect. Not every format even allows the minimal parsing of the structure that you get from using JSON as a base, in part because you may not want to pay the cost in archive size or en-/de-coding speed.

1 Like

If you want to do that, you need a schema on top of JSON that your coders interpret as meaning deduplication, e.g.,

{
    "shared": {"id-0": [1, 2, 3, 4, 5]},
    "content": ["id-0", "id-0"]
}

(That could be more compact by using numbers instead of strings for IDs; I did it for clarity)

4 Likes

I would suggest that if your use of JSON requires compactness and deduping is important, piping the output through a streaming compression algorithm will probably give you statistically better results than anything you could hand-craft anyway, and requires far less (new) infrastructural work to do it.

I'm making some assumptions here of course, so if I'm wildly missing the point, ignore me. :smiley:

8 Likes

Compressing JSON only makes the encoded representation smaller. If it is decoded it will duplicate everything again in-memory. With first class deduplication you could also use Swift's Copy on Write (CoW) infrastructure to do deduplication even in-memory which you kinda get for free if you decode the shared array only once and copy that value around.

I think this will become a rather important feature if we want to embrace value types for serialization. Otherwise just the act of encoding + decoding could increase the memory footprint quite drastically compared to the previous in-memory representation of CoW Containers like Array, Set, Dictionary and almost all other container types in the ecosystem. Swift value semantics + CoW makes this optimization really safe to implement.

Additionally this wouldn't just be a memory reduction. Array and other containers have a fast path that can reduce the time complexity from O(n) to O(1) for == if both operands share the same underlying storage. As Array and other container types are usually generic and the element type could in turn be another container type this can have even bigger effects, both in terms of memory savings and CPU time.

8 Likes

@kperryua can you maybe share already some update on progress on this effort?

Yes! It's been a crazy summer with lots of time away from this work, but I am planning a series of dev journal / update posts about important design decisions to show some of the progress I've made and provide opportunities for feedback along the way.

In general, I've been working a lot on the bleeding edge of ~Escapable & ~Copyable types and lifetime dependence, including use of Span and OutputSpan types. It's been a little bit of a bumpy ride, but the work has resulted in a handful of compiler fixes and some novel design ideas that should be generally beneficial.

Thank you for your interest, and please stay tuned!

29 Likes

Very cool, looking forward to it! By the way, I think what would be a really cool improvement, was the support “unknown fields” collection when decoding for example JSON. Ideally, these “unknown fields” would also be maintained when encoding the value to JSON again (and also maintain the order of keys for example …).

// instead of
case .unknown: try decoder.skipValue()

// something like this
case .unknown(let key): additionalProperties.append(field: key, value: decoder.loadValue())

Of course that would mean that we have some kind of “JSONValue” which captures the json value in a “raw” form.

1 Like

Yes! I mentioned this somewhat obliquely in the section about Codable compatibility:

To enable this, I definitely intend to support decoding unknown fields for self-describing formats like JSON or property list. This doesn’t really work for non-self-describing formats however, so keep that in mind!

2 Likes

Very cool!!!

Hi everyone, obviously this is not the equivalent, but if anyone is interested, you can check this out:

1 Like

I'm curious if this includes the ability to decode ~Escapable & ~Copyable into regular types. I'm working with a SQLite abstraction that vends rows as non-escapable/non-copyable values and I would love to be able to use them with this new serialization API.

1 Like

I’m afraid I don’t have a clear picture of your use case.

Generalizing a little bit, is the core of your question about producing ~Escapable & ~Copyabletypes that reference the decoder’s input without copying it?

I do hope to eventually be able to allow decoded types to be / contain ~Escapable in certain cases, though I don’t have a clear vision yet whether the language will easily allow that with the existing @_lifetime annotations. I definitely haven’t given much thought to what it would mean for ~Copyable to be specified on either inputs or outputs yet at all.

FWIW, my explorations with these concepts has been limited to making intermediate structures that are used during a decode ~Escapable and/or ~Copyable . For instance, when parsing a JSON object key (specifically one without any escapes), it’d be nice to give the client some variety of Span to identify the key and decode the corresponding value, instead of always instantiating a String .