The future of serialization & deserialization APIs

Developers might be willing to update their code to a new API, but updating the encoding scheme itself will in many cases be out of the question because there will be a tonne of serialized data already out there, and for web APIs there will be programs in other languages that already support that scheme that aren't going to be rewritten to support changes to Swift.

Before finalising the design, perhaps you could create a repo for public submissions of existing JSON encoding schemes - similar to the Swift compatibility suite - that this system will need to be able to continue to support?

Also (related) some kind of built-in support for schema updates and migrations (similar to CoreData/SwiftData) would be a great feature, as this is another pain point in Codable.

Even just a way to specify a default value for new non-optional properties would reduce a lot of the need for adding manual decoder implementations to apps in post-1.0 releases.

11 Likes

Seconded.

The format not going to change point is a critical point. This new approach really has to accept and work with the reality that a large percentage of the time we're consuming server side API generated data and it is not going to change to meet the needs of a new Swift library (only one of the server clients).

Likewise cyclical references and polymorphic fields are not unheard of.

The other points are good as well.

I hope the "foundation" tag indicates that it'll be an open source implementation like swift-foundation so that more developers can help find bugs and fix them. Though the idea that it be a separate package seems worthy of consideration.

1 Like

In my eyes, any new solution should support Swift's Embedded mode. I don't see any blockers in supporting that with this approach, but feel it's warranted to mention.

While trying to create very similar macros previously, I've also felt that it's hard to emit the right compiler error/warnings into the sourcecode. Particularly, if you have nested types those would need to (likely) conform to your protocols. Right now, a macro has no way to know that members conform to certain protocols. That makes debugging these macros in user apps more painful.

5 Likes

So just touching on zero-copy representations for a minute (e.g. flatbuffers/capnproto) - was that considered at all?

E.g. we are currently using a Swift in code definition of our data models, which we use as a basis for generating flatbuffers representations (the schema) and glue that back together with codegen, so we have an abstract representation that can either be a zero-copy object (from e.g. an mmap() on a read only memory page, or directly from the network) - or it can be an actual Swift type (struct) when just used locally. Flatbuffers really buys us zero-copy and evolution capability.

We do want to support zero-copy and we do want to support evolution of the serialised data format without breaking things - it seems this new support will give us the latter, but not the former?

Also what level of flexibility is considered - would parsing/generating arbitrary formats (with proper macro annotations) similar to what e.g. KaiTai struct supports be in scope and something that would be tried to be supported? E.g. if you have a wire format that is defined, but that is e.g. fixed format, or with key/value pairs (but not JSON) - is this something that should work?

Overall, just try to understand how wide range of ser/des tasks that is expected to be covered.

5 Likes

As happens with Codable today, we need the stdlib to be able to conform to the format-agnostic protocols, which means it isn't possible for that to live in a package.

As for where the JSONCodable protocols live, that is still TBD. I definitely see the opportunity here to lower this below Foundation to allow broader adoption. On the other hand, there may be an argument to continue to allow clients to set encoder-/decoder-wide encoding strategies for Foundation-only Data and Date types which can't be expressed outside of Foundation.

Perhaps there's some kind of "layered" approach where the core Data/Date-ignorant encoder/decoder live in a package, and another one in Foundation wraps it to add this support for Data/Date.

Excellent question. I'm trying to figure out if there's some way to express in a macro that an annotation should only apply if the type is not directly supported by the encoder being used. So for instance, if the same struct happens to be used for both JSONCodable and PropertyListCodable, only JSON would use the CodingFormat(.iso8601) annotation.

I'm still figuring out the full extent of capabilities of macros, so I'd love to hear ideas in this vein.

Your design already wants to have format specializations. I imagine that right now it looks like this:

extension Date: JSONCodable {
	init(container: JSONContainer)
	func encode(to container: inout JSONContainer) throws
}

Instead you could implement specializations as their own type, still specific to both a data type and a container type:

struct ISO8601DateCoder: JSONCodingSpecialization { // "YYYY-MM-DD"
	func encode(_ date: Date, into container: inout JSONContainer)
	func decode(_ date: Date, from container: JSONContainer)
}

struct UnitedStatesDateCoder: JSONCodingSpecialization { // "MM-DD-YYYY"
	func encode(_ date: Date, into container: inout JSONContainer)
	func decode(_ date: Date, from container: JSONContainer)
}

and then @CodingFormat could take in a specialization.

Yes, supporting embedded mode is a desirable goal here. As we go through this process of designing the APIs, I will make sure it builds properly in embedded mode and will request feedback from experts in that area to avoid any potential pitfalls.

The design is currently free from existentials, and my understanding is that the limited use of metatype parameters as type hints is indeed compatible. This case is one reason why it will be very important to have a design that avoids the need for dynamic type casting like JSON and property list coders in Foundation currently do.

1 Like

I would encourage considering the design for use with encoding/decoding CBOR and in particular how COSE extends that for subobject (partial object) encryption.

CBOR features heavily in forthcoming IETF standards and it would be great if Swift were a best fit for modern internet protocols.

There's two modes of "streaming" support to consider:

The first is what both Rust Serde and this design sans async will support: pulling bytes from something like a file descriptor into a buffer on demand through synchronous APIs like read().

The second is full blown async support that can pull additional bytes from anywhere, including asynchronous sources.

I'm open to the idea, but I hesitate a little because I presume there will be some inevitable overhead imposed by suspension points at essentially every function call boundary, and the involvement of the Concurrency library. Happy to be proven wrong though—it shouldn't be too hard to sprinkle async throughout the prototype and make some measurements.

This should work just fine, as the expansions of @JSONCodable and @FooCodable should be completely separate and parallel extensions. One of the only intersection points to consider is the macro annotations on properties—ideally we'll be able to establish a common "vocabulary" of generic annotations that any format-specialized macro can pick up, like default-specifying macros or key name altering macros.

Perhaps it's poorly communicated, but this is one of the core tenets of the proposal—"format specialized" protocols. JSONCodable, PlistCodable, etc. should have full freedom to craft their interface around each format's individuals needs and specialities.

At one stage, the "format specialized" protocols was the entirety of the design. However, while looking at adoption scenarios, I realized that this design presented a problem with "currency" types that are owned by frameworks/libraries, but used by application-level serializable types.

The concrete scenario that stuck out to me was Range (to get specific, let's say Range<UInt64>). It's perfectly reasonable for a client's JSONCodable-compliant struct to want to include a Range<UInt64> as one of its serializable properties. However, Range lives in the standard library—it cannot conform to JSONCodable within the standard library. Well, then maybe the JSON package provides that conformance? It certainly could since the package is dependent on the stdlib. But that is neither a sustainable, nor generally applicable strategy. Suppose the stdlib adds another currency type that clients want to encode? Or suppose a client wants to encode a CGRect—the JSON package can't provide that conformance, and neither can CoreGraphics.

Hence the introduction of the format-agnostic protocols in parallel with the format-specialized ones. Range and CGRect can, in similar fashion to Codable, describe their serializable members abstractly, allowing a specific encoder/decoder to interpret those instructions. The difference from Codable being that we avoid all the OTHER downsides of Codable the OP describes.

A new JSONEncoder and PropertyListEncoder would have no problem taking values/types conforming to this format-agnostic protocol (I've been referring to it personally as CommonCodable, but this is very much a placeholder). Some formats (XML? CSV?) might be able to do the same with some compromises. Other formats might not be able to handle it at all. And that's OK. A specific format's encoder/decoder is allowed to omit CommonCodable support if that makes sense, but it means that clients of that format may need to do some extra work to make the types they want to serialize compatible.

I'm confused how this suggestion would fit into the overall design. Delegating directly to types the job of converting themselves to and from bytes seems like the opposite of what modern high-level, format-agnostic serialization APIs are trying to achieve.

2 Likes

Apologies for confusion with the Foundation tag.

The intent is to have the "format agnostic" protocols live in the stdlib—not a package—as stdlib types will want to adopt these protocols themselves.

However, it's certainly possible, and even likely, that PropertyListCodable ends up defined in swift-foundation, given how it defines the primitive types of Date and Data.

"Easy for third parties to write their own encoders and decoders" is certainly an important goal here. The macro reliance is a bit of a hurdle to deal with, but I'm hoping in common cases we'll be able to find ways to mitigate that.

It's only mentioned briefly in the OP, but I did reference something similar here:

I am developing generic Encoder , Decoder , and Container types that operate on format-specific primitive values, e.g. JSONPrimitive or PropertyListPrimitive .

This implies that formats are encouraged under this design to provide their own JSONValue or JSONPrimitive-esque types for which one use is to support easier Codable compatibility. But they would certainly be usable in the scenario you describe here.

The catch is that using one of these in your type kicks you firmly out of "format-agnostic" mode, and further ties to to a single specific format. And that's probably exactly what you expect and want in the case you're describing. This would be indicated by your type conforming to JSONDecodable instead of (placeholder name!) CommonDecodable, which unlocks your ability to use whatever JSON-specific features the JSON package provides—something that couldn't be done easily in the forcibly format-agnostic Codable world.

This is a great suggestion. Would you mind sharing one example of something you'd expect to be submitted to such a suite?

It's worth noting that for JSON specifically (as well as other formats, like plist) we can and should certainly guarantee structural equivalence between both Encodable and JSONEncodable on identical structs. However, it's not really tenable to ensure byte-level equivalence where key order comes into play. (The upside is that JSONEncodable should guarantee predictable key ordering, where Codable + present day JSONEncoder does not.)

1 Like

Sure, I'll describe this a bit more.

There are two main purposes for serializing values:

  1. Serializing a value so another system can operate on the serialized form later
  2. Serializing a value so I can operate on the serialized form later

JSON, CSV, XML, ProtoBuf, etc all tend to be the first form. They're "currency formats" that are commonly used across a wide array of applications, so they serve as nicely-introspectable ways to communicate some structured information back and forth.

The other main scenario is more like "I'm going to hand this value to a framework so that it can hand it back to me later". In this case, the framework wants an opaque blob, because both the provider of the blob and the receiver of the blob are, if not the same application, at least both well known to each other and the framework is operating as a transport or storage layer.

NSKeyedArchiver was a nice way to get #2, because we could hand it values and say "encode this please" and we'd get an NSData out the other side that we could store and save and deserialize later. We couldn't do a whole lot with it, but that was also the point. If we got something that was NSCoding, we knew that it could be transported or persisted, and we could use protocols to extract semantic information from it.

I bring this up because Codable tried to be both. It tried to be an API to serialize to common formats (and while mostly successful, it had a myriad of Sharpe edges) as well as the way to describe an arbitrarily-serializable type.

In my experience, I've found that conflating these two kinds of serialization tends to make things overly complicated. If I'm writing a framework, I either want to receive values that I turn into blobs for storage, or I extract information from them via protocols. If I'm writing an app, I'll mostly care about the transport format of my types because the server on the other end of the wires wants a particular format.

Trying to shoehorn the two goals into the same API has always been pretty complicated from a client's side. Either I'm adopting a super abstract API when I really care about a specific format, or I'm trying to twist a specific API into supporting all kinds of abstract use-cases. That's why I'm suggesting that we split the API to support the cases separately. We have one API that can be very general and support the whole "A type can be serialized to an opaque format" use-case, and then packages to support particular formats and all of their respective idiosyncrasies. I think we'd be repeating past mistakes to try and make those two use cases be the same API again.

This is an excellent illustration of why the standard library shouldn't be trying to solve this abstractly. What if we're dealing with a format that, instead of storing an upper bound and a lower bound, stores a lower bound and a length? Or what if they're encoded at separate layers because, for that format, they mean different things? Or what if it's XML where these are values that will end up as named attributes inside a tag? Who's providing the name?