Serialization in Swift

Similarly, XML has a similar problem. Keyed containers have no concept of key order. Given some XML like this:

<book id="123">
    <id>123</id>
    <category>Kids</category>
    <title>Cat in the Hat</title>
    <category>Wildlife</category>
</book>

calling container.decode(String.self, forKey: .category) will always yield the first category in many implementations I've seen on GitHub, and I think that is mainly a failure of the Codable APIs.

4 Likes

Is there one that correctly supports reference types without making them proliferate?

I think that Codable prevents a library from implementing support for object graphs because they require two-pass decoding. Please, see this thread.

This. Mantle uses NSValueTransformer to accomplish this sort of thing. It would look something like this in Swift:

struct Person: Decodable {
    let age: Int
    let name: String

    static var decodeAge: (Any) -> Int = { value in
        if let ageString = value as? String {
            return Int(ageString)
        }

        // Throw an error or return a default value, etc
    }
}

Mirroring this behavior, I would like the runtime to see that a transformer for age exists on the type Person and use that to transform the value keyed by age before assigning it. More generally, this could use some new Transformer type that works similar to NSValueTransformer to provide two-way transformation, rather than the one-way transform above. The standard library could (and should) provide default transformers for common transformations, transforming between String and Int, allowing for something like this:

struct Person: Codable {
    let age: Int
    let name: String

    static var ageTransformer = Transformer.stringToInt
    // Or, ideally...
    // static var ageTransformer = Transformer<String, Int>()
}

These types of transformations are all too common and necessary in many JSON APIs. Plenty of APIs provide numbers or strings in place of booleans (i.e. 0/1 or "true"/"false" in place of true/false) for example, right behind the most common problem of sending everything as a string so that numbers need to be transformed to numbers, lest they be decoded as Strings.

Discourse's own API is extremely guilty of this and similar crimes, which made me give up on using Codable to write a Swift Forums app. To name one, Discourse posts have a bookmarked field on them. IIRC, this field is null if you've never bookmarked that post before, and when you bookmark it, it becomes true, and when you un-bookmark it, it becomes false. With that in mind, whatever transformer solution we come up with should not only handle transforming from T to U, but from T? to U. This API could look like any of these:

static var ageTransformer = Transformer.stringToInt.allowingNil
static var ageTransformer = Transformer<String?, Int>()
static var upvotedTransformer = Transformer<String, Bool>()
static var bookmarkedTransformer = Transformer<Bool?, Bool>()

Someone is probably going to say that property wrappers would be good for this sort of thing. They probably would be, but I have a feeling they would make debugging more difficult. The replies to this comment share my thoughts. So does this:

I would like to see a dedicated transformer type, unless someone has a compelling reason why we should take a different approach. That said, my approach is somewhat verbose. I'm curious to see what ideas others can come up with here. Reply to me with your ideas!


The only other thing I have to add is that every suggestion here should be taken seriously. As an example:

This may seem like a minor thing to the core team, but this feature is essential to minimizing boilerplate around a common task. I think everyone's biggest problem with Codable right now comes down to boilerplate. Codable seems like it was designed with the idea that the programmer controls the backend. When you do not control the backend, Codable quickly becomes a pain to use.


Edit: I forgot one thing.

Classes!

Codable has awful support for classes. This may also be a good opportunity to add synthesized initializers to classes subclasses.

2 Likes

The thread you are referencing is talking about resolving cycles in object graphs, which is different from your initial example, which only talked about deduplication of references. The deduplication is possible today, even if there may not be any implementations that actually support it. The thread explains in a lot of detail why breaking cycles in NSCoding is possible and why (for good reasons) it is not as simple in Swift. Given a library with support for references, you can (as shown in the other thread) write init(from:) and encode(to:) implementations that break reference cycles when encoding and properly rebuild the object graph when decoding. What is not possible today is to do this automatically, but it is something that we can definitely consider as part of this effort.

1 Like

In JSON, this is solved using an informal protocol of encoding a reference ID somewhere in the object graph. For example:

{
    "my_mother" : {
        "__id": 1,
        "name": "🚺"
    },
    "my_fave_person": {
        "__id": 1
    }
}

This isn't formalized by a spec, so various programs use different structures, different names for the id variable, some encode a class name, etc, but codable doesn't easily support this in any form.

Fundamentally, the issue is that the Codable APIs don't give you a way to intercept object instantiation from the class of the object itself. Even if you did manage your own look up table of object IDs mapped to existing object instances, you would have no way to use it. Unlike Objective C's initializers, you can't return a different object in init(from: Decoder) in Swift. By the point that's been called, Codable has already instantiated a new instance for you, and there's nothing you can do about that.

Notice that the "deduplication" is a concern of the Person objects. You can work around this by writing custom init(from: Decoder) and encode(to: Encoder) functions, but they'll be scattered all over the place, because they don't live in the Person class, they have to live in every class that contains a Person. And since you can't customize the behaviour of Array<Person> or Dictionary<String, Person>, you also have to write custom encoders/decoders wherever those collections are stored, too.

2 Likes

Give me a third format capable of storing reference graphs and I'll change my judgment.

YAML.

1 Like

Yes, it can be encoded in JSON. My point was, that there is no actual support for it in the JSON standard (as you also mentioned).

The deduplication could happen in a library alone and does not require special support. When container.decode(Person.self, forKey: .x) is called, the container can check for the id and return either a cached reference, or call init(from:) to decode the object. If you only want specific references to be deduplicated, you can use property wrappers to achieve that. In fact, you should be able to use property wrappers + userInfo to achieve that without special support from the encoder/decoder.

Is there an intention to create a custom serialization format that would allow to transfer arbitrary objects via net akin to how java can serialize its byte-code? A feature like this would be reasonable, given that actors are introduced, and Microsoft Orleans has similar thing.

Ideally I would imagine it to be able to transfer arbitrary functions - including generic ones, such that on target it can utilize hardware that is absent on source. For example, I write a function that can use parallelization to process images. Send it over network to other participant who has powerful gpu and on that computer it would process images faster than on mine.

That's not what Orleans does; It serializes messages that are delivered to remote actors -- so plain old value types that today's Codable could handle as well. It is not able to serialize actors (no actor runtime is), but it can pull some tricks to make it seem as if it was -- by storing and restoring state to some database etc.

Regarding Java's serializing and loading classes at runtime -- this is widely seen as a huge security vulnerability and not something we should be aiming to replicate without a lot of signing and other guarantees about the trustworthiness of such. This is very much outside of the scope of normal serialization work which is what we're focused on here.

9 Likes

I want to double-down on this. Object serialisation is a famous source of exploits. This is inherent in the data format. Any serialisation format that can be turned directly into code drives a giant, mile-wide threat surface into the center of any program that uses it. It also provides extremely useful gadgets for exploitation even in cases where the remote code ingest is appropriate defended. It cannot be secured and we should not provide it.

14 Likes

Some things I've wished for. Repeating quite some ideas from up-thread, but hopefully a bit useful nonetheless. The last one is not a wish I've seen elsewhere, which makes it either interesting or too niche to bother with.

  • Compiler assistance for ensuring encode/decode round-trips. If I write a custom init(from:) I'm reminded to set all properties, which is great, but my custom encode(to:) won't remind me to write them all. This is a big reason to lean heavily on the compiler-provided implementations.
  • Per-property configuration (so I don't have to write that custom encode(to:)) of
    • encoding strategies (currently possible with property wrappers, but with the downside that these are exposed as API)
    • default values (not currently possible with property wrappers in the desirable form @Decode(default: "some default value"))
    • key names (it's not a huge deal, but it would nice if we didn't have to write a full set of CodingKeys when only one key needs to be adjusted; this is much less important, though, because the compiler error if the CodingKeys get out of synch with the properties keeps us safe)
  • The possibility to provide multiple serializations for a single type (i.e. to be able to say the equivalent of "FlatUser and DeepUser each provide a Codable conformance for User, which we'll use in different contexts" without too much boilerplate and copying-things-around) -- obviously this wouldn't literally be implemented as multiple conformances, and I'm not at all sure how it would look, but I guess at least the use case is fairly clear.
  • A container-details-preserving Swift-native representation format with encoder and decoder, that's also convenient for structural transformations.

Let me say a bit more about that last one. Suppose I get json formatted as a list of named objects [{"name": "property1", "value": {...}}, {"name": "property2", "value": {...}] and I want to decode that into a Swift type with properties property1 and property2, etc. It's possible with current Codable, but it's not very comfortable; my current ad-hoc solution involves re-encoding as json after collecting the property:value dictionary, because there's no ready-made intermediate format but JSONDecoder is easy and available. To encode the Swift object back into the list-of-properties format I made a custom Encoder; again this is possible but not comfortable (in particular you can run into difficulties depending on how the property values are supposed to be serialised). For this particular (fairly odd) use case it would be great to have an encoder/decoder pair serializing to a mutable Swift-native format that preserves all the naming information, so that the transformation "list-of-named-objects to/from dictionary" isn't spread out across multiple encoders, decoders, and Codable conformances.

(I wonder how this relates to reflection, also: if you squint enough, the details you would want in that intermediate format look quite a lot like what Mirror provides.)

Is this thread focused narrowly on the Codable APIs, or are we also discussing the JSONEncoder and PropertyList encoder that likely comprise the vast majority of codable usage?

The deduplication could happen in a library alone and does not require special support.

It could, but in the case of the JSONEncoder we ship out the box, it doesn't.

This thread is focused on the entirety of serialization in Swift, so everything is fair game.

My understanding is that it is about the higher level serialization support and not about details of specific implementations. The goal is to improve the overall serialization support in Swift, so that library developers have an easier time implementing specific formats and users have ways to customize their format, without having to sacrifice the comfort of code synthesis.

Maybe @tomerd can clarify.

I would argue that it is questionable whether JSONEncoder/Decoder should offer non-standard functionality like this. As I mentioned in my previous post, it can be implemented using property wrappers and userInfo, if you want to have this functionality.

1 Like

Same problem without using (directly) reference types at all.

struct DataAnalizer : Codable {
	let data		: [Double]
	let criteria	: Int

	init( data : [Double], criteria : Int ) {
		self.data		= data
		self.criteria	= criteria
	}
	
	func doAnalisys() -> Double {
		/* I do some analysis on the data according
		to the criterion and return the result */
		return 0.0
	}
}

struct FullAnalisys  : Codable {
	let analizers		: [DataAnalizer]

	init( data:[Double], rangeOfCriteria:Range<Int> ) {
		/* I prepare a set of different analysis of the
		same _SHARED_ data. Even with a very large array this is
		not a problem because the data is not duplicated */
		analizers	= rangeOfCriteria.map { DataAnalizer(data: data, criteria:$0) }
	}
	
	func doAnalisys() -> [Double] {
		/* I perform all the set analysis */
		return analizers.map { $0.doAnalisys() }
	}
}

let veryLargeArrayOfData = [Double](repeating: 0.0, count: 1_000_000 )
let fullAnalisys	= FullAnalisys(data: veryLargeArrayOfData, rangeOfCriteria: 0..<100 )

/* Now suppose we want to save the analyzes
set using any of the foundation coders */

// The same data array will be saved 100 times...
let data	= try! JSONEncoder().encode( fullAnalisys )
print( "\(data.count) bytes written!" )	// 200002505 bytes written!

/* And when I read the data I find myself with
a perfectly functional app whose data takes up
100 times more memory because the data array is
no longer shared */
let decodedAnalisys	= try! JSONDecoder().decode( FullAnalisys.self, from: data )

Should Foundation provide a coder that doesn't have this problem?

It is not clear to me how it could under the current design or even most proposed designs. This would require deep introspection of your types to understand that two units of data are CoW types and so can alias the same storage, and that doing so would provide a meaningful improvement.

The data of the array is not contained in an inner reference type shared by all arrays that point to the same data and which is duplicated only due to a mutation thanks to the COW?
If the coder has the ability to save the same reference type no more than once (the subsequent times it encounters it saves only one identifier, for example its address), won't this work for the one contained in the array/set/dictionary/etc... as well?

Sorry for my english ...

Yes, but the problem is "all arrays that point to the same data". When decoding a JSON object, the decode is done incrementally, with one element being decoded at a time. To reduce the memory usage would require that after each element decode a scan of all decoded elements is done to identify elements that are equal, and to alias them together. This can work, but the other criterion would normally be that those objects have to not be reference types: if they're reference types then there is now unexpected aliasing in the object graph that can produce nasty action-at-a-distance bugs.

Additionally, this "after each decode scan all decoded objects" is the very definition of a quadratic-performance algorithm. It will be very slow for large object graphs. You could do it after only the final decode, but that provides less help than you'd think because peak memory usage is still high. And, of course, you could just write such code yourself!

1 Like

Writing:
I use a set to hold the id (i.e. the address: 8 byte) of each reference type I encounter.
Whenever I need to save a reference type: If its id is contained in the set, I save only the id, otherwise I save the id and all its fields and store the id in the set.

Reading:
I use a dictionary to store the id (ie the address) of each reference type as the key of the reference type itself.
The first time I encounter a reference type I have all the fields available to build it. Plus, I store it in the dictionary associated with the id I read.
All subsequent times I only encounter the id, but that is the key to retrieving the reference type from the dictionary.

Strictly O(1).

I'm not inventing anything: NSCoder works like this.

Note: I am not saying that the JSON Coder must work this way, I am saying that Foundation should provide at least one Coder capable of doing this, choosing an appropriate format.

XML is not a data serialization format. Property Lists define (one of many) ways to serialize and deserialize data via XML, by placing several restrictions on the open-endedness of XML. That is why there is an included Property List implementation, and not one for XML. The semantics of "XML" are really defined by an actual document format.