Codable != Archivable

QuinceyMorris · March 27, 2018, 3:56am

In SE-0167 we find this statement about NSKeyedArchiver and NSKeyedUnarchiver:

Although our primary objectives for this new API revolve around Swift, we would like to make it easy for current consumers to make the transition to Codable where appropriate. As part of this, we would like to bridge compatibility between new Codable types (or newly-Codable-adopting types) and existing NSCoding types.

To do this, we want to introduce changes to NSKeyedArchiver and NSKeyedUnarchiver in Swift that allow archival of Codable types intermixed with NSCoding types

(Yes, I’m squarely in Cocoa-land, because that’s one prominent scenario where the problem shows up. However, this is a definitely a Swift issue, not Cocoa.)

The problem is that this stated goal isn’t achievable by encodeEncodable and decodeDecodable as currently implemented. As a practical exercise, I took a very ordinary, simple data model that was previously being archived for NSDocument, and changed all the NSCoding conformances to Codable. When I tried to save the document, my app crashed — crashed big, with infinite recursion leading to a stack overflow.

It turns out that archiving/unarching via Codable through keyed archivers/unarchivers doesn’t respect the normal archiver convention of object reference identity. That is, reference type instances in NSCoding are unique within the archive as a whole. Codable, on the other hand, archives or unarchives a new instance at every reference encountered in the object graph. It crashes because typical data models in Cocoa apps have circular chains of references. (For example, there is a circular chain between a parent object with an owning reference to a child object that has a [weak] back reference to the parent.) These are unproblematic in NSCoding, but fatal in Codable.

This particular crash isn’t very difficult to solve, at least in principle. It’s possible to write archive encoders and decoders that check for reference types, and maintain reference identity in the archive.

Unfortunately, there is a second problem that’s not so easy to deal with. Reference identity is fine for encoding/archiving, but it isn't sufficient for decoding/unarchiving, because of Swift’s initializer rules. An object being decoded, A, might need to store a reference to another object, B, but, if there is a chain of references from B back to A, B cannot be initialized because references to A are not available until A returns from its initializer — which it can’t generally do without the reference to B.

(Obj-C doesn’t have this problem, because it happily passes around references to partially constructed instances.)

I thought at first this was insoluble, but I realized that the only way for a circular chain of references to exist is if one of them (at least) is an optional type. AFAICT, a circular chain of non-optional references cannot be created in Swift at all. Necessarily, one of them must be of optional type.

That opens up the possibility for a two-pass unarchiving process. In the first pass, objects would be allowed only to set stored properties (typically, non-optional references, along with non-optional value types that contain non-optional references), but not to use any stored properties (since they're not fully initialized yet). In the second pass, objects would set the rest of the stored properties (including optional references), and finish any other calculations and setup normally done in an initializer.

Clearly, this would require compiler support, but before venturing into that territory I’d like to ask if anyone has looked into this problem before, and whether there are any easier solutions. I’ve tried searching the forum for existing threads on the subject, but haven’t found any.

I’ve also spent some time on an experimental implementation of two-pass decoding (using init to simulate the first pass, and a decode(from:) method for the second pass), but I got stuck at the point where I found I couldn't retroactively impose two-pass decoding on types that currently conform to Decodable.

So, anyone got any thoughts on this matter?

Nevin · March 27, 2018, 4:20am

Without commenting on the rest of your post, I believe this point is incorrect. Here is an example where class instances foo and bar both have strong non-optional references to each other:

class SuperFoo {}
class Bar { var foo = SuperFoo() }
class Foo : SuperFoo { var bar = Bar() }

let foo = Foo()
let bar = Bar()
foo.bar = bar
bar.foo = foo

(Also, other enums besides Optional can have associated values, but that’s morally equivalent to Optional for these purposes.)

QuinceyMorris · March 27, 2018, 5:54am

Yes, I think you're right. That leads to one of two conclusions:

The problem is even harder than I thought. Or…
The distinction based on optional references is just a rule of thumb, that works in most non-pathological cases. In some cases, the choice of making some reference "pass 2 only", to break the circular chain in pass 1, might need to be explicit.

What's exceptional about your example is that the properties are var, which means they can be set later without breaking init rules. If the properties were let, or otherwise set only before the init completes, I don't think the strong circular chain can be constructed.

The goal here is to come up with an analysis that the compiler can use to synthesize Decodable conformance safely (or for it to know when it can't).

Thanks for the insightful response.

Tino · March 27, 2018, 6:05am

That works in the example, but not in general:
If Bar does not have an init without parameters, what values should those have when constructing the temporary instances?

itaiferber · March 27, 2018, 4:44pm

Thanks for bringing this topic up! Some notes:

QuinceyMorris:

The problem is that this stated goal isn’t achievable by encodeEncodable and decodeDecodable as currently implemented. As a practical exercise, I took a very ordinary, simple data model that was previously being archived for NSDocument, and changed all the NSCoding conformances to Codable. When I tried to save the document, my app crashed — crashed big, with infinite recursion leading to a stack overflow.

It turns out that archiving/unarching via Codable through keyed archivers/unarchivers doesn’t respect the normal archiver convention of object reference identity. That is, reference type instances in NSCoding are unique within the archive as a whole. Codable, on the other hand, archives or unarchives a new instance at every reference encountered in the object graph. It crashes because typical data models in Cocoa apps have circular chains of references. (For example, there is a circular chain between a parent object with an owning reference to a child object that has a [weak] back reference to the parent.) These are unproblematic in NSCoding, but fatal in Codable.

You can consider this a bug. Although not explicitly called out in the Codable proposals, we planned on supporting and maintaining reference semantics both through our own encoders (JSONEncoder, PropertyListEncoder) and through NSKeyedArchiver support. However, we didn't manage to finish this aspect of the feature in time, and have not yet had the bandwidth to follow through, unfortunately. I'm planning on incorporating this explicitly in the next update to the Codable feature.

Yes, that's correct. Because Objective-C has two separate initialization steps (+alloc, -init...), it's possible to break reference cycles by returning allocated but uninitialized objects. I don't know if this is necessarily a model to emulate, though — it's terribly unsafe to do this, because it might seem completely reasonable to be able to depend on objects being initialized once you -decodeObjectOfClass:forKey:. If you do have a reference cycle, whether the object you get back is valid or not depends on where you are in decoding and in what order the objects were encoded in the first place.

This goes both ways:

I had a discussion on Twitter a few weeks back in which some folks were disappointed in the Swift model because it's difficult to achieve this in Swift
A week later, I got an unrelated Radar from a surprised developer that ran into this (quite painfully) on their own

Not necessarily. Nevin had a good example of this, but there are other examples which are less surprising. Consider the following:

class Post {
    let author: Author
    let content: String
}

class Author {
    var posts: [Post]
}

Every Post must have an Author, but every Author owns its posts. Since a Post's author shouldn't be optional, how can we design an API around this cycle? The following, for example, would not work:

class Post {
    // ...
    init(author: Author, content: String) {
        self.author = author
        self.content = content
    }
}

class Author {
    let name: String
    private(set) var posts: [Post]
    init(name: String, posts: [Post]) { ... }
}

You wouldn't be able to create an Author who has Posts because those Posts require an existing Author. You can, however, do this:

class Author {
    let name: String
    private(set) var posts: [Post] = []

    init(name: String) { ... }

    func add(post: Post) {
        posts.append(post)
    }
}

You can create an Author with no Posts, then add more later on. This is better, but requires some amount of checking to ensure that when you add(post:), the post.author == self. Alternatively, this design might be better approached as such:

class Post {
    init(author: Author, content: String) {
        self.author = author
        self.content = content
        self.author.posts.append(self)
    }
}

class Author {
    let name: String
    fileprivate(set) var posts: [Post] = []
    init(name: String) { ... }
}

In this case, the initializer of Post encapsulates this runtime linking of the object graph. No optionals are necessary because [Post] acts somewhat link Optional in that it can start off empty and get added to later. (Of course, you'd actually need a weak in here somewhere to later break up the reference cycle, but this'll let you build it up.)

This all goes to say that I don't actually see this initialization problem as any different than building the reference cycle as above — there's a general way to do this in Swift, and I don't see much of a difference between doing this in init(author:content:) and init(from:).

The "linking" of this object graph on decode could do the same thing (assuming that the decoder supported the reference semantics that we're looking to embody):

class Post : Decodable {
    init(from decoder: Decoder) throws {
        let container = try decoder.container(keyedBy: CodingKeys.self)
        self.author = try container.decode(Author.self, forKey: .author) // assuming reference here
        self.content = try container.decode(String.self, forKey: .content)
        self.author.posts.add(self)
    }
}

class Author {
    init(from decoder: Decoder) throws {
        let container = try decoder.container(keyedBy: CodingKeys.self)
        self.name = try container.decode(String.self, forKey: .name)

        // _Don't_ decode Posts because every post on decode will add itself here
        self.posts = []
    }
}

As in all reference cycles, in order for the cycle to make sense, one object must be the true "owner" of the other, and in this case, Posts own their Authors and not the other way around. You could switch the relationship around by making Post.author be Optional, but this might be slightly more ergonomic depending on how you construct Posts and Authors.

With all of this, I don't know how much the compiler could help recognize that "hey, you've got a reference cycle here, better break that up" and break the cycle for you. It's possible theoretically to break up the process into a two-phase one like in Objective-C, but that would require either:

Breaking Swift's strong initialization requirements and allow us to pass around unintialized objects like in Objective-C, or
Break up the Decodable requirements into init() and decode(from:)

Besides the backwards incompatibility of option 2 at this point, we considered it during the design of Codable. Unfortunately, not all objects can be default-initialized, and making that a requirement would be a non-starter for a lot of types.

jrose · March 27, 2018, 5:41pm

Fundamentally I think of Codable as encoding trees and NSCoder as encoding graphs. You can encode a graph using a tree structure of some kind, but it does require a bit of extra manual effort.

I'm not sure how important it is for Codable-the-protocol or even the common Foundation encoders to support this. NSCoder isn't going away, and it may still be a good choice when you really do have an object graph.

QuinceyMorris · March 27, 2018, 5:47pm

There is more to think about in your response, but a couple of initial comments:

In an example like your Author/Post scenario, you basically have a parent/child relationship, with a one-to-many relationship in the parent, and a one-to-one relationship in the child. Under normal circumstances, you have the make the one-to-one relationship weak, because making the other one "weak" requires heroic measures, such as wrapping each reference in a struct with a weak member.

In practice, therefore, you have a weak and therefore optional property in the child, which simplifies the problem a little bit. That offers more hope of a good synthesized solution in "normal" code.

I think there's one important difference. Creating such chains of references initially may well use state (e.g. intermediate objects) that no longer exist at the time of archiving or unarchiving. Manually-written unarchiving code might be able to reconstruct the original initialization process, but it's a very subtle thing, especially if the chains of references are dispersed across a lot of types. In addition, creating these cases may put the objects into temporary states that are not valid or safe. A partially-constructed object graph can often be kept isolated when initially created, but doing so during unarchiving might be harder.

QuinceyMorris · March 27, 2018, 5:58pm

There is a 3rd option: Introduce a new Archivable protocol, which extends Decodable to two passes, or something else that works for the general object graph case.

Regarding #1, I would point out that there is already a kinda model for "reference to an object that you can store but not dereference", namely unowned(safe). This is AFAICT a valid pointer to an unusable object.

I've been wondering if something like this could done in reverse. If, during part of the init method, stored properties were implicitly treated as unowned, it might be possible to set them to references that become fully valid later on. This might involve adding a third init phase to the current two.

Obviously, this is not a fully-developed idea, but it would be interesting to think about what additions to the language might produce a clean solution — separately from the question of whether such a change might have a chance of being implemented.

QuinceyMorris · March 27, 2018, 6:10pm

There's a pretty big "Yes, but…" here:

Conforming to NSCoding turns your objects in NSObject subclasses. This is unpalatable for reference types, and untenable for value types. Even if possible, that tends to have unwanted follow-on effects, not to mention the fact that it leaves you stuck with a ~~Cocoa~~Apple-world dependency.

As I said originally, but forgot to follow up on: this is really about Swift, not Cocoa. It's rather unfortunate that pure Swift is very good at representing object graphs, but incapable of archiving and unarchiving them natively.

Karl · March 27, 2018, 6:32pm

You can encode a Codabe tree in to a NSCoding graph relatively easily, though, using a wrapper. Graphs are more flexible than trees because they have this cross-linkage.

I tried writing an NSCoding wrapper once. You can, but you get a warning because Swift class names aren't stable yet.

QuinceyMorris · March 27, 2018, 8:44pm

"Yes, but…"

If you're talking about (say) wrapping value types in a NSCoding-conforming object, you're implying (potentially) a lot of extra object allocations. In that case, you're better off using classes from the get-go.

IAC, even if you use wrappers, you still can't use Codable for the wrapped thing, since that's the problem you started with. You have to do the encoding/decoding entirely in the wrapper, at least in general.

I'd call both of those "unwanted follow-on effects".

itaiferber · March 28, 2018, 7:41pm

You're right; fair enough. The alternative is to have an array type which allows you to store weak objects, but this isn't easy in Swift at the moment.

Do you have something specific in mind here? I'm having trouble reasoning about cases where

Constructing the object graph requires state which is unavailable at unarchival time. If you can't reconstruct your object graph, then it's not really unarchivable, is it? (This is to say that if your object graph requires state to construct, it will require that state whether you're initializing manually, or unarchiving. If the state is necessary, it should be part of the archive; unarchival is not special here.)
Keeping a partially-initialized object graph isolated would be more difficult during unarchival. Why would this be the case, and how would it be different from when you perform the same initialization manually?

In other words, is there a specific case you have in mind which is easy to express when initializing manually, but difficult when unarchiving?

What do you see this Archivable protocol looking like? How would it fit within the current Codable model?

You might be able to use unowned(safe) to prevent accessing the object, but Swift does not currently make it easy (if at all possible) to create the object without initializing it in the first place.

Basically, what I'm trying to get at—assuming we get reference semantics to work as intended—is:

Are there object graphs which are representable when created manually, but unrepresentable when archived and unarchived (and if so, do we have an example)?
What, if anything, does the two-step initialization model in Objective-C give us that we don't already have in Swift, besides being able to more easily work with objects which have been allocated but not yet initialized?

My current answers to the above are

No: if you can construct a cyclic object graph manually with initializers in Swift, you can use the same mechanism in init(from:)
It doesn't: we have the tools we need already to solve these problems in Swift without needing access to uninitialized objects

If you have counterexamples to this, I'd love to be proven wrong! But at the moment, I've yet to be convinced that the initialization problems here are different from any other initialization problems that we might encounter in Swift.

QuinceyMorris · March 28, 2018, 9:19pm

Here's a simple example that, I believe, can't be decoded currently, reference identity aside:

class X: Codable {
	private(set) var y: Y
	init (y: Y) {
		self.y = y
	}
}
	
class Y: Codable {
	private(set) weak var x: X?
	static func createXWithY () -> X {
		let y = Y ()
		let x = X (y: y)
		y.x = x
		return x
	}
}

Initial creation of the chain works because the X and Y are new, and their references are local to the function that creates them. I believe there's no possible one-pass decoding (because each must be initialized before the other can finish initializing).

In this scenario, I believe there's no way to "fix" the graph afterwards via an ad-hoc second pass, because both X and Y intentionally prevent setting of their properties from outside the classes. If Y provided a method to change the optional x from nil to the correct unarchived reference, that defeats the purpose of private(set).

It's doable with two decoding passes, because a second decode(from:Decoder) pass in Y doesn't expose its private(set) property settably to the outside world. But in that case, it's not safe for Y's init to deference anything, in general, since in a more realistic example there may be other objects that are in a intermediate state, just like the Y is at that point of decoding. (And by extension, it's not safe to dereference anything in any init during decoding.)

There's no reason why the object can't have gone through an initialization. The point is that the object pointer cannot be allowed to be dereferenced until it's been through a second pass (or whatever).

I've been playing around with this:

public protocol Archivable: Encodable
{
	init (from decoder: Decoder) throws
	func decode (from: Decoder) throws
}

(combining encodability and decodability, since they probably go hand-in-hand for archiving scenarios), where decoding containers get new generic methods like this:

public mutating func decode<T>(_ type: T.Type) throws -> T where T : Archivable

alongside the ones that conform to Decodable.

I'm happy to go on kicking the larger topic around, but I'm inclined to think it'd be better to wait and see what you come up with for the reference identity problem, since I think you're going to have trouble synthesizing Decodable conformance even in easy cases. It might make more sense to wait and see what you come up with.

Jeroen537 · January 30, 2019, 11:19am

Please forgive the following newbee question. I have been wrestling with the very subject of this discussion.

I want to save an object graph in Swift, to be able to load it later; in my case, in the same app, although that should not matter. The graph contains shared class instances, which is essential for the functioning of the app. (De)serializing through Codable evidently does not work, as it gives back a tree instead of a graph. This is all explained above.

Using NSCoder might be an alternative, however as explained above one needs all classes to be subclasses of NSObject. This can complicate matters, and I would like not to go there.

The alternative I am exploring is using Codable, but to code my app in such a way that there is only one reference to a shared instance, that is accessed by client instances through an intermediate key/value storage facility. Preferably, this would be a generic construct that could work generally and could be reused in other apps. There would be some kind of API that should make it as transparant as possible to use, but I do not think it could be quite invisible to client objects. I think it will work, but at the expense of adopting different coding patterns. If necessary, I will go there but would prefer not to.

Further background: I am coming from Python where you have the pickle/unpickle function pair which can store and retrieve any object, simple or complex, to file without any extra coding needed.

This seems such a useful feature to me that I am wondering why Swift does not have it.

Is there anything in the language that resists it? I would think that such a function even might be written in Swift itself, at least for instances conforming to Codable, provided that there would be a way to persist ObjectIdentifier instances (to be able to identify object identity at restore time). But there may be issues that I don't know about that might need more introspection than is available.

Anyway, even if it could not be written in Swift, how difficult would it be to define a storage format derived from the internal representation? Of course there is the issue of restoring to different versions of the language, but those seem possible to overcome, for instance, by versioning the storage format. With such an implementation there would not even be any protocol conformance dependency.

In summary: Why is there not a native storage format for Swift object graphs, and functions to save and restore to that format, without any protocol requirement? Is it naive to think that it would be possible? Is it against the spirit of Swift? Has it been discussed, and then rejected? Or is such a thing perhaps on the roadmap?

Rod_Brown · January 30, 2019, 4:06pm

I believe technically on all Darwin-based platforms, all classes are already subclasses of NSObject, just not directly. From memory, there’s a hidden base class which classes that don’t subclass anything automatically inherit from, and that base class is a subclass of NSObject for reference counting behaviours etc.

With this in mind, why exactly can’t you subclass NSObject? To build a graph that’s not a tree you need reference semantics, for cross references, so it has to be a class anyway. The only reason I can think of is if the class hierarchy is not in your control, eg from a framework.

If this isn’t the case, then it sounds to me like you’re avoiding Obj-C features purely from an interest in having an arbritrary “pure” Swift-based solution. Even the Swift language itself doesn’t do that.

Jeroen537 · January 30, 2019, 4:11pm

Thanks for replying.

Correct me if I am wrong, but if I subclass my Swift class from NSObject, I will not be able to subclass it from another Swift class, will I?

If so, that would limit my freedom of modeling my data as I see fit, and I would be looking for a better solution. If not, thanks for the heads up!

Rod_Brown · January 30, 2019, 4:17pm

Definitely true, Swift does not have multiple inheritance of classes (thank God!).

But every inheritance hierarchy has to finish with a base class. Assuming these are not in a library outside your control, which I touched upon, then you control ensuring that the base class itself subclasses NSObject, which it privately will anyway. Whether you state it does or not is immaterial and won’t affect your performance.

Jeroen537 · January 30, 2019, 4:27pm

Good point, thanks. I can take it from there.

But, I will have to write a quite a bit of code to make it happen. And this leaves the question that really baffles me: Why does Swift , now almost in its 5th generation, not have this feature natively? Especially where, it seems to me, just under the surface all the information is available to completely automate the process. Of course, I may be wrong.

More a philosophical question than a technical one, I realize. But I am curious to know how the Swift community looks upon this.

Rod_Brown · January 30, 2019, 4:32pm

I completely agree it’s a clear gap, considering how awesome Codable is for similar but different things.

I think the answer is there have been bigger issues for the core team with ABI stability which even now is rearing it’s head as a big issue as fear of the “ABI Lockdown” hits home.

Hopefully in the post ABI-stable world of Swift 5 will free up resources from polishing the standard library existing types to adding these types of feature.

Jeroen537 · January 30, 2019, 4:35pm

Thanks, good to know that I as a newbee am asking meaningful questions! I'll be patient...