Serialization in Swift

As much as people keep talking about metaprogramming, I think it's reasonable to think that there's a fair chance we'll eventually get it. Assuming that MP would, in fact, provide for a complete serialization solution, do we want to spend the time & effort to build another interim solution? What about backwards compatibility? Can we keep calling the protocols Codable and have everything work, or will we need to be maintaining multiple "official Swift" serialization schemes?

[representing my own opinion, not the core team position] one interesting venue to explore here is how Codable synthesis can be extracted from the compiler to be based on an MP solution. IOW, can we design an MP solution that is flexible enough to extract Codable synthesis from the compiler and into a library, as well as support the development of additional Serialization solutions side by side to Codable

23 Likes

I think probably "yes", as long as the code that's currently synthesized has the same visibility as regular source code. That is to say, the only part of the compiler that "knows" it was synthesized is the part that does the synthesis. Although I suppose it also comes up if you ask your IDE where in your source code the implementations are, but I don't think an error there would prevent compilation or linking, would it?

Speaking of generating schemas, PerfectCRUD is using Codable to generate SQL schemas, although it comes with some bizarre limitations (max two Bools per table) probably due to how it interacts with Codable.

2 Likes

I wrote an experimental Swift encode / decode package (similar to Codable at interface level) that does not treat reference types as second-class citizens.

I called it GraphCodable. You can check it here.

GraphCodable:

  • encodes type information;
  • supports reference types inheritance;
  • never duplicates the same object (as defined by ObjectIdentifier);
  • in other words, it preserves the structure and the types of your model unaltered during encoding and decoding;
  • is fully type checked at compile time;
  • supports keyed and unkeyed coding, also usable simultaneously;
  • supports conditional encoding;
  • implements an userInfo dictionary;
  • implements a type version system;
  • implements a type substitution system during decode;

The package comes with a fairly complete documentation with many examples, a table summarizing the coding rules, a description of the data format.

I think you might find interesting the two examples concerning the encoding and decoding of DAG (Directed Acyclic Graph) and DGC (Directed Acyclic Graph).

I can confirm that deferDecode(...) is only required to decode a weak variable used to break ARC memory strong cycles. In any other situation, decode(...) must be used. If you don't believe me, check it out.

The main difficulty, in my opinion, is that you have to keep the types repository (simplifying, a singleton "typeName: GCodable type" dictionary) updated during the development of an application. The absence of a single 'GCodable' type from the types repository makes it impossible to decode a data file containing it.

I see two ways of solving this problem:

  • a compiler magic that auto-register all GCodable types encountered during compilation. Is this possible with generic types?;
  • better yet, a way to serialize and deserialize Any.Type, if possible.

If there is a third one, I would be happy to learn it.

9 Likes

The team at Jet brains just released v1.20 of their Kotlin Serialization framework it looks impressive

3 Likes

Oh I havent heard of HOCON before, would be nice with Swift support for that! :heart_eyes:

1 Like

I was at Typesafe when we were standardizing around “Typesafe config” which is what brought us HOCON, maintained and worked with it for quite a bit — it is very nice and a Swift parser for it would be very nice :blush: :+1:

It is not really a general purpose serialization thing — it is focused on configuration and overlaying multiple configuration files and values onto one another.

It would be very nice if someone wants to make a parser for it in Swift :slight_smile:

But nowadays for configuration languages I myself an more existed about languages with more powerful validation mechanisms… like dhall or similar ones. That would be very exciting as well.

Somewhat outside of the topic is just serialization though :slight_smile:

2 Likes

Hey, I think this is a great initiative and wanted to give my perspective.

There are a few things that I feel are either impossible or very hacky with Codable. The examples here a specific to JSON, but this applies to other formats as well.

Example 1: Define a unknown fields property

When implementing a client library, we usually define structs that represent the responses we get from the server, eg.

{
  "id": "1",
  "name": "Tobias"
}
struct Person: Codable {
   let id: String
   let name: String
}

When the server starts sending a new version of the object schema, it might add a new field.

{
  "id": "1",
  "name": "Tobias",
  "age": 35
}

All good, we can still deserialize this object into Person because JSONDecoder will just ignore the unknown field.

Now we might have a method,

func api.put(person: Person) { ... }

That takes a Person struct converts it into JSON and sends it the server, which saves it as-is. Now, if we take the object we got from the server, and then just put it back, we would remove the age field. Of course, this assumes that there is no other validation, and null is a valid value for the age field. Details don't matter so much, you get the point. It would be great if we could preserve the unknown field. Something like:

struct Person: Codable {
   let id: String
   let name: String
   let unknownFields: [String: AnyCodable] // would be flattened into the same container as `Person`
}

At the moment there is no good representation of AnyCodable. There are some implementations, but I feel like these is more of an hack (with lots of casting) and make a lot of assumptions about how the different Encoder/Decoders are implemented.

This brings be to the second example:

Example 2: Ability to interact with the serialisation directly

In web apis there is the notion of PATCH requests, that only send the changes you want to make on an object, see RFC 7386 JSON Merge Patch.

With the current state of JSONEncoder/JSONDecoder from Foundation there is no good way to implement this.


let person = api.get() // Person(id: "1", name: "Tobias", age: 35)
var modifiedPerson = person
modifiedPerson.age = nil

api.patch(JSONMergPatch.makePatch(from: person, to: modifiedPerson))
/*
PATCH /person/1 HTTP/1.1
Host: example.org
Content-Type: application/merge-patch+json

{
  "age": null
}
*/

Today, in order to implement this makePatch(from:to:) function, you would need to first encode both person structs into JSON Data. Then use JSONSerialization to parse this back into a Any JSON object. Then cast it to [String: Any] and then recursively compare the dictionary entries, casting the Any values into the NSObject types used by JSONSerialization in order to compare them. Then finally, encode the JSON object [String: Any] of the patch document back into Data and send it to the server. Here, it would be great if the standard library would ship with a pure swift implementation akin to XJSONEncoder/XJSONDecoder from swift-extras-json. This library allows to convert any Codable to and from a type-safe JSONValue, which then can be used for example to compute (or apply) a JSON Merge Patch document in a nice, performant and type-safe manner.

Example 3: Full precision numbers in JSON

This example is actually obsolete, after posting I realised I was using the wrong type: This actually works correctly with Foundation JSONEncoder / JSONDecoder , when using Decimal.

Wrong example with `NSDecimalNumber`

This is a bit special for JSON, and maybe I am missing something: In JSON numbers are encoded as a string: eg

{
  "amount": 99.99
  "currency": "EUR"
}

Here 99.99 is not a Double or Float, in the binary representation it is just a String. But when we want to convert this into a Swift struct, I don't think there is a way to do it like this:

import Foundation

var json = Data("""
    {
        "amount": 99.99,
        "currency": "EUR"
    }
    """.utf8)


struct Money: Decodable {
    enum CodingKeys: String, CodingKey {
        case amount
        case currency
    }
    
    let amount: NSDecimalNumber
    let currency: String
    
    init(from decoder: Decoder) throws {
        let conatiner = try decoder.container(keyedBy: CodingKeys.self)
        self.amount = NSDecimalNumber(string: try conatiner.decode(String.self, forKey: .amount))
        self.currency = try conatiner.decode(String.self, forKey: .currency)
    }
}


let decoder = JSONDecoder()
let money = try decoder.decode(Money.self, from: json)
/*
▿ DecodingError
  ▿ typeMismatch : 2 elements
    - .0 : Swift.String
    ▿ .1 : Context
      ▿ codingPath : 1 element
        - 0 : CodingKeys(stringValue: "amount", intValue: nil)
      - debugDescription : "Expected to decode String but found a number instead."
      - underlyingError : nil
*/

Again, JSONValue from swift-extras-json models numbers correctly as .number(String). But even here, there is no way to get to the "raw" representation of the value through the Decoder interface.

For reference, in Java world with Lombok and Jackson, this works just fine and will retain the full precision of the BigDecimal when encoding/decoding JSON:

@Value
class Person {
  final BigDecimal amount;
  final String currency;
}
This is how it works
import Foundation

var json = Data("""
    {
        "amount": 99.99,
        "currency": "EUR"
    }
    """.utf8)


struct Money: Codable {
    let amount: Decimal
    let currency: String
}


let decoder = JSONDecoder()
let money = try decoder.decode(Money.self, from: json)
print(money)
// Money(amount: 99.99, currency: "EUR")

let encoder = JSONEncoder()
encoder.outputFormatting = [ .prettyPrinted ]
let encodedMoney = try encoder.encode(money)
print(String(decoding: encodedMoney, as: UTF8.self))
/*
 {
   "amount" : 99.99,
   "currency" : "EUR"
 }
*/

Summary

I feel wha all these examples have in common is, that the rather "rigid" interface of Codable makes it very difficult for framework developers to build solutions that can handle special cases nicely without having to reinvent the entire serialisation / marshalling infrastructure. This has been mentioned multiple times in the thread: it would be nice, if we could move the magic out into library code.

I already abuse Codable for unrelated MP purposes as much as it lets me, so yes, to me this would be ideal!

4 Likes

I'd like to see a way to serialize a Json partially. Codeable is nice but slow

A thing I don't seem to see in the simpler serialization libraries I have used is error checking or correction. Many programs act as if the serialized file is always perfect, and crash horribly if something is wrong, and the libraries don't particularly help with this, simply throwing an file format exception and giving up. Worse is the times when the data looks great, but secretly isn't, as when bit rot happens or when some foolish person edits a file by hand and mistypes, or tries out-of-range things.

Some of this gets into a related idea that I've had (I assume someone must have thought about this long before me?), which are limited range types, such as Integer(10...92) or Color([.gray, .red, .green]), giving a compiler more clue as to how correct a value is.

The existence of these would be useful for such a serialization algorithm.

So, to summarize, I want serialization which can tell if the file is borked, fix it if possible, explain the problem to the user, keep going when possible, and know, with fine understanding, what acceptable values are.

2 Likes
  • easy way to encode/decode to/from dictionary (or array) representation (in addition to Data). Also consider adding encoding/decoding directly to/from string.
  • opt-in's for not including default values during json encoding and a symmetrical opt-in for allowing absent values for variables that have default values during json decoding. example implementation
  • optional emitting explicit nils in json encoder. link
  • a way to decode JSON from data where Data contains something else after JSON (can be another json, or a separator, or something totally different.
    {"foo":"bar"} {"baz" : "qux"} ...
    this API will return me how many bytes are used, so I can continue parsing Data from offset, e.g. calling decoder another time.
  • allowsJSON5 for JSONEncoder (it's already supported for JSONDecoder).
  • more easy customization, e.g.:
    // straw man syntax:
    struct Foo: Codable {
        var id: Int                        // normal case
        @json(excluded) var bar: Int = 42 // not in json
        @json(excluded, default: 42) var bar: Int // or this if easier
        @json(renamed: "hello") var baz: Int // renamed
    }

I think with additional runtime support it should be possible to decode weak references without second pass. And avoid two-pass initialisation in other cases, in particular when subscribing to events from child objects.

Weak reference is a pointer to the side table (not true for ObjC, but let's ignore it for now). Normally side table is created after the object itself, but it we allow side tables to be created first, then we can initialise weak reference to a pointer to the side table, which is not linked to an object. And when complete object is constructed, un-archiver can link it with the side table.

If we serialised the following object graph starting from a:

class A {
    weak var b: B?
}

class B {
    weak var a: A?
}

let a = A()
let b = B()
a.b	= b
b.a	= a

De-serialization will look like this:

  1. We encounter id for object A.
  2. We mark id as being constructed
  3. We enter A's decoding initialiser.
  4. Inside A's decoding initialiser, we encounter weak reference to 'b'.
  5. We allocate side table for b.
  6. Since object id for b is not marked as being constructed, we enter B initialiser.
  7. Inside B's initialiser, we encounter weak reference to a.
  8. Object id encoding instance of a is marked as being constructed, but we don't have an side table for it yet. We create side table instance and associate it with object id. Side table is not linked to an instance yet.
  9. Return unlinked side table pointer to 'a'.
  10. Complete initialisation of b and store strong reference inside decoder.
  11. Return pointer to b's side table to A's initialiser.
  12. Complete initialisation of a and link side table with an object instance.
  13. Decoder is destroyed, releasing the only strong reference to b, because that's our example is small, but stupid.

This introduces new state of weak references - reference to an incompletely initialised object. Reading strong reference from weak reference in such state would return nil. But after object finishes initialisation, strong references becomes accessible.

This also allows to create weak references from partially initialised objects in normal initialisers:

class Parent: Base {
    let child: Child
    init() {
        child = Child(foo: 42) { [weak self] in // weak self is ok, strong self is still an error
            // Will be nil, until Parent is fully initialised 
            self?.childDidChange()
        }
        super.init()
    }
}

Well, I could start by commenting that your system is effectively 2-pass. It takes two steps to initialize each object, and each object is in a different internal state in each step.

However, this doesn't actually solve the problem:

  • Forcing client code to rearchitect itself to use weak references instead of strong references is still problematic, because there's nothing obvious in the language that guides developers towards that.

  • Design considerations aside, it still doesn't work in any sort of general way. A class A that has a reference to a B? is broken if a consistency requirement between that reference and A's other properties is violated.

For example, A's initializer may need to set up properties that depend on properties of a B instance that aren't available when the B instance is in this incomplete state.

You can argue against this semi-convincingly with very simple examples involving just 2 classes, but in a real-world object graph there are likely to be lurking dependencies that aren't easily predictable or solvable.

At the very least, there are real-world object graphs where reference cycles are not an error, and are extremely hard to design around solely for the purpose of archiving.

That's why I think:

  1. Designing a solution based on non-optionals is a necessary discipline.

  2. There needs to be some kind of mechanism where partially-constructed objects (like the ones you just described) are integrated into new language rules about what it's safe to do when in initializers.

After all, initializers already kind of contain 2 behaviors internally — what happens before all the properties are set, and what happens after they're all set — but that's a semantic thing with no real syntax support.

For unarchiving to work, I still think we need 3 kinds of behaviors, with something like either a pre-init or a post-init that allow references to be fixed up. @itaiferber and I discussed this endlessly a few years ago, and never converged on an answer. (To be clear: AFAICR @itaiferber is pretty strongly in the "you can do it with optionals" camp, and has a very deep understanding of the subject, but still I remain unconvinced by that line of argument.)

1 Like

I think this, specifically, is the crux of any system we'd design to solve weak referencing/circular dependencies. Fundamentally, when you have a dependency chain like this, it cannot be resolved in a single pass because of reentrancy: if B decodes A, and A requires decoding B and accessing its properties, there's nothing A can do but return, wait for B to finish initializing, and then do what it needs to.


NSCoding and Obj-C implements a single-pass mechanism which is effectively the unsafe version of what @Nickolas_Pohilets is proposing: when B decodes A, and A decodes B, NSKeyedUnarchiver (or similar) can happily give A a reference to B because Obj-C separates allocation from initialization; so A gets a reference to the allocated-but-uninitialized B to hold on to. This solves the weak referencing problem, but is horribly unsafe because if A tries to do anything with B, well, it's working with a partially-initialized object. And worse, A has no way of knowing what state B is in. In effect, in an -initWithCoder:, in the general case, it's not safe to do anything but assign values to properties. (Even -awakeAfterUsingCoder: doesn't quite help, because that's called on the object returned from -initWithCoder: immediately upon return, not after all of decoding is done, so it's not quite a two-phase mechanism.)

@Nickolas_Pohilets's suggestion brings this idea closer to what Swift might allow, but it still requires a major departure from the inseparable allocation-is-initialization model that Swift has right now, by adding an allocated-but-explicitly-uninitialized object type. I'd presume that Swift wouldn't allow you to do anything with such an object except hold on to it, but it's not clear how the compiler would know at what point the initialization happens in order to allow you to do something with the object. (I can imagine a scheme where the object can be assumed to be fully initialized after init has completed, but that's not entirely true.)


To sum up my current thinking:

  • When you attempt to decode an object that is your parent in a circular dependency chain, you can either get an object back, or not
    1. If you get an object back, then it must be not-fully-initialized, in which case there's nothing you can do with it
    2. If you don't get an object back (because handing you a reference would be unsafe), then there's nothing you can do with it
  • In either case, there's nothing you can do with the object safely until a second pass takes place after all objects are initialized
  • During a second pass, you would either need to:
    1. Set up references / perform additional validation with the object references you had (if you got an unsafe reference), or
    2. Assign a value to the nil property you had if you got back nil, then perform (1)
  • Model (1) requires significant changes to the language, but model (2) is possible to implement today (though I fully agree — it's more verbose)
  • In either case, you need a second pass

I think there's absolutely room for improvement here, but I think there's a balance to strike between safety and verbosity, and I'm not sure where that is. And to be clear: I'd be delighted to see an improved model, or have my mind changed about the possibilities here!

Either way, we're lacking the main component of the whole concept: a structured way to get a second pass to fix up references (because right now, you're on your own).

2 Likes

I don't think any system supporting loops in the graph can work in effectively 1 pass. My point was more about an opportunity to offload second pass to the runtime/decoder, and remove it from user code.

I agree that strong reference cycles are valid use case, but IMO still pretty exotic. For me this falls into "hard things should be possible", not "simple things should be easy" category.

It is not possible to have a reference cycle without any optionals involved in some shape or form. Unrelated to serialisation, you cannot instantiate such object graph programmatically in the first place. You probably can craft an archive that represents such graph, but attempt to de-serialize it should be a runtime error.

class A {
    var b: B
    init(b: B) { self.b = b }
}
class B {
    var a: A
    init(a: A) { self.a = a }
}

let a = A(b: B(a: ???))

My suggestion is allow weak references to partially constructed objects (and even objects which are not yet allocated). Attempting to read such reference would produce nil until object is fully constructed. Definition of "fully constructed" is debatable. It can be a point where all properties are initialised. Or when outermost initialiser finishes.

2 Likes

That's true only in regard to creating such a reference cycle purely via initializers. (That is, after all, the exact problem we're trying to solve for unarchiving.)

It's trivial to create a reference cycle without optionals, if there's no limitation to initializers:

protocol P { }

class A {
    var p: P
    init(p: P) { self.p = p }
}

class B: P {
    var a: A
    init(a: A) { self.a = a }
}

class C: P { }

let a = A(p: C())
let b = B(a: a)
a.p = b

Although it's unlikely to occur in this simple form, this is not in any way an unusual construct. When an object graph is built up over (run-)time, it's easy to get structures like this.

It seems to me that there's nothing bad in this graph structure as such, aside from the impossibility of unarchiving it (currently).

Is this final assignment to a.p not isomorphic to a second pass?

It's not isomorphic to a second initialization pass. There's no reason why that assignment has to be made immediately after a or b is created. It might happen for unrelated reasons much later, and it may happen conditionally.

Also, in general, a client of a class something like A might not be aware that a class something like B has a back-reference to A. The mutual references are obvious here, but they're not necessarily obvious in real code.