Serialization in Swift

lukasa · April 1, 2021, 9:13am

I see. I was responding to the idea that you wanted JSON to work this way, which it cannot easily do. If you're asking for an alternative format that does this, sure, that can be done, but so far as I can see it has nothing to do with Codable. Codable is entirely capable of producing such an encoder/decoder today, there is no particular practical limitation preventing it.

dwaite · April 1, 2021, 9:20am

Article quoting the Java chief architect: Removing Serialization from Java Is a 'Long-Term Goal' at Oracle -- ADTmag

"[Serialization] was a horrible mistake in 1997," he said. "Some of us tried to fight it, but it went in, and there it is. ...We like to call serialization 'the gift that keeps on giving,' and the type of gift it keeps on giving is security vulnerabilities.... Probably a third of all Java vulnerabilities have involved serialization; it could be over half. It is an astonishingly fecund source of vulnerabilities, not to mention instabilities."

Java has three problems here, which have a multiplicative effect on one another:

the ability to convert strings of bytes into executable classes
support for deserializing arbitrary graphs by specifying individual class names
by default, the ability to deserialize any type visible in the context that is marked serializable.

It should be noted that Objective C has a subset of these problems, which is why NSCoding has been phased out for NSSecureCoding.

sighoya · April 1, 2021, 11:21am

A bit longer document from Software Consultant Brian Goetz about problems of serialization and how to possibly improve it in Java.

sighoya · April 1, 2021, 11:27am

Yes, but most of the time you don't use the native serialization framework.
Just write public getters/(setters/constructors) and use a reflection framework like jackson/gson to (de)serialize them (in case for Json/Xml).

Most of these frameworks in the Java Ecosystem require annotating fields in case of abnormal (de)serialization and utilize reflection to read out field names/field name rewrites.
I think this is already a very declarative approach for serialization.

QuinceyMorris · April 1, 2021, 3:20pm

It's not capable of that today.

No, you don't. If the object (A) contains a reference to another object (B), you cannot initialize A without a fully initialized reference to B. If B contains a reference to A, you can't initialize B without a fully initialized reference to A.

Or it could be a chain of references: A -> B -> C -> D -> A. This is an extremely common design pattern: in general, we have object graphs, not object trees.

Swift has no mechanism to initialize two objects with mutual references, and there's no workaround that doesn't expose implementation details that classes may want to keep private.

It works in Obj-C because Obj-C has no rules preventing references to partially initialized objects from being passed around. By design, Swift doesn't let you do that.

Loooop · April 1, 2021, 4:12pm

--- deleted ---

Loooop · April 1, 2021, 5:03pm

QuinceyMorris, I was describing how NSCoder avoids duplicating the same reference type.

Now, your chain of references must contain a weak reference to avoid retain cycles:
So can be A -> B -> C -> D >> A where >> is a weak var.

Then, in a very inelegant way, I can write:

class A : Codable {
	private (set) var array = [B]()
	
	/* ... */
	
	private enum CodingKeys: CodingKey  {
		case array
	}
	
	func encode(to encoder: Encoder) throws {
		var container = encoder.container(keyedBy: CodingKeys.self)
		
		try container.encode( array, forKey: .array )
	}
	
	required init(from decoder: Decoder) throws {
		let values	= try decoder.container(keyedBy: CodingKeys.self)
		
		self.array	= try values.decode( [B].self, forKey: .array )
		
		// I can survive with this hack
		self.array.forEach{ $0.finishDecoding( self ) }
	}
}


class B : Codable {
	private (set) var array = [C]()

	/* ... */

	func finishDecoding( _ a:A ) {
		self.array.forEach{ $0.finishDecoding( a ) }
	}
}

class C : Codable {
	private (set) var array = [D]()

	/* ... */

	func finishDecoding( _ a:A ) {
		self.array.forEach{ $0.finishDecoding( a ) }
	}
}

class D : Codable {
	private (set) weak var a : A?
	let name : String
	
	init( name : String ) {
		self.name	= name
	}

	private enum CodingKeys: CodingKey  {
		case name  // do non encode a
	}

	func finishDecoding( _ a:A ) {
		self.a	= a
	}
}

Yes, I am exposing details of the implementation that I would like to keep private.

QuinceyMorris · April 1, 2021, 5:42pm

It's nothing to do with strong/weak references. The problem exists whether decoding strong or weak or unowned references.

It does something have to do with "optional" references. (This is "optional" in a special sense. An Optional is "optional", of course, but so is a Collection of references, since it can be empty and get references added to it later.)

You can't ever create a cycle of non-"optional" references in Swift. However, it's a common pattern to have a parent object with an array of child objects, and a weak ("optional") reference from the child to the parent.

It's also common to have long chains of references through objects of multiple classes, but it's not obvious locally in source code what can be (indirectly) connected to what.

Sure, but we cannot reasonably require hacks like this, in order to prevent Codable from exploding in people's faces!

Currently, reference type initialization is a two-phase procedure in Swift (first up the inheritance chain, then "down" it). For archiving/serialization to work, as your example demonstrates, you need a 3rd phase.

Karl · April 1, 2021, 6:53pm

Hmm... that’s not usually a good sign

rockbruno · April 1, 2021, 7:38pm

My only problem with Codable is how picky it is with the types. Sometimes like "value": 0 has to be an integer type, even though you could use it as a String. I think John Sundell's GitHub - JohnSundell/Unbox: [Deprecated] The easy to use Swift JSON decoder was much better because it could handle cases like this. Also, it was much easier to define custom field names since everything was handled on a property basis vs Codable's custom key enums.

tomerd · April 1, 2021, 10:52pm

@drexin sums it up nicely. This thread is about the general Serialization APIs and underlying technology and less about the codecs that ship with Foundation. Of course it could useful to use these to make a point or give an example, but discussion on how to improve an implementation of a specific codec would be best done in a separate discussion thread.

Terje · April 2, 2021, 9:08am

A while ago I tried to make my own (en/de)coder using Codable. Starting from the Encodable protocol seemed a reasonable place to start. I couldn’t figure it out on my own and had to look for a tutorial. Those were like “yeah well, just look at the open source implementation of the JSON coder and copy that”. Seriously?

Everything felt long, verbose and convoluted. All those containers, and functions for every possible type. But hey, those (I.e. you) guys have to consider everything and I just want to encode like 2 types or whatever.

Sometimes I try to recreate a framework (badly) from scratch to improve my skills and understanding (e.g. CoreData, libDispatch). So why not, let’s recreate this for my narrow use case. Use reflection to get a list of all the variables and ... well,.. That might work for encoding but afaict there is no way to create an instance this way.

Using PropertyWrappers to encode things like a custom key seemed a good (or naive?) idea but afaict that information is not exposed via reflection.

Or perhaps something, something with keypaths like Codable but extracting the type out of the keypath type is not possible. So then how does Codable do it? Compiler magic: Ah , great

While doing all this I went also through the previous related proposals and came across this post by Chris Lattner. It’s about user defined attributes. (That’s kinda meta-programming? Enlighten me if you have spare time.)

It’s a 2 year old post and while some are probably aware of it others might have missed it or forgotten about it.

My point, I guess, is shouldn’t effort and the limited available resources be spend on first improving reflection etc before improving Codable? Improvements of the former will lead to improvements of the latter, won’t it? Or at least it wil/shouldl make improving Codable easier from what I understand.

disclaimer: just a user. Limited compiler knowledge.

lukasa · April 2, 2021, 9:51am

The difference between Codable and reflection is the difference between compile-time and runtime. As written today, Codable is a form of compile-time dynamic programming (albeit a limited one).

A replacement for Codable could be implemented on top of reflection, but such a replacement would incur a bunch of problems that are not entirely necessary. In particular, the pre-existing performance problems with Codable are unlikely to be solved by doing runtime type introspection to discover the various fields of an object.

Codable as it stands is more like compile-time metaprogramming, and an interesting way to improve Codable would be to generalise this approach. This has been alluded to above with references to thing’s like Rust’s serde, which builds heavily on Rust’s hygienic macros for compile-time metaprogramming. If Swift had a more general facility for doing this, that would enable Codable to move out into library space instead of being in the compiler.

I hesitate to suggest this, though, as the Swift team has made it clear that compile-time metaprogramming is a very large effort that is likely to take quite a while.

sighoya · April 2, 2021, 12:39pm

Yes, it isn't, but they try to improve on this. Java's native serialization framework is an antipattern as noted by several experts.
But it's interesting how they wanna solve this, maybe something interesting is there for Swift.
Or maybe just look at the abundance of extraneous serialization frameworks in the Java ecosystem, there seems to be some consensus of desired features:

More declarative (de)serialization usually implemented by annotation + reflection
More control over serialization by providing custom overriding mechanisms
More performance by providing stream- over node-like types

saagarjha · April 3, 2021, 8:09am

Speaking of deserialization safety, I just tracked down and fixed a bug in my supposedly "safe" Swift decoder stemming from an out-of-bounds read (within the outer bounds of the data I needed to decode, of course, but running into an adjacent type) leading to type confusion. Part of this is surely that deserialization is difficult, but I feel like another part of it is that the design of the API gives a decoder a "small window" to look through ("decode this one type with no context") and this makes catching invalid input hard. It is possible to do work around this but the methods to do so are slow and convoluted, mostly requiring passing around some pseudo-global state from the topmost decoder. The fact that some of the decoder methods are not allowed to be mutating or throw errors (SingleValueDecodingContainer.decodeNil(), decode(_:) on both SingleValueDecodingContainer and KeyedDecodingContainer) compounds this difficulty. It seems like decoders are generally supposed to just read all the data at once into some internal in-memory representation and then gradually vend it out using the container methods on demand, but this forces streaming decoders to bend over backwards to cater to the APIs and have subpar error detection.

Loooop · April 7, 2021, 2:43pm

In the previous days, I write a serialization code that does away with Codable completely and which resembles NSCoder as far as possible. It already works relatively well even though I'm working on improving it now.

I call it Coding.

Here is a code example:

	class Duck : Coding {
		private var	name	: String

		init( name: String ) {
			self.name = name
		}

		func archive(to archiver: Archiver) throws {
			try archiver.archive(name, forKey: "name")
		}
		
		required init(from unarchiver: Unarchiver) throws {
			name = try unarchiver.unarchive(forKey: "name")
		}
				
		class var codingTypeName: String {
			return defaultCodingTypeName
		}
	}
	
	class Parent : Duck {
		private var childs : [Child]
		
		init( name: String, childs: Child... ) {
			self.childs	= childs
			super.init(name: name)
			
			childs.forEach { $0.parent = self }
		}
		
		override func archive(to archiver: Archiver) throws {
			try archiver.archive(childs, forKey: "childs")
			try super.archive(to: archiver)
		}
		
		required init(from unarchiver: Unarchiver) throws {
			childs = try unarchiver.unarchive(forKey: "childs")
			try super.init(from: unarchiver)
		}
	}

	class Child : Duck, TwoPassCoding { // <--- TwoPassCoding solves circular references
		weak var	parent	: Parent?
		private var	age		: Int
		
		init( name: String, age:Int ) {
			self.age = age
			super.init(name: name)
		}
		
		override func archive(to archiver: Archiver) throws {
			try archiver.archive(parent, forKey: "parent")
			try archiver.archive(age, forKey: "age")
			try super.archive(to: archiver)
		}
		
		required init(from unarchiver: Unarchiver) throws {
			// don't unarchive parent here!
			age = try unarchiver.unarchive(forKey: "age")
			try super.init(from: unarchiver)
		}

		// TwoPassCoding protocol require this method
		func unarchive(from unarchiver: Unarchiver) throws {
			// unarchive parent here!
			parent = try unarchiver.unarchive(forKey: "parent")
		}
	}
	
	func donaldDuckFamily() -> Duck {
		let huey	= Child( name:"Huey", age:5 )
		let dewey	= Child( name:"Dewey", age:6 )
		let louie	= Child( name:"Louie", age:7 )
		
		// just to check duplicates!
		return Parent(name: "Donald Duck", childs: huey, dewey, louie, dewey, louie, dewey, louie, dewey, louie, dewey, louie, huey )
	}
	
	let donaldDuck	= donaldDuckFamily()

I can archive donaldDuck as a root and unarchive it like this:

	let data			= try! AFArchiver().archiveRoot( donaldDuck )
	let outDonaldDuck	= try! AFUnarchiver().unarchiveRoot( Duck.self, data: data )

donaldDuck and outDonaldDuck are identical!

This is the file written and read in readable format:

As you can see:
• Duplicate references are saved once only using a unique identifier (ID).
• Real types are preserved ( outDonaldDuck is Parent )
• Circular references can be archived and dearchived with a special protocol (TwoPassCoding)!
Note that Parent contains an array of childs and each child points to its parent. In this case we use a special protocol (TwoPassCoding) which redefines Coding. Classes that satisfy TwoPassCoding after initialization receive a call that allows you to set the missing variables (parent):

func unarchive(from unarchiver: Unarchiver) throws

Coding also supports collections of heterogeneous types. For example:

	let root = [ "A", 1.5, 3, [ "a" : 12, "b" : 31 ], ["c",nil]	] as [Any]

is archived/dearchived in the same way, obtaining:

You must register all types that will need to be de-archived as Swift does not allow you to instantiate a type from a character string.

Currently the biggest flaw is that Swift doesn't have a unique way to identify a type.

String(describing: type(of:...))

does not take into account the module and:

String(reflecting: type(of:...))

does not appear to be stable. To overcome this problem, Coding obliges to define:

		class var codingTypeName: String {  // or static var for structs
			return defaultCodingTypeName
		}

possibly choose a name other than defaultCodingTypeName ( = String(describing:self) ).

Finally, these are my basic protocols:

//	Ancestor protocol
protocol CodingBase {
	// 'override' to register in CodingRegister with a 'typeName' other than the default
	static var	codingTypeName: String { get }
}

extension CodingBase {
	var	codingTypeName: String {
		return type(of: self).codingTypeName
	}
	static var defaultCodingTypeName: String {
		return String(describing:self)
		//	return String(reflecting:self)
	}
}

//	Simple Native Types (bool, integers, floats, strings)
protocol CodingNative : CodingBase {
	init( codingRepresentation: String ) throws
	var codingRepresentation : String { get }	// archivingFormat
	var displayRepresentation : String { get }	// for readable print
}

extension CodingNative {
	var displayRepresentation: String {
		return "\(self)"
	}	
	static var	codingTypeName: String {
		return defaultCodingTypeName
	}
}

//	Archiving Composite types

protocol Archivable : CodingBase {
	func archive(to archiver: Archiver) throws
}

// Unarchivable
protocol Unarchivable : CodingBase {
	init(from unarchiver: Unarchiver) throws
	static func register()
}

extension Unarchivable {
	static func register() {
		CodingRegister.shared.register(type: self)
	}
}


// Coding
typealias Coding	= Archivable & Unarchivable

// TwoPassCoding
protocol TwoPassCoding : AnyObject, Coding {
	func unarchive(from unarchiver: Unarchiver) throws
}

// -----------------------------------------------------------------

// Archiver
protocol Archiver {
	func archive<T>(_ value: T, forKey key: String ) throws
	func archive<T>(_ value: T ) throws
}

// Unarchiver:
// Warning! unarchiver.unarchive( ... ) consume the unarchiver
// A value, once unarchived from the unarchiver, is no longer present
protocol Unarchiver {
	// keyed --- key shared with superclasses
	func contains( key:String ) throws -> Bool
	func unarchive<T>( forKey key: String ) throws -> T

	// unkeyed --- use with cautions (shared in the class hierarchy)
	func count() throws -> Int
	func unarchive<T>() throws -> T
}

Sorry for my bad English.

QuinceyMorris · April 7, 2021, 4:16pm

Unfortunately, this can't be a general solution. There are several problems:

The unarchive(from:) method is a major security hole in your archiving system. Because it's a protocol requirement, it's going to have to be exposed (public or internal, depending on the implementer) in a way that allows malicious code to call it. There's no real solution to this, except to turn it into a closure that the archived type passes directly to the unarchiver, but that would prevent your Coding protocol from being synthesized.
This requires every instance variable — that contains an object reference that needs to be set in unarchive(from:) — be an optional type. Making all instance variables of reference type Optional or IUO (implicitly unwrapped optional) is not likely to be acceptable. If only some of them are made optional, then it requires a global strategy across all your code to decide which ones. Developers aren't going to like that.
A 2-pass solution without compiler support leaves a value that has been initialized in an inconsistent state, if it contains any references that have not yet been set in the second pass. Other references to this partially-initialized object can be use to access its properties and invoke methods — but it is not safe to do so until after the second pass.

Problem #3 is the real show-stopper here. It makes Swift unsafe in a fairly dramatic way.

Loooop · April 7, 2021, 4:23pm

2 - No: only those that must be weak to avoid memory leaks regardless of Coding. All the other reference variables, and they are the vast majority, can be set in the initializer. Like any child in the array in the example.

QuinceyMorris · April 7, 2021, 7:28pm

As I said upthread, strong vs. weak references (and reference cycles that cause leaks) are not relevant here.

You would have the same problem to solve, for example, if you wanted to archive and unarchive objects A and B, each of which had a weak reference to the other. Those references would need to be restored at unarchiving time, and the fact that they form a circular chain of references means that they'd need your 2nd pass to set them correctly.

They can't both be set "in the initializer" because:

during A's initializer, no valid references to A exist (outside A) yet,
so B cannot have have set its reference to A in its initializer,
so B cannot have completed initialization yet,
so no valid references to B exist (outside B) yet,
so A cannot set its reference to B in its initializer.

So, at least one of A or B must set its weak reference in the 2nd pass.

Loooop · April 8, 2021, 5:58am

QuinceyMorris, as far as I know, this code:

class A {
	var b = B() 
}

class B {
	var a = A()
}

let a = A()
let b = B()

a.b	= b  // ok
b.a	= a  // now we have a memory leak...

leaks memory in Swift.

You must write:

class A {
	weak var b : B?
}

or:

class B {
	weak var a : A?
}

to avoid to avoid leaking memory.

Coding requires no more weak references than Swift requires to avoid memory leaks.