Serialization in Swift

If GraphCodable is flexible enough, you should be able to use the address of a non-empty array's first element as a deduplication key, via withUnsafeBufferPointer.

2 Likes

Shouldn't the reference check be based on "identifier"?

Example implementation:

protocol Node: Identifiable {
    var name: String { get }
    var children: [Self] { get }
    
    init(name: String, children: [Self])
    func dump(level: Int, seen: inout Set<ID>)
}

extension Node {
    init(name: String) {
        self.init(name: name, children: [])
    }
    
    func dump(level: Int = 0, seen: inout Set<ID>) {
        print(indent(level), "\(id): \(name) ", terminator: "")
        let beenSeen = seen.contains(id)
        seen.insert(id)
        
        if beenSeen {
            print("already seen")
        } else if !children.isEmpty {
            print("[")
            children.forEach { child in
                child.dump(level: level + 1, seen: &seen)
            }
            print(indent(level), "]")
        } else {
            print()
        }
    }
}

struct NodeS: Node {
    var id: Int
    var name: String
    var children: [NodeS]
    static var nextId = 1
    
    init(name: String, children: [NodeS] = []) {
        self.id = Self.nextId
        Self.nextId += 1
        self.name = name
        self.children = children
    }
}

final class NodeC: Node {
    var id: ObjectIdentifier { ObjectIdentifier(self) }
    var name: String
    var children: [NodeC]
    
    init(name: String, children: [NodeC]) {
        self.name = name
        self.children = children
    }
}

func nodeTest<NodeType: Node>(_ nodeType: NodeType.Type) {
    print(type(of: nodeType))
    let a = NodeType(name: "a")
    let b = NodeType(name: "b")
    let c = NodeType(name: "c", children: [a, b])
    let d = NodeType(name: "d", children: [c, a])
    
    var seen: Set<NodeType.ID> = []
    d.dump(seen: &seen)
    print()
}

func testX() {
    nodeTest(NodeS.self)
    nodeTest(NodeC.self)
}

Outputs:

NodeS.Type
 4: d [
     3: c [
         1: a 
         2: b 
     ]
     1: a already seen
 ]

NodeC.Type
 ObjectIdentifier(0x00006000026172d0): d [
     ObjectIdentifier(0x0000600002617270): c [
         ObjectIdentifier(0x00006000026171e0): a 
         ObjectIdentifier(0x0000600002617210): b 
     ]
     ObjectIdentifier(0x00006000026171e0): a already seen
 ]

Edit: better structured code example.

In fact I was thinking about something like this, but I don't know what to do with sets and dictionaries.

public protocol GCodableIdentifier {
	var codableIdentifier: ObjectIdentifier { get }
}

extension Array : GCodableIdentifier {
	public var codableIdentifier: ObjectIdentifier {
		withUnsafeBytes { unsafeBitCast( $0.baseAddress!, to: ObjectIdentifier.self )}
	}
}

extension ContiguousArray : GCodableIdentifier {
	public var codableIdentifier: ObjectIdentifier {
		withUnsafeBytes { unsafeBitCast( $0.baseAddress!, to: ObjectIdentifier.self )}
	}
}

extension Set : GCodableIdentifier {
	public var codableIdentifier: ObjectIdentifier {
		// what to do here?
	}
}

extension Dictionary : GCodableIdentifier {
	public var codableIdentifier: ObjectIdentifier {
		// what to do here?
	}
}

Hi, Tera,
what you write is absolutely correct but regarding types reference the GraphCodable library, despite being practically an experiment, not only:

  1. avoid their duplication (using ObjectIndentifier as key)

but:

  1. allows conditional archiving;
  2. supports inheritance (and thus saves the reference type name on archiving to "reify" it on dearchiving);
  3. rearrange the dearchiving sequence of reference types to ensure that it always happens if the references form an acyclic graph;
  4. offers the deferDecode method for dearchiving cyclic graphs;
  5. supports version (a reference can check the encoded version and act accordingly on decoding);
  6. supports type replacements on decoding;

Regarding value types, the GraphCodable library currently does exactly what Codable does.

The moment you assign an identity to some value type, you can get 1, possibly implement 2 and 6, but 3, 4, 5 and 7 don't make sense.
And so I can't trivially use the same code, but I have to update it ad hoc (as I'll have some free time) to specifically support value types with identities.

Finally, I'm not sure that using the Identifiable protocol is the right choice: the identity required during archive/unarchive may be different from the one needed in execution. But of course this isn't a problem, just define an analogous protocol, even if handling heterogeneous types that implement Identifiable instead of the always same type (ObjectIdentifier) complicates things a bit.

This is my take: for value types you either have an explicit stored "id" field:

struct NodeS: Node { // Identifiable
    var id: Int
}

(Which BTW can be of arbitrary type so long it is Hashable)

Or, if you don't have it you can use the value type itself as its identity (again, so long it is Hashable). Example:

struct NodeS: Node, Hashable { // Identifiable
    var id: Self { self }
    var name: String
    var children: [NodeS]
    static var nextId = 1
    
    init(name: String, children: [NodeS] = []) {
        Self.nextId += 1
        self.name = name
        self.children = children
    }
}

For this test:

    let a = NodeType(name: "a")
    let b = NodeType(name: "b")
    let c = NodeType(name: "c", children: [a, b])
    let d = NodeType(name: "d", children: [c, a, c])

Compared to the previous output (with short IDs):

NodeS.Type
 4: d [
     3: c [
         1: a 
         2: b 
     ]
     1: a already seen
     3: c already seen
 ]

Now outputs this:

NodeS.Type
 NodeS(name: "d", children: [RefTest.NodeS(name: "c", children: [RefTest.NodeS(name: "a", children: []), RefTest.NodeS(name: "b", children: [])]), RefTest.NodeS(name: "a", children: []), RefTest.NodeS(name: "c", children: [RefTest.NodeS(name: "a", children: []), RefTest.NodeS(name: "b", children: [])])]): d [
     NodeS(name: "c", children: [RefTest.NodeS(name: "a", children: []), RefTest.NodeS(name: "b", children: [])]): c [
         NodeS(name: "a", children: []): a 
         NodeS(name: "b", children: []): b 
     ]
     NodeS(name: "a", children: []): a already seen
     NodeS(name: "c", children: [RefTest.NodeS(name: "a", children: []), RefTest.NodeS(name: "b", children: [])]): c already seen
 ]

Quite long output in this case, I admit, but it works great. If there's no identity - the value itself is its own identity.

Nobody wants to hash a huge array just so you can discover it can't be deduplicated :-)

Hi Tera, your first take is the good one.

Also note that:

  • Not duplicating value types makes sense when they contain large amounts of data.
  • Not duplicating reference types always makes sense because duplication destroys the graph connecting them: you store one thing and get another.

If you archive with Codable reference types connected like this:

/// Wikipedia 'Tred-G.svg' graph:                
///                        โ•ญโ”€โ”€โ”€โ•ฎ
///          โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถ๏ธŽโ”‚ c โ”‚โ”€โ”€โ”€โ”€โ”€โ”
///          โ”‚             โ•ฐโ”€โ”€โ”€โ•ฏ     โ”‚
///          โ”‚               โ”‚       โ”‚
///          โ”‚               โ–ผ       โ–ผ
///        โ•ญโ”€โ”€โ”€โ•ฎ   โ•ญโ”€โ”€โ”€โ•ฎ   โ•ญโ”€โ”€โ”€โ•ฎ   โ•ญโ”€โ”€โ”€โ•ฎ
/// root = โ”‚ a โ”‚โ”€โ”€โ–ถ๏ธŽโ”‚ b โ”‚โ”€โ”€โ–ถ๏ธŽโ”‚ d โ”‚โ”€โ”€โ–ถ๏ธŽโ”‚ e โ”‚
///        โ•ฐโ”€โ”€โ”€โ•ฏ   โ•ฐโ”€โ”€โ”€โ•ฏ   โ•ฐโ”€โ”€โ”€โ•ฏ   โ•ฐโ”€โ”€โ”€โ•ฏ
///         โ”‚ โ”‚              โ–ฒ       โ–ฒ
///         โ”‚ โ”‚              โ”‚       โ”‚
///         โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜       โ”‚
///         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

after dearchiving you get a different thing:

	///                โ•ญโ”€โ”€โ”€โ•ฎ   โ•ญโ”€โ”€โ”€โ•ฎ
	///                โ”‚ d โ”‚โ”€โ”€โ–ถ๏ธŽโ”‚ e โ”‚
	///                โ•ฐโ”€โ”€โ”€โ•ฏ   โ•ฐโ”€โ”€โ”€โ•ฏ
	///                  โ–ฒ
	///                  โ”‚
	///                โ•ญโ”€โ”€โ”€โ•ฎ   โ•ญโ”€โ”€โ”€โ•ฎ
	///          โ”Œโ”€โ”€โ”€โ”€โ–ถ๏ธŽโ”‚ c โ”‚โ”€โ”€โ–ถ๏ธŽโ”‚ e โ”‚
	///          โ”‚     โ•ฐโ”€โ”€โ”€โ•ฏ   โ•ฐโ”€โ”€โ”€โ•ฏ
	///          โ”‚
	///        โ•ญโ”€โ”€โ”€โ•ฎ   โ•ญโ”€โ”€โ”€โ•ฎ   โ•ญโ”€โ”€โ”€โ•ฎ   โ•ญโ”€โ”€โ”€โ•ฎ
	/// root = โ”‚ a โ”‚โ”€โ”€โ–ถ๏ธŽโ”‚ b โ”‚โ”€โ”€โ–ถ๏ธŽโ”‚ d โ”‚โ”€โ”€โ–ถ๏ธŽโ”‚ e โ”‚
	///        โ•ฐโ”€โ”€โ”€โ•ฏ   โ•ฐโ”€โ”€โ”€โ•ฏ   โ•ฐโ”€โ”€โ”€โ•ฏ   โ•ฐโ”€โ”€โ”€โ•ฏ
	///         โ”‚ โ”‚
	///         โ”‚ โ”‚    โ•ญโ”€โ”€โ”€โ•ฎ   โ•ญโ”€โ”€โ”€โ•ฎ
	///         โ”‚ โ””โ”€โ”€โ”€โ–ถ๏ธŽโ”‚ d โ”‚โ”€โ”€โ–ถ๏ธŽโ”‚ e โ”‚
	///         โ”‚      โ•ฐโ”€โ”€โ”€โ•ฏ   โ•ฐโ”€โ”€โ”€โ•ฏ
	///         โ”‚      โ•ญโ”€โ”€โ”€โ•ฎ
	///         โ””โ”€โ”€โ”€โ”€โ”€โ–ถ๏ธŽโ”‚ e โ”‚
	///                โ•ฐโ”€โ”€โ”€โ•ฏ
2 Likes

BTW, are you storing ObjectIdentifiers in archives?! I have a very strange feeling about storing memory addresses in files... I'd use Int / UUID for node's id.

1 Like

In this context, storing ObjectIdentifiers is the correct thing to do, since the ObjectIdentifier is what tells us that object references are the same instance.

The ObjectIdentifier isn't a memory address. It's an abstract value whose bit pattern might (as an implementation detail) match the bit pattern of some memory address chosen by the compiler.

Edit: Well, a correct thing to do. I suppose if ObjectIdentier turned out to 10K bits long, it'd probably be worth mapping it into an Int or UUID for storage purposes, but the stored value would need the same uniqueness/equality characteristics as the original ObjectIdentifier.

4 Likes

It feels strange to me because if I have a graph with 100 elements, store it to disk, then quit the app, launch it again, add a new element to the graph, save it on disk again the only new ID I want to see is for the new node, all other ID's should remain the same (I mean it would be much nicer that way compared to the case when they all change "randomly"). E.g. I can log those elements, analyse the log, and so long as I don't change the graph much the ID's in the log would be still valid across the app relaunch. On top of that if ID's are always incrementing I'd even know the relative "age" of each node. Plus if ID's are small numbers - much easier to analyse such log. YMMV.

No. I archive a progressive objID(Uint32) for each unique ObjectIdentifier I encounter during archiving.

You can see the relevant part of the code here:

private func encodeAnyValue(_ anyValue: Any, forKey key: String?, conditional:Bool ) throws {
	// trasformo in un Optional<Any> di un solo livello:
	let value	= Optional(fullUnwrapping: anyValue)
	let keyID	= try createKeyID( key: key )
	
	guard let value = value else {
		encodedData.append( .nilValue(keyID: keyID) )
		return
	}
	// now value if not nil!

	//	let valueIdentifier	= (value as? GUniqueValue)?.uniqueID
	
	if let binaryValue = value as? NativeType {
		encodedData.append( .inBinType(keyID: keyID, value: binaryValue ) )
	} else if fullBinaryEncode, let binaryValue = value as? BinaryIOType {
		encodedData.append( .inBinType(keyID: keyID, value: binaryValue ) )
	} else if let object = value as? GCodable & AnyObject {	// reference type
		let identifier = ObjectIdentifier( object )
		if let objID = referenceID.strongID( identifier ) {
			// l'oggetto รจ stato giร  memorizzato, basta un pointer
			if conditional {
				encodedData.append( .objectWPtr(keyID: keyID, objID: objID) )
			} else {
				encodedData.append( .objectSPtr(keyID: keyID, objID: objID) )
			}
		} else if conditional {
			// Conditional Encoding: avrei la descrizione ma non la voglio usare
			// perchรฉ servirร  solo se dopo arriverร  da uno strongRef
			
			// Verifico comunque se l'oggetto รจ reificabile
			try ClassData.throwIfNotConstructible( type: type(of:object) )
			
			let objID	= referenceID.createWeakID( identifier )
			encodedData.append( .objectWPtr(keyID: keyID, objID: objID) )
		} else {
			//	memorizzo l'oggetto
			let typeID	= try encodedData.createTypeIDIfNeeded(type: type(of:object))
			let objID	= referenceID.createStrongID( identifier )
			
			encodedData.append( .objectType(keyID: keyID, typeID: typeID, objID: objID) )
			try encodeValue( object )
			encodedData.append( .end )
		}
	} else if let value = value as? GCodable {
		// value type
		encodedData.append( .valueType( keyID: keyID ) )
		try encodeValue( value )
		encodedData.append( .end )
	} else {
		throw GCodableError.internalInconsistency(
			Self.self, GCodableError.Context(
				debugDescription: "Not GCodable value \(value)."
			)
		)
	}
}

Note: Encoding is the easy thing. Decoding is much more complex.

2 Likes

I've tried it myself and am forced to concur with your analysis. If you've overlooked something then I've overlooked it too!

Here's a sketch of a solution that might work for you. It approach the problem from a different angle:

import Foundation
import Combine // WTF I need it here?!

protocol DeduplicatingTopLevelEncoder: TopLevelEncoder where Output == Data {
    // MARK: encode with deduplication
    // MARK: only dictionary at the top level is supported for now
    // MARK: only Data is supported as output type for now
    // MARK: must always be called with deduplicating = true
    // MARK: keys are currently not deduplicated, only values
    func encode<T>(_ value: T, deduplicating: Bool) throws -> Data where T : Encodable
}

protocol DeduplicatingTopLevelDecoder: TopLevelDecoder where Input == Data {
    // MARK: decode with deduplication
    // MARK: only dictionary at the top level is supported for now
    // MARK: only Data is supported as input type for now
    // MARK: must always be called with deduplicating = true
    // MARK: keys are currently not deduplicated, only values
    func decode<T>(_ type: T.Type, from: Data, deduplicating: Bool) throws -> T where T : Decodable
}

extension JSONEncoder: DeduplicatingTopLevelEncoder {}
extension JSONDecoder: DeduplicatingTopLevelDecoder {}

private let deduplicatingPrefix = ".#"

extension DeduplicatingTopLevelEncoder {
    func encode<T>(_ originalValue: T, deduplicating: Bool) throws -> Data where T : Encodable {
        precondition(deduplicating)
        let originalData = try encode(originalValue)
        let originalObject = try JSONSerialization.jsonObject(with: originalData)
        let originalDict = try (originalObject as? [String: Any]).get()
        
        var stringList: [String] = []
        var stringVocabulary: [String: Int] = [:]
        
        let convertedObject = converted(originalDict) { string in
            if let index = stringVocabulary[string] {
                return deduplicatingPrefix + String(index)
            }
            stringList.append(string)
            let index = stringList.count - 1
            stringVocabulary[string] = index
            return deduplicatingPrefix + String(index)
        }
        var convertedDict = try (convertedObject as? [String: Any]).get()
        convertedDict[deduplicatingPrefix] = stringList
        let options = (self as? JSONEncoder)?.outputFormatting.writingOptions ?? []
        let convertedData = try JSONSerialization.data(withJSONObject: convertedDict, options: options)
        return convertedData
    }
}

extension DeduplicatingTopLevelDecoder {
    func decode<T>(_ type: T.Type, from convertedData: Data, deduplicating: Bool) throws -> T where T : Decodable {
        precondition(deduplicating)
        let convertedObject = try JSONSerialization.jsonObject(with: convertedData)
        var convertedDict = try (convertedObject as? [String: Any]).get()
        let stringList = try (convertedDict.removeValue(forKey: deduplicatingPrefix) as? [String]).get()
        let unconvertedObject = converted(convertedDict) { string in
            let comps = string.components(separatedBy: deduplicatingPrefix)
            precondition(comps.count == 2)
            let index = Int(comps.last!)!
            return stringList[index]
        }
        let unconvertedData = try JSONSerialization.data(withJSONObject: unconvertedObject)
        let unconvertedValue = try decode(type, from: unconvertedData)
        return unconvertedValue
    }
}

private func converted(_ obj: Any, convertedString: (String) -> String) -> Any {
    if let array = obj as? [Any] {
        return array.map { converted($0, convertedString: convertedString) }
    } else if let dict = obj as? [String: Any] {
        return dict.mapValues { converted($0, convertedString: convertedString) }
    } else if let string = obj as? String {
        return convertedString(string)
    } else {
        return obj
    }
}

extension JSONEncoder.OutputFormatting {
    var writingOptions: JSONSerialization.WritingOptions {
        var options: JSONSerialization.WritingOptions = []
        if contains(.prettyPrinted) {
            options.insert(.prettyPrinted)
        }
        if contains(.sortedKeys) {
            options.insert(.sortedKeys)
        }
        if contains(.withoutEscapingSlashes) {
            options.insert(.withoutEscapingSlashes)
        }
        return options
    }
}

enum OptionalError: Error {
    case nullValue
}

extension Optional {
    func get() throws -> Wrapped {
        switch self {
        case .none: throw OptionalError.nullValue
        case .some(let wrapped): return wrapped
        }
    }
}

For this example data structure:

private let hello = "Hello Very Long String Here"
private let new = "New Very Long String Here"
private let world = "World Very Long String Here"

struct TestDataStructure: Codable, Equatable {
    var a: String
    var b: String
    var c: String
    var d: Int
    var stringArray: [String]
    var intArray: [Int]
    var stringDictionary: [String: String]
    var intDictionary: [String: Int]
    var st: Set<String>
}

This test code:

func testDeduplicator() {
    
    let originalValue = TestDataStructure(
        a: hello,
        b: new,
        c: world,
        d: 123,
        stringArray: [hello, new, new, new, world],
        intArray: [1, 2, 3],
        stringDictionary: ["x": hello, "y" : new, "z" : world, "t" : world],
        intDictionary: ["x" : 1],
        st: [hello, new, world]
    )

    let encoder = JSONEncoder()
    // encoder.outputFormatting = [.sortedKeys, .prettyPrinted]
    
    let normalData = try! encoder.encode(originalValue)
    let deduplicatedData = try! encoder.encode(originalValue, deduplicating: true)
    let uncodedValue = try! JSONDecoder().decode(TestDataStructure.self, from: normalData)
    let reduplicatedData = try! JSONDecoder().decode(TestDataStructure.self, from: deduplicatedData, deduplicating: true)

    print("original value: ", originalValue, "\n")
    print("recreated value: ", reduplicatedData, "\n")
    precondition(originalValue == reduplicatedData)
    precondition(originalValue == uncodedValue)
    print("json without deduplication: ", String(data: normalData, encoding: .utf8)!, "\n")
    print("deduplicated json: ", String(data: deduplicatedData, encoding: .utf8)!, "\n")
    print("normal json size: \(normalData.count), deduplicated json size: \(deduplicatedData.count)")
    print()
}

Which is standard looking JSON encoding / decoding example with the only difference in "deduplicating: true" parameter in encode / decode calls, outputs this:

...
deduplicated json:  {"a":".#0","st":[".#1",".#2",".#0"],"stringArray":[".#0",".#1",".#1",".#1",".#2"],"c":".#2","b":".#1","d":123,".#":["Hello Very Long String Here","New Very Long String Here","World Very Long String Here"],"stringDictionary":{"x":".#0","t":".#2","z":".#2","y":".#1"},"intArray":[1,2,3],"intDictionary":{"x":1}} 

normal json size: 562, deduplicated json size: 309

So all the string values are stored in a separate vocabulary and referenced by funny looking keys like ".#1", the string vocabulary is stored at the top level as a child of the ".#" key.

There are some limitations in this version (only dictionaries on top level, keys are not deduplicated, only values, perhaps a few others). Obviously it goes through a secondary serialisation / deserialisation step, so not the quickest approach possible but still would be interesting to know if it works fast enough for your case (hopefully it is easy to give it a try).

I've got another idea for something faster using a different approach but that's for another day.

1 Like

A naive question from me: isnโ€™t de-duplicating JSON the role of gzip or other compression algos? I realise that doesnโ€™t meet all possible use cases, but in terms of human readability and conformance to a defined spec, what do we gain from implementing a custom dedupe algorithm here?

3 Likes

GraphCodable encoder now works with values types that conforms to an identity protocol.

This code:

public protocol GCodableValueID {
	var	codableValueID : ObjectIdentifier? { get }
}

extension Array : GCodableValueID where Element:GCodable {
	public var codableValueID: ObjectIdentifier? {
		withUnsafeBytes { unsafeBitCast( $0.baseAddress, to: ObjectIdentifier?.self) }
	}
}

do {
	let a 		= [1,2,3,4,5]
	let b 		= [a,a]
	let c 		= [b,b,b]
	let root	= [c,c]
	print("root = \(root)\n")
	print("โ€ข Encoded Root Dump โ€ขโ€ขโ€ขโ€ขโ€ขโ€ขโ€ขโ€ขโ€ขโ€ขโ€ขโ€ข")
	print( try GraphEncoder().dump( root ) )
}

prints:

root = [[[[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]], [[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]], [[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]]], [[[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]], [[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]], [[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]]]]

โ€ข Encoded Root Dump โ€ขโ€ขโ€ขโ€ขโ€ขโ€ขโ€ขโ€ขโ€ขโ€ขโ€ขโ€ข
== BODY ==========================================================
- VAL1000
	- VAL1001
		- VAL1002
			- VAL1003
				- 1
				- 2
				- 3
				- 4
				- 5
			.
			- VID1003
		.
		- VID1002
		- VID1002
	.
	- VID1001
.
==================================================================
1 Like

Very valid question. Don't know if David was after "JSON" specifically or used another encoder, if it was JSON indeed, then the difference would be in these areas:

  1. speed difference (and it is not immediately obvious which method is faster, can well be that gzip is faster :slight_smile: need to check).
  2. compression ratio difference (compression based coders are expected to give more compact results).
  3. it may be important to have result in a JSON format for some reason (except for the converted strings), e.g. to do some further processing on it.
  4. coding object graphs vs coding value trees.

If these properties are not important, or just the (3, 4) property is not important and the encoding / decoding speed is in fact faster with compressing encoder (some real world testing can answer that question) โ€“ here's a version that uses "ZipEncoder" / "ZipDecoder" โ€“ these are "normal" top level encoder / decoder and they take "sub encoder / decoder" parameter - those who know how to do the actual values โ† โ†’ Data serialisation, e.g. JSONEncoder / decoder. You may also customise the actual compression algorithm (I'm using LZ4 by default).

import Foundation
import Combine // ?!
import Compression

enum ZipDecoderError: Error {
    case cantDecode
}

// MARK: zipping encoder
struct ZipEncoder<SubEncoder: TopLevelEncoder> : TopLevelEncoder where SubEncoder.Output == Data {
    let subEncoder: SubEncoder
    var algorithm: compression_algorithm = COMPRESSION_LZ4

    func encode<T>(_ value: T) throws -> Data where T : Encodable {
        var srcData = try subEncoder.encode(value)
        let size = srcData.count
        var dstData = Data(count: size)
        
        let resultSize = srcData.withUnsafeBytes { src in
            dstData.withUnsafeMutableBytes { dst in
                compression_encode_buffer(dst.baseAddress!, size, src.baseAddress!, size, nil, algorithm)
            }
        }
        if resultSize == 0 {
            var byte: UInt8 = 0x20 // not compressed
            srcData.append(&byte, count: 1)
            return srcData
        }
        dstData.count = resultSize
        var byte: UInt8 = 0x09 // compressed
        dstData.append(&byte, count: 1)
        
        // TODO: better encode size in the compressed data. Example format:
        // TODO: <compressed bytes> <8 bytes: uncompressed size> <1 byte: compressed marker>
        // TODO: <uncompressed bytes> <1 byte: uncompressed marker>

        return dstData
    }
}

// MARK: zipping decoder
struct ZipDecoder<SubDecoder: TopLevelDecoder>: TopLevelDecoder where SubDecoder.Input == Data {
    let subDecoder: SubDecoder
    var algorithm: compression_algorithm = COMPRESSION_LZ4

    func decode<T>(_ type: T.Type, from srcData: Data) throws -> T where T : Decodable {
        let size = srcData.count - 1
        let byte = srcData[size]
        var subData = srcData
        
        if byte == 0x09 { // compressed
            let dstSize = size * 1000 // TODO: ?!? see the above TODO how to do this better
            var dstData = Data(count: dstSize)
            let resultSize = srcData.withUnsafeBytes { src in
                dstData.withUnsafeMutableBytes { dst in
                    compression_decode_buffer(dst.baseAddress!, dstSize, src.baseAddress!, size, nil, algorithm)
                }
            }
            if resultSize == 0 {
                throw ZipDecoderError.cantDecode
            }
            dstData.count = resultSize
            subData = dstData
        }
        let result = try subDecoder.decode(type, from: subData)
        return result
    }
}

The same test as above:

// MARK: Example Area

func zipEncoderExample() {
    // MARK: for sub encoder / decoder use any that works with Data
    let subEncoder = JSONEncoder()
    let subDecoder = JSONDecoder()
    
    let zipEncoder = ZipEncoder(subEncoder: subEncoder /*, algorithm: COMPRESSION_LZ4 */)
    let zipDecoder = ZipDecoder(subDecoder: subDecoder)
    
    let originalValue = TestDataStructure()
    let codedData = try! subEncoder.encode(originalValue)
    let zippedData = try! zipEncoder.encode(originalValue)
    let uncodedValue = try! subDecoder.decode(TestDataStructure.self, from: codedData)
    let unzippedValue = try! zipDecoder.decode(TestDataStructure.self, from: zippedData)
    precondition(uncodedValue == unzippedValue)
    print("json size: \(codedData.count), zipped json size: \(zippedData.count)")
    // json size: 562, zipped json size: 307
    print()
}

private let hello = "Hello Very Long String Here"
private let new = "New Very Long String Here"
private let world = "World Very Long String Here"

private struct TestDataStructure: Codable, Equatable {
    var a: String = hello
    var b: String = new
    var c: String = world
    var d: Int = 123
    var stringArray: [String] = [hello, new, new, new, world]
    var intArray: [Int] = [1, 2, 3]
    var stringDictionary: [String: String] = ["x": hello, "y" : new, "z" : world, "t" : world]
    var intDictionary: [String: Int] = ["x" : 1]
    var st: Set<String> = [hello, new, world]
}

outputs:

json size: 562, zipped json size: 307

Coincidentally almost the same size as in the above manual method: 309!

That's still something else for another day.

If you have N equal arrays of M elements sharing the internal storage and therefore occupying the memory space necessary for 1xM elements, after archiving and dearchiving your N equal arrays will occupy memory for NxM elements regardless of how much the file has been compressed by zip .

2 Likes

In light of discussions here I think the problem is best described the other way around: how to we stop one object being encoded multiple times causing the inevitable decode errors noted by @Loooop.

That is to say, JSONCoder et al could be considered broken for certain object graphs, even simple acyclic graphs one might expect to work.

2 Likes

Arguably it's not "broken" as JSON itself can't handle that :wink:

โ€  unless you get creative and do some #ref workarounds, but from JSON file perspective that would be a tree, not a graph.

1 Like

text compression relies on matching patterns to previously-encountered sequences. if the input data is able to encode repeated data in closer proximity to other duplicate data, or transform it such that data that previously did not appear duplicated to the compression algorithm now appears duplicated, that will make the compression much more effective.

1 Like