NSKeyedArchiver, CoreData and other storage solutions

tera · November 23, 2022, 10:29pm

Want to double check if NSKeyedArchiver / NSKeyedUnarchiver is appropriate for the following task (have my suspicions it's not):

open / save archives to a given file location on a local file system. Expected file sizes are up to a few megabytes in the majority of cases, maybe tens of megs in some not so typical cases, and perhaps a few hundred megs on a stretch (such massive archives can work a bit slower), but less than a gigabyte. Should that help - archive sizes can be limited to, say, 1/4 of total device RAM.
open / save should be "quick" (see 3)
changing a small portion of a data graph should change a small number of pages on disk to minimise I/O.
should support Codable types
should support value types (like dictionaries or arrays of Ints or Strings or other arrays / dictionaries, possibly custom Codable "structs" with those types in them).
secure coding would be nice to have out of the box, although this is not a show stopper.
nice to have (although not a show stopper) if this storage is fault tolerant / atomic (e.g. app crash during storage update would leave storage undamaged, perhaps with most recent updates absent). Or, at the minimal, the damaged archive opening should not crash the app, just report it is damaged, in which case it will be deleted and a new one will be created instead.

I have a growing suspicion it might be inappropriate, on the following grounds:

archiveRootObject(_:toFile:) / unarchiveObject(withFile:) are deprecated in favour of methods that involve Data:

    @available(iOS, introduced: 2.0, deprecated: 12.0, message: "Use +archivedDataWithRootObject:requiringSecureCoding:error: and -writeToURL:options:error: instead")
    open class func archiveRootObject(_ rootObject: Any, toFile path: String) -> Bool

    @available(iOS, introduced: 2.0, deprecated: 12.0, message: "Use +unarchivedObjectOfClass:fromData:error: instead")
    open class func unarchiveObject(withFile path: String) -> Any?

(and 3) Similarly can't see how incremental write would be possible as the suggested writeToURL:options:error: won't be able writing only the "change data portion", it will write the whole (potentially multi MB) Data. Ditto for quick reading. Unless I use "alwaysMapped" option but that's only for reading (hmm) and sounds a hack.
While I can see encodeEncodable and decodeDecodable those seem to be on the periphery, and it doesn't look like Codable (but not NSCodable) values are supported with "archiveRootObject" / "archivedDataWithRootObject" or other methods.
The majority (or all?) of NSKeyedArchiver / NSKeyedUnarchiver examples I saw use reference types.

Is my suspicion correct and I should be looking elsewhere? In which case what would you recommend? I tend to favour apple provided solutions but if there's no better option would consider a third party solution. Quickly glanced over CoreData's value transformers, is that the proper tool for this job or would it result in a massive boilerplate of transformers with poor performance characteristics?

Edit: in this task I'm only interested in tree-looking data structures (not graphs) and value types (not reference types). Binary / format compatibility is not a concern - if the file is damaged or incompatible the app should not crash and just create a new file from scratch. "One file" restriction can be lifted to "a folder". The requirement of rewriting as little as possible and practical (e.g. a "page size") is firm: I am very uncomfortable with the idea of constantly rewriting a big, say 10 – 100MB file just to change a few bytes in it – it drains battery and reduces disk lifetime.

itaiferber · November 24, 2022, 12:52am

On the face of it, it doesn't sound like NSKeyed{A,Una}rchiver is going to be buying you much over Codable options which exist already, though what solution to go with specifically is going to depend on the rigidity of your requirements.

Primarily, you stand to benefit from NSKeyedArchiver if you're encoding object graphs and not just trees of objects: once you have circular references in encoding and decoding, NSKeyedArchiver offers tools which are easier to work with than Codable does currently. The caveat is that in order to participate in those tools, you have to adopt NSCoding and not just use the Codable support layer, so take that as you will.

In terms of your requirements:

1. open / save archives to a given file location on a local file system <snip>
2. open / save should be "quick" (see 3)
Both archival solutions are going to offer the features you need here, but you'll need to benchmark to see if either meets your perf requirements. With Codable, the specific encoder and decoder you use may swing the results in different directions; with NSKeyedArchiver you just have the one implementation — but worth a test
1. changing a small portion of a data graph should change a small number of pages on disk to minimise I/O.
Depending on what you mean specifically here, you're likely out of luck for all currently-presented solutions, since none of them really support updating in-place in a meaningful way; more on this below.
1. should support Codable types
2. should support value types (like dictionaries or arrays of Ints or Strings or other arrays / dictionaries, possibly custom Codable "structs" with those types in them).
3. secure coding would be nice to have out of the box, although this is not a show stopper.
Both options support the types you need; and Codable obviates the need for NSSecureCoding, so you're covered there too
1. nice to have (although not a show stopper) if this storage is fault tolerant / atomic <snip>
On all filesystems and OSes I'm aware of, this would be in conflict with (3). If you're looking to optimize writes to the disk by doing something like memory-mapping the file and updating only portions of it in-place, you'd have to sacrifice atomicity: there's nothing you can do if the user pulls the power cord after you've started writing to memory but before you're done (leading to data corruption that may not be detectable or recoverable).

On all systems I'm aware of, atomic writes are done by writing to a temp file, and having the filesystem atomically replace the existing file with the new one. When doing this, you're pretty much always writing to the temp file from scratch, which means writing out all of the data from start to finish — which precludes pretty much any I/O wins.

(3) is the real sticking point, here depending on what exactly you mean by "change a small number of pages on disk".

If you're looking to do something like mmap the file into memory and update portions of it, none of the mentioned solutions here are going to work well for you: NSKeyedArchiver (and CoreData, based on it) only supports writing the entire archive at once; Codable could theoretically support this with an Encoder explicitly written to do this, but I'm not quite sure it could work.

You'd likely need something written with this use-case in mind: you'd need something that keeps a reference to the existing file open, reading it as it's encoding data to only write out pages of data which have changed... which seems pretty niche.
You also need to keep in mind that it's pretty difficult to keep archives binary-stable, to benefit from any optimizations regarding avoiding writes to pages by matching them up
1. The data produced by NSKeyedArchiver (and all Encoders I know) is not guaranteed to be stable — e.g., reading an archive into memory and writing it back out can produce a different binary blob because dictionaries are ordered based on how keys are hashed, which can change at runtime
2. Inserting data into anywhere in the archive necessarily needs to "push" all further data out, which requires shifting everything in the file, defeating any I/O optimizations
So, you'd need an archiver of some sort which is guaranteed to be binary-stable and also append-only for changes, if you wanted to truly benefit from such a scheme (if I'm understanding your desires as written correctly)

Given this, it seems unlikely that this is truly a hard requirement, in which case "fast enough to not be noticeable" may be good enough... and if so, benchmark, benchmark, benchmark. You may discover that existing Encoders are fast enough to do the job you want, in which case you don't even need to think about any of this at all. (And if it's not the case, you can figure out what the bottlenecks are, and go from there. Maybe explore something like protobuf as a next step to see if it has the tools and performance characteristics you're looking for.)

tera · November 24, 2022, 9:55pm

Thank you for your reply. I edited the original post to clarify the requirements I have. I'm surprised there are no readily available solutions. Will look into protobuf, etc.

IIUC on AFS file system copying a file is instantaneous as it is just a reference count increase. I also believe that if you copy a 100MB file and change a few bytes in the middle of the copy - the "COW" machinery kicks in and only one or two 4K blocks of the file would be written (good for battery and disk longevity), and you'll end up with two files most blocks of which are shared between the two files and just two / four blocks are separate.

itaiferber · November 25, 2022, 4:00am

I thought of mentioning APFS in my post, but held off because I don't want to misspeak out of ignorance; I don't know enough about the specifics of how APFS works to be able to state anything with confidence.

I'm not sure how APFS would necessarily treat the approach I mentioned (copy the file, overwrite it, atomically copy back); my main thought would be: if you overwrite a cloned file with almost identical data, does APFS only write out blocks which have changed, or does any write to a block necesarily copy it? From the resources I've found, I can't pinpoint anything authoritative that confirms whether APFS CoW diffs writes to blocks in a cloned file or not:

The clonefile(2) man page only states that

Subsequent writes to either the original or cloned file are private to the file being modified (copy-on-write).

but doesn't indicate anything further about individual block writes
The About Apple File System page gets slightly more specific, but only says that

Modifications to the data are written elsewhere, and both files continue to share the unmodified blocks. You can use this behavior, for example, to reduce storage space required for document revisions and copies. The figure below shows a file named “My file” and its copy “My file copy” that have two blocks in common and one block that varies between them.

but this also doesn't cover writes which are identical

I'm having trouble searching for WWDC content prior to 2019, so there may be some information in some older content out there, but I haven't found it. The behavior here should be possible to test, though! (Create a file large enough to confirm that it affects disk space, clone it, and overwrite it with almost identical contents of the same length, and see how available disk space is affected.)

If APFS doesn't support this, theoretically you could meet in the middle by cloning the file, mmaping the clone, updating specific blocks in the clone, then atomically overwriting the original with the clone. (That does still require you to find a scheme where you can selectively update encoded content in the middle of a data blob, which I don't know readily exists.)

As a bit of a framing challenge, since I don't know much about the actual data you're trying to store: instead of managing an archive file (or files) yourself, would using a database make sense for the shape of your data? e.g. SQLite can give you fantastic performance for even huge amounts of data, fault tolerance + atomicity, and managed/limited I/O in a single file — presuming that the shape of your data makes sense to store in a DB. Is this something worth consideration?

(Tools like @gwendal.roue's GRDB can also make it much easier to interface with, so you don't necessarily need to sacrifice nice tooling.)

tera · November 25, 2022, 4:58am

Did that test just now and I believe APFS does true COW on the block level: as a test I created a 1GB disk image, and created a 100MB file on it. Finder info helpfully shows Used space to a byte precision. I changed one byte at the beginning of the file making sure the length is not changed - the used disk size increased by 16K (exactly). Then I scrolled to the middle of the file and changed one byte in there - same increase in used disk space. Then I removed a single byte from the middle of this file - in this case the save operation took longer and the used disk size jumped up by 50MB - as expected.

Yes! I need to try all those solutions. My initial fear is that (from not so deep past experience) typically such tools require something more than just Codable conformance... e.g. reference types, or value transformers, or prefixing variables with some fancy solution-specific @property wrappers, or NSObject / NSManagedObject subclassing, and so on... while I'd like to keep the project as "vendor lock" free as possible. (Although this project is for apple platforms only, so specific optimisations like APFS COW machinery usage is good.) Another potential (hopefully minor) issue is that those tools are heavy machinery doing way more than needed.

Someone has already been there so I'm looking for pointers in the right direction.

gwendal.roue · November 25, 2022, 5:48am

You won't find any of this in GRDB:

struct Player: Codable {
    var id: String
    var name: String
    var score: Int
}

#if canImport(GRDB)
import GRDB
// Fetching and persistence methods
extension Player: FetchableRecord, PersistableRecord { }
#endif

tera · December 6, 2022, 5:12am

Gwendal, I'm not sure I am reading this table right: GRDB – Build Results – Swift Package Index
what version of GRDB would you recommend for iOS?

Currently I am using neither SPM nor CocoaPods; if I use GRDB with encryption would you recommend CocoaPods or a combination of CocoaPods + SPM? Or is it possible to use the encryption module (SQLCipher?) via SPM somehow?

Edit: is it possible integrating GRDB + SQLCipher manually (no SPM, no CocoaPods?). I successfully did it manually for SQLCipher alone (added the generated .c + .h files into the project plus modified a few settings) and it works in the app when I use it directly, but I'm struggling pointing SPM installed GRDB to use that manually integrated SQLCipher.

gwendal.roue · December 6, 2022, 7:31am

Hello @tera,

The Swift Package Index has problems building GRDB, for some reason. The current requirements are iOS 11.0+ / macOS 10.13+ / tvOS 11.0+ / watchOS 4.0+ • SQLite 3.19.3+ • Swift 5.7+ / Xcode 14+.

what version of GRDB would you recommend for iOS?

The latest, 6.5.0 ;-)

Currently I am using neither SPM nor CocoaPods; if I use GRDB with encryption would you recommend CocoaPods or a combination of CocoaPods + SPM? Or is it possible to use the encryption module (SQLCipher?) via SPM somehow?

You can get GRDB+SQLCipher via CocoaPods (see Encryption).

The SQLCipher team is aware that people want to build SQLCipher with SPM, but there is no such package today.

Some people fork GRDB and build their own SQLCipher package "by hand" (e.g. duckduckgo/GRDB.swift - stuck at GRDB 5.26.0, though), You may also check the network page: it is a great way to search for recent forks and get some inspiration.

Besides SQLCipher, there is the SQLite Encryption Extension (not free) - that GRDB probably does not support out of the box.

Also, remember that iOS comes with Data Protection - when encryption is performed by the operating system, you can use the regular built-in SQLite.

Edit: is it possible integrating GRDB + SQLCipher manually (no SPM, no CocoaPods?)

Probably, but this is uncharted territory ;-) Besides, such custom builds are difficult to be merged upstream, because the maintenance burden is often too high. Miracles happen, though.