Implementing URLCache

Hi!

I'm interested in giving an implementation of URLCache a shot. I think I have a reasonable overview of how it works:

  • In ..someTempDirectory/Caches/BundleID, store information about what the (disk) cache contains.
  • The cached files themselves are in a subdirectory called fsCachedData. The filenames are randomly generated UUIDs without file extensions.
  • The info as to what these anonymous files contain is stored (on Darwin) in an SQL-Lite database. Presumably this includes total cache size, and serialised individual CachedURLResponse objects.

The last point is the first one particularly interested in: is there strong community interest in trying to keep the the Corelibs and Darwin implementations equivalent, or should the storage method be considered an implementation detail?

Given that we don't have SQL-Lite available to us (do we?), my first thought is to store the cache details as JSON via an array of Codable CachedURLResponse objects. This is (very) unlikely to be the most efficient implementation imaginable, but it'd be easy to reason about and implement.

With that in mind, it'd be great to get some feedback about the performance expectations of this class. Our use case generally involves accessing about 30 URLs at once, which I imagine is typical. Each of our cached datafiles is about 20kb in size, which is probably the smaller end of the spectrum, potentially leaving us with 1000s of CachedURLResponses to decode from the cache descriptor with the default cache sizes. The last time I checked (a few years ago), even slow machines could handle decoding/encoding that amount of JSON data in well under 1ms.

Of course, the more interesting question is whether there's a more appropriate data structure we could use on disk that would allow FIFO behaviour (or "oldest access/first out") with less abstraction that JSON would allow. Any ideas?

I'm also not really sure how the on disk / in memory caches interaction on Darwin. Is the "in memory" part just a subset of what's on disk? Presumably the size limits are about the stored data itself rather than the metadata. Is it acceptable to just keep the lazily loaded metadata in memory?

Any feedback and suggestions on the matter are very welcome! If I don't hear anything by mid next week I'll assume nobody's too concerned about keeping the implementations compatible and just go ahead with the outline here.

1 Like

Hi @Geordie_J,

The most important thing for us is the API remains compatible -- the implementation can be different on each platform.

I'd like to avoid adding more dependencies to swift-corelibs-foundation (at least at the start), so coming up with a solution that uses the stuff we already have would be preferable in my mind. Another option besides JSON is simple binary property lists.

I'm going to cc a few more people that I think may have an opinion here: @Pushkar_N_Kulkarni @IanPartridge.

1 Like

I think @Mamatha_Busi has been looking at URLCache recently. What were your thoughts, Mamatha?

1 Like

I agree with @Tony_Parker . Storage/persistence is an implementation detail and we should try to do it using something from Foundation, to avoid external dependencies . I think the use of NSKeyedArchiver/NSKeyedUnarchiver was being explored but I'm not sure about the feasibility.

Hi @Geordie_J

I am looking into the implementation of URLCache. With regards to persisting the files, I am using the XDG File Specification’s environment variable XDG_CACHE_HOME.

As @Pushkar_N_Kulkarni mentioned, I’m exploring the API’s from NSKeyedArchiver/NSKeyedUnarchiver to persist the objects of the mapped url request to the cachedURLResponse. However, while doing a behavioural study on Darwin I did notice the use of Cache.db files which contain the info in the SQL-lite format and was wondering if we have to follow the same implementation. As @Tony_Parker has mentioned that using simple binary property lists is another option, I would like to give that a try since we are using the same for the implementation of storing HTTPCookies on disk. Please let me know your thoughts on the same.

If we store stuff on disk, how do we deal with concurrent access from multiple threads/processes to that database? From what I understand, sqlite tries to use POSIX advisory locks so it should work fine unless the database file lives on NFS. JSON/plists/etc however wouldn't work.

Or am I misunderstanding this and the cache is only per thread?

Hi Mamatha,

I have been working on this for a few days. I also got up to the persistence. I’m not too familiar with NSCoding so it’d be great to work on that in parallel. I’ll push my version somewhere accessible over the next 24h and maybe we can meet halfway?

Cheers,
Geordie

I guess this is why I was asking about the performance expectations. The way I’ve conceptualised it so far is that the to-disk synchronisation of this would be write-only, as URLCache instances live in memory (just not the entirety of their associated Data objects, which are stored as independent blobs on disk). Updates would then be queued asynchronously via a serial write queue (with later operations cancelling earlier ones if necessary as an optimisation) and the canonical version can just live in memory.

Sure Geordie_J. Based on the work that you have done, I will add-on if there is something that I can contribute.

Hi Mamatha,

Sorry for my lack of activity on this, I have a deadline in a couple of weeks on another project that has taken precedence. If this is a blocking issue for you I’ll push my current state, if not I’ll be able to have another look once the other project is done.

Thanks,
Geordie

Hello @Geordie_J

Sorry, I got involved in other stuff and could not respond on this. Are you in a state now to proceed with this? Please let me know your plan for getting this completed. If you are still held up, please share the current state and maybe I could add-on to it.

Thank you.
Mamatha