Advice on high-level unified file API

theBitThatBytes · May 7, 2024, 12:59pm

Hello, I am seeking advice for a (niche) project.

In our research group, we work with environmental (weather) data daily, and I noticed that lots of time is spent curating data that comes from different providers in different file types. Let's say you want to assess thermal stress for a certain domain and period. You will have weather data from the national weather service, which mostly will come in some sort of weird CVS structure, and you will have data in NetCDF, HDF or binary format. So far, so good, but here's the crux: one data set will have a 5-min temporal resolution, the other 30-min and if you are really unlucky, one set has a file for each day while the other one has a file for each month or the whole period. Yeah, aligning the data sets is a repetitive and time intense procedure.

I think most of this can easily be abstracted and automated. There are tons of well-maintained and documented packages to interact with all kinds of file formats, but all of them require wrapping your head and logic around their API design. But why not abstract file handling? In the end, I would rather not care if it's NetCDF, HDF, CVS or JSON, I just want to read, write or modify the file's content. I searched the web but haven't found a library that provides me with a generic, unified file API to interact with. I'm not the first person to face this problem, and these days there are libraries for all sorts of things, which leads me to wonder why there is no unified file API, and whether I've missed something that makes this concept useless or impractical.

Well, all concerns aside, I played around and eventually ended with this logic.

// a kind of lightweight database that wraps the files, facilitates reading and writing to different files and provides an even nicer interaction layer
public final class VirtualDataSource<D: Driver> { 
    private var routerTable: [D] 
    public func register(id: String, source: [DataSource]) throws
    public func register(id: String, source: DataSource...) throws
    public subscript(_ key: String) -> D
    ...
}

public extension NetCDFDocument: DataSource { ... }
public extension CSVDocument: DataSource { ... }
public extension HDFDocument: DataSource { ... }

public protocol DataSource {
    func read<T: NetcdfConvertible>(`var`: String)  throws -> [T]
    func write<T: NetcdfConvertible>(`var`: String, content: [T], toDimensions: [String]) throws
    // maybe require support for stream reading/writing
    ...
}

enum Meteo: String, Convention {
    case ta   = "AirTemperature"
    case hum  = "Humidity"
    case bgt  = "BGT"
    case pet  = "PET"
    
    var error: some FloatingNumber { -9999.0 }
    
    var unit: any MeteoUnit {
        switch self {
            case .ta:  return TemperaturUnit.degree_celcius
            case .hum: return FractionUnit.percentage
            case .bgt: return TemperaturUnit.degree_celcius
            case .pet: return TemperaturUnit.degree_celcius
        }
    }
}

let dataSource = VirtualDataSource(convention: Meteo.self)
try dataSource.register(id: "FRHERD", source: try NetCDFDocument(url: .documentsDirectory + "WSN_T1_YEAR/NetCDF/Stations/FRHERD_year.nc"))
try dataSource.register(id: "FRTEST", source: try files.map({ try NetCDFDocument(url: $0) }))
try dataSource.register(id: "FRGUNT", source: try CSVDocument(url: .documentsDirectory + "WSN_T1_YEAR/CSV/Stations/FRGUNT_year.csv"))

let ta: [Double] = try dataSource["FRTEST"][.ta]

So far, this allows me to further extend the unified API to support functions like aligning the time axis and I have the same functionality no matter the file type. In addition, I can reuse all the code and just have to change the file driver if I have to switch to another file format.

It's quite painful to warp my head around all the limitations of swifts' generics and to keep the resulting API as generic and convenient as possible. Before I continue to tackle all the remaining bugs and flaws (there are plenty), I wanted to seek for a second opinion on this project.

Happy for any suggestion or reason to improve or stop it :)