[Proposal] Compression Library

This is a proposal for a canonical and multi-platform compression library for the Swift ecosystem.

Motivation

In the Swift ecosystem we’re currently missing a central place for compression, a place that’s agnostic of algorithm and type and that can run on any platform where Swift runs. There are several libraries that do provide compression, such as:

and there’s many more but none of them cover the matrix of platforms, types and algorithms that the ecosystem needs. This proposal aims to fill that gap with a library that the majority of Swift users can reach for, independently of their platform or stack.

Goals

  • Single, cross-platform compression package
  • Canonical, ecosystem wide package for vending C-based compression libraries
  • Provide modern Swift APIs for compression
  • Offer pluggable algorithms for extensibility
  • Minimal dependencies including only paying for the algorithms that are needed
  • Safe APIs and strict memory checking under the hood

High Level Approach

Package Structure

The package would be split into several targets:

  • CompressionCore: the core module. Defines the top level APIs and protocols that allow algorithms to be used with said APIs. Provides Span and [UInt8] compression APIs.
  • Czlib, Czstd, Cbrotli, CLZ4: C shims that vendor the respective libraries.
  • CompressionZlib, CompressionZstd, CompressionBrotli, CompressionLZ4: targets that pull in their respective C shim and provide APIs conforming to the protocols in CompressionCore. Single targets allow conditional compilation: the user won’t pay for what they don’t use.
  • Compression: umbrella target that pulls in all available algorithms.
  • CompressionNIO: adds ByteBuffer support. Allows for conditional compilation of NIO.
  • CompressionFoundation: adds Data support. Allows for conditional compilation of Foundation.

This allows for a lot of customisation as users can choose whether to prioritise simplicity, binary size and/or platform support.

API Ideas

Following are some rough sketches for the API, they’re not final by any means.

Core

The core could look something like this:

public protocol CompressionAlgorithm {
    associatedtype Configuration: CompressionConfiguration

    associatedtype Compressor: Core.Compressor where Compressor.Configuration == Configuration
    associatedtype Decompressor: Core.Decompressor where Decompressor.Configuration == Configuration
}

public protocol CompressionConfiguration {}

public protocol Compressor: Sendable {
    associatedtype Configuration: CompressionConfiguration
    var configuration: Configuration { get }

    func compress(_ input: some CompressibleInput) throws(CompressionError) -> [UInt8]
}

And, for each algorithm we want to implement, we could provide conformances for these in the respective target:

import Core
import CZlib

public enum Deflate: CompressionAlgorithm {}

extension Deflate {
	struct Configuration: CompressionConfiguration {
		let level: Level
		var `default` { .init(level: .speed) }
	}	
}

extension Deflate {
	public struct Compressor {
		public let configuration: Configuration
			
        public func compress(_ input: some CompressibleInput) throws(CompressionError) -> [UInt8] {
            try input.withSpan { span in
                try compress(span) // here we'd be calling CZlib.deflate(...)
            }
        }
	}
}

The associated Decompressor would be similar.
This would allow usage as follows:

let compressor = Deflate.Compressor(configuration: .default)
let input: [UInt8] = [0x42]
try compressor.compress(input)

Type Extensibility

The compression functions would accept a CompressibleInput that could look something like this:

public protocol CompressibleInput {
    func withSpan<R>(_ body: (Span<UInt8>) throws -> R) rethrows -> R
}

then, in the CompressionFoundation target, this would allow for:

import CompressionCore

#if canImport(FoundationEssentials)
    import FoundationEssentials
#else
    import Foundation
#endif

extension Data: CompressionCore.CompressibleInput {
    public func withSpan<R>(_ body: (Span<UInt8>) throws -> R) rethrows -> R {
        try body(self.span)
    }
}

So that Data, or really any other type that has Span support, can be compressed or decompressed.

Streaming

As for streaming, I think there’s more than one approach we can take.

  • One possibility is to make the one shot Compressor a convenience that internally compresses data via a stream and then buffers the output. This is convenient but it might leave some performance on the table.
  • Another approach could be to create a completely separate Streaming{De}Compressors to attach to a CompressionAlgorithm. This would allow for more fine grained customisation, though it does imply some code duplication.

I’m personally leaning towards the latter, since we’re creating a package for hiding complexity, however I’m open to suggestions.
In either case, the core could provide something like

public struct CompressionAsyncSequence<
    BackingSequence: AsyncSequence,
    Algorithm: CompressionAlgorithm
>: AsyncSequence where BackingSequence.Element: CompressibleInput {
    let backingSequence: BackingSequence
    let compressor: Algorithm.StreamingCompressor // Or Algorithm.Compressor
	
	// ...
}

extension AsyncSequence where Element: CompressibleInput {
    public func compressed<Algorithm: CompressionAlgorithm>(
        using algorithm: Algorithm.Type,
        configuration: Algorithm.Configuration = .default
    ) -> some AsyncSequence<[UInt8], Error> {
        .init(backingSequence: self, configuration: configuration)
    }
}

This would allow us to stream-compress like this:

let stream: some AsyncSequence<[UInt8], Error>
let compressedStream = stream.compressed(using: Deflate.self, configuration: .default)

for try await chunk in compressedStream {
	// do something with the chunk
}

Streaming would definitely be designed with back-pressure in mind.

Algorithms

Initially, the library would ship with support for

  • gzip, zlib and deflate - HTTP legacy compatibility, ubiquitous zlib support
  • zstd - modern, performance-focused general purpose compression
  • brotli - HTTP content encoding
  • lz4 - exceptionally high performance use cases

This is not intended to be an exhaustive list - the extensible design should make it straightforward to add further algorithms over time - but these cover the majority of real-world use cases in the Swift ecosystem today.

Prior Art

The Rust ecosystem provides various crates for compression, each one with a different algorithm and there’s no unified package for them. flate-2 is one of the major players there and they provide different backends based on needs. Their rust rewrite of zlib even outperforms C. They also provide both buffered and streaming APIs.

Java has built-in support for compression, in java.util, providing zlib functionality via De/Inflater and also on ByteArrayOutputStream, all in the standard library. Java’s approach is stream-oriented and the one-shot API is actually the less natural one which is interesting.
There’s also Apache Commons Compress which is closer to what is being proposed here, providing APIs for a battery of different algorithms and archives.

Future Directions

While these things are not part of the initial proposal, they would definitely be interesting to keep in mind for future effort.

Underlying Swift Implementation

As the library matures it could be interesting to explore rewriting the C compression algorithms in pure Swift. The vendored C libraries are a good starting point but in the long term we might want to seek the advantages a Swift codebase brings like better debuggability, type safety and no C interop overhead. Of course this isn’t supposed to be taken lightly as the C compression libraries are security battle tested so we’d have to undergo a proper security review.

C Compatible APIs

Afterwards, if and when we reach or exceed C performance, we can expand and try vending a C compatible API with an underlying Swift implementation. This is essentially what zlib-rs has done. Obviously this would be a big undertaking is therefore not part of the current proposal.

30 Likes

I'd add in GitHub - compnerd/dft: Debug Fetch Tool · GitHub which has a a few compression things in it that are important for windows (zip, mz-zip, zlib, etc). It doesn't carry any C library dependency nor any dependency on NIO.

Thanks for opening up a discussion around this. Over the past couple of years, this topic of compression has come up again and again in the Swift Server space. As you correctly point out, many packages across the ecosystem vend their own copy of one of the C libraries. I personally think that the general problem of compression is something that we need to address in the Swift ecosystem and there is a lot of value in having one canonical solution. I will bring this to the @ecosystem-steering-group to discuss in one of our next meetings.

8 Likes

I think that such a library is long overdue! (De-)Compression and (un-)archiving functionality are so common in many applications across the spectrum from mobile to sever.

And if I might take the opportunity to promote my own nascent project, swift-archive provides a Swift overlay atop the popular C libarchive library, which handles multiple compression and archive formats. It is similar to how Yams imports the C libyaml project, except it is implemented as a fork of libarchive, and so it is easy to keep in sync with the upstream (which continues to be very active after 23 years of existence).

It supports common libraries like zlib universally and uses package traits to handle optionally-available libraries (LZMASupport, ZstdSupport) that might not be present on all platforms. It builds and tests against macOS, Linux, Windows, iOS, and Android. The shape of the API can best be perused in ArchiveTests.swift.

I'd be happy to donate it, or contribute my experience in any way that might be helpful.

2 Likes

there are, i think, two different problems being asked about in this post, the first being:

  1. tight coupling with specific ecosystems like NIO ByteBuffers or Foundation Data, which is sort of also being blamed for lack of portability across platforms, and
  2. specialization of libraries around one single algorithm instead of providing a bundled product

i don’t find #2 to be motivating, there’s nothing wrong with single-algorithm libraries, and anyone who wants to ship a bundle can easily do so by creating a package that depends on the single-algorithm libraries

#1 i think is a real problem that has plagued many Swift ecosystems for years and is something that hurts Swift adoption, and therefore a good compression library has to be ArraySlice<UInt8> native and offer the Foundation and NIO compatibility in an overlay module.

by the way, i also have a zlib implementation, which lives in a module that is part of swift-png. it is pure Swift and has no Foundation or NIO dependency, but the main reason to not use it is because it is bundled with swift-png. it is not the only pure Swift compression library mentioned in this thread that grew alongside a larger project that needed it, and it would be interesting to think about graduating those modules from different authors to a combined compression package

5 Likes

I'd like to follow on with a huge +1, please for the concept here. In particular, having a relevant library that can either directly read and write from zip files, or be used to build enough to write a standard Zip file, is a significant gap that exists today, and which could help resolve a long-standing issue with DocC support for Windows.

With DocC being part of the built-in toolchain, it's extremely limited in what dependencies it can take on - and the mechanisms that it uses unfortunately include ":" in the filenames that it outputs in a DocC archive - a long standing limitation and known issue. With something like zlib or a more complete Zip like library being available in the standard library or as part of the foundation layering above it - that could significantly alleviate the earlier, fundamental flaw and provide a path to generating and viewing Swift documentation on Windows.

(I'm not excusing the flaw or suggesting that this can't be solved in other ways, just highlighting that this could help provide a path out of the legacy setup that we've stepped into which unfortunately didn't support the Windows platforms as robustly as alternative designs might have)

I haven't done sufficient deeper analysis or have a great background to talk to the API layer and what that could or should look like, but I definitely want to re-iterate the need and desire for something to be available to build on in this space that's explicitly cross-platform.

1 Like

It would be nice to have an official Apple supported native Swift compression library that could replace things like GitHub - marmelroy/Zip: Swift framework for zipping and unzipping files. · GitHub / GitHub - zlib-ng/minizip-ng: Fork of the popular zip manipulation library found in the zlib distribution. · GitHub

A comprehensive compression library would be fantastic. It’s a prerequisite for building libraries for formats like Apache Arrow, Apache Parquet, and cloud-optimised GeoTIFFs. I think Snappy compression would also be necessary for these use cases too.

As a side note, there’s a pure Swift LZ4 parser example in Apple's swift-binary-parsing. It might be interesting to benchmark against the C version. Given swift-binary-parsing uses features like lifetimes, it has a chance of being competitive.

I’d be very happy to lend a hand to this effort; it’s something that’s sorely needed for my own work.

3 Likes

A standard, universal compression library would be nice, but I worry about such a fundamental library as an official swiftlang library. This brings several downsides that have yet to be addressed, in regards to maintainers, capabilities, and deployments. So create a library, it just shouldn't be hosted there.

It would be nice to have something that is officially reviewed and maintained by Apple so we don't get another Jia Tan XZ Utils situation.

One Apple engineer reviewing PRs, vs. a group of open source reviewers doing reviews, I know which I'd trust. Really the only thing having Apple own it would help with is that releases would be so rare, any attempt to sneak in malware would have months to get caught! :laughing:

Ideally there would be both. The code would be fully open source so the community would review it but ideally Apple would also use the code internally so for versions that they use they would also have an incentive to review it.

We need to get away from this mindset that anything from Swift is from Apple. It’s perfectly acceptable (and wanted) to have a library living in swiftlang, maintained by the Swift community with no link to Apple

10 Likes

Could not agree more.

I think there are compelling reasons to build a pure-Swift compression library.

For illustrative purposes, I’ve built an experimental pure-Swift decompression library for block-encoded LZ4 and Snappy. The LZ4 implementation was lifted from Swift Binary Parsing.

https://github.com/willtemperley/swift-compression

I’ve benchmarked this using LZ4 from Apple’s compression framework and the only available Snappy shim on Swift Package Index, swift-snappy [1]. Note the description of swift-snappy is a little misleading; this is a C wrapper, not pure Swift.

The results are interesting - pure Swift LZ4 benchmarks at around 1.7x slower than Apple’s version. I’m actually seeing faster decompression in pure Swift than swift-snappy, but the astronomical memory usage of swift-snappy might be skewing the results. (Edit: I fixed the memory leak in swift-snappy and now it’s ~ 2.2x faster than my pure Swift version).

Apart from the utility of having a pure Swift compression library (e.g. smaller binary sizes, simpler concurrency, better portability, etc.), such a library could have a positive effect on the Swift ecosystem. For example it could act as a testbed for features like the upcoming OutputSpan append. Incidentally, this feature may help bridge the performance gap seen in this experiment. Compression performance could be a very good benchmark for Swift, given how well optimised for speed the C libraries are.

I’d argue that such a library could also reduce the likelihood of supply chain attacks. C code is hard to audit and is inherently more vulnerable. Very few people have the requisite skills to perform a security audit of C compression libraries, and there’s no guarantee that there aren’t extant vulnerabilities in the libraries currently shipping.

[1] https://swiftpackageindex.com/lovetodream/swift-snappy

3 Likes

i want to caution people against blindly placing too much faith in “open source review”, although incentives are different, and i think, better, than incentives in large corporations, that does not necessarily translate to “better” security.

most popular open source maintainers are just ordinary people, stretched extremely thin, policing a staggeringly vast surface area of public code the best we can. as public figures, we get targeted by phishing, social engineering attacks all the time, and it is dangerous complacency to assume we are simply too smart to get phished or socially-engineered into approving a malicious contribution.

yes, it’s a shared responsibility to be vigilant and scrutinize every contribution, but the truth is if we ratcheted up the paranoia to a level that, lets be honest, the sophistication of attackers commands, then virtually no external contributions would ever get accepted, and vast swaths of the library ecosystem would fall into disrepair - a state which itself invites security vulnerabilities.

people want to

  • externalize the costs of securing open-source code
  • benefit from an active, up-to-date package ecosystem
  • be able to trust code imported through multiple levels of dependencies

but these three things cannot all simultaneously be true.

3 Likes

You're right, of course, my point was rather that having an Apple-employed reviewer isn't somehow more reliable than the typical open source community process, especially when those reviews aren't their primary job at Apple.

1 Like

For anyone who wants to follow along, I've pushed an initial version of the library with zlib support GitHub - brokenhandsio/compression: A Swift library providing APIs for a number of common compression algorithms · GitHub. Happy to use this post as a discussion around APIs and improvements as well!

For now I'm using SuppressedAssociatedTypesWithDefaults so this won't compile without a toolchain from main, however it's not final yet

2 Likes

Awesome work Paul! Looks like a great start, especially with the Span support

3 Likes

The proposed API looks very similar to my compression library, especially its refactor. I really need a flexible, yet abstract, compression library exactly like the one you're proposing for my standalone networking library to cut down binary sizes while improving compilation & runtime performance so I can move away from Vapor/Hummingbird.


Will it support package traits, inline arrays, typed throws or be usable in embedded? What will the minimum required Swift version be (I see you require the main branch right now, which is very restrictive)?