I wrote a Swift 6 Lambda that zips 15 GB of S3 objects in one invocation, streaming end to end on 512 MB arm64. It was a challenge posted by Jérémie Rodon (RustyServerless) who built the Rust reference implementation.
The Swift contender lands at 1.04x Rust on median (219 s vs 211 s). Three-stage pipeline with TaskGroups, actors, and AsyncStreams. Pure Swift CRC32 (slicing-by-8), hand-rolled ZIP64 encoder, Soto for the S3 layer.
A few findings relevant to this community:
-
Soto vs aws-sdk-swift: same app code, same pipeline. Soto finishes in 250 s. The official SDK hits the 600 s timeout. The gap is uploadPart latency: 200 ms (AsyncHTTPClient) vs 680 ms (aws-crt-swift) per 10 MiB part.
-
ByteBuffer vs Data: switching the upload path from Data to ByteBuffer eliminated ~15 GB of copies per run. Data's per-append reallocation and the SDK's internal flattening are the culprits.
-
Pure-Swift CRC32 vs ARM __crc32d: the intrinsic has a serial dependency chain. Slicing-by-8 issues 8 parallel table lookups. On a 0.29 vCPU allocation, memory-level parallelism wins: 4 ms vs 76 ms per 5 MB file.
Code: demo-s3-archiving/contenders/swift at main · RustyServerless/demo-s3-archiving · GitHub
Blog post with full context: Can Swift Match Rust on a Lambda Micro-Benchmark? Almost. | Seb in the ☁️
I'm calling on the Swift community here: I'm sure I missed something. The code is about 1,200 lines of Swift 6 strict concurrency. If you spot inefficiencies in how I use TaskGroups, AsyncStreams, actor isolation, or ByteBuffer management, I'd genuinely appreciate the feedback. Same if you see a way to reduce the cold-start variance (9 s sigma vs Rust's 0.5 s at 512 MB).
A few specific questions I still have:
-
Is there a better pattern for the producer/consumer handoff than an actor with manual slot counting?
-
Could the download stage benefit from a custom AsyncSequence instead of collecting into a pre-sized ByteBuffer?
-
Any known pitfalls with AsyncHTTPClient connection pooling on very low vCPU (0.29)?
The STATS=1 instrumentation is left in the code so anyone investigating can get per-stage timings from a single run. PRs and issues welcome.