I assume you are using reasonably sized Data chunks, not, like 10 or 100 byte blocks but at least a few K, otherwise the overhead would be significant.
Some suggestions for you:
- check with "-enforce-exclusivity=none"
- check on Intel. Last time I checked loops are not automatically unrolled under ARM.
- unroll loops manually to see if there's a difference.
- double check the speed of that C code is "on par" with libz implementation (if not – you may need a better starting point for your swift port).