Hi everyone,
Some of you may remember my TaylorTorch post from a while back. I've been continuing to explore the intersection of Swift and deep learning, and today I'd like to share a new project: MetalHLO.
The motivation
StableHLO has become the lingua franca for ML compilers. JAX, PyTorch, and TensorFlow all emit StableHLO IR, and it's the standard interchange format in the OpenXLA ecosystem. Yet on Apple Silicon, there hasn't been a native way to execute StableHLO programs while fully leveraging what these chips offer β the GPU, Metal Performance Shaders, and the Neural Engine.
MetalHLO fills that gap. It's a standalone library, written entirely in Swift, that compiles and executes StableHLO MLIR programs on Apple Silicon.
What's inside
-
88% StableHLO operation coverage (92 of 105 ops) β enough for production ML workloads (CNNs, Transformers, RNNs, FFT, quantized models)
-
Three execution backends:
-
MPSGraph β broad compatibility, leverages Apple's optimized graph runtime
-
Custom Metal kernels β with O0βO3 optimization levels (algebraic simplification, pattern-based fusion for attention, layer norm, GELU, etc.)
-
Heterogeneous GPU+ANE β a cost model automatically partitions each operation to whichever accelerator (GPU or Neural Engine) will complete it faster, and both run concurrently. Because Apple Silicon uses unified memory, there's zero data transfer cost between them.
-
-
Triple API surface:
-
Native Swift API
-
C API for cross-language integration
-
PJRT plugin β the standard OpenXLA plugin interface, so you can
import jaxand run JAX programs on Apple GPUs without code changes
-
-
Full training support β forward and backward passes
A quick taste
import MetalHLO
let client = try Client.create()
// Enable heterogeneous GPU + Neural Engine execution
let config = CompilationConfig(
optimizationLevel: .O3,
devicePolicy: .auto // Cost model decides GPU vs ANE per operation
)
let executable = try client.compile(mlir, config: config)
let outputs = try executable.execute(inputs)
Or, if you're coming from JAX with the PJRT plugin:
import jax
# Just point JAX at the MetalHLO plugin β no code changes needed
result = jax.numpy.dot(a, b)
GPU+ANE β why it matters
Every Apple Silicon chip has a Neural Engine sitting alongside the GPU, and most ML frameworks ignore it entirely. MetalHLO's heterogeneous executor uses a cost model to route operations: large matmuls and convolutions go to the GPU, while element-wise ops and batch normalization go to the ANE. Both execute their queues concurrently over shared unified memory. On element-wise-heavy workloads, this gives 3β4x speedup over MPSGraph alone.
Where it stands
This is still an experimental/alpha project β a passion project born from wanting Swift to be a first-class citizen for ML. The StableHLO conformance suite passes 191 of 277 tests (86 skipped due to MPS/Metal limitations in specific edge cases). There is a comprehensive benchmark comparison across all four backends in the README.
What's next
-
Expanding Metal kernel coverage for remaining ops
-
Improving the ANE cost model with profiling feedback
-
Better JAX integration (more op coverage through PJRT)
-
Would love to explore tighter integration with Swift's differentiable programming capabilities. My goal is to connect StableHLO compilation with Swift's native autodiff capabilities, enabling end-to-end differentiable ML pipelines entirely in Swift (see, for example, Magma)
I'd love feedback, ideas, and contributions. The repo is at github.com/pedronahum/MetalHLO.
I also want to thank maderix for their incredible work reverse-engineering Apple's private ANE APIs and demonstrating training on the Neural Engine. His project was a key inspiration for MetalHLO's GPU+ANE heterogeneous execution β it showed that the ANE is far more capable than most frameworks give it credit for.
If you've been thinking about Swift's role in the ML compiler stack β or if you just want to run StableHLO on your Mac without leaving the Apple ecosystem β I'd love to hear your thoughts. All my benchmarking has been on my humble M1 with its 8GB of RAM (it's not much, but it's honest work
), so if you have a more powerful machine (M4, M5, or anything with more memory), I'd especially love to hear how the library performs on your hardware!