MetalHLO — StableHLO Execution on Apple Silicon (GPU + Neural Engine), built in Swift

pedronahum · March 8, 2026, 10:38am

Hi everyone,

Some of you may remember my TaylorTorch post from a while back. I've been continuing to explore the intersection of Swift and deep learning, and today I'd like to share a new project: MetalHLO.

The motivation

StableHLO has become the lingua franca for ML compilers. JAX, PyTorch, and TensorFlow all emit StableHLO IR, and it's the standard interchange format in the OpenXLA ecosystem. Yet on Apple Silicon, there hasn't been a native way to execute StableHLO programs while fully leveraging what these chips offer — the GPU, Metal Performance Shaders, and the Neural Engine.

MetalHLO fills that gap. It's a standalone library, written entirely in Swift, that compiles and executes StableHLO MLIR programs on Apple Silicon.

What's inside

88% StableHLO operation coverage (92 of 105 ops) — enough for production ML workloads (CNNs, Transformers, RNNs, FFT, quantized models)
Three execution backends:
- MPSGraph — broad compatibility, leverages Apple's optimized graph runtime
- Custom Metal kernels — with O0–O3 optimization levels (algebraic simplification, pattern-based fusion for attention, layer norm, GELU, etc.)
- Heterogeneous GPU+ANE — a cost model automatically partitions each operation to whichever accelerator (GPU or Neural Engine) will complete it faster, and both run concurrently. Because Apple Silicon uses unified memory, there's zero data transfer cost between them.
Triple API surface:
- Native Swift API
- C API for cross-language integration
- PJRT plugin — the standard OpenXLA plugin interface, so you can import jax and run JAX programs on Apple GPUs without code changes
Full training support — forward and backward passes

A quick taste

import MetalHLO

let client = try Client.create()

// Enable heterogeneous GPU + Neural Engine execution
let config = CompilationConfig(
    optimizationLevel: .O3,
    devicePolicy: .auto  // Cost model decides GPU vs ANE per operation
)
let executable = try client.compile(mlir, config: config)
let outputs = try executable.execute(inputs)

Or, if you're coming from JAX with the PJRT plugin:

import jax
# Just point JAX at the MetalHLO plugin — no code changes needed
result = jax.numpy.dot(a, b)

GPU+ANE — why it matters

Every Apple Silicon chip has a Neural Engine sitting alongside the GPU, and most ML frameworks ignore it entirely. MetalHLO's heterogeneous executor uses a cost model to route operations: large matmuls and convolutions go to the GPU, while element-wise ops and batch normalization go to the ANE. Both execute their queues concurrently over shared unified memory. On element-wise-heavy workloads, this gives 3–4x speedup over MPSGraph alone.

Where it stands

This is still an experimental/alpha project — a passion project born from wanting Swift to be a first-class citizen for ML. The StableHLO conformance suite passes 191 of 277 tests (86 skipped due to MPS/Metal limitations in specific edge cases). There is a comprehensive benchmark comparison across all four backends in the README.

What's next

Expanding Metal kernel coverage for remaining ops
Improving the ANE cost model with profiling feedback
Better JAX integration (more op coverage through PJRT)
Would love to explore tighter integration with Swift's differentiable programming capabilities. My goal is to connect StableHLO compilation with Swift's native autodiff capabilities, enabling end-to-end differentiable ML pipelines entirely in Swift (see, for example, Magma)

I'd love feedback, ideas, and contributions. The repo is at github.com/pedronahum/MetalHLO.

I also want to thank maderix for their incredible work reverse-engineering Apple's private ANE APIs and demonstrating training on the Neural Engine. His project was a key inspiration for MetalHLO's GPU+ANE heterogeneous execution — it showed that the ANE is far more capable than most frameworks give it credit for.

If you've been thinking about Swift's role in the ML compiler stack — or if you just want to run StableHLO on your Mac without leaving the Apple ecosystem — I'd love to hear your thoughts. All my benchmarking has been on my humble M1 with its 8GB of RAM (it's not much, but it's honest work ), so if you have a more powerful machine (M4, M5, or anything with more memory), I'd especially love to hear how the library performs on your hardware!

Lin · March 8, 2026, 8:52pm

I’m genuinely amazed. This feels like someone quietly dropped a gold mine on the forums:
there’s a surprising amount of serious work here.

pedronahum · April 13, 2026, 4:55pm

Quick update on MetalHLO — since the initial experimental release, the project has been started to get “battle-tested” through real-world workloads. A lot of the recent work has been driven by supporting vivsim, a fluid simulation library built on JAX. Yes, that's JAX running on Apple Silicon, powered by Swift. This has been a great forcing function: fluid simulations exercise a wide surface area of StableHLO ops under sustained load, which has flushed out bugs in MLIR parsing, kernel code generation, and the PJRT integration layer that synthetic tests alone wouldn't have caught. The library has seen significant improvements across the board — from new Metal kernel implementations closing coverage gaps, to fixes in the compiler pipeline (CSE, scatter, dot_general), to better handling of JAX's MLIR format quirks.

Still early days, but it's encouraging to see a non-trivial JAX workload running end-to-end on Metal through Swift.

Lin · April 15, 2026, 10:03pm

pedronahum:

Quick update on MetalHLO — since the initial experimental release, the project has been started to get “battle-tested” through real-world workloads. A lot of the recent work has been driven by supporting vivsim, a fluid simulation library built on JAX. Yes, that's JAX running on Apple Silicon, powered by Swift. This has been a great forcing function: fluid simulations exercise a wide surface area of StableHLO ops under sustained load, which has flushed out bugs in MLIR parsing, kernel code generation, and the PJRT integration layer that synthetic tests alone wouldn't have caught. The library has seen significant improvements across the board — from new Metal kernel implementations closing coverage gaps, to fixes in the compiler pipeline (CSE, scatter, dot_general), to better handling of JAX's MLIR format quirks.

Still early days, but it's encouraging to see a non-trivial JAX workload running end-to-end on Metal through Swift.

That’s a huge milestone. Thanks for keeping us updated

pedronahum · June 9, 2026, 6:05am

Quick update on where MetalHLO stands. At its core it's a drop-in backend that lets you run standard JAX code (and the Flax neural-network library built on top of it) directly on Apple Silicon GPUs, and that support is backed by a broad automated test suite that checks results against the trusted CPU implementation. Since the last post the focus has been on speed and on getting real models right: common operations now run much closer to the speed of Apple's own hand-tuned library, and full end-to-end training is verified - a ResNet18 image model trains about 8.7× faster than on the CPU, and a small GPT language model produces exactly the same results as the CPU (we also found and fixed a subtle bug that had been silently breaking ResNet's output).

The newest and most exciting addition is an early, experimental step toward using more than one device at once: standard JAX data-parallel training now runs through MetalHLO across several devices (for now simulated on a single GPU) which lays the groundwork for the real goal of connecting multiple Macs together over Thunderbolt. Considering adding MPI support (via GitHub - pedronahum/MessageDifferentiationKit: Automatic differentiation for MPI in Swift — differentiable distributed computing for machine learning and scientific applications. · GitHub ).

As the results improve, I will now continue integrating MetalHLO into Magma.

As always, feedback is more than welcome.