I started to work on this a few days ago. Now the model runs on my computer* and produces matching image against stable diffusion: GitHub - liuliu/swift-diffusion
At the moment, it is a bit of a hassle to get it run. The immediate next is to move the tokenizer off Python, that will make the whole thing "Swift-only". After that, will be various performance optimizations, memory usage optimizations, CPU / Metal work to finally make this usable on mobile. I estimate at least a month of work ahead to achieve mobile-friendly.
For some fun, here is the logo for this project generated with prompt: "a logo for swift diffusion, with a carton animal and a text diffusion underneath". It seems have trouble to understand exactly what "diffusion" is though
Python tokenizer is removed. Now it is "Swift-only".
This is exciting!
I realize this is extremely early, but in it's current state, how does performance compare to something like DiffusionBee on an M1?
There is no comparison. swift-diffusion in current form only supports Linux + CUDA. CPU support will come later and at that time, it can run on Mac / iOS. But to run it efficiently, some ops need to leverage the hardware either as Metal compute kernels or use the neural engine: tinygrad/accel/ane at master · geohot/tinygrad · GitHub DiffusionBee currently use MPS backend implemented in PyTorch to run on M1 efficiently. I haven't looked too deep into how the MPS backend implemented but would imagine some Metal kernels plus ANE there.
@liuliu note that on the M1 Max at least, the GPU F16 processing power is higher than the ANE. Also, the GPU is more programmable and can be accessed with lower latency. Make sure to use MPS and MPSGraph, otherwise try the simdgroup_matrix in Metal Shading Language. This will provide the highest matrix mul performance.
Apple went this route with MetalFX temporal upscaling. Most people suspect that it runs on the ANE, but it actually runs entirely on the GPU. It's also restricted to only M1/Pro/Max and doesn't run on A14/A15, probably because it needs sufficient GPU F16 TFLOPS.
MPSGraph should be pleasant to use for making neural networks, so I advise trying exclusively MPSGraph at first. But measure the CPU-side overhead, which is massive with MPSGraph, before shipping the final product.
Also if you're struggling with performance, don't hesitate to ping me in a GitHub issue and ask for advice. I know the ins and outs of Metal :)
can this work in the Xcode? SwiftUI specifically.
I've been updating the repo in the past a few days. Now img2img should work, as well as inpainting (or you can call it outpainting, really depends on where the mask is). Both requires text prompt to work (it is weird for inpainting, but I haven't figured out a way to avoid that).
As of now, it doesn't work with Apple hardware yet (requires CUDA, therefore, Linux).
What about M2 in the MacBook Air? Can it run there?
@liuliu Are you going to change it to what @philipturner recommend to use MPSGraph to replace CUDA? It's of no use for people in this forum as we are all using Apple's M processor.
Thanks for you work! So where are you running this Swift port? On what hardware platform to do what? Just curios.
Yeah, the plan is to support macOS / iOS with enough work. Starting with CUDA is easy as it is proving and I know where to look. Once I get it running on CPU, the work will move to enable it with MPS (such that I can compare results on macOS).
I am running with Swift 5.6.3 on Ubuntu 20.04 with CUDA 11.7 (or any CUDA after 10.2 should be fine), hardware RTX 2080 Ti (should be compatible with other RTX cards as long as having more than 8GiB memory).
I am setting up now to validate CPU version works on my macOS. Need to do some scaffolding so we can port one MPSGraph op over at a time. If you have a macOS with M1 / M2 chips, certainly would be helpful when we port MPSGraph over. I am still running Intel macOS so any MPS work need to be validated on a iDevice. Will let you know when the scaffolding are done and in the porting op mode.
I have an M2 MacBook Air (10 core GPU).
I am very curious to know how this M2 CPU and GPU perform compare to the M1 and M1 Max!
Are you going to use the GPU or ANE? Or both to see which is better?
Please let me know how I can help!
I have experimental MPS support now and updated the repo. To run on macOS, it still requires a bit setup though:
- Install Bazel: Installing Bazel
- clone the repo, modify
WORKSPACE, specifically, adjust
ccv_setting to following:
name = "local_config_ccv",
have_accelerate_framework = True,
have_pthread = True,
.bazel.local file, with one line:
- Download weights from http://static.libccv.org/sd-v1.4.ckpt
After above steps, you should be able to run
bazel run examples:txt2img --compilation_mode=opt -- /Users/liu/workspace/swift-diffusion "a photograph of an astronaut riding a horse"
You need to adjust the path to where you checked out swift-diffusion repo and it should be the same place you put
If it is slow, try to monitor the progress by adding some prints in this for loop: swift-diffusion/main.swift at main · liuliu/swift-diffusion · GitHub
My Mac Mini is a Intel one, so this took about half an hour to finish. Your mileage may vary. Also because my Mac Mini is Intel, it doesn't support Float16, you can switch to Float16 by changing this line: swift-diffusion/main.swift at main · liuliu/swift-diffusion · GitHub
There could be a lot of perf work left on the table, (for example, PyTorch has MPSGraph cached, mine hasn't yet, my way of using MTLBuffer could be inefficient too). That will be next week.
(BTW, somehow MPSGraph's RandomNormal can result nan, that took me half a day to debug ...)
I rented a M1 from MacStadium and with some minor tweaking (the main fuck up I have is assuming PAGE_SIZE always 4KiB, which is wrong on M1 (16KiB)), it works on Mac Mini M1 16GiB. Float16 also works out of box amazingly. With Float16, seems peak memory usage is about 6GiB, still more than I assume (should be somewhere around 3.8GiB), need some debug capability into MPSGraph memory usage.
After the memory usage understood, I should switch to implement LLM.int8 trick (also known as bitsandbytes: GitHub - TimDettmers/bitsandbytes: 8-bit CUDA functions for PyTorch). I think this is required to run on 6GiB devices (iPhone). On iPad, amazingly, I would assume f16 is enough (they should have 8GiB?).
On Mac Mini M1, it took about 145s to generate one image in f16 mode (50 steps). This should be comparable to PyTorch's MPS backend (132s).
Seems pretty straightforward to push the performance under 2min. Just encoding ops asynchronously (such that the Metal command queue is always full) would do the trick. There are some memory usage spikes related to that (the CommandBuffer would hold resources, thus, if you push the whole graph at once, intermediate allocations would be hold the entire time, and these significant). With some tweaks, I can have reasonable memory consumption (f16 around 3.8GiB) with reasonable time (95s) on Mac Mini M1. These are done at 512x512 with 50 steps. It is easy to imagine at lower resolution and fewer steps, it would be much faster.
I was able to run this on a iPhone 14 Pro recently. It uses around 2GiB, well under the limits for max memory on such device. Notable changes:
- Switch to NHWC layout, this is a speed improvement;
- Switched softmax from MPSGraph to MPSMatrixSoftMax, this saves about 0.5GiB as it avoid extra allocation in MPSGraph;
- Switch some GEMM from MPSGraph to MPSMatrixMultiplication, this saved another 0.5GiB by avoid extra allocation.
Right now, it runs on M1 for 50 steps in around 80s. On iPhone 14 Pro, it takes about 46s for 20 steps. I suspect you can at least reduce 50% of time by carefully tuning. But I am going to switch gears to other priorities.
Getting back to this thread. I launched an app in AppStore based on work in swift-diffusion: https://draw.nnc.ai, plan to port the features in the app over and make swift-diffusion a complete CLI tool (app is easier as I only need to deal with Apple platforms while CLI requires CUDA as well).