Nothing I did to the swiftc arguments could convince it to vectorise this, even though it feels like it should be vectorisable.
So, I took a crude stab at manually vectorising this for NEON, but the result is 25% slower than the scalar version. I suspect because it's not utilising NEON well (the generated assembly looks pretty horrific in general, and I couldn't figure out any way to generate a vector gather instruction).
I posit that this'd work substantially better with SVE2, but unless I really missed something, none of Apple's chips support SVE at all. 
let MAX: UInt32 = 440_000_000
var cache: [UInt32] = [0, 1, 4, 27, 256, 3_125, 46_656, 823_543, 16_777_216, 387_420_489]
cache.withUnsafeBufferPointer { fastCache in
func is_munchausen(numbers: SIMD4<UInt32>) -> SIMDMask<SIMD4<UInt32>.MaskStorage> {
var n = numbers
var totals = SIMD4<UInt32>(repeating: 0)
while any(n .> 0 .& totals .<= numbers) {
let remainders = n % 10
n /= 10;
totals &+= SIMD4(fastCache[Int(remainders.x)],
fastCache[Int(remainders.y)],
fastCache[Int(remainders.z)],
fastCache[Int(remainders.w)])
}
return totals .== numbers
}
assert(0 == MAX % 4)
for k in stride(from: 0, to: MAX, by: 4) {
let numbers = SIMD4(k, k + 1, k + 2, k + 3)
let matches = is_munchausen(numbers: numbers)
for i in matches.indices {
if matches[i] {
print(numbers[i])
}
}
}
}
I was really surprised by how much I had to baby the compiler, when using the SIMDn types. It feels like the optimiser takes a holiday as soon as it sees one. e.g. I had to manually rearrange the vectorised conditionals (re. the inner loop) to avoid a hefty performance loss. I suspect part of this is the inherent nature of SIMD (being more sensitive than scalar code, because the instructions are intrinsically heavier) but still, it felt like the compiler wasn't entirely pulling its weight.
P.S. I also made an eight-wide version, but it performed nearly twice as bad. Surprisingly, I can't seem to actually find any tech specs for Apple's M2 indicating what it supports re. NEON and vector widths and all that (is NEON fixed 128? It seems like that from glancing at the AArch64 docs from Arm, but that seems really surprising to me).