Swift and Apple Silicon symbiosis?

xavier.lowmiller · August 28, 2023, 9:08am

I recently re-read C Is Not a Low-level Language, in which the author makes the point that modern hardware isn't the abstract machine we think of when writing C code, and implies that the industry would design hardware differently if they didn't have to ensure that C code runs fast on it.

As Apple famously controls both software and hardware, they'd be in a position to break this cycle. The article predates Apple's M-series chips, which makes me wonder:

Is there anything in Apple's CPUs that optimizes Swift-specific things (ARC atomics? objc_msgSend?)
Is there anything in the Swift Compiler that optimizes for Apple Silicon chips specifically?

wadetregaskis · August 28, 2023, 4:16pm

While you'll hopefully get more direct answers from the compiler team members that visit these forums, particularly regarding specific examples (e.g. improved latency for cache line sharing re. reference counts), a few broader or tangential thoughts:

Old man yells at cloud

I wouldn't get too excited about that ACM article - it's a little misguided with its complaints and conclusions. Mostly it's just lamenting that there exists real-world work which is inherently serial and/or branchy.

The architectural debate (brainiacs vs speed-demons, also the related brawny vs wimpy) it's fighting is many decades old and largely settled right now (particularly because of Apple's ARM core designs which bucked convention by going heavily "brainiac" in exactly the space where conventional wisdom said that was insane, "mobile", and frankly embarrassed the rest of the industry with their resulting real-world performance and efficiency).

It also points out that C is an unsafe language (in Swift's sense of the word), which is a surprise to nobody ever. It sounds like the author just really doesn't like C, for their own personal reasons.

Compiler codegen optimisation

A lot of codegen optimisations are implemented by the CPU vendors - and architecture vendor, in ARM's case. So a lot of performance improvements for Swift are inherited generically through improvements in LLVM, many of which come from Arm (the company) or even other ARM licensees (e.g. Ampere). e.g. Swift's ARC is ultimately just atomic compare and swaps (or equivalent) which is a more fundamental pattern and performance-significant to a lot of code, so it gets plenty of focus in both compilers and hardware design irrespective of Swift's influence.

Sadly (for us on Apple platforms), Arm seem to still be putting a lot of effort into optimising ARM codegen in gcc (e.g. New performance features and improvements in GCC 12), which is essentially a waste of time for us. You can't really fault them for that - they have to follow the world's compiler choices, and gcc is still widely used (especially in certain domains that Arm has been heavily courting in recent years, like HPC).

Nonetheless, Arm have been steadily transitioning their focus to LLVM over the last decade, driven in no small part by Apple and some notable hyperscalers (e.g. Google). See for example What is new in LLVM 15 and What is new in LLVM 16. Their official compiler toolchains, for example, are now based on Clang+LLVM rather than gcc, although that is pretty recent and even some people at Arm haven't caught up with that news (e.g.). (they still maintain a GCC toolchain too, for whatever reason)

So we can look forward to increasing focus on LLVM from Arm, that will benefit Swift.

ARM is bigger than Swift

One of if not the dominant point of the ARM architectures are that they are enforced across all implementations. Arm certainly won't provide you any reference implementations that aren't fully compliant with the relevant ARM architecture, nor even can you buy an ARM architecture license and then make a non-compliant implementation (or at least, not publicly - what you do in the privacy of your own datacentre is your business, I suppose). Compliance means implementing the ISA as-is, without random implementation-specific instructions or the like.

This is (IMO) a big blessing, since it means portable binaries, reusable codegen in compilers, and even more broadly predictable semantics up in the higher-level languages themselves.

But, it does mean you can't do some hardware optimisations that you can with e.g. RISC-V, which distinguishes itself from ARM by saying "go nuts" re. architecture and [lack of concern for] compatibility. Arm wouldn't allow an "objc_msgSend" instruction to be added to any implementation, for example (unless they chose to incorporate that into their actual architecture, which they won't - I think Jazelle burnt them too bad, among other reasons). If for argument's sake Apple were using RISC-V (or any other fully proprietary architecture) then they could.

It's interesting to think about truly Swift-customised CPUs, and there certainly is some precedence - the aforementioned Jazelle, among several other attempts at "Java hardware" - but it seems highly unlikely. Though more targeted architecture additions are possible - see for example the ARMv8.3 'enhancement' that added instructions specifically for dealing with JavaScript's horrific numerics system.

sspringer · August 28, 2023, 9:01pm

A short note from me below who is not (!) an expert in these things, maybe someone from the compiler people can clarify?

To my understanding Swift first relied a lot on what was already possible as optimizations in LLVM (Chris Lattner once said that Swift is syntactic sugar for LLVM, which might be a bit of an exaggeration). Many of the optimizations for Swift are specific to the platform (OS + architecture), and at least some of them are newly implemented as part of LLVM, LLVM now even contains optimizations specific to Swift. The goal for Swift is that even complex code should be optimisable quite far, and this is why many changes to the language are far from trivial.

So I would say it is more the reverse, the development of Swift itself is guided by what is possible with the given platforms. But yes, it would be interesting to know if the development of Apple Silicon is now somehow affected by the development of Swift (haven't heard of the same yet).

John_McCall · August 29, 2023, 12:06am

I've worked with David for many years and generally respect his opinion. His central thesis seems to be that we've gotten too preoccupied with scalar code performance when we ought to be focusing on vector programs for vector processors. I can't quite imagine how to apply that to the programming world I know, so it seems impolite to respond.

Swift doesn't rely on any special low-level operations that you wouldn't see in a lot of other programs. Neither objc_msgSend nor ARC are Swift-specific. Swift and Objective-C both do a lot of atomic reference-counting, so they both benefit from processors that optimize uncontended atomics. However, I'd say there's growing recognition among all processor designers that uncontended atomics are important; it's not Apple-specific at all. There's nothing really surprising here.

This is a broader question than you might think. On a (somehow) high level, porting Swift to run on AArch64 included designing a calling convention for it, and that is in some sense "optimizing for Apple Silicon". On a low level, LLVM can tune instruction scheduling for the capabilities and timings of specific CPUs, which of course is microarchitecture-specific. But if you mean, does Swift ever do things drastically differently when it knows it's running on Apple Silicon? No, I can't think of anything like that.

Finagolfin · August 29, 2023, 6:43am

I would not say that is his central thesis. I had never read that article till now, but I would say this is his thesis:

"A processor designed purely for speed, not for a compromise between speed and C support, would likely support large numbers of threads, have wide vector units, and have a much simpler memory model. Running C code on such a system would be problematic, so, given the large amount of legacy C code in the world, it would not likely be a commercial success."

Given Wade's point above about the constraints of the ARMv8/9 ISA, maybe that is not possible for Apple Silicon- I don't know- but Apple is certainly slowly removing the C constraint with Swift. I would hope someone would experiment with new hardware like this with these new languages, given Chris's talk about what is possible these days.

David_Smith · August 29, 2023, 7:05am

I've occasionally joked that the indirect branch predictor is actually the "ObjC unit" (because objc_msgSend) and the reorder buffer and regular branch predictor are the "Swift units" (because bounds and overflow checking), so in that sense every modern CPU optimizes for these two languages

dmt · August 29, 2023, 5:14pm

It would be interesting to see what high level abstractions can be introduced to the language that makes it harder to write code that is not optimal from hardware perspective.
For example, provide a guarantee that actors are always allocated in a separate memory page.

wadetregaskis · August 29, 2023, 6:38pm

The ARM ISA is quite modular, e.g. you don't have to include SVE, or SME, or authenticated pointers, or memcpy/memset instructions, etc - same as any evolving architecture, like x86. It just enforces more of a baseline than in RISC-V (or other custom microarchitecture). e.g. you basically have to include NEON, with ARMv8+, irrespective of whether you want SIMD (or an FPU at all).

One can build for a wide range of applications - from relatively tiny embedded control processors up to vector supercomputer processors (e.g. Fujitsu's A64FX (technical details)).

So Apple do have a lot of room in which to play (not to mention that they have a direct line to all of Arm's architects and leadership, so Apple can pretty easily get whatever they want if they really want it).

But, keep in mind that Swift is still very close to C, in the big picture. It's still fundamentally procedural (functional sugar withstanding), and a far cry even from Haskell or Curry let-alone Datalog or VHDL. So it may be unwarranted to make dramatic hardware architecture changes, to the CPU microarchitecture at least. As Apple has demonstrated (among others), thinking outside the CPU core seems more fruitful, e.g. on-package DRAM, a unified xPU memory space, etc.

wadetregaskis · August 29, 2023, 6:44pm

Of course the problem is then what are you going to do with the other 16,368 bytes of the page, for your actor with just one member variable?

It's not memory page sharing you care about anyway, it's cache line sharing. So 64-byte granularity. But it's a good question - does Swift take any pains re. memory layout for mutable state, in actors or more generally? Does it reorder member variables to group mutable ones, let-alone ensure they get their own lines?

From my tests so far - albeit with structs, not actors - Swift never rearranges or varies the padding of data structures (SROA aside).

So I'm guessing not [today], because it seems like the conservative default is to minimise memory footprint which is at odds with the sometimes substantial padding required for cache line (let-alone memory page) alignment…?

scanon · August 29, 2023, 8:11pm

or 128B (pretty common for outer caches) or 256B (Fujitsu's ARM64 implementation) or 32B (Cortex-A7 and A9, IIRC, and maybe some arm64 designs as well).

64B is the norm on x86, but it's by no means universal outside of x86, and cacheline size is not architectural on most platforms, so trying to do this sort of layout "right" is pretty subtle.

wadetregaskis · August 30, 2023, 4:37am

Indeed. I said 64 not because it's universal but because it's the cache line size on all of Apple's Ax and Mx designs since at least the A7 (according to LLVM). Combined with 64 bytes being the line size on all relevant Intel & AMD CPUs (as far as I'm aware), for Swift's purposes 64 bytes is an excellent rule of thumb. The compiler would of course use the actual cache line size of the target microarchitecture, in any case. Larger line sizes are thankfully rare - the A64FX's 256 bytes is a real outlier.

Unless it changed in recent years, the smallest possible malloc allocation on macOS is 16 bytes (and all allocations are at least 16-byte aligned), so - tagged pointers aside - you can't have more than four heap-allocated objects sharing a cache line on Apple or x86 platforms. Since Swift never strips unused fields from types (as far as my testing has shown), you can "strongly encourage" unique cache lines by padding your sensitive type(s) to at least 49 bytes (including object header).

I'd love to see actual data on whether any of this matters, though. I vaguely recall seeing some real-world examples of false sharing over fifteen years ago, but even then I'm pretty sure it was exceedingly rare. The nature of performance bottlenecks changes over time, reflecting the increased overheads and inefficiencies of newer languages, libraries, and practices. So I suspect 'micro-optimisations' like this pale in relevance for Swift, compared to the impact of major frameworks (e.g. SwiftUI) and patterns (e.g. eager evaluation of map et al).

scanon · August 30, 2023, 2:40pm

A bunch of those designs have 128B outer cachelines. Whether you want to use inner or outer cacheline size to avoid destructive interference (or guarantee constructive interference) is itself somewhat subtle. In any event, Swift doesn’t let you specify type alignment higher than 16B, so making these guarantees requires manual allocation. It’s definitely worth doing for some use cases, however.

ksluder · August 30, 2023, 3:14pm

It can get really important for highly-contended atomics. In such cases it also helps if the contending threads get assigned to the same processor group so they can share the physical cache instead of synchronizing.

Finagolfin · September 1, 2023, 6:48am

I don't think not being a full functional PL really matters, as non-functional languages have implemented features enabling parallelism like functional purity, immutable sharing of data, message-passing or Actor models, and so on. Instead, what we're discussing here is whether evolving hardware-software codesign can lead to much greater performance while still being much safer than existing systems, by tying these language primitives better into new hardware primitives, such as the dominance of multi-core over the last couple decades, long after C rose up.

I suspect we could do an order of magnitude or two better with such a design shift, which is why it's something I'd like to dig into myself someday, but that is just a hunch right now, as I have not looked into all the details yet.