NOTE This is related to the compiler and not the runtime.
As we start to look towards compiler performance on Windows, one thing that comes up is the idea of a minimum development environment. Beyond just the OS level requirements, requiring a baseline CPU would be helpful. In particular, if we could assume that a baseline of a Haswell uarch (which is pretty close to the baseline support required for Swift on Windows) and generally matches with the Windows 11 requirements, which doesn't seem too restrictive.
However, I am tempted to also require suggest that we require AVX2. This is potentially a bit more controversial. This requirement definitely is a bit beyond the Windows 11 requirements, and does exclude some CPUs that Windows 11 supports (e.g. Celeron N6210).
The benefit of raising this requirement would be that we could enable additional optimizations for the compiler, e.g. enabling additional SIMD extensions.
I would like to hear if this restriction would be something which would be too limiting for users of the toolchain on Windows.
Haswell is over a decade old, and I'd personally be fine with requiring it as a baseline—I can't imagine trying to compile Swift code on an 11-year-old CPU is going to be very fun.
The Celeron N6210, on the other hand, was released in 2021. I have no idea how many Intel sold, but there are lots of people out there with 4-year-old machines that can otherwise run Windows 11 and the Swift toolchain without issue.
There in lies the issue - there was a small portion of supported CPUs for Windows 11 which does not support AVX2 but are somewhat recent. Granted, I'd suggest that they are all under powered.
Not every developer can afford a high-end workstation. Some folks only have the budget for an entry-level machine, and I would hate to exclude them from our community just because of an oddball cost-cutting measure on Intel's part (that the average developer might not even know about!)
I have the same reservations. To be clear, I am not suggesting that we change the requirements to have a high-end workstation. I am suggesting that we require CPUs that are not ultra-low end. Intel had a series of IoT edge-compute CPUs and some ultra-low-end CPU SKUs that retroactively removed features that have been shipping since 2013.
AVX2 was available since around Haswell. The problem is that in 2021 Intel released a set of Pentium Gold, Pentium Silver, and Celeron CPUs that are based on a newer uarch but are missing ISA features (including AVX2).
Supporting these CPUs is going to come at a cost of:
increased complexity in the build (incurred by the Swift project)
possibly increased build times (incurred by the Swift project)
increased complexity in the distribution (incurred by the user and the Swift project)
As there have been multiple complaints about the Windows toolchain performance, I think that an option might be to be more transparent and state that the Windows toolchains are being optimized for maximal compatibility and not compiler performance. At that point, perhaps it makes sense to have two separate builds, one for compatibility and one for performance that users can select between?
To be more explicit about the suggested changes, what I was exploring was the change to building mimalloc via CMake rather than msbuild/Visual Studio.
The switch to CMake would enable Microsoft's recommended optimizations for X86 targets:
Assume a uarch baseline of Haswell
Enable AVX2
However, doing that also allows us to reduce the complexity in the build system. More importantly, we then gain control over enabling some other behavioural features:
fixed TLS slot (avoids a TLS lookup)
enabling additional SIMD usage
enabling a mode where the program termination might speed up (this was one of the tricks that mold uses for perceived performance)
With the current version of mimalloc changes, we saw a ~4% throughput improvement on WIndows builds. I have a feeling that the additional changes here could yield another 0.5%-1% improvement.
@hjyamauchi already has identified another ~1.5% improvements to the compiler, so while the overall improvements here may seem small, collectively, they will add up.
For me personally, this is not really an issue. Windows is so bloated that I wouldn't recommend any student use it for development on old machines.
Just last month, I refurbished a few old Windows PCs with 4th and 6th generation Intel CPUs, and while they were pretty much unusable with Windows 10, they still work fine with a current version of Ubuntu.
For what it's worth, if we adopted such a change on Windows, I don't see why we wouldn't also adopt it on Linux—which would preclude using those older chips on Linux as well.
Haswell is probably okay, but AVX2 I'm less keen on. AVX2 instructions are often not present in VMs, for whatever reason. With any of these newer instructions, I'd be happier seeing the performance improvement numbers. How much faster is an AVX2 enabled compiler?
Oh, this is interesting! I would've expected AVX2 to be available, less so AVX512. AVX2 has been around since 2013, and the one instruction that comes up often is the faster bit scanning operations.
Rosetta translates all x86_64 instructions, but it doesn’t support the execution of some newer instruction sets and processor features, such as AVX, AVX2, and AVX512 vector instructions. If you include these newer instructions in your code, execute them only after verifying that they are available. For example, to determine if AVX512 vector instructions are available, use the sysctlbyname function to check the hw.optional.avx512f attribute.
(Although the Internet tells me more recent versions of Rosetta 2 do support it. Still looking for any official source on that.)
Rosetta 2 and VMs not supporting are a good argument against enabling AVX2 currently, even if it were to give a reasonable speedup I think. Having a more optimized version is still interesting, but I don't know how to support that without a significant cost for build times, download times, and complexity in testing/maintaining the toolchian.
I spoke with a colleague and they let me know that as of macOS Sequoia, Rosetta 2 does support AVX and AVX2, but not AVX512. I've let our documentation team know to update that article (143910888).
I think people running a low power, low cost, often old, Linux servers at home is way more common. Like I have one with a J4105 and also do slight modifications of docker images from time to time and rebuild them. Compile time being 10 seconds or 5 minutes doesn't really matter if it only happens occasionally.
It's just a data point. Data points are good to have. We unfortunately don't have a lot of them to tell us how many developers or prospective developers might be impacted by the proposed change.
If there are CI systems out there using virtualization or emulation that does not support AVX, that would be a problem. @BlueSparrow pointed this out and I shared the (tangentially) relevant data I had available.
It is a valid point of comparison. However, it can be a concern. What if there is a X64 specific issue that we encounter and the developer needs to run that under rosetta because that is the hardware/environment on hand?
Until quite recently, the ARM64 toolchain was broken again due to insufficient testing. While the underlying issue of testing hasn't been resolved, the X64 toolchain did provide a temporary stop gap.
Just to re-iterate, the point of this conversation was to collect data points as @grynspan points out. We are trying to make a calculation on whether this tradeoff is justified or not based on the data that is available.
Breaking CI systems is a real concern that @BlueSparrow pointed out - one that had been missed. This type of feedback is valuable and helpful in making decisions like this which have a broad impact.
Most of the core compiler contributors when trying to reproduce an x86-specific bug report, since most of their primary day-to-day work machines are AS.
(Happily Rosetta supports AVX2 now, as mentioned upthread)