I was very excited to hear about the Static Linux SDK, particularly how it enables building Swift applications that run on any Linux machine. This is especially important to me because I'm currently forced to use Python for my PhD work, as the (super)computers I have access to are RHEL 8-based, and it seems like Swift isn't supported on that distro. Additionally, I don't think I have permission to install new software without going through heaps of bureaucracy. Now that the static sdk is available, I can finally use Swift for my work but there seems to be a performance penalty in using musl since it exhibits quite poor performance on heavily multithreaded programs. In my case, which I unfortunately can't share, the slowdown seems to be ~10x. The programs that I run are embarrassingly parallelizable so the cost isn't in synchronizing between threads. It turns out that the culprit is musl's allocator.
Consider the following (rather contrived, modified from the original rust version) example
import Foundation
import Dispatch
import Synchronization
func versionOne() -> (Duration, Duration) {
let count = 5_000_000
let map: [[Int]: Int] = {
var _map: [[Int]: Int] = [:]
for i in 0..<count {
let key = [Int](repeating: i, count: 32)
_map[key] = i
}
return _map
}()
var singleThreadedSum = 0
let singleThreadedTime = ContinuousClock().measure {
for i in 0..<count {
let key = [Int](repeating: i, count: 32)
singleThreadedSum += map[key] ?? 0
}
}
let threads = ProcessInfo.processInfo.activeProcessorCount
let multiThreadedSum = Atomic<Int>(0)
let multiThreadedTime = ContinuousClock().measure {
DispatchQueue.concurrentPerform(iterations: threads) { index in
let subrange = (0..<count/threads)
var partialSum = 0
for i in subrange {
let key = [Int](repeating: i, count: 32)
partialSum += map[key] ?? 0
}
multiThreadedSum.add(partialSum, ordering: .relaxed)
}
}
return (singleThreadedTime, multiThreadedTime)
}
var singleThreadedAverageTime = Duration.zero
var multiThreadedAverageTime = Duration.zero
let iterations = 5
for _ in 0..<iterations {
let (singleThreadedTime, multiThreadedTime) = versionOne()
singleThreadedAverageTime += singleThreadedTime / iterations
multiThreadedAverageTime += multiThreadedTime / iterations
}
print("Version 1: Single threaded sum with:", singleThreadedAverageTime)
print("Version 1: Multi threaded sum with:", multiThreadedAverageTime)
This gives me the output when running natively on the distro (Ubuntu 22.04 16 core machine) (swift run -c release
)
Version 1: Single threaded sum with: 3.107581717 seconds
Version 1: Multi threaded sum with: 0.344443681 seconds
and when compiled using the static sdk (swift build -c release -Xlinker -strip-all --swift-sdk x86_64-swift-linux-musl
)
Version 1: Single threaded sum with: 3.2390356336 seconds
Version 1: Multi threaded sum with: 2.8119309522 seconds
As can be seen, the multi threaded version is noticeably slower than the single threaded version. Since the native to the distro version shows that it is possible to run faster, I think this result isn't because the example is very odd. If I change it to the following (as they did in the rust issue)
func versionTwo() -> (Duration, Duration) {
let count = 5_000_000
let map: [Int: Int] = {
var _map: [Int: Int] = [:]
for i in 0..<count {
_map[i] = i
}
return _map
}()
var singleThreadedSum = 0
let singleThreadedTime = ContinuousClock().measure {
for i in 0..<count {
singleThreadedSum += map[i] ?? 0
}
}
let threads = ProcessInfo.processInfo.activeProcessorCount
let multiThreadedSum = Atomic<Int>(0)
let multiThreadedTime = ContinuousClock().measure {
DispatchQueue.concurrentPerform(iterations: threads) { index in
let subrange = (0..<count/threads)
var partialSum = 0
for i in subrange {
partialSum += map[i] ?? 0
}
multiThreadedSum.add(partialSum, ordering: .relaxed)
}
}
return (singleThreadedTime, multiThreadedTime)
}
var singleThreadedAverageTime = Duration.zero
var multiThreadedAverageTime = Duration.zero
for _ in 0..<iterations {
let (singleThreadedTime, multiThreadedTime) = versionTwo()
singleThreadedAverageTime += singleThreadedTime / iterations
multiThreadedAverageTime += multiThreadedTime / iterations
}
print("Version 2: Single threaded sum with:", singleThreadedAverageTime)
print("Version 2: Multi threaded sum with:", multiThreadedAverageTime)
This now prints when run natively
Version 2: Single threaded sum with: 0.5063501642 seconds
Version 2: Multi threaded sum with: 0.0600251544 seconds
and with musl
Version 2: Single threaded sum with: 0.6159802106 seconds
Version 2: Multi threaded sum with: 0.059309451 seconds
The difference in version1()
and version2()
is that in the latter there are much fewer allocations (calls to malloc), which makes me come to the same conclusion that the musl allocator is the bottleneck here.
Now, since I can't let this conclusion be "unproved", and I'm determined to use Swift on those damn supercomputers, I found an article (for rust again) where they did some static library stitching and massaging and patched the musl allocator to use mimalloc instead. I followed those steps and did some necessary changes to also replace the musl allocator with mimalloc.
Now the earlier results prints for the musl build
Version 1: Single threaded sum with: 2.6381146422 seconds
Version 1: Multi threaded sum with: 0.2622337284 seconds
Version 2: Single threaded sum with: 0.6310474136 seconds
Version 2: Multi threaded sum with: 0.0542707918 seconds
So I think this shows that the allocator indeed is the cause of the slowdown.
Now after all this, my questions are:
- Is there a more standard and less error prone way of switching from the musl allocator to a different one, instead of this arcane static-library-object-file-removal-and-insertion-surgery-magic?
- Can the Swift project consider doing something similar for the allocator such that the users wouldn't need to switch from the default one themselves?
For those interested, this is the script I used to replace the allocator.
patch_musl.sh
#!/bin/bash
swiftStaticSDKArchiveFile=$1
swiftStaticSDKArchiveFolder="${swiftStaticSDKArchiveFile%.tar.gz}"
echo "Unarchiving SDK tar ball"
tar -xzf $swiftStaticSDKArchiveFile
LIBC_PATH=$(find ./$swiftStaticSDKArchiveFolder -type f -name libc.a -path "*/x86_64/*" -exec realpath {} \;)
echo "Building mimalloc static library"
git clone https://github.com/microsoft/mimalloc
cd mimalloc
git checkout tags/v2.1.7
git apply ../mimalloc_v2.1.7.diff
cmake -Bout -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=clang -DMI_BUILD_SHARED=OFF -DMI_BUILD_OBJECT=OFF -DMI_BUILD_TESTS=OFF .
cmake --build out
cd out
mkdir libs
cd libs
cp ../libmimalloc.a libmimalloc.a
cp $LIBC_PATH libc.a
echo "Patching musl libc.a"
llvm-ar -d libc.a aligned_alloc.lo
llvm-ar -d libc.a calloc.lo
llvm-ar -d libc.a donate.lo
# Musl libc.a contains two free.lo sections, so we need to remove it twice
llvm-ar -d libc.a free.lo
llvm-ar -d libc.a free.lo
llvm-ar -d libc.a libc_calloc.lo
llvm-ar -d libc.a lite_malloc.lo
llvm-ar -d libc.a malloc.lo
llvm-ar -d libc.a malloc_usable_size.lo
llvm-ar -d libc.a memalign.lo
llvm-ar -d libc.a posix_memalign.lo
# Musl libc.a contains two realloc.lo sections, so we need to remove it twice
llvm-ar -d libc.a realloc.lo
llvm-ar -d libc.a realloc.lo
llvm-ar -d libc.a reallocarray.lo
llvm-ar -d libc.a valloc.lo
llvm-ar -d libc.a strdup.lo
llvm-ar -d libc.a strndup.lo
# Check that the objects files were actually removed
llvm-ar -t libc.a | grep aligned_alloc.lo
llvm-ar -t libc.a | grep calloc.lo
llvm-ar -t libc.a | grep donate.lo
llvm-ar -t libc.a | grep free.lo
llvm-ar -t libc.a | grep libc_calloc.lo
llvm-ar -t libc.a | grep lite_malloc.lo
llvm-ar -t libc.a | grep malloc.lo
llvm-ar -t libc.a | grep malloc_usable_size.lo
llvm-ar -t libc.a | grep memalign.lo
llvm-ar -t libc.a | grep posix_memalign.lo
llvm-ar -t libc.a | grep realloc.lo
llvm-ar -t libc.a | grep reallocarray.lo
llvm-ar -t libc.a | grep valloc.lo
llvm-ar -t libc.a | grep strdup.lo
llvm-ar -t libc.a | grep strndup.lo
llvm-ar -x libmimalloc.a
llvm-ar -r libc.a *.o
cp libc.a $LIBC_PATH
cd ..
cd ..
cd ..
rm -fr mimalloc
echo "Archiving to new tar ball to directory ./new_archive"
mkdir new_sdk_archive
tar -czf new_archive.tar.gz $swiftStaticSDKArchiveFolder
mv new_archive.tar.gz new_sdk_archive
cd new_sdk_archive
mv new_archive.tar.gz $swiftStaticSDKArchiveFile
cd ..
echo "Archived!"
rm -fr $swiftStaticSDKArchiveFolder
echo "Done!"
mimalloc_v2.1.7.diff
diff --git a/CMakeLists.txt b/CMakeLists.txt
index bcfe91d8..a5473c69 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -481,7 +481,6 @@ endif()
# static library
if (MI_BUILD_STATIC)
add_library(mimalloc-static STATIC ${mi_sources})
- set_property(TARGET mimalloc-static PROPERTY POSITION_INDEPENDENT_CODE ON)
target_compile_definitions(mimalloc-static PRIVATE ${mi_defines} MI_STATIC_LIB)
target_compile_options(mimalloc-static PRIVATE ${mi_cflags} ${mi_cflags_static})
target_link_libraries(mimalloc-static PRIVATE ${mi_libraries})
diff --git a/src/alloc-override.c b/src/alloc-override.c
index 12837cdd..e5bcda22 100644
--- a/src/alloc-override.c
+++ b/src/alloc-override.c
@@ -191,7 +191,7 @@ typedef void* mi_nothrow_t;
void* operator new[](std::size_t n, std::align_val_t al, const std::nothrow_t&) noexcept { return mi_new_aligned_nothrow(n, static_cast<size_t>(al)); }
#endif
-#elif (defined(__GNUC__) || defined(__clang__))
+#elif (defined(__GNUC__) || defined(__clang__) || defined(do_we_need_this))
// ------------------------------------------------------
// Override by defining the mangled C++ names of the operators (as
// used by GCC and CLang).
@@ -289,7 +289,7 @@ mi_decl_weak int reallocarr(void* p, size_t count, size_t size) { return mi_r
void __libc_free(void* p) MI_FORWARD0(mi_free, p)
void* __libc_memalign(size_t alignment, size_t size) { return mi_memalign(alignment, size); }
-#elif defined(__GLIBC__) && defined(__linux__)
+#elif defined(__linux__) //defined(__GLIBC__) && defined(__linux__)
// forward __libc interface (needed for glibc-based Linux distributions)
void* __libc_malloc(size_t size) MI_FORWARD1(mi_malloc,size)
void* __libc_calloc(size_t count, size_t size) MI_FORWARD2(mi_calloc,count,size)
For example, let's say you have downloaded the latest sdk archive
swift-6.0.1-RELEASE_static-linux-0.0.1.artifactbundle.tar.gz
To apply the patch, you would run
./patch_musl.sh swift-6.0.1-RELEASE_static-linux-0.0.1.artifactbundle.tar.gz
It will create a new archive in the location ./new_archive/swift-6.0.1-RELEASE_static-linux-0.0.1.artifactbundle.tar.gz
. Then you would install it the usual way
swift sdk install ./new_archive/swift-6.0.1-RELEASE_static-linux-0.0.1.artifactbundle.tar.gz
Note that this script only works on a linux machine. And please note that I don't have a deep understanding of what I'm doing (the script is very basic), and you should not use this in your production code.