Performance penalty from the Static Linux SDK

I was very excited to hear about the Static Linux SDK, particularly how it enables building Swift applications that run on any Linux machine. This is especially important to me because I'm currently forced to use Python for my PhD work, as the (super)computers I have access to are RHEL 8-based, and it seems like Swift isn't supported on that distro. Additionally, I don't think I have permission to install new software without going through heaps of bureaucracy. Now that the static sdk is available, I can finally use Swift for my work but there seems to be a performance penalty in using musl since it exhibits quite poor performance on heavily multithreaded programs. In my case, which I unfortunately can't share, the slowdown seems to be ~10x. The programs that I run are embarrassingly parallelizable so the cost isn't in synchronizing between threads. It turns out that the culprit is musl's allocator.

Consider the following (rather contrived, modified from the original rust version) example

import Foundation
import Dispatch
import Synchronization

func versionOne() -> (Duration, Duration) {
    let count = 5_000_000
    let map: [[Int]: Int] = {
        var _map: [[Int]: Int] = [:]
        for i in 0..<count {
            let key = [Int](repeating: i, count: 32)
            _map[key] = i
        }
        return _map
    }()
    var singleThreadedSum = 0
    let singleThreadedTime = ContinuousClock().measure {
        for i in 0..<count {
            let key = [Int](repeating: i, count: 32)
            singleThreadedSum += map[key] ?? 0
        }
    }
    
    let threads = ProcessInfo.processInfo.activeProcessorCount
    let multiThreadedSum = Atomic<Int>(0)
    let multiThreadedTime = ContinuousClock().measure {
        DispatchQueue.concurrentPerform(iterations: threads) { index in 
            let subrange = (0..<count/threads)
            var partialSum = 0
            for i in subrange {
                let key = [Int](repeating: i, count: 32)
                partialSum += map[key] ?? 0
            }
            multiThreadedSum.add(partialSum, ordering: .relaxed)
        }
    }
    return (singleThreadedTime, multiThreadedTime)
}

var singleThreadedAverageTime = Duration.zero
var multiThreadedAverageTime = Duration.zero

let iterations = 5

for _ in 0..<iterations {
    let (singleThreadedTime, multiThreadedTime) = versionOne()
    singleThreadedAverageTime += singleThreadedTime / iterations
    multiThreadedAverageTime += multiThreadedTime / iterations
}

print("Version 1: Single threaded sum with:", singleThreadedAverageTime)
print("Version 1: Multi threaded sum with:", multiThreadedAverageTime)

This gives me the output when running natively on the distro (Ubuntu 22.04 16 core machine) (swift run -c release)

Version 1: Single threaded sum with: 3.107581717 seconds
Version 1: Multi threaded sum with: 0.344443681 seconds

and when compiled using the static sdk (swift build -c release -Xlinker -strip-all --swift-sdk x86_64-swift-linux-musl)

Version 1: Single threaded sum with: 3.2390356336 seconds
Version 1: Multi threaded sum with: 2.8119309522 seconds

As can be seen, the multi threaded version is noticeably slower than the single threaded version. Since the native to the distro version shows that it is possible to run faster, I think this result isn't because the example is very odd. If I change it to the following (as they did in the rust issue)

func versionTwo() -> (Duration, Duration) {
    let count = 5_000_000
    let map: [Int: Int] = {
        var _map: [Int: Int] = [:]
        for i in 0..<count {
            _map[i] = i
        }
        return _map
    }()
    var singleThreadedSum = 0
    let singleThreadedTime = ContinuousClock().measure {
        for i in 0..<count {
            singleThreadedSum += map[i] ?? 0
        }
    }
    
    let threads = ProcessInfo.processInfo.activeProcessorCount
    let multiThreadedSum = Atomic<Int>(0)
    let multiThreadedTime = ContinuousClock().measure {
        DispatchQueue.concurrentPerform(iterations: threads) { index in 
            let subrange = (0..<count/threads)
            var partialSum = 0
            for i in subrange {
                partialSum += map[i] ?? 0
            }
            multiThreadedSum.add(partialSum, ordering: .relaxed)
        }
    }
    return (singleThreadedTime, multiThreadedTime)
}

var singleThreadedAverageTime = Duration.zero
var multiThreadedAverageTime = Duration.zero
for _ in 0..<iterations {
    let (singleThreadedTime, multiThreadedTime) = versionTwo()
    singleThreadedAverageTime += singleThreadedTime / iterations
    multiThreadedAverageTime += multiThreadedTime / iterations
}
print("Version 2: Single threaded sum with:", singleThreadedAverageTime)
print("Version 2: Multi threaded sum with:", multiThreadedAverageTime)

This now prints when run natively

Version 2: Single threaded sum with: 0.5063501642 seconds
Version 2: Multi threaded sum with: 0.0600251544 seconds

and with musl

Version 2: Single threaded sum with: 0.6159802106 seconds
Version 2: Multi threaded sum with: 0.059309451 seconds

The difference in version1() and version2() is that in the latter there are much fewer allocations (calls to malloc), which makes me come to the same conclusion that the musl allocator is the bottleneck here.


Now, since I can't let this conclusion be "unproved", and I'm determined to use Swift on those damn supercomputers, I found an article (for rust again) where they did some static library stitching and massaging and patched the musl allocator to use mimalloc instead. I followed those steps and did some necessary changes to also replace the musl allocator with mimalloc.

Now the earlier results prints for the musl build

Version 1: Single threaded sum with: 2.6381146422 seconds
Version 1: Multi threaded sum with: 0.2622337284 seconds

Version 2: Single threaded sum with: 0.6310474136 seconds
Version 2: Multi threaded sum with: 0.0542707918 seconds

So I think this shows that the allocator indeed is the cause of the slowdown.


Now after all this, my questions are:

  • Is there a more standard and less error prone way of switching from the musl allocator to a different one, instead of this arcane static-library-object-file-removal-and-insertion-surgery-magic?
  • Can the Swift project consider doing something similar for the allocator such that the users wouldn't need to switch from the default one themselves?

For those interested, this is the script I used to replace the allocator.

patch_musl.sh
#!/bin/bash
swiftStaticSDKArchiveFile=$1
swiftStaticSDKArchiveFolder="${swiftStaticSDKArchiveFile%.tar.gz}"
echo "Unarchiving SDK tar ball"
tar -xzf $swiftStaticSDKArchiveFile
LIBC_PATH=$(find ./$swiftStaticSDKArchiveFolder -type f -name libc.a -path "*/x86_64/*" -exec realpath {} \;)
echo "Building mimalloc static library"
git clone https://github.com/microsoft/mimalloc
cd mimalloc
git checkout tags/v2.1.7
git apply ../mimalloc_v2.1.7.diff
cmake -Bout -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=clang -DMI_BUILD_SHARED=OFF -DMI_BUILD_OBJECT=OFF -DMI_BUILD_TESTS=OFF .
cmake --build out
cd out
mkdir libs
cd libs
cp ../libmimalloc.a libmimalloc.a
cp $LIBC_PATH libc.a

echo "Patching musl libc.a"
llvm-ar -d libc.a aligned_alloc.lo 
llvm-ar -d libc.a calloc.lo
llvm-ar -d libc.a donate.lo
# Musl libc.a contains two free.lo sections, so we need to remove it twice
llvm-ar -d libc.a free.lo
llvm-ar -d libc.a free.lo
llvm-ar -d libc.a libc_calloc.lo
llvm-ar -d libc.a lite_malloc.lo
llvm-ar -d libc.a malloc.lo
llvm-ar -d libc.a malloc_usable_size.lo
llvm-ar -d libc.a memalign.lo
llvm-ar -d libc.a posix_memalign.lo
# Musl libc.a contains two realloc.lo sections, so we need to remove it twice
llvm-ar -d libc.a realloc.lo
llvm-ar -d libc.a realloc.lo
llvm-ar -d libc.a reallocarray.lo
llvm-ar -d libc.a valloc.lo
llvm-ar -d libc.a strdup.lo
llvm-ar -d libc.a strndup.lo

# Check that the objects files were actually removed
llvm-ar -t libc.a | grep aligned_alloc.lo 
llvm-ar -t libc.a | grep calloc.lo
llvm-ar -t libc.a | grep donate.lo
llvm-ar -t libc.a | grep free.lo
llvm-ar -t libc.a | grep libc_calloc.lo
llvm-ar -t libc.a | grep lite_malloc.lo
llvm-ar -t libc.a | grep malloc.lo
llvm-ar -t libc.a | grep malloc_usable_size.lo
llvm-ar -t libc.a | grep memalign.lo
llvm-ar -t libc.a | grep posix_memalign.lo
llvm-ar -t libc.a | grep realloc.lo
llvm-ar -t libc.a | grep reallocarray.lo
llvm-ar -t libc.a | grep valloc.lo
llvm-ar -t libc.a | grep strdup.lo
llvm-ar -t libc.a | grep strndup.lo

llvm-ar -x libmimalloc.a
llvm-ar -r libc.a *.o
cp libc.a $LIBC_PATH 
cd ..
cd ..
cd ..
rm -fr mimalloc
echo "Archiving to new tar ball to directory ./new_archive"
mkdir new_sdk_archive
tar -czf new_archive.tar.gz $swiftStaticSDKArchiveFolder
mv new_archive.tar.gz new_sdk_archive
cd new_sdk_archive
mv new_archive.tar.gz $swiftStaticSDKArchiveFile
cd ..
echo "Archived!"
rm -fr $swiftStaticSDKArchiveFolder

echo "Done!"
mimalloc_v2.1.7.diff
diff --git a/CMakeLists.txt b/CMakeLists.txt
index bcfe91d8..a5473c69 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -481,7 +481,6 @@ endif()
 # static library
 if (MI_BUILD_STATIC)
   add_library(mimalloc-static STATIC ${mi_sources})
-  set_property(TARGET mimalloc-static PROPERTY POSITION_INDEPENDENT_CODE ON)
   target_compile_definitions(mimalloc-static PRIVATE ${mi_defines} MI_STATIC_LIB)
   target_compile_options(mimalloc-static PRIVATE ${mi_cflags} ${mi_cflags_static})
   target_link_libraries(mimalloc-static PRIVATE ${mi_libraries})
diff --git a/src/alloc-override.c b/src/alloc-override.c
index 12837cdd..e5bcda22 100644
--- a/src/alloc-override.c
+++ b/src/alloc-override.c
@@ -191,7 +191,7 @@ typedef void* mi_nothrow_t;
   void* operator new[](std::size_t n, std::align_val_t al, const std::nothrow_t&) noexcept { return mi_new_aligned_nothrow(n, static_cast<size_t>(al)); }
   #endif
 
-#elif (defined(__GNUC__) || defined(__clang__))
+#elif (defined(__GNUC__) || defined(__clang__) || defined(do_we_need_this))
   // ------------------------------------------------------
   // Override by defining the mangled C++ names of the operators (as
   // used by GCC and CLang).
@@ -289,7 +289,7 @@ mi_decl_weak int reallocarr(void* p, size_t count, size_t size)    { return mi_r
   void  __libc_free(void* p)                            MI_FORWARD0(mi_free, p)
   void* __libc_memalign(size_t alignment, size_t size)  { return mi_memalign(alignment, size); }
 
-#elif defined(__GLIBC__) && defined(__linux__)
+#elif defined(__linux__) //defined(__GLIBC__) && defined(__linux__)
   // forward __libc interface (needed for glibc-based Linux distributions)
   void* __libc_malloc(size_t size)                      MI_FORWARD1(mi_malloc,size)
   void* __libc_calloc(size_t count, size_t size)        MI_FORWARD2(mi_calloc,count,size)

For example, let's say you have downloaded the latest sdk archive

swift-6.0.1-RELEASE_static-linux-0.0.1.artifactbundle.tar.gz

To apply the patch, you would run

./patch_musl.sh swift-6.0.1-RELEASE_static-linux-0.0.1.artifactbundle.tar.gz

It will create a new archive in the location ./new_archive/swift-6.0.1-RELEASE_static-linux-0.0.1.artifactbundle.tar.gz. Then you would install it the usual way

swift sdk install ./new_archive/swift-6.0.1-RELEASE_static-linux-0.0.1.artifactbundle.tar.gz

Note that this script only works on a linux machine. And please note that I don't have a deep understanding of what I'm doing (the script is very basic), and you should not use this in your production code.

9 Likes

Very detailed message. Looks you are spot on in your analysis. No idea how to resolve this better.

That's what you silently doing anyway :sweat_smile::

Oh the script runs on my own machine (and the benchmarks in the post).

Do you have access to EPEL 9 packages in your runtime environment? Ron has Swift 6.0 in there.

Good question, @hjyamauchi, have you guys looked into bringing your mimalloc work to non-Windows platforms?

Hmm, I don't think I have access to EPEL 9. I'd guess we have access to EPEL 8 but I'll have to check. The environment is very restricted since many people share the computers, i.e., even if RHEL 8 was a supported platform on swift.org, I wouldn't be able to run yum install ... to install the necessary dependencies for the toolchain if they aren't present in the system already.

I think that work is more about enabling mimalloc for the compiler if I'm not mistaken, but it would be awesome if it could be extended for this use case as well.

Right, it's about enabling mimalloc for the compiler. We haven't tried it for other use cases but it will likely to be applicable though the way it's linked would be different on Linux.

A much simpler solution:

Build mimalloc into a single object file, either via cmake or src/static.c in its repository. Then place it in the root directory of your swift package (next to Package.swift) and add linkerSettings: [.unsafeFlags(["mimalloc.o"])] to your executable target.

This works for me, but it would be great for the Swift core team to comment on this method. LLVM linker documentation states that object files are linked before library files, in this case, mimalloc.o before libc.a. The microsoft/mimalloc repository also prescribes this method for static overriding.

2 Likes

Without commenting on the larger issue, the specific concern Iā€™d have is whether that overrides uses of malloc inside musl or not. For instance, what happens when you call free(strdup("test")) from Swift?

If you run:

ar -x libc.a strdup.lo
objdump -t strdup.lo

You find strdup as a text symbol, and malloc as an undefined symbol.

https://lld.llvm.org/NewLLD.html#important-data-structures

This article describes the LLVM linker symbol types and conflict resolution rules. Symbols from mimalloc.o are exclusively "Defined", so it follows that if there were no errors during a link operation, then the various malloc text symbols in libc.a were correctly overridden.