FB13800558: listen (2) man page should explain why the backlog is limited to 128
FB13800566: Connection errors are reported with the wrong errno (ECONNRESET or EBADF, rather than ECONNREFUSED).
FB13800571: kevents erroneously reports unconnected sockets as writable (and not closed)
Hi Wade,
As I have simply used the aptitude package up until now, I compiled the wrk application for the first time with your patch. I do not have much experience with compiling stuff under linux, so I simply ran ‘make’ after applying your second patch. No optimisations etc.
I see now that the wrk I used before was version 4.1.0, whereas 4.2.0 came out 3 years ago. 4.1.0 is apparently the default release for wrk under Ubuntu 22.04
I assume your patch apples to the latest commit on master (from February 2021)?
I will increase net.core.somaxconn and ulimit -n, but as I'm running both server and client on linux, they were already a lot higher than on macos.
Question
Can you recommend a good way to automatically measure the real memory usage of the webapplication? I have used htop in the past, but that was a manual process.
Is the RSS field of ps good enough to sample the process' memory usage?
And:
Are you going to create a PR with your work for wrk, even though there are already 55 PRs open from the last 5 years?
Or are your efforts ready for the bitbucket (and my fork)?
It's probably worth noting this with some prominence in your post, as while it's not necessarily wrong to raise these until they don't impact the benchmark (especially the file descriptor limit), there ostensibly are reasons that somaxconn exists (to help defend against SYN flood attacks, for example). So arguably it is detrimental, in contrast to the other frameworks tested, for Vapor to be so sensitive to low configuration values.
(though it's also arguable that 128 genuinely is too low for a modern system, probably having been established decades ago when RAM capacities were an order of magnitude lower)
ps or similar running at a high-enough sample rate should work, in theory. But "high-enough" is fuzzy… I'd say at least 10 Hz, but that's a somewhat arbitrary guess.
I seem to vaguely recall that there are more precise (and easier) ways to get peak memory usage. e.g. maybe getrusage (ru_maxrss specifically)? There's probably pre-existing scripts which use it to wrap over a subcommand, like time does. Even if not, it's trival to write a little wrapper.
I've unenthusiastically considered it, but it seems like wrk is abandoned. I could push up my own fork and just leave it, but then it too would be abandoned.
I'd be interested to see what difference this makes to the benchmarks.
Note that this should work for the app being benchmarked in question but we're still discovering a few places where it may crash due to the use of .wait() in various packages that we have no way of discovering without manual review or people reporting crashes
It essentially reduces the latency by roughly 20 microseconds every time you need to switch between the I/O threads and the business logic threads. The regular Swift Concurrency threads can't host I/O so we need I/O->business logic->I/O. When installing NIO as the Concurrency pool those switches aren't thread switches, just work item enqueues.
Where it helps the most are proxies (little work, lots of I/O) particularly streaming proxies (as the I/O operations don't necessarily go to the kernel but are from in-memory buffers from stuff that has been read ahead). It'll also help with HTTP servers and simple routes or anything where adding 20 microseconds is noticeable.
In @axello's benchmark it wouldn't make a difference either way because it spends 5 milli seconds per request computing, wasting an extra 20 micro seconds isn't a problem here (almost a three orders of magnitude difference).
Given that @axello's benchmark has long-enough CPU bursts that one could argue it should be offloaded it might even be detrimental to use NIO as the executor. Because having the I/O and business logic in different pools can be seen as an implicit offload. I personally don't buy into that argument though: offloads should be explicit IMHO. Outside from benchmarks it's very unlikely that I would want all requests to be handled on a different pool. Surely there are some heavier and some more lightweight requests... but for the benchmark where every request is a 5ms CPU requirement we may as well implicitly offload them for no downside (the 20 microseconds lost are well in the noise).
Ubuntu on Intel Pentium G4560. Numbers suggest that @lorentey machine is ~1.6x faster.
@axello/@lorentey code. Full repository here. Not optimal: no numeric tricks, tons of allocations (no inout) etc. Intended, as this is what users will write.
Node (2nd column) is used as a base -> values in parenthesis mean relative improvement over Node. For example GMP 1000000 is 3.4x faster than Node.
each test was repeated 50 times. Value in the cell is average after removing min/max.
Violet does not provide a Swift package, to solve this the source code of all of the Swift implementations (Attaswift, Numberick, Violet) was copied inside the test package.
maximum relative standard deviation for the last row is 3.31% (for Python). Other implementations are <1.5%. Results are deemed stable and repeatable.
Lets check if the results are similar to @lorentey by comparing the Attaswift/Node ratio (last column in their data):
negligible overhead from the interpreter and method dispatch (via numeric nb_add, not the full operator resolution).
reference counting for garbage collection as int can't have cycles.
Looking at the Python code one could ask: why Node is so slow? They should easily beat Python. Well… IDK. Perf tools give us nothing (I assume that they count BigInt as Shared libraries -> node):
[Summary]:
ticks total nonlib name
295 0.1% 7.9% JavaScript
3416 1.0% 92.1% C++
22975 6.8% 619.1% GC
333291 98.9% Shared libraries
[Shared libraries]:
ticks total nonlib name
330151 98.0% <PATH_REDACTED>/node/v19.0.0/bin/node
2982 0.9% /usr/lib/x86_64-linux-gnu/libc.so.6
116 0.0% [vdso]
42 0.0% /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30
The JavaScript part does not matter (0.1%). In logs we see that fib_loop was not optimized by Maglev/TurboFan, which means that we ran on the bytecode. Just like in Python only the inner C/C++ implementation matters. Anyway, both Node and Python do not use assembly, so a good Swift implementation should at least match their results.
mpz_add is defined in aors.h imported with OPERATION_add.
When signs are the same it calls mpn_add which is basically __GMPN_AORS with FUNCTION being __gmpn_add_n: __gmpn_add_n(__gmp_wp, __gmp_xp, __gmp_yp, __gmp_i) (w is result, xy are operands and i is min length).
Btw. there is a theoretical "nails" representation where we leave the highest bits of every digit/word 0 (similar to Python) and use vector operations. Fairly interesting, but c'mon, we can't expect github libraries written by unpaid volunteers to do this kind of stuff.
this seems to be a new kid on the block, so I do not know much about them.
UIntXL is BigUInt. I don't know if BigInt is available. We are only dealing with positive numbers, so sign overhead should be negligible anyway.
their stack traces are so long that my Instruments broke. I just could't see the inner loop, so I don't know what is happening inside.
Violet is an implementation written by me. It is written against the same machine that produced the results above. This may explain why it is so fast. The implementation used is Violet-BigInt-XsProMax, not the Violet one. Differences:
Violet - under the hood it is a union (via tagged pointer) of Int32 (called Smi, after V8) and a heap allocation (magnitude + sign representation) with ARC for garbage collection. This means no heap allocations in Int32 range, but it assumes 64 bit pointers. More information in the documentation.
Violet-BigInt-XsProMax - no Smi, only heap. Slightly faster than Violet, because it does not have to check which representation we are dealing with.
The Int32 stuff do not matter in this test, so I went with Violet-BigInt-XsProMax (Apple naming scheme, don't blame me). I will talk more about this a bit later in this post.
Attaswift - this is the original BigInt (the one used by OP), so lets focus on this. Their layout boils down to:
// https://github.com/attaswift/BigInt
// LICENSE: https://github.com/attaswift/BigInt/blob/master/LICENSE.md
struct BigInt: SignedInteger {
var magnitude: BigUInt
var sign: Sign
}
struct BigUInt: UnsignedInteger {
typealias Word = UInt
enum Kind {
case inline(Word, Word)
case slice(from: Int, to: Int)
case array
}
var kind: Kind // Internal for testing only
var storage: [Word] // Internal for testing only; stored separately to prevent COW copies
subscript(_ index: Int) -> Word {
get {
// @LiarPrincess: Simplified for readability
switch kind {
case .inline: …
case .slice: …
case .array: …
}
}
set(word) { … }
}
}
If we run fibonacci(500000) and print the result, their inner loop for addition will take 87.6%:
It handles 2 digits per iteration. Remember the "reorder lines -> 20% slower" comment in the Violet code above? This is exactly this. If you paste this code in godbolt.org and reorder those lines you will see that add rcx, 2 became inc rsi and only a single digit was handled during a single iteration. Calculation itself is done by lea followed by add; setb or add; adc rax, 0 for carry, how much it matters depends on the CPU.
At the same time GMP calculates 4 digits in a single iteration with 4x adc one after the other. I can't paste the code here because of the license, but here is the link.
That said, focusing on addition is a little bit pointless, as mul/div are the real heavy hitters in most of the workloads, and even a few adds will not out-weight a single mul. Attaswift vs Violet in mul/div:
That said, even if Attaswift used a simple Array/ContiguousArray is would still have some problems. For example isUniquelyReferenced (or whatever Array does) check can be merged with append/reserveCapacity for carry. This could potentially save us a copy.
Also, setting an index in an Array does another isUniquelyReferenced check which the complier will not move out of the loop. In Violet COW is done before we even enter the hot functions. There is a system where calling storage.guaranteeUniqueBufferReference (with the optional capacity argument) gives us a token which is required to call a mutating operation. This way the compiler forces us to do COW checks. This token should be move-only, but this code was written before we had them in Swift (borrow = mutation, consume = storage reallocation + new token is issued).
Also, as already pointed out in this thread, BigInt operations should strive to be inout to avoid buffer allocations. The whole BigInt has to be written with this in mind.
Attaswift - weird; smaller counts are faster, but bigger are slower. I have no idea. I expected no changes, though the variance is quite big on the smaller sets, so that may be it.
Numberick - ~5% improvement in 2 last rows.
Violet - ~30% improvement in 2 last rows. The whole implementation was written with inout in mind. The original code (the 1st table in this post) was using the library incorrectly. It is unfortunate that BigInt has Int in its name, because people will assume that they are as "cheap" as any other Int.
GMP also allows you to alias the lhs/result pointers, but remember that if you are using the inner mpn functions then the ranges should not overlap. Or just use the safe mpz.
Moral of the story: use inout for BigInt, even if the library does not optimize this (yet…).
Btw. there is a tiny gotcha in the inout method if you have enum based storage like Attaswift has: BigUInt.Kind: inline | slice | array, as switch can increase the reference count making inout always work on +2 which will force COW. Violet has the same problem with the smi|heap representation (Smi is an inlined Int32 instead of the pointer). To solve it I temporary assign Int32.zero:
public static func += (lhs: inout BigInt, rhs: BigInt) {
// lhs is on +1
switch (lhs.value, rhs.value) {
case let (.smi(lhsSmi), .smi(rhs)): …
case (.smi(let lhsSmi), .heap(var rhs)): …
case (.heap(var lhsHeap), .smi(let rhs)):
// lhs is +2
Self.releaseBufferToPreventCOW(&lhs)
// lhs is +1
…
case (.heap(var lhsHeap), .heap(let rhs)):
// lhs is +2
Self.releaseBufferToPreventCOW(&lhs)
// lhs is +1
…
}
}
private static func releaseBufferToPreventCOW(_ int: inout BigInt) {
let zero = Smi(Smi.Storage.zero)
int.value = .smi(zero)
}
(Disclaimer: I changed bullet points to numbers in their post, for easier reference).
Check assembly, Swift compiler will usually leave unique check inside the loop. Violet has an over-engineered solution which gives you control over this.
(Btw. have you noticed something about the Numberick and Violet results?)
When it comes to the implementation: the simplest case is when the radix is a power of 2. In such situation we can use the linear algorithm where we treat the whole BigInt as a bit buffer and group those bits according to radix. For radix is 16 we group by 4 bits etc. The tiny edge case is when the radix bit count is not a multiply of the digit size. For example:
digit is 64bit
radix is 8 -> group 3 bits together
64/3=21, the remaining bit has to be grouped with bits from the next digit
But this is not our case, because we print in radix 10 which is not a power of 2. Side-note: printing BigInt is not that common. It is mostly used for debug, but even then it is not like humans can read 100 digits. Most of the time BigInts are there "in the background" and we never print them.
Anyway, Node beats GMP in some cases. Links to implementation:
One of the ways to make division cheaper is to divide by a constant. This way the compiler can reduce the division into something that gives the same result but is much cheaper, for example modular multiplicative inverse. This is what DivideByMagic does.
If you want to see this in action then: some time ago I had to write an algorithm to count the trailing zeros in Int64. This can be done via binary search by dividing by powers of 10. Unfortunately divisions are expensive. But if you unroll the whole algorithm and use constants then no division will be performed. You can paste this code in godbolt.org to check.
This definitely makes things faster, but the actual difference is the algorithm.
Classic algorithm (used by Attaswift, Numberick, Violet) performs n steps where each step does quotientAndRemainder to extract rightmost digit. Obviously a "digit" is a whole UInt64 not a decimal digit. Example implementation here. In total we have:
n steps
each steps performs up to n divisions with UInt/UInt operands
The optimized algorithm (used by Node and GMP) is divide and conquer. Node has a fantastic documentation here. Basically you take the whole BigInt and split it in half by dividing by a big enough power of radix. And then you work on those halfs.
There are multiple reasons. I believe that once you reach a certain size divide and conquer is the biggest factor.
I think (but I have not looked carefully) that this is the classic algorithm mentioned above. The only difference is that Attaswift uses a temporary parts: [String].
Violet/Numberick also use the same algorithm, and even without those allocations they are 100x slower than divide and conquer.
I don't know what Swift does, but in Python:
ma = "ma"
print(f"ala {ma} kota")
Gives us following bytecode:
0 0 RESUME 0
2 2 LOAD_CONST 0 ('ma')
4 STORE_NAME 0 (ma)
3 6 PUSH_NULL
8 LOAD_NAME 1 (print)
10 LOAD_CONST 1 ('ala ') # Put 'ala ' on stack
12 LOAD_NAME 0 (ma) # Put variable 'ma' on stack
14 FORMAT_VALUE 0 # Convert variable 'ma' to string
16 LOAD_CONST 2 (' kota') # Put ' kota' on stack
18 BUILD_STRING 3 # Concatenate 3 values from the stack
20 CALL 1 # Call 'print'
28 POP_TOP
30 RETURN_CONST 3 (None)
Even Python optimizes this, so the above code will not copy the string.
Compiler cannot assume that += and + are related. It may optimize things in the most simplest cases, but not with logic as complicated as BigInt. The whole BigInt has to be designed to support efficient +=.
What to use? (spoiler: GMP)
Ok, let's say that we have to choose some BigInt library. Which one?
First of all there is no "one BigInt library to rule them all". It all depends. As already said Violet has 2 versions:
In most of the tests "Violet-BigInt-XsProMax" is slightly faster because it does not have to check with which representation we are dealing with. But in swift-numerics/pull/120@xwu proposed a π calculation test.
In this test "Violet" is much faster than "Violet-BigInt-XsProMax", getting similar results as much more optimized Python (the numbers that we are dealing with are huge, and Violet naive div is slow). The reason is the input distribution for multiplication. For example for test_pi_5000 (row/column is operand length, value is operation count; full distribution for all operations is available here):
lhs/rhs
0
0
8067
200
6795
400
6470
600
6234
800
6069
and so on…
The rhs is always a single Word, most probably in Int32 range. This is exactly the case that Violet was optimized for because it does not have to allocate memory.
Bottom line is: know your number distribution.
BigInt
Assumes that most of the code will use built-in Swift integer types. BigInt will be used, but only in places that can overflow.
Focuses on performance for big numbers, since small ones will be handled using Swift types.
Is more of a auxiliary type, than a thing on its own.
Unlimited/general purpose integer (I made-up this name, so don't Google it):
Assumes that this is the only integer type available in the system.
Will probably optimise for small numbers (think: 0, 1, 256 etc.) rather than big ones (think: 123_123_123_123_123_123).
Fibonacci is BigInt, but if we were implementing a math library and wanted to give our users a nice Int type then "general purpose integer" could be better. Interestingly Python int type is unlimited - there is no BigInt library because the default int is a BigInt.
they only have BigUInt? I wanted to run Violet tests on it and they require BigInt
for mul they use Karatsuba above 20 words - this is around the expected value (usually between 20 and 40 depending on the implementation details). Attaswift limit is 1024, which is uncommon.
Attaswift
Violet tests were merged, but some of them were commented
(Yes, all of the results in the table are correct.)
I think that attaswift uses sign + magnitude representation. If it was 2 complement then everything would be trivial, but it is not, so sometimes you need an adjustment: Swift uses what would be GMP_DIV_FLOOR mode in GMP. Which means that if we are negative and any of the removed bits is 1 then we have to round down.
Anyway, if my goal was to use 100% Swift then I would go with Attaswift.
That said, I don't think anyone goal is to use 100% Swift, but to use the best possible implementation. Because of that I would recommend using GMP wrapped in a class. It is easily the fastest and the most battle-tested option out there. Just check if it is supported on your platform.
If you need BigInt, but you do not need a full BigInt, then you may look at the wider fixed width integers:
stdlib will get Int128/UI128 soon - I tried 4 or 5 builds (as recent as from 2 weeks ago), and all of them crashed
DoubleWidth solutions, for example:
swift-numerics - few months ago I opened an issue about crashes and incorrect results. This uncovered even more issues (see the connected PR). And their test suite is lacking. Just be careful and write tests.
Numberick also has one. At some point I opened an issue there, but it was solved quite quickly.
A quick indicator of the quality is the div function, if you see any >> 1 (which indicates bit-by-bit division) then run away.
Do not go too crazy with fixed width integers, Int128/Int256 are fine, but above that YMMW. I have a Python script that will generate UInt128 (as 2x UInt64) and UInt256 (as 4x UInt64) for me. I never had to use anything bigger.
Fin
With all that it is clear that the community should drop everything, cancel WWDC and start implementing a good BigInt.
Please don't.
At least I don't think this a good idea. I have never used BigInt in a business scenario, and a lot of the programs that use it are artificial Fibonacci-like benchmarks. There are some real-world use cases, but they can be solved with GMP.
the decimal thingie.
Support for IEEE 754 decimal sucks across the board, not only in Swift.
Intel library should work as long as you are on supported platform and can FFI to C
(as you pointed out in another thread) Mongo supports it as Decimal128 via BID.
Anyway, for my needs I implemented it myself, and now I diligently ignore 70% of the library, because only +-*/, truncatingRemainder and quantize are useful.
I pondered that. I think it can, because if the results don't match then the implementation is incoherent anyway. I don't think even "rounding error" should be a permitted excuse - this isn't a fused multiply-add or whatever, this is logically the exact same single arithmetic operation either way, and it should only differ in what memory address the result is written to.
But I'm yet to see any situation where Swift does change to use one instead of the other. I suppose it'd have to somehow know which implementation is better, which should be += in principle but in practice sometimes isn't.
Even without that optimisation, it seems like the difference is small for any decent implementation, because of course + is basically always going to be:
@_transparent
public static func + (a: T, b: T) {
var result = a
a += b
return a
}
I suppose it's possible with tail allocation to do it differently in order to achieve better performance, but otherwise the compiler does a pretty good job with the above (in release builds).
…and that it's legal to use. It's GPLv3, so it's unusable for most software.
It seems like it wouldn't be that hard to essentially replicate (if not better) GMP's optimisations in Swift - these BigInt types aren't that complicated. Maybe not pure Swift (no assembly intrinsics) but mostly Swift with some ASM.
The compiler is not licensed to do this transformation in general; even if we had a way to promise that the result is the same, one or both of the operators might have observable side-effects (e.g. a printing something to a log for debugging). It's interesting to think about how you would design a language that lets you tell the compiler that it can do this transformation, but Swift has no mechanism to allow it today.
It's not hard to make common cases as fast as GMP. When you get further off into the algorithmic weeds (huge unbalanced division, enormous multiplications, more obscure operations) it's important to keep in mind that GMP represents a couple decades worth of active development by a small but quite-skilled team. A lot of that effort has gone into micro-optimization for architectures that don't exist any longer, but a lot of it is fundamental algorithmic work that would take some real effort to match.
Thank you for your extremely detailed and technical post. Most of it goes over my head but is very interesting from a algorithmic point of view. Of course, the other technologies in my humble benchmarks could do the same kinds of optimisations.
And of course, if I would display the Fibonacci result in hexadecimal or binary format, a lot of conversions would not be necessary. But as these conversions are also necessary in the other frameworks, the comparison should be similar. If optimised correctly.
Hello everyone,
I've patched wrk with @wadetregaskis patches, enhanced the memory measurements and ran all the benchmarks again with the upgraded Vapor to 4.96 and the Numbericks library. Sorry @0xTim , not 4.100 yet: I will do that after my holidays)
Things are looking very much brighter for swift with these changes. You can see that the best in this memory graph.