'Standard' vapor website drops 1.5% of requests, even at concurrency of 100!

FB13800558: listen (2) man page should explain why the backlog is limited to 128
FB13800566: Connection errors are reported with the wrong errno (ECONNRESET or EBADF, rather than ECONNREFUSED).
FB13800571: kevents erroneously reports unconnected sockets as writable (and not closed)

13 Likes

Hi Wade,
As I have simply used the aptitude package up until now, I compiled the wrk application for the first time with your patch. I do not have much experience with compiling stuff under linux, so I simply ran ‘make’ after applying your second patch. No optimisations etc.

I see now that the wrk I used before was version 4.1.0, whereas 4.2.0 came out 3 years ago. 4.1.0 is apparently the default release for wrk under Ubuntu 22.04
I assume your patch apples to the latest commit on master (from February 2021)?

I will increase net.core.somaxconn and ulimit -n, but as I'm running both server and client on linux, they were already a lot higher than on macos.

Question
Can you recommend a good way to automatically measure the real memory usage of the webapplication? I have used htop in the past, but that was a manual process.
Is the RSS field of ps good enough to sample the process' memory usage?

And:
Are you going to create a PR with your work for wrk, even though there are already 55 PRs open from the last 5 years?
Or are your efforts ready for the bitbucket (and my fork)?

1 Like

Should do - I just cloned the repo at HEAD.

It's probably worth noting this with some prominence in your post, as while it's not necessarily wrong to raise these until they don't impact the benchmark (especially the file descriptor limit), there ostensibly are reasons that somaxconn exists (to help defend against SYN flood attacks, for example). So arguably it is detrimental, in contrast to the other frameworks tested, for Vapor to be so sensitive to low configuration values.

(though it's also arguable that 128 genuinely is too low for a modern system, probably having been established decades ago when RAM capacities were an order of magnitude lower)

ps or similar running at a high-enough sample rate should work, in theory. But "high-enough" is fuzzy… I'd say at least 10 Hz, but that's a somewhat arbitrary guess.

I seem to vaguely recall that there are more precise (and easier) ways to get peak memory usage. e.g. maybe getrusage (ru_maxrss specifically)? There's probably pre-existing scripts which use it to wrap over a subcommand, like time does. Even if not, it's trival to write a little wrapper.

I've unenthusiastically considered it, but it seems like wrk is abandoned. I could push up my own fork and just leave it, but then it too would be abandoned. :man_shrugging:

We've released Vapor 4.100.0 that contains the fixes necessary for allowing you to install NIO as the Swift Concurrency global executor. See the template for an example - template-bare/Sources/App/entrypoint.swift at main · vapor/template-bare · GitHub

I'd be interested to see what difference this makes to the benchmarks.

Note that this should work for the app being benchmarked in question but we're still discovering a few places where it may crash due to the use of .wait() in various packages that we have no way of discovering without manual review or people reporting crashes

3 Likes

If I understand correctly, it's not expected to help use-cases like @axello's benchmark which are entirely CPU usage?

It essentially reduces the latency by roughly 20 microseconds every time you need to switch between the I/O threads and the business logic threads. The regular Swift Concurrency threads can't host I/O so we need I/O->business logic->I/O. When installing NIO as the Concurrency pool those switches aren't thread switches, just work item enqueues.

Where it helps the most are proxies (little work, lots of I/O) particularly streaming proxies (as the I/O operations don't necessarily go to the kernel but are from in-memory buffers from stuff that has been read ahead). It'll also help with HTTP servers and simple routes or anything where adding 20 microseconds is noticeable.

In @axello's benchmark it wouldn't make a difference either way because it spends 5 milli seconds per request computing, wasting an extra 20 micro seconds isn't a problem here (almost a three orders of magnitude difference).
Given that @axello's benchmark has long-enough CPU bursts that one could argue it should be offloaded it might even be detrimental to use NIO as the executor. Because having the I/O and business logic in different pools can be seen as an implicit offload. I personally don't buy into that argument though: offloads should be explicit IMHO. Outside from benchmarks it's very unlikely that I would want all requests to be handled on a different pool. Surely there are some heavier and some more lightweight requests... but for the benchmark where every request is a 5ms CPU requirement we may as well implicitly offload them for no downside (the 20 microseconds lost are well in the noise).

3 Likes

(This post is quite long, it is also available on github which may provide better reading experience.)

Quick list of the previous BigInt posts:

Fibonacci

We may as well add some actually optimized libraries, to check how much performance we leave on the table.

Methodology:

  • Ubuntu on Intel Pentium G4560. Numbers suggest that @lorentey machine is ~1.6x faster.
  • @axello/@lorentey code. Full repository here. Not optimal: no numeric tricks, tons of allocations (no inout) etc. Intended, as this is what users will write.
  • Node (2nd column) is used as a base -> values in parenthesis mean relative improvement over Node. For example GMP 1000000 is 3.4x faster than Node.
  • each test was repeated 50 times. Value in the cell is average after removing min/max.
  • Violet does not provide a Swift package, to solve this the source code of all of the Swift implementations (Attaswift, Numberick, Violet) was copied inside the test package.
  • maximum relative standard deviation for the last row is 3.31% (for Python). Other implementations are <1.5%. Results are deemed stable and repeatable.

Lets check if the results are similar to @lorentey by comparing the Attaswift/Node ratio (last column in their data):

Count @lorentey Table above
10000 2.78x 0.013141 / 0.0028444 = 4.619954999
30000 3.76x 0.081476 / 0.015971 = 5.101496462
100000 5.02x 0.6063 / 0.12162 = 4.985199803
300000 5.52x 5.1031 / 0.86641 = 5.889936635
1000000 5.78x 54.488 / 9.7038 = 5.61511985

I would say that they are comparable especially above the 100000, where their results reach 5.02x.


Node sits right in the middle. Their implementation calls AddSigned which then goes to Add. This is our reference point.

Python is faster than Node:

  • Python 3.12 compiled on the machine that ran those tests; Python 3.13 was just released with JIT compiler, it should not matter.
  • implementation is fairly simple.
  • digits are stored as flexible array member after the PyObject_HEAD (in older Python versions they actually used PyObject_VAR_HEAD, things changed during january 2023 re-work). The whole int object is just 1 allocation (no pointer chasing).
  • they store 30 bits per digit in uint32_t (PYLONG_BITS_IN_DIGIT == 30 is the default).
  • negligible overhead from the interpreter and method dispatch (via numeric nb_add, not the full operator resolution).
  • reference counting for garbage collection as int can't have cycles.

Looking at the Python code one could ask: why Node is so slow? They should easily beat Python. Well… IDK. Perf tools give us nothing (I assume that they count BigInt as Shared libraries -> node):

 [Summary]:
   ticks  total  nonlib   name
    295    0.1%    7.9%  JavaScript
   3416    1.0%   92.1%  C++
  22975    6.8%  619.1%  GC
  333291   98.9%          Shared libraries

 [Shared libraries]:
   ticks  total  nonlib   name
  330151   98.0%         <PATH_REDACTED>/node/v19.0.0/bin/node
   2982    0.9%          /usr/lib/x86_64-linux-gnu/libc.so.6
    116    0.0%          [vdso]
     42    0.0%          /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30

The JavaScript part does not matter (0.1%). In logs we see that fib_loop was not optimized by Maglev/TurboFan, which means that we ran on the bytecode. Just like in Python only the inner C/C++ implementation matters. Anyway, both Node and Python do not use assembly, so a good Swift implementation should at least match their results.

GMP obviously wins:

  1. mpz_add is defined in aors.h imported with OPERATION_add.
  2. When signs are the same it calls mpn_add which is basically __GMPN_AORS with FUNCTION being __gmpn_add_n: __gmpn_add_n(__gmp_wp, __gmp_xp, __gmp_yp, __gmp_i) (w is result, xy are operands and i is min length).
  3. mpn_add_n is implemented in assembly:
    • 4 digits are handled in a single iteration
    • L(lt4) - less than 4 - deal with the trailing 1/2/3 digits
    • L(2) - deal with the trailing 2/3 digits
    • L(3) - deal with the trailing 3 digits
    • L(top)/L(mid)/L(end) - loop

Just for completeness my GMP configuration is:

  • Version: GNU MP 6.3.0
  • Host type: kabylakenoavx-pc-linux-gnu (AVX does not matter)
  • ABI: 64
  • Compiler: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
  • tuneup was not used, with the fallback to Skylake configuration (Kaby Lake and Skylake are similar)
  • mini-gmp was not tested

Btw. there is a theoretical "nails" representation where we leave the highest bits of every digit/word 0 (similar to Python) and use vector operations. Fairly interesting, but c'mon, we can't expect github libraries written by unpaid volunteers to do this kind of stuff.

Swift Numerics -> biginteger branch was not tested as there is a high probability that their result would not be correct. It is ~10% faster than Attaswift and much slower than Violet, which makes me believe that we would not get any new information.

Numberick by @oscbyspro:

  • this seems to be a new kid on the block, so I do not know much about them.
  • UIntXL is BigUInt. I don't know if BigInt is available. We are only dealing with positive numbers, so sign overhead should be negligible anyway.
  • their stack traces are so long that my Instruments broke. I just could't see the inner loop, so I don't know what is happening inside.

Violet is an implementation written by me. It is written against the same machine that produced the results above. This may explain why it is so fast. The implementation used is Violet-BigInt-XsProMax, not the Violet one. Differences:

  • Violet - under the hood it is a union (via tagged pointer) of Int32 (called Smi, after V8) and a heap allocation (magnitude + sign representation) with ARC for garbage collection. This means no heap allocations in Int32 range, but it assumes 64 bit pointers. More information in the documentation.

  • Violet-BigInt-XsProMax - no Smi, only heap. Slightly faster than Violet, because it does not have to check which representation we are dealing with.

The Int32 stuff do not matter in this test, so I went with Violet-BigInt-XsProMax (Apple naming scheme, don't blame me). I will talk more about this a bit later in this post.

Attaswift - this is the original BigInt (the one used by OP), so lets focus on this. Their layout boils down to:

// https://github.com/attaswift/BigInt
// LICENSE: https://github.com/attaswift/BigInt/blob/master/LICENSE.md
struct BigInt: SignedInteger {
  var magnitude: BigUInt
  var sign: Sign
}

struct BigUInt: UnsignedInteger {
  typealias Word = UInt

  enum Kind {
    case inline(Word, Word)
    case slice(from: Int, to: Int)
    case array
  }

  var kind: Kind // Internal for testing only
  var storage: [Word] // Internal for testing only; stored separately to prevent COW copies

  subscript(_ index: Int) -> Word {
    get {
      // @LiarPrincess: Simplified for readability
      switch kind {
      case .inline: …
      case .slice: …
      case .array: …
      }
    }
    set(word) { … }
  }
}

If we run fibonacci(500000) and print the result, their inner loop for addition will take 87.6%:

Those subscripts are heavy! The same thing in Violet would spend 62.6% in print and 36.6% for fibonacci(500000):

It is a little bit pointless to talk about the code generated by Attaswift, but Violet gives us:

.LBB1_6:
mov     r9, qword ptr [rdx + 8*rcx]
mov     r10, qword ptr [rdi + 8*rcx]
mov     r11, qword ptr [rdi + 8*rcx + 8]
lea     rbx, [r10 + r9]
xor     r14d, r14d
add     rax, rbx # Carry from the previous iteration
setb    r14b
mov     qword ptr [rdi + 8*rcx], rax # lhs[i] = …
mov     rax, qword ptr [rdx + 8*rcx + 8]
lea     rbx, [r11 + rax]
add     r10, r9
adc     r14, rbx
setb    r9b
add     r11, rax
movzx   eax, r9b
adc     rax, 0   # Carry for the next iteration
mov     qword ptr [rdi + 8*rcx + 8], r14 # lhs[i+1] = …
add     rcx, 2   # Increment index
cmp     rsi, rcx # Loop condition
jne     .LBB1_6

It handles 2 digits per iteration. Remember the "reorder lines -> 20% slower" comment in the Violet code above? This is exactly this. If you paste this code in godbolt.org and reorder those lines you will see that add rcx, 2 became inc rsi and only a single digit was handled during a single iteration. Calculation itself is done by lea followed by add; setb or add; adc rax, 0 for carry, how much it matters depends on the CPU.

At the same time GMP calculates 4 digits in a single iteration with 4x adc one after the other. I can't paste the code here because of the license, but here is the link.

That said, focusing on addition is a little bit pointless, as mul/div are the real heavy hitters in most of the workloads, and even a few adds will not out-weight a single mul. Attaswift vs Violet in mul/div:

That said, even if Attaswift used a simple Array/ContiguousArray is would still have some problems. For example isUniquelyReferenced (or whatever Array does) check can be merged with append/reserveCapacity for carry. This could potentially save us a copy.

Also, setting an index in an Array does another isUniquelyReferenced check which the complier will not move out of the loop. In Violet COW is done before we even enter the hot functions. There is a system where calling storage.guaranteeUniqueBufferReference (with the optional capacity argument) gives us a token which is required to call a mutating operation. This way the compiler forces us to do COW checks. This token should be move-only, but this code was written before we had them in Swift (borrow = mutation, consume = storage reallocation + new token is issued).

Also, as already pointed out in this thread, BigInt operations should strive to be inout to avoid buffer allocations. The whole BigInt has to be written with this in mind.

This gives us:

  • Attaswift - weird; smaller counts are faster, but bigger are slower. I have no idea. I expected no changes, though the variance is quite big on the smaller sets, so that may be it.
  • Numberick - ~5% improvement in 2 last rows.
  • Violet - ~30% improvement in 2 last rows. The whole implementation was written with inout in mind. The original code (the 1st table in this post) was using the library incorrectly. It is unfortunate that BigInt has Int in its name, because people will assume that they are as "cheap" as any other Int.

GMP also allows you to alias the lhs/result pointers, but remember that if you are using the inner mpn functions then the ranges should not overlap. Or just use the safe mpz.

Moral of the story: use inout for BigInt, even if the library does not optimize this (yet…).

Btw. there is a tiny gotcha in the inout method if you have enum based storage like Attaswift has: BigUInt.Kind: inline | slice | array, as switch can increase the reference count making inout always work on +2 which will force COW. Violet has the same problem with the smi|heap representation (Smi is an inlined Int32 instead of the pointer). To solve it I temporary assign Int32.zero:

public static func += (lhs: inout BigInt, rhs: BigInt) {
  // lhs is on +1
  switch (lhs.value, rhs.value) {
  case let (.smi(lhsSmi), .smi(rhs)): …
  case (.smi(let lhsSmi), .heap(var rhs)): …

  case (.heap(var lhsHeap), .smi(let rhs)):
    // lhs is +2
    Self.releaseBufferToPreventCOW(&lhs)
    // lhs is +1
    …

  case (.heap(var lhsHeap), .heap(let rhs)):
    // lhs is +2
    Self.releaseBufferToPreventCOW(&lhs)
    // lhs is +1
    …
  }
}

private static func releaseBufferToPreventCOW(_ int: inout BigInt) {
  let zero = Smi(Smi.Storage.zero)
  int.value = .smi(zero)
}

(Disclaimer: I changed bullet points to numbers in their post, for easier reference).

  1. No, but GMP does
  2. No
  3. No SIMD
  4. Check assembly, Swift compiler will usually leave unique check inside the loop. Violet has an over-engineered solution which gives you control over this.
9 Likes

String + print

(Btw. have you noticed something about the Numberick and Violet results?)

When it comes to the implementation: the simplest case is when the radix is a power of 2. In such situation we can use the linear algorithm where we treat the whole BigInt as a bit buffer and group those bits according to radix. For radix is 16 we group by 4 bits etc. The tiny edge case is when the radix bit count is not a multiply of the digit size. For example:

  • digit is 64bit
  • radix is 8 -> group 3 bits together
  • 64/3=21, the remaining bit has to be grouped with bits from the next digit

With this Violet is 3210x faster than Swift Numerics for radix 16:

Following implementations have linear power of 2 algorithms (link goes to the implementation):

  • Violet - you need to use the BigInt overload: String(_ value: BigInt, radix: Int = 10, uppercase: Bool = false)
  • v8
  • GMP

But this is not our case, because we print in radix 10 which is not a power of 2. Side-note: printing BigInt is not that common. It is mostly used for debug, but even then it is not like humans can read 100 digits. Most of the time BigInts are there "in the background" and we never print them.

Anyway, Node beats GMP in some cases. Links to implementation:

I will use Node as the base of explanation.

First of all they have template <digit_t radix> DivideByMagic(RWDigits rest, Digits input, char* output). Division is important for printing and unfortunately is is very expensive. In Violet 59% of the whole fibonacci program is spend in division intrinsic:

One of the ways to make division cheaper is to divide by a constant. This way the compiler can reduce the division into something that gives the same result but is much cheaper, for example modular multiplicative inverse. This is what DivideByMagic does.

If you want to see this in action then: some time ago I had to write an algorithm to count the trailing zeros in Int64. This can be done via binary search by dividing by powers of 10. Unfortunately divisions are expensive. But if you unroll the whole algorithm and use constants then no division will be performed. You can paste this code in godbolt.org to check.


This definitely makes things faster, but the actual difference is the algorithm.

Classic algorithm (used by Attaswift, Numberick, Violet) performs n steps where each step does quotientAndRemainder to extract rightmost digit. Obviously a "digit" is a whole UInt64 not a decimal digit. Example implementation here. In total we have:

  • n steps
  • each steps performs up to n divisions with UInt/UInt operands

The optimized algorithm (used by Node and GMP) is divide and conquer. Node has a fantastic documentation here. Basically you take the whole BigInt and split it in half by dividing by a big enough power of radix. And then you work on those halfs.

There are multiple reasons. I believe that once you reach a certain size divide and conquer is the biggest factor.

I think (but I have not looked carefully) that this is the classic algorithm mentioned above. The only difference is that Attaswift uses a temporary parts: [String].

Violet/Numberick also use the same algorithm, and even without those allocations they are 100x slower than divide and conquer.

I don't know what Swift does, but in Python:

ma = "ma"
print(f"ala {ma} kota")

Gives us following bytecode:

0           0 RESUME                   0

2           2 LOAD_CONST               0 ('ma')
            4 STORE_NAME               0 (ma)

3           6 PUSH_NULL
            8 LOAD_NAME                 1 (print)
            10 LOAD_CONST               1 ('ala ')  # Put 'ala ' on stack
            12 LOAD_NAME                0 (ma)      # Put variable 'ma' on stack
            14 FORMAT_VALUE             0           # Convert variable 'ma' to string
            16 LOAD_CONST               2 (' kota') # Put ' kota' on stack
            18 BUILD_STRING             3           # Concatenate 3 values from the stack
            20 CALL                     1           # Call 'print'
            28 POP_TOP
            30 RETURN_CONST             3 (None)

At the same time:

ma = "ma"
print(f"{ma}")

Gives us:

0           0 RESUME                   0

2           2 LOAD_CONST               0 ('ma')
            4 STORE_NAME               0 (ma)

3           6 PUSH_NULL
            8 LOAD_NAME                 1 (print)
            10 LOAD_NAME                0 (ma)    # Put variable 'ma' on stack
            12 FORMAT_VALUE             0         # Convert variable 'ma' to string
            14 CALL                     1
            22 POP_TOP
            24 RETURN_CONST             1 (None)

Even Python optimizes this, so the above code will not copy the string.

Compiler cannot assume that += and + are related. It may optimize things in the most simplest cases, but not with logic as complicated as BigInt. The whole BigInt has to be designed to support efficient +=.

What to use? (spoiler: GMP)

Ok, let's say that we have to choose some BigInt library. Which one?

First of all there is no "one BigInt library to rule them all". It all depends. As already said Violet has 2 versions:

In most of the tests "Violet-BigInt-XsProMax" is slightly faster because it does not have to check with which representation we are dealing with. But in swift-numerics/pull/120 @xwu proposed a π calculation test.

In this test "Violet" is much faster than "Violet-BigInt-XsProMax", getting similar results as much more optimized Python (the numbers that we are dealing with are huge, and Violet naive div is slow). The reason is the input distribution for multiplication. For example for test_pi_5000 (row/column is operand length, value is operation count; full distribution for all operations is available here):

lhs/rhs 0
0 8067
200 6795
400 6470
600 6234
800 6069
and so on…

The rhs is always a single Word, most probably in Int32 range. This is exactly the case that Violet was optimized for because it does not have to allocate memory.

Bottom line is: know your number distribution.

  • BigInt
    • Assumes that most of the code will use built-in Swift integer types. BigInt will be used, but only in places that can overflow.
    • Focuses on performance for big numbers, since small ones will be handled using Swift types.
    • Is more of a auxiliary type, than a thing on its own.
  • Unlimited/general purpose integer (I made-up this name, so don't Google it):
    • Assumes that this is the only integer type available in the system.
    • Will probably optimise for small numbers (think: 0, 1, 256 etc.) rather than big ones (think: 123_123_123_123_123_123).

Fibonacci is BigInt, but if we were implementing a math library and wanted to give our users a nice Int type then "general purpose integer" could be better. Interestingly Python int type is unlimited - there is no BigInt library because the default int is a BigInt.

As far as currently available Swift libraries go:

  • :red_circle: Swift numerics/biginteger - please don't

  • :yellow_circle: Numberick by @oscbyspro

    • I don't know anything about their implementation
    • they only have BigUInt? I wanted to run Violet tests on it and they require BigInt
    • for mul they use Karatsuba above 20 words - this is around the expected value (usually between 20 and 40 depending on the implementation details). Attaswift limit is 1024, which is uncommon.
  • :green_circle: Attaswift

    • Violet tests were merged, but some of them were commented

    • #115 bitWidth is wrong for -(power-of-two) by @wadetregaskis is related to those commented tests from Violet

    • #99 [Violet] Node tests says that xor may sometimes be incorrect (PR contains detailed description):

      // Attaswift returns '0'
      self.xorTest(lhs: "-1", rhs: "18446744073709551615", expecting: "-18446744073709551616")
      
    • #99 [Violet] Node tests says that shift right uses different rounding than Swift (PR contains detailed description):

      // Attaswift returns '60397977'.
      self.shiftRightTest(value: "-1932735284", count: 5, expecting: "-60397978")
      

      Attaswift is slightly inconsistent with the rest of the Swift:

      Engine Result
      attaswift/BigInt -60397977
      Wolfram Alpha -60397977
      Node v17.5.0 -60397978
      Python 3.7.4 -60397978
      Swift 5.3.2 -60397978
      Violet -60397978

      (Yes, all of the results in the table are correct.)

      I think that attaswift uses sign + magnitude representation. If it was 2 complement then everything would be trivial, but it is not, so sometimes you need an adjustment: Swift uses what would be GMP_DIV_FLOOR mode in GMP. Which means that if we are negative and any of the removed bits is 1 then we have to round down.

Anyway, if my goal was to use 100% Swift then I would go with Attaswift.

That said, I don't think anyone goal is to use 100% Swift, but to use the best possible implementation. Because of that I would recommend using GMP wrapped in a class. It is easily the fastest and the most battle-tested option out there. Just check if it is supported on your platform.

If you need BigInt, but you do not need a full BigInt, then you may look at the wider fixed width integers:

  • stdlib will get Int128/UI128 soon - I tried 4 or 5 builds (as recent as from 2 weeks ago), and all of them crashed
  • DoubleWidth solutions, for example:
    • swift-numerics - few months ago I opened an issue about crashes and incorrect results. This uncovered even more issues (see the connected PR). And their test suite is lacking. Just be careful and write tests.
    • Numberick also has one. At some point I opened an issue there, but it was solved quite quickly.
  • DoubleWidth may be slow as soon as you reach multiplication on UInt256, so it may be better to just use 4x UInt64 and schedule 16 (independent!) multiplications one after the other.

A quick indicator of the quality is the div function, if you see any >> 1 (which indicates bit-by-bit division) then run away.

Do not go too crazy with fixed width integers, Int128/Int256 are fine, but above that YMMW. I have a Python script that will generate UInt128 (as 2x UInt64) and UInt256 (as 4x UInt64) for me. I never had to use anything bigger.

Fin

With all that it is clear that the community should drop everything, cancel WWDC and start implementing a good BigInt.

Please don't.

At least I don't think this a good idea. I have never used BigInt in a business scenario, and a lot of the programs that use it are artificial Fibonacci-like benchmarks. There are some real-world use cases, but they can be solved with GMP.

:point_up::arrow_up::up::outbox_tray::small_red_triangle::point_up_2: the decimal thingie.

Support for IEEE 754 decimal sucks across the board, not only in Swift.

  • Intel library should work as long as you are on supported platform and can FFI to C
  • C folks are working on it
  • (as you pointed out in another thread) Mongo supports it as Decimal128 via BID.

Anyway, for my needs I implemented it myself, and now I diligently ignore 70% of the library, because only +-*/, truncatingRemainder and quantize are useful.

7 Likes

I pondered that. I think it can, because if the results don't match then the implementation is incoherent anyway. I don't think even "rounding error" should be a permitted excuse - this isn't a fused multiply-add or whatever, this is logically the exact same single arithmetic operation either way, and it should only differ in what memory address the result is written to.

But I'm yet to see any situation where Swift does change to use one instead of the other. I suppose it'd have to somehow know which implementation is better, which should be += in principle but in practice sometimes isn't.

Even without that optimisation, it seems like the difference is small for any decent implementation, because of course + is basically always going to be:

@_transparent
public static func + (a: T, b: T) {
    var result = a
    a += b
    return a
}

I suppose it's possible with tail allocation to do it differently in order to achieve better performance, but otherwise the compiler does a pretty good job with the above (in release builds).

…and that it's legal to use. It's GPLv3, so it's unusable for most software.

It seems like it wouldn't be that hard to essentially replicate (if not better) GMP's optimisations in Swift - these BigInt types aren't that complicated. Maybe not pure Swift (no assembly intrinsics) but mostly Swift with some ASM.

Or Universal Paperclips clones - that's how I got pulled into the BigInt world. :laughing:

(UInt128 still isn't big enough for that use - you need UInt256, as you need to count up to 3e55 but UInt128.max is ~3.4e38)

The compiler is not licensed to do this transformation in general; even if we had a way to promise that the result is the same, one or both of the operators might have observable side-effects (e.g. a printing something to a log for debugging). It's interesting to think about how you would design a language that lets you tell the compiler that it can do this transformation, but Swift has no mechanism to allow it today.

4 Likes

It's not hard to make common cases as fast as GMP. When you get further off into the algorithmic weeds (huge unbalanced division, enormous multiplications, more obscure operations) it's important to keep in mind that GMP represents a couple decades worth of active development by a small but quite-skilled team. A lot of that effort has gone into micro-optimization for architectures that don't exist any longer, but a lot of it is fundamental algorithmic work that would take some real effort to match.

1 Like

Thank you for your extremely detailed and technical post. Most of it goes over my head but is very interesting from a algorithmic point of view. Of course, the other technologies in my humble benchmarks could do the same kinds of optimisations.

And of course, if I would display the Fibonacci result in hexadecimal or binary format, a lot of conversions would not be necessary. But as these conversions are also necessary in the other frameworks, the comparison should be similar. If optimised correctly.

Thanks for joining and enhancing the discussion!

2 Likes

Hello everyone,
I've patched wrk with @wadetregaskis patches, enhanced the memory measurements and ran all the benchmarks again with the upgraded Vapor to 4.96 and the Numbericks library. Sorry @0xTim , not 4.100 yet: I will do that after my holidays)

Things are looking very much brighter for swift with these changes. You can see that the best in this memory graph.

14 Likes

Minor typo in your article too, this should be 4.96. The newer version is 4.100 (now 4.101).

1 Like

Thanks!

1 Like