'Standard' vapor website drops 1.5% of requests, even at concurrency of 100!

I quickly cooked a minimal BigInt implementation from scratch (addition only), here's what I got:

Count Time
10,000 1.834ms
30,000 10.102ms
100,000 94.939ms
300,000 806.729ms
1,000,000 8.699s

It's not so bad but not quite as good as node.js implementation timing indicated above. In my implementation I've manually unrolled the loop, effectively working with "8x64" bit ints in the outermost loop. Possible reasons for this implementation being slower compared to node.js implementation:

  • they might be using asm.
  • they could be using 128-bit ints, while my implementation is using 64-bit ints as the base type.
  • they could be using SIMD tricks.
  • there's some ARC traffic in there in my implementation.
  • the obvious difference could be just due to the differences in the computer specs (I'm on a 2021 MacBook Pro, which is M1 Pro @ 3GHz if I am not mistaken).

BTW, are there GPU based BigInt implementations in the wild and how do they compare to CPU based implementations?