Why are Swift Loops slow?

uint64_t nbrBytes = 4e8;
uint64_t count = 0;
for( uint64_t byteIndex = 0; byteIndex < nbrBytes; byteIndex++ )
{
  count += byteIndex;
  if ( ( byteIndex & 0xffffffff ) == 0 ) { count += 1.3; } (AAA)
};

Takes 260 msec.

Btw.: Without the (AAA) line the whole loop is done in 10 μsec. A really clever compiler!
And with “count += 1” instead of “count += 1.3” it takes 410 msec. Very strange.
But this is beside the point here.

Now Swift:
let nbrBytes = 400_000_000
var count = 0
for byteIndex in 0 ..< nbrBytes
{
  count += byteIndex
  if ( ( byteIndex & 0xffffffff ) == 0 ) {count += Int(1.3);}
}

takes 390 msec - about 50 % more.

Release build with default options.

Gerriet.

You'll need to read the generated assembly code if you want to analyze performance of this sort of small arithmetic loop. Performance of this kind of code can be greatly affected by small optimization changes.

clang's `count +=1` code is vectorized, and the `count += 1.3` code is not vectorized. For whatever reason that vectorization is unsuccessful and the vectorized loop runs slower (at least on your machine and on my machine). Optimization is hard.

The Swift loop runs slower because Swift performs arithmetic overflow checks that C does not, and in this case swiftc was unable to optimize them all away.

If you use &+ instead of +, or compile with -Ounchecked, then Swift won't perform the overflow checks. Unfortunately in this case you then get the same slower vectorized code from swiftc as you did from clang's `count += 1` case; presumably both clang and swiftc get this pessimization from LLVM. I couldn't find a way to disable LLVM vectorization from swiftc.

···

On Oct 12, 2016, at 2:25 AM, Gerriet M. Denkmann via swift-users <swift-users@swift.org> wrote:

uint64_t nbrBytes = 4e8;
uint64_t count = 0;
for( uint64_t byteIndex = 0; byteIndex < nbrBytes; byteIndex++ )
{
  count += byteIndex;
  if ( ( byteIndex & 0xffffffff ) == 0 ) { count += 1.3; } (AAA)
};

Takes 260 msec.

Btw.: Without the (AAA) line the whole loop is done in 10 μsec. A really clever compiler!
And with “count += 1” instead of “count += 1.3” it takes 410 msec. Very strange.
But this is beside the point here.

Now Swift:
let nbrBytes = 400_000_000
var count = 0
for byteIndex in 0 ..< nbrBytes
{
  count += byteIndex
  if ( ( byteIndex & 0xffffffff ) == 0 ) {count += Int(1.3);}
}

takes 390 msec - about 50 % more.

Release build with default options.

--
Greg Parker gparker@apple.com <mailto:gparker@apple.com> Runtime Wrangler

This is a useless benchmark because the loop does no observable work in either language. The C version, if you look at the generated assembly, in fact optimizes away to nothing:

~/src/s/swift$ clang ~/butt.c -O3 -S -o -
  .section __TEXT,__text,regular,pure_instructions
  .macosx_version_min 10, 9
  .globl _main
  .p2align 4, 0x90
_main: ## @main
  .cfi_startproc
## BB#0: ## %entry
  pushq %rbp
Ltmp0:
  .cfi_def_cfa_offset 16
Ltmp1:
  .cfi_offset %rbp, -16
  movq %rsp, %rbp
Ltmp2:
  .cfi_def_cfa_register %rbp
  xorl %eax, %eax
  popq %rbp
  retq
  .cfi_endproc

It looks like LLVM does not recognize the overflow traps Swift emits on arithmetic operations as dead code, so that prevents it from completely eliminating the Swift loop. That's a bug worth fixing in Swift, but unlikely to make a major difference in real, non-dead code. However, if we make the work useful by factoring the loop into a function in both languages, the perf difference is unmeasurable. Try comparing:

#include <stdint.h>

__attribute__((noinline))
uint64_t getCount(uint64_t nbrBytes) {
  uint64_t count = 0;
  for( uint64_t byteIndex = 0; byteIndex < nbrBytes; byteIndex++ )
  {
          count += byteIndex;
          if ( ( byteIndex & 0xffffffff ) == 0 ) { count += 1.3; }
  };
  return count;
}

int main() {
  uint64_t nbrBytes = 4e8;
  return getCount(nbrBytes);
}

with:

import Darwin

@inline(never)
func getCount(nbrBytes: Int) -> Int {
  var count = 0
  for byteIndex in 0 ..< nbrBytes
  {
          count += byteIndex
          if ( ( byteIndex & 0xffffffff ) == 0 ) {count += Int(1.3);}
  }
  return count
}

exit(Int32(truncatingBitPattern: getCount(nbrBytes: 400_000_000)))

-Joe

···

On Oct 12, 2016, at 2:25 AM, Gerriet M. Denkmann via swift-users <swift-users@swift.org> wrote:

uint64_t nbrBytes = 4e8;
uint64_t count = 0;
for( uint64_t byteIndex = 0; byteIndex < nbrBytes; byteIndex++ )
{
  count += byteIndex;
  if ( ( byteIndex & 0xffffffff ) == 0 ) { count += 1.3; } (AAA)
};

Takes 260 msec.

Btw.: Without the (AAA) line the whole loop is done in 10 μsec. A really clever compiler!
And with “count += 1” instead of “count += 1.3” it takes 410 msec. Very strange.
But this is beside the point here.

Now Swift:
let nbrBytes = 400_000_000
var count = 0
for byteIndex in 0 ..< nbrBytes
{
  count += byteIndex
  if ( ( byteIndex & 0xffffffff ) == 0 ) {count += Int(1.3);}
}

takes 390 msec - about 50 % more.

Release build with default options.

The traps aren’t dead if they are reachable. Eventually we will teach LLVM about less-strict trap semantics so they can be speculated and reordered. But for now we can do that sort of thing in SIL in the most important cases.

-Andy

···

On Oct 12, 2016, at 9:05 AM, Joe Groff via swift-users <swift-users@swift.org> wrote:

It looks like LLVM does not recognize the overflow traps Swift emits on arithmetic operations as dead code, so that prevents it from completely eliminating the Swift loop. That's a bug worth fixing in Swift, but unlikely to make a major difference in real, non-dead code.

I quite agree that this is useless as a benchmark.
It is an abstraction of some real code, which has the common feature that inside the loop is not done anything lengthy.

Following some very helpful remarks from Greg Parker I was able to improve the Swift code by a factor of 3, making it on par with ObjC:
1. using a signed loop counter (factor of 2)
2. using “count = count &+ something” instead of “count += something” (factor of 1.5).

So this exercise gave me some very valuable insights into using Swift (or into some shortcomings of the current Swift compiler).

Thanks again to Greg Parker for his comments and suggestions!

Kind regards,

Gerriet.

···

On 12 Oct 2016, at 23:05, Joe Groff via swift-users <swift-users@swift.org> wrote:

On Oct 12, 2016, at 2:25 AM, Gerriet M. Denkmann via swift-users <swift-users@swift.org> wrote:

uint64_t nbrBytes = 4e8;
uint64_t count = 0;
for( uint64_t byteIndex = 0; byteIndex < nbrBytes; byteIndex++ )
{
  count += byteIndex;
  if ( ( byteIndex & 0xffffffff ) == 0 ) { count += 1.3; } (AAA)
};

Takes 260 msec.

Btw.: Without the (AAA) line the whole loop is done in 10 μsec. A really clever compiler!
And with “count += 1” instead of “count += 1.3” it takes 410 msec. Very strange.
But this is beside the point here.

Now Swift:
let nbrBytes = 400_000_000
var count = 0
for byteIndex in 0 ..< nbrBytes
{
  count += byteIndex
  if ( ( byteIndex & 0xffffffff ) == 0 ) {count += Int(1.3);}
}

takes 390 msec - about 50 % more.

Release build with default options.

This is a useless benchmark because the loop does no observable work in either language.