Difference in generated code - Swift and C

Hi all,

I wanted to check the generated code of a loop over UInt32 in Swift vs what was generated in the equivalent C function. I was surprised at the complexity of the generated Swift code. What am I doing wrong?

Swift code: https://godbolt.org/g/eoxthS

C code: https://godbolt.org/g/FHfS5h

(Note: I know the Swift code – in particular, the loop – is not the “right” way to do it, but I wanted to get it as close to the C code as possible.)

Try passing -O to the Swift version.

it’s still substantially longer than the C version

Thanks. Here’s what happens when I optimize both:

Swift -O : https://godbolt.org/g/pA9xR7
C -O : https://godbolt.org/g/rUjLNQ

The relevant part is of comparable length:

function signature specialization <Arg[0] = Owned To Guaranteed> of output.sumit(n: [Swift.UInt32], len: Swift.Int) -> Swift.UInt32:
  push rbp
  mov rbp, rsp
  test rsi, rsi
  js .LBB2_8
  test rsi, rsi
  je .LBB2_2
  mov rax, qword ptr [rip + _T0s27_ContiguousArrayStorageBaseC16countAndCapacitys01_B4BodyVvpWvd@GOTPCREL]
  mov rax, qword ptr [rax]
  dec rsi
  cmp rsi, qword ptr [rdi + rax]
  jae .LBB2_9
  mov eax, dword ptr [rdi + 32]
  add rdi, 36
.LBB2_7:
  test rsi, rsi
  je .LBB2_3
  dec rsi
  add eax, dword ptr [rdi]
  lea rdi, [rdi + 4]
  jae .LBB2_7
  ud2
.LBB2_2:
  xor eax, eax
.LBB2_3:
  pop rbp
  ret
.LBB2_8:
  ud2
.LBB2_9:
  ud2

As of Swift 4.2 “guaranteed” will be the default argument-passing convention, so the non-“function signature specialization” entry point for sumit will no longer be needed. The optimized Swift code has some additional branches due to array bounds checks, but looks reasonable. If you use a regular for x in array loop over the array then the bounds checks ought to get optimized out, since it’s known that a for loop will never go out of bounds.

if you use the actual equivalent C code

func square(numbers:UnsafePointer<UInt32>, count:Int) -> UInt32 {
    var sum:UInt32 = 0
    
    for i in 0 ..< count
    {
        sum += numbers[i]
    }

    return sum
}

you actually get code that’s about the same length as the C code (in fact it’s exactly the same number of instructions i think). Swift Arrays are nowhere near zero-cost abstractions apparently.

indeed: https://godbolt.org/g/oGZM5Q . Thanks.

Of course: https://godbolt.org/g/BNyRrx. Thank you.

While there’s some inherent overhead in getting to the array buffer compared to a raw pointer, the inner loop code is pretty close to what you’d get with C. Also, note that in many situations using higher-level abstractions on Array and similar constructs can end up producing better code than low-level iteration, because they can make higher-level safety assumptions about things like array bounds that can’t necessarily be made about arbitrary integer indexing.

This is ok, actually. I am new to Swift (in case that wasn’t already obvious) and am trying to implement some code that can (and should) be optimized really well. I want to make sure I’m giving Swift a fair shake as I compare it to C++ code that’s been through the optimization wringer.

For reference, I did the same thing with Go, and it resulted in this: https://medium.com/@egonelbre/a-tale-of-bfs-4ea1b8ab5eeb

2 Likes

Cool. If you do find places where you expect Swift code to get well-optimized but your expectations aren’t met, we’d appreciate getting bugs reported to bugs.swift.org, or test cases submitted to our compiler microbenchmark suite at https://github.com/apple/swift/tree/master/benchmark.

5 Likes

fair enough

i’m just trying to understand this assembly here:

movq _T0s27_ContiguousArrayStorageBaseC16countAndCapacitys01_B4BodyVvpWvd@GOTPCREL(%rip), %rax

why are we loading things from the instruction pointer?

  movq (%rax), %rax
  movq (%rdi,%rax), %rcx
  testq %rcx, %rcx
  je .LBB2_1
  movl 32(%rdi), %eax
  addq $36, %rdi
  decq %rcx

I assume this has something to do with arrays keeping their count in a header attached to their backing storage,, or?

.LBB2_4:
  testq %rcx, %rcx
  je .LBB2_2
  decq %rcx
  addl (%rdi), %eax
  leaq 4(%rdi), %rdi
  jae .LBB2_4
  ud2

why did the compiler insert an invalid instruction right after the end of the loop? it never gets executed,, but why is it there?

That’s the normal pattern for loading the address of symbols in position-independent code. %rip-relative addressing is no more expensive than absolute addressing. If you demangle that symbol name, you’ll see it’s direct field offset for Swift._ContiguousArrayStorageBase.countAndCapacity; we load the field offset from a global variable so that the internal layout of the storage type can be changed in the future. (We probably don’t need this flexibility for Array’s buffer type, though; cc @Slava_Pestov). As you surmised, this leads us to where an array keeps track of the size of its backing store.

That’s an overflow check. if addl (%rdi), %eax causes the UInt32 in %eax to overflow, it’ll set the carry flag. jae is really jnc in disguise—it jumps if the carry flag is not set.

why is this not done for the 32 and $36? And why does it use addq to increment the array index for the first element and leaq for the rest? In fact why is the first element treated differently at all?

That initial computation is used to find the offset of the buffer itself. We use some builtins that allow for tail-allocation of the buffer that probably assume the root class instance is fixed-size (all the more reason to stop pretending the offset of countAndCapacity might change). lea was likely used in the inner loop because it doesn’t disturb the flags register, so the generated code can advance the pointer through the array buffer and still do the overflow check on the preceding add, combining the happy path of the overflow check with the back edge repeating the loop.

okay so let me see if i got this right

  movq _T0s27_ContiguousArrayStorageBaseC16countAndCapacitys01_B4BodyVvpWvd@GOTPCREL(%rip), %rax

This puts the address of a global variable containing offsetof(_ContiguousArrayStorageBase.countAndCapacity) into rax

  movq (%rax), %rax

This dereferences the variable putting the value of offsetof(_ContiguousArrayStorageBase.countAndCapacity) into rax

  movq (%rdi,%rax), %rcx

This extracts countAndCapacity from the given array and puts… the count in rcx? (what happened to the capacity?)

  testq %rcx, %rcx
  je .LBB2_1
  movl 32(%rdi), %eax
  addq $36, %rdi

This gets the first element from the array which lives at offset +32 in the array storage and sets the element pointer to point to the second element.

So, does this mean Swift doesn’t know exactly where count and _capacity live inside the array storage, but it knows that they must be within the first 32 bytes since everything after that is always the elements?

That’s correct. We only need the capacity (which is the full size of the allocation) when appending new data to the array; to iterate through the existing value, we only need the count (which is the size of the part that has valid data).

This should get you to the first element of the array.

In general we want people to be able to change the layout of their classes without breaking ABI, like you can in Objective-C. We use those same code paths for properties in the standard library, even though we have no plans to really change the layout of array buffers in practice.

How does it make sense to allow the layout to vary, but not the size? I thought the point of changing layouts was to make structs more compact…

This is kind of off topic but why are the count and capacity combined into countAndCapacity? Does the capacity live at 8(%rdi, %rax) then? I tried looking through the sources for the Array core types and stuff but wasn’t able to make sense of them

It doesn’t. AFAIK we haven’t implemented the “fixed layout” variant for classes, though it would totally make sense for these to be.

@Ben_Cohen or @moiseev might be able to answer this.

so, you can still break things by adding more members to a class?