I wanted to check the generated code of a loop over UInt32 in Swift vs what was generated in the equivalent C function. I was surprised at the complexity of the generated Swift code. What am I doing wrong?
function signature specialization <Arg = Owned To Guaranteed> of output.sumit(n: [Swift.UInt32], len: Swift.Int) -> Swift.UInt32:
mov rbp, rsp
test rsi, rsi
test rsi, rsi
mov rax, qword ptr [rip + _T0s27_ContiguousArrayStorageBaseC16countAndCapacitys01_B4BodyVvpWvd@GOTPCREL]
mov rax, qword ptr [rax]
cmp rsi, qword ptr [rdi + rax]
mov eax, dword ptr [rdi + 32]
add rdi, 36
test rsi, rsi
add eax, dword ptr [rdi]
lea rdi, [rdi + 4]
xor eax, eax
As of Swift 4.2 “guaranteed” will be the default argument-passing convention, so the non-“function signature specialization” entry point for sumit will no longer be needed. The optimized Swift code has some additional branches due to array bounds checks, but looks reasonable. If you use a regular for x in array loop over the array then the bounds checks ought to get optimized out, since it’s known that a for loop will never go out of bounds.
While there’s some inherent overhead in getting to the array buffer compared to a raw pointer, the inner loop code is pretty close to what you’d get with C. Also, note that in many situations using higher-level abstractions on Array and similar constructs can end up producing better code than low-level iteration, because they can make higher-level safety assumptions about things like array bounds that can’t necessarily be made about arbitrary integer indexing.
This is ok, actually. I am new to Swift (in case that wasn’t already obvious) and am trying to implement some code that can (and should) be optimized really well. I want to make sure I’m giving Swift a fair shake as I compare it to C++ code that’s been through the optimization wringer.
That’s the normal pattern for loading the address of symbols in position-independent code. %rip-relative addressing is no more expensive than absolute addressing. If you demangle that symbol name, you’ll see it’s direct field offset for Swift._ContiguousArrayStorageBase.countAndCapacity; we load the field offset from a global variable so that the internal layout of the storage type can be changed in the future. (We probably don’t need this flexibility for Array’s buffer type, though; cc @Slava_Pestov). As you surmised, this leads us to where an array keeps track of the size of its backing store.
That’s an overflow check. if addl (%rdi), %eax causes the UInt32 in %eax to overflow, it’ll set the carry flag. jae is really jnc in disguise—it jumps if the carry flag is not set.
That initial computation is used to find the offset of the buffer itself. We use some builtins that allow for tail-allocation of the buffer that probably assume the root class instance is fixed-size (all the more reason to stop pretending the offset of countAndCapacity might change). lea was likely used in the inner loop because it doesn’t disturb the flags register, so the generated code can advance the pointer through the array buffer and still do the overflow check on the preceding add, combining the happy path of the overflow check with the back edge repeating the loop.
This gets the first element from the array which lives at offset +32 in the array storage and sets the element pointer to point to the second element.
So, does this mean Swift doesn’t know exactly where count and _capacity live inside the array storage, but it knows that they must be within the first 32 bytes since everything after that is always the elements?
That’s correct. We only need the capacity (which is the full size of the allocation) when appending new data to the array; to iterate through the existing value, we only need the count (which is the size of the part that has valid data).
This should get you to the first element of the array.
In general we want people to be able to change the layout of their classes without breaking ABI, like you can in Objective-C. We use those same code paths for properties in the standard library, even though we have no plans to really change the layout of array buffers in practice.
How does it make sense to allow the layout to vary, but not the size? I thought the point of changing layouts was to make structs more compact…
This is kind of off topic but why are the count and capacity combined into countAndCapacity? Does the capacity live at 8(%rdi, %rax) then? I tried looking through the sources for the Array core types and stuff but wasn’t able to make sense of them