Outlined copy is a good example of the optimiser actually working. The "outliner" is a piece of the compiler that spots repeated code invocation that can lead to smaller binary size if, instead of being inlined everywhere, it can be moved into a function that is called. These smaller binaries are often a perf win, not least because the actual cost of calling an x86 function itself is very low, especially when the call is direct. Function calls themselves are not a huge performance cost, except where they provide optimisation boundaries.
I understand the feeling, but that's not quite an accurate representation. There are a few reasons why.
First, checks. You have a few checked operations here: self.idx += 2
inserts a branch for overflow checking, and self.idx + 1
(repeated twice) also introduces such a branch. These branches are again extremely cheap, as they will never be taken in real code, but they increase the complexity of the generated code.
The thing is the subscript work, which looks like this:
callq outlined copy of Foundation.Data._Representation
movq %r15, %rdi
movq %rbx, %rsi
movq %r14, %rdx
callq Foundation.Data.subscript.getter : (Swift.Int) -> Swift.UInt8
movl %eax, %r15d
movq %rbx, %rdi
movq %r14, %rsi
callq outlined consume of Foundation.Data._Representation
The dance here is that we take a copy of the backing representation of Data
, then call the subscript to get a UInt8
, and then consume the copy. The reason this has to be done is twofold. Firstly, Data
's backing storage is reference counted, and if the compiler cannot prove this operation does not need to CoW it needs to appropriately manage the refcounting (that's what the copy
/consume
operations will be doing).
The other one is the Data
subscript. While this subscript is inlinable
, I suspect it exceeds the inlining threshold and doesn't get inlined. This is because Data
is a very complex data type with a wide range of possible representations, so the performance benefit of inlining the code would almost certainly be outweighed by the code size and complexity increases.
You can see this by considering this compiler explorer project that replaces your use of Data
with Array<UInt8>
, a vastly less complex data type. Here we get the following generated assembly:
output.Buffer.get() -> Swift.UInt16:
push rbp
mov rbp, rsp
mov rax, qword ptr [r13]
mov rcx, qword ptr [r13 + 8]
mov rdx, qword ptr [rax + 16]
cmp byte ptr [r13 + 16], 1
jne .LBB14_4
cmp rcx, rdx
jae .LBB14_10
lea rsi, [rcx + 1]
cmp rsi, rdx
jae .LBB14_11
movzx eax, word ptr [rax + rcx + 32]
rol ax, 8
jmp .LBB14_7
.LBB14_4:
cmp rcx, rdx
jae .LBB14_8
lea rsi, [rcx + 1]
cmp rsi, rdx
jae .LBB14_9
movzx eax, word ptr [rax + rcx + 32]
.LBB14_7:
add rcx, 2
mov qword ptr [r13 + 8], rcx
pop rbp
ret
.LBB14_10:
ud2
.LBB14_11:
ud2
.LBB14_8:
ud2
.LBB14_9:
ud2
This is much closer to where you wanted to end up. We have a few extra ud2
instructions due to the subscripts getting inlined, as they need bounds checking now, but otherwise you get exactly what you'd expect to see: bounds checks, followed by loads and (in the big-endian case) a rol
.