Outlined copy is a good example of the optimiser actually working. The "outliner" is a piece of the compiler that spots repeated code invocation that can lead to smaller binary size if, instead of being inlined everywhere, it can be moved into a function that is called. These smaller binaries are often a perf win, not least because the actual cost of calling an x86 function itself is very low, especially when the call is direct. Function calls themselves are not a huge performance cost, except where they provide optimisation boundaries.
I understand the feeling, but that's not quite an accurate representation. There are a few reasons why.
First, checks. You have a few checked operations here: self.idx += 2 inserts a branch for overflow checking, and self.idx + 1 (repeated twice) also introduces such a branch. These branches are again extremely cheap, as they will never be taken in real code, but they increase the complexity of the generated code.
The thing is the subscript work, which looks like this:
The dance here is that we take a copy of the backing representation of Data, then call the subscript to get a UInt8, and then consume the copy. The reason this has to be done is twofold. Firstly, Data's backing storage is reference counted, and if the compiler cannot prove this operation does not need to CoW it needs to appropriately manage the refcounting (that's what the copy/consume operations will be doing).
The other one is the Data subscript. While this subscript isinlinable, I suspect it exceeds the inlining threshold and doesn't get inlined. This is because Data is a very complex data type with a wide range of possible representations, so the performance benefit of inlining the code would almost certainly be outweighed by the code size and complexity increases.
You can see this by considering this compiler explorer project that replaces your use of Data with Array<UInt8>, a vastly less complex data type. Here we get the following generated assembly:
This is much closer to where you wanted to end up. We have a few extra ud2 instructions due to the subscripts getting inlined, as they need bounds checking now, but otherwise you get exactly what you'd expect to see: bounds checks, followed by loads and (in the big-endian case) a rol.
It'd be cool if I could avoid the bounds check, too (from an optimization exercise PoV), since I can ensure the bounds are valid at a higher level. There's no way to write a non-bounds-checked subscript is there? Something like &, in the spirit of the overflow operators, or some such.
You can drop down to UnsafeBufferPointer if it's important to you to remove bounds checks (it checks in debug builds, but not in optimised ones). If you do that and use wrapping arithmetic (you'd have to ensure idx is never equal to Int.max), I think you get the assembly you expected.
• You might get more speed using the SWIFT_DISABLE_SAFETY_CHECKS flag.
• If you want this to work with different size integers you might look into pre-specializing it with @_specialize, which is documented a bit on the web.
• You might want to try -Osize and see if that’s better than -O.
And, as usual, the hill I will die on:
• I always tell people to try NOT to optimize code they haven’t used in sutu. It’s usually a huge waste of your time, since you don’t actually know how much time this function will take relative to the rest of your program until you actually have the program written. You could spend days worrying about this and find out it’s 0.0001% of the total time to read a file, or that files read in 0.00001 seconds even ‘unoptimized.’ I know it can be a fun game to see how small your code can become (Woz was famous for that, and I learned it from him) but nowadays that’s the opposite of ‘shipping.’ But, if you’re just doing this for fun, go for it!