Response to ""very high performance" rebuttal"
My improvement uses 180 less instructions (~5%)
180? I haven't checked but that sounds suspiciously too high considering godbolt shows essentially the same number of assembly lines.
at least on my machine
Well, that's the keyword. You need to prove the significance of your machine. Does it use a CPU with similar properties compared to a typical server? Or ...? I know swift-endpoint's CI benchmarking machine does.
which improves throughput by 2 samples using the project's throughput benchmark (30 million more operations in the same time-frame on my machine).
I had already tried your changes, yesterday. It didn't make a visible difference in CI: Try Swift Forums suggestion by MahdiBM · Pull Request #21 · swift-dns/swift-endpoint · GitHub
Some months ago, I tried fully inlining that loop altogether. It didn't make a visible difference either, so that's why the loop is still there. Not because I didn't think it's faster, but because I didn't want such a weird-looking code for such low perf gains (to be clear, weird-looking is not true for your code). So the thing is, it needs to be visible in the benchmarks. Otherwise the change isn't sustainable and the minor improvements can be accidentally reverted in the future.
I'm assuming the reason for "less instructions" in your code is likely because of the byteSwapped, which means you don't need to reverse and index each time. If the inlined-loop was there, the byteSwapped would have only been some additional instructions and wouldn't have helped.
Also so we're clear, swift-endpoint has a custom UInt128 implementation (backed by 2xUInt64 to be able to dodge Swift's UInt128 macOS 15 requirements for IPv6's storage), so you can't be thinking that I didn't know of existence of byteSwapped or such.
I give it to you that your idea was good. But I'm still not convinced that it's worth a change.
Also you already had a pretty decent code, if I may say so myself, to start with. You didn't have to work it out from the beginning, which could have possibly resulted in you going a totally different/worse way. So even if you're proud of your code there, I'm inclined to take part of the credit :)
After all, there is also a compiler in between the code that we write. At any point it could have taken a decision to inline the whole loop. Or maybe it does in a future version. Then the current code will be looking better since it skips one byteSwapped. You're also using bitshifts compared to simply loading the values from the memory which the current code does. Now the compiler is likely smart enough to know what's going on in here, but I doubt the bitshifts have anything to do with a performance improvement, considering what the current code does is to directly load from memory.
When I say very high performance, I am talking about reducing total instructions executed
less instructions executed does not mean better performance
The point here is that we can argue about this all-day long. I think there has been a misunderstanding about what actually I meant by "very high performance". In these kinds of situations, I always take at least part of the blame for not being clear enough. For the most part, it was referring to the fact that the implementations beat the counterpart C implementations.
Can they be improved? yes. But how far should I go? Do I go for inlining assembly in Swift files and call it "Swift"? Or how far do I go for looking into what the compiler likes to produce given some code? I have gone pretty deep into this, but still, there is a point that I need to stop.
Realistically too, I'm also under real-world restrictions. For example about how to benchmark the code. The CI machine already costs $18-20/month out of pocket. I can't simply just add all possible benchmark types to the CI, that'd take too long, and will also take more time to maintain. I had already considered adding instructions count to the CI since it's a fine unit of measurement. I can't exactly recall why that didn't happen, but probably it was to save some CI time. I remember at some point I was struggling with benchmark times and that's partially why swift-idna was moved to its dedicated repository.