Sure. You can write out the arithmetic directly or use intrinsics. Either will produce reasonable code when optimized for operations that fit in a single register (I used SIMD8 for this reason). The optimizer has a somewhat harder time when they don't fit, so you may want to use intrinsics instead for those cases.
(edit: I wrote this for signed, but it works the same for unsigned; just change the types or use the _mm_mulhi_epu16 intrinsic instead.)
If you're curious, the casts to __m128i are needed because Intel's intrinsics are weakly typed. On ARM you don't need them.
I never would’ve found “_Builtin_intrinsics.intel” on my own, is this documented anywhere?
• • •
I’m working on something that you and Dan Lemire may find interesting, and which I believe is novel. Currently I’m prototyping different implementation strategies, one of which uses 16xUInt16.
If you look in the modulemap file for the compiler-provided headers in clang, you'll find all sorts of gems. I've posted about it once or twice in the past here as well.
On a Mac system, it will be somewhere like /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/13.0.0/include/module.modulemap. Not sure where it lives on other platforms.
It's installed by clang, so it either lives in the llvm/clang repo, or gets autogenerated by the clang build. But I've never had reason to care about how it's installed; the contents of the module are the relevant thing.
That is…quite the invocation. I definitely would not have figured it out on my own.
How does one apply that in Xcode?
Also, are there other such compiler settings I might want for this sort of thing?
…and I suppose I ought to ask, if for some reason I didn’t want to use AVX2, is it feasible to call the 128-bit version of mulhi twice, to handle the first 8 elements and then the last 8 elements, of a SIMD16<UInt16>, and if so how?
You can add per-file swift flags under the "compile sources" build phase, IIRC; if you want to enable it globally you add it to the normal swift flags project/target setting.
I feel like I’m hijacking this thread a little bit with the question but it’s kind of on topic: what is the low part / high part of the product? Is it about integer overflow (or wraparound or however we want to describe it)? As in, what’s left over after overflow vs potentially maxing out at UInt16.max?
It's probably easiest to explain with a base ten example. If I multiply two single-digit numbers, the result is a two digit number. E.g.:
8
x 7
---
56
The low-order digit (6 in my example) is the "low part" of the product, and the high-order digit (5) is the "high part".
When we talk about UInt16 multiplication, it's exactly the same thing, except now the "digits" are 16b unsigned (hence in the range 0 ..< 65536 instead of 0 ..< 10). So if we multiply:
0x812b
x 0xba90
----------
0x5e21_e630
then 0xe630 is the "low part" of the product (what's produced by &*), and 0x5e21 is the "high part".