Sure. You can write out the arithmetic directly or use intrinsics. Either will produce reasonable code when optimized for operations that fit in a single register (I used SIMD8 for this reason). The optimizer has a somewhat harder time when they don't fit, so you may want to use intrinsics instead for those cases.
(edit: I wrote this for signed, but it works the same for unsigned; just change the types or use the _mm_mulhi_epu16 intrinsic instead.)
If you're curious, the casts to __m128i are needed because Intel's intrinsics are weakly typed. On ARM you don't need them.
On a Mac system, it will be somewhere like /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/13.0.0/include/module.modulemap. Not sure where it lives on other platforms.
It's installed by clang, so it either lives in the llvm/clang repo, or gets autogenerated by the clang build. But I've never had reason to care about how it's installed; the contents of the module are the relevant thing.
That is…quite the invocation. I definitely would not have figured it out on my own.
How does one apply that in Xcode?
Also, are there other such compiler settings I might want for this sort of thing?
…and I suppose I ought to ask, if for some reason I didn’t want to use AVX2, is it feasible to call the 128-bit version of mulhi twice, to handle the first 8 elements and then the last 8 elements, of a SIMD16<UInt16>, and if so how?
I feel like I’m hijacking this thread a little bit with the question but it’s kind of on topic: what is the low part / high part of the product? Is it about integer overflow (or wraparound or however we want to describe it)? As in, what’s left over after overflow vs potentially maxing out at UInt16.max?