Can SIMD types compute the high part of a product?

Say I have two SIMD16<UInt16> vectors x and y. I can multiply their elements by writing x &* y, which computes the low part of each product.

However, suppose I actually want the high part of each product instead. Is there a way to compute it efficiently?

Sure. You can write out the arithmetic directly or use intrinsics. Either will produce reasonable code when optimized for operations that fit in a single register (I used SIMD8 for this reason). The optimizer has a somewhat harder time when they don't fit, so you may want to use intrinsics instead for those cases.

(edit: I wrote this for signed, but it works the same for unsigned; just change the types or use the _mm_mulhi_epu16 intrinsic instead.)

If you're curious, the casts to __m128i are needed because Intel's intrinsics are weakly typed. On ARM you don't need them.

3 Likes

Thanks!

I never would’ve found “_Builtin_intrinsics.intel” on my own, is this documented anywhere?

• • •

I’m working on something that you and Dan Lemire may find interesting, and which I believe is novel. Currently I’m prototyping different implementation strategies, one of which uses 16xUInt16.

If you look in the modulemap file for the compiler-provided headers in clang, you'll find all sorts of gems. I've posted about it once or twice in the past here as well.

1 Like

Where would one find such a thing?

On a Mac system, it will be somewhere like /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/13.0.0/include/module.modulemap. Not sure where it lives on other platforms.

1 Like

Oh, it’s not something hosted in a public repo?

It's installed by clang, so it either lives in the llvm/clang repo, or gets autogenerated by the clang build. But I've never had reason to care about how it's installed; the contents of the module are the relevant thing.

1 Like

Well, I attempted to use intrinsics for SIMD16<UInt16>, but it’s not compiling. I assume I’ve made a mistake somewhere:

func mulHi3(_ x: SIMD16<UInt16>, _ y: SIMD16<UInt16>) -> SIMD16<UInt16> {
  let sseX = unsafeBitCast(x, to: __m256i.self)
  let sseY = unsafeBitCast(y, to: __m256i.self)
  let sseR = _mm256_mulhi_epu16(sseX, sseY)
  return unsafeBitCast(sseR, to: SIMD16<UInt16>.self)
}

You have to tell the compiler to use AVX2 instructions if you want it to generate them (or accept the intrinsics): Compiler Explorer

1 Like

Thanks again!

-Xcc -Xclang -Xcc -target-feature -Xcc -Xclang -Xcc +avx2 -O

That is…quite the invocation. I definitely would not have figured it out on my own.

How does one apply that in Xcode?

Also, are there other such compiler settings I might want for this sort of thing?

…and I suppose I ought to ask, if for some reason I didn’t want to use AVX2, is it feasible to call the 128-bit version of mulhi twice, to handle the first 8 elements and then the last 8 elements, of a SIMD16<UInt16>, and if so how?

1 Like

Yes, I wrote up some notes on future directions here: [SR-11660] Umbrella: function multiversioning and dispatch on CPU features · Issue #54069 · apple/swift · GitHub

You can add per-file swift flags under the "compile sources" build phase, IIRC; if you want to enable it globally you add it to the normal swift flags project/target setting.

1 Like

Oh, sweet!

The lowHalf / highHalf part makes sense, and it also works with your original widen / narrow approach for multiplying the halves.

That will make my prototype implementation feel a lot more Swifty, since I won’t have to drop down to intrinsics or mess with compiler flags.

I really appreciate all the help.

1 Like

I feel like I’m hijacking this thread a little bit with the question but it’s kind of on topic: what is the low part / high part of the product? Is it about integer overflow (or wraparound or however we want to describe it)? As in, what’s left over after overflow vs potentially maxing out at UInt16.max?

1 Like

It's probably easiest to explain with a base ten example. If I multiply two single-digit numbers, the result is a two digit number. E.g.:

  8
x 7
---
 56

The low-order digit (6 in my example) is the "low part" of the product, and the high-order digit (5) is the "high part".

When we talk about UInt16 multiplication, it's exactly the same thing, except now the "digits" are 16b unsigned (hence in the range 0 ..< 65536 instead of 0 ..< 10). So if we multiply:

     0x812b
   x 0xba90
----------
0x5e21_e630

then 0xe630 is the "low part" of the product (what's produced by &*), and 0x5e21 is the "high part".

12 Likes

Amazing explanation, thanks and much respect @scanon

2 Likes

Is there a good way to perform mulHi on SIMD16<UInt64>, or should I just iterate and call multipliedFullWidth on the elements and store the high part?

No SIMD architecture has that instruction; you either need to build it out of 32b ops, or use the scalar operation.

1 Like