Are you guys actually doing accumulation in
BFloat16? Intel's extension does accumulation in
Float, and so do all the other public proposals that I've seen so far.
Edit: for the curious, here's Intel's white paper. They only define three operations: an FMA that accumulates a bfloat16 x bfloat16 product to a float32, and conversions between bfloat16 and float. So all arithmetic is done in float. Every other public proposal I've seen follows exactly the same pattern.