Towards Robust Performance Measurement

Could you please elaborate? How did this issue manifest? I'd like to investigate it on the dataset from my report and the new measurements after the PR #18318 has landed (month ago).