Performance overhead for protocols

Hi,

I tried to implement a Decimal64 struct which uses decimal floating-point arithmetic and conforms to the FloatingPoint protocol.

To benchmark it, I wrote two different functions. One accepting the new type and a generic one:
func templTest( start: T ) -> String where T: FloatingPoint, T: ExpressibleByFloatLiteral

This was over 30% slower for the builtin Double and 8% slower for the self written implementation.

Why is there such a big performance penalty?

Regards,
Dirk
P.S.: Source is available here: GitHub - dirkschreib/Decimal64

So just some general Swift performance info.

Make sure all your test code is outside of global scope. The Swift compiler gets pretty conservative with globals, and can't do many optimizations on them.

Generic functions across module boundaries aren't going to get generic specialization unless they're marked @inlinable. You can read more about it here. And if generic functions aren't specialized, then you're going to be going through some indirection. Which for numeric code can really expose performance issues.

2 Likes

Thanks for the suggestion but this is not applicable in this case. I marked the generic function as @inlineable and this changed exactly nothing. Not a surprise for me because calling and called function are in the same module anyway.

Another helpful tool for simple cases is to use godbolt.org to see what is going on.

Note that the test bundle is normally built separately from the main executable, so it is a different module from Swift's point of view. However, it looks like there is still some overhead even if everything is in the same module and optimization is enabled. Would you be able to file a bug?

It looks like the performance difference might come down to us picking a slower string interpolation path in the generic case. If I change all of the test* functions to this:

        ret = "\(s as Any), net: \(net as Any), tax: \(tax as Any), gross: \(gross as Any)"

which forces it to always pick the most general dynamic string interpolation implementation, then I get pretty much identical timings for the generic and non-generic implementations:

Double  time:  -1.4328429698944092
Decimal time:  -4.693014979362488
DecFP64 time:  -1.6121209859848022
Dec64   time:  -1.5860040187835693
TDouble time:  -1.400465965270996
TDec64FPtime:  -1.6180580854415894

cc @beccadax and @Michael_Ilseman. I would guess that we have at least a specific overload in StringInterpolation for interpolating Double, and that in the generic case, we fall into the most generic entry point, since in templTest we wouldn't be able to see the specific entry points. Maybe a tailored optimization in the specializer to re-specialize string interpolation calls would help.

Another experiment I tried was making it so that templTest required T: CustomStringConvertible in addition to FloatingPoint, enabling string interpolation to find the conformance statically instead of by dynamic lookup. This also brings Dec64 and TDec64 in line (though Double still apparently benefits from the Double-specific printing overload only in the static case):

Double  time:  -1.0079069137573242
Decimal time:  -3.8584940433502197
DecFP64 time:  -0.9931479692459106
Dec64   time:  -0.982342004776001
TDouble time:  -1.5873949527740479
TDec64FPtime:  -0.9742140769958496

And finally, if I remove the interpolation entirely, and change the functions to all return structs, then the generic and non-generic cases also fall into line:

Double  time:  -0.000970005989074707
Decimal time:  -1.1530550718307495
DecFP64 time:  -0.08175003528594971
Dec64   time:  -0.09572494029998779
TDouble time:  -0.0009540319442749023
TDec64FPtime:  -0.08098399639129639

suggesting that, at least, the numeric part of the code is not hitting any optimization barriers.

4 Likes

Joe's answer is correct except for one detail: the fast path isn't completely Double-specific—it's TextOutputStreamable. Arbitrary floating-point numbers often have too many digits to fit into a small string, so passing them in a temporary string can be expensive. Instead, the built-in floating-point types conform to TextOutputStreamable and use TextOutputStream._writeASCII(_:) to basically dump the digits directly into the string's backing storage.

3 Likes

Thanks Brent. That would explain why passing through CustomStringConvertible didn't avoid the penalty in the Double case, then.

I filed [SR-11158] Re-specialize interpolation segment invocations on String after specializing · Issue #53555 · apple/swift · GitHub to improve this in the optimizer.

4 Likes

Thank you very much. I really appreciate the fast response.
I will enhance my Decimal64 test structs with conformance to TextOutputStreamable in the meantime.

Thanks Brent!
TextOutputStreamable, which I have never heard of before, is great!
Both of my Decimal64 structs are now faster than Double by a good margin.
(This is related to this specific benchmark which includes text conversion. You can't beat Double for numeric operations without hardware support. But in the days of JSON and XML converting to and from strings is a quite common task).

2 Likes