My adaptation of some C++ code is approximately 100 times slower

BlueSparrow · October 2, 2024, 6:26am

Hello!
In my spare time I've been writing things in Swift, trying to get a feel for the language, which I've been enjoying quite a bit.

Last week I had some time off, so I started going through Dmitry Sokolov's TinyRenderer series using Swift. Like I said, I'm new to Swift so my code is almost certainly sub-optimal.

My code (please excuse the mess -- I hacked it to pieces trying to improve performance).

The C++ code at about the same point.

When I pass -b into my code (run the command with the arguments -b obj/african_head.obj), it renders the model 100 times before outputting it to disk. On my machine (M2 Mac Mini) it takes about 1m4s to complete this. The C++ code (changed to run 100 times) runs in 0m0.700s or so.

On both Windows and Mac, it's spending 99% of its time in drawBarycentricTriangle, around the call to (the inlined) barycentric function.

What have I done to make the Swift code so much slower? I wrote the code loosely following the articles, rather than making a 1:1 port of the C++ code, so I was expecting it to be slower. But not that slow!

Cyberbeni · October 2, 2024, 7:23am

swift run defaults to the debug configuration without optimization. You should pass --configuration release to the command. Similarly, the C++ version should also be compiled with optimization.

Jon_Shier · October 2, 2024, 3:42pm

There's also a bit of overhead in running things through swift run, so I'd suggest building and then running the executable directly. swift build -c release should build the release executable somewhere in the .build directory.

BlueSparrow · October 2, 2024, 5:41pm

To be clear, this is with optimisation enabled. (-c release) and running the executable directly from the .build/release directory. I was mainly trying to tell people how to invoke the program for it to work.

David_Smith · October 2, 2024, 5:53pm

It looks like you have an Instruments trace handy already, would it be possible to make that downloadable from somewhere for convenience? :)

scanon · October 2, 2024, 6:04pm

If you're doing it 100 times and it's about 100 times slower, the very first thing I would check, just out of an abundance of caution, is that the C++ code is really doing it 100 times as well. How long does it take to do one time?

BlueSparrow · October 2, 2024, 6:38pm

C++, 1 time:

$ time ./tinyrenderer 
# v# 1258 f# 2492

real	0m0.037s
user	0m0.031s
sys	    0m0.005s

Swift, 1 time:

$ time ./.build/release/tinyrenderer obj/african_head.obj 
# #v 1258 #f 2492

real	0m0.708s
user	0m0.695s
sys	    0m0.009s

BlueSparrow · October 2, 2024, 6:45pm

Sure. I uploaded it to the GitHub repo: SwiftTinyRenderer/trace/SwiftTinyRenderer.trace.zip at master · bhelyer/SwiftTinyRenderer · GitHub

Jon_Shier · October 2, 2024, 7:01pm

My uneducated guess.

51% of the work is in barycentric:

func barycentric(_ a: Vec3r, _ b: Vec3r, _ c: Vec3r, _ p: Vec3r) -> Vec3r {
    var s0 = Vec3r()
    var s1 = Vec3r()
    s1.x = c.y - a.y
    s1.y = b.y - a.y
    s1.z = a.y - p.y
    s0.x = c.x - a.x
    s0.y = b.x - a.x
    s0.z = a.x - p.x
    let u = s0 ^ s1
    if abs(u.z) > 1e-2 {
        return Vec3r(x: 1.0 - (u.x + u.y) / u.z, y: u.y / u.z, z: u.x / u.z)
    } else {
        return Vec3r(x: -1, y: 1, z: 1)
    }
}

Looking at it it seems like an easy candidate for vectorization, so I'm going to say the C++ autovectorizer picked that up (or perhaps the implementation was already vectorized) and the Swift version didn't.

David_Smith · October 2, 2024, 7:28pm

I suppose I should have realized that it being fully inlined would mean I would need symbols and disassembly to get anything useful from the trace. Ah well, it was worth a shot

David_Smith · October 2, 2024, 7:29pm

If this is the case, one easy candidate for what could be blocking it is the overflow checks on the arithmetic. If that is the case, then using &+ and so on would be an easy thing to try.

Jon_Shier · October 2, 2024, 7:40pm

Godbolt says it is properly vectorized, at least on x86, so probably not the issue unless the vectorizer just didn't work for Apple Silicon.

BlueSparrow · October 2, 2024, 7:41pm

I thought that could be the cause too, but all those calculations are done on Floats, which shouldn't have overflow checks IIUC.

But your observation that barycentric is entirely inlined is a good one. It's partly why I made this post -- it's a black box that I find difficult to penetrate, and I thought that someone would be able to point out a good way of seeing what's going on.

I suppose the next stop is looking at the output assembly. I'm not great at reading it, ARM64 even less so than AMD64, but maybe a copy or allocation or check will jump out at me.

BlueSparrow · October 2, 2024, 7:44pm

It's probably worth noting that the relative performance is identical on Windows and macOS. On Windows I compiled the C++ version with cl.exe rather than clang++.

j-f1 · October 2, 2024, 11:21pm

If you open the Source Viewer in Instruments, there’s a menu button in the top-right corner that lets you switch between looking at the original Swift source code, the raw assembly, and a couple of modes that mix the two. That may help you narrow down which instructions are particularly problematic.

Jon_Shier · October 3, 2024, 1:19am

Maybe something changed in Xcode 16, but I seem to remember Instruments being able to show inline percentages in the code view, so show the heaviest paths. Now all it does is highlight the functions I double click on. But I think the assembly view is saying this is the inlined barycentric function:

+0x468	ldp                 s0, s15, [x21, #0x20]
+0x46c	ldp                 s1, s2, [x21, #0x2c]
+0x470	ldp                 s3, s4, [x21, #0x38]
+0x474	fsub                s4, s4, s15
+0x478	fsub                s2, s2, s15
+0x47c	fsub                s13, s3, s0
+0x480	fsub                s8, s1, s0
+0x484	fsub                s0, s0, s10
+0x488	fmul                s18, s0, s2
+0x48c	fmul                s19, s0, s4
+0x490	fmul                s0, s2, s13
+0x494	fmul                s1, s8, s4
+0x498	fsub                s9, s0, s1
+0x49c	fabd                s14, s0, s1
+0x4a0	fmov                w8, s10
+0x4a4	and                 w27, w8, #0x7f800000
+0x4a8	fcvtzs              x28, s10
+0x4ac	fmov                s11, s5
+0x4b0	b                   "Renderer.drawBarycentricTriangle(_:_:)+0x4c0"
+0x4b4	fadd                s11, s11, s12
+0x4b8	fcmp                s11, s7
+0x4bc	b.hi                "Renderer.drawBarycentricTriangle(_:_:)+0x454"
+0x4c0	fmov                s0, w24
+0x4c4	fcmp                s14, s0
+0x4c8	b.le                "Renderer.drawBarycentricTriangle(_:_:)+0x4b4"
+0x4cc	fsub                s0, s15, s11
+0x4d0	fmul                s1, s13, s0
+0x4d4	fsub                s1, s19, s1
+0x4d8	fmul                s0, s8, s0
+0x4dc	fsub                s2, s0, s18
+0x4e0	fadd                s0, s2, s1
+0x4e4	fdiv                s0, s0, s9
+0x4e8	fsub                s0, s12, s0

BlueSparrow · October 3, 2024, 7:05am

C++ Barycentric Function

Vec3f barycentric(Vec3f A, Vec3f B, Vec3f C, Vec3f P) {
    Vec3f s[2];
    for (int i=2; i--; ) {
        s[i][0] = C[i]-A[i];
        s[i][1] = B[i]-A[i];
        s[i][2] = A[i]-P[i];
    }
    Vec3f u = cross(s[0], s[1]);
    if (std::abs(u[2])>1e-2) // dont forget that u[2] is integer. If it is zero then triangle ABC is degenerate
        return Vec3f(1.f-(u.x+u.y)/u.z, u.y/u.z, u.x/u.z);
    return Vec3f(-1,1,1); // in this case generate negative coordinates, it will be thrown away by the rasterizator
}

C++ Assembly

__Z11barycentric3vecILm3EfES0_S0_S0_:   ; @_Z11barycentric3vecILm3EfES0_S0_S0_
	.cfi_startproc
; %bb.0:
	ldp	s5, s2, [sp]
	fsub	s5, s5, s0
	fsub	s3, s3, s0
	fsub	s6, s2, s1
	fsub	s4, s4, s1
	fnmul	s2, s3, s6
	fmadd	s2, s5, s4, s2
	fabs	s7, s2
	fcvt	d7, s7
	mov	x8, #5243                       ; =0x147b
	movk	x8, #18350, lsl #16
	movk	x8, #31457, lsl #32
	movk	x8, #16260, lsl #48
	fmov	d16, x8
	fcmp	d7, d16
	b.le	LBB1_2
; %bb.1:
	ldp	s16, s7, [sp, #12]
	fsub	s1, s1, s7
	fsub	s0, s0, s16
	fnmul	s5, s5, s1
	fmadd	s5, s0, s6, s5
	fnmul	s0, s0, s4
	fmadd	s3, s3, s1, s0
	fadd	s0, s3, s5
	fdiv	s0, s0, s2
	fmov	s1, #1.00000000
	fsub	s0, s1, s0
	fdiv	s1, s5, s2
	fdiv	s2, s3, s2
	ret
LBB1_2:
	fmov	s1, #1.00000000
	fmov	s0, #-1.00000000
	fmov	s2, #1.00000000
	ret
	.cfi_endproc

This is compared to this:

Swift Barycentric Function

func barycentric(_ a: Vec3r, _ b: Vec3r, _ c: Vec3r, _ p: Vec3r) -> Vec3r {
    var s0 = Vec3r()
    var s1 = Vec3r()
    s1.x = c.y - a.y
    s1.y = b.y - a.y
    s1.z = a.y - p.y
    s0.x = c.x - a.x
    s0.y = b.x - a.x
    s0.z = a.x - p.x
    let u = s0 ^ s1
    if abs(u.z) > 1e-2 {
        return Vec3r(x: 1.0 - (u.x + u.y) / u.z, y: u.y / u.z, z: u.x / u.z)
    } else {
        return Vec3r(x: -1, y: 1, z: 1)
    }
}

Swift Assembly

_$s4main11barycentricyAA5Vec3rVAD_A3DtF:
	.cfi_startproc
	stp	d15, d14, [sp, #-80]!
	stp	d13, d12, [sp, #16]
	stp	d11, d10, [sp, #32]
	stp	d9, d8, [sp, #48]
	stp	x29, x30, [sp, #64]
	add	x29, sp, #64
	.cfi_def_cfa w29, 16
	.cfi_offset w30, -8
	.cfi_offset w29, -16
	.cfi_offset b8, -24
	.cfi_offset b9, -32
	.cfi_offset b10, -40
	.cfi_offset b11, -48
	.cfi_offset b12, -56
	.cfi_offset b13, -64
	.cfi_offset b14, -72
	.cfi_offset b15, -80
	fmov	s8, s7
	fmov	s9, s6
	fmov	s10, s4
	fmov	s11, s3
	fmov	s12, s1
	fmov	s13, s0
	ldp	s14, s15, [x29, #20]
	bl	_$s4main5Vec3rVACycfC
	bl	_$s4main5Vec3rVACycfC
	fsub	s3, s8, s12
	fsub	s4, s10, s12
	fsub	s5, s12, s15
	fsub	s0, s9, s13
	fsub	s1, s11, s13
	fsub	s2, s13, s14
	bl	_$s4main5Vec3rV1xoiyA2C_ACtFZ
	fabs	s3, s2
	mov	w8, #55050
	movk	w8, #15395, lsl #16
	fmov	s4, w8
	fcmp	s3, s4
	b.le	LBB0_2
	fadd	s3, s0, s1
	fdiv	s3, s3, s2
	fmov	s4, #1.00000000
	fsub	s3, s4, s3
	fdiv	s1, s1, s2
	fdiv	s2, s0, s2
	fmov	s0, s3
	b	LBB0_3
LBB0_2:
	fmov	s0, #-1.00000000
	fmov	s1, #1.00000000
	fmov	s2, #1.00000000
LBB0_3:
	bl	_$s4main5Vec3rV1x1y1zACSf_S2ftcfC
	ldp	x29, x30, [sp, #64]
	ldp	d9, d8, [sp, #48]
	ldp	d11, d10, [sp, #32]
	ldp	d13, d12, [sp, #16]
	ldp	d15, d14, [sp], #80
	ret
	.cfi_endproc

The C++ one seems to use less instructions. But probably looking at the uninlined function is not relevant to my case.

At the moment I'm assuming the issue must be in the calling function, and the way barycentric is inlined. I'm no good at reading assembly, but this sticks out to me. Here's the callsite to the inlined barycentric (I think) in the C++ version:

	ldp	s1, s2, [x21]
	ldp	s0, s6, [x21, #12]
	ldp	s3, s5, [x21, #24]
	fsub	s4, s3, s1
	fsub	s3, s0, s1
	fsub	s5, s5, s2
	fsub	s6, s6, s2
	fnmul	s0, s3, s5
	fmadd	s0, s4, s6, s0
	fabs	s7, s0

And here's that call site in the Swift version:

+0x468	ldp                 s0, s15, [x21, #0x20]
+0x46c	ldp                 s1, s2, [x21, #0x2c]
+0x470	ldp                 s3, s4, [x21, #0x38]
+0x474	fsub                s4, s4, s15
+0x478	fsub                s2, s2, s15
+0x47c	fsub                s13, s3, s0
+0x480	fsub                s8, s1, s0
+0x484	fsub                s0, s0, s10
+0x488	fmul                s18, s0, s2
+0x48c	fmul                s19, s0, s4
+0x490	fmul                s0, s2, s13
+0x494	fmul                s1, s8, s4
+0x498	fsub                s9, s0, s1
+0x49c	fabd                s14, s0, s1

I'm assuming those muls are from the cross product. And that the fabs and fabd instructions are the call to abs(). The codegen is quite different for what (to me) seems like fairly similar code, but I'm not smart enough to say if it's slower.

ST1 · October 3, 2024, 12:27pm

The counters in these functions are very different (сounters are global).
Swift does more work than C++

C++ (1 time)
Counter 1: 52 015
Counter 2: 1 572 244


void triangle(Vec3f *pts, float *zbuffer, TGAImage &image, TGAColor color) {
 ...
 for (P.x=bboxmin.x; P.x<=bboxmax.x; P.x++) {
        count1++;
        for (P.y=bboxmin.y; P.y<=bboxmax.y; P.y++) {
            count2++;

Swift (1 time)
Counter 1: 1 021 611
Counter 2: 387 875 174

 public func drawBarycentricTriangle(_ pts: [Vec3r], _ colour: TGAColour) {
  ...
  while p.x <= bboxmax.x {
            //for y in bboxmin.y...bboxmax.y {
            p.y = bboxmin.y
            count1 += 1
            while p.y <= bboxmax.y {
                count2 += 1

nsc · October 3, 2024, 1:23pm

Good catch. It seems the calculation of the maximum values of the bounding box is wrong.
The code says:

bboxmax.x = max(clamp.x, min(bboxmax.x, pts[i].x))
bboxmax.y = max(clamp.y, min(bboxmax.y, pts[i].y))

but it should be:

bboxmax.x = min(clamp.x, max(bboxmax.x, pts[i].x))
bboxmax.y = min(clamp.y, max(bboxmax.y, pts[i].y))

The numbers look much more sane with this change.

BlueSparrow · October 4, 2024, 4:41am

Oh man, that's such a clever check. So simple, but a very quick way of asking if the two versions are doing the same number of iterations. Annoyed at myself for not thinking of it, but stowing that away for the next time I'm staring at a tight inner loop. Thank you!

And you're exactly right. I fat fingered converting from the C++'s use of subscripts. The correct code, for those of you playing along at home:

        for i in 0..<3 {
            bboxmin.x = max(0, min(bboxmin.x, pts[i].x))
            bboxmax.x = min(clamp.x, max(bboxmax.x, pts[i].x))
            bboxmin.y = max(0, min(bboxmin.y, pts[i].y))
            bboxmax.y = min(clamp.y, max(bboxmax.y, pts[i].y))
        }

With that, the runtime for 100 iterations drops to a very reasonable ~700 ms (exactly the same as the C++ one).

Thanks to everyone who had a look at this, I really appreciate it. Now I can move on from "screwing up bounding box calculation" to "screwing up texture mapping"! :')