How to parallelize across CPU cores?

AlexanderM · January 30, 2025, 3:50am

I'll elaborate on David's point:

There are several reasons why you'd want to avoid this.

First, as your main topic emphasizes, you wouldn't want to have your program have to worry about the specific core count. It would be nice to asbtract that away.

More importantly, matching the chunk count to the core count leads to a performance cliff if the system has any other programs competing for CPU.

For a graphical demonstration, picture this timeline. We have 100 units of work (each =) split into 4 chunk of 25, with an ideal allocation among on 4 cores:

Core 1: [=========================] D
Core 2: [=========================] O
Core 3: [=========================] N
Core 4: [=========================] E

Look what happens if even a single core is busy with something else:

Core 1: [xxxxxxxxxx][=============================] D
Core 2: [=========================][ ... idle ... ] O
Core 3: [=========================][ ... idle ... ] N
Core 4: [=========================][ ... idle ... ] E

Some other program x was using core 1 for some time. Your program gets to use it eventually, but that chunk is behind. The other 3 cores will finish earlier, but your overall time will have to be as slow as the most delayed task's completion. Worse yet, cores 2-4 are idle while they wait fore core 1.

You might consider the other extreme, which is to split 100 work items into 100 single-item chunks. This prevents work-starvation like in the previous examples:

Core 1: [xxxxxxxxxx][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=] D
Core 2: [=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=] O
Core 3: [=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=] N
Core 4: [=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=][=] E

Now all the cores are maximally saturated, and they all finish at the same time, even with the disruption caused by x! You're never stuck waiting on for a straggler to finish while the other cores are all empty, but now there's a different downside: higher overhead. The repeated [ ] brackets are an apt visual metaphor for scheduling and context-switching overhead. The more fine-grained your chunks are, the more overhead you have in proportion to "real work".

So in practice, you'll want to right-size the chunks that's fine-grained enough to schedule work effectively, but not so fine grained that you pay too much overhead. It's a trade off.

I'll also shout out this great point by Jed fox: