I replied here with why I think this is an artificial and ultimately meaningless distinction.
What if it turns out those two lines in the top fork are actually switching on and off very rapidly, but so rapidly you can't see it (technically they are, because computer displays use subpixels for red, green and blue that can't be literally on top of each other)? Would it matter if the tiny slivers of red happen to line up vertically, instead of being staggered on a resolution so small you'd never be able to see it?
"Parallel" are "concurrent" are literally synonyms in English (if you don't believe me, Google their definitions). If the software industry has introduced a distinction between the two, it is highly suspicious because apparently there wasn't a better word to identify the two supposedly different concepts. Maybe that's because there is no distinction after all.
When and how could it matter for code that it was created and tested on single-core hardware, but now it's being prepared for use on multi-core hardware? The answer is "it doesn't"... except for one small part: your synchronization primitives that you use to ensure forked threads meet up again at an agreed upon point now have to be implemented at least partially in hardware instead of just being purely software constructs. On a single core machine, a Darwin lock would just be a boolean flag the OS kernel stores on a thread context, and when it's scheduling loop is picking the next thread to run for a time slice, it will skip any that are waiting on locks they don't own. On multiple cores, the parallelism is no longer being implemented (just) in the OS kernel but below it, on the hardware. Therefore even it needs to protect its "shared state" with locks, and those have to be supplied by the CPU itself in the form of atomic instructions.
If you're not an OS kernel developer, this should be largely irrelevant to you... except if you're writing performance critical code and want to make sure your locks are implemented as atomics instead of mutexes because one is much faster than the other (but more limited, it can only synchronize a single memory access, not an arbitrary block of instructions).
If that's not your concern, and you've noticed your multithreaded code doesn't work once you start running it on multi-core hardware, all that's happened is your code has race condition or re-entrancy bugs (it relied on relative order of execution where none was guaranteed), and the probability of encountering those bugs jumped from 0.001% on a single-core machine (not 0%) to 1%, and you finally won that lottery.
That bug didn't become a bug by supporting multi-core hardware. It was always there, it just had a low enough reproducibility rate you never noticed.
My point here is that you should stop thinking about hardware. That's not what you're coding to (how hardware actually executes your code is insanely complicated and not at all what we probably picture, it's slicing it up, reordering stuff, staggering it in superscalar cores with multiple execution units, executing ahead with branch prediction, doing all sorts of super complex caching and guessing of where you're going to read from memory next, etc.). You're coding to a virtual machine that presents a logical execution environment for your code. When you introduce a Thread
, or Task
, you are introducing parallelism/concurrency to this logical execution. That is all that matters. Once you introduce concurrency, you have asked for all guarantees of in-order execution (between the instructions in two different threads/tasks) to be removed.
If you're trying to rely on a difference in execution between single and multi-core environments, you're asking for race conditions to accidentally never be encountered.