Might be a long shot:
-
Is there a limit to the number of threads in your case? For example the number will be limited to a number of CPU cores (which is a small number like 8 or 32) – which would make sense if the task at hand is CPU bound – there would be no win to use, say, 1000 threads or any number bigger than the total number of CPU cores.
-
Can you afford using N times as much memory for the counters in question (where N is the above small number)?
-
Subsequently, do you need to read the "final" values from a single thread after all threads are done with changing the tree? And can tolerate that individual threads will not have the "current" counter information?
In that case you could sidetrack the whole atomicity issue by collecting the individual counters per thread and aggregating them in a single value at the post processing stage. No locks, no atomics, but yes, N times as much memory for the counters.
A side consideration: if the edges array can only hold up to a small number of edges (like 10 or 100) I'd seriously consider using a less heavy data structure than the Array.