On 7 dec. 2015, at 23:10, Pierre Habouzit <firstname.lastname@example.org<mailto:email@example.com>> wrote:
Hi Joakim, Kevin,
[ Full disclosure, I made that reply in rdar://problem/16436943 and your use case was slightly different IIRC but you’re right it’s a close enough problem ]
Dispatch internally has a notion of something that does almost that, called _dispatch_barrier_trysync_f. However, it is used internally to serialize state changes on sources and queues such as setting the target queue or event handlers.
The problem is that this call bypasses the target queue hierarchy in its fastpath, which while it’s correct when changing the state of a given source or queue, is generally the wrong thing to do. Let’s consider this code assuming the dispatch_barrier_trysync()
dispatch_queue_t outer = dispatch_queue_create("outer", NULL);
dispatch_queue_t inner = dispatch_queue_create("inner", NULL);
// write global state protected by inner
// write global state protected by inner
Then if it works like the internal version we have today, the code above has a data race, which we’ll all agree is bad.
Or we do an API version that when the queue you do the trysync on is not targetted at a global root queue always go through async, and that is weird, because the performance characteristics would completely depend on the target queue hierarchy, which when layering and frameworks start to be at play, is a bad characteristic for a good API.
Yes, we could currently assume that we only targeted a root queue for our use case, so our implementation has this limitation (so it is not a valid general solution as you say).
It would perhaps be a bit strange to have different performance characteristics depending on the target queue hierarchy as you say, but there are already some performance differences in actual behavior if using e.g. an overcommit queue vs a non, so perhaps another option would be to have this as an optional queue attribute instead of an additional generic API (queue attribute ’steal calling thread for inline processing of requests if the queue was empty when dispatching’) …?
Or we don’t give up right away when the hierarchy is deep, but then that means that dispatch_trysync would need to be able to unwind all the locks it took, and then you have ordering issues because enqueuing that block that couldn’t run synchronously may end up being after another one and break the FIFO ordering of queues. Respecting this which is a desired property of our API and getting an efficient implementation are somehow at odds.
Yes, agree it is a desirable property of the API to retain the ordering.
The other argument against trysync that way, is that during testing trysync would almost always go through the non contended codepath, and lead developers to not realize that they should have taken copies of variables and the like (this is less of a problem on Darwin with obj-c and ARC), but trysync running on the same thread will hide that. except that once it starts being contended in production, it’ll bite you hard with memory corruption everywhere.
Less of an issue for us as we depend on the _f interfaces throughout due to portability concerns, but fair point.
Technically what you’re after is that bringing up a new thread is very costly and that you’d rather use the one that’s asyncing the request because it will soon give up control. The wake up of a queue isn’t that expensive, in the sense that the overhead of dispatch_sync() in terms of memory barriers and locking is more or less comparable. What’s expensive is creating a thread to satisfy this enqueue.
Yes, in fact, bringing up a new thread is so costly that we keep a pool around in the libpwq implementation. Unfortunately we would often see double-digit microsecond latency incurred by this, which is unacceptable for us, so we had to (for some configurations/special deployments) have a dedicated spin thread that will grab the next queue to work on (that cut down the latency with a factor of 10 or so) and the next thread woken from the thread pool would take over a spinner…
In my opinion, to get the win you’re after, you’d rather want an async() version that if it wakes up the target queue hierarchy up to the root then you want to have more resistance in bringing up a new thread to satisfy that request. Fortunately, the overcommit property of queues could be used by a thread pool to decide to apply that resistance. There are various parts of the thread pool handling (especially without kernel work queues support) that could get some love to get these exact benefits without changing the API.
That would indeed be a very interesting idea, the problem is that the thread using ‘dispatch_barrier_trysync’ is not returning to the pthread_workqueue pool to grab the next dispatch queue for processing, but is instead going back to block on a syscall (e.g. read() from a socket) - and even the latency to wake up a thread (as is commonly done now) with mutex/condition signaling is way too slow for the use case we have (thus the very ugly workaround with a spin thread for some deployments).
Essentially, for these kind of operations we really want to avoid all context switches as long as we can keep up with the rate of inbound data, and in general such dynamics would be a nice property to have - if the thread performing the async call was known to always return to the global pwq thread pool, it would be nicely solved by applying resistance as you suggest, the problem is what to do when it gets blocked and you thus get stuck.
Perhaps we have to live with the limited implementation we have for practical purposes, but I have the feeling that the behavior we are after would be useful for other use cases, perhaps the queue attribute suggested above could be another way of expressing it without introducing new dispatch API.
This e-mail is confidential and may contain legally privileged information. It is intended only for the addressees. If you have received this e-mail in error, kindly notify us immediately by telephone or e-mail and delete the message from your system.