libdispatch roadmap and api addition proposal

Joakim_Hassila · December 7, 2015, 12:55pm

Hi,

I think (and hope) that this is the proper forum for a few questions wrt to libdispatch, otherwise any pointers are appreciated.

We are currently using libdispatch extensively on Linux (and Solaris for a while longer…) based on the previous version available from Mac OS forge (with later additions merged from opensource.apple.com) over time.

I have a few questions on how (particularly Apple folks) view this going forward:

First, the previous port to Linux/Solaris of libdispatch was dependent on libkqueue and more importantly on libpthread_workqueue (to have some heuristics for managing the number of threads when lacking kernel support).

How do you view this, would you consider integrating support for libpthread_workqueue, or would you have another preference for how to manage this on other platforms (Linux for starters, but essentially any lacking the pthread_workqueue interface)?

Secondly, we have extended the public libdispatch API internally with one more flavor of dispatching, let’s call it ‘dispatch_async_inline’ - the semantics being ‘perform the processing of the work synchronously if we wouldn’t block the calling thread, if we would block, instead perform the work as a normal dispatch_async’.

Would such a change be considered to be integrated, or should we keep our internal diffs indefinitely? Just to understand if it is worth the effort with a nicely packaged pull request or not...

The rationale for the API is that we are quite latency sensitive and want to use inline processing up until the point where we can’t keep up with the available work, at which point we would switch to asynchronous processing seamlessly (we have multiple producers). This means that the thread calling this API can be stolen for a significant amount of time (emptying the queue it was assigned to), but when the system is under ‘light' load, we don’t need to incur the wakeup penalty for a completely asynchronous dispatch.

Cheers,

Joakim

PS Big kudos to whoever at Apple is responsible for driving fundamentals like this out as OSS...

···

________________________________

This e-mail is confidential and may contain legally privileged information. It is intended only for the addressees. If you have received this e-mail in error, kindly notify us immediately by telephone or e-mail and delete the message from your system.

Tony_Parker · December 7, 2015, 7:37pm

Hi Joakim,

Hi,

I think (and hope) that this is the proper forum for a few questions wrt to libdispatch, otherwise any pointers are appreciated.

Yup, you’re in the right place.

We are currently using libdispatch extensively on Linux (and Solaris for a while longer…) based on the previous version available from Mac OS forge (with later additions merged from opensource.apple.com) over time.

I have a few questions on how (particularly Apple folks) view this going forward:

First, the previous port to Linux/Solaris of libdispatch was dependent on libkqueue and more importantly on libpthread_workqueue (to have some heuristics for managing the number of threads when lacking kernel support).

How do you view this, would you consider integrating support for libpthread_workqueue, or would you have another preference for how to manage this on other platforms (Linux for starters, but essentially any lacking the pthread_workqueue interface)?

I think it’s reasonable to continue to depend on libkqueue here where we can. We may have to have some kind of config option to use pure userspace stuff on certain platforms. I’m also open to the idea of getting something with fewer dependencies and lower performance done as an early first step, so we can unblock all of the API above libdispatch that wants to just use queues.

Secondly, we have extended the public libdispatch API internally with one more flavor of dispatching, let’s call it ‘dispatch_async_inline’ - the semantics being ‘perform the processing of the work synchronously if we wouldn’t block the calling thread, if we would block, instead perform the work as a normal dispatch_async’.

Would such a change be considered to be integrated, or should we keep our internal diffs indefinitely? Just to understand if it is worth the effort with a nicely packaged pull request or not…

The rationale for the API is that we are quite latency sensitive and want to use inline processing up until the point where we can’t keep up with the available work, at which point we would switch to asynchronous processing seamlessly (we have multiple producers). This means that the thread calling this API can be stolen for a significant amount of time (emptying the queue it was assigned to), but when the system is under ‘light' load, we don’t need to incur the wakeup penalty for a completely asynchronous dispatch.

Our most important goal for year one is to get the core library implementations up to date with where we are on Darwin on platforms like Linux. API changes are not out of the question but we have to make sure they align with that goal. This is the right place to discuss them, though. We’ll be in a much better place to evaluate it when we get dispatch building & running.

Cheers,

Joakim

PS Big kudos to whoever at Apple is responsible for driving fundamentals like this out as OSS…

Thanks for your interest in the project!

- Tony

···

On Dec 7, 2015, at 4:55 AM, Joakim Hassila via swift-corelibs-dev <swift-corelibs-dev@swift.org> wrote:

________________________________

This e-mail is confidential and may contain legally privileged information. It is intended only for the addressees. If you have received this e-mail in error, kindly notify us immediately by telephone or e-mail and delete the message from your system.
_______________________________________________
swift-corelibs-dev mailing list
swift-corelibs-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-corelibs-dev

Lily_Ballard · December 7, 2015, 8:30pm

Secondly, we have extended the public libdispatch API internally with
one more flavor of dispatching, let’s call it ‘dispatch_async_inline’
- the semantics being ‘perform the processing of the work
synchronously if we wouldn’t block the calling thread, if we would
block, instead perform the work as a normal dispatch_async’.

Would such a change be considered to be integrated, or should we keep
our internal diffs indefinitely? Just to understand if it is worth the
effort with a nicely packaged pull request or not...

The rationale for the API is that we are quite latency sensitive and
want to use inline processing up until the point where we can’t keep
up with the available work, at which point we would switch to
asynchronous processing seamlessly (we have multiple producers). This
means that the thread calling this API can be stolen for a significant
amount of time (emptying the queue it was assigned to), but when the
system is under ‘light' load, we don’t need to incur the wakeup
penalty for a completely asynchronous dispatch.

I actually have an outstanding radar asking for this exact
functionality. My proposal called it `dispatch_try_sync()`, which didn't
actually call the dispatch_async() automatically but simply returned a
boolean value telling you if it ran the code. My use-case here wasn't
actually that I wanted to run the code async, but that I needed to do
two operations on a realtime thread in any order, one of which needed to
be on a queue, so I wanted to do something like

BOOL done = dispatch_try_sync(queue, ^{ ... }); do_other_work(); if
(!done) { dispatch_sync(queue, ^{ ... }); }

My radar is still open (rdar://problem/16436943), but it got a response
as follows:

I think the best way to "emulate" this is to use a DATA_OR source, and
not semaphores or other things like that.

Most of the issues that I've seen with trylock() tends to be uses
looking like this:

again: if (trylock()) { do { clear_marker(); do_job();
} while(has_marker()); unlock(); } else if (!has_marker()) {
set_marker(); goto again; }

and all unlockers check for the marker to do the said job before
unlock basically.

The thing is, most of the people use that wrongly and don't loop
properly making those coalescing checks racy, that's what dispatch
DATA_OR sources are for.

Many other uses can also be replaced with a dispatch_async()

and it's very clear that the reporter can do exactly what he wants
with a DATA_OR source. We should have a way to make sources acts as
barriers (which I have a patch for) else we only provide half the
required primitives.

I don't see a compelling use case that can't be solved elegantly with
data sources today.

Using a DISPATCH_SOURCE_DATA_OR with a latch is a good alternative to
what you are doing.

We are continuing to work on this issue, and will follow up with
you again.

-Kevin Ballard

···

On Mon, Dec 7, 2015, at 04:55 AM, Joakim Hassila via swift-corelibs-dev wrote:

Pierre_Habouzit · December 7, 2015, 10:10pm

Hi Joakim, Kevin,

[ Full disclosure, I made that reply in rdar://problem/16436943 <rdar://problem/16436943> and your use case was slightly different IIRC but you’re right it’s a close enough problem ]

Dispatch internally has a notion of something that does almost that, called _dispatch_barrier_trysync_f[1]. However, it is used internally to serialize state changes on sources and queues such as setting the target queue or event handlers.

The problem is that this call bypasses the target queue hierarchy in its fastpath, which while it’s correct when changing the state of a given source or queue, is generally the wrong thing to do. Let’s consider this code assuming the dispatch_barrier_trysync()

    dispatch_queue_t outer = dispatch_queue_create("outer", NULL);
    dispatch_queue_t inner = dispatch_queue_create("inner", NULL);
    dispatch_set_target_queue(outer, inner);

    dispatch_async(inner, ^{
        // write global state protected by inner
    });
    dispatch_barrier_trysync(outer, ^{
        // write global state protected by inner
    });

Then if it works like the internal version we have today, the code above has a data race, which we’ll all agree is bad.
Or we do an API version that when the queue you do the trysync on is not targetted at a global root queue always go through async, and that is weird, because the performance characteristics would completely depend on the target queue hierarchy, which when layering and frameworks start to be at play, is a bad characteristic for a good API.

Or we don’t give up right away when the hierarchy is deep, but then that means that dispatch_trysync would need to be able to unwind all the locks it took, and then you have ordering issues because enqueuing that block that couldn’t run synchronously may end up being after another one and break the FIFO ordering of queues. Respecting this which is a desired property of our API and getting an efficient implementation are somehow at odds.

The other argument against trysync that way, is that during testing trysync would almost always go through the non contended codepath, and lead developers to not realize that they should have taken copies of variables and the like (this is less of a problem on Darwin with obj-c and ARC), but trysync running on the same thread will hide that. except that once it starts being contended in production, it’ll bite you hard with memory corruption everywhere.

Technically what you’re after is that bringing up a new thread is very costly and that you’d rather use the one that’s asyncing the request because it will soon give up control. The wake up of a queue isn’t that expensive, in the sense that the overhead of dispatch_sync() in terms of memory barriers and locking is more or less comparable. What’s expensive is creating a thread to satisfy this enqueue.

In my opinion, to get the win you’re after, you’d rather want an async() version that if it wakes up the target queue hierarchy up to the root then you want to have more resistance in bringing up a new thread to satisfy that request. Fortunately, the overcommit property of queues could be used by a thread pool to decide to apply that resistance. There are various parts of the thread pool handling (especially without kernel work queues support) that could get some love to get these exact benefits without changing the API.

[1] https://github.com/apple/swift-corelibs-libdispatch/blob/394d9a1c8be525cde8d9dd9fb8cef8308089b9c5/src/queue.c#L3089

-Pierre

···

On Dec 7, 2015, at 12:30 PM, Kevin Ballard via swift-corelibs-dev <swift-corelibs-dev@swift.org> wrote:

On Mon, Dec 7, 2015, at 04:55 AM, Joakim Hassila via swift-corelibs-dev wrote:
Secondly, we have extended the public libdispatch API internally with one
more flavor of dispatching, let’s call it ‘dispatch_async_inline’ - the
semantics being ‘perform the processing of the work synchronously if we
wouldn’t block the calling thread, if we would block, instead perform the
work as a normal dispatch_async’.

Would such a change be considered to be integrated, or should we keep our
internal diffs indefinitely? Just to understand if it is worth the effort
with a nicely packaged pull request or not...

The rationale for the API is that we are quite latency sensitive and want
to use inline processing up until the point where we can’t keep up with
the available work, at which point we would switch to asynchronous
processing seamlessly (we have multiple producers). This means that the
thread calling this API can be stolen for a significant amount of time
(emptying the queue it was assigned to), but when the system is under
‘light' load, we don’t need to incur the wakeup penalty for a completely
asynchronous dispatch.

I actually have an outstanding radar asking for this exact functionality. My proposal called it `dispatch_try_sync()`, which didn't actually call the dispatch_async() automatically but simply returned a boolean value telling you if it ran the code. My use-case here wasn't actually that I wanted to run the code async, but that I needed to do two operations on a realtime thread in any order, one of which needed to be on a queue, so I wanted to do something like

BOOL done = dispatch_try_sync(queue, ^{ ... });
do_other_work();
if (!done) {
    dispatch_sync(queue, ^{ ... });
}

My radar is still open (rdar://problem/16436943 <rdar://problem/16436943>), but it got a response as follows:

I think the best way to "emulate" this is to use a DATA_OR source, and not semaphores or other things like that.

Most of the issues that I've seen with trylock() tends to be uses looking like this:

again:
  if (trylock()) {
    do {
      clear_marker();
      do_job();
     } while(has_marker());
     unlock();
  } else if (!has_marker()) {
    set_marker();
    goto again;
  }

and all unlockers check for the marker to do the said job before unlock basically.

The thing is, most of the people use that wrongly and don't loop properly making those coalescing checks racy, that's what dispatch DATA_OR sources are for.

Many other uses can also be replaced with a dispatch_async()

and it's very clear that the reporter can do exactly what he wants with a DATA_OR source. We should have a way to make sources acts as barriers (which I have a patch for) else we only provide half the required primitives.

I don't see a compelling use case that can't be solved elegantly with data sources today.

Using a DISPATCH_SOURCE_DATA_OR with a latch is a good alternative to what you are doing.

We are continuing to work on this issue, and will follow up with you again.

das · December 7, 2015, 10:11pm

Hi Joakim,

Hi,

I think (and hope) that this is the proper forum for a few questions wrt to libdispatch, otherwise any pointers are appreciated.

We are currently using libdispatch extensively on Linux (and Solaris for a while longer…) based on the previous version available from Mac OS forge (with later additions merged from opensource.apple.com) over time.

FWIW I’ve updated the macosforge svn repo trunk to match with github swift-corelibs-libdispatch trunk (sans the PRs, excecpt for my buildsystem one), but going forward we are likely going to retire the macosforge repository in favor of the github one.

I have a few questions on how (particularly Apple folks) view this going forward:

First, the previous port to Linux/Solaris of libdispatch was dependent on libkqueue and more importantly on libpthread_workqueue (to have some heuristics for managing the number of threads when lacking kernel support).

How do you view this, would you consider integrating support for libpthread_workqueue, or would you have another preference for how to manage this on other platforms (Linux for starters, but essentially any lacking the pthread_workqueue interface)?

yes, staying with libpthread_workqueue is the focus of the current Linux porting effort, but it may make sense to move to something more native over time, e.g. like on FreeBSD where a version of the kernel workqueue was implemented natively.

Secondly, we have extended the public libdispatch API internally with one more flavor of dispatching, let’s call it ‘dispatch_async_inline’ - the semantics being ‘perform the processing of the work synchronously if we wouldn’t block the calling thread, if we would block, instead perform the work as a normal dispatch_async’.

Would such a change be considered to be integrated, or should we keep our internal diffs indefinitely? Just to understand if it is worth the effort with a nicely packaged pull request or not...

The rationale for the API is that we are quite latency sensitive and want to use inline processing up until the point where we can’t keep up with the available work, at which point we would switch to asynchronous processing seamlessly (we have multiple producers). This means that the thread calling this API can be stolen for a significant amount of time (emptying the queue it was assigned to), but when the system is under ‘light' load, we don’t need to incur the wakeup penalty for a completely asynchronous dispatch.

sounds familiar, have we talked to about this in the past somewhere ?

we actually have something quite similar internal to the library already: _dispatch_barrier_trysync_f

https://github.com/apple/swift-corelibs-libdispatch/blob/master/src/queue.c#L3089

but it currently (intentionally) ignores anything about the target queue hierarchy of the queue passed in (e.g. it will allow the sync even if the target queue is busy or suspended), so is not suitable as a general facility.

There are various technical reasons why we don’t believe this primitive in all generality is a good idea, Pierre is writing up an email about that so I won’t go into details here.

Daniel

···

On Dec 7, 2015, at 4:55, Joakim Hassila via swift-corelibs-dev <swift-corelibs-dev@swift.org> wrote:

Cheers,

Joakim

PS Big kudos to whoever at Apple is responsible for driving fundamentals like this out as OSS…

Joakim_Hassila · December 8, 2015, 2:16pm

Hi Tony,

I think it’s reasonable to continue to depend on libkqueue here where we can. We may have to have some kind of config option to use pure userspace stuff on certain platforms. I’m also open to the idea of getting something with fewer dependencies and lower performance done as an early first step, so we can unblock all of the API above libdispatch that wants to just use queues.

That would make sense - just want to point out that libkqueue and libpwq are separate - the same question posed for libpwq in a separate mail would be relevant for libkqueue as well - would closer integration make sense, or just keep things separate?

Our most important goal for year one is to get the core library implementations up to date with where we are on Darwin on platforms like Linux. API changes are not out of the question but we have to make sure they align with that goal. This is the right place to discuss them, though. We’ll be in a much better place to evaluate it when we get dispatch building & running.

Thanks, to complete the bringup of the core libraries makes sense, we’d just want to understand how to approach (the very few) API additions we’ve made - and it’s good its the right forum.

Cheers,

Joakim

···

On 7 dec. 2015, at 20:37, Tony Parker <anthony.parker@apple.com> wrote:

________________________________

This e-mail is confidential and may contain legally privileged information. It is intended only for the addressees. If you have received this e-mail in error, kindly notify us immediately by telephone or e-mail and delete the message from your system.

Joakim_Hassila · December 8, 2015, 2:11pm

Hi Daniel,

FWIW I’ve updated the macosforge svn repo trunk to match with github swift-corelibs-libdispatch trunk (sans the PRs, excecpt for my buildsystem one), but going forward we are likely going to retire the macosforge repository in favor of the github one.

That seems very reasonable and would make sense I think, there doesn’t seem to be much rationale for overlap.

I have a few questions on how (particularly Apple folks) view this going forward:

First, the previous port to Linux/Solaris of libdispatch was dependent on libkqueue and more importantly on libpthread_workqueue (to have some heuristics for managing the number of threads when lacking kernel support).

How do you view this, would you consider integrating support for libpthread_workqueue, or would you have another preference for how to manage this on other platforms (Linux for starters, but essentially any lacking the pthread_workqueue interface)?

yes, staying with libpthread_workqueue is the focus of the current Linux porting effort, but it may make sense to move to something more native over time, e.g. like on FreeBSD where a version of the kernel workqueue was implemented natively.

Ok, that’s great - previously there was a discussion to actually integrate libpthread_workqueue at least directly into the libdispatch project to reduce the number of dependencies to get a reasonably working libdispatch running - currently Mark Heily put it up on GitHub as well at GitHub - mheily/libpwq: libpthread_workqueue - a POSIX threads workqueue library - it has been quite dormant for the last few years, but I think that is largely due to things working reasonably well.

So would such more close integration be desirable to make things build more out of the box, or would you prefer to only use it if found during build time on the current host? (I would probably prefer the first option, as it essentially just provides support for functionality that the underlying platform lacks - the current libpwq supports a few platforms…).

Secondly, we have extended the public libdispatch API internally with one more flavor of dispatching, let’s call it ‘dispatch_async_inline’ - the semantics being ‘perform the processing of the work synchronously if we wouldn’t block the calling thread, if we would block, instead perform the work as a normal dispatch_async’.

Would such a change be considered to be integrated, or should we keep our internal diffs indefinitely? Just to understand if it is worth the effort with a nicely packaged pull request or not...

The rationale for the API is that we are quite latency sensitive and want to use inline processing up until the point where we can’t keep up with the available work, at which point we would switch to asynchronous processing seamlessly (we have multiple producers). This means that the thread calling this API can be stolen for a significant amount of time (emptying the queue it was assigned to), but when the system is under ‘light' load, we don’t need to incur the wakeup penalty for a completely asynchronous dispatch.

sounds familiar, have we talked to about this in the past somewhere ?

Well, it could well be that we touched upon it on the old libdispatch mailing list a few years ago (I did change my surname from Johansson -> Hassila as well as the company mail address, so it might have thrown things off for you :-). I did primarily spend some time in helping clean things up for usage on Solaris at that time though.

we actually have something quite similar internal to the library already: _dispatch_barrier_trysync_f

https://github.com/apple/swift-corelibs-libdispatch/blob/master/src/queue.c#L3089

but it currently (intentionally) ignores anything about the target queue hierarchy of the queue passed in (e.g. it will allow the sync even if the target queue is busy or suspended), so is not suitable as a general facility.

There are various technical reasons why we don’t believe this primitive in all generality is a good idea, Pierre is writing up an email about that so I won’t go into details here.

Thanks! I will reply to that separately.

Cheers,

Joakim

···

On 7 dec. 2015, at 23:11, Daniel A. Steffen <das@apple.com> wrote:

________________________________

This e-mail is confidential and may contain legally privileged information. It is intended only for the addressees. If you have received this e-mail in error, kindly notify us immediately by telephone or e-mail and delete the message from your system.

Joakim_Hassila · December 8, 2015, 3:34pm

Hi Pierre,

Thanks for the good explanation, will try to respond inline below:

···

On 7 dec. 2015, at 23:10, Pierre Habouzit <phabouzit@apple.com<mailto:phabouzit@apple.com>> wrote:

Hi Joakim, Kevin,

[ Full disclosure, I made that reply in rdar://problem/16436943 and your use case was slightly different IIRC but you’re right it’s a close enough problem ]

Dispatch internally has a notion of something that does almost that, called _dispatch_barrier_trysync_f[1]. However, it is used internally to serialize state changes on sources and queues such as setting the target queue or event handlers.

The problem is that this call bypasses the target queue hierarchy in its fastpath, which while it’s correct when changing the state of a given source or queue, is generally the wrong thing to do. Let’s consider this code assuming the dispatch_barrier_trysync()

    dispatch_queue_t outer = dispatch_queue_create("outer", NULL);
    dispatch_queue_t inner = dispatch_queue_create("inner", NULL);
    dispatch_set_target_queue(outer, inner);

    dispatch_async(inner, ^{
        // write global state protected by inner
    });
    dispatch_barrier_trysync(outer, ^{
        // write global state protected by inner
    });

Then if it works like the internal version we have today, the code above has a data race, which we’ll all agree is bad.
Or we do an API version that when the queue you do the trysync on is not targetted at a global root queue always go through async, and that is weird, because the performance characteristics would completely depend on the target queue hierarchy, which when layering and frameworks start to be at play, is a bad characteristic for a good API.

Yes, we could currently assume that we only targeted a root queue for our use case, so our implementation has this limitation (so it is not a valid general solution as you say).

It would perhaps be a bit strange to have different performance characteristics depending on the target queue hierarchy as you say, but there are already some performance differences in actual behavior if using e.g. an overcommit queue vs a non, so perhaps another option would be to have this as an optional queue attribute instead of an additional generic API (queue attribute ’steal calling thread for inline processing of requests if the queue was empty when dispatching’) …?

Or we don’t give up right away when the hierarchy is deep, but then that means that dispatch_trysync would need to be able to unwind all the locks it took, and then you have ordering issues because enqueuing that block that couldn’t run synchronously may end up being after another one and break the FIFO ordering of queues. Respecting this which is a desired property of our API and getting an efficient implementation are somehow at odds.

Yes, agree it is a desirable property of the API to retain the ordering.

The other argument against trysync that way, is that during testing trysync would almost always go through the non contended codepath, and lead developers to not realize that they should have taken copies of variables and the like (this is less of a problem on Darwin with obj-c and ARC), but trysync running on the same thread will hide that. except that once it starts being contended in production, it’ll bite you hard with memory corruption everywhere.

Less of an issue for us as we depend on the _f interfaces throughout due to portability concerns, but fair point.

Technically what you’re after is that bringing up a new thread is very costly and that you’d rather use the one that’s asyncing the request because it will soon give up control. The wake up of a queue isn’t that expensive, in the sense that the overhead of dispatch_sync() in terms of memory barriers and locking is more or less comparable. What’s expensive is creating a thread to satisfy this enqueue.

Yes, in fact, bringing up a new thread is so costly that we keep a pool around in the libpwq implementation. Unfortunately we would often see double-digit microsecond latency incurred by this, which is unacceptable for us, so we had to (for some configurations/special deployments) have a dedicated spin thread that will grab the next queue to work on (that cut down the latency with a factor of 10 or so) and the next thread woken from the thread pool would take over a spinner…

In my opinion, to get the win you’re after, you’d rather want an async() version that if it wakes up the target queue hierarchy up to the root then you want to have more resistance in bringing up a new thread to satisfy that request. Fortunately, the overcommit property of queues could be used by a thread pool to decide to apply that resistance. There are various parts of the thread pool handling (especially without kernel work queues support) that could get some love to get these exact benefits without changing the API.

That would indeed be a very interesting idea, the problem is that the thread using ‘dispatch_barrier_trysync’ is not returning to the pthread_workqueue pool to grab the next dispatch queue for processing, but is instead going back to block on a syscall (e.g. read() from a socket) - and even the latency to wake up a thread (as is commonly done now) with mutex/condition signaling is way too slow for the use case we have (thus the very ugly workaround with a spin thread for some deployments).

Essentially, for these kind of operations we really want to avoid all context switches as long as we can keep up with the rate of inbound data, and in general such dynamics would be a nice property to have - if the thread performing the async call was known to always return to the global pwq thread pool, it would be nicely solved by applying resistance as you suggest, the problem is what to do when it gets blocked and you thus get stuck.

Perhaps we have to live with the limited implementation we have for practical purposes, but I have the feeling that the behavior we are after would be useful for other use cases, perhaps the queue attribute suggested above could be another way of expressing it without introducing new dispatch API.

Cheers,

Joakim

________________________________

This e-mail is confidential and may contain legally privileged information. It is intended only for the addressees. If you have received this e-mail in error, kindly notify us immediately by telephone or e-mail and delete the message from your system.

Pierre_Habouzit1 · December 8, 2015, 3:56pm

Hi,

FWIW, this is my personal, let’s call it enlightened, opinion, based on my knowledge of dispatch and my past extensive system programming experience with Linux before I joined Apple.

I think that long term, the best way to maintain a Linux libdispatch port is to go away from the libkqueue that tries to emulate kqueue fully, where dispatch only needs a small subset of the surface of kqueue. Given how source.c is written today, this is not a very small undertaking, but eventually dispatch source map to epoll_ctl(EPOLLONESHOT) very very well.

Given our experience with the work queue subsystem in Darwin, I think that it would make sense to integrate both projects together, as work queue are not that useful if you don’t have dispatch with it, and having it separate gives you all the woes of a stable interface, which you don’t really care for in the first place. It’s probably much better to integrate it and not care about backward and forward compatibility and make it a private library of dispatch on linux. And to not be tied to a given interface at all.

I also think that having a minimal kernel support for thread pool management isn’t that hard to write as a kernel module, I had started to work on this a very long time ago, using the KVM scheduling hooks that let you know when a thread blocks and/or becomes runnable[1]. Threads would declare to that interface that they are work queue threads, and get load information that the thread pool can use to regulate. It’s old code, maybe (probably?) not the right way to do it, but that’s an example of things you can do if you move away from the contrived interface from what libpthread_workqueue exposes. My idea required a linux adjustment that I posted to the LKML at the time (http://lkml.iu.edu/hypermail/linux/kernel/1112.2/00235.html\) not sure if it ever made it to mainline (looks like it didn’t).

[1] git.madism.org Git - ~madcoder/pwqr.git/blob - kernel/pwqr.c

-Pierre

···

On Dec 8, 2015, at 6:11 AM, Joakim Hassila via swift-corelibs-dev <swift-corelibs-dev@swift.org> wrote:

Hi Daniel,

On 7 dec. 2015, at 23:11, Daniel A. Steffen <das@apple.com> wrote:

FWIW I’ve updated the macosforge svn repo trunk to match with github swift-corelibs-libdispatch trunk (sans the PRs, excecpt for my buildsystem one), but going forward we are likely going to retire the macosforge repository in favor of the github one.

That seems very reasonable and would make sense I think, there doesn’t seem to be much rationale for overlap.

I have a few questions on how (particularly Apple folks) view this going forward:

First, the previous port to Linux/Solaris of libdispatch was dependent on libkqueue and more importantly on libpthread_workqueue (to have some heuristics for managing the number of threads when lacking kernel support).

How do you view this, would you consider integrating support for libpthread_workqueue, or would you have another preference for how to manage this on other platforms (Linux for starters, but essentially any lacking the pthread_workqueue interface)?

yes, staying with libpthread_workqueue is the focus of the current Linux porting effort, but it may make sense to move to something more native over time, e.g. like on FreeBSD where a version of the kernel workqueue was implemented natively.

Ok, that’s great - previously there was a discussion to actually integrate libpthread_workqueue at least directly into the libdispatch project to reduce the number of dependencies to get a reasonably working libdispatch running - currently Mark Heily put it up on GitHub as well at GitHub - mheily/libpwq: libpthread_workqueue - a POSIX threads workqueue library - it has been quite dormant for the last few years, but I think that is largely due to things working reasonably well.

So would such more close integration be desirable to make things build more out of the box, or would you prefer to only use it if found during build time on the current host? (I would probably prefer the first option, as it essentially just provides support for functionality that the underlying platform lacks - the current libpwq supports a few platforms…).

Pierre_Habouzit · December 8, 2015, 4:07pm

-Pierre

Hi Pierre,

Thanks for the good explanation, will try to respond inline below:

Hi Joakim, Kevin,

[ Full disclosure, I made that reply in rdar://problem/16436943 <rdar://problem/16436943> and your use case was slightly different IIRC but you’re right it’s a close enough problem ]

Dispatch internally has a notion of something that does almost that, called _dispatch_barrier_trysync_f[1]. However, it is used internally to serialize state changes on sources and queues such as setting the target queue or event handlers.

The problem is that this call bypasses the target queue hierarchy in its fastpath, which while it’s correct when changing the state of a given source or queue, is generally the wrong thing to do. Let’s consider this code assuming the dispatch_barrier_trysync()

    dispatch_queue_t outer = dispatch_queue_create("outer", NULL);
    dispatch_queue_t inner = dispatch_queue_create("inner", NULL);
    dispatch_set_target_queue(outer, inner);

    dispatch_async(inner, ^{
        // write global state protected by inner
    });
    dispatch_barrier_trysync(outer, ^{
        // write global state protected by inner
    });

Then if it works like the internal version we have today, the code above has a data race, which we’ll all agree is bad.
Or we do an API version that when the queue you do the trysync on is not targetted at a global root queue always go through async, and that is weird, because the performance characteristics would completely depend on the target queue hierarchy, which when layering and frameworks start to be at play, is a bad characteristic for a good API.

Yes, we could currently assume that we only targeted a root queue for our use case, so our implementation has this limitation (so it is not a valid general solution as you say).

It would perhaps be a bit strange to have different performance characteristics depending on the target queue hierarchy as you say, but there are already some performance differences in actual behavior if using e.g. an overcommit queue vs a non, so perhaps another option would be to have this as an optional queue attribute instead of an additional generic API (queue attribute ’steal calling thread for inline processing of requests if the queue was empty when dispatching’) …?

My point is, adding API to dispatch is not something we do lightly. I’m not keen on an interface that only works for base queues. Mac OS and iOS code where dispatchy code is pervasive, more than 2 queue deep queues hierarchy is very common typically.

Or we don’t give up right away when the hierarchy is deep, but then that means that dispatch_trysync would need to be able to unwind all the locks it took, and then you have ordering issues because enqueuing that block that couldn’t run synchronously may end up being after another one and break the FIFO ordering of queues. Respecting this which is a desired property of our API and getting an efficient implementation are somehow at odds.

Yes, agree it is a desirable property of the API to retain the ordering.

The other argument against trysync that way, is that during testing trysync would almost always go through the non contended codepath, and lead developers to not realize that they should have taken copies of variables and the like (this is less of a problem on Darwin with obj-c and ARC), but trysync running on the same thread will hide that. except that once it starts being contended in production, it’ll bite you hard with memory corruption everywhere.

Less of an issue for us as we depend on the _f interfaces throughout due to portability concerns, but fair point.

Technically what you’re after is that bringing up a new thread is very costly and that you’d rather use the one that’s asyncing the request because it will soon give up control. The wake up of a queue isn’t that expensive, in the sense that the overhead of dispatch_sync() in terms of memory barriers and locking is more or less comparable. What’s expensive is creating a thread to satisfy this enqueue.

Yes, in fact, bringing up a new thread is so costly that we keep a pool around in the libpwq implementation. Unfortunately we would often see double-digit microsecond latency incurred by this, which is unacceptable for us, so we had to (for some configurations/special deployments) have a dedicated spin thread that will grab the next queue to work on (that cut down the latency with a factor of 10 or so) and the next thread woken from the thread pool would take over a spinner…

In my opinion, to get the win you’re after, you’d rather want an async() version that if it wakes up the target queue hierarchy up to the root then you want to have more resistance in bringing up a new thread to satisfy that request. Fortunately, the overcommit property of queues could be used by a thread pool to decide to apply that resistance. There are various parts of the thread pool handling (especially without kernel work queues support) that could get some love to get these exact benefits without changing the API.

That would indeed be a very interesting idea, the problem is that the thread using ‘dispatch_barrier_trysync’ is not returning to the pthread_workqueue pool to grab the next dispatch queue for processing, but is instead going back to block on a syscall (e.g. read() from a socket) - and even the latency to wake up a thread (as is commonly done now) with mutex/condition signaling is way too slow for the use case we have (thus the very ugly workaround with a spin thread for some deployments).

Essentially, for these kind of operations we really want to avoid all context switches as long as we can keep up with the rate of inbound data, and in general such dynamics would be a nice property to have - if the thread performing the async call was known to always return to the global pwq thread pool, it would be nicely solved by applying resistance as you suggest, the problem is what to do when it gets blocked and you thus get stuck.

Perhaps we have to live with the limited implementation we have for practical purposes, but I have the feeling that the behavior we are after would be useful for other use cases, perhaps the queue attribute suggested above could be another way of expressing it without introducing new dispatch API.

I completely agree with you, but I think that the way to address this is by making the thread pool smarter, not having the developper have to sprinkle his code with dispatch_barrier_trysync() where he feels like it. Using it properly require a deep understanding of the implementation of dispatch he’s using and changes on each platform / version combination. that’s not really the kind of interface we want to build.

“overcommit” is exactly the hint you’re after as far as the queue is concerned. It means “if I’m woken up, bring up a new thread provided it doesn’t blow up the system, no matter what”. So make your queue non overcommit by targetting it manually to dispatch_get_global_queue(0, 0) (that one isn’t overcommit), and make the thread pool smarter. That’s the right way to go and the design-compatible way to do it.

If your thread block in read() then I would argue that it should use a READ dispatch source instead, that way, the source would get enqueued *after* your async and you can ping pong. Doing blocking read()s is not dispatchy at all and will cause you all sorts of problems like that one, because re-async doesn’t work for you.

-Pierre

···

On Dec 8, 2015, at 7:34 AM, Joakim Hassila via swift-corelibs-dev <swift-corelibs-dev@swift.org> wrote:

On 7 dec. 2015, at 23:10, Pierre Habouzit <phabouzit@apple.com <mailto:phabouzit@apple.com>> wrote:

das · December 8, 2015, 5:05pm

Hi Daniel,

FWIW I’ve updated the macosforge svn repo trunk to match with github swift-corelibs-libdispatch trunk (sans the PRs, excecpt for my buildsystem one), but going forward we are likely going to retire the macosforge repository in favor of the github one.

That seems very reasonable and would make sense I think, there doesn’t seem to be much rationale for overlap.

I have a few questions on how (particularly Apple folks) view this going forward:

First, the previous port to Linux/Solaris of libdispatch was dependent on libkqueue and more importantly on libpthread_workqueue (to have some heuristics for managing the number of threads when lacking kernel support).

How do you view this, would you consider integrating support for libpthread_workqueue, or would you have another preference for how to manage this on other platforms (Linux for starters, but essentially any lacking the pthread_workqueue interface)?

yes, staying with libpthread_workqueue is the focus of the current Linux porting effort, but it may make sense to move to something more native over time, e.g. like on FreeBSD where a version of the kernel workqueue was implemented natively.

Ok, that’s great - previously there was a discussion to actually integrate libpthread_workqueue at least directly into the libdispatch project to reduce the number of dependencies to get a reasonably working libdispatch running - currently Mark Heily put it up on GitHub as well at GitHub - mheily/libpwq: libpthread_workqueue - a POSIX threads workqueue library - it has been quite dormant for the last few years, but I think that is largely due to things working reasonably well.

So would such more close integration be desirable to make things build more out of the box, or would you prefer to only use it if found during build time on the current host? (I would probably prefer the first option, as it essentially just provides support for functionality that the underlying platform lacks - the current libpwq supports a few platforms…).

That seems like a good idea in principle, I agree that it makes good technical sense given libdispatch is presumably the only client of this library, but short term continuing to keep it separate will likely be easiest (for boring non-technical reasons)

In particular I’ll have to figure out what the situation would be with us continuing to take changes internally from the github repo after importing a whole contributed project into it (as opposed to incremental patches to the existing sourcebase), ideally I would really prefer to not significantly diverge from our internal repo to make that process as straightforward as possible (essentially a git merge…)

···

On Dec 8, 2015, at 6:11, Joakim Hassila <Joakim.Hassila@orc-group.com> wrote:

On 7 dec. 2015, at 23:11, Daniel A. Steffen <das@apple.com> wrote:

Secondly, we have extended the public libdispatch API internally with one more flavor of dispatching, let’s call it ‘dispatch_async_inline’ - the semantics being ‘perform the processing of the work synchronously if we wouldn’t block the calling thread, if we would block, instead perform the work as a normal dispatch_async’.

Would such a change be considered to be integrated, or should we keep our internal diffs indefinitely? Just to understand if it is worth the effort with a nicely packaged pull request or not...

The rationale for the API is that we are quite latency sensitive and want to use inline processing up until the point where we can’t keep up with the available work, at which point we would switch to asynchronous processing seamlessly (we have multiple producers). This means that the thread calling this API can be stolen for a significant amount of time (emptying the queue it was assigned to), but when the system is under ‘light' load, we don’t need to incur the wakeup penalty for a completely asynchronous dispatch.

sounds familiar, have we talked to about this in the past somewhere ?

Well, it could well be that we touched upon it on the old libdispatch mailing list a few years ago (I did change my surname from Johansson -> Hassila as well as the company mail address, so it might have thrown things off for you :-). I did primarily spend some time in helping clean things up for usage on Solaris at that time though.

we actually have something quite similar internal to the library already: _dispatch_barrier_trysync_f

https://github.com/apple/swift-corelibs-libdispatch/blob/master/src/queue.c#L3089

but it currently (intentionally) ignores anything about the target queue hierarchy of the queue passed in (e.g. it will allow the sync even if the target queue is busy or suspended), so is not suitable as a general facility.

There are various technical reasons why we don’t believe this primitive in all generality is a good idea, Pierre is writing up an email about that so I won’t go into details here.

Thanks! I will reply to that separately.

Cheers,

Joakim

________________________________

This e-mail is confidential and may contain legally privileged information. It is intended only for the addressees. If you have received this e-mail in error, kindly notify us immediately by telephone or e-mail and delete the message from your system.

Pierre_Habouzit1 · December 8, 2015, 5:32pm

That is a good point.

Merging the codebases doesn’t necessarily require that they live in the same source repository though. I’m just arguing that if the worqueue code/emulation/layer is meant to only have dispatch as a client it allows for something more flexible.

-Pierre

···

On Dec 8, 2015, at 9:05 AM, Daniel A. Steffen via swift-corelibs-dev <swift-corelibs-dev@swift.org> wrote:

On Dec 8, 2015, at 6:11, Joakim Hassila <Joakim.Hassila@orc-group.com> wrote:

Hi Daniel,

On 7 dec. 2015, at 23:11, Daniel A. Steffen <das@apple.com> wrote:

FWIW I’ve updated the macosforge svn repo trunk to match with github swift-corelibs-libdispatch trunk (sans the PRs, excecpt for my buildsystem one), but going forward we are likely going to retire the macosforge repository in favor of the github one.

That seems very reasonable and would make sense I think, there doesn’t seem to be much rationale for overlap.

I have a few questions on how (particularly Apple folks) view this going forward:

First, the previous port to Linux/Solaris of libdispatch was dependent on libkqueue and more importantly on libpthread_workqueue (to have some heuristics for managing the number of threads when lacking kernel support).

How do you view this, would you consider integrating support for libpthread_workqueue, or would you have another preference for how to manage this on other platforms (Linux for starters, but essentially any lacking the pthread_workqueue interface)?

yes, staying with libpthread_workqueue is the focus of the current Linux porting effort, but it may make sense to move to something more native over time, e.g. like on FreeBSD where a version of the kernel workqueue was implemented natively.

Ok, that’s great - previously there was a discussion to actually integrate libpthread_workqueue at least directly into the libdispatch project to reduce the number of dependencies to get a reasonably working libdispatch running - currently Mark Heily put it up on GitHub as well at GitHub - mheily/libpwq: libpthread_workqueue - a POSIX threads workqueue library - it has been quite dormant for the last few years, but I think that is largely due to things working reasonably well.

So would such more close integration be desirable to make things build more out of the box, or would you prefer to only use it if found during build time on the current host? (I would probably prefer the first option, as it essentially just provides support for functionality that the underlying platform lacks - the current libpwq supports a few platforms…).

That seems like a good idea in principle, I agree that it makes good technical sense given libdispatch is presumably the only client of this library, but short term continuing to keep it separate will likely be easiest (for boring non-technical reasons)

In particular I’ll have to figure out what the situation would be with us continuing to take changes internally from the github repo after importing a whole contributed project into it (as opposed to incremental patches to the existing sourcebase), ideally I would really prefer to not significantly diverge from our internal repo to make that process as straightforward as possible (essentially a git merge…)

Joakim_Hassila · December 10, 2015, 8:36am

Ok, that’s great - previously there was a discussion to actually integrate libpthread_workqueue at least directly into the libdispatch project to reduce the number of dependencies to get a reasonably working libdispatch running - currently Mark Heily put it up on GitHub as well at GitHub - mheily/libpwq: libpthread_workqueue - a POSIX threads workqueue library - it has been quite dormant for the last few years, but I think that is largely due to things working reasonably well.

So would such more close integration be desirable to make things build more out of the box, or would you prefer to only use it if found during build time on the current host? (I would probably prefer the first option, as it essentially just provides support for functionality that the underlying platform lacks - the current libpwq supports a few platforms…).

That seems like a good idea in principle, I agree that it makes good technical sense given libdispatch is presumably the only client of this library, but short term continuing to keep it separate will likely be easiest (for boring non-technical reasons)

Ok.

In particular I’ll have to figure out what the situation would be with us continuing to take changes internally from the github repo after importing a whole contributed project into it (as opposed to incremental patches to the existing sourcebase), ideally I would really prefer to not significantly diverge from our internal repo to make that process as straightforward as possible (essentially a git merge…)

Right - would perhaps be good to try to have some guidelines on how such integration should be done in general (hopefully support for additional platforms will be added over time, so it would be good to have a systematic approach to how to do the platform specific changes without breaking your merge process completely - I think that is in everyones interest). Would you make a suggestion on what would work well in practice later for long-term?

Out of curiosity on a more philosophical note - do you view the libdipsatch internal repo or the GitHub one to be ‘upstream’? In the Ars interview, Craig Federighi said "The Swift team will be developing completely in the open on GitHub” which implied the GitHub version being the ‘upstream’ one - how you view that for libdispatch would possibly impact the ‘divergence’ aspect… I understand the answer can well be different for various reasons...

Cheers,

Joakim

···

On 8 dec. 2015, at 18:05, Daniel A. Steffen <das@apple.com> wrote:

On Dec 8, 2015, at 6:11, Joakim Hassila <Joakim.Hassila@orc-group.com> wrote:

________________________________

This e-mail is confidential and may contain legally privileged information. It is intended only for the addressees. If you have received this e-mail in error, kindly notify us immediately by telephone or e-mail and delete the message from your system.

Joakim_Hassila · December 10, 2015, 8:42am

Hi,

FWIW, this is my personal, let’s call it enlightened, opinion, based on my knowledge of dispatch and my past extensive system programming experience with Linux before I joined Apple.

I think that long term, the best way to maintain a Linux libdispatch port is to go away from the libkqueue that tries to emulate kqueue fully, where dispatch only needs a small subset of the surface of kqueue. Given how source.c is written today, this is not a very small undertaking, but eventually dispatch source map to epoll_ctl(EPOLLONESHOT) very very well.

That makes sense, could simplify the implementation (and keep thing cleaner). Then the follow up question is of course how to split/manage source.c (as Daniel pointed out there is the merging issue).

Given our experience with the work queue subsystem in Darwin, I think that it would make sense to integrate both projects together, as work queue are not that useful if you don’t have dispatch with it, and having it separate gives you all the woes of a stable interface, which you don’t really care for in the first place. It’s probably much better to integrate it and not care about backward and forward compatibility and make it a private library of dispatch on linux. And to not be tied to a given interface at all.

Agree, I don’t see much use for pwq except in this support role, so there would be a large degree of freedom.

I also think that having a minimal kernel support for thread pool management isn’t that hard to write as a kernel module, I had started to work on this a very long time ago, using the KVM scheduling hooks that let you know when a thread blocks and/or becomes runnable[1]. Threads would declare to that interface that they are work queue threads, and get load information that the thread pool can use to regulate. It’s old code, maybe (probably?) not the right way to do it, but that’s an example of things you can do if you move away from the contrived interface from what libpthread_workqueue exposes. My idea required a linux adjustment that I posted to the LKML at the time (http://lkml.iu.edu/hypermail/linux/kernel/1112.2/00235.html\) not sure if it ever made it to mainline (looks like it didn’t).

[1] git.madism.org Git - ~madcoder/pwqr.git/blob - kernel/pwqr.c

That would actually be very nice to be able to regulate on a system level just as on Darwin.

On a conceptual level, it would probably make sense to consider also non-work queue threads for regulation purposes (it is what is being done on the user level right now in pwq: simply doing thread introspection of /proc when needed) - then it plays more nicely in a mixed environment where there are legacy threads doing work.

Joakim

···

On 8 dec. 2015, at 16:56, Pierre Habouzit <pierre@habouzit.net> wrote:

________________________________

This e-mail is confidential and may contain legally privileged information. It is intended only for the addressees. If you have received this e-mail in error, kindly notify us immediately by telephone or e-mail and delete the message from your system.

Joakim_Hassila · December 10, 2015, 9:06am

My point is, adding API to dispatch is not something we do lightly. I’m not keen on an interface that only works for base queues. Mac OS and iOS code where dispatchy code is pervasive, more than 2 queue deep queues hierarchy is very common typically.

Gotcha.

That would indeed be a very interesting idea, the problem is that the thread using ‘dispatch_barrier_trysync’ is not returning to the pthread_workqueue pool to grab the next dispatch queue for processing, but is instead going back to block on a syscall (e.g. read() from a socket) - and even the latency to wake up a thread (as is commonly done now) with mutex/condition signaling is way too slow for the use case we have (thus the very ugly workaround with a spin thread for some deployments).
<snip>
Perhaps we have to live with the limited implementation we have for practical purposes, but I have the feeling that the behavior we are after would be useful for other use cases, perhaps the queue attribute suggested above could be another way of expressing it without introducing new dispatch API.

I completely agree with you, but I think that the way to address this is by making the thread pool smarter, not having the developper have to sprinkle his code with dispatch_barrier_trysync() where he feels like it. Using it properly require a deep understanding of the implementation of dispatch he’s using and changes on each platform / version combination. that’s not really the kind of interface we want to build.

“overcommit” is exactly the hint you’re after as far as the queue is concerned. It means “if I’m woken up, bring up a new thread provided it doesn’t blow up the system, no matter what”. So make your queue non overcommit by targetting it manually to dispatch_get_global_queue(0, 0) (that one isn’t overcommit), and make the thread pool smarter. That’s the right way to go and the design-compatible way to do it.

Ok, this is interesting (and probably just points out my misunderstanding of the intended semantics of the overcommit flag) - the part with “doesn’t blow up the system” is actually not clear from the header files, it says:

···

On 8 dec. 2015, at 17:07, Pierre Habouzit <phabouzit@apple.com> wrote:

++++++
* @constant DISPATCH_QUEUE_OVERCOMMIT
* The queue will create a new thread for invoking blocks, regardless of how
* busy the computer is.
++++++

The ‘regardless of how busy the computer is’ was also the implementation of overcommit queues in pwq, so we essentially banned the usage of them here internally as the semantics where not very usable for us, as a new thread was always created (and we do use a fairly large number of dispatch queues).

If we would have semantics like:
“overcommit” - "if I’m woken up, bring up a new thread provided it doesn’t blow up the system, no matter what” and the definition of “blowing up the system” is to not have more concurrent running threads than there are active cores
“non overcommit” - “only bring up a new thread a provided that enough ‘pressure’ is applied” (to allow for essentially the ping pong you suggest)

Then I think we can work with making the thread pool smarter and get desired behavior - will think a bit more about it (I think that some care would be required to not have essentially lost wakeups for the non overcommit variant in that case), but it feels like a possibly better way forward, thanks.

Would such interpretation of the overcommit attribute semantics be reasonable? (we wouldn’t want to have a completely different view of it to keep the API behavior robust across platforms)

If your thread block in read() then I would argue that it should use a READ dispatch source instead, that way, the source would get enqueued *after* your async and you can ping pong. Doing blocking read()s is not dispatchy at all and will cause you all sorts of problems like that one, because re-async doesn’t work for you.

Here we are a bit living with legacy considerations, but perhaps one possible way we are discussing is if such threads could use the pwq API (at the tail end) to facilitate the behavior you suggest.

I.e. pseudocode:

f() // legacy non-dispatchy code
{
  read() // get some new work
  dispatch_async() // dispatch the work on a non-overcommit queue
  pthread_workqueue_additem_np(q, f, ...) // repeat f(), switch between overcommit/nonovercommit target work queues as needed/desired, this thread could thus be stolen to process the above dispatch_async
}

Thanks!

Joakim

________________________________

This e-mail is confidential and may contain legally privileged information. It is intended only for the addressees. If you have received this e-mail in error, kindly notify us immediately by telephone or e-mail and delete the message from your system.

Pierre_Habouzit1 · December 10, 2015, 5:52pm

-Pierre

Hi,

FWIW, this is my personal, let’s call it enlightened, opinion, based on my knowledge of dispatch and my past extensive system programming experience with Linux before I joined Apple.

I think that long term, the best way to maintain a Linux libdispatch port is to go away from the libkqueue that tries to emulate kqueue fully, where dispatch only needs a small subset of the surface of kqueue. Given how source.c is written today, this is not a very small undertaking, but eventually dispatch source map to epoll_ctl(EPOLLONESHOT) very very well.

That makes sense, could simplify the implementation (and keep thing cleaner). Then the follow up question is of course how to split/manage source.c (as Daniel pointed out there is the merging issue).

we can decide when/if someone tries to tackle it. I humbly recognize that I have no great idea of how to do so.

···

On Dec 10, 2015, at 12:42 AM, Joakim Hassila via swift-corelibs-dev <swift-corelibs-dev@swift.org> wrote:

On 8 dec. 2015, at 16:56, Pierre Habouzit <pierre@habouzit.net> wrote:

Given our experience with the work queue subsystem in Darwin, I think that it would make sense to integrate both projects together, as work queue are not that useful if you don’t have dispatch with it, and having it separate gives you all the woes of a stable interface, which you don’t really care for in the first place. It’s probably much better to integrate it and not care about backward and forward compatibility and make it a private library of dispatch on linux. And to not be tied to a given interface at all.

Agree, I don’t see much use for pwq except in this support role, so there would be a large degree of freedom.

I also think that having a minimal kernel support for thread pool management isn’t that hard to write as a kernel module, I had started to work on this a very long time ago, using the KVM scheduling hooks that let you know when a thread blocks and/or becomes runnable[1]. Threads would declare to that interface that they are work queue threads, and get load information that the thread pool can use to regulate. It’s old code, maybe (probably?) not the right way to do it, but that’s an example of things you can do if you move away from the contrived interface from what libpthread_workqueue exposes. My idea required a linux adjustment that I posted to the LKML at the time (http://lkml.iu.edu/hypermail/linux/kernel/1112.2/00235.html\) not sure if it ever made it to mainline (looks like it didn’t).

[1] git.madism.org Git - ~madcoder/pwqr.git/blob - kernel/pwqr.c

That would actually be very nice to be able to regulate on a system level just as on Darwin.

On a conceptual level, it would probably make sense to consider also non-work queue threads for regulation purposes (it is what is being done on the user level right now in pwq: simply doing thread introspection of /proc when needed) - then it plays more nicely in a mixed environment where there are legacy threads doing work.

Joakim

________________________________

This e-mail is confidential and may contain legally privileged information. It is intended only for the addressees. If you have received this e-mail in error, kindly notify us immediately by telephone or e-mail and delete the message from your system.
_______________________________________________
swift-corelibs-dev mailing list
swift-corelibs-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-corelibs-dev

Pierre_Habouzit · December 10, 2015, 6:01pm

My point is, adding API to dispatch is not something we do lightly. I’m not keen on an interface that only works for base queues. Mac OS and iOS code where dispatchy code is pervasive, more than 2 queue deep queues hierarchy is very common typically.

Gotcha.

That would indeed be a very interesting idea, the problem is that the thread using ‘dispatch_barrier_trysync’ is not returning to the pthread_workqueue pool to grab the next dispatch queue for processing, but is instead going back to block on a syscall (e.g. read() from a socket) - and even the latency to wake up a thread (as is commonly done now) with mutex/condition signaling is way too slow for the use case we have (thus the very ugly workaround with a spin thread for some deployments).
<snip>
Perhaps we have to live with the limited implementation we have for practical purposes, but I have the feeling that the behavior we are after would be useful for other use cases, perhaps the queue attribute suggested above could be another way of expressing it without introducing new dispatch API.

I completely agree with you, but I think that the way to address this is by making the thread pool smarter, not having the developper have to sprinkle his code with dispatch_barrier_trysync() where he feels like it. Using it properly require a deep understanding of the implementation of dispatch he’s using and changes on each platform / version combination. that’s not really the kind of interface we want to build.

“overcommit” is exactly the hint you’re after as far as the queue is concerned. It means “if I’m woken up, bring up a new thread provided it doesn’t blow up the system, no matter what”. So make your queue non overcommit by targetting it manually to dispatch_get_global_queue(0, 0) (that one isn’t overcommit), and make the thread pool smarter. That’s the right way to go and the design-compatible way to do it.

Ok, this is interesting (and probably just points out my misunderstanding of the intended semantics of the overcommit flag) - the part with “doesn’t blow up the system” is actually not clear from the header files, it says:

on OS X, the “doesn’t blow up the system” part is, you still have less than 512 threads and enough threads allowed globally on that host, it’s not really a very strong limit.

++++++
* @constant DISPATCH_QUEUE_OVERCOMMIT
* The queue will create a new thread for invoking blocks, regardless of how
* busy the computer is.
++++++

The ‘regardless of how busy the computer is’ was also the implementation of overcommit queues in pwq, so we essentially banned the usage of them here internally as the semantics where not very usable for us, as a new thread was always created (and we do use a fairly large number of dispatch queues).

If we would have semantics like:
“overcommit” - "if I’m woken up, bring up a new thread provided it doesn’t blow up the system, no matter what” and the definition of “blowing up the system” is to not have more concurrent running threads than there are active cores
“non overcommit” - “only bring up a new thread a provided that enough ‘pressure’ is applied” (to allow for essentially the ping pong you suggest)

Then I think we can work with making the thread pool smarter and get desired behavior - will think a bit more about it (I think that some care would be required to not have essentially lost wakeups for the non overcommit variant in that case), but it feels like a possibly better way forward, thanks.

Would such interpretation of the overcommit attribute semantics be reasonable? (we wouldn’t want to have a completely different view of it to keep the API behavior robust across platforms)

The overcommit intent was that “if I’m on a single core machine and I wake up that queue but I myself keep a thread busy, would the program livelock because that queue I just woke up wouldn’t wakeup a thread”. IOW, could I risk to be blocked forever if that queue didn’t run right away.

That’s the problem to keep in mind with overcommit, it’s the intent. I don’t think the current wq implementation on OS X is pretty smart about it, probably not as much as it should.

As long as you fix that issue (guarantee that overcommit queue provided you don’t have 109238192038 competing for your cores will get a thread in a relatively timely fashion) then I think it’s ok if specific platforms semantics vary a bit. It’s already the case with the pthread thread pool vs wq anyway.

If your thread block in read() then I would argue that it should use a READ dispatch source instead, that way, the source would get enqueued *after* your async and you can ping pong. Doing blocking read()s is not dispatchy at all and will cause you all sorts of problems like that one, because re-async doesn’t work for you.

Here we are a bit living with legacy considerations, but perhaps one possible way we are discussing is if such threads could use the pwq API (at the tail end) to facilitate the behavior you suggest.

I.e. pseudocode:

f() // legacy non-dispatchy code
{
read() // get some new work
dispatch_async() // dispatch the work on a non-overcommit queue
pthread_workqueue_additem_np(q, f, ...) // repeat f(), switch between overcommit/nonovercommit target work queues as needed/desired, this thread could thus be stolen to process the above dispatch_async
}

oh you’re not on a dispatchy thread, then yeah well, you’re more or less on your own. I would if you need that tweak the thread pool so that it always has one ready for you (IOW it never kills them all) so that the wake up is less expensive. But that’s clearly not a pattern we will try to optimize for because it’ll have bad effect on the library API surface for something that we prefer people try to embrace fully with time.

Here if you’re simulating dispatch this way though, you could have a source for the read still:

source = dispatch_source_create(DISPATCH_SOURCE_TYPE_READ, fd, NULL);
dispatch_source_set_event_handler_f(source, …, f);
dispatch_resume(source);

If your code really looks like you’re describing that should naturally work, no?

-Pierre

···

On Dec 10, 2015, at 1:06 AM, Joakim Hassila via swift-corelibs-dev <swift-corelibs-dev@swift.org> wrote:

On 8 dec. 2015, at 17:07, Pierre Habouzit <phabouzit@apple.com> wrote: