[Concurrency] async/await + actors


#1

Hi all,

As Ted mentioned in his email, it is great to finally kick off

discussions for what concurrency should look like in Swift. This will
surely be an epic multi-year journey, but it is more important to find the
right design than to get there fast.

I’ve been advocating for a specific model involving async/await and

actors for many years now. Handwaving only goes so far, so some folks asked
me to write them down to make the discussion more helpful and concrete.
While I hope these ideas help push the discussion on concurrency forward,
this isn’t in any way meant to cut off other directions: in fact I hope it
helps give proponents of other designs a model to follow: a discussion
giving extensive rationale, combined with the long term story arc to show
that the features fit together.

Anyway, here is the document, I hope it is useful, and I’d love to hear

comments and suggestions for improvement:

https://gist.github.com/lattner/31ed37682ef1576b16bca1432ea9f782

-Chris

Hi chris,

is a actor guarantee always process the messages in one by one?
so, can it assume that never being multiple threads try to modify the state
at the same time?

P.S. i have implemented similar idea before:

https://github.com/SusanDoggie/Doggie/blob/master/Sources/Doggie/Thread/Thread.swift
https://github.com/SusanDoggie/Doggie/blob/master/Sources/Doggie/SDTriggerNode.swift


(Chris Lattner) #2

Hi chris,

is a actor guarantee always process the messages in one by one?
so, can it assume that never being multiple threads try to modify the state at the same time?

Yep, that’s the idea.

P.S. i have implemented similar idea before:

https://github.com/SusanDoggie/Doggie/blob/master/Sources/Doggie/Thread/Thread.swift
https://github.com/SusanDoggie/Doggie/blob/master/Sources/Doggie/SDTriggerNode.swift

Cool. That’s one of the other interesting things about the actor model. We can prototype and build it as a completely library feature to get experience with the runtime model, then move to language support (providing the additional safety) when things seem to work well in practice.

-Chris

···

On Aug 19, 2017, at 2:02 AM, Susan Cheng <susan.doggie@gmail.com> wrote:


(Pierre Habouzit) #3

I fail at Finding the initial mail and am quite late to the party of commenters, but there are parts I don't undertsand or have questions about.

Scalable Runtime

[...]

The one problem I anticipate with GCD is that it doesn't scale well enough: server developers in particular will want to instantiate hundreds of thousands of actors in their application, at least one for every incoming network connection. The programming model is substantially harmed when you have to be afraid of creating too many actors: you have to start aggregating logically distinct stuff together to reduce # queues, which leads to complexity and loses some of the advantages of data isolation.

What do you mean by this? queues are serial/exclusive execution contexts, and if you're not modeling actors as being serial queues, then these two concepts are just disjoint. The former (queues) represent where the code runs physically, gives you some level of scheduling, possibly prioritization, and the context is the entity that is known to the kernel so that when you need synchronization between two execution context (because despite your best intentions there is global mutable state on the system that Swift uses all the time whether it's through frameworks, malloc or simply any syscall), it can resolve priority inversions and do smart things to schedule these contexts.

Actors are the way you present the various tasks/operations/activities that you schedule. These contexts are a way for the developer to explain which things are related in a consistent system, and give them access to state which is local to this context (whether it's TSD for threads, or queue specific data, or any similar context), which is data that is not shared horizontally (across several concurrent execution contexts) but vertically (across all the hierarchy of actors/work items/... that you schedule on these execution contexts, hence require no locks and are "good" for the system).

GCD is trying to be a very efficient way to communicate and message between execution contexts that you know and represent your software architecture in your product/app/server/.... Using queues for anything else will indeed scale poorly.

IMO, Swift as a runtime should define what an execution context is, and be relatively oblivious of which context it is exactly as long it presents a few common capabilities:
- possibility to schedule work (async)
- have a name
- be an exclusion context
- is an entity the kernel can reason about (if you want to be serious about any integration on a real operating system with priority inheritance and complex issues like this, which it is the OS responsibility to handle and not the language)
- ...

In that sense, whether your execution context is:
- a dispatch serial queue
- a CFRunloop
- a libev/libevent/... event loop
- your own hand rolled event loop

Then this is fine, this is something where Swift could enqueue its own "schedule Swift closures on this context" at the very least, and for the ones that have native integration do smarter things (I'd expect runloops or libdispatch to be such better integrated citizens given that they're part of the same umbrella ;p). If you layer the runtime this way, then I don't see how GCD can be a hindrance, it's just one of the several execution contexts that can host Actors.

While mentioning this, I've seen many people complain that dispatch_get_current_queue() is deprecated. It is so for tons of valid reasons, it's too sharp an API to use for developers, but as part of integrating with the swift runtime, having a "please give me a reference on the current execution context" is trivially implementable when we know what the Swift runtime will do with it and has a reasonable use.

Design sketch for interprocess and distributed compute

[...]

One of these principles is the concept of progressive disclosure of complexity <https://en.wikipedia.org/wiki/Progressive_disclosure>: a Swift developer shouldn't have to worry about IPC or distributed compute if they don't care about it.

While I agree with the sentiment, I don't think that anything useful can be done without "distributed" computation. I like the loadResourceFromTheWeb example, as we have something like this on our platform, which is the NSURLSession APIs, or the CloudKit API Surface, that are about fetching some resource from a server (URL or CloudKit database records). However, they don't have a single result, they have:

- progress notification callbacks
- broken down notifications for the results (e.g. headers first and body second, or per-record for CloudKit operations)
- various levels of error reporting.

I expect most developers will have to use such a construct, and for these, having a single async pivot in your code that essentially fully serializes your state machine on getting a full result from the previous step to be lacking. Similarly, for the 3 categories I listed above, it's very likely that you want these notifications to be seen as various consequences of the initiator of the download, and they are typically sent to very specific execution contexts:
- progress usually goes to the main thread / UI thread because it's about reporting stuff to the user
- notification go to some validation logic that will assemble frames and reconstruct the whole payload which is likely some utility context on the side, until the full download is done and then you want to resume handling the result of the operation on some subsystem's context that was interested in this download.

Delivering all these notifications on the context of the initiator would be quite inefficient as clearly there are in my example above two very different contexts, and having to hop through one to reach the other would make this really terrible for the operating system. I also don't understand how such operations would be modeled in the async/await world to be completely honest.

As a former framework developer, I liked to be able to reason about contexts and silos of execution, and now as a runtime (as in operating system runtime, not language ;p) developer I like people who think that way best because this is how an operating system works, and any organization that is too remote from these silos has huge impedance mismatches and execute very poorly without a lot of manual fiddling with your language runtime knobs: the JVM is really infamous for this, but from where I stand I've seen go follow a similar trend where there are knobs that will affect the performance of the code that you run dramatically, and where no single setup is good for everyone. I actually strongly believe that it's impossible for the runtime to figure out these things, and it's best to create an environment that helps you think that way and organize your software architecture this way.

In other terms, in all this proposal, if Actors are people the developer can play with, we need to allow the developer to create housing too.

-Pierre

···

On Aug 19, 2017, at 4:06 PM, Chris Lattner via swift-evolution <swift-evolution@swift.org> wrote:

On Aug 19, 2017, at 2:02 AM, Susan Cheng <susan.doggie@gmail.com> wrote:

Hi chris,

is a actor guarantee always process the messages in one by one?
so, can it assume that never being multiple threads try to modify the state at the same time?

Yep, that’s the idea.

P.S. i have implemented similar idea before:

https://github.com/SusanDoggie/Doggie/blob/master/Sources/Doggie/Thread/Thread.swift
https://github.com/SusanDoggie/Doggie/blob/master/Sources/Doggie/SDTriggerNode.swift

Cool. That’s one of the other interesting things about the actor model. We can prototype and build it as a completely library feature to get experience with the runtime model, then move to language support (providing the additional safety) when things seem to work well in practice.

-Chris

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Chris Lattner) #4

I fail at Finding the initial mail and am quite late to the party of commenters, but there are parts I don't undertsand or have questions about.

Scalable Runtime

[...]

The one problem I anticipate with GCD is that it doesn't scale well enough: server developers in particular will want to instantiate hundreds of thousands of actors in their application, at least one for every incoming network connection. The programming model is substantially harmed when you have to be afraid of creating too many actors: you have to start aggregating logically distinct stuff together to reduce # queues, which leads to complexity and loses some of the advantages of data isolation.

What do you mean by this?

My understanding is that GCD doesn’t currently scale to 1M concurrent queues / tasks.

queues are serial/exclusive execution contexts, and if you're not modeling actors as being serial queues, then these two concepts are just disjoint.

AFAICT, the “one queue per actor” model is the only one that makes sense. It doesn’t have to be FIFO, but it needs to be some sort of queue. If you allow servicing multiple requests within the actor at a time, then you lose the advantages of “no shared mutable state”.

Actors are the way you present the various tasks/operations/activities that you schedule. These contexts are a way for the developer to explain which things are related in a consistent system, and give them access to state which is local to this context (whether it's TSD for threads, or queue specific data, or any similar context),

Just MHO, but I don’t think you’d need or want the concept of “actor local data” in the sense of TLS (e.g. __thread). All actor methods have a ‘self’ already, and having something like TLS strongly encourages breaking the model. To me, the motivation for TLS is to provide an easier way to migrate single-threaded global variables, when introducing threading into legacy code.

This is not a problem we need or want to solve, given programmers would be rewriting their algorithm anyway to get it into the actor model.

IMO, Swift as a runtime should define what an execution context is, and be relatively oblivious of which context it is exactly as long it presents a few common capabilities:
- possibility to schedule work (async)
- have a name
- be an exclusion context
- is an entity the kernel can reason about (if you want to be serious about any integration on a real operating system with priority inheritance and complex issues like this, which it is the OS responsibility to handle and not the language)
- ...

In that sense, whether your execution context is:
- a dispatch serial queue
- a CFRunloop
- a libev/libevent/... event loop
- your own hand rolled event loop

Generalizing the approach is completely possible, but it is also possible to introduce a language abstraction that is “underneath” the high level event loops. That’s what I’m proposing.

Design sketch for interprocess and distributed compute

[...]

One of these principles is the concept of progressive disclosure of complexity <https://en.wikipedia.org/wiki/Progressive_disclosure>: a Swift developer shouldn't have to worry about IPC or distributed compute if they don't care about it.

While I agree with the sentiment, I don't think that anything useful can be done without "distributed" computation. I like the loadResourceFromTheWeb example, as we have something like this on our platform, which is the NSURLSession APIs, or the CloudKit API Surface, that are about fetching some resource from a server (URL or CloudKit database records). However, they don't have a single result, they have:

- progress notification callbacks
- broken down notifications for the results (e.g. headers first and body second, or per-record for CloudKit operations)
- various levels of error reporting.

I don’t understand the concern about this. If you want low level control like this, it is quite easy to express that. However, it is also quite common to just want to say “load a URL with this name”, which is super easy and awesome with async/await.

I expect most developers will have to use such a construct, and for these, having a single async pivot in your code that essentially fully serializes your state machine on getting a full result from the previous step to be lacking.

Agreed, the examples are not trying to show that. It is perfectly fine to pass in additional callbacks (or delegates, etc) to async methods, which would be a natural way to express this… just like the current APIs do.

Delivering all these notifications on the context of the initiator would be quite inefficient as clearly there are in my example above two very different contexts, and having to hop through one to reach the other would make this really terrible for the operating system. I also don't understand how such operations would be modeled in the async/await world to be completely honest.

The proposal isn’t trying to address this problem, because Swift already has ways to do it.

-Chris

···

On Aug 31, 2017, at 7:24 PM, Pierre Habouzit <pierre@habouzit.net> wrote:


(Pierre Habouzit) #5

I fail at Finding the initial mail and am quite late to the party of commenters, but there are parts I don't undertsand or have questions about.

Scalable Runtime

[...]

The one problem I anticipate with GCD is that it doesn't scale well enough: server developers in particular will want to instantiate hundreds of thousands of actors in their application, at least one for every incoming network connection. The programming model is substantially harmed when you have to be afraid of creating too many actors: you have to start aggregating logically distinct stuff together to reduce # queues, which leads to complexity and loses some of the advantages of data isolation.

What do you mean by this?

My understanding is that GCD doesn’t currently scale to 1M concurrent queues / tasks.

It completely does provided these 1M queues / tasks are organized on several well known independent contexts.
The place where GCD "fails" at is that if you target your individual serial queues to the global concurrent queues (a.k.a. root queues) which means "please pool, do your job", then yes it doesn't scale, because we take these individual serial queues as proxies for OS threads.

If however you target these queues to either:
- new serial queues to segregate your actors per subsystem yourself
- or some more constrained pool than what the current GCD runtime offers (where we don't create threads to run your work nearly as eagerly)

Then I don't see why the current implementation of GCD wouldn't scale.

queues are serial/exclusive execution contexts, and if you're not modeling actors as being serial queues, then these two concepts are just disjoint.

AFAICT, the “one queue per actor” model is the only one that makes sense. It doesn’t have to be FIFO, but it needs to be some sort of queue. If you allow servicing multiple requests within the actor at a time, then you lose the advantages of “no shared mutable state”.

I agree, I don't quite care about how the actor is implemented here, what I care about is where it runs onto. my wording was poor, what I really meant is:

queues at the bottom of a queue hierarchy are serial/exclusive execution contexts, and if you're not modeling actors as being such fully independent serial queues, then these two concepts are just disjoint.

In GCD there's a very big difference between the one queue at the root of your graph (just above the thread pool) and any other that is within. The number that doesn't scale is the number of the former contexts, not the latter.

The pushback I have here is that today Runloops and dispatch queues on iOS/macOS are already systems that have huge impedance mismatches, and do not share the resources either (in terms of OS physical threads). I would hate for us to bring on ourselves the pain of creating a third completely different system that is using another way to use threads. When these 3 worlds would interoperate this would cause significant amount of context switches just to move across the boundaries.

"GCD Doesn't scale so let's build something new" will only create pain, we need a way for actors to inherently run on a thread pool that is shared with dispatch and that dispatch can reason about and vice versa, and where the Swift runtime gives enough information for GCD for it to execute the right work at the right time.

I'd like to dive and debunk this "GCD doesn't scale" point, that I'd almost call a myth (and I'm relatively unhappy to see these words in your proposal TBH because they send the wrong message).

Way before I started working on it, probably to ease adoption, the decision was made that it was ok to write code such as this and have it run without problems (FSVO without problems):

dispatch_queue_t q = ...;
dispatch_semaphore_t sema = dispatch_semaphore_create(0);
dispatch_async(q, ^{ dispatch_semaphore_signal(sema); });
dispatch_semaphore_wait(sema, DISPATCH_TIME_FOREVER);

To accommodate for this we when the caller of this code blocks a worker thread, then the kernel will notice your level of concurrency dropped and will bring up a new thread for you. This thread will likely be the one that picks up `q` that got woken up by this async, and will unblock the caller.

If you never write such horrible code, then GCD scales *just fine*. The real problem is that if you go async you need to be async all the way. Node.js and other similar projects have understood that a very long time ago. If you express dependencies between asynchronous execution context with a blocking relationship such as above, then you're just committing performance suicide. GCD handles this by adding more threads and overcommitting the system, my understanding is that your proposal is to instead livelock.

My currently not very well formed opinion on this subject is that GCD queues are just what you need with these possibilities:
- this Actor queue can be targeted to other queues by the developer when he means for these actor to be executed in an existing execution context / locking domain,
- we disallow Actors to be directly targeted to GCD global concurrent queues ever
- for the other ones we create a new abstraction with stronger and better guarantees (typically limiting the number of possible threads servicing actors to a low number, not greater than NCPU).

I think this aligns with your idea, in the sense that if you exhaust the Swift Actor Thread Pool, then you're screwed forever. But given that the pattern above can be hidden inside framework code that the developer has *no control over*, it is fairly easy to write actors that eventually through the said framework, would result in this synchronization pattern happening. Even if we can build the amazing debugging tools that make these immediately obvious to the developer (as in understanding what is happening), I don't know how the developer can do *anything* to work around these. The only solution is to fix the frameworks. However the experience of the last few years of maintaining GCD shows that the patterns above are not widely perceived as a dramatic design issue, let alone a bug. It will be a very long road before most framework code there is out there is Swift Actor async/await safe.

What is your proposal to address this? that we annotate functions that are unsafe? And then, assuming we succeed at this Herculean task, what can developers do anyway about it if the only way to do a thing is async/await unsafe ?

Note that synchronously waiting is not necessarily all bad. Any waiting that is waiting because something else is already happening and can make forward progress on its own (transitively so through any such blocking relationship) is 100% fine. Typically to take an example that is widely misunderstood, sync IPC is not bad, because it is about making another security domain perform work on your behalf, this other process is using the resources you're leaving free as a result of your blocking and this is fine. The problematic blockings are the ones for which the system cannot identify the work that you're waiting on and that may require allocation of constrained extra resources (such as a thread in the pool) to be able to be run.

Actors are the way you present the various tasks/operations/activities that you schedule. These contexts are a way for the developer to explain which things are related in a consistent system, and give them access to state which is local to this context (whether it's TSD for threads, or queue specific data, or any similar context),

Just MHO, but I don’t think you’d need or want the concept of “actor local data” in the sense of TLS (e.g. __thread). All actor methods have a ‘self’ already, and having something like TLS strongly encourages breaking the model. To me, the motivation for TLS is to provide an easier way to migrate single-threaded global variables, when introducing threading into legacy code.

I disagree, if you have an execution context that is "my database subsystem", it probably has an object that knows about all your database handles. Or do you suggest that your database subsystem is an actor too? I don't quite see the database subsystem as an actor in the sense that it's a strongly exclusive execution context (if your database is SQLite) and will really receive actors to execute, providing them a home.

You can obviously model this as "the database is an actor itself", and have the queue of other actors that only do database work target this database actor queue, but while this looks very appealing, in practice this creates a system which is hard to understand for developers. Actors talking to each other/messaging each other is fine. Actors nesting their execution inside each other is not because the next thing people will then ask from such a system is a way to execute code from the outer actor when in the context of the inner Actor, IOW what a recursive mutex is to a mutex, but for the Actor queue. This obvious has all the terrible issues of recursive locks where you think you hold the lock for the first time and expect your object invariants to be valid, except that you're really in a nested execution and see broken invariants from the outer call and this creates terribly hard bugs to find.

Do I make sense?

This is not a problem we need or want to solve, given programmers would be rewriting their algorithm anyway to get it into the actor model.

IMO, Swift as a runtime should define what an execution context is, and be relatively oblivious of which context it is exactly as long it presents a few common capabilities:
- possibility to schedule work (async)
- have a name
- be an exclusion context
- is an entity the kernel can reason about (if you want to be serious about any integration on a real operating system with priority inheritance and complex issues like this, which it is the OS responsibility to handle and not the language)
- ...

In that sense, whether your execution context is:
- a dispatch serial queue
- a CFRunloop
- a libev/libevent/... event loop
- your own hand rolled event loop

Generalizing the approach is completely possible, but it is also possible to introduce a language abstraction that is “underneath” the high level event loops. That’s what I’m proposing.

I'm not sure I understand what you mean here, can you elaborate?

···

On Sep 2, 2017, at 11:15 AM, Chris Lattner <clattner@nondot.org> wrote:
On Aug 31, 2017, at 7:24 PM, Pierre Habouzit <pierre@habouzit.net <mailto:pierre@habouzit.net>> wrote:

Design sketch for interprocess and distributed compute

[...]

One of these principles is the concept of progressive disclosure of complexity <https://en.wikipedia.org/wiki/Progressive_disclosure>: a Swift developer shouldn't have to worry about IPC or distributed compute if they don't care about it.

While I agree with the sentiment, I don't think that anything useful can be done without "distributed" computation. I like the loadResourceFromTheWeb example, as we have something like this on our platform, which is the NSURLSession APIs, or the CloudKit API Surface, that are about fetching some resource from a server (URL or CloudKit database records). However, they don't have a single result, they have:

- progress notification callbacks
- broken down notifications for the results (e.g. headers first and body second, or per-record for CloudKit operations)
- various levels of error reporting.

I don’t understand the concern about this. If you want low level control like this, it is quite easy to express that. However, it is also quite common to just want to say “load a URL with this name”, which is super easy and awesome with async/await.

I expect most developers will have to use such a construct, and for these, having a single async pivot in your code that essentially fully serializes your state machine on getting a full result from the previous step to be lacking.

Agreed, the examples are not trying to show that. It is perfectly fine to pass in additional callbacks (or delegates, etc) to async methods, which would be a natural way to express this… just like the current APIs do.

Delivering all these notifications on the context of the initiator would be quite inefficient as clearly there are in my example above two very different contexts, and having to hop through one to reach the other would make this really terrible for the operating system. I also don't understand how such operations would be modeled in the async/await world to be completely honest.

The proposal isn’t trying to address this problem, because Swift already has ways to do it.

-Chris


(David Zarzycki) #6

Hi Chris!

[As a preface, I’ve only read a few of these concurrency related emails on swift-evolution, so please forgive me if I missed something.]

When it comes to GCD scalability, the short answer is that millions of of tiny heap allocations are cheap, be they queues or closures. And GCD has fairly linear performance so long as the millions of closures/queues are non-blocking.

The real world is far messier though. In practice, real world code blocks all of the time. In the case of GCD tasks, this is often tolerable for most apps, because their CPU usage is bursty and any accidental “thread explosion” that is created is super temporary. That being said, programs that create thousands of queues/closures that block on I/O will naturally get thousands of threads. GCD is efficient but not magic.

As an aside, there are things that future versions of GCD could do to minimize the “thread explosion” problem. For example, if GCD interposed the system call layer, it would gain visibility into *why* threads are stalled and therefore GCD could 1) be more conservative about when to fire up more worker threads and 2) defer resuming threads that are at “safe” stopping points if all of the CPUs are busy.

That being done though, the complaining would just shift. Instead of an “explosion of threads”, people would complain about an “explosion of stacks" that consume memory and address space. While I and others have argued in the past that solving this means that frameworks must embrace callback API design patterns, I personally am no longer of this opinion. As I see it, I don’t think the complexity (and bugs) of heavy async/callback/coroutine designs are worth the memory savings. Said differently, the stack is simple and efficient. Why fight it?

I think the real problem is that programmers cannot pretend that resources are infinite. For example, if one implements a photo library browsing app, it would be naive to try and load every image at launch (async or otherwise). That just won’t scale and that isn’t the operating system's fault.

Dave

···

On Sep 2, 2017, at 14:15, Chris Lattner via swift-evolution <swift-evolution@swift.org> wrote:

My understanding is that GCD doesn’t currently scale to 1M concurrent queues / tasks.


(Pierre Habouzit) #7

My understanding is that GCD doesn’t currently scale to 1M concurrent queues / tasks.

Hi Chris!

[As a preface, I’ve only read a few of these concurrency related emails on swift-evolution, so please forgive me if I missed something.]

When it comes to GCD scalability, the short answer is that millions of of tiny heap allocations are cheap, be they queues or closures. And GCD has fairly linear performance so long as the millions of closures/queues are non-blocking.

The real world is far messier though. In practice, real world code blocks all of the time. In the case of GCD tasks, this is often tolerable for most apps, because their CPU usage is bursty and any accidental “thread explosion” that is created is super temporary. That being said, programs that create thousands of queues/closures that block on I/O will naturally get thousands of threads. GCD is efficient but not magic.

As an aside, there are things that future versions of GCD could do to minimize the “thread explosion” problem. For example, if GCD interposed the system call layer, it would gain visibility into *why* threads are stalled and therefore GCD could 1) be more conservative about when to fire up more worker threads and 2) defer resuming threads that are at “safe” stopping points if all of the CPUs are busy.

That being done though, the complaining would just shift. Instead of an “explosion of threads”, people would complain about an “explosion of stacks" that consume memory and address space. While I and others have argued in the past that solving this means that frameworks must embrace callback API design patterns, I personally am no longer of this opinion. As I see it, I don’t think the complexity (and bugs) of heavy async/callback/coroutine designs are worth the memory savings. Said differently, the stack is simple and efficient. Why fight it?

I think the real problem is that programmers cannot pretend that resources are infinite. For example, if one implements a photo library browsing app, it would be naive to try and load every image at launch (async or otherwise). That just won’t scale and that isn’t the operating system's fault.

Problems like thread explosion can be solved using higher-level constructs, though. For example, (NS)OperationQueue has a .maxConcurrentOperationCount property. If you make a global OperationQueue, set the maximum to whatever you want it to be, and run all your “primitive” operations through the queue, you can manage the thread count rather effectively.

I have a few custom Operation subclasses that easily wrap arbitrary asynchronous operations as Operation objects; once the new async/await API comes out, I plan to adapt my subclass to support it, and I’d be happy to submit the code to swift-evolution if people are interested.

NSOperation has several implementation issues, and using it to encapsulate asynchronous work means that you don't get the correct priorities (I don't say it cant' be fixed, I honnestly don't know, I just know from the mouth of the maintainer that NSOperation makes only guarantees if you do all your work from -[NSOperation main]).

Second, what Dave is saying is exactly the opposite of what you just wrote. If you use NSOQ's maximum concurrency bit, and you throw an infinite amount of work at it, *sure* the thread explosion will be fixed but:
- cancelation is a problem
- scalability is a problem
- memory growth is a problem.

The better design is to have a system that works like this:

(1) have a scheduler that knows how many operations are in flight and admits two levels "low" and "high".
(2) when you submit work to the scheduler, it tells you if it could take it or not, if not, then it's up to you to serialize it somewhere "for later" or propagate the error to your client
(3) the scheduler admits up to "high" work items, and as they finish, if you reach "low", then you use some notification mechanism to feed it again (and possibly get it from the database).

This is how any OS construct works for resource reasons (network sockets, file descriptors, ...) where the notification mechanism that writing to these is available again is select/poll/epoll/kevent/... name it.

By doing it this way, you actually can write smarter policies on what the next work is, because computing what you should do next is usually relatively expensive, especially if work comes in all the time and that your decision can go stale quickly or that priorities are live. by batching it because the scheduler will only call you when you're at "low" which lets you send "hihg - low" items at once, you benefit from a batching effect and possibility to reorder work. And you work in constant memory.

NSOperationQueue is not that, it is just a turnstile at the door that doesn't let too many people enter but will happily have the waiting queue overflow in the streets and cause a car accident because the pavement is not large enough to hold all of it.

Dave said it very well so I'll not try to reword it:

I think the real problem is that programmers cannot pretend that resources are infinite.

From there, it stems that you need to reason about your resources, and the execution contexts and amount of Actors in flight are such resources. It's fine to ignore these problems when the working set you have is small enough, but in my experience, there are *lots* of cases when the working set is actually still bounded, but too large in a way that makes the developer think he's fine. The most typical example which we actually tried to express in the WWDC GCD Talk my team did this year is the case when your working set is scaling in the number of clients/connections/... of your process: this is usually not a good property for a real system, even in many core systems. For these cases you need to make sure that all related work happens on the same exeution context to benefit from cache effects at all the levels (from CPU to actually the software stack), and have a better affinity for the locking domains you use, etc, etc, etc...

And then that bring us back to the reply I made to Chris earlier today: nesting actors is like nesting locks. When you have a clear ordering it's all fine, but when you don't, then you're screwed.

-Pierre

···

On Sep 2, 2017, at 2:19 PM, Charles Srstka via swift-evolution <swift-evolution@swift.org> wrote:

On Sep 2, 2017, at 4:05 PM, David Zarzycki via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

On Sep 2, 2017, at 14:15, Chris Lattner via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:


(John McCall) #8

More importantly, the basic interface of GCD doesn't seem to prevent an implementation from scaling to fill the resource constraints of a machine. The interface to dispatch queues does not imply any substantial persistent state besides the task queue itself, and tasks have pretty minimal quiescent storage requirements. Queue-hopping is an unfortunate overhead, but a constant-time overhead doesn't really damage scalability and can be addressed without a major overhaul of the basic runtime interface. OS threads can be blocked by tasks, but that's not a Dispatch-specific problem, and any solution that would fix it in other runtimes would equally fix it in Dispatch.

Now, an arbitrarily-scaling concurrent system is probably a system that's destined to eventually become distributed, and there's a strong argument that unbounded queues are an architectural mistake in a distributed system: instead, every channel of communication should have an opportunity to refuse further work, and the entire system should be architected to handle such failures gracefully. But I think that can be implemented reasonably on top of a runtime where most local queues are still unbounded and "optimistic".

John.

···

On Sep 2, 2017, at 3:19 PM, Pierre Habouzit via swift-evolution <swift-evolution@swift.org> wrote:

On Sep 2, 2017, at 11:15 AM, Chris Lattner <clattner@nondot.org <mailto:clattner@nondot.org>> wrote:

On Aug 31, 2017, at 7:24 PM, Pierre Habouzit <pierre@habouzit.net <mailto:pierre@habouzit.net>> wrote:

I fail at Finding the initial mail and am quite late to the party of commenters, but there are parts I don't undertsand or have questions about.

Scalable Runtime

[...]

The one problem I anticipate with GCD is that it doesn't scale well enough: server developers in particular will want to instantiate hundreds of thousands of actors in their application, at least one for every incoming network connection. The programming model is substantially harmed when you have to be afraid of creating too many actors: you have to start aggregating logically distinct stuff together to reduce # queues, which leads to complexity and loses some of the advantages of data isolation.

What do you mean by this?

My understanding is that GCD doesn’t currently scale to 1M concurrent queues / tasks.

It completely does provided these 1M queues / tasks are organized on several well known independent contexts.
The place where GCD "fails" at is that if you target your individual serial queues to the global concurrent queues (a.k.a. root queues) which means "please pool, do your job", then yes it doesn't scale, because we take these individual serial queues as proxies for OS threads.

If however you target these queues to either:
- new serial queues to segregate your actors per subsystem yourself
- or some more constrained pool than what the current GCD runtime offers (where we don't create threads to run your work nearly as eagerly)

Then I don't see why the current implementation of GCD wouldn't scale.


(Charles Srstka) #9

Problems like thread explosion can be solved using higher-level constructs, though. For example, (NS)OperationQueue has a .maxConcurrentOperationCount property. If you make a global OperationQueue, set the maximum to whatever you want it to be, and run all your “primitive” operations through the queue, you can manage the thread count rather effectively.

I have a few custom Operation subclasses that easily wrap arbitrary asynchronous operations as Operation objects; once the new async/await API comes out, I plan to adapt my subclass to support it, and I’d be happy to submit the code to swift-evolution if people are interested.

Charles

···

On Sep 2, 2017, at 4:05 PM, David Zarzycki via swift-evolution <swift-evolution@swift.org> wrote:

On Sep 2, 2017, at 14:15, Chris Lattner via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

My understanding is that GCD doesn’t currently scale to 1M concurrent queues / tasks.

Hi Chris!

[As a preface, I’ve only read a few of these concurrency related emails on swift-evolution, so please forgive me if I missed something.]

When it comes to GCD scalability, the short answer is that millions of of tiny heap allocations are cheap, be they queues or closures. And GCD has fairly linear performance so long as the millions of closures/queues are non-blocking.

The real world is far messier though. In practice, real world code blocks all of the time. In the case of GCD tasks, this is often tolerable for most apps, because their CPU usage is bursty and any accidental “thread explosion” that is created is super temporary. That being said, programs that create thousands of queues/closures that block on I/O will naturally get thousands of threads. GCD is efficient but not magic.

As an aside, there are things that future versions of GCD could do to minimize the “thread explosion” problem. For example, if GCD interposed the system call layer, it would gain visibility into *why* threads are stalled and therefore GCD could 1) be more conservative about when to fire up more worker threads and 2) defer resuming threads that are at “safe” stopping points if all of the CPUs are busy.

That being done though, the complaining would just shift. Instead of an “explosion of threads”, people would complain about an “explosion of stacks" that consume memory and address space. While I and others have argued in the past that solving this means that frameworks must embrace callback API design patterns, I personally am no longer of this opinion. As I see it, I don’t think the complexity (and bugs) of heavy async/callback/coroutine designs are worth the memory savings. Said differently, the stack is simple and efficient. Why fight it?

I think the real problem is that programmers cannot pretend that resources are infinite. For example, if one implements a photo library browsing app, it would be naive to try and load every image at launch (async or otherwise). That just won’t scale and that isn’t the operating system's fault.


(Chris Lattner) #10

What do you mean by this?

My understanding is that GCD doesn’t currently scale to 1M concurrent queues / tasks.

It completely does provided these 1M queues / tasks are organized on several well known independent contexts.

Ok, I stand corrected. My understanding was that you could run into situations where you get stack explosions, fragment your VM and run out of space, but perhaps that is a relic of 32-bit systems.

queues are serial/exclusive execution contexts, and if you're not modeling actors as being serial queues, then these two concepts are just disjoint.

AFAICT, the “one queue per actor” model is the only one that makes sense. It doesn’t have to be FIFO, but it needs to be some sort of queue. If you allow servicing multiple requests within the actor at a time, then you lose the advantages of “no shared mutable state”.

I agree, I don't quite care about how the actor is implemented here, what I care about is where it runs onto. my wording was poor, what I really meant is:

queues at the bottom of a queue hierarchy are serial/exclusive execution contexts, and if you're not modeling actors as being such fully independent serial queues, then these two concepts are just disjoint.

In GCD there's a very big difference between the one queue at the root of your graph (just above the thread pool) and any other that is within. The number that doesn't scale is the number of the former contexts, not the latter.

I’m sorry, but I still don’t understand what you’re getting at here.

The pushback I have here is that today Runloops and dispatch queues on iOS/macOS are already systems that have huge impedance mismatches, and do not share the resources either (in terms of OS physical threads). I would hate for us to bring on ourselves the pain of creating a third completely different system that is using another way to use threads. When these 3 worlds would interoperate this would cause significant amount of context switches just to move across the boundaries.

Agreed, to be clear, I have no objection to building actors on top of (perhaps enhanced) GCD queues. In fact I *hope* that this can work, since it leads to a naturally more incremental path forward, which is therefore much more likely to actually happen.

I'd like to dive and debunk this "GCD doesn't scale" point, that I'd almost call a myth (and I'm relatively unhappy to see these words in your proposal TBH because they send the wrong message).

I’m happy to revise the proposal, please let me know what you think makes sense.

My currently not very well formed opinion on this subject is that GCD queues are just what you need with these possibilities:
- this Actor queue can be targeted to other queues by the developer when he means for these actor to be executed in an existing execution context / locking domain,
- we disallow Actors to be directly targeted to GCD global concurrent queues ever
- for the other ones we create a new abstraction with stronger and better guarantees (typically limiting the number of possible threads servicing actors to a low number, not greater than NCPU).

Is there a specific important use case for being able to target an actor to an existing queue? Are you looking for advanced patterns where multiple actors (each providing disjoint mutable state) share an underlying queue? Would this be for performance reasons, for compatibility with existing code, or something else?

I don’t see a problem with disallowing actors on the global concurrent queues in general, but I do think it makes sense to be able to provide an abstraction for homing code on the main thread/queue/actor somehow.

I think this aligns with your idea, in the sense that if you exhaust the Swift Actor Thread Pool, then you're screwed forever. But given that the pattern above can be hidden inside framework code that the developer has *no control over*, it is fairly easy to write actors that eventually through the said framework, would result in this synchronization pattern happening. Even if we can build the amazing debugging tools that make these immediately obvious to the developer (as in understanding what is happening), I don't know how the developer can do *anything* to work around these. The only solution is to fix the frameworks. However the experience of the last few years of maintaining GCD shows that the patterns above are not widely perceived as a dramatic design issue, let alone a bug. It will be a very long road before most framework code there is out there is Swift Actor async/await safe.

What is your proposal to address this? that we annotate functions that are unsafe? And then, assuming we succeed at this Herculean task, what can developers do anyway about it if the only way to do a thing is async/await unsafe ?

I don’t think that annotations are the right way to go. It should be an end-goal for the system to be almost completely actor safe, so the parity of the annotation would have to be “this code is unsafe”. Given that, I don’t see how we could audit the entire world, and I don’t think that an annotation explosion would be acceptable this late in Swift’s lifetime. It would be like IUO-everywhere problem in the Swift 1 betas.

My preferred solution is three-fold:
- Make frameworks incrementally actor safe over time. Ensure that new APIs are done right, and make sure that no existing APIs ever go from “actor safe” to “actor unsafe”.
- Provide a mechanism that developers can use to address problematic APIs that they encounter in practice. It should be something akin to “wrap your calls in a closure and pass it to a special GCD function”, or something else of similar complexity.
- Continue to improve perf and debugger tools to help identify problematic cases that occur in practice.

This would ensure that developers can always “get their job done”, but also provide a path where we can incrementally improve the world over the course of years (if necessary).

Actors are the way you present the various tasks/operations/activities that you schedule. These contexts are a way for the developer to explain which things are related in a consistent system, and give them access to state which is local to this context (whether it's TSD for threads, or queue specific data, or any similar context),

Just MHO, but I don’t think you’d need or want the concept of “actor local data” in the sense of TLS (e.g. __thread). All actor methods have a ‘self’ already, and having something like TLS strongly encourages breaking the model. To me, the motivation for TLS is to provide an easier way to migrate single-threaded global variables, when introducing threading into legacy code.

I disagree, if you have an execution context that is "my database subsystem", it probably has an object that knows about all your database handles. Or do you suggest that your database subsystem is an actor too? I don't quite see the database subsystem as an actor in the sense that it's a strongly exclusive execution context (if your database is SQLite) and will really receive actors to execute, providing them a home.

I haven’t spent a lot of thought on this, but I tend to think that “database as an actor” fits with the programming model that I’m advocating for. The problem (as I think you’re getting at) is that you don’t want serialization at the API level of the database, you want to allow the database itself to have multiple threads running around in its process.

This is similar in some ways to your NIC example from before, where I think you want one instance of a NIC actor for each piece of hardware you have… but you want multithreaded access.

One plausible way to model this is to say that it is a “multithreaded actor” of some sort, where the innards of the actor allow arbitrary number of client threads to call into it concurrently. The onus would be on the implementor of the NIC or database to implement the proper synchronization on the mutable state within the actor.

I think that something like this is attractive because it provides key things that I value highly in a concurrency model:

- The programmer has a natural way to model things: “a instance of a database is a thing", and therefore should be modeled as an instance of an actor. The fact that the actor can handle multiple concurrent requests is an implementation detail the clients shouldn’t have to be rewritten to understand.

- Making this non-default would provide proper progressive disclosure of complexity.

- You’d still get improved safety and isolation of the system as a whole, even if individual actors are “optimized” in this way.

- When incrementally migrating code to the actor model, this would make it much easier to provide actor wrappers for existing subsystems built on shared mutable state.

- Something like this would also probably be the right abstraction for imported RPC services that allow for multiple concurrent synchronous requests.

I’m curious to know what you and others think about this concept.

You can obviously model this as "the database is an actor itself", and have the queue of other actors that only do database work target this database actor queue, but while this looks very appealing, in practice this creates a system which is hard to understand for developers. Actors talking to each other/messaging each other is fine. Actors nesting their execution inside each other is not because the next thing people will then ask from such a system is a way to execute code from the outer actor when in the context of the inner Actor, IOW what a recursive mutex is to a mutex, but for the Actor queue. This obvious has all the terrible issues of recursive locks where you think you hold the lock for the first time and expect your object invariants to be valid, except that you're really in a nested execution and see broken invariants from the outer call and this creates terribly hard bugs to find.

Yes, I understand the problem of recursive locks, but I don’t see how or why you’d want an outer actor to have an inner actor call back to it. Ideally your actors are in the shape of a DAG. Cycles would be properly modeled with (e.g.) weak references, but I think that synchronous/awaited cyclic calls should fail fast at runtime. To be more explicit, I mean something like:

1. Actor A has a reference to actor B. Actor B has a weak backref to actor A.
2. Actor A does an await on an async actor method on B. As such, it’s queue is blocked: no messages can be run on its queue until the B method returns.
3. Actor B’s method turns around and calls a method on A, awaiting the result. Because there is a cyclic wait, we have a deadlock, one which ideally fails fast with a trap.

The solution for this is to change the callback to A to be a call an async (void returning) actor method. Such a call would simply enqueue the request on A’s queue, and get serviced after the original A call returns.

I wrote this out to make it clear what problem I think you’re talking about. If this isn’t what you’re trying to get at, please let me know :slight_smile:

IMO, Swift as a runtime should define what an execution context is, and be relatively oblivious of which context it is exactly as long it presents a few common capabilities:
- possibility to schedule work (async)
- have a name
- be an exclusion context
- is an entity the kernel can reason about (if you want to be serious about any integration on a real operating system with priority inheritance and complex issues like this, which it is the OS responsibility to handle and not the language)
- ...

In that sense, whether your execution context is:
- a dispatch serial queue
- a CFRunloop
- a libev/libevent/... event loop
- your own hand rolled event loop

Generalizing the approach is completely possible, but it is also possible to introduce a language abstraction that is “underneath” the high level event loops. That’s what I’m proposing.

I'm not sure I understand what you mean here, can you elaborate?

The concept of an event loop is IMO a library abstraction that could be built on top of the actor model. The actor would represent the “context” for the concurrency, the event loop API would represent the other stuff.

-Chris

···

On Sep 2, 2017, at 12:19 PM, Pierre Habouzit <pierre@habouzit.net> wrote:


(Pierre Habouzit) #11

-Pierre

What do you mean by this?

My understanding is that GCD doesn’t currently scale to 1M concurrent queues / tasks.

It completely does provided these 1M queues / tasks are organized on several well known independent contexts.

Ok, I stand corrected. My understanding was that you could run into situations where you get stack explosions, fragment your VM and run out of space, but perhaps that is a relic of 32-bit systems.

a queue on 64bit systems is 128 bytes (nowadays). Provided you have that amount of VM available to you (1M queues is 128M after all) then you're good.
If a large amount of them fragments the VM beyond this is a malloc/VM bug on 64bit systems that are supposed to have enough address space.

queues are serial/exclusive execution contexts, and if you're not modeling actors as being serial queues, then these two concepts are just disjoint.

AFAICT, the “one queue per actor” model is the only one that makes sense. It doesn’t have to be FIFO, but it needs to be some sort of queue. If you allow servicing multiple requests within the actor at a time, then you lose the advantages of “no shared mutable state”.

I agree, I don't quite care about how the actor is implemented here, what I care about is where it runs onto. my wording was poor, what I really meant is:

queues at the bottom of a queue hierarchy are serial/exclusive execution contexts, and if you're not modeling actors as being such fully independent serial queues, then these two concepts are just disjoint.

In GCD there's a very big difference between the one queue at the root of your graph (just above the thread pool) and any other that is within. The number that doesn't scale is the number of the former contexts, not the latter.

I’m sorry, but I still don’t understand what you’re getting at here.

What doesn't scale is asking for threads, not having queues.

The pushback I have here is that today Runloops and dispatch queues on iOS/macOS are already systems that have huge impedance mismatches, and do not share the resources either (in terms of OS physical threads). I would hate for us to bring on ourselves the pain of creating a third completely different system that is using another way to use threads. When these 3 worlds would interoperate this would cause significant amount of context switches just to move across the boundaries.

Agreed, to be clear, I have no objection to building actors on top of (perhaps enhanced) GCD queues. In fact I *hope* that this can work, since it leads to a naturally more incremental path forward, which is therefore much more likely to actually happen.

Good :slight_smile:

I'd like to dive and debunk this "GCD doesn't scale" point, that I'd almost call a myth (and I'm relatively unhappy to see these words in your proposal TBH because they send the wrong message).

I’m happy to revise the proposal, please let me know what you think makes sense.

What doesn't scale is the way GCD asks for threads, which is what the global concurrent queues abstract.
The way it works (or rather limp along) is what we should not reproduce for Swift.

What you can write in your proposal and is true is "GCD current relationship with the system threads doesn't scale". It's below the queues that the scalability has issues.
Dave Z. explained it in a mail earlier today in very good words.

My currently not very well formed opinion on this subject is that GCD queues are just what you need with these possibilities:
- this Actor queue can be targeted to other queues by the developer when he means for these actor to be executed in an existing execution context / locking domain,
- we disallow Actors to be directly targeted to GCD global concurrent queues ever
- for the other ones we create a new abstraction with stronger and better guarantees (typically limiting the number of possible threads servicing actors to a low number, not greater than NCPU).

Is there a specific important use case for being able to target an actor to an existing queue? Are you looking for advanced patterns where multiple actors (each providing disjoint mutable state) share an underlying queue? Would this be for performance reasons, for compatibility with existing code, or something else?

Mostly for interaction with current designs where being on a given bottom serial queue gives you the locking context for resources naturally attached to it.

I don’t see a problem with disallowing actors on the global concurrent queues in general, but I do think it makes sense to be able to provide an abstraction for homing code on the main thread/queue/actor somehow.

I think this aligns with your idea, in the sense that if you exhaust the Swift Actor Thread Pool, then you're screwed forever. But given that the pattern above can be hidden inside framework code that the developer has *no control over*, it is fairly easy to write actors that eventually through the said framework, would result in this synchronization pattern happening. Even if we can build the amazing debugging tools that make these immediately obvious to the developer (as in understanding what is happening), I don't know how the developer can do *anything* to work around these. The only solution is to fix the frameworks. However the experience of the last few years of maintaining GCD shows that the patterns above are not widely perceived as a dramatic design issue, let alone a bug. It will be a very long road before most framework code there is out there is Swift Actor async/await safe.

What is your proposal to address this? that we annotate functions that are unsafe? And then, assuming we succeed at this Herculean task, what can developers do anyway about it if the only way to do a thing is async/await unsafe ?

I don’t think that annotations are the right way to go. It should be an end-goal for the system to be almost completely actor safe, so the parity of the annotation would have to be “this code is unsafe”. Given that, I don’t see how we could audit the entire world, and I don’t think that an annotation explosion would be acceptable this late in Swift’s lifetime. It would be like IUO-everywhere problem in the Swift 1 betas.

My preferred solution is three-fold:
- Make frameworks incrementally actor safe over time. Ensure that new APIs are done right, and make sure that no existing APIs ever go from “actor safe” to “actor unsafe”.
- Provide a mechanism that developers can use to address problematic APIs that they encounter in practice. It should be something akin to “wrap your calls in a closure and pass it to a special GCD function”, or something else of similar complexity.
- Continue to improve perf and debugger tools to help identify problematic cases that occur in practice.

This would ensure that developers can always “get their job done”, but also provide a path where we can incrementally improve the world over the course of years (if necessary).

fair enough.

Actors are the way you present the various tasks/operations/activities that you schedule. These contexts are a way for the developer to explain which things are related in a consistent system, and give them access to state which is local to this context (whether it's TSD for threads, or queue specific data, or any similar context),

Just MHO, but I don’t think you’d need or want the concept of “actor local data” in the sense of TLS (e.g. __thread). All actor methods have a ‘self’ already, and having something like TLS strongly encourages breaking the model. To me, the motivation for TLS is to provide an easier way to migrate single-threaded global variables, when introducing threading into legacy code.

I disagree, if you have an execution context that is "my database subsystem", it probably has an object that knows about all your database handles. Or do you suggest that your database subsystem is an actor too? I don't quite see the database subsystem as an actor in the sense that it's a strongly exclusive execution context (if your database is SQLite) and will really receive actors to execute, providing them a home.

I haven’t spent a lot of thought on this, but I tend to think that “database as an actor” fits with the programming model that I’m advocating for. The problem (as I think you’re getting at) is that you don’t want serialization at the API level of the database, you want to allow the database itself to have multiple threads running around in its process.

This is similar in some ways to your NIC example from before, where I think you want one instance of a NIC actor for each piece of hardware you have… but you want multithreaded access.

One plausible way to model this is to say that it is a “multithreaded actor” of some sort, where the innards of the actor allow arbitrary number of client threads to call into it concurrently. The onus would be on the implementor of the NIC or database to implement the proper synchronization on the mutable state within the actor.

I think that something like this is attractive because it provides key things that I value highly in a concurrency model:

- The programmer has a natural way to model things: “a instance of a database is a thing", and therefore should be modeled as an instance of an actor. The fact that the actor can handle multiple concurrent requests is an implementation detail the clients shouldn’t have to be rewritten to understand.

- Making this non-default would provide proper progressive disclosure of complexity.

- You’d still get improved safety and isolation of the system as a whole, even if individual actors are “optimized” in this way.

- When incrementally migrating code to the actor model, this would make it much easier to provide actor wrappers for existing subsystems built on shared mutable state.

- Something like this would also probably be the right abstraction for imported RPC services that allow for multiple concurrent synchronous requests.

I’m curious to know what you and others think about this concept.

I think what you said made sense. But it wasn't what I meant. I was really thinking at sqlite where the database is strongly serial (you can't use it in a multi-threaded way well, or rather you can but it has a big lock inside). It is much better to interact with that dude on the same exclusion context all the time. What I meant is really having some actors that have a "strong affinity" with a given execution context which eases the task of the actor scheduler.

Another problem I haven't touched either is kernel-issued events (inbound IPC from other processes, networking events, etc...). Dispatch for the longest time used an indirection through a manager thread for all such events, and that had two major issues:

- the thread hops it caused, causing networking workloads to utilize up to 15-20% more CPU time than an equivalent manually made pthread parked in kevent(), because networking pace even when busy idles back all the time as far as the CPU is concerned, so dispatch queues never stay hot, and the context switch is not only a scheduled context switch but also has the cost of a thread bring up

- if you deliver all possible events this way you also deliver events that cannot possibly make progress because the execution context that will handle them is already "locked" (as in busy running something else.

It took us several years to get to the point we presented at WWDC this year where we deliver events directly to the right dispatch queue. If you only have very anonymous execution contexts then all this machinery is wasted and unused. However, this machinery has been evaluated and saves full percents of CPU load system-wide. I'd hate for us to go back 5 years here.

Declaring that an actor targets a given existing serial context also means that if that actor needs to make urgent progress the context in question has to be rushed, and its priority elevated. It's really hard to do the same on an anonymous global context (the way dispatch does it still is to actually enqueue stealer work that try to steal the "actor" at a higher priority. this approach is terribly wasteful).

I know that I have a hard time getting my point across mostly because I'm not a language design guy, I'm a system design guy, and I clearly don't understand Actors enough to get how to integrate them with the OS. But to me, to go back to the database example, the Database Actor, or the NIC Actor from earlier are different from say a given SQL query or a given HTTP request: the formers are the things the OS needs to know about *in the kernel*, whereas the SQL Query or HTTP Request are merely actors enqueued on the formers. IOW these top-level actors are different, because they are top-level, right atop the kernel/low-level runtime, and this is the thing the kernel has to be able to reason about. This makes them different.

In dispatch, there are 2 kind of queues at the API level in that regard:
- the global queues, which aren't queues like the other and really is just an abstraction on top of the thread pool
- all the other queues that you can target on each other the way you want.

It is clear today that it was a mistake and that there should have been 3 kind of queues:
- the global queues, which aren't real queues but represent which family of system attributes your execution context requires (mostly priorities), and we should have disallowed enqueuing raw work on it
- the bottom queues (which GCD since last year tracks and call "bases" in the source code) that are known to the kernel when they have work enqueued
- any other "inner" queue, which the kernel couldn't care less about

In dispatch, we regret every passing day that the difference between the 2nd and 3rd group of queues wasn't made clear in the API originally.

I like to call the 2nd category execution contexts, but I can also see why you want to pass them as Actors, it's probably more uniform (and GCD did the same by presenting them both as queues). Such top-level "Actors" should be few, because if they all become active at once, they will need as many threads in your process, and this is not a resource that scales. This is why it is important to distinguish them. And like we're discussing they usually also wrap some kind of shared mutable state, resource, or similar, which inner actors probably won't do.

You can obviously model this as "the database is an actor itself", and have the queue of other actors that only do database work target this database actor queue, but while this looks very appealing, in practice this creates a system which is hard to understand for developers. Actors talking to each other/messaging each other is fine. Actors nesting their execution inside each other is not because the next thing people will then ask from such a system is a way to execute code from the outer actor when in the context of the inner Actor, IOW what a recursive mutex is to a mutex, but for the Actor queue. This obvious has all the terrible issues of recursive locks where you think you hold the lock for the first time and expect your object invariants to be valid, except that you're really in a nested execution and see broken invariants from the outer call and this creates terribly hard bugs to find.

Yes, I understand the problem of recursive locks, but I don’t see how or why you’d want an outer actor to have an inner actor call back to it.

I don't see why you'd need this with dispatch queues either today, however my radar queue disagrees strongly with this statement. People want this all the time, mostly because the outer actor has a state machine and the inner actor wants it to make progress before it continues working.

In CFRunloop terms it's like running the runloop from your work by calling CFRunloopRun() yourself again until you can observe some event happened.

It's not great and problematic for tons of reasons. If actors are nested, we need a way to make sure people don't have to ever do something like that.

Just as a data point, my reactions to this thread yielded a private discussion off list *exactly* about someone wanting us to provide something like this for dispatch (or rather in this instance having the inner actor be able to suspend the outer one, but it's just a corollary / similar layering violation where the inner actor wants to affect the outer one in a way the outer one).

Ideally your actors are in the shape of a DAG. Cycles would be properly modeled with (e.g.) weak references, but I think that synchronous/awaited cyclic calls should fail fast at runtime. To be more explicit, I mean something like:

1. Actor A has a reference to actor B. Actor B has a weak backref to actor A.
2. Actor A does an await on an async actor method on B. As such, it’s queue is blocked: no messages can be run on its queue until the B method returns.
3. Actor B’s method turns around and calls a method on A, awaiting the result. Because there is a cyclic wait, we have a deadlock, one which ideally fails fast with a trap.

The solution for this is to change the callback to A to be a call an async (void returning) actor method. Such a call would simply enqueue the request on A’s queue, and get serviced after the original A call returns.

I wrote this out to make it clear what problem I think you’re talking about. If this isn’t what you’re trying to get at, please let me know :slight_smile:

It is definitely the family of problems I'm worried about. I want to make sure we have an holistic approach here because I think that recursive mutexes, recursive CFRunloop runs and similar ideas are flawed and dangerous. I want to make sure we understand

IMO, Swift as a runtime should define what an execution context is, and be relatively oblivious of which context it is exactly as long it presents a few common capabilities:
- possibility to schedule work (async)
- have a name
- be an exclusion context
- is an entity the kernel can reason about (if you want to be serious about any integration on a real operating system with priority inheritance and complex issues like this, which it is the OS responsibility to handle and not the language)
- ...

In that sense, whether your execution context is:
- a dispatch serial queue
- a CFRunloop
- a libev/libevent/... event loop
- your own hand rolled event loop

Generalizing the approach is completely possible, but it is also possible to introduce a language abstraction that is “underneath” the high level event loops. That’s what I’m proposing.

I'm not sure I understand what you mean here, can you elaborate?

The concept of an event loop is IMO a library abstraction that could be built on top of the actor model. The actor would represent the “context” for the concurrency, the event loop API would represent the other stuff.

I think by reading your reply I start to grasp a bit better how you see Actors working, indeed. This makes sense to me. Like I said above then, we need to see top-level Actors (that can be equivalent to an OS thread, or whatever concurrent execution context the OS provides for these) being different from other ones. Like Dave Z. was saying earlier, building a language that pretends your resources are infinite is a mistake. We have to force people to think about the cost of their architecture.

I don't see actors that would be running on a "global pool" kind of thing to be such top-level actors though. But these top-level actors probably need to live outside of the pool: the NIC example is very good for this: if you don't schedule it as soon as it has events, then your networking bandwidth will suffer as things like TCP or SCTP are very sensitive to various timers and delays, so you can't be dependent on a constrained shared resource. (this is exactly why GCD has a notion of overcommit queues which bypass the NCPU limits, sadly in GCD way too many things are overcommit, but this is another discussion to have entirely).

I think that what I'm getting at is that we can't build Actors without defining how they interact with the operating system and what guarantees of latency and liveness they have.

Is this making more sense?

-Pierre

···

On Sep 2, 2017, at 9:59 PM, Chris Lattner <clattner@nondot.org> wrote:
On Sep 2, 2017, at 12:19 PM, Pierre Habouzit <pierre@habouzit.net <mailto:pierre@habouzit.net>> wrote:


(Pierre Habouzit) #12

[Sorry I hit send too fast, let me fix two spots I didn't correct]

-Pierre

What do you mean by this?

My understanding is that GCD doesn’t currently scale to 1M concurrent queues / tasks.

It completely does provided these 1M queues / tasks are organized on several well known independent contexts.

Ok, I stand corrected. My understanding was that you could run into situations where you get stack explosions, fragment your VM and run out of space, but perhaps that is a relic of 32-bit systems.

a queue on 64bit systems is 128 bytes (nowadays). Provided you have that amount of VM available to you (1M queues is 128M after all) then you're good.
If a large amount of them fragments the VM beyond this is a malloc/VM bug on 64bit systems that are supposed to have enough address space.

queues are serial/exclusive execution contexts, and if you're not modeling actors as being serial queues, then these two concepts are just disjoint.

AFAICT, the “one queue per actor” model is the only one that makes sense. It doesn’t have to be FIFO, but it needs to be some sort of queue. If you allow servicing multiple requests within the actor at a time, then you lose the advantages of “no shared mutable state”.

I agree, I don't quite care about how the actor is implemented here, what I care about is where it runs onto. my wording was poor, what I really meant is:

queues at the bottom of a queue hierarchy are serial/exclusive execution contexts, and if you're not modeling actors as being such fully independent serial queues, then these two concepts are just disjoint.

In GCD there's a very big difference between the one queue at the root of your graph (just above the thread pool) and any other that is within. The number that doesn't scale is the number of the former contexts, not the latter.

I’m sorry, but I still don’t understand what you’re getting at here.

What doesn't scale is asking for threads, not having queues.

The pushback I have here is that today Runloops and dispatch queues on iOS/macOS are already systems that have huge impedance mismatches, and do not share the resources either (in terms of OS physical threads). I would hate for us to bring on ourselves the pain of creating a third completely different system that is using another way to use threads. When these 3 worlds would interoperate this would cause significant amount of context switches just to move across the boundaries.

Agreed, to be clear, I have no objection to building actors on top of (perhaps enhanced) GCD queues. In fact I *hope* that this can work, since it leads to a naturally more incremental path forward, which is therefore much more likely to actually happen.

Good :slight_smile:

I'd like to dive and debunk this "GCD doesn't scale" point, that I'd almost call a myth (and I'm relatively unhappy to see these words in your proposal TBH because they send the wrong message).

I’m happy to revise the proposal, please let me know what you think makes sense.

What doesn't scale is the way GCD asks for threads, which is what the global concurrent queues abstract.
The way it works (or rather limp along) is what we should not reproduce for Swift.

What you can write in your proposal and is true is "GCD current relationship with the system threads doesn't scale". It's below the queues that the scalability has issues.
Dave Z. explained it in a mail earlier today in very good words.

My currently not very well formed opinion on this subject is that GCD queues are just what you need with these possibilities:
- this Actor queue can be targeted to other queues by the developer when he means for these actor to be executed in an existing execution context / locking domain,
- we disallow Actors to be directly targeted to GCD global concurrent queues ever
- for the other ones we create a new abstraction with stronger and better guarantees (typically limiting the number of possible threads servicing actors to a low number, not greater than NCPU).

Is there a specific important use case for being able to target an actor to an existing queue? Are you looking for advanced patterns where multiple actors (each providing disjoint mutable state) share an underlying queue? Would this be for performance reasons, for compatibility with existing code, or something else?

Mostly for interaction with current designs where being on a given bottom serial queue gives you the locking context for resources naturally attached to it.

I don’t see a problem with disallowing actors on the global concurrent queues in general, but I do think it makes sense to be able to provide an abstraction for homing code on the main thread/queue/actor somehow.

I think this aligns with your idea, in the sense that if you exhaust the Swift Actor Thread Pool, then you're screwed forever. But given that the pattern above can be hidden inside framework code that the developer has *no control over*, it is fairly easy to write actors that eventually through the said framework, would result in this synchronization pattern happening. Even if we can build the amazing debugging tools that make these immediately obvious to the developer (as in understanding what is happening), I don't know how the developer can do *anything* to work around these. The only solution is to fix the frameworks. However the experience of the last few years of maintaining GCD shows that the patterns above are not widely perceived as a dramatic design issue, let alone a bug. It will be a very long road before most framework code there is out there is Swift Actor async/await safe.

What is your proposal to address this? that we annotate functions that are unsafe? And then, assuming we succeed at this Herculean task, what can developers do anyway about it if the only way to do a thing is async/await unsafe ?

I don’t think that annotations are the right way to go. It should be an end-goal for the system to be almost completely actor safe, so the parity of the annotation would have to be “this code is unsafe”. Given that, I don’t see how we could audit the entire world, and I don’t think that an annotation explosion would be acceptable this late in Swift’s lifetime. It would be like IUO-everywhere problem in the Swift 1 betas.

My preferred solution is three-fold:
- Make frameworks incrementally actor safe over time. Ensure that new APIs are done right, and make sure that no existing APIs ever go from “actor safe” to “actor unsafe”.
- Provide a mechanism that developers can use to address problematic APIs that they encounter in practice. It should be something akin to “wrap your calls in a closure and pass it to a special GCD function”, or something else of similar complexity.
- Continue to improve perf and debugger tools to help identify problematic cases that occur in practice.

This would ensure that developers can always “get their job done”, but also provide a path where we can incrementally improve the world over the course of years (if necessary).

fair enough.

Actors are the way you present the various tasks/operations/activities that you schedule. These contexts are a way for the developer to explain which things are related in a consistent system, and give them access to state which is local to this context (whether it's TSD for threads, or queue specific data, or any similar context),

Just MHO, but I don’t think you’d need or want the concept of “actor local data” in the sense of TLS (e.g. __thread). All actor methods have a ‘self’ already, and having something like TLS strongly encourages breaking the model. To me, the motivation for TLS is to provide an easier way to migrate single-threaded global variables, when introducing threading into legacy code.

I disagree, if you have an execution context that is "my database subsystem", it probably has an object that knows about all your database handles. Or do you suggest that your database subsystem is an actor too? I don't quite see the database subsystem as an actor in the sense that it's a strongly exclusive execution context (if your database is SQLite) and will really receive actors to execute, providing them a home.

I haven’t spent a lot of thought on this, but I tend to think that “database as an actor” fits with the programming model that I’m advocating for. The problem (as I think you’re getting at) is that you don’t want serialization at the API level of the database, you want to allow the database itself to have multiple threads running around in its process.

This is similar in some ways to your NIC example from before, where I think you want one instance of a NIC actor for each piece of hardware you have… but you want multithreaded access.

One plausible way to model this is to say that it is a “multithreaded actor” of some sort, where the innards of the actor allow arbitrary number of client threads to call into it concurrently. The onus would be on the implementor of the NIC or database to implement the proper synchronization on the mutable state within the actor.

I think that something like this is attractive because it provides key things that I value highly in a concurrency model:

- The programmer has a natural way to model things: “a instance of a database is a thing", and therefore should be modeled as an instance of an actor. The fact that the actor can handle multiple concurrent requests is an implementation detail the clients shouldn’t have to be rewritten to understand.

- Making this non-default would provide proper progressive disclosure of complexity.

- You’d still get improved safety and isolation of the system as a whole, even if individual actors are “optimized” in this way.

- When incrementally migrating code to the actor model, this would make it much easier to provide actor wrappers for existing subsystems built on shared mutable state.

- Something like this would also probably be the right abstraction for imported RPC services that allow for multiple concurrent synchronous requests.

I’m curious to know what you and others think about this concept.

I think what you said made sense. But it wasn't what I meant. I was really thinking at sqlite where the database is strongly serial (you can't use it in a multi-threaded way well, or rather you can but it has a big lock inside). It is much better to interact with that dude on the same exclusion context all the time. What I meant is really having some actors that have a "strong affinity" with a given execution context which eases the task of the actor scheduler.

Another problem I haven't touched either is kernel-issued events (inbound IPC from other processes, networking events, etc...). Dispatch for the longest time used an indirection through a manager thread for all such events, and that had two major issues:

- the thread hops it caused, causing networking workloads to utilize up to 15-20% more CPU time than an equivalent manually made pthread parked in kevent(), because networking pace even when busy idles back all the time as far as the CPU is concerned, so dispatch queues never stay hot, and the context switch is not only a scheduled context switch but also has the cost of a thread bring up

- if you deliver all possible events this way you also deliver events that cannot possibly make progress because the execution context that will handle them is already "locked" (as in busy running something else.

It took us several years to get to the point we presented at WWDC this year where we deliver events directly to the right dispatch queue. If you only have very anonymous execution contexts then all this machinery is wasted and unused. However, this machinery has been evaluated and saves full percents of CPU load system-wide. I'd hate for us to go back 5 years here.

Declaring that an actor targets a given existing serial context also means that if that actor needs to make urgent progress the context in question has to be rushed, and its priority elevated. It's really hard to do the same on an anonymous global context (the way dispatch does it still is to actually enqueue stealer work that try to steal the "actor" at a higher priority. this approach is terribly wasteful).

I know that I have a hard time getting my point across mostly because I'm not a language design guy, I'm a system design guy, and I clearly don't understand Actors enough to get how to integrate them with the OS. But to me, to go back to the database example, the Database Actor, or the NIC Actor from earlier are different from say a given SQL query or a given HTTP request: the formers are the things the OS needs to know about *in the kernel*, whereas the SQL Query or HTTP Request are merely actors enqueued on the formers. IOW these top-level actors are different, because they are top-level, right atop the kernel/low-level runtime, and this is the thing the kernel has to be able to reason about. This makes them different.

In dispatch, there are 2 kind of queues at the API level in that regard:
- the global queues, which aren't queues like the other and really is just an abstraction on top of the thread pool
- all the other queues that you can target on each other the way you want.

It is clear today that it was a mistake and that there should have been 3 kind of queues:
- the global queues, which aren't real queues but represent which family of system attributes your execution context requires (mostly priorities), and we should have disallowed enqueuing raw work on it
- the bottom queues (which GCD since last year tracks and call "bases" in the source code) that are known to the kernel when they have work enqueued
- any other "inner" queue, which the kernel couldn't care less about

In dispatch, we regret every passing day that the difference between the 2nd and 3rd group of queues wasn't made clear in the API originally.

I like to call the 2nd category execution contexts, but I can also see why you want to pass them as Actors, it's probably more uniform (and GCD did the same by presenting them both as queues). Such top-level "Actors" should be few, because if they all become active at once, they will need as many threads in your process, and this is not a resource that scales. This is why it is important to distinguish them. And like we're discussing they usually also wrap some kind of shared mutable state, resource, or similar, which inner actors probably won't do.

You can obviously model this as "the database is an actor itself", and have the queue of other actors that only do database work target this database actor queue, but while this looks very appealing, in practice this creates a system which is hard to understand for developers. Actors talking to each other/messaging each other is fine. Actors nesting their execution inside each other is not because the next thing people will then ask from such a system is a way to execute code from the outer actor when in the context of the inner Actor, IOW what a recursive mutex is to a mutex, but for the Actor queue. This obvious has all the terrible issues of recursive locks where you think you hold the lock for the first time and expect your object invariants to be valid, except that you're really in a nested execution and see broken invariants from the outer call and this creates terribly hard bugs to find.

Yes, I understand the problem of recursive locks, but I don’t see how or why you’d want an outer actor to have an inner actor call back to it.

I don't see why you'd need this with dispatch queues either today, however my radar queue disagrees strongly with this statement. People want this all the time, mostly because the outer actor has a state machine and the inner actor wants it to make progress before it continues working.

In CFRunloop terms it's like running the runloop from your work by calling CFRunloopRun() yourself again until you can observe some event happened.

It's not great and problematic for tons of reasons. If actors are nested, we need a way to make sure people don't have to ever do something like that.

Just as a data point, my reactions to this thread yielded a private discussion off list *exactly* about someone wanting us to provide something like this for dispatch (or rather in this instance having the inner actor be able to suspend the outer one, but it's just a corollary / similar layering violation where the inner actor wants to affect the outer one in a way the outer one).

-> in a way the outer one didn't expect

Ideally your actors are in the shape of a DAG. Cycles would be properly modeled with (e.g.) weak references, but I think that synchronous/awaited cyclic calls should fail fast at runtime. To be more explicit, I mean something like:

1. Actor A has a reference to actor B. Actor B has a weak backref to actor A.
2. Actor A does an await on an async actor method on B. As such, it’s queue is blocked: no messages can be run on its queue until the B method returns.
3. Actor B’s method turns around and calls a method on A, awaiting the result. Because there is a cyclic wait, we have a deadlock, one which ideally fails fast with a trap.

The solution for this is to change the callback to A to be a call an async (void returning) actor method. Such a call would simply enqueue the request on A’s queue, and get serviced after the original A call returns.

I wrote this out to make it clear what problem I think you’re talking about. If this isn’t what you’re trying to get at, please let me know :slight_smile:

It is definitely the family of problems I'm worried about. I want to make sure we have an holistic approach here because I think that recursive mutexes, recursive CFRunloop runs and similar ideas are flawed and dangerous. I want to make sure we understand

I want to make sure we understand which limitations we want to impose here, and by limitations I really mean layering/architecture rules, and that we document them upfront and explain how to work with them.

IMO, Swift as a runtime should define what an execution context is, and be relatively oblivious of which context it is exactly as long it presents a few common capabilities:
- possibility to schedule work (async)
- have a name
- be an exclusion context
- is an entity the kernel can reason about (if you want to be serious about any integration on a real operating system with priority inheritance and complex issues like this, which it is the OS responsibility to handle and not the language)
- ...

In that sense, whether your execution context is:
- a dispatch serial queue
- a CFRunloop
- a libev/libevent/... event loop
- your own hand rolled event loop

Generalizing the approach is completely possible, but it is also possible to introduce a language abstraction that is “underneath” the high level event loops. That’s what I’m proposing.

I'm not sure I understand what you mean here, can you elaborate?

The concept of an event loop is IMO a library abstraction that could be built on top of the actor model. The actor would represent the “context” for the concurrency, the event loop API would represent the other stuff.

I think by reading your reply I start to grasp a bit better how you see Actors working, indeed. This makes sense to me. Like I said above then, we need to see top-level Actors (that can be equivalent to an OS thread, or whatever concurrent execution context the OS provides for these) being different from other ones. Like Dave Z. was saying earlier, building a language that pretends your resources are infinite is a mistake. We have to force people to think about the cost of their architecture.

I don't see actors that would be running on a "global pool" kind of thing to be such top-level actors though. But these top-level actors probably need to live outside of the pool: the NIC example is very good for this: if you don't schedule it as soon as it has events, then your networking bandwidth will suffer as things like TCP or SCTP are very sensitive to various timers and delays, so you can't be dependent on a constrained shared resource. (this is exactly why GCD has a notion of overcommit queues which bypass the NCPU limits, sadly in GCD way too many things are overcommit, but this is another discussion to have entirely).

What I realized after send for this is that in your way of reasoning about this, the "global pool" Is likely an Actor itself, that allows for concurrency and whose interface is about executing other actors that are otherwise not related to it.

···

On Sep 2, 2017, at 11:09 PM, Pierre Habouzit <phabouzit@apple.com> wrote:

On Sep 2, 2017, at 9:59 PM, Chris Lattner <clattner@nondot.org <mailto:clattner@nondot.org>> wrote:
On Sep 2, 2017, at 12:19 PM, Pierre Habouzit <pierre@habouzit.net <mailto:pierre@habouzit.net>> wrote:

I think that what I'm getting at is that we can't build Actors without defining how they interact with the operating system and what guarantees of latency and liveness they have.

Is this making more sense?

-Pierre
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution


(Chris Lattner) #13

What do you mean by this?

My understanding is that GCD doesn’t currently scale to 1M concurrent queues / tasks.

It completely does provided these 1M queues / tasks are organized on several well known independent contexts.

Ok, I stand corrected. My understanding was that you could run into situations where you get stack explosions, fragment your VM and run out of space, but perhaps that is a relic of 32-bit systems.

a queue on 64bit systems is 128 bytes (nowadays). Provided you have that amount of VM available to you (1M queues is 128M after all) then you're good.
If a large amount of them fragments the VM beyond this is a malloc/VM bug on 64bit systems that are supposed to have enough address space.

Right, I was referring to the fragmentation you get by having a large number of 2M stacks allocated for each kernel thread. I recognize the queues themselves are small.

What doesn't scale is asking for threads, not having queues.

Right, agreed.

Agreed, to be clear, I have no objection to building actors on top of (perhaps enhanced) GCD queues. In fact I *hope* that this can work, since it leads to a naturally more incremental path forward, which is therefore much more likely to actually happen.

Good :slight_smile:

I meant to be pretty clear about that all along, but perhaps I missed the mark. In any case, I’ve significantly revised the “scalable runtime” section of the doc to reflect some of this discussion, please let me know what you think:
https://gist.github.com/lattner/31ed37682ef1576b16bca1432ea9f782#scalable-runtime

My currently not very well formed opinion on this subject is that GCD queues are just what you need with these possibilities:
- this Actor queue can be targeted to other queues by the developer when he means for these actor to be executed in an existing execution context / locking domain,
- we disallow Actors to be directly targeted to GCD global concurrent queues ever
- for the other ones we create a new abstraction with stronger and better guarantees (typically limiting the number of possible threads servicing actors to a low number, not greater than NCPU).

Is there a specific important use case for being able to target an actor to an existing queue? Are you looking for advanced patterns where multiple actors (each providing disjoint mutable state) share an underlying queue? Would this be for performance reasons, for compatibility with existing code, or something else?

Mostly for interaction with current designs where being on a given bottom serial queue gives you the locking context for resources naturally attached to it.

Ok. I don’t understand the use-case well enough to know how we should model this. For example, is it important for an actor to be able to change its queue dynamically as it goes (something that sounds really scary to me) or can the “queue to use” be specified at actor initialization time?

One plausible way to model this is to say that it is a “multithreaded actor” of some sort, where the innards of the actor allow arbitrary number of client threads to call into it concurrently. The onus would be on the implementor of the NIC or database to implement the proper synchronization on the mutable state within the actor.

I think what you said made sense.

Ok, I captured this in yet-another speculative section:
https://gist.github.com/lattner/31ed37682ef1576b16bca1432ea9f782#intra-actor-concurrency

But it wasn't what I meant. I was really thinking at sqlite where the database is strongly serial (you can't use it in a multi-threaded way well, or rather you can but it has a big lock inside). It is much better to interact with that dude on the same exclusion context all the time. What I meant is really having some actors that have a "strong affinity" with a given execution context which eases the task of the actor scheduler.

Ah ok. Yes, I think that wrapping a “subsystem with a big lock” in an actor is a very natural thing to do, just as much as it makes sense to wrap a non-threadsafe API in an actor. Any internal locking would be subsumed by the outer actor queue, but that’s ok - all the lock acquires would be uncontended and fast :slight_smile:

Another problem I haven't touched either is kernel-issued events (inbound IPC from other processes, networking events, etc...). Dispatch for the longest time used an indirection through a manager thread for all such events, and that had two major issues:

- the thread hops it caused, causing networking workloads to utilize up to 15-20% more CPU time than an equivalent manually made pthread parked in kevent(), because networking pace even when busy idles back all the time as far as the CPU is concerned, so dispatch queues never stay hot, and the context switch is not only a scheduled context switch but also has the cost of a thread bring up

- if you deliver all possible events this way you also deliver events that cannot possibly make progress because the execution context that will handle them is already "locked" (as in busy running something else.

It took us several years to get to the point we presented at WWDC this year where we deliver events directly to the right dispatch queue. If you only have very anonymous execution contexts then all this machinery is wasted and unused. However, this machinery has been evaluated and saves full percents of CPU load system-wide. I'd hate for us to go back 5 years here.

I don’t have anything intelligent to say here, but it sounds like you understand the issues well :slight_smile: I agree that giving up 5 years of progress is not appealing.

Declaring that an actor targets a given existing serial context also means that if that actor needs to make urgent progress the context in question has to be rushed, and its priority elevated. It's really hard to do the same on an anonymous global context (the way dispatch does it still is to actually enqueue stealer work that try to steal the "actor" at a higher priority. this approach is terribly wasteful).

Ok.

It is clear today that it was a mistake and that there should have been 3 kind of queues:
- the global queues, which aren't real queues but represent which family of system attributes your execution context requires (mostly priorities), and we should have disallowed enqueuing raw work on it
- the bottom queues (which GCD since last year tracks and call "bases" in the source code) that are known to the kernel when they have work enqueued
- any other "inner" queue, which the kernel couldn't care less about

In dispatch, we regret every passing day that the difference between the 2nd and 3rd group of queues wasn't made clear in the API originally.

Well this is an opportunity to encourage the right thing to happen. IMO it seems that the default should strongly be for an “everyday actor” to be the third case, which is what developers will reach for when building out their app abstractions. We can provide some API or other system that allows sufficiently advanced developers to tie an actor into the second bucket, and we can disallow the first entirely if that is desirable (e.g. by a runtime trap if nothing else).

I like to call the 2nd category execution contexts, but I can also see why you want to pass them as Actors, it's probably more uniform (and GCD did the same by presenting them both as queues). Such top-level "Actors" should be few, because if they all become active at once, they will need as many threads in your process, and this is not a resource that scales. This is why it is important to distinguish them. And like we're discussing they usually also wrap some kind of shared mutable state, resource, or similar, which inner actors probably won't do.

The thing I still don’t understand is where #2 comes from: are these system defined queues that the developer interacts with, or are these things that developers define in their code? How does the kernel know about them if the developer defines new ones?

You can obviously model this as "the database is an actor itself", and have the queue of other actors that only do database work target this database actor queue, but while this looks very appealing, in practice this creates a system which is hard to understand for developers. Actors talking to each other/messaging each other is fine. Actors nesting their execution inside each other is not because the next thing people will then ask from such a system is a way to execute code from the outer actor when in the context of the inner Actor, IOW what a recursive mutex is to a mutex, but for the Actor queue. This obvious has all the terrible issues of recursive locks where you think you hold the lock for the first time and expect your object invariants to be valid, except that you're really in a nested execution and see broken invariants from the outer call and this creates terribly hard bugs to find.

Yes, I understand the problem of recursive locks, but I don’t see how or why you’d want an outer actor to have an inner actor call back to it.

I don't see why you'd need this with dispatch queues either today, however my radar queue disagrees strongly with this statement. People want this all the time, mostly because the outer actor has a state machine and the inner actor wants it to make progress before it continues working.

In CFRunloop terms it's like running the runloop from your work by calling CFRunloopRun() yourself again until you can observe some event happened.

It's not great and problematic for tons of reasons. If actors are nested, we need a way to make sure people don't have to ever do something like that.

Just as a data point, my reactions to this thread yielded a private discussion off list *exactly* about someone wanting us to provide something like this for dispatch (or rather in this instance having the inner actor be able to suspend the outer one, but it's just a corollary / similar layering violation where the inner actor wants to affect the outer one in a way the outer one didn't expect).

I get that this comes up now and then, but I also don’t understand what we can do about it. Allowing “recursive lock” style reentrancy into an actor breaks fundamental aspects of the design of actors: that you know at the start of any actor message that the actor invariants are intact.

It would be possible to relax this and give up on that (perhaps rather theoretical) win, but I’d rather not. I think that any recursive case can be expressed through a refactoring of the code, e.g. to use async sending instead of sync calls. Also, I think we should strongly encourage pure async “fire and forget” actor methods anyway - IOW, we should encourage push, not pull - since they provide much stronger guarantees in general.

It is definitely the family of problems I'm worried about. I want to make sure we have an holistic approach here because I think that recursive mutexes, recursive CFRunloop runs and similar ideas are flawed and dangerous. I want to make sure we understand which limitations we want to impose here, and by limitations I really mean layering/architecture rules, and that we document them upfront and explain how to work with them.

+1

I think that what I'm getting at is that we can't build Actors without defining how they interact with the operating system and what guarantees of latency and liveness they have.

Agreed, thanks Pierre!

-Chris

···

On Sep 2, 2017, at 11:09 PM, Pierre Habouzit <phabouzit@apple.com> wrote:

On Sep 2, 2017, at 12:19 PM, Pierre Habouzit <pierre@habouzit.net <mailto:pierre@habouzit.net>> wrote:


(Pierre Habouzit) #14

What do you mean by this?

My understanding is that GCD doesn’t currently scale to 1M concurrent queues / tasks.

It completely does provided these 1M queues / tasks are organized on several well known independent contexts.

Ok, I stand corrected. My understanding was that you could run into situations where you get stack explosions, fragment your VM and run out of space, but perhaps that is a relic of 32-bit systems.

a queue on 64bit systems is 128 bytes (nowadays). Provided you have that amount of VM available to you (1M queues is 128M after all) then you're good.
If a large amount of them fragments the VM beyond this is a malloc/VM bug on 64bit systems that are supposed to have enough address space.

Right, I was referring to the fragmentation you get by having a large number of 2M stacks allocated for each kernel thread. I recognize the queues themselves are small.

Yes, this is still poor, if every single queue needs to make a thread at the same time, we've lost. Whatever abstraction you pick on top it'll be true: physical OS threads (pthreads or workqueue threads) are inherently expensive and don't scale well.

What doesn't scale is asking for threads, not having queues.

Right, agreed.

Agreed, to be clear, I have no objection to building actors on top of (perhaps enhanced) GCD queues. In fact I *hope* that this can work, since it leads to a naturally more incremental path forward, which is therefore much more likely to actually happen.

Good :slight_smile:

I meant to be pretty clear about that all along, but perhaps I missed the mark. In any case, I’ve significantly revised the “scalable runtime” section of the doc to reflect some of this discussion, please let me know what you think:
https://gist.github.com/lattner/31ed37682ef1576b16bca1432ea9f782#scalable-runtime

I'm much happier with this new wording which is IMO a very fair assessment. thanks for rewording it.

My currently not very well formed opinion on this subject is that GCD queues are just what you need with these possibilities:
- this Actor queue can be targeted to other queues by the developer when he means for these actor to be executed in an existing execution context / locking domain,
- we disallow Actors to be directly targeted to GCD global concurrent queues ever
- for the other ones we create a new abstraction with stronger and better guarantees (typically limiting the number of possible threads servicing actors to a low number, not greater than NCPU).

Is there a specific important use case for being able to target an actor to an existing queue? Are you looking for advanced patterns where multiple actors (each providing disjoint mutable state) share an underlying queue? Would this be for performance reasons, for compatibility with existing code, or something else?

Mostly for interaction with current designs where being on a given bottom serial queue gives you the locking context for resources naturally attached to it.

Ok. I don’t understand the use-case well enough to know how we should model this. For example, is it important for an actor to be able to change its queue dynamically as it goes (something that sounds really scary to me) or can the “queue to use” be specified at actor initialization time?

I think I need to read more on actors, because the same way you're not an OS runtime expert, I'm not (or rather no longer, I started down that path a lifetime ago) a language expert at all, and I feel like I need to understand your world better to try to explain this part better to you.

One plausible way to model this is to say that it is a “multithreaded actor” of some sort, where the innards of the actor allow arbitrary number of client threads to call into it concurrently. The onus would be on the implementor of the NIC or database to implement the proper synchronization on the mutable state within the actor.

I think what you said made sense.

Ok, I captured this in yet-another speculative section:
https://gist.github.com/lattner/31ed37682ef1576b16bca1432ea9f782#intra-actor-concurrency

Great. BTW I agree 100% with:

That said, this is definitely a power-user feature, and we should understand, build, and get experience using the basic system before considering adding something like this.

Private concurrent queues are not a success in dispatch and cause several issues, these queues are second class citizens in GCD in terms of feature they support, and building something with concurrency *within* is hard. I would keep it as "that's where we'll go some day" but not try to attempt it until we've build the simpler (or rather less hard) purely serial case first.

But it wasn't what I meant. I was really thinking at sqlite where the database is strongly serial (you can't use it in a multi-threaded way well, or rather you can but it has a big lock inside). It is much better to interact with that dude on the same exclusion context all the time. What I meant is really having some actors that have a "strong affinity" with a given execution context which eases the task of the actor scheduler.

Ah ok. Yes, I think that wrapping a “subsystem with a big lock” in an actor is a very natural thing to do, just as much as it makes sense to wrap a non-threadsafe API in an actor. Any internal locking would be subsumed by the outer actor queue, but that’s ok - all the lock acquires would be uncontended and fast :slight_smile:

Another problem I haven't touched either is kernel-issued events (inbound IPC from other processes, networking events, etc...). Dispatch for the longest time used an indirection through a manager thread for all such events, and that had two major issues:

- the thread hops it caused, causing networking workloads to utilize up to 15-20% more CPU time than an equivalent manually made pthread parked in kevent(), because networking pace even when busy idles back all the time as far as the CPU is concerned, so dispatch queues never stay hot, and the context switch is not only a scheduled context switch but also has the cost of a thread bring up

- if you deliver all possible events this way you also deliver events that cannot possibly make progress because the execution context that will handle them is already "locked" (as in busy running something else.

It took us several years to get to the point we presented at WWDC this year where we deliver events directly to the right dispatch queue. If you only have very anonymous execution contexts then all this machinery is wasted and unused. However, this machinery has been evaluated and saves full percents of CPU load system-wide. I'd hate for us to go back 5 years here.

I don’t have anything intelligent to say here, but it sounds like you understand the issues well :slight_smile: I agree that giving up 5 years of progress is not appealing.

TBH our team has to explain into more depth how eventing works on Darwin, the same way we (or maybe it's just I, I don't want to disparage my colleagues here :P) need to understand Actors and what they mean better, I think the swift core team (and whoever works on concurrency) needs to be able to understand what I explained above.

Declaring that an actor targets a given existing serial context also means that if that actor needs to make urgent progress the context in question has to be rushed, and its priority elevated. It's really hard to do the same on an anonymous global context (the way dispatch does it still is to actually enqueue stealer work that try to steal the "actor" at a higher priority. this approach is terribly wasteful).

Ok.

It is clear today that it was a mistake and that there should have been 3 kind of queues:
- the global queues, which aren't real queues but represent which family of system attributes your execution context requires (mostly priorities), and we should have disallowed enqueuing raw work on it
- the bottom queues (which GCD since last year tracks and call "bases" in the source code) that are known to the kernel when they have work enqueued
- any other "inner" queue, which the kernel couldn't care less about

In dispatch, we regret every passing day that the difference between the 2nd and 3rd group of queues wasn't made clear in the API originally.

Well this is an opportunity to encourage the right thing to happen. IMO it seems that the default should strongly be for an “everyday actor” to be the third case, which is what developers will reach for when building out their app abstractions.

Correct.

We can provide some API or other system that allows sufficiently advanced developers to tie an actor into the second bucket, and we can disallow the first entirely if that is desirable (e.g. by a runtime trap if nothing else).

Yes, I would be pleased with something like that, this is exactly what I'm advocating.

I like to call the 2nd category execution contexts, but I can also see why you want to pass them as Actors, it's probably more uniform (and GCD did the same by presenting them both as queues). Such top-level "Actors" should be few, because if they all become active at once, they will need as many threads in your process, and this is not a resource that scales. This is why it is important to distinguish them. And like we're discussing they usually also wrap some kind of shared mutable state, resource, or similar, which inner actors probably won't do.

The thing I still don’t understand is where #2 comes from: are these system defined queues that the developer interacts with, or are these things that developers define in their code? How does the kernel know about them if the developer defines new ones?

This is, sir, a very good question, and I think anyone working on Actors will need to understand most of this :wink: You "just" need to understand most of the kevent and workq integration to answer this. It ties into what I was saying about kernel issued events above. But to try to give you a sense quickly, there are two cases, pure queues, and queues with kernel events registrations:

1) for the former case, the kernel knows just in time when the queue is woken up (the term GCD uses to describe the idle -> non empty transition for a queue), because the kernel never needs to know before that point. The GCD runtime will evaluate that the queue is rooted at "the bottom", and will make it know to the kernel then.

2) for the second case, queue hierarchies that have event monitoring setup (dispatch sources, xpc connections) are registered upfront with the kernel. The evaluation of which is the bottom queue an event is handled on is done by the GCD runtime when the first registration happens. This allows for both kernel and user spaces to the thread requests on behalf of this hierarchy and for these requests to coalesce in kernel. (this is btw awesome technology that has been very difficult in the making FWIW, but I digress).

(2) is what scares me: the kernel has stuff to deliver, and kernels don't like to hold on data on behalf of userspace forever, because this burns wired memory. This means that this doesn't quite play nice with a global anonymous pool that can be starved at any time. Especially if we're talking XPC Connections in a daemon, where super high priority clients such as the frontmost app (or even worse, SpringBoard) can ask you questions that you'd better answer as fast as you can.

The solution we recommend to solve this to developers (1st and 3rd parties), is for all your XPC connections for your clients are rooted at the same bottom queue that represents your "communication" subsystem, so you can imagine it be the "Incoming request multiplexer" Actor or something, this one is in the 2nd category and is known to the kernel so that the kernel can instantiate the Actor itself without asking permission to userspace and directly make the execution context, and rely on the scheduler to get the priorities right.

The alternative solution is for userspace to have a construct similar to a kernel scheduler in every process and have a strong tie with the kernel so that the context can be created within the shared pool, and then the userspace scheduler would take over to try to schedule this, and this is where all the discussion about phsyical threads starts: it starts to be very tempting to have "lightweight" (I really believe that for sufficiently complex code, with C interleaved, these actually are merely slightly less heavyweight) users-pace stack switching and basically make every process replicate what a kernel does. This is where I'd like for us not to go because it means solving the same problems in two places (kernel and usperspace), which requires synchronization between states and doubles your bug surface. It also requires a large amount of diagnostic tools to be able to represent these things in meaningful ways, and last, doesn't work at all with TSDs (which Swift uses extensively it its runtime, and that obj-c & C frameworks use all over the place) and most of our locking primitives (which when you take them make the locking primitive tied to the thread ID / port that took the lock, so if you have a fluid association between userspace threads and physical ones, you're in for a very bad time). All these reasons is why we took the direction with the OS of making the kernel more aware of the runtime rather than having the runtime be more like a kernel.

This is why our belief (in my larger K(ernel) & R(untime) team) is that instead of trying to paper over the fact that there's a real OS running your stuff, we need to embrace it and explain to people that everywhere in a traditional POSIX world they would have used a real pthread_create()d thread to perform the work of a given subsystem, they create one such category #2 bottom queue that represents this thread (and you make this subsystem an Actor), and that the same way in POSIX you can't quite expect a process to have thousands of threads, you can't really have thousands of such top level actors either. It doesn't mean that there can't be some anonymous pool to run stuff on it for the stuff that is less your raison d'être, it just means that the things your app that are really what it was built to do should be organized in a resource aware way to take maximum advantage of the system. Throwing a million actors to the system without more upfront organization and initial thinking by the developers is not optimizable.

FWIW for other people wanting to participate in this discussion, what I'm saying feels very Darwin centric because I use XPC as an example, but on Linux replace XPC with D-BUS and all my arguments apply the same. OS Design meets the same class of problems everywhere, I'm just taking concrete example that people who've worked at Apple understand. I don't know Windows at all, but I'd be very surprised if they don't have exactly something like this too.

You can obviously model this as "the database is an actor itself", and have the queue of other actors that only do database work target this database actor queue, but while this looks very appealing, in practice this creates a system which is hard to understand for developers. Actors talking to each other/messaging each other is fine. Actors nesting their execution inside each other is not because the next thing people will then ask from such a system is a way to execute code from the outer actor when in the context of the inner Actor, IOW what a recursive mutex is to a mutex, but for the Actor queue. This obvious has all the terrible issues of recursive locks where you think you hold the lock for the first time and expect your object invariants to be valid, except that you're really in a nested execution and see broken invariants from the outer call and this creates terribly hard bugs to find.

Yes, I understand the problem of recursive locks, but I don’t see how or why you’d want an outer actor to have an inner actor call back to it.

I don't see why you'd need this with dispatch queues either today, however my radar queue disagrees strongly with this statement. People want this all the time, mostly because the outer actor has a state machine and the inner actor wants it to make progress before it continues working.

In CFRunloop terms it's like running the runloop from your work by calling CFRunloopRun() yourself again until you can observe some event happened.

It's not great and problematic for tons of reasons. If actors are nested, we need a way to make sure people don't have to ever do something like that.

Just as a data point, my reactions to this thread yielded a private discussion off list *exactly* about someone wanting us to provide something like this for dispatch (or rather in this instance having the inner actor be able to suspend the outer one, but it's just a corollary / similar layering violation where the inner actor wants to affect the outer one in a way the outer one didn't expect).

I get that this comes up now and then, but I also don’t understand what we can do about it. Allowing “recursive lock” style reentrancy into an actor breaks fundamental aspects of the design of actors: that you know at the start of any actor message that the actor invariants are intact.

Exactly.

It would be possible to relax this and give up on that (perhaps rather theoretical) win, but I’d rather not.

Me neither, I hate this, and all the radars I was alluding to are either NTBFed with an explanation to the reporter on how to do the same without using recursive patterns, or sit in our queue with us being perplexed. It's just that we have to know these questions will happen.

I think that any recursive case can be expressed through a refactoring of the code, e.g. to use async sending instead of sync calls.

This is true, but I would argue that if you reached the point where you need a refactor, then you screwed up during design first. I think what I mean here is that we should explain upfront how to not design things that lead to this problem with simple example so that people can use them to build complex systems that never fall in this trap.

Also, I think we should strongly encourage pure async “fire and forget” actor methods anyway - IOW, we should encourage push, not pull

I almost agree. We should strongly encourage the `pure async "account for, fire and forget" actor methods`. The `account for` is really backpressure, where you actually don't fire if the remote queue is full and instead rely on some kind of reactive pattern to pull from you. (but I know you wrote that on your proposal and you're aware of it).

The other problem is the following. I used to work on iCloud Drive, and the design of the daemon uses a pure GCD design (including its interaction with the SQLite metadata database(s) it has, which will probably make you understand why I took this example earlier) only which makes it relevant to this discussion. Everytime you propose an idea, I'm really trying to ask myself "what would we do in bird".

One problem that is interesting and that I don't know how to handle in a pure fire-and-forget world is "iCloud Account Logout". This is something that today is implemented in a very synchronous fashion, because getting this wrong means the worst thing you can do for privacy: if you have a very quick succession of login/logout/login/logout and you screw this up, then you can have cross-polination between accounts, and we take that very seriously as you'd imagine.

That being said, the solution that was chosen is that the part of the code performing the logout sends tons of cancelations of all the things in flight, starts deleting your files on disk, and waits *synchronously* on all the subsystems for them to return that they've forgotten everything about the previous account. For debugging purposes, this is the right model, because when one subsystem has a bug and doesn't complete, you can see by inspecting the blocking relationship which one it is, which is trivial with regular operating introspection tools. I don't believe that writing the same code in a purely async way would be as easy to debug.

- since they provide much stronger guarantees in general.

It depends which guarantees you're talking about. I don't think this statement is true. Async work has good and strong properties when you write code in the "normal" priority ranges, what we refer as to "in the QoS world" on Darwin (from background up to UI work).

However, there are tons of snowflakes on any platform that can't be in that world:
- media rendering (video/audio)
- HID (touch, gesture recognition, keyboard, mouses, trackpads, ...)
- some use cases of networking (bluetooth is a very good example, you hate when your audio drops with your bluetooth headset don't you?)
- ...

And these use cases are many, and run in otherwise regular processes all the time.

Today, the only way you don't end up with massive priority inversions when one of these subsystems has to perform something that interacts with generic code (such as loading a user pref, do an IPC, etc...) is that they do everything synchronously, because blocking is how the kernel can resolve these inversions locally, without priority overhangs. Asynchronous propagation has a ton of overhang (the term we use to describe the fact that your current effective priority is tainted by a higher priority piece of work that happened in the past). Overhang outside of the QoS world is not an option: if you run at the priority of gesture recognition by accident and use a lot of CPU, you just forfeited the touch interface of your phone, you can see how people would be upset with this.

However it is a fact of life that these subsystems, have to interact with generic subsystems sometimes, and that mean they need to be able to synchronously wait on an actor, so that this actor's priority is elevated. And you can't waive this off, there are tons of legitimate reasons for very-high priorities subsystems to have to interact and wait on regular priority work.

I'll take one compelling example: logging, when your HID stack wants to report a runtime fault of some sort, it has to interact with the logging subsystem which is mostly utility work. If the HID stack logs all the time, this is wrong, obviously, but if it has to report a critical issue, then it's fair for this subsystem to have to log, and asyncing the log is not a good option as this breaks the stacks and many things you want for a report. However you really want once this log is done, to reset your HID stack and move on with your life.

I 100% agree with you that if *everything* was asynchronous and written this way, our lives would be great. I don't however think it's possible on real life operating system to write all your code this way. And this is exactly where things start to be *very* messy.

However, if you're not in one of these "extra difficult cases" you should use pure async. Which means that we still need to provide synchronous waiting, but explain very clearly when it's okay to use. I think e.g. that if you're running on the anonymous pool, you should not be allowed to synchronously wait on another actor, for all the examples I took (whether the iCloud Drive logout case, or the high priority people), the code that waits on actors is only running on a non anonymous context, and it's "fine" to block this context, because you made it as a developer in the first place, so you understand what you do to that unique serial resource. If you block the shared pool, then you can't possibly understand since you don't know what else runs there by definition.

···

On Sep 3, 2017, at 10:26 AM, Chris Lattner via swift-evolution <swift-evolution@swift.org> wrote:
On Sep 2, 2017, at 11:09 PM, Pierre Habouzit <phabouzit@apple.com <mailto:phabouzit@apple.com>> wrote:

On Sep 2, 2017, at 12:19 PM, Pierre Habouzit <pierre@habouzit.net <mailto:pierre@habouzit.net>> wrote:

It is definitely the family of problems I'm worried about. I want to make sure we have an holistic approach here because I think that recursive mutexes, recursive CFRunloop runs and similar ideas are flawed and dangerous. I want to make sure we understand which limitations we want to impose here, and by limitations I really mean layering/architecture rules, and that we document them upfront and explain how to work with them.

+1

I think that what I'm getting at is that we can't build Actors without defining how they interact with the operating system and what guarantees of latency and liveness they have.

Agreed, thanks Pierre!

-Chris

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Daniel Vollmer) #15

Hello,

first off, I’m following this discussion with great interest, even though my background (simulation software on HPC) has a different focus than the “usual” paradigms Swift seeks to (primarily) address.

Is there a specific important use case for being able to target an actor to an existing queue? Are you looking for advanced patterns where multiple actors (each providing disjoint mutable state) share an underlying queue? Would this be for performance reasons, for compatibility with existing code, or something else?

Mostly for interaction with current designs where being on a given bottom serial queue gives you the locking context for resources naturally attached to it.

Ok. I don’t understand the use-case well enough to know how we should model this. For example, is it important for an actor to be able to change its queue dynamically as it goes (something that sounds really scary to me) or can the “queue to use” be specified at actor initialization time?

I’m confused, but that may just be me misunderstanding things again. I’d assume each actor has its own (serial) queue that is used to serialize its messages, so the queue above refers to the queue used to actually process the messages the actor receives, correct?

Sometimes, I’d probably make sense (or even be required to fix this to a certain queue (in the thread(-pool?) sense), but at others it may just make sense to execute the messages in-place by the sender if they don’t block so no context switch is incurred.

One plausible way to model this is to say that it is a “multithreaded actor” of some sort, where the innards of the actor allow arbitrary number of client threads to call into it concurrently. The onus would be on the implementor of the NIC or database to implement the proper synchronization on the mutable state within the actor.

I think what you said made sense.

Ok, I captured this in yet-another speculative section:
https://gist.github.com/lattner/31ed37682ef1576b16bca1432ea9f782#intra-actor-concurrency

This seems like an interesting extension (where the actor-internal serial queue is not used / bypassed).

  Daniel.

···

On 3. Sep 2017, at 19:26, Chris Lattner via swift-evolution <swift-evolution@swift.org> wrote:

On Sep 2, 2017, at 11:09 PM, Pierre Habouzit <phabouzit@apple.com> wrote:

On Sep 2, 2017, at 12:19 PM, Pierre Habouzit <pierre@habouzit.net> wrote:


(Chris Lattner) #16

Hello,

first off, I’m following this discussion with great interest, even though my background (simulation software on HPC) has a different focus than the “usual” paradigms Swift seeks to (primarily) address.

Is there a specific important use case for being able to target an actor to an existing queue? Are you looking for advanced patterns where multiple actors (each providing disjoint mutable state) share an underlying queue? Would this be for performance reasons, for compatibility with existing code, or something else?

Mostly for interaction with current designs where being on a given bottom serial queue gives you the locking context for resources naturally attached to it.

Ok. I don’t understand the use-case well enough to know how we should model this. For example, is it important for an actor to be able to change its queue dynamically as it goes (something that sounds really scary to me) or can the “queue to use” be specified at actor initialization time?

I’m confused, but that may just be me misunderstanding things again. I’d assume each actor has its own (serial) queue that is used to serialize its messages, so the queue above refers to the queue used to actually process the messages the actor receives, correct?

Right.

Sometimes, I’d probably make sense (or even be required to fix this to a certain queue (in the thread(-pool?) sense), but at others it may just make sense to execute the messages in-place by the sender if they don’t block so no context switch is incurred.

Do you mean kernel context switch? With well behaved actors, the runtime should be able to run work items from many different queues on the same kernel thread. The “queue switch cost” is designed to be very very low. The key thing is that the runtime needs to know when work on a queue gets blocked so the kernel thread can move on to servicing some other queues work.

-Chris

···

On Sep 4, 2017, at 4:19 AM, Daniel Vollmer <lists@maven.de> wrote:

On 3. Sep 2017, at 19:26, Chris Lattner via swift-evolution <swift-evolution@swift.org> wrote:

On Sep 2, 2017, at 11:09 PM, Pierre Habouzit <phabouzit@apple.com> wrote:

On Sep 2, 2017, at 12:19 PM, Pierre Habouzit <pierre@habouzit.net> wrote:


(Chris Lattner) #17

My currently not very well formed opinion on this subject is that GCD queues are just what you need with these possibilities:
- this Actor queue can be targeted to other queues by the developer when he means for these actor to be executed in an existing execution context / locking domain,
- we disallow Actors to be directly targeted to GCD global concurrent queues ever
- for the other ones we create a new abstraction with stronger and better guarantees (typically limiting the number of possible threads servicing actors to a low number, not greater than NCPU).

Is there a specific important use case for being able to target an actor to an existing queue? Are you looking for advanced patterns where multiple actors (each providing disjoint mutable state) share an underlying queue? Would this be for performance reasons, for compatibility with existing code, or something else?

Mostly for interaction with current designs where being on a given bottom serial queue gives you the locking context for resources naturally attached to it.

Ok. I don’t understand the use-case well enough to know how we should model this. For example, is it important for an actor to be able to change its queue dynamically as it goes (something that sounds really scary to me) or can the “queue to use” be specified at actor initialization time?

I think I need to read more on actors, because the same way you're not an OS runtime expert, I'm not (or rather no longer, I started down that path a lifetime ago) a language expert at all, and I feel like I need to understand your world better to try to explain this part better to you.

No worries. Actually, after thinking about it a bit, I don’t think that switching underlying queues at runtime is scary.

The important semantic invariant which must be maintained is that there is only one thread executing within an actor context at a time. Switching around underlying queues (or even having multiple actors on the same queue) shouldn’t be a problem.

OTOH, you don’t want an actor “listening” to two unrelated queues, because there is nothing to synchronize between the queues, and you could have multiple actor methods invoked at the same time: you lose the protection of a single serial queue.

The only concern I’d have with an actor switching queues at runtime is that you don’t want a race condition where an item on QueueA goes to the actor, then it switches to QueueB, then another item from QueueB runs while the actor is already doing something for QueueA.

I think what you said made sense.

Ok, I captured this in yet-another speculative section:
https://gist.github.com/lattner/31ed37682ef1576b16bca1432ea9f782#intra-actor-concurrency

Great. BTW I agree 100% with:

That said, this is definitely a power-user feature, and we should understand, build, and get experience using the basic system before considering adding something like this.

Private concurrent queues are not a success in dispatch and cause several issues, these queues are second class citizens in GCD in terms of feature they support, and building something with concurrency *within* is hard. I would keep it as "that's where we'll go some day" but not try to attempt it until we've build the simpler (or rather less hard) purely serial case first.

Right, I agree this is not important for the short term. To clarify though, I meant to indicate that these actors would be implemented completely independently of dispatch, not that they’d build on private concurrent queues.

Another problem I haven't touched either is kernel-issued events (inbound IPC from other processes, networking events, etc...). Dispatch for the longest time used an indirection through a manager thread for all such events, and that had two major issues:

- the thread hops it caused, causing networking workloads to utilize up to 15-20% more CPU time than an equivalent manually made pthread parked in kevent(), because networking pace even when busy idles back all the time as far as the CPU is concerned, so dispatch queues never stay hot, and the context switch is not only a scheduled context switch but also has the cost of a thread bring up

- if you deliver all possible events this way you also deliver events that cannot possibly make progress because the execution context that will handle them is already "locked" (as in busy running something else.

It took us several years to get to the point we presented at WWDC this year where we deliver events directly to the right dispatch queue. If you only have very anonymous execution contexts then all this machinery is wasted and unused. However, this machinery has been evaluated and saves full percents of CPU load system-wide. I'd hate for us to go back 5 years here.

I don’t have anything intelligent to say here, but it sounds like you understand the issues well :slight_smile: I agree that giving up 5 years of progress is not appealing.

TBH our team has to explain into more depth how eventing works on Darwin, the same way we (or maybe it's just I, I don't want to disparage my colleagues here :P) need to understand Actors and what they mean better, I think the swift core team (and whoever works on concurrency) needs to be able to understand what I explained above.

Makes sense. In Swift 6 or 7 or whenever actors are a plausible release goal, a lot of smart people will need to come together to scrutinize all the details. Iteration and improvement over the course of a release cycle is also sensible.

The point of the document is to provide a long term vision of where things could go, primarily to unblock progress on async/await in the short term. For a very long time now, the objection to doing anything with async/await has been that people don’t feel that they know how and whether async/await would fit into the long term model for Swift concurrency. Indeed, the discussions around getting “context” into the async/await design are a concrete example of how considering the long term direction is important to help shape the immediate steps.

That said, we’re still a ways off from actually implementing an actor model, so we have some time to sort it out.

(2) is what scares me: the kernel has stuff to deliver, and kernels don't like to hold on data on behalf of userspace forever, because this burns wired memory. This means that this doesn't quite play nice with a global anonymous pool that can be starved at any time. Especially if we're talking XPC Connections in a daemon, where super high priority clients such as the frontmost app (or even worse, SpringBoard) can ask you questions that you'd better answer as fast as you can.

The solution we recommend to solve this to developers (1st and 3rd parties), is for all your XPC connections for your clients are rooted at the same bottom queue that represents your "communication" subsystem, so you can imagine it be the "Incoming request multiplexer" Actor or something, this one is in the 2nd category and is known to the kernel so that the kernel can instantiate the Actor itself without asking permission to userspace and directly make the execution context, and rely on the scheduler to get the priorities right.

Thanks for the explanation, I understand a lot better what you’re talking about now. To me, this sounds like the concern of a user space framework author (e.g. the folks writing Kitura), not the users of the frameworks. As such, I have no problem with making it “more difficult” to set up the right #2 abstractions.

we need to embrace it and explain to people that everywhere in a traditional POSIX world they would have used a real pthread_create()d thread to perform the work of a given subsystem, they create one such category #2 bottom queue that represents this thread (and you make this subsystem an Actor),

Makes sense. This sounds like a great opportunity for actors to push the world even farther towards sensible designs, rather than cargo culting the old threads+channels model.

Also, I think we should strongly encourage pure async “fire and forget” actor methods anyway - IOW, we should encourage push, not pull

I almost agree. We should strongly encourage the `pure async "account for, fire and forget" actor methods`. The `account for` is really backpressure, where you actually don't fire if the remote queue is full and instead rely on some kind of reactive pattern to pull from you. (but I know you wrote that on your proposal and you're aware of it).

Yep, I was trying to get across the developer mindset of “push, not pull” when it comes to decomposing problems and setting up the actor graph.

I think that - done right - the remote queue API can be done in a way where it looks like you’re writing naturally “push” code, but that the API takes care of making the right thing happen.

- since they provide much stronger guarantees in general.

It depends which guarantees you're talking about. I don't think this statement is true. Async work has good and strong properties when you write code in the "normal" priority ranges, what we refer as to "in the QoS world" on Darwin (from background up to UI work).

"stronger guarantees” is probably not the right way to express this. I’m talking about things like “if you don’t wait, it is much harder to create deadlocks”. Many problems are event-driven or streaming, which are naturally push. I can’t explain why I think this, but it seems to me that push designs encourage more functional approaches, but pull designs tend to be more imperative/stateful. The later feels familiar, but encourages the classical bugs we’re all used to :slight_smile:

However, there are tons of snowflakes on any platform that can't be in that world:
- media rendering (video/audio)
- HID (touch, gesture recognition, keyboard, mouses, trackpads, ...)
- some use cases of networking (bluetooth is a very good example, you hate when your audio drops with your bluetooth headset don't you?)
- ...

And these use cases are many, and run in otherwise regular processes all the time.

I think there is some misunderstanding here. I’m not saying that sync is bad, I’m only talking about the default abstraction and design patterns that people should reach for first.

The general design I’m shooting for here is to provide a default abstractions that work 80%+ of the time, allowing developers to have a natural first step to reach for when they build their code. However, any single abstraction will have limitations and problems in some use cases, and some of those snowflakes are SO important (media is a great example) that it isn’t acceptable to take any hit. This is why I think it is just as important to have an escape hatch. The biggest escape hatches we’ve talked about are multithreaded actors, but folks could also simply “not use actors” if they aren’t solving problems for them.

Swift aims to be pragmatic, not dogmatic. If you’re implementing a media decoder, write the thing in assembly if you want. My feelings won’t be hurt :slight_smile:

However it is a fact of life that these subsystems, have to interact with generic subsystems sometimes, and that mean they need to be able to synchronously wait on an actor, so that this actor's priority is elevated. And you can't waive this off, there are tons of legitimate reasons for very-high priorities subsystems to have to interact and wait on regular priority work.

I understand completely, which is why synchronous waiting is part of the model. Despite what I say above, I really don’t want people to avoid actors or write their code in assembly. :slight_smile:

My point about pull model is that it seems like the right *default* for people to reach for, not that it should be the only or exclusive mechanic proposed. This is one reason that I think it is important to introduce async/await before actors - so we have the right mechanic to build this waiting on top of.

I 100% agree with you that if *everything* was asynchronous and written this way, our lives would be great. I don't however think it's possible on real life operating system to write all your code this way. And this is exactly where things start to be *very* messy.

+1, again, this pragmatism is exactly why the proposal describes actor methods returning values, even though it is not part of the standard actor calculus that academia discusses:
https://gist.github.com/lattner/31ed37682ef1576b16bca1432ea9f782#extending-the-model-through-await

If you have some suggestion for how I can clarify the writing to address the apparent confusion here, please let me know and I’ll be happy to fix it.

-Chris

···

On Sep 3, 2017, at 12:44 PM, Pierre Habouzit <phabouzit@apple.com> wrote:


(Wallacy) #18

Hello,

I have a little question about the actors.

On WWDC 2012 Session 712 one of the most important tips (for me at least)
was: Improve Performance with Reader-Writer Access

Basically:
• Use concurrent subsystem queue: DISPATCH_QUEUE_CONCURRENT
• Use synchronous concurrent “reads”: dispatch_sync()
• Use asynchronous serialized “writes”: dispatch_barrier_async()

Example:

// ...
   _someManagerQueue = dispatch_queue_create("SomeManager",
DISPATCH_QUEUE_CONCURRENT);// ...

And then:

- (id) getSomeArrayItem:(NSUInteger) index {
    id importantObj = NULL;
    dispatch_sync(_someManagerQueue,^{
        id importantObj = [_importantArray objectAtIndex:index];
     });
   return importantObj;
}- (void) removeSomeArrayItem:(id) object {
     dispatch_barrier_async(_someManagerQueue,^{
         [_importantArray removeObject:object];
     });
}- (void) addSomeArrayItem:(id) object {
     dispatch_barrier_async(_someManagerQueue,^{
         [_importantArray addObject:object];
     });
}

That way you ensure that whenever you read an information (eg an array) all
the "changes" have been made ​​or are "waiting" . And every time you write
an information, your program will not be blocked waiting for the operation
to be completed.

That way, if you use several threads, none will have to wait another to get
any value unless one of them is "writing", which is right thing to do.

With this will it be composed using actors? I see a lot of discussion about
using serial queues, and I also have not seen any mechanism similar to
dispatch_barrier_async being discussed here or in other threads.

···

Em seg, 4 de set de 2017 às 08:20, Daniel Vollmer via swift-evolution < swift-evolution@swift.org> escreveu:

Hello,

first off, I’m following this discussion with great interest, even though
my background (simulation software on HPC) has a different focus than the
“usual” paradigms Swift seeks to (primarily) address.

> On 3. Sep 2017, at 19:26, Chris Lattner via swift-evolution < > swift-evolution@swift.org> wrote:
>> On Sep 2, 2017, at 11:09 PM, Pierre Habouzit <phabouzit@apple.com> > wrote:
>>> On Sep 2, 2017, at 12:19 PM, Pierre Habouzit <pierre@habouzit.net> > wrote:
>>>
>>> Is there a specific important use case for being able to target an
actor to an existing queue? Are you looking for advanced patterns where
multiple actors (each providing disjoint mutable state) share an underlying
queue? Would this be for performance reasons, for compatibility with
existing code, or something else?
>>
>> Mostly for interaction with current designs where being on a given
bottom serial queue gives you the locking context for resources naturally
attached to it.
>
> Ok. I don’t understand the use-case well enough to know how we should
model this. For example, is it important for an actor to be able to change
its queue dynamically as it goes (something that sounds really scary to me)
or can the “queue to use” be specified at actor initialization time?

I’m confused, but that may just be me misunderstanding things again. I’d
assume each actor has its own (serial) queue that is used to serialize its
messages, so the queue above refers to the queue used to actually process
the messages the actor receives, correct?

Sometimes, I’d probably make sense (or even be required to fix this to a
certain queue (in the thread(-pool?) sense), but at others it may just make
sense to execute the messages in-place by the sender if they don’t block so
no context switch is incurred.

> One plausible way to model this is to say that it is a “multithreaded
actor” of some sort, where the innards of the actor allow arbitrary number
of client threads to call into it concurrently. The onus would be on the
implementor of the NIC or database to implement the proper synchronization
on the mutable state within the actor.
>>
>> I think what you said made sense.
>
> Ok, I captured this in yet-another speculative section:
>
https://gist.github.com/lattner/31ed37682ef1576b16bca1432ea9f782#intra-actor-concurrency

This seems like an interesting extension (where the actor-internal serial
queue is not used / bypassed).

        Daniel.
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Jean-Daniel) #19

My understanding is that a kernel thread can’t move on servicing a different queue while a block is executing on it. The runtime already know when a queue is blocked, and the only way it has to mitigate the problem is to spawn an other kernel thread to server the other queues. This is what cause the kernel thread explosion.

···

Le 4 sept. 2017 à 17:54, Chris Lattner via swift-evolution <swift-evolution@swift.org> a écrit :

On Sep 4, 2017, at 4:19 AM, Daniel Vollmer <lists@maven.de <mailto:lists@maven.de>> wrote:

Hello,

first off, I’m following this discussion with great interest, even though my background (simulation software on HPC) has a different focus than the “usual” paradigms Swift seeks to (primarily) address.

On 3. Sep 2017, at 19:26, Chris Lattner via swift-evolution <swift-evolution@swift.org> wrote:

On Sep 2, 2017, at 11:09 PM, Pierre Habouzit <phabouzit@apple.com> wrote:

On Sep 2, 2017, at 12:19 PM, Pierre Habouzit <pierre@habouzit.net> wrote:

Is there a specific important use case for being able to target an actor to an existing queue? Are you looking for advanced patterns where multiple actors (each providing disjoint mutable state) share an underlying queue? Would this be for performance reasons, for compatibility with existing code, or something else?

Mostly for interaction with current designs where being on a given bottom serial queue gives you the locking context for resources naturally attached to it.

Ok. I don’t understand the use-case well enough to know how we should model this. For example, is it important for an actor to be able to change its queue dynamically as it goes (something that sounds really scary to me) or can the “queue to use” be specified at actor initialization time?

I’m confused, but that may just be me misunderstanding things again. I’d assume each actor has its own (serial) queue that is used to serialize its messages, so the queue above refers to the queue used to actually process the messages the actor receives, correct?

Right.

Sometimes, I’d probably make sense (or even be required to fix this to a certain queue (in the thread(-pool?) sense), but at others it may just make sense to execute the messages in-place by the sender if they don’t block so no context switch is incurred.

Do you mean kernel context switch? With well behaved actors, the runtime should be able to run work items from many different queues on the same kernel thread. The “queue switch cost” is designed to be very very low. The key thing is that the runtime needs to know when work on a queue gets blocked so the kernel thread can move on to servicing some other queues work.


(Chris Lattner) #20

I’m not sure what you mean by “executing on it”. A work item that currently has a kernel thread can be doing one of two things: “executing work” (like number crunching) or “being blocked in the kernel on something that GCD doesn’t know about”.

However, the whole point is that work items shouldn’t do this: as you say it causes thread explosions. It is better for them to yield control back to GCD, which allows GCD to use the kernel thread for other queues, even though the original *queue* is blocked.

-Chris

···

On Sep 4, 2017, at 9:05 AM, Jean-Daniel <mailing@xenonium.com> wrote:

Sometimes, I’d probably make sense (or even be required to fix this to a certain queue (in the thread(-pool?) sense), but at others it may just make sense to execute the messages in-place by the sender if they don’t block so no context switch is incurred.

Do you mean kernel context switch? With well behaved actors, the runtime should be able to run work items from many different queues on the same kernel thread. The “queue switch cost” is designed to be very very low. The key thing is that the runtime needs to know when work on a queue gets blocked so the kernel thread can move on to servicing some other queues work.

My understanding is that a kernel thread can’t move on servicing a different queue while a block is executing on it. The runtime already know when a queue is blocked, and the only way it has to mitigate the problem is to spawn an other kernel thread to server the other queues. This is what cause the kernel thread explosion.