Preventing Data Races in the Swift Concurrency Model

Hi all,

I've made a couple references to a more comprehensive view of preventing data races that covers closures, global variables, and such. Since the actual features and language restrictions needed to prevent data races got a bit scattered across the various proposals, I wrote up a document on Preventing Data Races in the Swift Concurrency Model that talks through the approach.

At a very high level, within the set of Swift types that are "shareable":

  • let stored properties are safe
  • Global and static stored properties need to be protected by global actors
  • Actor instance properties are protected by the actor class itself
  • Local var stored properties rely on analysis of captures

The "shareable" notion is how I've been referring to types for which a copy of a value and the original value can be concurrently accessed without breaking things. It is roughly the same notion as the ActorSendable protocol from the protocol-based actor isolation pitch.

Questions and comments would be greatly appreciated. I suspect there is much to do to make the ideas in this document clearer, but it's a start.

Doug

15 Likes

“ A local function is always assumed to execute concurrently, because they are always assumed to escape.”

I did not know this. I would love to understand more about the notion of escapability of closures/local functions / stored closures/. Specially since there are other case like optional closures where they are implicitly escaping.

Is a move-only type shareable? By the definition in the document I say yes but the ownership manifesto defines shared differently. (Naming is hard)

On the same note, An inout of shareable instance itself shareable? (I don’t fully understand inout in this new concurrent model).

In phase one: will all new async code and actors be subject to phase two data race checks? If so does this include the escape analysis mentioned?

Thank you for the great write up! Cleared up lots of thoughts from me.

Are state races also in scope? I’m thinking about the kind of state races that are introduce with actor reentrancy. If so then it would be great if actors had the ability to distinguish between mutating (read-write) and non-mutating (read-only) methods like we do for structs/enums.

This would allow us to default to reentrancy in non-mutating methods and default to non-reentrancy in mutating methods. This would compose nicely if all stored properties in an actor are enforced to be shareable.

Any type that provides unique ownership semantics, like move-only types do, would be considered "shareable" for the purposes of these checks, because the uniqueness checking would separately ensure that they aren't, in fact, shared.

inout isn't a type in the type system, so we cannot really ask whether they are shareably. There are some restrictions on inout arguments described in the actors proposal that I should probably pull in here and generalize. (Thanks!)

Yes, that's the intent.

We certainly do need to figure out the re-entrancy issue. I hadn't thought about separating out mutating from non-mutating for actors... I think that you're going to need to write up some use cases to justify it. My first reaction is that it doesn't actually help, because non-mutating methods are just as likely to be surprised if they get suspended and the underlying data changes.

Doug

1 Like

Thanks for writing this up @Douglas_Gregor! I've got a question regarding

To completely eliminate data races within the safe subset of Swift, we need to enforce a few rules on a Swift program:

  • [...]
  • Captured local var stored properties are not accessed from any function or closure that may concurrently execute with their declaring context

You note that

A lot of existing Swift code would be ill-formed under the rules above, even if it has been carefully created to be free of data races, because the tools to communicate the concurrency behavior don’t exist yet.

Right, that would make a lot of code "ill-formed" and so I'd need to understand a little better if the new tools provided will really be enough. If I take SwiftNIO as an example, async/await really fits really well: SwiftNIO is an non-blocking networking framework and therefore to the user of a library that's written with SwiftNIO, almost everything appears asynchronous, and therefore most APIs today return futures which inform the user about the result of operations. The migration to async/await will be wonderful for library developers and users because they can change all the func someOperationInvolvingNetworking() -> EventLoopFuture<Result> to func someOperationInvolvingNetworking() async throws -> Result.

So essentially, this means in the public APIs of libraries that are implemented using SwiftNIO, I'd expect that over time (almost?) all API that return futures today will be expressed in a much nicer way with async/await. The extra benefit is that less of SwiftNIO "leaks out" into the API of a library that just happens to be implemented with SwiftNIO.

On the inside of a library that's implemented using SwiftNIO, things are looking quite different. A networking library usually implements one (or more) networking protocols which are usually built in layer or (sub-)protocols. So for example https://... is really (say) HTTP/1.1 over TLS over TCP/IP and in SwiftNIO, you'd represent each sub-protocol as a ChannelHandler. This is a little simplified by the gist is: You'd implement a network protocol as a series of ChannelHandlers, some of which are provided in the SwiftNIO repository, some in other repos, and some of which you may write yourself.

The ChannelHandler API itself however is synchronous which is quite important for performance reasons and also makes a lot of sense in the implementation. Many ChannelHandlers are just transforming data. So if we want to send a HTTP request to a server, we would start with a Swift data type that represents the HTTP request. That then gets transformed into bytes by a ChannelHandler (into GET / HTTP/1.1\r\n\r\n....) and immediately gets handed on to the next ChannelHandler which may be the TLS handler which takes the plain text bytes and transforms them into encrypted bytes. And after that these bytes get sent over the network.

So for the most part, with regards to ChannelHandlers, async/await isn't actually in the picture because the ChannelHandler API is synchronous. The transformation of events that is done through this series of ChannelHandlers (which we call the ChannelPipeline) however either originates from or leads to events on the actual network. So the overall operation that's being implemented is of course asynchronous and if they need to, the ChannelHandlers can be informed about (or even influence) the result of the overall operation.

For example, if you wanted to account for all the bytes sent and received over a certain network connection, in SwiftNIO you would place a handler right up front in the ChannelPipeline that wouldn't do any transformations but would just count the bytes that come from and go to the network. Counting the incoming bytes is easy, they have arrived by the time you get told about them but the outgoing side is a little harder: You don't want to count immediately because the network connection may break before they've actually left to the kernel. So you want to count them once they've been actually sent. Today in SwiftNIO, this isn't very hard because through the promises/futures that get passed through the ChannelPipeline you can arrange to be told when the actual write has happened.

That all sounds benign but to the Swift compiler this would look like a "Captured local var stored properties are not accessed from any function or closure that may concurrently execute with their declaring context". Despite looking like it may execute concurrently with the declaring context one at the first glance, this is actually not the case. And they don't lead to data races in SwiftNIO for the following reasons:

  • Each network connection (Channel) is bound to exactly one EventLoop and it never changes its associated EventLoop. One EventLoop usually owns one thread (in NIOTransportServices it owns a DispatchQueue instead).
  • All futures/promises that are associated with Channel/ChannelPipeline/ChannelHandler operations always run their callbacks on the EventLoop associated with the Channel.

In a usual SwiftNIO program, there will be a small and constant number of thread, each of which runs one EventLoop. The Channels (think network connection) are bound to one EventLoop by assigning one in round-robin fashion when the Channel first gets created.

Here's some example code of a handler that would simply pass through the data but count the bytes both ways and print everything once the channel goes inactive (ie. closes).

class CountBytesHandler: ChannelDuplexHandler {
    // We transform from bytes to bytes in both directions
    typealias InboundIn = ByteBuffer
    typealias InboundOut = ByteBuffer
    typealias OutboundIn = ByteBuffer
    typealias OutboundOut = ByteBuffer

    // we start off with 0 bytes each way
    private var bytesRead = 0
    private var bytesSent = 0

    func channelRead(context: ChannelHandlerContext, data: NIOAny) {
        // [Always called on `context.channel.eventLoop` like all other ChannelHandler methods]
        // When we get told that bytes have been read of the channel, we look at the number of bytes, add them up, and ...
        self.bytesRead += self.unwrapInboundIn(data).readableBytes
        // ... forward them as is to the next ChannelHandler.
        context.fireChannelRead(data)
    }

    func write(context: ChannelHandlerContext, data: NIOAny, promise: EventLoopPromise<Void>?) {
        // [Always called on `context.channel.eventLoop` like all other ChannelHandler methods]
        // When we get told that a write has to happen, we ask the next handler to do the write (`context.write(data)`), ...
        context.write(data).map {
            // [this callback will also always be called on context.channel.eventLoop because inside our Channel, all operations yield futures that are always bound to the Channel's EventLoop]
            // ... and once the write has succeeded successfully, we add up the bytes
            self.bytesSent += self.unwrapOutboundIn(data).readableBytes
            // [^^^^^^^^^^]
        }.cascade(to: promise) // and forward the result into the result of the overall operation.
    }

    func channelInactive(context: ChannelHandlerContext) {
        // [Always called on `context.channel.eventLoop` like all other ChannelHandler methods]
        print("Channel \(self) received \(self.bytesRead) bytes and sent \(self.bytesSent)")
    }
}

So for SwitNIO one option that would definitely work really well is to have two sides:

  • The API of a library implemented with NIO should over time exclusively use the async/await features which makes everything feel consistent and has a great developer UX.
  • The implementation of the networking protocols in SwiftNIO would use the tried and tested ChannelHandler API (which is synchronous) and can reach out to the promises/futures if they need to, just like today.

So making essentially the "outside" of any NIO program support async/await is actually rather straightforward. This should all work but from what I understand, code like the following (from the above example) would fall foul of your proposed prevention of data races in the concurrency model:

context.write(data).map {
    self.bytesSent += self.unwrapOutboundIn(data).readableBytes
    // ^^^^^^^^^^^^^^
}

The closures passed to the combinators of the futures such as map are escaping because the write will likely not complete on the same stack that has triggered the write.

Are there any tools I may have missed in the concurrency model that would allow us to communicate with the compiler that this safe programming model is actually safe?

Sorry about this wall of text. I wanted to get it more concise but then I thought a real-world example which is an important use case for us is probably better. Any questions, please ask!

8 Likes

Any actor from which you read often and write rarely would benefit from a "read-write" system where reads are concurrent and writes are serialized. For instance: a notification center, a user defaults database, a route dispatching table in a HTTP server, or pretty much any kind of shared cache.

You still have the reentrancy issue to deal with, but to me it looks pretty much orthogonal to this.

Phase 2 introduces the full set of restrictions everywhere: global and static variables must be annotated, all captured locals are subject to data-race checks, and so on. This phase must be associated with a major language version bump, because it will break a significant amount of existing code.

I have a couple questions about phase 2.

  1. How do you handle global variables from imported C libraries? Are they all imported as @actorIndependent?
  2. Will there be a way to check access for non-actor classes? Each class instance probably should belong to an actor instance, right? Will there be a runtime check in debug mode, like exclusivity checks?
  3. Is there a way to test those restrictions on a smaller scale earlier, for instance by adding an attribute to a function to enable the stricter mode? That'd make it easier to test if your code is compatible with Phase 2 in advance.

It's possible there's no answer to this yet.

From the callee's perspective, the exclusivity requirement ought to mean we can consider the parameter to be "move-in, move-out", so as long as a let binding containing the initial value going in and the final value going out are both sharable, the inout should also be. So I think the inout-specific restrictions would primarily concern what arguments the caller can pass down without violating the exclusivity invariant; in particular, as Doug has already noted in other threads, inout async functions can't accept actor state as an argument (if actors remain reentrant), because other tasks running on the same actor may try to access the state at the same time while the callee is suspended.

4 Likes

A few comments:

  • Rather than talk about lets specifically, I think the properties you ascribe need to be discussed as applying to values more generally, since they also apply to temporaries, arguments, returns, and such. They only apply to lets insomuch as a let binding only ever has one value once it's been bound.
  • You say that there are "several general classes of such types", but I think that's an overcomplication. There's one specific property we're looking for in a value to say that it's sharable—does it not contain references to unsynchronized shared mutable state, transitively? The "general classes" you list afterwards can then be thought of as specific common cases of avoiding unsynchronized shared mutable state—immutable types do so by not having mutable state at all; value types check for uniqueness before mutating, or copy objects when they're shared; and synchronized objects guarantee their own race safety by other means.
  • There's also a choice of where to enforce the "no transitive references to unsynchronized mutable state" property. In its full generality, it can be seen as a property of individual values, at a finer granularity than their types. For instance, the one unique reference to any object is theoretically safe to share by moving between tasks, and even sharing multiple references from multiple tasks could be safe if the tasks all agree not to mutate the object. Other languages like Pony use reference capabilities as a way of enforcing these rules at the granularity of individual references. We're making an explicit choice here to instead enforce the property at data type granularity instead.
  • As a type trait, I'm not sure it makes sense to refer to ActorSendable/Shar[e]able as a protocol—it has no formal requirements, and you can't really retroactively conform a type to it, and we wouldn't want becoming sharable to induce a witness table or other ABI changes that normal protocol conformances do. It's more akin to AnyObject, or some of the other type traits like Trivial we'd like to expose in the future—although you can use it as a generic constraint very much like a protocol, the way types "witness" the requirement is through their inherent properties, not by providing any method implementations.
2 Likes

What is this Trivial type - and is it something that has been discussed in these forums?

Trivial would refer to types that don't require anything other than memcpy to make copies, like the POD or trivial concept from C++.

2 Likes

For what it's worth, NSUserDefaults used to use a reader-writer lock to protect its state, but I switched it to fine-grained regular mutexes later. Reader-writer locks aren't without their uses, but they have far far more downsides than most people would expect. In particular:

  • Doing priority donation from high priority pending writers to low priority readers requires tracking O(Readers) state, which few if any implementations are willing to do
  • They degrade to mostly serial alarmingly quickly when faced with even relatively low numbers of writers (because if there's a pending writer, it also blocks all future reader-reader concurrency until it can go)
  • Upgrading readers to writers is fraught with peril for reasons I don't recall off the top of my head

I'm not going to go as far as saying we should exclude them from consideration, but they aren't the slam dunk they appear to be.

2 Likes

Perhaps it was a bad example. I'm not too surprised read-write locks aren't a big gain for user defaults now that I think about it. I guess all you have to do under the lock for a read operation is a dictionary lookup, copying the value pointer, and then incrementing its reference count. All in all a pretty quick operation unlikely to create much contention.

The catch is that if two readers are waiting for an upgrade, one of the two has to give in otherwise it's a deadlock. I'd advise against doing upgrades in general: either take a writer lock in advance if you might need one or cancel what you were doing with the read lock and do the whole thing anew with a writer lock when you end up needing to write. Upgrading is basically the later but with a fast path in case the upgrade succeeds.

You indeed need to have a very high reader to writer ratio for this to be worth it. And those readers also need to access the lock simultaneously often enough that it'd create contention otherwise.

Right. I might be a bit overenthousiatic about read-write.

And I'm having this silly thought right now: you don't necessarily have to decide about the type of lock in advance. You could have an executor pick at runtime to the most appropriate strategy based on contention statistics. Assuming the compiler can tag those partial tasks well enough as reads or writes.

1 Like

This would definitely be a neat trick :slight_smile:

Yes, I think you're right about that. I end up talking about parameters/results/variables/etc. when it's really about the values. Then let vs. var is about whether the value at that memory location can change.

Yes, that's a good generalization!

Right, and maybe-some-time-in-the-future we get some kinds of types that are uniquely-referenced so we know they aren't ever actually shared.

I brought this up over in the protocol-based actor isolation pitch. I think we want a special kind of marker protocol that has no runtime footprint, and have an idea of how it might work.

Doug

1 Like

Beyond being a neat trick, I wonder if there is a genuine performance need to vary locking characteristics at runtime. On machines with low core counts, like iOS devices, it’s hard to beat a fast unfair lock as a default locking primitive (IOW don’t bet against os_unfair_lock). On big iron that often has more cores than threads you may need some fairness at least some of the time to avoid starvation.

1 Like

Yeah that’s very likely. Scaling Swift to very wide machines like that is going to present a whole new world of optimization requirements in general (eg refcounting becomes much trickier to do quickly as parallelism increases).

1 Like

The important part is you can pick a strategy easily without having to rewrite your code. If the compiler can determine what is a write and what is a read, then changing strategy can be either a runtime switch (with some overhead) or a one-liner somewhere in the code.

My first idea was actually to have the executor produce statistics and offer recommendations about the strategy to use. Perhaps this could also be part of profile guided optimization.

I'm sure the iPhone and the Mac Pro run a lot of the same code. If you know the target hardware and can pick the right strategy, that's great. But often you don't know the target hardware in advance.

I think most of the code that is worth tuning for the runtime architecture is code from the standard library and the runtime (AFAIK, ref counting is performed by runtime functions, not by code generated by the compiler). As long as you can optimise this code without breaking the ABI, you should be able to run the same apps on iOS and on a Mac Pro without too much overhead.

1 Like

Catching up after the holidays:

Awesome, I'm super thrilled to see more focus being put on making the concurrency model safer out of the box for Swift programs. Defining away major classes of concurrency problems is going to be a huge boon.

I think that this is a great step forward, I have several strong +1s and several strong concerns, including those that are explained in the Actor Isolation for Global State document that I wrote back in November. I realize that the discussion is spread across many different threads and places, so I'll try to summarize my position and the concerns I have in one place here.


Naming bikeshed for ActorSendable: +1

The protocol-based actor isolation pitch proposes the use of a protocol ( ActorSendable ) to describe types that are safe to transfer across actor boundaries. It is a promising model; we're using the term "shareable" here because the problem isn't limited to crossing actor boundaries.

I'm really happy to see embracing something like the ActorSendable idea. I agree with your premise that we should find a better name than ActorSendable for this concept, given that it should be applies to structured concurrency as well. I'm not sure that Sharable is descriptive enough though. Would you be open to a longer name like ConcurrentlySharable or something else to scope this and make it more obvious that this is a specific kind of sharing?


Actor properties being implicitly async from outside: +1

I just wanted to +1 and send love to:

Actors... protects its stored instance properties. An actor class is a synchronized type , and it maintains that synchronization by serializing access to its stored instance properties: any use of an actor instance member must either be on self or must be performed asynchronously.

It makes perfect sense for references to actor properties to be subject to access control, and if they are cross-actor, they are made implicitly async.

The non- self accesses will be scheduled using a serial executor, which ensures that only one thread will execute code on a given actor instance at any one time. Therefore, the model prevents data races on actor instance properties.

Just to check my understanding, this just means that it is scheduled on the same executor queue thingy that the actor is using, ensuring the accesses are safe right?


Non-shareable types + Closures: +1

Agreed on this section, just one nit pick:

In general, the problem here is one of escape analysis : we need to determine whether a given non-shareable value can "escape" out of the current context and into a context where they might be used concurrently. The suggestions above likely to not account for all of the ways in which a value of non-shareable type can escape.

This isn't really about escape analysis. This is just applying the ActorSendable type restrictions to the capture list of the closure -- the compiler already forms the capture list, so it isn't a new sort of escape analysis. As mentioned in my comments below, you only apply these restrictions to concurrent closures, so compatibility with existing code shouldn't be affected.


Massive source breakage: -100

The proposal has several aspects that make it massively source breaking, and alternate designs have been proposed that don't do that. The biggest problem is the requirement that all global variables (and static stored properties) be marked with a global actor. This requires annotating existing code (breaking any of them) and then requires any use of them to be async or ascribed to the same global actor - breaking all the clients. The detailed problems with this proposal are explained in this section of a previous whitepaper, quoting from it:

Beyond being source incompatible, changing a global to be ascribed to a global actor won't be an ABI compatible move for a package or framework.

  • All global and static var stored properties must be protected by a global actor

Again, this is not the only model, there are multiple alternate models in the main proposal and in the alternatives considered section.

A lot of existing Swift code would be ill-formed under the rules above, even if it has been carefully created to be free of data races, because the tools to communicate the concurrency behavior don’t exist yet. The Swift Concurrency effort introduces those tools, but we can’t refactor the world overnight.

I remain very concerned about the idea of a two phase rollout here. All of the carrot is in the first stage, and the stick is in the second stage. People will adopt the first and never adopt the second, and the world will have the idea that Swift's concurrency story is memory unsafe.

Beyond the phasing issues, there is no reason for the source compatibility in the first place - we should just roll out a safe and mostly source compatible model in one round.


Ascribing special behavior to let declarations: -1

The writeup above ascribes special behavior to let declarations that var declarations don't have, making them implicitly sharable in situations where vars are not. This is problematic for several reasons:

  1. They break our API evolution and resilience model: it is important to Swift evolution that we can turn stored properties into computed properties (and visa versa) without affecting API. However, Swift doesn’t have computed let properties. This is a major reason why Swift doesn't have other "let-only" behavior in its design - we want everything to tie nicely into the resilience model.
  2. As described in your writeup, the actual model is that "only certain kinds of lets" are safe to share because of the ActorSendable story. This leads to a model that is difficult to explain.
  3. The mainline actor proposal (at least the last draft I read) proposed allowing let declarations be accessed across actor boundaries without doing an await. While this can be convenient for some (very) narrow case of "initialization only data", this breaks the simple model of "all communication with an actor is async", and isn't very useful.

In short, I'm not sure why initialization-only data is worth special casing like this: it's not very expressive to only allow cross-actor/task sharing of init-only data, and there are alternate (much more general ways) to allow actors to provide sync requirements, including vars and function requirements. I can elaborate if you'd like.

If your concern is about local variables + structured concurrency, then the right solution (IMO) is for concurrent closures to default to capturing values by-copy instead of by-reference.


Escaping closures being implicit concurrent: -1

I'm very concerned about this:

We propose a specific heuristic to determine whether a given closure will execute concurrently, which is based on an observation: To call a closure on another thread, one must escape the value of the closure to another thread. ... Therefore, we consider a closure to execute concurrently if it is an escaping closure.

While it is true that only escaping closures can introduce unstructured concurrency (and thus affect captured values) it isn't true that all escaping closures are concurrent. This proposal is similar in my mind to conflating async with error handling, two orthogonal things that happen to go together often.

I see the situation either as:

  1. a tower, where you have non-escaping, then escaping, then escaping+concurrent closures, or
  2. a matrix where escapingness is orthogonal to concurrency, allowing "non-escaping concurrent closures" like you'd want in a parallel map.

This is why I think we should have an explicit @concurrent attribute to model this behavior, either as the top of the tower, or the orthogonal axis from @escaping.

Rationale: the proposed design is going to be significantly source breaking if you start imposing new ActorSendable restrictions on captured values in existing escaping closures. Using a new attribute only imposes those restrictions on new code. Furthermore, concurrent closures really don't want by-ref captures, they want by-value captures by default. This is a pretty huge behavior change, which should be opt-in.

A local function is always assumed to execute concurrently, because they are always assumed to escape.

Why?


Global actor markers on global functions

A global or static stored property annotated with a global actor is protected by the global actor, so it will always be accessed from only a single thread at a time.

Ok, makes sense, but I wasn't aware that you were advocating that global functions could also be marked as being ascribed to a global actor:

This is a major part of the "global actor" design that I haven't seen explained fully. While such a model composes, I'm very concerned about two things (on top of the compatibility/progressive disclosure problems described above):

  1. Attribute explosion: I'm very concerned that this will cause an explosion of attributes on a lot of declarations, similar to "const propagation" problems in C++.
  2. Type compatibility: "main actor global functions" can't be safely compatible with their apparent function type. The closest safe type for bumpGlobalCounter above is () async -> () or @sharable () -> (), but not () -> (). This incompatibility will restrict adoption of global actor annotations, since existing higher order functions won't be async aware. This exacterbates the source compatibility issues.

While there is a space for "truly global" state, it should be rare in practice.


Playgrounds and Scripts and progressive disclosure of complexity: -1

The proposed design for global annotation marking breaks all playgrounds and scripts (which pervasively use globals). While you could default all globals and functions in a script/playground to a global actor, such a change will be surprising and incompatible. It will be incompatible because the type of a function/var on a global actor is different than normal functions, so you won't be able to pass them to higher order functions defined in existing packages/frameworks.


Overall, this is a very promising step forward in the concurrency design. I'm really thrilled to see it, but I still think is more progress needed.

-Chris

10 Likes

I have the same feeling about this. We bring in lots of new attributes to already explosive attrs list to swift.

I suggest using

  • actor(main) => @mainActor
  • actor(global) => @globalActor
  • actor(local) <==> actor
  • actor(<name>) => @someGlobalActor

local/global/main are reserved built in keyword for actor scenarios.

1 Like