Proposal for Swift Actors and performance concurrency futures

johnburkey · October 26, 2020, 9:01pm

Alright, so then if you are already on the Actor queue, the second step is to not use the queue async primitive, that should only be for external to-actor calls. And within the actor, the allocator can dodge the global "need to get a lock/or go atomic" ARC stuff, and have a much higher performance allocator, similar to stack alloc, but with an actor local heap.

dfsweeney · October 26, 2020, 9:45pm

Can you explain the allocator in more detail? I'm not sure what you're comparing and contrasting. Thanks!

johnburkey · October 27, 2020, 3:41pm

Alloc is expensive, when its providing access to a global heap, because it has to acquire a "lock"- on modern multicore systems, thats especially expensive. The idea is for the Swift core team to give us the ability to declare an operation object, (An Actor would be an example of one, but there might be others) where when referring to objects inside that scope, you could use local allocation, that would be only on that operation. Then there would be a way to promote a few things to either the default global space, or another space. ARC is really expensive because of its intertwinement with global allocation. The vast spraying of retain/release across your code automatically is in itself not particularly expensive.

johnburkey · October 27, 2020, 3:46pm

We believe putting a 2nd invisible ptr in the ABI with access to that operation would allow swift to do great things with automatically doing this- but you could do it with API too. But without the core team giving us more flex on ARC, we have to resort to using "Unsafe"- which is kind of blowing a hole through the whole ARC metaphor. The question to be asked- is what level of performance makes it worth doing so? I've worked with so many people who think oh - it couldn't possibly be more than 2-3x difference. But what if its 10x or more faster without ARC? 20x? These kinds of performance wins are equivalent to 3-5 generations of iPhone hardware. Apple spends billions of dollars to create new chips. Why not spend a bit more energy making the language most Apple software is written in these days faster ...?

Max_Desiatov · October 27, 2020, 3:50pm

Would it be possible to group multiple subsequent paragraphs from a single author in a single post here for easier readability? Each paragraph posted separately at irregular intervals also sends a lot of notifications to people who are subscribed to this thread. Thanks!

Lantua · October 27, 2020, 5:18pm

Ok, so it seems you're trying to avoid ARC on the critical path, and the Unsafe you're resorting to are those of the Unsafe*Pointer APIs. Is that correct?

I'm trying to get a correct picture since I first thought that you're using something even less formal, like manually editing the value witness table, and want to formalize/Swiftify it.

johnburkey · October 27, 2020, 7:12pm

Yeah, correct. Unsafe lets us allocate stuff ourselves, and we use unowned as well, so we don't get retain/release traffic . So we setup a heap and use unsafe to grab stuff from it, and when the operation is over, the heap is no longer needed and can be marked as unused and reused next time, or just tossed.

Amortizing even a big heap alloc for a big operation that itself runs for 30ms is "free enough". The key is to not go in and out of locks for every sub operation within that big operation.

So we are asking for the Swift Core team to make it API clean to make an "Operation" class, and Actors would be a kind of that Operation, where it has a lockless subheap and a queue associated with it, and Api sugar so can you simply allocate from that heap. This would reduce allocations cost a ton. We think perhaps it should be a 2nd invisible pointer in the ABI, along with self, so thats its easy to get a hold of. We feel like our job has been to extract maximum performance from Swift through any means necessary, and then say, "hey we got this perf- is there a way to take the idiom into the fancy part of the language, so everyone can use it?" The operation tied lockless subheap stuff is the most important of those in terms of giving everyone lots more perf for compute bound stuff. There's more around actors and pipelining big ops and stuff, but the swift community seems fragile to discussion, so I figure we focus on the memory stuff first.

John_McCall · October 27, 2020, 7:32pm

The idea of being able to restrict a class type so that it can't escape a particular thread/actor is definitely something we're interested in, and it's part of our current long-term vision on concurrency. I'm not sure I understand what you're doing with actors well enough to say anything about whether/how it fits in with what we're thinking.

dfsweeney · October 27, 2020, 8:53pm

Paraphrasing to be sure I understand:

When you create an Operation/Actor, (an object of that class type), you also get a buffer of memory on the heap that belongs to the object. (I guess if the object is on the stack you could keep the memory inside the object, on the stack.) It would be fixed-size and created when the object is instantiated.
The object has some operations that allocate and initialize swift objects (value types, reference types, etc) inside that buffer and manually retain and release them (?), or just create them once and never deallocate them until the Operation/Actor class deallocates the buffer and they go away.
The contents of this buffer would not require any locking because nothing but the enclosing object would have access to that buffer of memory (unless it leaked it.) Any operation that the Actor does would be single-threaded on the things in the buffer.

Does that match up with what you're describing?

A couple of questions based on what I said above, which might be wrong if I misunderstood:

This sort of implies that you have a new flavor of Swift object that lives in the Actor's special heap. Should those work like regular Swift objects once they're allocated? Or are you allocating things like arrays of doubles in the local storage that you can access with one of the Unsafe*Pointer types?
If you have normal Swift objects in the special heap, and you're bypassing ARC, do you still manually retain and release the objects? Or do you just set the retain count to 1 for the object when you init it in the special heap and then never touch it again so the instance stays alive as long as its Actor stays alive. Is that right?
I'm writing the above assuming something like the Objective-C flow where there is an alloc that gets memory then an init that fills in the stuff to make the memory into a real object. In this model you'd replace the alloc with something that gets memory from the special heap, and also bypasses ARC, then init would fill in the pieces. Swift makes some of that implicit and I don't think the ABI exactly guarantees that, but I could be wrong.
Is it accurate to say that you see allocations as being slow because ARC is enforcing some unnecessary locking? I think this is really what I'm confused about--if ARC, or the allocator, is causing some problem, it would make sense to look at those subsystems and see if there is some problem that can be fixed. If ARC or the allocator is not central, and this is more about creating a fast-path lock-free allocation system, then talking about ARC is distracting.
Would this look like an annotation like @Action(1024K) on the type, which would then give you 1024K of fast-path storage and something like a fast_allocate<T>() -> T? that would do the equivalent of T.init on memory inside the buffer, if it's available? I'm trying to make this concrete.

I hope that makes sense, I'm trying to understand what this is like concretely. I hope I did not misrepresent anything. Thanks!

johnburkey · October 27, 2020, 9:25pm

Great response ! I would see this as looking like normal ARC, but the allocator thats associated with the Actor would just point ARC at its subheap, and not require locking as a result. Lets say in debug builds it does lock, and it has a heapID associated with the alloc so it the object is sent off to a land where its not supposed to be and accessed, it explodes in a huge fireball.

The idea here is to allow the api to let us express with an Actor that we have local memory, and local access, and dont need global allocation and no locking. Its not that there is anything wrong with the current global allocator. Global allocation and global retain/release is EXPENSIVE. The idea is just to not use the global stuff when we can hint to the memory subsystem that we dont need them..

Because of McCall's awesome push in this area, we have all real chance to do get something great here. To be clear: Brighten has synthesized the performance of this with a duct tape actor + heap API that is manual, but gives us the "same" performance. We are not claiming a super pretty sugared up language idiom. Thats best left to the people on the core team, and honestly, we are tackling voice Ai, a valid set of engineering challenges on its own :-).

Agree with your alloc/init thing- thats actually how we recycle stuff when we have internal pools. For some of our really high frequency allocs, we have big piles of objects setup to be rapidly reused. We reinit previously alloced stuff when we recycle.

On annotation and stuff, your ideas are much better than mine- despite our fancy looking Actor and Knowledge Graph DSL's, in this area we are mostly pleading for the API to let us all to smoothly dodge global alloc and global retain/release. And in debug builds to have it merrily lock and assert so we can all build stuff safely.

I think you've totally got what we are asking for, and why. If this goes forward we would all have a responsibility to try this stuff out and profile and study our idioms, and report back ARC in our profiles with a "why?". Some of you have probably noticed that ARC shows up in amazingly in obvious ways, usually findable only by commenting out code and looking for diffs.

TellowKrinkle · October 27, 2020, 10:05pm

I assume this requires making a distinction between objects that will always stay within the actor and objects that could be given away?

And I can't really tell, are you finding the majority of time spent in ARC, the global allocator, or both?

johnburkey · October 28, 2020, 2:21am

Yeah, you know we promote them with an API call, because we are not inside Swift lang like Coreteam, but you could imagine some kind of handoff if the compiler has enough context. Part of the answer to that is how McCall and the team handle what can be sent between Actors- you could auto graduate then, or send it to a pipe mini heap and regurgitate it again on the other end- who knows.

I see alot of time in both ARC and the allocator, but think for a sec, they are expensive because they are global viz, so you have to do the right thing and keep us safe from exploding our programs by using locks, etc. Im simply advocating for a while to have private areas with local heaps, and the much lesser costs associated with not messing with globals. its the semantic change in how we use our objects that lets us reduce the costs- we are no longer managing global state, so its ok to be quicker. So to finish answering you Tellow, you could imagine ARC being cheaper for local stuff too. Again, we are synthesizing the behavior we imagine , with API and manual graduation out of the infant heap, and specially designed structs and caches. Im just saying hey, is there a way we could get a bit more of this that way? One thing is true- if you just made collections and strings use one of those local heaps for their backing instead of malloced global heaps, we would see costs of high frequency use of collections and strings shrink. On the quartz team at Apple years ago, we had a local heap that if it ran out of space would just revert to malloc heap- so there are answers to these things.

Another thing we have is concurrency safe logging- because our heavy duty recognition goes wide to as many cores as are available, we needed to be able to do logging concurrently as well. That goes with that same operation context, and is passed around with the work item streams pumping concurrently. We have an on/off, state based triggers, and levels for that, because logging can be really expensive.

Thanks for discussing this. Swift is my favorite language so far, having written millions of lines of both C/C++ and Java previously. I like its cleanness, its thoughtful design, and its progress. I guess thats why we are here advocating.

Lantua · October 28, 2020, 6:10am

This sounds much closer to region-based memory management (which I'm not exactly a fan of, tbh). If it doesn't require anything beyond what the standard Swift can do (including using Unsafe*), maybe we can incubate it first as a normal library.

Interface-wise, maybe we can leverage the property wrapper and dynamic member:

protocol Zone {
  func allocate<T>(_: T.Type = T.self) -> UnsafeMutablePointer<T>
  func deallocate<T>(_: UnsafeMutablePointer<T>)
}

@propertyWrapper
struct UnsafeZoned<T> {
  private var storage: ..., zone: Zone

  init(wrappedValue: T, zone: Zone) { ... }

  var wrappedValue: T {
    get { ... }
    set { ... }
  }

  func destroy() { zone.deallocate(storage) }
  subscript(dynamicMember: ...) { ... }
}

So then we can use property wrapper where available, and use dynamic member otherwise:

@Zoned(.defaultZone) var data: ...
var data: Zoned = ...

johnburkey · October 29, 2020, 7:06pm

I think two things about this, ok , three: Nice to see code! Also, a great way for people to say "I can dodge ARC, and get higher performance", and also, it would be interesting to see if the mother objects connection to it causes ARC traffic anyway- And bigger picture, I think its important we advocate here for simplicity ahead- a way to use ARC ideas without global-ness, ((alloc from global cross thread heap), locks) but preserving the "GC"-ness of the way its used. So codebases can try out subheads by making Agents or other subclasses of the mythical Operation class we are advocating, and see what it feels like perf wise under load, without going the way of no matter "GC" safety rails. Apple pays people full time to think about these things, so Im hoping those people care enough about compute performance under load, and scaling codebases to handle more than one thing at once, to give serious thought to these kind of things. Because Swift is now the default language for Apple, there isn't alot of market pressure on the language anymore, so the pressure has to come from more core virtues.. .Having said that, with ARM stuff ungating multicore, you would think scalable async programming would be more front and center. ;-).

Lantua · October 29, 2020, 8:28pm

I think you've already handled a very similar situation. You seem to be avoiding ARC by using unowned references extensively, which should also be applicable here. If that's not enough, we can also separate the allocator from the allocated data, and have deallocation be supplied with the allocator, i.e.,

$x.deallocate(allocatedBy: .defaultAllocator)

If the feedback would be that it's good but the deallocation at the end is easy to miss, then we're probably on the right track for the syntax. I said this because it's not the only syntax and granularity out there. It could be a scope declaration, declaration annotation, or even type annotation:

local(zone) {
  ...
}
local(zone) var x = ...

local(Zone) class { ... }

Starting with something a little unsafe would be a good testing ground for that.

Even then, I think the safe annotation would at best be an optimization hint (albeit error if the value is not actually local). The compiler stands to gain if the variable is local, so it would already be trying to prove its locality anyway.

If the async chain are treated (and optimized) as a long local context, then we'd also be able to use what we learnt in a single thread environment too. So perhaps they're (async SIL) on to something.

johnburkey · October 29, 2020, 8:49pm

Ok, so if this "exists" and so does the Actor stuff, it would be nice if when you then called another actors function with this as a parameter, or, even more likely case, your fancy new Actor and this stuff are living in a Swift 5 codebase, and you want to send "results" to normal ARC land, how would we promote the object to safe global ARC, slow-land ?

It would be nice if the compiler had enough information present to switch ARC spaces and do the hand off for us. Including for things like Strings & Collections that in many cases are pretending to be stacky struct based, but actually are partying on the heap.

If we can do the exchanges, then we are pretty close to a new high performance place. And with adoption of the "queue bound lockless" heaps, the runtime and library teams can start experimenting with both creating them, and algorithms for right sizing them, etc. As I said before though, the key is to amortize the cost of the global malloc among many fine grained lockless queue safe allocs, not to get rid of the global alloc altogther. TANSTAAFL, but almost..

Lantua · October 30, 2020, 12:44am

With wrappedValue being refCount +0, I think we can expose a +1 access via projected value:

extension UnsafeZoned {
  var takeRetained() -> T { ... }
}

With Unsafe APIs giving us a lot of refCount freedom while storing/loading data, it should be possible to implement using normal (unsafe) Swift (I haven't thought about its details, so I could be wrong). The semantic would be that every UnsafeZoned storage shares a reference, which will be manually decreased at the end of its lifetime. Since I don't know if we have optimization/criteria for local-only storage, this would be the closest I could get performance-wise. Or if it needs to be reallocated, we'd require T to be clonable, at which point we're pretty close to NSZone.

johnburkey · October 30, 2020, 3:19pm

So how about an API where you could declare an Operation, and the Zone could live in it, and all objects would take Operation as the first parameter in their init, and hold on to that, and use the Operation to allocate everything they need, including themselves. Would this work with the ideas you are forming ? I would stick concurrency safe logging in the operation too, and probably a good place for other things too. Actors would be a "subclass"- we definitely want Actors to not use global ARC, otherwise they generate massive amounts of ARC sludge as they high frequency "message" each other. Then we want API over GCD concurrency stuff that lets us go wide with these and make N Operations, 1 per concurrent queue, with the idea that we get the GCD happy one Operation per physical core, and each gets its own private heap to party on during the ops execution. Longer lived Actors would be "streaming operations" in effect, along with the assorted goodies they are cooking up around all that auto-async stuff.

We want to do whole hierarchies- whole class societies.