Swift equivalent of C++ coroutines

Lua · December 10, 2024, 6:38pm

I've been using C++ a bit lately and one feature I found very useful is the coroutines.

the general idea is something like this:

auto Chunk::generate() -> Task {
    using core::rng::Xoshiro256StarStar;
    using core::random;

    (...)

    Float height = base;
    for (Int iy = 0; iy < chunkHeight; iy += 1) {
        const auto fiy = static_cast<Float>(iy);
        if (fiy <= height) this->blocks[ix][iy][iz] = create<Stone>();
        if (fiy > height and fiy < 90.0) this->blocks[ix][iy][iz] = create<Water>();

        co_await std::suspend_always {};
    }

    (...)
}

It's a suspending function that can generates terrain but suspends every once in a while to let the main loop choose how long generation takes and avoid blocking rendering

My question is, how could I do something like this in Swift, without significant overhead or any parallelism, I just want to suspend every once in a while

Can Swift async be used for this?

I don't want to actually run things in in the background as this is more than enough and allows me not to bother with any synchronization

wes1 · December 10, 2024, 7:50pm

Task.yield()? (assuming your function and the functions it calls are async)

Lua · December 10, 2024, 8:05pm

I'm not sure

Since it is a game, and it has to update every frame anyway, it is easiest to just resume whatever Chunk is not generated and closest to the camera until we run out of time for that frame.
Any dependency between Chunks, such as structures that generate across, are handled by one chunk awaiting another until it reaches a stage where it can safely generate itself.

There isn't any executor it just inherently works.

If I understand correctly, using Swift's Task I would lose control over where and when it is executed, which seems like an issue.

I want to just call resume on a coroutine in a loop, and Task seems to be handed off somewhere and all I can seemingly do is cancel it

Task is implemented somehow, so is there perhaps a way to use the Builtin module to advance coroutines? If I remember correctly Swift async is using LLVM coroutines under the hood.

Lua · December 10, 2024, 8:35pm

Honestly, I was never able to use Swift's concurrency for anything useful so far.

Whenever I look into it, I feel like it's very inflexible, and that I would have to write a very elaborate solution to conform my problem to Swift's idea of concurrency. Is there really no way to get around this

Serika_PHB · December 11, 2024, 2:19am

You mentioned "without any parallelism". Do you mean to have everything running on the main threat? Then you can just annotate the closure of the Task with @MainActor. If you really want control over where the task executes, try withTaskExecutorPreference function.

I'm not sure what do you mean by "when the task is executed". A Task is submitted once created and executed once the threat / executor is available. If there is no free threat / executor available, it is not possible to be executed. So I'm not sure how this can be possible, even in other languages.

Lua · December 11, 2024, 3:11am

In the C++ case it's really simple. Every time I suspend, control is returned back, and when I call resume it keeps going until it suspends again — C++ just handles preserving this state, I control the actual execution of the coroutine.

All I have to do is call resume in a loop for as long as I have spare time left. It is unrelated to whether or not I decide to run these using threads.

Task in Swift seems to hand it off to some runtime functions, I don't believe I can control it with that much precision

Serika_PHB · December 11, 2024, 3:53am

I'm not a C++ expert so correct me if I'm wrong.

It seems that you have some function or code segment A running and a coroutine B. A call resume to continue running code in B and B call suspend to return to A. Because suspend is done manually, so what B will do each time should be predictable.

Actually if you don't really care that much about the order of execution (the order does not have to be A -> B -> A -> B and things like A -> A -> B -> A -> B -> B are also acceptable), then using Task.yield is indeed an option. But it won't work if order matters.

So maybe we don't really need those fancy techniques but simply implement B as some member method of a class with some fields holding the states of execution? So that every time you call that method from A, it can continue from the state it returns last time.

mlienert · December 11, 2024, 10:25am

Edit : I mentionned AsyncSequence but it seems there is nothing async in the work.

Can this be "simulated" with a Sequence that does some work on each next() call ?

Main loop can call next() several time until it decides to go do something else (like render the next frame)

Lua · December 11, 2024, 10:41am

That would be technically possible, but for a function containing just a few nested loops like that for various terrain features, I would have to maintain the loop counters inside some separate allocation. That's very error prone and unreadable.
The advantage of a coroutine here is that it can be written and read like a normal function, the state machine is generated by the compiler.

The order matters between tasks themselves — there is one B per non generated terrain chunk, and they are resumed sorted by distance from the camera.

Additionally, calling resume on one task could actually be forwarding to resume another, as fully completing one chunk requires partially generating its neighbors, trees for example could overlap more than one at a time.

ibex10 · December 11, 2024, 1:06pm

Unfortunately, that is sadly the case, and it is going to remain in that state until the whole thing is properly documented with plenty of examples demonstrating how to solve real world problems.

j-f1 · December 11, 2024, 2:33pm

How about an AsyncChannel? You could either have actual values passed into the channel that you consume from your main loop, or just have a channel of Void if you don’t need to pass any values between the terrain generator and the main task/loop.

Lua · December 11, 2024, 3:38pm

From what I understand I could only do it in two ways

Manually store the stages of the now imaginary coroutine within the chunk, and call generate repeatedly with a switch inside that dispatches to the correct stage, also making sure each one preserves the state correctly
Dispatch a Task which captures the chunk and let Swift handle the magic, but now I have to make Chunk an actor and synchronize access to the chunk

In the C++ (or implemented in Swift by hand) case I do not need to worry about synchronization because I can guarantee nothing will ever try to use the chunk while it's being mutated.
I could isolate both to the MainActor but I believe I would no longer be able to ensure that when time runs out control is immediately handed back.

The idea is not to run multiple tasks concurrently, but break up a long task so that it never takes enough time in one go to drop even a single frame.

ksluder · December 11, 2024, 4:13pm

Unfortunately, I don’t think your approach can ever actually meet your goal. The frequency of your suspensions is fixed in your code, scaled by the speed of the CPU it’s running on. This frequency is entirely uncoupled from your refresh rate, leading to unpredictable frame pacing. Except by sheer luck, these frequencies will eventually form a “beat” where your suspension comes too late to prepare the frame containing the data you just generated, and you will drop a frame.

It’s also a poor use of the computer’s resources. Terrain generation is very frequently implemented using parallel algorithms. Trying to force all your computation onto a single core ignores the multi-core architecture that all platforms have evolved into over the past decade+.

If you really never want to drop a frame, you’ll need to run your renderer on a dedicated thread and use some form of synchronization against your terrain data. But the terrain generation itself should be easily parallelizable with minimal to no manual synchronization.

Lua · December 11, 2024, 5:40pm

I'm not sure I understand, why wouldn't it work? As long as a single resume never lasts longer than a fraction of a millisecond, checking back every 0.5ms or so if I can still resume should be fine.

Currently though it's implemented in a very naive way:

for (const auto chunk : this->chunkCache | std::views::reverse) {
    if (not chunk->isGenerated()) {
        while (start.elapsed() < 4) {
            if (chunk->generator.done()) break;
            chunk->generator.resume();
        }
    }
    if (start.elapsed() >= 4) break;
}

This is really bad code in many ways, like calling clock_gettime in a loop which accounts for a large chunk of time on the profiler - but since it is limited to 4ms, this should only further limit the speed of world generation, not drop frames.

This is what one of the spikes looks like:

I don't think this has anything to do with how I resume the coroutine, rather it's

calling clock_gettime repeatedly
a very suspicious isOpaqueExcludingEmptyAt method. It does not do much, something is very wrong
inefficiently sending all world vertices to OpenGL

It is not obvious from the graph but after generating it immediately remeshes its 4 direct neighbors, so it blocks to do 4 times the work it did for itself while suspending.

But do let me know if I'm missing something, I am not yet very familiar with graphics APIs

Certainly, but if at some point it becomes the bottleneck, I can then add complexity to solve it, but doing so now would make it harder to reason about more critical issues. It would also make them harder to profile. I see no reason to design around a problem that may or may not exist.

ksluder · December 11, 2024, 6:12pm

You are making the assumption that your simulation and encoding will always fit in the frame budget (1/refresh rate). Perhaps that’s true in your case. In most games whose performance traces I am exposed to, poor frame pacing inevitably leads dropped frames because these games are always pushing the limits of their frame budgets.

You’ve implemented a “push model” render loop where you are generating frames as quickly as possible and pushing them to the GPU as they are completed. This leads to inconsistent frame pacing unless the time it takes to simulate and encode each frame is a precise integer ratio of the framerate. That’s essentially impossible to guarantee, so you wind up capturing the world state at varying times relative to when the display refreshes. Sometimes the display will show a frame that’s 2ms old, sometimes it will show a frame that’s 10ms old. This makes input feel awful and imprecise.

This is why we recommend using CAMetalDisplayLink to drive your render loop on a dedicated thread. That way your renderer is always latching world state at a fixed latency relative to VBL. You provide an estimate of how long your most expensive frame will take to encode, but even when cheap frames finish early, all frames are sampling the simulation at a consistent interval.

Lua · December 11, 2024, 6:25pm

Odd, I am very sensitive to frame rates, but I can't feel anything wrong, besides the awful spikes (or smaller spikes sorting transparency), regardless of how vsync is configured.

Does it just not show yet when it takes the CPU 0.02ms to handle a frame?

I'll keep that in mind for when I run into inexplicably bad frame pacing. I am not using metal though, how would I do that with OpenGL?

Also, my main issue at the moment is porting the code to Swift, not optimizing it

ksluder · December 11, 2024, 6:57pm

If your frames are that cheap, then it probably won’t matter because your latency will vary in the 0.1ms range, which I’m pretty sure is well below most humans’ ability to detect. (The AR and VR researchers would know more.)

OpenGL is deprecated on Apple platforms. I strongly encourage you to target Metal instead.

I would feel bad if you had to redo the fundamental architecture of your entire application to “optimize” it.

Lua · December 11, 2024, 7:41pm

if I eventually release a game I will of course port it to use Metal, however I would rather avoid having to modify multiple versions of the same shader for all platforms whenever I make a change. I can just port a finished game at the end, it would not be hard, there is only 1 function that touches the graphics API during gameplay.

I don't think I have to change anything, only resume on a background thread. In fact, I only had to add a mutex to let the thread search through visible chunks, and make the enum informing the environment of the generation stage of the chunk atomic. It is now very smooth. I can even hear the electricity in the GPU now Who needs a profiler when the sound already communicates when the game slows down

Never mind, I missed something, there is a very low chance of the screen turning into a mess of colors. I would much rather do this in a language that can catch whatever it is that I forgot about

Lua · December 16, 2024, 9:10pm

Okay, so I am trying to rewrite this using async and I guess I'll just have to improvise when I want to cross compile and can't use _Concurrency

But I have another issue. A Chunk now owns a Task which generates it, but I want to express the following properties of my C++ coroutine:

Chunk tasks should be generated on the same actor, and the task being resumed should always be the one belonging to a chunk closest to the camera
Chunks generate in stages and depend on each other, because structures like trees can generate across chunks. A given chunk as its final stage of generation awaits its neighbors until before said stage so that it can know the list of blocks which spill over from those chunks — but doesn't await until the end since that would make the neighbors try to resolve their own neighbors and so on, causing an infinite chain of awaiting. Entering that final stage should only be possible for the primary chunk being generated.

Code so far

public func generate() async {
    var shuffled = self.position.x ^ self.position.z << 1 ^ Int64(bitPattern: self.world.seed) << 2
    
    // The RNG will do nothing if the seed is 0, take care not to accidentally shuffle one.
    for i in Int64(0)... { if shuffled != 0 { break }
        // Just to be safe and avoid taking *1 entire second* to type check this expression, split into steps.
        let step1 = self.position.x * i
        let step2 = self.position.z * i
        let step3 = Int64(bitPattern: self.world.seed)
        shuffled = step1 ^ step2 << 1 ^ step3 << 2 ^ i << 3
    }

    var rng = Xoshiro256StarStar(from: UInt64(bitPattern: Int64(shuffled)))

    for ix in 0..<Self.side {
        for iz in 0..<Self.side {
            let fix = Float(ix)
            let fiz = Float(iz)
            let fx = Float(self.position.x)
            let fz = Float(self.position.z)

            let posX = fix + Float(Self.side) * fx
            let posZ = fiz + Float(Self.side) * fz

            func octave(_ frequency: Float, _ amplitude: Float) -> Float {
                //perlin(posX * frequency, posZ * frequency, 0, Int32(this->world->seed())) * amplitude
            }

            func octaved(base frequency: Float) -> Float {
                octave(frequency, 0) + octave(frequency * 2, 0.5) + octave(frequency * 4, 0.25)
            };

            let continentalness = octaved(base: 0.005)
            let erosion = octaved(base: 0.01)
            let peaks = octaved(base: 0.05)

            // Terrain
            let base: Float = switch continentalness {
                case ..<0.3:    continentalness.normalized(from: -1...0.3, to: 50...100)
                case 0.3..<0.4: continentalness.normalized(from: 0.3...0.4, to: 100...150)
                case _:         150
            }

            // MARK: - Height pass -------------------------------------------------------------------------------------
            let height = base

            for iy in 0..<Self.height {
                let fiy = Float(iy)
                
                if fiy <= height { self[ix, iy, iz] = Stone.shared }
                if fiy > height && fiy < 90.0 { self[ix, iy, iz] = Water.shared }

                await Task.yield()
            }

            // MARK: - Layer pass --------------------------------------------------------------------------------------
            let maxThickness = 3 // TODO(!): This should be variable slightly but still determined by seed.

            for (currentThickness, iy) in (0..<Self.height).reversed().enumerated() where currentThickness <= maxThickness {
                if self[ix, iy, iz].tag == .stone {
                    self[ix, iy, iz] =
                        base < 92
                            ? Sand.shared
                            : currentThickness == 0 ? Grass.shared : Dirt.shared
                }

                await Task.yield()
            }

            // TODO(!): This is a duplicate loop but it was accidentally left in
            //       the C++ version so for now keep it to get identical foliage and tree generation.
            for iy in (0..<Self.height).reversed() {
                if
                    iy + 1 < Self.height
                    && self[ix, iy, iz].tag == .grass
                    && self[ix, iy + 1, iz].tag == .air
                    && Float.random(in: 1...100, using: &rng) < 25
                {
                    self[ix, iy + 1, iz] = TallGrass.shared
                    break
                }

                await Task.yield()
            }

            // TODO(!): This could probably be a `where` loop.
            for iy in (0..<Self.height).reversed() {
                if
                    iy + 1 < Self.height
                    && self[ix, iy, iz].tag == .grass
                    && self[ix, iy + 1, iz].tag == .air
                    && Float.random(in: 0..<100, using: &rng) < 25
                {
                    self[ix, iy + 1, iz] = TallGrass.shared
                    break
                } else if
                    iy + 1 < Self.height
                    && self[ix, iy, iz].tag == .grass
                    && self[ix, iy + 1, iz].tag == .air
                    && Float.random(in: 0..<100, using: &rng) < 1
                {
                    self[ix, iy + 1, iz] = Rose.shared
                    break
                }

                await Task.yield()
            }
        }
    }

    // Tree pass
    for ix in 0..<Self.side {
        mainLoop:
        for iz in 0..<Self.side {
            if Float.random(in: 0..<100, using: &rng) > 1 { continue }

            let height = Int.random(in: 4...7, using: &rng)

            var start: Int? = nil
            for iy in (0..<Self.height).reversed() {
                if (self[ix, iy, iz].tag == .grass) { start = iy + 1; break }
                if (!self[ix, iy, iz].softGeneration) { continue mainLoop }
            }

            if let start { // TODO(!): This isn't easy to translate into Swift, ignore for now.
                //for (Int iy = *start; iy < *start + height + 1 and iy < chunkHeight; iy += 1) {
                //    if (iy < *start + height) self.safeSoftSetBlockAt(ix, iy, iz, Log.shared)

                //    if iy > start + height - 4 {
                //        let radius = iy > start + height - 2 ? 2 : 3

                //        for (Int tix = -radius + 1; tix < radius; tix += 1) {
                //            for (Int tiz = -radius + 1; tiz < radius; tiz += 1) {
                //                this.safeSpillingSoftSetBlockAt(ix + tix, iy, iz + tiz, Leaves.shared)
                //            }
                //        }
                //    }
                //}
            }

            await Task.yield()
        }
    }

    // COMPLETION STAGE - This is the point where generating further creates a dependency on our neighbors.
    // We can now inform chunks awaiting on us that we are ready to be referenced for structure generation.
    self.stage = .completion
    await Task.yield()

    let positions = [
        Position(x: self.position.x - 1, z: self.position.z - 1),
        Position(x: self.position.x - 1, z: self.position.z    ),
        Position(x: self.position.x - 1, z: self.position.z + 1),

        Position(x: self.position.x,     z: self.position.z - 1),
        Position(x: self.position.x,     z: self.position.z + 1),

        Position(x: self.position.x + 1, z: self.position.z - 1),
        Position(x: self.position.x + 1, z: self.position.z    ),
        Position(x: self.position.x + 1, z: self.position.z + 1)
    ]

    for neighbor in positions.map { self.world.demandChunkAt($0) } {
        while (neighbor.stage == .terrain) {
            await Task.yield() // ??????????
        }

        // for block in neighbor.spill where block.position == self.position
        for (position, x, y, z, block) in neighbor.spill where position == self.position {
            self.safeSoftSetBlockAt(x, y, z, block)
        }
    }

    await self.relight()
    await self.remesh()

    // GENERATION END - The chunk is now fully generated and ready for use.
    self.stage = .generated

    // Remesh neighbors
    self.world.remeshNeighbors(self.position)
}

Source code of the Swift version TeamPuzel/BlockGameSwift - BlockGameSwift - Gitea: Git with a cup of tea