That's a great, well-written article, but it's clear that they haven't actually settled on an implementation. Their prototype includes at least two different designs, probably a function-global frame allocation vs. a narrower continuation allocation. Whatever they do, it's going to have serious trade-offs that I haven't seen any balanced evaluations of yet — which is fair, it's still fairly early days.
You cannot have truly lightweight threads while still managing stacks in a traditional way. Traditional stacks require a substantial reservation of virtual address space plus at least one page of actual memory; even with 4KB pages (and ARM64 uses 16KB), that level of overhead means you'll struggle with hundreds of thousands of threads, much less millions. Instead, local state must either be allocated separately from the traditional stack (which means a lot of extra allocator traffic) or be migratable away from it (which makes thread-switching quite expensive, and so runs counter to the overall goals of lightweight threads). Since the JVM already has a great allocator and GC, I assume they're doing the former, but that's going to introduce a lot of new GC pressure, which is not something I'd think most heavy JVM users will be fans of.
If you don't have "colored" async functions, you have to do that for every function that isn't completely trivial. That allocated context then has to be threaded through calls and returns the same way the traditional stack pointer is. Since Swift doesn't already do that for ordinary functions, and we've declared our ABI stable on Apple platforms, this really just isn't an option, even if there were no other downsides.