What could cause an unbounded AsyncStream to drop elements?

taylorswift · October 13, 2023, 2:09am

when you have an immortal AsyncStream, you tend not to think too hard about YieldResults that are not .enqueued:

guard case .enqueued = self.interactive.consumer.yield(request)
else
{
    fatalError("unimplemented")
}

of course the stream is not really immortal, in my case, if the mongod process exits then the task group that iterates the stream cancels all the child tasks, which terminates all associated streams.

but that doesn’t seem to be what happened to swiftinit.org last night - according to its logs, mongod had not crashed in over four days - the stream yield seems to have failed all on its own:

[Oct 12 10:22:12] UnidocServer/Server.swift:242: Fatal error: unimplemented

when i logged into EC2, i observed that the failure coincided with a dramatic spike in CPU usage:

if i didn’t know any better, i would have thought the stream, which was initialized with an unbounded buffering policy might just be overflowing due to too many enqueued elements.

but you could never saturate an Int.max, right? that would OOM before it gets any chance to report a dropped YieldResult … right?

FranzBusch · October 15, 2023, 8:38am

In your code, you are only guarding for .enqueued and you assume the fatalError happens due to it being .dropped but it could also be .terminated.

In general, I don’t recommend anyone putting an AsyncStream in a path that is triggered by external requests in a server side application unless you make sure that through some other means back pressure is applied to the producing side. Having an unbounded stream in such a request path is a potential DoS attack vector.

My proposal SE-0406, was aimed to expose a back pressured writing interface.

taylorswift · October 15, 2023, 10:10pm

how does one use backpressure to defend against DoS?

these attacks have become a near-daily occurrence on swiftinit, i logged a similarly-shaped attack just this morning:

the “upside” of the relentless attacks is that i’m getting a clearer timeline of how the server failure unfolds: it seems that the attacker is able to quickly drive up CPU usage to 100% on the swift side, which hogs all resources on the host machine and prevents mongod from responding in a timely manner, which the swift code interprets as a mongod failure, thus terminating the stream.

(it is also an open question as to who is responsible for these attacks. the site went months without experiencing this sort of thing in 2022 through Q3 2023.)

FranzBusch · October 16, 2023, 8:10am

It entirely depends on the context where the AsyncSequence is used, but if your inbound requests are creating work that is enqueued in an internal AsyncSequence but don't wait for that work to be enqueued or finished and that work takes significant CPU/time then an attacker can DoS you with requests. If your AsyncSequence has a back pressured write interface then the request must stay around until the work is enqueued or even finished and you can use other rate-limiting mechanisms such as maximum open HTTP streams/connections. With backpressure the important thing is that your whole system has to uphold it i.e. from the networking layer to your business logic.

Now to the concrete problem you are seeing it would be good to have a few more metrics to diagnose this. Off the top of my head I would be interested in seeing, HTTP 1&2 connection counters, HTTP/2 stream counters, outbound requests to mongoDB. Furthermore, if your application is multi threaded and uses Swift concurrency it would be great to see CPU utilisation on a per thread basis to understand if the IO threads or the Concurrency threads are hogged. One interesting note from your screenshot is that the network bytes stay consistent.

lukasa · October 16, 2023, 9:35am

I want to fork this conversation off slightly because it's important: backpressure is a vital component in defending against DoS. To get a feeling as to how, consider backpressure not as limiting the number of bytes in flight but instead in limiting the rate at which you accept new work.

A system like a network, and like a NIO application, can be thought of as a pipeline of logical units joined by communication channels. At a very high level we can think of this with each network appliance as a logical unit, and the network connections between them being the links.

At a lower level, each of these appliances is also a series of logical units joined by links. In your case, your webapp is this: a series of logical types and tasks, joined by async sequences, or channel handlers joined by pipelines.

In this model, we don't want to think of what is sent along the links as bytes, but instead think of it as work. Each packet is a unit of work that must be processed, each HTTP message a unit of work, each item enqueued into an async stream a unit of work. These units of work are not all equal: some cost more RAM, others more CPU, others more network bandwidth. But they are all work.

The goal of backpressure propagation is to ensure that this entire distributed system moves at the speed of the slowest participant. That is, whichever unit is processing this work most slowly forms the bottleneck, and backpressure propagation should slow all work producers down until they only produce work at a rate that the slowest worker can handle.

If backpressure propagation fails, then the producers do not slow down and work "piles up" in front of the slower workers. The effect of this is that each work item sees greater and greater latency than the last, which eventually manifests as timeouts and service unavailability (because in a distributed system an unavailable node is indistinguishable from one with arbitrarily large latency). This manifests as a complete denial of service: no user is served within reasonable timeouts.

Proper backpressure propagation limits the damage here. You approach a maximum latency, which is a product of buffer depth and time taken to complete each work item. Beyond that point, work is dropped. This leads to partial DoS: some users cannot access the service, but others can (assuming your buffer depths are reasonable). The system is degraded, but not unavailable.

NIO contains some protections here above and beyond the backpressure propagation. The nature of its design is that it bounds the maximum amount of work it will do on one connection in response to I/O using maxMessagesPerRead. Once we have issued that many reads, we will stop serving a given channel and let the others operate. This increases their latency, but does not starve them entirely.

The result is that when we experimented with the H2 DoS linked above we found NIO handled it fairly well. We were degraded, for sure, but still live. We shipped protections to reduce our exposure, but in general it was acceptable.

This is why I'd like to see a performance trace here, as well as the metrics @FranzBusch asked for. It's likely that somewhere there is a hole in the protection. Given the CPU and network graph you showed, it looks more likely to be a backpressure issue, because this manifests as a "bubble". But it may not be, and we can tell based on what the system is doing while it's under attack.

taylorswift · October 16, 2023, 10:08pm

i don’t understand how backpressure would help here; the requests come from NIO channel handlers, which synchronously yield the requests to the AsyncSequence continuation. a channelRead witness can’t await on anything, and even if it could, that wouldn’t stop another connection from enqueueing work into the stream.

the source of the work units is a ServerBootstrap with a channel handler attached to it with configureCommonHTTPServerPipeline. how can i get a ServerBootstrap to slow down? how does one propogate backpressure to a ServerBootstrap instance?

right, so the graphs posted earlier are not actually generated by the application, they are generated by AWS, which monitors the health of the EC2 instance the application runs on, and doesn’t have any visibility into HTTP-level things like connection counts. AWS can count things like IP packets, but it can’t see anything below the TLS layer.

i guess with enough effort, i could implement swift-side metrics to collect these statistics, but currently all of this information is kept in memory only and lost when the server crashes. so i have a sort of “plane crash” conundrum.

FranzBusch · October 16, 2023, 10:38pm

While there isn't anything you can await the AsyncSequence can tell you about the backpressure in its system. AsyncStream is not capable of relaying backpressure to external systems; however, our NIOAsyncSequenceProducer can do this. We know that this is a common problem so we created a complete new type that allows you to bridge a Channel into Swift Concurrency. We wrote up a document about the new APIs. Right now those APIs are gated behind an @_spi(AsyncChannel but we are about to remove this and ship them as actual API.
The important bit here is that those new APIs retain the backpressure from the ChannelPipeline into the Concurrency domain.

taylorswift · October 16, 2023, 11:00pm

from reading the linked article, it seems the gist of the idea is to produce requests from a child Task instead of a channel handler:

try await withThrowingDiscardingTaskGroup { group in
    for try await connectionChannel in serverChannel.inboundStream {
        group.addTask {
            do {
                for try await inboundData in connectionChannel.inboundStream {
                    // Let's echo back all inbound data
                    try await connectionChannel.outboundWriter.write(inboundData)
                }
            } catch {
                // Handle errors
            }
        }
    }
}

but echoing back the inbound data is a silly example, in a real application, the request needs to interact with a database and synchronize with some in-memory server state. so the request needs to be enqueued into some kind of linear stream anyways:

for try await inboundData in connectionChannel.inboundStream 
{
    response = try await withCheckedContinuation 
    {
        requestStream.yield((inboundData, $0))
    }
    
    try await connectionChannel.outboundWriter.write(response)
}

we have simply replaced an EventLoopPromise with a CheckedContinuation. what benefit does this bring over simply yielding from the channel handler itself?

FranzBusch · October 16, 2023, 11:18pm

So the big benefit over just yielding from a channel handler to an AsyncStream is that the inbound AsyncSequence is correctly retaining backpressure. If you are slow in consuming the inboundStream then the no more reads will be issued in the channel pipeline. As Cory noted above the whole system has to retain backpressure and this is one of the places where we do it.

It certainly doesn't have to, it is just how you done it. If you make an outbound request to a database I don't see why an AsyncStream must be used. However, in your example you are enqueueing a continuation into the stream so your individual connection is task here is awaiting the result of the continuation. The one thing that you are not handling here is task cancellation so if a request would come and you spawn a potentially expensive work in the requestStream. While the expensive work is being handled the request/connection goes away nothing is going to cancel it. This might open up an attack vector for your application.

However, just from looking at code and without further metrics it is really a stab in the dark. There might be a lot of other things going on here like a specific request causing unbounded amounts of work which might have nothing to do with your server being necessarily attacked with thousands of requests.

taylorswift · October 16, 2023, 11:33pm

the inner inboundStream is retaining backpressure, but the outer inboundStream is not: it is adding a child task to the DiscardingTaskGroup and immediately continuing to the next connection. instead of allowing a pileup in an AsyncStream of requests, we are just moving the pileup to the task group. am i missing something here?

the site serves requests in a linear queue to insulate mongod from concurrent requests. an actor-based synchronization model doesn’t provide this guarantee because actors are re-entrant.

the swift application is really just an intermediary between the outside world and a local mongod process. there is not much point in “cancelling” a database request - they are written to execute within a predictable amount of time so cancelling it would have the same amount of latency as just waiting for the query to return.

yeah… persistent metrics are hard. data visualization is hard.

however, it seems unlikely that this is a Query of Death situation, precisely because the site serves requests in a linear queue. if the problem were a specific query, then normal queries made during an attack wouldn’t also be as slow as i’ve observed.

taylorswift · October 20, 2023, 9:37pm

just an update, in the past 72 hours, we logged three more attacks, one of which took swiftinit offline for nearly three hours this morning.

four days ago we changed the unbounded buffering policy of the main query stream to a limited backlog of up to 128 requests, and one of the attacks seems to have been a straightforward request flood which overflowed the query buffer in a controlled manner causing all visitors to the site to encounter a 503 HTTP error for about one hour.

the mechanism of other two attacks is still unclear, my best guess is they were targeting a different endpoint that serves static assets by creating EventLoopPromises that await on an actor-isolated asset cache. my hypothesis is that this is causing some sort of degenerate behavior with swift concurrency that is causing the server to slow to a crawl without necessarily spiking CPU usage above 10 percent.

finally, and this is sort of a rant, i want to say that my experience with these attacks over the past few months has greatly lowered my opinion of the swift community. i have a hard time wrapping my head around the idea that our website, which has no political or controversial content and only exists to assist a small cadre of pure-swift power users could be such a threat to someone else’s business that they would invest the time and energy into conducting these nightly DoS attacks.

Jon_Shier · October 20, 2023, 10:24pm

You have no evidence these attacks have anything to do with the Swift community, or that you're being attacked specifically. Websites deal with these attacks all the time for any reason, for dedicated political statements to botnet tests, to "hackers" having fun. There is a reason why many, if not most, websites with any visibility put themselves behind services like Cloudflare. So while hardening a NIO website against DDoS attacks is an interesting exercise (and I hope your findings help the community make NIO and Vapor more resilient), if you'd rather not deal with it, free solutions are at hand.

taylorswift · October 20, 2023, 10:39pm

i’m well aware that Cloudflare exists, i have worked with it before, and i have a pretty good understanding of how it works and what types of websites it is well suited for. Cloudflare works well for sites that mostly serve unchanging content with a low invalidation rate. an example would be the swift.org website. it does not work well for sites like swiftinit (or these forums, which are not protected by Cloudflare) that are essentially a UI for running database queries, and have very high cache invalidation rates.

a documentation database has two characteristics that make it unsuitable for Cloudflare protection:

it allows users to run complex documentation queries and therefore has an enormous query-space. caching every possible page that the site can return would require terabytes of storage.
documentation content itself can change without the underlying package changing, because of interactions with package dependencies and consumers. for example, cross-package links can change because a new version of the dependency was released.

so to summarize, it’s not that i am “choosing” not to use Cloudflare. it is simply optimized for a different type of website than swiftinit is.

i’d dispute this assumption. the “wild west” narrative has a grain of truth to it, but in my view it is largely a myth. botnets are valuable pieces of capital and their owners do not deploy them capriciously, as every usage of the botnet risks that some of the nodes could be discovered or blacklisted. some websites (including swiftinit in the past) can go months or years without being attacked. when a website is being attacked, there is usually a rational explanation for why it is being attacked (ukraine/israel/azeri flag in header, setting up a rival cryptocurrency exchange, retribution for a suspected attack, etc. etc.) most people with the skills and resources to conduct DoS attacks are not lunatics.

Jon_Shier · October 20, 2023, 11:50pm

Not to belabor the point, but these services aren't just dumb caching CDNs anymore. I'm only familiar with Cloudflare, but I know it offers other optimizations and, most importantly for you, can detect and mitigate DDoS attacks automatically. It also gives you a button to press when you're under attack to make it even more aggressive. So it can help, even with a 0% cache rate.

taylorswift · October 21, 2023, 4:46am

a lot of popular web applications now are actually single-page applications that just overwrite the url bar when you click on an anchor. this is how DocC works - you download a giant archive (swift-syntax was 7.8 MB as of 509.0.1) and then it runs like a browser app.

pros:

very Cloudflare/Cloudfront-friendly, the number of URLs that need to be versioned is low
don’t need to build the entire service out of lambdas

cons:

bad startup experience, you need to download a giant archive before you can see anything
really expensive in terms of bandwidth because every bot and crawler that visits that page downloads the entire archive, and 90–95 percent¹ of your page views are going to bots that never use caching

but the biggest issue with this model is it doesn’t scale, there is a practical upper limit to the amount of archives you can preload.

the other way is to disable caching entirely on the Cloud(front/flare) side and have it act as a proxy that defends against the attacks. but then every request would involve two network hops to fetch the content and users would experience increased latency.

[1] it’s incredible how much human-seeming traffic is actually automated when you look for telltale signs of automation. all of the popular commercial statistics massively undercount bots because they rely on the bots to identify themselves.