Web Crawler using Async/Await

Chris_Eidhof · January 6, 2022, 11:31am

Hi everyone,

We were working on a small web crawler (just a toy implementation to explore the new concurrency system). Our design is roughly like this: we have a single job queue containing all the URLs that still need to be crawled. We have N workers, created using N child tasks. Each worker has a loop in which it tries to fetch the next URL from the queue (e.g. using queue.dequeue()). However, it might be the case that the queue has fewer items than there are workers (for example, initially there is just one URL in the queue). In this case, we want to suspend the worker.

We've implemented this suspension using withCheckedContinuation. Inside dequeue, when the queue is empty, we use withCheckedContinuation and store off the continuation. We can then await and call dequeue recursively. When items are added to the queue, we resume all stored continuations.

By the way, we can't model our queue as an async stream, because we want multiple workers to get the next element (async stream only supports a single task as the consumer).

It feels a bit "wrong" to have to use withCheckedContinuation to implement this suspension behavior, is there a simpler way? I think the ideas from Communicating between two concurrent tasks don't apply as we have multiple workers in different tasks.

Here's the full code, please run this against a local URL, as it does not wait between fetching pages or back off when something goes wrong.

Saklad5 · January 12, 2022, 5:12pm

Why are you explicitly crawling on the main thread?

That sounds to me like a Combine publisher of some description. Which is unfortunate, as that isn’t open-source or even available on all Swift platforms.

One thing it took me a bit to grasp is that the concurrency system isn’t actually meant to be used for parallelism. That is, performing work simultaneously isn’t the goal. Rather, async/await is designed to eliminate the potential to block while there remains work to do. If you want one worker starting more work while it waits for a download, that’s fine. If you want multiple workers doing that, you need something more.

Parallelism will come in the future, and the groundwork has already been laid for it. Custom executors for actors are one example of that.

Separately from the concurrency system, I’m hoping we might even get a limited form of automatic parallelization for code that can be proven referentially transparent someday. If the compiler could be certain that a series of operations are completely pure, and can therefore be executed at the same time, it could simply make that happen based on optimization settings as an implementation detail.

Chris_Eidhof · January 13, 2022, 3:57am

Thanks for your reply. I marked the crawler as @MainActor so we can observe it from another framework. However, you're right: it's an oversight that crawl(url:numberOfWorkers:) is on the main thread, it shouldn't be. This code isn't meant to be parallel per se, but it is meant to be concurrent: the N workers should crawl concurrently.

I know this could be solved with different techniques (Combine, RxSwift, GCD, etc.). My question was specifically about the new concurrency system: is there a simpler way to solve this? Or maybe phrased the other way around: is it okay to use with[...]Continuation as a general mechanism to suspend and resume tasks manually? Beyond the simplest cases (wrapping completion handlers) I found it tricky to get exactly right.

Alejandro_Martinez · January 13, 2022, 9:31am

That's how I've been treating it. For me the continuation API is a "low level API to interface with the concurrency runtime". That's how I'm treating it. The simple case is to wrap a callback based API but as shown in other examples ( my own asynchcannel or AsyncStream ), when you keep the continuation around it unlocks many more possibilities.

But others, on the other threads, seem to go use AsyncStream for a lot of things that I use continuations so maybe that's a better option. (use always the highest level abstraction rule, I guess).

AlexisQapa · January 13, 2022, 9:45am

I stumbled upon this Asynchrone/SharedAsyncSequence.swift at main · reddavis/Asynchrone · GitHub which might be a way to solve your issue with a combine / rx like approach

tclementdev · January 13, 2022, 12:00pm

You should be able to use just one worker that keeps trying to dequeue and spawns tasks when there's work to do.

Chris_Eidhof · January 13, 2022, 1:03pm

@tclementdev yes, that could work, but then you're in the "unstructured concurrency" land again. My problem isn't building a web crawler, it's just an example. I was mainly wondering whether it's okay to use with[...]Continuation as a general-purpose manual suspend/resume mechanism.