dataTask() in a loop

tclementdev · May 10, 2023, 9:43am

DispatchQueue is meant for sequential execution of synchronous work items, not asynchronous. DispatchGroup allows to execute a completion handler when several asynchronous work items are done (or block waiting), but there is no additional control over when or how they run.

Neph · May 10, 2023, 11:16am

How do sequential downloads work with dataTask?

That's the part I didn't quite understand in the docs.
You can easily save the Data to storage with try data!.write(to: localurl,options: .atomic), so what's the purpose of creating a temporary file that you then have to move to its permanent spot? Is this about trying to minimize the time the data is stored in memory/passed around?

That would be the recursive approach you suggested.

Oh, that might explain why group.wait() seems to completely stop execution and doesn't even start the task if the thread I'm currently on (it's not the main thread but the thread the previous task used) is also the thread the new task wanted to use. I guess there's no (easy-ish) way to make sure that the thread that might get blocked isn't the thread that's supposed to do the work, is there?

What is a good point to start async from a non-async function? Is that the point where you use Task as in the code you quoted?
My app basically does:

Main class/view controller calls public function A (with completion handler) in the download class.
Function A does a bunch of checks, then calls private function B (another completion handler) with a list of file names
Funktion B loops through that list (inside a Task to deal with async trickling up) and calls private function C that actually downloads the data with await (as tera suggested and with the changes they suggested to make it more task-like) and yet another completion handler.

I already read the concurrency docs but there are too many different paths it could take in my case to simply do "do something - await - do something with result - await - ...".

Just noticed I mixed those two up, I was talking about DispatchGroup of course.
Its "completion handler" is simply group.notify, it's the group.wait() that's causing the issue.

Gero · May 10, 2023, 11:23am

Assuming the "Queue" was just a typo and you mean DispatchGroup (as @tclementdev explained the difference):

Basically yes, such a group os indeed not really meant to sequentially perform work.
The text you quote should be interpreted like so:
"attach multiple work items to a group" means that you enter() for each item you then "schedule [them] for asynchronous execution", which in turn means that you do something that will execute asynchronously.
In your case this translates to resume()ing the task, which is an asynchronous operation that eventually calls a completion handler (on a different queue, btw). As explained, that is when you leave() again and once all such scheduled items have done that, the group invokes the closure passed to notify.

The part about "also wait[ing] synchronously for all tasks in the group to finish executing" means that as an alternative to using notify, you can wait() until all scheduled work is done. I now see the documentation for wait is a little unlucky as it says "Waits synchronously for the previously submitted work to finish.". It should probably say "... for all previously..." as it does not just wait for the last thing you scheduled to invoke leave(), but until all leave() calls happened.
The difference to notify is then that wait blocks the queue it is called on, in your case that's probably the main queue/thread. I can't immediately see why this ends up in a complete deadlock, but the general approach is still not a good fit for what you try to achieve anyway. Basically what you try to do is "enter the group once, let the one scheduled work item finish and leave while you wait for that finish, then enter the group for the next item again" and so on. You're instead supposed to first enter for ALL the items and then react once they have ALL finished, regardless of whether you do so with notify or wait.

You can imagine the group as a counter: every time you schedule something (i.e. enter() the group) it increments, and every time something finishes (i.e. leave()s the group) it decrements.
Once it reaches 0, it fires something. notify is usually the way to go, wait works if you do something on a background queue already and simple can afford to block that (assuming the actual work items scheduled do not run on that queue).
Generally speaking, using notify is probably better/safer, especially if you don't actually know what queues might be involved in your asynchronous work. In your case it might be that before the task invokes your completion handler (which runs on a different queue from the main queue) it still has to do some stuff on the main queue, which you just blocked by calling wait.

All this assumes you really wish to sequentially download those files into memory, as others have pointed out, and you want to avoid structured concurrency at all costs. Which is not a bad thing if part of your goal is to understand how structured concurrency actually works in such a context.
Basically, using the asynchronous data does the same thing as you would manually do with a completion handler, it just saves you thinking about hopping queues if needed and provides much nicer error handling (as it incorporates the throwing mechanism).

tera · May 10, 2023, 11:42am

It's worth mentioning it in the docs somewhere. Do you mean that with HTTP/2 httpMaximumConnectionsPerHost is effectively 1 or do you mean that the effective count could be bigger than what I set it to?

Just by starting a new dataTask (or downloadTask) in the completion handler of the previous dataTask (downloadTask).

With downloadTask the system has already prepared you the temporary file, it's just a matter of moving it to a proper place (typically a O(1) operation). When writing that temporary file the system (most likely) have used a small memory block to write the file piecewise. Compare this with dataTask API - the system must create a large memory block in the form of Data and pass it to you. Even if the file is 100MB or 1GB - that's a high memory pressure. It could even so happen that part of that memory block might travel between VM disk storage and back, and even a few times, before you finally use it - this is yet another source of slowdown. Then you just write this big data block to a file and drop it away - you see, creating the large memory block was quite resource heavy overall.

Neph · May 10, 2023, 11:49am

@Gero Thanks for the long explanation, I wish this was in the docs because the current description is confusing and apparently not even 100% correct.

Yes, I mixed those two up.

No, it's not. I do some work with another dataTask before I call the download function (with the loop,...) and I never switch back to the main thread afterwards, so it's still all running on the same thread the earlier dataTask switched to. I didn't think this would cause any problems, as dataTask switches automatically but if the download task happens to want to use the same thread group.wait() is called on (as @bjhomer bjhomer hinted at), then that might explain why it doesn't seem to do anything afterwards (completion handler is never called).

Oh, so it is kind of like Semaphore (that people advise you against using).

Not into memory. When a download for a single file is successful, then I save it to storage with data!.write(to: localurl,options: .atomic) before continuing. I eventually also want to delete all downloaded files if there's an error but that's currently still on my todo list.

Ah, so kind of the recursive approach again.

Thanks for the explanation!
The files are probably anything between 10kb and 10mb each and usually there are (should be!) more small files than large ones.
Does downloadTask download into the temporary file directly or does it still save some of the data in memory temporarily?

tera · May 10, 2023, 11:53am

It used a memory block as a buffer, but when writing a 10MB file it could have used, say, a 100K block to write the file piecewise.

Gero · May 10, 2023, 12:06pm

Ah, I see, that means you probably start the first task from the URLSession's delegate queue (that is what the completion handler is called on, the docs also say so). Then I see how your code deadlocks: You wait on that queue, so the completion handler can never be executed (it is scheduled to be executed after your call to wait as before it is put on the delegate queue, the session/system obviously has to load the data).
For reasons like this I always tended to eventually jump back to the original queue (mostly the main queue using DispatchQueue.main.async { /* call my own completion handler or something */ }). These days, I try to use structured concurrency as that makes this easier (I don't have to "remember" what queue I came from). Made it conceptually easier to keep track what happens where (especially for others working with the code).

Yeah, kind of. I assume deep down it is more clever and perhaps does things a little differently, but conceptually it works like that/is comparable.

Yeah, I assumed so (especially since you mentioned "files") and would agree with @tera that there might be some issues in regards to memory. If your files are small enough, that could be not a big deal, but it's sure cleaner to use downloadTask. That's orthogonal to the general concept of what gets queued where and when, etc.
Also I'd challenge your original approach to load the files sequentially. If they're not somehow depending on one another, this might be a good fit for either using DispatchGroup as I described (basically loading them in parallel) or use structured concurrency's with[Throwing]TaskGroup. I've used this in the recent past and it worked pretty nicely (i.e. was easy to write and read). If you have many, many files, though, it might be smart to additionally "chunk" them into smaller sets to download in parallel.
That, however, depends on your use case and it's also a question of what to tackle first. Since I get the impression you also want to familiarize yourself more with the various options you have ot might be smarter to tackle one after the other.

Neph · May 10, 2023, 12:46pm

I see, thanks!

My app used to use FTP for downloads and I always started the public functions with DispatchQueue.global(qos:.background).async and did DispatchQueue.main.async { completion(......)} at the end. Now I'm changing the code to also work with HTTP and because dataTask already does the switching manually, I thought it would be nicer to just keep reusing the background threads to avoid the constant switching and hopefully make it a bit faster too. The main thread is only used to display a loading screen (with the active file's name) and there's nothing users can do while it's doing its thing. Would you still recommend doing the manual switching at the start (-> background) and at the end (-> main) or at least switch back to a "regular" background thread in the completion handler to leave the "delegate queue"?

Probably 10kb to 10mb but usually on the smaller end. I don't know in advance how many there will be, there could be just 2-3 or 10 in that list.

No, the app only works if all files are there but I already download the most important file first (with dataTask but it's just a single file, so there's no problem) and the order of the other ones doesn't matter. If one file finishes downloading successfully, then the others most likely will too (unless internet drops out,..., which is always a possibility) but I still prefer to have the control over stopping everything instantly if one fails, instead of letting the kids run wild, so to speak.

You are correct. I want to understand why something works/doesn't work, no point just blindly picking something.

Gero · May 10, 2023, 1:32pm

Well, it does switch to some internal queue to do the loading somehow, but it's important to remember that it does not switch back to whatever queue you called it on. That's actually not possible outside from structured concurrency, afaik, so if you need to switch back (for example you need to display some loaded data in the UI) you always have to do that manually with DispatchQueue.<some_queue_most_likely_main>.async.

In your case that seems not necessary indeed (at least not at this place, I assume at some later point you do need some UI stuff to show the loading is done, etc.), at least if you leave out the deadlock from wait.
What I personally advise against is having too many different paths for your "entry points" to be called. (Entry points used loosely here, I mean the points at which you start an asynchronous operation that switches queues). For me that's a visibility thing: If at some point in your code you set off a dataTask from let's say an IBAction callback, i.e. the main queue and then later you do the same thing from a different queue that's not easy to see ("local reasoning" is otherwise a strength of Swift, imo, just here it fails). This even applies if it's not the exact same method, simply because it looks the same. Worst confusion can arise if you call an entry point recursively, the first time from one queue, but then from within the session's delegate queue. It's not per se wrong, and if you strive for performance perhaps even better, but it sure as hell is harder to correctly parse for any reader.
So unless performance is measurably impacted, I myself usually jump back to a specific queue (often the main queue) in my completion handlers for a dataTask. Even if from further calls in the handler I start yet another dataTask. I know this means a thread hops and all, but in my stuff that has not been a performance issue (and I obvs. check that with Instruments) so far. I also document this accordingly, like dataTask also says "this completion handler is called on the so-and-so queue". So I'd hesitantly recommend "manual switching at the end".

That really sounds like a perfect fit for withThrowingTaskGroup, I've used it for that exact purpose more or less recently (except I wasn't GETting files, but POSTing json payloads, potato, potato). It starts a bunch of Tasks (in which you can then easily use the async data variant to load) in parallel, but cancels any outstanding Tasks as soon as it encounters the first failure. I can elaborate more if you want, but I'd suggest using DMs for this to not needlessly ping everyone else here. I can do so in a few hours when work finishes. We can do the same if you have further questions should you prefer to follow a different approach as this thread becomes more and more specific and probably less interesting for any "onlookers", hehe.

Neph · May 10, 2023, 2:42pm

I update the UI at the start of each loop iteration (a simple DispatchQueue.main.async { self.dialog!.message = "file name here" } in a small extra function) but at the moment I only fully switch back to the main thread once everything's done, just before I load the next view controller.

I hope I understood you correctly: That's why I've got an extra class that handles everything related to server stuff (currently only GET, POST is still on my todo list). There are only a couple of public function (e.g. "listFolderContent" and "downloadFiles") that decide what else (= private functions) has to be called exactly and these are the only functions that also contain the thread switching code at the start and end in my FTP version. The visibility for stuff my regular VC classes have permission to access is pretty limited (and pretty good imo) like that but it can get a bit complicated inside the HTTP class with all the completion handlers (and now also async).

So just a quick DispatchQueue.global(qos:.background).async { completion(....) }?

Did you happen to do any tests with parallel downloads with a bad internet connection?
I think I'll just try to get the current version working for now (hopefully won't forget to post the code) and then check out the task group. If necessary, I'll just ask a new question, that might be easier and not as off-topic for this thread.

Thanks for the help/input!

eskimo · May 11, 2023, 10:12am

It's worth mentioning it in the docs somewhere.

Yeah, I filed a bug about that last year (r. 98788661).

Do you mean that with HTTP/2 httpMaximumConnectionsPerHost is
effectively 1 … ?

Yes.

This isn’t universally true — the system may end up opening more connections for various reasons, most notably mTLS — but it’s not going to start a second connection solely to improve parallelism. It’s that HTTP/1.1 behaviour that’s controlled by httpMaximumConnectionsPerHost.

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple

tera · May 11, 2023, 10:37am

Interesting. And if I make a few requests to the same host on several different URLSession() objects will I end up with a single connection still?

That httpMaximumConnectionsPerHost is effectively 1 for HTTP/2 (regardless of its value) means that when I set it to 1 – it is 1 in all cases (HTTP/1 and HTTP/2), which means that issuing all data/download tasks at once will always execute them sequentially be it HTTP/1 or HTTP/2

eskimo · May 12, 2023, 8:17am

And if I make a few requests to the same host on several different
URLSession objects will I end up with a single connection still?

No. Connections are never shared between sessions. This is one of the reason why starting a session for each task is such an anti-pattern.

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple

tera · May 12, 2023, 8:51am

Or I am totally wrong:

But what if I do need 3 independent "connections" (or "data streams", call it as you wish) to the same host? Will HTTP/2 do this automatically for me? How do I control the number if I can?

Neph · May 12, 2023, 10:10am

Speaking of session "etiquette":
The downloads described in my first post all happen shortly after the app start (multiple dataTasks and the async part share it) but there's also an upload that might happen after 30 minutes, maybe longer, maybe shorter, maybe even not at all. If I pass the previous session from the first to the second view controller, can this cause any connection/performance problems in the future if it's not in use in a while? Is it better to just close it in that case and open a new one when needed?

Jon_Shier · May 12, 2023, 7:45pm

There's no need to manage sessions like that. While URLSession keeps the connection alive while it can, allowing future requests to start slightly faster, there shouldn't be any issue even if the server closes the connection. Recommendation is always to use the fewest URLSession's possible. I usually recommend one per host, especially if each host requires different authorization or standard headers.

eskimo · May 15, 2023, 7:50am

Will HTTP/2 do this automatically for me?

Yes. That’s one of the key features of HTTP/2.

How do I control the number if I can?

URLSession gives you no control over that behaviour.

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple

tera · May 15, 2023, 9:24am

It seems reasonable for a setting like httpMaximumConnectionsPerHost, or some new better named httpMaximumStreamsPerHost to control the SETTINGS_MAX_CONCURRENT_STREAMS setting in case of HTTP/2; from an API user point of view whether it's a multitude of connections per host or a multitude of streams over a single connection - the end result is the same.

Neph · May 15, 2023, 11:03am

So just session=nil when I'm done with it and open a new one if there's an upload?

Jon_Shier · May 15, 2023, 2:03pm

It's best to keep the session around unless you really need to reclaim the resources. In nearly all cases that's unnecessary. If you do need to get rid of the session, you'll want to call finishTasksAndInvalidate() or invalidateAndCancel() before niling out the reference, depending on which behavior you want.