Need help to debug hanging URLSessionDataTask on Linux with FoundationNetworking

Hello,
Before I describe my problem in detail, I would like to summarize the situation I am in, because I think this is important to understand my problem. I hope someone can give me give me an idea to help me troubleshoot. At the moment I am running out of ideas.

  • I have an elaborate project with 240 test cases
  • All tests run smoothly with multiple (4) workers on the Mac and also in a Linux environment
  • All tests run smoothly with 1 worker on the Mac.
  • The test fails if 1 worker runs the entire suite on Linux
  • Executing the failed test alone works and the test turns green
  • If I rename the test so that it runs in a different order, it works and turns green
  • It does not work to leave the test where it is and run the entire suite.

Let's take a look at the interesting parts of the call chain.

The test:

func testGetExampleCallback() async throws {
   let cbi = ScriptProvider()
   let expect = expectation(description: "userValidate response")

   cbi.start(class: .userLogin, arguments: userOK) { result in
      do {
         let bodies = try result.get()
         XCTAssertEqual(bodies.count, 1)
         [...]
      } catch {
        if let err = error as? ScriptProvider.ScriptError {
           switch err {
              case .timeout:
                 XCTAssert(false, "Call timed out)")
        	[...]

I have a class cbi that I call with a .userLogin and some arguments. Irrelevant for the
test so far, except that the arguments have a script that calls a fetch method. userOK implements
a call to http://example.com

fetch("http://example.com", {
 method: "get"
})

So far so good. cbi" has a DispatchGroup

private let group = DispatchGroup()

and start(class:arguments:), this group starts to evaluate the script, and when it is finished, the group is notified. As long as SCRIPT_TIMEOUT is not reached, the script can do what it has to do (in our case: fetch):

_ = group.wait(timeout: DispatchTime.now() + DispatchTimeInterval.seconds(Constants.PROVIDER.SCRIPT_TIMEOUT))

I am waiting till the script has executed completely:

    /// Is called when `commit` is called from within the script context
    ///
    /// - Parameter data: Array of strings that are committed from within the script context
    ///
    func valueDidCommitted(data: [String?]) {
        // set result to class variable
        committedResults = data

        // leaf the DispatchGroup that is opened in `run()`
        group.leave()
    }

That is to give you all the details in this context of the method where the problem is occurring. I don't think the context is the problem here, but after several days of debugging I want to show everything for someone who can help.
I'm pretty sure I've overlooked something. For my understandinf the wait is not the problem, becuase the script evaluates and calls the fetch method. That method executes, but the URLSessionTask inside is not. (But only on linux, only when I run the whole test suite and only with a single test worker).

Let's take a look at the implementation of the "fetch" method.
First I get a URLSession let session = URLSession.shared (I tried a standard session with different configurations, without success).
I set some headers and an optional body (post) and prepare the request before I create my task:

let task = session.dataTask(with: request) { data, response, error in
    Log.info("REQUEST A")
    [...]
}
Log.info("REQUEST 0")
print("1. Task state: \(task.state)")
task.resume()
print("2. Task state: \(task.state)")

This test works in any situation, but not under Linux with a single test worker.
But that's what I need to support docerized-vscode testing.

If I run this test in the whole suite in a linux-docker, ALL OTHER tests (even if they use the same fetch method) run fine, but this one returns REQUEST 0 ... 1. Task state: suspended.... 2. Task state: running -... timeout!
I never get into the task callback and never see the log of REQUEST A.

Since I see REQUEST 0, I'm pretty sure there is no other lock. It must be the task blocking or waiting for something. I have set the network connection limit higher in Linux, the maximum open files, and am trying to find any clue, but so far I have found nothing.

How can I debug and fix this problem? Because it is only in this situation and only in this particular order ( at the bottom most of my suite).
I think another process/task/whatever is blocking the task. But what irritates me is that this does not happen on the Mac. And multiple workers also affect the result.
I have not idea how to find the origin of the problem.

I've tried instantiating a new URLSession every time I call the method, I've tried using a global session, I've also tried resetting and flushing it.
I did my best in the debugger, but after the third night I need some new ideas from you.

Thanks for any guesses, thoughts or support to help me solve the problem,
Kris

Detials: Swift 5.9.2, Linux eaada6d1d3ec 6.5.0-15-generic #15-Ubuntu SMP PREEMPT_DYNAMIC Tue Jan 9 22:39:36 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux, Colima docker environment.

I hate to be 'that guy' but is there a particular reason you're using URLSession instead of AHC?

URLSession on Linux is pretty broken and we (Vapor) have been trying to push people away from it for a while. It constantly causes issues, has missing APIs and missing annotations.

AsyncHTTPClient meanwhile has a modern API and is well tested and works on Linux

2 Likes

Not only Linux, you can use it on Darwin platforms as well. I don't think there's a particular downside to adopting it everywhere, unless you also need to support Windows at the same time, which isn't supported by AHC yet.

2 Likes

Thank you very much.
I will rewrite the request section as soon as I can, and compare the results of the tests. Hopefully I can mange it this week, I will update the thread with details as soon as I have them.

Given it pulls in the entirety of NIO, it's a pretty hefty dependency on macOS. Of course, Foundation is hefty on Linux, but you may already be paying part of that anyway, depending on what you're doing.

Would really appreciate URLSession working on other platforms.

4 Likes

This is spot on. Exactly what I wanted to say.

1 Like

I refactored my code in the fetch method and replaced URLSession with AsyncHTTPClient.
All problems are solved!

Thank you very much.

2 Likes