XCTests stalling in Ubuntu github actions job

skiingcows · June 14, 2024, 7:25pm

I am working on a PR to add Linux support to opentelemetry-swift. I have everything working locally in an Ubuntu VM and am trying to add a github actions job to run the tests on ubuntu-latest.

Unfortunately the linux tests seem to stall out after some number of tests run. First I was using the swift:5.10 docker image so I tried switching to using swiftly to install and run the tests just in case the docker image was somehow causing a problem (even though that also worked locally). I also tried removing the overrides of some XCTest methods just in case that was somehow breaking things, but that also has not made any difference.

I let one of the jobs timeout yesterday, and interestingly the raw logs indicate that nothing was printed between when the job appeared to get stuck up until the moment the job was cancelled, at which point some XCTest output was apparently printed as the job was being cancelled. That seems somewhat suspicious but I'm not really sure what to do with the information either. raw logs - actions summary

So far I've only seen it stall out after AggregationsTests.testDropAggregation passes, or somewhere in the Base2ExponentialHistogramAggregationTests test case, which seems very odd. Neither of those test cases are doing anything remotely interesting though.

The tests aren't being run in parallel so there really shouldn't be significant differences between running it locally and running it in github actions. Obviously something is very consistently breaking in one environment and not the other though.

Any thoughts on what might be happening here?

skiingcows · June 15, 2024, 12:54am

I used mxschmitt/action-tmate@v3 to get an SSH session in the runner. Fortunately the swift docker image includes lldb so debugging was straightforward other than needing to run settings set target.disable-aslr false in lldb before running the test binary ^[1].

This revealed that there was what appeared to be a deadlock on an NSCondition. A background worker thread was essentially thrashing the lock as some test functions set the delay on the condition variable wait operation to 0. I set the delay to a small but non-zero value and the problem went away. Oddly neither of the places the test run appeared to be stuck according to the logs were related to the cases where this could happen. My best (but still bad) guess is that the background threads were essentially spin waiting and starving the main thread of CPU time.

The background threads were also never exiting due to never checking for thread cancellation, an issue that is basically irrelevant in normal usage of the library but may have compounded this problem in the tests where multiple workers were created.

The main fix commit is here with a minor follow up for the cancellation logic here

the .xctest file in the .build folder. Built with swift build --build-tests ↩︎