Stability of Swift on Windows?

I'm sad to report that instability of Swift on Windows appears to have caused my project to drop support for Windows, despite having gotten direct support from @compnerd. In particular, about 1 in 3 CI runs on GitHub are failing, which creates too much development friction. Of course it's possible that this is a GitHub problem unrelated to Swift, but of course non-Swift Windows CI works fine on GitHub. I wonder if there's anyone out there other than @compnerd that's invested in supporting Swift on Windows? As excellent a developer as @compnerd is, we can't really call Swift on Windows supported if only one person is working on it.

12 Likes

sadly, this has been my experience and conclusion as well.

a related issue for me has been the unavailability of swift-atomics on Windows. pretty much anything that involves concurrency requires atomics, which implies that anything that involves concurrency cannot support Windows.

since i have a testing framework that supports concurrency, and depends on swift-atomics, that means i cannot run any tests on Windows.

For reference, we have not seen much in the way of instability in Windows CI for the Swift project itself, and nowadays we're running Windows CI on all pull requests alongside macOS and Linux. However, the Swift project CI is not based on GitHub's CI---its separate Windows machines coordinated by Jenkins. So it's possible that there's something different in the GitHub CI environment for Windows that's causing problems.

It's also possible that something in your code base is exposing existing bugs in Swift's support for Windows. If you're making heavy use of async, it could be issue #57976, which can manifest as a stack overflow. (We're making progress on that one, but it's tricky because Microsoft's unwinder cannot support any calling convention with mandatory tail calls).

In the issues you linked, I don't see much investigation into what's actually happening. Has anyone managed to catch the crash so we can see what's going on? Platform support is hard, especially for Windows because it's so different from Unix-like OS's, and the environment can matter in surprising ways.

Doug

7 Likes

Have you tried to switch to bash shell on Windows:

defaults:
  run:
    shell: bash

I guess part of this progress is this pull request? So there is hope I guess… Thank you for your efforts.

Yeah, we think we have a path forward now. We've been in touch with Microsoft, who confirmed our suspicions that we can't create a calling convention that has both mandatory tail-call optimization and structured exception handling (due to unwinder limitations on that platform), so we're trying to thread the needle here to address the bug without foreclosing on future improvements to backtrace facilities.

Doug

14 Likes

However, the Swift project CI is not based on GitHub's CI---its separate Windows machines coordinated by Jenkins. So it's possible that there's something different in the GitHub CI environment for Windows that's causing problems.

In addition to the issues you mention, I suspect part of the problem is that the github CI only makes available two CPU cores for CI in the free tier that most OSS projects use, whereas Apple is likely using much beefier Windows servers for your own CI. I always suspected that had a role to play in earlier reported Windows flakiness on github CI.

@dabrahams and @taylorswift, have you tried paying for the larger runners on github CI to see if that brings the error rate down? Obviously this would just be a workaround for any current performance issues, but it may tide you over until those Windows Swift bottlenecks are optimized to work on just two cores.

I have also seen lots of spurious build failures on Windows in the past, but I've found it has improved quite a lot with recent releases.

There are also a couple of other things you might try, short of dropping Windows support entirely.

  1. Use a stable release.

    I see this in the linked GitHub workflow:

    swift: [
              { version: "5.7", windows-branch: "development", windows-tag: "DEVELOPMENT-SNAPSHOT-2023-02-14-a" }
           ]
    
  2. Set the continue-on-error flag.

    If you have a job which runs tests on a matrix of different OSes, a build failure on Windows would mean the entire job fails, which can slow down development. By setting continue-on-error when building for Windows, the job will still run tests on other OSes even if the Windows run encounters a spurious failure.

    It's a bit of a band-aid, but I've found it significantly reduces the friction to running Windows tests.

This is what I do in my test workflow:

        swift-version: [5.3.3, 5.4.3, 5.5.3, 5.6.3, 5.7.2]
        host-os: [ubuntu-20.04, windows-latest]
        exclude:
          - host-os: windows-latest
            swift-version: 5.3.3
          - host-os: windows-latest
            swift-version: 5.4.3
          - host-os: windows-latest
            swift-version: 5.5.3
        continue-on-error: ${{ matrix.host-os == 'Windows' }}

Notice that some Swift versions are not tested on Windows - I found things got a lot better with 5.6+.

1 Like

I'm sorry that the experience has been less than stellar. I was unable to actually reproduce the failure locally on a test machine (which grant is likely a bit better than the GHA builders, but not by much).

There is at least one change (Fix JSONMessageStreamingParser error message formatting by tristanlabelle · Pull Request #398 · apple/swift-tools-support-core · GitHub) that might be helpful here. A logic issue in swift-tools-support-core would sometimes result in diagnostics being discarded rather than rendered. A reproducible case of a failure (Failed to produce diagnostic with optionals and generics (minimal repro) · Issue #64238 · apple/swift · GitHub) gave the ability to track down the failure.

Two items that I think that would possibly help here are:

  1. Telemetry collection
  2. Publicly available symbol servers or packaged symbols

I have been working on trying to see if we can collect some amount of debug information for the builds which should help at least symbolicate the crashes like these. However, an issue that I don't have any idea of how to workaround is the lack of access to the host. While we might be able to generate a minidump, collecting the minidump is still going to be an issue. However, were we able to collect that, we should be able to hopefully get some insight into the failure and hopefully resolve them.

7 Likes

i choose to invest time in adding windows support to my open source libraries because i remember the swift 3.x days when virtually no libraries supported linux out of the box, and how obnoxious that was since most of the libraries could have been made linux compatible with minimal tweaks.

i am willing to make a reasonable amount of effort to support windows as a service to the community even though the platform has zero relevance to our business interests. but i receive no external backing for our projects (nor do i ask for any) and in my view, expecting library publishers to pay for windows CI we do not use is not a realistic proposal.

9 Likes

+1

I think the best thing the rest of us can do to help Swift on Windows, is to ensure that libraries are available for people who want to use it, and that we make a best effort to ensure everything works.

That's why I suggest using continue-on-error for the Windows build if it's being flaky. You'll see if those builds fail and can investigate the failure to decide whether you've actually broken support for that platform, but flaky tasks won't block other CI tasks from executing, so it won't get in the way of other platforms.

Obviously it is less than ideal, but it's better than dropping support and saying you don't even want to know about those build failures.

3 Likes

Absolutely, having libraries that are usable on Windows would be a huge boon, and a requirement almost in my mind.

Personally also like to request that people also file issues with as much information as possible (which @dabrahams actually did!).

In order to improve the state of Windows, we need to get more systematic about the polish, and to do that, it really does help to have a concrete list of items so that we can classify the types of problems and work through them.

As a concrete example, we are finally in the final stages of removal of the alterations to Visual Studio. This means that repairing of the Swift toolchain after updates to Visual Studio would no longer be needed. Furthermore, we no longer have ordering dependencies either. This work required a lot of threading of information for the build itself and has force a few more changes to actually take place. The problem is that staging these changes often is a time consuming process and needs to be done carefully and so it has taken a while before we could address it.

Identifying the top pain points is important helps focus on the items that would improve things the most. I know that the debugging story is still painful, and the stability for concurrency is a problem, which is currently a priority item. The other piece that still remains an issue and is a priority is some amount of work to help improve SPM based builds. Once the current set of build regressions are resolved, I am hopeful that the other pain points can start being ameliorated.

20 Likes

Thanks for your reply, Doug!

That's consistent with the fact that @compnerd can't reproduce the problem on his own machine.

If you're making heavy use of async.

We're not using async at all, unless something in the Windows implementations of the runtime or Foundation is using it. Our codebase is 100% supposedly-portable non-async Swift code, with nothing of significance in an #if os(...) block AFAICT.

In the issues you linked, I don't see much investigation into what's actually happening. Has anyone managed to catch the crash so we can see what's going on?

If by "catch," you mean, "observe on a local machine,” then unfortunately not. It doesn't seem to be just one thing; sometimes it crashes during the build phase, sometimes during the test phase. That might suggest that it's an SPM issue(?)

Platform support is hard, especially for Windows because it's so different from Unix-like OS's, and the environment can matter in surprising ways.

Sure, and no shade thrown on those trying. I'm calling attention to the issue because it would be sad if Swift got a reputation of not-really-supporting Windows because of how things end up playing out in the very common GitHub CI scenario.

I don't think so. Do you have any reason to believe that might be the key?

I have not. Do you have any reason to believe that would make a difference?

We're using the version recommended by @compnerd, who wrote our Windows CI actions. I figured if the release version was a better bet, he'd have used it. As for continue-on-error, Windows failures have not prevented our tests from completing on other platforms.

We very much appreciate your efforts! That said, I don't think that change can possibly address the problem. In all the cases we're concerned about, the job completes OK if you just re-run it (enough times).

an issue that I don't have any idea of how to workaround is the lack of access to the host

The only idea that occurs to me (and I'm just guessing here) is that maybe you could somehow use a virtualized windows, to which you could have complete access, on the (probably already virtualized) host. I'm sure that's nontrivial, if it's even possible though.

Sorry, I didn't mean to imply that it would solve the issue, but more that it may help us understand what the failure is. I think that the struggle with the GHA builder has been so far gaining an understanding of the failure. Were these local, we would have minidumps (akin to coredumps on Unix), which would allow us to inspect what occurred so that we may address the issue.

I intend to spend some time thinking about how to collect telemetry so that we can better analyze and repair issues that we encounter as Swift starts to gain broader usage on Windows.

i am using async. the very few things i am working on that do not use async still use swift-atomics, and i have not been able to get either of those two things working on Windows.

(for those keeping score, swift-nio depends on swift-atomics, so no atomics means no networking either!)

i don’t mean to distract from the very valuable efforts to get Windows CI working for swift projects. in particular Fix JSONMessageStreamingParser error message formatting by tristanlabelle · Pull Request #398 · apple/swift-tools-support-core · GitHub is very encouraging to me as i have seen that exact CI failure many times. i hope that PR gets merged soon.

i just mean this as a reminder that there needs to be proportional effort from the swift project leadership towards supporting concurrency and atomics on Windows, because fixing the CI problems will have limited impact until concurrency and atomics become available as well.

Are you suggesting it's likely that it's trying to emit a diagnostic in these flaky cases—even though the code itself shouldn't generate one—and then crashing? Consider again that the crashes sometimes show up during testing.

I wonder if we'd get reliability by forcing single-threaded operation? That might be worth an experiment, because it would probably indicate a race condition in the implementation of something used by SPM.

I do not have any own experience with it, but you can find statement like “After switching from PowerShell to bash … we have not had any of the random failures on Windows we were seeing before”.

Edit: Note that the default shell for GitHub Windows builds is PowerShell, maybe this is why nobody can see a problem when using it on a local machine (using cmd)? Maybe changing to cmd is enough to resolve the random crashes?

Interesting. @compnerd, maybe you should try that too.

I don't find that obnoxious, it makes sense for them not to support platforms they don't use. What I find obnoxious is when somebody submits a pull with those small tweaks and they don't respond (obviously, larger tweaks are a different matter and are completely up to them to merge or reject).

I think you mean "windows CI for an OS we ourselves do not use," as you would be using the CI. I don't think anybody is expecting it either, it was a suggestion to try that and see if it made a difference.

Since you later note that this is likely a race condition, more cores are likely to lead to less contention and a lower failure rate. Do you disagree? It is worth trying to see if it makes much of a difference, after which you can decide if it's worth maintaining.