Illegal instruction… but where?

In non-interactive contexts, would it be too noisy to have the backtracer run without trying to do symbolication at all first, just giving the full raw backtrace with only addresses, and then do the work to generate the pretty human-readable backtrace?

We can't really do that, sadly. On Linux, at least, we need the symbols so that we can tell whether or not we're in an async frame.

Or, put another way, we could do that, but we'd then give a totally different backtrace the second time around. We already know this is confusing, because fatalError() already does something like this in some cases and people complain that it's failing to look up symbols.

1 Like

Are the pthread TLS APIs well-behaved enough in a broken process to check for a set task pointer in its TLS slot without first looking at symbols?

When I try it here, I absolutely can abort the program with ^C during the backtracing process. I did think based on some of the previous comments that that wouldn't work (I assumed that I'd left the signal mask in the wrong state), but it does seem to.

1 Like

pthreads are definitely not async-signal-safe (that's fairly explicit in the POSIX spec), so even if a pthread call happens to work today, there's no reason that Glibc or Musl couldn't turn around and break us.

A tangent question: we don't have a built-in diagnostic ability to log every function call with every parameter / result value, or do we?

Given this example:

func foo(_ a: Int, b: String) -> Double {
    // ideally no syntax "overhead" at all here
    // if that's not possible, a one liner like `@trace`
    ...
    bar(a)
    ...
    return 3
}

func main() {
    // ditto
    ...
    foo(1, b: "2")
}

To get this result:

2023-11-06T10:20:30+726351 MainThread    main() entered
2023-11-06T10:20:30+726373 MainThread        foo(1, b: "2") entered
2023-11-06T10:20:30+726391 thread:42     something else here
2023-11-06T10:20:30+726440 MainThread            bar(1) entered
...
2023-11-06T10:20:30+726471 MainThread        foo(...) returns 3
2023-11-06T10:20:30+726374 thread:24                           something else here
2023-11-06T10:20:30+726382 MainThread    main() returns

I did this manually on many occasions, but that's quite error prone to add and quite a boilerplate to maintain. Is it possible to somehow automate it? Ideally without introducing any boilerplate to the source code itself (but I guess that would require some significant change to the compiler?)

okay, so i spent some time trying to figure out what is different between a docker environment and the server environment, so i used a small test program and ran it inside a docker container.

func outer()
{
    inner()
}

func inner()
{
    1 + { Int.max }()
}

outer()

it generates the expected backtrace.

$ ./crash 

💣 Program crashed: Illegal instruction at 0x000055f7680b1a25

Thread 0 "crash" crashed:

0 0x000055f7680b1a25 inner() + 21 in crash
1 0x000055f7680b1a09 outer() + 8 in crash
2 0x000055f7680b19f9 main + 8 in crash

Illegal instruction (core dumped)

then i copied it to a real EC2 instance, and tried running it there. no backtrace.

[ec2-user@ip-172-31-90-47 ~]$ bin/crash 
Illegal instruction (core dumped)

so i tried to brainstorm possible reasons why it works in a docker container but not in a real cloud instance.

  1. the docker container runs Amazon Linux 2, but the real EC2 machines today use AL2023, which is not officially supported by swift.

  2. i install the swift runtime manually on the EC2 machines by sudo-copying the usr/lib directory of the toolchain to /usr.

i had always assumed that was sufficient to run swift programs on a server, but then i took a second look at a modern (5.9) toolchain distribution and then it dawned on me:

$ ls toolchain/
usr
$ ls toolchain/usr
bin  include  lib  libexec  local  share
$ ls toolchain/usr/libexec/
swift
$ ls toolchain/usr/libexec/swift/
linux
$ ls toolchain/usr/libexec/swift/linux/
swift-backtrace

everything makes sense now! (famous last words)

$ sudo cp -r toolchain/usr/libexec/swift /usr/libexec
$ bin/crash 

💣 Program crashed: Illegal instruction at 0x000055556ccf2a25

Thread 0 "crash" crashed:

0 0x000055556ccf2a25 inner() + 21 in crash
1 0x000055556ccf2a09 outer() + 8 in crash
2 0x000055556ccf29f9 main + 8 in crash


then i packaged the test program into a daemon and activated the daemon to see if anything is different. the daemon also gets backtraces now, but for some reason, they look slightly different.

*** Program crashed: Illegal instruction at 0x00005581d997ba25 ***
Thread 0 "crash" crashed:
0      0x00005581d997ba25 inner() + 21 in crash
1 [ra] 0x00005581d997ba09 outer() + 8 in crash
2 [ra] 0x00005581d997b9f9 main + 8 in crash
Registers:
rax 0x8000000000000001  9223372036854775809
rdx 0x0000000000000000  0
rcx 0x0000000000000000  0
rbx 0x00007ffc84f648c8  c2 4e f6 84 fc 7f 00 00 d2 4e f6 84 fc 7f 00 00  ÂNö·ü···ÒNö·ü···
rsi 0x0000000000000001  1
rdi 0x00007ffc84f648b8  a9 4e f6 84 fc 7f 00 00 00 00 00 00 00 00 00 00  ©Nö·ü···········
rbp 0x00007ffc84f64780  90 47 f6 84 fc 7f 00 00 09 ba 97 d9 81 55 00 00  ·Gö·ü····º·Ù·U··
rsp 0x00007ffc84f64780  90 47 f6 84 fc 7f 00 00 09 ba 97 d9 81 55 00 00  ·Gö·ü····º·Ù·U··
 r8 0x0000000000000000  0
 r9 0x0000000000000000  0
r10 0x00007fd43a0101a8  3f 17 00 00 12 03 0b 00 90 b7 3f 00 00 00 00 00  ?·········?·····
r11 0x00007fd43a3fb790  41 56 53 50 49 89 fe 48 89 3c 24 48 8b 5f 08 48  AVSPI·þH·<$H·_·H
r12 0x00007ffc84f648b8  a9 4e f6 84 fc 7f 00 00 00 00 00 00 00 00 00 00  ©Nö·ü···········
r13 0x00005581d997b9f0  55 48 89 e5 e8 07 00 00 00 31 c0 5d c3 0f 1f 00  UH·åè····1À]÷··
r14 0x0000000000000000  0
r15 0x00007fd43a882000  20 32 88 3a d4 7f 00 00 12 00 00 00 00 00 00 00   2·:Ô···········
rip 0x00005581d997ba25  0f 0b 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 48  ··f········UH·åH
rflags 0x0000000000010202
cs 0x0033  fs 0x0000  gs 0x0000
Images (17 omitted):
0x00005581d997b000–0x00005581d997bca8 <no build ID> crash /home/ec2-user/bin/crash

so far, so good.

next i tried triggering the restart endpoint i implemented in the real swiftinit server. no backtrace.

Nov 07 01:58:51 ip-172-31-90-47.ec2.internal launch-server[760366]: UnidocServer/Server.Endpoint.Admin.swift:63: Fatal error: Restarting server...
Nov 07 01:58:51 ip-172-31-90-47.ec2.internal launch-server[760365]: /home/ec2-user/bin/launch-server: line 6: 760366 Illegal instruction     /home/ec2-user/bin/UnidocServer --mongo localhos>

i tried this multiple times (YOLO) in case the process needed to rediscover the /usr/libexec/swift/linux/swift-backtrace binary, but still did not get any backtraces.

so it seems that the missing runtime library was necessary, but insufficient to obtain backtraces.

one thing i noticed is that systemctl does not wait the usual few seconds it takes to collect the backtrace. so i thought this might be related to how the binary is being invoked. but launching the server directly without any intermediate scripting layers didn’t produce backtraces either, it also prints

UnidocServer/Server.Endpoint.Admin.swift:63: Fatal error: Restarting server...
Illegal instruction

and exits immediately.

the more things i try, the less i understand about this…

2 Likes

The reason for the different appearance is that the backtracer changes its behaviour (by default) depending on whether or not various things are attached to a terminal. In particular, it will turn off Unicode and ANSI escape sequences and will default to more verbose output when it's executed in a pipe or with its output redirected. All of this is controllable via the SWIFT_BACKTRACE environment variable, should you decide that you don't like the automatic choices.

As regards it not working with your server, did you link the server dynamically or statically? swift-backtrace is located by finding the directory containing the runtime and looking relative to that, but when statically linked we instead use the directory containing the executable, so we may end up looking in a different set of locations, depending on where you chose to install your server executable.

i never fully understood how static linking works in swift, because there is a lot of contradicting information out there. according to SE-0342, the default SPM compilation mode uses static linking, and i am using the default SPM compilation mode, so one could assume the server binary is statically-linked.

but this doesn’t make too much sense, because the binary never worked unless i installed the runtime in /usr/lib, and when i inspect the binary with ldd, it shows a lot of dynamically-linked runtime libraries.

$ ldd bin/UnidocServer 
	linux-vdso.so.1 (0x00007ffd40d99000)
	libswiftCore.so => /usr/lib/swift/linux/libswiftCore.so (0x00007fbe02400000)
	libswift_Concurrency.so => /usr/lib/swift/linux/libswift_Concurrency.so (0x00007fbe04040000)
	libswift_StringProcessing.so => /usr/lib/swift/linux/libswift_StringProcessing.so (0x00007fbe03f7d000)
	libswift_RegexParser.so => /usr/lib/swift/linux/libswift_RegexParser.so (0x00007fbe02ae6000)
	libswiftGlibc.so => /usr/lib/swift/linux/libswiftGlibc.so (0x00007fbe03f6a000)
	libBlocksRuntime.so => /usr/lib/swift/linux/libBlocksRuntime.so (0x00007fbe02000000)
	libdispatch.so => /usr/lib/swift/linux/libdispatch.so (0x00007fbe01c00000)
	libswiftDispatch.so => /usr/lib/swift/linux/libswiftDispatch.so (0x00007fbe03f37000)
	libm.so.6 => /lib64/libm.so.6 (0x00007fbe02325000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fbe03f2e000)
	libutil.so.1 => /lib64/libutil.so.1 (0x00007fbe03f29000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007fbe03f22000)
	libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007fbe01800000)
	libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fbe02acc000)
	libc.so.6 => /lib64/libc.so.6 (0x00007fbe01400000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fbe040ae000)
	librt.so.1 => /lib64/librt.so.1 (0x00007fbe03f1d000)

this suggests to me that the docs are wrong and SPM still uses dynamic linking by default, even on linux.

i have also tried installing the rest¹ of the toolchain files, including the stuff in /usr/bin, /usr/include, /usr/local, and /usr/share. that didn’t change anything either.

$ /home/ec2-user/bin/UnidocServer --mongo localhost --certificates /home/ec2-user/certbot/live/swiftinit.org/ --port 8443
debug: bound to :::8443
UnidocServer/Server.Endpoint.Admin.swift:63: Fatal error: Restarting server...
Illegal instruction


[1] by the way, how do you uninstall them?

[1] by the way, how do you uninstall them?

You can just delete the extra files (annoying to do, for sure). As you found, those aren't the problem here.

It looks from your ldd output like it's dynamically linked, so it should be finding the backtracer. It may be that you've found a bug in the backtracer here… if that crashes, you currently won't see any output from it at all (I'm working on a PR that resolves that so that at least you get the address at which you crashed). If you can give me some instructions to get this set up the same way you have, I can take a look at what's going on.

finally, i have figured out what is causing the server to fail to print backtraces in the production environment.

during the deployment process, there is a post-build action that runs

$ sudo setcap CAP_NET_BIND_SERVICE=+eip /path/to/unidoc-server

this is needed in order for the server binary to successfully bind to port 443, which is a privileged port.

for reasons i do not understand, setcap interferes with backtrace collection; if i don’t perform the setcap step and launch a test server on an unprivileged port, i get backtraces as expected. if i perform the setcap step and do the same thing, i do not get backtraces.

i don’t know if setcap itself is to blame, or if the problem is occurring through some indirect mechanism triggered by setcap. but hopefully this helps.


edit: here is a stackoverflow question that suggests setcap interferes with LD_LIBRARY_PATH.

but it’s not clear to me how to transfer the solutions offered to an SPM project.

4 Likes

This is incorrect. That proposal was accepted, but is not marked as implemented. Ergo, as of Swift 5.9 SwiftPM produces binaries that link core libraries dynamically by default.

Interesting. I’ll take a look and see if I can work out why that would upset things. Thanks for persisting with this; hopefully it’ll be obvious why this breaks things and what to do about it.

Update: the problem seems to be as follows; the backtracer is deliberately designed to not operate from privileged executables to avoid creating a security hole. On Linux, it checks whether it's running in a privileged process by means of getauxval(AT_SECURE). The most common reason for this check failing is of course a setuid binary, but it turns out that executables that have capabilities set on them will also trigger this check.

I have to say that I wouldn't be inclined to use CAP_NET_BIND_SERVICE on the executable like this. I'd favour the more traditional route of using inetd or systemd to start my service, and either have that bind the socket for me, or for systemd have it add CAP_NET_BIND_SERVICE to the ambient capability set. Other alternatives include turning off the low-numbered port restriction using sysctl (typically this is fine on a server — it's only on multi-user boxes that take user logins where this restriction makes some kind of sense), running a reverse proxy in front of your server or using iptables to do port forwarding.

2 Likes

:exploding_head:

Could we have the backtracer print a message if it won't perform it work for security reasons?

4 Likes

What is the reasoning here?

It actually already does, but only if you try to explicitly force it on with SWIFT_BACKTRACE=enable=yes.

The backtracer is an external program, which gets exec()d by the crashing process, and it has access to the memory of the crashing process. A privileged process may have sensitive information in its memory footprint, and it may also inadvertently pass on some privileges to child processes.

This is particularly a problem for the interactive mode, but even doing a normal backtrace might conceivably leak security-relevant data (key bits and so on) and make it easy for an attacker to extract data from a privileged process by crashing it at an appropriate point.

1 Like

I see. Does the (presumably external) attacker have an access to the crash logs in the discussed case?

Yes, the assumption is that an attacker might somehow gain access to the crash log output and at the same time have a way to trigger crashes. Data can leak not only in any memory dump that happens (the new backtracer can display the contents of memory pointed to by CPU registers, so that's a concern) but even in the backtrace itself if you could convince the program to write interesting data over a return address.

That applies to any process.

In this case (privileged ports), why should some random HTTPS web server get this protection but e.g. Remote Desktop doesn't (3283), or VNC (5900 et al). You might accidentally dump the connection symmetric key in the backtrace. Gaining access to screen control is far more concerning than whatever you might find in a web server.

2 Likes

To be clear, the privileged process thing was put in place to cope with setuid programs. The fact that it also triggers for programs that have been setcapped is not so obvious, but it's a consequence of the fact that Linux doesn't have issetugid() and the call it does have, getauxval(AT_SECURE) also triggers if the executable has capabilities set on it.

This actually has very little to do with privileged ports. It happened to cause a problem in this instance because one way to allow a user program to bind privileged ports is to setcap its executable with CAP_NET_BIND_SERVICE, but doing that trips the privileged process behaviour in the Swift runtime. (It also has the effect of disabling LD_LIBRARY_PATH, LD_PRELOAD et al, and changing various other bits of behaviour in the C runtime as well.)

2 Likes