Dynamically sampling a server process in production?

taylorswift · January 23, 2024, 6:21pm

i have a server application that is periodically spiking to 99% CPU usage and becoming unresponsive. this phenomenon does not appear to be associated with high memory usage, so i currently suspect an infinite loop condition that is being triggered by some user action.

since the application isn’t actually crashing, i can’t think of any ways to investigate this issue except to dynamically sample the application (perhaps at 1s intervals) and view what dominates the stack while this is occurring. is there any way to accomplish this today?

for completeness, it has also been suggested to me offline that this may not be an infinite loop issue, but rather swift-backtrace getting hung while tracing a normal application crash. however when i observe the event as it unfolds, i see the name of the application in top with 99% CPU usage, not swift-backtrace. but then again, i don’t know how swift-backtrace interfaces with the application or if backtracing could still cause high CPU usage to be attributed to the application instead. some clarification would be appreciated.

cc @al45tair

al45tair · January 23, 2024, 6:23pm

If it was the backtracer, you'd see swift-backtrace using a lot of CPU, not the crashing process. The bit that is currently taking lots of time is all done by the swift-backtrace process, not by your process (not least because it involves doing all kinds of things that just wouldn't be safe in a crashed process).

al45tair · January 23, 2024, 6:26pm

As for sampling the application, there are a number of ways to do it. If this runs on and can be triggered on macOS, that's the easiest because you can use Instruments (which is excellent). If you're on Linux, perf is one way to go, or if you can get the program into this state you could just attach a debugger and see where it is.

taylorswift · January 23, 2024, 6:33pm

the issue cannot be reproduced locally with a mock database, as it happens when performing a global database rebuild on a multi-node architecture. based on what you have told me, it is most likely not related to backtracing, but an edge case in the database driver (which is written in swift) that only occurs when multiple cluster nodes are under heavy load.

it’s possible to trigger it in production by doing the global rebuild, but obviously this is extremely disruptive to our users. so i am interested in sampling the process in production, at a significantly lower frequency than one might use when profiling with perf. i imagine this would involve buffering and exporting the perf samples at some interval.

wadetregaskis · January 23, 2024, 6:44pm

perf's sampling frequency is configurable, via the -F argument. The units are hertz. Sounds like -F 1 is what you're looking for.

al45tair · January 23, 2024, 6:49pm

I was going to suggest /proc/<pid>/stack but realised that's the kernel stack. You could try pstack if you have it installed.

hassila · January 23, 2024, 6:52pm

I’d suggest you try this tool from a previous life: