Hello!
On Sun, Dec 1, 2013 at 10:00 AM, leaf corcoran wrote:
> I've been having an issue where workers get stuck at 100% CPU usage and stop
> taking requests. I've actually had this issue for a few months, typically it
>
[...]
> Sadly I don't have a way to reproduce the issue so I don't know when I'll be
> able to test debugging it.
>
Fortunately 100% CPU is relatively easy to debug with the right tools
when it is happening :)
Usually there're two common possibilities (according to our
experiences with our own openresty/ngx_lua servers in production):
1. Some special Lua code path enters an infinite loop.
2. Some (PCRE) regex backtracks crazily on a large string.
For the first case, you can use the ngx-lua-bt tool in my Nginx
Systemtap Toolkit to get the current Lua backtrace in your nginx
worker process that is spinning at 100% CPU usage:
https://github.com/agentzh/nginx-systemtap-toolkit#ngx-lua-bt
Alternatively, you can use gdb and the following gdb extension script
to obtain the current Lua backtrace from gdb:
https://github.com/agentzh/nginx-gdb-utils
Basically, just do this:
(gdb) source luajit20.gdb
(gdb) lbt <value-for-L>
You can get the value for the L argument (i.e., <value-for-L>) from
one of the top-most frames in the backtrace (shown by the gdb command
"bt").
For the 2nd case, it should be easy to confirm by getting the C-land
backtrace by tools like gdb and pstack if PCRE JIT is not enabled.
The on-CPU flamegraph could also be very useful here (though we cannot
get complete backtraces for JITted code yet):
https://github.com/agentzh/nginx-systemtap-toolkit#sample-bt
But the most important thing is to have an nginx worker spinning with
100% CPU. The systemtap-based tools can be used directly in
production.
Hope these help :)
Best regards,
-agentzh