Debugging 100% CPU workers

leafot · 2013-12-02T02:00:51+00:00

Hello! On Fri, Nov 29, 2013 at 11:02 AM, Alexander Stocko wrote: > I'm trying out your wonderful OpenResty project and enjoying it. Great :) > I'm n...

Debugging 100% CPU workers

leafot

I've been having an issue where workers get stuck at 100% CPU usage and stop taking requests. I've actually had this issue for a few months, typically it would take down a single worker, I would notice, then restart the server and things would be fine for weeks. In the past week though I've had all 4 of my workers on my main app go to 100% CPU around the same time, taking my site down.

I'm currently running 3 openresty sites and I haven't seen this behavior in the other two but they receive significantly less traffic. I guessing it's either something in my application's Lua code or some external library I'm using. I was curious if you had any advice for tracking down this issue, maybe by inspecting the locked up process.

Sadly I don't have a way to reproduce the issue so I don't know when I'll be able to test debugging it.

Thanks

agentzh

Hello!

On Sun, Dec 1, 2013 at 10:00 AM, leaf corcoran wrote:
> I've been having an issue where workers get stuck at 100% CPU usage and stop
> taking requests. I've actually had this issue for a few months, typically it
>
[...]
> Sadly I don't have a way to reproduce the issue so I don't know when I'll be
> able to test debugging it.
>

Fortunately 100% CPU is relatively easy to debug with the right tools
when it is happening :)

Usually there're two common possibilities (according to our
experiences with our own openresty/ngx_lua servers in production):

1. Some special Lua code path enters an infinite loop.
2. Some (PCRE) regex backtracks crazily on a large string.

For the first case, you can use the ngx-lua-bt tool in my Nginx
Systemtap Toolkit to get the current Lua backtrace in your nginx
worker process that is spinning at 100% CPU usage:

    https://github.com/agentzh/nginx-systemtap-toolkit#ngx-lua-bt

Alternatively, you can use gdb and the following gdb extension script
to obtain the current Lua backtrace from gdb:

    https://github.com/agentzh/nginx-gdb-utils

Basically, just do this:

    (gdb) source luajit20.gdb
    (gdb) lbt <value-for-L>

You can get the value for the L argument (i.e., <value-for-L>) from
one of the top-most frames in the backtrace (shown by the gdb command
"bt").

For the 2nd case, it should be easy to confirm by getting the C-land
backtrace by tools like gdb and pstack if PCRE JIT is not enabled.

The on-CPU flamegraph could also be very useful here (though we cannot
get complete backtraces for JITted code yet):

    https://github.com/agentzh/nginx-systemtap-toolkit#sample-bt

But the most important thing is to have an nginx worker spinning with
100% CPU. The systemtap-based tools can be used directly in
production.

Hope these help :)

Best regards,
-agentzh

leafot

Thanks for the quick response, I'll try these out next time the issue happens.

On Sun, Dec 1, 2013 at 11:00 AM, Yichun Zhang (agentzh) <age...@gmail.com> wrote:

Hello!

On Sun, Dec 1, 2013 at 10:00 AM, leaf corcoran wrote:
> I've been having an issue where workers get stuck at 100% CPU usage and stop
> taking requests. I've actually had this issue for a few months, typically it
>

[...]

> Sadly I don't have a way to reproduce the issue so I don't know when I'll be
> able to test debugging it.
>

Fortunately 100% CPU is relatively easy to debug with the right tools
when it is happening :)

Usually there're two common possibilities (according to our
experiences with our own openresty/ngx_lua servers in production):

1. Some special Lua code path enters an infinite loop.
2. Some (PCRE) regex backtracks crazily on a large string.

For the first case, you can use the ngx-lua-bt tool in my Nginx
Systemtap Toolkit to get the current Lua backtrace in your nginx
worker process that is spinning at 100% CPU usage:

https://github.com/agentzh/nginx-systemtap-toolkit#ngx-lua-bt

Alternatively, you can use gdb and the following gdb extension script
to obtain the current Lua backtrace from gdb:

https://github.com/agentzh/nginx-gdb-utils

Basically, just do this:

(gdb) source luajit20.gdb
(gdb) lbt <value-for-L>

You can get the value for the L argument (i.e., <value-for-L>) from
one of the top-most frames in the backtrace (shown by the gdb command
"bt").

For the 2nd case, it should be easy to confirm by getting the C-land
backtrace by tools like gdb and pstack if PCRE JIT is not enabled.

The on-CPU flamegraph could also be very useful here (though we cannot
get complete backtraces for JITted code yet):

https://github.com/agentzh/nginx-systemtap-toolkit#sample-bt

But the most important thing is to have an nginx worker spinning with
100% CPU. The systemtap-based tools can be used directly in
production.

Hope these help :)

Best regards,
-agentzh
.

erikwickstrom

Hi!

So I'm having this same problem with my OpenResty workers pegging the CPU at 100% and getting "stuck" (the process reports "nginx: worker process is shutting down" in `ps`). I'm running OpenResty inside a docker container (privileged to assist with debugging). I've tried the tools you suggested and got the following results. Not sure where to go from here.

Using https://github.com/agentzh/nginx-systemtap-toolkit#ngx-lua-bt

# ./ngx-lua-bt -p 24492 --luajit20
semantic error: while resolving probe point: identifier 'process' at <input>:370:1
source: process("/opt/openresty/luajit/lib/libluajit-5.1.so.2.1.0").function("lj_cf_string_find"),
^

semantic error: no match
semantic error: while resolving probe point: identifier 'process' at :371:7
source: process("/opt/openresty/luajit/lib/libluajit-5.1.so.2.1.0").function("luaL_*"),
^

semantic error: while resolving probe point: identifier 'process' at :372:7
source: process("/opt/openresty/luajit/lib/libluajit-5.1.so.2.1.0").function("lua_*"),
^

semantic error: while resolving probe point: identifier 'process' at :373:7
source: process("/opt/openresty/nginx/sbin/nginx").function("ngx_http_lua_ngx_*")
^

Pass 2: analysis failed. [man error::pass2]
Tip: /usr/share/doc/systemtap/README.Debian should help you get started.
#

And https://github.com/agentzh/nginx-gdb-utils results in:

$ sudo gdb /opt/openresty/nginx/sbin/nginx 24492
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
/opt/openresty/nginx/sbin/nginx: No such file or directory.
Attaching to process 24492
/opt/openresty/nginx/sbin/nginx: No such file or directory.
A program is being debugged already. Kill it? (y or n) n
Program not killed.
(gdb) source luajit20.gdb
(gdb) bt
#0 0x6aa0a641 in ?? ()
#1 0x00010660 in ?? ()
#2 0x41b1e3b8 in ?? ()
#3 0x00000c02 in ?? ()
#4 0x7ffedcc8 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) lbt 0x6aa0a641
Undefined command: "lbt". Try "help".
(gdb)

Thanks for your help!

Erik

On Sunday, December 1, 2013 at 11:00:54 AM UTC-8, agentzh wrote:

Hello!

On Sun, Dec 1, 2013 at 10:00 AM, leaf corcoran wrote:
> I've been having an issue where workers get stuck at 100% CPU usage and stop
> taking requests. I've actually had this issue for a few months, typically it
>
[...]
> Sadly I don't have a way to reproduce the issue so I don't know when I'll be
> able to test debugging it.
>

Fortunately 100% CPU is relatively easy to debug with the right tools
when it is happening :)

Usually there're two common possibilities (according to our
experiences with our own openresty/ngx_lua servers in production):

1. Some special Lua code path enters an infinite loop.
2. Some (PCRE) regex backtracks crazily on a large string.

For the first case, you can use the ngx-lua-bt tool in my Nginx
Systemtap Toolkit to get the current Lua backtrace in your nginx
worker process that is spinning at 100% CPU usage:

https://github.com/agentzh/nginx-systemtap-toolkit#ngx-lua-bt

Alternatively, you can use gdb and the following gdb extension script
to obtain the current Lua backtrace from gdb:

https://github.com/agentzh/nginx-gdb-utils

Basically, just do this:

(gdb) source luajit20.gdb
(gdb) lbt <value-for-L>

You can get the value for the L argument (i.e., <value-for-L>) from
one of the top-most frames in the backtrace (shown by the gdb command
"bt").

For the 2nd case, it should be easy to confirm by getting the C-land
backtrace by tools like gdb and pstack if PCRE JIT is not enabled.

The on-CPU flamegraph could also be very useful here (though we cannot
get complete backtraces for JITted code yet):

https://github.com/agentzh/nginx-systemtap-toolkit#sample-bt

But the most important thing is to have an nginx worker spinning with
100% CPU. The systemtap-based tools can be used directly in
production.

Hope these help :)

Best regards,
-agentzh

agentzh

Hello!

On Thu, Feb 25, 2016 at 3:39 PM, Erik Wickstrom wrote:
> # ./ngx-lua-bt -p 24492 --luajit20
> semantic error: while resolving probe point: identifier 'process' at
> <input>:370:1
>         source:
> process("/opt/openresty/luajit/lib/libluajit-5.1.so.2.1.0").function("lj_cf_string_find"),
>                 ^
>

You are using LuaJIT 2.1 while the ngx-lua-bt tool is LuaJIT 2.0 only
(see your option --luajit20?). You should use this tool instead:

    https://github.com/openresty/stapxx#lj-lua-stacks

Also, please ensure your systemtap is recent enough. See below for
building instructions:

    http://openresty.org/#BuildSystemtap

> And https://github.com/agentzh/nginx-gdb-utils results in:
>
> $ sudo gdb /opt/openresty/nginx/sbin/nginx 24492
[...]
> (gdb) source luajit20.gdb
> (gdb) bt
> #0  0x6aa0a641 in ?? ()
> #1  0x00010660 in ?? ()
> #2  0x41b1e3b8 in ?? ()
> #3  0x00000c02 in ?? ()
> #4  0x7ffedcc8 in ?? ()
> Backtrace stopped: previous frame inner to this frame (corrupt stack?)
> (gdb) lbt 0x6aa0a641
> Undefined command: "lbt".  Try "help".
> (gdb)
>

Again, you're using the wrong gdb tool for LuaJIT 2.0 (and also using
the wrong way, because lbt excepts no argument at all or the address
of the target lua_State address) while you need to use the following
tool instead:

    https://github.com/openresty/nginx-gdb-utils/#lbt

Note that luajit21.py is used instead of luajit20.gdb here.

These flame graph tools are summarized here:

    https://openresty.org/#Profiling

Regards,
-agentzh