Hello!
2012/11/28 Goujiang:
> 尝试了gdb,得到如下,貌似堆栈轨迹不完整呢
> 我在浏览器端四五个请求并行测试的(单个请求测试的时候不容易出现),是不是因为这个原因导致不完整呢
>
Okay,我今天在 Amazon EC2 上启了一个 Linux i386 的系统,终于在上面通过下面这个最小化的用例复现了你报告中的这个崩溃问题:
lua_check_client_abort on;
location = /main {
echo_location_async /proxy;
echo_location_async /proxy;
echo_location_async /proxy;
echo_location_async /proxy;
echo_location_async /proxy;
echo_location_async /proxy;
}
location = /proxy {
proxy_send_timeout 6s;
proxy_read_timeout 6s;
proxy_pass http://127.0.0.1:$server_port/t;
}
location = /t {
content_by_lua '
local redis = require "resty.redis"
local function my_cleanup()
ngx.exit(499)
end
local ok, err = ngx.on_abort(my_cleanup)
if not ok then
ngx.log(ngx.ERR, "failed to register the on_abort
callback: ", err)
ngx.exit(500)
end
local red = redis:new()
red:set_timeout(60000) -- 60 sec
local ok, err = red:connect("127.0.0.1", 6379)
if not ok then
ngx.log(ngx.ERR, "failed to connect: ", err)
ngx.exit(500)
end
local res, err = red:blpop("not_exists", 40)
ngx.say("ok")
';
}
访问 location /main 时,nginx 进程便会发生崩溃。使用 gdb 可以得到下面的输出:
Program received signal SIGSEGV, Segmentation fault.
0xb76b3a45 in lj_str_new (L=0xb70d23a0, str=0xb771702f "cannot
resume non-suspended coroutine", lenx=37) at lj_str.c:123
123 o = gcref(g->strhash[h & g->strmask]);
(gdb) bt
#0 0xb76b3a45 in lj_str_new (L=0xb70d23a0, str=0xb771702f "cannot
resume non-suspended coroutine", lenx=37) at lj_str.c:123
#1 0xb76b2a8c in lj_err_str (L=0xb70d23a0, em=LJ_ERR_COSUSP) at
lj_err.c:480
#2 0xb76c08f1 in lua_resume (L=0xb70d23a0, nargs=0) at lj_api.c:1136
#3 0x080e9e7d in ngx_http_lua_run_thread (L=0xb70bd1c0,
r=0x8711378, ctx=0x8712070, nret=0)
at /home/ec2-user/git/lua-nginx-module/src/ngx_http_lua_util.c:1019
#4 0x080eb14e in ngx_http_lua_on_abort_resume (r=0x8711378)
at /home/ec2-user/git/lua-nginx-module/src/ngx_http_lua_util.c:3197
#5 0x080ebf95 in ngx_http_lua_content_wev_handler (r=0x8711378)
at /home/ec2-user/git/lua-nginx-module/src/ngx_http_lua_contentby.c:128
#6 0x080eb4ac in ngx_http_lua_rd_check_broken_connection (r=0x8711378)
at /home/ec2-user/git/lua-nginx-module/src/ngx_http_lua_util.c:3167
#7 0x0808bc0c in ngx_http_request_handler (ev=0x86f47c0) at
src/http/ngx_http_request.c:1873
#8 0x08079bed in ngx_epoll_process_events (cycle=0x86c36e8,
timer=53996, flags=1) at src/event/modules/ngx_epoll_module.c:683
#9 0x0806f9bd in ngx_process_events_and_timers (cycle=0x86c36e8)
at src/event/ngx_event.c:247
#10 0x080781d8 in ngx_single_process_cycle (cycle=0x86c36e8) at
src/os/unix/ngx_process_cycle.c:316
#11 0x0805a6cb in main (argc=5, argv=0xbffecc74) at src/core/nginx.c:407
如果使用 valgrind memcheck 运行 nginx,得到的第一个报错也是此位置:
==9744== Invalid read of size 4
==9744== at 0x4053A45: lj_str_new (lj_str.c:123)
==9744== by 0x4052A8B: lj_err_str (lj_err.c:480)
==9744== by 0x40608F0: lua_resume (lj_api.c:1136)
==9744== by 0x80E9E7C: ngx_http_lua_run_thread (ngx_http_lua_util.c:1019)
==9744== by 0x80EB14D: ngx_http_lua_on_abort_resume
(ngx_http_lua_util.c:3197)
==9744== by 0x80EBF94: ngx_http_lua_content_wev_handler
(ngx_http_lua_contentby.c:128)
==9744== by 0x80EB4AB: ngx_http_lua_rd_check_broken_connection
(ngx_http_lua_util.c:3167)
==9744== by 0x808BC0B: ngx_http_request_handler (ngx_http_request.c:1873)
==9744== by 0x8079BEC: ngx_epoll_process_events (ngx_epoll_module.c:683)
==9744== by 0x806F9BC: ngx_process_events_and_timers (ngx_event.c:247)
==9744== by 0x80781D7: ngx_single_process_cycle (ngx_process_cycle.c:316)
==9744== by 0x4372CE5: (below main) (in /lib/libc-2.12.so)
==9744== Address 0x424 is not stack'd, malloc'd or (recently) free'd
有趣的是,同样的用例在 Linux x86_64 系统上是一切正常的。
现在既然可以可靠地复现问题了,修复也应该比较快了。多谢你的报告!
今天晚些时候如果成功修复的话,我会请你尝试新版本的,呵呵。
值得一提的是,为得到完整的堆栈轨迹,你需要启用 LuaJIT 的调试符号。最简单的做法是使用 --with-debug 重新编译
openresty(但此 --with-debug 选项不建议用于生产,因为会有较高的性能代价)。
Best regards,
-agentzh