Hi, I'm having a weird issue with coroutines, and am struggling to debug it further so I thought I'd explain what I know in case anyone can help.
I'm using coroutines to stream http responses in
http://github.com/pintsized/lua-resty-http, which under normal circumstances work fine. However if you have several upstream http requests, sometimes a coroutine will enter the dead state without having ever run. This appears to be quite random, but the more upstream requests you have in a given request the more likely you are to see it (say 10 upstream requests to the same resource, 3 or 4 of them may be dead on arrival).
Additionally, the ones which are dead change every time... It's rarely any of the first 4 requests, but otherwise it could be any combination of the remaining requests, including occasionally none.
I'd suspected there was just something silly wrong with my code, but it's definitely weirder than that. The coroutine is created (successfully, I get a thread value back), and then on the next line I call coroutine.status(co) and it returns "dead", which as I understand should be impossible?
I can't seem to recreate this with simpler code though, I have to use lua-resty-http and in a loop create connections. I can't help wondering if it's something to do with the fact that my coroutines operate on open cosockets. That is, if I take that factor out of the equation I can't replicate it.
But just to reiterate, when a given request fails the coroutine doesn't execute a single line (it can't be resumed to do so), and the socket that the coroutine would have read from is still perfectly fine and usable afterwards. That is, the error is definitely that I've created a coroutine but it has changed to "dead" immediately, rather than the coroutine tries to run and fails in some other way.
I went through the ngx_lua source and placed debug statements around every time the status is set to NGX_HTTP_LUA_CO_DEAD, and it would appear that none of them are responsible! The coroutine is created as suspended, but then immediately after, apparently not set by C at any point, the status has become dead.
Clutching at straws, but could the threads be garbage collected by accident or something?
I've created a Gist with example code to recreate and example output. I became aware of this because of this bug report regarding excessive memory usage:
https://github.com/pintsized/lua-resty-http/issues/4 - which happens because if a coroutine dies when using coroutine.wrap, we blindly try to resume it and get a string error back, causing an infinite loop. So in my Gist you can see how I've reimplemented coroutine.wrap to stop this.
I also tried removing wrap from the equation completely, and using create/resume directly, but the effect is the same.
Any ideas on what I can try to narrow this down further?
Thanks.
--