Hello!
On Wed, Jan 29, 2014 at 1:46 AM, Vincent MADY wrote:
> I still have some connection errors (ressource
> temporarily unavailable)
It should be the EAGAIN error returned by connect() on the nginx side.
This happens when the backlog queue (or listening queue) on the Redis
side overflows.
Assuming you're on Linux, for stream-typed unix domain socket, the
connect() operation is asynchronous and almost always succeeds
immediately even though the other side has not done a corresponding
accept() yet. So when the other side is not fast enough in doing
accept(), Redis's backlog queue can easily be filled up and yielding
the EAGAIN error on the client side (here, the Nginx side).
This can especially happen for your setup because
1. Your redis process has only 1 thread while your nginx has many
workers (i.e., many OS threads), so your redis will have fewer CPU
cycles from the kernel scheduler redis has the same process priority
and a fair scheduler is used.
2. The redis server hard-codes a really small backlog limit in its
source, that is, 511:
agentzh@w530 ~/work/redis-2.4.17 $ grep 511 -r src/
src/anet.c: if (listen(s, 511) == -1) { /* the magic 511
constant is from nginx */
For 1), if your redis server has not reached 100% CPU usage yet, then
you can consider increasing the process priority of your redis-server
process.
For 2), you can edit the C source of your redis server and increase
the 511 magic number in the listen() call to something bigger, like
2048.
If you redis server is already at (or close to) 100% CPU usage under
load, then you should really scale your redis backend to multiple
instances.
BTW, what are the exact CPU usage of your redis-server and nginx processes?
Also, you can try disabling the access_log on the nginx side to see if
it makes a difference.
And...what exact versions of ngx_lua and LuaJIT are you using? (or
just the version of ngx_openresty if you're using ngx_openresty).
If the Nginx processes are the bottleneck, you can sample an on-CPU
and an off-CPU flamegraph for a typical nginx worker under load:
https://github.com/agentzh/nginx-systemtap-toolkit#sample-bt
https://github.com/agentzh/nginx-systemtap-toolkit#sample-bt-off-cpu
By looking at the graphs, we will no longer have to guess for optimizations.
> Do you have an idea about the best settings for
> set_timeout and set_keepalive for this kind of high frequency architecture
> (my response should be 10ms max)
>
The EAGAIN error in connect() has nothing to do with these
configuration parameters.
Best regards,
-agentzh