openresty / redis optimization with high frequency / low latency stack

vincent.mady · 2014-01-26T02:11:44+00:00

Awesome, working on my end too!Thanks for the quick fix :)On Saturday, 25 January 2014 21:57:26 UTC, agentzh wrote:Hello Hamish! On Sat, Jan 25, 2014 at 1:...

openresty / redis optimization with high frequency / low latency stack

vincent.mady

Hi everybody,

I want to use openresty + redis to handle high frequency / low latency requests. My goal is to handle as much requests per second as possible (50-60k is the target) with a 8-10ms maximum latency.

The box used for my test has a 32 threads capacity that is fully handled by my nginx.conf files : worker_processes 32 with the corresponding worker_cpu_affinity.

Right now, I can easily handle 20k request per second but over that limit I get the following nginx error "connect() to 127.0.0.1:6379 failed (99: Cannot assign requested address)" even if the load per thread is still 'low' between 10-25%.

My Lua setting regaring the redis connection is a timeout of 1000 ms and a keepalive of 1024 :

....

red:set_timeout(1000)

local ok, err = red:connect("127.0.0.1", 6379)

...

local kal, err = red:set_keepalive(0,1024)

...

Is anybody has tested a better settings for high frequency redis ? I'm thinking of most efficient timeout / keepalive settings; nginx and redis are running on the same box so can I improve my stack if I use unix socket instead of TCP port ?

Thanks in advance for your help guys !!

Cheers,

Vincent

agentzh

Hello!

On Fri, Jan 24, 2014 at 7:56 AM,  vincent.mady wrote:
> The box used for my test has a 32 threads capacity that is fully handled by
> my nginx.conf files : worker_processes 32 with the corresponding
> worker_cpu_affinity.
>

Because redis is running on the same box, you should leave one CPU
core for Redis.

Also ensure your load generator tool (like ab) is not running on the
same box because it can take a lot of CPU time itself.

> Right now, I can easily handle 20k request per second but over that limit I
> get the following nginx error "connect() to 127.0.0.1:6379 failed (99:
> Cannot assign requested address)" even if the load per thread is still 'low'
> between 10-25%.
>

You're running out of local ports. Ensure your connection pools for
Redis are properly configured (i.e., the set_keepalive calls).

> local kal, err = red:set_keepalive(0,1024)

Always do proper error handling for the set_keepalive call and ensure
the connection pool works.

Also, if you're on Linux, you can consider using the
ngx-lua-conn-pools to inspect the connection pool usage in a live
nginx worker:

    https://github.com/agentzh/nginx-systemtap-toolkit#ngx-lua-conn-pools

> Is anybody has tested a better settings for high frequency redis ? I'm
> thinking of most efficient timeout / keepalive settings; nginx and redis are
> running on the same box so can I improve my stack if I use unix socket
> instead of TCP port ?
>

Yes, using unix domain sockets is an important optimization here.

Also, you're encouraged to use the latest ngx_openresty 1.5.8.1 release.

Regards,
-agentzh

brian

FWIW, in one of my applications I wound up having to "shard" redis, even on local host. I was doing some redis-side Lua as I needed to guaranteed atomicity of certain actions. I was on rather large servers pushing a lot more requests/sec than you are discussing, however. Just keep in mind that redis can become CPU bound depending on what you are doing.

As a "last mile" optimization, I was going to do a custom version of shared dictionaries, but at some point I had to complete the project and move on ;)

vincent.mady

Thanks agentzh and Brian,

I switched to unix socket and use a second redis instance and it works much better, I can now handle 40k ++ connections per second with a very low failure rate (>0.02%) : I still have some connection errors (ressource temporarily unavailable) Do you have an idea about the best settings for set_timeout and set_keepalive for this kind of high frequency architecture (my response should be 10ms max)

So far I am using :

red:set_timeout(25)

red:set_keepalive(0,1024)

Regards,

Vincent

brian

What are your settings on the redis side for maximum connections?

agentzh

Hello!

On Wed, Jan 29, 2014 at 1:46 AM, Vincent MADY wrote:
> I still have some connection errors (ressource
> temporarily unavailable)

It should be the EAGAIN error returned by connect() on the nginx side.
This happens when the backlog queue (or listening queue) on the Redis
side overflows.

Assuming you're on Linux, for stream-typed unix domain socket, the
connect() operation is asynchronous and almost always succeeds
immediately even though the other side has not done a corresponding
accept() yet. So when the other side is not fast enough in doing
accept(), Redis's backlog queue can easily be filled up and yielding
the EAGAIN error on the client side (here, the Nginx side).

This can especially happen for your setup because

1. Your redis process has only 1 thread while your nginx has many
workers (i.e., many OS threads), so your redis will have fewer CPU
cycles from the kernel scheduler redis has the same process priority
and a fair scheduler is used.

2. The redis server hard-codes a really small backlog limit in its
source, that is, 511:

    agentzh@w530 ~/work/redis-2.4.17 $ grep 511 -r src/
    src/anet.c:    if (listen(s, 511) == -1) { /* the magic 511
constant is from nginx */

For 1),  if your redis server has not reached 100% CPU usage yet, then
you can consider increasing the process priority of your redis-server
process.

For 2), you can edit the C source of your redis server and increase
the 511 magic number in the listen() call to something bigger, like
2048.

If you redis server is already at (or close to) 100% CPU usage under
load, then you should really scale your redis backend to multiple
instances.

BTW, what are the exact CPU usage of your redis-server and nginx processes?

Also, you can try disabling the access_log on the nginx side to see if
it makes a difference.

And...what exact versions of ngx_lua and LuaJIT are you using? (or
just the version of ngx_openresty if you're using ngx_openresty).

If the Nginx processes are the bottleneck, you can sample an on-CPU
and an off-CPU flamegraph for a typical nginx worker under load:

    https://github.com/agentzh/nginx-systemtap-toolkit#sample-bt

    https://github.com/agentzh/nginx-systemtap-toolkit#sample-bt-off-cpu

By looking at the graphs, we will no longer have to guess for optimizations.

> Do you have an idea about the best settings for
> set_timeout and set_keepalive for this kind of high frequency architecture
> (my response should be 10ms max)
>

The EAGAIN error in connect() has nothing to do with these
configuration parameters.

Best regards,
-agentzh

vincent.mady

Thanks for this very clear and precise feedback agentzh ;)

I get back to the Redis crew and they will fix the 511 limitation in anet.c in the next release (2.8.5 that should be released in the coming days)

They also recommend (on linux) to change the value of kernel parameter SOMAXCONN (by default, it is generally set at 128). The value used at listen time is the minimum calculated between SOMAXCONN and the backlog listen parameter.

I will keep you posted after I test everything !!

Best regards,

Vincent

vincent.mady

Just to keep the group posted :

I migrated to Redis 2.8.5 with increased backlog value (2048)

It fixed all the Redis connection issues !!

Thanks and regards,

Vincent