CPU Intensive Tasks inside Nginx/Openresty Module

ajaybodhe · 2014-08-13T01:34:25+00:00

Hello! On Tue, Aug 12, 2014 at 7:37 AM, Ajay Bodhe wrote: > I guess redis2-nginx-module-master only allows redis access through configs > If I wanna a...

CPU Intensive Tasks inside Nginx/Openresty Module

ajaybodhe

Hello,

I am writing an application server as Nginx/Openresty standard module, This code is supposed to perform some CPU intensive tasks. If the single thread/master-process in Nginx/Openresty doing event-loop is stuck inside this part then throughput/performance of server would drop. What is the way to execute this? Is there any coroutine/thread-pool library that can solve this issue?

I am new to this platform & want to get rolling ASAP, so forgive if this is very trivial question.

Thanks,

agentzh

Hello!

On Tue, Aug 12, 2014 at 10:34 AM, Ajay Bodhe wrote:
> I am writing an application server as Nginx/Openresty standard module, This
> code is supposed to perform some CPU intensive tasks.

As long as each of the computation units does not run for too long
(below the level of milliseconds), then it should be fine.

> If the single
> thread/master-process in Nginx/Openresty doing event-loop is stuck inside
> this part then throughput/performance of server would drop.

The nginx master process does not do anything truly useful except
waiting for signals. It is the worker processes doing the true
heavy-lifting. CPU computation will surely block the event loop more
or less but whether it is a real issue depends on how much time you
block each iteration.

You can use the following systemtap-based tool to measure it
quantitatively, for example:

    https://github.com/openresty/stapxx#epoll-loop-blocking-distr

This tool also measures any other things that blocks the loop,
including disk I/O and other system calls.

>  What is the way
> to execute this?
> Is there any coroutine/thread-pool library that can solve
> this issue?

One common work-around is to configure more nginx worker processes.
But keep in mind that your CPU resource in your machine is a always
constant (you cannot get more CPU resource out of the existing
hardware, you can just reduce wasting), more OS threads mean more
impact on the kernel scheduler and more overhead due to context
switches. This work-around just make each task more fair in using the
CPU resource and preventing I/O timeouts. But your overall system
throughput will certianly drop as you increase the number of OS
threads (or worker processes).

Another work-around is manually add ngx.sleep(0.001) to some points in
your CPU computation units, to divide it into interrupt-able pieces
for the nginx event loop to progress. In the future, we can add
support for ngx.sleep(0) to yield the control back to the nginx event
loop without introducing extra delays (for now, 1ms at minimum).

Also, coroutines or "light threads" are designed for I/O bound use cases.

Regards,
-agentzh

ajaybodhe

Thats Very Insightful !
My CPU bound code would be consuming 200-400 MS.

If I run this computation code inside nginx worker it will be blocked for further requests.
I can write another processing stage(threadpool may be) to which I handle over this task along with Nginx-worker callback.
But then it will again cause more overhead in terms of switching, right?

I was looking at solutions build around ASIO. It has the concept of IO_SERVICE which handle aync NW/IO calls & threadpool which handles such CPU tasks & gives a call to IO_SERVICE for any NW/IO thing.

Also in terms of Nginx what works better writing server logic as part of Nginx or moving it to another application layer which communicates with Nginx over FastCGI?

On Wednesday, 13 August 2014 00:13:26 UTC+5:30, agentzh wrote:

Hello!

On Tue, Aug 12, 2014 at 10:34 AM, Ajay Bodhe wrote:
> I am writing an application server as Nginx/Openresty standard module, This
> code is supposed to perform some CPU intensive tasks.

As long as each of the computation units does not run for too long
(below the level of milliseconds), then it should be fine.

> If the single
> thread/master-process in Nginx/Openresty doing event-loop is stuck inside
> this part then throughput/performance of server would drop.

The nginx master process does not do anything truly useful except
waiting for signals. It is the worker processes doing the true
heavy-lifting. CPU computation will surely block the event loop more
or less but whether it is a real issue depends on how much time you
block each iteration.

You can use the following systemtap-based tool to measure it
quantitatively, for example:

https://github.com/openresty/stapxx#epoll-loop-blocking-distr

This tool also measures any other things that blocks the loop,
including disk I/O and other system calls.

> What is the way
> to execute this?
> Is there any coroutine/thread-pool library that can solve
> this issue?

One common work-around is to configure more nginx worker processes.
But keep in mind that your CPU resource in your machine is a always
constant (you cannot get more CPU resource out of the existing
hardware, you can just reduce wasting), more OS threads mean more
impact on the kernel scheduler and more overhead due to context
switches. This work-around just make each task more fair in using the
CPU resource and preventing I/O timeouts. But your overall system
throughput will certianly drop as you increase the number of OS
threads (or worker processes).

Another work-around is manually add ngx.sleep(0.001) to some points in
your CPU computation units, to divide it into interrupt-able pieces
for the nginx event loop to progress. In the future, we can add
support for ngx.sleep(0) to yield the control back to the nginx event
loop without introducing extra delays (for now, 1ms at minimum).

Also, coroutines or "light threads" are designed for I/O bound use cases.

Regards,
-agentzh

agentzh

Hello!

On Tue, Aug 12, 2014 at 11:27 PM, Ajay Bodhe wrote:
> My CPU bound code would be consuming 200-400 MS.
>

Let's say, it's 300ms CPU time per request, then each of your CPU
cores will only handle 3.3 requests/sec! If your machine has N cores,
then it's just a little above 3 * N req/sec in theory! Yes, no matter
what approach you use, because you cannot *invent* new CPU resources
from your software. You can only try to waste less.

> If I run this computation code inside nginx worker it will be blocked for
> further requests.

Because your current CPU core (bound by your current nginx worker
process) is already 100% on the current request, there is no point in
accepting new requests for this extremely busy nginx worker process,
because that'll only make the situation even worse :)

If I were you, I'll try to optimize this CPU hog thing to death. CPU
time a precious hardware resource anyway. Failing that, I'll just run
a machine farm of nginx servers, possibly before an nginx simply doing
reverse proxying. And in each "backend" nginx server, I'll configure
the standard ngx_limit_req module to limit the incoming request rate
to 3 * N req/sec (where N is the number of logical CPU cores available
in the current backend machine).

Please always keep in mind:

1. Trying to exceed the throughput limit of your server by launching
more OS threads or increase client concurrency level will never do you
any good but make the latency and throughput worse and worse.

2. Hardware resources like CPU time are constants. You cannot invent
new hardware resources from your software; you can only reduce
wasting.

3. Throughput limit is everything. See 1) above.

> I can write another processing stage(threadpool may be) to which I handle
> over this task along with Nginx-worker callback.

It's not really solving any problems here :)

> But then it will again cause more overhead in terms of switching, right?
>

Of course. This is a way of increasing the waste of hardware resources :)

> I was looking at solutions build around ASIO. It has the concept of
> IO_SERVICE which handle aync NW/IO calls & threadpool which handles such CPU
> tasks & gives a call to IO_SERVICE for any NW/IO thing.
>

You need to make a clear conceptual distinction between I/O-bound and
CPU-bound. Solutions to improve one thing will not really help the
other. Even use of nonblocking I/O is to reduce context switching
overhead (for CPU time!) involved with a lot of OS threads doing
blocking I/O. Again, to reduce wasting of hardware resources.

> Also in terms of Nginx what works better writing server logic as part of
> Nginx or moving it to another application layer which communicates with
> Nginx over FastCGI?
>

Another FastCGI layer involves more socket communication overhead
(including expensive system calls). Again, you may start wasting more
CPU time, i.e., more hardware resources :)

Best regards,
-agentzh

ajaybodhe

Gr8 Explanation, thanks.

On Thursday, 14 August 2014 01:55:38 UTC+5:30, agentzh wrote:

Hello!

On Tue, Aug 12, 2014 at 11:27 PM, Ajay Bodhe wrote:
> My CPU bound code would be consuming 200-400 MS.
>

Let's say, it's 300ms CPU time per request, then each of your CPU
cores will only handle 3.3 requests/sec! If your machine has N cores,
then it's just a little above 3 * N req/sec in theory! Yes, no matter
what approach you use, because you cannot *invent* new CPU resources
from your software. You can only try to waste less.

> If I run this computation code inside nginx worker it will be blocked for
> further requests.

Because your current CPU core (bound by your current nginx worker
process) is already 100% on the current request, there is no point in
accepting new requests for this extremely busy nginx worker process,
because that'll only make the situation even worse :)

If I were you, I'll try to optimize this CPU hog thing to death. CPU
time a precious hardware resource anyway. Failing that, I'll just run
a machine farm of nginx servers, possibly before an nginx simply doing
reverse proxying. And in each "backend" nginx server, I'll configure
the standard ngx_limit_req module to limit the incoming request rate
to 3 * N req/sec (where N is the number of logical CPU cores available
in the current backend machine).

Please always keep in mind:

1. Trying to exceed the throughput limit of your server by launching
more OS threads or increase client concurrency level will never do you
any good but make the latency and throughput worse and worse.

2. Hardware resources like CPU time are constants. You cannot invent
new hardware resources from your software; you can only reduce
wasting.

3. Throughput limit is everything. See 1) above.

> I can write another processing stage(threadpool may be) to which I handle
> over this task along with Nginx-worker callback.

It's not really solving any problems here :)

> But then it will again cause more overhead in terms of switching, right?
>

Of course. This is a way of increasing the waste of hardware resources :)

> I was looking at solutions build around ASIO. It has the concept of
> IO_SERVICE which handle aync NW/IO calls & threadpool which handles such CPU
> tasks & gives a call to IO_SERVICE for any NW/IO thing.
>

You need to make a clear conceptual distinction between I/O-bound and
CPU-bound. Solutions to improve one thing will not really help the
other. Even use of nonblocking I/O is to reduce context switching
overhead (for CPU time!) involved with a lot of OS threads doing
blocking I/O. Again, to reduce wasting of hardware resources.

> Also in terms of Nginx what works better writing server logic as part of
> Nginx or moving it to another application layer which communicates with
> Nginx over FastCGI?
>

Another FastCGI layer involves more socket communication overhead
(including expensive system calls). Again, you may start wasting more
CPU time, i.e., more hardware resources :)

Best regards,
-agentzh