Live record storage into files

bogdan · 2014-08-25T18:58:50+00:00

Hello! On Mon, Aug 25, 2014 at 12:15 AM, Laurence Rowe wrote: > I've installed OpenResty on my Mac 10.6 using Homebrew and have got the > WebSocket ex...

Live record storage into files

bogdan

Hello, all

We are building a monitoring tool with OpenResty and Redis; this appreceives records (CSV strings) from recording agents each minute. Weneed to store these (raw) records in files, so we can analyze themafterwards (raw data analysis).Each record will be stored in a separate file, so that we have onerecord storage operation per file at any given time (so we avoid thepossibility to have two simultaneous storage requests for the same file).Nevertheless, in the case of having thousands of recorders, we may havethousands of simultaneous storage requests (for thousands differentfiles). And, also, these files will need to be rotated, say, daily, soit is possible to have thousands of simultaneous requests that shouldtrigger file rotation and then writing into the new file.What issues may appear if we implement these operations (storage andfile rotation) synchronously with lua's file IO? How can we implementthese operations async?


Thank you

Bogdan

agentzh

Hello!

On Mon, Aug 25, 2014 at 3:58 AM, Bogdan Irimia wrote:
> We are building a monitoring tool with OpenResty and Redis; this app
> receives records (CSV strings) from recording agents each minute. We need to
> store these (raw) records in files, so we can analyze them afterwards (raw
> data analysis).
> Each record will be stored in a separate file, so that we have one record
> storage operation per file at any given time (so we avoid the possibility to
> have two simultaneous storage requests for the same file).

Seems like you really need a real database-like service to do this.
Like Redis, MySQL, or PostgreSQL. They are good at what you're
supposed to do, talking to them over sockets is very efficient and
100% nonblocking. Otherwise we'll have to reinvent most of the
heavy-lifting ourselves in nginx, which may not be that efficient.

> What issues may appear if we implement these operations (storage and file
> rotation) synchronously with lua's file IO? How can we implement these
> operations async?
>

Avoid *heavy* file IO whenever possible in the context of nginx, which
can never be inefficient because there is no such thing called
"nonblocking file IO" and workarounds like AIO and OS thread pooling
have their own limitations, complexity, and/or overhead.

Regards,
-agentzh

sparvu

Some more clarifications, I work with Bogdan as well:

> Seems like you really need a real database-like service to do this.
> Like Redis, MySQL, or PostgreSQL. They are good at what you're
> supposed to do, talking to them over sockets is very efficient and
> 100% nonblocking. Otherwise we'll have to reinvent most of the
> heavy-lifting ourselves in nginx, which may not be that efficient.

we want to build a like appliance which does not have any RDBMS system
at all. We go via NoSQL. See here: 
https://groups.google.com/forum/#!topic/redis-db/kINPN4jiLgI

It seems Josiah advised us to pre-process all raw data outside of Redis/Lua since
the Lua within Redis is something less powerful and restricted as scope than
a full LuaJit environment, for example.

So interesting here to find out is if we can pre-process somehow using lua-resty-lock 
approach all raw data or not ? Meaning we will have something like:

host1: runs 4 recorders + transporter
    sysrec  -> sysrec.sdrd (flat CSV file)
    cpurec  -> cpurec.sdrd
    diskrec  -> diskrec.sdrd
    nicrec   -> nicrec.sdrd
    sender -> transports minute based updated via HTTP | HTTPs
   
host2: runs 4 recorders + transporter
    sysrec  -> sysrec.sdrd (flat CSV file)
    cpurec  -> cpurec.sdrd
    diskrec  -> diskrec.sdrd
    nicrec   -> nicrec.sdrd
    sender -> transports minute based updated via HTTP | HTTPs

Every minute these files will have new CSV records. We want to transport these records to our
backend and store that sysrec.sdrd on the backend, for example. 


> Avoid *heavy* file IO whenever possible in the context of nginx, which
> can never be inefficient because there is no such thing called
> "nonblocking file IO" and workarounds like AIO and OS thread pooling
> have their own limitations, complexity, and/or overhead.

Yes, we moved already our user and subscriptions mgmt within Redis. We are happy with that.
But we need somehow still to figure out how to update every N minutes some flat files 
on disks, consistently within OpenResty or Redis. 

Thanks a lot for comments,

-- 
Stefan Parvu <sp...@systemdatarecorder.org>

agentzh

Hello!

On Mon, Aug 25, 2014 at 1:07 PM, Stefan Parvu wrote:
> Yes, we moved already our user and subscriptions mgmt within Redis. We are happy with that.
> But we need somehow still to figure out how to update every N minutes some flat files
> on disks, consistently within OpenResty or Redis.
>

For ngx_lua, you can just use init_worker_by_lua to initiate a
reoccurring timer via ngx.timer.at(), in whose handler, you can update
your files directly in Lua. To prevent multiple workers doing the same
job, you can guard your timer's operation with lua-resty-lock so that
only one worker is active.

Regards,
-agentzh

sparvu

> For ngx_lua, you can just use init_worker_by_lua to initiate a
> reoccurring timer via ngx.timer.at(), in whose handler, you can update
> your files directly in Lua. To prevent multiple workers doing the same
> job, you can guard your timer's operation with lua-resty-lock so that
> only one worker is active.

I see. Thanks a lot. We need to experiment and as you said we will use
lua-resty-lock to handle this part. Very curious how bad things will 
go. Later we will come with some results.

Thanks,
-- 
Stefan Parvu <sp...@systemdatarecorder.org>

agentzh

Hello!

On Mon, Aug 25, 2014 at 3:04 PM, Stefan Parvu wrote:
>
> I see. Thanks a lot. We need to experiment and as you said we will use
> lua-resty-lock to handle this part. Very curious how bad things will
> go. Later we will come with some results.
>

I've just had a closer look at your thread on the redis-db mailing
list. For your use case, I suggest you send the aggregated data
(stored in lua_shared_dict, for example) out over socket every few
minutes or seconds directly without touching the disk. We're using
this approach for data analytics at CloudFlare.

The lua-resty-loggers-socket library is such an example:

    https://github.com/cloudflare/lua-resty-logger-socket

Regards,
-agentzh

sparvu

> I've just had a closer look at your thread on the redis-db mailing
> list. For your use case, I suggest you send the aggregated data
> (stored in lua_shared_dict, for example) out over socket every few
> minutes or seconds directly without touching the disk. We're using
> this approach for data analytics at CloudFlare.

Thanks, for taking a look. Some comments:

1. Original raw data. 
    We plan storing the original raw data, collected from hosts or sensors, on disk
    for some time. Some pointers here about raw data:
    http://www.systemdatarecorder.org/recording/raw.html 
    
    Within our analytic software we will have another way to explore data, using these
    flat files.

2. Aggregated data
    I understood that we need to process and make all aggregations within Redis.
    Redis has a Lua interpreter inside which can be used. Are you suggesting we should
    process and aggregate within OpenResty ?  Wouldn't that be a problem between 
    NGINX workers ?

So our main first problem will be the flat files containing the raw data. We probable are good
to go using lua_shared_dict, but we need to test. Then second part will be to clarify if all calculations
we will do in Redis or OpenResty.

Thanks again,


-- 
Stefan Parvu <sp...@systemdatarecorder.org>

bogdan

Thank you all for your insights

Just to clarify this:
1. Aggregated data - we decided to compute them somehow with lua and Redis. For now, all we will do is to store them in Redis as soon as a record is received from the hosts. With some well thought data structures in Redis and a bid of Lua (maybe in Redis, but better in OpenResty) we should be able to have the data computed real-time. But the aggregated data discussion is out of the scope of this question (maybe we will start a new thread on this mailing list :) )

2. Raw data - we need to store raw data (the actual strings received from the hosts) in log files, or some other data files, for off-line analysis. Because the volume of data will be quite large, storing them in Redis is out of the question. We need to avoid relational databases or other software that need maintenance and administration. I think the best way would be to use (CSV) text files in which to "dump" the records as they are, as soon as they are received.

The straightforward solution is to synchronously write the records to files. But, as agentzh said, "Avoid *heavy* file IO whenever possible in the context of nginx".

I understand that one suggestion would be to store these records in a temporary buffer (with lua_shared_dict or maybe in Redis) and to empty this buffer with a separate background thread launched with "ngx.timer.at". To synchronize these threads we would need to use "lua-resty-lock". I will need to read more about this method, as it's still unclear to me if " init_worker_by_lua"+"ngx.timer.at" starts more than one light threads (in fact, one per worker, right?). We only need one light thread, so the other ones should need to be "synced out". Is this true?

The other suggestion was to send the data directly "out over socket" (although I think this was about aggregated data, not raw data). We thought about a socket server for storing the raw records on disk, but I don't know what software we should use as a "storage daemon" (as I said, we would avoid SQL-related engines or other too-complicated-pieces-of-software). Maybe syslog-ng? Any other alternatives?

Thank you again.

Bogdan

Stefan Parvu

Tuesday, August 26, 2014 8:42 AM

Thanks, for taking a look. Some comments:

1. Original raw data.
We plan storing the original raw data, collected from hosts or sensors, on disk
for some time. Some pointers here about raw data:
http://www.systemdatarecorder.org/recording/raw.html

Within our analytic software we will have another way to explore data, using these
flat files.

2. Aggregated data
I understood that we need to process and make all aggregations within Redis.
Redis has a Lua interpreter inside which can be used. Are you suggesting we should
process and aggregate within OpenResty ? Wouldn't that be a problem between
NGINX workers ?

So our main first problem will be the flat files containing the raw data. We probable are good
to go using lua_shared_dict, but we need to test. Then second part will be to clarify if all calculations
we will do in Redis or OpenResty.

Thanks again,

Yichun Zhang (agentzh)

Tuesday, August 26, 2014 1:14 AM

Hello!

I've just had a closer look at your thread on the redis-db mailing
list. For your use case, I suggest you send the aggregated data
(stored in lua_shared_dict, for example) out over socket every few
minutes or seconds directly without touching the disk. We're using
this approach for data analytics at CloudFlare.

The lua-resty-loggers-socket library is such an example:

https://github.com/cloudflare/lua-resty-logger-socket

Regards,
-agentzh

Stefan Parvu

Tuesday, August 26, 2014 1:04 AM

I see. Thanks a lot. We need to experiment and as you said we will use
lua-resty-lock to handle this part. Very curious how bad things will
go. Later we will come with some results.

Thanks,

Yichun Zhang (agentzh)

Tuesday, August 26, 2014 12:59 AM

Hello!

For ngx_lua, you can just use init_worker_by_lua to initiate a
reoccurring timer via ngx.timer.at(), in whose handler, you can update
your files directly in Lua. To prevent multiple workers doing the same
job, you can guard your timer's operation with lua-resty-lock so that
only one worker is active.

Regards,
-agentzh

Stefan Parvu

Monday, August 25, 2014 11:07 PM
Some more clarifications, I work with Bogdan as well:
Seems like you really need a real database-like service to do this.
Like Redis, MySQL, or PostgreSQL. They are good at what you're
supposed to do, talking to them over sockets is very efficient and
100% nonblocking. Otherwise we'll have to reinvent most of the
heavy-lifting ourselves in nginx, which may not be that efficient.
we want to build a like appliance which does not have any RDBMS system
at all. We go via NoSQL. See here: 
https://groups.google.com/forum/#!topic/redis-db/kINPN4jiLgI

It seems Josiah advised us to pre-process all raw data outside of Redis/Lua since
the Lua within Redis is something less powerful and restricted as scope than
a full LuaJit environment, for example.

So interesting here to find out is if we can pre-process somehow using lua-resty-lock 
approach all raw data or not ? Meaning we will have something like:

host1: runs 4 recorders + transporter
    sysrec  -> sysrec.sdrd (flat CSV file)
    cpurec  -> cpurec.sdrd
    diskrec  -> diskrec.sdrd
    nicrec   -> nicrec.sdrd
    sender -> transports minute based updated via HTTP | HTTPs
   
host2: runs 4 recorders + transporter
    sysrec  -> sysrec.sdrd (flat CSV file)
    cpurec  -> cpurec.sdrd
    diskrec  -> diskrec.sdrd
    nicrec   -> nicrec.sdrd
    sender -> transports minute based updated via HTTP | HTTPs

Every minute these files will have new CSV records. We want to transport these records to our
backend and store that sysrec.sdrd on the backend, for example. 
Avoid *heavy* file IO whenever possible in the context of nginx, which
can never be inefficient because there is no such thing called
"nonblocking file IO" and workarounds like AIO and OS thread pooling
have their own limitations, complexity, and/or overhead.
Yes, we moved already our user and subscriptions mgmt within Redis. We are happy with that.
But we need somehow still to figure out how to update every N minutes some flat files 
on disks, consistently within OpenResty or Redis. 

Thanks a lot for comments,

sparvu

 
> The straightforward solution is to synchronously write the records to 
> files. But, as agentzh said, "Avoid *heavy* file IO whenever possible in 
> the context of nginx".

I'm not sure heavy file IO operations, is suitable for this case. Example:

sysrec is producing every minute such record:
1409048683:1.59:6.35:393.65:0.59:0.00:0.92:0.07:98.41:0.00:0.00:0.00:0.00:103.00:301664.00:
172824.00:647348.00:8736.00:4832212.00:5013772.00:0.00:0.00:0.00:0.00:0.00:0.00:0.00:0.00:0.00:0.22:0.20:0.17

That would translate more or less around 200bytes from one host, one recorder, 
into a single write operation. Of course we have 4 records running per host, which we could
round up to 1k / min writes per host. 

Even for 3000-5000 hosts that would not be a lot. So I dont think we are talking about
large IO operations for this case.

-- 
Stefan Parvu <sp...@systemdatarecorder.org>

bogdan

Not large, but many operations. With 500 simultaneous requests (or per second) we have 500 file IO operations, which are blocking. Is this acceptable?
And what kind of file IO should we do? Using lua's file IO module, or is there a better alternative?

Stefan Parvu wrote:

The straightforward solution is to synchronously write the records to
files. But, as agentzh said, "Avoid *heavy* file IO whenever possible in
the context of nginx".

I'm not sure heavy file IO operations, is suitable for this case. Example:

sysrec is producing every minute such record:
1409048683:1.59:6.35:393.65:0.59:0.00:0.92:0.07:98.41:0.00:0.00:0.00:0.00:103.00:301664.00:
172824.00:647348.00:8736.00:4832212.00:5013772.00:0.00:0.00:0.00:0.00:0.00:0.00:0.00:0.00:0.00:0.22:0.20:0.17
< BR> That would translate more or less around 200bytes from one host, one recorder,
into a single write operation. Of course we have 4 records running per host, which we could
round up to 1k / min writes per host.

Even for 3000-5000 hosts that would not be a lot. So I dont think we are talking about
large IO operations for this case.