Thank you all for your
insights
Just to clarify this:
1. Aggregated data - we decided to compute them somehow with lua and
Redis. For now, all we will do is to store them in Redis as soon as a
record is received from the hosts. With some well thought data
structures in Redis and a bid of Lua (maybe in Redis, but better in
OpenResty) we should be able to have the data computed real-time. But
the aggregated data discussion is out of the scope of this question
(maybe we will start a new thread on this mailing list :) )
2. Raw data - we need to store raw data (the actual strings received
from the hosts) in log files, or some other data files, for off-line
analysis. Because the volume of data will be quite large, storing them
in Redis is out of the question. We need to avoid relational databases
or other software that need maintenance and administration. I think the
best way would be to use (CSV) text files in which to "dump" the records
as they are, as soon as they are received.
The straightforward solution is to synchronously write the records to
files. But, as agentzh said, "Avoid *heavy* file IO whenever possible in
the context of nginx".
I understand that one suggestion would be to store these records in a
temporary buffer (with lua_shared_dict or maybe in Redis) and to empty
this buffer with a separate background thread launched with
"ngx.timer.at". To synchronize these threads we would need to use
"lua-resty-lock". I will need to read more about this method, as it's
still unclear to me if " init_worker_by_lua"+"ngx.timer.at"
starts more than one light threads (in fact, one per worker, right?).
We only need one light thread, so the other ones should need to be
"synced out". Is this true?
The other suggestion was to send the data directly "out over socket"
(although I think this was about aggregated data, not raw data). We
thought about a socket server for storing the raw records on disk, but I
don't know what software we should use as a "storage daemon" (as I
said, we would avoid SQL-related engines or other
too-complicated-pieces-of-software). Maybe syslog-ng? Any other
alternatives?
Thank you again.
Bogdan
Tuesday, August
26, 2014 8:42 AM
Thanks, for
taking a look. Some comments: 1. Original raw data. We
plan storing the original raw data, collected from hosts or sensors, on
disk for some time. Some pointers here about raw data:
http://www.systemdatarecorder.org/recording/raw.html
Within our analytic software we will have another way to explore data,
using these flat files. 2. Aggregated data I
understood that we need to process and make all aggregations within
Redis. Redis has a Lua interpreter inside which can be used. Are
you suggesting we should process and aggregate within OpenResty ?
Wouldn't that be a problem between NGINX workers ? So
our main first problem will be the flat files containing the raw data.
We probable are good to go using lua_shared_dict, but we need to
test. Then second part will be to clarify if all calculations we will
do in Redis or OpenResty. Thanks again,
Tuesday, August
26, 2014 1:14 AM
Hello!
I've
just had a closer look at your thread on the redis-db mailing list.
For your use case, I suggest you send the aggregated data (stored in
lua_shared_dict, for example) out over socket every few minutes or
seconds directly without touching the disk. We're using this approach
for data analytics at CloudFlare. The lua-resty-loggers-socket
library is such an example:
https://github.com/cloudflare/lua-resty-logger-socketRegards, -agentzh
Tuesday, August
26, 2014 1:04 AM
I see. Thanks a
lot. We need to experiment and as you said we will use lua-resty-lock
to handle this part. Very curious how bad things will go. Later we
will come with some results.
Thanks,
Tuesday, August
26, 2014 12:59 AM
Hello!
For
ngx_lua, you can just use init_worker_by_lua to initiate a reoccurring
timer via ngx.timer.at(), in whose handler, you can update your
files directly in Lua. To prevent multiple workers doing the same job,
you can guard your timer's operation with lua-resty-lock so that only
one worker is active.
Regards, -agentzh
Monday, August
25, 2014 11:07 PM
Some more clarifications, I work with Bogdan as well:
Seems like you really need a real database-like service to do this.
Like Redis, MySQL, or PostgreSQL. They are good at what you're
supposed to do, talking to them over sockets is very efficient and
100% nonblocking. Otherwise we'll have to reinvent most of the
heavy-lifting ourselves in nginx, which may not be that efficient.
we want to build a like appliance which does not have any RDBMS system
at all. We go via NoSQL. See here:
https://groups.google.com/forum/#!topic/redis-db/kINPN4jiLgI
It seems Josiah advised us to pre-process all raw data outside of Redis/Lua since
the Lua within Redis is something less powerful and restricted as scope than
a full LuaJit environment, for example.
So interesting here to find out is if we can pre-process somehow using lua-resty-lock
approach all raw data or not ? Meaning we will have something like:
host1: runs 4 recorders + transporter
sysrec -> sysrec.sdrd (flat CSV file)
cpurec -> cpurec.sdrd
diskrec -> diskrec.sdrd
nicrec -> nicrec.sdrd
sender -> transports minute based updated via HTTP | HTTPs
host2: runs 4 recorders + transporter
sysrec -> sysrec.sdrd (flat CSV file)
cpurec -> cpurec.sdrd
diskrec -> diskrec.sdrd
nicrec -> nicrec.sdrd
sender -> transports minute based updated via HTTP | HTTPs
Every minute these files will have new CSV records. We want to transport these records to our
backend and store that sysrec.sdrd on the backend, for example.
Avoid *heavy* file IO whenever possible in the context of nginx, which
can never be inefficient because there is no such thing called
"nonblocking file IO" and workarounds like AIO and OS thread pooling
have their own limitations, complexity, and/or overhead.
Yes, we moved already our user and subscriptions mgmt within Redis. We are happy with that.
But we need somehow still to figure out how to update every N minutes some flat files
on disks, consistently within OpenResty or Redis.
Thanks a lot for comments,
|