Hi,
We are considering using Redis for our future Analytics project, where we need to process time series data, for example
computer performance data. We plan to record data every 1 minute from lots of hosts and send this over HTTP or HTTPs
to our analytics platform. All raw data, is CSV format like:
timestap (seconds since Epoch): value : value : value ...
1. Per-CPU data:
1408658702:0:2.06:0.65:2.25:0.10:94.95:5.05
1408658702:1:2.04:0.52:2.12:0.16:95.16:4.84
1408658702:2:2.49:0.56:2.03:0.05:94.87:5.13
1408658702:3:2.48:0.55:1.99:0.02:94.97:5.03
1408658703:0:0.00:0.00:0.79:0.00:99.21:0.79
1408658703:1:2.36:0.00:1.57:0.79:95.28:4.72
1408658703:2:0.00:0.00:2.33:0.00:97.67:2.33
1408658703:3:0.00:0.00:0.79:0.00:99.21:0.79
2. Overall System data:
1408659072:4.99:19.94:380.06:2.27:0.56:2.08:0.08:95.01:0.00:0.00:0.00:0.00:125.00:353164.00:860568.00:871112.00:29916.00:3848060.00:4738544.00:0.00:0.00:0.00:0.00:0.00:0.00:0.00:0.00:0.00:0.30:0.30:0.28
1408659073:0.59:2.35:397.65:0.00:0.00:0.59:0.00:99.41:0.00:0.00:0.00:0.00:125.00:353168.00:860568.00:871192.00:29916.00:3847976.00:4738460.00:0.00:0.00:0.00:154.48:648.83:0.02:0.12:0.00:3.20:0.30:0.30:0.28
We are using OpenResty (NGINX + Lua) and RRDtool at the moment. We are thinking to process all raw data into Redis and
make simple our processing without using RRDtool.
Anyone can comment on the following:
1. Is it possible to maintain sort of last 3hrs, last 12hrs, last 24hrs , etc ... time window frames within Redis and
present the statistics associated ? We would like to have AVG, MIN, MAX, LAST ? Do we need to
calculate each statistic function ?
2. We basically will not be able to keep lots of raw data inside Redis, if we want to keep 1 or 2 years of data. We
want to somehow store on a flat files the raw data after some periods of time has passed. How can we do that ?
How one could extract or move parts of the raw data to flat files ?
3. Do we need to process all raw data before displaying the stats ? Normalization, like RRDtool does ?
4. How about if we need to offer different types of statistics functions, where and how would we calculate these ?
Redis directly ? Lua ?
Im reading Redis currently and found it very interesting for our project. Sorry if these questions have been asked again.
Thanks a lot,--
Stefan
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redis-db+u...@googlegroups.com.
To post to this group, send email to redi...@googlegroups.com.
Visit this group at http://groups.google.com/group/redis-db.
For more options, visit https://groups.google.com/d/optout.
Many thanks. See my answers below:
Right, I need to read and understand what data structures we will really need
> You will always need to calculate average, the other 3 aggregates (MIN,
> MAX, LAST) might be computable with Redis internal commands, depending on
> the data representation.
in order to be able to have some useful statistics.
> Well, there are 1440 minutes/day. I estimate CPU rows to be about 55 bytesWe could have from 500 up to 5000 hosts to monitor. Of course for the large
> long on the upper end. That's 29 megs/year per CPU. System rows look to be
> about 250 bytes on the upper end, so 131.5 megs/year per system. That's not
> a lot if you don't have many machines, but I'm guessing you've got more
> than a few machines to record.
configuration we will need to have probable more than 64GB RAM. But we probable
need to get rid of the raw data (not processed one) as soon as almost it arrives
somewhere to a flat file on disk.(we want to keep the CSV records for future
archiving ...)
> How to pull data out will depend on how the data is stored in Redis itself.Exactly. That's what I was also thinking - we could store the aggregated data in Redis
> I doubt you will actually be storing the raw rows in Redis (it may make
> sense to pass through a Lua script for processing/aggregation, but it
> doesn't make sense to store the non-processed data in Redis for much longer
up to 6 months or whatever else. But only the aggregated data.
> than it takes to write the data to disk), so I would suggest just sendingRight, the aggregation part it could be done 100% in Lua, within Redis.
> the data into Redis while at the same time appending to a flat file on
> disk. You can periodically rotate the flat file, backing up the old file
> anywhere you want.
As I understood Redis has within a Lua interpreter, exactly 5.1. Right ? So we could
process all raw data within Redis /Lua and append every record to a flat file on disk.
Then we could keep the aggregated values within Redis for our dashboards
and the rest off-loaded on disk.
> If you want to keep local filesystems out of the loop, you can have aninteresting. But probable will increase the memory usage if the list grows.
> analytics Lua script analyze your rows and add them to a "pending disk
> write" LIST after.
Probable the first approach, is simpler as the raw data record arrives and it is processed
it will be sent to raw data flat file on disk. Is there any consistency regarding file IO
blocking / nonblocking ? I suppose if that Lua function executes it blocks the other
activities until the record is flushed to disk ? Or how does this happen within Redis ?
> I would suggest that you process lines as they come in, then your dashboardright.
> that displays the data basically just performs a few commands to fetch the
> data, possibly calculating the average, and displaying it.
ok, I was thinking as well, we could use Lua within Redis. I dont understand yet,
> That depends on whether you are primarily processing your data outside
> Redis (using the typical API to update in-Redis stats) or inside Redis
> using Lua. Both have benefits and drawbacks, but generally I'd suggest
> sticking with using Lua inside Redis for actually processing your data.
how Redis will function if data arrives from different hosts at the same time ?
Are they processed one by one ? Redis is single threaded, single process
so there is no form of concurrency within ? Or ?
> In terms of an analytics system design; how you would store your dataWe want to present data last 3hrs, 6hrs, 12hrs, 24hrs, 3days, 7days, 30days, 90days
> depends on the API you want for reading the data, how precise you need your
> sliding window (1-hour granularity is easy, 1-minute granularity is less
for example. RRDtool was building its own types of archives (RRA) which was keeping these
stats within. We could vary the granularity based on the type of archive, we want to keep or
to present data:
5min for 3hrs stats
15min for 6hrs stats
30min for 12hrs stats
1hr for 24hrs stats
3hr for 3 days
...
Not sure if it is easy to build something like this in Redis/Lua.
> easy), and a few other things. In section 5.2 of Redis in Action[1], Isuper, we have ordered the book already. Im waiting it.
> cover basic statistics and how you can do min, max, average, and standardRight. I need to read this part. Thanks for pointer. Waiting the book.
> deviation in Redis. It's primarily focused on small numbers of counters, so
> won't work as well for many CPU/system counters like what you are looking
> to solve.
> But ultimately, how to store data, compute on the data, etc., will dependI see. Well sounds a bit of work, but worth doing it.
> on the access patterns you expect to have. For what it's worth, I've built
> real-time analytics systems using Redis 3 times now, one of which could
> ingest 40k rows/second of logs on a single Redis server. I don't see a
> reason why you couldn't build a similar system for your CPU and System
> analytics... Though it brings up an interesting question: why not just use
> Graphite?
We used for long time RRDtool / Perl, with no big troubles. But lately we been discovering
OpenResty and Lua and the performance was fantastic.We wanted to move away from
RRDtool plotting to a JavaScript JSON based library. So we were thinking how we could
easily do all these things without having many components around and perform as much as
we could within Lua.
Then we started to read about in cache memory databases, and I was thinking some dashboard
numbers we could use to calculate and keep in memory to seep up access.
We want to minimize the number of trips from our authentication layer to the storage
and processing layer and perform as much as possible within Lua.
So somehow I was thinking to perform all numerical processing in memory without
accessing RRDtool via Lua etc ... and pass a JSON record to the plotting JavaSCript
library. We have never used Graphite.
thanks again for explanations.
--
Stefan Parvu <spa...@systemdatarecorder.org>
Cheers. Monday I should get your book and I need to start reading and dig into all your answers
detailed.
I need to clarify on my side:
- what would be the dashboard. we plan making an analytic product for
weather and climate data and computer performance. 2 different areas.
there will be differences in metrics, granularity, many things.
- how many metrics I really want to show on the dashboard ?
- see how much RAM I will need to keep the counters for both bases
- we plan for light installations (< 5 hosts) to build the solution around
raspberry pi hardware. So we will have to careful plan the metrics
and the dashboard.
About Lua and Redis - so most likely I will need to pre-process the raw
data on disk/flat files within OpenResty and then send the raw data record to Redis for
counter updates.
Thanks again for all good advises. I will start working on this and later I will post my
progress with Redis and OpenResty.
sorry for delay. I finally got your book. Well written. I like it very much.
I just want to give a slightly different response, for perspective:
Redis is awesome - I love me some Redis, but that doesn't mean it is the best tool for every job. In the case of long running time-series operations, I *personally* would give serious consideration to things like Cassandra. Not because Redis *can't* be made to do it - but because it is *perhaps* a more natural fit for something like Cassandra, rather than forcing a square peg into a round hole. This should in no way be seen as a criticism of Redis, and is simply a "pick tools for the jobs you need to do, not jobs for the tools you already have" thing...
Just my tuppence.
Marc
--