After a month of writing events to Cube, the instance we're running
has hit about 70GB worth of data. I wasn't sure what to anticipate in
this regard, so I was playing a bit of a wait and see game to see how
much it actually grew. At this rate, we'd hit 840 GB in a year - let's
just round up and say 1 TB for ease. It's possible for me to secure
that kind of hardware, but I'm also exploring the idea of pruning data
from Cube in order to run on something with less disk space. I spent
some time today looking at how Cube is using Mongo and what it's doing
in Mongo and saw, unsurprisingly, that the raw events collections we
have are the largest in terms of size. The metrics collections are
tiny in comparison. The one collection that I didn't know would be
there was the cube_request_events collection. This one stuck out
because it was as large as our largest events collection.
According to
https://github.com/square/cube/wiki/Evaluator#wiki-metric_get,
the five intervals supported are 10s, 1m, 5m, 1h, 1d. In thinking
about how to not just let this grow boundlessly, it seems to me that
as long as I am consistently querying Cube on and for the intervals
I'm interested in, then Cube will consistently add to the capped
metric collections such that, after a day at most, I have the
aggregate data I need and no longer need the raw event data laying
around, which means I can just script the deletion with a cron job or
whatever.
If my goal is to not just grow forever does that approach seem
reasonable, understanding that if I need the data in the long term,
then I would need to ensure that Cube has the chance to aggregate and
write to the capped collections by issuing any and all queries I would
need prior to the event pruning?
Also, I'd like to be able to prune old data from the
cube_request_events collection since it is so large. Will doing this
cause Cube any problems? Is that stuff used long term for anything?