/tmp/temp1426738839" . These directory names don't mean much if stored as
/tmp/<name>. Hence we get the owner of these directories and
account the stats under tmpFileCount and tmpSpaceConsumed for path
'/user/<owner>'--
You received this message because you are subscribed to the Google Groups "hraven-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hraven-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
The tmpFileCount and tmpSpaceConsumed are accounted for as a special case. Under /tmp on hdfs, there are files and directories like:"Hi Shrijeet,The hdfs stats service can give a timeseries of how a particular directory stats changed over time. It also tells you at a point in time what the stats were of that cluster. The NN fsimage can give you only the snapshot of the time when the fsimage was taken.
Yes, nn audit logs are another source of information which populates the access counts for that hour for that directory.
/tmp/temp1426738839" .These directory names don't mean much if stored as /tmp/<name>. Hence we get the owner of these directories and account the stats under tmpFileCount and tmpSpaceConsumed for path '/user/<owner>'
Some more processing details can be found here:
https://confluence.twitter.biz/display/CORESTORAGE/Processing+hdfs+usage+stats#Processinghdfsusagestats-How/tmpisaccountedfor
Let me know if you have any more questions or suggestions!
thanks
Vrushali
On Fri, May 30, 2014 at 12:52 PM, Shrijeet Paliwal <shrijeet...@gmail.com> wrote:
Thanks! So, presently we collect stats by querying the NN itself each hour.
Having the timestamp as part of the leading row key was the design choice that we would always query for a particular timestamp (or range of timestamps), but we would not know the directory path to query for. So given a cluster, fetch all the hdfs dirs as of a particular time, is easier to do in the hbase scan/gets when cluster and timestamp occur in leading part of the row key, followed by actual path than vice versa. Insertion happens only once per hour, so it's not really that much of a "hot" region problem while collecting but makes querying much faster.
thanksVrushali
On Fri, May 30, 2014 at 1:19 PM, Shrijeet Paliwal <shrijeet...@gmail.com> wrote:
That was quick, thanks Vrushali.Yes fsimage is a snapshot, what I meant to ask was do you take the fsimage snapshot every hour, parse it (along with audit logs) and do inserts into HBase table or do you have a different way?I don't have access to the wiki page (must be internal) but I understood your explanation of tmp columns, make sense now.Comment on the table schema, with timestamp in leading position of the row key we would always insert data into one region. Given we care about time series the options are limited (bucketed timestamps etc. or fixed length cluster name in leading part). I am guessing given limited ingestion & query rate the choice must have been driven by the simplicity.
--
Shrijeet
On Fri, May 30, 2014 at 12:58 PM, Vrushali Channapattan <vrus...@twitter.com> wrote:
The tmpFileCount and tmpSpaceConsumed are accounted for as a special case. Under /tmp on hdfs, there are files and directories like:"Hi Shrijeet,The hdfs stats service can give a timeseries of how a particular directory stats changed over time. It also tells you at a point in time what the stats were of that cluster. The NN fsimage can give you only the snapshot of the time when the fsimage was taken.
Yes, nn audit logs are another source of information which populates the access counts for that hour for that directory.
/tmp/temp1426738839" .These directory names don't mean much if stored as /tmp/<name>. Hence we get the owner of these directories and account the stats under tmpFileCount and tmpSpaceConsumed for path '/user/<owner>'
Let me know if you have any more questions or suggestions!
thanks
Vrushali