Few questions regarding HdfsStatsService

Shrijeet Paliwal

unread,

May 30, 2014, 3:52:42 PM5/30/14

to hrave...@googlegroups.com

Hey Guys,

I have few questions around data population for hdfs stats service? Most of the fields for a row in hbase stats table can be extracted from offline NN image, what are other sources of information. Is namenode audit log one of the source? Finally whats 'tmpFileCount' & 'tmpSpaceConsumed' & how is it being populated?

-Shrijeet

Message has been deleted

Vrushali Channapattan

unread,

May 30, 2014, 4:03:47 PM5/30/14

to Shrijeet Paliwal, hrave...@googlegroups.com

Hi Shrijeet,

Yes, nn audit logs are another source of information which populates the access counts for that hour for that directory.

The hdfs stats service can give a timeseries of how a particular directory stats changed over time. It also tells you at a point in time what the stats were of that cluster. The NN fsimage can give you only the snapshot of the time when the fsimage was taken.

The tmpFileCount and tmpSpaceConsumed are accounted for as a special case. Under /tmp on hdfs, there are files and directories like:" /tmp/temp1426738839" . These directory names don't mean much if stored as /tmp/<name>. Hence we get the owner of these directories and account the stats under tmpFileCount and tmpSpaceConsumed for path '/user/<owner>'

Let me know if you have any more questions or suggestions!

thanks

Vrushali

--
You received this message because you are subscribed to the Google Groups "hraven-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hraven-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shrijeet Paliwal

unread,

May 30, 2014, 4:20:19 PM5/30/14

to Vrushali Channapattan, hrave...@googlegroups.com

That was quick, thanks Vrushali.

Yes fsimage is a snapshot, what I meant to ask was do you take the fsimage snapshot every hour, parse it (along with audit logs) and do inserts into HBase table or do you have a different way?

I don't have access to the wiki page (must be internal) but I understood your explanation of tmp columns, make sense now.

Comment on the table schema, with timestamp in leading position of the row key we would always insert data into one region. Given we care about time series the options are limited (bucketed timestamps etc. or fixed length cluster name in leading part). I am guessing given limited ingestion & query rate the choice must have been driven by the simplicity.

--
Shrijeet

On Fri, May 30, 2014 at 12:58 PM, Vrushali Channapattan <vrus...@twitter.com> wrote:

Hi Shrijeet,

Yes, nn audit logs are another source of information which populates the access counts for that hour for that directory.

The hdfs stats service can give a timeseries of how a particular directory stats changed over time. It also tells you at a point in time what the stats were of that cluster. The NN fsimage can give you only the snapshot of the time when the fsimage was taken.

The tmpFileCount and tmpSpaceConsumed are accounted for as a special case. Under /tmp on hdfs, there are files and directories like:" /tmp/temp1426738839" . These directory names don't mean much if stored as /tmp/<name>. Hence we get the owner of these directories and account the stats under tmpFileCount and tmpSpaceConsumed for path '/user/<owner>'

Some more processing details can be found here:
https://confluence.twitter.biz/display/CORESTORAGE/Processing+hdfs+usage+stats#Processinghdfsusagestats-How/tmpisaccountedfor

Let me know if you have any more questions or suggestions!

thanks
Vrushali

On Fri, May 30, 2014 at 12:52 PM, Shrijeet Paliwal <shrijeet...@gmail.com> wrote:

Vrushali Channapattan

unread,

May 30, 2014, 5:05:21 PM5/30/14

to Shrijeet Paliwal, hrave...@googlegroups.com

Thanks! So, presently we collect stats by querying the NN itself each hour.

Having the timestamp as part of the leading row key was the design choice that we would always query for a particular timestamp (or range of timestamps), but we would not know the directory path to query for. So given a cluster, fetch all the hdfs dirs as of a particular time, is easier to do in the hbase scan/gets when cluster and timestamp occur in leading part of the row key, followed by actual path than vice versa. Insertion happens only once per hour, so it's not really that much of a "hot" region problem while collecting but makes querying much faster.

thanks

Vrushali

Shrijeet Paliwal

unread,

May 30, 2014, 5:24:01 PM5/30/14

to Vrushali Channapattan, hrave...@googlegroups.com

On Fri, May 30, 2014 at 2:05 PM, Vrushali Channapattan <vrus...@twitter.com> wrote:

Thanks! So, presently we collect stats by querying the NN itself each hour.

Having the timestamp as part of the leading row key was the design choice that we would always query for a particular timestamp (or range of timestamps), but we would not know the directory path to query for. So given a cluster, fetch all the hdfs dirs as of a particular time, is easier to do in the hbase scan/gets when cluster and timestamp occur in leading part of the row key, followed by actual path than vice versa. Insertion happens only once per hour, so it's not really that much of a "hot" region problem while collecting but makes querying much faster.

Thats what I thought. Thanks for explaining Vrushali.

thanks
Vrushali

On Fri, May 30, 2014 at 1:19 PM, Shrijeet Paliwal <shrijeet...@gmail.com> wrote:

That was quick, thanks Vrushali.

Yes fsimage is a snapshot, what I meant to ask was do you take the fsimage snapshot every hour, parse it (along with audit logs) and do inserts into HBase table or do you have a different way?

I don't have access to the wiki page (must be internal) but I understood your explanation of tmp columns, make sense now.

Comment on the table schema, with timestamp in leading position of the row key we would always insert data into one region. Given we care about time series the options are limited (bucketed timestamps etc. or fixed length cluster name in leading part). I am guessing given limited ingestion & query rate the choice must have been driven by the simplicity.

--
Shrijeet

On Fri, May 30, 2014 at 12:58 PM, Vrushali Channapattan <vrus...@twitter.com> wrote:

Hi Shrijeet,

Yes, nn audit logs are another source of information which populates the access counts for that hour for that directory.

The hdfs stats service can give a timeseries of how a particular directory stats changed over time. It also tells you at a point in time what the stats were of that cluster. The NN fsimage can give you only the snapshot of the time when the fsimage was taken.

The tmpFileCount and tmpSpaceConsumed are accounted for as a special case. Under /tmp on hdfs, there are files and directories like:" /tmp/temp1426738839" . These directory names don't mean much if stored as /tmp/<name>. Hence we get the owner of these directories and account the stats under tmpFileCount and tmpSpaceConsumed for path '/user/<owner>'

Let me know if you have any more questions or suggestions!

thanks
Vrushali

Reply all

Reply to author

Forward