Few questions regarding HdfsStatsService

23 views
Skip to first unread message

Shrijeet Paliwal

unread,
May 30, 2014, 3:52:42 PM5/30/14
to hrave...@googlegroups.com
Hey Guys, 

I have few questions around data population for hdfs stats service? Most of the fields for a row in hbase stats table can be extracted from offline NN image, what are other sources of information. Is namenode audit log one of the source? Finally whats 'tmpFileCount'  & 'tmpSpaceConsumed' & how is it being populated? 

-Shrijeet

Message has been deleted

Vrushali Channapattan

unread,
May 30, 2014, 4:03:47 PM5/30/14
to Shrijeet Paliwal, hrave...@googlegroups.com
Hi Shrijeet,

Yes, nn audit logs are another source of information which populates the access counts for that hour for that directory.

The hdfs stats service can give a timeseries of how a particular directory stats changed over time. It also tells you at a point in time what the stats were of that cluster. The NN fsimage can give you only the snapshot of the time when the fsimage was taken.

The tmpFileCount and tmpSpaceConsumed are accounted for as a special case. Under /tmp on hdfs, there are files and directories like:" /tmp/temp1426738839" . These directory names don't mean much if stored as /tmp/<name>. Hence we get the owner of these directories and account the stats under tmpFileCount and tmpSpaceConsumed for path '/user/<owner>'

Let me know if you have any more questions or suggestions!

thanks
Vrushali

--
You received this message because you are subscribed to the Google Groups "hraven-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hraven-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shrijeet Paliwal

unread,
May 30, 2014, 4:20:19 PM5/30/14
to Vrushali Channapattan, hrave...@googlegroups.com
That was quick, thanks Vrushali. 

Yes fsimage is a snapshot, what I meant to ask was do you take the fsimage snapshot every hour, parse it (along with audit logs) and do inserts into HBase table or do you have a different way? 

I don't have access to the wiki page (must be internal) but I understood your explanation of tmp columns, make sense now. 

Comment on the table schema, with timestamp in leading position of the row key we would always insert data into one region. Given we care about time series the options are limited (bucketed timestamps etc. or fixed length cluster name in leading part). I am guessing given limited ingestion & query rate the choice must have been driven by the simplicity.   

--
Shrijeet


On Fri, May 30, 2014 at 12:58 PM, Vrushali Channapattan <vrus...@twitter.com> wrote:
Hi Shrijeet,

Yes, nn audit logs are another source of information which populates the access counts for that hour for that directory.

The hdfs stats service can give a timeseries of how a particular directory stats changed over time. It also tells you at a point in time what the stats were of that cluster. The NN fsimage can give you only the snapshot of the time when the fsimage was taken.

The tmpFileCount and tmpSpaceConsumed are accounted for as a special case. Under /tmp on hdfs, there are files and directories like:" /tmp/temp1426738839" . These directory names don't mean much if stored as /tmp/<name>. Hence we get the owner of these directories and account the stats under tmpFileCount and tmpSpaceConsumed for path '/user/<owner>'

Let me know if you have any more questions or suggestions!

thanks
Vrushali


On Fri, May 30, 2014 at 12:52 PM, Shrijeet Paliwal <shrijeet...@gmail.com> wrote:

Vrushali Channapattan

unread,
May 30, 2014, 5:05:21 PM5/30/14
to Shrijeet Paliwal, hrave...@googlegroups.com
Thanks! So, presently we collect stats by querying the NN itself each hour.

Having the timestamp as part of the leading row key was the design choice that we would always query for a particular timestamp (or range of timestamps), but we would not know the directory path to query for. So given a cluster, fetch all the hdfs dirs as of a particular time, is easier to do in the hbase scan/gets when cluster and timestamp occur in leading part of the row key, followed by actual path than vice versa. Insertion happens only once per hour, so it's not really that much of a "hot" region problem while collecting but makes querying much faster.

thanks
Vrushali


Shrijeet Paliwal

unread,
May 30, 2014, 5:24:01 PM5/30/14
to Vrushali Channapattan, hrave...@googlegroups.com
On Fri, May 30, 2014 at 2:05 PM, Vrushali Channapattan <vrus...@twitter.com> wrote:
Thanks! So, presently we collect stats by querying the NN itself each hour.  

Having the timestamp as part of the leading row key was the design choice that we would always query for a particular timestamp (or range of timestamps), but we would not know the directory path to query for. So given a cluster, fetch all the hdfs dirs as of a particular time, is easier to do in the hbase scan/gets when cluster and timestamp occur in leading part of the row key, followed by actual path than vice versa. Insertion happens only once per hour, so it's not really that much of a "hot" region problem while collecting but makes querying much faster.

Thats what I thought. Thanks for explaining Vrushali. 
 

thanks
Vrushali




On Fri, May 30, 2014 at 1:19 PM, Shrijeet Paliwal <shrijeet...@gmail.com> wrote:
That was quick, thanks Vrushali. 

Yes fsimage is a snapshot, what I meant to ask was do you take the fsimage snapshot every hour, parse it (along with audit logs) and do inserts into HBase table or do you have a different way? 

I don't have access to the wiki page (must be internal) but I understood your explanation of tmp columns, make sense now. 

Comment on the table schema, with timestamp in leading position of the row key we would always insert data into one region. Given we care about time series the options are limited (bucketed timestamps etc. or fixed length cluster name in leading part). I am guessing given limited ingestion & query rate the choice must have been driven by the simplicity.   

--
Shrijeet


On Fri, May 30, 2014 at 12:58 PM, Vrushali Channapattan <vrus...@twitter.com> wrote:
Hi Shrijeet,

Yes, nn audit logs are another source of information which populates the access counts for that hour for that directory.

The hdfs stats service can give a timeseries of how a particular directory stats changed over time. It also tells you at a point in time what the stats were of that cluster. The NN fsimage can give you only the snapshot of the time when the fsimage was taken.

The tmpFileCount and tmpSpaceConsumed are accounted for as a special case. Under /tmp on hdfs, there are files and directories like:" /tmp/temp1426738839" . These directory names don't mean much if stored as /tmp/<name>. Hence we get the owner of these directories and account the stats under tmpFileCount and tmpSpaceConsumed for path '/user/<owner>'


Let me know if you have any more questions or suggestions!

thanks
Vrushali


Reply all
Reply to author
Forward
0 new messages