In Hive Table, how to decide no.of buckets.

Chandra Reddy

unread,

May 31, 2013, 5:18:30 PM5/31/13

to chenn...@googlegroups.com

Hi Folks,
Please help me on below query today:

How can we decide the number of buckets in Hive table while doing the clustering.
Is this based on each bucket size (and/or hadoop block size) ?
Do we need to consider no.of data nodes available?
Do we need to consider no.of map/reduce (or both) tasks available?

Any thoughts please!!! Your inputs are well appreciated. Thanks a lot.

Example: for Hive bucketing :-
CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
friends ARRAY<BIGINT>, properties MAP<STRING, STRING>
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS;

--
Thanks,
-Chandra.

Senthil Kumar

unread,

Jun 2, 2013, 6:42:12 AM6/2/13

to chenn...@googlegroups.com

Buskets should be based upon as Queries to be used in Hive.

Buckets are used to reduce the number of input files for processing your query

Senthil

sudhakar kurakula

unread,

Jun 2, 2013, 2:42:32 PM6/2/13

to chenn...@googlegroups.com

Senhil, you are wrong.

Buckets are used to store data and it will specify how data is going to be arranged into further level.(after partition)

--
You received this message because you are subscribed to the Google Groups "Hadoop Users Group (HUG) Chennai" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chennaihug+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Senthil Kumar

unread,

Jun 2, 2013, 8:48:17 PM6/2/13

to chenn...@googlegroups.com

Sudhakar,

Let me put it in other words.

When you do bucketing on your data based on particular column(userid), your data is stored in each bucket file is based on hashing of key modulus no of buckets.

For example, if you have 10 records in data in the following format.(original )

1,senthil

2,kumar

3,siva

4,senthil2

5,adhi

6,vignesh

7,peter

8,stefen

9,doug

10,alan

if you do buckets as 4 on the above data, your (partition) dir in hdfs will have four files wrt each bucket

bucket-0 will contain data

4,senthil2

8,stefen

buckte-1 will contain data

1,senthil

5,adhi

9,doug

It goes based on remainder of your userid

Coming to Query. "Select * from table1 where userid = 4

How many files will be processed?? Only 1 ie. bucket-0 file

It turn we reduce the number of files for MR using Hive.

We can do bucketing on more number of columns based on frequency of the columns in where clause of your queries.

Note: used 10 records just for explanation only.

Buckets can be used even without partition.

Senthil

Chandra Reddy

unread,

Jun 3, 2013, 12:12:54 AM6/3/13

to chenn...@googlegroups.com

Thanks for response Senthil.
Yes you are right, by buckets we can reduce processing entire data we can process specific data part (specific bucket data).
You mean based on no.of times a specific query executes is the base to decide no.of buckets - could you please clarify here please.

thanks
Chandra

--

You received this message because you are subscribed to the Google Groups "Hadoop Users Group (HUG) Chennai" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chennaihug+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Thanks,
-Chandra.

sudhakar kurakula

unread,

Jun 3, 2013, 12:20:07 AM6/3/13

to chenn...@googlegroups.com

Excellent explantion.Thanks a lot.

Senthil Kumar

unread,

Jun 3, 2013, 12:47:27 AM6/3/13

to chenn...@googlegroups.com

Considering all the queries(reporting) run on the table, which are the columns repeating in most of those?

How can we classify data of those columns?

For Example, if one column is related to geographic location(country), you can have buckets equal to no of countries in data.

Chandra Reddy

unread,

Jun 3, 2013, 12:54:56 PM6/3/13

to chenn...@googlegroups.com

Thanks for the response Senthil.

Reply all

Reply to author

Forward