Sudhakar,
Let me put it in other words.
When you do bucketing on your data based on particular column(userid), your data is stored in each bucket file is based on hashing of key modulus no of buckets.
For example, if you have 10 records in data in the following format.(original )
1,senthil
2,kumar
3,siva
4,senthil2
5,adhi
6,vignesh
7,peter
8,stefen
9,doug
10,alan
if you do buckets as 4 on the above data, your (partition) dir in hdfs will have four files wrt each bucket
bucket-0 will contain data
buckte-1 will contain data
It goes based on remainder of your userid
Coming to Query. "Select * from table1 where userid = 4
How many files will be processed?? Only 1 ie. bucket-0 file
It turn we reduce the number of files for MR using Hive.
We can do bucketing on more number of columns based on frequency of the columns in where clause of your queries.
Note: used 10 records just for explanation only.
Buckets can be used even without partition.
Senthil