While running some comparisons between Hive on MR3 and Hadoop, I ran
into a problem with a moderately large aggregation query. I believe
I've narrowed the problem down to an issue with collect_set(). Here
are the smallest set of queries I've come up with to reproduce the
problem.
drop table if exists t ;
create table t ( k string, b binary, s string ) ;
insert into table t values
('0', unhex('00'), '0'),
('0', unhex('0001'), '0a' ),
('1', unhex('01'), '1') ;
select k, collect_set(b) as bset, collect_set(s) as sset
from t group by k ;
Attached are the resulting logs from running these right after a
restart. This is on Kubernetes using the
mr3project/hive:4.0.0.mr3.2.0 image.
David
--
David Engel
da...@istwok.net