SetUnion does not handle large inputs gracefully

18 views
Skip to first unread message

Josh Rosenberg

unread,
Jan 2, 2013, 11:59:29 PM1/2/13
to dat...@googlegroups.com
I'm looking at the code for SetUnion from DataFu 0.0.5 (http://grepcode.com/file/repo1.maven.org/maven2/com.linkedin.datafu/datafu/0.0.5/datafu/pig/bags/sets/SetUnion.java/), and I noticed that it's implemented with a HashSet that stores all unique entries and no direct spill support. While this works fine for taking the union of small sets, the lack of spill support means large bags can cause OutOfMemory errors.

Is there any reason this doesn't build on the default bag factory's newDistinctBag() functionality? It seems like that is the purpose of a DistinctBag. As far as I can tell, you could simplify the code down to creating a DistinctBag and call addAll for each input bag. And as a side-benefit, it would degrade gracefully when it exceeds available memory, rather than crashing.

Am I missing something obvious here? Is the cost of using a DistinctBag much greater than hand-implementing with a HashSet?

Sam Shah

unread,
Jan 7, 2013, 5:47:32 PM1/7/13
to dat...@googlegroups.com
Josh, you are absolutely correct. We've merged in a patch:
https://github.com/linkedin/datafu/pull/21
Reply all
Reply to author
Forward
0 new messages