I'm looking at the code for SetUnion from DataFu 0.0.5 (
http://grepcode.com/file/repo1.maven.org/maven2/com.linkedin.datafu/datafu/0.0.5/datafu/pig/bags/sets/SetUnion.java/), and I noticed that it's implemented with a HashSet that stores all unique entries and no direct spill support. While this works fine for taking the union of small sets, the lack of spill support means large bags can cause OutOfMemory errors.
Is there any reason this doesn't build on the default bag factory's newDistinctBag() functionality? It seems like that is the purpose of a DistinctBag. As far as I can tell, you could simplify the code down to creating a DistinctBag and call addAll for each input bag. And as a side-benefit, it would degrade gracefully when it exceeds available memory, rather than crashing.
Am I missing something obvious here? Is the cost of using a DistinctBag much greater than hand-implementing with a HashSet?