For one of our jobs we are reading avro snappy compressed data of 4.5 TB and applying a filter to this data. We expect the filtered output written out to be much less than the input size but we see that Scoobi is writing out an intermediate sequence file of 8 TB to the /tmp/scoobi-user/ directory.
Why is the intermediate data size double that of the input size inspite of a filter being applied? Is this expected behavior from Scoobi?