At Bitcasa we built our garbage collection in Go on Hadoop (though we got burned so many places using Hadoop, rewrites will just spin-up EC2 clusters). We copy-verify-deleted > 2 PB during the process as well as frequent re-evaluation of about 10TB of data. This was for Amazon-stored data, so we used their Hadoop system. Since their directory listings are sorted & our filenames were easily sharded, we batched 4096 subgroups then processed each subgroup as a stream with a GoLang. Top-level operations have set-notation semantics, but each piece has input-output channels (and all channels must be digested for the system to work).
The GoLang streams worked great, but the sharding was mostly because of network maximums per EC2 instance and wanting to iterate quickly.
One gotcha for anyone still considering Hadoop: We had to use much larger instances than we wanted because Hadoop's bucket/group step (between map & reduce) appears to require huge amounts of memory when the data really gets big & had OOMs in Hadoop code otherwise. They should have used on-disk append operations.
Relatedly, I will be open-sourcing the sorted-chan-string-as-a-set library I built for this purpose: Venn, Subtract, Merge, CountConsume, Save, Tee (hardest one to avoid deadlocks when a consumer is blocked on something else which is blocked on the other consumer).