There's some built-in anonymization stuff in the admin console, of course, but that only does things like scrub IP addresses and (ha-ha) AIM usernames, if I'm understanding right. The bigger problem with anonymizing is that, inevitably, people communicate personal information when they're trying to get account help and the like. In this case, any kind of anonymizing is inevitably going to have to be done by hand, which means that some human is going to have to look at the data.I don't know if there are best practices to this kind of thing, but I wonder if a reasonable approach would be to have someone dedicated to anonymizing, and then just have that person not do any coding and the like.
As for downloading and storing, you basically get what LH3 gives you, which is lots of nested files and folders. For doing text analysis we like to get a random sample, but the nested directory structure makes this non-trivial to do. But somewhere I've got a java file that'll automatically crawl a directory for .txt files and copy some percentage of them to a destination directory. Let me know if you're interested.
-dre.