Ahad,
Thanks for the reply. I just confirmed that what I've observed is because I was looking at the content of the .arc file using "less", and "less" decides to render UTF-8 that way. I've confirmed this using emac's hexl-mode.
Also, looking at the code on github, (
https://github.com/commoncrawl/commoncrawl ), there is no special handling for angle brackets. There's just an assumption that parts of the .arc file are UTF-8 encoded, and parts are ASCII-encoded.
I was trying to figure out if I need to write my own .arc file parser, mainly because I need to integrate it with something else. But it looks like I can just use what's in org.commoncrawl.util.shared.ArcFileReader. The only thing is, this has dependency on thrift. That's probably ok, one more dependency to integrate.
Thanks for your help.
Hsiao