non-English text in common crawl

77 views
Skip to first unread message

Hsiao Su

unread,
May 29, 2012, 8:00:38 PM5/29/12
to common...@googlegroups.com

I noticed that non-English text are encoded as hexadecimals, enclosed in a pair of angle brackets.  This is in the sample file that I downloaded.  Just want to confirm if this is true throughout the entire corpus.

Also, does the parsing code here:


handle this correctly?

Hsiao

Ahad Rana

unread,
May 29, 2012, 8:26:49 PM5/29/12
to common...@googlegroups.com
Hi, can you give a more specific example of the problem ? That should not be the case. Data in the ARC files is stored in its native encoding format and the new JSON Metadata is UTF-8 encoded.

Ahad.


Hsiao

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To view this discussion on the web visit https://groups.google.com/d/msg/common-crawl/-/WgfG5LAicoYJ.
To post to this group, send email to common...@googlegroups.com.
To unsubscribe from this group, send email to common-crawl...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/common-crawl?hl=en.

Hsiao Su

unread,
May 30, 2012, 2:42:35 PM5/30/12
to common...@googlegroups.com
Ahad,

Thanks for the reply.  I just confirmed that what I've observed is because I was looking at the content of the .arc file using "less", and "less" decides to render UTF-8 that way.  I've confirmed this using emac's hexl-mode.

Also, looking at the code on github, ( https://github.com/commoncrawl/commoncrawl ), there is no special handling for angle brackets.  There's just an assumption that parts of the .arc file are UTF-8 encoded, and parts are ASCII-encoded.

I was trying to figure out if I need to write my own .arc file parser, mainly because I need to integrate it with something else.  But it looks like I can just use what's in org.commoncrawl.util.shared.ArcFileReader.  The only thing is, this has dependency on thrift.  That's probably ok, one more dependency to integrate.

Thanks for your help.

Hsiao
Reply all
Reply to author
Forward
0 new messages