Re: .wet file encoding

86 views
Skip to first unread message

Sebastian Nagel

unread,
Nov 18, 2016, 1:31:15 PM11/18/16
to common...@googlegroups.com
Hi,

the encoding of the WET file is utf-8 but what went wrong is that the WAT/WET extractor
did not use the proper source encoding (here Shift_JIS) when converting the content to
utf-8. That's why mostly `�' (the replacement character) shows up in the for the cited
record / URL.

But thanks for reporting this. I actually had a look into the source code at
https://github.com/commoncrawl/ia-hadoop-tools/
https://github.com/commoncrawl/ia-web-commons/

WAT and WET files are generated from the WARC files by

https://github.com/commoncrawl/ia-hadoop-tools/blob/master/src/main/java/org/archive/hadoop/jobs/WEATGenerator.java

The reason seems to be simply that in

https://github.com/commoncrawl/ia-web-commons/blob/master/src/main/java/org/archive/resource/html/HTMLResourceFactory.java#L25
no efforts are made to guess the charset or even use that given in the HTML or HTTP header.

I always thought that Mozilla's juniversalchardet package which is a dependency in the pom.xml
is also used. But this assumption seems to be wrong.

The code to for charset detection is there, maybe it's a quick to just use it:

https://github.com/commoncrawl/ia-web-commons/blob/master/src/main/java/org/archive/format/text/charset/CharsetDetector.java

But I cannot promise anything. Eventually and for the long-term we may need to replace the
WAT/WET generation code.

Please, open an issue at
https://github.com/commoncrawl/ia-web-commons/issues

Help in fixing this is, of course, always welcome!

Thanks,
Sebastian


On 11/17/2016 01:09 AM, ccl...@dieselpoint.com wrote:
> I'm getting errors trying to parse .wet files. The encoding does not appear to be entirely UTF-8.
> For example, when I try to convert the record for URI "http://1-2-3.chu.jp/" in
> "CC-MAIN-20160924173739-00000-ip-10-143-35-109.ec2.internal.warc.wet" I get junk.
>
> Is the encoding for each page in the .wet file just the original binary?
>
> Is so, is there a way to get the proper encoding for the page?
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Sebastian Nagel

unread,
Nov 23, 2016, 11:53:36 AM11/23/16
to common...@googlegroups.com
Hi,

thanks for the perseverance of all users having reported this issue.
It turns out that indeed all encodings except UTF-8 and ASCII are not
properly converted to Unicode / UTF-8. The issue is now tracked and
should get fixed for the next crawl, see
https://github.com/commoncrawl/ia-web-commons/issues/4

Thanks,
Sebastian
Reply all
Reply to author
Forward
0 new messages