Hi,
the encoding of the WET file is utf-8 but what went wrong is that the WAT/WET extractor
did not use the proper source encoding (here Shift_JIS) when converting the content to
utf-8. That's why mostly `�' (the replacement character) shows up in the for the cited
record / URL.
But thanks for reporting this. I actually had a look into the source code at
https://github.com/commoncrawl/ia-hadoop-tools/
https://github.com/commoncrawl/ia-web-commons/
WAT and WET files are generated from the WARC files by
https://github.com/commoncrawl/ia-hadoop-tools/blob/master/src/main/java/org/archive/hadoop/jobs/WEATGenerator.java
The reason seems to be simply that in
https://github.com/commoncrawl/ia-web-commons/blob/master/src/main/java/org/archive/resource/html/HTMLResourceFactory.java#L25
no efforts are made to guess the charset or even use that given in the HTML or HTTP header.
I always thought that Mozilla's juniversalchardet package which is a dependency in the pom.xml
is also used. But this assumption seems to be wrong.
The code to for charset detection is there, maybe it's a quick to just use it:
https://github.com/commoncrawl/ia-web-commons/blob/master/src/main/java/org/archive/format/text/charset/CharsetDetector.java
But I cannot promise anything. Eventually and for the long-term we may need to replace the
WAT/WET generation code.
Please, open an issue at
https://github.com/commoncrawl/ia-web-commons/issues
Help in fixing this is, of course, always welcome!
Thanks,
Sebastian
On 11/17/2016 01:09 AM,
ccl...@dieselpoint.com wrote:
> I'm getting errors trying to parse .wet files. The encoding does not appear to be entirely UTF-8.
> For example, when I try to convert the record for URI "
http://1-2-3.chu.jp/" in
> "CC-MAIN-20160924173739-00000-ip-10-143-35-109.ec2.internal.warc.wet" I get junk.
>
> Is the encoding for each page in the .wet file just the original binary?
>
> Is so, is there a way to get the proper encoding for the page?
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl...@googlegroups.com>.
> To post to this group, send email to
common...@googlegroups.com
> <mailto:
common...@googlegroups.com>.
> Visit this group at
https://groups.google.com/group/common-crawl.
> For more options, visit
https://groups.google.com/d/optout.