Segments with missing WAT dir

已查看 23 次
跳至第一个未读帖子

Julien Nioche

未读,
2014年4月17日 06:35:282014/4/17
收件人 common...@googlegroups.com
Hi, 

I found that there are quite a few segments from CC-MAIN-2013-48 where the WAT content is missing (see full list below). 

For instance :

hadoop fs -ls s3n://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-48/segments/1386163998951/  
               
Found 1 items
drwxrwxrwx   -          0 1970-01-01 00:00 /common-crawl/crawl-data/CC-MAIN-2013-48/segments/1386163998951/warc

Any reason why these segments have only the warc content? If not would it be possible to generate the missing sub dirs?

Julien

-----------------------------------

CC-MAIN-2013-48/segments/1386163051986
CC-MAIN-2013-48/segments/1386163051992
CC-MAIN-2013-48/segments/1386163052034
CC-MAIN-2013-48/segments/1386163052107
CC-MAIN-2013-48/segments/1386163052204
CC-MAIN-2013-48/segments/1386163052216
CC-MAIN-2013-48/segments/1386163052275
CC-MAIN-2013-48/segments/1386163052286
CC-MAIN-2013-48/segments/1386163052338
CC-MAIN-2013-48/segments/1386163052343
CC-MAIN-2013-48/segments/1386163052462
CC-MAIN-2013-48/segments/1386163052469
CC-MAIN-2013-48/segments/1386163052713
CC-MAIN-2013-48/segments/1386163052727
CC-MAIN-2013-48/segments/1386163052810
CC-MAIN-2013-48/segments/1386163052909
CC-MAIN-2013-48/segments/1386163052912
CC-MAIN-2013-48/segments/1386163052949
CC-MAIN-2013-48/segments/1386163053578
CC-MAIN-2013-48/segments/1386163053669
CC-MAIN-2013-48/segments/1386163053831
CC-MAIN-2013-48/segments/1386163053843
CC-MAIN-2013-48/segments/1386163053865
CC-MAIN-2013-48/segments/1386163053883
CC-MAIN-2013-48/segments/1386163053894
CC-MAIN-2013-48/segments/1386163053921
CC-MAIN-2013-48/segments/1386163053923
CC-MAIN-2013-48/segments/1386163054000
CC-MAIN-2013-48/segments/1386163054096
CC-MAIN-2013-48/segments/1386163054352
CC-MAIN-2013-48/segments/1386163054353
CC-MAIN-2013-48/segments/1386163054424
CC-MAIN-2013-48/segments/1386163054457
CC-MAIN-2013-48/segments/1386163848048
CC-MAIN-2013-48/segments/1386163857457
CC-MAIN-2013-48/segments/1386163857566
CC-MAIN-2013-48/segments/1386163860676
CC-MAIN-2013-48/segments/1386163870408
CC-MAIN-2013-48/segments/1386163901500
CC-MAIN-2013-48/segments/1386163915534
CC-MAIN-2013-48/segments/1386163922753
CC-MAIN-2013-48/segments/1386163930735
CC-MAIN-2013-48/segments/1386163932627
CC-MAIN-2013-48/segments/1386163933724
CC-MAIN-2013-48/segments/1386163936569
CC-MAIN-2013-48/segments/1386163944066
CC-MAIN-2013-48/segments/1386163949658
CC-MAIN-2013-48/segments/1386163955638
CC-MAIN-2013-48/segments/1386163956743
CC-MAIN-2013-48/segments/1386163964642
CC-MAIN-2013-48/segments/1386163966854
CC-MAIN-2013-48/segments/1386163968717
CC-MAIN-2013-48/segments/1386163973624
CC-MAIN-2013-48/segments/1386163982738
CC-MAIN-2013-48/segments/1386163988740
CC-MAIN-2013-48/segments/1386163990831
CC-MAIN-2013-48/segments/1386163990989
CC-MAIN-2013-48/segments/1386163992191
CC-MAIN-2013-48/segments/1386163992799
CC-MAIN-2013-48/segments/1386163994706
CC-MAIN-2013-48/segments/1386163994768
CC-MAIN-2013-48/segments/1386163995757
CC-MAIN-2013-48/segments/1386163996785
CC-MAIN-2013-48/segments/1386163996875
CC-MAIN-2013-48/segments/1386163998951
CC-MAIN-2013-48/segments/1386164000828
CC-MAIN-2013-48/segments/1386164033950


--

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

jor...@commoncrawl.org

未读,
2014年4月18日 15:39:022014/4/18
收件人 common...@googlegroups.com
Oof. I think the WAT generator might have had some parse problems with those that I fixed in a later version. That's an awful lot of missing WAT files though. After my current crawl (which should be done in a couple days), I'll go back and regenerate them.


Jordan
回复全部
回复作者
转发
0 个新帖子