Hi,
I found that there are quite a few segments from CC-MAIN-2013-48 where the WAT content is missing (see full list below).
For instance :
hadoop fs -ls s3n://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-48/segments/1386163998951/
Found 1 items
drwxrwxrwx - 0 1970-01-01 00:00 /common-crawl/crawl-data/CC-MAIN-2013-48/segments/1386163998951/warc
Any reason why these segments have only the warc content? If not would it be possible to generate the missing sub dirs?
Julien
-----------------------------------
CC-MAIN-2013-48/segments/1386163051986
CC-MAIN-2013-48/segments/1386163051992
CC-MAIN-2013-48/segments/1386163052034
CC-MAIN-2013-48/segments/1386163052107
CC-MAIN-2013-48/segments/1386163052204
CC-MAIN-2013-48/segments/1386163052216
CC-MAIN-2013-48/segments/1386163052275
CC-MAIN-2013-48/segments/1386163052286
CC-MAIN-2013-48/segments/1386163052338
CC-MAIN-2013-48/segments/1386163052343
CC-MAIN-2013-48/segments/1386163052462
CC-MAIN-2013-48/segments/1386163052469
CC-MAIN-2013-48/segments/1386163052713
CC-MAIN-2013-48/segments/1386163052727
CC-MAIN-2013-48/segments/1386163052810
CC-MAIN-2013-48/segments/1386163052909
CC-MAIN-2013-48/segments/1386163052912
CC-MAIN-2013-48/segments/1386163052949
CC-MAIN-2013-48/segments/1386163053578
CC-MAIN-2013-48/segments/1386163053669
CC-MAIN-2013-48/segments/1386163053831
CC-MAIN-2013-48/segments/1386163053843
CC-MAIN-2013-48/segments/1386163053865
CC-MAIN-2013-48/segments/1386163053883
CC-MAIN-2013-48/segments/1386163053894
CC-MAIN-2013-48/segments/1386163053921
CC-MAIN-2013-48/segments/1386163053923
CC-MAIN-2013-48/segments/1386163054000
CC-MAIN-2013-48/segments/1386163054096
CC-MAIN-2013-48/segments/1386163054352
CC-MAIN-2013-48/segments/1386163054353
CC-MAIN-2013-48/segments/1386163054424
CC-MAIN-2013-48/segments/1386163054457
CC-MAIN-2013-48/segments/1386163848048
CC-MAIN-2013-48/segments/1386163857457
CC-MAIN-2013-48/segments/1386163857566
CC-MAIN-2013-48/segments/1386163860676
CC-MAIN-2013-48/segments/1386163870408
CC-MAIN-2013-48/segments/1386163901500
CC-MAIN-2013-48/segments/1386163915534
CC-MAIN-2013-48/segments/1386163922753
CC-MAIN-2013-48/segments/1386163930735
CC-MAIN-2013-48/segments/1386163932627
CC-MAIN-2013-48/segments/1386163933724
CC-MAIN-2013-48/segments/1386163936569
CC-MAIN-2013-48/segments/1386163944066
CC-MAIN-2013-48/segments/1386163949658
CC-MAIN-2013-48/segments/1386163955638
CC-MAIN-2013-48/segments/1386163956743
CC-MAIN-2013-48/segments/1386163964642
CC-MAIN-2013-48/segments/1386163966854
CC-MAIN-2013-48/segments/1386163968717
CC-MAIN-2013-48/segments/1386163973624
CC-MAIN-2013-48/segments/1386163982738
CC-MAIN-2013-48/segments/1386163988740
CC-MAIN-2013-48/segments/1386163990831
CC-MAIN-2013-48/segments/1386163990989
CC-MAIN-2013-48/segments/1386163992191
CC-MAIN-2013-48/segments/1386163992799
CC-MAIN-2013-48/segments/1386163994706
CC-MAIN-2013-48/segments/1386163994768
CC-MAIN-2013-48/segments/1386163995757
CC-MAIN-2013-48/segments/1386163996785
CC-MAIN-2013-48/segments/1386163996875
CC-MAIN-2013-48/segments/1386163998951
CC-MAIN-2013-48/segments/1386164000828
CC-MAIN-2013-48/segments/1386164033950
--
Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble