I have downloaded a single warc.gz file . but it has metadata information but not the content

37 views
Skip to first unread message

Bhavana

unread,
Jul 18, 2016, 5:04:13 PM7/18/16
to Web Data Commons
Hello,

I have downloaded a single warc.gz file but it has metadata information. i want the title or header name but i didn't see in it.I want to use elastic search to get the url's when i search the title names. So, i am looking for title names in the data.

the data is as below 

WARC/1.0
WARC-Type: metadata
WARC-Date: 2016-05-24T06:19:32Z
WARC-Record-ID: <urn:uuid:63cb79b8-5916-4a92-adb4-a67446db63aa>
WARC-Refers-To: <urn:uuid:08e9fbe3-c842-418a-b0f4-e20718591155>
Content-Type: application/json
Content-Length: 1101

{"Envelope":{"Format":"WARC",
"WARC-Header-Length":"423",
"Block-Digest":"sha1:33WFHKNHJ64W55DRWPWJFJ35T4AN54QF",
"Actual-Content-Length":"20",
"WARC-Header-Metadata":{"WARC-Type":"metadata",
"WARC-Date":"2016-05-24T06:19:32Z",
"WARC-Warcinfo-ID":"<urn:uuid:50214873-4101-4984-8452-db7b5475da62>",
"Content-Length":"20",
"WARC-Record-ID":"<urn:uuid:08e9fbe3-c842-418a-b0f4-e20718591155>",
"WARC-Concurrent-To":"<urn:uuid:c03dfd08-562b-44bf-8c09-dae24d7f67c7>",
"Content-Type":"application/warc-fields"},
"Payload-Metadata":{"Trailing-Slop-Length":"4",
"WARC-Metadata-Metadata":{"Trailing-Slop-Length":"0",
"Metadata-Records":[{"Name":"fetchTimeMs",
"Value":"284"}],
"Actual-Content-Length":"20"},
"Actual-Content-Type":"application/metadata-fields"}},
"Container":{"Compressed":true,
"Gzip-Metadata":{"Footer-Length":"8",
"Deflate-Length":"328",
"Header-Length":"10",
"Inflated-CRC":"-4818014",
"Inflated-Length":"447"},
"Offset":"1038238145",
"Filename":"CC-MAIN-20160524002110-00017-ip-10-185-217-139.ec2.internal.warc.gz"}}


Could you please help me to know where can i get the header name or title ?
    I want to use elastic search to get the url's when i search the title names. So, i am looking for title names in the data. 


Thanks,
Bhavana.

Robert Meusel

unread,
Jul 19, 2016, 2:15:45 AM7/19/16
to Web Data Commons
Bhavana,

can you please make an example. What kind of title or header do you want to have? the one from the Page? Any internal data?

Thanks,
Robert

seba...@commoncrawl.org

unread,
Jul 19, 2016, 8:44:42 AM7/19/16
to Web Data Commons
Hi Bhavana,

this question is about the Common Crawl data and not about Web Data Commons,
see https://groups.google.com/d/msg/common-crawl/TP9Cpr6Vw1k/qsvI5Xm6AQAJ

Sebastian
Reply all
Reply to author
Forward
0 new messages