I have downloaded a single warc.gz file . but it has metadata information but not the content

141 views
Skip to first unread message

Bhavana

unread,
Jul 18, 2016, 4:55:00 PM7/18/16
to Common Crawl
Hello,

I have downloaded a single warc.gz file but it has metadata information. i want the title or header name but i didn't see in it. the data is as below 

WARC/1.0
WARC-Type: metadata
WARC-Date: 2016-05-24T06:19:32Z
WARC-Record-ID: <urn:uuid:63cb79b8-5916-4a92-adb4-a67446db63aa>
WARC-Refers-To: <urn:uuid:08e9fbe3-c842-418a-b0f4-e20718591155>
Content-Type: application/json
Content-Length: 1101

{"Envelope":{"Format":"WARC",
"WARC-Header-Length":"423",
"Block-Digest":"sha1:33WFHKNHJ64W55DRWPWJFJ35T4AN54QF",
"Actual-Content-Length":"20",
"WARC-Header-Metadata":{"WARC-Type":"metadata",
"WARC-Date":"2016-05-24T06:19:32Z",
"WARC-Warcinfo-ID":"<urn:uuid:50214873-4101-4984-8452-db7b5475da62>",
"Content-Length":"20",
"WARC-Record-ID":"<urn:uuid:08e9fbe3-c842-418a-b0f4-e20718591155>",
"WARC-Concurrent-To":"<urn:uuid:c03dfd08-562b-44bf-8c09-dae24d7f67c7>",
"Content-Type":"application/warc-fields"},
"Payload-Metadata":{"Trailing-Slop-Length":"4",
"WARC-Metadata-Metadata":{"Trailing-Slop-Length":"0",
"Metadata-Records":[{"Name":"fetchTimeMs",
"Value":"284"}],
"Actual-Content-Length":"20"},
"Actual-Content-Type":"application/metadata-fields"}},
"Container":{"Compressed":true,
"Gzip-Metadata":{"Footer-Length":"8",
"Deflate-Length":"328",
"Header-Length":"10",
"Inflated-CRC":"-4818014",
"Inflated-Length":"447"},
"Offset":"1038238145",
"Filename":"CC-MAIN-20160524002110-00017-ip-10-185-217-139.ec2.internal.warc.gz"}}


Could you please help me to know where can i get the header name or title ?
    I want to use elastic search to get the url's when i search the title names. So, i am looking for title names in the data. 


Thanks,
Bhavana.

Sebastian Nagel

unread,
Jul 18, 2016, 6:15:10 PM7/18/16
to common...@googlegroups.com
Hi Bhavana,

the title is also contained in the WAT file
(s3://commoncrawl/crawl-data/CC-MAIN-2016-22/segments/1464049270134.8/warc/CC-MAIN-20160524002110-00017-ip-10-185-217-139.ec2.internal.warc.gz)

It's just a couple of lines above in a record marked as "response":
"WARC-Type":"response" ...
"Title":" Wireless | Major Opportunities at the University of Houston-Downtown "

There are also records for crawling-related metadata,
Just skip all records of "WARC-Type":"metadata" since only title and URL are of relevance.

Best,
Sebastian

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at https://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

Bhavana

unread,
Jul 19, 2016, 2:15:38 PM7/19/16
to Common Crawl

Bhavana

unread,
Jul 19, 2016, 2:31:14 PM7/19/16
to Common Crawl
Hello Sebastian,

Thanks for the response,

I thought i would require only the URL and the title.But i came to know that i require title,description,content and URL. I didn't find the content in the WAT files. I have seen WARC files but they are not in JSON format. I don't require whole WARC file. i require to have only title, description,content and the URL. I would like to load it in to elastic search and when i search with content, description or title , i should be able to get the URL's. so let me know how to proceed in it. 


Thanks,
Bhavana.

Sebastian Nagel

unread,
Jul 19, 2016, 4:15:15 PM7/19/16
to common...@googlegroups.com
Hi Bhavana,


> I didn't find the content in the WAT files
Correct. The WAT files do not contain the textual content (or body).

There are two possible ways to go:
(a) the WET files contain only the content, you could process them in addition to the WAT files
(b) take the WARC files, parse the HTML content of every page and extract the fields which
     should go into the ElasticSearch index

(b) is for sure more complex but the more reliable way as you have full control over the HTML parser.

Sebastian

Bhavana

unread,
Jul 26, 2016, 9:30:54 AM7/26/16
to Common Crawl
Hello Sebastian,

Thanks for the information.

Now i am using Warc files to extract the content.

Could you please mention how many web pages information can be present in a warc file? I have seen few are repeating .

Thanks,
Bhavana.

Sebastian Nagel

unread,
Jul 26, 2016, 10:01:04 AM7/26/16
to common...@googlegroups.com
Hi Bhavana,

WARC files are about 1 GB in size each (after gzip compression). Theoretically, there is no limit on
the number of documents in a WARC file. Practically, there should be only a small variation.

But you can calculate the average number of documents/pages per WARC file:
First, get the number of WARC files for a monthly crawl:
 % aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2016-26/warc.paths.gz - | gzip -cd | wc -l
 20200

Divide the number of pages by the number of WARCs (1.23 billion pages for the June crawl):
 1230000000 / 20200 = 60891

That's around 60,000 pages per WARC file.

Best,
Sebastian

Reply all
Reply to author
Forward
0 new messages