WET files don't seem to contain target URL's data

Josh

unread,

Mar 6, 2021, 3:01:50 PM3/6/21

to Common Crawl

Hello everyone,

I'm new to common crawl and am trying to obtain the WET file contents for particular pages. For reference here, I'm using "https://www.reddit.com/r/dataisbeautiful/*" just for testing purposes.

When I run a GET request, I'm told the following information:

{'url': 'https://www.reddit.com/r/dataisbeautiful/comments/1fgz8q/structure_of_romantic_and_sexual_relations_at/', 'filename': 'crawl-data/CC-MAIN-2021-04/segments/1610703538431.77/warc/CC-MAIN-20210123191721-20210123221721-00066.warc.gz', 'offset': '941486961', 'length': '68465'}

Based on the filename field response I was given, I altered the URL by swapping "/warc/" for "/wet/" and "warc.gz" for "warc.wet.gz" and pulled from the following link:

https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2021-04/segments/1610703538431.77/wet/CC-MAIN-20210123191721-20210123221721-00066.warc.wet.gz

Yet, interestingly enough, when I do another GET request on that specific new URL, none of the websites whose data is portrayed in the resulting file is from reddit, much less the specific sub-url that I pasted above. There are a bunch of websites in there as expected (it's my understanding that I can't index the WET file in my request like one could with a WARC file). Am I missing something conceptually here as to why that is? I expected to be able to find a reddit url somewhere in the resulting WET file, but yet there is nothing. I've tried to replicate this using several different websites as well, and still no luck, so I don't believe that it's a reddit-specific issue. I can post my code (using Python) if necessary, but I've checked it at every point and it's behaving as I expect.

Any help is much appreciated, thank you!

Colin Dellow

unread,

Mar 6, 2021, 3:16:03 PM3/6/21

to Common Crawl

Welcome to the mailing list!

I think your code may be the root cause -- can you share it as well?

Based on what you've written, I think you're saying that a capture exists for a given URL in the WARC file, but not the WET file. When I tried to reproduce this with a quick-and-dirty command line test, I observe that they're in both.

First, I confirmed that the WARC file did include the capture of the dataisbeautiful URL:

$ curl --silent https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2021-04/segments/1610703538431.77/warc/CC-MAIN-20210123191721-20210123221721-00066.warc.gz -H "Range: bytes=941486961-$((941486961+68465-1))" | zcat | head

WARC/1.0

WARC-Type: response

WARC-Date: 2021-01-23T21:15:21Z

WARC-Record-ID: <urn:uuid:a96087d4-8c01-48d3-9dea-edd004ceac61>

Content-Length: 723491

Content-Type: application/http; msgtype=response

WARC-Warcinfo-ID: <urn:uuid:a5e9448b-9868-47f4-9d9f-b13f2a12b11a>

WARC-Concurrent-To: <urn:uuid:ffb47486-8dfc-4de6-9ea8-a54271c4312c>

WARC-IP-Address: 151.101.117.140

WARC-Target-URI: https://www.reddit.com/r/dataisbeautiful/comments/1fgz8q/structure_of_romantic_and_sexual_relations_at/

Then I downloaded the WET file and searched it for the URL:

$ wget https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2021-04/segments/1610703538431.77/wet/CC-MAIN-20210123191721-20210123221721-00066.warc.wet.gz

$ zgrep --line-number --text /r/dataisbeautiful CC-MAIN-20210123191721-20210123221721-00066.warc.wet.gz

7309263:WARC-Target-URI: https://www.reddit.com/r/dataisbeautiful/comments/1fgz8q/structure_of_romantic_and_sexual_relations_at/

7309689:There is a thread on reddit (http://www.reddit.com/r/dataisbeautiful/comments/1fgz8q/structure_of_romantic_and_sexual_relations_at/) which cites the chains of affection as an example of 'data is beautiful'.

Tom Morris

unread,

Mar 6, 2021, 3:20:32 PM3/6/21

to common...@googlegroups.com

As Colin mentioned, you're close and that WET file does contain the entry that you're looking for, although I'm not sure how useful it'll be since it's polluted with a TON of boilerplate:

WARC/1.0

WARC-Type: conversion

WARC-Target-URI: https://www.reddit.com/r/dataisbeautiful/comments/1fgz8q/structure_of_romantic_and_sexual_relations_at/

WARC-Date: 2021-01-23T21:15:21Z

WARC-Record-ID: <urn:uuid:15fe0c20-bf0b-45c8-86c4-77b0a2b529e4>

WARC-Refers-To: <urn:uuid:a96087d4-8c01-48d3-9dea-edd004ceac61>

WARC-Block-Digest: sha1:V6THSFEB34KIFOL4CHKKCAKVHYJFEOPZ

WARC-Identified-Content-Language: eng

Content-Type: text/plain

Content-Length: 62720

Structure of romantic and sexual relations at "Jefferson High School" : dataisbeautiful

jump to content

my subreddits

edit subscriptions

popular

-all

-random

-users

|

AskReddit

-funny

...

Josh

unread,

Mar 6, 2021, 4:00:54 PM3/6/21

to Common Crawl

Wow, thanks guys, you were definitely right about it being my code. As I was cleaning it up/commenting it for better readability here, I decided to experiment and realized the guide I had been following, for some reason, was incorrect and only pulled every 2 records, not each individual one. Why that is I have no idea, but I was only getting every odd entry in the WET file, and I guess I got unlucky and every link I tried was an even entry.

While I'm here though, can I ask what you mean by "there's a ton of boilerplate"? Do you just mean the record info atop the content? Because I was able to parse all of that out and just take the actual text from the webpage, but not sure if there's something else here I'm missing. This is for a project I'm working on where I just need the actual raw text from each page and was wondering if, even though I have this method working now, there was a better/smarter way to have done this (my script currently automates the entire process, where it fetches all of the S3 URLs for the list of pages I need scraped, and then pulls their WET files one by one and generates .txt files with the raw page text for each page I needed).

Tom Morris

unread,

Mar 6, 2021, 5:03:38 PM3/6/21

to common...@googlegroups.com

On Sat, Mar 6, 2021 at 4:00 PM Josh <omidmo...@gmail.com> wrote:

> While I'm here though, can I ask what you mean by "there's a ton of boilerplate"?
> Do you just mean the record info atop the content? Because I was able to parse all of that out
> and just take the actual text from the webpage, but not sure if there's something else
> here I'm missing.

There is also boilerplate sprinkled throughout the page, not just at the top.
For example, these two comments spanning N lines contain just two sentences:

[–]MarioisKewl 47 points48 points49 points 7 years ago (1 child)
Huh, I definitely missed out on that. I'm even a band kid in college
and still missing out on it.
permalink
embed
save
parent
give award
[–]Fapplesauced 15 points16 points17 points 7 years ago (0 children)
You should have gone to band camp.
permalink
embed
save
parent
give award
load more comments (1 reply)

There are heuristics people have used to remove short lines or phrases without
sentence punctuation for text only extracts from HTML pages, I tend to
think having
the HTML structure available can make this easier.

If the algorithms that you are using are producing the type of output you want,
you should stick with them.

Tom

Josh

unread,

Mar 6, 2021, 8:19:03 PM3/6/21

to Common Crawl

Ah, makes sense. My partner on this wants the raw WET file, so I assume they should (hopefully) have that stuff handled, but I'll bring it up just as a precaution. Thanks again for both of your help on this!

Reply all

Reply to author

Forward