Unusual search question

66 views
Skip to first unread message

TheBean InABox

unread,
Oct 3, 2016, 1:34:41 PM10/3/16
to Common Crawl
Hello,

Sebastian Nagel was kind enough to reply to my email, (thank you Sebastian!) and suggested I post this question here.

I run a small Boston based business.

Hoping you might help me solve an unusual search question.

I'd like to see where some of the pockets of passionate thinkers not driven by commerce are on the web (nothing against commerce, just would like to see what's really out there).

To do that I'd like to search to find websites that are of at least 100 pages in size, are still online today, but that haven't had any pages added or updated since 2014.

I'm just a little tech savvy, not at all afraid to learn something new, can follow the instructions from Steve Salevan here:


But beyond that I'm lost and  have zero idea how to proceed.

Can this be done?

ANY assistance, advice, or even a few (mild) insults would be greatly appreciated!

Thank you!

Best,
Rich

Sebastian Nagel

unread,
Oct 4, 2016, 2:58:02 AM10/4/16
to common...@googlegroups.com
Hi Rich,

just a draft for one way to come close to the desired result...
I don't know about a ready-to-use solution for the given problem. Some programming is needed. :)
And maybe someone has a better and smarter solution for this interesting problem!

> To do that I'd like to search to find websites that are of at least 100 pages in size, are still
> online today, but that haven't had any pages added or updated since 2014.

The Common Crawl index contains a document checksum / hash for each crawled page generated on the
"raw" HTML content. This includes navigation elements, boilerplates, and may also contain elements
which change frequently: date or time, number of visitors, etc.

Indexes are available back to 2014. One possible way to get the desired set of websites:

1. prepare a list of websites of sufficient size. This is possible with Common Crawl data but keep
in mind that only only a sample is crawled and there is no guarantee that a website is crawled
exhaustively.

2. take at least two indexes: one from 2014 (or early 2015) and a recent one, and compare
the two checksums for pages of your list of candidate websites. If the checksum is different
the website the page belongs to can be excluded.

Maybe it's more efficient to start with step 2 since the list of unchanged pages since 2014 is
probably quite small (esp., if the raw HTML is compared).

The result would be a list of candidates, not a ready result set because
- [of at least 100 pages in size]: the number of crawled pages of a site is probably lower than the
real number of pages
- or sites may be missing in the Common Crawl snapshots at all
- page additions are also hardly detected with a snapshot crawl

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Christian Lund

unread,
Oct 4, 2016, 3:16:35 AM10/4/16
to Common Crawl
Hi Rich,

We use the CommonCrawl WAT files, which are condensed versions of the WARC files - containing only structured data from the crawled pages. For each domain we build a link profile of internal and external links, for your project you would only need to look for internal links.

Once you have internal link profiles (or sitemaps) for each domain, then you could start to compare this data to historical data.

You'll need considerable computing power to dig through all the data (now and then), not least for the comparison phase, so it might be an idea to limit your scope to only include certain top level domains or content languages.

Best regards,
Christian

TheBean InABox

unread,
Oct 24, 2016, 9:55:48 AM10/24/16
to Common Crawl
Hi Sebastian and Christian,

Thank you for your replies!

Not surprising, but this is far more complex than I had realized. :)

Taking another tack, to try to pare this process down.

Would it be possible to search through website footer data for the "copyright" symbol with the year 2013 within the 2016 corpus? Or maybe even just the year 2013 in the footer?

In an attempt to reduce processing demands, I'd like to restrict the data set to .com domains (ideally not subdomains) and sites written in English, if that's possible and/or helps at all?

As always, ANY thoughts would be greatly appreciated!!!

Thank you!

Best,
Rich

Christian Lund

unread,
Nov 1, 2016, 4:12:09 AM11/1/16
to Common Crawl
Hi Rich,

It's certainly a good idea to narrow the scope of your search results, but regardless you will need to dig through all the data first in our to get your processing list.

You'll probably want to use the WET files, since these are plain text and smaller than the WARC files, which will save you some bandwidth. 

Each crawled page includes meta data, including the target URI. So you can easily filter on .com domains and exclude subdomains if you like (though keep in mind that a lot of blogs use subdomains).

Websites often include a language meta tag, which you can filter on, but beware that these are not always accurate. I would only recommend to use these as a loose reference.


For processing: the first thing you would need to do is research how the Copyright snippet is used on webpages. There is no standard for this and web masters use it differently. So you need to sample a good portion of websites and generate some rules to detect the copyright snippet.

Once you have your detection rules in place, you could add another check to determine if the Copyright mention is in the footer or not.

There are roughly 30000 WET files in a monthly archive set, so you need to find a way to efficiently process these files and then generate a subset you can work on. Perhaps 3 processes, where one is extracting data, one is validating your subsets and one is comparing data.





On Monday, October 3, 2016 at 7:34:41 PM UTC+2, TheBean InABox wrote:
Reply all
Reply to author
Forward
0 new messages