Exploring commoncrawl with a keyword

609 weergaven
Naar het eerste ongelezen bericht

Paul Tolmer

ongelezen,
6 feb 2015, 07:41:1606-02-2015
aan common...@googlegroups.com
Hello everyone,

I am a grad student working on a thesis about big data analysis. I am trying to analyse the way the term "big data" is used and understood online; I am hoping to find evidence that a large portion of articles about misuse and/or misunderstand the term.

My idea is to obtain a statistically significant amount of online articles (text from webpages), to check correlations between the term 'big data' and other meaningful terms (either at the sentence or at the article level) and to do some statistical analyses of said correlations (for starters, some simple correspondance analysis, and see where it takes me).

My problem is that my computer science background is very limited. I have an engineering background and studied some java and html, but it was only a little bit and a long time ago. Every time I try to look in depth at the tools I need to master to do this, I get a bit lost.

I am trying to understand how difficult it would be extract relevant webpages (just text) from the commoncrawl data, so that I can then analyse them locally. So I am just looking to build a relatively simple database of articles selected with a relevance criteria (I am thinking of just articles having "big data" in their title, for instance). I do not need to have a TB worth of articles, I just need enough to run some relevant statistical analysis on (I am not sure how much I would need, but I am thinking that a random sample of articles about big data of under 1GB would probably be plenty!)

Any help would be greatly appreciated in the following:
- estimating feasibility and difficulty for an 'almost-layman', maybe estimating cost as well..?
- examples of similar projects
- anything else that seems helpful!! ;-)

Thanks in advance!

Kevin Fink

ongelezen,
6 feb 2015, 09:37:0706-02-2015
aan common...@googlegroups.com
In this particular case, I probably wouldn't use the CommonCrawl data. Since there is no keyword index available, you would need to iterate over the entire data set searching for your phrase. That's a not-insignificant amount of work, especially if you're not comfortable writing map-reduce jobs. And probably a couple hundred dollars worth of compute time on AWS.

Instead, I'd use something like the Big Web Search API. It offers a really simple API and you get the first 5000 queries per month free, which would be more than enough to get a good corpus of articles. Then you'd need to download each one and parse out the text. Not difficult, but may be some ramp-up time given that you're not immersed in coding. Probably an hour or two all told for someone familiar with the space (I did a similar project last week - took me about 3 hours but that included a web interface and some processing of the crawled text afterwards).


Kevin

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

Mat Kelcey

ongelezen,
6 feb 2015, 23:00:1006-02-2015
aan common...@googlegroups.com

depending on what you're trying to analyse you probably _don't_ want to filter on 'big data' first since you're introducing a pretty big selection bias, i.e. your correlation statistic between "big data" and other-meaningful-term given the doc already has "big data" in it is going to be very different to the correlation statistic between "big data" and other-meaningful-term given a completely random page.

this is one of the nice things about common crawl, it's easy to randomly sample a small percentage of data and have some faith that it's a representative of the internet... content duplication aside :) note that it's a good idea to do 2 or 3 sperate samples too so you can check things like sample variance.

Is there any form of a visible text dump of the common crawl available these days?

Mat

Mat Kelcey

ongelezen,
7 feb 2015, 00:15:4507-02-2015
aan common...@googlegroups.com
On Fri Feb 06 2015 at 8:00:08 PM Mat Kelcey <matthew...@gmail.com> wrote:

Is there any form of a visible text dump of the common crawl available these days?


oh, WET files of course! 
( i need to update my years old ARC examples :) 

this should be pretty easy to script up as a proof of concept paul

you'll want to run as much of it as possible on AWS to avoid that costly download since it doesn't yield a whole stack per file...

Stephen Merity

ongelezen,
9 feb 2015, 20:42:5909-02-2015
aan common...@googlegroups.com
Hi Paul,

To follow up to the two good responses you've already received, I just thought I'd add some extra details.

As Kevin said, you might be able to get away using off the shelf search APIs, depending on exactly how much you want. One thing I will note though is that a MapReduce job over the relevant text data of Common Crawl won't cost hundreds of dollars, it would more likely cost somewhere between $30 and $60, possibly far cheaper if you run particularly optimized code.

Mat's code is a great starting point and shows you how you can get the result you're interested in without using MapReduce. Best of all, you can use his method "on repeat" until you feel you have enough data to go forward. Whilst his example stops short at pulling out the URLs that contain the original text, that isn't too difficult either. 

His example pulled out 434 occurrences of "big data" from one WET file, and there are over 43,000 WET (text extract) files in the December crawl, so we're looking at (guesstimate) tens of millions of references to big data across the full dataset. Pulling out the "big data" text one file at a time from the 43k files until you've had your fill of "big data" references would likely work quite well.

Mat's point on selection bias is an important one too -- if you use the search technique, you are being strongly biased by what their search algorithm has decided is relevant, for whatever definition of "relevant" they've defined.

Good luck and I'd love to see what you make of it! =]

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.



--
Regards,
Stephen Merity
Data Scientist @ Common Crawl
Allen beantwoorden
Auteur beantwoorden
Doorsturen
Bericht is verwijderd
Bericht is verwijderd
Bericht is verwijderd
0 nieuwe berichten