Project ideas using common-crawl dataset

123 views
Skip to first unread message

Pramod Bharadwaj C

unread,
Jun 14, 2015, 6:41:32 PM6/14/15
to common...@googlegroups.com
We as a team of 4 are looking to work with common-crawl dataset. We are looking for a project idea that involves answering some data science problem. Can you please suggest some ideas.


Thanks,
Pramod Bharadwaj Chandrashekar

OneSpeedFast

unread,
Jun 14, 2015, 6:52:19 PM6/14/15
to common...@googlegroups.com
Consider a relational database that stores only the unique pieces of data for each page. I have been building something like this for some time, and have managed to eliminate a whole lot of duplicate data.

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

Laura Dietz

unread,
Jun 15, 2015, 5:57:21 AM6/15/15
to common...@googlegroups.com
Hi Pramod,

Are you looking for a 'cool' project or for a 'useful' project?

'useful' would be build an inverted index for the most informative 1-2 million words.

Cheers,
Laura

Dominik Stadler

unread,
Jun 15, 2015, 7:54:16 AM6/15/15
to common...@googlegroups.com
Hi,

I and others are looking for certain types of files in the crawls,
e.g. document-types for test-data for Apache Tika and Apache POI, or
images for some image-processing functionality. So a reverse index
by filetype/mimetype would be beneficial for such efforts.

Dominik.

Eugene Nana Opoku

unread,
Jun 18, 2015, 9:25:04 AM6/18/15
to common...@googlegroups.com
Well, I'm working on a project that analyzes the characteristics between spam and ham pages within a webgraph data. However, I've had some difficulty extracting these nodes  and connections from the WDC Hyperlink 2012 dataset. Any help is welcomed.

Thanks
Eugene

Mat Kelcey

unread,
Jun 18, 2015, 9:50:22 AM6/18/15
to common...@googlegroups.com

Have a read through http://www-nlp.stanford.edu/IR-book/ pretty much anything in there, applied at common crawl scale, would be a non trivial project


--

Eugene Nana Opoku

unread,
Jun 18, 2015, 10:47:14 AM6/18/15
to common...@googlegroups.com
Actually webgraph as proposed by Boldi and Vigna, has been applied in many compressed form web page/url datasets. Since WDC and webspam-uk datasets both are compressed in BV, methods of retrieving could be applicable to any webgraph datasets. using maven  in cmd has not been successful cos of mising dependencies. creating the offset file with  "java it.unimi.dsi.webgraph.BVGraph -O uk-2006-05-nat" gives Error: could not find or load main class it.unimi.dsi.webgraph.BVGraph -O uk-2006-05-nat.

Robert Meusel

unread,
Jun 19, 2015, 3:40:13 AM6/19/15
to common...@googlegroups.com
Eugene, maybe I can help. Where is the problem with the WDC graph. What are you trying to do?

In WDC we mostly use the webgraph library within maven projects using the set of depencies:

<dependency>
<groupId>it.unimi.dsi</groupId>
<artifactId>webgraph-big</artifactId>
<version>3.3.5</version>
</dependency>
<dependency>
<groupId>it.unimi.dsi</groupId>
<artifactId>dsiutils</artifactId>
<version>2.1.4</version>
</dependency>
<dependency>
<groupId>it.unimi.dsi</groupId>
<artifactId>fastutil</artifactId>
<version>6.5.9</version>
</dependency>

within the pom.xml.

Than you can simply read the webgraph files using:

BVGraph graph = BVGraph.loadMapped(baseName, new ProgressLogger());

where the baseName is the name of the three files without ending.

To initially compress a graph (arc list) you can use:

ProgressLogger pl = new ProgressLogger();
BVGraph.store(ArcListASCIIGraph.loadOnce(new FileInputStream("yourArcFile"), 0), "outputPath", new ProgressLogger());
Reply all
Reply to author
Forward
0 new messages