ANN: Web Data Common Extraction Framework released

79 views
Skip to first unread message

Robert Meusel

unread,
Aug 27, 2014, 7:37:10 AM8/27/14
to web-data...@googlegroups.com

Hi all,

We are happy to announce the release of the WDC Extraction framework, which is used by the Web Data Commons project to extract Microdata,Microformats and RDFa data, web graphs, and HTML tables from the web crawls provided by the Common Crawl Foundation. The framework provides an easy to use basis for the distributed processing of large web crawls using Amazon Cloud Services. The framework is published under the terms of the Apache license and can be simply customized to perform different data extraction tasks.

More information about the framework, a detailed guide on how to run it, and a tutorial showing how to customize the framework for your extractions can be found at

http://webdatacommons.org/framework

We encourage all interested parties to make use of the framework and also to contribute their own improvements.

Best Regards,

Robert, Hannes, Oliver, Petar and Chris

Tie hky

unread,
Sep 26, 2014, 9:48:37 PM9/26/14
to web-data...@googlegroups.com
Hi Robert,

I am really interested in the extract framework. I spent time to read the code. Looks like the framework will generate a file for each ARC, WARC. As we know, there are hundreds of thousands of ARC/WARC files, if I don’t misunderstand, hundreds of thousands of result files will be generated. Is it an issue for WDC when you collected hyperlink and RDF data?

A file can be divided into different WARC records, who is responsible for combinig all of them records to get the original file?

Does the JAVA class org.archive.io.ArchiveReader do this? If not, should developesr take care of this?

Tie hky

unread,
Sep 27, 2014, 2:26:34 AM9/27/14
to web-data...@googlegroups.com
Reading the source code again, looks like the results for all ARc/WARC file are stored in simpleDB.

Robert, can you help answer the following question, thanks.

A file can be divided into different WARC records, who is responsible for combinig all of them records to get the original file?

Does the JAVA class org.archive.io.ArchiveReader do this? If not, should developesr take care of this?



Tie hky

unread,
Sep 27, 2014, 10:07:03 PM9/27/14
to web-data...@googlegroups.com
The following code snippet is in the DataThread that is used to implement "retrivedata" command
     String line;
     while ((line = retrievedDataReader.readLine()) != null) {
      Line l = parseLine(line);
      if (l == null) {
       continue;
      }
      if (!RDFExtractor.EXTRACTORS.contains(l.extractor)) {
       log.warn(l.quad + "/" + l.extractor
         + " is strange...");
       continue;
      }

Looks like the code process RDF data only. If am wrong please correct me.

Tie hky

unread,
Sep 28, 2014, 10:52:33 PM9/28/14
to web-data...@googlegroups.com

ArcProcessor does NOT implement org.webdatacommons.framework.processor.FileProcessor interface.
I am not sure how it is called by the framework.

public class ArcProcessor {
 private static Logger log = Logger.getLogger(ArcProcessor.class);


ArcProcessor implements a thread to load ARC file asynchronously.
  /**
   * This thread asynchronously copies the data from the gzipped arc file
   * into the input buffer of the ARC file reader. The reader will block
   * once its buffers are full, and enables access the ARC file entries.
   * */
  (new Thread() {
   public void run() {
    try {
     while (true) {
      ByteBuffer buffer = ByteBuffer.allocate(BLOCKSIZE);
      int bytesRead = gzippedArcFileBC.read(buffer);
      if (bytesRead > 0) {
       buffer.flip();
       reader.available(buffer);
      } else if (bytesRead == -1) {
       reader.finished();
       return;
      }
     }
    } catch (IOException e) {
     log.warn("Unable to pipe input to reader");
     reader.finished();
     return;
    }
   }
  }).start();

It is different from WArcProcessor complemently.
Would you please give some info why ARC and WARC processors are so different? Thanks.

Robert Meusel

unread,
Sep 29, 2014, 2:48:40 AM9/29/14
to web-data...@googlegroups.com
Hi Tie hky,

As the framework is completely parallelized it will generate one file for each processed ARC/WARC/WAT file. There are a large number of files (depending on the crawl you will use). After prozessing the files (meaning extracting the needed information) you have to collect the results and combine them as needed for your task.
In the case of RDF data, we used the DataThread (in the Master.java) to collect the files and write them into combined files. In the case of the hyperlink graph, we did not need this as we needed to process the files further using PIG running on Hadoop.

Within one file (WARC ARC or WET) there are several records. The ArchiveReader is a library/class which takes care of the reading so you can read each "chunk" one by one using an iterator. For more insights on the different formats, have a look in the blog post of Stephen from CC: http://commoncrawl.org/navigating-the-warc-file-format/

Robert Meusel

unread,
Sep 29, 2014, 2:49:17 AM9/29/14
to web-data...@googlegroups.com
Hi,

as the comment above states, this is only used to collect the RDF data. In case you want something different you have to reimplement the "run" method of the thread.

Robert Meusel

unread,
Sep 29, 2014, 2:50:33 AM9/29/14
to web-data...@googlegroups.com
Hi,

you are right. I need to move the ArcProcessor to the new structure, which I simply forget. It is as it is, as we used the framework in the beginning in a more static way, and did not make it so easy to adapt. In case you need examples how to write your own code, have a look at the WarcProcessor or the WatProcessor.

Cheers

Tie hky

unread,
Sep 29, 2014, 10:41:13 AM9/29/14
to web-data...@googlegroups.com
Thanks Robert.

The ArchiveReader is a library/class which takes care of the reading so you can read each "chunk" one by one using an iterator.
Does this mean the developers should combine chunks to get a whole file that has been divided?

Tie hky

unread,
Sep 29, 2014, 11:08:28 AM9/29/14
to web-data...@googlegroups.com
In the case of RDF data, we used the DataThread (in the Master.java) to collect the files and write them into combined files.

Looks like only one spot instance is used to collect the files. How long it took to combine all RDF files? How many of RDF files to ombine?

Tie hky

unread,
Sep 29, 2014, 11:15:29 AM9/29/14
to web-data...@googlegroups.com
The following code snippet to collect all RDF files.
    S3Object[] objects = getStorage().listObjects(resultBucket,
      "data/", null);
    int i = 0;
    for (S3Object object : objects) {

If there are thousands and thousands RDF files, is it possible to fail to collect all of them and waht's the requirement for memory and CPU?

Robert Meusel

unread,
Sep 29, 2014, 5:57:18 PM9/29/14
to web-data...@googlegroups.com
no its not really. of course , if it crashs it will restart. but you can implement a tracker to keep an eye on the current status.

you solely need hard disc. this state processes just one file per cpu, so memory should not be a problem. but its stored locally which means you should make sure you have enough space.

Robert Meusel

unread,
Sep 29, 2014, 5:58:01 PM9/29/14
to web-data...@googlegroups.com


Am Montag, 29. September 2014 17:15:29 UTC+2 schrieb Tie hky:
The following code snippet to collect all RDF files.
    S3Object[] objects = getStorage().listObjects(resultBucket,
      "data/", null);
    int i = 0;
    for (S3Object object : objects) {

If there are thousands and thousands RDF files, is it possible to fail to collect all of them and waht's the requirement for memory and CPU?

last time we needed around 2 days to aggregate the data using this code. this strongly depends on the amounts of data you gathered. 

Tie hky

unread,
Sep 30, 2014, 12:02:41 AM9/30/14
to web-data...@googlegroups.com
so how much did it cost?

Robert Meusel

unread,
Sep 30, 2014, 8:18:29 AM9/30/14
to web-data...@googlegroups.com
in case you mean just the collection: 

2 x 24h x 0.14$ --> 6.72$ (traffic from s3 to ec2 is free, we used something like an m3.large instance)

If you mean the extractions as whole was less than 400$ for the RDFa, MD, MF Extraction: http://webdatacommons.org/structureddata/index.html#toc4

Tie hky

unread,
Oct 1, 2014, 9:29:23 AM10/1/14
to web-data...@googlegroups.com
Thanks Robert.


$400 was only for single one CC corpus. There are several corpus now. so it will cost thousands of dollars.

What I want to do is find some patterns in all javascript files. Do I need deal with all corpus? I feel that the only way of doing this is browsing Archive Record in ARC/WARC files, WAT and WET files do not help in any way, is it correct?
Do you have any suggestions? Thanks.

Robert Meusel

unread,
Oct 1, 2014, 10:29:03 AM10/1/14
to web-data...@googlegroups.com
Hi Tie hky,

I am not sure what you are exactly looking for but as you want to see/inspect/parse the real source of an URL you will need the WARC files. As the WAT includes mostly the meta-data with links and WET is solely the textual representation. The JS files will either be includes as "stand alone" - meaning ".js" - files (if the crawler crawls them - this I do not know and you should ask CC) or the JS is embedded in the HTML, where you can use a regex to prefilter the pages and make sure you are not parsing HTML pages which do not even have a javascript tag.

In addition, why do you want to crawl ALL corpora. They overlap, are from different times and crawled with different crawling strategies. In my eyes you should focus on the newest corpus (AUG 2014).

Cheers,
Robert

Tie hky

unread,
Oct 1, 2014, 5:31:13 PM10/1/14
to web-data...@googlegroups.com
Thanks Robert.

Do you think it is helpful if I get the info of all .js file links form metadata files(I remembered that metadata contains the name of the ARC file that contains the url, the offset and size, at least in older corpus data. I am not sure whether WAT files contains these info).
After we get all the needed infos, including, URL, the name of the corresponding ARC file, offset and size. And then we can just read the part from ARC files.

This is what I proposed. But I am not sure whether it works because I am not sure whether it is possible to read an part of ARC file.


In addition, why do you want to crawl ALL corpora. They overlap, are from different times and crawled with different crawling strategies. In my eyes you should focus on the newest corpus (AUG 2014).
I found this discussion threadin CommonCrawl group, looks like we need all corpus?

Robert Meusel

unread,
Oct 2, 2014, 2:06:07 AM10/2/14
to web-data...@googlegroups.com
hi 

i am not sure why you want to have all the crawls. i mean there is an overlap and i cannot think about a usecase right now where you want to have "all" even from different times?

you can give this a try - first find the meta data within the WAT files and than parse the corresponding ARC files. I have no clue about the density of the .js - files in the corpus. have you tried to find it and make a small test? is it 5%? 1%? 0.1% depending on this outcome you should select your strategy.
Message has been deleted

Tie hky

unread,
Oct 2, 2014, 10:16:04 AM10/2/14
to web-data...@googlegroups.com
Stephen from CC said:
Each of our crawls will hit a few billion pages, the majority of pages new but some subset may retrieve previously covered URLs,
This is the reason why I though I should deal with all crawls. Probably I misunderstood. Do you  think it is enough processing latest crawl only?

I think .JS files should be a small percentage in a crawl. Will give it a try.

Even a small percentage, we can expect that .JS files are scattered here and there on all crawl files.so I still need real all of the ARC/WARC files. Do you know any example that reads a random part of an ARC/WARC file? Thanks


Robert Meusel

unread,
Oct 2, 2014, 11:34:47 AM10/2/14
to web-data...@googlegroups.com


Am Donnerstag, 2. Oktober 2014 16:16:04 UTC+2 schrieb Tie hky:
Stephen from CC said:
Each of our crawls will hit a few billion pages, the majority of pages new but some subset may retrieve previously covered URLs,
This is the reason why I though I should deal with all crawls. Probably I misunderstood. Do you  think it is enough processing latest crawl only?
Depends on your goal. Extracting .js files is fine - but what do you want to do with it? 

I think .JS files should be a small percentage in a crawl. Will give it a try.

Even a small percentage, we can expect that .JS files are scattered here and there on all crawl files.so I still need real all of the ARC/WARC files. Do you know any example that reads a random part of an ARC/WARC file? Thanks
No sorry. Just select one file randomly and have a look. 
Reply all
Reply to author
Forward
0 new messages