Hi all,
We are happy to announce the release of the WDC Extraction framework, which is used by the Web Data Commons project to extract Microdata,Microformats and RDFa data, web graphs, and HTML tables from the web crawls provided by the Common Crawl Foundation. The framework provides an easy to use basis for the distributed processing of large web crawls using Amazon Cloud Services. The framework is published under the terms of the Apache license and can be simply customized to perform different data extraction tasks.
More information about the framework, a detailed guide on how to run it, and a tutorial showing how to customize the framework for your extractions can be found at
http://webdatacommons.org/framework
We encourage all interested parties to make use of the framework and also to contribute their own improvements.
Best Regards,
Robert, Hannes, Oliver, Petar and ChrisI am really interested in the extract framework. I spent time to read the code. Looks like the framework will generate a file for each ARC, WARC. As we know, there are hundreds of thousands of ARC/WARC files, if I don’t misunderstand, hundreds of thousands of result files will be generated. Is it an issue for WDC when you collected hyperlink and RDF data?
A file can be divided into different WARC records, who is responsible for combinig all of them records to get the original file?
Does the JAVA class org.archive.io.ArchiveReader do this? If not, should developesr take care of this?
Robert, can you help answer the following question, thanks.
A file can be divided into different WARC records, who is responsible for combinig all of them records to get the original file?
Does the JAVA class org.archive.io.ArchiveReader do this? If not, should developesr take care of this?
The following code snippet to collect all RDF files.S3Object[] objects = getStorage().listObjects(resultBucket,
"data/", null);
int i = 0;for (S3Object object : objects) {If there are thousands and thousands RDF files, is it possible to fail to collect all of them and waht's the requirement for memory and CPU?
Stephen from CC said:Each of our crawls will hit a few billion pages, the majority of pages new but some subset may retrieve previously covered URLs,This is the reason why I though I should deal with all crawls. Probably I misunderstood. Do you think it is enough processing latest crawl only?
I think .JS files should be a small percentage in a crawl. Will give it a try.Even a small percentage, we can expect that .JS files are scattered here and there on all crawl files.so I still need real all of the ARC/WARC files. Do you know any example that reads a random part of an ARC/WARC file? Thanks