Finding set of URLs in Common Crawl Metadata

131 views
Skip to first unread message

Sambit Tripathy

unread,
Apr 16, 2014, 3:08:21 AM4/16/14
to common...@googlegroups.com
Hi All,


What could be the better way to check whether Common Crawl has the URLs already crawled that I have with me?

Simplifying, I have a set of URLs with me and I want to validate whether Common Crawl corpus do have those URLs in them. One possible option is to write a MR job and run it on the cluster having the metadata. The key is a URL and value is a JSON with metadata attributes. But this process is time consuming as I have to run through the entire corpus.

 I have been trying to find someone who has already done some work on this or possibly I can do a text lookup?

Lisa Green

unread,
Apr 17, 2014, 9:42:53 PM4/17/14
to common...@googlegroups.com
Hi

I think that you will find the URL Search tool very useful.  http://urlsearch.commoncrawl.org/  The index and web app was created by Common Crawl volunteer Scott Robinson. You can read's Scott post about it here: http://commoncrawl.org/common-crawl-url-index/ Common Crawl volunteer Aparup Banerjee is working on indices for the most recent crawls and the web should be updated with them soon.

On a related note, I am currently working on an index of the metadata that will allow one to find pages based on their tags. 

Lisa

Sambit Tripathy

unread,
Apr 18, 2014, 1:28:10 AM4/18/14
to common...@googlegroups.com
Hi Lisa,

I must say you guys are doing a great job.

It makes sense if you can search the corpus.

Good luck for your work and it will be awesome to have tag based search as well. I think Yahoo guys have done some work on Glimmer for RDF data.




Regards,
Sambit.

John Wiseman

unread,
Apr 18, 2014, 1:22:05 PM4/18/14
to common...@googlegroups.com
FYI I think there's still a pretty serious issue with the index. See https://github.com/trivio/common_crawl_index/issues/13 and my comment at http://commoncrawl.org/url-search-tool/

The symptom is that a search for "en.wikipedia.org" returns

[...]

While a search for "en.wikipedia.org/wiki" returns
[...]

A search for "en.wikipedia.org/wiki/19" returns no results at all but the index actually contains many URLs with that prefix, which you can see by doing a search for "en.wikipedia.org/wiki/1".




--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

Sambit Tripathy

unread,
Apr 18, 2014, 1:29:32 PM4/18/14
to common...@googlegroups.com

It works for me as long as the domains are matching.

Well yeah it could be an issue if the urls are not crawled and cc doesn't have the data for them.

Regards
Sambit

You received this message because you are subscribed to a topic in the Google Groups "Common Crawl" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/common-crawl/gAUbsSeTe8Q/unsubscribe.
To unsubscribe from this group and all its topics, send an email to common-crawl...@googlegroups.com.

John Wiseman

unread,
Apr 18, 2014, 1:42:58 PM4/18/14
to common...@googlegroups.com
The issue is that the index contains the data but queries that should return it, don't.

Sambit Tripathy

unread,
Apr 18, 2014, 2:02:49 PM4/18/14
to common...@googlegroups.com

Are you sure about it?

Is this happening for specific domains or in general?

Regards
Sambit

John Wiseman

unread,
Apr 18, 2014, 2:19:21 PM4/18/14
to common...@googlegroups.com
Sambit, please try the examples in my first message and read https://github.com/trivio/common_crawl_index/issues/13

If you think I missed something, let me know.  Otherwise, I believe there's a serious bug in the indexing code that hasn't been addressed and will corrupt the new index that Aparup Banerjee is working on.

jor...@commoncrawl.org

unread,
Apr 18, 2014, 3:36:28 PM4/18/14
to common...@googlegroups.com
John

Thanks for bringing this to our attention. I am looking into it and will report back soon.

Jordan

Tom Morris

unread,
Apr 18, 2014, 4:25:48 PM4/18/14
to common...@googlegroups.com
Independent of the index corruption, isn't it true that the URL index isn't comprehensive?  I thought it only covered a subset of the crawl (or am I misremembering?)

Tom
Reply all
Reply to author
Forward
0 new messages