Finding set of URLs in Common Crawl Metadata

Sambit Tripathy

unread,

Apr 16, 2014, 3:08:21 AM4/16/14

to common...@googlegroups.com

Hi All,

What could be the better way to check whether Common Crawl has the URLs already crawled that I have with me?

Simplifying, I have a set of URLs with me and I want to validate whether Common Crawl corpus do have those URLs in them. One possible option is to write a MR job and run it on the cluster having the metadata. The key is a URL and value is a JSON with metadata attributes. But this process is time consuming as I have to run through the entire corpus.

I have been trying to find someone who has already done some work on this or possibly I can do a text lookup?

Lisa Green

unread,

Apr 17, 2014, 9:42:53 PM4/17/14

to common...@googlegroups.com

Hi

I think that you will find the URL Search tool very useful. http://urlsearch.commoncrawl.org/ The index and web app was created by Common Crawl volunteer Scott Robinson. You can read's Scott post about it here: http://commoncrawl.org/common-crawl-url-index/ Common Crawl volunteer Aparup Banerjee is working on indices for the most recent crawls and the web should be updated with them soon.

On a related note, I am currently working on an index of the metadata that will allow one to find pages based on their tags.

Lisa

Sambit Tripathy

unread,

Apr 18, 2014, 1:28:10 AM4/18/14

to common...@googlegroups.com

Hi Lisa,

I must say you guys are doing a great job.

It makes sense if you can search the corpus.

Good luck for your work and it will be awesome to have tag based search as well. I think Yahoo guys have done some work on Glimmer for RDF data.

Regards,
Sambit.

John Wiseman

unread,

Apr 18, 2014, 1:22:05 PM4/18/14

to common...@googlegroups.com

FYI I think there's still a pretty serious issue with the index. See https://github.com/trivio/common_crawl_index/issues/13 and my comment at http://commoncrawl.org/url-search-tool/

The symptom is that a search for "en.wikipedia.org" returns

http://en.wikipedia.org/wiki/1525

http://en.wikipedia.org/wiki/1525_BC

[...]

While a search for "en.wikipedia.org/wiki" returns

http://en.wikipedia.org/wiki/1647_in_literature

http://en.wikipedia.org/wiki/1647_in_music

http://en.wikipedia.org/wiki/1647_in_science

[...]

A search for "en.wikipedia.org/wiki/19" returns no results at all but the index actually contains many URLs with that prefix, which you can see by doing a search for "en.wikipedia.org/wiki/1".

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

Sambit Tripathy

unread,

Apr 18, 2014, 1:29:32 PM4/18/14

to common...@googlegroups.com

It works for me as long as the domains are matching.

Well yeah it could be an issue if the urls are not crawled and cc doesn't have the data for them.

Regards
Sambit

You received this message because you are subscribed to a topic in the Google Groups "Common Crawl" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/common-crawl/gAUbsSeTe8Q/unsubscribe.
To unsubscribe from this group and all its topics, send an email to common-crawl...@googlegroups.com.

John Wiseman

unread,

Apr 18, 2014, 1:42:58 PM4/18/14

to common...@googlegroups.com

The issue is that the index contains the data but queries that should return it, don't.

Sambit Tripathy

unread,

Apr 18, 2014, 2:02:49 PM4/18/14

to common...@googlegroups.com

Are you sure about it?

Is this happening for specific domains or in general?

Regards
Sambit

John Wiseman

unread,

Apr 18, 2014, 2:19:21 PM4/18/14

to common...@googlegroups.com

Sambit, please try the examples in my first message and read https://github.com/trivio/common_crawl_index/issues/13

If you think I missed something, let me know. Otherwise, I believe there's a serious bug in the indexing code that hasn't been addressed and will corrupt the new index that Aparup Banerjee is working on.

jor...@commoncrawl.org

unread,

Apr 18, 2014, 3:36:28 PM4/18/14

to common...@googlegroups.com

John

Thanks for bringing this to our attention. I am looking into it and will report back soon.

Jordan

Tom Morris

unread,

Apr 18, 2014, 4:25:48 PM4/18/14

to common...@googlegroups.com

Independent of the index corruption, isn't it true that the URL index isn't comprehensive? I thought it only covered a subset of the crawl (or am I misremembering?)

Tom

Reply all

Reply to author

Forward