Duplicate URLs with similar contents

jayash...@gmail.com

unread,

Mar 11, 2016, 5:19:50 AM3/11/16

to Common Crawl

Hi,

I am looking for web pages which are having similar contents but URLs are different to access those pages.

Is there a way to extract such pages from common crawl?

Thanks.

Jayashri

Tom Morris

unread,

Mar 11, 2016, 11:21:07 AM3/11/16

to common...@googlegroups.com

On Fri, Mar 11, 2016 at 5:19 AM, <jayash...@gmail.com> wrote:

I am looking for web pages which are having similar contents but URLs are different to access those pages.
Is there a way to extract such pages from common crawl?

Sure. There are a variety of ways of doing this, but the C4Corpus tools which were mentioned recently implement one scheme.

https://github.com/dkpro/dkpro-c4corpus

They use it the other way around, to eliminate duplicates and near-duplicates, but the hard part is finding the clusters in the first place.

Tom

Greg Lindahl

unread,

Mar 11, 2016, 12:32:52 PM3/11/16

to common...@googlegroups.com

On Fri, Mar 11, 2016 at 11:21:04AM -0500, Tom Morris wrote:
> Sure. There are a variety of ways of doing this, but the C4Corpus tools
> which were mentioned recently implement one scheme.
>
> https://github.com/dkpro/dkpro-c4corpus
>
> They use it the other way around, to eliminate duplicates and
> near-duplicates, but the hard part is finding the clusters in the first
> place.

Another interesting case is exact duplicates, which can be
inexpensively determined by examining the CDX index checksum. Sylvain
Zimmer of CommonSearch suggested to me that exact duplicates might be
a good way to figure out which CGI arguments don't affect content (&
are probably just for analytics purposes.)

I tried this out on a day of NYT articles and it worked great, here's
a list of identical article groups. I'm sure this won't work for all
sites, but it's a nice start!

com,nytimes)/2016/01/01/us/banished-words-lake-superior-state-university.html
com,nytimes)/2016/01/01/us/banished-words-lake-superior-state-university.html?ref=education

com,nytimes)/2016/01/01/arts/music/pop-rock-cabaret-listings-for-jan-1-7.html?ref=arts
com,nytimes)/2016/01/01/arts/music/pop-rock-cabaret-listings-for-jan-1-7.html

com,nytimes)/2016/01/01/technology/microsoft-to-notify-users-of-government-hackings.html
com,nytimes)/2016/01/01/technology/microsoft-to-notify-users-of-government-hackings.html?src=me

com,nytimes)/2016/01/01/opinion/no-more-statutes-of-limitations-for-rape.html
com,nytimes)/2016/01/01/opinion/no-more-statutes-of-limitations-for-rape.html?_r=0&emc=edit_th_20160101&nl=todaysheadlines&nlid=58599836

com,nytimes)/2016/01/01/business/media/bbc-websites-said-to-be-target-of-online-attack.html
com,nytimes)/2016/01/01/business/media/bbc-websites-said-to-be-target-of-online-attack.html?ref=international

com,nytimes)/2016/01/01/arts/television/downton-abbey-season-6-crawleys-review.html?src=mv
com,nytimes)/2016/01/01/arts/television/downton-abbey-season-6-crawleys-review.html?src=me

com,nytimes)/2016/01/01/opinion/girls-in-japans-war-brothels.html
com,nytimes)/2016/01/01/opinion/girls-in-japans-war-brothels.html?ref=international

com,nytimes)/2016/01/01/arts/music/pop-rock-cabaret-listings-for-jan-1-7.html?ref=arts
com,nytimes)/2016/01/01/arts/music/pop-rock-cabaret-listings-for-jan-1-7.html

com,nytimes)/2016/01/01/opinion/privilege-pathology-and-power.html?src=me
com,nytimes)/2016/01/01/opinion/privilege-pathology-and-power.html?src=mv

Ivan Habernal

unread,

Mar 11, 2016, 4:51:49 PM3/11/16

to Common Crawl

Hi Jayashri,

As Tom mentioned, this is exactly what we do in Phase 2 in C4Corpus for preprocessing CommonCrawl. In the first phase, we compute the SimHash for each document from CommonCrawl and store it in the metadata. In the second phase, we use the SimHash as a key for reducer and retain only one occurrence of the warc record. You can easily modify the reducer to get you the information of exact matching pages, have a look here:

https://github.com/dkpro/dkpro-c4corpus/blob/master/dkpro-c4corpus-hadoop/src/main/java/de/tudarmstadt/ukp/dkpro/c4corpus/hadoop/full/Phase2ExactMatchDeDuplication.java

Best,

Ivan

Dne pátek 11. března 2016 18:32:52 UTC+1 Greg Lindahl napsal(a):

jayash...@gmail.com

unread,

Mar 14, 2016, 1:42:08 AM3/14/16

to Common Crawl

Thanks Tom. But I was assuming that the common crawl must have duplicate contents. Isn't it?

Any idea where we can find such clusters?

Thanks,

Jayashri

jayash...@gmail.com

unread,

Mar 14, 2016, 2:14:54 AM3/14/16

to Common Crawl

Thanks Ivan. Looks good. I shall give a try. Can dkpro-c4corpus-deduplication work on a single WARC file downloaded locally?

Jayashri

Ivan Habernal

unread,

Mar 14, 2016, 4:55:31 AM3/14/16

to Common Crawl

Hi Jayashri,

Yes, all the functionality is implemented so it can run independently of Hadoop. However, we split the de-duplication into two parts: in the first one, only _exact_ duplicates are removed (those entries with matching SimHash) - for this you only need to compute SimHash for a given text (public static long getSimHash(String text) in ParallelDocumentDeDuplication) and do the removal by yourself. For near duplicates, we first collect the candidates for near duplicates according to the hamming distance of SimHash and then perform de-duplication in local "clusters" (this is an approximation of global near-duplicates removal, which is NP-hard if you want to stick to certain criteria, e.g., keeping only larger document from a near-duplicate pair). This part can be also run without hadoop, but we split it into several phases with intermediate outputs. I guess all you need is to extract the functionality from Phase3Step1ExtractNearDupInfo, Phase3Step2DistinctDataJob, Phase3Step3NearDupTuplesCreation, Phase3Step4LocalDeDuplication.

Let me know if that worked.

Best,

Ivan

Reply all

Reply to author

Forward