On Fri, Mar 11, 2016 at 11:21:04AM -0500, Tom Morris wrote:
> Sure. There are a variety of ways of doing this, but the C4Corpus tools
> which were mentioned recently implement one scheme.
>
>
https://github.com/dkpro/dkpro-c4corpus
>
> They use it the other way around, to eliminate duplicates and
> near-duplicates, but the hard part is finding the clusters in the first
> place.
Another interesting case is exact duplicates, which can be
inexpensively determined by examining the CDX index checksum. Sylvain
Zimmer of CommonSearch suggested to me that exact duplicates might be
a good way to figure out which CGI arguments don't affect content (&
are probably just for analytics purposes.)
I tried this out on a day of NYT articles and it worked great, here's
a list of identical article groups. I'm sure this won't work for all
sites, but it's a nice start!
com,nytimes)/2016/01/01/us/banished-words-lake-superior-state-university.html
com,nytimes)/2016/01/01/us/banished-words-lake-superior-state-university.html?ref=education
com,nytimes)/2016/01/01/arts/music/pop-rock-cabaret-listings-for-jan-1-7.html?ref=arts
com,nytimes)/2016/01/01/arts/music/pop-rock-cabaret-listings-for-jan-1-7.html
com,nytimes)/2016/01/01/technology/microsoft-to-notify-users-of-government-hackings.html
com,nytimes)/2016/01/01/technology/microsoft-to-notify-users-of-government-hackings.html?src=me
com,nytimes)/2016/01/01/opinion/no-more-statutes-of-limitations-for-rape.html
com,nytimes)/2016/01/01/opinion/no-more-statutes-of-limitations-for-rape.html?_r=0&emc=edit_th_20160101&nl=todaysheadlines&nlid=58599836
com,nytimes)/2016/01/01/business/media/bbc-websites-said-to-be-target-of-online-attack.html
com,nytimes)/2016/01/01/business/media/bbc-websites-said-to-be-target-of-online-attack.html?ref=international
com,nytimes)/2016/01/01/arts/television/downton-abbey-season-6-crawleys-review.html?src=mv
com,nytimes)/2016/01/01/arts/television/downton-abbey-season-6-crawleys-review.html?src=me
com,nytimes)/2016/01/01/opinion/girls-in-japans-war-brothels.html
com,nytimes)/2016/01/01/opinion/girls-in-japans-war-brothels.html?ref=international
com,nytimes)/2016/01/01/arts/music/pop-rock-cabaret-listings-for-jan-1-7.html?ref=arts
com,nytimes)/2016/01/01/arts/music/pop-rock-cabaret-listings-for-jan-1-7.html
com,nytimes)/2016/01/01/opinion/privilege-pathology-and-power.html?src=me
com,nytimes)/2016/01/01/opinion/privilege-pathology-and-power.html?src=mv