Thanks very much Monika! I’ll try it out.
Cheers,
Anthony
From: Monika C. Mevenkamp [mailto:mon...@princeton.edu]
Sent: Friday, May 08, 2015 9:59 AM
To: Anthony Petryk
Cc: dspac...@lists.sourceforge.net
Subject: Re: [Dspace-tech] spider ip recognition
Anthony
After a bit of investigation, it turns out that a significant portion of our items stats come from spiders. Any thoughts on the best way to go about removing them from Solr retroactively? There’s nothing that I can see in the code that will do this by domain or agent, only IP. We’re not excited at the prospect of pulling out the IPs of all the spiders in order run “stats-util –i” effectively.
Cheers,
Anthony
From: Monika C. Mevenkamp [mailto:mon...@princeton.edu]
Sent: Friday, May 08, 2015 9:59 AM
To: Anthony Petryk
Cc: dspac...@lists.sourceforge.net
Subject: Re: [Dspace-tech] spider ip recognition
Anthony
[dspace]/bin/dspace stats-util command on a regular basis. You definitely need to run it to prune mark usage events after you configure Hi again,
Unfortunately, the documentation for the stats-util command is incorrect. Specifically this line:
-i or --delete-spiders-by-ip: Delete Spiders in Solr By IP Address, DNS name, or Agent name. Will prune out all records that match spider identification patterns.
Running “stats-util –i” does not actually remove spiders by DNS name or Agent name. Here’s are the relevant sections of the code, from StatisticsClient.java and SolrLogger.java:
(…)
else if(line.hasOption('i'))
{
SolrLogger.deleteRobotsByIP();
}
public static void deleteRobotsByIP()
{
for(String ip : SpiderDetector.getSpiderIpAddresses()){
deleteIP(ip);
}
}
What this means is that, if a spider is in your Solr stats, there’s no way to remove it other than manually adding its IP to [dpsace]/config/spiders; adding its DNS name or Agent name to the configs will not expunge it. Updating the spider files with “stats-util –u” does little to help because the IP lists it pulls from are out of date.
An example is the spider from the Bing search engine: bingbot. As of DSpace 4.3, it’s not in the list of spiders by DNS name or Agent name, nor is it in the list of spider IP addresses. So anyone running DSpace 4.3 likely has usage stats inflated by visits from this spider. The only way to remove it is to specify all the IPs for bingbot. Multiply that by all the other “new” spiders and we’re talking about a lot of work.
I tried briefly to modify the code to take domains/agents into account when marking or deleting spiders, but I wasn’t able to figure out how to query Solr with regex patterns. It’s easier to do with IPs because each IP or IP range is transformed into a String and used as a standard query parameter.
Thanks for the info Hardy. I just discovered that Jira issue yesterday. I’ll probably use your approach for our own stats, but I’m sure other sites would benefit from domain/agent handling when running “stats-util –m” or “stats-util –i” (as described in the issue).
Best,
Anthony
From: Pottinger, Hardy J. [mailto:Potti...@missouri.edu]
Sent: Friday, May 15, 2015 10:19 AM
To: Anthony Petryk; Monika C. Mevenkamp; dspac...@lists.sourceforge.net
Subject: RE: [Dspace-tech] spider ip recognition
Hi, you've run into a known issue, and one I very recently wrestled with myself:
https://jira.duraspace.org/browse/DS-2431
See my last comment on that ticket, I found a way around the issue, by simply deleting the spider docs from the stats index via a query in the Solr admin interface.
--Hardy
Hi Susan,
The one that DSpace uses is http://iplists.com. It was last updated 2 years ago. I haven’t come across another one myself, at least not in such an easy to use format. We’ve taken to manually periodically removing the main offenders (facet your Solr query by IP – the top ones will likely be bots). A more up-to-date list would be welcome indeed!
Anthony