Marking robots by user agent with `stats-util -m` is not implemented

134 views
Skip to first unread message

Alan Orth

unread,
Nov 4, 2019, 4:10:40 PM11/4/19
to DSpace Technical Support
Dear list,

The DSpace 5.x (and presumably 6.x) documentation[0] suggests that it is possible to mark existing Solr statistics records as being bots or spiders using the following command:

$ dspace stats-util -m

After trying to test this with an updated list of user agents[1] for a while I realized that the feature is only implemented for IPs. As it stands right now the code in StatisticsClient.java only marks robots based on their IPs, but not on their user agents or domains:

else if (line.hasOption('m'))
{
    SolrLogger.markRobotsByIP();
}

Strangely enough, SolrLogger has a markRobotByUserAgent() function that is never called anywhere in the Java code base (also it seems to only be partially implemented, as it does not iterate over agents).

Should I file a bug? This issue affects DSpace 5.x and 6.x for sure.

Regards,

[1] https://github.com/atmire/COUNTER-Robots
--
Alan Orth
alan...@gmail.com
https://picturingjordan.com
https://englishbulgaria.net
https://mjanja.ch
"In heaven all the interesting people are missing." ―Friedrich Nietzsche

Mark H. Wood

unread,
Nov 5, 2019, 9:02:15 AM11/5/19
to DSpace Technical Support
On Mon, Nov 04, 2019 at 11:10:25PM +0200, Alan Orth wrote:
> The DSpace 5.x (and presumably 6.x) documentation[0] suggests that it is
> possible to mark existing Solr statistics records as being bots or spiders
> using the following command:
>
> $ dspace stats-util -m
>
> After trying to test this with an updated list of user agents[1] for a
> while I realized that the feature is only implemented for IPs. As it stands
> right now the code in StatisticsClient.java only marks robots based on
> their IPs, but not on their user agents or domains:
>
> else if (line.hasOption('m'))
> {
> SolrLogger.markRobotsByIP();
> }
>
> Strangely enough, SolrLogger has a markRobotByUserAgent() function that is
> never called anywhere in the Java code base (also it seems to only be
> partially implemented, as it does not iterate over agents).
>
> Should I file a bug? This issue affects DSpace 5.x and 6.x for sure.

https://jira.duraspace.org/browse/DS-2431

There are several Issues related to completing the work on extended
spider marking and filtering.

--
Mark H. Wood
Lead Technology Analyst

University Library
Indiana University - Purdue University Indianapolis
755 W. Michigan Street
Indianapolis, IN 46202
317-274-0749
www.ulib.iupui.edu
signature.asc

Alan Orth

unread,
Nov 7, 2019, 8:55:51 AM11/7/19
to DSpace Technical Support
Thank you, Mark. For now I'll just settle for an updated list of spider agents from COUNTER-Robots¹ (dropping the text file into dspace/config/spiders/agents seems to work).

Regards,


--
All messages to this mailing list should adhere to the DuraSpace Code of Conduct: https://duraspace.org/about/policies/code-of-conduct/
---
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dspace-tech/20191105140039.GA30402%40IUPUI.Edu.

Alan Orth

unread,
Nov 10, 2019, 11:12:23 AM11/10/19
to DSpace Technical Support
Dear list,

I ended up writing a little bash script¹ to read known spider user agents from a file such as DSpace's `example` pattern file and check for matching documents in the Solr statistics core (or yearly statistics shards). It can optionally purge the matched records, but this is disabled by default. In our case, I purged 2 MILLION hits from our statistics core, which has data going back nine years. It feels nice to know that our usage statistics are more accurate now, though the repository managers will be depressed because their content wasn't as popular as they thought. :)

To use the script you need to be able to access your DSpace's Solr instance directly, either by running the script on the same machine or by making the port available via an SSH tunnel:

$ ssh -L 8080:localhost:8080 dspace.example.edu

Then you can run the script, specifying the location of the Solr instance and the location of the patterns file:

$ ./check-spider-hits.sh -u http://localhost:8080/solr -f ~/dspace/config/spiders/agents/example

Read the script source or check its help text with `-h` to see more options. There is one implementation detail that is interesting: DSpace uses the spider agents file from the COUNTER-Robots project², which contains some plaintext names as well as regular expressions. Unfortunately Solr 4.x as used in current DSpace 5 and 6 only has basic support for regular expressions. For example, all patterns are anchored with ^ and $ by default, you need to use [0-9] instead of \d, etc. As such, my script does some basic filtering of the input pattern file to remove user agents that are using regular expression characters. I imagine this is part of the reason why DSpace's mark spider feature was never completed for user agents, because the example agents file used by SpiderDetector.java cannot be used when searching Solr later for marking spiders.

I hope this is helpful for someone. Thanks to the contributors of the COUNTER-Robots project for curating this list.

Regards,

Paul Münch

unread,
Nov 11, 2019, 3:45:24 AM11/11/19
to dspac...@googlegroups.com
Hello everybody,

is there something like a OAI-Log, where we can check at what time or on
what date our DSpace was harvested? I looked into the dspace logs and
the '[dspace-dir]/var/oai/request' directory but didn't find what we are
looking for.

Thank you in advance and kind regards,

Paul Münch

signature.asc
Message has been deleted

Alan Orth

unread,
Nov 12, 2019, 7:11:26 AM11/12/19
to Fabricio Costa, DSpace Technical Support
Dear Fabricio,

Thank you for trying the script! It sounds like there is something wrong with the  Solr query parameters that causes the XML result to be malformed (it is parsed with xmllint in the script). I've added some additional logic checks and a debug option to the script. Please get a new copy of the script¹ and try again with the "-d" option to see if you can narrow the issue down:

$ ./check-spider-hits.sh -d -u http://localhost:8080/solr -f ~/dspace/config/spiders/agents/example

Regards,


On Tue, Nov 12, 2019 at 6:26 AM Fabricio Costa <briz...@gmail.com> wrote:
Hello, Alan.

I tried the bash script and received the following message (several times).

-:1: parser error : Document is empty

To unsubscribe from this group and stop receiving emails from it, send an email to dspac...@googlegroups.com.


--
Alan Orth
alan...@gmail.com
https://picturingjordan.com
https://englishbulgaria.net
https://mjanja.ch
"In heaven all the interesting people are missing." ―Friedrich Nietzsche


--
Alan Orth
alan...@gmail.com
https://picturingjordan.com
https://englishbulgaria.net
https://mjanja.ch
"In heaven all the interesting people are missing." ―Friedrich Nietzsche

--
All messages to this mailing list should adhere to the DuraSpace Code of Conduct: https://duraspace.org/about/policies/code-of-conduct/
---
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com.

Sean Carte

unread,
Nov 12, 2019, 7:50:14 AM11/12/19
to Alan Orth, DSpace Technical Support
Thanks, Alan!

Total number of bot hits purged: 575004

One thing I found curious is that I first ran it with -pno -d, then -pyes and got a different result each time:

dspace@ir:/home/dspace$ scripts/check-spider-hits.sh -u http://localhost:8080/solr -f /dspacecris-dut/config/spiders/agents/example -pno -d
(DEBUG) Using spiders pattern file: /dspacecris-dut/config/spiders/agents/example
(DEBUG) Checking for hits from spider: AllenTrack
(DEBUG) Checking for hits from spider: Arachmo
(DEBUG) Checking for hits from spider: ContentSmartz
(DEBUG) Checking for hits from spider: DSurf
(DEBUG) Checking for hits from spider: EmailSiphon
(DEBUG) Checking for hits from spider: EmailWolf
(DEBUG) Checking for hits from spider: GetRight
(DEBUG) Checking for hits from spider: Googlebot
Found 325498 hits from Googlebot in statistics
(DEBUG) Checking for hits from spider: HTTrack
Found 1366 hits from HTTrack in statistics
(DEBUG) Checking for hits from spider: LOCKSS
(DEBUG) Checking for hits from spider: MSNBot
(DEBUG) Checking for hits from spider: Milbot
(DEBUG) Checking for hits from spider: MuscatFerre
(DEBUG) Checking for hits from spider: NABOT
(DEBUG) Checking for hits from spider: NaverBot
(DEBUG) Checking for hits from spider: OurBrowser
(DEBUG) Checking for hits from spider: Readpaper
(DEBUG) Checking for hits from spider: Strider
Found 1 hits from Strider in statistics
(DEBUG) Checking for hits from spider: Teoma
Found 2 hits from Teoma in statistics
(DEBUG) Checking for hits from spider: Wanadoo
Found 7 hits from Wanadoo in statistics
(DEBUG) Checking for hits from spider: WebCloner
(DEBUG) Checking for hits from spider: WebCopier
(DEBUG) Checking for hits from spider: WebReaper
(DEBUG) Checking for hits from spider: WebStripper
(DEBUG) Checking for hits from spider: WebZIP
(DEBUG) Checking for hits from spider: Webinator
(DEBUG) Checking for hits from spider: Webmetrics
(DEBUG) Checking for hits from spider: Wget
Found 170 hits from Wget in statistics
(DEBUG) Checking for hits from spider: alexa
Found 238 hits from alexa in statistics
(DEBUG) Checking for hits from spider: almaden
(DEBUG) Checking for hits from spider: appie
(DEBUG) Checking for hits from spider: architext
(DEBUG) Checking for hits from spider: arks
Found 18 hits from arks in statistics
(DEBUG) Checking for hits from spider: asterias
(DEBUG) Checking for hits from spider: atomz
(DEBUG) Checking for hits from spider: autoemailspider
(DEBUG) Checking for hits from spider: awbot
(DEBUG) Checking for hits from spider: baiduspider
(DEBUG) Checking for hits from spider: bbot
(DEBUG) Checking for hits from spider: biadu
(DEBUG) Checking for hits from spider: biglotron
(DEBUG) Checking for hits from spider: bjaaland
(DEBUG) Checking for hits from spider: bloglines
(DEBUG) Checking for hits from spider: blogpulse
(DEBUG) Checking for hits from spider: bot
Found 520514 hits from bot in statistics
(DEBUG) Checking for hits from spider: bspider
Found 72 hits from bspider in statistics
(DEBUG) Checking for hits from spider: bwh3_user_agent
(DEBUG) Checking for hits from spider: celestial
(DEBUG) Checking for hits from spider: cfnetwork|checkbot
(DEBUG) Solr query returned HTTP 400, skipping cfnetwork|checkbot.
(DEBUG) Checking for hits from spider: combine
(DEBUG) Checking for hits from spider: contentmatch
(DEBUG) Checking for hits from spider: core
(DEBUG) Checking for hits from spider: crawl
Found 15205 hits from crawl in statistics
(DEBUG) Checking for hits from spider: crawler
Found 15191 hits from crawler in statistics
(DEBUG) Checking for hits from spider: cursor
(DEBUG) Checking for hits from spider: custo
Found 4 hits from custo in statistics
(DEBUG) Checking for hits from spider: daumoa
(DEBUG) Checking for hits from spider: docomo
(DEBUG) Checking for hits from spider: dtSearchSpider
(DEBUG) Checking for hits from spider: dumbot
(DEBUG) Checking for hits from spider: easydl
(DEBUG) Checking for hits from spider: exabot
Found 133 hits from exabot in statistics
(DEBUG) Checking for hits from spider: fast-webcrawler
(DEBUG) Checking for hits from spider: favorg
(DEBUG) Checking for hits from spider: feedburner
(DEBUG) Checking for hits from spider: ferret
(DEBUG) Checking for hits from spider: findlinks
Found 10626 hits from findlinks in statistics
(DEBUG) Checking for hits from spider: gaisbot
(DEBUG) Checking for hits from spider: geturl
(DEBUG) Checking for hits from spider: gigabot
(DEBUG) Checking for hits from spider: girafabot
(DEBUG) Checking for hits from spider: gnodspider
(DEBUG) Checking for hits from spider: google
Found 327642 hits from google in statistics
(DEBUG) Checking for hits from spider: grub
(DEBUG) Checking for hits from spider: gulliver
(DEBUG) Checking for hits from spider: harvest
(DEBUG) Checking for hits from spider: heritrix
Found 765 hits from heritrix in statistics
(DEBUG) Checking for hits from spider: hl_ftien_spider
(DEBUG) Checking for hits from spider: holmes
(DEBUG) Checking for hits from spider: htdig
(DEBUG) Checking for hits from spider: htmlparser
(DEBUG) Checking for hits from spider: httrack
(DEBUG) Checking for hits from spider: iSiloX
(DEBUG) Checking for hits from spider: ia_archiver
Found 243 hits from ia_archiver in statistics
(DEBUG) Checking for hits from spider: ichiro
Found 1153 hits from ichiro in statistics
(DEBUG) Checking for hits from spider: iktomi
(DEBUG) Checking for hits from spider: ilse
(DEBUG) Checking for hits from spider: internetseer
(DEBUG) Checking for hits from spider: intute
(DEBUG) Checking for hits from spider: java
Found 2 hits from java in statistics
(DEBUG) Checking for hits from spider: jeeves
(DEBUG) Checking for hits from spider: jobo
(DEBUG) Checking for hits from spider: kyluka
(DEBUG) Checking for hits from spider: larbin
(DEBUG) Checking for hits from spider: libwww
Found 113 hits from libwww in statistics
(DEBUG) Checking for hits from spider: lilina
(DEBUG) Checking for hits from spider: linkbot
(DEBUG) Checking for hits from spider: linkcheck
(DEBUG) Checking for hits from spider: linkchecker
(DEBUG) Checking for hits from spider: linkscan
(DEBUG) Checking for hits from spider: linkwalker
(DEBUG) Checking for hits from spider: lmspider
(DEBUG) Checking for hits from spider: lwp
(DEBUG) Checking for hits from spider: megite
(DEBUG) Checking for hits from spider: milbot
(DEBUG) Checking for hits from spider: mimas
(DEBUG) Checking for hits from spider: mj12bot
(DEBUG) Checking for hits from spider: mnogosearch
(DEBUG) Checking for hits from spider: moget
(DEBUG) Checking for hits from spider: mojeekbot
(DEBUG) Checking for hits from spider: momspider
(DEBUG) Checking for hits from spider: motor
Found 8 hits from motor in statistics
(DEBUG) Checking for hits from spider: msiecrawler
(DEBUG) Checking for hits from spider: msnbot
Found 8993 hits from msnbot in statistics
(DEBUG) Checking for hits from spider: myweb
(DEBUG) Checking for hits from spider: nagios
(DEBUG) Checking for hits from spider: netcraft
(DEBUG) Checking for hits from spider: netluchs
(DEBUG) Checking for hits from spider: no_user_agent
(DEBUG) Checking for hits from spider: nomad
(DEBUG) Checking for hits from spider: nutch
Found 68 hits from nutch in statistics
(DEBUG) Checking for hits from spider: ocelli
(DEBUG) Checking for hits from spider: onetszukaj
(DEBUG) Checking for hits from spider: perman
(DEBUG) Checking for hits from spider: pioneer
(DEBUG) Checking for hits from spider: powermarks
(DEBUG) Checking for hits from spider: psbot
Found 3 hits from psbot in statistics
(DEBUG) Checking for hits from spider: python
Found 1 hits from python in statistics
(DEBUG) Checking for hits from spider: qihoobot
(DEBUG) Checking for hits from spider: rambler
(DEBUG) Checking for hits from spider: redalert|robozilla
(DEBUG) Solr query returned HTTP 400, skipping redalert|robozilla.
(DEBUG) Checking for hits from spider: robot
Found 56183 hits from robot in statistics
(DEBUG) Checking for hits from spider: robots
Found 43145 hits from robots in statistics
(DEBUG) Checking for hits from spider: rss
(DEBUG) Checking for hits from spider: scan4mail
(DEBUG) Checking for hits from spider: scientificcommons
(DEBUG) Checking for hits from spider: scirus
(DEBUG) Checking for hits from spider: scooter
(DEBUG) Checking for hits from spider: seekbot
(DEBUG) Checking for hits from spider: seznambot
(DEBUG) Checking for hits from spider: shoutcast
(DEBUG) Checking for hits from spider: slurp
Found 104 hits from slurp in statistics
(DEBUG) Checking for hits from spider: sogou
Found 2178 hits from sogou in statistics
(DEBUG) Checking for hits from spider: speedy
Found 139 hits from speedy in statistics
(DEBUG) Checking for hits from spider: spider
Found 23341 hits from spider in statistics
(DEBUG) Checking for hits from spider: spiderman
(DEBUG) Checking for hits from spider: spiderview
(DEBUG) Checking for hits from spider: sunrise
(DEBUG) Checking for hits from spider: superbot
(DEBUG) Checking for hits from spider: surveybot
(DEBUG) Checking for hits from spider: tailrank
(DEBUG) Checking for hits from spider: technoratibot
(DEBUG) Checking for hits from spider: titan
(DEBUG) Checking for hits from spider: turnitinbot
(DEBUG) Checking for hits from spider: twiceler
(DEBUG) Checking for hits from spider: ucsd
(DEBUG) Checking for hits from spider: ultraseek
(DEBUG) Checking for hits from spider: urlaliasbuilder
(DEBUG) Checking for hits from spider: urllib
Found 66 hits from urllib in statistics
(DEBUG) Checking for hits from spider: voila
(DEBUG) Checking for hits from spider: webcollage
(DEBUG) Checking for hits from spider: weblayers
(DEBUG) Checking for hits from spider: webmirror
(DEBUG) Checking for hits from spider: webreaper
(DEBUG) Checking for hits from spider: wordpress
(DEBUG) Checking for hits from spider: worm
(DEBUG) Checking for hits from spider: xenu
(DEBUG) Checking for hits from spider: yacy
Found 2 hits from yacy in statistics
(DEBUG) Checking for hits from spider: yahoo
Found 153 hits from yahoo in statistics
(DEBUG) Checking for hits from spider: yahoofeedseeker
(DEBUG) Checking for hits from spider: yahooseeker
(DEBUG) Checking for hits from spider: yandex
Found 8591 hits from yandex in statistics
(DEBUG) Checking for hits from spider: yodaobot
(DEBUG) Checking for hits from spider: zealbot
(DEBUG) Checking for hits from spider: zeus
(DEBUG) Checking for hits from spider: zyborg
(DEBUG) Checking for hits from spider: parsijoo
Found 38 hits from parsijoo in statistics
(DEBUG) Checking for hits from spider: validator

Total number of hits from bots: 1361976
dspace@ir:/home/dspace$ scripts/check-spider-hits.sh -u http://localhost:8080/solr -f /dspacecris-dut/config/spiders/agents/example -pyes
Purging 325498 hits from Googlebot in statistics
Purging 1366 hits from HTTrack in statistics
Purging 1 hits from Strider in statistics
Purging 2 hits from Teoma in statistics
Purging 7 hits from Wanadoo in statistics
Purging 170 hits from Wget in statistics
Purging 238 hits from alexa in statistics
Purging 18 hits from arks in statistics
Purging 195014 hits from bot in statistics
Purging 72 hits from bspider in statistics
Purging 14714 hits from crawl in statistics
Purging 4 hits from custo in statistics
Purging 10626 hits from findlinks in statistics
Purging 2271 hits from google in statistics
Purging 765 hits from heritrix in statistics
Purging 5 hits from ia_archiver in statistics
Purging 598 hits from ichiro in statistics
Purging 2 hits from java in statistics
Purging 113 hits from libwww in statistics
Purging 8 hits from motor in statistics
Purging 1 hits from python in statistics
Purging 103 hits from slurp in statistics
Purging 2178 hits from sogou in statistics
Purging 139 hits from speedy in statistics
Purging 20938 hits from spider in statistics
Purging 66 hits from urllib in statistics
Purging 49 hits from yahoo in statistics
Purging 38 hits from parsijoo in statistics

Total number of bot hits purged: 575004




--

Alan Orth

unread,
Nov 12, 2019, 10:18:50 AM11/12/19
to Sean Carte, DSpace Technical Support
Dear Sean,

That's great! I'm glad you found it useful. I hope your manager isn't too depressed to see the numbers go down. ;)

Regarding the difference in between runs, it looks like it has to do with the order of the user agent patterns in the file. For example, there are 325498 hits from "Googlebot" which get purged first, then there's a later user agent "bot" which matches 520514 requests, but 325498 of those would have already been purged from the "Googlebot" match. There are also about 100,000 matches for "robot" and "robots", both of which overlap with the "bot" pattern and each other. Maybe I should add a note to the output of the total to say it's not a reliable number. The most accurate number would be the hits actually purged.

Also, I think I'm going to change the purge option to just be "-p" without an argument like the debug flag... to be consistent and require less typing...

Cheers,

Sean Kalynuk

unread,
Nov 12, 2019, 10:45:49 AM11/12/19
to Paul Münch, dspac...@googlegroups.com
Hi Paul,

If you have Tomcat access logging enabled, then you can search those logs for access to the /oai/request path.

--
Sean
--
All messages to this mailing list should adhere to the DuraSpace Code of Conduct: https://duraspace.org/about/policies/code-of-conduct/
---
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dspace-tech/76807727-91b2-ec83-5d3c-3645cf241742%40staff.uni-marburg.de.

Paul Münch

unread,
Nov 14, 2019, 2:32:17 AM11/14/19
to Sean Kalynuk, dspac...@googlegroups.com
Hi Sean,

thank you for the hint. Bad that I didn't came up with that myself. :D

Kind regards,

Paul Münch

Am 12.11.19 um 16:45 schrieb Sean Kalynuk:
signature.asc

Alan Orth

unread,
Nov 15, 2019, 3:13:10 PM11/15/19
to Sean Carte, DSpace Technical Support
Dear Sean, Fabricio, and others,

I've made a handful of improvements to the script. Notably it can now read regular expressions from the patterns file, which greatly improves the number of hits matched¹. In my repository's case, with statistics from 2010 to 2019, I identified and purged 1.4 million more hits (in addition to the 2 million from before). Please test the script again to see if there are any more bot hits matched on your statistics. Find the latest version below. Check the help options and make sure to run without the purge option (-p) to see if things look OK.


Regards,

¹ Parsing these from the patterns file is tricky in bash and, even so, the regular expression syntax used in the patterns file differs from that used in Solr. Where possible, I've tried to convert them to a compatible format on the fly, and where not possible I've ignored them (for example patterns that use + or % are really tricky to handle).
Reply all
Reply to author
Forward
0 new messages