apache httpd config to avoid web crawlers getting lost in facets

106 views
Skip to first unread message

sye...@gmail.com

unread,
May 1, 2024, 4:56:18 PM5/1/24
to DSpace Community
In the last couple of weeks we've had an issue with web crawlers getting lost in facets, crawling literally millions of URLs in the faceted solr index. This is mainly a problem because some of them get quite expensive in terms of solr search (CPU and memory consumption of the solr component rises).

We've deployed the following fix:

        #added to redirect long solr queries back to the homepage
        RewriteEngine On
        RewriteCond "%{QUERY_STRING}" "filter_3"
        RewriteRule .  https://ir.wgtn.ac.nz/ [R]

The "filter_3' means that users and crawlers are allowed two facets deep before being redirected back to the homepage.

We're redirecting to our own homepage; others will probably want to redirect to their own homepages (and/or bot tarpits).

cheers
stuart



DSpace Community

unread,
May 1, 2024, 5:13:01 PM5/1/24
to DSpace Community
Hi Stuart & all,

I wanted to briefly mention that also sounds similar to this bug ticket about crawlers getting "stuck" in facets of entities: https://github.com/DSpace/dspace-angular/issues/2709

There's a fix we've applied which will be in the 8.0 and 7.6.2 releases (once each is finished): https://github.com/DSpace/dspace-angular/pull/2710  (This approach has been approved by Google Scholar)

This may not be the same thing that Stuart noticed, but it's definitely related.  So, this is another way to lessen the crawler activity if you are seeing it in your DSpace 7 instance.

Tim

Stuart A. Yeates

unread,
May 1, 2024, 5:50:58 PM5/1/24
to DSpace Community
Related notes:

1) Our fix is for  DSpace 6.3, sorry I should have said this in the first email.
2) Some of the issues we're seeing appear to be from non-google crawlers (based on reverse IP lookup and user agent string analysis)
3) Our fix works for web crawlers which do not follow robots.txt.

cheers
stuart
--
...let us be heard from red core to black sky


--
All messages to this mailing list should adhere to the Code of Conduct: https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx
---
You received this message because you are subscribed to a topic in the Google Groups "DSpace Community" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/dspace-community/1-J8xg1ZrF8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to dspace-communi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dspace-community/bd9d62d7-8d94-44c7-8a67-7ce38378f47fn%40googlegroups.com.

Andrew K

unread,
May 13, 2024, 2:43:08 PM5/13/24
to DSpace Community
Hi,
Actually the situation is a lot better even in 7.6.1.
Half a year after switching from v5 the total number of scanned pages decreased drastically from 1.3M to 190K as you can see (and it's probably not final).
The number of indexed pages also decreased, which doesn't seem to affect the number of views (because those extra pages were never viewed).
WBR,
Andrew
2024-05-13_213034.png

четвер, 2 травня 2024 р. о 00:13:01 UTC+3 DSpace Community пише:
Reply all
Reply to author
Forward
0 new messages