Query on Apache rewrite rules

49 views
Skip to first unread message

Eunice Soh

unread,
Oct 12, 2021, 12:50:39 AM10/12/21
to Dataverse Users Community
Hi all,

Sorry for double posting. Thought it would be good to have a specific post for the question below.

Pertaining to the documentation on https://guides.dataverse.org/en/latest/installation/config.html, it says:
"If you are having trouble with the site being overloaded with what looks like heavy automated crawling, you may have to resort to blocking this traffic by other means - for example, via rewrite rules in Apache" 

Could anyone give pointers (e.g. which Apache module) or implementation details on this? 

Thanks,
Eunice

leo...@g.harvard.edu

unread,
Oct 12, 2021, 5:00:30 PM10/12/21
to Dataverse Users Community
The Apache module in question is mod_rewrite, if you are running Dataverse behind Apache you are using this module already.
That sentence in the guide simply means that if you are seeing a lot of traffic that you don't recognize as normal user activity, you can add simple rewrite rules to block such requests by whatever they have in common. 
For example, in our own production we have a section blocking a bunch of apparent crawlers that did not pay attention to the rules in our robots.txt by their User-Agent names; something along the lines of: 

RewriteCond %{HTTP_USER_AGENT}  '^.*AnnoyingBot1.*'           [OR]

RewriteCond %{HTTP_USER_AGENT}  '^.*AnnoyingBot2.*'           [OR]

...

RewriteCond %{HTTP_USER_AGENT}  '^AnnoyingBotN.*'

RewriteRule ^/(.*)              /var/www/html/robots.txt


Note that in the example above instead of dropping the connection, we are actually serving the robots.txt in response to ANY requests; just to make a point.

But also note that the above is not going to stop somebody who *really* wants to keep crawling your site - as the user-agent header can be easily changed on the client side. Although a large search engine company is likely not going to bother. 

A rewrite condition to block by the client's IP address, or a block of addresses: 

RewriteCond %(REMOTE_ADDR) '^123\.124\.125\..*' 

Or, if your Apache server is behind a proxy that's hiding the client's real IP address, you may need to obtain that address from another header. For example, our prod. server is on an AWS load balancer, so our rules look like 

RewriteCond %{HTTP:X-Forwarded-For}  '^123\.124\.125\.126' 


Eunice Soh

unread,
Oct 13, 2021, 3:48:23 AM10/13/21
to Dataverse Users Community
Thanks so much for the example on your production system, using mod_rewrite to serve robots.txt to specific named bots and (group of) IP addresses.

Just to add, the case study we're looking at is a collection of bots/servers, with dynamic IPs from one geographical location, hitting an backend endpoint such as /api/access/datafile.  

Based on this case study, defining the redirect rules based on IP could be challenging then. 

May have to consider other approaches, e.g. blocking based on # of requests/ min using mod_evasive or mod_security, according to this stackoverflow post: https://stackoverflow.com/questions/19631981

leo...@g.harvard.edu

unread,
Oct 13, 2021, 12:34:50 PM10/13/21
to Dataverse Users Community
We should of course update that paragraph in the guide; to include a couple of RewriteCond examples, like the ones I posted yesterday...
But also to add some text making it clear that these are very quick-and-dirty, bandaid solutions. On our own production server we've used these rewrite rules as mostly temporary fixes to address some unusual activity in the access logs. It requires manual work of course, and does not work as a long term solution. 

We were also thinking of/discussing doing something more sophisticated, some throttling mechanism that would automatically block repeated requests from the same user and/or same IP address, when they start exceeding some defined limits. But we haven't progressed to practically implementing any such solutions. 

I believe you are on the right track experimenting with mod_evasive and mod_security. Some limiting mechanism like that could be implemented on the application level, inside Dataverse, too. But we were also thinking that there should be some already implemented technology that would be easier to apply on the Apache side. 

Eunice Soh

unread,
Oct 13, 2021, 10:03:46 PM10/13/21
to Dataverse Users Community
It would be great to hear more progress and/or implementation on any throttling/rate limiting mechanism, be it on the server or application level. Thanks again!
Reply all
Reply to author
Forward
0 new messages