Seeing new connections fail, but lots of activity, "ajp_ilink_receive failed" in apache logs

192 views
Skip to first unread message

Nick Lauland

unread,
Oct 23, 2018, 1:40:47 PM10/23/18
to Dataverse Users Community
Hey guys, wondering if you've seen anything like this...

We've started having this odd problem a couple weeks ago where our server will stop responding and we'll see:

"Connection reset by peer: ajp_ilink_receive() can't receive header" in /var/log/httpd/error_log
"ajp_read_header: ajp_ilink_receive failed" in /var/log/httpd/ssl_error_log

But looking at ssl_access_log and ssl_request_log I can see lots of successful activity to multiple clients.
There also aren't any real errors in dataverse's logs, it look like it is still happily functioning.
When this happens, it looks like most of the (successful) activity is coming from search engines such as "pipl.com" and "vultr.com"
After restarting glassfish, dataverse seems to be fine for another 24ish hours.

My theory, after searching for something like this, is, maybe, the spiders connect, but somehow slowly consume all connections, crowding out any new ones.  The spiders then continue to happily crawl, but nobody else is able to get in.

From some research a mismatch between apache and glassfish threads sounded possible...but making sure both the apache max and the glassfish pool maxes were 255 it wasn't any different.

I can't help thinking this is an apache ajp thing and not a dataverse thing at all.
We're using Amazon AWS, apache 2.2 (I know), Dataverse v. 4.8.6

Thanks!

Nick Lauland
System Admin
Texas Digital Library

Nick Lauland

unread,
Oct 29, 2018, 3:55:43 PM10/29/18
to Dataverse Users Community
FYI, it was definitely a bot problem!

I blocked all bots and the immediately things became stable. I then added GoogleBot and still no issues at all over the weekend.

I'm adding other known "good" bots and watching activity. Of couse I'd rather let everyone in, and only block the troublemakers!

When I get to that point, and I can positively identify them, I'll post here along with if I've been able to figured out why.

Philip Durbin

unread,
Oct 29, 2018, 4:44:29 PM10/29/18
to dataverse...@googlegroups.com
Phew! Thanks for letting us know, Nick! There had been a little chatter on this topic in IRC...


But I didn't have anything substantial to add.

Thanks again,

Phil

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To post to this group, send email to dataverse...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/66b45a86-3915-4281-a2e7-5607b9398b11%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--

Lars Kaczmirek

unread,
Oct 30, 2018, 5:47:42 AM10/30/18
to Dataverse Users Community
Hi Nick,
thanks for informing us and posting the solution. Was it enough to change the robots.txt or did you take more serious measures to block out the bots?
Best regards
Lars

Lars Kaczmirek

unread,
Oct 30, 2018, 10:17:59 AM10/30/18
to Dataverse Users Community
quick (probably temporary) fix: For now we restarted the Glassfish service which solved the problem for us at the moment. Restarting Apache did not do the trick. We tried that first. Thanks again to Nick.
Best
Lars

Lars Kaczmirek

unread,
Dec 7, 2018, 1:33:25 PM12/7/18
to Dataverse Users Community
For the last 37 days the problem did not repeat.

Philip Durbin

unread,
Dec 7, 2018, 2:56:31 PM12/7/18
to dataverse...@googlegroups.com
Good! Thanks for letting us know!

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To post to this group, send email to dataverse...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
Message has been deleted

Nick Lauland

unread,
Dec 13, 2018, 10:31:48 AM12/13/18
to Dataverse Users Community

We've had no problems since the robots.txt was put in place!

 
Message has been deleted

Eunice Soh

unread,
Oct 7, 2021, 5:11:27 AM10/7/21
to Dataverse Users Community

Hi,

 I’m reviving this thread…

 

There is a use case: single endpoint (e.g. /api/access/datafile/$id), crawled by a collection of bots with dynamic IPs. Similar to the outcome of a DDoS, it could slow down Dataverse services, or result in 50X.

 

Am consolidating some of the mitigation strategies mentioned here and on the github issues. 

If anyone has interest/expertise on these, would like to ask a couple of questions. 

 

Also welcome any other mitigation strategies, and how it might be implemented.

 

 

On this thread

 

(1) robots.txt

Q: How do we make an explicit rule for bots not to crawl /api/access/datafile/$id, in addition to the robots.txt production file?

Q: How and where do you deploy robots.txt file? During deployment on /payara/glassfish/domain/domain1/dataverse-*/robots.txt? Or is it fronted on the Apache proxy?

Q: Has it worked in your experience? Is it correct to say that it's only a code of conduct that may not be respected if bots choose not to?

 

(2) Restarting apache proxy

Q: How does this reduce the bot?

 

Other ways

 

(3) Rate limiting 

This seems particularly useful for our case, because it's hitting a single endpoint. However the feature does not seem to be implemented on the application level: https://github.com/IQSS/dataverse/issues/1339

 

Q: Would it be in the pipeline for implementation? If so when?

 

(4) Web application firewall

 

Q: If it is a botnet (i.e. collection of bots with different IP addresses) and no definite pattern/IPs are dynamic, is this a feasible solution? 

Q: Has anyone implemented this? 

 

 

Thanks in advance,

Eunice

Philip Durbin

unread,
Oct 7, 2021, 4:28:36 PM10/7/21
to dataverse...@googlegroups.com
Hi Eunice,

I don't think I'm going to be able to answer all your questions but I'll do what I can.

Yes, you have the path right for where robots.txt should go. The guides provide the following example at https://guides.dataverse.org/en/5.6/installation/config.html#ensure-robots-txt-is-not-blocking-search-engines

"For example, for Dataverse Software 4.6.1 the path to robots.txt may be /usr/local/payara5/glassfish/domains/domain1/applications/dataverse-4.6.1/robots.txt with the version number 4.6.1 as part of the path."

Yes, I'd say it's only a violation of code of conduct when bots don't respect robots.txt. You shouldn't be surprised if you have to take more drastic measures, such as blocking IP addresses.

As for where issue #1339 (rate limiting) is in the pipeline, it was recently (8 days ago) moved from "Needs Discussion/Definition" to "Up Next" on our project board at https://github.com/orgs/IQSS/projects/2 and is currently in the #4 spot. This issue has a fairly long history. Starts and stops. Many comments but more comments are welcome if you'd like to give some opinions! In terms of implementing rate limiting within Dataverse itself or recommending a tool, I don't think we know yet. Over the years we've certainly talked about both. This is definitely the kind of thing you could leave a comment about in the issue. Or perhaps a new thread. Or perhaps it could be a topic for a community call.

I don't have any experience defending against botnets, but a long time ago I was on a team that used fail2ban to block abusive IP addresses. I'm sure there are plenty of newer tools by now. :)

I hope this helps!

Phil

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

Eunice Soh

unread,
Oct 7, 2021, 9:53:21 PM10/7/21
to Dataverse Users Community
Thanks Phil for your input/suggestions on robots.txt, rate-limiting and fail2ban! We'll try to look into fail2ban for dynamic IPs. 

By the way, saw this "If you are having trouble with the site being overloaded with what looks like heavy automated crawling, you may have to resort to blocking this traffic by other means - for example, via rewrite rules in Apache" on https://guides.dataverse.org/en/latest/installation/config.html. Could anyone give implementation details on this? 

Also another potential solution mooted: captcha. However, a captcha would only prevent client site users/bots whereas our specific use case is a direct hit on the endpoint (e.g. /api/access/datafile/$id). Is that a correct understanding?

Kind regards,
Eunice

Philip Durbin

unread,
Oct 12, 2021, 4:17:18 PM10/12/21
to dataverse...@googlegroups.com
Hi Eunice,

Thanks for creating a dedicated thread about Apache rewrite rules over at https://groups.google.com/g/dataverse-community/c/LPCbGJvB2io/m/yVVyPWOLAQAJ

This original thread is getting a bit long and sprawling (and old, from years ago), so we appreciate it. Also, I talked to my colleage who put that line in about Apache rules and he's hoping to reply soon.

As to your question about captchas, yes, I think they're only for browsers, really. They aren't going to help block or slow down API calls. As discussed earlier https://github.com/IQSS/dataverse/issues/1339 is the issue we have open for rate limiting (including the API). You're welcome to leave comments there. There hasn't been any discussion on that issue lately.

I hope this helps,

Phil

Eunice Soh

unread,
Oct 13, 2021, 4:05:53 AM10/13/21
to Dataverse Users Community
Thanks Phil. I'll put a note of interest on the rate-limiting thread and describe how it might be useful!
Reply all
Reply to author
Forward
0 new messages