Dictionary database server issues

38 views
Skip to first unread message

Jim Breen

unread,
May 30, 2025, 8:26:14 AMMay 30
to edict-...@googlegroups.com
We are having quite an issue with the server at edrdg.org. It is being hit with masses of requests, probably from scrapers, and from time to time gets slow and even stops working.

We are working on it but it may take a while. Please be patient if you find it unresponsive.

Jim

Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University
http://www.jimbreen.org/      https://www.edrdg.org/~jwb/

Arfrever Frehtes Taifersar Arahesis

unread,
May 31, 2025, 4:19:36 AMMay 31
to edict-...@googlegroups.com
I have seen that in recent times many sites started using Anubis
software for scraper defense:

https://anubis.techaro.lol/
https://anubis.techaro.lol/docs/user/known-instances
https://github.com/TecharoHQ/anubis

Maybe consider using this on EDRDG site.

Jim Breen

unread,
May 31, 2025, 9:00:41 AMMay 31
to edict-...@googlegroups.com
On Fri, 30 May 2025 at 22:26, Jim Breen <jimb...@gmail.com> wrote:
> We are having quite an issue with the server at edrdg.org. It is being hit with masses of requests, probably from scrapers, and from time to time gets slow and even stops working.

As an interim measure, the documentation Wiki has been closed. It was
the target of many of the requests, which were overloading the web
server.

Jim

--

Jim Breen

unread,
Jun 3, 2025, 11:56:26 PMJun 3
to edict-...@googlegroups.com
The documentation can be viewed on the copy on the Wayback machine.


We're still working on the issue of the bots pounding away at the wiki. Despite it being disabled, its (old) address is getting over 60k hits a day.

Jim

Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University
http://www.jimbreen.org/      https://www.edrdg.org/~jwb/

Jim Breen

unread,
Jun 4, 2025, 12:21:05 AMJun 4
to edict-...@googlegroups.com

Jim

Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University
http://www.jimbreen.org/      https://www.edrdg.org/~jwb/

Jim Breen

unread,
Jun 5, 2025, 3:53:02 AMJun 5
to edict-...@googlegroups.com
The Wiki is now available again.

What we have done is block a range of IP addresses from accessing our server. Virtually all the trolling accesses to the server were originating from those addresses.

If anyone has lost access to the edrdg.org server, please contact me with both your host and IP address, and I will attempt to restore access.

Cheers 

Jim


Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University
http://www.jimbreen.org/      https://www.edrdg.org/~jwb/
On Sat, 31 May 2025, 11:00 pm Jim Breen, <jimb...@gmail.com> wrote:

Stuart McGraw

unread,
Jun 5, 2025, 11:39:45 AMJun 5
to edict-...@googlegroups.com
I just want to point out, in case in wasn't clear, that the problems affected the entire edrdg.org site and had nothing to do with the "Dictionary database server" specifically, despite the subject line

-- Stuart


n 6/5/25 01:52, Jim Breen wrote:
> The Wiki is now available again.
>
> What we have done is block a range of IP addresses from accessing our server. Virtually all the trolling accesses to the server were originating from those addresses.
>
> If anyone has lost access to the edrdg.org <http://edrdg.org> server, please contact me with both your host and IP address, and I will attempt to restore access.
>
> Cheers
>
> Jim
>
>
> Jim Breen
> Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University
> http://www.jimbreen.org/ <http://www.jimbreen.org/> https://www.edrdg.org/~jwb/ <https://www.edrdg.org/~jwb/>
>
> On Sat, 31 May 2025, 11:00 pm Jim Breen, <jimb...@gmail.com <mailto:jimb...@gmail.com>> wrote:
>
> On Fri, 30 May 2025 at 22:26, Jim Breen <jimb...@gmail.com <mailto:jimb...@gmail.com>> wrote:
> > We are having quite an issue with the server at edrdg.org <http://edrdg.org>. It is being hit with masses of requests, probably from scrapers, and from time to time gets slow and even stops working.
>
> As an interim measure, the documentation Wiki has been closed. It was
> the target of many of the requests, which were overloading the web
> server.
>
> Jim
>
> --
> Jim Breen
> Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University
> http://www.jimbreen.org/ <http://www.jimbreen.org/> https://www.edrdg.org/~jwb/ <https://www.edrdg.org/~jwb/>

Thomas Buick

unread,
Jun 10, 2025, 7:54:24 AMJun 10
to jimb...@gmail.com, edict-...@googlegroups.com
Hi Jim,

It's likely that most of the bots are coming from recent LLM scraping, most of them are pretending to old versions of chrome, some web-admins i've talked to have had success by responding with 503 for user agents matching:
- "(Chrome/[2-9])"
- "(Chrome/1[012])"
- "(Firefox/3)"
- "(MSIE)"

Specifically responding with 503 yields better success than responding with 429 or other 4xx errors, since the bots are not really respecting http codes, but are used to "flattening" target servers.

Are the servers exposed with an apache configuration, or is there some type of proxy like nginx in between?

Tom



--
You received this message because you are subscribed to the Google Groups "EDICT-JMdict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to edict-jmdict...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/edict-jmdict/9eee62cd-3ff3-4323-9f2a-2451e6fb4f07%40mtneva.com.

Jim Breen

unread,
Jun 17, 2025, 1:10:35 AMJun 17
to edict-...@googlegroups.com
On Tue, 10 Jun 2025 at 21:54, Thomas Buick <thomas...@gmail.com> wrote:
> It's likely that most of the bots are coming from recent LLM scraping,

That was our conclusion too. There were literally hundreds of IP
addresses being used, with the traffic spread fairly evenly over them.
The requests were aimed at our MediaWiki system, which we use for
documentation. They were targeting known pathways into the MediaWiki
stuff (which we don't use) and were triggering unsuccessful database
accesses which were causing stress to our system (up to 100
simultaneous database processes.) Even after we moved the Wiki to a
different address they kept hammering the old address.

> Specifically responding with 503 yields better success than responding with 429 or other 4xx errors, since the bots are not really respecting http codes, but are used to "flattening" target servers.

I looked at tweaking the MediaWiki code to intercept these scraping
requests, but I'm not a PHP guru and the code is rather opaque. In the
end it was easier to geoblock the offending sites.

> Are the servers exposed with an apache configuration, or is there some type of proxy like nginx in between?

At present, Apache is "exposed", but using nginx or something like
that will be worth considering if the problem recurs.

Cheers

Jim

--
Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University
http://www.jimbreen.org/ https://www.edrdg.org/~jwb/
Reply all
Reply to author
Forward
0 new messages