On Tue, 10 Jun 2025 at 21:54, Thomas Buick <
thomas...@gmail.com> wrote:
> It's likely that most of the bots are coming from recent LLM scraping,
That was our conclusion too. There were literally hundreds of IP
addresses being used, with the traffic spread fairly evenly over them.
The requests were aimed at our MediaWiki system, which we use for
documentation. They were targeting known pathways into the MediaWiki
stuff (which we don't use) and were triggering unsuccessful database
accesses which were causing stress to our system (up to 100
simultaneous database processes.) Even after we moved the Wiki to a
different address they kept hammering the old address.
> Specifically responding with 503 yields better success than responding with 429 or other 4xx errors, since the bots are not really respecting http codes, but are used to "flattening" target servers.
I looked at tweaking the MediaWiki code to intercept these scraping
requests, but I'm not a PHP guru and the code is rather opaque. In the
end it was easier to geoblock the offending sites.
> Are the servers exposed with an apache configuration, or is there some type of proxy like nginx in between?
At present, Apache is "exposed", but using nginx or something like
that will be worth considering if the problem recurs.
Cheers
Jim
--
Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University
http://www.jimbreen.org/ https://www.edrdg.org/~jwb/