Database scrapers

31 views
Skip to first unread message

Jim Breen

unread,
Jun 7, 2021, 9:16:27 PM6/7/21
to edict-...@googlegroups.com
Via the log files on the edrdg.org server I keep an eye on the overall
WWW usage. I have several "cron" jobs regularly sniffing for unusual
traffic patterns. The other day one site made several requests *per
second* to the online database over the course of a couple of hours (I
blocked the IP address.)

Today's summary report of the last 24 hours usage included:
19607 GET /jmdictdb/cgi-bin/entr.py

Now that is a helluva lot of requests for entries from the database.
Usually the daily total is in the hundreds. I immediately suspected it
was one site pounding away, but when I pulled out the IP addresses
which were the source of those requests, I found the following counts
and addresses:
2432 103.107.199.197
2288 217.138.219.180
2031 162.253.71.25
2021 86.106.90.106
1908 185.242.6.3
1722 66.222.59.200
1599 185.107.95.212
1318 185.128.27.100
1268 46.165.233.56

All quite fishy. I can sort-of understand someone wanting to get a
copy of the database and (foolishly) deciding to get it one entry at a
time, but to spread it around 10+ sites? Starts smell like a looming
DDOS attack, but in fact this sort of thing is popping up every few
weeks.

For the time being I am keeping an eye on it. If any site fires in
around 10k requests in a day I'm quite prepared to block the IP
address.

Jim

--
Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University
http://www.jimbreen.org/
http://nihongo.monash.edu/

Alexandru Pojoga

unread,
Jun 8, 2021, 5:24:58 PM6/8/21
to edict-...@googlegroups.com
Were they scraping the contents of EDICT/Kanjidic?? Someone should tell them those are freely downloadable...

--
You received this message because you are subscribed to the Google Groups "EDICT-JMdict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to edict-jmdict...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/edict-jmdict/CABHGxq7Hr%3DS27fy%2BNd-F-U4RP6OQM8tvGHkFav%2BhrYcmHSH8mw%40mail.gmail.com.

Jim Breen

unread,
Jun 9, 2021, 8:20:47 PM6/9/21
to edict-...@googlegroups.com
On Wed, 9 Jun 2021 at 07:24, Alexandru Pojoga <apo...@gmail.com> wrote:
>
> Were they scraping the contents of EDICT/Kanjidic?? Someone should tell them those are freely downloadable...

It's a bit hard to talk to a script firing in requests. Yes, they're
fetching JMdict database entries.

Today's logs revealed:
34168 GET /jmdictdb/cgi-bin/entr.py

Digging into that I see about 20 sites retrieving around 2k entries
each. The pattern seems to be they retrieve a sequence of entries
until they get a "403" response, then they stop. After a while,
another site starts up with another sequence. And so on.

It doesn't appear to be doing any damage - it's not slowing down the
server too much and the bandwidth usage is OK. It's just creepy and,
of course, quite useless.

Jim
> To view this discussion on the web visit https://groups.google.com/d/msgid/edict-jmdict/CAFe6_TDpq0A1tcFeVix39%3DSnt1oSN9mHDKw%2BEcB%2BMdK9Txq3kQ%40mail.gmail.com.

René Malenfant

unread,
Jun 9, 2021, 8:26:14 PM6/9/21
to edict-...@googlegroups.com
Is there any way to feed a few junk entries like 'mountweazel' to one of those IP addresses?  Might be interesting to see where it ends up.


Alexandre Courbot

unread,
Jun 10, 2021, 9:54:20 AM6/10/21
to edict-...@googlegroups.com
On Wed, Jun 9, 2021 at 6:25 AM Alexandru Pojoga <apo...@gmail.com> wrote:
>
> Were they scraping the contents of EDICT/Kanjidic?? Someone should tell them those are freely downloadable...

I can see a reason why someone would do that, and that's to scrape
information that is not in EDICT/Kanjidic. For instance, deleted
entries:

http://www.edrdg.org/jmdictdb/JMdict_deletedentries

I am myself in a situation where I am considering asking permission to
do something similar. My software uses the JMdict entry ID as a key
for users to mark entries of interest (i.e. for studying). As JMdict
evolves, some of these entries are getting removed. I just don't want
to silently delete user data whenever a deleted entry is referenced -
instead I'd like to display what the entry was looking like so the
user can find a replacement or just decide to drop it. To that end I
need the deleted entries' data and so far the only way I have found to
do this is to query all these entries one by one from the JMdict
database. If there is a better way to do it, I'd love to hear about
it. If not, Jim would you find it acceptable if I scraped all the
deleted entries once (and only once) over the course of several
days/weeks?

(and to answer a predictable question, I have nothing to do with the
scraping attempts mentioned above, neither have I attempted any of my
own so far :P)

Cheers,
Alex.
> To view this discussion on the web visit https://groups.google.com/d/msgid/edict-jmdict/CAFe6_TDpq0A1tcFeVix39%3DSnt1oSN9mHDKw%2BEcB%2BMdK9Txq3kQ%40mail.gmail.com.

Stuart McGraw

unread,
Jun 10, 2021, 12:00:29 PM6/10/21
to edict-...@googlegroups.com, Alexandre Courbot
Hi Alexandre,

You probably know this but when an entry is edited in the database
the updated entry is always added with a new entry id number. When
that edited entry is eventually approved, the old entry is deleted.
So if you are referencing the old entry by id number, then you will
see that entry disappear any time it is edited and the edit approved.

You can get the entry by sequence number with a URL like:

https://edrdg.org/jmdictdb/cgi-bin/entr.py?q=1202270

Does that help?

(Unfortunately it was a bad choice of my part to make the links on
the JMdictDB web pages use id numbers (e=...) rather than sequence
numbers (q=...) since it encourages bookmarking links that are
transient. I hope to change that in the future sometime.)

If you are aware of all that and I am not understanding the problem,
then my apologies. There may be a way to provide the info you need
without bulk screen scraping (although if screen scraping works for
you and Jim, that's fine too.)

-- Stuart

Ben Bullock

unread,
Jun 10, 2021, 6:37:19 PM6/10/21
to edict-...@googlegroups.com
On Thu, 10 Jun 2021 at 09:21, Jim Breen <jimb...@gmail.com> wrote:
On Wed, 9 Jun 2021 at 07:24, Alexandru Pojoga <apo...@gmail.com> wrote:
>
> Were they scraping the contents of EDICT/Kanjidic?? Someone should tell them those are freely downloadable...

It's a bit hard to talk to a script firing in requests. Yes, they're
fetching JMdict database entries.

Today's logs revealed:
34168   GET /jmdictdb/cgi-bin/entr.py

Digging into that I see about 20 sites retrieving around 2k entries
each. The pattern seems to be they retrieve a sequence of entries
until they get a "403" response, then they stop. After a while,
another site starts up with another sequence. And so on.

The trick here is not to send the 403 response. Just send a page consisting of some kind of random but plausible text if you think that someone is scraping the database. This is assuming you put a link "please email me" on the page so people have a way to contact you in case this catches humans. 

As long as the random response doesn't absorb processing cycles these requests then stop being a problem. You could probably automate it so that the random response occurs after 100 or 200 requests from some source or another. Other things you can do are to have a captcha test, a minimum time between requests, use cookies or javascript, and check the user agent or the "referer" field of the requests.
 
It doesn't appear to be doing any damage - it's not slowing down the
server too much and the bandwidth usage is OK. It's just creepy and,
of course, quite useless.

I haven't looked at the database pages so I cannot imagine what the motives are, especially if the data can be downloaded, but there are a lot of people who do very odd things on the internet. I've recently had huge problems with someone who added links to my site to various Google Docs documents, causing absolutely gigantic use of resources (it cost me more than $100 in server costs in 2020). I've written to Google repeatedly about this, but it's very difficult to get a human response from them. The big problem is that the accesses all come from Google servers, so blocking them means blocking Google's robots too.
 

Jim Breen

unread,
Jun 11, 2021, 1:37:26 AM6/11/21
to edict-...@googlegroups.com
On Thu, 10 Jun 2021 at 23:54, Alexandre Courbot <gnu...@gmail.com> wrote:
> On Wed, Jun 9, 2021 at 6:25 AM Alexandru Pojoga <apo...@gmail.com> wrote:

> > Were they scraping the contents of EDICT/Kanjidic?? Someone should tell them those are freely downloadable...
>
> I can see a reason why someone would do that, and that's to scrape
> information that is not in EDICT/Kanjidic. For instance, deleted
> entries:
>
> http://www.edrdg.org/jmdictdb/JMdict_deletedentries

I could understand if that's what they're doing, but in fact the
current scraping, which is going on as I am typing, is getting current
entries

> I am myself in a situation where I am considering asking permission to
> do something similar. My software uses the JMdict entry ID as a key
> for users to mark entries of interest (i.e. for studying). As JMdict
> evolves, some of these entries are getting removed. I just don't want
> to silently delete user data whenever a deleted entry is referenced -
> instead I'd like to display what the entry was looking like so the
> user can find a replacement or just decide to drop it. To that end I
> need the deleted entries' data and so far the only way I have found to
> do this is to query all these entries one by one from the JMdict
> database. If there is a better way to do it, I'd love to hear about
> it. If not, Jim would you find it acceptable if I scraped all the
> deleted entries once (and only once) over the course of several
> days/weeks?

I have no problem with that . A heads-up email in advance would be a
good idea. I'm looking at the options for automated detection and
blocking of this sort of scraping.

Cheers

Jim
> To view this discussion on the web visit https://groups.google.com/d/msgid/edict-jmdict/CAAVeFu%2BEC0crpmS9H5xHKiN%2B1Zz%3D6MQVsuh%3D5u%3Dwdf14nUPZmg%40mail.gmail.com.

Stuart McGraw

unread,
Jun 11, 2021, 9:38:26 AM6/11/21
to edict-...@googlegroups.com, Jim Breen
Some significant information that is only available in the
web pages, not in the the XML files, is the history records.
I've personally found that information both useful and
educational. Perhaps that is the motivation?

-- Stuart

Jim Breen

unread,
Jun 12, 2021, 3:35:02 AM6/12/21
to edict-...@googlegroups.com
I mentioned that the scraping pauses when they get a 403 error
response. They then switch to another system/IP address and pick up
where they left off.
I'd assumed the 403 was because they'd asked for a missing entry, but
I just realised that couldn't be the cause, and indeed it isn't. A 403
is the server saying "The server understood the request, but is
refusing to authorize it." (RFC 7321). I looked in the error logs and
I see for these 403s entries such as:
[Sat Jun 12 02:10:46.893225 2021] [:error] [pid 29517:tid
139887529412352] [client 185.230.126.12:43836] client denied by server
configuration: /usr/local/apache2/jmdictdb/cgi-bin/entr.py
My guess is that at that point Apache had run out of resources or
reached a process limit or something like that, so it rejected the
request. In fact the script probably only needed to pause briefly and
it could have continued with the same system/IP address.

BTW, that example I just quoted has an IP address of 185.230.126.12.
That belongs to the domain mail.experiencedadvertising.com. They are
an SEO outfit.

Jim
> --
> You received this message because you are subscribed to the Google Groups "EDICT-JMdict" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to edict-jmdict...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/edict-jmdict/CAN5Y6m_fiwrrFgOoakKyL5WztLzGqKrEUq950QAvCbVrbSU3yw%40mail.gmail.com.

Ben Bullock

unread,
Jun 12, 2021, 4:56:11 AM6/12/21
to edict-...@googlegroups.com
On Sat, 12 Jun 2021 at 16:35, Jim Breen <jimb...@gmail.com> wrote:
BTW, that example I just quoted has an IP address of 185.230.126.12.
That belongs to the domain mail.experiencedadvertising.com. They are
an SEO outfit.

I wouldn't jump to conclusions. If the people are using a lot of IP addresses, it's very likely to be from a compromised machine. Just to give an example, for quite some time there has been a person who runs scripts against


for some reason containing the word "jfaf" plus some repetitive nonsense, like "jfaf mrchildren jfaf" or "jfaf yoasobi jfaf" over and over again from different IP addresses across the globe. Choosing a few from the recent log he has IP addresses belonging to, for example, "The Calyx Institute", "China Telecom", "Emerald Onion", "31173 Services Denmark", "IP Volume Norway" and "Digital Ocean". (I've just copied the names from "whois" without checking what these various things are.)




Alexandre Courbot

unread,
Jun 12, 2021, 7:41:21 AM6/12/21
to Stuart McGraw, edict-...@googlegroups.com
On Fri, Jun 11, 2021 at 1:00 AM Stuart McGraw <smc...@mtneva.com> wrote:
>
> Hi Alexandre,
>
> You probably know this but when an entry is edited in the database
> the updated entry is always added with a new entry id number. When
> that edited entry is eventually approved, the old entry is deleted.
> So if you are referencing the old entry by id number, then you will
> see that entry disappear any time it is edited and the edit approved.
>
> You can get the entry by sequence number with a URL like:
>
> https://edrdg.org/jmdictdb/cgi-bin/entr.py?q=1202270
>
> Does that help?

Sorry for being imprecise, I was referring to the entry sequence
number indeed. Unless I am mistaken that one remains constant across
changes, and is used in
http://www.edrdg.org/jmdictdb/JMdict_deletedentries so my usecase
should be covered.

Cheers,
Alex.

Alexandre Courbot

unread,
Jun 12, 2021, 7:45:08 AM6/12/21
to edict-...@googlegroups.com
On Fri, Jun 11, 2021 at 2:37 PM Jim Breen <jimb...@gmail.com> wrote:
>
> On Thu, 10 Jun 2021 at 23:54, Alexandre Courbot <gnu...@gmail.com> wrote:
> > On Wed, Jun 9, 2021 at 6:25 AM Alexandru Pojoga <apo...@gmail.com> wrote:
>
> > > Were they scraping the contents of EDICT/Kanjidic?? Someone should tell them those are freely downloadable...
> >
> > I can see a reason why someone would do that, and that's to scrape
> > information that is not in EDICT/Kanjidic. For instance, deleted
> > entries:
> >
> > http://www.edrdg.org/jmdictdb/JMdict_deletedentries
>
> I could understand if that's what they're doing, but in fact the
> current scraping, which is going on as I am typing, is getting current
> entries

In that case that behavior is pretty puzzling indeed.

>
> > I am myself in a situation where I am considering asking permission to
> > do something similar. My software uses the JMdict entry ID as a key
> > for users to mark entries of interest (i.e. for studying). As JMdict
> > evolves, some of these entries are getting removed. I just don't want
> > to silently delete user data whenever a deleted entry is referenced -
> > instead I'd like to display what the entry was looking like so the
> > user can find a replacement or just decide to drop it. To that end I
> > need the deleted entries' data and so far the only way I have found to
> > do this is to query all these entries one by one from the JMdict
> > database. If there is a better way to do it, I'd love to hear about
> > it. If not, Jim would you find it acceptable if I scraped all the
> > deleted entries once (and only once) over the course of several
> > days/weeks?
>
> I have no problem with that . A heads-up email in advance would be a
> good idea. I'm looking at the options for automated detection and
> blocking of this sort of scraping.

Thanks, that's much appreciated. I will try to tune my script and will
then let you know when I plan to run the initial download and from
which IP address (I also don't plan the rate to be higher than 1
request/second). After that I only plan to get the entries that get
added to http://www.edrdg.org/jmdictdb/JMdict_deletedentries (so one
every other day IIUC).

Cheers,
Alex.

Jim Breen

unread,
Jun 12, 2021, 8:27:25 AM6/12/21
to edict-...@googlegroups.com
That was just a frinstance. The sites he/she/it is using are all over
the place, and yes, they are probably compromised servers.

So far today they've scraped about 26,000 entries.

Jim
> --
> You received this message because you are subscribed to the Google Groups "EDICT-JMdict" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to edict-jmdict...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/edict-jmdict/CAN5Y6m_7jY_Afy3Wr%2BHF7kTVx%3D%3DeNa5XFy_A3wm4nBdNWM3ObQ%40mail.gmail.com.
Reply all
Reply to author
Forward
0 new messages