How can Web2py prevent web scraping?

Alex Glaros

unread,

May 9, 2013, 2:59:00 PM5/9/13

to

What techniques can be used in a Web2py site to prevent data mining by harvester bots?

In my day job, if the Oracle database slows down, I go to the Unix OS, see if the same IP address is doing a-lot-faster-than-a-human-could-type queries, and then block that IP address in the firewall.

Are there any ideas that that I could use with a Web2py website?

Thanks,

Alex Glaros

Niphlod

unread,

May 9, 2013, 4:09:11 PM5/9/13

to web...@googlegroups.com

for "kind" agents a robots.txt suffice.
for inconsiderate harvester, usually that kind of work is addressed by either the firewall or the webserver.

in web2py either you evaluate the user agent for every request and cut it to a "HTTP(500)" in models or you implement your own rate-limiting, that needs to be as fast as possible.

Derek

unread,

May 9, 2013, 7:36:22 PM5/9/13

to web...@googlegroups.com

I've read an idea about using a 'ticket' system... each session gets X # of tickets. Tickets regenerate at a fixed rate. Normal users would never run out of tickets.

Each query operation would have a fixed cost of tickets. Inserts would cost double selects...

You don't have to calculate regeneration - just when an operation is about to be performed, you check the last time a request was made, and the last number of tickets. calculate regeneration and then if they have enough tickets, you do the request. If not, you return a 503 error, or perhaps a friendly message saying "swiper no swiping" (to quote Dora)

LightDot

unread,

May 10, 2013, 7:43:57 AM5/10/13

to web...@googlegroups.com

Some of those rogue robots / harvesters are really damaging. I've seen them bring entire sites down. We are using mod_evasive and mod_bw for apache and also permanently ban certain IP ranges. But as you can see, none of this is web2py specific.

Regards,
Ales

Message has been deleted