How can Web2py prevent web scraping?

220 views
Skip to first unread message

Alex Glaros

unread,
May 9, 2013, 2:59:00 PM5/9/13
to

What techniques can be used in a Web2py site to prevent data mining by harvester bots?

In my day job, if the Oracle database slows down, I go to the Unix OS, see if the same IP address is doing a-lot-faster-than-a-human-could-type queries, and then block that IP address in the firewall.

Are there any ideas that that I could use with a Web2py website?

Thanks,

Alex Glaros

Niphlod

unread,
May 9, 2013, 4:09:11 PM5/9/13
to web...@googlegroups.com
for "kind" agents a robots.txt suffice.
for inconsiderate harvester, usually that kind of work is addressed by either the firewall or the webserver.

in web2py either you evaluate the user agent for every request and cut it to a "HTTP(500)" in models or you implement your own rate-limiting, that needs to be as fast as possible.

Derek

unread,
May 9, 2013, 7:36:22 PM5/9/13
to web...@googlegroups.com
I've read an idea about using a 'ticket' system... each session gets X # of tickets. Tickets regenerate at a fixed rate. Normal users would never run out of tickets.
Each query operation would have a fixed cost of tickets. Inserts would cost double selects... 
You don't have to calculate regeneration - just when an operation is about to be performed, you check the last time a request was made, and the last number of tickets. calculate regeneration and then if they have enough tickets, you do the request. If not, you return a 503 error, or perhaps a friendly message saying "swiper no swiping" (to quote Dora)

LightDot

unread,
May 10, 2013, 7:43:57 AM5/10/13
to web...@googlegroups.com
Some of those rogue robots / harvesters are really damaging. I've seen them bring entire sites down. We are using mod_evasive and mod_bw for apache and also permanently ban certain IP ranges. But as you can see, none of this is web2py specific.

Regards,
Ales
Message has been deleted

Richard Penman

unread,
Aug 26, 2014, 8:37:43 AM8/26/14
to web...@googlegroups.com
I created a simpler system just counting requests and then blocking when exceeded maximum in a time frame:

Derek

unread,
Aug 26, 2014, 4:26:45 PM8/26/14
to web...@googlegroups.com
You should use proper HTTP status codes.
429 is the appropriate response code.

While 503 did have wording suggesting it can be used for rate limiting, that has been removed.


503 and 403 were both used for this purpose in the past, but 429 is now the most appropriate response.


Thanks!

Richard Baron Penman

unread,
Aug 27, 2014, 1:45:43 AM8/27/14
to web...@googlegroups.com
Thanks for info
> --
> Resources:
> - http://web2py.com
> - http://web2py.com/book (Documentation)
> - http://github.com/web2py/web2py (Source code)
> - https://code.google.com/p/web2py/issues/list (Report Issues)
> ---
> You received this message because you are subscribed to a topic in the
> Google Groups "web2py-users" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/web2py/4ULoscwDKb0/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> web2py+un...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages