possible to capture outgoing and incoming header information?

doridori Jo

unread,

Jul 23, 2009, 11:14:31 PM7/23/09

to scrapy...@googlegroups.com

Hi folks,

I ran into this problem when scraping sites with heavy usage of javascript to obfuscate it's data.

For example,

<a href="javascript:void(0)" onClick="grabData(23)"> VIEW DETAILS </a.

This href attribute, reveals no information about the actual URL. You'd have to manually look and examine the grabData() javascript function to get a clue.

OR

The old school way is manually opening up Live HTTP header add on for firefox, and monitoring the POST perimeters, which reveals the actual URL being POSTed.

So i'm wondering, is there a way for scrapy to monitor, as Live HTTP header does, for the outgoing and incoming POST parameters? This would make even the most javscript obfuscated web pages easily scrapable.

Cheers

Aníbal Pacheco

unread,

Jul 24, 2009, 11:48:21 AM7/24/09

to scrapy...@googlegroups.com

Hello doridori,

JS hacking to figure out the links and making the right requests is a
common web-scraping task, the NET panel of the "firebug"[0] add-on for
firefox is a nice tool to monitor that.

Another good FF add-on can be "Tamper Data"[1] which allows you to
change on-the-fly the data being submitted to the server, sometimes this
is useful for discover hard-obfuscated JS links and test their behavior
with different data sets.

Cheers!

[0] http://getfirebug.com
[1] http://tamperdata.mozdev.org/

doridori Jo

unread,

Jul 25, 2009, 12:24:27 AM7/25/09

to scrapy...@googlegroups.com

yes, but im looking to see if this can be accomplished via scrapy. just capturing the outgoing POST parameters.

doridori Jo

unread,

Jul 25, 2009, 1:53:09 AM7/25/09

to scrapy...@googlegroups.com

I guess simply put, my question is

How to make the spider's dump HTTP header, and parse out the URL parameter information ?

Pablo Hoffman

unread,

Jul 25, 2009, 5:56:24 PM7/25/09

to scrapy...@googlegroups.com

Doridori,

You can use the request_httprepr() function (from scrapy.utils.request) to
print the raw HTTP representation of requests. They used to be method of the
Request objects, but there's no need for it to be a method so I just moved them
to a utils function, so make sure you update you Scrapy code first.

Then you can write a simple downloader middleware to print it:

from scrapy.utils.request import request_httprepr

class DumpRawRequestsMiddleware(object):
def process_request(self, request, spider):
print request_httprepr(request)

You'd want to put that middleware as close to the downloader as possible, since
some downloader middlewares (like the User-Agent one) modify the requests. So,
for exapmle:

DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.DumpRawRequestsMiddleware': 999,
}

I don't think I've understood the "parse out the URL parameter" part.

Pablo.

On Fri, Jul 24, 2009 at 10:53:09PM -0700, doridori Jo wrote:
> I guess simply put, my question is
>

> How to make the spider's *dump HTTP header*, and parse out the URL parameter

> information ?
>
> On Fri, Jul 24, 2009 at 9:24 PM, doridori Jo <dori...@gmail.com> wrote:
>
> > yes, but im looking to see if this can be accomplished via scrapy. just
> > capturing the outgoing POST parameters.
> >

> > 2009/7/24 Aníbal Pacheco <apach...@gmail.com>

doridori Jo

unread,

Jul 26, 2009, 1:37:12 PM7/26/09

to scrapy...@googlegroups.com

by the URL parameter i mean when your browser makes POST request when you submit a form. im sure the http header contains the POST parameters?

Pablo Hoffman

unread,

Jul 26, 2009, 1:51:24 PM7/26/09

to scrapy...@googlegroups.com

No, the POST parameters are contained in the request body.

Pablo.

Reply all

Reply to author

Forward