How to crawl this website via scrapy?

796 views
Skip to first unread message

lzhshen

unread,
Sep 21, 2010, 10:28:03 PM9/21/10
to scrapy-users
Hi,

I want to extract JD (Job Description) from top 500's official
websites and store them in my website. With scrapy's help, I was able
to extract JD from a few websites, such IBM and Morgan. However, I
meet some problems when try to do the same thing for Intel website. It
just returns a simple page that is almost empty after I send the
request via post method. By the way, with httpFox (firefox add-on), I
did see there are lots of POST data when I submit the request from my
browser, that is different from other sites.

Could someone give me some hints about how to crawl this website? Here
is the url that can be accessed from browser:
http://www.intel.com/jobs/jobsearch/index_ne.htm?Location=200000008

Here is my spider's skeleton:
class IntelSpider(BaseSpider):
name = "intel.com"
allowed_domains = ["taleo.net"]
def start_requests(self):
req_china = FormRequest("https://intel.taleo.net/careersection/
750PRD.18.6.3.3.0/html/ajax.htm",
formdata={
'iframemode': '1',
'ftlpageid': 'reqListAdvancedPage',
'ftlinterfaceid': 'advancedSearchFooterInterface',
'ftlcompid': 'SEARCH',
... # commentsThere are a lots of data here.#
'location1L2': '-1',
'dropListSize': '25',
'dropSortBy': '10'},
callback=self.test)

return [req_china]

def test(self, response):
print response.body
return

Here is the result of print:
shen@shen-laptop:~/Dropbox/app/jdspider$ scrapy crawl intel.com
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.0 Strict//EN' 'http://
www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd'>
<html xmlns='http://www.w3.org/1999/xhtml' lang='en' xml:lang='en'>
<head>
<meta http-equiv='Content-Type' content='text/html; charset=UTF-8' />
<title>Ajax FTL Proxy</title>
<script type='text/javascript'>
function pushDataToParent()
{
var responseContainer = document.getElementById('response');
var response = responseContainer.innerHTML;
if (response == '')
{
response = null;
}
try
{
if (response == null)
{
//tt117496: PATCH for support the back-Button of browser
//tt130201: PATCH for support the back-Button of browser for all
ftl List
if (parent.ftlUtil_getInitialHistoryData)
{
//tt117496: PATCH for support the back-Button of browser
var repTmp = parent.ftlUtil_getInitialHistoryData();
if (repTmp != undefined && repTmp != null && repTmp != '')
{
response = repTmp;
responseContainer.innerHTML = response;
}
}
}

//tt117496: PATCH for support the back-Button of browser
if (response != null)
{
parent.ftlUtil_ajaxResponseReady(response);
}
}
catch(e)
{
// Nothing to do
}
}
</script>
</head>
<body onload='javascript:pushDataToParent();'>
<!-- Section 508 Compliance for Cynthia Says Tool -->
Empty frame. Ignore.
<table><tr><td id='response'></td></tr></table>
<form action='' method='post'>
</form>
</body>
</html>


Could someone help me on this? Thanks in advance!

Leo

Pablo Hoffman

unread,
Sep 27, 2010, 11:39:04 AM9/27/10
to scrapy...@googlegroups.com
Hi,

Check out the FormRequest.from_response() class constructor, it could be of
help.

See:

http://doc.scrapy.org/topics/request-response.html#scrapy.http.FormRequest.from_response
http://doc.scrapy.org/topics/request-response.html#using-formrequest-from-response-to-simulate-a-user-login

Pablo.

> --
> You received this message because you are subscribed to the Google Groups "scrapy-users" group.
> To post to this group, send email to scrapy...@googlegroups.com.
> To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

lzhshen

unread,
Oct 24, 2010, 6:02:02 AM10/24/10
to scrapy-users
Thanks for your reply. However, it does not help.

The problem is that I can't construct a correct Form Request to get
the reply content.

On Sep 27, 11:39 pm, Pablo Hoffman <pablohoff...@gmail.com> wrote:
> Hi,
>
> Check out the FormRequest.from_response() class constructor, it could be of
> help.
>
> See:
>
> http://doc.scrapy.org/topics/request-response.html#scrapy.http.FormRe...http://doc.scrapy.org/topics/request-response.html#using-formrequest-...

Steven Almeroth

unread,
Oct 25, 2010, 7:31:46 AM10/25/10
to scrapy-users
Scrapy will not execute the JavaScript to build the <iframe>. You
will have to make a Request to the value of URL separately or use a JS
engine like SpiderMonkey to render the script first.

<script language="javascript" type="text/javascript">
document.write('<iframe src="'+URL+'" name="main_content"
frameborder="no" marginwidth="10" marginheight="5" scrolling="yes"
width="920" height="610"></iframe>');
</script>


On Oct 24, 12:02 pm, lzhshen <lzhs...@gmail.com> wrote:
> Thanks for your reply. However, it does not help.
>
> The problem is that I can't construct a correct Form Request to get
> the reply content.
>
> On Sep 27, 11:39 pm, Pablo Hoffman <pablohoff...@gmail.com> wrote:
>
>
>
>
>
>
>
> > Hi,
>
> > Check out the FormRequest.from_response() class constructor, it could be of
> > help.
>
> > See:
>
> >http://doc.scrapy.org/topics/request-response.html#scrapy.http.FormRe......

lzhshen

unread,
Oct 28, 2010, 10:33:38 AM10/28/10
to scrapy-users
Thanks for your reply.
I catch the POST request sent by the javascript via firefox plugin
(httpFox) and I simulate the request via scrapy. Aren't they the same?
Why should scrapy care about what javascript does in client side?

Steven Almeroth

unread,
Oct 29, 2010, 6:37:03 AM10/29/10
to scrapy-users
On Oct 28, 4:33 pm, lzhshen <lzhs...@gmail.com> wrote:
> Thanks for your reply.
> I catch the POST request sent by the javascript via firefox plugin
> (httpFox) and I simulate the request via scrapy. Aren't they the same?

sure they're the same, but there's a lot of script going on in these
pages so you have to be able to not only catch the post but also
process it, for example <body onload=...>.

> Why should scrapy care about what javascript does in client side?

Scrapy doesn't, but you do. I'm not sure what you are asking. it
appears that there are 141 job descriptions listed in:
https://intel.taleo.net/careersection/10020/jobsearch.ftl?lang=en&location=200000008

I don't think you need to worry about:
https://intel.taleo.net/careersection/750PRD.18.6.3.3.0/html/ajax.htm

lzhshen

unread,
Oct 31, 2010, 2:49:35 AM10/31/10
to scrapy-users


On Oct 29, 6:37 pm, Steven Almeroth <srot...@gmail.com> wrote:
> On Oct 28, 4:33 pm, lzhshen <lzhs...@gmail.com> wrote:
>
> > Thanks for your reply.
> > I catch the POST request sent by the javascript via firefox plugin
> > (httpFox) and I simulate the request via scrapy. Aren't they the same?
>
> sure they're the same, but there's a lot of script going on in these
> pages so you have to be able to not only catch the post but also
> process it, for example <body onload=...>.

Web server does not send back response that I expect. Instead, it tell
me to create a new session. Here is the error message:

"
A system error has occurred.

Please exit the application and open a new session. Please contact
your internal company help desk if your company has one. Otherwise,
for any problem with the Taleo application, please log an incident
through Web Support. In your incident, please describe the problem in
detail (circumstances and actions leading up to the error). If you do
not have access to Web Support, please contact your Taleo system
administrator."

>
> > Why should scrapy care about what javascript does in client side?
>
> Scrapy doesn't, but you do.  I'm not sure what you are asking.  it
> appears that there are 141 job descriptions listed in:https://intel.taleo.net/careersection/10020/jobsearch.ftl?lang=en&loc...

lzhshen

unread,
Oct 31, 2010, 2:55:08 AM10/31/10
to scrapy-users
Here is my code skeleton:

class IntelSpider(BaseSpider):
name = "autodesk.com"
allowed_domains = ["taleo.net"]
def start_requests(self):
req_china = FormRequest('https://autodesk.taleo.net/
careersection/adsk_gen/jobsearch.ajax',
headers={
'Host' : 'autodesk.taleo.net',
'User-Agent' : 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.10)
Gecko/20100915 Ubuntu/9.10 (karmic) Firefox/3.6.10 GTB7.1',
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/
*;q=0.8',
'Accept-Language' : 'en-us,en;q=0.5',
'Accept-Encoding' : 'gzip,deflate',
'Accept-Charset' : 'ISO-8859-1,utf-8;q=0.7,*;q=0.7',
'Keep-Alive' : '115',
'Connection' : 'keep-alive',
'Referer' : 'http://autodesk.taleo.net/careersection/adsk_gen/
jobsearch.ftl'
},
formdata={
... # lots of post data here. They are hidden "input" field

lzhshen

unread,
Nov 1, 2010, 9:57:42 AM11/1/10
to scrapy-users
Steven Almeroth,

Could you help me give it a try if it will not take you too much time?
Here is the link:
http://autodesk.taleo.net/careersection/adsk_gen/jobsearch.ftl

Thanks in advance!

lzhshen
Reply all
Reply to author
Forward
0 new messages