Hi,
I want to extract JD (Job Description) from top 500's official
websites and store them in my website. With scrapy's help, I was able
to extract JD from a few websites, such IBM and Morgan. However, I
meet some problems when try to do the same thing for Intel website. It
just returns a simple page that is almost empty after I send the
request via post method. By the way, with httpFox (firefox add-on), I
did see there are lots of POST data when I submit the request from my
browser, that is different from other sites.
Could someone give me some hints about how to crawl this website? Here
is the url that can be accessed from browser:
http://www.intel.com/jobs/jobsearch/index_ne.htm?Location=200000008
Here is my spider's skeleton:
class IntelSpider(BaseSpider):
name = "
intel.com"
allowed_domains = ["
taleo.net"]
def start_requests(self):
req_china = FormRequest("
https://intel.taleo.net/careersection/
750PRD.18.6.3.3.0/html/ajax.htm",
formdata={
'iframemode': '1',
'ftlpageid': 'reqListAdvancedPage',
'ftlinterfaceid': 'advancedSearchFooterInterface',
'ftlcompid': 'SEARCH',
... # commentsThere are a lots of data here.#
'location1L2': '-1',
'dropListSize': '25',
'dropSortBy': '10'},
callback=self.test)
return [req_china]
def test(self, response):
print response.body
return
Here is the result of print:
shen@shen-laptop:~/Dropbox/app/jdspider$ scrapy crawl
intel.com
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.0 Strict//EN' 'http://
www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd'>
<html xmlns='
http://www.w3.org/1999/xhtml' lang='en' xml:lang='en'>
<head>
<meta http-equiv='Content-Type' content='text/html; charset=UTF-8' />
<title>Ajax FTL Proxy</title>
<script type='text/javascript'>
function pushDataToParent()
{
var responseContainer = document.getElementById('response');
var response = responseContainer.innerHTML;
if (response == '')
{
response = null;
}
try
{
if (response == null)
{
//tt117496: PATCH for support the back-Button of browser
//tt130201: PATCH for support the back-Button of browser for all
ftl List
if (parent.ftlUtil_getInitialHistoryData)
{
//tt117496: PATCH for support the back-Button of browser
var repTmp = parent.ftlUtil_getInitialHistoryData();
if (repTmp != undefined && repTmp != null && repTmp != '')
{
response = repTmp;
responseContainer.innerHTML = response;
}
}
}
//tt117496: PATCH for support the back-Button of browser
if (response != null)
{
parent.ftlUtil_ajaxResponseReady(response);
}
}
catch(e)
{
// Nothing to do
}
}
</script>
</head>
<body onload='javascript:pushDataToParent();'>
<!-- Section 508 Compliance for Cynthia Says Tool -->
Empty frame. Ignore.
<table><tr><td id='response'></td></tr></table>
<form action='' method='post'>
</form>
</body>
</html>
Could someone help me on this? Thanks in advance!
Leo