On Tue, May 22, 2012 at 7:49 AM, Deepak Shenoy <
deepak...@gmail.com> wrote:
> On Mon, May 21, 2012 at 8:46 PM, Venkata Pingali <
pin...@gmail.com> wrote:
>> It is an
asp.net site. You will not be able to scrape anything
>> behind a form using python/html parsing (been there! done that!).
>
> The one way it can be done is using Windows and automating IE instead.
> Have done that using C#, but yes, it's a painful exercise and NREGA
> more so because they constantly validate with VIEWSTATE and other
> hidden parameters.
>
Right. Thats the problem. The server side is very sensitive to these values,
and not being asp developer, dont know the logic by which it is generated
or validated.
The approach will eventually look like browser automation of various kinds -
IE or other framework. Phantomjs uses webkit (chrome/safari engine).
I looked at available test automation frameworks incuding Selenium.
I needed something to run off AWS servers (read: no display; found
xvrb inefficient, hard to use). So chose to go with Phantomjs. Another
advantage I found with it is that all code is in javascript. So it is
cross-platform, compact, and potentially convertible into an
chrome extension - which I am working on.
> Having said that, the end result is a bunch of HTML files (for job
> cards) which I assume can be obtained using some test framework tools?
Thats what is going on here. Once we get the html, we can use whatever
mechanism that suits. I used python.