How to submit a form

40 views
Skip to first unread message

SkyScraper

unread,
Jan 2, 2015, 5:39:54 AM1/2/15
to scrap...@googlegroups.com
I am trying to extract data from a public database which unfortunately does not accept wildcards for the data-input. I can enter dates manually, but I thought it would be cool to create a script to do this for me. The form on the website requires three inputs. Two of these inputs in my search will be the same always, but the third (the date) will need to change if submitting the form does not yield any results (i.e. input 1 and input 2 on 01-01-2001 yields no result, then the next inputs should be: input 1 and input 2 on 02-01-2001, etc. until the search yields a result which should then be logged). I have never used code before and am trying to use python. So far, I am having no luck with this, not even with the basic submission of the form (without automatically changing the date if there is no result). Any tips, links to useful tutorials for this kind of form submitting scraper would be much appreciated. Using firefox on a windows machine, but can use a mac with safari as well if that is easier. Also willing to switch to ruby if that is easier. Thanks!

Here is the relevant website source code:


<form action="/HuwelijksgoederenRegister/Zoeken" method="get">
<div class="form" id="zoekenform">
    <fieldset id="zoekcriteriahuwelijk" >
        <legend>Zoekcriteria</legend>
        <dl class="mandatoryprepared">
            <dt class="editor-label">
                <img src="/SharedWebResources.axd?images/icon_mandatoryfield.gif" alt="dit veld is verplicht" title="dit veld is verplicht"/>
                <label for="HuwelijkZoekArgumenten_AchternaamPartner1">Partner achternaam</label>
            </dt>
            <dd class="editor-field">
                <input id="HuwelijkZoekArgumenten_AchternaamPartner1" maxlength="50" name="HuwelijkZoekArgumenten.AchternaamPartner1" type="text" value="" />
                <label>(geen voorvoegsel)</label>
                
            </dd>

            <dt id="partner2" class="editor-label">
                <img src="/SharedWebResources.axd?images/icon_mandatoryfield.gif" alt="dit veld is verplicht" title="dit veld is verplicht"/>
                <label for="HuwelijkZoekArgumenten_AchternaamPartner2">Partner achternaam</label>
            </dt>
            <dd class="editor-field">
                <input id="HuwelijkZoekArgumenten_AchternaamPartner2" maxlength="50" name="HuwelijkZoekArgumenten.AchternaamPartner2" type="text" value="" />
                <label>(geen voorvoegsel)</label>
                
            </dd>

            <dt class="editor-label">
                <img src="/SharedWebResources.axd?images/icon_mandatoryfield.gif" alt="dit veld is verplicht" title="dit veld is verplicht"/>
                <label for="HuwelijkZoekArgumenten_DatumVerbintenis">Datum gehuwd / geregistreerd partnerschap</label>
            </dt>
            <dd class="editor-field">
                <input class="date" id="HuwelijkZoekArgumenten_DatumVerbintenis" maxlength="10" name="HuwelijkZoekArgumenten.DatumVerbintenis" type="text" value="" />
                (dd-mm-jjjj)
                
            </dd>
        </dl>

        <div class="clear">
        </div>

        <div class="buttoncontainer">
            <input type="submit" name="zoekBtn" value="zoek" title="" class="button" id="defaultbutton" />
            <input type="submit" name="wisBtn" value="wis" title="" class="button" />
        </div>

        <div class="mandatoryinfo"><img src="/SharedWebResources.axd?images/icon_mandatoryfield.gif" alt="dit veld is verplicht" title="dit veld is verplicht"/></div><p><em>Deze velden zijn verplicht.</em></p>
    </fieldset>
</div>

<div class="clear">
</div></form>
<div id="resultaten">
</div>



Here is the code I cobbled together:

#!/usr/bin/env python

import scraperwiki
import requests
import re
from mechanize import Browser

br = Browser()

# Ignore robots.txt
br.set_handle_robots( False )
# Google demands a user-agent that isn't a robot
br.addheaders = [('User-agent', 'Firefox')]

br.open("http://hgr.rechtspraak.nl/")
br.select_form(nr=0)
# Browser passes through unknown attributes (including methods)
# to the selected HTMLForm (from ClientForm).
br["HuwelijkZoekArgumenten.AchternaamPartner1"] = ["NAME1"]  # (the method here is __setitem__)
br["HuwelijkZoekArgumenten.AchternaamPartner2"] = ["NAME2"]
br["HuwelijkZoekArgumenten.DatumVerbintenis"] = ["DD-MM-YYYY"]
response = br.submit()  # submit current form

print

'Dragon' Dave McKee

unread,
Jan 5, 2015, 6:53:42 AM1/5/15
to scrap...@googlegroups.com
Mechanize is pretty difficult to use: I avoid it whenever I can.

Instead, I'd look at the URL that the search generates when you click through.

Copy-pasting that directly into my code seems to work:

import requests
print requests.get(url).content

so I can think about writing this a bit more nicely; either with explicit parameter passing:

import requests
params = {"HuwelijkZoekArgumenten.AchternaamPartner1": "o",
          "HuwelijkZoekArgumenten.AchternaamPartner2": "t",
          "HuwelijkZoekArgumenten.DatumVerbintenis": "01-01-2010",
          "zoekBtn": "zoek"}
print requests.get(baseurl, params=params).content

or by using string formatting to replace the text (each occurrence of {} in the url gets replaced by one of the text elements from the format string)
(I often use this when there's one option I want to tweak, but there's dozens of parameters in the URL I don't want to change.

import requests
print requests.get(format_string_url.format('o', 't', '01-01-2010')).content

Also worth noting is that NAME1 isn't a valid name in the website - it raises an error.

Best of luck getting this to work!

Dragon.

--
You received this message because you are subscribed to the Google Groups "ScraperWiki" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scraperwiki...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

SkyScraper

unread,
Jan 5, 2015, 9:22:42 AM1/5/15
to scrap...@googlegroups.com
Thanks Dragon, much appreciated. Happy new year to you.

Op maandag 5 januari 2015 12:53:42 UTC+1 schreef Dragon Dave:
Reply all
Reply to author
Forward
0 new messages