Parsing/Crawler Questions..

bruce

unread,

Mar 4, 2009, 4:44:30 PM3/4/09

to pytho...@python.org

Hi...

Sorry that this is a bit off track. Ok, maybe way off track!

But I don't have anyone to bounce this off of..

I'm working on a crawling project, crawling a college website, to extract
course/class information. I've built a quick test app in python to crawl the
site. I crawl at the top level, and work my way down to getting the required
course/class schedule. The app works. I can consistently run it and extract
the information. The required information is based upon an XPath analysis of
the DOM for the given pages that I'm parsing.

My issue is now that I have a "basic" app that works, I need to figure out
how I guarantee that I'm correctly crawling the site. How do I know when
I've got an error at a given node/branch, so that the app knows that it's
not going to fetch the underlying branch/nodes of the tree..

When running the app, I can get 5000 classes on one run, 4700 on antoher,
etc... So I need some method of determining when I get a "complete" tree...

How do I know when I have a complete "tree"!

I'm looking for someone, or some group/prof that I can talk to about these
issues. I've been searching google, linkedin, etc.. for someone to bounce
thoughts with..!

Any pointers, or people, or papers, etc... would be helpful.

Thanks

MRAB

unread,

Mar 4, 2009, 5:19:20 PM3/4/09

to pytho...@python.org

bruce wrote:
> Hi...
>
> Sorry that this is a bit off track. Ok, maybe way off track!
>
> But I don't have anyone to bounce this off of..
>
> I'm working on a crawling project, crawling a college website, to extract
> course/class information. I've built a quick test app in python to crawl the
> site. I crawl at the top level, and work my way down to getting the required
> course/class schedule. The app works. I can consistently run it and extract
> the information. The required information is based upon an XPath analysis of
> the DOM for the given pages that I'm parsing.
>
> My issue is now that I have a "basic" app that works, I need to figure out
> how I guarantee that I'm correctly crawling the site. How do I know when
> I've got an error at a given node/branch, so that the app knows that it's
> not going to fetch the underlying branch/nodes of the tree..
>

[snip]
If you were crawling the site yourself, how would _you_ know when you
had an error at a given node/branch?

Philip Semanchuk

unread,

Mar 4, 2009, 9:14:30 PM3/4/09

to python-list (General)

On Mar 4, 2009, at 4:44 PM, bruce wrote:

> Hi...
>
> Sorry that this is a bit off track. Ok, maybe way off track!
>
> But I don't have anyone to bounce this off of..
>
> I'm working on a crawling project, crawling a college website, to
> extract
> course/class information. I've built a quick test app in python to
> crawl the
> site. I crawl at the top level, and work my way down to getting the
> required
> course/class schedule. The app works. I can consistently run it and
> extract
> the information. The required information is based upon an XPath
> analysis of
> the DOM for the given pages that I'm parsing.
>
> My issue is now that I have a "basic" app that works, I need to
> figure out
> how I guarantee that I'm correctly crawling the site. How do I know
> when
> I've got an error at a given node/branch, so that the app knows that
> it's
> not going to fetch the underlying branch/nodes of the tree..
>

> When running the app, I can get 5000 classes on one run, 4700 on
> antoher,
> etc... So I need some method of determining when I get a "complete"
> tree...
>
> How do I know when I have a complete "tree"!

hi Bruce,
To put this another way, you're trying to convince yourself that your
program is correct, yes? For instance, you're worried that you might
be doing something like discovering a URL on a site but failing to
pursue that URL, yes?

The standard way of testing any program is to test known input and
look for expected output. Repeat as necessary. In your case that would
mean crawling a site where you know all of the URLs and to see if your
program finds them all. And that, of course, isn't proof of
correctness, it just means that that particular site didn't trigger
any error conditions that would cause your program to misbehave.

I think every modern OS makes it easy to run a Web server on your
local machine. You might want to set up suite of test sites on your
machine and point your program at localhost. That way you can build a
site to test your application in areas you fear it may be weak.

I'm unclear on what you're using to parse the pages, but (X)HTML is
very often invalid in the strict sense of validity. If the tools
you're using expect/insist on well-formed XML or valid HTML, they'll
be disappointed on most sites and you'll probably be missing URLs. The
canonical solution for parsing real-world Web pages with Python is
BeautifulSoup.

HTH
Philip

bruce

unread,

Mar 4, 2009, 11:07:54 PM3/4/09

to Philip Semanchuk, python-list (General)

hi phillip...

thanks for taking a sec to reply...

i'm solid on the test app i've created.. but as an example.. i have a parse
for usc (southern cal) and it exrtacts the courselist/class schedule... my
issue was that i realized the multiple runs of the app was giving differentt
results... in my case, the class schedule isn't static.. (actually, none of
the class/course lists need be static.. they could easily change).

so i don't have apriori knowledge of what the actual class/course list site
would look like, unless i physically examined the site, each time i run the
app...

i'm inclined to think i might need to run the parser a number of times
within a given time frame, and then take a union/join of the output of the
different runs.. this would in theory, give me a high probablity that i'd
get 100% of the class list...

most crawlers, and most research that i've seen focus on the indexing, or
crawling function/architecture.. haven't really seen any
articles/research/pointers dealing with this kind of process...

thoughts/comments are welcome..

thanks

HTH
Philip

--
http://mail.python.org/mailman/listinfo/python-list

John Nagle

unread,

Mar 5, 2009, 1:22:42 AM3/5/09

to

bruce wrote:
> hi phillip...
>
> thanks for taking a sec to reply...
>
> i'm solid on the test app i've created.. but as an example.. i have a parse
> for usc (southern cal) and it exrtacts the courselist/class schedule... my
> issue was that i realized the multiple runs of the app was giving differentt
> results... in my case, the class schedule isn't static.. (actually, none of
> the class/course lists need be static.. they could easily change).
>
> so i don't have apriori knowledge of what the actual class/course list site
> would look like, unless i physically examined the site, each time i run the
> app...
>
> i'm inclined to think i might need to run the parser a number of times
> within a given time frame, and then take a union/join of the output of the
> different runs.. this would in theory, give me a high probablity that i'd
> get 100% of the class list...

I think I see the problem. I took a look at the USC class list, and
it's been made "Web 2.0". When you read the page, you don't get the
class list; you get a Javascript thing that builds a class list on
demand, using JSON, no less.

See "http://web-app.usc.edu/soc/term_20091.html".

I'm not sure how you're handling this. The Javascript actually
has to be run before you get anything.

John Nagle

bruce

unread,

Mar 5, 2009, 9:59:52 AM3/5/09

to John Nagle, pytho...@python.org

hi john..

You're missing the issue, so a little clarification...

I've got a number of test parsers that point to a given classlist site.. the
scripts work.

the issue that one faces is that you never "know" if you've gotten all of
the items/links that you're looking for based on the XPath functions. This
could be due to an error in the parsing, or it could be due to an admin
changing the site (removing/adding courses etc...)

So I'm trying to figure out an approach to handling these issues...

As far as I can tell... An approach might be to run the parser script across
the target site X number of times within a narrow timeframe (a few minutes).
Based on the results of this process, you might be able to develop an
overall "tree" of what the actual class/course links/list should be. But you
don't know from hour to hour, day to day if this list is stable, as it could
change..

The only way you know for certain is to physically examine a site. You can't
do this if you're going to develop an automated system for 5-10 sites, or
for 500-1000...

These are the issues that I'm grappling with.. not how to write the XPath
parsing functions...

Thanks..

-----Original Message-----
From: python-list-bounces+bedouglas=earthl...@python.org
[mailto:python-list-bounces+bedouglas=earthl...@python.org]On Behalf

See "http://web-app.usc.edu/soc/term_20091.html".

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list

John Nagle

unread,

Mar 5, 2009, 11:37:48 AM3/5/09

to

bruce wrote:
> hi john..
>
> You're missing the issue, so a little clarification...
>
> I've got a number of test parsers that point to a given classlist site.. the
> scripts work.
>
> the issue that one faces is that you never "know" if you've gotten all of
> the items/links that you're looking for based on the XPath functions. This
> could be due to an error in the parsing, or it could be due to an admin
> changing the site (removing/adding courses etc...)

What URLs are you looking at?

John Nagle

bruce

unread,

Mar 5, 2009, 12:31:23 PM3/5/09

to John Nagle, pytho...@python.org

hi..

the url i'm focusing on is irrelevant to the issue i'm trying to solve at
this time.

i think an approach will be to fire up a number of parsing attempts, and to
track the returned depts/classes/etc... in theory (hopefully) i should be
able to create a process to build a kind of statistical representation of
what the site looks like (names of depts, names/number of classes for given
depts, etc..) if i'm correct, this would provide a complete
"list/understanding" of what the courselist looks like....

i could then run the parsing process a number of times, examining the actual
value/results for the query, and taking the highest/oldest values for the
given query.. the idea being that the app will return correct results for
most of the queries, most of the time.. so from a statistical basis.. i can
take the results that are returned with the highest frequency...

so this approach might work. but again, haven't seen anything in the
literature/'net that talks about this...

thoughts...

thanks

-----Original Message-----
From: python-list-bounces+bedouglas=earthl...@python.org
[mailto:python-list-bounces+bedouglas=earthl...@python.org]On Behalf
Of John Nagle
Sent: Thursday, March 05, 2009 8:38 AM
To: pytho...@python.org
Subject: Re: Parsing/Crawler Questions..

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list

Philip Semanchuk

unread,

Mar 5, 2009, 12:36:05 PM3/5/09

to python-list (General)

On Mar 5, 2009, at 12:31 PM, bruce wrote:

> hi..
>
> the url i'm focusing on is irrelevant to the issue i'm trying to
> solve at
> this time.

Not if we're to understand the situation you're trying to describe.
From what I can tell, you're saying that the target site displays
different results each time your crawler visits it. It's as if e.g.
the site knows about 100 courses but only displays 80 randomly chosen
ones to each visitor. If that's the case, then it is truly bizarre.

> --
> http://mail.python.org/mailman/listinfo/python-list

John Nagle

unread,

Mar 5, 2009, 1:54:07 PM3/5/09

to

Philip Semanchuk wrote:
> On Mar 5, 2009, at 12:31 PM, bruce wrote:
>
>> hi..
>>
>> the url i'm focusing on is irrelevant to the issue i'm trying to solve at
>> this time.
>
> Not if we're to understand the situation you're trying to describe. From
> what I can tell, you're saying that the target site displays different
> results each time your crawler visits it. It's as if e.g. the site knows
> about 100 courses but only displays 80 randomly chosen ones to each
> visitor. If that's the case, then it is truly bizarre.

Agreed. The course list isn't changing that rapidly.

I suspect the original poster is doing something like reading the DOM
of a dynamic page while the page is still updating, running a browser
in a subprocess. Is that right?

I've had to deal with that in Javascript. My AdRater browser plug-in
(http://www.sitetruth.com/downloads) looks at Google-served ads and
rates the advertisers. There, I have to watch for page-change events
and update the annotations I'm adding to ads.

But you don't need to work that hard here. The USC site is actually
querying a server which provides the requested data in JSON format. See

http://web-app.usc.edu/soc/dev/scripts/soc.js

Reverse-engineer that and you'll be able to get the underlying data.
(It's an amusing script; many little fixes to data items are performed,
something that should have been done at the database front end.)

The way to get USC class data is this:

1. Start here: "http://web-app.usc.edu/soc/term_20091.html"
2. Examine all the department pages under that page.
3. On each page, look for the value of "coursesrc", like this:
var coursesrc = '/ws/soc/api/classes/aest/20091'
4. For each "coursesrc" value found, construct a URL like this:
http://web-app.usc.edu/ws/soc/api/classes/aest/20091
5. Read that URL. This will return the department's course list in
JSON format.
6. From the JSON tree, pull out CourseData items, which look like this:

CourseData":
{"prefix":"AEST",
"number":"220",
"sequence":"B",
"suffix":{},
"title":"Advanced Leadership Laboratory II",
"description":"Additional exposure to the military experience for continuing
AFROTC cadets, focusing on customs and courtesies, drill and ceremonies, and the
environment of an Air Force officer. Credit\/No Credit.",
"units":"1",
"restriction_by_major":{},
"restriction_by_class":{},
"restriction_by_school":{},
"CourseNotes":{},
"CourseTermNotes":{},
"prereq_text":"AEST-220A",
"coreq_text":{},
"SectionData":{"id":"41799",
"session":"790",
"dclass_code":"D",
"title":"Advanced Leadership Laboratory II",
"section_title":{},
"description":{},
"notes":{},
"type":"Lec",
"units":"1",
"spaces_available":"30",
"number_registered":"2",
"wait_qty":"0",
"canceled":"N",
"blackboard":"Y",
"comment":{},
"day":{},"start_time":"TBA",
"end_time":"TBA",
"location":"OFFICE",
"instructor":{"last_name":"Hampton","first_name":"Daniel"},
"syllabus":{"format":{},"filesize":{}},
"IsDistanceLearning":"N"}}},

Parsing the JSON is left as an exercise for the student. (There's
a Python module for that.)

And no, the data isn't changing; you can read those pages of JSON over and
over and get the same data every time.

John Nagle

bruce

unread,

Mar 5, 2009, 1:52:09 PM3/5/09

to John Nagle, pytho...@python.org

john...

again.... the problem i'm facing really has nothing to do with a specific
url... the app i have for the usc site works...

but for any number of reasons... you might get different results when
running the app..
-the server could be screwed up..
-data might be cached
-data might be changed, and not updated..
-actual app problems...
-networking issues...
-memory corruption issues...
-process constraint issues..
-web server overload..
-etc...

the assumption that most people appear to make is that if you create a
parser, and run and test it once.. then if it gets you the data, it's
working.. when you run the same app.. 100s of times, and you're slamming the
webserver... then you realize that that's a vastly different animal than
simply running a snigle query a few times...

so.. nope, i'm not running the app and getting data from a dynamic page that
hasn't finished uploading/creating the content..

but what my analysis is showing, not only for the usc, but for others as
well.. is that there might be differences in what gets returned...

which is where a smoothing algorithmic approach appears to be workable..

i've been starting to test this approach, and it actually might have a
chance of working...

so.. as i've stated a number of times.. focusing on a specific url isn't the
issue.. the larger issue is how you can
programatically/algorithmically/automatically, be reasonably ensured that
what you have is exactly what's on the site...

ain't screen scraping fun!!!

-----Original Message-----
From: python-list-bounces+bedouglas=earthl...@python.org
[mailto:python-list-bounces+bedouglas=earthl...@python.org]On Behalf
Of John Nagle
Sent: Thursday, March 05, 2009 10:54 AM
To: pytho...@python.org

http://web-app.usc.edu/soc/dev/scripts/soc.js

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list

bruce

unread,

Mar 5, 2009, 5:50:37 PM3/5/09

to John Nagle, pytho...@python.org

hi john...

update...

further investigation has revealed that apparently, for some urls/sites, the
server serves up pages that take awhile to be fetched... this appears to be
a potential problem, in that it appears that the parsescript never gets
anything from the python mech/urllib read function.

the curious issue is that i can run a single test script, pointing to the
url, and after a bit of time.. the resulting content is fetched/downloaded
correctly. by the way, i can get the same results in my test browsing
environment, if i start it with only a subset of the urs that i've been
using to test the app.

hmm... might be a resource issue, a timing issue,.. or something else...
hmmm...

thanks

http://web-app.usc.edu/soc/dev/scripts/soc.js

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list

rounde...@gmail.com

unread,

Mar 6, 2009, 7:19:10 PM3/6/09

to

So, it sounds like your update means that it is related to a specific
url.

I'm curious about this issue myself. I've often wondered how one
could properly crawl an AJAX-ish site when you're not sure how quickly
the data will be returned after the page has been.

John, your advice has really helped me. Bruce / anyone else, have you
had any further experience with this type of parsing / crawling?

> From: python-list-bounces+bedouglas=earthlink....@python.org
>
> [mailto:python-list-bounces+bedouglas=earthlink....@python.org]On Behalf

> Of John Nagle
> Sent: Thursday, March 05, 2009 10:54 AM

> To: python-l...@python.org

Lie Ryan

unread,

Mar 7, 2009, 4:14:43 AM3/7/09

to

bruce wrote:
> john...
>
> again.... the problem i'm facing really has nothing to do with a specific
> url... the app i have for the usc site works...
>
> but for any number of reasons... you might get different results when
> running the app..
> -the server could be screwed up..
> -data might be cached
> -data might be changed, and not updated..
> -actual app problems...
> -networking issues...
> -memory corruption issues...
> -process constraint issues..
> -web server overload..
> -etc...
>
> the assumption that most people appear to make is that if you create a
> parser, and run and test it once.. then if it gets you the data, it's
> working.. when you run the same app.. 100s of times, and you're slamming the
> webserver... then you realize that that's a vastly different animal than
> simply running a snigle query a few times...

The assumptions is most websites edit and remove data from time to time
and using the union of data collected throughout several runs might
populate your program with redundant (but slightly different) or
outdated data. The assumptions is these redundant or outdated data is
not useful for most people.

lkcl

unread,

Mar 7, 2009, 5:34:17 AM3/7/09

to

On Mar 7, 12:19 am, rounderwe...@gmail.com wrote:
> So, it sounds like your update means that it is related to a specific
> url.
>
> I'm curious about this issue myself. I've often wondered how one

> could properly crawl anAJAX-ish site when you're not sure how quickly

> the data will be returned after the page has been.

you want to look at the webkit engine - no not the graphical browser
- the ParseTree example - and combine it with pywebkitgtk - no not the
"original" version, the one which has DOM-manipulation bindings
through webkit-glib.

the webkit parse tree example is, despite it being based on the GTK
"port" as they like to call it in webkit (which just means that it
links with GTK not QT4 or wxWidgets), is a console-based application.

in other words, despite it being GTK, it still does NOT output
graphical crap to the screen, yet it still *executes* the javascript
on the page.

dummy functions for "mouse", "keyboard", "console errors" are given as
examples and are left as an exercise for the application writer to
fill-in-the-blanks.

combining this parse tree example with pywebkitgtk (see
demobrowser.py) would provide a means by which web pages can be
executed AT THE CONSOLE NOT AS A GUI APP, then, thanks to the glib /
gobject bindings, a python app will be able to walk the DOM tree as
expected.

i _just_ fixed pyjamas-desktop's iterators in the pyjamas.DOM module
for someone, on the pyjamas-dev mailing list.

http://github.com/lkcl/pyjamas-desktop/tree/8ed365b89efe5d1d3451c3e3ced662a2dd014540

so, actually, you may be better off starting from pyjamas-desktop and
then cutting out the "fire up the GTK window" bit, from pyjd.py.

pyjd.py is based on pywebkitgtk's demobrowser.py

the alternative to webkit is to use python-hulahop - it will do the
same thing, but just using python bindings to gecko instead of python-
bindings-to-glib-bindings-to-webkit.

l.

bruce

unread,

Mar 7, 2009, 4:56:09 PM3/7/09

to lkcl, pytho...@python.org

....

and this solution will somehow allow a user to create a web parsing/scraping
app for parising links, and javascript from a web page?

-----Original Message-----
From: python-list-bounces+bedouglas=earthl...@python.org
[mailto:python-list-bounces+bedouglas=earthl...@python.org]On Behalf

Of lkcl
Sent: Saturday, March 07, 2009 2:34 AM
To: pytho...@python.org
Subject: Re: Parsing/Crawler Questions - solution

http://github.com/lkcl/pyjamas-desktop/tree/8ed365b89efe5d1d3451c3e3ced662a2
dd014540

l.
--
http://mail.python.org/mailman/listinfo/python-list

lkcl

unread,

Mar 8, 2009, 5:22:48 AM3/8/09

to

On Mar 7, 9:56 pm, "bruce" <bedoug...@earthlink.net> wrote:
> ....
>
> and this solution will somehow allow a user to create a web parsing/scraping
> app for parising links, and javascript from a web page?

not just parsing the links and the "static" javascript, but:

* actually executing the javascript, giving the quotes page quotes a
chance to actually _look_ like it would if it was being viewed as a
quotes real quotes web browser.

so any XMLHTTPRequests will _actually_ get executed, _actually_
result in _actually_ having the content of the web page _properly_
modified.

so, e.g instead of seeing a "Loader" page on gmail you would
_actually_ see the user's email and the adverts (assuming you went to
the trouble of putting in the username/password) because the AJAX
would _actually_ get executed by the WebKit engine, and the DOM model
accessed thereafter.

* giving the user the opportunity to call DOM methods such as
getElementsByTagName and the opportunity to access properties such as
document.anchors.

in webkit-glib "gdom" bindings, that would be:

* anchor_list = gdom_document_get_elements_by_tag_name(doc, "a");

or

* g_object_get(doc, "anchors", &anchor_list, NULL);

which in pywebkitgtk (thanks to python-pygobject auto-generation of
python bindings from gobject bindings) translates into:

* doc.get_elements_by_tag_name("a")

or

* doc.props.anchors

which in pyjamas-desktop, a high-level abstraction on top of _that_,
turns into:

* from pyjamas import DOM
anchor_list = DOM.getElementsByTagName(doc, "a")

or

* from pyjamas import DOM
anchor_list = DOM.getAttribute(doc, "anchors")

answer: yes.

l.

> -----Original Message-----
> From: python-list-bounces+bedouglas=earthlink....@python.org
>
> [mailto:python-list-bounces+bedouglas=earthlink....@python.org]On Behalf

> Oflkcl

> Sent: Saturday, March 07, 2009 2:34 AM

> To: python-l...@python.org

> http://github.com/lkcl/pyjamas-desktop/tree/8ed365b89efe5d1d3451c3e3c...