Project: Question about _index_document fuction

38 views
Skip to first unread message

Jeong

unread,
Nov 30, 2012, 8:05:44 PM11/30/12
to csc-32...@googlegroups.com

I am modifying the _visit_a function in crawler, and the hint says that I have to add title/alt/text to index for dest_url. So I looked at the _index_document function and was confused how to pass values to the function since the function takes 'soup' as an argument. Also, I do not get 100% what the index_document function is supposed to do. Any advice, please?

Wesley May

unread,
Nov 30, 2012, 8:21:56 PM11/30/12
to csc-32...@googlegroups.com
"soup" is a BeautifulSoup object (defined at line 316)

_index_document goes through the document, recursively visiting each tag. The relevant "_enter" function is called for each tag (note that _enter is a dictionary that maps tag names to *functions*). I don't think you need to modify this function in any way.

The _visit_a function is called whenever the crawler runs across an <a> tag, which is almost invariably a link to another webpage.

I feel like I'm missing your exact question though. Tell me if that's the case.

Jeong

unread,
Nov 30, 2012, 8:36:28 PM11/30/12
to csc-32...@googlegroups.com

Sorry, bad question haha.. Well, the question I had was about calling the _index_document function. TODO says 'add title/alt/text to index for destination url' and I am assuming that the _index_document function has to be called. However, I do not know how to do it since the function takes 'soup' as an argument. Or am I on the totally wrong track here that I don't have to call the _index_document function? If so, I don't get what the line wants me to do. Sorry if I don't make sense this time either. lol

Wesley May

unread,
Nov 30, 2012, 10:35:38 PM11/30/12
to csc-32...@googlegroups.com
No, I don't think this has anything to do with the _index_document function. It just means, look at the "title" and "alt" attributes of the <a> tag and store them somewhere, and also get the text from that tag. See the code that's commented out in lines 196 to 199? That's how you grab the info.
Reply all
Reply to author
Forward
0 new messages