Re: Can't find tags in XML

154 views
Skip to first unread message

Leonard Richardson

unread,
Nov 2, 2012, 3:35:49 PM11/2/12
to beautifulsoup
Julie,

You're feeding the BeautifulSoup object the literal string
"index.rss". You want to open the file called index.rss and use that
instead:

xml_soup = BeautifulSoup(open('index.rss'), 'xml', from_encoding = "iso-8859-1")

Leonard

On Fri, Nov 2, 2012 at 3:30 PM, Julie Swoope <swo...@gmail.com> wrote:
> Hi everyone,
>
> I'm new to BeautifulSoup and I've been trying to get it print some simple
> xml items. I have lxml installed, and I have the xml in a file called
> 'index.rss'. I'm just trying to print some simple data from it to test out
> beautifulsoup, but the code below doesn't work.
>
> I have tried it with and without the encoding argument to the BeautifulSoup
> constructor, and I've tried passing in a file with an .xml file extension,
> and I've tried converting the data in index.rss to a string, but none of it
> is working. I've also tried using findAll('item') and find_all('item').
> When the 'print items' line executes, I just get an empty list.
>
> Any advice? Thanks!
>
>
>
> xml_soup = BeautifulSoup('index.rss', 'xml', from_encoding = "iso-8859-1")
> items = xml_soup('item')
> print items
>
>
> #
> #sample of what is in index.rss (it's craigslist data)
> #
> <?xml version="1.0" encoding="iso-8859-1"?>
>
> <rdf:RDF
> xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
> xmlns="http://purl.org/rss/1.0/"
> xmlns:ev="http://purl.org/rss/1.0/modules/event/"
> xmlns:content="http://purl.org/rss/1.0/modules/content/"
> xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/"
> xmlns:dc="http://purl.org/dc/elements/1.1/"
> xmlns:syn="http://purl.org/rss/1.0/modules/syndication/"
> xmlns:dcterms="http://purl.org/dc/terms/"
> xmlns:admin="http://webns.net/mvcb/"
>>
>
> <channel rdf:about="http://newyork.craigslist.org/brk/aap/index.rss">
> <title>craigslist | all apartments in brooklyn</title>
> <link>http://newyork.craigslist.org/brk/aap/</link>
> <description></description>
> <dc:language>en-us</dc:language>
> <dc:rights>Copyright &#x26;copy; 2012 craigslist, inc.</dc:rights>
> <dc:publisher>ro...@craigslist.org</dc:publisher>
> <dc:creator>ro...@craigslist.org</dc:creator>
> <dc:source>http://newyork.craigslist.org/brk/aap/index.rss</dc:source>
> <dc:title>craigslist | all apartments in brooklyn</dc:title>
> <dc:type>Collection</dc:type>
> <syn:updateBase>2012-11-02T14:08:19-04:00</syn:updateBase>
> <syn:updateFrequency>4</syn:updateFrequency>
> <syn:updatePeriod>hourly</syn:updatePeriod>
> <items>
> <rdf:Seq>
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381915303.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/fee/3381915357.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381914720.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381914138.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3340953908.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381912292.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381912185.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381911670.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381911216.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381910810.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381910650.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381910613.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381910485.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/fee/3381909630.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3365129345.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/fee/3381909421.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/fee/3381908899.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3370117661.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381908131.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/fee/3381908198.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3326966342.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3365045005.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3341354104.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/fee/3381907552.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381907065.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381907091.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3370559196.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/fee/3381906851.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381906735.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3362179464.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3329916137.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381906460.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/fee/3381906190.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3375349027.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381906045.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3334187580.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/nfb/3381905572.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381905413.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3339904275.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3329576424.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3318517908.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381904246.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3339965674.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381903994.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/fee/3381903596.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3376873331.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3340002209.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/nfb/3381902922.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3354573159.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3362100298.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381902125.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/fee/3381902153.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/fee/3381901542.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3350107308.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/fee/3381900265.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381899791.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/nfb/3381899679.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3358779904.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381898539.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3327246304.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381896539.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/fee/3381896670.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/nfb/3381896112.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/fee/3381895869.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/fee/3381895311.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381894513.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/nfb/3381894621.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3316456732.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/fee/3381894075.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3377062584.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381893787.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381892574.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381892377.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3318512559.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381891646.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381891305.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381891455.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381891273.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3316386240.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381889883.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381889386.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381889169.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3360124329.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381888063.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381888052.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3368678821.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381884138.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381882248.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3338138553.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3337423964.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3338076745.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3329311040.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3340270122.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3331499560.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3315865208.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/fee/3381879676.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3345633915.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3332220119.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3381879142.html" />
> <rdf:li
> rdf:resource="http://newyork.craigslist.org/brk/abo/3327215566.html" />
> </rdf:Seq>
> </items>
> </channel>
> <item rdf:about="http://newyork.craigslist.org/brk/abo/3381915303.html">
> <title><![CDATA[Sweet Awesome 4 Bedroom Apartment No fee (Crown Heights)
> $2599 4bd]]></title>
> <link>http://newyork.craigslist.org/brk/abo/3381915303.html</link>
> <description><![CDATA[Call Avner At 917 553 3934 ~~~~ great cafes ~~~ bars
> ~~~ restaurants ~~~ endless conviences ~~~ only moments away ~~~~~ live by
> at head move under no the city would no keep him just set if go over may
> any]]></description>
> <dc:date>2012-11-02T14:04:56-04:00</dc:date>
> <dc:language>en-us</dc:language>
> <dc:rights>Copyright &#x26;copy; 2012 craigslist, inc.</dc:rights>
> <dc:source>http://newyork.craigslist.org/brk/abo/3381915303.html</dc:source>
> <dc:title><![CDATA[Sweet Awesome 4 Bedroom Apartment No fee (Crown Heights)
> $2599 4bd]]></dc:title>
> <dc:type>text</dc:type>
> <dcterms:issued>2012-11-02T14:04:56-04:00</dcterms:issued>
> </item>
> <item rdf:about="http://newyork.craigslist.org/brk/fee/3381915357.html">
> <title><![CDATA[DAZZLING DESIGNER DUPLEX! NEW RENO 3BR/2BTH PRIV DECK W/D
> CLS TO ALL! (Boerum Hill/Cobble Hill/ Carroll Gardens) $4800 3bd]]></title>
> <link>http://newyork.craigslist.org/brk/fee/3381915357.html</link>
> <description><![CDATA[3br, Boerum Hill, Brooklyn, $4800
> EUROPEAN ELEGANCE, ZEN SIMPLICITY &amp; CHIC DESIGN ARE YOURS WITH THIS
> STUNNING NEWLY RENOVATED TOWNHOUSE 3 BEDROOM &amp; 2 BATH DUPLEX LOCATED IN
> THE HEART OF HISTORIC BOERUM HILL!
> Featuring Hi Ceilings, New Red [...]]]></description>
> <dc:date>2012-11-02T14:04:55-04:00</dc:date>
> <dc:language>en-us</dc:language>
> <dc:rights>Copyright &#x26;copy; 2012 craigslist, inc.</dc:rights>
> <dc:source>http://newyork.craigslist.org/brk/fee/3381915357.html</dc:source>
> <dc:title><![CDATA[DAZZLING DESIGNER DUPLEX! NEW RENO 3BR/2BTH PRIV DECK W/D
> CLS TO ALL! (Boerum Hill/Cobble Hill/ Carroll Gardens) $4800
> 3bd]]></dc:title>
> <dc:type>text</dc:type>
> <dcterms:issued>2012-11-02T14:04:55-04:00</dcterms:issued>
> </item>
>
> --
> You received this message because you are subscribed to the Google Groups
> "beautifulsoup" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/beautifulsoup/-/YTR59SBI8B4J.
> To post to this group, send email to beauti...@googlegroups.com.
> To unsubscribe from this group, send email to
> beautifulsou...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/beautifulsoup?hl=en.

Julie Swoope

unread,
Nov 2, 2012, 3:47:55 PM11/2/12
to beauti...@googlegroups.com, leon...@segfault.org
Hi Leonard,

Thanks so much for the reply.  I'm still getting an empty list with this code:

xml_soup = BeautifulSoup(open('index.rss'), 'xml', from_encoding = "iso-8859-1")
items = xml_soup.findAll('item')
print items

I also get an empty list with this code, which I tried because maybe I thought I had to pass in a string object:

open_stream = open('index.rss', 'rt')
self.data = open_stream.read()
str_data = str(self.data)
xml_soup = BeautifulSoup(str_data, 'xml', from_encoding = "iso-8859-1")
items = xml_soup.findAll('item')
print items

If I add a print for just self.data (before it goes into the soup), it prints all the xml data just fine.  For some reason I'm not getting the soup constructor and/or searching to work.

Thanks again for the help.

Julie

Leonard Richardson

unread,
Nov 2, 2012, 4:03:48 PM11/2/12
to beautifulsoup
Julie,

It looks like you've encountered bug 1034883:

https://bugs.launchpad.net/beautifulsoup/+bug/1034883

This is caused by a bug in lxml:

https://bugs.launchpad.net/lxml/+bug/963936

The bug has been fixed in lxml version 2.3.6 and lxml 3.0 alpha 2.

You have several options:

1. Remove 'encoding="iso-8859-1"; from the beginning of the XML file
before you parse it.
2. Parse the feed using the Universal Fencoding="iso-8859-1"eed Parser
instead of Beautiful Soup.
3. Use Beautiful Soup, but parse the feed using one of the HTML
parsers instead of parsing it as XML.
4. Upgrade your version of lxml.

Leonard
> https://groups.google.com/d/msg/beautifulsoup/-/WW4SkzjNiCsJ.

Julie Swoope

unread,
Nov 2, 2012, 4:21:28 PM11/2/12
to beauti...@googlegroups.com, leon...@segfault.org
Thanks so much Leonard.  I'll just delete the encoding message at the top and it seems to work fine.  I think I already have the latest version of lxml so the bug may persist in the later versions?  I'll see if there is anything I can do to help.

Again, thank you.  Saved me several hours here.

Julie
Reply all
Reply to author
Forward
0 new messages