Re: Help with extracting mp3 link

126 views
Skip to first unread message

Link Swanson

unread,
Jul 12, 2012, 11:04:17 AM7/12/12
to beauti...@googlegroups.com
try this:

soup = BeautifulSoup(your_html)

print soup.find('a', 'play3')['playlist']

This returns the string inside of playlist="" for the first anchor tag with css class "play3"

If the links you are looking for always have the css class play3 then you are golden


On Thu, Jul 12, 2012 at 8:10 AM, Chris Lewis <chriskw...@gmail.com> wrote:
Until recently, I hadn't programmed in over 4 years and it's also my first time using python. After extensive google searches looking for regex in python, I came across beautifulsoup. Though the documentation was extremely long, a lot of it didn't seem to help me or I couldn't get it to work when I copy and pasted and modified the code. So I'm here hoping to get some help.

I'm trying to parse a webpage to get the FIRST mp3 link ONLY. The code I will be parsing will look like this:
<a href="#" playlist="http://dn-naverdic.ktics.co.kr/naverdic/f759cdac78d6e201e5dfd928acc70e2a/4ffec2f7/naverdic/endic/sound/clear/us/007/007582.mp3" class="play3 N=a:wrd.listencom,r:3,i:85c05904f36749e6aa9f6fd3f461f63c">

I've tried the find all function with 'a' as the parameter and tried getting it to find 'playlist' but I could not get it to work. Could I get some help on how to extract the url for the mp3 please? Thank you

--
You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
To view this discussion on the web visit https://groups.google.com/d/msg/beautifulsoup/-/1TcB-ZYbTcIJ.
To post to this group, send email to beauti...@googlegroups.com.
To unsubscribe from this group, send email to beautifulsou...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/beautifulsoup?hl=en.



--
Link Swanson
Must Build Digital


Chris Lewis

unread,
Jul 12, 2012, 11:15:39 AM7/12/12
to beauti...@googlegroups.com
I think you for your reply. I tried this:
soup = BeautifulSoup(html)
    print soup.find('a', 'play3')['playlist']

and get the follow error:
    print soup.find('a', 'play3')['playlist']
TypeError: 'NoneType' object has no attribute '__getitem__'



On Friday, July 13, 2012 12:04:17 AM UTC+9, LunkRat wrote:
try this:

soup = BeautifulSoup(your_html)

print soup.find('a', 'play3')['playlist']

This returns the string inside of playlist="" for the first anchor tag with css class "play3"

If the links you are looking for always have the css class play3 then you are golden


On Thu, Jul 12, 2012 at 8:10 AM, Chris Lewis <chriskw...@gmail.com> wrote:
Until recently, I hadn't programmed in over 4 years and it's also my first time using python. After extensive google searches looking for regex in python, I came across beautifulsoup. Though the documentation was extremely long, a lot of it didn't seem to help me or I couldn't get it to work when I copy and pasted and modified the code. So I'm here hoping to get some help.

I'm trying to parse a webpage to get the FIRST mp3 link ONLY. The code I will be parsing will look like this:
<a href="#" playlist="http://dn-naverdic.ktics.co.kr/naverdic/f759cdac78d6e201e5dfd928acc70e2a/4ffec2f7/naverdic/endic/sound/clear/us/007/007582.mp3" class="play3 N=a:wrd.listencom,r:3,i:85c05904f36749e6aa9f6fd3f461f63c">

I've tried the find all function with 'a' as the parameter and tried getting it to find 'playlist' but I could not get it to work. Could I get some help on how to extract the url for the mp3 please? Thank you





--

Link Swanson

unread,
Jul 12, 2012, 11:31:30 AM7/12/12
to beauti...@googlegroups.com, chriskw...@gmail.com
Ok, looks like you are not getting the page parsed properly into BeautifulSoup. 

I pasted your HTML into a sample file, then created a little script that opens the html.html file, parses with bs4, extracts the URL you are looking for into a variable, and prints that variable. Both files are attached. Try running the mp3soup.py script (make sure html.html is in the same folder with the script) and see if it returns what you are looking for. That script should show you how to get rolling parsing pages and finding data. 

Good luck, 

Lunk

--
You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
To view this discussion on the web visit https://groups.google.com/d/msg/beautifulsoup/-/UEKeo7-IKlkJ.

To post to this group, send email to beauti...@googlegroups.com.
To unsubscribe from this group, send email to beautifulsou...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/beautifulsoup?hl=en.
html.html
mp3soup.py

Chris Lewis

unread,
Jul 12, 2012, 11:58:03 AM7/12/12
to beauti...@googlegroups.com
Thank you for your reply. I did try to modify your code to:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html)
mp3_url = soup.find('a', 'play3')['playlist']
print mp3_url

but didn't work because open is only for files?

I did however find some other code that does work:

import urllib2
from bs4 import BeautifulSoup
 
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
 
playlists = soup.find_all("a", {"playlist": True})
print playlists
 
print playlists[0].get("playlist")




The only problem now is when I insert it into my main program, it doesn't work!! Ahhhhhhhhhhhh

Link Swanson

unread,
Jul 12, 2012, 11:59:50 AM7/12/12
to beauti...@googlegroups.com
Yeah, great that you saw you needed to open the URL for remote pages ...

Now you have the thing broken down into a small chunk that works. Makes it easier. Good luck!

--
You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
To view this discussion on the web visit https://groups.google.com/d/msg/beautifulsoup/-/MpcmTD4jmtkJ.

To post to this group, send email to beauti...@googlegroups.com.
To unsubscribe from this group, send email to beautifulsou...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/beautifulsoup?hl=en.

Chris Lewis

unread,
Jul 13, 2012, 11:19:16 AM7/13/12
to beauti...@googlegroups.com

Ok I belive I fixed the problem now and also starting to get a feel for python. One more quick question for this program i'm writing (though a little bit off topic) if you don't mind, 
I'm trying to store words I read from a file into a list, but the list is storing individual characters instead. For example a text file called words might look like this:

Word
Word2
Word3
Word4


I tried this:

words = []       #store our individual words from the textfile
mp3_urls = []  #array to store our urls

    #Read from our text file of words, store our words and URLs
    for line in file(filename):
        words += line   #Store the original word
        mp3_urls += "http://endic.naver.com/search.nhn?isOnlyViewEE=N&query=" + word   #Our url for where we are going to find the file

It's also saving new lines and other things within it, is there a way to remove the new lines so my url doesn't get a \n attached to it? Thank you

Paul Walker

unread,
Jul 13, 2012, 6:42:34 PM7/13/12
to beauti...@googlegroups.com
Formatting is a bit random, but...

> storing individual characters instead. For example a text file called words
> might look like this:
>
> Word
> Word2
> Word3
> Word4
>
> words = [] #store our individual words from the textfile

So "words" is now an array.

> mp3_urls = [] #array to store our urls
>
> for line in file(filename):
> words += line #Store the original word
> mp3_urls +=

That doesn't do what you think it does. :) Try words.append(line)

> It's also saving new lines and other things within it, is there a way to
> remove the new lines so my url doesn't get a \n attached to it? Thank you

Try the "strip" function from the string library. So (rewriting somewhat):

f = open(filename)
lines = f.readlines()
f.close()
for line in lines:
stripped_line = string.strip(line)
words.append(stripped_line)

If you're feeling more Pythonic, that can also be written as:

for line in open(filename).readlines():
words.append(string.strip(line))

Don't forget your try/except clauses. :-)

Can I also recommend bpython? It's an interactive Python shell which
you might find useful.

Hope that helps,

--
Paul

Chris Lewis

unread,
Jul 13, 2012, 10:13:16 PM7/13/12
to beauti...@googlegroups.com
Great, I'll give that a try. Thank you so much for your help.

 I got the idea for trying to use mp3_urls += instead of append from this program I found online http://slacy.com/blog/2010/09/script-to-automatically-download-tracks-from-pitchforks-best-new-tracks-feed/

He does this:

mp3_urls = []
for p in posts:
    pstr = str(p)
    pstr = pstr.replace('&lt;', '<')
    pstr = pstr.replace('&gt;', '>')
    desc_soup = BeautifulSoup(pstr)
    links = desc_soup.findAll('a', href=re.compile('.*pitchforkmedia.*mp3$'))
    mp3_urls += [l['href'] for l in links ] 

Now I'm a bit confused how he got that to work. Thanks again

Paul Walker

unread,
Jul 16, 2012, 7:34:16 AM7/16/12
to beauti...@googlegroups.com
> mp3_urls += [l['href'] for l in links ]

That's adding a list to a list (see "list comprehensions" in the
Python manual), not an item to a list.

We're probably off-topic for the list now though. ;-)
Reply all
Reply to author
Forward
0 new messages