BeautifulSoup and WebScraping

79 views
Skip to first unread message

tmm...@gmail.com

unread,
May 19, 2016, 12:59:27 PM5/19/16
to beautifulsoup
Hi,
I am using BeautifulSoup to  do the web scraping.
I need to save the url content (plain text)  in a cvs file after removing stop words, punctuation, html tags, java script, css etc.
Below is my code snippet to parse the url.
For some of the urls, I get javascript and css text as well in the parsed result. Could anyone please let me know how to just get only the text and not any tags or scripts or css in the parsed content result?Appreciate your help. 
Ex:
r1 = urllib.urlopen('http://microsoft.com').read()
soup=BeautifulSoup(r,"html.parser")
content=soup.get_text()

Thanks Much!
Reply all
Reply to author
Forward
0 new messages