BeautifulSoup and WebScraping

79 views

Skip to first unread message

tmm...@gmail.com

unread,

May 19, 2016, 12:59:27 PM5/19/16

to beautifulsoup

Hi,

I am using BeautifulSoup to do the web scraping.

I need to save the url content (plain text) in a cvs file after removing stop words, punctuation, html tags, java script, css etc.

Below is my code snippet to parse the url.

For some of the urls, I get javascript and css text as well in the parsed result. Could anyone please let me know how to just get only the text and not any tags or scripts or css in the parsed content result?Appreciate your help.

Ex:

r1 = urllib.urlopen('http://microsoft.com').read()

soup=BeautifulSoup(r,"html.parser")

content=soup.get_text()

Thanks Much!

Reply all

Reply to author

Forward

0 new messages