getting all words in html, not only text

33 views
Skip to first unread message

Csaba Szunyog

unread,
May 15, 2021, 9:10:36 AM5/15/21
to beautifulsoup
Hi All

I am making an app which scrapes webpages. I have written it on my desktop PC running Windows and when I want to run it on my Raspberry, I get totally different results:

it returns not only the text, but everything within the html code. I used the following:

#request
for webpage in webpages:
    r = requests.get(webpage)
    c=r.content
    soup=BeautifulSoup(c,"lxml")

    #get the text from html & word_tokenize the text
    all_text=soup.get_text()
    reg_tokenizer=RegexpTokenizer(r'\w+')
    word_tokens=reg_tokenizer.tokenize(all_text)


lxml is installed in both environments.

Thanks

stephen lukacs

unread,
May 15, 2021, 9:17:22 AM5/15/21
to beautifulsoup
i have had similar results on different platforms, and lxml is a much faster parser then html.parser.  i have also found that making sure bs4 is the same version on both platforms is most important.  with 

from bs4 import __version__ as BS_version, BeautifulSoup
version = BS_version

the latest version 4.9.3 seems bug free and consistent.  lucas

Csaba Szunyog

unread,
May 16, 2021, 8:19:31 AM5/16/21
to beauti...@googlegroups.com
Hi

thanks for the advice, but I didn't manage to downgrade my 4.9.3 BS to 4.7.1 and didn't manage to upgrade the 4.7.1 either on the Raspberry... 4.7.1 is unstable according to the website, so I will use ubuntu server instead, hope it will work.



--
You received this message because you are subscribed to a topic in the Google Groups "beautifulsoup" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/beautifulsoup/QU06B_WSXo4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to beautifulsou...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/beautifulsoup/10dee689-2aaf-47f4-b264-1cd599112159n%40googlegroups.com.

stephen lukacs

unread,
May 16, 2021, 11:13:27 AM5/16/21
to beautifulsoup
what?  I didn't understand that.  upgrade on raspberry or other platforms with "pip3 install --upgrade bs4".  that should do it.

Csaba Szunyog

unread,
May 16, 2021, 12:06:04 PM5/16/21
to beauti...@googlegroups.com
I mean that I wasn't able to update to 4.9 on the Raspberry, and was not able to downgrade to 4.7.1 on Windows for some reason.

4.7 was the version which I got with the pip3 in Linux, maybe I didn't do something correctly... 
I had the impression that I can't go above the 4.7 on the raspbian, so I decided to setup an ubuntu server in the meantime.

It seems to be OK now, although I haven't checked bs4 yet... If it doesn't work I'll try the Raspbian again. 

Thanks for the help!!

Reply all
Reply to author
Forward
0 new messages