how to create a web scarwler, that would collect following info

Daily Chinese

未讀,

2021年1月14日晚上8:22:322021/1/14

收件者：beautifulsoup

Store the data in an excel file, including the elements of click number, reply number, title, author, publish time from
1. http://guba.sina.com.cn/?s=bar&name=sh000001

FMAPR

未讀,

2021年1月14日晚上8:28:232021/1/14

收件者：beauti...@googlegroups.com

Is there a specific question? Did you try giving it a go?

A sexta, 15/01/2021, 01:22, Daily Chinese <sofiia...@gmail.com> escreveu:

Store the data in an excel file, including the elements of click number, reply number, title, author, publish time from

http://guba.sina.com.cn/?s=bar&name=sh000001

--
You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beautifulsou...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/beautifulsoup/cffc0671-fa14-4f9d-bbbf-c8fc0594bd7bn%40googlegroups.com.

Daily Chinese

未讀,

2021年1月15日上午8:55:282021/1/15

收件者：beautifulsoup

i have already wrote a code, that works)))but now i am wondering, how to write the info, that i collected in the form of the table

import requests
import bs4
from bs4 import BeautifulSoup

r=requests.get('http://guba.sina.com.cn/?s=bar&name=sh000001')
soup=bs4.BeautifulSoup(r.text,"xml")
R=soup.find('tr',{'class':'tit_tr'})

li = soup.find_all('tr', {'class': 'tit_tr'})
for x in li:

children = x.findChildren("tr" , recursive=False)
for child in children:
print(child.text)

r = soup.find_all('th')
for i in r:
print(i.text)

Daily Chinese

未讀,

2021年1月15日上午8:55:282021/1/15

收件者：beauti...@googlegroups.com

yeah, i publish , what i have done on stackoverflow https://stackoverflow.com/questions/65729188/how-to-create-a-web-scrawler. i tried different codes, but i have mistakes each time. and only one was working properly, but outputs the wrong format(maybe cause it is chinese)

To view this discussion on the web visit https://groups.google.com/d/msgid/beautifulsoup/CAOMYA-Rn3ORaTawVhEziZpYh3QpGws%2BobvVEkhkT9ZpfqOrbtg%40mail.gmail.com.

Daily Chinese

未讀,

2021年1月15日上午8:55:282021/1/15

收件者：beauti...@googlegroups.com

this is a correct version

import bs4
import requests
from bs4 import BeautifulSoup

page=requests.get('http://guba.sina.com.cn/?s=bar&name=sh000001')
soup=BeautifulSoup(page.text,"xml")
r=soup.find_all('tr',{'class':'tit_tr'})

for i in r:
    children = i.findChildren("th",recursive=False)


    for child in children:
        print(child.text)

On Fri, Jan 15, 2021 at 7:42 AM Daily Chinese <sofiia...@gmail.com> wrote:

I tried this,

import requests
from bs4 import BeautifulSoup
import pandas
from pprint import pprint

url='http://guba.sina.com.cn/?s=bar&name=sh000001'
r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
table = soup.find("table", {"id" : "curr_table"})
tableRows = [[td.text for td in row.find_all("td")] for row in table.find_all("tr")[1:]]

#get headers for dataframe
tableHeaders = [th.text for th in table.find_all("th")]

#build df from tableRows and headers
df = pandas.DataFrame(tableRows, columns=tableHeaders)

print(df)

but it has a mistake

 tableRows = [[td.text for td in row.find_all("td")] for row in table.find_all("tr")[1:]]
AttributeError: 'NoneType' object has no attribute 'find_all'

and this

from urllib import request
url = " http://guba.sina.com.cn/?s=bar&name=sh000001"
html = request.urlopen(url).read().decode('ISO-8859-1')
html[:60]

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
title = soup.find('title')

print(title) # Prints the tag
print(title.string) # Prints the tag string content

but it outputs words in the wrong format

<title>ÉÏÖ¤Ö¸Êýsh000001_ÐÂÀË¹ÉÊÐ»ã_²Æ¾_ÐÂÀËÍø</title>
ÉÏÖ¤Ö¸Êýsh000001_ÐÂÀË¹ÉÊÐ»ã_²Æ¾_ÐÂÀËÍø

Daily Chinese

未讀,

2021年1月15日上午8:55:282021/1/15

收件者：beauti...@googlegroups.com

回覆所有人

回覆作者

轉寄