how to create a web scarwler, that would collect following info

瀏覽次數:76 次
跳到第一則未讀訊息

Daily Chinese

未讀,
2021年1月14日 晚上8:22:322021/1/14
收件者:beautifulsoup
  1. Store the data in an excel file, including the elements of click number, reply number, title, author, publish time from 
    1. http://guba.sina.com.cn/?s=bar&name=sh000001

FMAPR

未讀,
2021年1月14日 晚上8:28:232021/1/14
收件者:beauti...@googlegroups.com
Is there a specific question? Did you try giving it a go?

A sexta, 15/01/2021, 01:22, Daily Chinese <sofiia...@gmail.com> escreveu:
  1. Store the data in an excel file, including the elements of click number, reply number, title, author, publish time from 
    1. http://guba.sina.com.cn/?s=bar&name=sh000001

--
You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beautifulsou...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/beautifulsoup/cffc0671-fa14-4f9d-bbbf-c8fc0594bd7bn%40googlegroups.com.

Daily Chinese

未讀,
2021年1月15日 上午8:55:282021/1/15
收件者:beautifulsoup
i have already wrote a code, that works)))but now i am wondering, how to write the info, that i collected in the form of the table 



import requests
import bs4
from bs4 import BeautifulSoup

r=requests.get('http://guba.sina.com.cn/?s=bar&name=sh000001')
soup=bs4.BeautifulSoup(r.text,"xml")
R=soup.find('tr',{'class':'tit_tr'})



li = soup.find_all('tr', {'class': 'tit_tr'})
for x in li:

children = x.findChildren("tr" , recursive=False)
for child in children:
print(child.text)

r = soup.find_all('th')
for i in r:
print(i.text)

Daily Chinese

未讀,
2021年1月15日 上午8:55:282021/1/15
收件者:beauti...@googlegroups.com
yeah, i publish , what i have done on stackoverflow https://stackoverflow.com/questions/65729188/how-to-create-a-web-scrawler. i tried different codes, but i have mistakes  each time. and only one was working properly, but outputs the wrong format(maybe cause it is chinese)

Daily Chinese

未讀,
2021年1月15日 上午8:55:282021/1/15
收件者:beauti...@googlegroups.com
this is a correct version 


import bs4
import requests
from bs4 import BeautifulSoup

page=requests.get('http://guba.sina.com.cn/?s=bar&name=sh000001')
soup=BeautifulSoup(page.text,"xml")
r=soup.find_all('tr',{'class':'tit_tr'})

for i in r:
children = i.findChildren("th",recursive=False)

for child in children:
print(child.text)

On Fri, Jan 15, 2021 at 7:42 AM Daily Chinese <sofiia...@gmail.com> wrote:
I tried this, 


import requests
from bs4 import BeautifulSoup
import pandas
from pprint import pprint

url='http://guba.sina.com.cn/?s=bar&name=sh000001'
r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
table = soup.find("table", {"id" : "curr_table"})
tableRows = [[td.text for td in row.find_all("td")] for row in table.find_all("tr")[1:]]

#get headers for dataframe
tableHeaders = [th.text for th in table.find_all("th")]

#build df from tableRows and headers
df = pandas.DataFrame(tableRows, columns=tableHeaders)

print(df)

but it has a mistake 
 tableRows = [[td.text for td in row.find_all("td")] for row in table.find_all("tr")[1:]]
AttributeError: 'NoneType' object has no attribute 'find_all'

and this 
from urllib import request
url = " http://guba.sina.com.cn/?s=bar&name=sh000001"
html = request.urlopen(url).read().decode('ISO-8859-1')
html[:60]

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
title = soup.find('title')

print(title) # Prints the tag
print(title.string) # Prints the tag string content


but it outputs words in the wrong format


<title>ÉÏÖ¤Ö¸Êýsh000001_ÐÂÀ˹ÉÊлã_²Æ¾­_ÐÂÀËÍø</title>
ÉÏÖ¤Ö¸Êýsh000001_ÐÂÀ˹ÉÊлã_²Æ¾­_ÐÂÀËÍø

Daily Chinese

未讀,
2021年1月15日 上午8:55:282021/1/15
收件者:beauti...@googlegroups.com
回覆所有人
回覆作者
轉寄
0 則新訊息