Browsermob, .ts files, webscraper in python

44 views
Skip to first unread message

Magdalena Anopsy

unread,
Feb 7, 2023, 6:40:22 AM2/7/23
to BrowserMob Proxy
Hi there!
I'm working on a small tool, that should help me retrieve .ts links and use them to GET chunks of  a stream (I only need samples of that stream, so they don't have to scrape all of them or in the right order)
After many tries with scraping performance logs, that didnt work, I found a solution with Browsermob, and it works. But..when I run the script, it scrapes and downloads usually ONLY 1 link. Once I got lucky and it scraped 5 links (If I do it manually - copying the link from Network tab and cURLing it I'm able to get dozens of chunks within one minute, but I really need that tool for later)
The .ts links appear in Network tab every 1 or 2 seconds, so the script should be able to fetch at least 30 of them per minute.

I think I have 2 problems.
1- the number of .ts links that my script is scraping (maybe because of that for loop going through the whole HAR file)
2- the fact that the script exits after 1-3 minutes, (probably I can change browsermob options to extend that, but I couldnt find out how to do it)

How can I make browsermob retrieve more links, and keep it longer alive, so it actually has a chance to find more .ts links per minute

My code:
https://github.com/anopsy/stream-scraper.git
If you want to have a quick look, I pasted it under the email
Thanks in advance
Magdalena

import requests
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from browsermobproxy import Server


#start proxy server
server = Server("/home/anopsy/stream-scraper/proxy/browsermob-proxy-2.1.4/bin/browsermob-proxy")
server.start()
proxy = server.create_proxy()

# selenium arguments
options = webdriver.ChromeOptions()

options.add_argument("--proxy-server={0}".format(proxy.proxy))

caps = DesiredCapabilities.CHROME.copy()
caps['acceptSslCerts'] = True
caps['acceptInsecureCerts'] = True

proxy.new_har(ref=None,options={'captureHeaders': True,'captureContent':True,'captureBinaryContent': True}) # tag network logs
driver = webdriver.Chrome('chromedriver',options=options,desired_capabilities=caps)
driver.get(url)


fetched = []
i=0
for ent in proxy.har['log']['entries']:
i=i+1
_url = ent['request']['url']
_response = ent['response']
#make sure havent already downloaded this piece
if _url in fetched:
continue
if _url.endswith('.ts'):
#check if this url had a valid response, if not, ignore it
if 'text' not in _response['content'].keys() or not _response['content']['text']:
continue
print(_url+'\n')
r1 = requests.get(_url, stream=True)
if(r1.status_code == 200 or r1.status_code == 206):

# re-open output file to append new video
f = open("/home/anopsy/data/autodata/{}".format(i),"wb")
data = b''
for chunk in r1.iter_content(chunk_size=1024):
if(chunk):
data += chunk
f.write(data)
f.close()
fetched.append(_url)
else:
print("Received unexpected status code {}".format(r1.status_code))
Reply all
Reply to author
Forward
0 new messages