Browsermob, .ts files, webscraper in python

44 views

Skip to first unread message

Magdalena Anopsy

unread,

Feb 7, 2023, 6:40:22 AM2/7/23

to BrowserMob Proxy

Hi there!
I'm working on a small tool, that should help me retrieve .ts links and use them to GET chunks of a stream (I only need samples of that stream, so they don't have to scrape all of them or in the right order)
After many tries with scraping performance logs, that didnt work, I found a solution with Browsermob, and it works. But..when I run the script, it scrapes and downloads usually ONLY 1 link. Once I got lucky and it scraped 5 links (If I do it manually - copying the link from Network tab and cURLing it I'm able to get dozens of chunks within one minute, but I really need that tool for later)
The .ts links appear in Network tab every 1 or 2 seconds, so the script should be able to fetch at least 30 of them per minute.

I think I have 2 problems.
1- the number of .ts links that my script is scraping (maybe because of that for loop going through the whole HAR file)

2- the fact that the script exits after 1-3 minutes, (probably I can change browsermob options to extend that, but I couldnt find out how to do it)

How can I make browsermob retrieve more links, and keep it longer alive, so it actually has a chance to find more .ts links per minute

My code:
https://github.com/anopsy/stream-scraper.git

If you want to have a quick look, I pasted it under the email
Thanks in advance
Magdalena

import requests

from selenium import webdriver

from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

from browsermobproxy import Server

url = "https://surfweer.nl/surf/webcam/"

#start proxy server

server = Server("/home/anopsy/stream-scraper/proxy/browsermob-proxy-2.1.4/bin/browsermob-proxy")

server.start()

proxy = server.create_proxy()

# selenium arguments

options = webdriver.ChromeOptions()

options.add_argument("--proxy-server={0}".format(proxy.proxy))

caps = DesiredCapabilities.CHROME.copy()

caps['acceptSslCerts'] = True

caps['acceptInsecureCerts'] = True

proxy.new_har(ref=None,options={'captureHeaders': True,'captureContent':True,'captureBinaryContent': True}) # tag network logs

driver = webdriver.Chrome('chromedriver',options=options,desired_capabilities=caps)

driver.get(url)

fetched = []

i=0

for ent in proxy.har['log']['entries']:

i=i+1

_url = ent['request']['url']

_response = ent['response']

#make sure havent already downloaded this piece

if _url in fetched:

continue

if _url.endswith('.ts'):

#check if this url had a valid response, if not, ignore it

if 'text' not in _response['content'].keys() or not _response['content']['text']:

continue

print(_url+'\n')

r1 = requests.get(_url, stream=True)

if(r1.status_code == 200 or r1.status_code == 206):

# re-open output file to append new video

f = open("/home/anopsy/data/autodata/{}".format(i),"wb")

data = b''

for chunk in r1.iter_content(chunk_size=1024):

if(chunk):

data += chunk

f.write(data)

f.close()

fetched.append(_url)

else:

print("Received unexpected status code {}".format(r1.status_code))

Reply all

Reply to author

Forward

0 new messages