Download the source of a big HTML file

18 views

Skip to first unread message

Jabba Laci

unread,

Dec 11, 2012, 1:18:16 PM12/11/12

to moz...@googlegroups.com

Hi,

I'm new to the list. I want to use MozRepl to access AJAX-generated HTML sources. I want to control the whole process from a Python script (see below). It works fine for small HTML sources but it becomes very slow for pages that are some MBs long. I tried with a page of 5 MB source and it got Firefox frozen for minutes.

What I do is ask the HTML source and copy the output to a string until the prompt re-appears. Below you can find my Python source. The 2nd part ("Death") blocks my browser for minutes.

How to use MozRepl for these cases? It would be nice if I could tell the browser to save the content of a string (HTML source) to a local file. I tried to use FileUtils.openSafeFileOutputStream but I got an exception. If you have a solution, please let me know.

Thanks,

Laszlo

============

#!/usr/bin/env python

import re
from time import sleep
import telnetlib

HOST = 'localhost'
PORT = 4242

prompt = [r'repl\d*> ']    # list of regular expressions

def get_page(url, wait=3):
    tn = telnetlib.Telnet(HOST, PORT)
    tn.expect(prompt)
    cmd = "content.location.href = '{url}'".format(url=url)
    tn.write(cmd + "\n")
    tn.expect(prompt)
    if wait:
        print '# waiting {X} seconds...'.format(X=wait)
        sleep(wait)
        print '# continue'
    #
    tn.write('content.document.body.innerHTML\n')
    html = tn.expect(prompt)[2].split('\n')
    if html[0].strip() == '"':
        html = html[1:]
    if re.search(prompt[0], html[-1]):
        html = html[:-1]
    if html[-1].strip() == '"':
        html = html[:-1]
    tn.write("repl.quit()\n")
    return html

##################################

if __name__ == "__main__":
    print 'OK'
    html = get_page('http://simile.mit.edu/crowbar/test.html')
    for line in html:
        print line
    print '================'
    print 'Death'
    url = 'http://www.ncbi.nlm.nih.gov/nuccore/CP002059.1'
    html = get_page(url, wait=30)
    for line in html:
        print line

Reply all

Reply to author

Forward

0 new messages