Converting a Django site to flat HTML

James Mulholland

unread,

Dec 2, 2006, 10:34:05 AM12/2/06

to Django users

This might fail the obviousness test for some people, but it just saved
me a shed-load of work. Basic problem is this: I pitched Django to a
client a few weeks ago, and they seemed very happy with it and told me
to go ahead. About a week into the project (ie, when I'd all but
finished it ;) they got cold feet on the idea, said they wanted to stay
with their existing web-hosting service, and so therefore would need
"flat" HTML rather than Django-ised pages.

The solution is this:

wget -m http://www.example.com/

where "example.com" is your Django website. There was a small amount of
messing around to do -- wget didn't deduce that some of the images on
the page were backgrounds specified in the style-sheet for example, so
I had to copy those images by hand.

But overall, this means I can now develop websites under my preferred
environment (Django) without worrying unduly about the client changing
his/her mind about it. I think, in fact, this is now my preferred way
to develop websites, given that templates, etc, make it much easier to
respond to changing requirements from clients.

HTH :)

--
James

rikl...@gmail.com

unread,

Dec 2, 2006, 4:11:30 PM12/2/06

to Django users

try httrack :)

James Bennett

unread,

Dec 2, 2006, 5:04:28 PM12/2/06

to django...@googlegroups.com

On 12/2/06, James Mulholland <james.mu...@gmail.com> wrote:
> The solution is this:
>
> wget -m http://www.example.com/

Yeah, wget rocks ;)

When I was freelancing I used to use that trick all the time; I'd
build a site on my local computer using whatever dynamic/templating
tools I wanted, and then if the client wanted static files I'd pull
them off using wget.

--
"May the forces of evil become confused on the way to your house."
-- George Carlin

goon

unread,

Dec 5, 2006, 10:06:30 PM12/5/06

to Django users

Hi James,

I was pondering the same thing just the other day - trying to get flat
pages from django site. Now at first I thought using *wget* would
suffice, but I also needed to do other things with the files
(archiving, uploading to ftp). So I needed some way of interacting with
the static pages after download.

The obvious thing might have been to find a python wget module. No such
luck. Also further checking wget
(http://en.wikipedia.org/wiki/Wget#Criticisms_of_Wget) I found numerous
things that may get in the way of extracting complete pages. For
example HTTP 1.0 only support, hence js, css referenced data might not
be extracted.

So I looked for an alternative. cURL comes to mind
(http://en.wikipedia.org/wiki/CURL), And becuase the libCurl is exposed
a python interface to cURL, http://pycurl.sourceforge.net/. Using this
approach I can not only download pages with greater flexability but
also script the downloads in python instead of relying on a shell
command call to wget.

James Mulholland

unread,

Dec 6, 2006, 8:36:10 AM12/6/06

to Django users

Hi, yes I looked at Curl but I was looking for a quick 'n' dirty way to
do it, without worrying too much about the options on the command line.
Since I'm coming to Python from Perl, I would also probably choose a
Perl-ish way to do what you're accomplishing with Python modules. But,
whatever :)

Can you share the scripts you've written? I wonder if there's a generic
need for this kind of thing, but maybe what you've done is specific to
your setup... I'm quite taken with the idea of using Django as a
web-site factory regardless of the final deployment environment, so
maybe a write-up of the alternatives would find some readers.

--
James

Rob Hudson

unread,

Dec 6, 2006, 2:11:04 PM12/6/06

to Django users

Here's the wget flags I'm using to do something similar:

wget -E --load-cookies /path/to/firefox/profiles/cookies.txt -r -k -l
inf -N -p

-E = use .html as extension
-r = recurse
-k = convert links
-l inf = infinite depth
-N = disable timestamping?
-p = get page requisites

Other than that, I also created an "all objects" page that gets all my
objects that map to pages, call get_absolute_url() on each one, and
print out a list. I call wget on that URL to make sure it hits every
page in the database just in case some pages aren't linked to directly
(eg: when we have a JS redirect to a page depending on some user
interaction.)

-Rob

goon

unread,

Jan 6, 2007, 12:13:43 AM1/6/07

to Django users

Rob Hudson wrote:
> wget -E --load-cookies /path/to/firefox/profiles/cookies.txt -r -k -l

> -r = recurse ...'

I missed this with pycurl & have yet to find example that supports it
:( Then I scanned the curl FAQ and found 3.15 [0]

> 3.15 Can I do recursive fetches with curl?
> http://curl.mirrors.cyberservers.net/docs/faq.html#3.15

This means to use pycurl you need a list of urls. This is difficult as
you need to parse the returned page. I would have thought curl could
support this. Obviously not. One obvious way is to use pycurl by
pointing it to a base url, then for each returned page (assuming html)
parse with HTMLParser [1] build a list for each url and extract the
pages that way.

Not what you asked. Want a quick hack solution do what 'Rob Hudson'
suggests & use wget (which you probably did anyway) OR otherwise try an
old favourite of mine and use websucker. [2] The reason I looked at
pycurl was the obvious django, pycurl integration (ie: a django app
that mirrored sites).

Reference
----------------
[0] Curl FAQ, 'Can I do recursive fetches with curl?, NO'
http://curl.mirrors.cyberservers.net/docs/faq.html#3.15
[Accessed Saturday, 6 January, 2007]

[1] Python HTMLParser module, ''Parses text files in format of HTML &
XHTML'
http://docs.python.org/lib/module-HTMLParser.html
[Accessed Saturday, 6 January, 2007]

[2] python websucker, 'creates "mirror copy of a remote site"'
http://svn.python.org/view/python/trunk/Tools/webchecker/
[Accessed Saturday, 6 January, 2007]

Reply all

Reply to author

Forward