The solution is this:
wget -m http://www.example.com/
where "example.com" is your Django website. There was a small amount of
messing around to do -- wget didn't deduce that some of the images on
the page were backgrounds specified in the style-sheet for example, so
I had to copy those images by hand.
But overall, this means I can now develop websites under my preferred
environment (Django) without worrying unduly about the client changing
his/her mind about it. I think, in fact, this is now my preferred way
to develop websites, given that templates, etc, make it much easier to
respond to changing requirements from clients.
HTH :)
--
James
Yeah, wget rocks ;)
When I was freelancing I used to use that trick all the time; I'd
build a site on my local computer using whatever dynamic/templating
tools I wanted, and then if the client wanted static files I'd pull
them off using wget.
--
"May the forces of evil become confused on the way to your house."
-- George Carlin
I was pondering the same thing just the other day - trying to get flat
pages from django site. Now at first I thought using *wget* would
suffice, but I also needed to do other things with the files
(archiving, uploading to ftp). So I needed some way of interacting with
the static pages after download.
The obvious thing might have been to find a python wget module. No such
luck. Also further checking wget
(http://en.wikipedia.org/wiki/Wget#Criticisms_of_Wget) I found numerous
things that may get in the way of extracting complete pages. For
example HTTP 1.0 only support, hence js, css referenced data might not
be extracted.
So I looked for an alternative. cURL comes to mind
(http://en.wikipedia.org/wiki/CURL), And becuase the libCurl is exposed
a python interface to cURL, http://pycurl.sourceforge.net/. Using this
approach I can not only download pages with greater flexability but
also script the downloads in python instead of relying on a shell
command call to wget.
Can you share the scripts you've written? I wonder if there's a generic
need for this kind of thing, but maybe what you've done is specific to
your setup... I'm quite taken with the idea of using Django as a
web-site factory regardless of the final deployment environment, so
maybe a write-up of the alternatives would find some readers.
--
James
wget -E --load-cookies /path/to/firefox/profiles/cookies.txt -r -k -l
inf -N -p
-E = use .html as extension
-r = recurse
-k = convert links
-l inf = infinite depth
-N = disable timestamping?
-p = get page requisites
Other than that, I also created an "all objects" page that gets all my
objects that map to pages, call get_absolute_url() on each one, and
print out a list. I call wget on that URL to make sure it hits every
page in the database just in case some pages aren't linked to directly
(eg: when we have a JS redirect to a page depending on some user
interaction.)
-Rob
I missed this with pycurl & have yet to find example that supports it
:( Then I scanned the curl FAQ and found 3.15 [0]
> 3.15 Can I do recursive fetches with curl?
> http://curl.mirrors.cyberservers.net/docs/faq.html#3.15
This means to use pycurl you need a list of urls. This is difficult as
you need to parse the returned page. I would have thought curl could
support this. Obviously not. One obvious way is to use pycurl by
pointing it to a base url, then for each returned page (assuming html)
parse with HTMLParser [1] build a list for each url and extract the
pages that way.
Not what you asked. Want a quick hack solution do what 'Rob Hudson'
suggests & use wget (which you probably did anyway) OR otherwise try an
old favourite of mine and use websucker. [2] The reason I looked at
pycurl was the obvious django, pycurl integration (ie: a django app
that mirrored sites).
Reference
----------------
[0] Curl FAQ, 'Can I do recursive fetches with curl?, NO'
http://curl.mirrors.cyberservers.net/docs/faq.html#3.15
[Accessed Saturday, 6 January, 2007]
[1] Python HTMLParser module, ''Parses text files in format of HTML &
XHTML'
http://docs.python.org/lib/module-HTMLParser.html
[Accessed Saturday, 6 January, 2007]
[2] python websucker, 'creates "mirror copy of a remote site"'
http://svn.python.org/view/python/trunk/Tools/webchecker/
[Accessed Saturday, 6 January, 2007]