You gotta love a 2-line python solution

DFS

unread,

May 1, 2016, 11:39:33 PM5/1/16

to

To save a webpage to a file:
-------------------------------------
1. import urllib
2. urllib.urlretrieve("http://econpy.pythonanywhere.com
/ex/001.html","D:\file.html")
-------------------------------------

That's it!

Coming from VB/A background, some of the stuff you can do with python -
with ease - is amazing.

VBScript version
------------------------------------------------------
1. Option Explicit
2. Dim xmlHTTP, fso, fOut
3. Set xmlHTTP = CreateObject("MSXML2.serverXMLHTTP")
4. xmlHTTP.Open "GET", "http://econpy.pythonanywhere.com/ex/001.html"
5. xmlHTTP.Send
6. Set fso = CreateObject("Scripting.FileSystemObject")
7. Set fOut = fso.CreateTextFile("D:\file.html", True)
8. fOut.WriteLine xmlHTTP.ResponseText
9. fOut.Close
10. Set fOut = Nothing
11. Set fso = Nothing
12. Set xmlHTTP = Nothing
------------------------------------------------------

Technically, that VBS will run with just lines 3-9, but that's still 6
lines of code vs 2 for python.

Stephen Hansen

unread,

May 2, 2016, 12:31:57 AM5/2/16

to

On Sun, May 1, 2016, at 08:39 PM, DFS wrote:
> To save a webpage to a file:
> -------------------------------------
> 1. import urllib
> 2. urllib.urlretrieve("http://econpy.pythonanywhere.com
> /ex/001.html","D:\file.html")
> -------------------------------------

Note, for paths on windows you really want to use a rawstring. Ie,
r"D:\file.html".

--
Stephen Hansen
m e @ i x o k a i . i o

DFS

unread,

May 2, 2016, 12:51:38 AM5/2/16

to

On 5/2/2016 12:31 AM, Stephen Hansen wrote:
> On Sun, May 1, 2016, at 08:39 PM, DFS wrote:
>> To save a webpage to a file:
>> -------------------------------------
>> 1. import urllib
>> 2. urllib.urlretrieve("http://econpy.pythonanywhere.com
>> /ex/001.html","D:\file.html")
>> -------------------------------------
>
> Note, for paths on windows you really want to use a rawstring. Ie,
> r"D:\file.html".

Thanks.

I actually use "D:\\file.html" in my code.

Stephen Hansen

unread,

May 2, 2016, 1:02:17 AM5/2/16

to

Or you can do that. But the whole point of raw strings is not having to
escape slashes :)

DFS

unread,

May 2, 2016, 1:08:36 AM5/2/16

to

Nice. Where/how else is 'r' used?

I'm new to python, but I learned that one the hard way.

I was using "D\testfile.txt" for something, and my code kept failing.
Took me a while to figure it out. I tried various letters after the
slash. I finally stumbled across the escape slashes in the docs somewhere.

Stephen Hansen

unread,

May 2, 2016, 1:21:34 AM5/2/16

to

On Sun, May 1, 2016, at 10:08 PM, DFS wrote:
> On 5/2/2016 1:02 AM, Stephen Hansen wrote:
> >> I actually use "D:\\file.html" in my code.
> >
> > Or you can do that. But the whole point of raw strings is not having to
> > escape slashes :)
>
>
> Nice. Where/how else is 'r' used?

Raw strings are primarily used A) for windows paths, and more
universally, B) for regular expressions.

But in theory they're useful anywhere you have static/literal data that
might include backslashes where you don't actually intend to use any
escape characters.

DFS

unread,

May 2, 2016, 1:23:58 AM5/2/16

to

Trying the rawstring thing (say it fast 3x):

webpage = "http://econpy.pythonanywhere.com/ex/001.html"

----------------------------------------------------
webfile = "D:\\econpy001.html"
urllib.urlretrieve(webpage,webfile) WORKS
----------------------------------------------------
webfile = "rD:\econpy001.html"
urllib.urlretrieve(webpage,webfile) FAILS
----------------------------------------------------
webfile = "D:\econpy001.html"
urllib.urlretrieve(webpage,"r" + webfile) FAILS
----------------------------------------------------
webfile = "D:\econpy001.html"
urllib.urlretrieve(webpage,"r" + "" + webfile + "") FAILS
----------------------------------------------------

The FAILs throw:

Traceback (most recent call last):
File "webscraper.py", line 54, in <module>
urllib.urlretrieve(webpage,webfile)
File "D:\development\python\python_2.7.11\lib\urllib.py", line 98, in
urlretrieve
return opener.retrieve(url, filename, reporthook, data)
File "D:\development\python\python_2.7.11\lib\urllib.py", line 249,
in retrieve
tfp = open(filename, 'wb')
IOError: [Errno 22] invalid mode ('wb') or filename: 'rD:\\econpy001.html'
----------------------------------------------------

What am I doing wrong?

Stephen Hansen

unread,

May 2, 2016, 1:37:19 AM5/2/16

to

On Sun, May 1, 2016, at 10:23 PM, DFS wrote:
> Trying the rawstring thing (say it fast 3x):
>
> webpage = "http://econpy.pythonanywhere.com/ex/001.html"
>
> ----------------------------------------------------
> webfile = "D:\\econpy001.html"
> urllib.urlretrieve(webpage,webfile) WORKS
> ----------------------------------------------------
> webfile = "rD:\econpy001.html"

The r is *outside* the string.

Its: r"D:\econpy001.html"

Steven D'Aprano

unread,

May 2, 2016, 1:52:09 AM5/2/16

to

On Monday 02 May 2016 15:21, Stephen Hansen wrote:

> On Sun, May 1, 2016, at 10:08 PM, DFS wrote:
>> On 5/2/2016 1:02 AM, Stephen Hansen wrote:
>> >> I actually use "D:\\file.html" in my code.
>> >
>> > Or you can do that. But the whole point of raw strings is not having to
>> > escape slashes :)
>>
>>
>> Nice. Where/how else is 'r' used?
>
> Raw strings are primarily used A) for windows paths, and more
> universally, B) for regular expressions.

Raw strings are designed for regular expressions. They can be used for
Windows paths, except for one minor gotcha: you can't end a raw string with
an odd number of backspaces. So this doesn't work:

directory = r'D:\some\path\dir\'

So it's more of a half-cooked string than a raw string.

--
Steve

DFS

unread,

May 2, 2016, 2:13:25 AM5/2/16

to

Got it. Thanks.

Terry Reedy

unread,

May 2, 2016, 2:46:59 AM5/2/16

to

On 5/2/2016 12:31 AM, Stephen Hansen wrote:
> On Sun, May 1, 2016, at 08:39 PM, DFS wrote:

>> To save a webpage to a file:
>> -------------------------------------
>> 1. import urllib
>> 2. urllib.urlretrieve("http://econpy.pythonanywhere.com
>> /ex/001.html","D:\file.html")
>> -------------------------------------
>

> Note, for paths on windows you really want to use a rawstring. Ie,
> r"D:\file.html".

Or use forward slashes "D:/file.html" and avoid the issue. I don't know
of anywhere this does not work for file names sent from python directly
to Windows.

--
Terry Jan Reedy

BartC

unread,

May 2, 2016, 5:26:32 AM5/2/16

to

It seems Python provides a higher level solution compared with VBS.
Python presumably also has to do those Opens and Sends, but they are
hidden away inside urllib.urlretrieve.

You can do the same with VB just by wrapping up these lines in a
subroutine. As you would if this had to be executed in a dozen different
places for example. Then you could just write:

getfile("http://econpy.pythonanywhere.com/ex/001.html", "D:/file.html")

in VBS too. (The forward slash in the file name ought to work.)

(I don't know VBS; I assume it does /have/ subroutines? What I haven't
factored in here is error handling which might yet require more coding
in VBS compared with Python)

--
Bartc

Marko Rauhamaa

unread,

May 2, 2016, 6:12:19 AM5/2/16

to

BartC <b...@freeuk.com>:

> On 02/05/2016 04:39, DFS wrote:
>> 2. urllib.urlretrieve("http://econpy.pythonanywhere.com
>> /ex/001.html","D:\file.html")

> [...]

>
> It seems Python provides a higher level solution compared with VBS.
> Python presumably also has to do those Opens and Sends, but they are
> hidden away inside urllib.urlretrieve.

Relevant questions include:

* Is a solution available?

* Is the solution well thought out?

Python does have a lot of great stuff available, which is nice.
Unfortunately, many of the handy facilities are lacking in the
well-thought-out department.

For example, the urlretrieve() function above blocks. You can't use it
with the asyncio or select modules. You are left with:

<URL: https://docs.python.org/3/library/asyncio-stream.html#get-http-h
eaders>

Database facilities are notorious offenders. Also, json.load and
json.loads don't allow you to decode JSON in chunks.

If asyncio breaks through, I expect all blocking stdlib function calls
to be adapted for it over the coming years. I'm not overly fond of the
asyncio programming model, but it does sport two new killer features:

* any blocking operation can be interrupted

* events can be multiplexed

Marko

Steven D'Aprano

unread,

May 2, 2016, 8:10:34 AM5/2/16

to

On Mon, 2 May 2016 08:12 pm, Marko Rauhamaa wrote:

> For example, the urlretrieve() function above blocks. You can't use it
> with the asyncio or select modules.

The urlretrieve function is one of the oldest functions in the std library.
It literally only exists because Guido was working on a computer somewhere,
found that he did have wget, and decided it would be faster to write his
own in Python than download and install wget.

And because this was very early in Python's history, the barrier to getting
into the std lib was much less, especially for stuff Guido wrote himself,
so there it is. These days, I doubt it would be included. It would probably
be a recipe in the docs.

Compared to a full-featured tool like wget or curl, urlretrieve is missing a
lot of stuff which is considered essential, like limiting/configuring the
rate, support for cookies and authentication, retrying on error, etc.

--
Steven

DFS

unread,

May 2, 2016, 11:15:22 AM5/2/16

to

Of course. Taken to its extreme, I could eventually replace you with
one line of code :)

But python does it for me. That would save me 8 lines...

> (I don't know VBS; I assume it does /have/ subroutines? What I haven't
> factored in here is error handling which might yet require more coding
> in VBS compared with Python)

Yeah, VBS has subs and functions. And strange, limited error handling.
And a single data type, called Variant. But it's installed with Windows
so it's easy to get going with.

Larry Martell

unread,

May 2, 2016, 11:25:38 AM5/2/16

to

On Mon, May 2, 2016 at 11:15 AM, DFS <nos...@dfs.com> wrote:
> Of course. Taken to its extreme, I could eventually replace you with one
> line of code :)

That reminds me of something I heard many years ago.

Every non-trivial program can be simplified by at least one line of code.
Every non trivial program has at least one bug.

Therefore every non-trivial program can be reduced to one line of code
with a bug.

Manolo Martínez

unread,

May 2, 2016, 11:39:41 AM5/2/16

to

On 05/02/16 at 11:24am, Larry Martell wrote:
> That reminds me of something I heard many years ago.
>
> Every non-trivial program can be simplified by at least one line of code.
> Every non trivial program has at least one bug.
>
> Therefore every non-trivial program can be reduced to one line of code
> with a bug.

Well, not really. Every non-trivial program can be reduced to one line
of code, but then the resulting program is not non-trivial (as it cannot
be further reduced), and therefore there are no guarantees that it will
have a bug.

M

jf...@ms4.hinet.net

unread,

May 2, 2016, 8:45:27 PM5/2/16

to

DFS at 2016/5/2 UTC+8 11:39:33AM wrote:
> To save a webpage to a file:
> -------------------------------------
> 1. import urllib
> 2. urllib.urlretrieve("http://econpy.pythonanywhere.com
> /ex/001.html","D:\file.html")
> -------------------------------------
>
> That's it!

Why my system can't do it?

Python 3.4.4 (v3.4.4:737efcadf5a6, Dec 20 2015, 19:28:18) [MSC v.1600 32 bit (In
tel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from urllib import urlretrieve

Traceback (most recent call last):

File "<stdin>", line 1, in <module>
ImportError: cannot import name 'urlretrieve'

DFS

unread,

May 2, 2016, 9:12:24 PM5/2/16

to

try

from urllib.request import urlretrieve

http://stackoverflow.com/questions/21171718/urllib-urlretrieve-file-python-3-3

I'm running python 2.7.11 (32-bit)

jf...@ms4.hinet.net

unread,

May 2, 2016, 11:28:01 PM5/2/16

to

DFS at 2016/5/3 9:12:24AM wrote:
> try
>
> from urllib.request import urlretrieve
>
> http://stackoverflow.com/questions/21171718/urllib-urlretrieve-file-python-3-3
>
>
> I'm running python 2.7.11 (32-bit)

Alright, it works...someway.

I try to get a zip file. It works, the file can be unzipped correctly.

>>> from urllib.request import urlretrieve
>>> urlretrieve("http://www.caprilion.com.tw/fed.zip", "d:\\temp\\temp.zip")
('d:\\temp\\temp.zip', <http.client.HTTPMessage object at 0x03102C50>)
>>>

But when I try to get this forum page, it does get a html file but can't be viewed normally.

>>> urlretrieve("https://groups.google.com/forum/#!topic/comp.lang.python/jFl3GJ
bmR7A", "d:\\temp\\temp.html")
('d:\\temp\\temp.html', <http.client.HTTPMessage object at 0x03102A90>)
>>>

I suppose the html is a much complex situation where more processes need to be done before it can be opened by a web browser:-)

Stephen Hansen

unread,

May 2, 2016, 11:49:22 PM5/2/16

to

On Mon, May 2, 2016, at 08:27 PM, jf...@ms4.hinet.net wrote:
> But when I try to get this forum page, it does get a html file but can't
> be viewed normally.

What does that mean?

DFS

unread,

May 2, 2016, 11:57:07 PM5/2/16

to

Who knows what Google has done... it won't open in Opera. The tab title
shows up, but after 20-30 seconds the screen just stays blank and the
cursor quits loading.

It's a mess - try running it thru BeautifulSoup.prettify() and it looks
better.

------------------------------------------------------------
import BeautifulSoup
from urllib.request import urlretrieve
webfile = "D:\\afile.html"
urllib.urlretrieve("https://groups.google.com/forum/#!topic/comp.lang.python/jFl3GJbmR7A",webfile)
f = open(webfile)
soup = BeautifulSoup.BeautifulSoup(f)
f.close()
print soup.prettify()
------------------------------------------------------------

jf...@ms4.hinet.net

unread,

May 2, 2016, 11:57:33 PM5/2/16

to

The page we are looking at:-)
https://groups.google.com/forum/#!topic/comp.lang.python/jFl3GJbmR7A

Stephen Hansen

unread,

May 3, 2016, 12:09:56 PM5/3/16

to

Try scraping gmane. Google Groups is one big javascript application.

Steven D'Aprano

unread,

May 3, 2016, 9:20:45 PM5/3/16

to

On Tue, 3 May 2016 01:56 pm, DFS wrote:

> On 5/2/2016 11:27 PM, jf...@ms4.hinet.net wrote:
>> DFS at 2016/5/3 9:12:24AM wrote:
>>> try
>>>
>>> from urllib.request import urlretrieve
>>>
>>>
http://stackoverflow.com/questions/21171718/urllib-urlretrieve-file-python-3-3
>>>
>>>
>>> I'm running python 2.7.11 (32-bit)
>>
>> Alright, it works...someway.
>>
>> I try to get a zip file. It works, the file can be unzipped correctly.
>>
>>>>> from urllib.request import urlretrieve
>>>>> urlretrieve("http://www.caprilion.com.tw/fed.zip",
>>>>> "d:\\temp\\temp.zip")
>> ('d:\\temp\\temp.zip', <http.client.HTTPMessage object at 0x03102C50>)
>>>>>
>>
>> But when I try to get this forum page, it does get a html file but can't
>> be viewed normally.
>>
>>>>>
urlretrieve("https://groups.google.com/forum/#!topic/comp.lang.python/jFl3GJ
>> bmR7A", "d:\\temp\\temp.html")
>> ('d:\\temp\\temp.html', <http.client.HTTPMessage object at 0x03102A90>)
>>>>>
>>
>> I suppose the html is a much complex situation where more processes need
>> to be done before it can be opened by a web browser:-)
>
>
> Who knows what Google has done... it won't open in Opera. The tab title
> shows up, but after 20-30 seconds the screen just stays blank and the
> cursor quits loading.

Dennis has given the answer to this, but since he has X-No-Archive=Yes, his
useful and well-written answer will be lost forever.

So I've taken the liberty of copying his answer here:

Dennis Lee Bieber says:

There's practically no HTML in that page -- just miles of
Javascript.
The one obvious item is:

-=-=-=-=-=-
<script type="text/javascript" language="javascript"
src="/forum/C53652DA8B67255A46256B72F0D65A40.cache.js">

</script>
-=-=-=-=-=-

which is a RELATIVE path. If you copied the file to your machine and then
load it in a browser, it will be looking for

/forum/C53652DA8B67255A46256B72F0D65A40.cache.js

to be on your machine in a subdirectory of where you saved the main file.

You'd have to recreate most of the Google environment and fetch
anything that was referenced through a relative path first, to get the
content to display. Of course, you may find, for example, that the
Javascript at some point is doing a database lookup -- and you'd maybe have
to now duplicate the database...

--
Steven