Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Any python scripts to do parallel downloading?

37 views
Skip to first unread message

Frank Potter

unread,
Jan 31, 2007, 11:23:57 AM1/31/07
to
I want to find a multithreaded downloading lib in python,
can someone recommend one for me, please?
Thanks~

Jean-Paul Calderone

unread,
Jan 31, 2007, 11:37:05 AM1/31/07
to pytho...@python.org

There are no threads, but perhaps http://jcalderone.livejournal.com/24285.html
would be interesting to you.

Jean-Paul

Michele Simionato

unread,
Jan 31, 2007, 2:17:04 PM1/31/07
to

Why do you want to use threads for that? Twisted is the
obvious solution for your problem, but you may use any
asynchronous framework, as for instance the good ol
Tkinter:

"""
Example of asynchronous programming with Tkinter. Download 10 times
the same URL.
"""

import sys, urllib, itertools, Tkinter

URL = 'http://docs.python.org/dev/lib/module-urllib.html'

class Downloader(object):
chunk = 1024

def __init__(self, urls, frame):
self.urls = urls
self.downloads = [self.download(i) for i in range(len(urls))]
self.tkvars = []
self.tklabels = []
for url in urls:
var = Tkinter.StringVar(frame)
lbl = Tkinter.Label(frame, textvar=var)
lbl.pack()
self.tkvars.append(var)
self.tklabels.append(lbl)
frame.pack()

def download(self, i):
src = urllib.urlopen(self.urls[i])
size = int(src.info()['Content-Length'])
for block in itertools.count():
chunk = src.read(self.chunk)
if not chunk: break
percent = block * self.chunk * 100/size
msg = '%s: downloaded %2d%% of %s K' % (
self.urls[i], percent, size/1024)
self.tkvars[i].set(msg)
yield None
self.tkvars[i].set('Downloaded %s' % self.urls[i])

if __name__ == '__main__':
root = Tkinter.Tk()
frame = Tkinter.Frame(root)
downloader = Downloader([URL] * 10, frame)
def next(cycle):
try:
cycle.next().next()
except StopIteration:
pass
root.after(50, next, cycle)
root.after(0, next, itertools.cycle(downloader.downloads))
root.mainloop()


Michele Simionato

Carl J. Van Arsdall

unread,
Jan 31, 2007, 2:31:41 PM1/31/07
to pytho...@python.org
Michele Simionato wrote:
> On Jan 31, 5:23 pm, "Frank Potter" <could....@gmail.com> wrote:
>
>> I want to find a multithreaded downloading lib in python,
>> can someone recommend one for me, please?
>> Thanks~
>>
>
> Why do you want to use threads for that? Twisted is the
> obvious solution for your problem, but you may use any
> asynchronous framework, as for instance the good ol
>
Well, since it will be io based, why not use threads? They are easy to
use and it would do the job just fine. Then leverage some other
technology on top of that.

You could go as far as using wget via os.system() in a thread, if the
app is simple enough.

def getSite(site):
os.system('wget %s',site)

threadList =[]
for site in websiteList:
threadList.append(threading.Thread( target=getSite,args=(site,)))

for thread in threadList:
thread.start()

for thread in threadList:
thread.join()


--

Carl J. Van Arsdall
cvana...@mvista.com
Build and Release
MontaVista Software

Carl Banks

unread,
Jan 31, 2007, 3:24:21 PM1/31/07
to
Michele Simionato wrote:
> On Jan 31, 5:23 pm, "Frank Potter" <could....@gmail.com> wrote:
> > I want to find a multithreaded downloading lib in python,
> > can someone recommend one for me, please?
> > Thanks~
>
> Why do you want to use threads for that? Twisted is the
> obvious solution for your problem,

Overkill? Just to download a few web pages? You've got to be
kidding.

> but you may use any
> asynchronous framework, as for instance the good ol
> Tkinter:

Well, of all the things you can use threads for, this is probably the
simplest, so I don't see any reason to prefer asynchronous method
unless you're used to it. One Queue for dispatching should be enough
to synchronize everything; maybe a Queue or simple lock at end as well
depending on the need.

The OP might not even care whether it's threaded or asynchronous.


Carl Banks

Jean-Paul Calderone

unread,
Jan 31, 2007, 3:37:58 PM1/31/07
to pytho...@python.org
On 31 Jan 2007 12:24:21 -0800, Carl Banks <pavlove...@gmail.com> wrote:
>Michele Simionato wrote:
>> On Jan 31, 5:23 pm, "Frank Potter" <could....@gmail.com> wrote:
>> > I want to find a multithreaded downloading lib in python,
>> > can someone recommend one for me, please?
>> > Thanks~
>>
>> Why do you want to use threads for that? Twisted is the
>> obvious solution for your problem,
>
>Overkill? Just to download a few web pages? You've got to be
>kidding.

Better "overkill" (whatever that is) than wasting time re-implementing
the same boring thing over and over for no reason.

Jean-Paul

Carl J. Van Arsdall

unread,
Jan 31, 2007, 4:52:35 PM1/31/07
to pytho...@python.org

How is that a waste of time? I wrote the script to do it in 10 lines.
What is a waste of time is learning a whole new technology/framework to
do a simple task that can be scripted in 4 minutes.

-c

Jean-Paul Calderone

unread,
Jan 31, 2007, 5:20:16 PM1/31/07
to pytho...@python.org

You're right. Learning new things is bad. My mistake.

Jean-Paul

Carl J. Van Arsdall

unread,
Jan 31, 2007, 6:13:59 PM1/31/07
to pytho...@python.org
Jean-Paul Calderone wrote:
> [snip]

>>
>>
>
> You're right. Learning new things is bad. My mistake.
>
> Jean-Paul
>
That isn't what I said at all. You have to look at it from a
cost/benefit relationship. Its a waste of time/money to learn something
complex to do something simple. For the simple things, use a simple
solution. KISS. When he has an application that would require
something more complex, it would be at that point he should consider
using it for a project. Unless the OP has a desire to learn this
technology, then more power to him. I, however, do not believe that
would be the best approach for a simple problem.

Knowing the appropriate tool for the job is a trait of an good engineer.

--

Carl J. Van Arsdall
cvana...@mvista.com

Jean-Paul Calderone

unread,
Jan 31, 2007, 8:10:29 PM1/31/07
to pytho...@python.org
On Wed, 31 Jan 2007 15:13:59 -0800, "Carl J. Van Arsdall" <cvana...@mvista.com> wrote:
>Jean-Paul Calderone wrote:
>> [snip]
>>>
>>>
>>
>> You're right. Learning new things is bad. My mistake.
>>
>> Jean-Paul
>>
>That isn't what I said at all. You have to look at it from a
>cost/benefit relationship. Its a waste of time/money to learn something
>complex to do something simple. For the simple things, use a simple
>solution. KISS. When he has an application that would require
>something more complex, it would be at that point he should consider
>using it for a project. Unless the OP has a desire to learn this
>technology, then more power to him. I, however, do not believe that
>would be the best approach for a simple problem.
>
>Knowing the appropriate tool for the job is a trait of an good engineer.
>

You are assuming that he already knows how to use threads, and so there
is no investment required for a threaded solution. In my experience, it's
much safer to assume the opposite. _Even_ (often _especially_ when a
threaded solution is explicitly requested.

Jean-Paul

Carl J. Van Arsdall

unread,
Jan 31, 2007, 8:19:07 PM1/31/07
to pytho...@python.org
I have a bit more confidence in python threads, but that takes us back
to the age old debate on this list. So we agree to disagree.

-c

--

Carl J. Van Arsdall
cvana...@mvista.com

Jean-Paul Calderone

unread,
Jan 31, 2007, 10:09:19 PM1/31/07
to pytho...@python.org
On Wed, 31 Jan 2007 17:19:07 -0800, "Carl J. Van Arsdall" <cvana...@mvista.com> wrote:
>Jean-Paul Calderone wrote:
>> On Wed, 31 Jan 2007 15:13:59 -0800, "Carl J. Van Arsdall" <cvana...@mvista.com> wrote:
>>
>>> Jean-Paul Calderone wrote:
>>>
>>>> [snip]
>>>>
>>>>>
>>>> You're right. Learning new things is bad. My mistake.
>>>>
>>>> Jean-Paul
>>>>
>>>>
>>> That isn't what I said at all. You have to look at it from a
>>> cost/benefit relationship. Its a waste of time/money to learn something
>>> complex to do something simple. For the simple things, use a simple
>>> solution. KISS. When he has an application that would require
>>> something more complex, it would be at that point he should consider
>>> using it for a project. Unless the OP has a desire to learn this
>>> technology, then more power to him. I, however, do not believe that
>>> would be the best approach for a simple problem.
>>>
>>> Knowing the appropriate tool for the job is a trait of an good engineer.
>>>
>>>
>>
>> You are assuming that he already knows how to use threads, and so there
>> is no investment required for a threaded solution. In my experience, it's
>> much safer to assume the opposite. _Even_ (often _especially_ when a
>> threaded solution is explicitly requested.
>>
>I have a bit more confidence in python threads, but that takes us back
>to the age old debate on this list. So we agree to disagree.
>

You misunderstand. I wasn't expressing a lack of confidence in Python
threads, but in the facility with which they can be used by programmers.

Jean-Paul

Aahz

unread,
Jan 31, 2007, 10:12:59 PM1/31/07
to
In article <mailman.3386.1170299...@python.org>,

Jean-Paul Calderone <exa...@divmod.com> wrote:
>
>You misunderstand. I wasn't expressing a lack of confidence in Python
>threads, but in the facility with which they can be used by programmers.

Based on my admittedly limited experience, I say the same about Twisted.
Although I was able to bring up a Twisted 1.1 web server in a hurry under
extreme pressure (15 minutes before a PyCon presentation), I have never
been able to even get Twisted 2.0 installed.

Software is hard.
--
Aahz (aa...@pythoncraft.com) <*> http://www.pythoncraft.com/

"I disrespectfully agree." --SJM

Jean-Paul Calderone

unread,
Jan 31, 2007, 10:21:00 PM1/31/07
to pytho...@python.org
On 31 Jan 2007 19:12:59 -0800, Aahz <aa...@pythoncraft.com> wrote:
>In article <mailman.3386.1170299...@python.org>,
>Jean-Paul Calderone <exa...@divmod.com> wrote:
>>
>>You misunderstand. I wasn't expressing a lack of confidence in Python
>>threads, but in the facility with which they can be used by programmers.
>
>Based on my admittedly limited experience, I say the same about Twisted.
>Although I was able to bring up a Twisted 1.1 web server in a hurry under
>extreme pressure (15 minutes before a PyCon presentation), I have never
>been able to even get Twisted 2.0 installed.

FWIW, this is probably even easier than when you last tried it:

exarkun@kunai:~$ twistd -n web --port 8080 --path /tmp
2007-01-31 22:19:34-0500 [-] Log opened.
2007-01-31 22:19:34-0500 [-] twistd 2.5.0+r19505 (/usr/bin/python 2.4.3) starting up
2007-01-31 22:19:34-0500 [-] reactor class: <class 'twisted.internet.selectreactor.SelectReactor'>
2007-01-31 22:19:34-0500 [-] twisted.web.server.Site starting on 8080
2007-01-31 22:19:34-0500 [-] Starting factory <twisted.web.server.Site instance at 0xb79c646c>

>Software is hard.

But I absolutely agree with this point, anyway :) Software is _crazy_
hard. I merely dispute the claim that threads are somehow _easier_. :)

Jean-Paul

Michele Simionato

unread,
Feb 1, 2007, 12:40:13 AM2/1/07
to
On Jan 31, 9:24 pm, "Carl Banks" <pavlovevide...@gmail.com> wrote:
> Well, of all the things you can use threads for, this is probably the
> simplest, so I don't see any reason to prefer asynchronous method
> unless you're used to it.

Well, actually there is a reason why I prefer the asynchronous
approach even for the simplest things:
I can stop my program at any time with CTRL-C. When developing a
threaded program, or I implement a
mechanism for stopping the threads (which should be safe enough to
survive the bugs introduced
while I develop, BTW), or I have to resort to kill -9, and I *hate*
that. Especially since kill -9 does not
honor try .. finally statements.
In short, I prefer to avoid threads, *especially* for the simplest
things.
I use threads only when I am forced to, typically when I am using a
multithreaded framework
interacting with a database.

Michele Simionato

Michele Simionato

unread,
Feb 1, 2007, 1:02:36 AM2/1/07
to
On Jan 31, 8:31 pm, "Carl J. Van Arsdall" <cvanarsd...@mvista.com>
wrote:

>
> Well, since it will be io based, why not use threads? They are easy to
> use and it would do the job just fine. Then leverage some other
> technology on top of that.
>
> You could go as far as using wget via os.system() in a thread, if the
> app is simple enough.

Calling os.system in a thread look really perverse to me, you would
loose CTRL-C without any benefit.
Why not to use subprocess.Popen instead?

I am unhappy with the current situation in Python. Whereas for most
things Python is such that the simplest
things look simple, this is not the case for threads. Unfortunately we
have a threading module in the
standard library, but not a "Twisted for pedestrian" module, so people
overlook the simplest solution
in favor of the complex one.
Another thing I miss is a facility to run an iterator in the Tkinter
mainloop: since Tkinter is not thread-safe,
writing a multiple-download progress bar in Tkinter using threads is
definitely less obvious than running
an iterator in the main loop, as I discovered the hard way. Writing a
facility to run iterators in Twisted
is a three-liner, but it is not already there, nor standard :-(

Michele Simionato

Jan Danielsson

unread,
Feb 1, 2007, 3:33:36 AM2/1/07
to
Jean-Paul Calderone wrote:
[---]

>> Software is hard.
>
> But I absolutely agree with this point, anyway :) Software is _crazy_
> hard. I merely dispute the claim that threads are somehow _easier_. :)

Threads aren't easier. Nor are they harder. They are just different.
I used to be heavily into OS/2 programming. In OS/2, you use threads
heavily - almost by tradition. Its relatively low context switch latency
and its nice set of IPC routines (almost all API's are thread safe and
reentrant), make developing multithreaded applications quite natural.

Guess what happened when I started programming on NetBSD and Windows.
I struggled to write singlethreaded applications(!). I was so used to
kicking off a worker thread as soon as I needed to do something that I
knew could just as well be done in the background. An I *constantly*
thought in terms of "How could I make full use of an SMP system?".

I would never claim that multithreading is *easier* than
singlethreaded. It's mererly a different way of thinking.

OTOH, multithreaded does have a steeper learning curve. But once you
get past that, there's really not a lot of difference, IMHO. YMMV.

--
Kind regards,
Jan Danielsson

Jean-Paul Calderone

unread,
Feb 1, 2007, 7:43:20 AM2/1/07
to pytho...@python.org

Have you seen the recently introduced twisted.internet.task.coiterate()?
It sounds like it might be what you're after.

Jean-Paul

Michele Simionato

unread,
Feb 1, 2007, 8:14:18 AM2/1/07
to
On Feb 1, 1:43 pm, Jean-Paul Calderone <exar...@divmod.com> wrote:

> On 31 Jan 2007 22:02:36 -0800, Michele Simionato <michele.simion...@gmail.com> wrote:
> >Another thing I miss is a facility to run an iterator in the Tkinter
> >mainloop: since Tkinter is not thread-safe,
> >writing a multiple-download progress bar in Tkinter using threads is
> >definitely less obvious than running
> >an iterator in the main loop, as I discovered the hard way. Writing a
> >facility to run iterators in Twisted
> >is a three-liner, but it is not already there, nor standard :-(
>
> Have you seen the recently introduced twisted.internet.task.coiterate()?
> It sounds like it might be what you're after.

Ops! There is a misprint here, I meant "writing a facility to run
iterators in TKINTER",
not in Twisted. Twisted has already everything, even too much. I would
like to have
a better support for asynchronous programming in the standard library,
for people
not needing the full power of Twisted. I also like to keep my
dependencies at a minimum.

Michele Simionato

Carl Banks

unread,
Feb 1, 2007, 9:14:40 AM2/1/07
to
On Jan 31, 3:37 pm, Jean-Paul Calderone <exar...@divmod.com> wrote:

"I need to download some web pages in parallel."

"Here's tremendously large and complex framework. Download, install,
and learn this large and complex framework. Then you can write your
very simple throwaway script with ease."

Is the twisted solution even shorter? Doing this with threads I'm
thinking would be on the order of 20 lines of code.


Carl Banks

Jean-Paul Calderone

unread,
Feb 1, 2007, 9:20:37 AM2/1/07
to pytho...@python.org

The /already written/ solution I linked to in my original response was five
lines shorter than that.

Hmm.

Jean-Paul

Carl Banks

unread,
Feb 1, 2007, 9:41:56 AM2/1/07
to
On Feb 1, 9:20 am, Jean-Paul Calderone <exar...@divmod.com> wrote:

And I suppose "re-implementing the same boring thing over and over" is
ok if it's 15 lines but is too much to bear if it's 20 (irrespective
of the additional large framework the former requires).


Carl Banks

Jean-Paul Calderone

unread,
Feb 1, 2007, 10:01:54 AM2/1/07
to pytho...@python.org

It's written. Copy it and use it. There's no re-implementation to do.
And if you don't want to _limit_ the number of concurrent connections,
then you don't even need those 15 lines, you need four, half of which
are imports. I could complain about what a waste of time it is to always
have to import things, but that'd be silly. :)

Jean-Paul

Carl Banks

unread,
Feb 1, 2007, 10:06:40 AM2/1/07
to
On Feb 1, 12:40 am, "Michele Simionato" <michele.simion...@gmail.com>
wrote:

Fair enough.

I'm just saying that just because something is good for funded,
important, enterprise tasks, it doesn't mean very simple stuff
automatically has to use it as well. For Pete's sake, even Perl works
for simple scripts.


Carl Banks

0 new messages