Wget Download Parallel !EXCLUSIVE!

0 views

Skip to first unread message

Robinette Stiles

unread,

Jan 20, 2024, 10:57:38 AM1/20/24

to snaginalat

where $jobs is the maximum number of wget you want to allow to run concurrently (setting -n to 1 to get one wget invocation per line in urls.txt). Without -j/-P, parallel will run as many jobs at a time as CPU cores (which doesn't necessarily make sense for wget bound by network IO), and xargs will run one at a time.

wget is used to get the JSON for the search query. jq is then used to extract the URLs of the collections. parallel then calls wget to get each collection, which is passed to jq to extract the URLs of all images. grep filters out the large images, and parallel finally uses wget to fetch the images.

wget download parallel

DOWNLOAD ✯✯✯ https://t.co/gYr1DTPwiw

This script below will crawl and mirror a URL in parallel. It downloads first pages that are 1 click down, then 2 clicks down, then 3; instead of the normal depth first, where the first link link on each page is fetched first.

If you do not need GNU parallel to have control over each job (so no need for --retries or --joblog or similar), then it can be even faster if you can generate the command lines and pipe those to a shell. So if you can do this:

When running jobs that output data, you often do not want the output of multiple jobs to run together. GNU parallel defaults to grouping the output of each job, so the output is printed when the job finishes. If you want full lines to be printed while the job is running you can use --line-buffer. If you want output to be printed as soon as possible you can use -u.

A bit more complex example is downloading a huge file in chunks in parallel: Some internet connections will deliver more data if you download files in parallel. For downloading files in parallel see: "EXAMPLE: Download 10 images for each of the past 30 days". But if you are downloading a big file you can download the file in chunks in parallel.

The command will start one grep per CPU and read bigfile one time per CPU, but as that is done in parallel, all reads except the first will be cached in RAM. Depending on the size of regexps.txt it may be faster to use --block 10m instead of -L1000.

This will split bigfile into blocks of 1 MB and pass that to gzip -9 in parallel. One gzip will be run per CPU. The output of gzip -9 will be kept in order and saved to bigfile.gz

gzip works fine if the output is appended, but some processing does not work like that - for example sorting. For this GNU parallel can put the output of each command into a file. This will sort a big file in parallel:

Here bigfile is split into blocks of around 1MB, each block ending in '\n' (which is the default for --recend). Each block is passed to sort and the output from sort is saved into files. These files are passed to the second parallel that runs sort -m on the files before it removes the files. The output is saved to bigfile.sort.

GNU parallel's --pipe maxes out at around 100 MB/s because every byte has to be copied through GNU parallel. But if bigfile is a real (seekable) file GNU parallel can by-pass the copying and send the parts directly to the program:

If you need to run a massive amount of jobs in parallel, then you will likely hit the filehandle limit which is often around 250 jobs. If you are super user you can raise the limit in /etc/security/limits.conf but you can also use this workaround. The filehandle limit is per process. That means that if you just spawn more GNU parallels then each of them can run 250 jobs. This will spawn up to 2500 jobs:

--open-tty will make the pings receive SIGINT (from CTRL-C). CTRL-C will not kill GNU parallel, so that will only exit after ping is done.

GNU parallel can work as a simple job queue system or batch manager. The idea is to put the jobs into a file and have GNU parallel read from that continuously. As GNU parallel will stop at end of file we use tail to continue reading:

If you keep this running for a long time, jobqueue will grow. A way of removing the jobs already run is by making GNU parallel stop when it hits a special value and then restart. To use --eof to make GNU parallel exit, tail also needs to be forced to exit:

This small shell snippet will download the files one after the other, but it is a linear process. The second file will start downloading only after the first one has finished. The utilization of your connection will probably be less than optimal. Indeed, while fetching large files from a remote server, my DSL connection at home could fetch only about 78 Kbytes/sec when I was running one wget instance at a time:

Figure 1. Download speed with 1 wget job at a time.One of the ways to achieve better download speeds for multiple files is to use multiple parallel connections. This is precisely the idea behind download managers: programs that can be fed a list of URLs and fetch them in parallel.

This is a very small wrapper around wget, but the difference it has in download speed is quite dramatic. I tried running 8 parallel wget processes at the same time, by setting maxjobs=8 in the source of the script itself, and downloaded a set of relatively large files by typing:

Yep. The -i option would work fine for serialized multi-file downloads. I was just curious to see if I could parallelize things a bit without having to install a GUI-based download manager. It seems to have worked nicely enough :-)

Somehow when i use this script it is simply downloads all the urls in parallel without any restriction on the number of jobs. I tried indentation of code again but may be i made a mistake in the tab levels.

yea, i seldom look into man pages of different commands when i are sure of one.. after that i come to know how childish i was.
i made a script to download my favorite newspaper .
well. the newspaper that i want comes only as pages. so i had to download pages using loop and wget -i and then pdftk it.
and was looking for something to parallize the downloads. and this server the purpose.
thanks john.

The last few days I have had the following problem on fully updated x86_64. DNS lookups are taking 5-6 seconds when running network clients such as ssh, telnet, wget. The client hangs while resolving the hostname. If I use IP address or put host in /etc/hosts, there is no issue. The really weird thing is that DNS utils like dig, host, nslookup always resolve immediately (dig query times 10msec).

I guess it's the same in your case. You said above that you have the problem with opendns and your ISPs own DNS servers, so the problem is somewhere in your own setup. Most probably your router cannot deal with the parallel DNS lookups well. But most others can. That's what U. Drepper was saying in his post: some users will have problems, but most won't. It just feels strange when YOU are the one affected and all the rest seem to be doing fine. But as long as there's an easy workaround, it's all fine.

Looks like the resolver sends parallel requests, fails to see the IPv6 response, waits 5 sec and sends sequential requests because it thinks the nameserver is broken. Any DNS gurus out there who can explain what is happening?

The installation goes smoothly, independent of the system. I tried on CentOS 7 (which has already a parallel yum package), CentOS 6, and RH7, without errors. First, of course, sometimes you need to install wget. Note that once installed, if you try to run it without a process, you get the next funny output:

Normally this is not really an issue, but for downloading large files like Linux distribution images it would be an advantage to have a single fast download. Due to it's nature BitTorrent (and other P2P apps) can take full use of the breadth of all connections, but for HTTP this is not (necessarily) the case. Browsers are currently limited to a single download stream, as are common CLI tools like curl and wget.

There are "download manager" utilies available that can download simultaneously with multiple WANs, but I prefer using command line (CLI) applications for downloads. The GNU wget utility is getting support for multiple streams with the upcoming release of wget2, but there are already tools available.

Note that certain programs/utilities bundled with Quantum Espresso might notwork correctly in parallel compilation, so we may need serial compilation forthose by ./configure --disable-parallel option in case parallel option isautomatically detected.