Google Groupes n'accepte plus les nouveaux posts ni abonnements Usenet. Les contenus de l'historique resteront visibles.

download big files from NCBI GEO database parallelly : Error 404 not found

37 vues
Accéder directement au premier message non lu

kishorered...@gmail.com

non lue,
30 janv. 2018, 15:29:0130/01/2018
à
Hello Everyone,

I am trying to download about 64 files from NIH database.
I am using gnu parallel/20160922
Command that I am using is


seq -w 1 64 | parallel -j 23 "wget http://sra-download.ncbi.nlm.nih.gov/srapub_files/SRR5259335_E18_20160930_Neurons_Sample_{}.bam"


I am doing this in our hpc, the command is not downloading my files and its giving 404 error.
When I wget it one by one its working, I want to submit a job so that it is parallelly downloaded

Helmut Waitzmann

non lue,
31 janv. 2018, 03:43:1531/01/2018
à
kishorered...@gmail.com:
With GNU xargs rather than GNU parallel:

seq -w 1 64 |
xargs -E '' --max-procs 23 -n 1 -- sh -c -- \
'wget \
http://sra-download.ncbi.nlm.nih.gov/srapub_files/\
SRR5259335_E18_20160930_Neurons_Sample_"${1?}".bam' \
sh

Jorgen Grahn

non lue,
1 févr. 2018, 10:31:4801/02/2018
à
I don't see why the original wouldn't work, though, unless parallel(1)
is somehow broken.

Do people use parallel(1)? I never heard of it before. Downloaded it
and had a look. The banner/disclaimer looked odd, and the number of
command-line options seemed unreasonably high at first glance.

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

Clyde

non lue,
1 févr. 2018, 11:55:2201/02/2018
à
On Thursday, February 1, 2018 at 7:31:48 AM UTC-8, Jorgen Grahn wrote:
> On Wed, 2018-01-31, Helmut Waitzmann wrote:
> > kishorereddyanekalla at gmail.com:
Please note that NCBI *hates* this sort of thing (tens of thousands of researchers doing the same thing kills their bandwidth) and will ban your IP address for a while, which is probably why the download fails.

We've had that problem at work when a couple of hundred scientists all hit their site at the same time, and we go over quota.

The original string should work properly.

I use parallel *a lot* it's great and performs much better than xargs.

kishorered...@gmail.com

non lue,
1 févr. 2018, 13:21:4801/02/2018
à
Thanks, I made a simple for loop for now, and waiting to finish the download

kishorered...@gmail.com

non lue,
1 févr. 2018, 13:22:5401/02/2018
à
Thank you, I understand, but I need to get the data by any means, I was planning to get it faster, but ended up using a for loop :)

Helmut Waitzmann

non lue,
1 févr. 2018, 15:00:1301/02/2018
à
Jorgen Grahn <grahn...@snipabacken.se>:
> On Wed, 2018-01-31, Helmut Waitzmann wrote:

>> With GNU xargs rather than GNU parallel:
>>
>> seq -w 1 64 |
>> xargs -E '' --max-procs 23 -n 1 -- sh -c -- \
>> 'wget \
>> http://sra-download.ncbi.nlm.nih.gov/srapub_files/\
>> SRR5259335_E18_20160930_Neurons_Sample_"${1?}".bam' \
>> sh
>
> I don't see why the original wouldn't work, though, unless parallel(1)
> is somehow broken.

I've to beg your pardon. The parallel(1) manual page I read
described a very old version, which didn't read standard input to
collect arguments.

Ian Zimmerman

non lue,
2 févr. 2018, 15:39:1302/02/2018
à
On 2018-02-01 20:59, Helmut Waitzmann wrote:

> I've to beg your pardon. The parallel(1) manual page I read described
> a very old version, which didn't read standard input to collect
> arguments.

Note: there are __two__ programs named parallel(1). While they
(obviously) have broadly the same purpose, their command line interfaces
are completely incompatible.

There's the GNU one, and then there's the one in moreutils package.

For some background on the reasons why both exist, see here:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=597050

Myself I started avoiding both of them, in the situations where I need
parallel shell jobs I just write a simple makefile and then use make -j.

--
Please don't Cc: me privately on mailing lists and Usenet,
if you also post the followup to the list or newsgroup.
To reply privately _only_ on Usenet, fetch the TXT record for the domain.

Helmut Waitzmann

non lue,
3 févr. 2018, 12:50:2203/02/2018
à
Ian Zimmerman <i...@no-use.mooo.com>:
> On 2018-02-01 20:59, Helmut Waitzmann wrote:
>
>> I've to beg your pardon. The parallel(1) manual page I read described
>> a very old version, which didn't read standard input to collect
>> arguments.
>
> Note: there are __two__ programs named parallel(1). While they
> (obviously) have broadly the same purpose, their command line interfaces
> are completely incompatible.
>
> There's the GNU one, and then there's the one in moreutils package.

Ah. Thank you. I didn't know that. The parallel(1) manual page
I read was the moreutils' parallel's one.

Jorgen Grahn

non lue,
3 févr. 2018, 17:14:5003/02/2018
à
On Thu, 2018-02-01, Clyde wrote:
> On Thursday, February 1, 2018 at 7:31:48 AM UTC-8, Jorgen Grahn wrote:
>> On Wed, 2018-01-31, Helmut Waitzmann wrote:
>> > kishorereddyanekalla at gmail.com:
>> >> Hello Everyone,
>> >>
>> >> I am trying to download about 64 files from NIH database.
>> >> I am using gnu parallel/20160922
>> >> Command that I am using is
...
>
> Please note that NCBI *hates* this sort of thing (tens of thousands
> of researchers doing the same thing kills their bandwidth) and will
> ban your IP address for a while, which is probably why the download
> fails.
>
> We've had that problem at work when a couple of hundred scientists
> all hit their site at the same time, and we go over quota.

Seems like a good reason to make it available over Bittorrent
instead. Or Git or something.

But yes, when they don't, the polite thing to do is not to try to
bypass their rate limiting.

Clyde

non lue,
5 févr. 2018, 00:24:1805/02/2018
à
That already exists as genetorrent ( https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4178372/ ) and git *sucks* for extremely large files like that.

Politely asking didn't work, so NCBI enforces with an off switch

But for the OP, you can use something like -J 2 to limit it to a couple of parallel downloads (faster than single, but not breaking the rules)

C
0 nouveau message