What is the best way to download all the Bacteria's faa file?

439 views
Skip to first unread message

Zhao Xu

unread,
Apr 25, 2015, 1:16:09 AM4/25/15
to img-use...@lbl.gov
Hi All,
  Currently I am working on phylogenetic studies of Bacteria. From this link we can see there are 25302 bacterium in JGI: https://img.jgi.doe.gov/cgi-bin/er/main.cgi
  I've read a few posts and I understand that I should use Globus to download the files since the number is huge. However, when I try Globus, there is no easy way to download only the faa files, and there is even no way to download all the files:
1. If I choose By_Data_Source_And_Type, then 'Files', to download Taxonomy/Bacteria folder, I will only get part of the organisms, about 7605 in total. I noticed that many Bacteria has the IMG_Data folder that can not be downloaded by this way. I did not figure out which Data source and type I should use to download this IMG_Data folder.
2. If I choose Taxonomy from the root folder, download the Bacteria directory, I will end up download a large volume of Raw data(raw reads) which I don't need. And these files are too big to be stored in my local hard drive. 

  In fact, the ideal solution for me is to find a way download only the faa files(only the proteins). Could anyone help me on this?

Thanks very much!
Zhao

Matthew

unread,
Apr 1, 2016, 2:08:33 PM4/1/16
to IMG User Forum
I also would like to do something similar to this, and Zhao's question has not been answered.
It seems like it might be impossible?
The faa files are often only available bundled in a zipped archive, which probably cannot be opened without downloading a local copy.

The issue of the raw reads is frustrating to me as well.
Could this problem be circumvented using scripting with the API? If so, could someone provide some examples or point us somewhere to learn how to do this kind of API scripting?

I wish there were more data-access solutions for "intermediate users" (ie, those of us who are biologists, and comfortable but new with coding).

I would love to be able to upload a list of img_taxon oid's and be able to retrieve ALL protein coding genes for these genomes. In the web interface, one can get those genomes into the genome cart, but cannot get the sequences into the genome cart. I would be okay if I had to do this ten genomes at a time (since there are limits to the gene cart), but perhaps the API supports actions like this?

There is hardly any documentation on how to use the API!

Manoj Pillay

unread,
Apr 2, 2016, 2:03:18 AM4/2/16
to IMG User Forum
Mathew and Zhao,

                     IMG is primarily a web-based analytical resource. The data within IMG is multi-dimensional but organized as taxons or analysis projects on a unit-level. The vast majority of our consumers are interested in compartmentalized analysis of their own data.  For this reason, our data is not organized to be downloaded on a functional global level. The Portal Group at JGI who manage download interfaces to IMG as well as other JGI data offer a convenience alternative to downloading individual datasets via Globus. There is no native API available for public consumption. In general, it is not feasible to balance high concurrency usage that a public API would demand without enforcing usage limits via paid tiered memberships. IMG, being a public-funded free resource needs to maintain neutrality and equality with respect to all our users and cannot offer tiered membership plans. The limits enforced on the amount of data such as the maximum number of genomes permitted by an IMG application are often limits of our hardware and architecture offered by NERSC. Although, I haven't used it personally, I am aware of the Globus API that you can possibly program against to selectively download data from endpoints. Hopefully, you can make the best use of this and obtain data tailored to the dimensions of your analysis.

Best,
Manoj Pillay
Database Systems Group
Integrated Microbial Genomes.

Matthew

unread,
Apr 4, 2016, 5:15:09 PM4/4/16
to IMG User Forum
Thank you for responding Manoj,

Yes I understand the compartmentalization and multidimensionality of IMG.  But the reason for my wanting all the sequences from certain genomes (clearly a compartmentalization approach) is because IMG lacks certain tools such as scoring against custom-built HMMs.  (For example, there are some gene families for which no pfam/COG/TIGRFAM/KO exists).
A huge problem comes from some of the in-built limitations for IMG. I understand why these limitations have been built in, and why tiered memberships are beyond IMG's mission as a free resource.  However, there is simply no way to do certain analyses completely within the IMG environment.

A commonly-occurring example that frustrates my colleagues and I is the way that the in-house BLAST algorithm defaults to "ALL GENOMES" when more than 100 are in the genome cart.
If I want to find all of the cyanobacterial homologs of an uncharacterized sequence (thus lacking a functional tag such as a pfam), i am forced to retrieve homologs from all bacteria. 
If the uncharacterized gene family is widespread, then I sometimes cannot even reach the "twilight zone" of distant homologs. 

That is why I would greatly like to have access to all of the coding sequences from cyanobacteria; so that I can run these analyses locally on my own system (eg HMMER)...
I can do such things with a local version of uniprot or nr, but it is really sad to me that there is no way to do this with IMG. 
Worst of all, there is no way for me to map gene IDs back to IMG's gene_oid system.
I love the gene_oid & taxon_oid system that IMG uses. 
But if I were to use HMMER against uniprot to identify members of an uncharacterized gene family, I cannot find a way to map these sequences back to IMG's oid system.

As far as I'm concerned, the Globus FTP GUI is no faster than clicking on individual genome download pages on the JGI Download Portal, and the API provided for downloads is essentially undocumented.
I'm tempted to dump all the genes from a genome into my gene cart, one at a time for the 381 cyanobacterial genomes, in batches of < 20,000 genes each, and then compiling the exported files on my own computer.
It would probably take me a day of tedious clicking, which is neither sane nor impossible.

Please help; I'm almost ready to stop using IMG because of these problems, even though IMG has many features superior to the other available databases.
Thanks,
Matt Melnicki.




Again, if I could put a whole genome into the genome cart in small batches (say 10 or 20 genomes at a time

I am interested in cyanobacteria.
Reply all
Reply to author
Forward
0 new messages