Import SRA file for analysis in QIIME

1,046 views
Skip to first unread message

Tim

unread,
Jan 24, 2011, 10:29:11 AM1/24/11
to qiime...@googlegroups.com
Hi all,

Could anyone tell me if it is possible import an SRA file for analysis through QIIME? In the 1.2.0-dev trunk I found the following file, which seems related:
but I can't find any related documentation, other than the message that the trunk is under active development, and it's hard to keep the docs up to date.

What I'd like to do is download an SRA file, decompose it into input files for QIIME, and run the analysis on that to compare it to our own results..

Best regards,
Tim

Greg Caporaso

unread,
Jan 24, 2011, 11:47:31 AM1/24/11
to qiime...@googlegroups.com
Hi Tim,
We don't have an automated way to do this file format conversion, but
if you can get the fasta and mapping files into the formats described
at the links below (respectively) you can run the samples through
QIIME:

http://qiime.org/documentation/file_formats.html#metadata-mapping-files
http://qiime.org/documentation/file_formats.html#demultiplexed-sequences

Greg

Jesse Stombaugh

unread,
Jan 24, 2011, 11:49:30 AM1/24/11
to qiime...@googlegroups.com
Hello Tim,

There is not currently a script to import from SRA.  What you need to do is download the fasta files for each sample, then concatenate them.  You will have to also generate a QIIME-formatted mapping file which contains all of your samples.  Once you have the fasta's and mapping file you can do all the processing post-split-libraries.

-Jesse
--
Jesse Stombaugh, Ph.D.
Research Associate
University of Colorado, Boulder

Tim

unread,
Jan 25, 2011, 10:24:39 AM1/25/11
to qiime...@googlegroups.com
Thanks. I've looked into it a bit further on how to implement this myself:
  • Extracting .fasta & .qual files from a .SRA file should not be a problem: 
    • Using the NCBI sratoolkit's 'fastq_dump' I can convert the SRA contents to fastq,
    • which is then easily converted into both fasta & quality files using BioPython's 'SeqIO.convert(infile, "fastq", outfile, "fasta/qual")'
However, I'm having trouble finding a source to use for the mapping file, either because I'm looking in the wrong places, or because I'm too unfamiliar with the SRA file format.
Should this information also be contained in the SRA files, or is there a difference source available to download this information from the Sequence Read Archive?
And is the sra_spreadsheet_to_map_files.py script I linked to initially somehow connected to this?

Best regards,
Tim

Jesse Stombaugh

unread,
Jan 25, 2011, 12:11:02 PM1/25/11
to qiime...@googlegroups.com
Hello Tim,

The SRA scripts in QIIME are for submitting to SRA, not re-capturing the data. You will need to manually create a mapping file, using the information found on the SRA's website for the given study you are trying to extract.

-Jesse

JQL

unread,
Feb 7, 2011, 5:44:12 PM2/7/11
to Qiime Forum
It seems related so I'll just hijack Tim's thread.

After exploring the tutorial for a while, I think I am ready to play
with the full data set -- just to get a feel of it. I am looking into
the data set by Crawford et al. 2009, SRA008712. It seems there are 38
samples? the supplemental information stated that there were four
groups of mice: control, fasted, high fat/low carb, and swimming.
n=4-5/group. how did they come up with 38 samples?
http://www.pnas.org/content/suppl/2009/06/22/0902366106.DCSupplemental/0902366106SI.pdf

I am also trying to figure out the SRA toolkit for converting SRA data
to FASTA...

If Crawford data set is not good for practice, any suggestion?

thanks
John

On Jan 25, 11:11 am, Jesse Stombaugh <jistomba...@gmail.com> wrote:
> Hello Tim,
>
> The SRA scripts in QIIME are for submitting to SRA, not re-capturing the
> data. You will need to manually create a mapping file, using the information
> found on the SRA's website for the given study you are trying to extract.
>
> -Jesse
>
>
>
>
>
> On Tue, Jan 25, 2011 at 8:24 AM, Tim <tim.te.b...@nbic.nl> wrote:
> > Thanks. I've looked into it a bit further on how to implement this myself:
>
> >    - Extracting .fasta & .qual files from a .SRA file should not be a
> >    problem:
> >       - Using the NCBI sratoolkit's 'fastq_dump' I can convert the SRA
> >       contents to fastq,
> >       - which is then easily converted into both fasta & quality files
> >       using BioPython's 'SeqIO.convert(infile, "fastq", outfile, "fasta/qual")'
>
> > However, I'm having trouble finding a source to use for the mapping file,
> > either because I'm looking in the wrong places, or because I'm too
> > unfamiliar with the SRA file format.
> > Should this information also be contained in the SRA files, or is there a
> > difference source available to download this information from the Sequence
> > Read Archive?
> > And is the sra_spreadsheet_to_map_files.py script I linked to initially
> > somehow connected to this?
>
> >http://qiime.svn.sourceforge.net/viewvc/qiime/trunk/qiime/sra_spreads...
>
> > Best regards,
> > Tim
>
> --
> Jesse Stombaugh, Ph.D.
> Research Associate
> University of Colorado, Boulder- Hide quoted text -
>
> - Show quoted text -

Tim

unread,
Feb 8, 2011, 6:33:16 AM2/8/11
to qiime...@googlegroups.com
Hi John (and all),

I've since setup a script to convert from SRA to QIIME input Mapping, Fasta & Quality files. I've made it publicly available here, should anyone be interested in this:

It currently handles quite a few edge cases, but is also built upon a number of assumptions, which might not be true in every study you throw at it. Since the script runs on experiment XML files, the most notable among the above assumptions is the presence of barcodes, primer(s) and multiple samples within a single Experiment package set XML file (i.e.: SRX012345). 
The Crawford et al. 2009, SRA008712 dataset does not fit these criteria, as each sample has been submitted as a different Experiment. It should however be possible to extend the script to include such submitted experiments, but probably at the loss of barcodes & primers, which would require outputting the following two input file formats instead:
I've not yet implemented this pending discussion about the necessity for this with the clients I wrote the script for, but could look into it if there's a substantial need for this.

Feel free to contact me about changes, suggestions or extensions, or check out, extend, patch or fork as you see fit.

Best regards,
Tim

Greg Caporaso

unread,
Feb 8, 2011, 8:51:32 AM2/8/11
to qiime...@googlegroups.com
Hi Tim,
Thanks for the contribution!

John, Another option would be to use the data that we presented in the
initial QIIME Nature Methods paper. That is available here:

http://bmf.colorado.edu/QIIME/QIIME_NM_2010.tgz

Greg

JQL

unread,
Feb 8, 2011, 10:31:24 AM2/8/11
to Qiime Forum
Hi Greg and Tim,

Thanks for the information. I called Roche this morning and they said
one can request the software Newbler (it has sffinfo etc.) for data
analysis purpose. They have relaxed its distribution since whenever.
I sent in the request online and they said someone will be in touch
soon.

Thanks for the NM dataset, I am still downloading it. Will take a
look. I will feel more confident after running the full data set.

John

Greg Caporaso

unread,
Feb 8, 2011, 10:40:40 AM2/8/11
to qiime...@googlegroups.com
> I called Roche this morning and they said
> one can request the software Newbler (it has sffinfo etc.) for data
> analysis purpose.

Interesting, I hadn't heard that. Will you follow up and let us know
how it goes (i.e., if it's easy to get)?

Greg

JQL

unread,
Feb 8, 2011, 10:46:32 AM2/8/11
to Qiime Forum
absolutely. I was wondering myself after I talked to that woman. Now I
am even more wondering :)

JQL

unread,
Feb 8, 2011, 11:04:54 AM2/8/11
to Qiime Forum
Hi Greg,

what file type is that? I downloaded the file as .GZ, unzipped it.
couldn't figure it out.
does it require some conversion first?

thanks
John

Greg Caporaso

unread,
Feb 8, 2011, 11:07:54 AM2/8/11
to qiime...@googlegroups.com
Should be a gzipped tar file, so unpack with:

tar -xvzf QIIME_NM_2010.tgz

Let us know if that doesn't work.

Greg

JQL

unread,
Feb 8, 2011, 11:40:00 AM2/8/11
to Qiime Forum
when double clicked, it directs you to:
The previous page is sending you to http://bmf.colorado.edu/QIIME/QIIME_NM_2010.tgz.
If you do not want to visit that page, you can return to the previous
page.
so double-click that new link, it prompts for save, but the file name
changes to QIIME_NM_2010.GZ

so the file downloaded is a .GZ file.

I tried to use wget, but we have firewall issues here.
John

Greg Caporaso

unread,
Feb 8, 2011, 11:42:52 AM2/8/11
to qiime...@googlegroups.com
You can try renaming the file locally to end with tgz and then run the
command I sent.

Greg

JQL

unread,
Feb 8, 2011, 12:11:41 PM2/8/11
to Qiime Forum
ok, that trick worked.

It seems split_library.py was run. VEry good, thank you Greg!

JQL

unread,
Feb 8, 2011, 2:58:30 PM2/8/11
to Qiime Forum
Hi Greg,

I did received an email from Roche with download information. It seems
that is true. They send the software out for free. I told them I was
the data analyst, and we do not have the machine.

So, one can go to Roche's website, go to softwre request and fill out
the form. Linux only.

John

Greg Caporaso

unread,
Feb 8, 2011, 4:51:57 PM2/8/11
to qiime...@googlegroups.com
> So, one can go to Roche's website, go to softwre request and fill out
> the form. Linux only.

Great, that's good to know. Thanks!

JQL

unread,
Feb 14, 2011, 6:41:27 PM2/14/11
to Qiime Forum
Hi Greg,

I've a question partially related to your QIIME paper. I am looking at
the commans, when you do OTU picking, you did that with
pick_otus_blast.py, which I think is equavilent to "pick_otus.py -m
blast", I think?

The reference sequence set: in the greengene website
http://greengenes.lbl.gov/Download/Sequence_Data/Fasta_data_files/, I
saw a file called "current_GREENGENES_unaligned.fasta.gz", is this a
good reference set? there is another one from NCBI
"current_NCBI_unaligned.fasta.gz", and RDP as well. But file sizes are
quite different. Any preference?
In your paper, you used a set called "greengenes_unaligned.fasta-
OTUs_at_0.01.fasta". Which one did you use for reference?

thanks,
John

Greg Caporaso

unread,
Feb 14, 2011, 9:42:09 PM2/14/11
to qiime...@googlegroups.com
Hi John,

> pick_otus_blast.py, which I think is equavilent to "pick_otus.py -m
> blast", I think?

Yes, that's correct. I'd recommend using:

pick_otus.py -m uclust_ref

rather than BLAST for reference-based OTU picking. Reference-based
uclust computes pairwise identity over the full length of the read,
while BLAST computes pairwise identity over the alignable region (with
a minimal alignment length), so the former results in better OTUs (in
my opinion) in that all sequences in the OTU are a min of 97%
identical to the seed over their full length.


The most recent versions of our greengenes reference OTUs are here:

http://greengenes.lbl.gov/Download/Sequence_Data/Fasta_data_files/Caporaso_Reference_OTUs_29nov2010/gg_otus_29nov2010.zip

Check the notes.txt file in that zip for information on which
sequences should be used for what -- basically you'll want the
unaligned rep_set sequences for OTU picking.

Greg

JQL

unread,
Feb 14, 2011, 11:35:55 PM2/14/11
to Qiime Forum
Hi Greg,

Let me see if I get this.
It looked like the otu_map from the folder otus was actually generated
by "pick_otus.py -m uclust -i /rep_set/gg_97_otus_29nov2010.fasta"?
Are you suggesting that the gg_97_otus_29nov2010.fasta file can be
used as reference sequence set for OTU picking by uclust_ref?
if so, my question is, why don't we eliminate those sequences that map
to the same OTU? for instance, OTU 154, we only need 107891 in the ref
set, and can remove 389548 and 449142.

Or I totally missed your point.
thanks
John

153 26566
154 107891 389548 449142
155 214536
156 75534



On Feb 14, 8:42 pm, Greg Caporaso <gregcapor...@gmail.com> wrote:
> Hi John,
>
> > pick_otus_blast.py, which I think is equavilent to "pick_otus.py -m
> > blast", I think?
>
> Yes, that's correct. I'd recommend using:
>
> pick_otus.py -m uclust_ref
>
> rather than BLAST for reference-based OTU picking. Reference-based
> uclust computes pairwise identity over the full length of the read,
> while BLAST computes pairwise identity over the alignable region (with
> a minimal alignment length), so the former results in better OTUs (in
> my opinion) in that all sequences in the OTU are a min of 97%
> identical to the seed over their full length.
>
> The most recent versions of our greengenes reference OTUs are here:
>
> http://greengenes.lbl.gov/Download/Sequence_Data/Fasta_data_files/Cap...

Greg Caporaso

unread,
Feb 15, 2011, 7:27:01 AM2/15/11
to qiime...@googlegroups.com
Hi John,

> It looked like the otu_map from the folder otus was actually generated
> by "pick_otus.py -m uclust -i /rep_set/gg_97_otus_29nov2010.fasta"?

No, OTUs were picked against a larger input set. In this case a
previous version of our greengenes OTUs (6oct2010.fasta). This is all
in the notes.txt file.

> Are you suggesting that the gg_97_otus_29nov2010.fasta file can be
> used as reference sequence set for OTU picking by uclust_ref?

Yes. Of course there are limitations as with any reference set on what
taxa are present, so you may lose those that with reference-only OTU
picking (pick_otus.py -m uclust_ref -C).

> if so, my question is, why don't we eliminate those sequences that map
> to the same OTU? for instance, OTU 154, we only need 107891 in the ref
> set, and can remove 389548 and 449142.

Only 107891 is in the ref set:

caporaso@saguaro gg_otus_29nov2010> egrep 107891
rep_set/gg_97_otus_29nov2010.fasta
>107891
caporaso@saguaro gg_otus_29nov2010> egrep 389548
rep_set/gg_97_otus_29nov2010.fasta
caporaso@saguaro gg_otus_29nov2010> egrep 449142
rep_set/gg_97_otus_29nov2010.fasta

Do you get a different result running the above commands?

JQL

unread,
Feb 15, 2011, 9:32:45 AM2/15/11
to Qiime Forum
OK, I got it. thanks.

JQL

unread,
Feb 20, 2011, 2:52:11 PM2/20/11
to Qiime Forum
Hi there,

I ran the OTU picking using -m uclust_ref with -C. The -f file I used
is gg_97_otus_29nov2010.fasta as recommended.

I got a bit less than700 OTUs (I have ~500,000 sequences). In log
file, the num of failure was ~450,000. I wonder if this is normal with
uclust_ref.

I looked at your QIIME paper, you started with over 3.8 million seqs,
and ended up with 1700 OTUs but only 0 failure. I understand you used
blast method.

My concern is the failure rate is over 80% vs. your 0%. Should I be
concerned or not?

Thanks
John

Greg Caporaso

unread,
Feb 22, 2011, 4:08:06 PM2/22/11
to qiime...@googlegroups.com, JQL
Hi John,
This sounds high, but I have seen the percent of failures be that high
before. The uclust_ref OTU picker is much stricter in OTU assignment
than the BLAST OTU picker that was used in that paper (as well as the
current version which has some minor differences).

You might try allowing for new clusters -- this isn't possible with
the parallel version of uclust_ref, but is the default for the serial
version. In this case you won't get any failures because the sequences
that don't hit reference sequences will form new clusters.

Greg

Steve

unread,
Apr 5, 2011, 11:14:54 AM4/5/11
to Qiime Forum
Hi John,

You can "export http_proxy=proxy_url" from the command line to set-up
a proxy environment variable that would circumvent any firewall issues
in most cases!

Cheers,

Steve

On Feb 8, 5:40 pm, JQL <ql16...@gmail.com> wrote:
> when double clicked, it directs you to:
> The previous page is sending you tohttp://bmf.colorado.edu/QIIME/QIIME_NM_2010.tgz.

JQL

unread,
Apr 7, 2011, 5:22:43 PM4/7/11
to Qiime Forum
I tried this:
export http_proxy=proxy_http://www.google.com
then
wget http://www.google.com
failed.
> > > Greg- Hide quoted text -
Reply all
Reply to author
Forward
0 new messages