removing set of sequences

300 views
Skip to first unread message

William Orsi

unread,
Mar 21, 2012, 3:45:32 PM3/21/12
to Qiime Forum
Hello,

I have clustered my data, and I would like to remove several OTUs from
the original .fna file so that I can recluster without those
sequences. Is there a command in QIIME that allows one to do this by
providing a list of OTUs to remove? I need to remove ~20 OTUs (and
all representative sequences) from the original .fna file.

Thanks

Bill

Greg Caporaso

unread,
Mar 21, 2012, 4:50:59 PM3/21/12
to qiime...@googlegroups.com
Hi Bill,
The filter_fasta.py script might get you what you need. You can pass a
list of sequences that you'd like to remove in a text file, one per
line, along with the fasta file. The output will be a new fasta file
not including those sequences.

Greg

Tony Walters

unread,
Mar 21, 2012, 4:53:58 PM3/21/12
to qiime...@googlegroups.com
Another possibility would be to create a copy of your OTU mapping file (output from pick_otus.py), remove the OTU ids (and the sequences mapped to them) in question, and pass this file with the -m option to the filter_fasta.py script that Greg mentioned (http://qiime.org/scripts/filter_fasta.html).

William Orsi

unread,
Mar 22, 2012, 11:00:53 AM3/22/12
to Qiime Forum
Hi,

Thanks for the tip. But, I run into a problem when I try to do this.
Here is what happens:

qiime@linux:~/Desktop/Shared_Folder/Bill/CDEBI/90% clustering/
neg_otu_removed_analysis$ filter_fasta.py -f denoised_seqs.fna -o
denoised_seqs_neg_97_otu_removed.txt -m
denoised_97_otu_map_neg_otu_removed.txt
Traceback (most recent call last):
File "/software/qiime-1.4.0-release/bin/filter_fasta.py", line 137,
in <module>
main()
File "/software/qiime-1.4.0-release/bin/filter_fasta.py", line 134,
in main
negate)
File "/software/qiime-1.4.0-release/bin/filter_fasta.py", line 66,
in filter_fasta_fp
return filter_fasta(input_seqs,output_f,seqs_to_keep,negate)
File "/software/qiime-1.4.0-release/lib/qiime/filter.py", line 35,
in filter_fasta
for seq_id in seqs_to_keep])
IndexError: list index out of range

Any ideas about what the "list index out of range" means?

Thanks

Bill

On Mar 21, 4:53 pm, Tony Walters <william.a.walt...@gmail.com> wrote:
> Another possibility would be to create a copy of your OTU mapping file
> (output from pick_otus.py), remove the OTU ids (and the sequences mapped to
> them) in question, and pass this file with the -m option to the
> filter_fasta.py script that Greg mentioned (http://qiime.org/scripts/filter_fasta.html).
>
> On Wed, Mar 21, 2012 at 2:50 PM, Greg Caporaso <gregcapor...@gmail.com>wrote:
>
>
>
>
>
>
>
> > Hi Bill,
> > The filter_fasta.py script might get you what you need. You can pass a
> > list of sequences that you'd like to remove in a text file, one per
> > line, along with the fasta file. The output will be a new fasta file
> > not including those sequences.
>
> > Greg
>
> > On Wed, Mar 21, 2012 at 12:45 PM, William Orsi <william.o...@gmail.com>

William Orsi

unread,
Mar 22, 2012, 11:23:07 AM3/22/12
to Qiime Forum
Actually, when I look at the newly saved otu map (with the OTUs/
sequences that I want to remove deleted) in a text editor, the file
looks completely different than the original otu map. There are large
spaces and returns and some OTUs are not in the OTU map, even though
when I open the txt file in excel they are there. I think it is a
problem with saving the otu map to a new txt file from excel.......but
I am not sure how to rectify this...

Tony Walters

unread,
Mar 22, 2012, 11:45:09 AM3/22/12
to qiime...@googlegroups.com
Hello Bill,

You want to open the OTU mapping file in a plain text editor (alternatively, you could use vim at a command line http://linuxconfig.org/Vim_Tutorial) to delete the OTUs that you want to filter out.  Excel will put in unwanted characters/text.


-Tony

Eric

unread,
Jun 27, 2012, 4:44:50 PM6/27/12
to qiime...@googlegroups.com
I have three sets of fasta files: fasta1, fasta2, fasta3.  

fasta2 contains both fasta1 and fasta3 sequences, and I would like to remove this "contamination." 

Is there a way to just use fasta1 and fasta3 as references to remove those sequences from fasta2 and keep the ones that are unique to fasta2?

Tony Walters

unread,
Jun 27, 2012, 4:52:00 PM6/27/12
to qiime...@googlegroups.com
Hello Eric,

You may be able to use filter_fasta.py ( http://qiime.org/scripts/filter_fasta.html ), along with a combined fasta file of fasta1 and fasta3 (can use: cat XXX YYY > combined_fasta13.fna where XXX is path to fasta1 and YYY is path to fasta3), this combined file would be used with the -a parameter, and the fasta2 file would be the input file passed with -f.

-Tony

Eric

unread,
Jun 27, 2012, 5:25:28 PM6/27/12
to qiime...@googlegroups.com
Because the script seems to cross check the sequence id associated in each fasta file instead of the sequence (ie: <#####_fasta1, <#####_fasta3  with <#####_fasta2) it either removes everything or if negated keeps everything.  

I think using OTUs might be better, using the filter_otus_from_otu_table.py script.



Tony Walters

unread,
Jun 27, 2012, 5:28:24 PM6/27/12
to qiime...@googlegroups.com
You might be able to use the approach outlined in this tutorial  http://www.qiime.org/tutorials/filtering_contamination_otus.html 
If the OTUs can be related to a particular category.

-Tony

Eric

unread,
Jun 27, 2012, 5:48:04 PM6/27/12
to qiime...@googlegroups.com
Those scripts are newly added in QIIME1.5?  My installation is 1.4.  Is there a way to get access to those scripts without upgrading to the latest QIIME installation?  (we have Ubuntu10.4, and that doesn't seem to like the newest QIIME installation).

Tony Walters

unread,
Jun 27, 2012, 7:08:22 PM6/27/12
to qiime...@googlegroups.com
Hello Eric,

It's going to be a bit trickier with 1.4.0.  So you do have an OTU table already made from the combination of all of your data ("good" data and contamination data)?  And there are sample IDs that are associated with the fasta1 and fasta3, and another set of sample IDs from fasta2?

You might be able to take this approach if you do have these data:
Create a set of filtered OTU tables using filter_otus_by_sample.py, one "good" OTU table and one with the contamination samples.  Then you will have to do some manual filtering unfortunately, to try and find the OTUs that are part of the contamination so they can be deleted from the "good" OTU table.

I would suggest trying to use the match function in Excel for this (get a row of OTUs from the 'good' table, another row from the contamination table, and create another line with =match(X, $Y:$Z, 0) where X is the first OTU number from the good table, Y and Z are the first and last cell of the contamination data, and 0 will tell it to search for exact matches).

Hope this helps,
Tony Walters

Eric

unread,
Jun 28, 2012, 3:19:43 PM6/28/12
to qiime...@googlegroups.com
Great! Thanks Tony, that worked.

Reply all
Reply to author
Forward
0 new messages