usearch61 otu picking - discrepancy between otus.txt mapping file and denovo_abundance_sorted.fna

30 views
Skip to first unread message

Kendra Turk-Kubo

unread,
Feb 5, 2016, 4:05:16 PM2/5/16
to Qiime 1 Forum
Hello - 

While running pick_otus.py with -m usearch61, we are finding discrepancies between the otus.txt mapping file and the information in the read name generated in denovo_abundance_sorted.fna.  

For example, from the read name in the denovo_abundance_sorted.fna file:

>DimBioDivNE8_58;size=6802

but when we find this sequence in the otus.txt mapping file it is part of a cluster with over 100,000 sequences.

We had assumed that the denovo_abundance_sorted.fna was a fasta with the representative sequences, but now we are unclear which file is intended for use in downstream analysis, and why they don't match up.

Any advice would be much appreciated.

Kendra

Colin Brislawn

unread,
Feb 5, 2016, 5:29:37 PM2/5/16
to Qiime 1 Forum
Hello Kendra,

Great question! I'm glad you are taking the time to track your reads through the OTU picking process.

As you probably already know, demultiplexing adds unique labels to every one of your quality filtered reads. OTU picking starts by dereplicating these reads, then clustering them starting with the most abundant reads. Based on the read name you gave me, I'm guessing that the file denovo_abundance_sorted.fna was the output of dereplication, but had not yet been through clustering.


You mentioned
denovo_abundance_sorted.fna was a fasta with the representative sequences
Like I said, I'm guessing that is a dereplicated read, so while it appears 6802 times, I'm not surprised that it ended up in a cluster containing thousands more reads.

Try looking for your OTU centroids inside a file with a name like rep_set.fna.

Colin Brislawn

Kendra Turk-Kubo

unread,
Feb 5, 2016, 5:44:30 PM2/5/16
to Qiime 1 Forum
Thanks for confirming what we expected, Colin.  We have a work-around, but for some reason, the output does not contain a rep_set.fna.

The script I used was:

$ pick_otus.py -i NE9_chimera_usearch61/non_chimeras.fasta -m usearch61 -s 0.99 --de_novo_chimera_detection -k -x -o NE9_pickOTU99/


The files that were generated were only:

abundance_sorted.log          denovo_smallmem_clustered.uc  smallmem_clustered.log 

denovo_abundance_sorted.fna   non_chimeras_otus.log         

denovo_abundance_sorted.uc    non_chimeras_otus.txt        


But the analysis runs to completion as far as I can tell. Any insight?


K


Colin Brislawn

unread,
Feb 5, 2016, 7:12:16 PM2/5/16
to Qiime 1 Forum
Uh... I'm not really sure if this is normal output or not. 

Can you post the first few lines of non_chimeras_otus.log and also non_chimeras_otus.txt?
I'm thinking that the .txt file may actually be a .fna file. 

Another idea! You could run ls -alht
to list all these files, and their creation time, sorted by creation time. This will let us see in which order the files were made.

Thanks for working through this one with me.
I'm sorry I don't know the answer directly.
Colin

Kendra Turk-Kubo

unread,
Feb 5, 2016, 7:54:42 PM2/5/16
to Qiime 1 Forum
Hi - The denovo_abundance_sorted.fna is the only fast file among them, which is why it took me a little while to figure out that the .fna file was not the representative sequences.  Here are the first few lines of the .txt and .log files, and the list of the creation dates of the files:

$ head non_chimeras_otus.txt 

denovo236824 DimBioDivNE9_860622

denovo236825 DimBioDivNE9_860624 DimBioDivNE9_859747

denovo236826 DimBioDivNE9_860609

denovo236827 DimBioDivNE9_860617

denovo236820 DimBioDivNE9_860595

denovo236821 DimBioDivNE9_860588

denovo236822 DimBioDivNE9_860577

denovo236823 DimBioDivNE9_860621

denovo15142 DimBioDivNE9_1601741 DimBioDivNE9_1597207 DimBioDivNE9_1618319 DimBioDivNE9_1625072 DimBioDivNE9_1302067 DimBioDivNE9_1270107 DimBioDivNE9_1268528


[kturk@thalassa NE9_pickOTU99]$ head non_chimeras_otus.log 

Usearch610DeNovoOtuPicker parameters:

Application:usearch61

minlen:64

output_dir:NE9_pickOTU99/

percent_id:0.99

remove_usearch_logs:False

rev:False

save_intermediate_files:True

sizeorder:False

threads:1



drwxrwxr-x 6 kturk kturk 4.0K Jan 25 19:32 ..

-rw-rw-r-- 1 kturk kturk  49M Jan 25 16:41 non_chimeras_otus.txt

drwxrwxr-x 2 kturk kturk 4.0K Jan 25 16:41 .

-rw-rw-r-- 1 kturk kturk  351 Jan 25 16:41 non_chimeras_otus.log

-rw-rw-r-- 1 kturk kturk 3.0K Jan 25 16:41 smallmem_clustered.log

-rw-rw-r-- 1 kturk kturk  75M Jan 25 16:41 denovo_smallmem_clustered.uc

-rw-rw-r-- 1 kturk kturk 1.7K Jan 25 16:16 abundance_sorted.log

-rw-rw-r-- 1 kturk kturk 160M Jan 25 16:16 denovo_abundance_sorted.uc

-rw-rw-r-- 1 kturk kturk 362M Jan 25 16:15 denovo_abundance_sorted.fna



Colin Brislawn

unread,
Feb 6, 2016, 4:15:29 PM2/6/16
to Qiime 1 Forum, Kyle Bittinger
Hello Kendra,

I'm still at a loss here. Maybe one of the qiime devs can comment more...

Colin

TonyWalters

unread,
Feb 6, 2016, 5:04:45 PM2/6/16
to Qiime 1 Forum, kylebi...@gmail.com
To get a representative sequence file, you would want to use the OTU mapping file (non_chimeras_otus.txt) and the fasta file that was used as input for pick_otus.py (NE9_chimera_usearch61/non_chimeras.fasta) with the pick_rep_set.py. E.g.
pick_rep_set.py -f NE9_chimera_usearch61/non_chimeras.fasta -i non_chimeras_otus.txt -o rep_set.fna

When you use the workflow scripts (e.g. pick_open_reference_otus.py) the scripts are called in order automatically (pick_otus.py -> pick_rep_set.py -> others...), but if you call pick_otus.py directly, you have to manually do the other scripts.

Kendra Turk-Kubo

unread,
Feb 8, 2016, 2:30:43 PM2/8/16
to Qiime 1 Forum, kylebi...@gmail.com
Thanks for the clarification Tony - it's clear now that I didn't understand that I needed the additional script.  All is good, and we are on the right path now.
Reply all
Reply to author
Forward
0 new messages