reference green genes database not being used to pick rep set

131 views
Skip to first unread message

Constantino Schillebeeckx

unread,
May 5, 2016, 9:57:30 AM5/5/16
to Qiime 1 Forum
I'm picking open reference otus with a modified green genes database:

parallel_pick_otus_uclust_ref.py -i /home/data_repo/pre_processing/jobs/66/stage_2/master_seqs.fna -o /home/data_repo/pre_processing/jobs/66/stage_2/otupick/step1_otus -r /home/data_repo/pre_processing/otu_support_files/denovo_green_genes/97/rep_set/gg_13_5_pynast_left_2264_right_4052_rep_set.fasta -T --jobs_to_start 10 --max_rejects 0 --similarity 0.97 --max_accepts 50

I would have expected that since I'm passing the reference database, that this would also be used when picking reference sequences; however when I check the log (full log is also attached) I see:

pick_rep_set.py -i /home/data_repo/pre_processing/jobs/66/stage_2/otupick/step1_otus/master_seqs_otus.txt -o /home/data_repo/pre_processing/jobs/66/stage_2/otupick/step1_otus/step1_rep_set.fna -f /home/data_repo/pre_processing/jobs/66/stage_2/master_seqs.fna




Finally, in a somewhat related vein, it's not clear to me how the final rep_set.fna file is generated.  The only thing I see related to it is the comment: 

# Write non-singleton otus representative sequences from step 2 and step 4 to the final representative set and the new reference set (/home/data_repo/pre_processing/jobs/66/stage_2/otupick/rep_set.fna and /home/data_repo/pre_processing/jobs/66/stage_2/otupick/new_refseqs.fna respectively)

However I don't see any commands being executed.


Thanks for the insight!


log_20160418220124.txt

Kyle Bittinger

unread,
May 7, 2016, 6:56:25 AM5/7/16
to Qiime 1 Forum
I understand your confusion with the pick_rep_set.py command, and I'll try to have the authors of this code answer the question directly.

To answer your second question, this step happens in python; a command is not issued to an external program.  Here is the code that's writing the rep-_set.fna file: https://github.com/biocore/qiime/blob/master/qiime/workflow/pick_open_reference_otus.py#L915


Constantino Schillebeeckx

unread,
May 9, 2016, 12:20:20 PM5/9/16
to Qiime 1 Forum
Thanks for pointing me to the source code, looks like the generation of the final rep_set is pretty straightforward.

It looks like the pick_rep_set.py call never has an option of using a reference database (see here).  I suppose this would make sense if one assumes that the seed sequence used in UCLUST is always the same as that defined by (in my case) green genes.  However my results show that this is just not the case. (e.g. the rep_seq assigned to an OTU does not match that defined by green genes).

I supposed I could add an extra pick_rep_set in my pipeline but I'm curious to know why the authors chose not to integreate the -r option when doing closed reference OTU picking.

Constantino Schillebeeckx

unread,
May 9, 2016, 6:08:19 PM5/9/16
to Qiime 1 Forum
A follow up on the discrepancy I'm seeing; after doing open reference OTU picking with a modified green genes database...

cat rep_set.fna | grep '>10647' -A1

gives me the read (let's call it read-A):
TACGGAAGGTCCGGGCGTTATCCGGATTTATTGGGGTTAAAGGGAGCGTAGGCCGGAGATTAAGCGTGTTGTGAAATGTAGATGCTCAACATCTGAACTGCAGCGCGAACTGGTTTCCTTGAGTACGCACAAAGTGGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATGGCGAAGGCAGCTCACTGGAGCGCAACTGACGCTGAAGCTCGAAGGTGCGGGTATCGAACAGG

cat step1_otus/step1_rep_set.fna | grep '>10647' -A1

 gives me the read (let's call it read-B):
TACGGAAGGTCCGGCCGTTATCCAGATTTATTGGGGTTAAAGGGAGCGTAGGCCGGAGATTAAGCGTGTGTGAAATGTAGATGCTCAACATCTGAACTGCAGCGCGAACTGGTTTCCTAGAGTACGCACAAAGTGGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATNGCGAAGGCAGCTCACTGGAGCGCAACTGACGCTGAAGCTCGAAAGTGCGGGTATCGAACAGG

Aligning them I see 98% identity (instead of the 100% I was expecting)

 
When I check the green genes DB I used as a reference for UCLUST I get the following read:
cat gg_13_5_pynast_left_2264_right_4052_rep_set.fasta | grep '>10647' -A1
TACGGAAGGTCCGGCCGTTATCCAGATTTATTGGGGTTAAAGGGAGCGTAGGCCGGAGATTAAGCGTGTGTGAAATGTAGATGCTCAACATCTGAACTGCAGCGCGAACTGGTTTCCTAGAGTACGCACAAAGTGGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATNGCGAAGGCAGCTCACTGGAGCGCAACTGACGCTGAAGCTCGAAAGTGCGGGTATCGAACAGG
 
Which is 100% identical to read-B. As Kyle pointed out, it looks like the rep_set reads from step1 should be the same as those found in the final rep_set ... where am I going wrong? 

Greg Caporaso

unread,
May 10, 2016, 1:03:18 PM5/10/16
to Qiime 1 Forum
Hi Constantino,
You're correct - the reference sequences are not used as the representative sequences by this workflow. The reason for that is that the reference sequences are generally going to be full-length (~1500 bases for 16S), while the sequences being clustered are typically much shorter (150-300 bases, usually). This causes problems downstream when building the phylogenetic tree, as FastTree (the default used by this workflow) doesn't handle bimodal distributions of sequence lengths well, so we use the actual sequences obtained by sequencing as the representatives in this case. 

I could help you with a work-around to generate a custom representative set that would use the reference sequences if you'd like (basically I think we'd filter the new_refseqs.fna file that is created by that workflow to contain only the OTUs that you observed). 

Greg

Constantino Schillebeeckx

unread,
May 10, 2016, 4:22:55 PM5/10/16
to Qiime 1 Forum
Hi Greg,

Thanks a million for the response, that makes sense to me.  In fact, we're using a modified green genes database that's been trimmed to the V4 region.

Filtering down the new_refseqs.fna file is a good idea; I'll simply adjust the way we're doing things to leverage data from within this file.

Constantino

Greg Caporaso

unread,
May 10, 2016, 5:13:40 PM5/10/16
to Qiime 1 Forum
Great, good luck! 

Best,
Greg
Reply all
Reply to author
Forward
0 new messages