Compare short and long reads 16S

AnnaC

unread,

Apr 16, 2017, 5:54:50 AM4/16/17

to Qiime 1 Forum

Hello everyone,

I am having trouble when comparing V1-V2 fragments with V1-V8 and V1-V9 long reads. I want to compare different reads from the same samples sequenced with two different technologies (Ion Torrent and MinION (Oxford Nanopore)) and I thought performing a closed refernce analysis would give me some results. However, all the long sequences that come from V1-V8 and V1-V9 regions fall in otu failures file...

Did anyone have a similar problem? Any suggestions on how to compare and analyze together this different data sets from the same samples?

Thanks in advance!

Anna

Colin Brislawn

unread,

Apr 16, 2017, 1:53:22 PM4/16/17

to qiime...@googlegroups.com

Hello Anna,

This is a cool question! I'm not sure if there is a perfect way to do this, but this method came to mind:

What if you did prefix dereplication on the long MinION reads, then used them as the database for open-ref OTU picking? Then all short Illumina reads would get matched to one of the long reads, and any novel diversity would be captured as new OTUs.

In this method, a reference database like greengenes would only be used later on for taxonomy assignment.

This method is not perfect. For example, if a pair of long reads are different in V8 but the same in V1-2, all the Illumina reads from V1-2 could be a 100% match to both of these long reads, and would simply be placed into the read it found first. This is a standing problem whenever comparing reads of different lengths. See this discussion of why global trimming to the same length is good, and local clustering is bad:

http://drive5.com/usearch/manual/global_trimming.html

http://drive5.com/usearch/manual/local_clustering.html

I hope this is helpful. You are on the 'cutting edge' out here, so there are no 'conventional' methods yet.

Let us know what you find!

Colin

TonyWalters

unread,

Apr 16, 2017, 2:02:40 PM4/16/17

to Qiime 1 Forum

Hello Anna,

To add to what Colin said, you might blast a few of the long reads on NCBI nr and see if there are regions (usually at the beginning/end) that don't hit a reference gene. If you do have overhangs like this, there could be non-16S adapter/sequencing constructs in your reads, that will interfere with clustering against Greengenes. Issues in the middle of the reads may be a bit more difficult to troubleshoot-I haven't looked directly at Nanopore data yet to see what kind of challenges there are with clustering against 16S databases/etc.

-Tony

AnnaC

unread,

Apr 17, 2017, 9:33:45 AM4/17/17

to Qiime 1 Forum

Hello Colin,

thanks for your fast answer! :)

I think it's a very interesting approach, this one you're suggesting... Of course it has some weaknesses but at least I will be able to obtain some first insights. What I have detected is that MinION reads have many errors... So when I pick OTUs with my short reads, I have to use a lower similarity threshold if I want them to match the custom database.

I also have some trouble in assigning taxonomy to the custom database... rdp or blast with SILVA have been running for many hours without giving me any output. I will try with Greengenes database, too. Those databases contain full 16S rRNA gene sequences, don't they? Maybe I have just to wait more time to let the job finish, but it looks like it is stuck.

I will let you know what I get!

Anna

AnnaC

unread,

Apr 17, 2017, 9:38:53 AM4/17/17

to Qiime 1 Forum

Hello Tony,

You are absolutely right, raw sequences contain adapters, barcodes and universal tags... So we have demultiplexed and removed them and also re-oriented the reads.

I don't think it presents many issues in the middle of the reads, just sometimes chimeric sequences (that we have also removed). I think that the main problem is the error rate...

Thanks for your suggestion!

Anna

Colin Brislawn

unread,

Apr 17, 2017, 12:47:47 PM4/17/17

to Qiime 1 Forum

Hello Anna,

Oh, of course the error rate is high! I totally forgot about that.

I have been so used to working with the limitations of Illumina reads (very short, very high quality after joining), and I'm not used to balancing the challenges of other sequencers (very long, very low quality).

I could not find any software for 16S data on the Nanopore, but I did find these two open source libraries. Maybe they could help with error correction.

https://github.com/arq5x/poretools

https://github.com/tszalay/poreseq

I appreciate you bringing this discussion to the forums. The nanopore technology is exciting, so I'm glad I lean more about it.

Colin

AnnaC

unread,

Apr 19, 2017, 2:24:53 AM4/19/17

to Qiime 1 Forum

Hello Colin,

thanks for the suggestions. I've already used poretools, but I think I have found a more suitable software to correct the reads: https://github.com/TGAC/NanoOK

Once I had my custom database (long reads) corrected, I will try the approach you suggested and see how the short reads hit this database :)