How to analyze several samples at a time?

48 views
Skip to first unread message

Peter Kos

unread,
Nov 21, 2017, 9:52:33 AM11/21/17
to Qiime 1 Forum
Sorry, I am a bit confused.

I am using qiime virtual box 1.9.1

I have several datasets of two demultiplexed, quality trimmed, tag-removed fastq files each. I also have fastq files (one per sample) containing both directions and I used these to generate the fasta file for input. (The read names make it possible for some programs to identify the pairs, according to MiSeq convention)

I would like to eventually obtain a big OTU table (in any format) that contains all my samples as columns and OTUs as rows (or vica versa).
I used pick_otus.py that takes a single input file, so I now have dozens of single OTU tables that  can not be compared since the representative OTUs are different and unrelated. I can not find a script that would sort and relate/link the OTUs in individual otu tables an that would make a combined otu table of the different samples, so this seems to be a dead end.

Should I rather restart from the beginning, merge the two directions (I have seen some discussions about it but could not find the way to do that in qiime) and collect the file-pairs in a directory and use pick_open_reference_otus.py that can take a directory as input? How will the program know which file belongs to which sample? Does it care about the filenames? Or will the whole set be melted together as one sample with zillions of files?

As I saw in the the documentation and also in the Werner-lab tutorial something like pick_otus_through_taxonomy.py and pick_otus_through_otu_table.py, but my virtualbox does not know of such scripts.

In short: How can I investigate the whole dataset of several paired-end MiSeq samples?

All hints would be highly appreciated.

Colin Brislawn

unread,
Nov 21, 2017, 3:35:52 PM11/21/17
to Qiime 1 Forum
Hello Peter,

Combining data sets can be tricky, depending on how you want to process your data. Basically, you want to merge your data set, without merging different samples and different OTUs!

There are lots of ways to demultiplex with qiime, but all of them end up with a seq.fna file with valid qiime labels. This file will be the input to OTU picking. If you have multiple seqs.fna files from different data sets, and all sample names are unique, you can simply concatenate all these files together using the linux cat command: 
cat project1/seqs.fna project2/seqs.fna project3/seqs.fna > projects123_seqs.fna
If two samples from different projects have the same name, they will be combined! In order to avoid that, repeat demultiplexing using unique names.

The benefit of combining your data before OTU picking is that you can use any algorithm you want. If you want to combine your data after OTU picking, you must use closed-ref OTU picking. If you don't, the de novo OTUs made from each data set will not match. Using closed-ref picking will avoid this problem because all OTUs will match your database.

dozens of single OTU tables that  can not be compared since the representative OTUs are different and unrelated
Closed-ref OTU picking would solve this problem. :-)

Once you have closed-ref OTU tables, merging them is easy:  
Like before, if two samples have the same name, they will be merged. So you may have to repeat closed-ref picking using unique names. 

I hope that helps,
Colin


Peter Kos

unread,
Nov 22, 2017, 3:07:17 AM11/22/17
to Qiime 1 Forum
Thanks a lot, Colin.

If I now concatenate the files, then I'll get back to the state prior to demultiplexing with concomitant loss of the information of which read (on the chip) belonged to which sample.
So I first need to rename the reads in the files according to the sample identity, I guess,
Now my files look like
>M02372:57:000000000-B5DTG:1:1101:12480:1727_1:N:0:GTAGAGGA+AAGGAGTA
GGATAGCCAAGGTCAGGT...
>M02372:57:000000000-B5DTG:1:1101:12480:1727_2:N:0:GTAGAGGA+AAGGAGTA
GACGCTGGAATGTAACAA...
>M02372:57:000000000-B5DTG:1:1101:15580:2018_1:N:0:GTAGAGGA+AAGGAGTA
GGATAGCCAAGGTCAGGT...

Should I keep the direction info ("_1" and "_2")? I can rename it like
>thissample_1_1
GGAAT...
>thissample_1_2
GGAAT...
>thissample_2_1
GGAAT...
>thissample_2_2

or omitting the directional part, like 
>thissample_1
GGAAT...
>thissample_2
GGAAT...
>thissample_3
GGAAT...
>thissample_4

Will qiime use the paired information?
All the best
Peter

Colin Brislawn

unread,
Nov 22, 2017, 11:18:04 AM11/22/17
to Qiime 1 Forum
Hello Peter,

While you could concatenate before demultiplexing and lose the qiime labels, I recommend concatenate after demultiplexing so that your qiime labels will remain in the output file. Does that make sense? Basically, you concatenate directly before OTU picking.

Will qiime use the paired information?
Not be default. You can run this script to join the paired ends of Illumina reads, then demultiplex with that.


Let me give an example
  • For each project (1 and 2)
    • join paired ends
    • split libraries fastq
  • After running all of these, combine them with
    • cat project1/seqs.fna project2/seqs.fna > project12_seqs.fna

Colin

Peter Kos

unread,
Nov 24, 2017, 11:30:37 AM11/24/17
to Qiime 1 Forum
Thanks Colin, it ALMOST works. The method you suggested is the one I needed. Thanks a lot

My problem now is that join_paired_ends.py does not join the majority of the read pairs, even if I decrease the required overlap to 1 base (the minimum with fastq-join) or 0.1 % (0.4 base, or actually none) (-n  0.001 with SeqPrep).
It is possible that in these bacteria the amplified region is a bit longer than the 2x230 bases that the MiSeq reads, but we can still be certain that the reads (f+r) belong to each other; therefore they should be joined (linked) even if by putting an NNN in between them.
Is there a way to do so with the non-joining reads? It would be unacceptable to discard the majority of the otherwise good quality reads (both with respect to economic reason, and moreover this may introduce taxonomic bias).
 
If there is no way to do so, then I'll need to neglect the joining. Then if I use the two directions "separately", it may result in higher number (perhaps double) of OTUs (the forward OTUs and reverse OTUs). These would perhaps eventually fall back to the same taxon set (if both ends point to the same taxon), but those investigations that rely on OTUs rather than Taxons would have this intrinsic error (or bias).
(This bias can be perhaps only that everything is there twice, but if two taxa are close to each other than one side may differ and give two OTUs while the other side does not differ enough and hence combine into one (not two) OTU of summed abundance. This merged OTU will later be impossible to decide which other side it belongs to, so at LCA or whatever taxon determination it will fall to another level of tree.)

So all in all I guess it would be important to combine the two directions, and it would lead to a deeper and more exact taxonomy as well as an unbiased OTU set.

Does it make sense?

Peter

Colin Brislawn

unread,
Nov 24, 2017, 3:57:45 PM11/24/17
to Qiime 1 Forum
Hello Peter,

I'm sorry to hear that your reads are not joining well. How long is the region you targeted? If it's not expected to overlap, then there is not much you can do... I agree that discarding most of your reads is not a good solution.

You could play with the settings more to get them to join. While the very short overlap did not help, increasing the number of allowed mismatches may be helpful. Joining reads dramatically increases quality in the area of overlap, so we want to get this working if at all possible.

If joining does not work, the standard other method is to only use the forward read. You just take the forward read, demultiplex and quality filter, combine, then run OTU picking like normal. Using both then combining them later is intractable for all the reasons you mentioned, so people join as the first step, or not at all. 

Let me know what you find. I hope you can get these merged, but using just the forward read is OK too.

Colin

Reply all
Reply to author
Forward
0 new messages