Merging already demultiplexed samples

Niccolo`

unread,

May 14, 2013, 8:40:06 AM5/14/13

to qiime...@googlegroups.com

Hi everybody,

I had a run of Miseq PE 2x250 on my 16S data. I submitted 6 samples and the sequencing center sent me back the sequences already divided by sample according to the index used for each sample. So at the end I have got 6 files (indeed they were 12 files if we consider R1 and R2).

Now each sample has no SampleID or barcode sequences. I have already done some preprocessing on each file (stitching R1 and R2, trimming of primers, quality check and removal of chimeras) and now for each sample I have a fasta file that I have to run through one of the OTU picking pipelines. Here is how the reads in one of the fasta file look like:

>M01168_27_000000000-A2Y50_1_1101_14172_1431
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>M01168_27_000000000-A2Y50_1_1101_16756_1431
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

My questions are:

1) which is the right moment to merge all the 6 fasta files together? before starting the OTU picking procedure (maybe using the "cat" UNIX function?) or after that (for example merging the OTU tables using merge_OTU_tables.py?)?
2) how to merge the files? do I first need to assign a SampleID to each file using add_qiime_labels.py?
3) this questions rises from the need to compare at the end the samples between each other to compute for example beta diversity: in order to do that, do I really need to merge all the files together and process them (I mean in the OTU picking step) as a single file or there is a way to process each file separately and then compare them for the beta-diversity analysis? I guess that processing each file separately instead of a single "big" one would help to lighten the OTU picking process...

Thanks a lot for any help or suggestion!

Cheers

Niccolo`

Tony Walters

unread,

May 14, 2013, 9:50:45 AM5/14/13

to qiime...@googlegroups.com

Hello Niccolo,

1. You would want to merge the files before OTU picking-if you were to pick OTUs at separate times, you would have to used the "closed reference" approach (discarding sequences that do not match the reference database), otherwise you would not be able to compare the samples. With de novo OTUs picked at different times, there is no way to match up OTUs.

2. Yes-you would want to use add_qiime_labels.py on your 6 separate fasta files to create the merges fasta file with the QIIME compatible labels/enumeration, before doing any OTU picking. Also you want to create a formal QIIME mapping file with the corrected labels at this point, so the resulting OTU table you build will have matching IDs.

3. Mostly addressed in question 1, but if you wanted to try it separately, you could run add_qiime_labels.py on each of the fasta files individually so you have six separate files (you'd want to set the count start value to be different in each case, using -n with add_qiime_labels), and use -m uclust_ref -C with pick_otus.py to do the closed_reference approach. You would want to check the output after doing this once to see how many sequences were failing to cluster, and if a large amount were, it would probably be better to go with the de novo combined approach to retain as many sequences as possible.

-Tony

Niccolo`

--

---
You received this message because you are subscribed to the Google Groups "Qiime Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qiime-forum...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Niccolo`

unread,

May 14, 2013, 11:00:39 AM5/14/13

to qiime...@googlegroups.com

Hello Tony,

thanks for the explanation! I think I will merge the files before picking OTUs as you suggested.

Cheers

Niccolo`

Husen Zhang

unread,

Jun 28, 2013, 11:47:58 AM6/28/13

to qiime...@googlegroups.com

Tony,
I'd like to jump in and ask a similar question. I too have
demultiplexed fastq files, 16 of them from a single study. The
individual fastq files do not barcodes and look like this (output from
"head -8"):

@HWI-M00720:12:000000000-A3342:1:1101:14045:1354 3:N:0:
CCTGTTTGATCCCCACGCTTTCGCACATCAGCGTCAGTTACAGACCAGAAAGTCGCCTTCGCCACTGGTGTTCCTCCATATCTCTGCGCATTTCACCGCTACACATGGAAT
+
?????9?BB<BBBBBBCEEFFFEHHHECEFFHECEE@CDGEGFGH,-AACEGHHHHHEEGACECCC?=C-5>C-A-CC---5DDD.D=AEE577@7DE@6A)5DDDD;ED,
@HWI-M00720:12:000000000-A3342:1:1101:16395:1367 3:N:0:

My question is how to combine 16 fastq into one, with added mapping
information. The command add_qiime_labels.py seems only work with
fasta files but not fastq?

Husen

Tony Walters

unread,

Jun 28, 2013, 12:16:18 PM6/28/13

to qiime...@googlegroups.com

Hello Husen,

If each of the fastq files can be linked to a particular SampleID, you can convert them to fasta/qual format using convert_fastaqual_fastq.py. Then you should be able to use add_qiime_labels.py to make a combined, QIIME-compatible, fasta file.

-Tony

Husen Zhang

unread,

Jun 28, 2013, 1:34:40 PM6/28/13

to qiime...@googlegroups.com

Tony,
Thank you very much for your help. It worked as you suggested with
the -c option.
Husen

On Fri, Jun 28, 2013 at 12:16 PM, Tony Walters

francesca

unread,

Sep 26, 2013, 2:58:29 PM9/26/13

to qiime...@googlegroups.com

Hi Tony!

I have a question about this issue:

if i convert the already demultiplexed fastqs in fna/qual files and then I add the qiime labels... then how can I filter them according to length, qual score, etc, as the split library script does?

Is there a way to do this with demultiplexed fastqs?

Thanks!

Tony Walters

unread,

Sep 26, 2013, 3:06:50 PM9/26/13

to qiime...@googlegroups.com

Hello Francesca,

At this time, the only way to do the quality filtering in the manner you are asking is to create a mapping file for each individual fasta file that contains only a single SampleID, empty data fields (the header would remain) for the BarcodeSequence, and use split_libraries.py with the -b 0 option for each call to split_libraries.py.

colin.averill

unread,

Oct 29, 2014, 6:53:48 PM10/29/14

to qiime...@googlegroups.com

Hi Tony-

I just wanted to follow up. Its one year later and I am in the same boat as Francesca. I have 92 fastq files (ends already paired) with no barcodes that I would like to merge together. I would like to retain the quality files as well so I can filter by length and quality score using the split_libraries.py command. Is the solution still the same? convert to separate fasta and qual files, run each file through split_libraries.py independently and make 92 mapping files, and then proceed to use add_qiime_labels.py to merge them into one file post quality filtering?

Tony Walters

unread,

Oct 29, 2014, 7:02:29 PM10/29/14

to qiime...@googlegroups.com

Colin, you could take the paired-reads, and join them with join_paired_ends.py, and then pass the resulting stitched reads and SampleIDs, comma separated, to split_libraries_fastq.py (see last example command, sans -q parameter, for split_libraries_fastq.py):

http://qiime.org/scripts/join_paired_ends.html

http://qiime.org/scripts/split_libraries_fastq.html

For more options, visit https://groups.google.com/d/optout.

colin.averill

unread,

Oct 29, 2014, 7:35:14 PM10/29/14

to qiime...@googlegroups.com

Thanks Tony- This looks like it might be much easier! However, I don't have any meta data mapping files to feed to split_libraries_fastq.py. Is there a way I can just feed it the same uninformative mapping file for each sample, since I am going to be adding labels downstream using add_qiime_labels.py ?

Tony Walters

unread,

Oct 29, 2014, 7:41:16 PM10/29/14

to qiime...@googlegroups.com

Yes, you can just pass a single mapping file with one SampleID in it with the -m option (no comma separated files).

Tony Walters

unread,

Oct 29, 2014, 7:41:53 PM10/29/14

to qiime...@googlegroups.com

You won't need to do the add_qiime_labels.py command either, as the comma separated SampleIDs will put the IDs in the output fasta file when you call split_libraries_fastq.py

colin.averill

unread,

Oct 29, 2014, 7:45:31 PM10/29/14

to qiime...@googlegroups.com

gotcha. So, I'll still need to make 92 separate mapping files for 92 individual .fastq files, each with their own unique sample ID. Seems to me like I will still need to run add_qiime_labels.py to merge them all into one .fasta file, rather than 92 separate files.

Tony Walters

unread,

Oct 29, 2014, 7:47:01 PM10/29/14

to qiime...@googlegroups.com

No, the output of split_libraries_fastq.py will be a single fasta file with the labels already added-the separate input fastq files/SampleID names that are comma separated will tell the script with sequences go with which SampleID.

colin.averill

unread,

Oct 29, 2014, 9:51:04 PM10/29/14

to qiime...@googlegroups.com

This worked! Thank you! I tried passing a .csv files with my file names and another with my sample ids to make things more streamlined, but of course this did not work. Would be even cooler if there was a way to indicate to run this with all files in a given folder as inputs, and auto-increment the sample ID. Maybe there is, and someone more familiar with UNIX would handle this without much trouble. Either way, thanks again.

Reply all

Reply to author

Forward