Input files: .sff .fna .fastq .qual

Sébastien Matamoros

unread,

Jul 22, 2013, 2:32:41 PM7/22/13

to qiime...@googlegroups.com

Hello QIIME users,

I am a new user of qiime, just started a week ago. I read most of the documentation and went through the tutorial.

Here is my problem. My data come from an outside contractor. They are in the form of demultiplexed files, .sff or .fastq format. Now I managed to process the .sff files in .fna and .qual files, but I have one file per sample. I combined the .fna files using the "add_qiime_label.py" script. So now I have a big .fna files with all my sequences, but I can't process it using the "pick de nove otus" script because it contains small sequences and hasn't been tested for quality check. And I can't use the "split library" script to check for quality because I don't have a big .qual file to match the combined .fna file.

I'm probably missing something very obvious here, but I could need some help on this.

Thank you,
Seb

Matthew Stoll

unread,

Jul 22, 2013, 4:22:53 PM7/22/13

to qiime...@googlegroups.com

I also get individual fastq files (1 per sample.) I personally have found the FastX toolkit to be very useful in performing quality control; it can also convert fastq to fasta.

Matt

--

---
You received this message because you are subscribed to the Google Groups "Qiime Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qiime-forum...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Tony Walters

unread,

Jul 22, 2013, 4:29:41 PM7/22/13

to qiime...@googlegroups.com

Hello Seb,

How short are these sequences? Can you post a small sample of your combined sequences, as well as the output of print_qiime_config.py and the exact command you used for OTU picking (and the resulting error)?

-Tony

--

Sébastien Matamoros

unread,

Jul 23, 2013, 3:32:54 AM7/23/13

to qiime...@googlegroups.com

Hello,

When I look at the .fna file containing all the sequences, some of those are 100 bp, down to something like 30 bp. There are more than 500k sequences in the file, so obviously I didn't look at all of them. Attached file is a short portion of the file.

I don't get errors so to speak, because I have no idea where to begin in the first place. I can probably perform the pick de novo otus script on the combined_seqs.fna file, but I'm not sure I should do that without checking for quality. And I don't know how to check the quality without a corresponding .qual file. But I don't have a combined .qual file, I just have 1 .qual file per sample. I fell like I'm running in circles here...

As for the FastX toolkit: can I use that to perform the quality check on all the demultiplexed files ? Can I combine the resulting files afterward to get the big combined_seqs.fna file I need to perform the pick de novo otus script ?

Thanks a lot !
Seb

combined_seqs2.fna

Matthew Stoll

unread,

Jul 23, 2013, 9:01:19 AM7/23/13

to qiime...@googlegroups.com

Hi Seb,

Yes, the FastX Toolkit can help you perform quality checking on all of the files. I find it more efficient to combine the quality files into one file first, though. I don't think QIIME has a script that does that; with the help of the bioinformatics folks at my University, I wrote a perl script that combines all of the individual quality files into one large quality file, where the header of each sample is basically the original name of the demultiplexed file. I then put that qual file through FastX and converted the resulting output into a fasta file (which FastX does.) The Fasta file should then be useable for otu picking and all other QIIME analyses.

Regards,

Matt

From: Sébastien Matamoros <lord...@gmail.com>
To: qiime...@googlegroups.com

Sébastien Matamoros

unread,

Jul 23, 2013, 9:17:39 AM7/23/13

to qiime...@googlegroups.com, Matthew Stoll

Hi Matt,

Thank you, this looks interesting. I use qiime on the Virtual Box Ubuntu, do you know if the FastX Toolkit is compatible with this application ?

Seb

Matthew Stoll

unread,

Jul 23, 2013, 11:16:14 AM7/23/13

to qiime...@googlegroups.com

I'm afraid I don't know. I use my University's cluster environment, which the bioinformatics folks were able to download QIIME into.

Tony Walters

unread,

Jul 23, 2013, 12:05:56 PM7/23/13

to qiime...@googlegroups.com

Hello Sebastien,

Another option, which may be a bit tedious, is to go back to your original qual and fna files (the ones that are split up according to sample). For each of these, create a mapping file that has a single SampleID in it, and have no barcodes and no primers (unless there is a primer at the beginning of your reads). You will still leave the empty data fields (all white spaces are tabs in the example below, and there should be two between the sample.1 and s.1 strings):

#SampleID BarcodeSequence LinkerPrimerSequence Description

sample.1 s.1

Then run split_libraries.py on each of the individual fasta, qual, and mapping files, adding the -n X parameter to each call (start X at 1000000 for the first call, 2000000 for the second and so on). Also select a different output directory with -o for each call. Turn off barcodes and primers (if applicable) with -b 0 and -p. As your sequences seem to be of short length, you may have to lower the minimum sequence length with the -l parameter (e.g. -l 100). Example command: split_libraries.py -f fasta_filepath -q qual_filepath -m mapping_sample1.txt -b 0 -p -n 1000000 -o sample1_output/ -l 100

You should check the log file after doing the first one to make sure you're not losing most of the sequences (the log file will tell you counts of discarded sequences).

Then once this is done, you can combine your sequences using cat, example:

cat sample1_output/seqs.fna sample2_output/seqs.fna (and so on for all of your output) > combined_seqs.fna

Then you can do OTU picking with combined_seqs.fna

Hope this helps,

Tony Walters

Sébastien Matamoros

unread,

Jul 24, 2013, 3:08:08 AM7/24/13

to qiime...@googlegroups.com

Hi Tony,

Yes, I was afraid something like that was coming.

However the sequence lenght is not that bad. According to the QA report, most of the sequences are between 450 and 500 bp lenght. I assume it must be before triming, but still good. I just need to remove the few small ones that are still there.

I see that I can run the split_library script on multiple files at the same time, using a coma to separate them. This looks like what you are suggesting, but I should be able to process the files in batch mode.

Thanks a lot !

Sébastien

Tony Walters

unread,

Jul 24, 2013, 7:50:10 AM7/24/13

to qiime...@googlegroups.com

Hello Sebastien,

You don't want to call split_libraries.py with the comma separated fasta and qual files in this case, you have to call it separately for each fasta/qual/mapping file (you can only pass a single mapping file to it, and if you could pass multiple mapping files, it wouldn't be able to tell which sequence went with which sample as there are not barcodes to distinguish them).

-Tony

Sébastien Matamoros

unread,

Jul 24, 2013, 11:37:55 AM7/24/13

to qiime...@googlegroups.com

Hi Tony,

At last it works. Thanks a lot. Last question (for the moment) : is there a way to process these files in "batch mode" or whatever ? I have something like 150 files to process, and many more in the future. Is there a way to avoid filling 150 mapping files by hand and launching the script 150 times ?

Thanks a lot !
Seb

Tony Walters

unread,

Jul 24, 2013, 12:46:05 PM7/24/13

to qiime...@googlegroups.com

Hello Seb,

I think you'll have to do a bit of scripting to pull that off. If you're comfortable with python, you could use the glob function to get the fasta/qual filenames (which hopefully are named by sample), and write a mapping file (again based on the sample name), and create a bunch of split_libraries.py commands that are written to a text file (that each point to the fasta/qual/mapping file), and then execute this text file (example: sh split_library_commands.txt)

Hope this helps,

Tony

Sébastien Matamoros

unread,

Jul 25, 2013, 3:33:25 AM7/25/13

to qiime...@googlegroups.com

Hi Tony,

I'm not familiar at all with Python. I'm gonna try anyway.

Thanks a lot for the help.

Seb

Reply all

Reply to author

Forward