Ion torrent Sff files and process_sff.py

1,651 views
Skip to first unread message

sp

unread,
May 16, 2012, 1:10:18 AM5/16/12
to Qiime Forum
Hi

Ion torrent machines outputs a .sff file. I tried to used the
process_sff.py command to parse it. I get an error

process_sff.py -f -i sff/ -o output_dir/
Traceback (most recent call last):
File "/macqiime/QIIME/bin/process_sff.py", line 71, in <module>
main()
File "/macqiime/QIIME/bin/process_sff.py", line 67, in main
use_sfftools=opts.use_sfftools,
File "/macqiime/lib/python2.7/site-packages/qiime/process_sff.py",
line 253, in prep_sffs_in_dir
make_fna(sff_fp, base_output_fp + '.fna', use_sfftools)
File "/macqiime/lib/python2.7/site-packages/qiime/process_sff.py",
line 203, in make_fna
raise IOError("Could not parse SFF %s" % sff_fp)
IOError: Could not parse SFF sff/ion_final.sff


Config file is attached below and any help is appreciated.

Thanks


Config file

System information
==================
Platform: darwin
Python version: 2.7.1 (r271:86832, Dec 15 2011, 08:41:37) [GCC
4.0.1 (Apple Inc. build 5493)]
Python executable: /macqiime/bin/python

Dependency versions
===================
PyCogent version: 1.5.1
NumPy version: 1.5.1
matplotlib version: 1.1.0
QIIME library version: 1.4.0
QIIME script version: 1.4.0
PyNAST version (if installed): 1.1
RDP Classifier version (if installed): rdp_classifier-2.2.jar

QIIME config values
===================
blastmat_dir: None
topiaryexplorer_project_dir: None
pynast_template_alignment_fp: /macqiime/greengenes/
core_set_aligned.fasta.imputed
cluster_jobs_fp: /macqiime/QIIME/bin/
start_parallel_jobs.py
pynast_template_alignment_blastdb: None
assign_taxonomy_reference_seqs_fp: None
torque_queue: friendlyq
template_alignment_lanemask_fp: /macqiime/greengenes/
lanemask_in_1s_and_0s
jobs_to_start: 1
cloud_environment: False
qiime_scripts_dir: /macqiime/QIIME/bin/
denoiser_min_per_core: 50
working_dir: None
python_exe_fp: /macqiime/bin/python
temp_dir: /tmp/
blastall_fp: blastall
seconds_to_sleep: 60

sp

unread,
May 16, 2012, 3:52:40 PM5/16/12
to Qiime Forum
Anybody?

Tony Walters

unread,
May 16, 2012, 4:19:40 PM5/16/12
to qiime...@googlegroups.com
Hello,

I've seen different output formats from IonTorrent thus far (sff in your case, fastq in another).  It seems that the .sff file generated is not quite the same format as the .sff files generated by 454 sequencer software.  Did the sequence come with any other software for processing/converting the data?

If you have access to the sffinfo tool from Roche (comes with 454 sequencers) you could try running that directly on the data as well (if it fails, this would also suggest invalid/incompatible formatting).  As far as I know, sffinfo is not publicly available at the moment, although I've read rumors that it will be at some point, so this probably won't be an option unless you already have access to a 454 machine.

-Tony

sp

unread,
May 16, 2012, 5:05:22 PM5/16/12
to Qiime Forum
Thanks Tony. I will check it out.

GeorgeWatts

unread,
May 16, 2012, 7:03:54 PM5/16/12
to Qiime Forum
Hi Tony,

Ion Torrent PGM user here.

I have 16s sequence data and would like to try analysis using Qiime.

From the threads here, and the Qiime tutuorial, it seems Qiime is 454
data-specific.

The PGM outputs data in .SFF .FASTA and .BAM formats. So, the question
is: is there a way to "squeeze" one of these formats into a .SFF file
that Qiime expects? The obvious answer would be to convert the
PGM .SFF file to a 454 .SFF file. Looking at the two file types (the
PGM one from my data, the 454 one from the Qiime tutorial) it's clear
the two platforms use totally different codes to convey base quality
scores. If this was the only difference between the formats, one could
perhaps convert PGM.SFF files to 454.SFF files with a script and then
proceed with using Qiime.....

can you shed any light on this? I am obviously ignorant of 454 file
types and outputs.

Thanks in advance,

George

Tony Walters

unread,
May 16, 2012, 7:19:04 PM5/16/12
to qiime...@googlegroups.com
Hello George,

We don't have any conversion scripts at the moment for "normalizing" the IonTorrent SFF files to make them compliant with QIIME (or anything that uses SFF files for that matter).  Even if one were able to convert the .sff files, if this was for the purpose of using the .sff files for denoising, it probably would not be legitimate to use the flowgrams from IonTorrent without an error model for IonTorrent data (i.e., using 454 models would most likely not give you accurate results).  You could put in a request for an IonTorrent .sff processing tool with QIIME ( http://sourceforge.net/tracker/?group_id=272178&atid=1157167  ) but if there isn't a standard format that most sequencing facilities are using for IonTorrent data it will be difficult to create a general purpose processing tool that will be useful for most users. 

The fasta format file would allow you to use QIIME (these files could be directly demultiplexed with split_libraries.py and used in downstream steps).  Do you get .qual files generated at the same time as the .fasta data?  These could be used in the demultiplexing step as well.

Hope this helps,
Tony Walters

GeorgeWatts

unread,
Aug 20, 2012, 8:09:41 PM8/20/12
to qiime...@googlegroups.com
Tony,

Thanks for the thoughtful reply, I am sorry it's taken me so long to respond. I am still very interested in using QIIME with my PGM data, so to continue this thread: 

Ok, let's drop the .SFF file approach and focus on joining the QIIME analysis process midstream. The Ion Torrent PGM provides a FASTQ file so the answer to your question about quality information is yes, we have it. The PGM automatically trims the sequencing adaptor from the start of the reads in the FASTQ file. Thus, the sequences in the FASTQ file have the following "structure":

barcode----16s-specific-forward-primer----16s-sequence-of-interest-----16s-specific-reverse-primer----sequencing-adaptor-P1

I have other software that I can use to do any or all of the following:

1) Take the FASTQ file containing the barcoded reads and convert it into individual FASTA files for each barcode.
2) Use the quality information in the FASTQ file to trim and quality filter the reads that make it into the barcode separated FASTA files.
3) Remove the sequencing adaptor(s) and the 16s-specific primer sequence used to generate the amplicons.

Thus, I can generate FASTA files (one for each bar code used) containing sequence that is quality filtered. From reading the tutorial it would seem that these files are ready for generating OTUs using  pick_otus_through_otu_table.py  which requires only a .fna file for input. Looking at the Fasting_Example.fna file provided in the QIIME tutorial, it appears the a .fna file is simply a FASTA file in which the header has a specific format consisting of 5 pieces of information.

Example header from Fasting_Example.fna: 

>FLP3FBN01ELBSX length=250 xy=1766_0111 region=1 run=R_2008_12_09_13_51_01_

I think:
1) FLP3FBN01ELBSX is the unique read identifier.
2) length is the length of the read.
3) xy is a unique location specific to the region of the 454 PTP plate.
4) region is the region of the PTP plate.
5) run is a unique identifier for the 454 run.

I am guessing that pick_otus_through_otu_table.py really only needs the first part of the header (the unique sequence identifier) and the associated sequence to work, is this correct? If so, all I need to do is generate my FASTA files using my other software and then feed them into pick_otus_through_otu_table.py.

Thanks in advance for your reply. 


Regards,

George

Antonio González Peña

unread,
Aug 21, 2012, 2:58:35 AM8/21/12
to qiime...@googlegroups.com
Hi George,

The issue is that the unique identifier after split_libraries.py has
to be the name of the sample, _ and a number; the last two make it
unique; more info:
http://qiime.org/documentation/file_formats.html#demultiplexed-sequences

Do you know if is possible to ask the PGM to not autotrim the
sequences? Another option is to use the software to create a
fasta/fastq file for the barcodes and other for the sequences, can it
do this?

Cheers
> --
>
>
>



--
Antonio González Peña
Research Assistant, Knight Lab
University of Colorado at Boulder
https://chem.colorado.edu/knightgroup/

Tony Walters

unread,
Aug 21, 2012, 12:13:12 PM8/21/12
to qiime...@googlegroups.com
Hello George,

There is also a script (in the develoment version of QIIME, http://qiime.org/svn_documentation/index.html#qiime-development), convert_fastaqual_fastq.py that will allow you to convert your fastq files to fasta/qual files (and visa versa).  With the fastq files in the structure you listed above, you should be able to convert directly to fasta/qual and use those as inputs to split_libraries.py.  Since the reverse primers/sequencing adapters may still be in the reads, if they are long enough, you would want to use the reverse primer removal option (-z truncate_only) with split_libraries.py.

Hope this helps,
Tony Walters

--




Message has been deleted

GeorgeWatts

unread,
Aug 21, 2012, 1:29:16 PM8/21/12
to qiime...@googlegroups.com
Antonio and Tony,

Thanks for your replies. Because the quality scores in the PGM's FASTQ file are coded differently than 454, I am going to perform the quality filtering and trimming steps using other software and enter the QIIME pipeline at the pick_otus.py or pick_otus_through_otu_table.py step.  From the info here: http://qiime.org/documentation/file_formats.html#demultiplexed-sequences, rather than passing a separate FASTA for each barcode, I am going to pass a single FASTA file with the barcode info encoded into the header. The example header has the following structure:

>PC.634_1 FLP3FBN01ELBSX orig_bc=ACAGAGTCGGCT new_bc=ACAGAGTCGGCT bc_diffs=0

But, importantly, only the first part of the header (PC.634_1 in the example above) is actually used by pick_otus.py or pick_otus_through_otu_table.py.

PC.634 is a unique sample identifier of my choosing and the number after the underscore is a read number starting at 1 and counting up until all reads for a particular sample have a unique number.

Optionally, I can include my unique read identifier from the PGM in place of the 454 unique read identifier (FLP3FBN01ELBSX in the example above).

orig_bc, new_bc, bc_diff=0 are not necessary as they are not used by pick_otus.py or pick_otus_through_otu_table.py.

Lastly, the sequence that follows each header should be fully trimmed of barcode, adaptors, and primer sequence on both ends (in addition to quality trimmed and filtered).

I will give this a try.

Antonio González Peña

unread,
Aug 21, 2012, 2:13:43 PM8/21/12
to qiime...@googlegroups.com
Let us know how goes and the commands you use to make it work. Thanks.
Reply all
Reply to author
Forward
0 new messages