Extracting sequence informations from fastq file by sample ID

311 views
Skip to first unread message

env

unread,
May 27, 2015, 9:40:54 PM5/27/15
to qiime...@googlegroups.com
Hello everyone,


I am trying to extract subset of fastq sequences based on sample IDs.
When the input file is in fasta format, "extract_seqs_by_sample_id.py" worked.
Is there any way to do similar processing with fastq format?

The QIIME version is 1.9.0 on VirtualBox.

***Background information***
The original data was in sff format (454 data), including reads from multiple samples.
I want to submit part of reads to the database, related to certain sample ID.
The database that I am trying to submit prefers the data in fastq format.
I have converted the original sff file into fastq file by running commands below
  #process_sff.py
  #convert_fastaqual_fastq.py
  #split_libraries_fastq.py
****************************


Thanks,
Mana

Jai Ram Rideout

unread,
May 28, 2015, 7:54:17 PM5/28/15
to qiime...@googlegroups.com
Hi Mana,

If you're processing 454 data in SFF format, I recommend using the following workflow:

1. process_sff.py on your SFF file to produce FASTA/QUAL
2. split_libraries.py on your FASTA/QUAL files. Pass -d/--record_qual_scores to have the script output a filtered .qual file.
3. extract_seqs_by_sample_id.py on your demultiplexed sequences file (seqs.fna).
4. extract_seqs_by_sample_id.py on your filtered .qual file (seqs_filtered.qual).
5. convert_fastaqual_fastq.py to convert your FASTA/QUAL subset to FASTQ format.

Hope this helps,
Jai

--

---
You received this message because you are subscribed to the Google Groups "Qiime Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qiime-forum...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

env

unread,
May 29, 2015, 4:24:22 AM5/29/15
to qiime...@googlegroups.com
Hi Jai,

Thank you for the kind reply.

I tried the workflow you have recommended.
Although, I had a problem at step 5.
These are what I actually did:

1. I ran process_sff.py and produced FASTA/QUAL file of my data.

2. I ran command below, and got seqs.fna and seqs_filtered.qual file.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
split_libraries.py -m Fasting_Map.txt -f sffs/20121015-2.fna -q sffs/20121015-2.qual -o split_library_output2/ -b 6 -d --record_qual_scores
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

3. and 4. I ran command below, and got extracted_seqs.fna/extracted_seqs.qual.
(sequences from sample ID 78 has been extracted):
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
extract_seqs_by_sample_id.py -i split_library_output2/seqs.fna -o extracted_seqs.fna -s 78
extract_seqs_by_sample_id.py -i split_library_output2/seqs_filtered.qual -o extracted_seqs.qual -s 78
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

5. I ran command below:
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
convert_fastaqual_fastq.py -f extracted_seqs.fna -q extracted_seqs.qual -o extracted_seqs.fastq
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

and got the error messgge:
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/bin/convert_fastaqual_fastq.py", line 113, in <module>
    main()
  File "/usr/local/bin/convert_fastaqual_fastq.py", line 110, in main
    full_fasta_headers)
  File "/usr/local/lib/python2.7/dist-packages/qiime/convert_fastaqual_fastq.py", line 43, in convert_fastaqual_fastq
    full_fastq, full_fasta_headers)
  File "/usr/local/lib/python2.7/dist-packages/qiime/convert_fastaqual_fastq.py", line 118, in convert_fastq
    "label (%s)") % label)
KeyError: 'Sequence length does not match QUAL length for label (78_2)'
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Does anyone have an idea why this kind of problem occurs?

Thank you for the help,
Mana


2015年5月29日金曜日 8時54分17秒 UTC+9 Jai Ram Rideout:

Jai Ram Rideout

unread,
May 30, 2015, 6:06:40 PM5/30/15
to qiime...@googlegroups.com
Hi Mana,

That's odd -- can you please send me extracted_seqs.fna and extracted_seqs.qual so I can take a look?

Best,
Jai

Jai Ram Rideout

unread,
Jun 1, 2015, 4:33:33 PM6/1/15
to qiime...@googlegroups.com
Hi Mana,

Thanks for sharing your extracted seqs and qual files. I'm able to reproduce the issue locally -- it looks like the first sequence in the file (78_2) has 268 characters but its corresponding quality score "sequence" only has 264 quality scores.

I'm not sure why/where this discrepancy is being introduced. Is it possible for you to share split_library_output2/seqs.fna and split_library_output2/seqs_filtered.qual with me?

Thanks,
Jai

Jai Ram Rideout

unread,
Jun 2, 2015, 12:14:35 PM6/2/15
to qiime...@googlegroups.com
Hi Mana,

It looks like extract_seqs_by_sample_id.py doesn't work with QUAL files, only FASTA. This was the step where your quality scores were getting truncated.

I found a different way to accomplish this: pass the output of split_libraries.py to convert_fastaqual_fastq.py and supply the -m/--multiple_output_files option to have a FASTQ file created for each of your samples:

convert_fastaqual_fastq.py -f seqs.fna -q seqs_filtered.qual -o per_sample_fastq -m

The FASTQ file for sample "78" will be per_sample_fastq/seqs_78.fastq.

Best,
Jai

env

unread,
Jun 3, 2015, 6:15:54 AM6/3/15
to qiime...@googlegroups.com
Hi Jai,

It seems it worked without any error message.

I really appreciate your help, and thank you for your time.

Kind regards,
Mana


2015年6月3日水曜日 1時14分35秒 UTC+9 Jai Ram Rideout:
Reply all
Reply to author
Forward
0 new messages