error in split_libraries_fastq "an incorrect value for phred_offset"

56 views
Skip to first unread message

Nilusha Malmuthuge

unread,
Jun 13, 2017, 12:15:50 PM6/13/17
to Qiime 1 Forum

I have 16S sequences (27F -519R) obtained through MIseq 250bp. After joining paired ends using Join_paired_ends.py, I used split_libraries_fastq.py. But I got following error.

split_libraries_fastq.py -i ~/Documents/URTmicrobiota/sequence/6B/first_seq.fastq -o ~/Documents/URTmicrobiota/sequence/6B/split_library_output -m ~/Documents/URTmicrobiota/sequence/map_B.txt --barcode_type 'not-barcoded' --sample_id 6B_ -r 1 -q 19

Traceback (most recent call last):

  File "/macqiime/anaconda/bin/split_libraries_fastq.py", line 365, in <module>

    main()

  File "/macqiime/anaconda/bin/split_libraries_fastq.py", line 344, in main

    for fasta_header, sequence, quality, seq_id in seq_generator:

  File "/macqiime/anaconda/lib/python2.7/site-packages/qiime/split_libraries_fastq.py", line 239, in process_fastq_single_end_read_file_no_barcode

    phred_offset=phred_offset):

  File "/macqiime/anaconda/lib/python2.7/site-packages/qiime/split_libraries_fastq.py", line 317, in process_fastq_single_end_read_file

    parse_fastq(fastq_read_f, strict=False, phred_offset=phred_offset)):

  File "/macqiime/anaconda/lib/python2.7/site-packages/skbio/parse/sequences/fastq.py", line 174, in parse_fastq

    seqid)

skbio.parse.sequences._exception.FastqParseError: Failed qual conversion for seq id: GGGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCTTAACACATGCAAGTCGAACGGTAGCAGGAAGAAGCTTGCTTCTTTGCTGACGAGTGGCGGACGGGTGAGTAATGCTTGGGAATCTGGCTTATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCGTAATCTCTACGGAGTAAAGGGTGGGACCTTTTGGCCACCTGCCATAAGATGAGCCCAAGTGGGATTAGGTAGTTGGTGAGGTAAAGGCTCACCAAGCCGACGATCGCTAGCTGGTCTGAGAGGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGGAACCCTGATGCAGCCATGCCGCGTGAATGAAGAAGGCCGTCGGGGTGTAAAGTTCTTTCGGTGATGAGGAAGGAGTGAAGTTTAATAGACTTCATTATTGACGTTAGTCACAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATTC. This may be because you passed an incorrect value for phred_offset.

 

 

After going through few existing forums I included phred- offset 33 into the command and still have the same error

split_libraries_fastq.py -i ~/Documents/URTmicrobiota/sequence/6B/first_seq.fastq -o ~/Documents/URTmicrobiota/sequence/6B/split_library_output -m ~/Documents/URTmicrobiota/sequence/map_B.txt --barcode_type 'not-barcoded' --sample_id 6B_ -r 1 -q 19 --phred_offset 33

Traceback (most recent call last):

  File "/macqiime/anaconda/bin/split_libraries_fastq.py", line 365, in <module>

    main()

  File "/macqiime/anaconda/bin/split_libraries_fastq.py", line 344, in main

    for fasta_header, sequence, quality, seq_id in seq_generator:

  File "/macqiime/anaconda/lib/python2.7/site-packages/qiime/split_libraries_fastq.py", line 239, in process_fastq_single_end_read_file_no_barcode

    phred_offset=phred_offset):

  File "/macqiime/anaconda/lib/python2.7/site-packages/qiime/split_libraries_fastq.py", line 317, in process_fastq_single_end_read_file

    parse_fastq(fastq_read_f, strict=False, phred_offset=phred_offset)):

  File "/macqiime/anaconda/lib/python2.7/site-packages/skbio/parse/sequences/fastq.py", line 174, in parse_fastq

    seqid)

skbio.parse.sequences._exception.FastqParseError: Failed qual conversion for seq id: GGGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCTTAACACATGCAAGTCGAACGGTAGCAGGAAGAAGCTTGCTTCTTTGCTGACGAGTGGCGGACGGGTGAGTAATGCTTGGGAATCTGGCTTATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCGTAATCTCTACGGAGTAAAGGGTGGGACCTTTTGGCCACCTGCCATAAGATGAGCCCAAGTGGGATTAGGTAGTTGGTGAGGTAAAGGCTCACCAAGCCGACGATCGCTAGCTGGTCTGAGAGGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGGAACCCTGATGCAGCCATGCCGCGTGAATGAAGAAGGCCGTCGGGGTGTAAAGTTCTTTCGGTGATGAGGAAGGAGTGAAGTTTAATAGACTTCATTATTGACGTTAGTCACAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATTC. This may be because you passed an incorrect value for phred_offset.

 

When I create a small set of data to send I have found that there’s an additional + after the sequence name starting from 835th sequence. Once this appeared in the join.fastq file, I couldn’t run split_libraries_fastq.py. I got the same error message as above. So, I assumed that this extra + is the problem causer and looking for a solution to work around this. 

I have attached my map file, and fastq files (first_seq_ok.fastq – works fine; first_seq.fastq – gives error message)

split_libraries_fastq.py -i ~/Documents/URTmicrobiota/sequence/6B/first_seq.fastq -o ~/Documents/URTmicrobiota/sequence/6B/split_library_output_first_1 -m ~/Documents/URTmicrobiota/sequence/map_B.txt --barcode_type 'not-barcoded' --sample_ids 6B_ -r 1 -q 19

Traceback (most recent call last):

  File "/macqiime/anaconda/bin/split_libraries_fastq.py", line 365, in <module>

    main()

  File "/macqiime/anaconda/bin/split_libraries_fastq.py", line 344, in main

    for fasta_header, sequence, quality, seq_id in seq_generator:

  File "/macqiime/anaconda/lib/python2.7/site-packages/qiime/split_libraries_fastq.py", line 239, in process_fastq_single_end_read_file_no_barcode

    phred_offset=phred_offset):

  File "/macqiime/anaconda/lib/python2.7/site-packages/qiime/split_libraries_fastq.py", line 317, in process_fastq_single_end_read_file

    parse_fastq(fastq_read_f, strict=False, phred_offset=phred_offset)):

  File "/macqiime/anaconda/lib/python2.7/site-packages/skbio/parse/sequences/fastq.py", line 174, in parse_fastq

    seqid)

skbio.parse.sequences._exception.FastqParseError: Failed qual conversion for seq id: GGGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCTTAACACATGCAAGTCGAACGGTAGCAGGAAGAAGCTTGCTTCTTTGCTGACGAGTGGCGGACGGGTGAGTAATGCTTGGGAATCTGGCTTATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCGTAATCTCTACGGAGTAAAGGGTGGGACCTTTTGGCCACCTGCCATAAGATGAGCCCAAGTGGGATTAGGTAGTTGGTGAGGTAAAGGCTCACCAAGCCGACGATCGCTAGCTGGTCTGAGAGGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGGAACCCTGATGCAGCCATGCCGCGTGAATGAAGAAGGCCGTCGGGGTGTAAAGTTCTTTCGGTGATGAGGAAGGAGTGAAGTTTAATAGACTTCATTATTGACGTTAGTCACAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATTC. This may be because you passed an incorrect value for phred_offset.    
first_seq_OK.fastq
first_seq.fastq
map_B.txt

justink

unread,
Jun 14, 2017, 12:09:16 AM6/14/17
to Qiime 1 Forum
Hmm, I hope something didn't happen to your sequences. But if there's just a few lines at the end of the file, here's how to command-line them away:

'tail seqs.fastq'
will show you the end of the file.

'wc -l seqs.fastq' will count lines in the file

'head -n 100 seqs.fastq > newseqs.fastq' will copy the first 100 lines of the file.

Nilusha Malmuthuge

unread,
Jun 14, 2017, 10:34:22 AM6/14/17
to Qiime 1 Forum
Thanks for the reply. 
The fastq file I posted only has few sequences. I have ~90K sequences in each sample (total 72 samples). When I go through the files I can see this + sign in most of the sequences. Starngely, it appears randomly

here are few of those weired formatting sequences

@M00833:558:000000000-B5H6B:1:2106:14051:10005 1:N:0:TGCTACATCA
+
AGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCTTAACACATGCAAGTCGAGCGGATGAAGGGAGCTTGCTCCTGGATTCAGCGGCGGACGGGTGAGTAATGCTTAGGAATCTGCCTATTAGTGGGGGACAACAGTTGGAAACGACTGCTAATACCGCATACGCCCTACGGGGGAAAGGAGGGGATCTTCGGACCTTTCGCTAATAGATGAGCCTAAGTCAGATTAGCTAGTTGGTGGGGTAAAGGCCTACCAAGGCGACGATCTGTAGCGGGTCTGAGAGGATGATCCGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGGACAATGGGGGCAACCCTGATCCAGCCATGCCGCGTGTGTGAAGAAGGCCTTTTGGTTGTAAAGCACTTTAAGCGAGGAGGAGGCTCTTCTAGTTAATACCTAGGATGAGTGGACGTTACTCGCAGAATAAGCACCGGCTAACTCTGTGCCAGCAGCCGCGGTAATAC
+
CCCCCGGGGGGFGGFGFGGGGGGGDCFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDGGFFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDEGGGGGGGGGGGEGCEGGGFFEGGGGGGGGGGDDEGGGGGGGCGGGGGGGGGGGGDFGEEGGGGGGGGFCFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGE)GGGF?>DCGGC<AFFF*A:FFCF?FGGF5;?FFEFGFC;;@FDFDFDFGGGGGFFFE7ED;5FGFCCGAFDD9>FAFGFFGGGGGGFCFFGGGGGGF8DFFGGGGCFGGGF@FCEGFEGFGGFCGGGGGGFGGGGGDGGDGGFFGGGGGGGGGECGGGGGGGGGFGGEFFECDFFGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGFCGGGGGGEGGFCGGGGGFFE8GGGGGGFGGGFFGGGGGGGCCCCC
@M00833:558:000000000-B5H6B:1:2106:19337:10018 1:N:0:TGCTACATCA
+
AGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCTTAACACATGCAAGTCGAGCGGATGAAGGGAGCTTGCTCCTGGATTCAGCGGCGGACGGGTGAGTAATGCTTAGGAATCTGCCTATTAGTGGGGGACAACAGTTGGAAACGACTGCTAATACCGCATACGCCCTACGGGGGAAAGGAGGGGATCTTCGGACCTTTCGCTAATAGATGAGCCTAAGTCAGATTAGCTAGTTGGTGGGGTAAAGGCCTACCAAGGCGACGATCTGTAGCGGGTCTGAGAGGATGATCCGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGGACAATGGGGGCAACCCTGATCCAGCCATGCCGCGTGTGTGAAGAAGGCCTTTTGGTTGTAAAGCACTTTAAGCGAGGAGGAGGCTCTTCTAGTTAATACCTAGGATGAGTGGACGTTACTCGCAGAATAAGCACCGGCTAACTCTGTGCCAGCCGCCGCGGTAATTC
+

@M00833:558:000000000-B5H6B:1:2106:17296:10023 1:N:0:TGCTACATCA
+

justink

unread,
Jun 14, 2017, 1:56:18 PM6/14/17
to Qiime 1 Forum
Well, that's strange. Fastq files with seq identifier line, no sequence, then the quality id line, then quality that might sometime be the sequence.

Sigh. I'd remove those and feel a little more worried in general. It looks like it's just the final few in the files you attached.

btw, the regex to find such strange sequences is:

@.*\n\+\n

if that helps at all. I used sublime text to find and delete them.

Nilusha Malmuthuge

unread,
Jun 14, 2017, 3:04:32 PM6/14/17
to Qiime 1 Forum
Thanks again for the promptly answer. I was manually removing + signs and it is taking forever. It looks like your method is faster than mine. Please mind my terrible computer skills.
So I downloaded sublime text and open a join.fastq file in it. But I am not sure how can I use the regex you gave me.

Many thanks again

Nilusha Malmuthuge

unread,
Jun 14, 2017, 4:10:42 PM6/14/17
to Qiime 1 Forum
PS:
My exact question is how can I differentiate the + in between baspair sequence and quality score (AGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCTTAACACATGCAAGTCGAGCGGATGAAGGGAGCTTGCTCCTGGATTCAGCGGCGGACGGGTGAGTAATGCTTAGGAATCTGCCTATTAGTGGGGGACAACAGTTGGAAACGACTGCTAATACCGCATACGCCCTACGGGGGAAAGGAGGGGATCTTCGGACCTTTCGCTAATAGATGAGCCTAAGTCAGATTAGCTAGTTGGTGGGGTAAAGGCCTACCAAGGCGACGATCTGTAGCGGGTCTGAGAGGATGATCCGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGGACAATGGGGGCAACCCTGATCCAGCCATGCCGCGTGTGTGAAGAAGGCCTTTTGGTTGTAAAGCACTTTAAGCGAGGAGGAGGCTCTTCTAGTTAATACCTAGGATGAGTGGACGTTACTCGCAGAATAAGCACCGGCTAACTCTGTGCCAGCAGCCGCGGTAATAC
+
CCCCCGGGGGGFGGFGFGGGGGGGDCFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDGGFFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDEGGGGGGGGGGGEGCEGGGFFEGGGGGGGGGGDDEGGGGGGGCGGGGGGGGGGGGDFGEEGGGGGGGGFCFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGE)GGGF?>DCGGC<AFFF*A:FFCF?FGGF5;?FFEFGFC;;@FDFDFDFGGGGGFFFE7ED;5FGFCCGAFDD9>FAFGFFGGGGGGFCFFGGGGGGF8DFFGGGGCFGGGF@FCEGFEGFGGFCGGGGGGFGGGGGDGGDGGFFGGGGGGGGGECGGGGGGGGGFGGEFFECDFFGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGFCGGGGGGEGGFCGGGGGFFE8GGGGGGFGGGFFGGGGGGGCCCCC) 

from the extra + appears right after sequence ID (@M00833:558:000000000-B5H6B:1:2106:14051:10005 1:N:0:TGCTACATCA
+
AGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCTTAACACATGCAAGTCGAGCGGATGAAGGGAGCTTGCTCCTGGATTCAGCGGCGGACGGGTGAGTAATGCTTAGGAATCTGCCTATTAGTGGGGGACAACAGTTGGAAACGACTGCTAATACCGCATACGCCCTACGGGGGAAAGGAGGGGATCTTCGGACCTTTCGCTAATAGATGAGCCTAAGTCAGATTAGCTAGTTGGTGGGGTAAAGGCCTACCAAGGCGACGATCTGTAGCGGGTCTGAGAGGATGATCCGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGGACAATGGGGGCAACCCTGATCCAGCCATGCCGCGTGTGTGAAGAAGGCCTTTTGGTTGTAAAGCACTTTAAGCGAGGAGGAGGCTCTTCTAGTTAATACCTAGGATGAGTGGACGTTACTCGCAGAATAAGCACCGGCTAACTCTGTGCCAGCAGCCGCGGTAATAC) 

because this regex select both of them. So, cannot do a simple find and replace

Thanks

justink

unread,
Jun 16, 2017, 12:23:51 AM6/16/17
to Qiime 1 Forum
ooh, try this regex: ^@.*\n^\+\n

also, stackoverflow loves this stuff if you want a better answer, maybe even one you can do from the command line.
Reply all
Reply to author
Forward
0 new messages