join_paired_ends output fastq is broken

45 views
Skip to first unread message

Masha T

unread,
Jan 10, 2017, 10:40:33 AM1/10/17
to Qiime 1 Forum
Hello, I have PE250 MiSeq data with primers 505/806 that has been demultiplexed and the barcodes are in the header. When I run join_paired_ends.py, the fastq.join.fastq is the largest file so I thought it worked, but looking more closely at the fastq.join.fastq file, it seems broken. The beginning of the file looks like it's in the correct format, but by the end of the file there are extra "+" separating teh sequences and the quality scores. This is the command I ran: MacQIIME Mashas-MacBook-Pro:290_joined $ join_paired_ends.py -f MI.M03555_0163.001.FLD0290.SCI017690_SLE16_R1.fastq.gz -r MI.M03555_0163.001

.FLD0290.SCI017690_SLE16_R2.fastq.gz -o 290_joined


This is the beginning of the fastq.join.fastq

+

ABBB@BBFFBBBGGGGGGGGGGHHEGGGGAGHHHEGGGGGEHHHGGFGGHHHHHHHFFGGFFGGGGGGGGGGGGFFHGGFGHHGHHHGGHHHHHHHHGHHHHHGGGGGGHHHHHHHHHGHGHHHHHHHHHHHHHHHHHHHH

HHHHGHHHHGHHGHGGGGHHHHHGGGHHHHHGGGGGGHHHHHHFHFFHHHGHGHGHGHHGHHFGGGFHGGGGHHHHHHHHHHHHHHHGGHHFHHHHHHHHGHHHHHGGFFAHHHHGEGGHHHHHFHHHHHHGGGDGGGGFE

BBFDFFFBAA>3

@M03555:163:000000000-ATTLR:1:2105:16042:1617 1:N:0:TCTAGCGTGG

CTAGTGCCAGCAGCCGCGGTAATACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGATGGATGTTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCAGTTGATACTGGATATCT

TGAGTGCAGTTGAGGCAGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCCTGCTAAGCTGCAACTGACATTGAGGCTCGAAAGTGTGGGTATCAAACAGGATTAGAT

ACCCCAGTAGTCCA

+

>AAAAFFFFFBFGGGGGGGGGGFBGCF?EEGGFFHCEEAEG?GBGGG1EFHFFFGFFFFEFFFHHGH1EFEFFBFGGHGBBFFGFFFGEGGHHHHHGBGAGFHHEGGCGG<FFFHGDGFGGHGGHHFGFG1?GH1?GF>DH

GGHFGH1GHFFHHHGHGGDFG?GHECCEGGD1GGGEGHGDHHHHGFHHBGFGFFB0EHHFHHFEF>E9EGGEGHHHHGFCGGHHHFBFFHHGFDFGDHGHFHGHHGFACGAEEHHHEFCHHHHCHHHHFBBGHGFG3G1FE

?GAFFFFCDA>>1>

@M03555:163:000000000-ATTLR:1:2105:13480:1620 1:N:0:TCTAGCGTGG

ACGTGCCAGCCGCCGCGGTAATACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGATGGGTTGTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCAATTGATACTGGCAGTCTT

GAGTACAGTTGAGGTGGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTTACTAACCTGTAACTGACATTGATGCTCGAAAGTGTGGGTATCAAACAGGATTAGATA

CCCCAGTAGTCA



But the end of the fastq file looks like this:

@M03555:163:000000000-ATTLR:1:1109:20044:28413 1:N:0:TCTAGCGTGG

+

GTGCCAGCAGCCGCGGTAATCCGGAGGCTCCGAGCGTTATCCGGATTTAGTGGGTTTAAAGGGAGCGTAGATGGATTTTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCAGTTGATACTGTATATCTTGAGTGCAGTTGAGGCAGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGAGAAGGCAGCCTGCTAAGCTGCAACTGACATTGAGGCTCGAAAGTGTGGGTATCAAACAGGATTAGATCCCCTTGTAGTCC

+

^@A?ABFBAFFFBGCGCECGGGGE2AEG2AFF2E0ECGGAGHHGFFEEH55EGFFEFGHHEHHGC>EEGFG;?BFHFGHFHHHDHHHHFGEGFFGBGFFGH2</<??3BBD>FCFHHHGBGGHFGHH11<<G1FFHHHHFFFGHFHHHHHGGGFFEGABD>HF<CEGGGHFE/E>CFDFHHGG@FB22FFGFGFFB2HHGFBF>BBFG1DDB1F@@GDBFBDFBBFGF0AEGBBB;DGFFD;C9HHGCF/EGA1BFEEGGG3GEFBHGGBHGAB13GA1A11>CDFB1AA>11

@M03555:163:000000000-ATTLR:1:1109:20050:28431 1:N:0:TCTAGCGTGG

+

GTGCCAGCAGCCGCGGTAATACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGATGGATGTTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCAGTTGATACTGGATATCTTGAGTGCAGTTGAGGCAGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCCTGCTAAGCTGCAACTGACATTGAGGCTCGAAAGTGTGGGTATCAAACAGGATTAGATACCCTTGGAGTCC

+

^@AABBFFBFFFBGGGGGCFG5FGGGGG2BFFEEFGGGGGHGHGFEFGHFGGHHHFGGHHHHHGFGGGGGGAFFHHE;FF?BGHHHFHFHHHGHGFHHB?GA<0GBAFGFFGEG?FF@GHHGCGHHHDGHHHHHHGHHHFFHHHFHGHHHHHGHGGGG?DGCCEHFEFHGEEGFGG2HBHG>1CFF1CF1FFGF0;CG@FBHEFFFFD>/DE/B;1AFEAEFEFFFFF01BFFFD;9;FGFFBBGBABC0AFB1GGECFGF3FDGFGFF1A31313GEF11A1B@B111>111



I believe this is causing the problems I am having in split_libraries, so any help would be appreciated, thanks!

Masha

Stefan Janssen

unread,
Jan 10, 2017, 12:02:00 PM1/10/17
to Qiime 1 Forum
Hi Masha,
hard to check formatting issues without having those files in my hand. How big are they? Do you mind send them over to me via e.g. dropbox? If yes, could you mail the link to sjan...@ucsd.edu ?

Masha T

unread,
Jan 10, 2017, 2:16:46 PM1/10/17
to Qiime 1 Forum
Shared it with you via dropbox, thanks!

I've shared the dropbox folder with you with the original R1 and R2 files, and my output folder from join paired ends, along with the split libraries output, in case you're interested what the downstream error looked like.

thanks again!

Stefan Janssen

unread,
Jan 10, 2017, 4:51:55 PM1/10/17
to Qiime 1 Forum
Hi Masha,

I downloaded your files and re-run join_paired_ends.py. My results differ from what you uploaded. In fact, my results do not contain those additional lines with + symbols in it. Could you please double check which version of fastq-join you are using: fastq-join --help
I have version 1.3.1.

I uploaded my results to the Dropbox. I cannot test split_libraries, since I miss the metadata file containing the barcodes. But please try to run split_libraries.py on my results and see if the error is reproducible.

Best,
Stefan

Masha T

unread,
Jan 14, 2017, 2:34:24 PM1/14/17
to Qiime 1 Forum
Hi Stefan,

Thanks so much for your help, but unfortunately it's still a problem. I had version 1.2.1 so I had updated it to 1.3.1 and re-ran the command, but I still get the same problem of the additional lines with the pluses. I tried extracting barcodes and running split libraries and both worked on your joined file. I'm at a loss of what could be causing this. I have the latest version of Qiime, 1.9.1, here's what I get for print_qiime_config.py :

System information

==================

         Platform: darwin

   Python version: 2.7.13 |Continuum Analytics, Inc.| (default, Dec 20 2016, 23:05:08)  [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]

Python executable: /Users/mashataguer/miniconda2/envs/qiime-github/bin/python


QIIME default reference information

===================================

For details on what files are used as QIIME's default references, see here:

 https://github.com/biocore/qiime-default-reference/releases/tag/0.1.3


Dependency versions

===================

          QIIME library version: 1.9.1-dev

           QIIME script version: 1.9.1-dev

qiime-default-reference version: 0.1.3

                  NumPy version: 1.11.3

                  SciPy version: 0.16.0

                 pandas version: 0.16.2

             matplotlib version: 1.4.3

            biom-format version: 2.1.5

                   h5py version: Not installed.

                   qcli version: 0.1.1

                   pyqi version: 0.3.2

             scikit-bio version: 0.2.3

                 PyNAST version: 1.2.2

                Emperor version: 0.9.60

                burrito version: 0.9.1

       burrito-fillings version: 0.1.1

              sortmerna version: SortMeRNA version 2.0, 29/11/2014

              sumaclust version: Not installed.

                  swarm version: Swarm 1.2.19 [Jan 14 2017 14:02:16]

                          gdata: Installed.


QIIME config values

===================

For definitions of these settings and to learn how to configure QIIME, see here:

 http://qiime.org/install/qiime_config.html

 http://qiime.org/tutorials/parallel_qiime.html


                     blastmat_dir: None

      pick_otus_reference_seqs_fp: /Users/mashataguer/miniconda2/envs/qiime-github/lib/python2.7/site-packages/qiime_default_reference/gg_13_8_otus/rep_set/97_otus.fasta

                         sc_queue: all.q

      topiaryexplorer_project_dir: None

     pynast_template_alignment_fp: /Users/mashataguer/miniconda2/envs/qiime-github/lib/python2.7/site-packages/qiime_default_reference/gg_13_8_otus/rep_set_aligned/85_otus.pynast.fasta

                  cluster_jobs_fp: start_parallel_jobs.py

pynast_template_alignment_blastdb: None

assign_taxonomy_reference_seqs_fp: /Users/mashataguer/miniconda2/envs/qiime-github/lib/python2.7/site-packages/qiime_default_reference/gg_13_8_otus/rep_set/97_otus.fasta

                     torque_queue: friendlyq

                    jobs_to_start: 1

                       slurm_time: None

            denoiser_min_per_core: 50

assign_taxonomy_id_to_taxonomy_fp: /Users/mashataguer/miniconda2/envs/qiime-github/lib/python2.7/site-packages/qiime_default_reference/gg_13_8_otus/taxonomy/97_otu_taxonomy.txt

                         temp_dir: /var/folders/x6/sl6dpg0156q050yy_wm4rjb00000gn/T/

                     slurm_memory: None

                      slurm_queue: None

                      blastall_fp: blastall

                 seconds_to_sleep: 1


and Here's the script I ran: join_paired_ends.py -f MI.M03555_0163.001.FLD0290.SCI017690_SLE16_R1.fastq.gz -r MI.M03555_0163.001.FLD0290.SCI017690_SLE16_R2.fastq.gz -o 290

Any other suggestions? Thanks!

Stefan Janssen

unread,
Jan 15, 2017, 12:30:22 PM1/15/17
to Qiime 1 Forum
Often transfer of files between Windows and Mac / Linux cause problems because of line end encodings (Windows uses \r\n, Mac/Linux only \n). Is there a Windows machine involved in your pipeline?

Masha T

unread,
Jan 22, 2017, 8:51:46 PM1/22/17
to Qiime 1 Forum
Hi, just wanted to follow up on what I've found since. 

We found a bunch of null characters in the output files (ASCII 0)  of the files that were 'broken'. I also tried SeqPrep which gave me a segfault, and then I saw this on github, which made me think the problem is somewhere within MacQIIME/using OS X: Some tests are failing when building on OS X versus Redhat 6.6. OS X gives "nan" while Redhat gives "-nan". 

I'm on OS.X 10.11.6, and these problems were reproducable on another Mac computer. 

So I tried this on a virtual machine, and success!

Stefan Janssen

unread,
Jan 23, 2017, 12:18:56 PM1/23/17
to Qiime 1 Forum
Hi Masha. That is a really hard bug to track. Thank's a lot for sharing your findings!
Reply all
Reply to author
Forward
0 new messages