issue getting started with fastq input

478 views
Skip to first unread message

Michael Crossley

unread,
Feb 11, 2015, 3:51:06 PM2/11/15
to tas...@googlegroups.com
Hi, I am using GBS to obtain SNPs to do population genetic analyses with the Colorado potato beetle and wanted to try TASSEL to get started (working with the windows GUI for now). I attempted to import data in the "Data" tab (in the form of fastq files obtained directly from Illumina sequencing of a 96 well plate), but got an error message to the effect that the taxa was already imported and duplicates won't be allowed - but I do not see anything indicating the program has done anything with my fastq file.
I then attempted to convert fastq to TBT in the "GBS" tab, and got what looked like a progress bar to appear, but nothing happened and the progress bar eventually just disappeared.
What might I be doing wrong?
Sorry for such a newbie question,
-Mike

Lynn Carol Johnson

unread,
Feb 15, 2015, 3:22:44 PM2/15/15
to tas...@googlegroups.com
HI Michael -

Which version of TASSEL are you running?

TASSEL does not support importing FASTQ files from the Data->import option.

Using the GBS tab on the GUI:  click Help/Logging in the upper right corner. This opens a logging window.  Check the debug level check box.  This should display additional information that you wouldn’t see otherwise.    Adding the “-debug” flag on the command line will also get you more information.

Potential errors when attempting to run the GBS/FastqToTagCount plugin are the name  is not in the correct format.  Error message:

[Thread-8] ERROR net.maizegenetics.analysis.gbs.FastqToTagCountPlugin -    The filename does not contain either 3, 4, or 5 underscore-delimited values.

[Thread-8] ERROR net.maizegenetics.analysis.gbs.FastqToTagCountPlugin -    Expect: flowcell_lane_fastq.txt.gz OR flowcell_s_lane_fastq.txt.gz OR code_flowcell_s_lane_fastq.txt.gz


Or there could be a memory issue.

Thread-10] INFO net.maizegenetics.analysis.gbs.FastqToTagCountPlugin - Total barcodes found in lane:96

[Thread-10] ERROR net.maizegenetics.analysis.gbs.FastqToTagCountPlugin - Your system doesn't have enough memory to store the number of sequencesyou specified.  Try using a smaller value for the minimum number of reads.

 

I am running this with TASSEL 5.2.3

Lynn

--
You received this message because you are subscribed to the Google Groups "TASSEL - Trait Analysis by Association, Evolution and Linkage" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tassel+un...@googlegroups.com.
To post to this group, send email to tas...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tassel/c86ba7ea-ce9a-4c3b-8e8e-42655063cbeb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michael Crossley

unread,
Feb 16, 2015, 5:12:07 PM2/16/15
to tas...@googlegroups.com, lc...@cornell.edu
Thanks for your input.

To the best of my knowledge, I am using Tassel 5.0.

I did not see any helpful information in the Help/Logging tab after attempting to convert fastq to tag count - there actually wasn't evidence of any activity.

I noticed that my fastq files are just fastq.gz files, not fastq.txt.gz. Is it important for files to have the .txt extension? If so, how do I convert a fastq file to a fastq.txt file? I see plenty of txt --> fastq converters but not the reverse.

Thanks,

Michael Crossley

unread,
Feb 16, 2015, 5:12:56 PM2/16/15
to tas...@googlegroups.com, lc...@cornell.edu
Correction - I am also using version 5.2.3

Lynn Carol Johnson

unread,
Feb 17, 2015, 11:18:20 AM2/17/15
to Michael Crossley, tas...@googlegroups.com
Michael -

You don’t need the .txt extension.  I am able to run with a file of this name:  D0D7RACXX_1_fastq.gz

What values are you using for Max Good Reads/Min Tag Count?  The defaults are these:

 


I find if I use the default of  300000000 with Min Tag Count of 1, I run out of memory.  When I lower the default for Max Good Reads , or increase the MinTag Count value, the operation completes successfully.

I’m not sure why error messages are not showing up in the logger.  I’ll look into that.   Meanwhile, try changing one (or both) of the 2 default parameters and see if that makes a difference.

Michael Crossley

unread,
Feb 17, 2015, 1:49:52 PM2/17/15
to tas...@googlegroups.com, mcros...@gmail.com, lc...@cornell.edu
Thanks for the suggestion. I tried it, but did not get a different result:

Just to be clear: there is no error message at all when I attempt Fastq to Tag counts.

I wonder if something is wrong with my input directory or keyfile format. When I browse to set my input directory, I cannot see any of the fastq files in the folder they reside in (left image), but I can see them when I browse to find my keyfile (right image):

Here is a snapshot of my keyfile - the only thing I am in doubt of is my "Flowcell" designation - does this need to match a particular aspect of the fastq file?

Thanks for your time & attention,

Lynn Carol Johnson

unread,
Feb 17, 2015, 3:06:36 PM2/17/15
to tas...@googlegroups.com, mcros...@gmail.com
The problem is the name of your file, and the name of your flow cells in the key file.  The flow cell name should NOT have underscores.  TASSEL code uses the underscores to identify the flow cell and the lane, then maps to the key file.  Can you remove the underscores your flow cell lane name in both your key file and in the file name ?  Tassel expects one of the following formats:  (you don’t need the “txt”)

flowcell_lane_fastq.txt.gz OR flowcell_s_lane_fastq.txt.gz OR code_flowcell_s_lane_fastq.txt.gz

Thanks - Lynn


Michael Crossley

unread,
Feb 18, 2015, 12:24:24 PM2/18/15
to tas...@googlegroups.com, mcros...@gmail.com, lc...@cornell.edu
I tried renaming the keyfile and values in the flowcell column of the key file and still get no response from TASSEL.

Is it important for the "flowcell_lane_" in the file name match something specific in the sequence fastq files - or does it just need to match the values in the flowcell column of the key file?

I am really wondering if there is something wrong with my fastq files that is preventing TASSEL from recognizing them in my input directory.Is it normal for them not to appear when selecting the input directory?

Jeff Glaubitz

unread,
Feb 18, 2015, 12:50:49 PM2/18/15
to tas...@googlegroups.com

“Or does it just need to match the values in the flowcell column of the key file?”

Yes.  You can rename the fastq file to conform.  The “lane” part should be a number (the corresponding lane number on the flowcell = 7 for your case).

 

The “flowcell” part doesn’t HAVE to match anything inside the file.  But it wouldn’t hurt if it matched the Illumina ID for the flowcell, which you should be able to see like this:

$ gunzip -c PotatoBeetlePlate1_NoIndex_L007_R1_001.fastq.gz | head -12

 

The Tassel-GBS pipeline is not set up for paired-end data (which it looks like you have).  Try putting the R1 file (PotatoBeetlePlate1_NoIndex_L007_R1_001.fastq.gz) in a folder by itself (and then rename it to <FLOWCELL>_7_fastq.gz.

 

Best,

 

Jeff

 

--
Jeff Glaubitz
Project Manager
Biology of Rare Alleles in Maize and its Wild Relatives
National Science Foundation award IOS-1238014
http://www.panzea.org
Institute for Genomic Diversity
Cornell University
175 Biotechnology Bldg
Ithaca, NY 14853
Phone: 607-255-1386
jcg...@cornell.edu

 

From: tas...@googlegroups.com [mailto:tas...@googlegroups.com] On Behalf Of Michael Crossley
Sent: Wednesday, February 18, 2015 12:24 PM
To: tas...@googlegroups.com
Cc: mcros...@gmail.com; Lynn Carol Johnson
Subject: Re: [TASSEL-Group] issue getting started with fastq input

 

I tried renaming the keyfile and values in the flowcell column of the key file and still get no response from TASSEL.

 

Is it important for the "flowcell_lane_" in the file name match something specific in the sequence fastq files - or does it just need to match the values in the flowcell column of the key file?

 

I am really wondering if there is something wrong with my fastq files that is preventing TASSEL from recognizing them in my input directory.Is it normal for them not to appear when selecting the input directory?


On Tuesday, February 17, 2015 at 2:06:36 PM UTC-6, Lynn Carol Johnson wrote:

The problem is the name of your file, and the name of your flow cells in the key file.  The flow cell name should NOT have underscores.  TASSEL code uses the underscores to identify the flow cell and the lane, then maps to the key file.  Can you remove the underscores your flow cell lane name in both your key file and in the file name ?  Tassel expects one of the following formats:  (you don’t need the “txt”)

 

flowcell_lane_fastq.txt.gz OR flowcell_s_lane_fastq.txt.gz OR code_flowcell_s_lane_fastq.txt.gz

 

Thanks - Lynn

 

 

From: "mcros...@gmail.com" <mcros...@gmail.com>
Reply-To: "tas...@googlegroups.com" <tas...@googlegroups.com>
Date: Tuesday, February 17, 2015 at 1:49 PM
To: "tas...@googlegroups.com" <tas...@googlegroups.com>
Cc: "mcros...@gmail.com" <mcros...@gmail.com>, Lynn Carol Johnson <lc...@cornell.edu>
Subject: Re: [TASSEL-Group] issue getting started with fastq input

 

Thanks for the suggestion. I tried it, but did not get a different result:

Here is a snapshot of my keyfile - the only thing I am in doubt of is my "Flowcell" designation - does this need to match a particular aspect of the fastq file?

Thanks for your time & attention,



On Tuesday, February 17, 2015 at 10:18:20 AM UTC-6, Lynn Carol Johnson wrote:

Michael -

 

You don’t need the .txt extension.  I am able to run with a file of this name:  D0D7RACXX_1_fastq.gz

 

What values are you using for Max Good Reads/Min Tag Count?  The defaults are these:

 

 

 

 

I find if I use the default of  300000000 with Min Tag Count of 1, I run out of memory.  When I lower the default for Max Good Reads , or increase the MinTag Count value, the operation completes successfully.

Michael Crossley

unread,
Feb 18, 2015, 1:32:04 PM2/18/15
to tas...@googlegroups.com
Making progress - I managed to get an error message.
Here is a snapshot of my fastq


Here are the properties of the fastq file, showing the correct flowcell and lane in the name, and that the file is a .gz file.


When I tried fastq to tag count with this file, I got no response.
But when I added .gz to the file name (so that it read C5TA4ACXX_7_fastq.gz) I got the following error message.

But my file should end with _fastq.gz
Do you know what is wrong now?

Jeff Glaubitz

unread,
Feb 18, 2015, 3:26:50 PM2/18/15
to tas...@googlegroups.com

Can you show me what

$ ls -Fl ‘GBS/fastq/CPB R1 fastq/’

outputs? 

 

Please paste the output, instead of a screen shot.

 

Best,

 

Jeff

You received this message because you are subscribed to the Google Groups "TASSEL - Trait Analysis by Association, Evolution and Linkage" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tassel+un...@googlegroups.com.
To post to this group, send email to tas...@googlegroups.com.

Michael Crossley

unread,
Feb 18, 2015, 3:48:30 PM2/18/15
to tas...@googlegroups.com
I am working from the Windows GUI, so I don't think I can do that. (The screenshot of the file was taken from an ubuntu terminal, where I am storing a copies of my files for other analyses, but I'm not running TASSEL from there).

Or did you mean for me to run this from the Windows command prompt?
...

Jeff Glaubitz

unread,
Feb 18, 2015, 4:01:12 PM2/18/15
to tas...@googlegroups.com

I am thinking that the “Hide extensions of known file types” default in Windows is preventing you from seeing the actual name of the file.  Perhaps it is “C5TA4ACXX_7_fastq.gz.gz” or something.

 

Either turn off that option in Windows, or try listing the directory in a DOS window (Google the analogous DOS commands for cd and ls).

 

Jeff

 

 

From: tas...@googlegroups.com [mailto:tas...@googlegroups.com] On Behalf Of Michael Crossley
Sent: Wednesday, February 18, 2015 3:49 PM
To: tas...@googlegroups.com
Subject: Re: [TASSEL-Group] issue getting started with fastq input

 

I am working from the Windows GUI, so I don't think I can do that. (The screenshot of the file was taken from an ubuntu terminal, where I am storing a copies of my files for other analyses, but I'm not running TASSEL from there).

 

Or did you mean for me to run this from the Windows command prompt?

On Wednesday, February 18, 2015 at 2:26:50 PM UTC-6, Jeff Glaubitz wrote:

Can you show me what

$ ls -Fl ‘GBS/fastq/CPB R1 fastq/’

outputs? 

 

Please paste the output, instead of a screen shot.

 

Best,

 

Jeff

 

 

From: tas...@googlegroups.com [mailto:tas...@googlegroups.com] On Behalf Of Michael Crossley
Sent: Wednesday, February 18, 2015 1:32 PM
To: tas...@googlegroups.com
Subject: Re: [TASSEL-Group] issue getting started with fastq input

 

Making progress - I managed to get an error message.

Here is a snapshot of my fastq

 

But when I added .gz to the file name (so that it read C5TA4ACXX_7_fastq.gz) I got the following error message.

But my file should end with _fastq.gz

Michael Crossley

unread,
Feb 18, 2015, 4:47:58 PM2/18/15
to tas...@googlegroups.com
I was able to get fastq to tag counts to work after some putzing - ultimately though, all of your suggestions came into play. I got it to work after:
-deleting the paired end fastq files and redownloading them
-isolating one of the paired end fastq files (C5TA4ACXX_7_fastq.gz) in its own directory
-adding a LibraryPrepID column to the keyfile (after getting an error message indicating this was necessary. I thought this column was optional?)
-decreasing maximum number of good reads from 300000000 to 10000000 (after the program closed before starting the fastq to tag counts process)
-increasing the minimum number of tags to 3

Thanks for all your help,
...
Reply all
Reply to author
Forward
0 new messages