Error with setupBarcodeFiles

511 views
Skip to first unread message

vin6...@gmail.com

unread,
May 26, 2013, 3:59:59 PM5/26/13
to tas...@googlegroups.com
Hello,
I'm learning about the TASSEL pipeline, testing the first step, FastqToTagCount.pl. I'm getting this error:

Error with setupBarcodeFiles: java.lang.ArrayIndexOutOfBoundsException: 0
Total barcodes found in lane:0
Total barcodes found in lane:0
No barcodes found.  Skipping this flowcell lane.

I've confirmed that my barcode file is formatted appropriately (pasted below), and that my fastq file name is in one of the accepted formats (first few seq records pasted below), and even renamed it to be similar to the example in the documentation (C10RAACXX_4_fastq.txt).  Since this is my first attempt to run this script, I'm not sure what else to check here.

Any help to get me going on this first step will be much appreciated!

COMMAND:
/local/cluster/hts/gbs/tassel3.0_standalone/run_pipeline.pl -fork1 -FastqToTagCountPlugin -i /usda_ars_data/Bassil_Lab/FC1094/lane2/fastq -k /usda_ars_data/Bassil_Lab/FC1094/lane2/FRA_bar_1st6.txt -e ApeKI -o tagCounts -endPlugin -runfork1


BARCODE FILE:
Flowcell        Lane    Barcode Sample  Plate   Row     Column
AC1JDAACXX_1094 2       CTCC    UK_1    Plate1  A       1
AC1JDAACXX_1094 2       TGCA    UK_16   Plate1  A       2
AC1JDAACXX_1094 2       ACTA    UK_32   Plate1  A       3
AC1JDAACXX_1094 2       CAGA    UK_55   Plate1  A       4
AC1JDAACXX_1094 2       AACT    UK_67   Plate1  A       5
AC1JDAACXX_1094 2       GCGT    UK_92   Plate1  A       6

FASTQ FILE:
@DB775P1:229:C1JDAACXX:2:1101:1445:1923 1:N:0:
NCTGTGACTGCCCGCATTGTTGACGACATTGTTGAATCCCTTGTCCACGGAAGGGTTGCTGTTGGTCGCACCTCCGCAGGCACTGCTCGTATCTCTGTTCA
+
#11AD?AAHHAFDEGBGGG?DBAFHCGG<DDBDEBDGDB?BF>D9?CGII'55;B(=B9?@?@CC=;/9'88(<235-0<<ABC9@@48)298((:4@A@3
@DB775P1:229:C1JDAACXX:2:1101:1306:1923 1:N:0:
NTAACAGCAACCGGAGCCTTCGCCTGAAGAGGANGCCCTCAAGAGGAACACCGACTGCGTCTACTTCCTCGCCTCCCCCGTCTCCTGCAAAAAGGTTCCGC
+
#1=DFFFFHHHHHJJJJJJJJGJJJJJJFHHIJ#0?FHIJJIJGIJIIIJGHFFFCDED@DDBDEDDDDDBDDDDABDD@@DBDD@ACCC>AB@DCDDDBB
@DB775P1:229:C1JDAACXX:2:1101:1473:1929 1:N:0:
NTCAGACAGCATTTCGAGTAACTCCTCAACCTGGAGTTCCGCCTGAGGAAGCAGGGGCAGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCT
+
#1:BDDDBDBDBDBEEEE:C:A<FEDFD@FEEII@CEDFD@DDDDIDIAB=8=CCDIAAACCADD>@@@??>(3=2'5>A>A<<>?AAAA?##########
@DB775P1:229:C1JDAACXX:2:1101:1418:1930 1:N:0:
NAGATACTGCGCCGACTCTCCGCGGAGCTGGACGATCTTGGAGTCGTCGCGGATGACGGCGATGGGGCCGCCGAAGGGGGCGCAGGCGACCTTGTTGCGGC
+
#1=DDFFFGHHGHIJJIJJJJGGIGHHICHIIJJDFHFFFFFD@CCABB@DDB@DACBDDD<7BBBBB@BDB>5<<@BDB599<>B9>B>B<@CA@BCBB@



Jeff Glaubitz

unread,
May 28, 2013, 11:21:59 AM5/28/13
to tas...@googlegroups.com

Hi vin69110,

 

The Flowcell column in your barcode key file needs to match the flowcell part of your file name, and cannot contain an underscore.  So, try changing “AC1JDAACXX_1094” to “C1JDAACXX” (or “C10RAACXX”).

 

Best,

 

Jeff

 

--

Jeff Glaubitz

Project Manager

Genetic Architecture of Maize and Teosinte

National Science Foundation award 0820619

http://www.panzea.org

Institute for Genomic Diversity

Cornell University

175 Biotechnology Bldg

Ithaca, NY 14853

Phone: 607-255-1386

jcg...@cornell.edu

--
You received this message because you are subscribed to the Google Groups "TASSEL - Trait Analysis by Association, Evolution and Linkage" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tassel+un...@googlegroups.com.
To post to this group, send email to tas...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tassel/74c6b360-11ee-45df-8633-e21e4d094728%40googlegroups.com?hl=en-US.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

vin6...@gmail.com

unread,
May 28, 2013, 11:15:27 PM5/28/13
to tas...@googlegroups.com
Hi Jeff,
Thanks for your reply. I changed the Flowcell column in my barcode file as you suggested, and made it consistent with the name of my fastq file. That looks like it did the trick. Now, I'm trying to understand the various error messages thrown by the Java plugins. The out-of-memory message is fine...I can up the requested RAM.  Is the 'string index out of range' error related to that, or does it indicate still something amiss in my barcode file? I checked line 32, and there is nothing different in that line. I'm interpreting the error to mean that there is a string of length 32 that is the problem.

Your help is much appreciated!

ERROR MESSAGE:
Your system doesn't have enough memory to store the number of sequencesyou specified.  Try using a smaller value for the minimum number of reads.
Catch testBasicPipeline c=0 e=java.lang.StringIndexOutOfBoundsException: String index out of range: 32
java.lang.StringIndexOutOfBoundsException: String index out of range: 32
        at java.lang.String.substring(Unknown Source)
        at net.maizegenetics.gbs.homology.ParseBarcodeRead.findBestBarcode(ParseBarcodeRead.java:259)
        at net.maizegenetics.gbs.homology.ParseBarcodeRead.parseReadIntoTagAndTaxa(ParseBarcodeRead.java:386)

Fastq file: C1JDAACXX_4_fastq.txt

New barcode file head:

Flowcell        Lane    Barcode Sample  Plate   Row     Column
C1JDAACXX       2       CTCC    UK_1    Plate1  A       1
C1JDAACXX       2       TGCA    UK_16   Plate1  A       2
C1JDAACXX       2       ACTA    UK_32   Plate1  A       3
C1JDAACXX       2       CAGA    UK_55   Plate1  A       4
C1JDAACXX       2       AACT    UK_67   Plate1  A       5
C1JDAACXX       2       GCGT    UK_92   Plate1  A       6
C1JDAACXX       2       TGCGA   UK_133  Plate1  A       7
C1JDAACXX       2       CGAT    H_2485  Plate1  A       8
C1JDAACXX       2       CGCTT   H_2322  Plate1  A       9
C1JDAACXX       2       TCACC   H_2552  Plate1  A       10
C1JDAACXX       2       CTAGC   Climax  Plate1  A       11
C1JDAACXX       2       ACAAA   Aberdeen        Plate1  A       12
C1JDAACXX       2       TTCTC   UK_2    Plate1  B       1
C1JDAACXX       2       AGCCC   UK_17   Plate1  B       2
C1JDAACXX       2       GTATT   UK_34   Plate1  B       3

Jeff Glaubitz

unread,
May 30, 2013, 6:11:40 PM5/30/13
to tas...@googlegroups.com

Hi vin69110,

 

I don’t think that the memory error message is related, but you could try increasing the memory and see if the StringIndexOutOfBoundsException goes away.

 

Looking at where the index out of range error is being thrown in the code, I suspect that your fastq file is not being read properly for some reason.  We divide each sequence read into 2 chunks of 32 bases, storing each chunk as 64 bits (two bits for each nucleotide, where 00=A, 01=C, 10=G, and 11=T).  That is where the 32 in “String index out of range: 32” comes from.  If it throws this error, then it is probably reading something other than the sequence read from the file, a string that is less than 32 bases long, so String.substring(0,32) fails.  Your sequences are 100 bases long, so it is not because they are too short.  This part of the code is trying to figure out the barcode, so it only uses the first 32 bases.

 

Maybe the fastq file has linux line endings and you are trying to run the pipeline on a mac (or vice versa), or perhaps it should be named *.txt.gz instead of *.txt?  Try:

wc -l  /usda_ars_data/Bassil_Lab/FC1094/lane2/fastq/C10RAACXX_4_fastq.txt

 

And see what you get.  This should tell you how many lines are in the file (=4x the number of reads).

 

Best,

 

Jeff

vin6...@gmail.com

unread,
Jun 3, 2013, 10:45:31 AM6/3/13
to tas...@googlegroups.com
Hi Jeff,
Thanks for your helpful suggestions and explanations. I did increase the memory parameter and that resolved the memory error. I also found some characters in the fastq file, left over from a filtering operation, that were the likely source of other errors. 

I made a fastq subset with just 10 sequence records, and a barcodes file that only contains the barcodes from the sequences in the subset fastq file. The script runs file with those inputs.

However, I am still getting errors when I try to run larger subsets of data, and also when I run the small fastq subset with the barcode file that contains all of the barcodes. I noticed that some of my sequences have single 5' Ns. Does this throw off the barcode matching, or is it a fuzzy match? 

Again, your help is much appreciated!

Jeff Glaubitz

unread,
Jun 3, 2013, 11:44:44 AM6/3/13
to tas...@googlegroups.com

Reads with N’s in either the barcode or the subsequent 64 bases are not used by the pipeline.  Barcodes must match exactly.

vin6...@gmail.com

unread,
Jun 3, 2013, 1:17:51 PM6/3/13
to tas...@googlegroups.com
Thanks Jeff. If you are seeing this twice, it's because I initially chose reply-to-sender rather than post here. In that case, my apologies.

Given that ~10% of my reads have these single 5' Ns, I think it would be a waste of data to throw them out. Is it true that these cause the script to throw errors? That is apparently what I am seeing. Is there an available script to correct these using the barcodes file, since sequence reads with 5' Ns are not uncommon in Illumina data?

Vin

Jeff Glaubitz

unread,
Jun 3, 2013, 4:01:58 PM6/3/13
to tas...@googlegroups.com

Hi Vin,

 

No, reads with N do not cause the code to throw errors.  It merely skips them (they are counted toward the total number of reads).  They are common at the beginning and end of the fastq files (edge effect on the flowcell).

 

There is no script available to correct single 5’ Ns.

Reply all
Reply to author
Forward
0 new messages