Error with setupBarcodeFiles

vin6...@gmail.com

unread,

May 26, 2013, 3:59:59 PM5/26/13

to tas...@googlegroups.com

Hello,
I'm learning about the TASSEL pipeline, testing the first step, FastqToTagCount.pl. I'm getting this error:

Error with setupBarcodeFiles: java.lang.ArrayIndexOutOfBoundsException: 0
Total barcodes found in lane:0
Total barcodes found in lane:0
No barcodes found. Skipping this flowcell lane.

I've confirmed that my barcode file is formatted appropriately (pasted below), and that my fastq file name is in one of the accepted formats (first few seq records pasted below), and even renamed it to be similar to the example in the documentation (C10RAACXX_4_fastq.txt). Since this is my first attempt to run this script, I'm not sure what else to check here.

Any help to get me going on this first step will be much appreciated!

COMMAND:
/local/cluster/hts/gbs/tassel3.0_standalone/run_pipeline.pl -fork1 -FastqToTagCountPlugin -i /usda_ars_data/Bassil_Lab/FC1094/lane2/fastq -k /usda_ars_data/Bassil_Lab/FC1094/lane2/FRA_bar_1st6.txt -e ApeKI -o tagCounts -endPlugin -runfork1

BARCODE FILE:
Flowcell        Lane    Barcode Sample Plate   Row     Column
AC1JDAACXX_1094 2       CTCC    UK_1    Plate1 A       1
AC1JDAACXX_1094 2       TGCA    UK_16   Plate1 A       2
AC1JDAACXX_1094 2       ACTA    UK_32   Plate1 A       3
AC1JDAACXX_1094 2       CAGA    UK_55   Plate1 A       4
AC1JDAACXX_1094 2       AACT    UK_67   Plate1 A       5
AC1JDAACXX_1094 2       GCGT    UK_92   Plate1 A       6

FASTQ FILE:
@DB775P1:229:C1JDAACXX:2:1101:1445:1923 1:N:0:
NCTGTGACTGCCCGCATTGTTGACGACATTGTTGAATCCCTTGTCCACGGAAGGGTTGCTGTTGGTCGCACCTCCGCAGGCACTGCTCGTATCTCTGTTCA
+
#11AD?AAHHAFDEGBGGG?DBAFHCGG<DDBDEBDGDB?BF>D9?CGII'55;B(=B9?@?@CC=;/9'88(<235-0<<ABC9@@48)298((:4@A@3
@DB775P1:229:C1JDAACXX:2:1101:1306:1923 1:N:0:
NTAACAGCAACCGGAGCCTTCGCCTGAAGAGGANGCCCTCAAGAGGAACACCGACTGCGTCTACTTCCTCGCCTCCCCCGTCTCCTGCAAAAAGGTTCCGC
+
#1=DFFFFHHHHHJJJJJJJJGJJJJJJFHHIJ#0?FHIJJIJGIJIIIJGHFFFCDED@DDBDEDDDDDBDDDDABDD@@DBDD@ACCC>AB@DCDDDBB
@DB775P1:229:C1JDAACXX:2:1101:1473:1929 1:N:0:
NTCAGACAGCATTTCGAGTAACTCCTCAACCTGGAGTTCCGCCTGAGGAAGCAGGGGCAGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCT
+
#1:BDDDBDBDBDBEEEE:C:A<FEDFD@FEEII@CEDFD@DDDDIDIAB=8=CCDIAAACCADD>@@@??>(3=2'5>A>A<<>?AAAA?##########
@DB775P1:229:C1JDAACXX:2:1101:1418:1930 1:N:0:
NAGATACTGCGCCGACTCTCCGCGGAGCTGGACGATCTTGGAGTCGTCGCGGATGACGGCGATGGGGCCGCCGAAGGGGGCGCAGGCGACCTTGTTGCGGC
+
#1=DDFFFGHHGHIJJIJJJJGGIGHHICHIIJJDFHFFFFFD@CCABB@DDB@DACBDDD<7BBBBB@BDB>5<<@BDB599<>B9>B>B<@CA@BCBB@

Jeff Glaubitz

unread,

May 28, 2013, 11:21:59 AM5/28/13

to tas...@googlegroups.com

Hi vin69110,

The Flowcell column in your barcode key file needs to match the flowcell part of your file name, and cannot contain an underscore. So, try changing “AC1JDAACXX_1094” to “C1JDAACXX” (or “C10RAACXX”).

Best,

Jeff

--

Jeff Glaubitz

Project Manager

Genetic Architecture of Maize and Teosinte

National Science Foundation award 0820619

http://www.panzea.org

Institute for Genomic Diversity

Cornell University

175 Biotechnology Bldg

Ithaca, NY 14853

Phone: 607-255-1386

jcg...@cornell.edu

--
You received this message because you are subscribed to the Google Groups "TASSEL - Trait Analysis by Association, Evolution and Linkage" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tassel+un...@googlegroups.com.
To post to this group, send email to tas...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tassel/74c6b360-11ee-45df-8633-e21e4d094728%40googlegroups.com?hl=en-US.
For more options, visit https://groups.google.com/groups/opt_out.

vin6...@gmail.com

unread,

May 28, 2013, 11:15:27 PM5/28/13

to tas...@googlegroups.com

Hi Jeff,

Thanks for your reply. I changed the Flowcell column in my barcode file as you suggested, and made it consistent with the name of my fastq file. That looks like it did the trick. Now, I'm trying to understand the various error messages thrown by the Java plugins. The out-of-memory message is fine...I can up the requested RAM. Is the 'string index out of range' error related to that, or does it indicate still something amiss in my barcode file? I checked line 32, and there is nothing different in that line. I'm interpreting the error to mean that there is a string of length 32 that is the problem.

Your help is much appreciated!

ERROR MESSAGE:

Your system doesn't have enough memory to store the number of sequencesyou specified. Try using a smaller value for the minimum number of reads.

Catch testBasicPipeline c=0 e=java.lang.StringIndexOutOfBoundsException: String index out of range: 32

java.lang.StringIndexOutOfBoundsException: String index out of range: 32

at java.lang.String.substring(Unknown Source)

at net.maizegenetics.gbs.homology.ParseBarcodeRead.findBestBarcode(ParseBarcodeRead.java:259)

at net.maizegenetics.gbs.homology.ParseBarcodeRead.parseReadIntoTagAndTaxa(ParseBarcodeRead.java:386)

Fastq file: C1JDAACXX_4_fastq.txt

New barcode file head:

Flowcell Lane Barcode Sample Plate Row Column

C1JDAACXX 2 CTCC UK_1 Plate1 A 1

C1JDAACXX 2 TGCA UK_16 Plate1 A 2

C1JDAACXX 2 ACTA UK_32 Plate1 A 3

C1JDAACXX 2 CAGA UK_55 Plate1 A 4

C1JDAACXX 2 AACT UK_67 Plate1 A 5

C1JDAACXX 2 GCGT UK_92 Plate1 A 6

C1JDAACXX 2 TGCGA UK_133 Plate1 A 7

C1JDAACXX 2 CGAT H_2485 Plate1 A 8

C1JDAACXX 2 CGCTT H_2322 Plate1 A 9

C1JDAACXX 2 TCACC H_2552 Plate1 A 10

C1JDAACXX 2 CTAGC Climax Plate1 A 11

C1JDAACXX 2 ACAAA Aberdeen Plate1 A 12

C1JDAACXX 2 TTCTC UK_2 Plate1 B 1

C1JDAACXX 2 AGCCC UK_17 Plate1 B 2

C1JDAACXX 2 GTATT UK_34 Plate1 B 3

Jeff Glaubitz

unread,

May 30, 2013, 6:11:40 PM5/30/13

to tas...@googlegroups.com

Hi vin69110,

I don’t think that the memory error message is related, but you could try increasing the memory and see if the StringIndexOutOfBoundsException goes away.

Looking at where the index out of range error is being thrown in the code, I suspect that your fastq file is not being read properly for some reason. We divide each sequence read into 2 chunks of 32 bases, storing each chunk as 64 bits (two bits for each nucleotide, where 00=A, 01=C, 10=G, and 11=T). That is where the 32 in “String index out of range: 32” comes from. If it throws this error, then it is probably reading something other than the sequence read from the file, a string that is less than 32 bases long, so String.substring(0,32) fails. Your sequences are 100 bases long, so it is not because they are too short. This part of the code is trying to figure out the barcode, so it only uses the first 32 bases.

Maybe the fastq file has linux line endings and you are trying to run the pipeline on a mac (or vice versa), or perhaps it should be named *.txt.gz instead of *.txt? Try:

wc -l /usda_ars_data/Bassil_Lab/FC1094/lane2/fastq/C10RAACXX_4_fastq.txt

And see what you get. This should tell you how many lines are in the file (=4x the number of reads).

Best,

Jeff

From: tas...@googlegroups.com [mailto:tas...@googlegroups.com] On Behalf Of vin6...@gmail.com

To view this discussion on the web visit https://groups.google.com/d/msgid/tassel/59a0cd51-ed07-49fa-8993-0a05d388f489%40googlegroups.com?hl=en-US.

vin6...@gmail.com

unread,

Jun 3, 2013, 10:45:31 AM6/3/13

to tas...@googlegroups.com

Hi Jeff,

Thanks for your helpful suggestions and explanations. I did increase the memory parameter and that resolved the memory error. I also found some characters in the fastq file, left over from a filtering operation, that were the likely source of other errors.

I made a fastq subset with just 10 sequence records, and a barcodes file that only contains the barcodes from the sequences in the subset fastq file. The script runs file with those inputs.

However, I am still getting errors when I try to run larger subsets of data, and also when I run the small fastq subset with the barcode file that contains all of the barcodes. I noticed that some of my sequences have single 5' Ns. Does this throw off the barcode matching, or is it a fuzzy match?

Again, your help is much appreciated!

Jeff Glaubitz

unread,

Jun 3, 2013, 11:44:44 AM6/3/13

to tas...@googlegroups.com

Reads with N’s in either the barcode or the subsequent 64 bases are not used by the pipeline. Barcodes must match exactly.

To view this discussion on the web visit https://groups.google.com/d/msgid/tassel/f0d125b3-e2af-410e-9239-bd8aae2c2ac0%40googlegroups.com?hl=en-US.

vin6...@gmail.com

unread,

Jun 3, 2013, 1:17:51 PM6/3/13

to tas...@googlegroups.com

Thanks Jeff. If you are seeing this twice, it's because I initially chose reply-to-sender rather than post here. In that case, my apologies.

Given that ~10% of my reads have these single 5' Ns, I think it would be a waste of data to throw them out. Is it true that these cause the script to throw errors? That is apparently what I am seeing. Is there an available script to correct these using the barcodes file, since sequence reads with 5' Ns are not uncommon in Illumina data?

Vin

Jeff Glaubitz

unread,

Jun 3, 2013, 4:01:58 PM6/3/13

to tas...@googlegroups.com

Hi Vin,

No, reads with N do not cause the code to throw errors. It merely skips them (they are counted toward the total number of reads). They are common at the beginning and end of the fastq files (edge effect on the flowcell).

There is no script available to correct single 5’ Ns.

To view this discussion on the web visit https://groups.google.com/d/msgid/tassel/daec5309-991e-4897-87ac-51cc968dfd70%40googlegroups.com?hl=en-US.

Reply all

Reply to author

Forward