Help with GBS pipeline Tassel 4.0 (-TagsToSNPByAlignmentPlugin)

574 views
Skip to first unread message

afko...@gmail.com

unread,
Feb 12, 2013, 2:39:42 PM2/12/13
to tas...@googlegroups.com
Hello,

I'm attempting to do some genotyping on a set of samples.  The samples were set up on 20 96-well plates, with each barcode appearing in two different wells of each plate.  Half of the plates were sequenced in one illumina lane (D17N9ACXX:1), and the other half were in another (C0T7VACXX:2).  I'm running Tassel 4.0 on a 64 bit Linux server.

I've successfully (I think) taken the data through each step of the pipeline up through TagsToSNPByAlignmentPlugin.
I've been following the documentation as closely as possible, including keeping my file and directory names the same as those in http://www.maizegenetics.net/tassel/docs/TasselPipelineGBS.pdf

However, when I got to MergeDuplicateSNPsPlugin I got an error (see below).

Note: My reference genome is a bunch of scaffolds, so I used those as the chromosomes. There aren't 4942 scaffolds, there are 1507, but the numbers range from 1 to 4942 (hence -s 1 -e 4942), so some of the input files contain only a header line and nothing else.  However, I tried the plugin on a single file that does have data in it and got the same error.

I called the plugin as follows:
 ~/Software/tassel4.0_standalone/run_pipeline.pl -Xmx24g -fork1 -MergeDuplicateSNPsPlugin -hmp hapmap/unfilt/mergedTBT.c+.LocusLog.txt -o hapmap/mergedSNPs/myStudy.mergedSNPs.c+.hmp.txt -s 1 -e 4942 -endPlugin -runfork1

...and got a series of these

java.lang.IllegalStateException: Problem with line in file: 304
        at net.maizegenetics.pal.alignment.ProcessLineOfHapmap.run(ProcessLineOfHapmap.java:105)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.IllegalArgumentException: NucleotideAlignmentConstants: getNucleotideDiploidByte: unknown allele value: 1
        at net.maizegenetics.pal.alignment.NucleotideAlignmentConstants.getNucleotideDiploidByte(NucleotideAlignmentConstants.java:239)
        at net.maizegenetics.pal.alignment.ProcessLineOfHapmap.run(ProcessLineOfHapmap.java:101)
        ... 3 more
java.lang.IllegalStateException: Problem with line in file: 303
        at net.maizegenetics.pal.alignment.ProcessLineOfHapmap.run(ProcessLineOfHapmap.java:105)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.IllegalArgumentException: NucleotideAlignmentConstants: getNucleotideDiploidByte: unknown allele value: 1
        at net.maizegenetics.pal.alignment.NucleotideAlignmentConstants.getNucleotideDiploidByte(NucleotideAlignmentConstants.java:239)
        at net.maizegenetics.pal.alignment.ProcessLineOfHapmap.run(ProcessLineOfHapmap.java:101)
        ... 3 more

I think the problem may be something wrong in my outputs from TagsToSNPByAlignmentPlugin.  In each file, the 'status' field is always either "invariant" or "tooFewTaxa".  I gather this is because the "nTaxaCovered" field is always much lower "minTaxaCovered".  In fact, it maxes out at exactly half of minTaxaCovered, which makes me think that my key file might be set up incorrectly.  I've attached it in case it might be helpful to have a look. 

I called the TagsToSNPByAlignmentPlugin as follows:

~/Software/tassel4.0_standalone/run_pipeline.pl -fork1 -TagsToSNPByAlignmentPlugin -i mergedTBT/myStudy.tbt.bin -m topm/myMasterTags.topm.bin -o hapmap/unfilt -s 1 -e 4942 -mnF 0.9  -endPlugin -runfork1

The outputs appeared in hapmap/unfilt as a series of files all named  mergedTBT.c(1 to 4942).LocusLog.txt

I realize it might be hard to diagnose the problem with this information, so if there's anything else I can send along that will help please let me know.
Any help or advice would be much appreciated.

Thanks,

Alex
Key_file.txt

Robert Elshire

unread,
Feb 12, 2013, 2:46:02 PM2/12/13
to tas...@googlegroups.com
Hello Alex,

We are in the process of migrating the GBS code to TASSEL 4. This is not done
yet. Once it is we will put an announcement on the TASSEL page on
maizegenetics.net. Until then please use TASSEL 3 for the GBS Pipeline.

Best regards,

Rob Elshire
> 08) at java.lang.Thread.run(Thread.java:662)
> Caused by: java.lang.IllegalArgumentException:
> NucleotideAlignmentConstants: getNucleotideDiploidByte: unknown allele
> value: 1
> at
> net.maizegenetics.pal.alignment.NucleotideAlignmentConstants.getNucleotideDi
> ploidByte(NucleotideAlignmentConstants.java:239) at
> net.maizegenetics.pal.alignment.ProcessLineOfHapmap.run(ProcessLineOfHapmap.
> java:101) ... 3 more
> java.lang.IllegalStateException: Problem with line in file: 303
> at
> net.maizegenetics.pal.alignment.ProcessLineOfHapmap.run(ProcessLineOfHapmap.
> java:105) at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.ja
> va:886) at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:9
> 08) at java.lang.Thread.run(Thread.java:662)
> Caused by: java.lang.IllegalArgumentException:
> NucleotideAlignmentConstants: getNucleotideDiploidByte: unknown allele
> value: 1
> at
> net.maizegenetics.pal.alignment.NucleotideAlignmentConstants.getNucleotideDi
> ploidByte(NucleotideAlignmentConstants.java:239) at
> net.maizegenetics.pal.alignment.ProcessLineOfHapmap.run(ProcessLineOfHapmap.

afko...@gmail.com

unread,
Feb 13, 2013, 9:30:20 AM2/13/13
to tas...@googlegroups.com
Thanks Rob,

Unfortunately, switching to TASSEL 3 doesn't seem to have solved the problem.
I went back through the pipeline, using TASSEL 3 instead of TASSEL 4 for each step.  Now, however, I'm getting no output at all from the TagsToSNPByAlignmentPlugin.

I called the plugin exactly as before (see original post) with the exception of calling tassel 3.0 instead of 4.0.  

I used the  BinaryToTextPlugin to confirm that I have output for previous steps of the pipeline, but hapmap/unfilt is empty.

In the case of each chromosome, I get this text as the TagsToSNPByAlignmentPlugin is running :

[Thread-0] INFO net.maizegenetics.gbs.pipeline.TagsToSNPByAlignmentPlugin -

Processing chromosome 4941...
[Thread-0] INFO net.maizegenetics.gbs.pipeline.TagsToSNPByAlignmentPlugin - Creating Mutable Alignment to hold genotypes for chr4941 (maximum number of sites = 200000)
[Thread-0] INFO net.maizegenetics.gbs.pipeline.TagsToSNPByAlignmentPlugin - minTaxaWithLocus:192 MinF:0.900000 MinMAF:0.0100000 MinMAC:10

[Thread-0] INFO net.maizegenetics.gbs.pipeline.TagsToSNPByAlignmentPlugin - includeRare:false includeGaps:false

[Thread-0] INFO net.maizegenetics.gbs.pipeline.TagsToSNPByAlignmentPlugin - Number of marker sites recorded for chr4941: 0
[Thread-0] INFO net.maizegenetics.gbs.pipeline.TagsToSNPByAlignmentPlugin - Finished processing chromosome 4941

So I think the problem is that it's not recording any marker sites for any of the chromosomes, but I'm not sure why that would be the case.

In case it helps:
Contents of : /topm/myMasterTags.topm.bin.log
Total 13266047 tags
        6259816 were aligned to unique postions
        2375016 were aligned to multiple postions
        4631215 could not be aligned.

nPositions  nTags
0           4631215
1           6259816
99          2375016

First ten rows of the first 6 columns of my tbt file (tbt/C0T7VACXX_2.tbt.byte.txt):
8817537 2       960
imswc221:C0T7VACXX:2:Plate 3A:E04       imswc221:C0T7VACXX:2:Plate 3A:E10       imswc269:C0T7VACXX:2:Plate 3B:E04       imswc269:C0T7VACXX:2:Plate 3B:E10       imswc413:C0T7VACXX:2:Plate 5A:E04       imswc413:C0T7VACXX:2:Plate 5A:E10
TAAAAAAAAAAAAAAAAAAAAAAAACAAAAAAGGTGTGAAATTATAAAAAAAAACCCCCTTATA        64      0       0       0       0
TAAAAAAAAAAAAAAAAAAAAAAAACAAAAAATTATGAGTGACTTTTTGAAAATATTATTATAA        64      0       0       0       0
TAAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAAAAAAAAAAAAAAAATTTGGGGATTT        64      0       0       0       0
TAAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAAAAAAGGAAAAAAAAACAAATAAAAA        64      0       0       0       0
TAAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAAAAAATCTTTTTTTGTATGGGGTAAA        64      0       0       0       0
TAAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAAAAAGAAAAAAAATTCATCATCATTC        64      0       0       0       0
TAAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAAAAAGACCCCCCCCAGCCCCCCGGGC        64      0       0       0       0
TAAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAAAAAGGAAAATGAAAAAAAATAGGAA        64      0       0       0       0

Sorry if any of that is not useful information.  Is there other information I could provide that might help?

Thanks again,

Alex

Robert Elshire

unread,
Feb 13, 2013, 10:36:26 AM2/13/13
to tas...@googlegroups.com, afko...@gmail.com
Hi Alex,

I think the issue could, in fact, be the key file. The column that you have
labelled "PlateID" is now used by us for a unique sample prep id. I suspect
that this is the cause of your problems. There are many samples with the same
PlateID. Try either removing that column or making each entry in that column
unique.

I have an additional suggestion for you and others working with incomplete
assemblies. This is idea comes from Katie Hyma who does a lot of these types
of analyses. Concatonate down to a smaller number of files and pad the junction
with 100 Ns. In your case you might want to have 10 files with ~50 contigs in
each one. This reduces overhead and speeds up the process.

Please let me know how it goes.

Best,

Rob

Jeff Glaubitz

unread,
Feb 13, 2013, 10:48:32 AM2/13/13
to tas...@googlegroups.com, afko...@gmail.com
>Try either removing that column or making each entry in that column unique.

Alternatively, bump over the column to the right by one. Replace it with a "filler" column that does not contain integers. If our code finds integers in that particular column, it interprets them as "libraryPrepIDs" (=sample prep id) which it then uses as part of the sample name instead of the plate name and well. Sorry, this is not documented yet.

Best,

Jeff

--
Jeff Glaubitz
Project Manager
Genetic Architecture of Maize and Teosinte
National Science Foundation award 0820619
http://www.panzea.org
Institute for Genomic Diversity
Cornell University
175 Biotechnology Bldg
Ithaca, NY 14853
Phone: 607-255-1386
jcg...@cornell.edu
--
You received this message because you are subscribed to the Google Groups "TASSEL - Trait Analysis by Association, Evolution and Linkage" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tassel+un...@googlegroups.com.
To post to this group, send email to tas...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.




afko...@gmail.com

unread,
Feb 13, 2013, 10:58:21 AM2/13/13
to tas...@googlegroups.com, afko...@gmail.com
Rob and Jeff,

Thanks so much for your advice. That 8th column wasn't strictly necessary for me, so I just went ahead and cut it out entirely.  I'm re-running the pipeline now with the new key file.  I'll post again when it's finished.

Thanks,

Alex

afko...@gmail.com

unread,
Feb 13, 2013, 12:58:56 PM2/13/13
to tas...@googlegroups.com
Hello again,

My run with the extra column removed is still going, but it's reached he TagsToSNPByAlignmentPlugin stage, and as before it seems to be giving me no output.  Looking into some of the outputs of the earlier stages a bit more closely however, I think the problem is occurring before that step.

I looked at the log files from FastqToTBTPlugin
C0T7VACXX_2.tbt.byte.log and
D17N9ACXX_1.tbt.byte.log

For some reason, in each log file, I appear to have results for only one of the plates.  In D17N9ACXX_1.tbt.byte.log it's Plate 8B, and in C0T7VACXX_2.tbt.byte.log it's plate 9B (example of the first few lines below).  I can't think of anything special about those particular plates.

Total reads: 173020275
Accepted reads (with barcode and cut site): 110080839(0.6362309 of total)
Accepted reads found in TOPM: 110080839(0.6362309 of total)
name    read count      fraction of total       mapped read count       fraction mapped of total
imswc616:C0T7VACXX:2:Plate 7A:H05       0       0.0     0       0.0
imswcF1:C0T7VACXX:2:Plate 5A:D11        0       0.0     0       0.0
imswc826:C0T7VACXX:2:Plate 9B:B02       0       0.0     0       0.0
imswc438:C0T7VACXX:2:Plate 5B:F01       0       0.0     0       0.0
imswc826:C0T7VACXX:2:Plate 9B:B08       1102791 0.0063737677    1102791 0.0063737677
imswc501:C0T7VACXX:2:Plate 6A:E09       0       0.0     0       0.0
imswc595:C0T7VACXX:2:Plate 7A:C03       0       0.0     0       0.0
imswc208:C0T7VACXX:2:Plate 3A:H08       0       0.0     0       0.0
imswc596:C0T7VACXX:2:Plate 7A:D03       0       0.0     0       0.0
imswc650:C0T7VACXX:2:Plate 7B:B10       0       0.0     0       0.0
imswc817:C0T7VACXX:2:Plate 9B:A07       587262  0.0033941802    587262  0.0033941802
imswc538:C0T7VACXX:2:Plate 6B:B02       0       0.0     0       0.0
imswc463:C0T7VACXX:2:Plate 5B:G04       0       0.0     0       0.0
imswc497:C0T7VACXX:2:Plate 6A:A03       0       0.0     0       0.0
imswc596:C0T7VACXX:2:Plate 7A:D09       0       0.0     0       0.0
imswc566:C0T7VACXX:2:Plate 6B:F05       0       0.0     0       0.0
imswc612:C0T7VACXX:2:Plate 7A:D05       0       0.0     0       0.0
imswc216:C0T7VACXX:2:Plate 3A:H09       0       0.0     0       0.0
imswc829:C0T7VACXX:2:Plate 9B:E02       0       0.0     0       0.0
imswc538:C0T7VACXX:2:Plate 6B:B08       0       0.0     0       0.0
imswc829:C0T7VACXX:2:Plate 9B:E08       1926501 0.011134539     1926501 0.011134539

Is this because the same barcodes are used across plates?  If I recall correctly, the documentation says to start one input file per flowcell lane.  Do I actually need to start with one fastq file per plate, such that within each input fastq file one barcode corresponds to only one sample?  Or might the problem be that my key file is set up incorrectly in some way?

Robert Elshire

unread,
Feb 13, 2013, 1:11:20 PM2/13/13
to tas...@googlegroups.com, afko...@gmail.com
Hi Alex,

Are those spaces that I see in the plate name?

Plate 7A for example?

If it does, this is your problem. I just looked up this issue in the docs. See
page 24. To be fair it may not be immediately clear that your plate column is
part of the name, but it is. There should never be spaces in any of your text
in a key file.

Best,

Rob

afko...@gmail.com

unread,
Feb 13, 2013, 2:11:25 PM2/13/13
to tas...@googlegroups.com
Gotcha.  I recall the documentation saying that the sample column shouldn't have spaces.  I guess I should have realized that that would apply to the other columns as well.  Giving it another go with a space-free key file.  Thanks again for all of your help.  I'll let you know how this run turns out.

Alex

Jeff Glaubitz

unread,
Feb 13, 2013, 3:54:34 PM2/13/13
to tas...@googlegroups.com, afko...@gmail.com

Hi Alex,

 

The spaces in the plate names are probably an issue (in need of clarification in the documentation), especially once you have HapMap genotype files.  However, there might be a bigger issue:

 

> Do I actually need to start with one fastq file per plate, such that within each input fastq file one barcode corresponds to only one sample?

 

Yes!!  How else could the software know which sample was which?  You can have multiple plates within a lane/fastq file, but the barcodes have to be unique for each sample.  For example, you can combine four 96 well plates (384 unique samples in total, including blanks) into a single well, but you need 384 distinct bar codes to distinguish them.

 

Best,

 

Jeff

 

--

Jeff Glaubitz

Project Manager

Genetic Architecture of Maize and Teosinte

National Science Foundation award 0820619

http://www.panzea.org

Institute for Genomic Diversity

Cornell University

175 Biotechnology Bldg

Ithaca, NY 14853

Phone: 607-255-1386

jcg...@cornell.edu

 

-----Original Message-----
From: tas...@googlegroups.com [mailto:tas...@googlegroups.com] On Behalf Of Robert Elshire
Sent: Wednesday, February 13, 2013 1:11 PM
To: tas...@googlegroups.com
Cc: afko...@gmail.com

--

afko...@gmail.com

unread,
Feb 13, 2013, 4:18:45 PM2/13/13
to tas...@googlegroups.com, afko...@gmail.com
That makes sense.  After I said it I realized it must be the case, so in my latest run (with the spaces removed from the key file), I split things up so that instead of 2 fastq files (one for each lane):

D17N9ACXX_1_fastq.txt
C0T7VACXX_2_fastq.txt

...to 20 fastq files (one for each plate).  I had deliberately combined them before, (so that I'd have one file for each lane) which in retrospect, was.... not smart. 

C0T7VACXX_Plate3A_2_fastq.txt  C0T7VACXX_Plate6A_2_fastq.txt  C0T7VACXX_Plate9A_2_fastq.txt   D17N9ACXX_Plate1A_1_fastq.txt  D17N9ACXX_Plate4A_1_fastq.txt
C0T7VACXX_Plate3B_2_fastq.txt  C0T7VACXX_Plate6B_2_fastq.txt  C0T7VACXX_Plate9B_2_fastq.txt   D17N9ACXX_Plate1B_1_fastq.txt  D17N9ACXX_Plate4B_1_fastq.txt
C0T7VACXX_Plate5A_2_fastq.txt  C0T7VACXX_Plate7A_2_fastq.txt  D17N9ACXX_Plate10A_1_fastq.txt  D17N9ACXX_Plate2A_1_fastq.txt  D17N9ACXX_Plate8A_1_fastq.txt
C0T7VACXX_Plate5B_2_fastq.txt  C0T7VACXX_Plate7B_2_fastq.txt  D17N9ACXX_Plate10B_1_fastq.txt  D17N9ACXX_Plate2B_1_fastq.txt  D17N9ACXX_Plate8B_1_fastq.txt

Come to think of it though, I'm not sure if that will be enough?    
Each of my 48 barcodes appears in two different wells of each plate so that each sample is replicated twice, (for example, sample imswc001, with barcode GCTCGG is in both wells A01 and A07 of Plate1A ).  
My key file therefore has two rows for that sample:

D17N9ACXX       1       GCTCGG  imswc001        Plate1A A       01
D17N9ACXX       1       GCTCGG  imswc001        Plate1A A       07

Do I need to split my key file in half and run the pipeline twice?

afko...@gmail.com

unread,
Feb 13, 2013, 5:24:25 PM2/13/13
to tas...@googlegroups.com
I think I'm still missing something important.  
I tried the run described in the previous post, with the input files split so that there was one fastq file per plate (and no spaces in the key file), but I got pretty much the same result as before (i.e. the tbt log file for Plate4B only had non-zero entries for Plate 8B).  

How does the software know which files correspond to which entries in the key file if the barcodes match?  Is it only the flowcell and the lane?  If so, how does one handle this situation where I have ten plates in each lane, each of which re-uses the same barcodes?  I thought maybe the _s_ in the filename would do the trick if s corresponded to the plate name in the key file but it seems not.  

Again my apologies if the answer should be obvious.

Jeff Glaubitz

unread,
Feb 13, 2013, 5:43:10 PM2/13/13
to tas...@googlegroups.com

Hi Alex,

 

>How does the software know which files correspond to which entries in the key file if the barcodes match?  Is it only the flowcell and the lane?

Each sample is identified as a unique combination of flowcell, lane, and barcode.  You can’t distinguish two or more samples with the same barcode in the same flowcell lane.

 

>If so, how does one handle this situation where I have ten plates in each lane, each of which re-uses the same barcodes?

This situation cannot be handled, unless the ten plates have identical layouts in terms of the samples & barcodes (i.e., they are replicas of the same plate).  If they are ten different plates, then a sequence within a fastq file from a particular flowcell lane with a given barcode could potentially come from 10 different samples, and we have no way of knowing which one it is.

 

Best,

 

Jeff

 

 

From: tas...@googlegroups.com [mailto:tas...@googlegroups.com] On Behalf Of afko...@gmail.com
Sent: Wednesday, February 13, 2013 5:24 PM
To: tas...@googlegroups.com
Subject: [TASSEL-Group] Re: Help with GBS pipeline Tassel 4.0 (-TagsToSNPByAlignmentPlugin)

 

I think I'm still missing something important.  

--

You received this message because you are subscribed to the Google Groups "TASSEL - Trait Analysis by Association, Evolution and Linkage" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tassel+un...@googlegroups.com.
To post to this group, send email to tas...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msg/tassel/-/eGOvz21mZcIJ.

Robert Elshire

unread,
Feb 13, 2013, 5:29:47 PM2/13/13
to tas...@googlegroups.com
Alex,

From what I can tell from your emails, you had 2 lanes of sequence data. Is this right? You put more than one plate in each lane. Is this right? How many plates did you put in one lane? For each bar code in a lane how many samples had that bar code?

Rob

Sent: Wednesday, February 13, 2013 5:24 PM
To: tas...@googlegroups.com
Subject: [TASSEL-Group] Re: Help with GBS pipeline Tassel 4.0 (-TagsToSNPByAlignmentPlugin)
--

afko...@gmail.com

unread,
Feb 14, 2013, 9:52:56 AM2/14/13
to tas...@googlegroups.com
Hi Rob and Jeff,

There were two lanes of sequencing, with 10 plates in each lane.  The files I received were already split up by plate, so at some point there must have been barcodes that identified the plates, in addition to the barcodes I'm using which just identify the samples within each plate.

I'm in communication with the person whose data this is to see if I can get the original files with the full barcodes.  If I can, then I can just do a new key file with the full length barcodes, and then every flowcell/lane/barcode combination should be unique to one sample.  

Sorry for the confusion.  I'll update again once I've sorted this out.
Thanks for all of your help, and for your patience.

Alex

afko...@gmail.com

unread,
Feb 20, 2013, 10:42:34 AM2/20/13
to tas...@googlegroups.com
I think everything is straightened out now with my inputs.  I was able to re-tag my reads with the plate-specific barcodes, such that every sample in each lane now has a unique barcode.  The whole pipeline appears to have run smoothly, and I have filtered hapmap files for each chromosome.

Thanks so much for all of your help,

Alex
Reply all
Reply to author
Forward
0 new messages