Stampy aligner and Tassel 5

256 views
Skip to first unread message

Peri Bolton

unread,
Jun 28, 2015, 9:04:07 PM6/28/15
to tas...@googlegroups.com
Dear all,

I am using Tassel 5 to align my GBS reads from a non-model organism to a reference genome of a closely related individual. When I used the bwa aligner on its own I was only able to align 46% of my SNPs to the genome. The reference genome is about 2-5% divergent from my organism.
It was suggested that I attempt to use stampy to align the remainder of my reads, and it vastly improved the number of reads that were aligned to the reference. However, Tassel seems to be unable to read the .sam alignment file made by stampy. The error message suggests that it thinks that it's not a proper SAM file, but it looks like one to me. 

Has any one else encountered this problem, and found a solution? What is it about the .sam file that's different from the one generated from bwa?

Does any one have any other tips for aligning to a divergent genome for use in Tassel?

Cheers,

Peri

Peri Bolton

unread,
Jul 8, 2015, 7:43:47 AM7/8/15
to tas...@googlegroups.com
I have some more information about the errors that occur when I try to run this. 
If I use TASSEL3.0 and the SAMConverterPlugin then I get the following output:

Found 3101552 tags in SAM file.  Assuming BWA file format.


Catch in reading SAM alignment file at tag 0:
        length=64count=14       0       chr1633 705669  67      13M13I38M       *       0       0       TGCAGAAAAAAAAAAAAAAAAAACAACAAAAATCTTCCTTCTCCCAGCAAGCTCATTCCGCAGAffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff        UQ:i:236        NM:i:16
Error: java.lang.NullPointerException


java.lang.NullPointerException
        at net.maizegenetics.gbs.maps.TagsOnPhysicalMap.parseSAMAlignment(TagsOnPhysicalMap.java:666)
        at net.maizegenetics.gbs.maps.TagsOnPhysicalMap.readSAMFile(TagsOnPhysicalMap.java:609)
        at net.maizegenetics.gbs.maps.SAMConverterPlugin.performFunction(SAMConverterPlugin.java:55)
        at net.maizegenetics.plugindef.AbstractPlugin.dataSetReturned(AbstractPlugin.java:201)
        at net.maizegenetics.plugindef.ThreadedPluginListener.run(ThreadedPluginListener.java:29)


However, when I use Tassel 5 SAMtoGBSdbPlugin, it just gives me thousands of warnings all reading that there is no quality score and will default to zero. However, it then allows me to proceed to the next step the discovery SNPcaller... but I don't think it is actually doing anything. 

[pool-1-thread-1] INFO net.maizegenetics.analysis.gbs.v2.DiscoverySNPCallerPluginV2 - Finding SNPs in PATH/t5pl.db.
[pool-1-thread-1] INFO net.maizegenetics.analysis.gbs.v2.DiscoverySNPCallerPluginV2 - StartChr:1 EndChr:37095

size of all tags in tag table=1385147
size of all tags in mappingApproach table=1
size of all taxa in taxa table=288
size of all positions in cutPosition table=0
positionTagTaxaMap = 0
[pool-1-thread-1] INFO net.maizegenetics.analysis.gbs.v2.DiscoverySNPCallerPluginV2 - Finished processing chromosome 1


size of all positions in snpPosition table=0
positionTagTaxaMap = 0
[pool-1-thread-1] INFO net.maizegenetics.analysis.gbs.v2.DiscoverySNPCallerPluginV2 - Finished processing chromosome 2


please help!

Jeff Glaubitz

unread,
Jul 8, 2015, 11:56:47 AM7/8/15
to tas...@googlegroups.com

Hi Peri,

 

The Tassel-GBS pipeline is only tested with SAM output from BWA and bowtie, so may not work with SAM output from other aligners such as stampy.

 

Is each field in the output separated by a tab character (\t)?  Is there a newline (\n) after “chr1633”?  (There shouldn’t be).

 

Best,

 

Jeff

 

--
Jeff Glaubitz
Project Manager
Biology of Rare Alleles in Maize and its Wild Relatives
National Science Foundation award IOS-1238014
http://www.panzea.org
Institute for Genomic Diversity
Cornell University
175 Biotechnology Bldg
Ithaca, NY 14853
Phone: 607-255-1386
jcg...@cornell.edu

--
You received this message because you are subscribed to the Google Groups "TASSEL - Trait Analysis by Association, Evolution and Linkage" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tassel+un...@googlegroups.com.
To post to this group, send email to tas...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tassel/adfaab4f-ed6b-4311-bc54-073fa397afe4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Peri Bolton

unread,
Jul 8, 2015, 3:32:23 PM7/8/15
to tas...@googlegroups.com
Hi Jeff,

Each field is separated by a tab, and there isn't a new line after the chromosome number. When viewed in Notepad++ each tag is on one line. As far as I can tell, the file matches what a .sam file should look like in every way.


Cheers,

Peri

Lynn Carol Johnson

unread,
Jul 13, 2015, 2:14:25 PM7/13/15
to tas...@googlegroups.com
A couple of thoughts:

For TASSEL-3 GBS, it appears there is an expectation the .sam file will contain the “XO” and “NM” optional fields.  The “NM” (edit distance) field exists in Peri’s example below, but the “XO” field does not.  Perhaps this is the cause of the null pointer exception from Tassel 3. (I don’t see an XO required in bowTie either, just BWA).

For GBSv2:  The error message you are seeing relating to quality score is due to the missing “AS” field.  The messages are annoying (though informative) but should not effect processing.    If you use the debug option with the latest TASSEL-5 build  then send your commands and all the output from SAMToGBSdbPlugin and DiscoverySNPCallerPluginV2 we might be able to give you more help.  

See this link for reporting tassel issues – Thanks.





Lynn Carol Johnson

unread,
Jul 14, 2015, 7:34:40 AM7/14/15
to tas...@googlegroups.com
I have removed the warning messages related to the missing AS field in the GBSv2 pipeline.  This fix will appear in the next build.

Peri Bolton

unread,
Jul 14, 2015, 10:43:11 AM7/14/15
to tas...@googlegroups.com, lc...@cornell.edu
Hi Lynn,

I have run it in the debug mode, but I don't really see what is different about the output... here is what I have 

I tried to call it this way run_pipeline.pl -debug -Xmx40g -fork1 -PLUGIN FILES etc... It didn't like it if I put the -debug option after I specified the plugin.

using SAMToGBSdbPlugin:

/home/../opt/tassel5.0_standalone/lib/xmlParserAPIs.jar:/home/../opt/tassel5.0_standalone/lib/batik-util.jar:/home/../opt/tassel5.0_standalone/lib/commons-codec-1.10.jar:/home/../opt/tassel5.0_standalone/lib/guava-14.0.1.jar:/home/../opt/tassel5.0_standalone/lib/biojava-core-4.0.0.jar:/home/../opt/tassel5.0_standalone/lib/batik-awt-util.jar:/home/../opt/tassel5.0_standalone/lib/mail-1.4.jar:/home/../opt/tassel5.0_standalone/lib/batik-gui-util.jar:/home/../opt/tassel5.0_standalone/lib/geronimo-spec-activation-1.0.2-rc4.jar:/home/../opt/tassel5.0_standalone/lib/jfreechart-1.0.3.jar:/home/../opt/tassel5.0_standalone/lib/biojava-alignment-4.0.0.jar:/home/../opt/tassel5.0_standalone/lib/slf4j-simple-1.7.10.jar:/home/../opt/tassel5.0_standalone/lib/cisd-jhdf5-batteries_included_lin_win_mac.jar:/home/../opt/tassel5.0_standalone/lib/slf4j-api-1.7.10.jar:/home/../opt/tassel5.0_standalone/lib/junit-4.10.jar:/home/../opt/tassel5.0_standalone/lib/batik-svg-dom.jar:/home/../opt/tassel5.0_standalone/lib/batik-gvt.jar:/home/../opt/tassel5.0_standalone/lib/snappy-java-1.1.1.6.jar:/home/../opt/tassel5.0_standalone/lib/json-simple-1.1.1.jar:/home/../opt/tassel5.0_standalone/lib/commons-math3-3.4.1.jar:/home/../opt/tassel5.0_standalone/lib/colt.jar:/home/../opt/tassel5.0_standalone/lib/log4j-1.2.13.jar:/home/../opt/tassel5.0_standalone/lib/forester.jar:/home/../opt/tassel5.0_standalone/lib/batik-xml.jar:/home/../opt/tassel5.0_standalone/lib/LiuExt.jar:/home/../opt/tassel5.0_standalone/lib/xercesImpl.jar:/home/../opt/tassel5.0_standalone/lib/ejml-0.23.jar:/home/../opt/tassel5.0_standalone/lib/itextpdf-5.1.0.jar:/home/../opt/tassel5.0_standalone/lib/batik-dom.jar:/home/../opt/tassel5.0_standalone/lib/poi-3.0.1-FINAL-20070705.jar:/home/../opt/tassel5.0_standalone/lib/xml.jar:/home/../opt/tassel5.0_standalone/lib/batik-svggen.jar:/home/../opt/tassel5.0_standalone/lib/biojava-phylo-4.0.0.jar:/home/../opt/tassel5.0_standalone/lib/sqlite-jdbc-3.8.4.3-SNAPSHOT.jar:/home/../opt/tassel5.0_standalone/lib/batik-css.jar:/home/../opt/tassel5.0_standalone/lib/trove-3.0.3.jar:/home/../opt/tassel5.0_standalone/lib/jcommon-1.0.6.jar:/home/../opt/tassel5.0_standalone/lib/batik-parser.jar:/home/../opt/tassel5.0_standalone/lib/batik-ext.jar:/home/../opt/tassel5.0_standalone/sTASSEL.jar
Memory Settings: -Xms512m -Xmx40g
Tassel Pipeline Arguments: -debug -fork1 -SAMToGBSdbPlugin -i /home/rollins/gouldian_gbs/Ref_genome/T5/align/aligned_bwaStampy.sam -o /home/rollins/gouldian_gbs/Ref_genome/T5/datab/t5pl.db -endPlugin -runfork1
[main] INFO net.maizegenetics.tassel.TasselLogging - Tassel Version: 5.2.5  Date: March 5, 2015
[main] INFO net.maizegenetics.tassel.TasselLogging - Max Available Memory Reported by JVM: 36409 MB
[main] INFO net.maizegenetics.tassel.TasselLogging - Java Version: 1.8.0_31
[main] INFO net.maizegenetics.tassel.TasselLogging - OS: Linux
[main] INFO net.maizegenetics.pipeline.TasselPipeline - Tassel Pipeline Arguments: [-fork1, -SAMToGBSdbPlugin, -i, /home/rollins/gouldian_gbs/Ref_genome/T5/align/aligned_bwaStampy.sam, -o, /home/rollins/gouldian_gbs/Ref_genome/T5/datab/t5pl.db, -endPlugin, -runfork1]
net.maizegenetics.analysis.gbs.v2.SAMToGBSdbPlugin
[pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin - 
SAMToGBSdbPlugin Parameters
i: /home/rollins/gouldian_gbs/Ref_genome/T5/align/aligned_bwaStampy.sam
o: /home/rollins/gouldian_gbs/Ref_genome/T5/datab/t5pl.db
aProp: 0.0
aLen: 0

size of all tags in tag table=1385147
size of all tags in mappingApproach table=1
size of all taxa in taxa table=288
[pool-1-thread-1] INFO net.maizegenetics.analysis.gbs.v2.SAMToGBSdbPlugin - SAMToGBSDbPluginV2: warning: alignmentScore not present in Sam File, defaulting to 0

Unobserved tags were found in the SAM file count= 107780
[pool-1-thread-1] INFO net.maizegenetics.analysis.gbs.v2.SAMToGBSdbPlugin - Finished reading SAM file.  No Tags added to DB as 107780 unobserved tags were found.
Please ensure all tags in the SAM file already exist in the DB.

Closing SQLDB
[pool-1-thread-1] INFO net.maizegenetics.pipeline.TasselPipeline - net.maizegenetics.analysis.gbs.v2.SAMToGBSdbPlugin: progress: 100%



so actually it looks like it's not even reading in the tags...

Peri Bolton

unread,
Jul 14, 2015, 10:45:28 AM7/14/15
to tas...@googlegroups.com, lc...@cornell.edu
What is weird is that there are the same number of tags in each file so I don't know why it says it can't find the tags...

Cheers,

P

Lynn Carol Johnson

unread,
Jul 14, 2015, 11:44:27 AM7/14/15
to Peri Bolton, tas...@googlegroups.com
HI Peri -

The expectation is that you’d run GBSSeqToTagDBPlugin to initially populate the database.  Then run TagExportToFastqPlugin to get a list of unique tags  from the db to be used for your aligner.  The messages below indicate many tags from the .sam file were not found in the database.  The code expects tags to be in the database to ensure we are aligning/processing/ searching for SNPs from a consistent set of tags.  Because there are so many tags not found in the DB, SAMToGBSdbPlugin does not add the alignments to the db tables, and DiscoverySNPCallerPluginV2 has nothing with which to work.

Note: if you are using a database created from an old version of this pipeline, you should re-run the initial pipeline steps, at least from TagExportToFastqPlugin.  TagExportToFastqPlugin was modified in the spring to use the original tag sequence as the seqname field when creating the fastq file.  This allows a tag modified by the aligner (via hard-clipping or other means) to be identified in the database.   The seqname from the fastq input file becomes the QNAME in the .sam output.  Your .sam input file should show a QNAME with a value e.g. "tagSeq=CAGCAAGGTAATGCTACAGAAGCATTGCGTCTGTTTGAGGAGATGAGGAGTGCCAAGGTCATGC "   vs  the old “ length=64count=14”.

The documentation for the GBSv2 pipeline is here:  https://bitbucket.org/tasseladmin/tassel-5-source/wiki/Tassel5GBSv2Pipeline

Let me know if you have other  questions.

Thanks - Lynn

Peri Bolton

unread,
Jul 16, 2015, 5:36:45 PM7/16/15
to tas...@googlegroups.com, lc...@cornell.edu
Hi Lynn,

But that is odd, because I've run the whole pipeline as one chunk, there are definitely no missing tags between the master tags list (output by TagExportToFastqPlugin) and the number of entries in the .sam file, so why would the database think differently? I can't seem to make sense of the number of tags that it throws, because it's not just the number of unaligned tags, or the number of aligned tags, or even the number of tags that stampy was able to add to the alignment after bwa.

I have version 5.2.5 of Tassel, and so my initial commands made the fastq names length=64count=14. Could it be that because stampy is editing the reads and then it cannot be added back into the database because it can't find that match? I'm not setting stampy to cut anything? but do bwa and stampy do that anyway?

Cheers 
P

Lynn Carol Johnson

unread,
Jul 17, 2015, 7:03:01 AM7/17/15
to Peri Bolton, tas...@googlegroups.com
Hi Peri -

Please try running with the latest TASSEL-5.  BWA often uses hard-clipping when presenting the tag sequences.  This removes part of the sequence resulting in the tag not being found in the DB.  The changes made to TagExportToFastqPlugin and SAMToGBSdbPlugin described below make the original sequence available which enables finding it in the db when the aligner makes changes.  
Reply all
Reply to author
Forward
0 new messages