Discovery and Production SNP caller

227 views
Skip to first unread message

Peri Bolton

unread,
Nov 2, 2015, 10:33:36 PM11/2/15
to TASSEL - Trait Analysis by Association, Evolution and Linkage
Hi TASSEL developers and users,

I have been using Tassel 5 v2 to align my GBS reads to the Zebra finch genome. It has a number 70 "chromosomes", only some of which are purely numeric. I know that the new tassel pipeline is technically able to read character strings now, but my final .h5 or .vcf file only has positions for numeric chromosomes 1-28. It expresses the position numbers as S1_position... However, according to the alignment summary I should have some SNPs on Unknown chromosome and other string names. 

My final steps to the .h5 file were: 

/run_pipeline.pl  -fork1 -DiscoverySNPCallerPluginV2 -db /T5/datab/t5pl.db -mnMAF 0.05 -sC 1 -eC 70 -endPlugin -runfork1 

/run_pipeline.pl -fork1 -SNPQualityProfilerPlugin -db /T5/datab/t5pl.db -statFile SNPqualstats.txt -endPlugin -runfork1 

/run_pipeline.pl  -fork1 -UpdateSNPPositionQualityPlugin -db /T5/datab/t5pl.db -endPlugin -runfork1 

/run_pipeline.pl  -fork1 -ProductionSNPCallerPluginV2 -i /fastq/Peri_raw_data/ -db /T5/datab/t5pl.db -e PstI -k /fastq/keyfile.txt -eR 0.03 -d 1 -o /T5/SNPs/rawsnps.h5 -endPlugin -runfork1 


However, in the final file it is really obviously NOT calling anything on the chromosomes with string names, e.g. Unknown chromosome.

E.g.
NumberSNPS NumberTags ChromosomeName
21529 72558
13989 48137
20615 65307
16676 60637
15026 49106
0 166 LG2
0 6 LG5
0 0 MT 
0 160909 Un 
0 36782
0 3956 4_random 
0 8723 8_random 


The only thing I can think of that would be causing this is:

Could this result be because I am specifying -eC 70 in the Discovery SNP caller, and it doesn't like that I have only 28 numeric and the rest of the 70 are string chromosome names?

Has anyone else had this problem?



Thanks for the help.

Peri

Lynn Carol Johnson

unread,
Nov 3, 2015, 6:57:59 AM11/3/15
to tas...@googlegroups.com
Hi Peri -

Try running the command without giving a start/end chromosome.  Those parameters are not required and when absent,  the code should default to giving information on all chromosomes.

Lynn

--
You received this message because you are subscribed to the Google Groups "TASSEL - Trait Analysis by Association, Evolution and Linkage" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tassel+un...@googlegroups.com.
To post to this group, send email to tas...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tassel/209ea996-89f6-445e-aa57-50c0665a4491%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Lynn Carol Johnson

unread,
Nov 3, 2015, 7:05:30 AM11/3/15
to tas...@googlegroups.com
More detail:  The chromosomes are ordered alphabetically, with number preceding letters.  When start/end chromosomes are specified, the code will only process data for chromosomes that fall within that range.   When you give it sC 1, eC 70 it doesn’t process anything past 70 which is why you end up with no data for your non-numeric chromosomes.

Examples of ordering can be seen on the wiki for this plugin. Please let me know if the wiki isn’t clear and I will update it.


Thanks - Lynn

Peri Bolton

unread,
Nov 3, 2015, 9:17:29 PM11/3/15
to TASSEL - Trait Analysis by Association, Evolution and Linkage
Hi Lynn,

Thanks for the recommendation. I've tried doing that in three different ways:
- not including the -sC and -eC flags at all
- only including the -sC 1 flag.
- including -sC and -eC but not specifying a number.
All throw an error (e.g. below) relating that the -sC and -eC flags are required. I thought I was running the most up to date version of TASSEL (5.2.12)? I am not including a reference genome because I don't want SNPs to be called relative to the reference. Is that what the explanation on the wiki actually means, because that is how I understood it.

As for the instructions on the wiki, I think they could be a little clearer, yes. from that I understood that they could handle characters in the chromosome names, but not how to implement in that. 

Cheers,

P


/home/../opt/tassel5.0_standalone/lib/batik-awt-util.jar:/home/../opt/tassel5.0_standalone/lib/guava-14.0.1.jar:/home/../opt/tassel5.0_standalone/lib/snappy-java-1.1.1.6.jar:/home/../opt/tassel5.0_standalone/lib/trove-3.0.3.jar:/home/../opt/tassel5.0_standalone/lib/batik-util.jar:/home/../opt/tassel5.0_standalone/lib/batik-gui-util.jar:/home/../opt/tassel5.0_standalone/lib/biojava-alignment-4.0.0.jar:/home/../opt/tassel5.0_standalone/lib/colt.jar:/home/../opt/tassel5.0_standalone/lib/commons-math3-3.4.1.jar:/home/../opt/tassel5.0_standalone/lib/slf4j-api-1.7.10.jar:/home/../opt/tassel5.0_standalone/lib/batik-gvt.jar:/home/../opt/tassel5.0_standalone/lib/jfreechart-1.0.3.jar:/home/../opt/tassel5.0_standalone/lib/mail-1.4.jar:/home/../opt/tassel5.0_standalone/lib/poi-3.0.1-FINAL-20070705.jar:/home/../opt/tassel5.0_standalone/lib/geronimo-spec-activation-1.0.2-rc4.jar:/home/../opt/tassel5.0_standalone/lib/jcommon-1.0.6.jar:/home/../opt/tassel5.0_standalone/lib/junit-4.10.jar:/home/../opt/tassel5.0_standalone/lib/sqlite-jdbc-3.8.5-pre1.jar:/home/../opt/tassel5.0_standalone/lib/json-simple-1.1.1.jar:/home/../opt/tassel5.0_standalone/lib/slf4j-simple-1.7.10.jar:/home/../opt/tassel5.0_standalone/lib/postgresql-9.4-1201.jdbc41.jar:/home/../opt/tassel5.0_standalone/lib/itextpdf-5.1.0.jar:/home/../opt/tassel5.0_standalone/lib/biojava-core-4.0.0.jar:/home/../opt/tassel5.0_standalone/lib/log4j-1.2.13.jar:/home/../opt/tassel5.0_standalone/lib/ejml-0.23.jar:/home/../opt/tassel5.0_standalone/lib/batik-dom.jar:/home/../opt/tassel5.0_standalone/lib/cisd-jhdf5-batteries_included_lin_win_mac.jar:/home/../opt/tassel5.0_standalone/lib/batik-xml.jar:/home/../opt/tassel5.0_standalone/lib/commons-codec-1.10.jar:/home/../opt/tassel5.0_standalone/lib/javax.json-1.0.4.jar:/home/../opt/tassel5.0_standalone/lib/xmlParserAPIs.jar:/home/../opt/tassel5.0_standalone/lib/xercesImpl.jar:/home/../opt/tassel5.0_standalone/lib/batik-css.jar:/home/../opt/tassel5.0_standalone/lib/batik-svggen.jar:/home/../opt/tassel5.0_standalone/lib/forester.jar:/home/../opt/tassel5.0_standalone/lib/batik-parser.jar:/home/../opt/tassel5.0_standalone/lib/xml.jar:/home/../opt/tassel5.0_standalone/lib/batik-svg-dom.jar:/home/../opt/tassel5.0_standalone/lib/biojava-phylo-4.0.0.jar:/home/../opt/tassel5.0_standalone/lib/batik-ext.jar:/home/../opt/tassel5.0_standalone/sTASSEL.jar
Memory Settings: -Xms512m -Xmx40g
Tassel Pipeline Arguments: -fork1 -DiscoverySNPCallerPluginV2 -db /home/rollins/gouldian_gbs/Ref_genome/T5/datab/t5pl.db -sC -eC -mnMAF 0.05 -endPlugin -runfork1
[main] INFO net.maizegenetics.tassel.TasselLogging - Tassel Version: 5.2.12  Date: June 25, 2015
[main] INFO net.maizegenetics.tassel.TasselLogging - Max Available Memory Reported by JVM: 36409 MB
[main] INFO net.maizegenetics.tassel.TasselLogging - Java Version: 1.8.0_31
[main] INFO net.maizegenetics.tassel.TasselLogging - OS: Linux
[main] INFO net.maizegenetics.pipeline.TasselPipeline - Tassel Pipeline Arguments: [-fork1, -DiscoverySNPCallerPluginV2, -db, /home/rollins/gouldian_gbs/Ref_genome/T5/datab/t5pl.db, -sC, -eC, -mnMAF, 0.05, -endPlugin, -runfork1]
[main] ERROR net.maizegenetics.plugindef.AbstractPlugin - Parameter requires a value: -sC
[main] INFO net.maizegenetics.plugindef.AbstractPlugin - 
Usage:
DiscoverySNPCallerPluginV2 <options>
-db <Input GBS Database> : Input Database file if using SQLite (required)
-mnMAF <Min Minor Allele Freq> : Minimum minor allele frequency (Default: 0.01)
-mnLCov <Min Locus Coverage> : Minimum locus coverage (proportion of Taxa with a genotype) (Default: 0.1)
-ref <Reference Genome File> : Path to reference genome in fasta format. Ensures that a tag from the reference genome is always included when the tags at a locus are aligned against each other to call SNPs. The reference allele for each site is then provided in the output HapMap files, under the taxon name "REFERENCE_GENOME" (first taxon). DEFAULT: Don't use reference genome.
-sC <Start Chromosome> : Start Chromosome (required)
-eC <End Chromosome> : End Chromosome (required)
-inclRare <true | false> : Include the rare alleles at site (3 or 4th states) (Default: false)
-inclGaps <true | false> : Include sites where major or minor allele is a GAP (Default: false)
-callBiSNPsWGap <true | false> : Include sites where the third allele is a GAP (mutually exclusive with inclGaps) (Default: false)
-gapAlignRatio <Gap Alignment Threshold> : Gap alignment threshold ratio of indel contrasts to non indel contrasts: IC/(IC + NC). Any loci with a tag alignment value above this threshold will be excluded from the pool. (Default: 1.0)
-maxTagsCutSite <Max Number of Cut Sites> : Maximum number of tags per cut site (Default: 64)
-deleteOldData <true | false> : Delete existing SNP data from tables (Default: false)

Terry Casstevens

unread,
Nov 3, 2015, 10:06:11 PM11/3/15
to Tassel User Group
The usage says -sC -eC are required.

And a chromosome must follow the flags.

-sC 1 -eC 10
> https://groups.google.com/d/msgid/tassel/8367721a-e054-4fae-a39f-ddc04d9527ba%40googlegroups.com.

Lynn Carol Johnson

unread,
Nov 4, 2015, 6:40:02 AM11/4/15
to tas...@googlegroups.com
Hi Peri -

The latest version of TASSEL 5 is 5.2.16.  I just ran Discovery without the –sC/-eC parameters and had no problems. Please update to the latest and let me know if you’re still seeing problems.

What additional information would you like to see about characters in chromosomes?  The 3 examples on the wiki show the ordering of chromosomes where names are just numeric, where names have a  combination of numbers/letters, and where the list of names have some that are strictly numeric while others contain strictly alpha characters.  The code doesn’t care how you name the chromosomes.  It takes what is there and sorts then lexicographically.  When you give it a range for start/end, it only processing  those names that fall within the range specified.

Lynn

Lynn Carol Johnson

unread,
Nov 4, 2015, 6:45:39 AM11/4/15
to tas...@googlegroups.com
Hi Terry -

That shouldn¹t be true in the latest loads. The start/end chromosomes
have ³required" set to ³false². This change was delivered in commit
191fe00 on July 17, 2015.

Lynn

On 11/3/15, 10:06 PM, "tas...@googlegroups.com on behalf of Terry
>>https://groups.google.com/d/msgid/tassel/8367721a-e054-4fae-a39f-ddc04d95
>>27ba%40googlegroups.com.
>>
>> For more options, visit https://groups.google.com/d/optout.
>
>--
>You received this message because you are subscribed to the Google Groups
>"TASSEL - Trait Analysis by Association, Evolution and Linkage" group.
>To unsubscribe from this group and stop receiving emails from it, send an
>email to tassel+un...@googlegroups.com.
>To post to this group, send email to tas...@googlegroups.com.
>To view this discussion on the web visit
>https://groups.google.com/d/msgid/tassel/CACHsrTvig3uTfkuM1s19WW8BRCk537Y3
>f4mH3%2Bh6Cx%2BDfcsUig%40mail.gmail.com.

Peri Bolton

unread,
Nov 5, 2015, 10:46:48 PM11/5/15
to TASSEL - Trait Analysis by Association, Evolution and Linkage
Hi Lynn,

Thanks so much for your help! I really appreciate it. it seems to be working now for calling SNPs on all the chromosomes. 

I guess in terms of the manual it would just be helpful for an explicit explanation (like I've got here) about what to do in the commandline framework if you have character strings in the chromosomes. this could just simply be an example piece of code that matches one of the preexisting examples that shows how tassel sorts them.


So it all seems to be working now, but I am getting an error with regard to getting a SNP quality output (below). Do I need to delete the old version of the DiscoverySNP calling where I called SNPs using the slightly older version of tassel? I would just do that by using the discoverySNP caller again with the flag -deleteOldData TRUE?

Tassel Pipeline Arguments: -fork1 -SNPQualityProfilerPlugin -db /home/rollins/gouldian_gbs/Ref_genome/T5/datab/t5pl.db -statFile SNPqualstats_151104.txt -endPlugin -runfork1
[main] INFO net.maizegenetics.tassel.TasselLogging - Tassel Version: 5.2.16  Date: October 15, 2015
[main] INFO net.maizegenetics.tassel.TasselLogging - Max Available Memory Reported by JVM: 36409 MB
[main] INFO net.maizegenetics.tassel.TasselLogging - Java Version: 1.8.0_31
[main] INFO net.maizegenetics.tassel.TasselLogging - OS: Linux
[main] INFO net.maizegenetics.tassel.TasselLogging - Number of Processors: 12
[main] INFO net.maizegenetics.pipeline.TasselPipeline - Tassel Pipeline Arguments: [-fork1, -SNPQualityProfilerPlugin, -db, /home/rollins/gouldian_gbs/Ref_genome/T5/datab/t5pl.db, -statFile, SNPqualstats_151104.txt, -endPlugin, -runfork1]
net.maizegenetics.analysis.gbs.v2.SNPQualityProfilerPlugin
[pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin - Starting net.maizegenetics.analysis.gbs.v2.SNPQualityProfilerPlugin: time: Nov 6, 2015 14:29:35
[pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin - 
SNPQualityProfilerPlugin Parameters
taxa: null
db: /home/rollins/gouldian_gbs/Ref_genome/T5/datab/t5pl.db
tname: null
statFile: SNPqualstats_151104.txt
deleteOldData: false

size of all tags in tag table=1385147
size of all tags in mappingApproach table=2
size of all taxa in taxa table=288
sublist
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287]
size of all positions in snpPosition table=346972
Processing Positions between 0 and 10,000.java.sql.SQLException: UNIQUE constraint failed: snpQuality.snpid, snpQuality.taxasubset
at org.sqlite.core.DB.throwex(DB.java:859)
at org.sqlite.core.DB.executeBatch(DB.java:760)
at org.sqlite.core.CorePreparedStatement.executeBatch(CorePreparedStatement.java:77)
at net.maizegenetics.dna.tag.TagDataSQLite.putSNPQualityProfile(TagDataSQLite.java:535)
at net.maizegenetics.analysis.gbs.v2.SNPQualityProfilerPlugin.processData(SNPQualityProfilerPlugin.java:204)
at net.maizegenetics.plugindef.AbstractPlugin.performFunction(AbstractPlugin.java:108)
at net.maizegenetics.plugindef.AbstractPlugin.dataSetReturned(AbstractPlugin.java:1585)
at net.maizegenetics.plugindef.ThreadedPluginListener.run(ThreadedPluginListener.java:29)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Error processing request.  Quality data may already exist for taxa name ALL
 UNIQUE constraint failed: snpQuality.snpid, snpQuality.taxasubset
Closing SQLDB

Lynn Carol Johnson

unread,
Nov 6, 2015, 7:03:22 AM11/6/15
to tas...@googlegroups.com
Hi Peri -

The “unique constraint error” below indicates you already have data in the snpQuality table corresponding to the positions you are trying to insert.  You can re-run the SNPQualityProfilerPlugin with the “- deletedOldData true” flag to allow the new values to be present.

I’ll add another example to the wiki of calling the pipeline with string chromosomes.

Thanks for your feedback - Lynn

Reply all
Reply to author
Forward
0 new messages