Re: Extremely low number of SNPs in TASSEL GBS discovery pipeline

491 views
Skip to first unread message
Message has been deleted

Polymerase Writer

unread,
Feb 3, 2016, 5:56:41 AM2/3/16
to TASSEL - Trait Analysis by Association, Evolution and Linkage
Have you tried using TASSEL 5 gbsv2 pipeline?

On Wednesday, February 3, 2016 at 8:37:11 AM UTC, Neu B wrote:
Hi there,

I have been working with the GBS pipeline of TASSEL3 to analyze GBS data (restriction enzyme based reduced representation libraries).

The pipeline finished succesfully now, but in the end result I am getting very limited SNPs (~500). I have analyzed the same data with different procedures and I get many more SNPs, so I am confident there are much more SNPs in the data.

Below are the commands that I am using. I was hoping there was some filtering which I could turn off, but I haven't found any yet.
Does anyone have any advice on what could be the problem or what kind of checks I can make to troubleshoot this?

Thanks very much in advance!


echo "COMMAND: FastqToTagCountPlugin"
perl run_gbs_pipeline.pl -fork1 -FastqToTagCountPlugin -i $work_folder/00-fastq -k $work_folder/00-key/key.txt -e ApeKI -o $work_folder/01-tagCounts -s 3000000 -endPlugin -runfork1
## Usage is as follows:
##  -i  Input directory containing FASTQ files in text or gzipped text.
##      NOTE: Directory will be searched recursively and should
##      be written WITHOUT a slash after its name.
##
##  -k  Key file listing barcodes distinguishing the samples
##  -e  Enzyme used to create the GBS library, if it differs from the one listed in the key file.
##  -s  Max good reads per lane. (Optional. Default is 200,000,000).
##  -c  Minimum tag count (default is 1).
##  -o  Output directory to contain .cnt files (one per FASTQ file, defaults to input directory).

echo "COMMAND: MergeMultipleTagCountPlugin"
perl run_gbs_pipeline.pl -fork1 -MergeMultipleTagCountPlugin -i $work_folder/01-tagCounts/ -o $work_folder/02-mergedTagCounts/myMasterGBSTags.cnt -endPlugin -runfork1
## Usage is as follows:
## -i  Input directory containing .cnt files
## -o  Output file name
## -c Minimum count of reads to be output (default 1)
## -t Specifies that reads should be output in FASTQ text format.

echo "COMMAND: TagCountToFastqPlugin"
perl run_gbs_pipeline.pl -fork1 -TagCountToFastqPlugin -i $work_folder/02-mergedTagCounts/myMasterGBSTags.cnt -o $work_folder/02-mergedTagCounts/myMasterGBSTags.fastq -endPlugin -runfork1
## The options for the TagCountToFastqPlugin are:
## -i Input binary tag count (*.cnt) file
## -o Output fastq file to use as input for BWA or bowtie2
## -c Minimum count of reads for a tag to be output (default: 1)

echo "COMMAND: Indexing reference for mapping with BWA MEM"
bwa index -a bwtsw $reference
echo "COMMAND: mapping with BWA MEM"
bwa aln -q 20 -t 10 -k 1 $reference $work_folder/02-mergedTagCounts/myMasterGBSTags.fastq > $work_folder/03-mapping/Mapping.sai
bwa samse $reference $work_folder/03-mapping/Mapping.sai $work_folder/02-mergedTagCounts/myMasterGBSTags.fastq > $work_folder/03-mapping/Mapping.sam

echo "COMMAND: SAMConverterPlugin"
perl run_gbs_pipeline.pl -fork1 -SAMConverterPlugin -i $work_folder/03-mapping/Mapping.sam -o $work_folder/04-topm/myMasterTags.topm -endPlugin -runfork1
## Usage is as follows:
## -i  Name of input file in SAM text format (required)
## -o  Name of output file (default output.topm.bin)
## -t  Specifies text output format
## -l  tag length in integer multiples of 32 bases (default=2)

echo "COMMAND: FastqToTBTPlugin"
perl run_gbs_pipeline.pl -fork1 -FastqToTBTPlugin -i $work_folder/00-fastq -k $work_folder/00-key/key.txt -e ApeKI -o $work_folder/05-tbt -y -t $work_folder/02-mergedTagCounts/myMasterGBSTags.cnt -endPlugin -runfork1
## Usage is as follows:
## -i  Input directory containing .fastq files
## -k  Barcode key file
## -e  Enzyme used to create the GBS library, if it differs from the one listed in the key file.
## -o  Output directory
## -s  Max good reads per lane. (Optional. Default is 200,000,000).
## -c  Minimum taxa count within a fastq file for a tag to be output (default 1)
## -y  Output to tagsByTaxaByte (tag counts per taxon from 0 to 127) instead of tagsByTaxaBit (0 or 1)
## One of either:
##     -t  Tag count file, OR A
##     -m  Physical map file containing alignments

echo "COMMAND: MergeTagsByTaxaFilesPlugin"
perl run_gbs_pipeline.pl -fork1 -MergeTagsByTaxaFilesPlugin -i $work_folder/05-tbt -o $work_folder/06-mergedTBT/myStudy.tbt.byte -s 10000000 -endPlugin -runfork1
## Usage is as follows:
## -i  Input directory containing .tbt.bin or .tbt.byte files
## -o  Output file name
## -s  Maximum number of tags the TBT can hold while merging (default: 200000000)
## -x  Merge tag counts of taxa with identical names (default: false)
## -h  Call snps in output and write to HapMap file with the provided name

echo "COMMAND: TagsToSNPByAlignment"
perl run_gbs_pipeline.pl -fork1 -TagsToSNPByAlignmentPlugin -i $work_folder/06-mergedTBT/myStudy.tbt.byte -y \
-m $work_folder/04-topm/myMasterTags.topm -o $work_folder/07-hapmap/raw -s 0 -e 370646 -endPlugin -runfork1
## The available options for the TagsToSNPByAlignmentPlugin are as follows:
## -i       Input .tbt file
## -y       Use byte-formatted TBT file (*.tbt.byte)
## -m       TagsOnPhysicalMap file containing genomic position of tags
## -mUpd    Update TagsOnPhysicalMap file with allele calls, saved to specified file
## -o       Output directory (default current directory)
## -mxSites Maximum number of sites (SNPs) output per chromosome (default: 200000)
## -mnF     Minimum F (inbreeding coefficient) (default: -2.0  = no filter)
## -p       Pedigree file containing full sample names (or expected names after merging) & expected inbreeding
##          coefficient (F) for each.  Only taxa with expected F >= mnF used to calculate F = 1-Ho/He.
##          (default: use ALL taxa to calculate F)
## -mnMAF   Minimum minor allele frequency (default: 0.01)
## -mnMAC   Minimum minor allele count (default: 10)
## -mnLCov  Minimum locus coverage (proportion of Taxa with a genotype) (default: 0.1)
## -errRate Average sequencing error rate per base (used to decide between heterozygous and homozygous calls) (default: 0.01)
## -ref     Path to reference genome in fasta format. Ensures that a tag from the reference genome is always included
##          when the tags at a locus are aligned against each other to call SNPs. The reference allele for each site
##          is then provided in the output HapMap files, under the taxon name "REFERENCE_GENOME" (first taxon).
##          DEFAULT: Don't use reference genome.
## -inclRare  Include the rare alleles at site (3 or 4th states) (default: false)
## -inclGaps  Include sites where major or minor allele is a GAP (default: false)
## -callBiSNPsWGap  Include sites where the third allele is a GAP (default: false)
## -s       Start chromosome
## -e       End chromosome

Neu B

unread,
Feb 3, 2016, 8:21:29 AM2/3/16
to TASSEL - Trait Analysis by Association, Evolution and Linkage
Hi Polymerase Writer,

I am doing reference based GBS analysis (using the UNEAK pipeline). When I take the manual page from TASSEL 5 (https://bitbucket.org/tasseladmin/tassel-5-source/wiki/UserManual ) and follow the link to the UNEAK pipeline, it sends me to a manual from TASSEL 3.. Is there another version of the UNEAK pipeline in TASSEL 5?





Op woensdag 3 februari 2016 11:56:41 UTC+1 schreef Polymerase Writer:

Lynn Carol Johnson

unread,
Feb 3, 2016, 9:09:25 AM2/3/16
to tas...@googlegroups.com
The UNEAK pipeline is not supported in TASSEL 5.  You can run the DiscoverySNPCallerV2 plugin with or without a reference.  When no reference genome file is specified, the most common tag is marked as the reference. 

--
You received this message because you are subscribed to the Google Groups "TASSEL - Trait Analysis by Association, Evolution and Linkage" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tassel+un...@googlegroups.com.
To post to this group, send email to tas...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tassel/491f5729-5b5c-494f-83a9-8e040a22bb26%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Message has been deleted

Neu B

unread,
Mar 3, 2016, 7:31:06 AM3/3/16
to TASSEL - Trait Analysis by Association, Evolution and Linkage
Lynn, thanks for your reply. I see have caused some confusion here; UNEAK is for analysis WITHOUT reference sequence, I have been running the analysis WITH a reference using the commands I showed in my first post above.

Using these commands, I get a very low number of SNPs. With other approaches (regular mapping and SNP calling) I get a very large number of SNPs ( > 300k).
There are not many parameters to set, but maybe I am filtering them out somehow? I hope someone can help!

Thanks in advance!


Op woensdag 3 februari 2016 15:09:25 UTC+1 schreef Lynn Johnson:
Reply all
Reply to author
Forward
Message has been deleted
0 new messages