Regarding reordering error

192 views
Skip to first unread message

shalu jhanwar

unread,
Mar 10, 2014, 3:17:52 PM3/10/14
to bissn...@googlegroups.com
Hi,

I want to use BisSnp for calling variants in WGBS of mouse data. Attached is the procedure I followed to prepare .bam and reference file to use BisSNP. I tried my best to follow all the steps correctly. Still my run got killed with the following error:

 ##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 1.5-3-gbb2c10b):
##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
##### ERROR Please do not post this error to the GATK forum
##### ERROR
##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
##### ERROR Visit our wiki for extensive documentation http://www.broadinstitute.org/gsa/wiki
##### ERROR Visit our forum to view answers to commonly asked questions http://getsatisfaction.com/gsa
##### ERROR
##### ERROR MESSAGE: Input files reads and reference have incompatible contigs: Order of contigs differences, which is unsafe.
##### ERROR   reads contigs = [chr7, chr14, chrY, chr19, chr8, chr1, chr11, chr6, chr17, chr16, chr18, chr3, chr12, chr15, chrX, chr4, chrM, chr2, chr9, chr13, chr10, chr5]
##### ERROR   reference contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chrX, chrY, chrM]
##### ERROR ------------------------------------------------------------------------------------------

This error is related to the ordering of chromosomes in the files. But I already use ReorderSam (see attached file). Please let me know how can I resolve this error.

Many Thanks!

Shalu
BisSnp_error_report.txt

Yaping Liu

unread,
Mar 10, 2014, 3:33:11 PM3/10/14
to bissn...@googlegroups.com
Hi Shalu,
I guess it because the vcf file order should also be changed:

/users/GD/resource/mouse/mm9/annotation/variation/mouse-20111102-snps-all.annotated.mm9.vcf

This vcf file contig order is ch1, chr2, chr3, .....

You can use the perl script in Utils to sort your vcf files by your own reference.fa.fai file: sortByRefAndCor.pl 

Thanks for your interests!

Yaping

---
Yaping Liu

PhD candidate 
in
USC Epigenome Center

University of Southern California






--
You received this message because you are subscribed to the Google Groups "bissnp-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bissnp-help...@googlegroups.com.
To post to this group, send email to bissn...@googlegroups.com.
Visit this group at http://groups.google.com/group/bissnp-help.
For more options, visit https://groups.google.com/d/optout.
<BisSnp_error_report.txt>

shalu jhanwar

unread,
Mar 11, 2014, 6:57:50 AM3/11/14
to bissn...@googlegroups.com
Hi,

Thank you for the reply.
There is a step in preparing .bam files i.e. AddReadGroup. I did it on one .bam of a sample in default mode, so it added RG:Z:1 to all the reads. But if I have multiple samples(bam files), then should I use different read groups for every sample  or I can use the default mode, which adds RG:Z:1 to all the samples?

Please explain.

Shalu

shalu jhanwar

unread,
Mar 11, 2014, 7:17:30 AM3/11/14
to bissn...@googlegroups.com
Hi,

i) Can I perform with BisSnp multisample and single sample calling same as GATK?
ii) If I want to run 2 samples with a single run of BisSnp, then will it report SNPs statistics for both the samples individually?

Please explain.

Thanks!

Shalu

On Monday, 10 March 2014 15:17:52 UTC-4, shalu jhanwar wrote:

Yaping Liu

unread,
Mar 11, 2014, 8:00:56 PM3/11/14
to bissn...@googlegroups.com
Hi Shalu,
1) yes, you can for BisulfiteGenotyper
2) for cpg.vcf and snp.vcf, they will be output into the one single VCF file.

I will suggest you to do it separately since it is quicker to get the result. you can merge them later by VCFtools later anyway..

Thanks,

Yaping

---
Yaping Liu

PhD candidate 
in
USC Epigenome Center

University of Southern California






shalu jhanwar

unread,
Mar 17, 2014, 6:56:28 AM3/17/14
to bissn...@googlegroups.com
Hi Liu,

I add different readGroups to the different samples, then merge all the bam files in one. Then is it possible that BisSNP performs SNP calling by using all the reads from different samples at once. Something related to the follwoing example:

Ref        ..........G..........
Sam1     ..........A..........
Sam2     ..........A..........
Sam2     ..........A..........
Sam2     ..........A..........
Sam2     ..........A..........
Sam2     ..........A..........
Sam3     ..........T..........
Sam3     ..........T..........
Sam3     ..........T..........
Sam4     ..........T..........
Sam4     ..........T..........
Sam4     ..........T..........


Now the SNP is G/A/T at a particular position.



 

Yaping Liu

unread,
Mar 17, 2014, 1:24:13 PM3/17/14
to bissn...@googlegroups.com
Hi Shalu,
If you specify SM tag (sample names) to be different in the bam file header, then Bis-SNP will treat them separately, 
e.g.

@RG     ID:sample_id_1  PL:illumina Hiseq       PU:your_pu  LB:your_lib   SM:sam1 CN:USC EPIGENOME CENTER
@RG     ID:sample_id_2   PL:illumina Hiseq       PU:your_pu  LB:your_lib   SM:sam2 CN:USC EPIGENOME CENTER

 in VCF output, it will be still output into one vcf file, but different sample have different genotyping result.

Thanks,

Yaping

---
Yaping Liu

PhD candidate 
in
USC Epigenome Center

University of Southern California






shalu jhanwar

unread,
Mar 18, 2014, 9:25:49 AM3/18/14
to bissn...@googlegroups.com
Thanks for your reply.

i) Continuation of my previous question, do I need to specify something like "UnifiedGenotyper" or multisample calling like GATK? r I can just call  BisulfiteGenotyper (same in single sample and multisample calling)
ii) Do I need to merge all the bam files or I can provide more than one bam file in the input (as in multisample calling of GATK)?

S.


On Monday, 10 March 2014 15:17:52 UTC-4, shalu jhanwar wrote:

Yaping Liu

unread,
Mar 18, 2014, 8:08:59 PM3/18/14
to bissn...@googlegroups.com
Hi Shalu,
1) You can just use BisulfiteGenotyper directly.
2) no, you can use the similar way as GATK: -I 1.bam -I 2.bam -I 3.bam ....    but more bam file will make the program slower and need more memory.

Yaping

---
Yaping Liu

PhD candidate 
in
USC Epigenome Center

University of Southern California






shalu jhanwar

unread,
Mar 30, 2014, 12:25:31 PM3/30/14
to bissn...@googlegroups.com
Hi Liu,

Thank you for your kind replies.
Regarding the interpretation of the Bayesian inference model Pr(G|D)=π(G)Pr(D|G) for SNP detection, please let me know the following:
- i) At each position, there are 10 possible genotypes.
ii)  For each genotype (out of 10), calculate π(G) from the dbSNP database
iii) Pr(D|G): i.e. the probability of the observing bisulfite data, given a particular genotype (out of the 10 possible).  Is this term calculated separately for each genotype? If it is the case, then in the end I have 10 values of the Pr(D|G) term, each for one genotype. Please explain.
iv) If we have 10 values for Pr(D|G) terms, then similarly 10 different values for  Pr(G|D) is calculated for each position and the genotype  with the highest probability is considered in the end and reported in the vcf file. 

Please correct me in case i am wrong.
Thanks!

shalu jhanwar

unread,
Apr 1, 2014, 7:11:49 AM4/1/14
to bissn...@googlegroups.com
Hi Liu,

Thank you for your kind replies.
Regarding the interpretation of the Bayesian inference model Pr(G|D)=π(G)Pr(D|G) for SNP detection, please let me know the following:
- i) At each position, there are 10 possible genotypes.
ii)  For each genotype (out of 10), calculate π(G) from the dbSNP database
iii) Pr(D|G): i.e. the probability of the observing bisulfite data, given a particular genotype (out of the 10 possible).  Is this term calculated separately for each genotype? If it is the case, then in the end I have 10 values of the Pr(D|G) term, each for one genotype. Please explain.
iv) If we have 10 values for Pr(D|G) terms, then similarly 10 different values for  Pr(G|D) is calculated for each position and the genotype  with the highest probability is considered in the end and reported in the vcf file. 

Please correct me in case i am wrong.
Thanks!


On Monday, 10 March 2014 15:17:52 UTC-4, shalu jhanwar wrote:

shalu jhanwar

unread,
Apr 3, 2014, 3:03:46 PM4/3/14
to bissn...@googlegroups.com
Hi Liu,

Please reply. As I am performing the analysis of my data. I run it on a sample like this:


I saw one entry in the snp.vcf file:
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  TGEE
chr11   7000834 .       C       T       29.33   PASS    CS=+;Context=YH;DP=18;MQ0=0;NS=1;REF=CH;SB=-0.0235      GT:BQ:BRC6:CM:CP:CU:DP:DP4:GP:GQ:SS     0/1:29.0,29.5:0,13,0,3,2,0:0:YH:13:18:0,3,13,2:29,0,90:29.33:5
Here, in BRC6 column: I have these values
0: no. of C on cytosine strand
13: no. fo T on cytosine strand
0: no of A/G/N in cytosine strand
3: no. of G in the Guanine strand
2: no of A in the Guanine strand
0: no of C/T/N in the Guanine strand

But when I looked in igv. I found the calculation as (A:0, C:3, T:15, G:0, N:0). Please see the attachment(Red=negative strand, blue=positive strand). Even in other variants, the value of DP is not the same as the total depth.
Please explain me why the values reposrted in the vcf fluctuates from the actual one?

Below is my command I used to generate the vcf file:

BisSNP-0.82.2 Program Args= -R /users/GD/resource/mouse/mm9/full/mm9_NCBI73.fasta -I /no_backup/so/sjhanwar/WGBS_mouse/Analysis/Rep1/SNP_analysis/TGEE.deduplicated.Sorted.desired.realign.mdups.recal.bam -D /users/GD/resource/mouse/mm9/annotation/variation/mouse-20111102-snps-all.annotated.mm9.vcf -T BisulfiteGenotyper -vfn1 /no_backup/so/sjhanwar/WGBS_mouse/Analysis/Rep1/SNP_analysis/test/TGEE.deduplicated.Sorted.desired.realign.mdups.recal.cpg.raw.vcf -vfn2 /no_backup/so/sjhanwar/WGBS_mouse/Analysis/Rep1/SNP_analysis/test/TGEE.deduplicated.Sorted.desired.realign.mdups.recal.snp.raw.vcf -C CG,1 -C CH,1 -out_modes DEFAULT_FOR_TCGA -stand_call_conf 20 -stand_emit_conf 0 -nt 12 -minConv 1 -vcfCache 1000000 -mmq 30 -mbq 5 -L chr11:7000000-7100000

Thanking you in anticipation of your quick reply.

Shalu

On Monday, 10 March 2014 15:17:52 UTC-4, shalu jhanwar wrote:
snp_view.png
snp_view1.png

Yaping Liu

unread,
Apr 4, 2014, 9:03:16 PM4/4/14
to bissn...@googlegroups.com, shalu jhanwar, bbe...@usc.edu
Hi Shalu,
I am sorry that we were out of town recently. The sequence composition showed at IGV browser is not designed for bisulfite-seq. If you use samtools pileup/mpileup command to look at the sequence composition at that position, you will find the real composition should be:  cccTTTTTTTTTTTTTtt (the character order may vary, but the number of c, t and T should be the same as I showed)
lower case represent the bases at reverse strand, while upper case represent the base at forward strand(reference genome strand).  The stats from IGV just ignore the strand specificity, which is fine and good for non-bisulfite seq genotyping, but it will not be good for bisulfite-seq genotyping and methylation calling. 
I will suggest you to use samtools tview to look at the real sequence composition if you want to do some sanity check on the raw data.

Thanks,


Yaping

 
--
You received this message because you are subscribed to the Google Groups "bissnp-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bissnp-help...@googlegroups.com.
To post to this group, send email to bissn...@googlegroups.com.
Visit this group at http://groups.google.com/group/bissnp-help.
For more options, visit https://groups.google.com/d/optout.
<snp_view.png><snp_view1.png>

shalu jhanwar

unread,
Apr 5, 2014, 11:58:22 AM4/5/14
to bissn...@googlegroups.com
Hi Liu,

Thanks for the reply. I saw the position in tview as suggested by you. But I didn't find the composition as suggested by you.


 #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  TGEE
chr11   7000834 .       C       T       29.33   PASS    CS=+;Context=YH;DP=18;MQ0=0;
NS=1;REF=CH;SB=-0.0235      GT:BQ:BRC6:CM:CP:CU:DP:DP4:GP:GQ:SS     0/1:29.0,29.5:0,13,0,3,2,0:0:YH:13:18:0,3,13,2:29,0,90:29.33:5
Here, in BRC6 column: I have these values
0: no. of C on cytosine strand
13: no. fo T on cytosine strand
0: no of A/G/N in cytosine strand
3: no. of G in the Guanine strand
2: no of A in the Guanine strand
0: no of C/T/N in the Guanine strand

In IGV, the real composition of the following variant is like this (see the very first position in the screen shot attached)
T:4
t:11
C:3 (it is shown as dot(.) and comma(,))
Please let me know where is the problem? Additionally, it gave context=YH, what does it mean? I wonder that even in some entries, the value of DP is not excatly the same as the total no. of reads align?

Thanks in anticipation of your quick reply.

Shalu


On Monday, 10 March 2014 15:17:52 UTC-4, shalu jhanwar wrote:
Screenshot.png

shalu jhanwar

unread,
Apr 11, 2014, 12:43:46 PM4/11/14
to bissn...@googlegroups.com
Hi Liu,

I would highly appriciate if Can you please reply. I am still waiting for your replies. Continuing my previous question, I have viewed one result in tview and attached it to you. Please resolve my query.

Thanks!


Shalu

On Monday, 10 March 2014 15:17:52 UTC-4, shalu jhanwar wrote:

Yaping Liu

unread,
Apr 11, 2014, 11:42:11 PM4/11/14
to bissn...@googlegroups.com, shalu jhanwar, bbe...@usc.edu
Hi Shalu,
I am sorry that we are still out of town and can't reply the email frequently. If you insist to know the answer right now, i can give you some quick and dirty solutions.


I will answer the simple question firstly:
For "YH" mean, i will refer you to read this webpage first about the IUPAC code:

Y here represent it could be C/T, H means it could be A/C/T

" the value of DP is not excatly the same as the total no. of reads align"

Bis-SNP by default filter out some bad reads. Here is the default filter in our software which is also described in our google group, you can adjust it by the parameters:

Some bad reads aligned there can not be used for genotyping and methylation calling.

Finally, here is the explanation about how we calculate the genotyping in this example position (sorry in the last email, i forgot to explain it clearly when it is in paired-end condition):
I suggest you to read the top post in our bis-snp group very carefully about the paired-end bisulfite-seq principle. 

You need to combine the information of strand (positive/negative), reads end (first end/ second end).  IGV and tview are not designed for the bisulfite-seq representation purpose. But you can infer it by the reads information by combining both of browsers.

The basic principle is:
when you see a C in positive strand and 1st end, it represent C in cytosine strand (used into methylation calculation).
when you see a C in positive strand and 2nd end, it represent G in guanine strand (complementary to reference genome genotype).
when you see a C in negative strand and 1st end, it represent G in guanine strand (complementary to reference genome).
when you see a C in negative strand and 2nd end, it represent C in cytosine strand (used into methylation calculation).


In your tview mode, you can see there are 4 Ts:
 The first and second "T" are from positive strand and second end, so it will be counted as "A" in the guanine strand (you can infer from the surrounding base composition, like a lot of GA mismatches there. or you can check IGV browser), 
The third and fourth "T" are from positive strand and 1st end, it will be counted as "T" in cytosine strand

Then let's look at 3 "C":
The 1st C (also annotated as ",") is from negative strand and first end. So the C actually represent G. It will be counted as "G" in the guanine strand.
The 2nd and 3rd "C" (annotated as ".") is from positive strand and second end. So the C actually represent G. It will be counted as "G" in the guanine strand.

For the other 11 "t", you can infer it by your self...


I did this reply very quickly, let me know if you see some unreasonable explanations….


Yaping

--
You received this message because you are subscribed to the Google Groups "bissnp-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bissnp-help...@googlegroups.com.
To post to this group, send email to bissn...@googlegroups.com.
Visit this group at http://groups.google.com/group/bissnp-help.
For more options, visit https://groups.google.com/d/optout.
<Screenshot.png>

shalu jhanwar

unread,
Apr 12, 2014, 9:39:29 PM4/12/14
to bissn...@googlegroups.com
Hi Liu,

Many thanks for your kind replies and information and sorry for bugging you. Your replies are really needed to me to help in understanding of the software. Please let me know the following:
- Always we call +ve strand as 5' to 3' direction and negative to 3' to 5' direction?
-Cytosine strand is the original 5' to 3' strand and Guanine strand is the original 3' to 5' strand? 

If the above concept is correct, then I cannot visualise "The 1st C (also annotated as ",") is from negative strand and first end. So the C actually represent G. It will be counted as "G" in the guanine strand".
I found it as I should be C on Guanine strand. please explain. I would highly appreciate if you can please explain me with one more position.
 
Looking at the position (attachment), 

i)chr11 7006701 . T C 26.79 PASS CS=+;Context=YH;DP=25;MQ0=0;NS=1;SB=-3.2054 GT:BQ:BRC6:CM:CP:CU:DP:DP4:GP:GQ:SS 0/1:31.4,31.5:0,16,0,2,7,0:0:YH:16:25:16,7,0,2:27,0,222:26.79:5

0: no. of C on cytosine strand
16: no. fo T on cytosine strand

0: no of A/G/N in cytosine strand 
2: no. of G in the Guanine strand
7: no of A in the Guanine strand

0: no of C/T/N in the Guanine strand

please find my try for all the types of the reads  (excluding similar explanation) mapped. There are 32 reads but DP is 25. 
- read no. 1,3,8,16,19,20,21,23: T on +ve strand and first end: represents T on cytosine strand
- read no. 2,6,7,10,11,14: T on +ve strand and second end: represents A on Guanine strand
-read no. 12,18,25,31: T on -ve strand and first end: represents T on Guanine strand
-read no. 15,17,22,24,26,27,28,29,30,32: T on -ve strand and second read: represents A on Cytosine strand
-read no. 9: C on +ve and second end: represents G on Guanine strand
-read no 13: c on -ve and first end: represents G on Guanine (according to your explanation, I am unable to understand this)

I am wrong somewhere but I am unable to understand. Please Liu, explain using this example.

-Here as no. of total reads (32) mapped and DP (25) is very different. How do I know which reads are counted here?

- My sequencing is directional, then BisSNP will give me genotypic information as well?

Please reply.

I am unable to attach the attachment here.  So please find the attachment on your gmail. Sorry for the inconvinience. 

Shalu
 


On Monday, 10 March 2014 20:17:52 UTC+1, shalu jhanwar wrote:

shalu jhanwar

unread,
Apr 12, 2014, 9:40:05 PM4/12/14
to Yaping Liu, bissn...@googlegroups.com, bbe...@usc.edu
Hi Liu,

Many thanks for your kind replies and information and sorry for bugging you. Your replies are really needed to me to help in understanding of the software. Please let me know the following:
- Always we call +ve strand as 5' to 3' direction and negative to 3' to 5' direction?
-Cytosine strand is the original 5' to 3' strand and Guanine strand is the original 3' to 5' strand? 

If the above concept is correct, then I cannot visualise "The 1st C (also annotated as ",") is from negative strand and first end. So the C actually represent G. It will be counted as "G" in the guanine strand".
I found it as I should be C on Guanine strand. please explain. I would highly appreciate if you can please explain me with one more position.
 
Looking at the position (attachment), 

i)chr11 7006701 . T C 26.79 PASS CS=+;Context=YH;DP=25;MQ0=0;NS=1;SB=-3.2054 GT:BQ:BRC6:CM:CP:CU:DP:DP4:GP:GQ:SS 0/1:31.4,31.5:0,16,0,2,7,0:0:YH:16:25:16,7,0,2:27,0,222:26.79:5

0: no. of C on cytosine strand
16: no. fo T on cytosine strand

0: no of A/G/N in cytosine strand 
2: no. of G in the Guanine strand
7: no of A in the Guanine strand

0: no of C/T/N in the Guanine strand

please find my try for all the types of the reads  (excluding similar explanation) mapped. There are 32 reads but DP is 25. 
- read no. 1,3,8,16,19,20,21,23: T on +ve strand and first end: represents T on cytosine strand
- read no. 2,6,7,10,11,14: T on +ve strand and second end: represents A on Guanine strand
-read no. 12,18,25,31: T on -ve strand and first end: represents T on Guanine strand
-read no. 15,17,22,24,26,27,28,29,30,32: T on -ve strand and second read: represents A on Cytosine strand
-read no. 9: C on +ve and second end: represents G on Guanine strand
-read no 13: c on -ve and first end: represents G on Guanine (according to your explanation, I am unable to understand this)

I am wrong somewhere but I am unable to understand. Please Liu, explain using this example.

-Here as no. of total reads (32) mapped and DP (25) is very different. How do I know which reads are counted here?

- My sequencing is directional, then BisSNP will give me genotypic information as well?

Please reply.

I am unable to attach the attachment here.  So please find the attachment on your gmail. Sorry for the inconvinience. 

Shalu
example.png

shalu jhanwar

unread,
Apr 29, 2014, 11:25:35 AM4/29/14
to bissn...@googlegroups.com
Hi Liu,

Are you still unavailable or can you please reply?

Thanks!


Shalu

On Monday, 10 March 2014 15:17:52 UTC-4, shalu jhanwar wrote:

Lyping1986

unread,
Apr 29, 2014, 3:02:51 PM4/29/14
to shalu jhanwar, bissn...@googlegroups.com, bbe...@usc.edu
i am coming back today and give u the feed back tonight.  Sorry. 

Sent from my iPhone
<example.png>

shalu jhanwar

unread,
May 1, 2014, 8:28:10 AM5/1/14
to bissn...@googlegroups.com
Hi Liu,

Could you please reply to my previous posts? Your replies only can make user to use your softwares successfully.

Thanks!

Shalu

On Monday, 10 March 2014 20:17:52 UTC+1, shalu jhanwar wrote:

shalu jhanwar

unread,
May 2, 2014, 12:21:39 PM5/2/14
to bissn...@googlegroups.com
Hi Liu and other BisSNP users,

I have performed BisSNP using easyrun.pl.

Please see the attached script containing all the steps I performed for a run using BisSNP. Please varify this, in case I am doing anything wrong.
Can you please let me know the following:
i) what are these columns in the cpg and snp summary.txt (attachment) file?
ii) Can you please let me know which filtering criteria's(parameters) should I select to filter high confident SNPs further?
iii) Is it possible just to call SNPs and not the methylation (as I have aligned reads using Bismark and I already have methylation levels for C's)?
iv) In the vcf file, there are some entries where genotype is 0/0? Why does software report such entries when there is no SNP?
v) I have given options for indel calculations as well, but I didn't find any entries for indels. Can you suggest me somethuing?

I would highly appreciate your quick reply.

Thanks!

Shalu


On Monday, 10 March 2014 15:17:52 UTC-4, shalu jhanwar wrote:
SNP.sh
TGEE.deduplicated.Sorted.desired.realign.mdups.recal.cpg.filtered.sort.vcf.cpgSummary.txt
TGEE.deduplicated.Sorted.desired.realign.mdups.recal.snp.filtered.sort.vcf.cpgSummary.txt

shalu jhanwar

unread,
May 6, 2014, 8:30:06 AM5/6/14
to bissn...@googlegroups.com
Hi Liu,

I tried to message from my email. Now BisSNP is running fine on the system. I am on my way to analyse the data.

Thank you for all the responses.

Shalu

On Monday, 10 March 2014 15:17:52 UTC-4, shalu jhanwar wrote:

shalu jhanwar

unread,
May 7, 2014, 3:07:56 PM5/7/14
to Lyping1986, bissn...@googlegroups.com, bbe...@usc.edu
Hi Liu,

I am looking at the summary files I got after the completion of a BiSSNP run. There are actually 2371058 entries in the TGEE.deduplicated.Sorted.desired.realign.mdups.recal.snp.filtered.sort.vcf file
 grep -v -c "^#" TGEE.deduplicated.Sorted.desired.realign.mdups.recal.snp.filtered.sort.vcf
But in the summary file when I summed all the statistics, then it came up to be 2206223 variants /entries only. Can you please suggest something for the difference in numbers?
Also can you please let me know that where can I find the explanation of the summary files. It is not mentioned in the user guide. Please find the summary file attached in case you want to have a look.

Thanking you in anticipation,

Shalu

snp_summary_file.txt

Yaping Liu

unread,
May 12, 2014, 2:31:25 AM5/12/14
to shalu jhanwar, bissn...@googlegroups.com
Hi Shalu,
I am sorry that this summary file has not been finished in the script. so please do not use this file. That is why i did not mention it in the user guide at all. Thanks!

yaping

<snp_summary_file.txt>

shalu jhanwar

unread,
May 13, 2014, 7:20:03 AM5/13/14
to bissn...@googlegroups.com
Hi Liu,

As suggested by you, I am not looking at the summary file and trying to make statistics. Please let me know the following:
i) Homogyzous conditions could be : 1/1, 0/0 ( 2/2 or 3/3 or so, if more than two alleles are present)
ii) Heterogyzous condition: 0/1, 1/2 etc.
Is that correct?

I saw one entry like below:
chr1    26835522        .       G       T,C     66.06   PASS    CS=-;DP=25;MQ0=0;NS=1;REF=CH    GT:BQ:BRC6:CM:CP:CU:DP:DP4:GP:GQ:SS     1/1:30.0,NaN:0,22,0,0,3,0:.:.:.:25:0,0,22,3:0,66,109:66.06:5

Here there are two alternative variants, but the genotype is 1/1, it means all the reads have T allele only. But if none of the read is mapped to C, then how is it reported as variant?

Is there any way (already available), so that I can count heterogyzous, homogyzous, synonymous, non-synonymous variants?

Many thanks in anticipation of your kind reply,


Shalu


On Monday, 10 March 2014 15:17:52 UTC-4, shalu jhanwar wrote:

ping

unread,
May 13, 2014, 1:49:28 PM5/13/14
to bissn...@googlegroups.com
Hi Shalu,
i) yes.
ii) yes.
the genotype is 1/1, does not mean there is no other reads. It just means that by the base quality of all reads, the best genotype is T/T. The reference genome type is G/G. That is why it was reported as variants.This is called homozygous variants. I suggest you to look at some normal genomic sequencing result first.. 

From the explanation above, i guess you can find out what is heterozygous, homozygous, synonymous, non-synonymous variants..

best,

yaping


--
You received this message because you are subscribed to the Google Groups "bissnp-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bissnp-help...@googlegroups.com.
To post to this group, send email to bissn...@googlegroups.com.
Visit this group at http://groups.google.com/group/bissnp-help.
For more options, visit https://groups.google.com/d/optout.



--
Yaping Liu

PhD candidate
USC Epigenome Center
Genetics Molecular and Cellular Biology Program
University of Southern California
USA

Phone: +1 213 400 2164
Email:lypin...@gmail.com

shalu jhanwar

unread,
May 16, 2014, 9:26:40 AM5/16/14
to bissn...@googlegroups.com
Hi Liu,

I have called SNPs successfully using BiSSNP. Please let me know the following:

i) I performed calling using the default parameters with few changes only using bissnp_easy_usage.pl. I have given the indel calling options as well. Please see the attached shell script which contains all the commands fired by bissnp_easy_usage.pl. But in the output file, it didn't report any indels. Can you please explain (Also find the screenshot of all the files given by BisSNP as output)?
ii) I calculated homogyzous, heterogyzous, transition, transversion rate of the SNPs on the filtered file like following:
   

cat Desired_TGEE_EGCG_rep1_rep2.deduplicated.Sort.realign.mdups.recal.snp.filtered.sort.vcf | java -jar /users/GD/tools/SnpEff/SnpSift.jar filter "(countHet() >= 1)" | grep -v -c "^#"          =    2056857

cat Desired_TGEE_EGCG_rep1_rep2.deduplicated.Sort.realign.mdups.recal.snp.filtered.sort.vcf | java -jar /users/GD/tools/SnpEff/SnpSift.jar filter "(countHom() >= 1)" | grep -v -c "^#"         =12645   (total: 2056857 + 12645 = 2069502)

java -jar /users/GD/tools/SnpEff/SnpSift.jar tstv het Desired_TGEE_EGCG_rep1_rep2.deduplicated.Sort.realign.mdups.recal.snp.filtered.sort.vcf   = Transition= 1228035, transversion=828951  (total: 1228035 + 828951 = 2056986

So is that possible that the total no. of Ts + Tv is more than the total no. of variants in the file? Please expalin.

Similarly, Homo + Hetero is not equal to the total no. of the variants in the file. Can you please suggest some reason behind it?

I could be wrong. So I would highly appreciate if you could please look into the filtered file (attached) on which I calculated these statistics.

Many thanks!
Screenshot-3.png
SNP.sh

Yaping Liu

unread,
May 20, 2014, 3:24:07 PM5/20/14
to bissn...@googlegroups.com
Hi Shalu,
i) the index realign just try to find potential indels (from the indel.vcf file you provide.), then realign the reads around these potential indels since a lot of bisulfite aligners do not provide gapped alignment result (like BSMAP), it is still not accurate to call indels yet, so i did not output that. 

ii) I am not familiar with SnpEff tool. Maybe when they count Ts/Tv, they have some other filter option? Have you tried to apply SnpEff on any other genomic sequencing  whole genome VCF file (maybe you can download from 1000 genome project ftp site)? You should contact them for the suggestion…

Thanks,

Yaping




--
You received this message because you are subscribed to the Google Groups "bissnp-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bissnp-help...@googlegroups.com.
To post to this group, send email to bissn...@googlegroups.com.
Visit this group at http://groups.google.com/group/bissnp-help.
For more options, visit https://groups.google.com/d/optout.
<Screenshot-3.png><SNP.sh>

Reply all
Reply to author
Forward
0 new messages