Regarding reordering error

shalu jhanwar

unread,

Mar 10, 2014, 3:17:52 PM3/10/14

to bissn...@googlegroups.com

Hi,

I want to use BisSnp for calling variants in WGBS of mouse data. Attached is the procedure I followed to prepare .bam and reference file to use BisSNP. I tried my best to follow all the steps correctly. Still my run got killed with the following error:

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 1.5-3-gbb2c10b):
##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
##### ERROR Please do not post this error to the GATK forum
##### ERROR
##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
##### ERROR Visit our wiki for extensive documentation http://www.broadinstitute.org/gsa/wiki
##### ERROR Visit our forum to view answers to commonly asked questions http://getsatisfaction.com/gsa
##### ERROR
##### ERROR MESSAGE: Input files reads and reference have incompatible contigs: Order of contigs differences, which is unsafe.
##### ERROR reads contigs = [chr7, chr14, chrY, chr19, chr8, chr1, chr11, chr6, chr17, chr16, chr18, chr3, chr12, chr15, chrX, chr4, chrM, chr2, chr9, chr13, chr10, chr5]
##### ERROR reference contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chrX, chrY, chrM]
##### ERROR ------------------------------------------------------------------------------------------

This error is related to the ordering of chromosomes in the files. But I already use ReorderSam (see attached file). Please let me know how can I resolve this error.

Many Thanks!

Shalu

BisSnp_error_report.txt

Yaping Liu

unread,

Mar 10, 2014, 3:33:11 PM3/10/14

to bissn...@googlegroups.com

Hi Shalu,

I guess it because the vcf file order should also be changed:

/users/GD/resource/mouse/mm9/annotation/variation/mouse-20111102-snps-all.annotated.mm9.vcf

This vcf file contig order is ch1, chr2, chr3, .....

You can use the perl script in Utils to sort your vcf files by your own reference.fa.fai file: sortByRefAndCor.pl

http://epigenome.usc.edu/publicationdata/bissnp2011/utilies.html

Thanks for your interests!

Yaping

---

Yaping Liu

PhD candidate

in

USC Epigenome Center

University of Southern California

lypin...@gmail.com

--
You received this message because you are subscribed to the Google Groups "bissnp-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bissnp-help...@googlegroups.com.
To post to this group, send email to bissn...@googlegroups.com.
Visit this group at http://groups.google.com/group/bissnp-help.
For more options, visit https://groups.google.com/d/optout.
<BisSnp_error_report.txt>

shalu jhanwar

unread,

Mar 11, 2014, 6:57:50 AM3/11/14

to bissn...@googlegroups.com

Hi,

Thank you for the reply.
There is a step in preparing .bam files i.e. AddReadGroup. I did it on one .bam of a sample in default mode, so it added RG:Z:1 to all the reads. But if I have multiple samples(bam files), then should I use different read groups for every sample or I can use the default mode, which adds RG:Z:1 to all the samples?

Please explain.

Shalu

shalu jhanwar

unread,

Mar 11, 2014, 7:17:30 AM3/11/14

to bissn...@googlegroups.com

Hi,

i) Can I perform with BisSnp multisample and single sample calling same as GATK?
ii) If I want to run 2 samples with a single run of BisSnp, then will it report SNPs statistics for both the samples individually?

Please explain.

Thanks!

Shalu

On Monday, 10 March 2014 15:17:52 UTC-4, shalu jhanwar wrote:

Yaping Liu

unread,

Mar 11, 2014, 8:00:56 PM3/11/14

to bissn...@googlegroups.com

Hi Shalu,

1) yes, you can for BisulfiteGenotyper

2) for cpg.vcf and snp.vcf, they will be output into the one single VCF file.

I will suggest you to do it separately since it is quicker to get the result. you can merge them later by VCFtools later anyway..

Thanks,

Yaping

---

Yaping Liu

PhD candidate

in

USC Epigenome Center

University of Southern California

lypin...@gmail.com

shalu jhanwar

unread,

Mar 17, 2014, 6:56:28 AM3/17/14

to bissn...@googlegroups.com

Hi Liu,

I add different readGroups to the different samples, then merge all the bam files in one. Then is it possible that BisSNP performs SNP calling by using all the reads from different samples at once. Something related to the follwoing example:

Ref        ..........G..........
Sam1     ..........A..........
Sam2     ..........A..........
Sam2     ..........A..........
Sam2     ..........A..........
Sam2     ..........A..........
Sam2     ..........A..........
Sam3     ..........T..........
Sam3     ..........T..........
Sam3     ..........T..........
Sam4     ..........T..........
Sam4     ..........T..........
Sam4     ..........T..........

Now the SNP is G/A/T at a particular position.

Yaping Liu

unread,

Mar 17, 2014, 1:24:13 PM3/17/14

to bissn...@googlegroups.com

Hi Shalu,

If you specify SM tag (sample names) to be different in the bam file header, then Bis-SNP will treat them separately,

e.g.

@RG ID:sample_id_1 PL:illumina Hiseq PU:your_pu LB:your_lib SM:sam1 CN:USC EPIGENOME CENTER

@RG ID:sample_id_2 PL:illumina Hiseq PU:your_pu LB:your_lib SM:sam2 CN:USC EPIGENOME CENTER

in VCF output, it will be still output into one vcf file, but different sample have different genotyping result.

Thanks,

Yaping

---

Yaping Liu

PhD candidate

in

USC Epigenome Center

University of Southern California

lypin...@gmail.com

shalu jhanwar

unread,

Mar 18, 2014, 9:25:49 AM3/18/14

to bissn...@googlegroups.com

Thanks for your reply.

i) Continuation of my previous question, do I need to specify something like "UnifiedGenotyper" or multisample calling like GATK? r I can just call BisulfiteGenotyper (same in single sample and multisample calling)
ii) Do I need to merge all the bam files or I can provide more than one bam file in the input (as in multisample calling of GATK)?

S.

On Monday, 10 March 2014 15:17:52 UTC-4, shalu jhanwar wrote:

Yaping Liu

unread,

Mar 18, 2014, 8:08:59 PM3/18/14

to bissn...@googlegroups.com

Hi Shalu,

1) You can just use BisulfiteGenotyper directly.

2) no, you can use the similar way as GATK: -I 1.bam -I 2.bam -I 3.bam .... but more bam file will make the program slower and need more memory.

Yaping

---

Yaping Liu

PhD candidate

in

USC Epigenome Center

University of Southern California

lypin...@gmail.com

shalu jhanwar

unread,

Mar 30, 2014, 12:25:31 PM3/30/14

to bissn...@googlegroups.com

Hi Liu,

Thank you for your kind replies.

Regarding the interpretation of the Bayesian inference model Pr(G|D)=π(G)Pr(D|G) for SNP detection, please let me know the following:

- i) At each position, there are 10 possible genotypes.

ii) For each genotype (out of 10), calculate π(G) from the dbSNP database

iii) Pr(D|G): i.e. the probability of the observing bisulfite data, given a particular genotype (out of the 10 possible). Is this term calculated separately for each genotype? If it is the case, then in the end I have 10 values of the Pr(D|G) term, each for one genotype. Please explain.

iv) If we have 10 values for Pr(D|G) terms, then similarly 10 different values for Pr(G|D) is calculated for each position and the genotype with the highest probability is considered in the end and reported in the vcf file.

Please correct me in case i am wrong.

Thanks!

shalu jhanwar

unread,

Apr 1, 2014, 7:11:49 AM4/1/14

to bissn...@googlegroups.com

Hi Liu,

Thank you for your kind replies.

Regarding the interpretation of the Bayesian inference model Pr(G|D)=π(G)Pr(D|G) for SNP detection, please let me know the following:

- i) At each position, there are 10 possible genotypes.

ii) For each genotype (out of 10), calculate π(G) from the dbSNP database

iii) Pr(D|G): i.e. the probability of the observing bisulfite data, given a particular genotype (out of the 10 possible). Is this term calculated separately for each genotype? If it is the case, then in the end I have 10 values of the Pr(D|G) term, each for one genotype. Please explain.

iv) If we have 10 values for Pr(D|G) terms, then similarly 10 different values for Pr(G|D) is calculated for each position and the genotype with the highest probability is considered in the end and reported in the vcf file.

Please correct me in case i am wrong.

Thanks!

On Monday, 10 March 2014 15:17:52 UTC-4, shalu jhanwar wrote:

shalu jhanwar

unread,

Apr 3, 2014, 3:03:46 PM4/3/14

to bissn...@googlegroups.com

Hi Liu,

Please reply. As I am performing the analysis of my data. I run it on a sample like this:

I saw one entry in the snp.vcf file:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TGEE
chr11 7000834 . C T 29.33 PASS CS=+;Context=YH;DP=18;MQ0=0;NS=1;REF=CH;SB=-0.0235 GT:BQ:BRC6:CM:CP:CU:DP:DP4:GP:GQ:SS 0/1:29.0,29.5:0,13,0,3,2,0:0:YH:13:18:0,3,13,2:29,0,90:29.33:5
Here, in BRC6 column: I have these values
0: no. of C on cytosine strand
13: no. fo T on cytosine strand
0: no of A/G/N in cytosine strand
3: no. of G in the Guanine strand
2: no of A in the Guanine strand
0: no of C/T/N in the Guanine strand

But when I looked in igv. I found the calculation as (A:0, C:3, T:15, G:0, N:0). Please see the attachment(Red=negative strand, blue=positive strand). Even in other variants, the value of DP is not the same as the total depth.
Please explain me why the values reposrted in the vcf fluctuates from the actual one?

Below is my command I used to generate the vcf file:

BisSNP-0.82.2 Program Args= -R /users/GD/resource/mouse/mm9/full/mm9_NCBI73.fasta -I /no_backup/so/sjhanwar/WGBS_mouse/Analysis/Rep1/SNP_analysis/TGEE.deduplicated.Sorted.desired.realign.mdups.recal.bam -D /users/GD/resource/mouse/mm9/annotation/variation/mouse-20111102-snps-all.annotated.mm9.vcf -T BisulfiteGenotyper -vfn1 /no_backup/so/sjhanwar/WGBS_mouse/Analysis/Rep1/SNP_analysis/test/TGEE.deduplicated.Sorted.desired.realign.mdups.recal.cpg.raw.vcf -vfn2 /no_backup/so/sjhanwar/WGBS_mouse/Analysis/Rep1/SNP_analysis/test/TGEE.deduplicated.Sorted.desired.realign.mdups.recal.snp.raw.vcf -C CG,1 -C CH,1 -out_modes DEFAULT_FOR_TCGA -stand_call_conf 20 -stand_emit_conf 0 -nt 12 -minConv 1 -vcfCache 1000000 -mmq 30 -mbq 5 -L chr11:7000000-7100000

Thanking you in anticipation of your quick reply.

Shalu

On Monday, 10 March 2014 15:17:52 UTC-4, shalu jhanwar wrote:

snp_view.png

snp_view1.png

Yaping Liu

unread,

Apr 4, 2014, 9:03:16 PM4/4/14

to bissn...@googlegroups.com, shalu jhanwar, bbe...@usc.edu

Hi Shalu,

I am sorry that we were out of town recently. The sequence composition showed at IGV browser is not designed for bisulfite-seq. If you use samtools pileup/mpileup command to look at the sequence composition at that position, you will find the real composition should be: cccTTTTTTTTTTTTTtt (the character order may vary, but the number of c, t and T should be the same as I showed)

lower case represent the bases at reverse strand, while upper case represent the base at forward strand(reference genome strand). The stats from IGV just ignore the strand specificity, which is fine and good for non-bisulfite seq genotyping, but it will not be good for bisulfite-seq genotyping and methylation calling.

I will suggest you to use samtools tview to look at the real sequence composition if you want to do some sanity check on the raw data.

Thanks,

Yaping

--
You received this message because you are subscribed to the Google Groups "bissnp-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bissnp-help...@googlegroups.com.
To post to this group, send email to bissn...@googlegroups.com.
Visit this group at http://groups.google.com/group/bissnp-help.
For more options, visit https://groups.google.com/d/optout.

<snp_view.png><snp_view1.png>

shalu jhanwar

unread,

Apr 5, 2014, 11:58:22 AM4/5/14

to bissn...@googlegroups.com

Hi Liu,

Thanks for the reply. I saw the position in tview as suggested by you. But I didn't find the composition as suggested by you.

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TGEE
chr11 7000834 . C T 29.33 PASS CS=+;Context=YH;DP=18;MQ0=0;

NS=1;REF=CH;SB=-0.0235 GT:BQ:BRC6:CM:CP:CU:DP:DP4:GP:GQ:SS 0/1:29.0,29.5:0,13,0,3,2,0:0:YH:13:18:0,3,13,2:29,0,90:29.33:5
Here, in BRC6 column: I have these values
0: no. of C on cytosine strand
13: no. fo T on cytosine strand
0: no of A/G/N in cytosine strand
3: no. of G in the Guanine strand
2: no of A in the Guanine strand
0: no of C/T/N in the Guanine strand

In IGV, the real composition of the following variant is like this (see the very first position in the screen shot attached)
T:4
t:11
C:3 (it is shown as dot(.) and comma(,))
Please let me know where is the problem? Additionally, it gave context=YH, what does it mean? I wonder that even in some entries, the value of DP is not excatly the same as the total no. of reads align?

Thanks in anticipation of your quick reply.

Shalu

On Monday, 10 March 2014 15:17:52 UTC-4, shalu jhanwar wrote:

Screenshot.png

shalu jhanwar

unread,

Apr 11, 2014, 12:43:46 PM4/11/14

to bissn...@googlegroups.com

Hi Liu,

I would highly appriciate if Can you please reply. I am still waiting for your replies. Continuing my previous question, I have viewed one result in tview and attached it to you. Please resolve my query.

Thanks!

Shalu

On Monday, 10 March 2014 15:17:52 UTC-4, shalu jhanwar wrote:

Yaping Liu

unread,

Apr 11, 2014, 11:42:11 PM4/11/14

to bissn...@googlegroups.com, shalu jhanwar, bbe...@usc.edu

Hi Shalu,

I am sorry that we are still out of town and can't reply the email frequently. If you insist to know the answer right now, i can give you some quick and dirty solutions.

I will answer the simple question firstly:

For "YH" mean, i will refer you to read this webpage first about the IUPAC code:

http://www.bioinformatics.org/sms/iupac.html

Y here represent it could be C/T, H means it could be A/C/T

" the value of DP is not excatly the same as the total no. of reads align"

Bis-SNP by default filter out some bad reads. Here is the default filter in our software which is also described in our google group, you can adjust it by the parameters:

https://groups.google.com/forum/#!topic/bissnp-help/XdJZR20aSe8

Some bad reads aligned there can not be used for genotyping and methylation calling.

Finally, here is the explanation about how we calculate the genotyping in this example position (sorry in the last email, i forgot to explain it clearly when it is in paired-end condition):

I suggest you to read the top post in our bis-snp group very carefully about the paired-end bisulfite-seq principle.

https://groups.google.com/forum/#!topic/bissnp-help/lGPDWBx7dN4

You need to combine the information of strand (positive/negative), reads end (first end/ second end). IGV and tview are not designed for the bisulfite-seq representation purpose. But you can infer it by the reads information by combining both of browsers.

The basic principle is:

when you see a C in positive strand and 1st end, it represent C in cytosine strand (used into methylation calculation).

when you see a C in positive strand and 2nd end, it represent G in guanine strand (complementary to reference genome genotype).

when you see a C in negative strand and 1st end, it represent G in guanine strand (complementary to reference genome).

when you see a C in negative strand and 2nd end, it represent C in cytosine strand (used into methylation calculation).

In your tview mode, you can see there are 4 Ts:

The first and second "T" are from positive strand and second end, so it will be counted as "A" in the guanine strand (you can infer from the surrounding base composition, like a lot of GA mismatches there. or you can check IGV browser),

The third and fourth "T" are from positive strand and 1st end, it will be counted as "T" in cytosine strand

Then let's look at 3 "C":

The 1st C (also annotated as ",") is from negative strand and first end. So the C actually represent G. It will be counted as "G" in the guanine strand.

The 2nd and 3rd "C" (annotated as ".") is from positive strand and second end. So the C actually represent G. It will be counted as "G" in the guanine strand.

For the other 11 "t", you can infer it by your self...

I did this reply very quickly, let me know if you see some unreasonable explanations….

Yaping

--
You received this message because you are subscribed to the Google Groups "bissnp-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bissnp-help...@googlegroups.com.
To post to this group, send email to bissn...@googlegroups.com.
Visit this group at http://groups.google.com/group/bissnp-help.
For more options, visit https://groups.google.com/d/optout.

<Screenshot.png>

shalu jhanwar

unread,

Apr 12, 2014, 9:39:29 PM4/12/14

to bissn...@googlegroups.com

Hi Liu,

Many thanks for your kind replies and information and sorry for bugging you. Your replies are really needed to me to help in understanding of the software. Please let me know the following:

- Always we call +ve strand as 5' to 3' direction and negative to 3' to 5' direction?

-Cytosine strand is the original 5' to 3' strand and Guanine strand is the original 3' to 5' strand?

If the above concept is correct, then I cannot visualise "The 1st C (also annotated as ",") is from negative strand and first end. So the C actually represent G. It will be counted as "G" in the guanine strand".

I found it as I should be C on Guanine strand. please explain. I would highly appreciate if you can please explain me with one more position.

Looking at the position (attachment),

i)chr11 7006701 . T C 26.79 PASS CS=+;Context=YH;DP=25;MQ0=0;NS=1;SB=-3.2054 GT:BQ:BRC6:CM:CP:CU:DP:DP4:GP:GQ:SS 0/1:31.4,31.5:0,16,0,2,7,0:0:YH:16:25:16,7,0,2:27,0,222:26.79:5

0: no. of C on cytosine strand

16: no. fo T on cytosine strand

0: no of A/G/N in cytosine strand

2: no. of G in the Guanine strand
7: no of A in the Guanine strand

0: no of C/T/N in the Guanine strand

please find my try for all the types of the reads (excluding similar explanation) mapped. There are 32 reads but DP is 25.

- read no. 1,3,8,16,19,20,21,23: T on +ve strand and first end: represents T on cytosine strand

- read no. 2,6,7,10,11,14: T on +ve strand and second end: represents A on Guanine strand

-read no. 12,18,25,31: T on -ve strand and first end: represents T on Guanine strand

-read no. 15,17,22,24,26,27,28,29,30,32: T on -ve strand and second read: represents A on Cytosine strand

-read no. 9: C on +ve and second end: represents G on Guanine strand

-read no 13: c on -ve and first end: represents G on Guanine (according to your explanation, I am unable to understand this)

I am wrong somewhere but I am unable to understand. Please Liu, explain using this example.

-Here as no. of total reads (32) mapped and DP (25) is very different. How do I know which reads are counted here?

- My sequencing is directional, then BisSNP will give me genotypic information as well?

Please reply.

I am unable to attach the attachment here. So please find the attachment on your gmail. Sorry for the inconvinience.

Shalu

On Monday, 10 March 2014 20:17:52 UTC+1, shalu jhanwar wrote:

shalu jhanwar

unread,

Apr 12, 2014, 9:40:05 PM4/12/14

to Yaping Liu, bissn...@googlegroups.com, bbe...@usc.edu