how to understand rgt-hint differential result

423 views
Skip to first unread message

zq Miao

unread,
Jan 20, 2022, 12:25:27 AM1/20/22
to RGT Users

i am analysing an atac-seq data, when I run rgt-hint differential, I got the result successfully, but i have confuse about the result

for my understanding, as the official example showing, different sample show the different line significantly should be select as differential motif.

WX20220120-132218.png

but see my top differential motifs. 

WX20220120-121429.png

WX20220120-121358.png

why my result didnt show the difference in lineplot but with most significant pvalue??

other motif without significant pvalue, but there is some difference in the plot 

WX20220120-131045.png

i used following command to run, can anyone help with this problem??

Am I make any mistake? or there is anyother issue??

rgt-hint footprinting --atac-seq --paired-end --organism=mm10 --output-location=./ \ --output-prefix=WT.merge.footprinting WT.merge.sort.modified.bam WT.Genrich.mod

rgt-hint footprinting --atac-seq --paired-end --organism=mm10 --output-location=./ \ --output-prefix=DKO.merge.footprinting DKO.merge.sort.modified.bam DKO.Genrich.mod

rgt-motifanalysis matching --organism=mm10 --input-files WT.merge.footprinting.bed DKO.merge.footprinting.bed

rgt-hint differential --organism=mm10 --bc --nc 24 \ --mpbs-files=./match/WT.merge.footprinting_mpbs.bed,./match/DKO.merge.footprinting_mpbs.bed \ --reads-files=WT.merge.sort.modified.bam,DKO.merge.sort.modified.bam \ --conditions=WT,DKO \ -output-location=WTvsDKO5





Zhijian Li

unread,
Jan 20, 2022, 8:43:57 AM1/20/22
to zq Miao, RGT Users
Hi,

Your results look good and there is no issue with running RGT-HINT.

The point is how to properly interpret the results. The scatter plot that you showed is a summary of the line plots from all TFs, which could be helpful for us to select the most different TFs. However, as you already observed, this scatter plot is also impacted by the data quality of each TF, e.g., TFs with very few binding sites could be very different and this is likely caused by noise, and we usually only look at the TFs with more than 1,000 binding sites, which you can find on the top-left corner of the line plot.

So my suggestion is that you can sort the TFs by their activity score differences, and then check the line plots to see if they also make sense.

Let me know if you have any questions.

Best,
Li

______________________________
Zhijian Li
Institute for Computational Genomics
RWTH Aachen University 
Pauwelsstrasse 19
52074 Aachen, Germany




--
You received this message because you are subscribed to the Google Groups "RGT Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rgtusers+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rgtusers/1670c79c-a3dc-4710-9f0a-254e69f97f76n%40googlegroups.com.

Débora Parrine

unread,
Jun 22, 2022, 10:34:59 AM6/22/22
to RGT Users
Hi Zhijian,

I also have a few questions about the same subject. In the line plot, why the number of binding sites ("n=") does not match the number of sequences in the genome that is generated in the mpbs file? For example, I have a line graph for LBX2 that shows n=9 (is this the number of sites that HINT found LBX2 binding?), however, in the .txt file generated by the footprint analysis, I see more than 800 genomic ranges with scores around 10-12. Shouldn't that number be the same?

So, from Miao's question: the line plots account for the statistical value, however, the scatter plots do not?

Zhijian Li

unread,
Jun 22, 2022, 11:14:47 AM6/22/22
to Débora Parrine, RGT Users
Hi,

I also have a few questions about the same subject. In the line plot, why the number of binding sites ("n=") does not match the number of sequences in the genome that is generated in the mpbs file? For example, I have a line graph for LBX2 that shows n=9 (is this the number of sites that HINT found LBX2 binding?),

The number of binding sites (n = ) is the total number from all the conditions that are compared after removing repetitive binding sites. For example, say we are comparing two conditions, A and B. Then we need to predict the binding sites of TFs for each of them and generate A_mpbs.bed and B_mpbs.bed.

However, it's possible that some TFs share a lot of binding sites between A and B. In this case, rgt-hint will remove the duplicates.
 
however, in the .txt file generated by the footprint analysis, I see more than 800 genomic ranges with scores around 10-12. Shouldn't that number be the same?
Do you mean the MPBS file that contains the prediction of the binding site?

So, from Miao's question: the line plots account for the statistical value, however, the scatter plots do not?

Actually, it's the opposite. The line plots are just summarised signals from all predicted binding sites of a TF. 
Because usually, the open chromatin profiles are too sparse to generate a clear line plot, we have to collect data from all the binding sites which are shown in the line plots.

Then to compare the TF activity, we estimate a single value for each TF and condition based on the line plot. 
Since we have a lot of TFs, we can perform the z-score transformation and obtain a p-value for each TF, as shown in the scatter plot.

This is automatically computed by rgt-hint if two conditions are compared.
When more than two conditions are used, rgt will only estimate the TF activity and output it as a txt file.
We provided here a tutorial on how to select condition-specific TFs in this case:

Best,
Li

______________________________
Zhijian Li
Institute for Computational Genomics
RWTH Aachen University
Pauwelsstrasse 19
52074 Aachen, Germany

Reply all
Reply to author
Forward
0 new messages