CITS.pl generates truncation sites > 1nt

52 views
Skip to first unread message

Andreas Pittroff

unread,
Jul 7, 2021, 6:20:28 AM7/7/21
to CTK User Group

Dear Zhang Lab members,
I am currently working on a miCLIP dataset using your tools.

I used the CITS.pl utility and then tried to generate an overview of the base composition at truncation sites. This is when I noticed, that there were several truncation sites called >1 nt (see below). Initially I used the parameter "--gap 25", so I thought it might be connected to truncation sites clustering into stretches of multiple nucleotides.

I then turned of clustering (or so I thought) by setting "--gap -1", but then I got even worse results.

Is there an explanation/workaround for this? Would it be viable to simply split the n-mers into n separate sites, or does the detection algorithm/p-value, etc depend on the n-mer record?

Output Truncation sites with --gap 25:
   6922 A
     85 AA
      1 AAA
      5 AC
      1 ACT
      6 AG
     27 AT
    326 C
      1 CA
      1 CC
      2 CT
    998 G
     55 GA
      2 GAT
      1 GG
      5 GT
   4834 T
     26 TA
      1 TAT
      4 TC
      1 TGT
     59 TT
      2 TTT
Output Truncation sites with --gap -1:
  23741 A
    383 AA
      2 AAA
      2 AAC
      9 AAT
     49 AC
      1 ACG
      2 ACT
     17 AG
    239 AT
      2 ATA
      1 ATC
      7 ATT
   3910 C
     55 CA
      4 CAT
      8 CC
      5 CG
     67 CT
      2 CTA
      5 CTT
   6170 G
    238 GA
      5 GAT
      5 GC
      9 GG
     41 GT
      2 GTA
      1 GTT
  27099 T
    205 TA
      4 TAA
      1 TACTT
      6 TAT
     47 TC
      1 TCA
      1 TCT
     24 TG
      1 TGA
      1 TGT
    474 TT
      1 TTA
      2 TTC
      2 TTG
     14 TTT
      1 TTTC
      1 TTTG
      1 TTTT

I would appreciate any kind of help on this.

Best regards,
Andreas

Chaolin Zhang

unread,
Jul 7, 2021, 1:28:31 PM7/7/21
to Andreas Pittroff, CTK User Group
Hi Andreas,

I do not recommend you cluster truncation sites (run the command line without the --gap option, or use --gap “-1”, which is the default).

Chaolin



--
You received this message because you are subscribed to the Google Groups "CTK User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ctk-user-grou...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ctk-user-group/3a4c6084-220a-481f-bf35-59d3021c71a7n%40googlegroups.com.

Andreas Pittroff

unread,
Jul 8, 2021, 2:58:49 AM7/8/21
to CTK User Group
Dear Prof. Zhang,
thank you for your quick response.

In fact I already tried that, but the output changed for the worse actually as numbers went up for n-mers and even 4-mers showed up (please see bottom of the post for truncation base composition with the --gap parameter set to -1).
 
Additionally I checked that there was nothing wrong with the method I used to extract the bases. Here is one example of a truncation record taken straight from the output bed-file, which shows that truncations are actually reported as n-mer sites:
dd_Smes_g4_203    498338 498341     CITS_26552[gene=dd_Smes_g4_203_f_c2][PH=10][PH0=0.29][P=1.37e-04]    10    +
Base composition for this example record was "TTT". I could also provide you with a Genome Viewer Screenshot of the underlying reads for the above record, if that could help you with deciphering this.


Best,
Andreas

Base Composition:

Chaolin Zhang

unread,
Jul 8, 2021, 8:26:08 AM7/8/21
to Andreas Pittroff, CTK User Group
Hi Andreas,

I checked the program.  When consecutive position have the same number of tags starting, they will be clustered together during peak finding no matter whether you apply a merge step after that. The reason for that is that when this occurs, we are not really sure which position is the precise truncation sites and we probably do not have single nucleotide resolution for these sites.

Below is the breakdown of CITS based on size for a representative CLIP experiment:

awk '{print $3-$2}' HepG2.RBFOX2.R2.tag.uniq.rgb.clean.CITS.s30.bed | sort | uniq -c
  14802 1
     86 2
      2 3

You seem to get more doublets and triplets.  I still recommend you do not use --gap to merge neighboring CITS, which will reduce the resolution.  That option was included for some historical reasons.

For most downstream analyses, we focus on the singletons.

Chaolin



Andreas Pittroff

unread,
Jul 8, 2021, 10:57:41 AM7/8/21
to CTK User Group
Dear Prof Zhang,
that merging happens during peak finding, means that it would happen before statistical testing, right?

So it would not be viable to simply split the doublets and triplets into separate sites, as the p-value statistics are calculated for the doublet and triplet sites? Therefor it would be better to omit these sites?

Best,
Andreas

Chaolin Zhang

unread,
Jul 8, 2021, 11:04:39 AM7/8/21
to Andreas Pittroff, CTK User Group
The statistical testing was for individual position.  If you want, you could split them, but it worth double checking if those doublet/triplet sites provide the same resolution as singleton.  In most of our analyses, we chose to omit those since the number is pretty small.

Chaolin



Andreas Pittroff

unread,
Jul 9, 2021, 2:15:01 AM7/9/21
to CTK User Group
I see, I guess I will try both - splitting and omitting - and compare the results. Thanks a lot for your help on this.

Best,
Andreas
Reply all
Reply to author
Forward
0 new messages