using FIMO on the Transcriptome and getting questionable results

76 views
Skip to first unread message

Wanja Kassuhn

unread,
Feb 6, 2015, 8:33:16 AM2/6/15
to meme-...@googlegroups.com
Hallo guys,
 
im trying to do a motif prediction for PUM2 on the transcriptome ( downloaded from ensembl ) with that form :

>ENSG00000003137:72129238:72148038:chr2:ENST00000001146:72129238:72132619:72133023:72133307:72143989:72144213:72147631:72148038:72134761:72134916:72135144:72135419
AGGCAATTTTTTTCCTCCCTCTCTCCGCTCCCCTCGCAGCCTCCACTCCCTTTCCCTTGG
CCCCTTCCTCCTTCTCTGTTTCGGCTGGAGGTGCCAGGACCCCCGGCCGCAGCCTCCCCT
...

now the prediction i get with this command :

$ fimo --oc /test --max-stored-scores 200000 --verbosity 4 --parse-genomic-coord --motif-pseudo 0.01 --norc /data/PUM2.txt /data/transcripts.fasta

looks like :

#pattern name    sequence name    start    stop    strand    score    p-value    q-value    matched sequence
PUM2    ENSG00000004799    3256    3263    +    13.6364    2.76e-05    0.854    TGTAAATA
PUM2    ENSG00000005073    2272    2279    +    13.6364    2.76e-05    0.854    TGTAAATA
PUM2    ENSG00000075790    1514    1521    +    13.6364    2.76e-05    0.854    TGTAAATA

now as you can see the q-values are quite high. higher than i would like. So i tried to make a fimo search  with --qv-thresh which returns not a single predicted motif, while the predicted motifs of my output are clearly all perfect matches for the pum2 motif. so how would i go about getting a prediction with q-values up to 0.1 or something like that ?

question 2: so i used a program that generates for each motif an genomic index ( fimo output are on a transcriptomic index) with splicesites and all that stuff. for the most part it works but there are cases where the motifs from the motif prediction are overlapping the end of the gene, which shouldn't be possible and in other cases the coordinates of the prediction are higher than the gene is long. so the motifs arent even on the gene. why does this happen and should i just ignore them.

i will edit this question if something is unclear.

thanks for the help


CharlesEGrant

unread,
Feb 6, 2015, 6:03:51 PM2/6/15
to meme-...@googlegroups.com
Hi Wanju,

For your first question the problem is that your motif is relatively short. That means that if your sequence database is large enough, even a perfect match will not be statistically significant. They only way you can work around this would be to reduce the size of your input sequence database. The discussion in this oder post may be helpful:


For your second question:

It looks like the headers in your sequence database don't follow the format needed for the '--parse-genomic-coords' to work.

In the FIMO documentation it notes that the headers are expected to be in the UCSC format

>sequence name:starting position-ending position

 If FIMO can't match the header to the required format, it falls back to just numbering each sequence starting at 1. You'll need to write a script that transforms the current headers into the UCSC format.


Reply all
Reply to author
Forward
0 new messages