Difficulty with GOMo

35 views
Skip to first unread message

Konrad Taube

unread,
Oct 12, 2024, 4:47:45 PM10/12/24
to MEME Suite Q&A
Hello,

I am looking for some guidance in troubleshooting my results from Meme. It seems like I am able to find motifs just find with the meme command, but gomo is not returning any results. I am hoping I am just making a silly mistake. 

I am working with ATACseq data on a non-model organism (P. fluviatilis), and have peak reads from eye samples. The .fa file I am working with is accepted just fine, and looks something like this:

tail macs2.peaks.fa
>VHII01000302.1:5963-6017
ACAGAGGGCCGGGTGATGACGAGGCGGTGGacgagccaatcacagagctGCGTA
>VHII01000302.1:6483-6552
GCCTCTGTTGTCCTAATTAAGGCTGGGTATCCAGGCCTCTGTTGTCCTAATTAAGGCTGGGTATCCAGG
>VHII01000302.1:6599-6652
AATTAAGGCTGGGTATCCAGGCCACTGTTGTCCTAATTAAGGCTGGGTATCCA
>VHII01000303.1:952-1052
tacacacacacacacagacacacacacacacactctaacacatacacatacatacacacacacacacatacacacacacacatacacacacacacacaca


So. now to the issue. First I did: 

meme macs2.peaks.fa -o ./meme-output -dna -mod zoops -nmotifs 25 -minw 4 -maxw 20

And I got the output meme.txt. I then ran fasta-get-markov:

fasta-get-markov  macs2.peaks.fa > macs2.eye.model

The .model file looks like this:

# 0-order Markov frequencies from file  macs2.peaks.fa
# seqs: 142955    min: 41    max: 2019    avg: 129.4    sum: 18495495    alph: DNA
# order 0
A 2.453e-01
C 2.547e-01
G 2.547e-01
T 2.453e-01


Next, I converted to meme.txt into cisML:

ama meme.txt  macs2.peaks.fa macs2.eye.model --pvalues --o macs2_eye_ama_output

Which gives me my ama.xml file. Lastly, I take the ama.xml file and compared it to zebrafish or human motifs. This is where I am not getting results. Here is the code I use:

gomo fish_danio_rerio_1000_199.na.csv ./macs2_eye_ama_output/ama.xml

For fun I also tried human:

gomo mammal_homo_sapiens_1000_199.na.csv ./macs2_eye_ama_output/ama.xml

And here is the gomo.tsv file contents:

Motif_Identifier GO_Term_Identifier GOMo_Score p-value q-value

# GOMo (Gene Ontology for Motifs): Version 5.5.7 compiled on Sep 19 2024 at 15:25:23
# The format of this file is described at https://meme-suite.org/meme/doc/gomo-output-format.html.
# gomo mammal_homo_sapiens_1000_199.na.csv ./macs2_eye_ama_output/ama.xml


So as you can see, no results. I am a bit confused as to what went wrong. I clearly have motifs, yet I am not finding any results for GOMo? I am feeling very stupid right now, so any tips/fixes/advice would be really appreciated!

Thanks for your time,




cegrant

unread,
Oct 14, 2024, 2:40:13 PM10/14/24
to MEME Suite Q&A
Hi Konrad,

You need to have a GO-term database corresponding to the sequences you are working with. Using a GO-term database for another organism (like mammal_homo_sapiens_1000_199.na.csv in your example) won't work. 
The GO term database uses the sequence names as the link between the sequences in your FASTA file and the GO terms they have been annotated with. For example the first line of mammal_homo_sapiens_1000_199.na.csv is:

GO:0032768 P01308 P01375 P27037 P02649 P17813-2 O15528 P78540 P50052 Q9Y314 P39905 P06734 P01584 P01116 P31749 Q99684 P06280 P19838 Q8WTV0 P30793 P11473 P01579 P37840 Q16665 P00533 P56539

which says that the sequences P01308, P01375, P27037, P02649, .... have been annotated with the GO term GO:0032768. These sequences are the upstream regions for the known human genes.
The sequence names in your starting FASTA file have to match the sequence names in your GO term database.

This means you need the GO annotations for your sequences. This is not something the MEME Suite provides. We simply put a few existing annotations for a handful of the most common model organisms on our website as a convenience.
To use GoMO you'll need to come up with your own GO-term database.

Konrad Taube

unread,
Oct 16, 2024, 6:53:24 AM10/16/24
to MEME Suite Q&A
Hi,

Thanks for the clarification. I've made some changes to my analysis and am hoping to find some troubleshooting assistance!

I downloaded the perch annotation file from NCBI (.gtf file) and used that instead of the csv file for gomo. I also took my macs2 peak data and only considered regions up to 2000bp from the target site. However, looking back at the initial meme command, it's resulting in some less-than-ideal (to my uneducated brain, anyway) results. Here is the command I used:

meme filtered_sequences_filtered.fa -o ./meme-output -dna -mod zoops -nmotifs 25 -minw 4 -maxw 20

Looking at this output, the motifs tend to look something like:

ACACACASACACACACACAC

GAGAGAGAGAGAGAG

ACACACACASASACACACAC

TGTGTGTGTGTGTGTGTGTG

And so on. I get a feeling that this is not ideal. More specifically, it looks something like this:

********************************************************************************
MOTIF TGTGTGTGTGTGTGTGTGTG MEME-1 width =  20  sites = 30258  llr = 649286  E-value = 4.0e-460
********************************************************************************
--------------------------------------------------------------------------------
Motif TGTGTGTGTGTGTGTGTGTG MEME-1 Description
--------------------------------------------------------------------------------
Simplified        A  ::::::::::::::::::::
pos.-specific     C  :2:2:2:2:2:2:2:2:2:1
probability       G  :8:8:8:8:8:8:8:8:8:8
matrix            T  a:a:a:a:a:a:a:a:a:a:

         bits    2.0     * *   *   *    
                 1.8 * * * * * * * * * *
                 1.6 * * * * * * * * * *
                 1.4 * * * * * * * * * *
Relative         1.2 ******* * **********
Entropy          1.0 ********************
(31.0 bits)      0.8 ********************
                 0.6 ********************
                 0.4 ********************
                 0.2 ********************
                 0.0 --------------------

Multilevel           TGTGTGTGTGTGTGTGTGTG
consensus                     C          
sequence                                 



I can attach more of the results if needed. Perhaps I need to check my initial parameters? Perhaps the maxW needs to be lowered to 10, or maybe 8? Any insights would be appreciated :-) 

Cheers,

cegrant

unread,
Oct 16, 2024, 3:13:15 PM10/16/24
to MEME Suite Q&A
Hi Konrad, 

It looks like your sequences are full of tandem repeats and low complexity regions. MEME and STREME work by identifying short, similar sequences that appear more often than would be expected by chance. They have no way of distinguishing long runs of tandem repeats from more biologically interesting motifs. You should be masking out repeats and low complexity regions from your sequences using a tool like Dust or RepeatMasker before analyzing them with MEME or STREME. Dust is not available on the public website, but is included with the command line version of the MEME Suite. The source for the command line version of the MEME Suite is available here. The installation guide is here.

 I downloaded the perch annotation file from NCBI (.gtf file) and used that instead of the csv file for gomo.

That is not going to work. There are many types of annotations. GTF annotations list the locations of key features like genes, but don't include GO tags. GoMO specifically requires GO annotations and they have to be in a CSV format, linking the GO annotation ID to the sequence names. GO annotations are usually provided on NCBI using the GAF format. For example, the GO annotation for Perca_flavescens is available on NCBI as


Note however, that you are going to have build your own CSV file using the sequence names from your FASTA file and the GAF file. The GAF itself is not directly useable by GoMO.

We are happy to try to answer questions specifically about using GoMO or other tools in the MEME Suite, however we don't have the resources to answer general bioinformatics questions, or to help you design your analysis. If you are unfamiliar with GO annotations you should consult with your local mentor. It's outside of the scope we can provide assistance with.

cegrant

unread,
Oct 16, 2024, 3:28:02 PM10/16/24
to MEME Suite Q&A
Konrad,

One further thought. If you actually are not interested in the association between motifs and GO terms, you would be better off using enrichment analysis tools like SEA or Centrimo.

Reply all
Reply to author
Forward
0 new messages