name of fasta sequence in output file, and fimo pipeline

39 views
Skip to first unread message

alexjf...@gmail.com

unread,
Jan 8, 2017, 2:35:42 PM1/8/17
to MEME Suite Q&A
Hi everyone,

I have two questions,

I have a file with hundreds of motifs and a file with hundreds of fasta sequences. Is there a way to get the name of the fasta sequence next to the recognized motif?

The name of each fasta file is in UCSC style genomic coordinates so I am able to parse genomic coordinates. However, I would like to retain the UCSC style name somewhere in the output file.

Right now my fimo command looks like this:

fimo --text --parse-genomic-coord --bgfile bg.txt copy_motifs/ABI5_col_v3h.txt fasta_peaks/ABI5_col_v3h.fasta > ABI5_col_v3h_m1_fimo.txt

And my output looks like so:

#pattern name    sequence name    start    stop    strand    score    p-value    q-value    matched sequence
bZIP_tnt.ABI5_col_v3h_m1    chr2    12108264    12108281    +    9.09524    1.86e-05        ATGATGCCACGTGTACTT
bZIP_tnt.ABI5_col_v3h_m1    chr2    12108266    12108283    -    17.3651    9.31e-07        TGAAGTACACGTGGCATC
bZIP_tnt.ABI5_col_v3h_m1    chr2    12108288    12108305    -    22.7937    9.32e-09        AGTTGCTGACGTGGCACT

Is there an option to put the name of the fasta sequence in the output file? (ie. chr2:12108195-12108396) if that is where the matched sequence came from?

Also, I was wondering if anyone had any experience performing fimo on multiple motifs and databases. Right now I have a directory structure where I have a folder with 100 motifs and I have a folder with 100 sequence databases. In both folders the corresponding files I wish to use have the same name, with diferent suffixs (*.motifs, *.fasta, where appropriate) Is there a way to automate this? I'm thinking along the lines of for f in /copy peaks... but I don't know how to incorporate two files.

Thanks!

Alex


CharlesEGrant

unread,
Jan 9, 2017, 2:57:59 PM1/9/17
to MEME Suite Q&A
The name of each fasta file is in UCSC style genomic coordinates so I am able to parse genomic coordinates. However, I would like to retain the UCSC style name somewhere in the output file. 

Do you mean the name of each fasta sequence is in UCSC style genomic coordinates? Your example command line seems to indicate the latter.

There aren't any FIMO options to modify the sequence name. If the '--parse-genomic-coord' option is used the sequence name is take to be the string following '>' up to the ':' character.

You'll have to write your own script to generate the sequence names you want. It should be a one-liner in awk.
Reply all
Reply to author
Forward
0 new messages