How to weight sequences in dreme?

23 views
Skip to first unread message

Robert Leach

unread,
Feb 28, 2018, 6:54:27 PM2/28/18
to MEME Suite Q&A
Hi,

I have randomized oligos that underwent a selection step and were sequenced.  The number of times a specific oligo occurs should relate to how well a 3-6nt motif in it bound during the selection, thus my reads with numerous identical sequences should be enriched for the motif I want to find.  However, what will be the effect on the dreme motif finding algorithm to have some reads be represented thousands or tens of thousands of times in the dataset?  For example, here's the head of my collapsed sequence file with abundances indicated as "size":

>lib_1;size=20518;

GGGAAAGAATGGTGAGGTTG

>lib_2;size=2033;

AGGGTGCAGTGGAGCAATAT

>lib_3;size=1566;

ATGGGAACCTCACACTCGAT

>lib_4;size=1243;

GGGTTCTGTCTGCTGGGCAC

>lib_5;size=1086;

GCGGGGGCAGATGGACAGGT


Will the first sequences overwhelm dreme?  Should I somehow normalize the numbers of duplicates to get an effective weighting?  Or would dreme work better if I just ran it on the top N reads?  Any advice would be appreciated.

Thanks,
Rob

cegrant

unread,
Mar 5, 2018, 3:59:46 PM3/5/18
to MEME Suite Q&A
I'm afraid any motif that involves an inexact match is going to be washed out in your data.  With so many exactly duplicated sequences, DREME is going to end up reporting motifs that are the longest allowed exact matches to portions of the input sequences, more or less in order of the frequency of the matching parent sequence. Running DREME on the top N reads might not be a bad idea for spotting motifs common to all the sequences. However, the estimations of statistical significance reported by DREME won't be meaningful, since you are skewing the sequence data. You might also trying running multiple iterations of DREME, setting the kmax and kmin options to 3-6 in turn. For example

dreme -p big.fa  -mink 6 -maxk 6
dreme
-p big.fa  -mink 5 -maxk 5
dreme
-p big.fa  -mink 4 -maxk 4
dreme
-p big.fa  -mink 3 -maxk 3


The mink and max options control the minimum and maximum width of motifs that will be considered.  The results will still be dominated by exact matches to the sequences, but it might help you spot exact matches common to more than one of the families of sequences.




Reply all
Reply to author
Forward
0 new messages