Tuning Minimum Substantive Abundance Parameters for MED

206 views
Skip to first unread message

Gene Blanchard

unread,
Jun 30, 2016, 11:53:54 AM6/30/16
to Oligotyping and MED
Meren, 

I am wondering if you have any insight into tuning the min-substantive-abundance parameter. I am processing a run of stool samples with about 10.7 million reads. The min-substantive-abundance is set to 2149 by the MED algorithm. That is much higher than a lot of the other data I have tried to process (since my other inputs were only around 1 million reads).  As a result the sequences removed due to the min-substantive-abundance is really high (the default 2149 removes 7.4 million reads or ~70% of my total reads)

I stepped the min-substantive-abundance by 50 from 100 to 2700 and compiled the results into a TSV, I have attached it if you want to look at it.

Do you have a rule of thumb you go by on setting the min-substantive-abundance? Do you consider a value way too low or too high?
When setting the value, does the environment the samples came from make a huge difference? I.E. vaginal (low diversity) vs stool (high diversity) 

Thanks,
Gene Blanchard


 





min_sub_abund_stats.txt

A. Murat Eren

unread,
Jun 30, 2016, 2:41:23 PM6/30/16
to Oligotyping and MED
Hi Gene,

You must be frustrated. Sorry about that. The heuristic that sets the default M value does not work for well for large datasets, and it becomes much more reasonable to set it manually.

M-heuristic removes a lot of noise along with rare organisms in large datasets from environments that are not very diverse. When you have a million copies of the same tag sequence, the frequency of its error-driven descendants became way too abundant in the dataset compared to low-abundance but true reads. MED deals with this in an inefficient way. There are non-heuristic ways to improve the algorithm with, however, I don't have a timeline to implement them yet. I suggest you to stay around M 100 max. I also would suggest you to take a look at DADA2 from Ben Callahan and Susan Holmes. I think it is a very decent algorithm, and should perform better than MED.


​Best,​
--

A. Murat Eren (meren)
http://merenlab.org :: gpg

--
You received this message because you are subscribed to the Google Groups "Oligotyping and MED" group.
To unsubscribe from this group and stop receiving emails from it, send an email to oligotyping...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/oligotyping/e71ef0b7-7366-4b26-9f2c-c2c9726f4de3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Abish Stephen

unread,
Mar 13, 2017, 8:05:25 AM3/13/17
to Oligotyping and MED
Bit late to the party, but thought I'd add my experience with the MED pipeline to hopefully help someone out there searching. I'm working with oral samples and I found using low abundance "reference" species helps to set the -M parameter to the right target manually. I'm also working with a big dataset (~9.4 million reads in total), and after a certain threshold of -M, MED doesn't pick up this particular oligotype/species in my samples. I do several runs of MED like Gene has done, until I find the threshold after which this species disappears. I use that value as the -M and so far, I'm happy with the balance of reads discarded to reduce noise vs the resolution of the nodes. 
Reply all
Reply to author
Forward
0 new messages