1) dereplication:
a. Is it possible there are two options controlling this?
I ran with derepMin at default 1. But still got sequences filtered out based on 'min unique read abundance=2'
A : Try using the more extensive format, e.g. 2:1;5:4 and if the error still occurs. This might be a bug. Also note that you can set specific derep conditions for each sample in the mapping file (header “derepMin”)
b. what does 'Singletons among these' in the demulti.log file refer to?
eg I have 14,354; 21,893 singletons ; but 114,468 sequences not passing derep conditions. So I assume singletons are not the sequences with unique appearance (since there were 114,468 sequences with appearance 1)? Or how do I have to see this?
A: These are reads with no second read pair, since one read pair was removed.
c. what are 'Bad Reads recovered with dereplication' (+- 4500 in my run) in the demulti.log file.
I assume reads are quality filtered first, then dereplicated (that may be wrong?)
- are they 'sequences not passing derep conditions' (from the 114,468) because they were unique? --> why are they recovered for clustering? Or are only 4500 of the 114,000 recovered as low quality read for mapping after seed extension?
- are they actual quality filtered reads somehow recovered?
A: These are sequences that were not passing quality filters (any of them), but had a 100% id match to a read that passed quality filters and therefore assumed to be ok.
2) On the clustering-seed extension:
a. Clustering is performed with the High Qual, derep, FWD reads AFTER they were trimmed to a fixed length (still using 170bp for the moment)
AFTER clustering you check in each OTU for longest high qual reads --> do you stick to the 170bp here or do you trace back to the original sequence length
A: No, this is tracing back the original read length AND checking if a 2nd read is present at a high quality. It also tries to optimize the %id seq similarity to the cluster seed (of 170 bp in your case) to the best recovered read.
b. I am currently using a protocol (for organisational reasons) in which my 16s amplicons are treated wih the Nextera transposase that cuts amplicons to random lengths. In short, it means that my library prep consists of fragments of different lengths and I have reads between 50-250 bp.
This means that a lot of reads are initially filtered out for the clustering (which might be OK if I sequence deep enough)
I was wondering if you would recommend to set the mid-quality filter *minSeqLength to eg 50bp, to get these shorter ones at least included in the abundance estimation?
A: This is a very specific problem, sounds like a real hassle to work with this kind of sequences. I would not recommend to go below 90bp, Try to optimize between still recovering a lot of reads and getting enough reads for the analysis. The log files should also help here, I have one that is a histogram of read lengths in the input files.
Thanks a lot for your time (and making the platform of course!)
Best,
Jolien