Hi Stephan,
The DER is very closely related to the speaker error rate, but not quite the same. DER is the amount of incorrect speech labels normalized by the total amount of speech spoken. When we have oracle SAD and throw out overlap, the sliding window ivectors that correspond to non-overlapping speech should basically correspond to the total amount of speech. We do not throw out ivectors extracted on overlap, however; those time marks are just omitted from scoring. Also, I'm not 100% on the specifics, but I believe depending on the proportion of the speakers within the recording and depending on how up you set up the speaker ID evaluation, the two metrics can become less correlated.
In addition, the clustering algorithm should affect outliers in a way that pure speaker ID evaluation would not. Since the distance between clusters is the average pairwise distance between all points across the clusters, there can be individual pairs that otherwise would be scored incorrectly but are not due to the individual points similarity to their respective speaker clusters.
As for the PLDA vs. cosine scoring question, I think that's a fair assessment. The effect of PLDA will vary depending on the dataset but I expect the gains to be significant.
—Matt