1. First of all, I don't know, how you computed the FRR results, so I cannot tell you, why one is worse than the other. If you compute the threshold for the FRR using the development set and apply them to both dev and eval, then the threshold might not be good and, hence, the algorithm is not too good.
OK, this command evaluates the FRR at FAR=0.001. As mentioned above, in small datasets, low FAR regions are not very stable. Hence the threshold might fluctuate a lot, leading to variable FRR values for different algorithms, especially on the eval set. Otherwise the approach is correct. You might want to use a different FAR value (this is a parameter of evaluate.py).
How to select a threshold highly depends on the application at hand. One example would select the threshold based on the equal error rate (EER) on the development set.
Manuel