From my point of view the used training data set should have the same characteristic of the data set to be annotated. So if you have such a data set for training the system or you think you could use a portion of your data set for training and testing this should be a good thing to try. In contrast if you want or if you have to use one of the models distributed with the EOP you have to consider that they might have been built on data sets that have characteristics that are different from your data set; for example the RTE-3 data set were the EOP was trained on is balanced (half of the examples are positive example).
As regards how to interpret the results and select the best answer I would say that the following could be a simple approach: in the simplest case of 1 ENTAILMENT and 3 NonEntailment no problem. An issue could instead come if the system for a question T and its H1, 2, 3, 4 says that there is more than one ENTAILMENT (e.g. ENTAILMENT for T-H1, T-H2) or if it says there is NonEntailment for all T-H1, T-H2, T-H3, T-H4 that, for both the cases, we know it is not possible.
In the first case I would select the T-H pair annotated with ENTAILMENT to which the system is more confident, to be the T-H pair where there is ENTAILMENT (all the 3 remaining pairs will be annotated with NonEntailment). As regards the second case (NonEntailment for all the generated T-H pairs) I would consider the T-H pair to be annotated with ENTAILMENT the one to which the system is less confident (e.g. T-H1 NonEntailment 0.8 , T-H2 NonEntailment 0.3, T-H3 NonEntailment 0.1 T-H4 NonEntailment 0.5, I would consider T-H3 to be the one to which there is ENTAILMENT).