Batch 1: label species A strains as #1 and species B as #0
Batch 2: label species B strains as #1 and species A as #0
The analysis you outlined looked ok. yes, you need to do this in 2 batches, labelling one clade as the foreground each time.
you didn't specify the null model for the branch-site test correctly. the two models are as follows
H0: model = 2, NSites = 2, fix_omega = 1, omega = 1
Ha: model = 2, NSites = 2, fix_omega = 0
if you have an outgroup that is not too distant, it should be useful to include it.
otherwise you need to be careful about branch labelling, as the tree should be unrooted.
if the dn/ds ratio is 999, the upper limit set by the program, the estimate is unreliable. the LRT is still fine despite the difficulty of estimate the dn/ds ratio.
i suspect that the reliability of the analysis is affected by the levels of sequence divergence. i assume that the two species are not too far away (<30% of sequence divergence, say), but what about the strains in each species? if the strains are nearly identical the sequences won't have much info, and then the test will not be significant simply due to lack of info in the data.
best wishes,
ziheng