Thought I would follow up on this in case people were interested in the results.
Yes, running multiple instances worked fine. I wrote a small java class that would sequentially run a selection of robots in battles via RobocodeEngine and report results, and my python script would call this a bunch of times in parallel (using asyncio) with a different 'chunk' of robots and wait for them all to finish. Each process was surprisingly consistent with how long it took.
Doing it this way still had an overhead for each time it spun up the Robocode engine(s) so there were diminishing returns for adding threads (assuming a fixed gene pool/chunk size). Writing the whole thing in java and holding onto the Robocode instances is probably the way to go for maximum robots tested per hour.
Still, I got massive performance improvements from this so I would recommend doing this if you're making a genetic algorithm or something similar.
Once again, thanks for the help!