Dear Clive,
I think this might be a case of your data being "spread too thinly",
given the parameterisations you have specified. Put simply, lots of
parameters require lots of information to get reliable estimates.
I suggest that given the relatively small information content of your
short sequences, starting with a more simple analysis: a HKY+G model
and a strict clock, with no partitioning or invariant sites
parameters.
I would imagine that will immediately improve the ESS. Then, you will
need to work out whether increasing the parameterisation is justified
to remove any systematic biases from the "simple" approach.
Generally, providing a ML starting tree generated from the same data
is considered "data dredging", and not therefore not ideal. Two
separate, but combined runs from random starting trees would be
better.
Hope this helps.