I don’t know if it can help, but I have written a paper exactly on this with my group at with my research group at University of Milan - Bicocca:
"
Giabelli, A., Malandri, L., Mercorio, F., Mezzanzanica, M., & Nobani, N. (2022). Embeddings Evaluation Using a Novel Measure of Semantic Similarity. Cognitive Computation, 14(2), 749-763."
The paper is published on Cognitive Computation, and there is also a python tool to create the benchmark:
https://pypi.org/project/TaxoSS/
If you need help, feel free to contact me!
Kind regards,
Lorenzo Malandri