We can all agree that launching threads is something expensive. It costs resources to do that, which is why people use thread pools. Work that requires high-performance usually re-uses the same threads that were launched once.
OpenBLAS offers parallelization. The parallelization covers matrix multiplication. It's actually easy thing to parallelize matrix multiplication, and I can imagine how it does that. My question is: how are these threads launched? Are they launched once or multiple times?
I have a program where I need to very frequently multiply matrices (it's a quantum mechanics time-evolution simulation). And I'm wondering, does OpenBLAS start threads for every call? Or does it have its own semi thread pool to do this efficiently? I couldn't find the answer to this question anywhere.
Please ask for more details if you require them.