Can you elaborate on what you mean by not "very robust"?
It may help to think of a PV model as transforming the geometric curve of the irradiance data to the curve of the measured power. A metric quantifying agreement between simulated and measured power is likely to perform similarly when comparing measured irradiance and measured power. On a clear-sky day, I would expect close agreement as measured by either R2 or RMSE. But on a day with variable irradiance conditions, at 1-minyute intervals the time offsets between shadows on the irradiance sensor and the extent of shadows on the arrays can result in large R2 or RMSE. That's not a problem with the metrics, but rather, of comparing two signals with unaligned shadow features at a 1-minute time scale.
If that's the case with your data, you can try smoothing it by time-averaging to e.g. 15 minutes.
Cliff