The H200 starts around 2.55min/million states rising up to 3ish after about 5 minutes and holds there. The B200 gets around 3.6min/million states and settles to again about 3ish after about 5 minutes.
I am unaware the extent of double precision use with my test data if different datasets vary in this way, or even how much it's used in various steps of the code, but I found it interesting the additional FP64 compute performance didn't do anything for me. I also assume from my results that the B200 being two chips appearing as one actually hampers performance.
The overall conclusion I'm drawing here is that a good "matching" combination of memory bandwidth and shaders is most important. I think this is illustrated my comparing my numbers between the H200 and RTX Pro.
I will post more numbers here when I can continue testing.