Well I mean clean solid data, data can be noisy, and have a lot of cycle slips or loss of lock. You want to have a lot of continuous data from a lot of satellites, with good overlap, and they also need to have good geometry in the sky. The post processing software needs to be able to identify and fix cycle slips, and resolve the integer ambiguities, and frankly recognize it is dealing with poor data. Now on really clean data I can get results to converge to 3mm rms, on noisier receivers/antennas 5-7mm is typical, at 9mm I know I can do better. At 30-35KM I'd expect to see cm level numbers, because I'm basically fighting the horizon, the curvature of the Earth and a general lack of commonality in the observations.
2KM should be a good distance, I mentioned 5-7KM as the kind of maximum I'd want to deal with, I'm a tad over 7KM (7.1) from my nearest station, they tore out the old choke-ring antennas and replaced them a few years back and replace with some poorer performing GNSS antennas, and pushed the sample interval to 10 seconds. If you get too far away the atmospheric paths become too dissimilar. I have a much better NOAA CORS about 5KM from my office, and my neighbouring county has a network of stations where you are never more that 5-7KM away, as the circles of coverage overlap.
I like static tests as an initial validation, if those don't go well, nothing else will. Dynamic tests are very hard to reproduce accurately. Tethered circles would be the easiest to do, and validate. One could run a second receiver in parallel, but you are still basically dealing with similar levels of accuracy, and each is going to have it's own relative motion due to architecture/clocks/noise.
Not a big fan of RTKLIB, waiting for a new version, had about zero success with it in GPS+GLONASS mode.
I do hope Swift hurries up, this has been going on for far too long.