Hello,
It does look a bit like a salad :)
The answer is that you have to be careful and understand what you are doing.
If possible, try to apply at least similar front end optimizations.
Using a default optimization options setting of the compiler without any extra flags would definitely give you valid comparison points.
As for running on top of an OS, since CoreMark ought to fit in the cache of processors that run win-CE, as long as you take care with what you use for timing, and setup the iterations to run long enough (I would suggest 2 minutes), you should be fine.
I would suggest running a few times (5 times * 2 minutes is only 10 minutes and you can take the median and avoid the headache of just having run when the OS decided it needs to clean something up) and taking the median (or best as there are valid arguments for each).
Depending on what the actual goal of this comparison is, you may want to run may more times and collect stats as another useful data point...
Hope this helps.
Thanks,
- Shay