What a great question!
I think ideally you would compare the CLS value of the page to the current web vital CLS threshold (0.1), then also compare the new potential normalized LS scores to their own respective thresholds. That we would give you a label marking the experience as "Good", "Needs Improvement" or "Bad" based on each score. Then, you could compare those labels to see which strategy most often matches expectations. Based on our own results, I would expect that the labels agree 95%+ of the time, so it's really only in the extreme cases where it gets interesting.
Unfortunately, we do not yet have guidance on expected thresholds for the experimental metrics. This makes it tougher to compare strategies, since it's not reasonable to just compare the scores directly for a single experience (i.e. we of course expect max-sliding-1000 to be higher than max-sliding-300 in the aggregate...)
We've been focusing on analyzing a larger corpus of data, but we'll brainstorm how best to evaluate individual experiences with what we have now, and I'll get back to you. Thanks for raising the question :)