Test doesn't pick up a problem - what am I doing wrong?

51 views
Skip to first unread message

Paul Moore

unread,
Jan 17, 2024, 8:52:46 AM1/17/24
to Hypothesis users
I have what feels to me like a perfect case for hypothesis. I'm writing a function to "simplify" a Python version specifier, so I have a function that takes a specifier set, does some work on it, and spits out a specifier set that's expected to be equivalent. So I should (I think!) be able to test that for any specifier, it matches the same versions before and after I process it.

So my test is basically as follows:

@given(version(), st.lists(version()))
def test_no_discrepancies(v, candidates):
    spec = SpecifierSet(f"<={v}, <{v}")
    simp = simplify(spec)
    assert (list(spec.filter(candidates)) ==
            list(simp.filter(candidates))), \
        f"Spec {spec} simplifies to {simp} - wrong for {candidates}"

This test is basically checking that, given a version, if I simplify the specifier "<=v, <v", then no matter what list of versions I supply, I always get the same set of matches for the original and simplified specifiers.

However, when I run it, the test passes. And yet, I know (from manual testing) that my code has a bug and it simplifies "<=v,<v" to "<=v".

There's a very simple example that fails - v = "0" and candidates = ["0"]. And yet, Hypothesis doesn't find it :-(

Is there something I'm doing wrong here? This example seems like something that should be very simple to find, and yet it's not getting identified. If I add an explicit @example decorator, the test fails just fine - but I don't need Hypothesis to check if a known failing example fails... :-(

My assumption is that I'm using Hypothesis incorrectly here, and not that there's some sort of bug that I'm triggering. But I don't really know what I might be doing wrong, or where to look for an explanation for what might be going wrong. So any advice or pointers would be welcome. I can post my full test if it would help, but it really isn't much more complicated than the above.

Paul

Paul Zuradzki

unread,
Jan 17, 2024, 12:51:38 PM1/17/24
to Hypothesis users
Some possibilities that come to mind:
  • One thing to examine is whether the version() strategy that you specified will produce those examples. It looks like it is a user-defined example generation strategy. You might need to specify it to generate simpler examples or in a way that those examples are possible from the strategy.
  • You can also try increasing `max_examples`.
@given(version(), st.lists(version()))
@settings(max_examples=500)
def test_no_discrepancies(v, candidates):
    ...

- Paul Z

Zac Hatfield Dodds

unread,
Jan 17, 2024, 9:23:34 PM1/17/24
to Paul Zuradzki, Paul Moore, Hypothesis users
Seeing the strategy definition as well would definitely be helpful - one perspective on this is "it seems that our versions() strategy doesn't generate "0", or perhaps other edge cases, often enough to find certain bugs".  

That said, bugs which can only be triggered by a single exact value are inherently hard for randomized testing tools like Hypothesis - we upweight a lot of special cases and use various heuristics, but ultimately the 'right way' to find this kind of bug is with a SMT solver or similar.  (which is why we're working with the maintainers of CrossHair to support Z3 as a Hypothesis backend!)

Specific things I'd try here:
  • Try to build knowledge of your edge cases into your strategies.  Restricting the search space somewhat, e.g. major versions in [0..3] and minor + patch versions in [0..15], makes it impossible to find bugs which trigger only outside that range but can make it more likely to find things inside it - make the case-by-case tradeoff in cases where you know the interaction is what matters.  (but not otherwise; most missed bugs I see are because of too-narrow strategies)
  • Pick the version-to-compare-to from the list of versions.  This enormously upweights the chance of having some kind of collision or otherwise hitting comparison edge cases.
  • Pick an arbitrary (set of?) comparisons.  This would make it less likely to find your known bug, but increases the surface area of others you could find.  If it's a short list, you could parametrize over it instead of using Hypothesis; alternatively turn up the max_examples.
  • Use https://pypi.org/project/hypofuzz/ for coverage-feedback-guided search.  Leaving this running for a few minutes, or overnight, routinely finds bugs that I hadn't otherwise found at all.
Hope that helps,
Zac

--
You received this message because you are subscribed to the Google Groups "Hypothesis users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hypothesis-use...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/hypothesis-users/f64a39d8-6d17-4682-8908-bb9959713b00n%40googlegroups.com.

Paul Moore

unread,
Jan 18, 2024, 9:00:14 AM1/18/24
to Hypothesis users
On Thu, 18 Jan 2024 at 04:51, Paul Zuradzki <paulzu...@gmail.com> wrote:
Some possibilities that come to mind:
  • One thing to examine is whether the version() strategy that you specified will produce those examples. It looks like it is a user-defined example generation strategy. You might need to specify it to generate simpler examples or in a way that those examples are possible from the strategy.
  • You can also try increasing `max_examples`.

Thanks. I'm not sure how I check what the version strategy generates, other than by inspecting the code. Is there a way that I've not found to say "does this strategy ever generate this value"?

Having said that, increasing max_examples did trigger the failure, so (a) thanks for that suggestion, and (b) I guess that confirms that the strategy *does* generate the failing examples ;-)

On Thursday 18 January 2024 at 02:23:34 UTC Zac Hatfield-Dodds wrote:
Seeing the strategy definition as well would definitely be helpful - one perspective on this is "it seems that our versions() strategy doesn't generate "0", or perhaps other edge cases, often enough to find certain bugs".

It's fairly simple (if a bit longwinded):

N = st.integers(min_value=0)

@st.composite
def version(draw):
    epoch = ""
    if draw(st.booleans()):
        # Include an epoch
        epoch = f"{draw(N)}!"

    ver = ".".join(map(str, draw(st.lists(N, min_size=1))))

    pre = ""
    if draw(st.booleans()):
        # Include a pre-release segment
        phase = draw(st.sampled_from(["a", "b", "rc"]))
        pre = f"{phase}{draw(N)}"

    post = ""
    if draw(st.booleans()):
        # Include a post-release segment
        post = f".post{draw(N)}"

    dev = ""
    if draw(st.booleans()):
        # Include a dev release segment
        dev = f".dev{draw(N)}"

    return f"{epoch}{ver}{pre}{post}{dev}"


That said, bugs which can only be triggered by a single exact value are inherently hard for randomized testing tools like Hypothesis - we upweight a lot of special cases and use various heuristics, but ultimately the 'right way' to find this kind of bug is with a SMT solver or similar.  (which is why we're working with the maintainers of CrossHair to support Z3 as a Hypothesis backend!)

I guess my main problem here is likely to be that I'm simply not thinking about hypothesis in the right way. When I see the term "property-based testing", I'm thinking of it as "checking whether a particular property that should hold for my code actually does hold". In this case, I had a very obvious property - the simplified specifier should accept the same versions as the original one did, so I wrote a test thinking it would tell me whether that was true.

In reality, though, I think there were two flaws in my logic:

1. A successful test doesn't "tell me if the property is true". It's probabilistic, so it can't do that. So I shouldn't be thinking that if the test passes, there's no bugs in my code relating to this property.
2. I should think of the test as looking for counterexamples. So more like fuzz testing - throw a bunch of values at the code and see if anything breaks. But don't think of the test as anything like exhaustive - if I have cases that I think are important to test, I should test them explicitly (whether by using @example, or by writing a conventional test, I guess doesn't matter much).
 
Specific things I'd try here:
  • Try to build knowledge of your edge cases into your strategies.  Restricting the search space somewhat, e.g. major versions in [0..3] and minor + patch versions in [0..15], makes it impossible to find bugs which trigger only outside that range but can make it more likely to find things inside it - make the case-by-case tradeoff in cases where you know the interaction is what matters.  (but not otherwise; most missed bugs I see are because of too-narrow strategies)
That's interesting. I'd gone in the other direction - make the strategy general, so it checks the extreme cases I wouldn't otherwise think of. Versions with 20+ components, versions with the incredibly obscure epoch component, etc... But I think I had a naive view, that the test would "cover" the space of possibilities, whereas in fact, because it's only generating a set number of examples, it will give *less* complete coverage the bigger the space is.
  • Pick the version-to-compare-to from the list of versions.  This enormously upweights the chance of having some kind of collision or otherwise hitting comparison edge cases.
That's something I considered, but backed off from on the basis that "it doesn't test the case where the version to compare to *isn't* in that list". Again, I think my problem is thinking in terms of a property-based test "proving an invariant" rather than "checking for counterexamples".
  • Pick an arbitrary (set of?) comparisons.  This would make it less likely to find your known bug, but increases the surface area of others you could find.  If it's a short list, you could parametrize over it instead of using Hypothesis; alternatively turn up the max_examples.
Turning up the value of max_examples did find the problem, as I noted above. I chose 500 examples, and it didn't take long to run. Also, I think this comment makes me realise I'd be better thinking of hypothesis in terms of a parametrised test where I don't have to bother picking the test values for myself but I can let the library pick some "reasonable" values to test.
  • Use https://pypi.org/project/hypofuzz/ for coverage-feedback-guided search.  Leaving this running for a few minutes, or overnight, routinely finds bugs that I hadn't otherwise found at all.
Thanks, that's something I'd not heard of!

So I think that the overall message here is that I need to think differently about hypothesis tests, and not try to make them into something they aren't. I still need to explicitly design tests to check edge cases and to cover the core functionality of my code. Hypothesis-based tests should *support* those tests, and not *relpace* them. And I should think of hypothesis as more of a way of picking better (and more) examples for checking invariants, but *not* as a way of "proving" that those invariants hold.

What I'm not 100% sure of is where that leaves me in terms of "why should I use hypothesis?" I can write my own tests that check the normal paths through my code, and which (with the help of coverage checking) exercise all (or nearly all) of my code. For edge cases that I think of, I can add tailored tests. Where I was thinking hypothesis would fit in was in finding edge cases that I'd missed. But my experience here is that I shouldn't be too reliant on that - it's quite possible for hypothesis to miss an edge case, just the same as I might not have spotted it by reviewing the code (that's precisely what happened here - the case I'm talking about wasn't picked up by my hypothesis-based tests, and it was only when I was doing some manual experimenting with the API that I triggered the bug and realised my tests hadn't covered it).

I guess the trick is to remember that these are tests, not correctness proofs. And see hypothesis as a way of generating more (and weirder) examples than I might try myself, but nothing more than that.

Is that reasonable?

Paul

Paul Moore

unread,
Jan 19, 2024, 10:05:10 AM1/19/24
to Hypothesis users
Update: With max_examples set to 1000, and actually generalising my test even further, Hypothesis found a bunch of other edge cases for me. So that's a good answer to the "why should I use hypothesis?" question :-) I just need to (a) have more reasonable expectations, and (b) consider not using the default number of examples (or use hypofuzz) if I *do* want to try to establish "does this invariant hold".

Paul

Paul

unread,
Jan 19, 2024, 11:38:50 AM1/19/24
to Paul Moore, Hypothesis users
🙌 Glad your troubleshooting worked! Thanks for sharing the results and what you tried.

- Paul (Z)

On Jan 19, 2024, at 09:05, Paul Moore <p.f....@gmail.com> wrote:


--
You received this message because you are subscribed to a topic in the Google Groups "Hypothesis users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/hypothesis-users/JJbS5MD2QIE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to hypothesis-use...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/hypothesis-users/efc4370b-8981-473b-97aa-4428d84980fbn%40googlegroups.com.

Anne Archibald

unread,
Feb 29, 2024, 4:23:45 AM2/29/24
to Hypothesis users
On Thursday 18 January 2024 at 14:00:14 UTC Paul Moore wrote:
On Thu, 18 Jan 2024 at 04:51, Paul Zuradzki <paulzu...@gmail.com> wrote:

Specific things I'd try here:
  • Try to build knowledge of your edge cases into your strategies.  Restricting the search space somewhat, e.g. major versions in [0..3] and minor + patch versions in [0..15], makes it impossible to find bugs which trigger only outside that range but can make it more likely to find things inside it - make the case-by-case tradeoff in cases where you know the interaction is what matters.  (but not otherwise; most missed bugs I see are because of too-narrow strategies)
That's interesting. I'd gone in the other direction - make the strategy general, so it checks the extreme cases I wouldn't otherwise think of. Versions with 20+ components, versions with the incredibly obscure epoch component, etc... But I think I had a naive view, that the test would "cover" the space of possibilities, whereas in fact, because it's only generating a set number of examples, it will give *less* complete coverage the bigger the space is.

There is an effective way to ensure that edge cases get covered without restricting your search space: the one_of strategy draws from one of several strategies, simplifying towards the first. So you can use one_of(known simple cases, general cases, known edge cases), and hypothesis will explore the edge cases as well as the space in general. It will also try to simplify failures down so that they aren't edge cases, if that still triggers the failure.

For example, if you have some precision time-based computations, you can generate your times with one_of(days without leap seconds, days, days with leap seconds) - hypothesis wouldn't have known which days have leap seconds to check them, but you can tell it, without losing generality in your exploration.

Anne 
Reply all
Reply to author
Forward
0 new messages