| Implementation cross-pollination | Scott Vokes | 06/09/17 09:13 | Hello quiet mailing list! I've been very actively working on [theft][t] again lately, and was thinking it would be good to get a thread going sharing ideas for implementation choices with other library implementors. Conversations I've had with David (Hypothesis) and Reid (test.check) along the way have influenced theft a lot, even if the ideas aren't so recognizable in C. 1. theft uses a bloom filter to track which parts of the state space have already been tried, to avoid redundant trials and shrinking paths -- it stores a hash based on the combination of arguments. In theft's case, this has historically been a neat solution to several potential issues, but it doesn't seem like any other libraries are using that approach. While other libraries may generate sufficiently differently that it's a non-issue, it still seems useful to detect when the generator starts producing excessive duplicates: it may be a sign of an overly narrow generator, for example. theft's implementation currently uses a dynamic blocked bloom filter, based on the description in Tim Kaler's paper ["Cache Efficient Bloom Filters for Shared Memory Machines"][bloom]. [bloom]: http://tfk.mit.edu/pdf/bloom.pdf 2. When applying theft tests to a codebase after the fact, I've often run into cases where it will shrink from one kind of failure to another, and get hung up on the failure produced by the smallest input. I added a feature (called ["failure tagging"][tag]) where, before returning a failure or around a call to a function that may crash, the property function can set a particular failure ID. The test runner can be configured to only shrink to the same failure tag, when set, or to only shrink to failure tags with numerically higher IDs. This came up recently in a [Hypothesis PR][pr], as well. It seems like a fundamental design issue. Only returning the simplest failure isn't all that helpful, sometimes it's better to be able to flag something as a known issue and move on. It also seems useful to prioritize issues that aren't yet tagged -- they might be obscure/novel failures found by a long-running CI process, and discarding a minimal counter-example to reproduce them for another known issue could be a huge loss. For single-worker shrinking, the [implementation][imp] was quite small; I'm working on restructuring the overall test runner for concurrent search and shrinking, but that too should be a very tiny change. Scott / @silentbicycle |
| Re: Implementation cross-pollination | David MacIver | 08/09/17 03:07 | On 6 September 2017 at 17:13, Scott Vokes <vok...@gmail.com> wrote:
Yeah, I'm keen to get this list working and implementation cross-pollination is definitely good. :-)
I've always been rather jealous off theft's bloom filter. I'd like to use something that compact, but haven't been able to figure out a way to make it satisfy all my requirements. The problem I have in Hypothesis is that I need just a bit more information than I can get out of the bloom filter. In Hypothesis I have a (currently very inefficiently represented) tree where each child node is keyed off bytes. A node can be one of five types: 1. General (all 256 bytes are possible as children here) 2. Masked (only the low k bits count) 3. Forced (only a single byte is valid as a child here) 4. Complete (corresponds to a previously executed test run) 5. Unexplored (we haven't tried this part of the tree yet) Masked and forced nodes correspond to various places where Hypothesis writes its bytestream into a canonical form (which is useful for improving shrinking and reducing duplicates). This gets used in three ways (the first two of which are essentially variants on how you use the bloom filter): 1. To avoid duplicates in generation we mark tree nodes as dead when we've exhaustively explored below them, and avoid generating examples that would go into a dead node by doing a small amount of rewriting on the fly. 2. To avoid duplication in shrinking we have a "prescreening" process which uses the tree to determine if (after suitable rewriting) the buffer we want to try is either a prefix or a suffix of an existing example (the languages Hypothesis generates are intrinsically prefix free) 3. To guide the shrinking process we sometimes need to remember past examples. To prescreen an example (which is a buffer of bytes), we walk the tree, with each byte in the buffer possibly being rewritten and causing a transition to a child node. Elaborating on the latter two: When shrinking we're basically just trying to minimize the byte array (first lengthwise, then lexicographically) based on information we have from test execution. This results in us repeatedly trying shrunk byte arrays. The prescreening process works as follows: We use the byte array as a path through the tree, always walking to the corresponding child. When we encounter a forced node, we rewrite the byte to the forced byte. When we encounter a masked node, we mask off the high bits. If we ever hit a Complete node we reject the example (it corresponds to something we've already seen). If we ever hit an Unexplored node we accept the example and actually run it (it goes somewhere we don't know about) and otherwise we reject the example (it's a prefix of a previous example). (There's actually a bit more to the prescreening than that - in some cases we can tell that even though it hits an unexplored node it's still too short to succeed - but that's the main idea) The guided shrinking: I implemented something relatively recently (https://github.com/HypothesisWorks/hypothesis-python/pull/778) that improves Hypothesis's shrink capability a lot. I keep being surpirsed by just how much - it's a much bigger improvement than I thought. The motivating example is that people often write something that looks like the following: integers(0, 10).flatmap(lambda n: lists(something, min_size=n, max_size=n) i.e. they draw a length parameter and then draw that many elements. This is a perfectly reasonable thing to do, but it's something that Hypothesis has previously done a bad job of shrinking it because of the underlying model - it required lexicographically minimizing (to reduce the length parameter) at the same time as deleting. Because for example if you ended up drawing [0, 0, problem] (a thing that was likely due to the way shrinking proceeds, but could happen by chance), you can't shrink the list further because you can't delete anywhere except the last elements. What Hypothesis does now is that when it lexicographically minimizes part of the example it looks to see if the example it ran was valid but boring (i.e. satisfied all of the preconditions but did not exhibit a bug). If so it checks if this caused a reduction in the size of the example: If, say, when we shrunk this block the total size went from 80 bytes to 70 bytes, then that means there's 10 bytes worth of deletion we can do. It tries a judicious selection of 10 byte blocks to delete (the 10 bytes immediately after the shrink, followed by any of the intervals that it would normally try deleting which are <= 10 bytes and start after the shrink point). This has worked really well, and automatically adapts nicely to the shape of the problem, but it puts a fairly strong requirement on the example caching. The problem comes when the lexicographically minimized but not deleted example is one that we've seen before (this happens fairly naturally in the shrinking process). We need to not only be able to say if the rewritten form of the example is a prefix/extension of an existing example, we also need to be able to get a fair bit of information about the previous example back (the length and the validity). So alas I think bloom filters are out for Hypothesis. I definitely need to do some more work on more compact storage though. I'm hoping to start rewriting the core of Hypothesis in Rust at some point in the not too distant future, which will be a good opportunity for doing that.
I have more thoughts on this but need to go soon and would like to send this email before I do so as to not have a long radio silence. :-) Brief initial thoughts: One big variable here is where the testing library fits into the software testing flow. I get the sense that most property-based testing is slotting into the fast tests on CI part, while theft is often run much more like a fuzzer (continuously running in the background for a long period of time searching for bugs). Hypothesis currently piggybacks on the existing Python test runners a lot, which are not well set up for this continuous workflow. I'd like to get better at that, but it probably requires doing a lot more of our own testing UI. In the part of the space we fit into, this has resulted in a bunch of discussion on the linked PR and Zac has more or less convinced me that it's probably not acceptable to prevent slippage in this context, but instead what we need to do is minimize and show all the bugs found in the course of a shrink.
I'm interested in concurrent shrinking and would like to talk more about this. I'm actually a little surprised that you think it should be a tiny change - one of the problems I've had trying to figure out how this work is that the order of what examples to try in a shrink is very path dependent for me, with what you try next depending a lot on what you've tried previously and whether it worked. I can see how you might parallelise some of the shrink steps in Hypothesis, but it would require a fairly robust invalidation mechanism (Try these N things in parallel and as soon as one of them works, finish the currently running jobs and then cancel all of the others) in order to really work. For a simple example, suppose you were trying to delete single bytes from the example (this isn't actually something Hypothesis does, but it does similar things). The current Hypothesis loop is roughly: i = 0 while i < len(data): if not try_delete_byte(i): i += 1 So we do a single pass through all of the indices, and don't restart from the beginning. I can imagine enqueuing all of the byte deletions at once, but as soon as one of them succeeds I want that to be the new current example, and so all of the existing examples become less interesting (though any that are still running are probably useful enough to finish and compare and see if they're better than the current best before continuing). Regards, David |
| Re: Implementation cross-pollination | Scott Vokes | 17/09/17 13:49 | > Bloom Filter stuff theft doesn't track shrinking with a tree node like that; it currently has a snapshot of the entropy pool that generated an argument ("current"), and applies a random set of drop/shift/mask/subtract/swap operations to a copy ("candidate"). The operations are all monotonic, i.e., swapping only swaps when the result is lexicographically smaller, and the operations are randomly chosen based on weights from a statistical model of what has been effective thus far. I'm still working on improving that, but it's pretty good about (say) abandoning dropping when it's no longer effective, switching to subtracting small numbers when close to a final result and masking or shifting tend to overshoot, etc. This means that: - Shrinking is probabilistic, rather than exhaustive. It tries a variety of mutations, but is focused more on what tends to effectively reduce entropy rather than exhaustive searching. In particular, I'm trying to avoid backtracking, which can make shrinking take several orders of magnitude longer. - It randomly interleaves dropping bits (reducing the size of the entropy pool) with operations that preserve the length but monotonically reduce the values for individual requests. Each can get the other unstuck. - The canonical internal representation for the current shrinking state is just the flat bit array, paired with an array of request sizes. The bit array can be hashed and checked against the bloom filter pretty quickly. - Potentially useful structural information is getting discarded, in exchange for a form that can be manipulated with very fast bit operations in C. Thus far, this seems like a good trade-off; the weighting helps guide the mutation choice instead. (The flat structure is also better from a memory standpoint, but that's never been a problem.) > Bind points I'm not quite following your details here, but it sounds related to something I've found with theft lately: if a generator draws a count, and then has N iterations of drawing whatever bits to construct N instances of something, it tends to take the shrinker more time to effectively shrink that. If, for each entry, I also add a draw of a single bit ("should I use this?") that ignores the generated entry, it gives the shrinker a lot more footholds to ignore irrelevant entries, without generally-useless exploration of what happens when part of an entry is dropped and all subsequent entries input are misaligned. Adding another request per entry, say, skipping all following entries in that collect with 1-in-M odds (say, 1:64) helps to prune the state space fast. This makes sense to me -- it's adding localized choice points for the shrinker. (This may be because I'm mentally coming at this stuff from more of a Prolog-ish mindset than a monadic / mathematical one.) I don't currently have any structural annotations for this sort of thing; the generator API could register that it's about to start building the next thing in an array of things, and shrinking could use that information, but I'd really rather infer it. This comes up a lot for me, because I frequently write model-checker-style property tests -- the input to the property is a randomly generated specification for a constructed thing, a random sequence of operations against a stateful API, etc. I'm wondering if a structural inference algorithm like Sequitur would be good enough at detecting where to put these choice points, so the generators could be written without thinking about giving the shrinker any clues. Either way, the heart of the issue is that there's a distant coupling between generating the collection count and the entries in it, and the shrinker needs some sort of hint about ways to effectively shrink away the structure when the odds of hitting the length are fairly low.
I tend to use it both ways. The test runner I tend to use (greatest) has a flag to skip tests with a particular string in their name, and I mark some tests with `_slow`, `_fuzzing`, or something like that. I'd prefer to avoid implementation details that rule out using it for either. (Incidentally, long term fuzzing apparently isn't great for laptop batteries.)
I wasn't saying adding concurrent searching & shrinking will be a tiny change! :) The tiny change I was referring to was making failure tagging (AKA controlling slippage while shrinking) also work with concurrency, once concurrency is already otherwise implemented. For that, I've settled on the following architecture: - Currently, the control flow for a given batch of trials ("run") is managed by procedural execution -- it's a for loop with with a bunch of nested function calls. Instead, I'm extracting the overall behavior into a separate module, which ("theft_planner"), that has calls to notify of it various conditions (e.g. "ran trial X, it passed", "generated argument #2 for trial X, here's a snapshot of the bit pool used (for potentially shrinking it later)", etc.), and a function to get the next step for an available worker. Whether there's one worker in the same process, one worker in a forked child process, or a pool of worker processes on a server with dozens of cores shouldn't matter -- the planner will decide where to focus efforts when multiple test failures or shrinks are reported from concurrent workers. - The process management (which is currently just various conditional branches in `theft_call.c`) will split into a supervisor module, which forks worker processes, manages async messaging between them, calls waitpid, buffers logging, and relays information between workers and the planner. The messaging should be quite simple, it mostly just exists because the worker processes have distinct memory due to copy-on-write (which is also a feature, for isolation purposes). - The worker processes generate instances, run the tests, and anything else that takes a nontrivial amount of CPU time, but do blocking IO and are fairly simple. This puts a very intentional seam between the planner and the supervisor, so that I can test each independently. I tried adding concurrency to an earlier version of theft a few years ago, without a full design in mind, and some complex, unanticipated interactions with shrinking led to quite a lot of extra work. Mixing up testing Unix syscall stuff with testing the logical behavior was also a huge pain. I eventually got really burned out on it and never integrated any of it. It also limits the amount of concurrency-related bugs I need to handle: there isn't any synchronization of shared data across threads, all concurrency is via task parallelism. Since all CPU-intensive stuff happens on the worker processes, the supervisor hopefully shouldn't be a bottleneck with <50ish workers (a SUBSTANTIAL speed-up). My main open questions here have to do with the progress callbacks; previously, they all ran on the supervisor (or there was only one process), but now data saved to a struct inside a callback may not be visible when running the next step on another process. (This also made me decide to eliminate custom shrinkers in theft; they add quite a bit more surface area to support across multiple address spaces. It also means that hashing can be handled completely internally, simplifying the API further.)
I have several options for what to use as a tie-breaker when multiple workers report useful shrinks. I need to experiment further, but currently I'm favoring a mix of failing faster (in the wall clock time sense) and reducing the entropy pool further (less bits set in the pool). I'm trying to avoid needing "a fairly robust invalidation mechanism", beyond kill-9-ing worker processes that are running old shrink candidates long after a new candidate is chosen. It's also not a problem I've run into much, though: the stuff I'm testing is usually very latency sensitive anyway, so wide variations in test runtime are rare aside from failure by infinite loop. Scott |