Inefficiency in CSV generation

35 views
Skip to first unread message

Michael De La Rue

unread,
Mar 24, 2023, 10:43:15 AM3/24/23
to Hypothesis users
I'm using hypothesis-csv (actually a fixed version from - https://github.com/jeffbr13/hypothesis-csv ).

I want to generate a set of matching files for testing CSV merge sorting.

@given(
csv(header=5),
csv(header=5),
csv(header=5),
csv(header=5),
csv(header=5),
csv(header=5),
)

When I do that I get terrible statistics for generated data

tests/test_inventory_diff.py::test_diff_should_sort_with_the_same_setup_as_sort:

  - during reuse phase (0.08 seconds):
    - Typical runtimes: ~ 76ms, ~ 85% in data generation
    - 1 passing examples, 0 failing examples, 0 invalid examples

  - during generate phase (116.37 seconds):
    - Typical runtimes: 75-166 ms, ~ 100% in data generation
    - 4 passing examples, 0 failing examples, 995 invalid examples


This level of invalid examples is obviously a problem.  I tried running the test cases inside the hyptothesis-csv package and the one which seems relevant also shows a suspiciously high level of invalid examples

tests/test_strategies.py::test_data_rows_fixed_column_num:

  - during reuse phase (0.01 seconds):
    - Typical runtimes: 0-6 ms, ~ 79% in data generation
    - 3 passing examples, 0 failing examples, 0 invalid examples

  - during generate phase (2.98 seconds):
    - Typical runtimes: 1-41 ms, ~ 90% in data generation
    - 97 passing examples, 0 failing examples, 205 invalid examples

  - Stopped because settings.max_examples=100


At that stage I'm a bit stuck because I don't understand where hypothesis is marking the examples invalid.

Could anyone provide any suggestions for how to go forward to improve the efficiency of this please?

 
  thanks very much in advance for any effort or help with this
  Michael




Trent Savage

unread,
Mar 31, 2023, 6:55:15 PM3/31/23
to Hypothesis users
I'm having a similar problem, except with hypothesis-jsonschema. 

I got slightly more descriptive error logs than you though. I used the command: pytest --hypothesis-show-statistics my_test.py

- during generate phase (385.67 seconds):
    - Typical runtimes: 23-516 ms, ~ 100% in data generation
    - 2 passing examples, 0 failing examples, 2997 invalid examples
    - Events:
      * 37.21%, Aborted test because unable to satisfy none().filter(lambda obj: all(v(obj) for v in validators))
      * 36.95%, Aborted test because unable to satisfy text(min_size=1).filter(lambda obj: all(v(obj) for v in validators))
      * 36.95%, Retried draw from text(min_size=1).filter(lambda obj: all(v(obj) for v in validators)) to satisfy filter
      * 23.21%, Retried draw from text().filter(lambda s: s not in out) to satisfy filter
      * 20.31%, Retried draw from text().filter(not_yet_in_unique_list) to satisfy filter
      * 1.80%, Aborted test because unable to satisfy text().filter(lambda s: s not in out)
      * 0.13%, Retried draw from sampled_from([***, ***]).filter(lambda s: s not in out) to satisfy filter
      * 0.10%, Aborted test because unable to satisfy sampled_from([***, ***]).filter(lambda s: s not in out)

From the error message I got, I can see the problem: Somehow one of the knots of my JSON Schema was compiled to strategy: none().filter(lambda obj: all(v(obj) for v in validators))

Which fails generation 37% of the time. I wonder why it isn't 100% of the time, since shouldn't none() fail in the filter?

All that's left to do is either handcraf a strategy,  exam hypothesis-jsonschema in a debugger, or substitute a lesser library like Faker for my data-generation needs. 

Michael De La Rue

unread,
Apr 3, 2023, 11:21:34 AM4/3/23
to Hypothesis users
Right now all I have done is moved from a five column to a three column CSV and started generating four of them rather than six. That's a little slow, but definitely acceptably so and as a test works well enough. Definitely found me a few missing unit tests.

I also filed this as a bug on hypothesis since its clear there's something missing in the reporting  -


That's been closed as not easy to work on without a reprooducer. Since I solved my own problem and am fine I'm not currently working on it but I will get back to this and attempt to get some clear open source reproducers for the various problems which can then be watched under debuggers and so on to work out whats actually going on and what might be missing.

Thanks for the comment anyway. At least I'm not alone.

all the best
 Michael
Reply all
Reply to author
Forward
0 new messages