Preventing regressions ?

81 views
Skip to first unread message

Simon Michael

unread,
Jun 11, 2021, 1:34:35 PM6/11/21
to hledger
https://github.com/simonmichael/hledger/issues/1568 (a particular mode of the register report showing misleading numbers) is the latest of many breakages we haven't noticed till much later. 

It is not surprising given our large UI surface area and highly intertwingled functionality. Such regressions have seemed acceptably rare much of the time.. but they keep happening. hledger is supposed to be an ultra-dependable tool, so this damages our reputation, not to mention being inconvenient for users and maintainers, and unsatisfying.

Our current strategy for preventing this is to continually run about a thousand tests, and when problems are found in the field, add more tests, and hope that eventually it stops happening. Will this work eventually ? I'm not so sure.

What else could we try, so that regressions become a thing that "never" happens, even as we are continually refactoring and improving the codebase ? Any thoughts ?


morgan.s...@gmail.com

unread,
Jun 11, 2021, 10:46:48 PM6/11/21
to hledger
If we could find some way to include property-based testing that might help. The problem is it can be quite difficult to integrate them into the current architecture, and it can be quite tedious to give the input and output in ADT form, rather than the more input/output of the functests.

Simon Michael

unread,
Jun 11, 2021, 11:16:35 PM6/11/21
to hledger
Indeed ! I would love to have more property-based / randomised / fuzz testing, but it's a bit hard to see how to apply it where we most need it, ie for the high-level user-visible issues.

If we consider one report [mode] at a time (single-period balance report, ...)

and try to identify some correctness properties for it (all the top-level numbers add up to the total, no items are empty, no commodities other than the ones selected by query appear, ...)

and some useful dimensions to randomise (various kinds of data, various kinds of query, various output options, ...)

could we achieve something useful ? It sounds.. kind of a big job.

Simon Michael

unread,
Jun 11, 2021, 11:21:29 PM6/11/21
to hledger
I'll try to tag more "regression" issues with that label. Reviewing them, we might get some ideas of the most common kinds of breakage, a smaller set to focus on. 



-- 
You received this message because you are subscribed to the Google Groups "hledger" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hledger+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hledger/E72AED2E-CA27-400F-8791-0109E2E6B3D0%40joyful.com.

morgan.s...@gmail.com

unread,
Jun 11, 2021, 11:30:31 PM6/11/21
to hledger
There have been a few instances where I thought some property-based testing might be useful, but I haven't added any yet due to a lack of existing place to put them. If we could decide where and how these sorts of test could be added, that would allow us to start placing things in.

Stephen Morgan

unread,
Jun 11, 2021, 11:46:28 PM6/11/21
to hledger
This wouldn't catch the really long-lived ones, but you could also try having a formal release candidate before full releases. I generally try to keep something relatively close to master as my day-to-day hledger, but having a formal ‘everybody try using this for a month and report any issues’ might catch some of the more egregious issues.

morgan.s...@gmail.com

unread,
Jun 11, 2021, 11:50:14 PM6/11/21
to hledger
Another option: a few regressions haven't been caught because the functests run on test cases that are too simple. #1569 is a good example of that.

Because we want targeted tests, and also because we're lazy, we tend to put the simplest input that actually tests the case we have in mind. But these really finely targeted tests might be a more appropriate place for a unit test, doctest, or property test. It might be worthwhile coming up with a standard functest journal which tests the main things we want. Enhancing sample.journal may be a good place to start. We would need to change the output of a lot of test cases, but it may be helpful.

To-do would be coming up with a checklist of features we would need in the sample journal.

Simon Michael

unread,
Jun 11, 2021, 11:52:35 PM6/11/21
to hledger
I tagged issues back to #1430 or so - now visible at http://regressions.hledger.org :) (not https, yet)

Even for the small/internal things, the benefit/cost ratio usually seems low to me, but there must be exceptions. Applying it to low level systems could be a good first step. We might even have one or two property tests in there somewhere.




Simon Michael

unread,
Jun 11, 2021, 11:56:00 PM6/11/21
to hledger


On Jun 11, 2021, at 5:27 PM, Stephen Morgan <morgan....@gmail.com> wrote:
This wouldn't catch the really long-lived ones, but you could also try having a formal release candidate before full releases. I generally try to keep something relatively close to master as my day-to-day hledger, but having a formal ‘everybody try using this for a month and report any issues’ might catch some of the more egregious issues.

I think this doesn't work for hledger. People won't do it, and even if they do they won't find the issues - they don't use that report mode, or their data doesn't trigger it. 

Simon Michael

unread,
Jun 12, 2021, 12:04:28 AM6/12/21
to hledger
Our tests have certainly grown organically in many cases, and could do with being more systematic. 

I don't think we want to over-complexify individual tests, usually. Clear, understandable tests are good. But perhaps add more ?

We could do with reviewing the existing tests, infrastructure and policies, figuring out some ways to improve them, and some examples of this.

Simon Michael

unread,
Jun 12, 2021, 12:08:49 AM6/12/21
to hledger

On Jun 11, 2021, at 5:52 PM, Simon Michael <si...@joyful.com> wrote:
I tagged issues back to #1430 or so - now visible at http://regressions.hledger.org :) (not https, yet)

Simon Michael

unread,
Jun 12, 2021, 12:11:06 AM6/12/21
to hledger
On Jun 11, 2021, at 5:50 PM, morgan.s...@gmail.com <morgan.s...@gmail.com> wrote:
Enhancing sample.journal may be a good place to start.


PS yes, I have looong wanted to do this! But didn't dare, since it's so entrenched - I would almost say don't change it, but start a new one instead. examples/bcexample.hledger has been my go-to "realistic" example, but it's a bit too much.

Simon Michael

unread,
Jun 12, 2021, 12:12:27 AM6/12/21
to hledger

niels...@gmail.com

unread,
Jun 12, 2021, 8:20:31 AM6/12/21
to hledger
How would the following help, as an automated testing option for the command line? 
  • Start with a robust data file (more on this later).
  • Generate a series of reports, one series generated by one version of hledger and a second, identical series, generated by a second version. The reports are saved to text files. The names of the text files identify the heldger version on the command used. For example: version1-19_reg_-M_--forecast=-2020/02/01.txt.
  • Compare the files generated by the first version of hledger to the second version, but by the same command, to see if they have exactly the same content. If the files exactly match, there is no regression. 
However, if there is a change, is it because of some change to a number (bad) or some text in the report (perhaps not a problem)?
  • Therefore, where there are differences between the two reports, run a second test. This time take the two text files and remove anything that isn't a digit, negative sign, thousands separator, or decimal point. (Not sure if we would want exactly this, but that would be the general idea.)
For example, if the output of a file were:

Balance changes in 2020-01:

       ||       Jan
=======++===========
 a     || 31.00 USD
 a:b   || 30.00 USD
 a:b:c || 30.00 USD
-------++-----------
       || 91.00 USD

Then the shortened version would be:

2020-0131.0030.0030.0091.00

  • We compare the two shorted versions (i.e., same hledger command, but run by the different versions of hledger) to see if they are identical. If they are not, then we have likely identified a value that has changed. If they are identical, it's likely a change to something other than a number (maybe no problem, but a human could take a look at the two files to decide).
The test would identify which pairs of reports differed, and then if they differed additionally in terms of numbers. At that point you would need someone to make an inspection.

Problems with the above approach: 
  • Is one data file going to be robust enough to catch all errors? What about errors that come up only when running a command for two hledger files?
  • There are potentially an infinite number of possible hledger command options. How will you choose which ones to run?
  • When checking two versions of hledger, if the error is in both versions, the above approach won't pick up the error
  • I can see how the above approach would work for the command line version of hledger, but I have no idea if it would work with the other versions
I would be willing to create a working version of the above test if you thought that it would be worth looking at.

Rob


On Saturday, June 12, 2021 at 12:12:27 AM UTC-4 Simon Michael (sm) wrote:

Simon Michael

unread,
Jun 12, 2021, 2:43:40 PM6/12/21
to hledger
(I apologise for asking for ideas then immediately shooting them down. :) What seemed true at certain times may not be true always.



On Jun 11, 2021, at 6:12 PM, Simon Michael <si...@joyful.com> wrote:

There's ~13 of this kind of regression since January. That's one every two weeks. That sounds really bad, but hopefully it indicates an increased rate of discovery, not creation, due to eg: more users, more usage, more adventurous usage of features, more confidence in bug reporting, ...
I like it!


I have been thinking along similar lines. It's possible right now to run our functional tests with a different hledger version (shelltest -w hledger-1.20 ...). Some would be expected to fail with the old version due to UI changes, but most should behave identically. We could have a way of marking which are expected to pass on both versions. As you say sometimes the old version is broken too (with better regression testing this should become rare, but it's still possible). We could have a way of marking which even-older hledger version was correct (the last known good version).

The above relies on our existing functional test suite, and would use the same human-curated expected outputs. For regression testing specifically, we could create a new kind of test, where we specify only the command (and data files), but not any of the expected output; the test runner would verify that the command produces identical outputs with old and new hledger versions. So the tests would be basically a list of commands, or close to it - much easier to define CLI tests quickly, and to see their coverage.



Simon Michael

unread,
Jun 13, 2021, 12:23:47 AM6/13/21
to hledger

On Jun 12, 2021, at 8:43 AM, Simon Michael <si...@joyful.com> wrote:
I would be willing to create a working version of the above test if you thought that it would be worth looking at.

PS, please do! We should explore all of these and see what works.

Simon Michael

unread,
Jun 13, 2021, 1:12:41 PM6/13/21
to hledger
By coding and/or discussing. It seems to me that your idea, and saving outputs generally,

- is the most efficient way to run regression tests on a large scale, eg in CI. Results are saved for reuse, so you don't have to execute the old hledger version every time you run regression tests.

- provides the best transparency and historical data. It's easy to see what version X produced for data set Y and command Z, or if we also saved performance data, how that has been changing.

- leads to a pretty big data management project. I could be overengineering here, but say you keep (and commit ?) all artifacts (and once generated, it would seem a pity to throw them away) - assuming file storage, imagine a schema like:

regression/
 DATASET/        # the data files to be used by all tests - since these will evolve
  TESTHOST/      # the OS/environment[/architecture/machine specs] tests were run on - for platform [& performance] comparisons
   COMMAND/      # the hledger command including data files, encoded as directory name (will have cross platform problems)
    HLEDGERVER/  # the hledger versions tested with this configuration
     OUTPUTS     # the outputs - stdout, stderr, exit code [, various measurements of time and space usage..]

The first three outputs (out, err, exit), at least, could be recorded more compactly in shelltestrunner's file format; but probably less usefully; separate files sounds simpler. I could see all this generating a thousand files per year. (Then there needs to be a test runner that can compare current hledger with the most appropriate past version, for some set of commands.)

We might want to keep such data out of the main repo; which would make it inaccessible for casual contributors. But we'd like all devs to be running regression tests to check their work. So we might need another, simpler regression test system - perhaps like the "dataless" one I proposed - that could live in the main repo.

niels...@gmail.com

unread,
Jun 13, 2021, 5:21:46 PM6/13/21
to hledger
I have something working, and it's probably best to get some feedback before continuing.

I started with the balance command. That is, I generated a very large subset of combinations for the balance command. You can see them in this file.

Notice that the number of commands is 129,600. I left out some options such as 
  -p interval
  all the conversions (such as 
So we could have had an even larger number of variations just for the balance command! 

I calculated that it would take about 6 hours on my laptop to generate the 129,600 file outputs, so I settled for the first 100. That is, with the first 100 lines of the above file in a separate file (let's call it short.txt), I ran the following command:

bash short.txt

Once the 100 files were created, I did a search and replace to change ver1 in the file to ver2. This simulates running the commands in a separte version of hledger. Then I had bash process the short.txt file again.

Next, the following bash script will compare the two versions to see if there are any differences:

for i in {000001..000100}
# above start number should be padded with zeros
# to have same number of digits as the total
# number of files generated by one version of hledger

# end number should equal the total number of
# files output by one version of hledger

# in the example above, we are dealing with a subset
# of the data, so we have padded zeros so that the number
# of digits is the same as the number of total files
# generated by one version of hledger

do
   FILE1=~/Documents/R/temp/ver1_${i}.txt
   FILE2=~/Documents/R/temp/ver2_${i}.txt
   cmp $FILE1 $FILE2 
done

Of course, since it's the same version of hledger that generated both version 1 and version 2, it isn't going to show any differences. So I went into a couple of the files and added one letter and saved the files. This time, the cmp command noted where the files differed between version 1 and version 2.

Again, there is much more that can be done. This is only one command, and not even all the variations for that command. But it's time for some feedback.

By the way, I decided that using numbers as part of the filename wasn't a bad idea since it's relatively easy to look up which command generated which file. Also, it was easier for me to do the coding. This can be changed, of course.

Rob

niels...@gmail.com

unread,
Jun 13, 2021, 6:59:27 PM6/13/21
to hledger
Also, here is the R script that generated the variations of the balance command. This is useful to see what options I included and which ones were left out.

How to read: For the first set of options (opt1), there are 3 possibilities: nothing, -l, and -t.

Option 7 (opt7) was the most complicated one, at least for me, to to figure as there are multiple combinations.

-----------------start script ---------------------
filenumber = 1
opt1 <- list("", "-l", "-t")
opt2 <- list("", "-1", "-2", "-3", "-4", "-5", "-6", "-7", "-8", "-9")
opt3 <- list("", "-S")
opt4 <- list("", "--budget", "-V", "--valuechange")
opt5 <- list("", "-D", "-W", "-M", "-Q", "-Y")
opt6 <- list("", "--cumulative", "-H")
opt7 <- list("", "-T", "-A", "-%", "--invert", "-TA", "-T%", "-T --invert", "-A%", "-A --invert", "-% --cumulative", "-TA%", "-TA --invert", "-A% --invert", "-TA% --invert")
opt8 <- list("", "--transpose") # missing --pivot and --format
sink(file = "~/Documents/bal.txt")
for (i in opt1) {
  for (j in opt2) {
    for (k in opt3){
      for (l in opt4) {
        for (m in opt5){
          for (n in opt6) {
            for(o in opt7){
              for(p in opt8){
                padded_number <- str_pad(filenumber, 6, "0", side = "left")
                filename = paste0(" > ", "~/Documents/R/temp/",padded_number, ".txt")
                cat("hledger -f 2021.ledger balance ", i, j, k, l, m, n, o, p, filename, "\n")
              filenumber = filenumber + 1
              }
                   
            }
                
          }
          
        }
        
      }
    }
  }  
}
sink()
----------------end script -------------

Rob

niels...@gmail.com

unread,
Jun 13, 2021, 7:11:49 PM6/13/21
to hledger
And there is an error in opt7. There is a --cumulative where there should be an --invert.

Anything else that needs correction?

Rob

morgan.s...@gmail.com

unread,
Jun 13, 2021, 8:08:56 PM6/13/21
to hledger
I think "-V" should be removed from opt4, and that two new options should be introduced which include list("", "--cost") and list("", "--value=end", "--value=then", "--value=2020-01-01")  (Obviously any specific date can be used there. I've omitted "--value=now" since it output will depend on when it's run).

morgan.s...@gmail.com

unread,
Jun 13, 2021, 9:35:57 PM6/13/21
to hledger
I think there also need to be some options that include different start/end dates for the report. Without them, there's no difference between --cumulative and --historical.

niels...@gmail.com

unread,
Jun 14, 2021, 7:24:33 AM6/14/21
to hledger
Excellent. Thank you, and I will come up with a next version to include those. It may take a day or two.

Rob

Simon Michael

unread,
Jun 14, 2021, 3:45:44 PM6/14/21
to hledger
Hi Rob,

Very cool experimentation!

I think you have shown that exhaustively testing all possible combinations of options isn't going to happen. Still, simply finding good ways to list such combinations is valuable. We can inspect this list and think about how to expand it, make it more realistic, select a smaller set of more bug-prone/more realistic combinations, or maybe pick random samples from the big list..

In regard to actual testing, if you want to experiment with maintaining a test suite of your own (even a small set of tests) and running it periodically against master - I know that would flush out a lot of practical questions and issues. I would encourage anyone to do this. 

In fact I'm thinking this would be a great use of our project funds on opencollective: give bounties for new regressions found in master. I will pay $50 for the next regression found (as judged by me, see https://github.com/simonmichael/hledger/issues?q=label:regression%21 for some examples).

Simon Michael

unread,
Jun 14, 2021, 3:59:42 PM6/14/21
to hledger
Bounty and discussion are now also discoverable at https://github.com/simonmichael/hledger/issues/1570. I expect there'll be a series of these bounties, I'm just starting with the first.


Simon Michael

unread,
Jun 14, 2021, 4:18:50 PM6/14/21
to hledger
Just for your interest: some discussion of fuzzing from the Matrix project, with links to some of the big tools - might give us some inspiration:

https://matrix.org/blog/2021/06/14/adventures-in-fuzzing-libolm

niels...@gmail.com

unread,
Jun 15, 2021, 7:59:10 AM6/15/21
to hledger
Simon,

Yes, when you start adding specific dates as options, the number of possible reports is, for practical purposes, infinite. Your idea of a curated subset of reports sounds like the way to go.

For the bounty, I assume that you mean finding a regression that involves something significant like a number, a sign, or currency. This would not apply, for example, if two reports differed by the number of blank lines. (And that's why you specified that you would be the judge of what would qualify as a regression.)

I am glad to continue helping with this. Speaking of which, I came up with the following to remove the insignificant parts of a report (i.e., one saved as a .txt file):

   sed -i.bak 's/[a-zA-Z&:/() ]//g' *.txt

As mentioned previously, this is for when you have reports that differ, and next want to compare if two reports, reduced to their more essential parts, still differ. 

Rob

Simon Michael

unread,
Jun 15, 2021, 12:57:17 PM6/15/21
to hledger


> On Jun 15, 2021, at 1:59 AM, niels...@gmail.com <niels...@gmail.com> wrote:
> For the bounty, I assume that you mean finding a regression that involves something significant like a number, a sign, or currency. This would not apply, for example, if two reports differed by the number of blank lines. (And that's why you specified that you would be the judge of what would qualify as a regression.)

Yes - I don't have a definition better than "an unplanned not-good change", "I know it when I see it", and the past examples. (I want to be generous, but our funds are finite and I'm not sure how fast we'll find these. No new reports yet. I do know of one which I don't have a clean repro for; it might involve currency filtering..)


>
> I am glad to continue helping with this. Speaking of which, I came up with the following to remove the insignificant parts of a report (i.e., one saved as a .txt file):
>
> sed -i.bak 's/[a-zA-Z&:/() ]//g' *.txt
>
> As mentioned previously, this is for when you have reports that differ, and next want to compare if two reports, reduced to their more essential parts, still differ.

I like this technique!

niels...@gmail.com

unread,
Jun 19, 2021, 2:21:17 PM6/19/21
to hledger
Still at work on this! Making progress and learning a lot at the same time. 

Rob

niels...@gmail.com

unread,
Jun 22, 2021, 6:13:54 PM6/22/21
to hledger
I thought it would be useful to analyze the list of regressions. That is, is there something in common, that would give clues for how to search for other, but as yet unknown, errors? Also, how would the the approach of comparing how different versions of hledger (CLI) output the same report have worked? 

By my count, there were 12 issues. Five of them were not applicable to command-line generated text reports. Note that by text reports, I am not including CSV reports. CSV reports may be very worthwhile testing, but my first step was to look at the vanilla text output. Three of the five were related to CSV, and the other two hledger-ui. 

Of the seven remaining regressions, five dealt in one way or another with specialized data. What I mean by specialized data are cases that seemed to be not common. For example, two of the regressions were caused when numbers with 16 (?) or more decimal places were included. Another involved a forecast where there were forecasts both for parent and child accounts (think a, a:b, a:b:c). Finally, one regression was related to a timeclock issue, and one to sub accounts with aregister. 

My point about the above 5, is that it would take a remarkably robust set of data to uncover the errors with the approach I mentioned.

On the positive side, issue 1450, the one where aregister ate the last newline, would have been found by comparing the text versions of command line generated reports.

Where to go next?

I am thinking of two things:
  • start comparing CSV output. As mentioned above, there were a couple of issues related to CSV output, so let's see what happens when compare CSV output over different versions of hledger.
  • rethink the approach of testing such a large number of combinations. When I looked at the list of regressions, I didn't see any that seemed to be related to combinations of command options. For example, I don't think there were problems when a command was used with four options. 
Rob



Simon Michael

unread,
Jun 22, 2021, 6:41:13 PM6/22/21
to hle...@googlegroups.com
Very interesting analysis Rob. Yes, it's just a huge space for bugs to hide in.

I'm still hoping to try or help someone try a lightweight list-of-commands type of test - like bench.sh, but commands that run quickly and more of them. I still don't know how we can proactively detect everything but at least if tests are fast to create and to run we can probably do some good.


On Jun 22, 2021, at 12:13, niels...@gmail.com <niels...@gmail.com> wrote:


--
You received this message because you are subscribed to the Google Groups "hledger" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hledger+u...@googlegroups.com.

Stephen Morgan

unread,
Jun 22, 2021, 7:40:16 PM6/22/21
to hle...@googlegroups.com
The textual analysis could also be used to test hledger-ui, correct? This is a big hole in our test suite, and if it could be used to test that it would really improve things.

You received this message because you are subscribed to a topic in the Google Groups "hledger" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/hledger/zyMoqMzg1Fc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to hledger+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hledger/EB26301C-44A3-488D-8977-60C221E3DB49%40joyful.com.

Simon Michael

unread,
Jun 22, 2021, 8:10:18 PM6/22/21
to hle...@googlegroups.com
I got pretty close to automating this with horrible hacks, but had to give up. I think the right fix would be to add a little support for it in the upstream brick library (something you can call to dump the current screen state to stdout and exit).

On Jun 22, 2021, at 13:40, Stephen Morgan <morgan.s...@gmail.com> wrote:



niels...@gmail.com

unread,
Jun 24, 2021, 12:23:12 PM6/24/21
to hledger
Regarding textual analysis to test hledger-ui output, the question that I (as someone very unfamiliar with hledger-ui) would have is, is there an alternative to pointing a clicking to generate, let's say, 100 reports? 

Rob

Reply all
Reply to author
Forward
0 new messages