Thanks! I like the structure of artificially generated data of given sizes and standard set of operations. Were the operations motivated by any particular use case?
As a first target, I sped up ToNumber floating point conversion about 6X. I picked it somewhat at random because it's where my debugger was pointed when I paused a long running operation on a performance test case that I was looking at, but I think we should have a standard benchmark suite that we run at least every release, if not more frequently (monthly? weekly? nightly?). I expect the benchmark suite to be too heavyweight for it to be practical to run as part of the standard build.
I think the benchmark suite should include a combination of microbenchmarks for common functions like toNumber, replace, split, etc, bigger multirow functional benchmarks like import a 100K row x 5 column CSV with/without automatic datatype detection, and system level benchmarks like Owen's. In addition to CPU performance, tracking RAM requirements would also be useful. Getting something in place, no matter how minimal, that we can run every release to track performance over time will give us valuable data about how performance is evolving and, hopefully, finding any critical performance regressions before we inflict them on users.
It also seems like at least the gross level system tests would be a critical metric for the scalability work so that you can say we met the goal of XX% performance improvements for projects of size Y, or maximum project size increased from N to M. Without tests, there's no way of telling.
I think there are two sets of things to decide on for the initial setup: 1) project size, and 2) operations.
1. Project Size - Owen's benchmark uses a CSV of four string columns by 25K rows (increasing size by 25K rows until failure) and the operations are import CSV, create column based on regex (x3), delete column, move column with timing of the sum of all operations. The file that I was playing with was the old GeoIP worldcities CSV which has ~3M rows which is a little too big to be practical with today's OpenRefine (and probably unnecessary, except to measure RAM performance or catch O(N^2) issues), but I like its mix of datatypes: CountryCode,City,AccentCity,RegionCode,Population,Latitude,Longitude. Intuitively 5-10 columns with a mix of datatypes feels like a reasonable sweet spot. For row count, I'm thinking something like 100K, 250K, 500K, 1M, but it could be a sparse matrix (e.g. we only test import performance at the full set of sizes).
2. Operations - import CSV with/without datatype guessing, export to CSV, export to XLSX, toNumber, toDate, toString, split, replace, string concatenation, Add Column, Remove Column, any facets?, no record mode? anything else basic?
The important thing is to start generating numbers on an ongoing basis. As a matter of fact, if we had an automated suite that wasn't too onerous or time consuming to run, it might even be worthwhile to generate a few historical data points for 2.6, 3.0, 3.3 to establish a baseline. Does anyone have an intuitive feel for whether performance has changed much (or at all)?
Antonin asked a few relevant questions on the PR. I think I've covered most of them, but I've included explicit inline answers below:
I wonder what is the expected workflow around this. Do you intend to remove the old implementation when we merge this PR? In that case, the benchmark module will only run the new implementation, so we will have lost the comparison point - is it still useful to have the benchmark itself in that case?
That was a development stopgap. At steady state, we'll have a set of historical numbers that we can compare against. Implicit in this is that we archive the performance profile data so that we can compare across time.
When should these benchmarks be run? (Is that something you would add to the CI as a test, to make sure we do not degrade the performance of a particular part of the code?)
I expect them to be too heavyweight to be able to run as part of CI, although perhaps we can come up with some stripped down version to check for key regressions in CPU or RAM performance.
Also, I am curious how you came to work on this specific part of the code - is it something that was a bottleneck for a certain OR workflow, or was this flagged by a code analysis tool perhaps?
Semi-random. It's where my debugger stopped when I paused a long running operation (toNumber x 3 million rows), but intuitively it's a core function and, by inspection, it could be made more efficient. The optimization was the easy part. Setting up the benchmarking harness to prove that it worked and to give us something that we could use on an ongoing basis was where all the work came in.
If you (all) had to pick one size project and one set of, say 5-10, operations to test as a starting point, what would you choose as being most representative?
Tom