Performance test suite?

8 views
Skip to first unread message

Tom Morris

unread,
Jun 26, 2020, 3:01:34 PM6/26/20
to openref...@googlegroups.com
Do we have any performance tests or scalability metrics associated with the 4.x work that we could use with 3.x? Owen had some graphs that he pointed Allana too. Is the tool that generated those available? If not tests, do we have specific scalability/performance goals? If not, should we write some down?

Tom

ow...@ostephens.com

unread,
Jun 26, 2020, 3:09:35 PM6/26/20
to OpenRefine Development
The code I wrote is available at https://github.com/ostephens/openrefine-timer

It depends on the refine-ruby gem which doesn't work with >= 3.3 (I'm guessing because it doesn't know to request a csrf token - and if that's the case shouldn't be too hard to fix)
It wouldn't be hard to write something better though! It's only a few lines of code and doesn't do anything clever :)

Owen

Tom Morris

unread,
Jul 2, 2020, 12:57:02 PM7/2/20
to openref...@googlegroups.com
On Fri, Jun 26, 2020 at 3:09 PM ow...@ostephens.com <ow...@ostephens.com> wrote:
The code I wrote is available at https://github.com/ostephens/openrefine-timer

It depends on the refine-ruby gem which doesn't work with >= 3.3 (I'm guessing because it doesn't know to request a csrf token - and if that's the case shouldn't be too hard to fix)
It wouldn't be hard to write something better though! It's only a few lines of code and doesn't do anything clever :)

Thanks! I like the structure of artificially generated data of given sizes and standard set of operations. Were the operations motivated by any particular use case?

As a complementary thing, I've started implementing a microbenchmarking framework here: https://github.com/OpenRefine/OpenRefine/pull/2859

As a first target, I sped up ToNumber floating point conversion about 6X. I picked it somewhat at random because it's where my debugger was pointed when I paused a long running operation on a performance test case that I was looking at, but I think we should have a standard benchmark suite that we run at least every release, if not more frequently (monthly? weekly? nightly?). I expect the benchmark suite to be too heavyweight for it to be practical to run as part of the standard build.

I think the benchmark suite should include a combination of microbenchmarks for common functions like toNumber, replace, split, etc, bigger multirow functional benchmarks like import a 100K row x 5 column CSV with/without automatic datatype detection, and system level benchmarks like Owen's. In addition to CPU performance, tracking RAM requirements would also be useful. Getting something in place, no matter how minimal, that we can run every release to track performance over time will give us valuable data about how performance is evolving and, hopefully, finding any critical performance regressions before we inflict them on users.

It also seems like at least the gross level system tests would be a critical metric for the scalability work so that you can say we met the goal of XX% performance improvements for projects of size Y, or maximum project size increased from N to M. Without tests, there's no way of telling.

I think there are two sets of things to decide on for the initial setup: 1) project size, and 2) operations. 

1. Project Size - Owen's benchmark uses a CSV of four string columns by 25K rows (increasing size by 25K rows until failure) and the operations are import CSV, create column based on regex (x3), delete column, move column with timing of the sum of all operations. The file that I was playing with was the old GeoIP worldcities CSV which has ~3M rows which is a little too big to be practical with today's OpenRefine (and probably unnecessary, except to measure RAM performance or catch O(N^2) issues), but I like its mix of datatypes: CountryCode,City,AccentCity,RegionCode,Population,Latitude,Longitude. Intuitively 5-10 columns with a mix of datatypes feels like a reasonable sweet spot. For row count, I'm thinking something like 100K, 250K, 500K, 1M, but it could be a sparse matrix (e.g. we only test import performance at the full set of sizes).

2. Operations - import CSV with/without datatype guessing, export to CSV, export to XLSX, toNumber, toDate, toString, split, replace, string concatenation, Add Column, Remove Column, any facets?, no record mode? anything else basic?

The important thing is to start generating numbers on an ongoing basis. As a matter of fact, if we had an automated suite that wasn't too onerous or time consuming to run, it might even be worthwhile to generate a few historical data points for 2.6, 3.0, 3.3 to establish a baseline. Does anyone have an intuitive feel for whether performance has changed much (or at all)?

Antonin asked a few relevant questions on the PR. I think I've covered most of them, but I've included explicit inline answers below:

I wonder what is the expected workflow around this. Do you intend to remove the old implementation when we merge this PR? In that case, the benchmark module will only run the new implementation, so we will have lost the comparison point - is it still useful to have the benchmark itself in that case?

That was a development stopgap. At steady state, we'll have a set of historical numbers that we can compare against. Implicit in this is that we archive the performance profile data so that we can compare across time.
 
When should these benchmarks be run? (Is that something you would add to the CI as a test, to make sure we do not degrade the performance of a particular part of the code?)

I expect them to be too heavyweight to be able to run as part of CI, although perhaps we can come up with some stripped down version to check for key regressions in CPU or RAM performance.
 
Also, I am curious how you came to work on this specific part of the code - is it something that was a bottleneck for a certain OR workflow, or was this flagged by a code analysis tool perhaps?

Semi-random. It's where my debugger stopped when I paused a long running operation (toNumber x 3 million rows), but intuitively it's a core function and, by inspection, it could be made more efficient. The optimization was the easy part. Setting up the benchmarking harness to prove that it worked and to give us something that we could use on an ongoing basis was where all the work came in.

If you (all) had to pick one size project and one set of, say 5-10, operations to test as a starting point, what would you choose as being most representative?

Tom


 

Thad Guidry

unread,
Jul 2, 2020, 1:44:14 PM7/2/20
to openref...@googlegroups.com
My intuition says that:
1. XML importing has improved! (maybe fixing serialization, marshalling, or just Jackson itself improved in areas probably helped)
2.I did notice that toDate() on a column is a bit slower depending on the values inside (original string format) since 2.8 alpha maybe 2.6?  I always wondered if because of the Calendar changes and offsetDatetime introductions, dunno, just guessing there. So I'd like to see what happened to that functions' performance and perhaps there's not much we can do about it, but still.

Just looking at my own History in GREL...
The operations I feel useful to benchmark would be some of the common ones that I use nearly weekly:
Create New column based on value.escape("html")
Create New column based on value.partition(",")[0]
value.toNumber()
value.toDate()

Some others that I find are useful because working with messy data I always seem to run into subdata in cells that needs to be replaced or split out:
Split multi-valued cells with regex "\r" - creating new columns (making my records so that I can export to SQL tables or HTML tables)
Join 2 columns with "," while replacing nulls with blank strings (really useful that we added the UI for this) - creating single line mailing addresses all the time.

That's what I'm doing nearly 90% of the time with OpenRefine weekly



--
You received this message because you are subscribed to the Google Groups "OpenRefine Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine-dev/CAE9vqEF_YakQp5ZFS9bH38zvjp6t2hFQ_s11nSxkUJTdGHutHg%40mail.gmail.com.

Thad Guidry

unread,
Jul 2, 2020, 1:46:53 PM7/2/20
to openref...@googlegroups.com
correction:  Split multi-valued cells with regex "\r" - creating new record rows


Tom Morris

unread,
Jul 5, 2020, 9:09:32 PM7/5/20
to openref...@googlegroups.com
On Thu, Jul 2, 2020 at 1:44 PM Thad Guidry <thadg...@gmail.com> wrote:
My intuition says that:
1. XML importing has improved! (maybe fixing serialization, marshalling, or just Jackson itself improved in areas probably helped)
2.I did notice that toDate() on a column is a bit slower depending on the values inside (original string format) since 2.8 alpha maybe 2.6?  I always wondered if because of the Calendar changes and offsetDatetime introductions, dunno, just guessing there. So I'd like to see what happened to that functions' performance and perhaps there's not much we can do about it, but still.

Do you have a link to the issue for when the performance regression was first reported?

It definitely seems like there was a giant disturbance in The Force surrounding date handling in ~April 2018, so I could easily see a performance regression tied to that. As a matter of fact, glancing at the code, I could believe that ISO 8601 date conversion is running at half the speed it was before for strings which don't have a timezone offset.

Tom

Thad Guidry

unread,
Jul 5, 2020, 9:26:15 PM7/5/20
to openref...@googlegroups.com
Hi Tom,

Yes, that might be plausible.  I just know that 2.8 was working OK and after that things were a bit slower and recalled some differences with Date handling as others mentioned with Excel in our issues.  I cannot really recall because I lost a Raid array and completely upgraded the PC then... so I don't have those OpenRefine project files that would have the timestamp and tell me exactly what OpenRefine version.

Looking at the timeline of the changes that 3.0 Beta seems about the time that we introduced the changes perhaps is what you are thinking? :

* Unify the internal date type

Do you have the ability to investigate and clean the Date problems within a day or is it super messed up and would take a week?



--
You received this message because you are subscribed to the Google Groups "OpenRefine Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine-de...@googlegroups.com.

Tom Morris

unread,
Jul 6, 2020, 3:38:57 PM7/6/20
to openref...@googlegroups.com
What was the issue #?

On Sun, Jul 5, 2020 at 9:26 PM Thad Guidry <thadg...@gmail.com> wrote:

Do you have the ability to investigate and clean the Date problems within a day or is it super messed up and would take a week?

I've already done the investigation and it would probably only take a few minutes to fix, but it doesn't make any sense to do without performance tests which cover it.

Since we profess to be concerned with scalability, my first goal is to get at least some basic performance tests in place. Without them we're just playing guessing games.

Tom

Thad Guidry

unread,
Jul 6, 2020, 4:00:06 PM7/6/20
to openref...@googlegroups.com
I never made an issue for the performance degradation, since I was not keenly aware of it at the time, and just got on with my other work.
The Excel date issues you have already put your thoughts into.

Sorry confused by your wording...Are you asking for some help with putting together a test file? or...?



--
You received this message because you are subscribed to the Google Groups "OpenRefine Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine-de...@googlegroups.com.

Tom Morris

unread,
Jul 6, 2020, 4:07:41 PM7/6/20
to openref...@googlegroups.com
On Mon, Jul 6, 2020 at 4:00 PM Thad Guidry <thadg...@gmail.com> wrote:
I never made an issue for the performance degradation, since I was not keenly aware of it at the time, and just got on with my other work.

If you think it's real, please make a ticket for it.
 
Sorry confused by your wording...Are you asking for some help with putting together a test file? or...?

If you have a test file, that would be great, but the performance tests that I'm talking about are not specific to date handling. I'm talking about creating a benchmark suite so that we know whether performance is going up, down, or sideways, as discussed on the performance testing thread. The only relevance to this problem is ordering and priority of tasks.

Tom

 
Reply all
Reply to author
Forward
0 new messages