hash table storage for unit testing

61 views
Skip to first unread message

Jacob G

unread,
Jul 1, 2016, 7:46:03 AM7/1/16
to jOOQ User Group
For unit testing, I'd like to be able to use in-memory hash tables for data rather than a database. Rather than re-implement memory-based DAO classes for every JOOQ-based DAO class, I'd rather have JOOQ execute itself on hash tables. Conceptually, it seems very doable since Jooq's powerful metadata model lends itself to being rendered and executed in alternative ways to sql.

Has anyone thought about this, or even worked on or implemented such a solution?

If I were to go about it, what approach would you suggest, and are there existing hooks I can use to redirect Jooq to a different hash table implementation. I see that there is a hook for connections once sql is rendered, but I'd like to hook in before that, while the query is still an object. Is that what AbstractDelegatingQuery is for? Any other concerns?

Lukas Eder

unread,
Jul 1, 2016, 8:04:37 AM7/1/16
to jooq...@googlegroups.com
Yes, that exists, and it has a name! H2 in-memory database. :) (or HSQLDB or Derby)

Hope this helps.
Lukas

--
You received this message because you are subscribed to the Google Groups "jOOQ User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jooq-user+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jacob G

unread,
Jul 1, 2016, 9:09:39 AM7/1/16
to jOOQ User Group
Thanks. I'll take a look at it, particularly to evaluate in-process mode, performance, and complexity of configuration, schema setup and teardown between each unit test method. The nice thing about hash tables is there is no schema setup or teardown.

Lukas Eder

unread,
Jul 1, 2016, 9:33:07 AM7/1/16
to jooq...@googlegroups.com
First, you'll add a bunch of values in 1-2 hashmaps. Then you refactor. Then you figure out that very often, these hashmaps are rather similar. Then you notice that minimal transactionality is nice. And perhaps locking. And suddenly, you have implemented a full scale RDBMS built on hashmaps that can be queried using a DSL that looks like SQL... Why go through all that hassle? :)

The nice thing about H2 is that in principle, you could just keep a copy of your .db file somewhere and restore that at the end of a test that includes writing...

Cheers,
Lukas

Jacob G

unread,
Jul 1, 2016, 10:06:20 AM7/1/16
to jOOQ User Group
I see what you mean. We unit test our DAO layer against the real postgres db. But at the service layer, I'm working on re-designing our test classes to use "memory DAOs". I started with the approach to write memory DAO versions of every DAO class, essentially instead of JOOQ queries to do hashtable stream filter/map, etc. It's extremely fast: test classes that took 10 seconds to execute now take less than 100 ms. But it's a lot of effort, and forces us to make sure our real DAO classes are properly unit tested in all cases.

Hopefully, H2 will give us similar speed and configuration simplicity, without all the development effort!

FYI, for the hashtable implementation I started, I use a class that maintains all the hashtables to support join emulation:

//
// The purpose of this class is to enable memory dal classes to emulate the following sql
// features by providing a central service of all repositories by type:
// 1) sql joins
// 2) long-based primary key sequence generation
//
@Singleton
public class MemoryRepositoryService implements IMemoryRepositoryService {

private final Map<Class<? extends UpdatableRecord<?>>,
Map<?, ? extends UpdatableRecord<?>>> _repositories;
private final Map<Class<? extends UpdatableRecord<?>>, Class<?>> _idTypes;

private AtomicLong _sequenceGenerator;

public MemoryRepositoryService() {
_repositories = new HashMap<>();
_idTypes = new HashMap<>();
_sequenceGenerator = new AtomicLong();
}

@Override
public <R extends UpdatableRecord<R>, T> Map<T, R> repository(Class<R> valueClass,
Class<T> keyClass)
{
if (_repositories.containsKey(valueClass)) {
validateKeyClass(keyClass);
return (Map<T, R>) _repositories.get(valueClass);
} else {
return syncRepository(valueClass, keyClass);
}
}

private synchronized <R extends UpdatableRecord<R>, T>
Map<T, R> syncRepository(Class<R> valueClass, Class<T> keyClass)
{
Map<T, R> repository;

// double checked lock
if (_repositories.containsKey(valueClass)) {
validateKeyClass(keyClass);
repository = (Map<T, R>) _repositories.get(valueClass);
} else {
repository = new ConcurrentHashMap<>();
_idTypes.put(valueClass, keyClass);
_repositories.put(valueClass, repository);
}

return repository;
}

private <T> void validateKeyClass(Class<T> keyClass) {
if (!_idTypes.get(keyClass).equals(keyClass)) {
throw new IllegalArgumentException(
"Key type does not match existing registered type");
}
}

@Override
public synchronized void clear() {
_repositories.values().forEach(Map::clear);
}

@Override
public long nextLong() {
return _sequenceGenerator.incrementAndGet();
}
}

ad...@honestbuildings.com

unread,
Jul 3, 2016, 1:51:08 AM7/3/16
to jOOQ User Group
Having spent much time on many part of this, hopefully I can save you some time Jacob

(1) If your tests require pgSQL running, they are not unit tests - they are integration tests.  In the limit these types of tests produce a test suite that is too slow to be useful, but also likely too much code to economically fix, and the likelihood that you will have to throw away the system is very real. 

(2) DAO is a useful abstraction, and they make testing DAO-dependent really easy - just mock the DAO!  But how do you test the DAO's themselves?  The job of the DAO is to present a simple interface to client-code, and reliably perform some type of database-manipulation.  Clearly the best way test this is to run a DAO against a database and see if the desired manipulation results from calling a DAO method.  (Worth noting that jOOQ allows you to mock-out DSL.using(jooq).* but this is not usable at a level that invites concise and reliable DAO testing)

(3) If you have to run a database (or something like a database), then you need to be able to do joins, etc.  FTR I think it is short-sighted to try to do this using hash tables - H2 is designed just for this purpose!

(4) .. and this is the punch-line.  JOOQ captures much meta-data when you do jOOQ-generate, and can generate DDL for you in H2!   

DSLContext dsl = DSL.using(yourJooqH2Config);
// The TABLE, UNIQUE and PRIMARY_KEY flags tell jOOQ what DDL Meta-Data you would like jOOQ to copy from your 
// original schema into the H2 in-mem db.  FOREIGN-KEY can be problematic.  
// queries should contain only one query (the CREATE TABLE) query
Query[] queries = dsl.ddl(tableToCreateInH2, TABLE, UNIQUE, PRIMARY_KEY).queries();

Jacob G

unread,
Jul 4, 2016, 12:28:45 PM7/4/16
to jOOQ User Group, ad...@honestbuildings.com
Thank you Adam. I appreciate the tips. I went ahead and tried to integrate H2, with a goal of comparing performance with both real postgres and with hashtables. But I quickly ran into postgres compatibility issues:
  1. Postgres adds ::regClass to the end of nextVal function calls on table columns. H2 doesn't like that. Fortunately, I was able to workaround that by just simply removing ::regClass from the ddl. :-)
  2. We extensively use jsonb columns and jsonb queries, and H2 appears to not support that. I suspect that it may also not support network address column types and functions, which we use a lot.
If you have any suggestions for working around that, or a different approach, I'd appreciate that. I'm not sure we'll do extensive re-"design for testability" on those things.

Adam Zell

unread,
Jul 4, 2016, 6:22:27 PM7/4/16
to jOOQ User Group, ad...@honestbuildings.com

Jacob Gur

unread,
Jul 4, 2016, 8:55:42 PM7/4/16
to jooq...@googlegroups.com, ad...@honestbuildings.com
Does that make tests run faster?



You received this message because you are subscribed to a topic in the Google Groups "jOOQ User Group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/jooq-user/jjDJZGpE9Fo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to jooq-user+...@googlegroups.com.

Samir Faci

unread,
Jul 5, 2016, 12:17:51 PM7/5/16
to jooq...@googlegroups.com, ad...@honestbuildings.com

This is more of an integration test then unit test, then again if you're writing to a database server it's an integration test in my book.  When we tried to execute our tests we also ran into incompatibility issues
between h2 and paths which is why we ended up using a postgres docker contained instead. 

We use dbunit to load and unload the data set and the tests have the expectations of a docker container running on localhost. Embedded Postgres could work as well.  I haven't looked into embedded postgres but you won't get 100% compatibility with any in memory generic database.

My preference is run a postgres server (in memory or in docker) to validate my code path.  As far as the tests, I don't think either approach really will impact the speed of the implementation.  

Jacob G

unread,
Jul 6, 2016, 7:33:40 AM7/6/16
to jOOQ User Group, ad...@honestbuildings.com
I'm certainly no expert or have all the answers. I have read thought leaders on unit testing write that unit testing a DAO layer class means testing against the real database. More generally, the idea is that any class which directly integrates with a 3rd party component needs to be unit tested against the 3rd party component.

When unit testing service layer classes, from trying different approaches, I'm finding the following guidelines to work well:
  • replace production DAO classes with test implementations that use hash tables (SO much faster)
  • mock other service classes
I am not finding that mocking DAO classes to verify function calls and/or stub indirect inputs/outputs to be effective for me, because:
  • it makes unit test writing much more time consuming, tedious, and difficult to understand
  • makes the unit tests overly aware and dependent upon the internal implementation of the class being tested
  • makes tests more fragile
My goals in the unit test are that they be fast, easy to read, stable, and allow me to be rigorously focused on the logic in the class being tested. The approach I am taking is meeting those goals. It's not a perfect solution. I don't think there is one. Everything is a trade off, but I find this approach to be making the trade offs I prefer most.

Regarding replacing database with a "fake database" using memory hash tables, I am pasting below an excerpt I read recently in the well known XUnit Test Patterns book. It is an old book, but I find it to still be relevant:

Faster Tests Without Shared Fixtures

The first reaction to Slow Tests (page 253) is often to switch to a Shared Fixture (page 317) approach. Several other solutions are available, however. This sidebar describes some experiences on several projects.

Fake Database

On one of our early XP projects, we wrote a lot of tests that accessed the database. At first we used a Shared Fixture. When we encountered Interacting Tests (see Erratic Test on page228) and later Test Run Wars (see Erratic Test), however, we changed to a Fresh Fixture(page 311) approach. Because these tests needed a fair bit of reference data, they were taking a long time to run. On average, for every read or write the SUT did to or from the database, each test did several more. It was taking 15 minutes to run the full test suite of several hundred tests, which greatly impeded our ability to integrate our work quickly and often.

At the time, we were using a data access layer to keep the SQL out of our code. We soon discovered that it allowed us to replace the real database with a functionally equivalent Fake Database (see Fake Object on page 551). We started out by using simple HashTables to store the objects against a key. This approach allowed us to run many of our simpler tests "in memory" rather than against the database. And that bought us a significant drop in test execution time.

Our persistence framework supported an object query interface. We were able to build an interpreter of the object queries that ran against our HashTable database implementation and that allowed the majority of our tests to work entirely in memory. On average, our tests ran about 50 times faster in memory than with the database. For example, a test suite that took 10 minutes to run with the database took 10 seconds to run in memory.

This approach was so successful that we have reused the same testing infrastructure on many of our subsequent projects. Using the faked-out persistence framework also means we don't have to bother with building a "real database" until our object models stabilize, which can be several months into the project.

Incremental Speedups

Ted O'Grady and Joseph King are agile team leads on a large (50-plus developers, subject matter experts, and testers) eXtreme Programming project. Like many project teams building database-centric applications, they suffered from Slow Tests. But they found a way around this problem: As of late 2005, their check-in test suite ran in less than 8 minutes compared to 8 hours for a full test run against the database. That is a pretty impressive speed difference. Here is their story:

Currently we have about 6,700 tests that we run on a regular basis. We've actually tried a few things to speed up the tests and they've evolved over time.
In January 2004, we were running our tests directly against a database via Toplink.
In June 2004, we modified the application so we could run tests against an in-memory, in-process Java database (HSQL). This cut the time to run in half.
In August 2004, we created a test-only framework that allowed Toplink to work without a database at all. That cut the time to run all the tests by a factor of 10.
In July 2005, we built a shared "check-in" test execution server that allowed us to run tests remotely. This didn't save any time at first but it has proven to be quite useful nonetheless.
In July 2005, we also started using a clustering framework that allowed us to run tests distributed across a network. This cut the time to run the tests in half.
In August 2005, we removed the GUI and Master Data (reference data crud) tests from the "check-in suite" and ran them only from Cruise Control. This cut the time to run by approximately 15% to 20%.

Since May 2004, we have also had Cruise Control run all the tests against the database at regular intervals. The time it takes Cruise Control to complete [the build and run the tests] has grown with the number of tests from an hour to nearly 8 hours now.
When a threshold has been met that prevents the developers from (a) running [the tests] frequently when developing and (b) creating long check-in queues as people wait for the token to check in, we have adapted by experimenting with new techniques. As a rule we try to keep the running of the tests under 5 minutes, with anything over 8 minutes being a trigger to try something new.
We have resisted thus far the temptation to run only a subset of the tests and instead focused on ways to speed up running all the tests—although as you can see, we have begun removing the tests developers must run continuously (e.g., Master Data and GUI test suites are not required to check in, as they are run by Cruise Control and are areas that change infrequently).
Two of the most interesting solutions recently (aside from the in-memory framework) are the test server and the clustering framework.

The test server (named the "check-in" box here) is actually quite useful and has proven to be reliable and robust. We bought an Opteron box that is roughly twice as fast as the development boxes (really, the fastest box we could find). The server has an account set up for each development machine in the pit. Using the UNIX tool rsynch, the Eclipse workspace is synchronized with the user's corresponding server account file system. A series of shell scripts then recreates the database on the server for the remote account and runs all the development tests. When the tests have completed, a list of times to run each test is dumped to the console, along with a MyTestSuite.java class containing all the test failures, which the developer can use to run locally to fix any tests that have broken. The biggest advantage the remote server has provided is that it makes running a large number of tests feel fast again, because the developer can continue working while he or she waits for the results of the test server to come back.
The clustering framework (based on Condor) was quite fast but had the defect that it had to ship the entire workspace (11MB) to all the nodes on the network (×20), which had a significant cost, especially when a dozen pairs are using it. In comparison, the test server uses rsynch, which copies only the files that are new or different in the developer's workspace. The clustering framework also proved to be less reliable than the server solution, frequently not returning any status of the test run. There were also some tests that would not run reliably on the framework. Since it gave us roughly the same performance as the "check-in" test server, we have put this solution on the back burner.

Further Reading

A more detailed description of the first experience can be found athttp://FasterTestsPaper.gerardmeszaros.com.

Reply all
Reply to author
Forward
0 new messages