Better configuration of MapDB to improve performance

raf

unread,

Nov 24, 2014, 5:25:16 PM11/24/14

to ma...@googlegroups.com

Hi,

I'm trying to use MapDB in my project. Concretely, I want to use MapDB to store data in disk that are loaded from an origin datasource (for those cases in which size of datasource exceeds main memory size).

So, I read each value from a datasource and I put it into MapDB. Finally, values stored in MapDB can be passed to a target datasource. Both origin and target datasource can be a database (for instance, mysql) or a file (for instance, csv format).

I use the next configuration:

DB db = DBMaker.newFileDB(new File(filename))
   .transactionDisable()
   .asyncWriteEnable()
   .mmapFileEnablePartial()
   .deleteFilesAfterClose()
   .closeOnJvmShutdown()
            .make();

ConcurrentNavigableMap<Integer, Object> map =
   db.createTreeMap(sName)
     .counterEnable()
     .makeOrGet();

NOTE: I use mmapFileEnablePartial rather than mmapFileEnable because my project is a library that could be used by other developers both in 32 bits and 64 bits. I read several posts that refer to problems using mmapFileEnable with 32 bits architectures. So, I don't use mmapFileEnable. that's right?

I've done some tests using 1000 blocks and 250.000 values per block:
- Using MapDB out of my project: it takes 601 and 197 second respectively to add and get operations
- Using MapDB inside my project with the configuration of db and map above. it takes 3942 and 3423 second respectively.

As you can see, this is a significant difference. I believe that it happens due to reads and writes of in-disk data in a non-sequential way.

When data are loaded from an origin datasource:
1. Read of a value from origin datasource.
2. Write of the read value toin MapDB map.
3. Repeat step 1 and 2 until all values be passed to disk (MapDB).

When data loaded in my project (in MapDB maps) are passed to the target datasource:
1. Read of a value from disk (MapDB).
2. Write of the read value to target datasource.
3. Repeat step 1 and 2 until all values be passed to target.

I've tried to improve performance increasing the cacheSize (until now, I used the size by default - 32768). Then, I've increased the size until 128M values (32768 * 4). it takes 2323 and 2172 seconds respectively to add and get all values.

Although performance is improved, it isn't enough.
I'm wondering if I'm not using MapDB correctly or, maybe, I could use a better configuration to improve performance.

Thanks in advance!!

Eric Snellman

unread,

Nov 25, 2014, 10:07:24 AM11/25/14

to ma...@googlegroups.com

"""

HashMap is better suited for large keys. TreeMap is suited for smaller keys. TreeMap may be also good choice for larger keys which can be delta-packed (strings, tuples).

"""

Also with treemap you can try:

"""

DB.BTreeMapMaker valuesOutsideNodesEnable()
by default values are stored inside BTree Nodes.

"""

http://www.mapdb.org/apidocs/index.html?org/mapdb/DB.BTreeMapMaker.html

Eric Snellman

unread,

Nov 25, 2014, 10:07:47 AM11/25/14

to ma...@googlegroups.com

*hashmap not treemap

raf

unread,

Nov 26, 2014, 5:38:37 PM11/26/14

to ma...@googlegroups.com

Thanks Eric!

But I've tried with HashMap and the result is worse. I've used the next code:

ConcurrentMap<Integer, Object> map =
      this.oDB
            .createHashMap(sName)
            .counterEnable()
            .makeOrGet();

I've also done some tests with valuesOutsideNodesEnable and others with asyncWriteFlushDelay(100), but performance not get better.

Eric Snellman

unread,

Nov 26, 2014, 5:48:36 PM11/26/14

to ma...@googlegroups.com

Are you sorting the order of operations. As in before u do the gets you order them in random for benchmarking purposes. You want to mirror your real world use case. Getting all in order impacts cache hit rates.

raf

unread,

Nov 26, 2014, 6:18:27 PM11/26/14

to ma...@googlegroups.com

That's because my project has 3 step:

1. Store dataset into MapDB from an origin datasource
2. Execute one o more algorithms on stored dataset (optional step)
3. Store dataset into a target datasource.

I use MapDB to store data in disk when they don't fit in memory.

Right now, I'm just performing some tests to check performance in step 1 and 2.
So, I think I'm mirroring the real use case of my project.

Jan Kotek

unread,

Nov 27, 2014, 7:23:18 AM11/27/14

to ma...@googlegroups.com

Hi,

1) use data pump to create treemap

Map<String,Integer> map = db.createTreeMap("map")

.pumpSource(source,valueExtractor)

//.pumpPresort(100000) // for presorting data we could also use this method

.keySerializer(keySerializer)

.make();

https://github.com/jankotek/MapDB/blob/master/src/test/java/examples/Huge_Insert.java

2) if values are large enable .valuesOutsideNodes() option on treemap. That should improve performance.

This problem is improved a lot in 2.0

Jan

--
You received this message because you are subscribed to the Google Groups "MapDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mapdb+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

raf

unread,

Nov 27, 2014, 1:58:40 PM11/27/14

to ma...@googlegroups.com, j...@kotek.net

Hi Jan,

I can't use data pump due to current design of my project. Right now, it is been developed several funcionalities in parallel and I can't change it.

With regard to .valuesOutsideNodes(), I've performed some tests with this option and it takes more time.

do you have any other advice that can improve performance?

Thanks!!

To unsubscribe from this group and stop receiving emails from it, send an email to mapdb+unsubscribe@googlegroups.com.

Reply all

Reply to author

Forward