Hi,
I'm trying to use MapDB in my project. Concretely, I want to use MapDB to store data in disk that are loaded from an origin datasource (for those cases in which size of datasource exceeds main memory size).
So, I read each value from a datasource and I put it into MapDB. Finally, values stored in MapDB can be passed to a target datasource. Both origin and target datasource can be a database (for instance, mysql) or a file (for instance, csv format).
I use the next configuration:
DB db = DBMaker.newFileDB(new File(filename))
.transactionDisable()
.asyncWriteEnable()
.mmapFileEnablePartial()
.deleteFilesAfterClose()
.closeOnJvmShutdown()
.make();
ConcurrentNavigableMap<Integer, Object> map =
db.createTreeMap(sName)
.counterEnable()
.makeOrGet();
NOTE: I use mmapFileEnablePartial rather than mmapFileEnable because my project is a library that could be used by other developers both in 32 bits and 64 bits. I read several posts that refer to problems using mmapFileEnable with 32 bits architectures. So, I don't use mmapFileEnable. that's right?
I've done some tests using 1000 blocks and 250.000 values per block:
- Using MapDB out of my project: it takes 601 and 197 second respectively to add and get operations
- Using MapDB inside my project with the configuration of db and map above. it takes 3942 and 3423 second respectively.
As you can see, this is a significant difference. I believe that it happens due to reads and writes of in-disk data in a non-sequential way.
When data are loaded from an origin datasource:
1. Read of a value from origin datasource.
2. Write of the read value toin MapDB map.
3. Repeat step 1 and 2 until all values be passed to disk (MapDB).
When data loaded in my project (in MapDB maps) are passed to the target datasource:
1. Read of a value from disk (MapDB).
2. Write of the read value to target datasource.
3. Repeat step 1 and 2 until all values be passed to target.
I've tried to improve performance increasing the cacheSize (until now, I used the size by default - 32768). Then, I've increased the size until 128M values (32768 * 4). it takes 2323 and 2172 seconds respectively to add and get all values.
Although performance is improved, it isn't enough.
I'm wondering if I'm not using MapDB correctly or, maybe, I could use a better configuration to improve performance.
Thanks in advance!!
HashMap is better suited for large keys. TreeMap is suited for smaller keys. TreeMap may be also good choice for larger keys which can be delta-packed (strings, tuples).
"""
Also with treemap you can try:
"""
DB.BTreeMapMaker | valuesOutsideNodesEnable() by default values are stored inside BTree Nodes. |
That's because my project has 3 step:
1. Store dataset into MapDB from an origin datasource
2. Execute one o more algorithms on stored dataset (optional step)
3. Store dataset into a target datasource.
I use MapDB to store data in disk when they don't fit in memory.
Right now, I'm just performing some tests to check performance in step 1 and 2.
So, I think I'm mirroring the real use case of my project.
Hi,
1) use data pump to create treemap
Map<String,Integer> map = db.createTreeMap("map")
.pumpSource(source,valueExtractor)
//.pumpPresort(100000) // for presorting data we could also use this method
.keySerializer(keySerializer)
.make();
https://github.com/jankotek/MapDB/blob/master/src/test/java/examples/Huge_Insert.java
2) if values are large enable .valuesOutsideNodes() option on treemap. That should improve performance.
This problem is improved a lot in 2.0
Jan
--
You received this message because you are subscribed to the Google Groups "MapDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mapdb+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
To unsubscribe from this group and stop receiving emails from it, send an email to mapdb+unsubscribe@googlegroups.com.