MapDB Pump Huge data Into map Performance issue

821 views
Skip to first unread message

Krishna Kumar Karanam

unread,
Feb 20, 2015, 4:03:40 PM2/20/15
to ma...@googlegroups.com
Hi All,
  We have been using the MapDb for in-memory operations. Now would like to store the data into disk .It is taking more than a minute to load one million records to tree map.
Following is the sample code 

 DB db = DBMaker.newFileDB(new File("C:\\BigMem\\aggr1"))
                .compressionEnable()
                .asyncWriteEnable()
                .deleteFilesAfterClose()
                .transactionDisable()
                .closeOnJvmShutdown()
                .make();
              
                final int max = 1000000;
                
                Iterator<Pair<String, Map<String,Object>>> entriesSourceNonComp = new Iterator<Pair<String,Map<String,Object>>>() {
                        int count = 0;
                        @Override
                        public void remove() {throw new IllegalArgumentException("NOT SUPPORTED");}

                        @Override
                        public Pair<String, Map<String,Object>> next() {
                                count++;

                                String key ="SAME KEY"+count;
                                String value="value"+count;
                                Map<String,Object> record = new ConcurrentHashMap<String, Object>();
                                record.put(key, value);

                                Pair<String,Map<String,Object>> ret = new Pair<String,Map<String,Object>>(key,record);
                                return ret;
                        }

                        @Override
                        public boolean hasNext() {
                                return count<max;
                        }
                };

                Comparator<Pair<String, Map<String,Object>>> comp = new Comparator<Fun.Pair<String,Map<String,Object>>>() {
@Override
public int compare(Pair<String, Map<String, Object>> paramT1,
Pair<String, Map<String, Object>> paramT2) {
return paramT1.a.compareToIgnoreCase(paramT2.a);
}
};
entriesSourceNonComp = Pump.sort(entriesSourceNonComp,
               true, 100000,
               Collections.reverseOrder(comp), //reverse  order comparator
               db.getDefaultSerializer()
               );
      System.out.println(new Date(System.currentTimeMillis()));
               Map<String,Map<String,Object>>    map2 = db.createTreeMap("non comparable values")
                                .pumpSource(entriesSourceNonComp).valuesOutsideNodesEnable()
                                .pumpIgnoreDuplicates()
                                .counterEnable()
                                .make();
                System.out.println(new Date(System.currentTimeMillis()));


Oupt put time :


start time Fri Feb 20 16:00:48 EST 2015
end time Fri Feb 20 16:02:03 EST 2015

we are going to load 300 to 400 million records into disk based cache. Please let me know how to improve the performance. 


Regards,
Krishna



Michael Charnoky

unread,
Feb 23, 2015, 10:19:03 AM2/23/15
to ma...@googlegroups.com

Krishna Kumar Karanam

unread,
Feb 24, 2015, 9:20:20 AM2/24/15
to ma...@googlegroups.com

Michael,
Thank you very much for the quick reply. I have tried all the possible options mentioned in the post how ever there is not improvement in the performance.

I got the issue tried below options to improve the performance 


1. .valuesOutsideNodesEnable()  takes more time loading the data into tree map ,It improves the performance by disabling it.
2.  Actual  issue is with JAVA serializer .It is not performing well while loading the entries. I have converted the map to byte[] before loading them to BTreeMap so it improves the performance by using Serializer.BYTE_ARRAY
3. Pump.sort()   is taking more time in sorting the elements before loading it ,another issue is we are getting outofmemory issue while sorting the elements with less memory.


  /** max number of elements to import */
        final long max =(int) 1e6;

        /**
         * Open database in temporary directory
         */
        File dbFile =new File("c:\\temp_sq\\temp.tab");
        DB db = DBMaker
                .newFileDB(dbFile)
                //.cacheSize(100000)
                .transactionDisable()
                .deleteFilesAfterClose()
                .mmapFileEnableIfSupported()
                .asyncWriteEnable()
                .make();
      
      
        long time = System.currentTimeMillis();
        
        Iterator<Tuple2<String,byte[]>> entriesSourceNonComp = new Iterator<Tuple2<String,byte[]>>() {
            int count = 0;
            @Override
            public void remove() {throw new IllegalArgumentException("NOT SUPPORTED");}

            @Override
            public Tuple2<String, byte[]> next() {
                    count++;

                    String key ="SAME KEY"+count;
                    String value="value"+count;
                    String key1 ="SAME KEY1"+count;
                    Date value1=new Date(System.currentTimeMillis());
                    
                    Map<String,Object> record = new ConcurrentHashMap<String, Object>();
                    record.put(key, value);
                    record.put(key1,value1);

                    Tuple2<String,byte[]> ret = new Tuple2<String,byte[]>(key,Converter.getBytes(record));
                    return ret;
            }

            @Override
            public boolean hasNext() {
                    return count<max;
            }
    };

    Comparator<Tuple2<String,byte[]>> comp = new Comparator<Tuple2<String,byte[]>>() {
@Override
public int compare(Tuple2<String,byte[]> paramT1,
Tuple2<String,byte[]> paramT2) {
return paramT1.a.compareToIgnoreCase(paramT2.a);
}
};
entriesSourceNonComp = Pump.sort(entriesSourceNonComp,
                true, 100000,
                Collections.reverseOrder(comp), //reverse  order comparator
                Serializer.JAVA
                );
 System.out.println("Sorting; total time: "+(System.currentTimeMillis()-time)/1000);
  time = System.currentTimeMillis();
 
       Map<String,byte[]>    map2 = db.createTreeMap("mymap")
                        .pumpSource(entriesSourceNonComp)
                        .valueSerializer(Serializer.BYTE_ARRAY)
                       // .keySerializer(BTreeKeySerializer.STRING)
                      //  .pumpPresort(100000)
                        .make();
       
       System.out.println("Finished; total time: "+(System.currentTimeMillis()-time)/1000+"s; there are "+map2.size()+" items in map");
       db.close();


Now i have reduced the time to 22 seconds (19 seconds for sorting+3 seconds for loading) by using the byte Array Serializer.  

Regards,
Krishna

Jan Kotek

unread,
Mar 1, 2015, 3:40:31 PM3/1/15
to ma...@googlegroups.com

Hi,

 

> 3. Pump.sort()   is taking more time in sorting the elements before loading it ,another issue is we are getting outofmemory issue while sorting the elements with less memory.

 

Sorting is done in chunks which should fit into memory, if you use large vals, change this param:

 

.pumpPresort(100000)

 

I am afraid there is no easy fix for this case. I think major problem in this case are large values. Data pump in 1.0 is bit crude, so it stores everything in temp folders in serialized forms. It also uses slow default serialization.

 

This should be fixed in 2.0 by some extra functionality, but thats not finished yet.

 

So far only option I can think of is to reverse sort keys by yourself and avoid sorter in pump.

 

Jan

--
You received this message because you are subscribed to the Google Groups "MapDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mapdb+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



Reply all
Reply to author
Forward
0 new messages