Pls suggest to improve performance using MapDBBlockingDatabase

31 views

Skip to first unread message

Venkatesh T

unread,

Jul 3, 2017, 10:50:13 AM7/3/17

to duke

Dear Lars,

First of all, thank you very much for this wonderful framework. I am trying my best to understand and use it for my requirement. I have around 150 Million records of customers in my database table. I am trying to deduplicate the same. I did a POC with 1 million records. When I ran the deduplication on my program ran well and I met with success. But when I tried to run the same on 150 Million records, I am failing. The program initially was running around 2.8 million per day but slowing down considerably as the time progresses and finally progressing very very slowly. Seeking your help on how I should approach this.

The attributes we are using are Name, Address, Country. Please note I have tried my POC and final program with MapDBBlockingDatabase with newDirectMemoryDB as option.

The key strategies used are as below:

Key 1: First 5 alpha chars of Name + last token of Name + first 5 alpha chars of Address + Country Name

Key 2: First 3 alpha char of Address + penultimate token of address+ Country.

It took nearly 1 hour for processing 0.1 million records with on Server with RAM 51gb & 24 CPU cores . I have used 20 threads and given MaxDirectMemorySize=30G.

Now, I have just modified the config file and fallen to use the Memory Mapped File. But I would like to know what is that I can do to make the deduplication to work on 150 million records.

Please find below my config file I am using.

<duke>

</property>

<comparator>no.priv.garshol.duke.comparators.PersonNameComparator</comparator>

</property>

<name>ADDRESS</name>

<comparator>no.priv.garshol.duke.comparators.QGramComparator</comparator>

</property>

<name>COUNTRY</name>

<comparator>no.priv.garshol.duke.comparators.ExactComparator</comparator>

</property>

</schema>

</database>

<jdbc>

</jdbc>

</duke>

Also I tried to dabble with the databasestatistics java file. But I am not able to understand the numbers it is throwing out. If possible, kindly guide me on the usage of same.

Reply all

Reply to author

Forward

0 new messages