Pls suggest to improve performance using MapDBBlockingDatabase

31 views
Skip to first unread message

Venkatesh T

unread,
Jul 3, 2017, 10:50:13 AM7/3/17
to duke

Dear Lars,

 

First of all, thank you very much for this wonderful framework. I am trying my best to understand and use it for my requirement. I have around 150 Million records of customers in my database table. I am trying to deduplicate the same. I did a POC with 1 million records. When I ran the deduplication on my program ran well and I met with success. But when I tried to run the same on 150 Million records, I am failing. The program initially was running around 2.8 million per day but slowing down considerably as the time progresses and finally progressing very very slowly. Seeking your help on how I should approach this.

 

The attributes we are using are Name, Address, Country. Please note I have tried my POC and final program with MapDBBlockingDatabase with newDirectMemoryDB as option.

 

The key strategies used are as below:

 

Key 1: First 5 alpha chars of Name + last token of Name + first 5 alpha chars of Address + Country Name

Key 2: First 3 alpha char of Address + penultimate token of address+ Country.

 

It took nearly 1 hour for processing 0.1 million records with on Server with RAM 51gb & 24 CPU cores . I have used 20 threads and given MaxDirectMemorySize=30G.

 

Now, I have just modified the config file and fallen to use the Memory Mapped File. But I would like to know what is that I can do to make the deduplication to work on 150 million records.

Please find below my config file I am using.

 

<duke>

  <schema>

    <threshold>0.85</threshold>

    <property type="id">

      <name>ID</name>

    </property>

 

     <property>

      <name>NAME</name>

      <comparator>no.priv.garshol.duke.comparators.PersonNameComparator</comparator>

      <low>0.4</low>

      <high>0.81</high>

    </property>

    <property>

      <name>ADDRESS</name>

      <comparator>no.priv.garshol.duke.comparators.QGramComparator</comparator>

      <low>0.4</low>

      <high>0.81</high>

    </property>

     <property>

      <name>COUNTRY</name>

      <comparator>no.priv.garshol.duke.comparators.ExactComparator</comparator>

      <low>0.4</low>

      <high>0.6</high>

    </property>   

  </schema>

<database class="no.priv.garshol.duke.databases.MapDBBlockingDatabase">

     <param name="notxn" value="true"/>

     <param name="async" value="true"/>

     <param name="compression" value="true"/>

     <param name="snapshot" value="true"/>

     <param name="file" value="/home/usrtest/xxxxxxxx/dedupemapdb/mapdbfile.map"/>

     <param name="mmap" value="true"/>

  </database> 

 <jdbc>

   <param name="driver-class" value="org.postgresql.Driver"/>

    <param name="connection-string" value="jdbc:postgresql://10.xxx.xx.xx:xxxx/testdb"/>

    <param name="user-name" value="postgres"/>

    <param name="password" value="postgres"/>

    <param name="query" value="SELECT sequence_id,customername,customer_address,country from testtable where customer_address is not null and country is not null order by sequence_id"/>    

    <column name="sequence_id" property="ID"/> 

    <column name="customername" property="NAME" cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"/> 

     <column name="customer_address" property="ADDRESS" cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"/>

     <column name="country" property="COUNTRY" cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"/>

  </jdbc>

</duke>

 

Also I tried to dabble with the databasestatistics java file. But I am not able to understand the numbers it is throwing out. If possible, kindly guide me on the usage of same.

Reply all
Reply to author
Forward
0 new messages