Dear Lars,
First of all, thank you very much for this wonderful framework. I am trying my best to understand and use it for my requirement. I have around 150 Million records of customers in my database table. I am trying to deduplicate the same. I did a POC with 1 million records. When I ran the deduplication on my program ran well and I met with success. But when I tried to run the same on 150 Million records, I am failing. The program initially was running around 2.8 million per day but slowing down considerably as the time progresses and finally progressing very very slowly. Seeking your help on how I should approach this.
The attributes we are using are Name, Address, Country. Please note I have tried my POC and final program with MapDBBlockingDatabase with newDirectMemoryDB as option.
The key strategies used are as below:
Key 1: First 5 alpha chars of Name + last token of Name + first 5 alpha chars of Address + Country Name
Key 2: First 3 alpha char of Address + penultimate token of address+ Country.
It took nearly 1 hour for processing 0.1 million records with on Server with RAM 51gb & 24 CPU cores . I have used 20 threads and given MaxDirectMemorySize=30G.
Now, I have just modified the config file and fallen to use the Memory Mapped File. But I would like to know what is that I can do to make the deduplication to work on 150 million records.
Please find below my config file I am using.
<duke>
<schema>
<threshold>0.85</threshold>
<property type="id">
<name>ID</name>
</property>
<property>
<name>NAME</name>
<comparator>no.priv.garshol.duke.comparators.PersonNameComparator</comparator>
<low>0.4</low>
<high>0.81</high>
</property>
<property>
<name>ADDRESS</name>
<comparator>no.priv.garshol.duke.comparators.QGramComparator</comparator>
<low>0.4</low>
<high>0.81</high>
</property>
<property>
<name>COUNTRY</name>
<comparator>no.priv.garshol.duke.comparators.ExactComparator</comparator>
<low>0.4</low>
<high>0.6</high>
</property>
</schema>
<database class="no.priv.garshol.duke.databases.MapDBBlockingDatabase">
<param name="notxn" value="true"/>
<param name="async" value="true"/>
<param name="compression" value="true"/>
<param name="snapshot" value="true"/>
<param name="file" value="/home/usrtest/xxxxxxxx/dedupemapdb/mapdbfile.map"/>
<param name="mmap" value="true"/>
</database>
<jdbc>
<param name="driver-class" value="org.postgresql.Driver"/>
<param name="connection-string" value="jdbc:postgresql://10.xxx.xx.xx:xxxx/testdb"/>
<param name="user-name" value="postgres"/>
<param name="password" value="postgres"/>
<param name="query" value="SELECT sequence_id,customername,customer_address,country from testtable where customer_address is not null and country is not null order by sequence_id"/>
<column name="sequence_id" property="ID"/>
<column name="customername" property="NAME" cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"/>
<column name="customer_address" property="ADDRESS" cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"/>
<column name="country" property="COUNTRY" cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"/>
</jdbc>
</duke>
Also I tried to dabble with the databasestatistics java file. But I am not able to understand the numbers it is throwing out. If possible, kindly guide me on the usage of same.