I have a CakePHP CRM application with 30,000+ contact records. Each
Contact hasMany Address records. One feature of my app is to allow my
client to search for duplicate entries within the contacts table
before importing new contacts. In short, I am querying the entire
Contact table and LEFT JOINing with the Address table. I store the
result set into an array and then use PHP's array functions to filter
potential duplicates from a "pending contacts" table.
What I want to do is somehow run this entire process (the database
query and array comparison) in the background (perhaps with a PHP
exec() call) and show some type of "Loading, please wait..." interface
to the application user. I'm thinking something similar to Expedia's
"Please wait while we search for your flight" interface.
The problem is, how would I do this? Any ideas or suggestions are
welcome! Thanks in advance...
> I have a CakePHP CRM application with 30,000+ contact records. Each
> Contact hasMany Address records. One feature of my app is to allow my
> client to search for duplicate entries within the contacts table
> before importing new contacts. In short, I am querying the entire
> Contact table and LEFT JOINing with the Address table. I store the
> result set into an array and then use PHP's array functions to filter
> potential duplicates from a "pending contacts" table.
> What I want to do is somehow run this entire process (the database
> query and array comparison) in the background (perhaps with a PHP
> exec() call) and show some type of "Loading, please wait..." interface
> to the application user. I'm thinking something similar to Expedia's
> "Please wait while we search for your flight" interface.
> The problem is, how would I do this? Any ideas or suggestions are
> welcome! Thanks in advance...
> I have a CakePHP CRM application with 30,000+ contact records. Each
> Contact hasMany Address records. One feature of my app is to allow my
> client to search for duplicate entries within the contacts table
> before importing new contacts. In short, I am querying the entire
> Contact table and LEFT JOINing with the Address table. I store the
> result set into an array and then use PHP's array functions to filter
> potential duplicates from a "pending contacts" table.
What exactly are you doing? It can probably be handled pretty quickly
by the database. You shouldn't have to manually fish out dupes.
I always thought the please wait while we search messages were either
a [poor] attempt to seem extra high-tech or a sign that they have a
really inefficient system. Google searches the whole web--not just a
day's commercial flights--and comes back in a flash.
I have used a PHP app with a valid reason for it, it was a mailing
list program and it accomplished sending large amounts of email out by
batching (the page would keep reloading with incrementing GET
variables, such as start=0&num=100). It wouldn't work in your case.
What I have is an ImportContact model which is simply a table
containing basic information about a person (name, email, address,
etc). This table gets populated either from a standard website
"Contact Us" form or by my client manually importing a CSV file. Now,
once this ImportContact model has new records in it, I want to display
a list of *potential* duplicates by various criteria. The "master"
contacts model is simply called Contact and contains 30,000+ records.
The Contact model does NOT include addresses, there is a separate
Address model for this (again, with 30,000+ records).
The reason I can't have the database do the searching for me is
because the criteria I want to search against to find potential
duplicates doesn't have a direct one-to-one correlation to the
database table fields.
For example this "duplicate search" criteria works great using Cake:
- Show potential duplicates based on field `email1`
- Cake will generate a SELECT...IN query, listing ALL email addresses
from the ImportContact model and returning all records from the
Contact model containing anyone of the listed emails
However, this example does NOT work:
- Show potential duplicates based on fields `fn` and `ln` (a person's
full name)
- Since I have to CONCAT(`fn`, `ln`) AS `full_name`, the SELECT...IN
syntax no longer works
So what I did was query ALL records from both ImportContact and
Contact models and store each of these result sets in their own
arrays. Then I perform the search using PHP array functions and store
the potential duplicates in a third array. This is working great, but
it's just on the slow side (sometimes 3+ minutes). It's not just the
array searching that's taking a while either, it's actually the
database query retrieving 30,000+ records that's slowing things down
as well.
I'm just giving a brief overview here and can provide specific code if
you like. Thanks for the quick replies, maybe I can get this thing to
run more efficiently but I'm not quite sure how :-/
- Kevin
On Mar 26, 4:14 pm, jonknee <m...@jongales.com> wrote:
> > I have a CakePHP CRM application with 30,000+ contact records. Each
> > Contact hasMany Address records. One feature of my app is to allow my
> > client to search for duplicate entries within the contacts table
> > before importing new contacts. In short, I am querying the entire
> > Contact table and LEFT JOINing with the Address table. I store the
> > result set into an array and then use PHP's array functions to filter
> > potential duplicates from a "pending contacts" table.
> What exactly are you doing? It can probably be handled pretty quickly
> by the database. You shouldn't have to manually fish out dupes.
> I always thought the please wait while we search messages were either
> a [poor] attempt to seem extra high-tech or a sign that they have a
> really inefficient system. Google searches the whole web--not just a
> day's commercial flights--and comes back in a flash.
> I have used a PHP app with a valid reason for it, it was a mailing
> list program and it accomplished sending large amounts of email out by
> batching (the page would keep reloading with incrementing GET
> variables, such as start=0&num=100). It wouldn't work in your case.
> The reason I can't have the database do the searching for me is
> because the criteria I want to search against to find potential
> duplicates doesn't have a direct one-to-one correlation to the
> database table fields.
Hmm. Is there a way you could make them directly correlate? It's a
pretty fast DB query if you can get them to line up. If I'm
understanding it correctly, you just have a full name in the import
model and broken out first and last names in your contact model. This
can work, though you'll probably need some manual SQL. Instead of
CONCAT you want to use CONTACT_WS(' ', 'fn', 'ls) which will result in
a space between the first and last names.
SELECT * FROM contactImport WHERE full_name IN (SELECT CONCAT_WS(' ',
first_name, last_name) FROM contact)
That will match a record in contactImport with a full_name of "Joe
Blow" when there is a record in contact with a first name of "Joe" and
a last name of "Blow".
The beauty of doing the heavy lifting in SQL, even if you can't do it
in the ORM, is that you only have to bring the matches across the
wire. So instead of a minimum of 30,000 records coming out of the DB
(and each getting messed around with Cake's ORM) you get only the
dupes.
> If I'm understanding it correctly, you just have a full name in the import
> model and broken out first and last names in your contact model.
I may have not explained the setup properly. I don't actually have a
"full_name" field in either table (import_contacts and contacts are
the actual MySQL table names). Each of these tables have "fn" and "ln"
fields, however. So the problem is that I need to somehow find all
records in the contacts table which have the same values in the "fn"
and "ln" fields as the import_contacts table.
> Hmm. Is there a way you could make them directly correlate? It's a
> pretty fast DB query if you can get them to line up.
The only idea I have at the moment is to create a field in both tables
which simply stores the concatenated value of a record's "fn" and "ln"
fields. This would require me to write a script to retro-fit the
existing 30k contact records, as well as update some logic in my "add
contact" controller. Not to mention this would also create a redundant
field in the table and still doesn't ultimately solve the problem
because what if in the future I want to find duplicates matching "fn",
"ln" and "mi" (middle initial)? I would have to create yet another
"dummy" field in my table.
> The beauty of doing the heavy lifting in SQL, even if you can't do it
> in the ORM, is that you only have to bring the matches across the
> wire. So instead of a minimum of 30,000 records coming out of the DB
> (and each getting messed around with Cake's ORM) you get only the
> dupes.
Oh I hear you on this one :) Once I realized I *couldn't* do it with
SQL I was quite bummed! Of course, maybe there's still a way...
On Thu, Mar 27, 2008 at 12:59 PM, kdecapite <decap...@creationsite.com> wrote:
> I may have not explained the setup properly. I don't actually have a > "full_name" field in either table (import_contacts and contacts are > the actual MySQL table names). Each of these tables have "fn" and "ln" > fields, however. So the problem is that I need to somehow find all > records in the contacts table which have the same values in the "fn" > and "ln" fields as the import_contacts table.
I could be misinterpreting your needs, but ...
SELECT c.id, c.fn, c.ln FROM contacts AS c INNER JOIN import_contacts AS ic ON c.fn = ic.fn AND c.ln = ic.ln;
This will select the id, fn, and ln from contacts where it also exists in import_contacts.
> The only idea I have at the moment is to create a field in both tables > which simply stores the concatenated value of a record's "fn" and "ln" > fields. This would require me to write a script to retro-fit the > existing 30k contact records, as well as update some logic in my "add > contact" controller. Not to mention this would also create a redundant > field in the table and still doesn't ultimately solve the problem > because what if in the future I want to find duplicates matching "fn", > "ln" and "mi" (middle initial)? I would have to create yet another > "dummy" field in my table.
> I may have not explained the setup properly. I don't actually have a
> "full_name" field in either table (import_contacts and contacts are
> the actual MySQL table names). Each of these tables have "fn" and "ln"
> fields, however. So the problem is that I need to somehow find all
> records in the contacts table which have the same values in the "fn"
> and "ln" fields as the import_contacts table.
That should still be able to be handled pretty easily with SQL. You
got me interested so I created a test DB formatted like you said
(first_name, last_name) and this query works fine. Only spits out the
dupes (when first and last names are the same, if one of the two is
different it doesn't show up):
SELECT * FROM contactImport WHERE first_name IN (SELECT first_name
FROM contact) AND last_name in (SELECT last_name FROM contact)
b logica's way would work too, so you could just check whatever is
faster. I don't have 30,000 records to test that out on :P.
Thanks to everyone who has contributed so far, it's much appreciated!
Now, here's where we stand...
jonknee's query took about 25-27 seconds and returned 154 records.
b logica's query took about 6 seconds and returned 24 records
My original query takes 18 seconds and returns ALL records and is
exactly this:
SELECT CONCAT(`Contact`.`ln`, `Contact`.`fn`) AS `full_name`,
`Contact`.`id`, `Contact`.`created`, `Contact`.`modified`,
`Contact`.`fn`, `Contact`.`ln`, `Contact`.`co`, `Contact`.`email1`,
`Contact`.`phone_home`, `Address`.`line1`, `Address`.`line2`,
`Address`.`city`, `Address`.`st`, `Address`.`zip` FROM `contacts` AS
`Contact` LEFT JOIN `addresses` AS `Address` ON `Contact`.`id` =
`Address`.`contact_id` WHERE `Contact`.`ln` <> '' AND `Contact`.`fn`
<> '' AND `Contact`.`ln` IS NOT NULL AND `Contact`.`fn` IS NOT NULL
ORDER BY `Contact`.`ln` ASC, `Contact`.`fn` ASC
Then I store the results into an array and run this query:
SELECT * FROM `import_contacts` AS `ImportContact` ORDER BY `ln` ASC,
`fn` ASC
I store those results into an array, too, then I loop through the
ImportContact array looking for a matching key (PHP's array_key_exists
function). I can do this because I format the two result set arrays
using a concatenated string as the key of the array. For instance, a
record with a first name of "Mark" and last name of "Smith" gets
stored as: