Problem with mysql_example: No records have been blocked together

431 views
Skip to first unread message

Matthew Gross

unread,
Feb 19, 2018, 2:52:57 PM2/19/18
to open source deduplication
I am attempting to run the mysql_example code on a Windows machine. I experienced a few problems and tried to limit the sample of the input text file so that I could more effectively troubleshoot my problems. After selecting a sample of 100,000 observations and running the mysql_example_init file, I tried the main mysql_example file. Unfortunately, I am getting an error message at the cluster stage that says, "No records have been blocked together. Is the data you are trying to match like the data you trained on?" I am not sure why I am getting this message since my code is lifted directly from the example code. Does anyone have a suggestion for a possible fix? As you can maybe tell, I am both a python and mysql beginner.

I copied and pasted the output for my run at the following pastebin link: https://pastebin.com/jWuwXea7

bespin

unread,
Jun 29, 2018, 4:54:21 PM6/29/18
to open source deduplication

Did you get this resolved? I'm having the same issue on the full sample.

Josh Wieder

unread,
Jul 12, 2018, 4:15:33 PM7/12/18
to open source deduplication
I can confirm that I am encountering the same issue with the default mysql example. Initially I had an "_mysql_exceptions.OperationalError: (1267, "Illegal mix of collations (utf8_unicode_ci,IMPLICIT) and (utf8_general_ci,IMPLICIT) for operation '='")" failure immediately after creating the blocking table. I resolved this by specifying ut8_unicode_ci collation when creating the `contributions` database CREATE DATABASE contributions CHARACTER SET utf8 COLLATE utf8_unicode_ci;

However after this I now receive a "dedupe.core.BlockingError" *during* this creation of the blocking table:

Traceback (most recent call last):
  File "mysql_example.py", line 371, in <module>
    clustered_dupes = list(clustered_dupes)
  File "/opt/rh/rh-python35/root/usr/lib64/python3.5/site-packages/dedupe/api.py", line 122, in matchBlocks
    threshold=0)
  File "/opt/rh/rh-python35/root/usr/lib64/python3.5/site-packages/dedupe/core.py", line 215, in scoreDuplicates
    raise BlockingError("No records have been blocked together. "
dedupe.core.BlockingError: No records have been blocked together. Is the data you are trying to match like the data you trained on?

I've been actively troubleshooting this all day. I will update here if I find any solutions.

Cheers.
 

Josh Wieder

unread,
Jul 12, 2018, 4:27:19 PM7/12/18
to open source deduplication
I should clarify I'm using MariaDB 10.2 on CentOS 7 with Python3.5 (I've had earlier failures using python 2.7 & 3.6). MariaDB 10.1+ makes at least one call pointless, e.g.:
    c.execute("SET group_concat_max_len = 10192")
This should be commented out as MariaDB's 10.2+ group_concat_max_len value defaults to 1MB or roughly 100x this value (it is clearly specified to increase the default mysql 5.2 value of 1kb, not to specify a lower value).

Currently I am re-running the example after make a (slightly different) change specified here: https://github.com/dedupeio/dedupe-examples/issues/75

Specifically I've changed the default plural_key create table line in mysql_example.py into two queries & changed the syntax around. This change has already gotten me past the previous fail point. Will update here with results. If it works I'm going to push an update to Github.

c.execute("CREATE TABLE plural_key "
               "(block_key VARCHAR(200), "
               " block_id INTEGER UNSIGNED AUTO_INCREMENT, "
               " PRIMARY KEY (block_id)) "
               "CHARACTER SET utf8 COLLATE utf8_unicode_ci")
cursor1.execute("INSERT INTO plural_key "
               "(block_key)"
               "(SELECT MIN(block_key) as block_key FROM "
               " (SELECT block_key, "
               " GROUP_CONCAT(CC_id ORDER BY CC_id) AS block "
               " FROM blocking_map "
               " GROUP BY block_key HAVING COUNT(*) > 1) AS blocks "
               " GROUP BY block)")


mza...@clarityinsights.com

unread,
Jul 12, 2018, 4:44:31 PM7/12/18
to open source deduplication
The mysql_example works for me... I am using Python 3.6.5 and MariaDB 10.3.8.  

@Josh - you're on the right path.  Code adjustments are likely necessary depending on your version of Python, OS, and MySQL

I did have to make some code adjustments- 
  • the mysql_init and the underlying data needed tweaking to fit into a table using STRICT MySQL settings due to typecasting and some NULL/empty values.
  • mysql_example required some adjustments to be fully compliant to Python 3.6.5, including things like parens () for print() function, etc.  If you are getting a BlockingError... it could be due to bad data loaded into MySQL.  I have had quite a few problems obfuscated in that error.

Josh Wieder

unread,
Jul 12, 2018, 5:17:52 PM7/12/18
to open source deduplication
OK guys I got a successful run (finally). I'm going to push my current code to the mysql_examples repo as soon as time permits (I'm behind the 8-ball on a deadline ATM).
 
I made another change to the db query posted above:

c.execute("CREATE TABLE plural_key "
               "(block_key VARCHAR(200), "
               " block_id INTEGER UNSIGNED AUTO_INCREMENT, "
               " PRIMARY KEY (block_id)) "
               "CHARACTER SET utf8 COLLATE utf8_unicode_ci")
cursor1.execute("INSERT INTO plural_key "
               "(block_key)"
               "(SELECT MIN(block_key) as block_key FROM "
               " (SELECT block_key, "
               " GROUP_CONCAT(donor_id ORDER BY donor_id) AS block "

               " FROM blocking_map "
               " GROUP BY block_key HAVING COUNT(*) > 1) AS blocks "
               " GROUP BY block)") 

Sweet success (I re-ran this with an existing training json file):

[root@webhost mysql_example]# python mysql_example.py
reading from  mysql_example_settings
blocking...
creating blocking_map database
creating inverted index
writing blocking map
prepare blocking table. this will probably take a while ...
clustering...
10000 blocks
1340.320717573166 seconds
20000 blocks
1344.446471452713 seconds
30000 blocks
1347.2672460079193 seconds
40000 blocks
1350.3440387248993 seconds
50000 blocks
1352.8005783557892 seconds
60000 blocks
1355.9533243179321 seconds
70000 blocks
1358.5162761211395 seconds
80000 blocks
1360.8267829418182 seconds
90000 blocks
1362.9271190166473 seconds
100000 blocks
1364.996277332306 seconds
110000 blocks
1367.5683472156525 seconds
120000 blocks
1369.7726373672485 seconds
130000 blocks
1371.6814708709717 seconds
140000 blocks
1373.544719696045 seconds
150000 blocks
1375.3536491394043 seconds
160000 blocks
1377.5848760604858 seconds
170000 blocks
1378.9647209644318 seconds
180000 blocks
1380.7578570842743 seconds
190000 blocks
1382.1378812789917 seconds
200000 blocks
1384.0062987804413 seconds
210000 blocks
1386.5011110305786 seconds
creating entity_map database
/opt/rh/rh-python35/root/usr/lib64/python3.5/site-packages/MySQLdb/cursors.py:168: Warning: (1051, "Unknown table 'contributions.entity_map'")
  self.fetchall()
# duplicate sets
118540
Top Donors (deduped)
      $30,700,720.52: Democratic Party of Illinois
      $11,747,097.77: Republican State Senate Campaign Committee
      $10,806,911.01: Illinois Republican Party
       $9,912,263.57: Illinois Senate Democratic Fund (The)
       $9,590,682.54: Republican Governors Association
       $9,040,913.46: Madigan Michael Friends of
       $8,037,705.78: Citizens to Elect Tom Cross
       $7,897,829.31: RGA Illinois 2010 PAC
       $6,864,234.45: Illinois Federation of Teachers
       $6,814,350.20: Chicago Teachers Union IFT Local 1
Top Donors (raw)
      $14,319,194.47: Democratic Party of Illinois
      $13,020,132.76: Democratic Party of Illinois
       $9,027,432.54: Republican Governors Association
       $7,897,829.31: RGA Illinois 2010 PAC
       $6,675,000.00: Madigan Michael Friends of
       $6,008,841.69: Scott Cohen
       $5,570,839.00: Ronald Gidwitz,
       $5,562,800.00: Citizens for Emil Jones
       $5,324,649.63: Paul Wood,
       $5,132,563.83: SEIU Healthcare IL IN
ran in 1494.4037194252014 seconds



Josh Wieder

unread,
Jul 12, 2018, 5:50:29 PM7/12/18
to open source deduplication
@mza ->

Thanks or your quick feedback. I ended up abandoning 3.6.5 after I found a post somewhere in Github saying that the devs had only confirmed with 3.5. I'm using a software collection of Python 3.5.1 to keep things somewhat seperated in my existing environment.

Both of your pointers are solid.

Your strict mode setting is a good catch. I did not even realize this but MariaDB began enabling innodb strict mode by default after version 10.2.2. Its not been an issue for me up to this point, but I'd rather resolve this type of thing in the application rather than disabling it (I have other stuff on the DB I'm using for this).

And the bad DB data is on point as well. AFAICT there was a casting issue taking place during the population of the plural_keys table. The default syntax for creation of the table looks like this:

c.execute("CREATE TABLE plural_key "
          "(block_key VARCHAR(200), "
          " block_id INTEGER UNSIGNED AUTO_INCREMENT, "
          " PRIMARY KEY (block_id)) "
          "(SELECT MIN(block_key) FROM "
          " (SELECT block_key, "
          " GROUP_CONCAT(donor_id ORDER BY donor_id) AS block "
          " FROM blocking_map "
          " GROUP BY block_key HAVING COUNT(*) > 1) AS blocks "
          " GROUP BY block)")

The query I used explicitly specified the second INSERT statement that is made implicitly there, like so (I keep making typos copying stuff in here ... sorry ... this is from the just-working application)

c.execute("CREATE TABLE plural_key "
               "(block_key VARCHAR(200), "
               " block_id INTEGER UNSIGNED AUTO_INCREMENT, "
               " PRIMARY KEY (block_id)) "
               "CHARACTER SET utf8 COLLATE utf8_unicode_ci")
c.execute("INSERT INTO plural_key "

               "(block_key)"
               "(SELECT MIN(block_key) as block_key FROM "
               " (SELECT block_key, "
               " GROUP_CONCAT(donor_id ORDER BY donor_id) AS block "
               " FROM blocking_map "
               " GROUP BY block_key HAVING COUNT(*) > 1) AS blocks "
               " GROUP BY block)") 

In addition to breaking the query in two, I throw in an explicit type cast (really shouldnt matter since the table gets generated that way by default in my DB) & specify "as block_key". This was all specified previously in https://github.com/dedupeio/dedupe-examples/issues/75 ... however, that fix used a now-invalid column name "Cc_id" in the GROUP_CONCAT() call. Fixing this resolves things.

As I mentioned in my previous post I managed to get this to work. And it is worth pointing out that whatever its limitations, I have never come across an open-source library with the kind of functionality offered here. This sort of big data sexiness is typically kept under lock & key where it can be used for important stuff like government spying & advertising. Unleashing this sort of technology on public records as demonstrated in the mysql example could provide enormous social value & the devs should be applauded for this.

I should also clarify that I'm pretty green with python (I focus more on PHP & C). but based on my experience & what you mentioned & the github issues I'm fairly certain that these examples do not function in most "vanilla" environments of of more common Python + Mysql combinations as-published (both in terms of code & provisioning instructions). For example, the default codebase for this example creates three database tables without specifying unicode collation. This would be fine if the instructions specified that the 'contributions' db should specify a default utf_unicode collation type, but it doesn't; whether or not its best practice at this point, utf8_general_ci is still the default for rpm-base DB installs on the linux flavors I use.

In most cases, this library has a steep learning curve & working examples are needed to make sense of it for those of us still learning. Anyway, posting this here not to burn the devs but to encourage those who have put a lot of work into hacking these examples to post their results to make things easier for newer users & push to github.

Josh W

Forest Gregg

unread,
Jul 12, 2018, 5:55:44 PM7/12/18
to open-source-...@googlegroups.com
The example code did work on the most common Python + MySQL examples when I wrote it four or five years ago. For folks that have gotten it working with newer version of MySQL, pull requests are very welcome.

Best,

Forest

--

---
You received this message because you are subscribed to the Google Groups "open source deduplication" group.
To unsubscribe from this group and stop receiving emails from it, send an email to open-source-dedupl...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

mza...@clarityinsights.com

unread,
Jul 12, 2018, 7:12:53 PM7/12/18
to open source deduplication
@Josh- nice job sloughing through.  A bit overkill, I think, but it will help your understanding.

@Forest - You're still taking pull requests and contributions on this?  Nice!  I'll get my fixes pretty and submit.
To unsubscribe from this group and stop receiving emails from it, send an email to open-source-deduplication+unsub...@googlegroups.com.

Deepesh Chaudhari

unread,
Mar 4, 2020, 12:57:51 PM3/4/20
to open source deduplication
Hi, I have a similar error with a line that implies I'm not useing the data that I had trained on? Is it important to always have a data you labelled as part of your dataset?


On Monday, February 19, 2018 at 11:52:57 AM UTC-8, Matthew Gross wrote:
Reply all
Reply to author
Forward
0 new messages