Problem with mysql_example: No records have been blocked together

Matthew Gross

unread,

Feb 19, 2018, 2:52:57 PM2/19/18

to open source deduplication

I am attempting to run the mysql_example code on a Windows machine. I experienced a few problems and tried to limit the sample of the input text file so that I could more effectively troubleshoot my problems. After selecting a sample of 100,000 observations and running the mysql_example_init file, I tried the main mysql_example file. Unfortunately, I am getting an error message at the cluster stage that says, "No records have been blocked together. Is the data you are trying to match like the data you trained on?" I am not sure why I am getting this message since my code is lifted directly from the example code. Does anyone have a suggestion for a possible fix? As you can maybe tell, I am both a python and mysql beginner.

I copied and pasted the output for my run at the following pastebin link: https://pastebin.com/jWuwXea7

bespin

unread,

Jun 29, 2018, 4:54:21 PM6/29/18

to open source deduplication

Did you get this resolved? I'm having the same issue on the full sample.

Josh Wieder

unread,

Jul 12, 2018, 4:15:33 PM7/12/18

to open source deduplication

I can confirm that I am encountering the same issue with the default mysql example. Initially I had an "_mysql_exceptions.OperationalError: (1267, "Illegal mix of collations (utf8_unicode_ci,IMPLICIT) and (utf8_general_ci,IMPLICIT) for operation '='")" failure immediately after creating the blocking table. I resolved this by specifying ut8_unicode_ci collation when creating the `contributions` database CREATE DATABASE contributions CHARACTER SET utf8 COLLATE utf8_unicode_ci;

However after this I now receive a "dedupe.core.BlockingError" *during* this creation of the blocking table:

Traceback (most recent call last):

File "mysql_example.py", line 371, in <module>

clustered_dupes = list(clustered_dupes)

File "/opt/rh/rh-python35/root/usr/lib64/python3.5/site-packages/dedupe/api.py", line 122, in matchBlocks

threshold=0)

File "/opt/rh/rh-python35/root/usr/lib64/python3.5/site-packages/dedupe/core.py", line 215, in scoreDuplicates

raise BlockingError("No records have been blocked together. "

dedupe.core.BlockingError: No records have been blocked together. Is the data you are trying to match like the data you trained on?

I've been actively troubleshooting this all day. I will update here if I find any solutions.

Cheers.

Josh Wieder

unread,

Jul 12, 2018, 4:27:19 PM7/12/18

to open source deduplication

I should clarify I'm using MariaDB 10.2 on CentOS 7 with Python3.5 (I've had earlier failures using python 2.7 & 3.6). MariaDB 10.1+ makes at least one call pointless, e.g.:

c.execute("SET group_concat_max_len = 10192")

This should be commented out as MariaDB's 10.2+ group_concat_max_len value defaults to 1MB or roughly 100x this value (it is clearly specified to increase the default mysql 5.2 value of 1kb, not to specify a lower value).

Currently I am re-running the example after make a (slightly different) change specified here: https://github.com/dedupeio/dedupe-examples/issues/75

Specifically I've changed the default plural_key create table line in mysql_example.py into two queries & changed the syntax around. This change has already gotten me past the previous fail point. Will update here with results. If it works I'm going to push an update to Github.

c.execute("CREATE TABLE plural_key "
"(block_key VARCHAR(200), "
" block_id INTEGER UNSIGNED AUTO_INCREMENT, "
" PRIMARY KEY (block_id)) "
"CHARACTER SET utf8 COLLATE utf8_unicode_ci")
cursor1.execute("INSERT INTO plural_key "
"(block_key)"
"(SELECT MIN(block_key) as block_key FROM "
" (SELECT block_key, "
" GROUP_CONCAT(CC_id ORDER BY CC_id) AS block "
" FROM blocking_map "
" GROUP BY block_key HAVING COUNT(*) > 1) AS blocks "
" GROUP BY block)")

mza...@clarityinsights.com

unread,

Jul 12, 2018, 4:44:31 PM7/12/18

to open source deduplication

The mysql_example works for me... I am using Python 3.6.5 and MariaDB 10.3.8.

@Josh - you're on the right path. Code adjustments are likely necessary depending on your version of Python, OS, and MySQL

I did have to make some code adjustments-

the mysql_init and the underlying data needed tweaking to fit into a table using STRICT MySQL settings due to typecasting and some NULL/empty values.
mysql_example required some adjustments to be fully compliant to Python 3.6.5, including things like parens () for print() function, etc. If you are getting a BlockingError... it could be due to bad data loaded into MySQL. I have had quite a few problems obfuscated in that error.

Josh Wieder

unread,

Jul 12, 2018, 5:17:52 PM7/12/18

to open source deduplication

OK guys I got a successful run (finally). I'm going to push my current code to the mysql_examples repo as soon as time permits (I'm behind the 8-ball on a deadline ATM).

I made another change to the db query posted above:

c.execute("CREATE TABLE plural_key "
"(block_key VARCHAR(200), "
" block_id INTEGER UNSIGNED AUTO_INCREMENT, "
" PRIMARY KEY (block_id)) "
"CHARACTER SET utf8 COLLATE utf8_unicode_ci")
cursor1.execute("INSERT INTO plural_key "
"(block_key)"
"(SELECT MIN(block_key) as block_key FROM "
" (SELECT block_key, "

" GROUP_CONCAT(donor_id ORDER BY donor_id) AS block "

" FROM blocking_map "
" GROUP BY block_key HAVING COUNT(*) > 1) AS blocks "

" GROUP BY block)")

Sweet success (I re-ran this with an existing training json file):

[root@webhost mysql_example]# python mysql_example.py

reading from mysql_example_settings

blocking...

creating blocking_map database

creating inverted index

writing blocking map

prepare blocking table. this will probably take a while ...

clustering...

10000 blocks

1340.320717573166 seconds

20000 blocks

1344.446471452713 seconds

30000 blocks

1347.2672460079193 seconds

40000 blocks

1350.3440387248993 seconds

50000 blocks

1352.8005783557892 seconds

60000 blocks

1355.9533243179321 seconds

70000 blocks

1358.5162761211395 seconds

80000 blocks

1360.8267829418182 seconds

90000 blocks

1362.9271190166473 seconds

100000 blocks

1364.996277332306 seconds

110000 blocks

1367.5683472156525 seconds

120000 blocks

1369.7726373672485 seconds

130000 blocks

1371.6814708709717 seconds

140000 blocks

1373.544719696045 seconds

150000 blocks

1375.3536491394043 seconds

160000 blocks

1377.5848760604858 seconds

170000 blocks

1378.9647209644318 seconds

180000 blocks

1380.7578570842743 seconds

190000 blocks

1382.1378812789917 seconds

200000 blocks

1384.0062987804413 seconds

210000 blocks

1386.5011110305786 seconds

creating entity_map database

/opt/rh/rh-python35/root/usr/lib64/python3.5/site-packages/MySQLdb/cursors.py:168: Warning: (1051, "Unknown table 'contributions.entity_map'")

self.fetchall()

# duplicate sets

118540

Top Donors (deduped)

$30,700,720.52: Democratic Party of Illinois

$11,747,097.77: Republican State Senate Campaign Committee

$10,806,911.01: Illinois Republican Party

$9,912,263.57: Illinois Senate Democratic Fund (The)

$9,590,682.54: Republican Governors Association

$9,040,913.46: Madigan Michael Friends of

$8,037,705.78: Citizens to Elect Tom Cross

$7,897,829.31: RGA Illinois 2010 PAC

$6,864,234.45: Illinois Federation of Teachers

$6,814,350.20: Chicago Teachers Union IFT Local 1

Top Donors (raw)

$14,319,194.47: Democratic Party of Illinois

$13,020,132.76: Democratic Party of Illinois

$9,027,432.54: Republican Governors Association

$7,897,829.31: RGA Illinois 2010 PAC

$6,675,000.00: Madigan Michael Friends of

$6,008,841.69: Scott Cohen

$5,570,839.00: Ronald Gidwitz,

$5,562,800.00: Citizens for Emil Jones

$5,324,649.63: Paul Wood,

$5,132,563.83: SEIU Healthcare IL IN

ran in 1494.4037194252014 seconds

Josh Wieder

unread,

Jul 12, 2018, 5:50:29 PM7/12/18

to open source deduplication

@mza ->

Thanks or your quick feedback. I ended up abandoning 3.6.5 after I found a post somewhere in Github saying that the devs had only confirmed with 3.5. I'm using a software collection of Python 3.5.1 to keep things somewhat seperated in my existing environment.

Both of your pointers are solid.

Your strict mode setting is a good catch. I did not even realize this but MariaDB began enabling innodb strict mode by default after version 10.2.2. Its not been an issue for me up to this point, but I'd rather resolve this type of thing in the application rather than disabling it (I have other stuff on the DB I'm using for this).

And the bad DB data is on point as well. AFAICT there was a casting issue taking place during the population of the plural_keys table. The default syntax for creation of the table looks like this:

c.execute("CREATE TABLE plural_key "

"(block_key VARCHAR(200), "

" block_id INTEGER UNSIGNED AUTO_INCREMENT, "

" PRIMARY KEY (block_id)) "

"(SELECT MIN(block_key) FROM "

" (SELECT block_key, "

" GROUP_CONCAT(donor_id ORDER BY donor_id) AS block "

" FROM blocking_map "

" GROUP BY block_key HAVING COUNT(*) > 1) AS blocks "

" GROUP BY block)")

The query I used explicitly specified the second INSERT statement that is made implicitly there, like so (I keep making typos copying stuff in here ... sorry ... this is from the just-working application)

c.execute("CREATE TABLE plural_key "
"(block_key VARCHAR(200), "
" block_id INTEGER UNSIGNED AUTO_INCREMENT, "
" PRIMARY KEY (block_id)) "
"CHARACTER SET utf8 COLLATE utf8_unicode_ci")

c.execute("INSERT INTO plural_key "

"(block_key)"
"(SELECT MIN(block_key) as block_key FROM "
" (SELECT block_key, "

" GROUP_CONCAT(donor_id ORDER BY donor_id) AS block "

" FROM blocking_map "
" GROUP BY block_key HAVING COUNT(*) > 1) AS blocks "

" GROUP BY block)")

In addition to breaking the query in two, I throw in an explicit type cast (really shouldnt matter since the table gets generated that way by default in my DB) & specify "as block_key". This was all specified previously in https://github.com/dedupeio/dedupe-examples/issues/75 ... however, that fix used a now-invalid column name "Cc_id" in the GROUP_CONCAT() call. Fixing this resolves things.

As I mentioned in my previous post I managed to get this to work. And it is worth pointing out that whatever its limitations, I have never come across an open-source library with the kind of functionality offered here. This sort of big data sexiness is typically kept under lock & key where it can be used for important stuff like government spying & advertising. Unleashing this sort of technology on public records as demonstrated in the mysql example could provide enormous social value & the devs should be applauded for this.

I should also clarify that I'm pretty green with python (I focus more on PHP & C). but based on my experience & what you mentioned & the github issues I'm fairly certain that these examples do not function in most "vanilla" environments of of more common Python + Mysql combinations as-published (both in terms of code & provisioning instructions). For example, the default codebase for this example creates three database tables without specifying unicode collation. This would be fine if the instructions specified that the 'contributions' db should specify a default utf_unicode collation type, but it doesn't; whether or not its best practice at this point, utf8_general_ci is still the default for rpm-base DB installs on the linux flavors I use.

In most cases, this library has a steep learning curve & working examples are needed to make sense of it for those of us still learning. Anyway, posting this here not to burn the devs but to encourage those who have put a lot of work into hacking these examples to post their results to make things easier for newer users & push to github.

Josh W

Forest Gregg

unread,

Jul 12, 2018, 5:55:44 PM7/12/18

to open-source-...@googlegroups.com

The example code did work on the most common Python + MySQL examples when I wrote it four or five years ago. For folks that have gotten it working with newer version of MySQL, pull requests are very welcome.

Best,

Forest

--

---
You received this message because you are subscribed to the Google Groups "open source deduplication" group.
To unsubscribe from this group and stop receiving emails from it, send an email to open-source-dedupl...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

mza...@clarityinsights.com

unread,

Jul 12, 2018, 7:12:53 PM7/12/18

to open source deduplication

@Josh- nice job sloughing through. A bit overkill, I think, but it will help your understanding.

@Forest - You're still taking pull requests and contributions on this? Nice! I'll get my fixes pretty and submit.

To unsubscribe from this group and stop receiving emails from it, send an email to open-source-deduplication+unsub...@googlegroups.com.

Deepesh Chaudhari

unread,

Mar 4, 2020, 12:57:51 PM3/4/20

to open source deduplication

Hi, I have a similar error with a line that implies I'm not useing the data that I had trained on? Is it important to always have a data you labelled as part of your dataset?

On Monday, February 19, 2018 at 11:52:57 AM UTC-8, Matthew Gross wrote:

Reply all

Reply to author

Forward