need help finding duplicate records

1,455 views
Skip to first unread message

Al J

unread,
Jan 26, 2016, 12:35:43 AM1/26/16
to OpenRefine
I am having trouble understanding how to do even the simplest of tasks with OR 

I have a set of records containing name, address, email/phone data & time of registration

I want to find sets of records that show multiple registrations from the same address, or even the same people

I'd love to be able to do things like sort those clusters by the amount of time elapsed between registration, for example

And yet, I can't even get it to show me duplicates

After reading a bunch pf posts and docs I used a duplicates facet on the email column and then excluded the false rows -- and the remaining set of rows are not duplicates, I know this from checking the data via hand built sql

I love the idea of OR and want to get better at using it to answer questions about my data

But I've got to have some success soon, or I'm going to go back to coding it by hand -- I know I can do that, but it won't have the ad-hoc power 

arg-bargle-flut !!

help!

Owen Stephens

unread,
Jan 26, 2016, 4:37:04 AM1/26/16
to OpenRefine
Hi Al J

There might be different approaches to identifying duplicates in your data set depending on what you want to do once you have identified the duplicates and to some extent how big the data set is.

If you think the email data is the most likely to show duplicate rows, and if the data set isn't huge (lets say its a few thousand lines) then the first thing I'd do is to try a simple Text Facet on the email address column (click drop down menu at the top of the column choose Facet->Text Facet). This should give you a facet listing all the email addresses. You can then sort them by the 'count' by clicking 'Count' in the facet. This should give you all the repeated email addresses at the top of the list with the number of times they are repeated shown in the facet.

If you have a larger data set, then the duplicates facet may be better. The first thing I'd do in this case is Sort the dataset by the relevant column (e.g. email) before doing the duplicates facet - that will ensure that when you view the duplicates you see the duplicated lines adjacent to each other, rather than scattered through the result set.

Something to watch out for generally is that you are in 'Rows' mode (this is set towards the top left of the grid display in OR). The other mode ('Records') can be powerful but unless you are dealing with Records deliberately, you'll want to work in Rows mode.

Hope that helps a little. If you are able to share some data, or dummy data, that shows us what you are trying to do in more detail and demonstrates issues that you are seeing (e.g. if you could share some data that shows how the duplicates facet isn't working correctly for you) then that may help us offer more specific answers.

In case it helps, there are some great 'getting started' tutorials for OR out there. To mention a few:


Best wishes,

Owen
Message has been deleted

Al J

unread,
Jan 26, 2016, 9:28:46 AM1/26/16
to OpenRefine
  Thanks for those pointers -- they enabled me to get a start.  I've clustered the names (first,last) with trim&lowercase and I've clustered the address (line1&2) with t&l -- then adding duplicates facets to both columns gave me the 91 duplicate records I was looking for, sorting on the clustered name makes it easy to follow.

  I can see from skimming the data (without duplicates) that there is one person who registered twi, once w/just the 1st letter of his first name -- that's not caught in the duplicate facet, I think that the Cluster&Edit ngram-fingerprint may help there (not that I know what it does, but it's name and it's numeric param look promising)

  Now I need to figure out how to look at the registration timestamps to see how much time elapsed between the duplicates -- but that's row-against-row comparison and everything I've seen so far is within rows...

  I just wish there were what I consider actual docs for this product !  For example: I found this in an obscure tutorial: Blank Down: This will transform the column so instead of the number being repeated it will only appear once, with all subsequent repetitions underneath transformed into blank entries

  Where is the reference where each thing you can choose from a menu is defined?  I miss the days of LRM's -- now everything is explained by examples - whatta pita for my intp brain...

Owen Stephens

unread,
Jan 26, 2016, 10:26:23 AM1/26/16
to OpenRefine
Hi,

The main place to look for documentation is the OpenRefine Wiki https://github.com/OpenRefine/OpenRefine/wiki/Documentation-For-Users - this page has a Reference section at the bottom which gives the detail on all the different functions. Any suggestions on how we could improve the documentation would be very much welcome.

In terms of the duplicates - the duplicate checking in OpenRefine is looking for two cells which contain the same value. So a duplicate facet on the Name column wouldn't pick up that a row containing Joe Bloggs, and another row containing just J are duplicates. Presumably you can tell due to some other duplicate information in the row? You may need to use some of those other columns as a way of tracing possible duplicates (from your description you've already done some of this).

In terms of working out time differences between rows - you are right OpenRefine isn't so good on comparing data between rows. You can force this by using the 'Records' mode, which allows you to group rows together and to some extent work across all the rows in a record, but to be honest this is going to be clunky way of getting to what you need.

Instead I'd suggest it might be easier to use OpenRefine to get to the potential duplicates, then export them to a Spreadsheet to do that sort of calculation.

Owen

Al J

unread,
Jan 26, 2016, 10:54:57 AM1/26/16
to OpenRefine
Ok, I'll look at those docs and make suggestions in a PR

Joe Bloggs -- that's hilarious (to me) !! I worked on the Princeton Review's SAT Prep Multimedia CD-Rom a long time ago - and there was a little animated guy named Joe Bloggs who would pop up from time to time with helpful information for the student.  We all recorded his one line: The name is Bloggs, Joe Bloggs.  And they picked mine, HA!

I agree, once I get it down to good clusters where time is the only differentiator - then I'm going to eyeball it.  If it gets too hairy I'll export it and write some code.  

I will look into Records tho' -- I assume I'd have to identify the rows as belonging to the records, I'll post back here if I can't figure out how to do that.  

I'll be asked to repeat this analysis each month - can I plug in new data and re-run automatically ?

thanks
al;

Al J

unread,
Jan 26, 2016, 6:45:37 PM1/26/16
to OpenRefine
I added the: RefinePro Knowledge Base for OpenRefine -- http://kb.refinepro.com/ to the documentation wiki 

So far that is the best possible additional reference I have found for answers





Owen Stephens

unread,
Jan 27, 2016, 4:27:32 AM1/27/16
to OpenRefine
Great.

Once you have converted the rows into Records, then you can use the ability to address the individual rows in the Record using record.index or record.cells (see the bottom of https://github.com/OpenRefine/OpenRefine/wiki/Variables) to compare values across rows (e.g. use record.cells to get an array of values then iterate through with forEach )

Owen

Al J

unread,
Jan 27, 2016, 7:04:05 AM1/27/16
to OpenRefine
yep, my current search is for date math: date2epoch, epoch2date & seconds2duration 

if I end up writing those, it looks like my options for contribution is only extensions, is that right?

seems there's no macro language and I can't load a clojure library

am I missing something ?

Owen Stephens

unread,
Jan 27, 2016, 8:11:07 AM1/27/16
to OpenRefine
Hi Al,

Can you say a bit more about what you are trying to do?
The date functions in OpenRefine GREL may be enough - https://github.com/OpenRefine/OpenRefine/wiki/GREL-Date-Functions:
* toDate
* diff
* inc
* datePart

for example, if you have the value as a string such as: "2015-08-01T00:00:00Z" you could use the following GREL to get the epoch time (milliseconds since Unix epoch):

value.toDate().datePart("time")

If you are starting from epoch time, you'll probably need to switch to using Jython or Clojure. For example using Jython with the epoch time in seconds:
import time;
epochlong = int(float(value));
datetimestamp = time.strftime('%Y-%m-%dT%H:%M:%SZ', time.localtime(epochlong));
return datetimestamp


In terms of extending OpenRefine functionality that might be better posed to the Developer list (https://groups.google.com/forum/#!forum/openrefine-dev), but yes Extensions are the standard way of extending the functionality of OR.

I'm afraid I don't know how/if additional Clojure libraries can be imported. For Jython the documentation indicates that you can load additional Python libraries by dropping them somewhere in the Path, but it isn't something that I've done - more information at https://github.com/OpenRefine/OpenRefine/wiki/Jython and https://groups.google.com/forum/#!topic/openrefine/OvphcgDLNTs.

Owen

Al J

unread,
Jan 27, 2016, 8:58:21 AM1/27/16
to OpenRefine
I can say that you've given me all I need there!  

The crucial bit was the date part "time" -- I didn't see that in the SimpleDateFormat Docs -- I hadn't thought to look for the GREL ref -- it would be nice if that help section in the dialog where you enter functions had the link to the GREL as well as the SimpleDateFormat (I've opened an issue suggesting that.)

Now that I've cleaned up the registrations, and assembled them into records, I'm doing date  math on the registration dates to try to determine why so many people are registered twice: <5hrs between registrations, program error or enhancement needed, otherwise possible lost card --- more than one extra card, may be something else

I had done all this with SQL (except for the beautiful clustering) but I've grown tired of hacking away at SQL queries to get what I want, so when I saw a reference to OR - I was sold before I got past the splash page

I'm thinking of starting a Glossary Of Terms section to the wiki -- I'm feeling like that would be a good thing for newbies (like me.)

Thanks!
al;

Owen Stephens

unread,
Jan 27, 2016, 9:16:06 AM1/27/16
to OpenRefine

Al J

unread,
Jan 27, 2016, 9:25:52 AM1/27/16
to OpenRefine
I could do a PR, but it would be in bad form not to have built and tested the change -- and I'm not up for doing the entire java thing again right now

That's why I went for an issue

Thanks
al;
Message has been deleted

Tom Morris

unread,
Jan 27, 2016, 2:49:02 PM1/27/16
to openr...@googlegroups.com
Wow, this thread really wanders all over the place!  It sounds like most of the stuff has been resolved, but to address the last question and address a few comments:

- The expression used to create a column is available both in the History tab of the expression preview dialog and in the Undo/Redo panel for the project
- The blog that you found was already referenced in the wiki, it's just been renamed.  The article you found had the URL  http://googlerefine.blogspot.ca/2012/06/create-records-in-google-refine.html for years before everything there was redirected to the current commercial site
- I'm happy accept untested pull requests for simple textual substitutions, but there's a tension between making each help item full and robust vs lengthening an already too long wall-o-text in the help tab
- Although OpenRefine doesn't do calculations across rows, a set of values in a clustered record can be collapsed into a single cell using "Join multivalued cells" and then operated on there using split, min, max, etc
- Since there was discussion about epochs, this earlier answer from the archives may help: https://groups.google.com/forum/#!msg/openrefine/p5gfbsvJzBo/JY8hAY_KkscJ
- Clojure - not many people use it in Refine, so there's a limited amount of knowledge available

Tom

On Wed, Jan 27, 2016 at 10:04 AM, Al J <allen....@gmail.com> wrote:
Last Known Question

:)

Am I correct in thinking that there is no way to retrieve the Expression used to create a column ?

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Al J

unread,
Jan 27, 2016, 4:04:48 PM1/27/16
to OpenRefine
Hi Tom, 

  yes, I copied and pasted that wall-o-text into an external file for easier access -- it'd be nice if there were a button to open it into a separate window that you could just keep open on the side

  for calculations across rows I did just as you said, eventually I figured out how to do a reduce() on an array so I was able to get what I needed which was time elapsed between 1st & 2nd registrations, and between 2nd & 3rd, and so on

  then I did a facet on blanks to get back to a result row

  I only mentioned clojure because it's closer to my beloved javascript than python :)  I don't know clojure -- keep thinking I should learn it

  so I got the data I wanted - and I could have just used sql and nodejs to do it - but this was a one time thing and I had more fun this way -- and I'm learning a tool that I can spread around at the office

  Al;
Reply all
Reply to author
Forward
0 new messages