De-duping events/venues

Lennon Day-Reynolds

unread,

Mar 3, 2008, 3:09:04 PM3/3/08

to PDX Tech Calendar

Having worked on some other calendaring systems, as well as tools that
synchronize scheduling data across different domain models, I wanted
to throw a few ideas out for de-duping heuristics, and see what people
thought. (If there's sufficient interest, I also might be able to
spend some working hours on some trial implementations, since there
are a number of scheduling-related issues coming up at Reed these days
that might benefit from a good calendar aggregator.)

The first and more important tool for duplicate-detection is good
normalization of input data, followed by time and location parsing. If
you can show that two events happen in the same time and place,
chances are you've found a duplicate. (The exception to this of course
being extremely vague locations like, say, "PSU Campus", which could
play host to many events at once.)

At that point, some sort of effective rule engine needs to kick in to
flag possible duplicates. Those rules could be explicitly provided by
users (i.e., folks adding a new calendar could then browse the scraped
events and manually flag duplicate venues or event occurrences,
thereby creating rewriting rules for future imports) or inferred via
some statistical or machine-learning process. The former would likely
be much simpler to implement from an algorithmic POV, but obviously
has the downside of requiring much more user interaction.

Finally, it's worth thinking about modeling events as being related
either in generic clusters, or as hierarchical structures. That way,
duplicate events don't have to be deleted: they can simply be added to
a generic "event cluster" with other events, or marked as having a
"child" or "dupe" relationship, thereby making it easier to mine those
relationships for future de-duping steps. (Think of these clusters as
the training sets for a machine learning algorithm being applied to
new incoming events...)

When simply displaying events, you simply pick the earliest or most
trustworthy instance from the cluster, but can still maintain the
other records in order to link back to those calendars containing the
event. Similarly, keeping clusters of venue identifiers could allow
you to more easily recognize similar variations in the future.

If you're going the bloom/bayes/neural net/etc. route, training can
also happen in the background via a periodic process, and trained
filters can be retroactively applied to the existing data.

I know this all sounds pretty abstract, but the basic ideas are really
simple: aggressively normalize input data, parse to well-structured
data containers, (for locations, timestamps, etc.) and represent
duplication as a relationship or membership flag, rather than simply
deleting duplicates.

Anyway, I'd be interested to see what other people's thinking about
this is...

-Lennon

Audrey Eschright

unread,

Mar 4, 2008, 12:38:32 PM3/4/08

to pdx-tech...@googlegroups.com

This is similar to the discussions we've had at code sprints about the issue. If you wanted to work on some sample code, I'd be interested in seeing it. It looks like Reid checked in some basic duplicate field checking functions on the Event and Venue models yesterday, as a starting point.

--

Audrey Eschright

lifeofaudrey.com

Igal Koshevoy

unread,

Mar 5, 2008, 2:45:15 PM3/5/08

to pdx-tech...@googlegroups.com

Lennon: Thanks for the ideas on de-duplicating, that definitely sounds
like a good long-term solution and I'd love to dig into some of those
tools and algorithms.

Reid: Thanks for the work on adding a basic de-duper.

However, regardless of how clever our automatic de-duping system is, we
must still provide a way for people to squash duplicates manually. I
added ticket #27 "add duplicate squasher control to venues". The feature
ticket describes a fairly straightforward UI and implementation for
de-duping venues that I'm reasonably sure we can finish off in a couple
of hours of work. Once we implement that, we can then apply what we
learned with that UI to the events.

Although I'm all for trying to come up with an algorithmic way to do
this, it may take a while to implement, whereas the human-driven system
will be much quicker to implement and still needed regardless of which
direction we take with the automated version.

Comments and critiques on that ticket are very welcome.

-igal

Igal Koshevoy

unread,

Mar 6, 2008, 4:15:36 AM3/6/08

to PDX Tech Calendar

Reid,

In SVN r341, I rewrote venue_spec and added a bunch of integration
tests to exercise the duplicate finder code. I've tested every edge
condition I could think of and all's well, except for how it treats
the Venue.find_duplicates_by(:all) queries. I kinda ran out of time to
actually fix the issue but think I've pinpointed it. I added issue #31
with further details on this.

Thanks again for putting that system together, it's a good step
forward.

-igal

Reply all

Reply to author

Forward