Lennon Day-Reynolds
unread,Mar 3, 2008, 3:09:04 PM3/3/08Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to PDX Tech Calendar
Having worked on some other calendaring systems, as well as tools that
synchronize scheduling data across different domain models, I wanted
to throw a few ideas out for de-duping heuristics, and see what people
thought. (If there's sufficient interest, I also might be able to
spend some working hours on some trial implementations, since there
are a number of scheduling-related issues coming up at Reed these days
that might benefit from a good calendar aggregator.)
The first and more important tool for duplicate-detection is good
normalization of input data, followed by time and location parsing. If
you can show that two events happen in the same time and place,
chances are you've found a duplicate. (The exception to this of course
being extremely vague locations like, say, "PSU Campus", which could
play host to many events at once.)
At that point, some sort of effective rule engine needs to kick in to
flag possible duplicates. Those rules could be explicitly provided by
users (i.e., folks adding a new calendar could then browse the scraped
events and manually flag duplicate venues or event occurrences,
thereby creating rewriting rules for future imports) or inferred via
some statistical or machine-learning process. The former would likely
be much simpler to implement from an algorithmic POV, but obviously
has the downside of requiring much more user interaction.
Finally, it's worth thinking about modeling events as being related
either in generic clusters, or as hierarchical structures. That way,
duplicate events don't have to be deleted: they can simply be added to
a generic "event cluster" with other events, or marked as having a
"child" or "dupe" relationship, thereby making it easier to mine those
relationships for future de-duping steps. (Think of these clusters as
the training sets for a machine learning algorithm being applied to
new incoming events...)
When simply displaying events, you simply pick the earliest or most
trustworthy instance from the cluster, but can still maintain the
other records in order to link back to those calendars containing the
event. Similarly, keeping clusters of venue identifiers could allow
you to more easily recognize similar variations in the future.
If you're going the bloom/bayes/neural net/etc. route, training can
also happen in the background via a periodic process, and trained
filters can be retroactively applied to the existing data.
I know this all sounds pretty abstract, but the basic ideas are really
simple: aggressively normalize input data, parse to well-structured
data containers, (for locations, timestamps, etc.) and represent
duplication as a relationship or membership flag, rather than simply
deleting duplicates.
Anyway, I'd be interested to see what other people's thinking about
this is...
-Lennon