Most data around us can be thought of as "things co-occurring with other things in certain contexts". Whether it is products co-occurring with other products in retail market baskets, words occurring before or after other words in unstructured text, tags co-occurring with other tags in social tagging systems, people co-occurring with other people in various social networking scenarios, or objects occurring in various 2-D geometrical juxtapositions of other objects in images, etc.
While there have been silos of efforts in each research community - retail, text, social networking, and vision, etc. - in dealing with "their" data, there has been no unifying framework to tame such a wide variety of co-occurrence data systematically - a theme for this talk.
We will present a simple, intuitive, yet a powerful co-occurrence analytics framework to deal with a wide variety of data of the form "things co-occurring with other things in some context". After describing the framework we will demonstrate how to adapt and apply the core principles of the framework to a variety of large real-world datasets to find novel and actionable insights even in the presence of significant noise in the data.
Specifically, we will describe how to find (a) Product bundles in Retail Point of Sales data, (b) Communities in Tag Networks and (c) Meaningful Phrases in Text Data using this framework.
What makes this approach attractive is that it is:
Unsupervised: No cost of getting labeled data. Just point it to the data and crunch.
Unbiased: No prior assumptions about data distributions, etc.
High Precision: Generates very high quality insights.
High Recall: Generates exhaustively many insights.
Parameter Poor: Very few parameters to play with.
Scaleable: Highly parallelizable in MapReduce sense.
Universal: Can be applied to a wide variety of domains and applications.