[GSoC Weekly] Using Cassovary to analyze Wikipedia (week 1)

72 views
Skip to first unread message

Szymon Matejczyk

unread,
May 26, 2014, 11:41:20 AM5/26/14
to twitter-...@googlegroups.com
Hi guys,

I'm Szymon Matejczyk (@szymonmatejczyk) and I'll be working on Cassovary as GSoC student this summer. 

The goal of my project is to use Cassovary to analyze Wikipedia pages graph and use it for entity resolution of short texts.

Last week I've been working on few minor changes needed to load Wikipedia dump to Cassovary:
  1. https://github.com/twitter/cassovary/pull/69 -- allow renumbering of Longs or numbering of String nodes so that they can be used in Cassovary
  2. https://github.com/twitter/cassovary/pull/71 -- reading graphs that have Long or String nodes to Cassovary (using previous PR)
  3. https://github.com/twitter/cassovary/pull/72 -- improving parallel reading of graphs to move to Futures
Until the end of this week I'm planning to load Wikipedia dumps to Cassovary and do some minor improvements:
  1. Test my triangle counting approximated algorithm: https://github.com/twitter/cassovary/pull/62
  2. Allow generic node types in Cassovary (for now only Ints are possible) and benchmark performance decrease
  3. Enhance traversals
If you have any questions, suggestions, feel free to ask,
Szymon
Reply all
Reply to author
Forward
0 new messages