Metaoptimize challenge

14 views
Skip to first unread message

Dragon Silicon

unread,
Nov 6, 2010, 2:40:31 AM11/6/10
to metaoptimize-ch...@googlegroups.com
Solution:
The major observation of the problem, is that the connection matrix of potential word combinations is sparse: the actual number of real connections are limited.

Let's put down some definitions:
-Token connection is defined as having at least 1 document, in which both words occur.
-Semantic similarity is defined as the function of _co-occurrance with same external words_. That is, we're using third-word validation, and searching for token pairs with the highest co-occurance of the same external token. (External tokens are defined by not being equal to either source, or target token, but co-occuring with both at least once)

-We start with a divide&conquer algorithm: take the original documents, divide it into 145 chunks, 1mb each. This ensures later processing to be mallable to parallel processing.
Time req: takes about a min, or two

-Then each bucket is processed with mapreduce to generate connection tables.
A connection table has three columns: source word, target word, and number of co-occurances.
The mapping task iterates through each sentence, and omits word-word pairs; reduce is defined as taking the sum of number of word-word pairs.
5 seconds / document, 145 documents ~= 10 min

-Merging the connection tables:
After the separate chunks are processed, the result tables are reduced into a single token-token-count table, which gives the final tally of connectionlist over the dataset
10 seconds / document, 145 documents ~= 20 min

The merging operation here is functionally equalent to the reduce task of bucket processing, described in step 2.
To reduce memory requirements, instead of storing plain token-token pairs in the working memory, tokens are mapped into a global dictionary; connection map is referencing to token id -making memory footprint slightly smaller.


Scaling properties: from the ~140mb plain corpus, the resulting (unpacked -raw) merged wordlist weights about ~240mb -this will be of special interest, once the larger datasets are released, for storage, and memory footprint characteristics.


-Searching for semantical similarity:
Iterating over the wordlist, for every word we take the N topmost connected proxy-words, and the relationship between proxy-words and top-N third word. The semantic similarity then is calculated by the product of the number of 1st-2nd token connection, and 2nd-3rd token connections. Resulting token-product doubles are sorted by similarity score, and the topmost 10 is written out for every word.

In conclusion, we reconstructed the sparse matrix of connections, along with their occurance counter; and using that, we evaluated semantic similarity, well under a total runtime of an hour. Although the current implementation is single-threaded, almost all of the time-intensitive algorithm can work in a parallel manner, further lowering runtime. 
Once the connection table is built, there are a number of other ways how "semantic similarity" can be calculated.

Note: the stuff above was done between 2am, and 5am; I'm pretty sure there are massive opportunities of optimization in both time, and space requirements, as well as output quality.

The output of the current solution can be downloaded from 


Does the output makes semantically sense? :)
-SDr

Dragon Silicon

unread,
Nov 6, 2010, 2:41:59 AM11/6/10
to metaoptimize-ch...@googlegroups.com
Sources of the solution:
http://88.151.103.8/similarity/   (both .py files)
Reply all
Reply to author
Forward
0 new messages