Converting matrix market files to mysql and back again

Stephen Marquard

unread,

May 4, 2011, 2:11:35 PM5/4/11

to gen...@googlegroups.com

Hi all,

I recently had a need to efficiently extract subsets of a matrix, so wrote 2 scripts to import an MM file into mysql, and then extract selected rows into a new MM file.

I have written up a description of this on my blog: http://trulymadlywordly.blogspot.com/2011/05/matrix-market-to-mysql-and-back-again.html

with links to the scripts, in case this is useful to anyone else.

Thanks to everyone who created gensim. It is a great piece of software, and I especially appreciated the clear documentation, tutorials and examples.

The scripts above are part of a prototype for a Wikipedia "similarity crawler" for creating topic-specific language models for use in speech recognition applications. I will describe this work and its use of gensim in more detail on my blog in the near future.

Regards
Stephen Marquard

Centre for Educational Technology,
University of Cape Town

Radim

unread,

May 4, 2011, 3:56:37 PM5/4/11

to gensim

Hello Stephen,

thanks for the tip! I'm thinking of adding an SQL serializer myself
(for easier incremental/decremental updates to a corpus). It's good to
see it will be useful to other people too.

Now in your case, instead of going mm->sql->mm, why not just use the
`corpus[docno]` random-access notation? Seems easier.

And if you contribute support for slicing, `corpus[startno : endno]`,
it will be even easier :)

Best,
Radim

On May 4, 8:11 pm, Stephen Marquard <smarqu...@gmail.com> wrote:
> Hi all,
>
> I recently had a need to efficiently extract subsets of a matrix, so wrote 2
> scripts to import an MM file into mysql, and then extract selected rows into
> a new MM file.
>

> I have written up a description of this on my blog:http://trulymadlywordly.blogspot.com/2011/05/matrix-market-to-mysql-a...

Stephen Marquard

unread,

May 5, 2011, 8:45:35 AM5/5/11

to gensim

On May 4, 9:56 pm, Radim <radimrehu...@seznam.cz> wrote:

> Hello Stephen,
>
> thanks for the tip! I'm thinking of adding an SQL serializer myself
> (for easier incremental/decremental updates to a corpus). It's good to
> see it will be useful to other people too.
>
> Now in your case, instead of going mm->sql->mm, why not just use the
> `corpus[docno]` random-access notation? Seems easier.

Ah, that is indeed easier! I should have read more of the
documentation before starting out.

> And if you contribute support for slicing, `corpus[startno : endno]`,
> it will be even easier :)

In my case I'm selecting on a set of non-contiguous row numbers, but I
wrote this little helper class which makes it very easy to use a
subset of a corpus:

class SubCorpus(IndexedCorpus):
"""
A corpus which returns a subset of rows from a larger, indexed
corpus.
"""
def __init__(self, indexedCorpus, docIdList):
self.bigcorpus = indexedCorpus
self.idList = docIdList

def __iter__(self):
"""
Return one document at a time from the larger corpus.
"""
for docId in self.idList:
yield self.bigcorpus[int(docId)]

so my code now does:

mm = SubCorpus(bigcorpus, artIdList)

MmCorpus.serialize('/tmp/gensim_sub.mm', mm)
mm = MmCorpus('/tmp/gensim_sub.mm')

index = similarities.MatrixSimilarity(lsi[mm], numFeatures = 400)

vec = mm[0]
vec_lsi = lsi[vec] # convert the query to LSI space
sims = index[vec_lsi] # perform a similarity query against the
corpus

where artIdList is the list of row ids.

Regards
Stephen

Radim

unread,

May 5, 2011, 11:54:02 AM5/5/11

to gensim

Hello,

On May 5, 2:45 pm, Stephen Marquard <smarqu...@gmail.com> wrote:
>
> In my case I'm selecting on a set of non-contiguous row numbers, but I
> wrote this little helper class which makes it very easy to use a
> subset of a corpus:

Lovely! So clear and clean that I'm tempted to include it in core
gensim. But on the other hand, it's so clear and clean that it doesn't
need to be included in core gensim :) What a dilemma.

I'm thinking of adding a "best practices" section to gensim
documentation, with little tips and helper code snippets. Yours will
be a perfect candidate.

Cheers,
Radim

Stephen Marquard

unread,

May 6, 2011, 7:36:48 AM5/6/11

to gensim

On May 5, 5:54 pm, Radim <radimrehu...@seznam.cz> wrote:

> Lovely! So clear and clean that I'm tempted to include it in core
> gensim. But on the other hand, it's so clear and clean that it doesn't
> need to be included in core gensim :) What a dilemma.
>
> I'm thinking of adding a "best practices" section to gensim
> documentation, with little tips and helper code snippets. Yours will
> be a perfect candidate.

I realised it's possible to make it even simpler by supporting indexed
access to the indirect corpus, removing the need for the save and
load. It didn't make any difference to performance (in fact improved
it slightly, maybe from eliminating the save/load overhead time,
testing with a few hundred rows at a time).

So the class file is:

class SubCorpus(IndexedCorpus):
"""
A corpus which returns a subset of rows from a larger,
indexed corpus.
"""
def __init__(self, indexedCorpus, docIdList):
self.bigcorpus = indexedCorpus
self.idList = docIdList

def __iter__(self):
"""

Return one document at a time.

"""
for docId in self.idList:
yield self.bigcorpus[int(docId)]

def __len__(self):
"""
Return corpus length as number of row ids.
"""
return len(self.idList)

def __getitem__(self, docno):
return self.bigcorpus[int(self.idList[docno])]

and the main code is:

# create a smaller matrix from the larger one
mm = SubCorpus(bigcorpus, artIdList)

# transform corpus to LSI space and index it

index = similarities.MatrixSimilarity(lsi[mm], numFeatures = 400)
vec = mm[0]
vec_lsi = lsi[vec] # convert the query to LSI space