simserver for plone - integrate simserver into a CMS

42 views
Skip to first unread message

Christian Ledermann

unread,
Jan 16, 2012, 11:46:20 AM1/16/12
to gensim
Hello all,

To give you an idea what I am working on:

I want to integrate the web content management system plone
http://plone.org with simserver, so that related items can be automatically
retrieved. The related items is a powerful feature of plone but content
managers mostly fail to do it, so simserver comes to the rescue and
does it automatically (well not yet, it still has to be invoked manually ;)

In the future I also envision a automatic tagging based on existing
tagged items in the database


https://github.com/cleder/collective.simserver.core
Simserver Integration for plone

https://github.com/cleder/collective.simserver.related
Collective Simserver Related items for plone


First test are very encouraging (actually i was extremely impressed)
the corpus consisted of 2900 documents, similar documents identified
by simserver are really highly related to the original document.

awesome work Radim 8-)

the above is a first alpha, but I think the pace will be fast, so stay tuned ;)

--
Best Regards,

Christian Ledermann

Nairobi - Kenya
Mobile : +254 702978914

<*)))>{

If you save the living environment, the biodiversity that we have left,
you will also automatically save the physical environment, too. But If
you only save the physical environment, you will ultimately lose both.

1) Don’t drive species to extinction

2) Don’t destroy a habitat that species rely on.

3) Don’t change the climate in ways that will result in the above.

}<(((*>

Radim

unread,
Jan 18, 2012, 8:21:43 AM1/18/12
to gensim
Whoa that's pretty cool! A CMS system/a library is a great fit for
gensim. Something like that was actually my idea behind creating
simserver :)

Christian, let me know if you come across any issues. I mean both
inadequacies on API level and downright bugs, the simserver is still
an infant. I'll try to help as much as I can, I like your project. Btw
what is your involvement in it, why are you doing it? (send me a
private email if that suits you better :)

Best,
Radim


On Jan 16, 5:46 pm, Christian Ledermann
<christian.lederm...@gmail.com> wrote:
> Hello all,
>
> To give you an idea what I am working on:
>
> I want to integrate the web content management system plonehttp://plone.orgwith simserver, so that related items can be automatically

Christian Ledermann

unread,
Jan 18, 2012, 10:13:51 AM1/18/12
to gen...@googlegroups.com
On Wed, Jan 18, 2012 at 4:21 PM, Radim <radimr...@seznam.cz> wrote:
> Whoa that's pretty cool! A CMS system/a library is a great fit for
> gensim. Something like that was actually my idea behind creating
> simserver :)
>
> Christian, let me know if you come across any issues. I mean both
> inadequacies on API level and downright bugs, the simserver is still
> an infant. I'll try to help as much as I can, I like your project. Btw
> what is your involvement in it, why are you doing it? (send me a
> private email if that suits you better :)

Currently I am the sole developer it will be used on http://iwlearn.net/

The rationale behind it is to produce a list of similar documents so
a user gets an idea what also might interest him (suprise :)

Also it can be used for quality assurance e.g. if another document
is similar > 0.99 it might be a duplicate.

Another use case is to simplify the tagging of content, especially when you
introduce new tags it is a pita to find all the existing documents that
should probably be tagged with it too. The idea is to tag some of
the key documents and then see what is closely related and work your way
through the db (maybe semi automated)

My next step is to build a REST server for simserver as I had experienced
once (I could not replicate the behaviour) that simserver over pyro
hung my client.
so a loose coupling over http seems desirable. So I am trying to get up to speed
with pyramid (I always wanted to learn that framework and it seems a good fit)

Christian Ledermann

unread,
Jan 18, 2012, 10:35:27 AM1/18/12
to gen...@googlegroups.com
On Wed, Jan 18, 2012 at 6:13 PM, Christian Ledermann
<christian...@gmail.com> wrote:
> My next step is to build a REST server for simserver as I had experienced
> once (I could not replicate the behaviour) that simserver over pyro
> hung my client.
> so a loose coupling over http seems desirable. So I am trying to get up to speed
> with pyramid (I always wanted to learn that framework and it seems a good fit)

Another rationale for the REST server is licensing: plone product must
be licensed
GPL 2, and there are probably other projects too that would profit
from a loosely
coupling thus avoiding the AGPL. The REST server would be AGPL though

Possible planned features:

* multiple simserver instances per REST server - so you can run it as SAAS, or
provide multi language capacities

* simpler call interface, the client only has to provide the text (as
plain text) and
id of a document to be indexed etc. so tokenization needs not to be done by
the client.

* offline corpus generation and indexing by uploading a zip (or tgz,
tar.bz2) file
containing the documents (in plain text)

Radim

unread,
Jan 20, 2012, 6:58:37 AM1/20/12
to gensim
Hey, big plans, good to hear :-)

Another Christian -- Winkelmann -- also used an earlier version of
simserver for a REST service: http://groups.google.com/group/gensim/browse_thread/thread/d847e10e114cf83e

They used Flask though (not Pyramid).

>  * simpler call interface, the client only has to provide the text (as
> plain text) and
>    id of a document to be indexed etc. so tokenization needs not to be done by
>    the client.

That was the state of simserver some versions ago; I changed it to
accepting tokenized input because that flexibility was needed. More
generally, one solution is to offer several tokenizers for users to
choose from, another option to let users specify their tokenizer once
(=accept code) and then continue accepting plain texts.

Re: licensing, GPL is of course fine. The AGPL is meant to stop people
using the server commercially, without giving anything back or even
acknowledging its use. As long as "gensim the project" profits from
the use, commercial as well as free applications are welcome to use
it :-)

Best,
Radim


Christian Ledermann

unread,
Jan 20, 2012, 10:48:43 AM1/20/12
to gen...@googlegroups.com
REST service:
first alpha available on https://github.com/cleder/restsims

On Fri, Jan 20, 2012 at 2:58 PM, Radim <radimr...@seznam.cz> wrote:
> Hey, big plans, good to hear :-)
>
> Another Christian -- Winkelmann -- also used an earlier version of
> simserver for a REST service: http://groups.google.com/group/gensim/browse_thread/thread/d847e10e114cf83e
>
> They used Flask though (not Pyramid).
>

@ christian: talk is cheap show me the code :)

>>  * simpler call interface, the client only has to provide the text (as
>> plain text) and
>>    id of a document to be indexed etc. so tokenization needs not to be done by
>>    the client.
>
> That was the state of simserver some versions ago; I changed it to
> accepting tokenized input because that flexibility was needed. More
> generally, one solution is to offer several tokenizers for users to
> choose from, another option to let users specify their tokenizer once
> (=accept code) and then continue accepting plain texts.

OK, restsims tokenization is currently using utils.simple_preprocess
so there is room for improvement ;)


> Re: licensing, GPL is of course fine. The AGPL is meant to stop people
> using the server commercially, without giving anything back or even
> acknowledging its use. As long as "gensim the project" profits from
> the use, commercial as well as free applications are welcome to use
> it :-)
>
> Best,
> Radim
>
>

--

Christian Ledermann

unread,
Jan 20, 2012, 11:31:48 AM1/20/12
to gen...@googlegroups.com
short description of the REST server:

When you start the server you will get a simple form which lets you
interact with the server.

Implemented:
- offline corpus generation and indexing by uploading a zip (or tgz,


tar.bz2) file containing the documents (in plain text)

- training, indexing, querying

- querying returns either html or json

Not implemented:

- multiple simserver instances per REST server (low priority)
- authentication so you do not want to use it on a public network
as anybody could overwrite your training or indexing!

- training and indexing return no json yet
- custom tokenization on the server or client side

License: AGPL


Beware the code is alpha and ugly, but short (on the brighter side :)

> That was the state of simserver some versions ago; I changed it to
> accepting tokenized input because that flexibility was needed. More
> generally, one solution is to offer several tokenizers for users to
> choose from, another option to let users specify their tokenizer once
> (=accept code) and then continue accepting plain texts.

I think that the client could either send plain text and then tokenization
is done by the server or already tokenized json then the server just
passes it to simserver

Radim

unread,
Jan 23, 2012, 5:47:42 AM1/23/12
to gensim
> > Another Christian -- Winkelmann -- also used an earlier version of
> > simserver for a REST service:http://groups.google.com/group/gensim/browse_thread/thread/d847e10e11...
>
> > They used Flask though (not Pyramid).
>
> @ christian: talk is cheap show me the code :)

Haha, a developer with an attitude :) And also with a mission, going
by your email signature. That's always a plus -- development without a
passion is simply a boring job!

More importantly, good job on the server. It looks like restsims
doesn't allow training on bigger data than fits in RAM, right?

I am not familiar with Plone, but I think it would be really cool if
your http extension of simserver could be used by other CMS's, too. As
a sort of general building block for "find similar content". I guess
all CMS's must be solving a similar problem of suggesting similar
content to users (similar as in "the same general idea", not "exact
duplicate wording").

Best,
Radim

Christian Ledermann

unread,
Jan 23, 2012, 7:28:51 AM1/23/12
to gen...@googlegroups.com
On Mon, Jan 23, 2012 at 1:47 PM, Radim <radimr...@seznam.cz> wrote:

> More importantly, good job on the server.

thx :)

> It looks like restsims
> doesn't allow training on bigger data than fits in RAM, right?

yes if you pass (preprocessed) data via the 'text' argument because
json.loads is used.
If you pass a a compressed file via 'filedata' generators are used,
so there should be no ram restriction.

It is an alpha, right now the focus is getting it usable and keeping
it simple, there will be optimizations later. (contributors are
always welcome ;) right now it is a yagni for me.

also there should be quite a lot of optimizations possible e.g. buffering
the documents better on training and indexing.

> I am not familiar with Plone, but I think it would be really cool if
> your http extension of simserver could be used by other CMS's, too. As
> a sort of general building block for "find similar content". I guess
> all CMS's must be solving a similar problem of suggesting similar
> content to users (similar as in "the same general idea", not "exact
> duplicate wording").

with restsim there is no problem, there is no plone specific
code in there at all. Any CMS that can provide a unique id
and the plain text (stripped of all mark up) can connect.

I does not even have to be a cms, with preprocessing ( e.g.
convert all content to index into plain text, remove everything
except the pure text content, etc) tar and gz the output

e.g:
index.html
dir1/do1.pdf
dir1/do2.doc
dir1/do3.htm
dir2/do1.pdf

where do*.* and index.html are plain text files the extension
is artificial to give the document an id.
When indexing a compressed file the file name as stored in the
archive is taken as the id. so a find_similar("dir1/do1.pdf") could
return [("dir1/do1.pdf",1,0,None), ("index.html",..), ("dir2/do3.html",..)]

you could use even it with pure javascript and no server side language at all.

The plone integration layer is also very slim so it can be used as
a reference.


BTW:
in gensim.utils there is some simserver specific stuff i think
the simserver specific functions should be moved to
simserver.utils

Christian Ledermann

unread,
Feb 14, 2012, 10:47:49 AM2/14/12
to gen...@googlegroups.com
I got finally around to update the documentation for
restsims, a small pyramid restfull wrapper around simserver itself.

https://github.com/cleder/restsims

please let me know if you have trouble installing or running it.

Christian Ledermann

unread,
Feb 17, 2012, 1:22:48 AM2/17/12
to gen...@googlegroups.com
I released some more documentation, this time for the plone WCMS
components @

http://plone.org/products/collective.simserver/

let me know your thoughts ;)

Radim

unread,
Feb 17, 2012, 4:47:45 AM2/17/12
to gensim
Nice Christian, great job!

A documentation page certainly helps a lot. Did you get any feedback
from the plone community yet? What are the main obstacles to wider
adoption? Would be cool if more people use it :)

Plus can you at least mention gensim in there, I think that'd be
appropriate.

Best,
Radim


On Feb 17, 7:22 am, Christian Ledermann

Christian Ledermann

unread,
Feb 17, 2012, 6:58:22 AM2/17/12
to gen...@googlegroups.com
On Fri, Feb 17, 2012 at 12:47 PM, Radim <radimr...@seznam.cz> wrote:
> Nice Christian, great job!
>
> A documentation page certainly helps a lot. Did you get any feedback
> from the plone community yet?

No not yet I think most people in the community do not know what it is
good for so i have to advertise it in a way to make it desirable to have
similarity search. (I think the SEO angle will be helpful)

> What are the main obstacles to wider
> adoption? Would be cool if more people use it :)
>
> Plus can you at least mention gensim in there, I think that'd be
> appropriate.

Good point, done :)

>
> Best,
> Radim
>
>
> On Feb 17, 7:22 am, Christian Ledermann
> <christian.lederm...@gmail.com> wrote:
>> I released some more documentation, this time for the plone WCMS
>> components @
>>
>> http://plone.org/products/collective.simserver/
>>

Christian Ledermann

unread,
Apr 4, 2012, 5:40:37 AM4/4/12
to gen...@googlegroups.com
just to keep you in the loop:

for two month the simserver integration is now in production.
apart from the (now fixed) bug that documents were treated differently
in optimized and non optimized indexes it all was smooth sailing

have look at e.g.:

http://iwlearn.net/iw-projects/1017/workshops/report-of-the-fourth-ordinary-meeting-of-the-lake-tanganyika-authority-management-committee/view

to judge the automatic assignment of similar documents as related items.

thanks again Radim :)

Reply all
Reply to author
Forward
0 new messages