To give you an idea what I am working on:
I want to integrate the web content management system plone
http://plone.org with simserver, so that related items can be automatically
retrieved. The related items is a powerful feature of plone but content
managers mostly fail to do it, so simserver comes to the rescue and
does it automatically (well not yet, it still has to be invoked manually ;)
In the future I also envision a automatic tagging based on existing
tagged items in the database
https://github.com/cleder/collective.simserver.core
Simserver Integration for plone
https://github.com/cleder/collective.simserver.related
Collective Simserver Related items for plone
First test are very encouraging (actually i was extremely impressed)
the corpus consisted of 2900 documents, similar documents identified
by simserver are really highly related to the original document.
awesome work Radim 8-)
the above is a first alpha, but I think the pace will be fast, so stay tuned ;)
--
Best Regards,
Christian Ledermann
Nairobi - Kenya
Mobile : +254 702978914
<*)))>{
If you save the living environment, the biodiversity that we have left,
you will also automatically save the physical environment, too. But If
you only save the physical environment, you will ultimately lose both.
1) Don’t drive species to extinction
2) Don’t destroy a habitat that species rely on.
3) Don’t change the climate in ways that will result in the above.
}<(((*>
Currently I am the sole developer it will be used on http://iwlearn.net/
The rationale behind it is to produce a list of similar documents so
a user gets an idea what also might interest him (suprise :)
Also it can be used for quality assurance e.g. if another document
is similar > 0.99 it might be a duplicate.
Another use case is to simplify the tagging of content, especially when you
introduce new tags it is a pita to find all the existing documents that
should probably be tagged with it too. The idea is to tag some of
the key documents and then see what is closely related and work your way
through the db (maybe semi automated)
My next step is to build a REST server for simserver as I had experienced
once (I could not replicate the behaviour) that simserver over pyro
hung my client.
so a loose coupling over http seems desirable. So I am trying to get up to speed
with pyramid (I always wanted to learn that framework and it seems a good fit)
Another rationale for the REST server is licensing: plone product must
be licensed
GPL 2, and there are probably other projects too that would profit
from a loosely
coupling thus avoiding the AGPL. The REST server would be AGPL though
Possible planned features:
* multiple simserver instances per REST server - so you can run it as SAAS, or
provide multi language capacities
* simpler call interface, the client only has to provide the text (as
plain text) and
id of a document to be indexed etc. so tokenization needs not to be done by
the client.
* offline corpus generation and indexing by uploading a zip (or tgz,
tar.bz2) file
containing the documents (in plain text)
On Fri, Jan 20, 2012 at 2:58 PM, Radim <radimr...@seznam.cz> wrote:
> Hey, big plans, good to hear :-)
>
> Another Christian -- Winkelmann -- also used an earlier version of
> simserver for a REST service: http://groups.google.com/group/gensim/browse_thread/thread/d847e10e114cf83e
>
> They used Flask though (not Pyramid).
>
@ christian: talk is cheap show me the code :)
>> * simpler call interface, the client only has to provide the text (as
>> plain text) and
>> id of a document to be indexed etc. so tokenization needs not to be done by
>> the client.
>
> That was the state of simserver some versions ago; I changed it to
> accepting tokenized input because that flexibility was needed. More
> generally, one solution is to offer several tokenizers for users to
> choose from, another option to let users specify their tokenizer once
> (=accept code) and then continue accepting plain texts.
OK, restsims tokenization is currently using utils.simple_preprocess
so there is room for improvement ;)
> Re: licensing, GPL is of course fine. The AGPL is meant to stop people
> using the server commercially, without giving anything back or even
> acknowledging its use. As long as "gensim the project" profits from
> the use, commercial as well as free applications are welcome to use
> it :-)
>
> Best,
> Radim
>
>
--
When you start the server you will get a simple form which lets you
interact with the server.
Implemented:
- offline corpus generation and indexing by uploading a zip (or tgz,
tar.bz2) file containing the documents (in plain text)
- training, indexing, querying
- querying returns either html or json
Not implemented:
- multiple simserver instances per REST server (low priority)
- authentication so you do not want to use it on a public network
as anybody could overwrite your training or indexing!
- training and indexing return no json yet
- custom tokenization on the server or client side
License: AGPL
Beware the code is alpha and ugly, but short (on the brighter side :)
> That was the state of simserver some versions ago; I changed it to
> accepting tokenized input because that flexibility was needed. More
> generally, one solution is to offer several tokenizers for users to
> choose from, another option to let users specify their tokenizer once
> (=accept code) and then continue accepting plain texts.
I think that the client could either send plain text and then tokenization
is done by the server or already tokenized json then the server just
passes it to simserver
> More importantly, good job on the server.
thx :)
> It looks like restsims
> doesn't allow training on bigger data than fits in RAM, right?
yes if you pass (preprocessed) data via the 'text' argument because
json.loads is used.
If you pass a a compressed file via 'filedata' generators are used,
so there should be no ram restriction.
It is an alpha, right now the focus is getting it usable and keeping
it simple, there will be optimizations later. (contributors are
always welcome ;) right now it is a yagni for me.
also there should be quite a lot of optimizations possible e.g. buffering
the documents better on training and indexing.
> I am not familiar with Plone, but I think it would be really cool if
> your http extension of simserver could be used by other CMS's, too. As
> a sort of general building block for "find similar content". I guess
> all CMS's must be solving a similar problem of suggesting similar
> content to users (similar as in "the same general idea", not "exact
> duplicate wording").
with restsim there is no problem, there is no plone specific
code in there at all. Any CMS that can provide a unique id
and the plain text (stripped of all mark up) can connect.
I does not even have to be a cms, with preprocessing ( e.g.
convert all content to index into plain text, remove everything
except the pure text content, etc) tar and gz the output
e.g:
index.html
dir1/do1.pdf
dir1/do2.doc
dir1/do3.htm
dir2/do1.pdf
where do*.* and index.html are plain text files the extension
is artificial to give the document an id.
When indexing a compressed file the file name as stored in the
archive is taken as the id. so a find_similar("dir1/do1.pdf") could
return [("dir1/do1.pdf",1,0,None), ("index.html",..), ("dir2/do3.html",..)]
you could use even it with pure javascript and no server side language at all.
The plone integration layer is also very slim so it can be used as
a reference.
BTW:
in gensim.utils there is some simserver specific stuff i think
the simserver specific functions should be moved to
simserver.utils
https://github.com/cleder/restsims
please let me know if you have trouble installing or running it.
http://plone.org/products/collective.simserver/
let me know your thoughts ;)
No not yet I think most people in the community do not know what it is
good for so i have to advertise it in a way to make it desirable to have
similarity search. (I think the SEO angle will be helpful)
> What are the main obstacles to wider
> adoption? Would be cool if more people use it :)
>
> Plus can you at least mention gensim in there, I think that'd be
> appropriate.
Good point, done :)
>
> Best,
> Radim
>
>
> On Feb 17, 7:22 am, Christian Ledermann
> <christian.lederm...@gmail.com> wrote:
>> I released some more documentation, this time for the plone WCMS
>> components @
>>
>> http://plone.org/products/collective.simserver/
>>
for two month the simserver integration is now in production.
apart from the (now fixed) bug that documents were treated differently
in optimized and non optimized indexes it all was smooth sailing
have look at e.g.:
to judge the automatic assignment of similar documents as related items.
thanks again Radim :)