Bounty Offered for Xapian (full-text index lib) module

134 views
Skip to first unread message

Liam

unread,
Oct 1, 2011, 7:21:45 PM10/1/11
to nodejs
I need a lightweight search engine for my Node app; Xapian seems like
the right stuff.

I have only a little experience creating C++ bindings, so I'm hoping
folks will join me in offering a bounty to produce a decent open
source Xapian module in a reasonable time frame.

I'll ante up $200 to start off the kitty. Who's in?

Jeroen Janssen

unread,
Oct 2, 2011, 5:04:48 AM10/2/11
to nodejs
Note that Xapian is licensed under the GPL.
I don't know if that is the right license for your Node app if you are
embedding it (through C++ bindings).

Liam

unread,
Oct 2, 2011, 7:36:21 AM10/2/11
to nodejs
My app is GPL licensed.

Hmm, although the Xapian binding would need to be GPL, would an app
that uses the binding need to as well?

Dean Landolt

unread,
Oct 2, 2011, 9:12:18 AM10/2/11
to nod...@googlegroups.com
On Sun, Oct 2, 2011 at 7:36 AM, Liam <networ...@gmail.com> wrote:
My app is GPL licensed.

Hmm, although the Xapian binding would need to be GPL, would an app
that uses the binding need to as well?

If you distribute it, yes. Anything that links to the binding would be subject to the GPL. You could avoid liking by creating a small shim app exposing the binding functionality over IPC -- the GPL would not apply to anything communicating to your binding shim.

Liam

unread,
Oct 3, 2011, 1:12:01 PM10/3/11
to nodejs

Richard Marr

unread,
Oct 4, 2011, 4:50:13 AM10/4/11
to nod...@googlegroups.com


How do you define "lightweight"; easy to implement/manage, or low memory footprint? 

Xapian is more of a search library than a search engine. Like Lucene et al it expects you to be competent with search technology and do things like managing your own strategies wrt replication, update, backup, etc.

If you want a search engine that's simpler to use (and obviously this will depend on your own requirements) I'd recommend a stand-alone search application that provides those features already. There are a few to chose from; Omega, Solr, Elasticsearch, Compass, Sphinx... I've been using Elasticsearch with Node recently because it speaks JSON over HTTP and does its own cluster management with zero configuration.

Apologies if you already know all of this already, I just got a warning flag from calling Xapian a search engine; I may just be being pedantic.





--
Job Board: http://jobs.nodejs.org/
Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to nod...@googlegroups.com
To unsubscribe from this group, send email to
nodejs+un...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en



--
Richard Marr

Liam

unread,
Oct 4, 2011, 5:18:31 AM10/4/11
to nodejs
This is running on a single-core ARM device with mostly single-user
access, so yes, a C library is probably a good choice.

On Oct 4, 1:50 am, Richard Marr <richard.m...@gmail.com> wrote:
> How do you define "lightweight"; easy to implement/manage, or low memory
> footprint?
>
> Xapian is more of a search library than a search engine. Like Lucene et al
> it expects you to be competent with search technology and do things like
> managing your own strategies wrt replication, update, backup, etc.
>
> If you want a search engine that's simpler to use (and obviously this will
> depend on your own requirements) I'd recommend a stand-alone search
> application that provides those features already. There are a few to chose
> from; Omega, Solr, Elasticsearch, Compass, Sphinx... I've been using
> Elasticsearch with Node recently because it speaks JSON over HTTP and does
> its own cluster management with zero configuration.
>
> Apologies if you already know all of this already, I just got a warning flag
> from calling Xapian a search engine; I may just be being pedantic.
>

Richard Marr

unread,
Oct 4, 2011, 7:02:27 AM10/4/11
to nod...@googlegroups.com
There's me wearing my web-app-centric hat again  :o)

Elliot

unread,
Oct 4, 2011, 9:20:52 PM10/4/11
to nod...@googlegroups.com
Have you looked at node-clucene (
https://github.com/erictj/node-clucene ) yet? I don't see the license
on node-clucene, but clucene is Apache/LGPL licensed.

If that fails, contemplate using node-sqlite with sqlite's
full-text-search indexes: http://www.sqlite.org/fts3.html

Those are the two things I'm planning on looking at for a similar situation.

Liam

unread,
Oct 5, 2011, 4:09:06 PM10/5/11
to nodejs
Thanks for the pointer. When I looked at the alternatives a while
back, I considered CLucene, and concluded that Xapian is better for
some reason -- more mature? sophisticated? actively-developed? battle-
tested?

Would love to hear users' comments on CLucene tho...

As for SQLite FTS, I need to index files (which are being served via
Samba).

On Oct 4, 6:20 pm, Elliot <efos...@firetaco.com> wrote:
> Have you looked at node-clucene (https://github.com/erictj/node-clucene) yet?  I don't see the license
> on node-clucene, but clucene is Apache/LGPL licensed.
>
> If that fails, contemplate using node-sqlite with sqlite's
> full-text-search indexes:  http://www.sqlite.org/fts3.html
>
> Those are the two things I'm planning on looking at for a similar situation.
>

Elliot

unread,
Oct 5, 2011, 4:13:46 PM10/5/11
to nod...@googlegroups.com
sqlite should be able to index file contents just as well as anything
else, and it's something you can try now without waiting for another
library. If you build it right, you should be able to drop in a
replacement later on by wrapping the search index in a module. You'd
need to do such a thing anyway, if you're thinking of using something
like xapian/clucene.

Liam

unread,
Oct 5, 2011, 5:08:35 PM10/5/11
to nodejs
Reading the SQLite FTS docs, the files to index must be inserted into
a database -- I don't see a way to omit the raw data.

I store the files in the filesystem, where they can be modified in
place, so I need an indexer that can read filesystem objects.


On Oct 5, 1:13 pm, Elliot <efos...@firetaco.com> wrote:
> sqlite should be able to index file contents just as well as anything
> else, and it's something you can try now without waiting for another
> library.  If you build it right, you should be able to drop in a
> replacement later on by wrapping the search index in a module.  You'd
> need to do such a thing anyway, if you're thinking of using something
> like xapian/clucene.
>

Elliot

unread,
Oct 5, 2011, 5:13:53 PM10/5/11
to nod...@googlegroups.com
Please let me know if you're not interested in continuing this
conversation, but you could just dump the file contents in the DB if
they're plaintext or filter the files through 'strings' or somesuch
otherwise.

If you're talking about extracting content from word documents, then
I'm not familiar with that aspect of lucene/xapian.

You mentioned that your app is GPL; is the source available somewhere?

Elliot

unread,
Oct 5, 2011, 5:16:05 PM10/5/11
to nod...@googlegroups.com
Ah, now I see why you specifically want Xapian (from
http://xapian.org/features):

"The indexer supplied can index HTML, PHP, PDF, PostScript,
OpenOffice/StarOffice, OpenDocument, Microsoft
Word/Excel/Powerpoint/Works, Word Perfect, AbiWord, RTF, DVI, Perl POD
documentation, CSV, SVG, RPM packages, Debian packages, and plain
text. Adding support for indexing other formats is easy where
conversion filters are available. This indexer works using the filing
system, but we also provide a script to allow the htdig web crawler to
be hooked in, allowing remote sites to be searched using Omega."

Fair enough.

Liam

unread,
Oct 5, 2011, 5:45:52 PM10/5/11
to nodejs
Exactly :-)

On Oct 5, 2:16 pm, Elliot <efos...@firetaco.com> wrote:
> Ah, now I see why you specifically want Xapian (fromhttp://xapian.org/features):
>
> "The indexer supplied can index HTML, PHP, PDF, PostScript,
> OpenOffice/StarOffice, OpenDocument, Microsoft
> Word/Excel/Powerpoint/Works, Word Perfect, AbiWord, RTF, DVI, Perl POD
> documentation, CSV, SVG, RPM packages, Debian packages, and plain
> text. Adding support for indexing other formats is easy where
> conversion filters are available. This indexer works using the filing
> system, but we also provide a script to allow the htdig web crawler to
> be hooked in, allowing remote sites to be searched using Omega."
>
> Fair enough.
>
> On Wed, Oct 5, 2011 at 2:13 PM, Elliot <efos...@firetaco.com> wrote:
> > Please let me know if you're not interested in continuing this
> > conversation, but you could just dump the file contents in the DB if
> > they're plaintext or filter the files through 'strings' or somesuch
> > otherwise.
>
> > If you're talking about extracting content from word documents, then
> > I'm not familiar with that aspect of lucene/xapian.
>
> > You mentioned that your app is GPL; is the source available somewhere?
>

Liam

unread,
Oct 21, 2011, 7:14:39 PM10/21/11
to nodejs
I've started work on a Xapian module for Node.

If interested, join the discussion on xapian-discuss:
http://lists.xapian.org/pipermail/xapian-discuss/2011-October/thread.html
Reply all
Reply to author
Forward
0 new messages