Text Search Engine that works with Python

Doug Farrell

unread,

Mar 3, 2002, 6:26:24 PM3/3/02

to

Hi all,

I'm wondering if anyone knows of a text search engine that works with
Python? What I'm looking for specifically is something that will compress
the text and still allow searches and retrievals that can be exact matches
or proximity based. The text I want to compress and search is huge (70 megs)
and should compress down to half, not including any index files that might
be required by the search engine. Anyone know of anything like this or any
ideas?

Thanks,
Doug Farrell

William Park

unread,

Mar 3, 2002, 7:27:23 PM3/3/02

to

Perhaps, you can illustrate your problem with some concrete examples.
Otherwise, you'll be getting "use Linux" or "use gzip/bzip2" answers
which wouldn't be too useful for you (judging by the fact that you had
to ask in the first place).

--
William Park, Open Geometry Consulting, <openge...@yahoo.ca>
8 CPU cluster, NAS, (Slackware) Linux, Python, LaTeX, Vim, Mutt, Tin

Ron Johnson

unread,

Mar 3, 2002, 10:52:58 PM3/3/02

to

On 04 Mar 2002, 00:27:23, William Park wrote:
> Doug Farrell <writ...@earthlink.net> wrote:
> > Hi all,
> >
> > I'm wondering if anyone knows of a text search engine that works with
> > Python? What I'm looking for specifically is something that will
> > compress the text and still allow searches and retrievals that can be
> > exact matches or proximity based. The text I want to compress and
> > search is huge (70 megs) and should compress down to half, not
> > including any index files that might be required by the search engine.
> > Anyone know of anything like this or any ideas?
> >
> > Thanks, Doug Farrell
>
> Perhaps, you can illustrate your problem with some concrete examples.
> Otherwise, you'll be getting "use Linux" or "use gzip/bzip2" answers
> which wouldn't be too useful for you (judging by the fact that you had
> to ask in the first place).
>

Maybe he's talking about "zgrep".

http://www.delorie.com/gnu/docs/gzip/zgrep.1.html

However, the effort involved in decompressing the file might out-
weigh the benefits from saving 35MB disk space.

Unless you have an _old_ disk, it may be best to leave the text
uncompressed.

--
+------------------------------------------------------------+
| Ron Johnson, Jr. Home: ron.l....@cox.net |
| Jefferson, LA USA http://ronandheather.dhs.org:81 |
| |
| 484,246 sq mi are needed for 6 billion people to live, 4 !
! persons per lot, in lots that are 60'x150'. |
! That is ~ California, Texas and Missouri. !
! Alternatively, France, Spain and The United Kingdom. |
+------------------------------------------------------------+

Ype Kingma

unread,

Mar 4, 2002, 1:55:20 PM3/4/02

to

In case you can use Jython as your Python implementation, have a look
at Lucene http://jakarta.apache.org/lucene/docs/index.html .

You'll have to do the compression yourself, but you can store any field
with a document, including one that is filtered through a zip outputstream
from the standard java libraries. You might consider storing only a reference
to a file containing the compressed text of your documents.

Lucene searches very fast. For 500 Mb of indexes in 15 lucene dbs, typical
query time is less than a second for all databases together on a 400Mhz
machine. Typical index size is around one third of original text.
The 15 dbs are my own choice, lucene could easily handle everything in
a single db.

Apart from exact matches and proximity you can also use prefix terms
and required terms. Lucene is optimized to retrieve only the best matches
to a query, but you can also use it's API in boolean mode.

Recommended, especially together with the lucene-users list.

Ype

David Mertz, Ph.D.

unread,

Mar 4, 2002, 1:50:49 PM3/4/02

to

|...text search engine that works with Python? What I'm looking for

|specifically is something that will compress the text and still allow
|searches and retrievals that can be exact matches or proximity based.
|The text I want to compress and search is huge (70 megs) and should
|compress down to half, not including any index files that might be
|required by the search engine.

My indexer.py modules does this (mostly). I wrote an articles
discussing the module at:

http://gnosis.cx/publish/programming/charming_python_15.txt

I have now incorporated it into a package at:

http://gnosis.cx/download/Gnosis_XML_Utils-0.9.tar.gz

The indexer is sort of an ugly duckling in there, since it doesn't have
anything to do with XML, per se. But xml_indexer.py uses indexer.py for
support, so I bundled things this way.

Anyway, indexer does not allow proximity searches, but does allow
searches for multiple words that occur in the same documents. The
indexes are quite reasonable sized, and the indexer will operate on
gzip'd files happily (it wouldn't be difficult to add support for zip,
bzip2, etc). The module itself doesn't perform compressions, but that's
what 'gzip' is for.

--
mertz@ _/_/_/_/_/_/_/ THIS MESSAGE WAS BROUGHT TO YOU BY:_/_/_/_/ v i
gnosis _/_/ Postmodern Enterprises _/_/ s r
.cx _/_/ MAKERS OF CHAOS.... _/_/ i u
_/_/_/_/_/ LOOK FOR IT IN A NEIGHBORHOOD NEAR YOU_/_/_/_/_/ g s

damien morton

unread,

Mar 4, 2002, 4:56:18 PM3/4/02

to

"Doug Farrell" <writ...@earthlink.net> wrote in message news:<Aiyg8.32103$ZC3.2...@newsread2.prod.itd.earthlink.net>...

Have you looked at mg.

Its written in C, but useable from the unix command line.

http://www.mds.rmit.edu.au/mg/

William Park

unread,

Mar 4, 2002, 7:09:59 PM3/4/02

to

damien morton <mor...@dennisinter.com> wrote:
> Have you looked at mg.
> Its written in C, but useable from the unix command line.
> http://www.mds.rmit.edu.au/mg/

Thanks Damien!

Dave Kuhlman

unread,

Mar 5, 2002, 2:30:24 PM3/5/02

to

Here is a related question -- Is there a search program for
structured text files, in particular something that searches XML
files.

I know about sgrep, and I've made Python wrappers for it. Are
their an
y others?

You can find sgrep at http://www.cs.helsinki.fi/u/jjaakkol/sgrep.html.
My Python wrappers for sgrep are at http://www.rexx.com/~dkuhlman.

- Dave

--
Dave Kuhlman
dkuh...@rexx.com
http://www.rexx.com/~dkuhlman

-----= Posted via Newsfeeds.Com, Uncensored Usenet News =-----
http://www.newsfeeds.com - The #1 Newsgroup Service in the World!
-----== Over 80,000 Newsgroups - 16 Different Servers! =-----

David Mertz, Ph.D.

unread,

Mar 5, 2002, 3:33:51 PM3/5/02

to

|Here is a related question -- Is there a search program for
|structured text files, in particular something that searches XML
|files.

You might like my xml_indexer program. There is a writeup on the design
at:

http://gnosis.cx/publish/programming/xml_matters_10.txt

As with indexer, from which xml_indexer is derived, the module has been
aggregated into a package found at:

http://gnosis.cx/download/Gnosis_XML_Utils-0.9.tar.gz

Yours, David...

--
mertz@ | The specter of free information is haunting the `Net! All the
gnosis | powers of IP- and crypto-tyranny have entered into an unholy
.cx | alliance...ideas have nothing to lose but their chains. Unite
| against "intellectual property" and anti-privacy regimes!
-------------------------------------------------------------------------

Doug Farrell

unread,

Mar 7, 2002, 9:41:45 PM3/7/02

to

To everyone,

Thank you for all the feedback, I really appreciate it. Here is some
more detail about what I'm looking for and how it should work, this
may or may not be helpful <g>. One of the things my company works on
is a large reference title that is sold on CD-ROM. The current
un-compressed text is 70 megs. The reason we compress that currently
is because of all the other media that goes on the CD. Our current
search engine is a piece of junk, but works in the C++ environment of
our current application, which is only for Windows. I am considering
an alternative with Python for a couple of reasons. Easier to write
than C++. The app is not speed critical and I think Python would be
more than fast enough anyway, I'm considing wxPython as the GUI, so
most of the window calls are implemented in C anyway. Writing the app
in Python would possibly allow us to market the CD for Windows, Mac,
Linux and Unix systems.

So the requirements of the search engine are that it compress the text
(or have index files so small that compression is un-necessary) and
the retrieval engine have an API accessible to Python, not implemented
as a command line tool. I want to incorporate the search engine into a
larger application that links the text together with media. I know, I
know, big demands, but that's why it's in the conceptual stage with me
right now.

Anyway, hope that adds more information to the thread.

Thanks again,
Doug

me...@gnosis.cx (David Mertz, Ph.D.) wrote in message news:<mailman.1015360930...@python.org>...

Fernando Rodríguez

unread,

Mar 8, 2002, 5:09:13 AM3/8/02

to

On 7 Mar 2002 18:41:45 -0800, writ...@earthlink.net (Doug Farrell) wrote:

>is because of all the other media that goes on the CD. Our current
>search engine is a piece of junk, but works in the C++ environment of
>our current application, which is only for Windows. I am considering

What are you using?

Decent and unexpensive IR systems aren't very common. Maybe MySQL is enough
(they now have full text indexing). You may consider:

a) http://www.lextek.com/onix/ (never used it)
b) http://200.6.42.16:2001/p_isis_dll.html (In portuguese, though the manuals
are in English). Used to be very unreliable but apparently it has improved.
c) Altavista's sdk.

Whatever you use, you will probably have to write the Python interface.
If I was in your situation I would write a Python interface to the mg system
that someone already recommended to you...

Stuart Bishop

unread,

Mar 12, 2002, 6:48:23 PM3/12/02

to

On Friday, March 8, 2002, at 01:41 PM, Doug Farrell wrote:

> So the requirements of the search engine are that it compress the text
> (or have index files so small that compression is un-necessary) and
> the retrieval engine have an API accessible to Python, not implemented
> as a command line tool. I want to incorporate the search engine into a
> larger application that links the text together with media. I know, I
> know, big demands, but that's why it's in the conceptual stage with me
> right now.

You may want to download a copy of Zope and see if the ZCatalog
product meets all your requirements. I don't think the indexes are
compressed, but they may be small enough for your needs anyway.
If this is an issue, I believe there is a ZODB storage implementation
that stores objects in a compressed format (but they still will chew
up RAM). You can talk to the Zope database directly from Python without
the need to run the Zope application server so it should interface
nicely with your GUI (the Standalone ZODB and just ZCatalog from the
Zope distribution might be more suitable in the long run for your
product since you won't get all the cruft you don't care about, but
using the Zope distribution initially gives you everything out of the
box with no hassles).

--
Stuart Bishop <z...@shangri-la.dropbear.id.au>
http://shangri-la.dropbear.id.au/

Tim Bell

unread,

Mar 18, 2002, 5:22:46 PM3/18/02

to

Fernando Rodr?uez <f...@ThouShallNotSpam.EasyJob.NET> wrote in message news:<mt2h8us6sl06q3v47...@4ax.com>...

> On 7 Mar 2002 18:41:45 -0800, writ...@earthlink.net (Doug Farrell) wrote:
>
> >is because of all the other media that goes on the CD. Our current
> >search engine is a piece of junk, but works in the C++ environment of
> >our current application, which is only for Windows. I am considering

> Whatever you use, you will probably have to write the Python interface.

> If I was in your situation I would write a Python interface to the mg system
> that someone already recommended to you...

MG has already been adapted for use on multimedia CD ROMS as the Greenstone
Digital Library software <http://www.greenstone.org/>. Most of the work was
done at the University of Waikato in Hamilton, NZ, and I'm sure it would be
worth your while investigating their approach. (They don't use Python
anywhere, to my knowledge.)

While the Greenstone system is cross-platform (including even Windows 3.1),
there is considerable effort required in getting it to work on each platform.
Clearly a language such as Python (coupled with a suitable GUI toolkit) would
be very useful here.

Also, if you start getting into implementing a new text indexing and
compression system, or even just building a Python API for MG, I strongly
recommend you read Managing Gigabytes, 2nd Edition, by Witten, Moffat and
Bell; Morgan Kaufmann, 1999. (My name is similar to the third author's,
but we're different people.) There's more info about the book here:
<http://www.cs.mu.oz.au/mg/>. Much of the work they describe has been
implemented in the MG system.

Tim.
--
Tim Bell - bh...@cs.mu.oz.au - Dept of Comp Sci & SE - Uni of Melbourne, Aust.