Search-Tools first BETA release

4 views
Skip to first unread message

voidptrptr

unread,
Dec 1, 2008, 8:15:55 AM12/1/08
to search-tools
Hi All,

Summary
========
The main goal of Search Tools is to facilitate and promote the
adoption of the WARC file format for storing web archives by the
mainstream web development community by providing an open source
software library, a set of command line tools, web server plug-ins and
technical documentation for full-text and metadata search of web
archive files, or WARC files.


Features
=======

* Command line tools to index WARC material
* Default plugins to index "HTML," "DOC", "PDF" and pure "TEXT"
documents
* Default plugins to index any meta data from HTML, PDF, PS, OLE2
(DOC, XLS, PPT), OpenOffice (sxw), StarOffice (sdw), DVI, MAN, FLAC,
MP3 (ID3v1 and ID3v2), NSF(E) (NES music), SID (C64 music), OGG, WAV,
EXIV2, JPEG, GIF, PNG, TIFF, DEB, RPM, TAR(.GZ), ZIP, ELF, S3M (Scream
Tracker 3), XM (eXtended Module), IT (Impulse Tracker), FLV, REAL,
RIFF (AVI), MPEG, QT and ASF
* Command line tools to search in a WARC index
* Easy to test Ruby on Rails search interface
* Easy to deploy Ruby on Rails search interface for production
(Mongrel and Lighttpd).



Usage
======

First of all, get a fresh version from subversion:

$ svn checkout http://search-tools.googlecode.com/svn/trunk/ search-
tools-read-only
$ cd search-tools-read-only
$ ./build.sh

Read the "doc/install" documentation related to "full text
search" (section "Search-tools").
Make sure to install all needed dependencies (i.e. Rails, hpricots,
libextractor ...)
described in the doc.

Then, type:

$ cd warc-tools-read-only
$ make && make ruby

Index the WARCs you want:

$ cd app/ruby && ./warc2index.rb
$ ./warc2index.sh -s warc_diretory -d index_directory -a
base_config.wsc

Note: adapt the default configuration in "base_config.wsc" to fit your
needs (ex. other language
than english, stemers, stop list...)

For convenience, use the Web user interface for search:

$ cd search-tools-read-only/rails

Change the index path in file "config/index-path.pat" to the same
index directory
previously used with option "-d index_directory".

Then, automatically build the Rails application (default name for it
is "wwwoh"):

$ ./build.sh

Read the output of this command as it shows you how to deploy "Rails"
for dev or production
environment. For quick testing, type:

$ cd wwwoh && ruby script/server


Note: when indexing huge volume of WARC data, prefer "Mongrel or
Lighttpd" to Rails's
web sever.


We'll appreciate all your comments, bugs reports, and feedbacks.


Regards
Younès

searchtools

unread,
Dec 19, 2008, 3:24:33 PM12/19/08
to search-tools
As a librarian, I think this is definitely good news. A standard
format makes it really possible to archive without being dependent on
vendor file formats. I will go ahead and write up a blurb about it on
my site, and may let my Rails guys play with it.

I'd like to try and convince you to index everything rather than
exclude stopwords or limit indexing to linguistic or morphemic stems.
I've done a fair amount of research on this and think it has a biggish
influence on search success.

Could you also link to my site, www.searchtools.com in case people end
up here but they're looking for me?

Thanks,

Avi
Search Tools Consulting (since 1998)
Reply all
Reply to author
Forward
0 new messages