Hi All,
Summary
========
The main goal of Search Tools is to facilitate and promote the
adoption of the WARC file format for storing web archives by the
mainstream web development community by providing an open source
software library, a set of command line tools, web server plug-ins and
technical documentation for full-text and metadata search of web
archive files, or WARC files.
Features
=======
* Command line tools to index WARC material
* Default plugins to index "HTML," "DOC", "PDF" and pure "TEXT"
documents
* Default plugins to index any meta data from HTML, PDF, PS, OLE2
(DOC, XLS, PPT), OpenOffice (sxw), StarOffice (sdw), DVI, MAN, FLAC,
MP3 (ID3v1 and ID3v2), NSF(E) (NES music), SID (C64 music), OGG, WAV,
EXIV2, JPEG, GIF, PNG, TIFF, DEB, RPM, TAR(.GZ), ZIP, ELF, S3M (Scream
Tracker 3), XM (eXtended Module), IT (Impulse Tracker), FLV, REAL,
RIFF (AVI), MPEG, QT and ASF
* Command line tools to search in a WARC index
* Easy to test Ruby on Rails search interface
* Easy to deploy Ruby on Rails search interface for production
(Mongrel and Lighttpd).
Usage
======
First of all, get a fresh version from subversion:
$ svn checkout
http://search-tools.googlecode.com/svn/trunk/ search-
tools-read-only
$ cd search-tools-read-only
$ ./build.sh
Read the "doc/install" documentation related to "full text
search" (section "Search-tools").
Make sure to install all needed dependencies (i.e. Rails, hpricots,
libextractor ...)
described in the doc.
Then, type:
$ cd warc-tools-read-only
$ make && make ruby
Index the WARCs you want:
$ cd app/ruby && ./warc2index.rb
$ ./warc2index.sh -s warc_diretory -d index_directory -a
base_config.wsc
Note: adapt the default configuration in "base_config.wsc" to fit your
needs (ex. other language
than english, stemers, stop list...)
For convenience, use the Web user interface for search:
$ cd search-tools-read-only/rails
Change the index path in file "config/index-path.pat" to the same
index directory
previously used with option "-d index_directory".
Then, automatically build the Rails application (default name for it
is "wwwoh"):
$ ./build.sh
Read the output of this command as it shows you how to deploy "Rails"
for dev or production
environment. For quick testing, type:
$ cd wwwoh && ruby script/server
Note: when indexing huge volume of WARC data, prefer "Mongrel or
Lighttpd" to Rails's
web sever.
We'll appreciate all your comments, bugs reports, and feedbacks.
Regards
Younès