Search-Tools project first BETA release :

22 views
Skip to first unread message

voidptrptr

unread,
Dec 1, 2008, 8:22:53 AM12/1/08
to warc-tools
Hi All,

As discussed at Aarhus IIPC meeting in september, the first BETA
release of WARC fulltext seach project is ready.


Project summary
=============

The main goal of Search Tools is to facilitate and promote the
adoption of the WARC file format for storing web archives by the
mainstream web development community by providing an open source
software library, a set of command line tools, web server plug-ins and
technical documentation for full-text and metadata search of web
archive files, or WARC files.

Project page: http://code.google.com/p/search-tools/



Features
=======

* Command line tools to index WARC material
* Default plugins to index "HTML," "DOC", "PDF" and pure "TEXT"
documents
* Default plugins to index any meta data from HTML, PDF, PS, OLE2
(DOC, XLS, PPT), OpenOffice (sxw), StarOffice (sdw), DVI, MAN, FLAC,
MP3 (ID3v1 and ID3v2), NSF(E) (NES music), SID (C64 music), OGG, WAV,
EXIV2, JPEG, GIF, PNG, TIFF, DEB, RPM, TAR(.GZ), ZIP, ELF, S3M (Scream
Tracker 3), XM (eXtended Module), IT (Impulse Tracker), FLV, REAL,
RIFF (AVI), MPEG, QT and ASF
* Command line tools to search in a WARC index
* Easy to test Ruby on Rails search interface
* Easy to deploy Ruby on Rails search interface for production
(Mongrel and Lighttpd).



Usage
======

First of all, get a fresh version from subversion:

$ svn checkout http://search-tools.googlecode.com/svn/trunk/ search-
tools-read-only
$ cd search-tools-read-only
$ ./build.sh

Read the "doc/install" documentation related to "full text
search" (section "Search-tools").
Make sure to install all needed dependencies (i.e. Rails, hpricots,
libextractor ...)
described in the doc.

Then, type:

$ cd warc-tools-read-only
$ make && make ruby

Index the WARCs you want:

$ cd app/ruby && ./warc2index.rb
$ ./warc2index.sh -s warc_diretory -d index_directory -a
base_config.wsc

Note: adapt the default configuration in "base_config.wsc" to fit your
needs (ex. other language
than english, stemers, stop list...)

For convenience, use the Web user interface for search:

$ cd search-tools-read-only/rails

Change the index path in file "config/index-path.pat" to the same
index directory
previously used with option "-d index_directory".

Then, automatically build the Rails application (default name for it
is "wwwoh"):

$ ./build.sh

Read the output of this command as it shows you how to deploy "Rails"
for dev or production
environment. For quick testing, type:

$ cd wwwoh && ruby script/server


Note: when indexing huge volume of WARC data, prefer "Mongrel or
Lighttpd" to Rails's
web sever.


We'll appreciate all your comments, bugs reports, and feedbacks.


Regards
Younès

Erik Hetzner

unread,
Jan 7, 2009, 5:51:51 PM1/7/09
to warc-...@googlegroups.com
At Mon, 1 Dec 2008 05:22:53 -0800 (PST),

voidptrptr <voidp...@gmail.com> wrote:
> Hi All,
>
> As discussed at Aarhus IIPC meeting in september, the first BETA
> release of WARC fulltext seach project is ready.
>
> […]

Hi Younès -

I am attempting to build this project, but am unable to build
warc-tools-read-only due to a missing file extractor.h:

egh@gales:~/software/search-tools-read-only$ uname -a
Linux gales.cdlib.org 2.6.27-9-generic #1 SMP Thu Nov 20 21:57:00 UTC 2008 i686 GNU/Linux

aka Ubuntu Intrepid

egh@gales:~/software/search-tools-read-only/warc-tools-read-only$ make ruby
make[1]: Entering directory `/home/egh/software/search-tools-read-only/warc-tools-read-only'
swig -ruby -outdir lib/private/plugin/ruby lib/private/plugin/ruby/warctools.i
gcc -I. -Ilib/private -Ilib/public -Ilib/private/plugin/gzip -Ilib/private/plugin/cunit -Ilib/private/plugin/tiger -Ilib/private/plugin/event -Ilib/private/plugin/event/compat -Ilib/private/plugin/regex -Ilib/private/plugin/python -Ilib/private/plugin/ruby -Ilib/private/os -Ilib/private/plugin/event/os/linux -Ilib/private/plugin/cunit/os/linux -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGE_FILES -fno-strict-aliasing -g -g -O2 -fPIC -I. -I/usr/lib/ruby/1.8/i486-linux -c lib/private/plugin/ruby/warctools_wrap.c -o lib/private/plugin/ruby/warctools_wrap.o
In file included from lib/private/plugin/ruby/warctools_wrap.c:2090:
lib/private/plugin/ruby/wextract.h:30:23: error: extractor.h: No such file or directory
make[1]: *** [lib/private/plugin/ruby/warctools_wrap.o] Error 1
make[1]: Leaving directory `/home/egh/software/search-tools-read-only/warc-tools-read-only'
make: *** [ruby] Error 2

Thanks for any help you can give. Let me know if you need any more
system information.

best,
Erik Hetzner

WARC

unread,
Jan 7, 2009, 7:28:39 PM1/7/09
to warc-...@googlegroups.com
Hi Erik,

First of all, thanks for your interest to "Search-Tools" Erik.


At Mon, 1 Dec 2008 05:22:53 -0800 (PST),
voidptrptr <voidp...@gmail.com> wrote:
Hi All,

As discussed at Aarhus IIPC meeting in september, the first BETA
release of WARC fulltext seach project is ready.

[…]

Hi Younès -

As you may know, before compiling the library, you have to add Search-Tools capabilities
to "warc-tools" by running:
$ ./build.sh

And then, change to :
cd warc-tools-read-only

Now, install the required dependencies.

I am attempting to build this project, but am unable to build
warc-tools-read-only due to a missing file extractor.h:

"wextractor.h" try to include "extractor.h" which is an external header shipped with "libextractor":

You can simply add this header to your system by installing "libextractor" as explained in the doc.
Please, refer to section "Search-tools" please in "doc/install". Under Fedora, "yum" is your friend !!!

But I advice you to build "libextractor" from source by following this instructions:

$ tar xf libextractor-0.5.21.tar.gz && cd libextractor-0.5.21
$ CFLAGS="-liconv" ./configure --prefix=/usr --enable-all --enable-xpdf --enable-vorbis
$ make
# make install

Note: If the previous "./configure ..." command fails due to "libiconv", rerun it like this:
$ ./configure --prefix=/usr --enable-all --enable-xpdf --enable-vorbis 

Everything is the documentation of course.

egh@gales:~/software/search-tools-read-only$ uname -a
Linux gales.cdlib.org 2.6.27-9-generic #1 SMP Thu Nov 20 21:57:00 UTC 2008 i686 GNU/Linux

aka Ubuntu Intrepid

egh@gales:~/software/search-tools-read-only/warc-tools-read-only$ make ruby
make[1]: Entering directory `/home/egh/software/search-tools-read-only/warc-tools-read-only'
swig -ruby -outdir lib/private/plugin/ruby lib/private/plugin/ruby/warctools.i
gcc -I. -Ilib/private -Ilib/public -Ilib/private/plugin/gzip -Ilib/private/plugin/cunit -Ilib/private/plugin/tiger -Ilib/private/plugin/event -Ilib/private/plugin/event/compat -Ilib/private/plugin/regex -Ilib/private/plugin/python -Ilib/private/plugin/ruby -Ilib/private/os -Ilib/private/plugin/event/os/linux -Ilib/private/plugin/cunit/os/linux -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGE_FILES -fno-strict-aliasing -g -g -O2  -fPIC   -I. -I/usr/lib/ruby/1.8/i486-linux -c lib/private/plugin/ruby/warctools_wrap.c -o lib/private/plugin/ruby/warctools_wrap.o
In file included from lib/private/plugin/ruby/warctools_wrap.c:2090:
lib/private/plugin/ruby/wextract.h:30:23: error: extractor.h: No such file or directory
make[1]: *** [lib/private/plugin/ruby/warctools_wrap.o] Error 1
make[1]: Leaving directory `/home/egh/software/search-tools-read-only/warc-tools-read-only'
make: *** [ruby] Error 2

Thanks for any help you can give. Let me know if you need any more
system information.


You're welcome Erik. We would like to have more interest to this library to make it more robust.

best,
Erik Hetzner
;; Erik Hetzner, California Digital Library
;; gnupg key id: 1024D/01DB07E3


Regards
Younès

Erik Hetzner

unread,
Jan 7, 2009, 8:21:29 PM1/7/09
to warc-...@googlegroups.com
At Thu, 8 Jan 2009 01:28:39 +0100,

WARC <voidp...@gmail.com> wrote:
>
> Hi Erik,
>
> First of all, thanks for your interest to "Search-Tools" Erik.
>
> […]

Hi Younès -

And first of all, from me, apologies for not reading the install doc!

Could you consider moving the install doc to
search-tools-read-only/INSTALL? This is the location recommended by
Karl Fogel in his book Producing open source software and where I tend
to look for it [1].

Things seem to work once I install the libextractor-dev, with the
exception of a problem like one encountered previously:

gcc -I. -Ilib/private -Ilib/public -Ilib/private/plugin/gzip -Ilib/private/plugin/cunit -Ilib/private/plugin/tiger -Ilib/private/plugin/event -Ilib/private/plugin/event/compat -Ilib/private/plugin/regex -Ilib/private/plugin/python -Ilib/private/plugin/ruby -Ilib/private/os -Ilib/private/plugin/event/os/linux -Ilib/private/plugin/cunit/os/linux -I/usr/include -L/usr/lib -lextractor -Wall -W -Wunused -ansi -Werror -Wno-long-long -Wunused-function -std=gnu89 -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGE_FILES -g -pedantic-errors -Wextra -fno-strict-aliasing -g -g -O2 -fPIC -DSTDC_HEADERS=1 -DHAVE_STRING_H=1 -DHAVE_ALLOCA_H=1 -DEBUG=1 \
-I. -I/usr/lib/ruby/1.8/i486-linux -c lib/private/plugin/ruby/wextract.c -o lib/private/plugin/ruby/wextract.o
In file included from lib/private/plugin/ruby/wextract.c:26:
/usr/include/extractor.h:190: error: comma at end of enumerator list
make[1]: *** [lib/private/plugin/ruby/wextract.o] Error 1


make[1]: Leaving directory `/home/egh/software/search-tools-read-only/warc-tools-read-only'
make: *** [ruby] Error 2

I was able to fix this by removing -pedantic and -pendatic-errors from
the CFLAGS. This was what was triggering the problems with commas at
the end of an enumerator list. There is some information about this
problem here [2].

After this, and install the hpricot, ferret, & rubyzip ruby gems, and
adding the
search-tools-read-only/warc-tools-read-only/lib/private/plugin/ruby
directory to RUBYLIB, I was able to get the warc2index.rb script to
run (though I have not created an index yet). Thanks.

best,
Erik Hetzner

1. <http://producingoss.com/en/packaging.html#packaging-name-and-layout>
2. <http://www.cpptalk.net/extra-comma-in-enum-is-valid-vt20540.html>

WARC

unread,
Jan 7, 2009, 8:43:38 PM1/7/09
to warc-...@googlegroups.com
Hi Erik,

> Hi Younès -
>
> And first of all, from me, apologies for not reading the install doc!
>
> Could you consider moving the install doc to
> search-tools-read-only/INSTALL? This is the location recommended by
> Karl Fogel in his book Producing open source software and where I tend
> to look for it [1].

Ok, next release !

> Things seem to work once I install the libextractor-dev, with the
> exception of a problem like one encountered previously:
>
> gcc -I. -Ilib/private -Ilib/public -Ilib/private/plugin/gzip -Ilib/
> private/plugin/cunit -Ilib/private/plugin/tiger -Ilib/private/plugin/
> event -Ilib/private/plugin/event/compat -Ilib/private/plugin/regex -
> Ilib/private/plugin/python -Ilib/private/plugin/ruby -Ilib/private/
> os -Ilib/private/plugin/event/os/linux -Ilib/private/plugin/cunit/os/
> linux -I/usr/include -L/usr/lib -lextractor -Wall -W -Wunused -ansi -
> Werror -Wno-long-long -Wunused-function -std=gnu89 -
> D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGE_FILES -g -
> pedantic-errors -Wextra -fno-strict-aliasing -g -g -O2 -fPIC -
> DSTDC_HEADERS=1 -DHAVE_STRING_H=1 -DHAVE_ALLOCA_H=1 -DEBUG=1 \
> -I. -I/usr/lib/ruby/1.8/i486-linux -c lib/private/plugin/ruby/
> wextract.c -o lib/private/plugin/ruby/wextract.o
> In file included from lib/private/plugin/ruby/wextract.c:26:
> /usr/include/extractor.h:190: error: comma at end of enumerator list
> make[1]: *** [lib/private/plugin/ruby/wextract.o] Error 1
> make[1]: Leaving directory `/home/egh/software/search-tools-read-
> only/warc-tools-read-only'
> make: *** [ruby] Error 2
>
> I was able to fix this by removing -pedantic and -pendatic-errors from
> the CFLAGS.

This to GCC flags ensure that everything is hardened. Another solution
is to leave them and fix the error by removing the comma
manually.

> This was what was triggering the problems with commas at
> the end of an enumerator list. There is some information about this
> problem here [2].

New C norm (+1999) tolerate this kind of extra commas at the end od
data structures, but not old one (ANSI, 1989).
This is why we're trying to have backward compatibilities with old
compilers using these flags.

> After this, and install the hpricot, ferret, & rubyzip ruby gems, and
> adding the
> search-tools-read-only/warc-tools-read-only/lib/private/plugin/ruby
> directory to RUBYLIB, I was able to get the warc2index.rb script to
> run (though I have not created an index yet). Thanks.

Cool. Fun is close.

Reply all
Reply to author
Forward
0 new messages