Re: Language

1 view
Skip to first unread message

Patrick Collison

unread,
Feb 6, 2008, 5:34:32 AM2/6/08
to wikipedi...@googlegroups.com, Oskar_ Vikholm
First, you need to download the source distribution. The code will
only work on Linux/OS X/some UNIX (though I guess it _may_ work in
Cygwin) -- what OS are you using?

Once you have the source anyway, things are explained in the README file:

==Get started==

cd c
./configure
make
cd ..

Get a Wikipedia XML dump (e.g. enwiki-20071018-pages-articles.xml.bz2), and
place it in root (wp) directory.

cd sh
./process ../<dump>
cd ..

The processing stage will take several hours (8 on 2.16GHz MBP). (If someone
wants to speed it up, implement xmlprocess.rb in C.) Once this is done, you
can delete the original dump. If you get sick of waiting, use a dump of
the Simple English Wikipedia, which is several orders of magnitude smaller
than the standard English dumps.

--snip--

To the above, I'll add that you need to run "sh bootstrap.sh" before
running configure in the wp/c directory.

Try following those instructions, and let me know where you have issues.

On 06/02/2008, Oskar_ Vikholm <snobb...@hotmail.com> wrote:
>
> Okay... But I have no idea how to do it... so it would be great if you
> could give me some instructions and explain how you did it with the english-
> and german dump!
>
> Thanks man
>
> > Date: Tue, 5 Feb 2008 17:59:00 -0800
> > From: pat...@collison.ie
> > To: snobb...@hotmail.com
> > Subject: Re: Language
>
> >
> > I can't -- but I'm happy to assist with you doing so :).
> >
> > Post to the group --
> > http://groups.google.com/group/wikipedia-iphone/topics --
> with any
> > questions, and we'll help out.
> >
> > On 05/02/2008, Oskar_ Vikholm <snobb...@hotmail.com> wrote:
> > >
> > > Hello. Could you make an dump for Swedish? (http://sv.wikipedia.org)
> Would
> > > be great man! Thanks for the english dump!
> > > ________________________________
> > > Tokbilliga solresor & gratis reseguider! MSN Resor
>
>
> ________________________________
> Den perfekta mixen av nöjesnyheter & livekonserter! MSN Video

Sanford

unread,
Feb 6, 2008, 10:32:35 PM2/6/08
to Wikipedia iPhone
I gave up. I have a linux box running CentOS 5, couldn't get the
process script to finish running...

First in the make process indexer.c has an invalid lvalue error I
believe is because my gcc is version 4 instead of 3 you are using..

Then after creating zhwiki-20080114-pages-articles.xml.bz2.processed I
realize I do not have rubygem, so I went over to rubyforge to download
and install it.

After that there's this error "in `gem_original_require': no such file
to load -- inline (LoadError)" which cannot resolve..

and mklocatedb is a MacOS script... need to find a replacement for
linux..


Can someone make the zhwiki dump for me?

On Feb 6, 6:34 pm, "Patrick Collison" <patr...@collison.ie> wrote:
> First, you need to download the source distribution. The code will
> only work on Linux/OS X/some UNIX (though I guess it _may_ work in
> Cygwin) -- what OS are you using?
>
> Once you have the source anyway, things are explained in the README file:
>
> ==Get started==
>
> cd c
> ./configure
> make
> cd ..
>
> Get a Wikipedia XML dump (e.g. enwiki-20071018-pages-articles.xml.bz2), and
> place it in root (wp) directory.
>
> cd sh
> ./process ../<dump>
> cd ..
>
> The processing stage will take several hours (8 on 2.16GHz MBP). (If someone
> wants to speed it up, implement xmlprocess.rb in C.) Once this is done, you
> can delete the original dump. If you get sick of waiting, use a dump of
> the Simple English Wikipedia, which is several orders of magnitude smaller
> than the standard English dumps.
>
> --snip--
>
> To the above, I'll add that you need to run "sh bootstrap.sh" before
> running configure in the wp/c directory.
>
> Try following those instructions, and let me know where you have issues.
>
> On 06/02/2008, Oskar_ Vikholm <snobben...@hotmail.com> wrote:
>
>
>
> > Okay... But I have no idea how to do it... so it would be great if you
> > could give me some instructions and explain how you did it with the english-
> > and german dump!
>
> > Thanks man
>
> > > Date: Tue, 5 Feb 2008 17:59:00 -0800
> > > From: patr...@collison.ie
> > > To: snobben...@hotmail.com
> > > Subject: Re: Language
>
> > > I can't -- but I'm happy to assist with you doing so :).
>
> > > Post to the group --
> > >http://groups.google.com/group/wikipedia-iphone/topics--
> > with any
> > > questions, and we'll help out.
>

Patrick Collison

unread,
Feb 6, 2008, 10:39:59 PM2/6/08
to wikipedi...@googlegroups.com
On 06/02/2008, Sanford <sanfor...@gmail.com> wrote:
>
> I gave up. I have a linux box running CentOS 5, couldn't get the
> process script to finish running...
>
> First in the make process indexer.c has an invalid lvalue error I
> believe is because my gcc is version 4 instead of 3 you are using..

This is easily worked around -- edit Makefile.am, and on the first
line, remove "searcher", and then run bootstrap.sh, configure and make
again.

> Then after creating zhwiki-20080114-pages-articles.xml.bz2.processed I
> realize I do not have rubygem, so I went over to rubyforge to download
> and install it.
>
> After that there's this error "in `gem_original_require': no such file
> to load -- inline (LoadError)" which cannot resolve..

gem install RubyInline.

> and mklocatedb is a MacOS script... need to find a replacement for
> linux..

Hmm. I'm not exactly sure how to work around this one. If you can get
things working up to this, though, am happy to help figure out some
sort of workaround.

Would be great if you can get this working on Linux -- will then
integrate the changes into the main distribution.

Sanford

unread,
Feb 6, 2008, 10:57:58 PM2/6/08
to Wikipedia iPhone
Some results: bootstrap.sh:
configure.ac: installing `./install-sh'
configure.ac: installing `./missing'
Makefile.am: installing `./compile'
Makefile.am: installing `./depcomp'
Makefile.am:19: variable `searcher_SOURCES' is defined but no program
or
Makefile.am:19: library has `searcher' as canonic name (possible typo)

Same result in make:
if gcc -DPACKAGE_NAME=\"wp\" -DPACKAGE_TARNAME=\"wp\" -
DPACKAGE_VERSION=\"0.1\" -DPACKAGE_STRING=\"wp\ 0.1\" -
DPACKAGE_BUGREPORT=\"\" -DPACKAGE=\"wp\" -DVERSION=\"0.1\" -
DHAVE_LIBBZ2=1 -DHAVE_LIBNCURSES=1 -I. -I. -fpack-struct -g -O2 -
MT indexer-indexer.o -MD -MP -MF ".deps/indexer-indexer.Tpo" -c -o
indexer-indexer.o `test -f 'indexer.c' || echo './'`indexer.c; \
then mv -f ".deps/indexer-indexer.Tpo" ".deps/indexer-
indexer.Po"; else rm -f ".deps/indexer-indexer.Tpo"; exit 1; fi
In file included from indexer.c:1:
ternary.h:46:1: warning: "tolower" redefined
In file included from ternary.h:9,
from indexer.c:1:
/usr/include/ctype.h:204:1: warning: this is the location of the
previous definition
indexer.c: In function 'insert':
indexer.c:30: error: invalid lvalue in assignment
make: *** [indexer-indexer.o] Error 1

==========

After installing RubyInline things start to work.. so I am leaving it
to work for a while...
so I guess I will need to find the mklocatedb alternative.


On Feb 7, 11:39 am, "Patrick Collison" <patr...@collison.ie> wrote:

Patrick Collison

unread,
Feb 6, 2008, 11:10:55 PM2/6/08
to wikipedi...@googlegroups.com
On 06/02/2008, Sanford <sanfor...@gmail.com> wrote:
>

You can safely take indexer out of Makefile.am, too. (I looked into
this, and it turns out that using ternary operators as an lvalue was
disallowed in GCC 4, as you suspected. Although I'm using GCC 4, I'm
using Apple's version, which has kept support.)

> After installing RubyInline things start to work.. so I am leaving it
> to work for a while...
> so I guess I will need to find the mklocatedb alternative.

Yeah. Mklocatedb is just a BSD licensed shell script, so it's possible
that it could be coaxed to function on Linux without too much work. We
could probably add a modified version to the source distribution.

Sanford

unread,
Feb 7, 2008, 2:07:57 AM2/7/08
to Wikipedia iPhone
using the mklocatedb.sh I found at
http://opengrok.creo.hu/dragonfly/xref/src/usr.bin/locate/locate/mklocatedb.sh
and extracting the executables code and bigram from the findutils rpm
at
http://rpm2html.osmirror.nl/redhat-archive/updates/5.0/en/os/i386/findutils-4.1-24.i386.html
http://rpm.pbone.net/index.php3?stat=3&search=findutils-locate&srodzaj=3&dist[]=46

I can get mklocatedb.sh to create _mklocatedb[randomtext].list but
_mklocatedb[randomtext].bigram is empty, resulting an empty
locatedb...
I've tried both the x86_64 and i386 binaries with the same result..

Sanford

unread,
Feb 7, 2008, 2:09:39 AM2/7/08
to Wikipedia iPhone
Furthermore, even if I take out indexer completely from Makefile.am
and bootstrap configure again, the compilation still halts with an
error.. in the end I commented the line where the forbidden lvalue
assignment is and it finally lot me go on.

Patrick Collison

unread,
Feb 7, 2008, 2:15:11 AM2/7/08
to wikipedi...@googlegroups.com
On 06/02/2008, Sanford <sanfor...@gmail.com> wrote:
>
> Furthermore, even if I take out indexer completely from Makefile.am
> and bootstrap configure again, the compilation still halts with an
> error.. in the end I commented the line where the forbidden lvalue
> assignment is and it finally lot me go on.

Strange. As it happens, those source files aren't actually used any
more, so I'll just remove them in the next release.

Patrick Collison

unread,
Feb 7, 2008, 2:20:17 AM2/7/08
to wikipedi...@googlegroups.com
On 06/02/2008, Sanford <sanfor...@gmail.com> wrote:
>

Cool, this is the same version as mine.

> and extracting the executables code and bigram from the findutils rpm
> at
> http://rpm2html.osmirror.nl/redhat-archive/updates/5.0/en/os/i386/findutils-4.1-24.i386.html
> http://rpm.pbone.net/index.php3?stat=3&search=findutils-locate&srodzaj=3&dist[]=46
>
> I can get mklocatedb.sh to create _mklocatedb[randomtext].list but
> _mklocatedb[randomtext].bigram is empty, resulting an empty
> locatedb...
> I've tried both the x86_64 and i386 binaries with the same result..

Can you use the built-in locate commands? In mklocatedb.sh I see:

: ${bigram:=locate.bigram}
: ${code:=locate.code}
: ${sort:=sort}

What happens if you point locate.bigram at /usr/lib/locate/bigram, and
locate.code at /usr/lib/locate/code? (I believe that's the default
location for those binaries on Debian-based systems.)

Patrick

Sanford

unread,
Feb 7, 2008, 2:54:17 AM2/7/08
to Wikipedia iPhone
I do not know about debian installations, but in my CentOS
installation the built-in findutils does not provide bigram and code
and I believe the locate binary uses a slightly different locate
database so I had to get the locate, code and bigram commands from a
the internet.

And it worked now after some modifications...
1. put bigram, code and locate from the i386 rpm in some directory,
and modify the LIBEXECDIR in mklocatedb.sh to reflect that
2. for me the -presort option in the script is causing me lots of
problems so I had to remove that whole chunk of code and keep only the
"else" portion
3. instead of awk, use perl: change line 87 from "awk '{if (/
^[ ]*[0-9]+[ ]+..$/) {printf("%s",$2)} else {exit 1}}' > $bigrams ||
exit 1" to "perl -ne '/^\s*[0-9]+\s(..)$/ && print $1 || exit 1' >
$bigrams || exit 1"

So now I have this... do I tar it up, SCP it to my iPhone and untar it
into the wp folder?
-rw-rw-r-- 1 gaarder gaarder 163027914 Feb 7 10:46 zhwiki-20080114-
pages-articles.xml.bz2.processed
-rw-rw-r-- 1 gaarder gaarder 5661220 Feb 7 11:58 zhwiki-20080114-
pages-articles.xml.bz2.index.txt
-rw-rw-r-- 1 gaarder gaarder 508 Feb 7 15:41 zhwiki-20080114-
pages-articles.xml.bz2.locate.prefixdb
-rw-rw-r-- 1 gaarder gaarder 3602067 Feb 7 15:41 zhwiki-20080114-
pages-articles.xml.bz2.locate.db
-rw-rw-r-- 1 gaarder gaarder 4372 Feb 7 15:41 zhwiki-20080114-
pages-articles.xml.bz2.blocks.db

On Feb 7, 3:20 pm, "Patrick Collison" <patr...@collison.ie> wrote:
> On 06/02/2008, Sanford <sanford.p...@gmail.com> wrote:
>
>
>
> > using the mklocatedb.sh I found at
> >http://opengrok.creo.hu/dragonfly/xref/src/usr.bin/locate/locate/mklo...
>
> Cool, this is the same version as mine.
>
> > and extracting the executables code and bigram from the findutils rpm
> > at
> >http://rpm2html.osmirror.nl/redhat-archive/updates/5.0/en/os/i386/fin...
> >http://rpm.pbone.net/index.php3?stat=3&search=findutils-locate&srodza...[]=46

Patrick Collison

unread,
Feb 7, 2008, 2:59:56 AM2/7/08
to wikipedi...@googlegroups.com
On 06/02/2008, Sanford <sanfor...@gmail.com> wrote:
>

Awesome! So, last step is to remove original dump name from all of
those files. E.g., zhwiki-20080114-pages-articles.xml.bz2.processed
becomes just "processed". The folder structure on the iPhone should
look like this (there's no need to tar it, you can just scp -r the
directory):

wp/
processed
locate.db
locate.prefixdb
blocks.db

Although scp works fine, I normally set up a local webserver, and curl
the processed file directly -- it's about 2x as fast on my iPhone,
since there's no unnecessary encryption overhead.

Would you mind documenting the steps you undertook somewhere, so that
I can add it to the source distribution? Also, I'm happy to integrate
whatever suggestions you have to make it easier for future Linux users
-- e.g. including the mklocatedb.sh shell script, etc. (A linux
version of sh/process would be nice to have; the other can then become
sh/process.osx or something.)

Patrick

Sanford

unread,
Feb 7, 2008, 3:32:01 AM2/7/08
to Wikipedia iPhone
It worked...! well, kind of...
The zh wiki is actually the Chinese wikipedia and most of the entries
have unicode double-byte character titles, and it seems that your
program, in the capitalizing procedure, destroys the first unicode
character in the title.
For example, navigating to the page "美鐵" (redirected from the page
Amtrak) results in the program accessing the page "é鐵"


2008-02-07 16:16:49.423 Wikipedia[465:d03] key: WebActionButtonKey,
value: 1
2008-02-07 16:16:49.426 Wikipedia[465:d03] key:
WebActionModifierFlagsKey, value: 0
2008-02-07 16:16:49.429 Wikipedia[465:d03] key:
WebActionNavigationTypeKey, value: 0
2008-02-07 16:16:49.432 Wikipedia[465:d03] key:
WebActionOriginalURLKey, value: wp://localhost/%E7%BE%8E%E9%90%B5
2008-02-07 16:16:49.434 Wikipedia[465:d03] loading article with url
wp://localhost/%E7%BE%8E%E9%90%B5
2008-02-07 16:16:49.436 Wikipedia[465:d03] url path: /美鐵
2008-02-07 16:16:49.438 Wikipedia[465:d03] cap: é鐵

Secondly the search box in the application does not match unicode
characters.

Thirdly the rendered unicode texts are garbled (but the links to other
pages are fine), as you can see this screenshot:
http://img114.imageshack.us/img114/8131/img0029da6.jpg

I think this program has lots of potential! With this framework we can
even try to import dictionary files into the system (e.g. stardict)
and the result would be much better than weDict!
Keep up the good work!

I will create a wiki page on my personal site to document what I have
done to convert the database in a short while. Or do you prefer
creating a new google codes / sourceforge project so we can have a
centralized repository for related information?
> ...
>
> read more >>

Patrick Collison

unread,
Feb 7, 2008, 3:45:28 AM2/7/08
to wikipedi...@googlegroups.com
On 07/02/2008, Sanford <sanfor...@gmail.com> wrote:
>
> It worked...! well, kind of...

Sweet!

> The zh wiki is actually the Chinese wikipedia and most of the entries
> have unicode double-byte character titles, and it seems that your
> program, in the capitalizing procedure, destroys the first unicode
> character in the title.
> For example, navigating to the page "美鐵" (redirected from the page
> Amtrak) results in the program accessing the page "é鐵"
>
>
> 2008-02-07 16:16:49.423 Wikipedia[465:d03] key: WebActionButtonKey,
> value: 1
> 2008-02-07 16:16:49.426 Wikipedia[465:d03] key:
> WebActionModifierFlagsKey, value: 0
> 2008-02-07 16:16:49.429 Wikipedia[465:d03] key:
> WebActionNavigationTypeKey, value: 0
> 2008-02-07 16:16:49.432 Wikipedia[465:d03] key:
> WebActionOriginalURLKey, value: wp://localhost/%E7%BE%8E%E9%90%B5
> 2008-02-07 16:16:49.434 Wikipedia[465:d03] loading article with url
> wp://localhost/%E7%BE%8E%E9%90%B5
> 2008-02-07 16:16:49.436 Wikipedia[465:d03] url path: /美鐵
> 2008-02-07 16:16:49.438 Wikipedia[465:d03] cap: é鐵
>
> Secondly the search box in the application does not match unicode
> characters.
>
> Thirdly the rendered unicode texts are garbled (but the links to other
> pages are fine), as you can see this screenshot:
> http://img114.imageshack.us/img114/8131/img0029da6.jpg

Yes, Unicode is broken (many German users have also complained...).
I'd like to fix it, but unfortunately have very little time to work on
this at the moment...

> I think this program has lots of potential! With this framework we can
> even try to import dictionary files into the system (e.g. stardict)
> and the result would be much better than weDict!
> Keep up the good work!
>
> I will create a wiki page on my personal site to document what I have
> done to convert the database in a short while. Or do you prefer
> creating a new google codes / sourceforge project so we can have a
> centralized repository for related information?

I'm creating a Google Code project right now to centralise this
info... will ping again when it's set up.

Patrick Collison

unread,
Feb 7, 2008, 3:49:49 AM2/7/08
to wikipedi...@googlegroups.com

Ok, done. I've also committed the current source code to the SVN repo.
We're good to go!

Reply all
Reply to author
Forward
0 new messages