Create a dictionary in slob format

413 views
Skip to first unread message

ALI MS

unread,
Oct 24, 2015, 10:50:48 AM10/24/15
to aarddict
Hi
Please explain to me step by step how I create a slob dictionary in my language?

mhbraun

unread,
Oct 24, 2015, 1:01:42 PM10/24/15
to aarddict
Valid request. But too little information to answer it.
It would be good to have the direction in one place. This would include setting up the couch database of course.

What operating system are you using? There is just a Linux version available.
What is your language?
What kind of dictionary are you interested in?

ALI MS

unread,
Oct 25, 2015, 11:33:30 AM10/25/15
to aarddict
Thanks for answer.
I use windows but install Linux in virtual machine.
I want for Persian.
I want wikipedia for Arad2 in slob format.

itkach

unread,
Oct 25, 2015, 11:47:11 AM10/25/15
to aarddict
On Sunday, October 25, 2015 at 11:33:30 AM UTC-4, ALI MS wrote:
Thanks for answer.
I use windows but install Linux in virtual machine.
I want for Persian.
I want wikipedia for Arad2 in slob format.

Persian Wikipedia is available here:  https://github.com/itkach/slob/wiki/Dictionaries#persian

if you want to create a newer version yourself you need to install mwscrape (from https://github.com/itkach/mwscrape) and then convert Wikipedia data it downloads to .slob using mwscrape2slob (from https://github.com/itkach/mwscrape2slob)

 

Markus Braun

unread,
Oct 25, 2015, 1:05:59 PM10/25/15
to aard...@googlegroups.com

Same setup I use.
Make sure your VM is 64bit. And the disk space exceeds 3 times your database size. Eg.  50GB for EN and 15GB for DE (x3 of course). Because you will need to compact your database.
Not sure what Persian will be.
2 GB RAM should be fine with smaller dicts.
Tweak couch dB for higher compression to save storage space.

Take your time... it is a hobby.

--
You received this message because you are subscribed to the Google Groups "aarddict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aarddict+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ALI MS

unread,
Oct 30, 2015, 5:07:29 AM10/30/15
to aarddict
in the terminal i use this commend

sudo apt-get install python2.7 python-virtualenv couchdb git
sudo service couchdb start
virtualenv env-mwscrape -p python2.7
source env-mwscrape/bin/activate
pip install git+https://github.com/itkach/mwscrape

and finally for persian wikipedia use this commend

mwscrape fa.m.wikipedia.org

the step started OK without problem

but sometimes my internet lost (in IRAN sometimes you disconnect from the net and connect after 1 minute)
now all things stooped and not run and I must start again from the first!!!
anyway for this problem?
 
http://i.imgur.com/KZcBUpA.png

Igor Tkach

unread,
Oct 30, 2015, 8:30:38 AM10/30/15
to aard...@googlegroups.com
On Fri, Oct 30, 2015 at 5:07 AM, ALI MS <ahoor...@gmail.com> wrote:
in the terminal i use this commend 
but sometimes my internet lost (in IRAN sometimes you disconnect from the net and connect after 1 minute)

now all things stooped and not run

If I kill my connection (for example by disabling wifi) and then turn it back on mwscrape continues just fine. Give it more time. 
and  I must start again from the first!!!

However, if it is indeed stuck (which may happen even if your internet is perfectly fine) or if you need to interrupt it for any reason (power loss, reboot or whatever), you can resume with this command:


mwscrape -r

This command reads you scraping session description from you local CouchDB - host you're scraping and what was the last article you've got - and starts from there.

You should be able to browser content of CouchDB through it's admin UI at http://localhost:5984/_utils/, if you're curious. mwscrape does it's bookkeeping in CouchDB database aptly named "mwscrape".
Reply all
Reply to author
Forward
0 new messages