please update spanish wikipedia (es wiki)

103 visualizzazioni
Passa al primo messaggio da leggere

☠☠☠

da leggere,
26 gen 2020, 16:30:0226/01/20
a aarddict
As of the filename eswiki-20170128.slob found here https://github.com/itkach/slob/wiki/Dictionaries#spanish the package is three (3!!) years old. I need a more recent version of the spanish wikipedia, please.

I would even do it myself, if there was a step by step instruction on how to do it on a recent Fedora Linux system that actually works (i.e. mwscrape with python 3). Or is there another way instead of very slow webscraping?

Thank you,
Erik

mhbraun

da leggere,
18 mar 2020, 23:16:5218/03/20
a aarddict
I do not know another way doing it.
If you want to setup instruction you may figure out here
Any questions, please post here in this forum.
Have fun

☠☠☠

da leggere,
12 gen 2022, 07:13:2112/01/22
a aarddict
I am finally working on a new spanish wikipedia slob file. At the moment still scraping.

Some statistics after 50 minutes on a http://clouding.io server with 1 core, 2 GB RAM and 50 GB disk space:

• Total articles in spanish wikipedia:    anz_artikel = 1 744 503
• Articles per hour:    artikel_pro_h = 6 990 ×60/50 = 8 388
• Estimated time of arrival (in hours):    eta_h = anz_artikel/artikel_pro_h = 207.976
• Estimated time of arrival (in days):    eta_d = runden(eta_h/24) = 9
• Megabytes per hour on disk:    mb_pro_h = 81×60/50 = 97.2
• Expected final size of database when finished (in Gigabytes):    db_gb = runden(mb_pro_h×eta_h/ 1 000 ) = 20

Cost of the server is 10 € per month. After 9 days the cost will be 3 €.

But the cpu usage is only around 15 %. So I could reduce the costs by only using half cpu (this option is available) and only 25 GB of disk space. The cost of such a server would only be half, so 5 € per month, and therefore after 9 days only 1.5 €.

☠☠☠

da leggere,
12 gen 2022, 07:28:4412/01/22
a aarddict
Here I show all the commands which are needed for scraping and  creating a slob file afterwards if you create a new server on http://clouding.io :

########################
# Written for CentOS 8 #
########################
export TERM=rxvt-unicode

echo "export TERM=rxvt-unicode" >> ~/.bashrc       # otherwise I had problems with my terminal and the forward and backward keys etc.


dnf install -y --nogpgcheck https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
dnf config-manager --enable powertools
dnf install -y rsync htop bmon dstat ranger screen git    # some nice tools for monitoring the server and working with files


########################
# mwscrape for aardict #
########################
dnf install -y yum-utils
yum-config-manager --add-repo https://couchdb.apache.org/repo/couchdb.repo
dnf install -y couchdb libicu-devel gcc gcc-c++ python3-virtualenv python3-lxml python3-cssselect jq
pip3 install PyICU couchdb cssutils

# couchdb setup (better compression, admin account with stupid password for couchdb)
sed -i 's%;file_compression.*%file_compression = deflate_6%' /opt/couchdb/etc/default.ini
sed -i 's%;admin\ *=.*%admin = password%' /opt/couchdb/etc/local.ini

# start the DB
systemctl start couchdb.service

# show which databases are present in couchdb (none before you use mwscrape for the first time)
sleep 5s    # starting of DB takes some time
curl http://admin:pass...@127.0.0.1:5984/_all_dbs | jq


# from at home I can view the content of the database in my browser by opening a SSH tunnel to the clouding.io server:
ssh -L 1234:localhost:5984 root@IPADDRESS_OF_SERVER

# and then on my computer at home I open the couchdb web-interface in my browser


Regards,
Erik

itkach

da leggere,
12 gen 2022, 11:20:3712/01/22
a aarddict
Erik,

Thank you for the write-up, very intersting. Indeed, scraping is not very cpu intensive. For the next part - dictionary compilation - you may want to provision a different machine, with more cores and faster disk.

☠☠☠

da leggere,
12 gen 2022, 13:26:2512/01/22
a aarddict
Well, statistics have changed a lot since start. Maybe I have to pause the process and resize the disk (make it bigger). The new predictions look like this:

 • Total articles in spanish wikipedia:    anz_artikel = 1 744 503
• Minutes working (6 hours, 49 minutes):    min_uptime = 6×60 + 49 = 409
• Articles per hour:    artikel_pro_h = 34 002 ×60/min_uptime = 4 988,07
• Estimated time of arrival (in hours):    eta_h = anz_artikel/artikel_pro_h = 349,735
• Estimated time of arrival (in days):    eta_d = runden(eta_h/24) = 15
• Megabytes per hour on disk:     mb_pro_h = 873×60/min_uptime = 128,068
• Expected final size of database when finished (in Gigabytes):    db_gb = runden(mb_pro_h×eta_h/ 1 000 ) = 45

So the expected file size went up significantly. And the articles per hour went down (maybe in the beginning there were only small articles).

Thank you for your great work itkach. I use the aard2 app a lot, because it is always much faster than searching online. Startup of aard2 is super fast and then write a word in the aard2-search field shows immediately a list of results. Click on one of the results and the article shows immediately again. So much faster than opening another app or browser and loading some site online. I use it a lot for searching words in the wiktionarys and also articles (info) in the several wikipedias I have installed. And of course it works also offline. Or with bad internet connection. One of my most used app. But the spanish version was just to old. Hopefully in 9 to 15 days I’ll have a new one. I will put the file somewhere for you to download. Maybe you can suggest something, as mega.nz is only allowing 1 GB download or so per day without paying. I hope the final slob files will be small again compared to the database (as they’ll be compressed with lzma2).

☠☠☠

da leggere,
13 gen 2022, 05:47:2413/01/22
a aarddict
Maybe this is boring for you, but for me it is quite interesting how the values change with time. So the articles per hour haven’t reached it’s lowest and true point, but almost. At least more than one article per second. 16 days for the whole spanish wikipedia.

But the size needed for the final database went down dramatically again, lower than the first value. Great.

    anz_artikel = 1 744 503
    min_uptime = 23×60 + 23 = 1 403
    artikel_pro_h = 106 798 ×60/min_uptime = 4 567.27
    eta_h = anz_artikel/artikel_pro_h = 381.957
    eta_d = runden(eta_h/24) = 16
    mb_pro_h = 1 100 ×60/min_uptime = 47.042 1
    db_gb = runden(mb_pro_h×eta_h/ 1 000 ) = 18

mhb...@freenet.de

da leggere,
13 gen 2022, 10:03:5013/01/22
a aard...@googlegroups.com
Quite interesting what you are doing here. How do you get the data?
I could run a comparison on my setting with enwiki or dewiki.
Do you collect the data over time for you can generate a graph?

Markus

Sent from Nine

From: '☠☠☠' via aarddict <aard...@googlegroups.com>
Sent: Thursday, January 13, 2022 05:47
To: aarddict
Subject: Re: please update spanish wikipedia (es wiki)
--
You received this message because you are subscribed to a topic in the Google Groups "aarddict" group.
To unsubscribe from this group and all its topics, send an email to aarddict+u...@googlegroups.com.

☠☠☠

da leggere,
15 gen 2022, 15:05:2015/01/22
a aarddict
Hi Markus.

Well, originally I did it manually. But as you said, that statistics would be fine i'm running now a background process that gets the numbers every 15 minutes. Tomorrow I could show the first graph of it. But I think the biggest changes where in the beginning.

This is the file

########
#!/bin/bash

STATISTICS_FILE="$HOME/statistics.txt"

# number of documents of first database (that should be Wikipedia/Wiktionary)
articles=$(curl -s http://admin:pass...@127.0.0.1:5984/$(curl -s http://admin:pass...@127.0.0.1:5984/_all_dbs | cut -d, -f 1  | sed "s%[\"\[]%%g") | sed "s%,%\n%g" | \grep doc_count | cut -d: -f2)
# size of db
db_size=$(du -s /opt/couchdb/data/ | cut -f1)
# uptime in Minuten
uptime_minutes=$(( ($(date +%s)-$(date --date="$(uptime -s)" +%s))/60 ))
# print them
echo $uptime_minutes $articles $db_size >> "$STATISTICS_FILE"
########

I started it with `screen` and inside `screen` I am using `watch` to call the script every 900 seconds (= 15 minutes). This is the command to start the named screen session with the watch-command:

########
screen -S create_statistics -d -m watch -n 900 write_couchdb_aard_scrape_statistics.sh
########

Try it yourself. Maybe you have to change the path of your database for the db_size variable. And of course the credentials (username and password) of your couchdb.

Tomorrow I will show you a gnuplot script that visualizes the first stats.

Regards,
Erik

☠☠☠

da leggere,
17 gen 2022, 07:23:2917/01/22
a aarddict
nrtd.png
First two values are read from database. All others are calculated (therefore the data rate is not really changing, but based on the data base size it looks as if). The jumps in database size are interesting. Could be the compression, which is only done every few hours?

This is the gnuplot script:

######
set key above
set xlabel "hours uptime"
f='statistics.txt'
set grid
tot_art=1744503

plot f u (h=$1/60):(scraped_articles=$2,  scraped_articles/1000) w l t "scraped articles [1000]", \
  f u (h=$1/60):(current_db_size=$3/1000,  current_db_size/10) w l t "current db size [10 MB]", \
  f u (h=$1/60):(art_rate=$2/$1*60,  art_rate/10) w l t "article rate [10/h]", \
  f u (h=$1/60):(tot_estimated_h=tot_art/($2/$1*60)) w l t "total estimated time [h]", \
  f u (h=$1/60):(data_rate_mb_per_h=($3/1000)/$1*60,  data_rate_mb_per_h*10) w l t "data rate [1/10 MB/h]", \
  f u (h=$1/60):(data_rate_mb_per_h=($3/1000)/$1*60, total_estimated_time=tot_art/($2/$1*60), \
    expected_db_size_gb=data_rate_mb_per_h*total_estimated_time/1000, expected_db_size_gb*10) w l t "expected db size [1/10 GB]", \
  f u (h=$1/60):(total_estimated_time=tot_art/($2/$1*60), h=$1/60, eta_h=total_estimated_time-h) w l t "ETA [h]"######

############

☠☠☠

da leggere,
18 gen 2022, 08:31:5218/01/22
a aarddict

Still not finished.

Newest statistics and new graphs:

nrtd.png

Markus Braun

da leggere,
21 gen 2022, 15:52:3021/01/22
a '☠☠☠' via aarddict

This is great, Erik.

Thank you for your ideas and sharing the code. I am actually implementing it in a slightly different way, as I want to monitor the weekly updates this way.

Will keep you updated.

Markus

☠☠☠

da leggere,
27 gen 2022, 07:52:0327/01/22
a aarddict
Finished scraping the spanish Wikipedia after 13 days. But in the end it only were about 1,2 million articles instead of 1,7 million (what the mainpage of spanish Wikipedia states) – see violet line in graph. Are there so many redirects and are they not saved as articles in the database? Or how is that?

Creating the slob on a 6 core computer took only 1 hour 50 minutes. The slob file is 3 985 891 517 Bytes big. Soon to big for a fat32 filesystem. 

Important question: Where can I upload the file so that everybody can use it? mega.nz is not unlimited anymore. Well, I could leave it on my private webspace. How adds the link to the official list?

stat.png
Regards,
Erik

☠☠☠

da leggere,
27 gen 2022, 08:09:5327/01/22
a aarddict
stat.png
I wanted to add this statistics graph.

☠☠☠

da leggere,
27 gen 2022, 08:11:2827/01/22
a aarddict
Apparently this is not working. Why can’t I add the graph? Sorry for spaming. Will try again:

stat.png

Markus Braun

da leggere,
28 gen 2022, 11:58:0628/01/22
a '☠☠☠' via aarddict

Good. Sound strange. I have a similar (unresolved) issue with enwikitionary. Full scraping and updating does not deliver all articles. I have all articles with dewiki and enwiki.

The ceration of slob is super fast on your machine. For double the articles ( approx 6.8 mio) my enwiki takes about 5 days on a 4 core machine as a VM.

Let me know how to get your files and I will host it in the library on RWTH Aachen in the Spanish section. Will give you more details later today, as I am in a hurry now.

Thank you for your update

☠☠☠

da leggere,
30 gen 2022, 09:04:2530/01/22
a aarddict
Would you then please add it to the list of dictionaries on github?

And second: What does the author of the software say about the differing numbers of articles?

itkach

da leggere,
30 gen 2022, 11:12:1230/01/22
a aarddict
Hi Erik,

any Github user can edit the wiki page. Since @mhbraun agreed to upload to RWTH Aachen I'm sure he will update the page accordingly, just keep in mind that you if you have any dictionaries to share you can add the links yourself.

Regarding the number of articles... a significant discrepancy indicates that mwscrape didn't get all the articles and the dictionary would be rather incomplete.
We've ran into this before: https://groups.google.com/g/aarddict/c/-YTv_a1jIIs. @mhbraun could not get the expected number of articles in multiple runs but did not see any apparent errors on his system,  and when I ran mwscrape I got all the articles. We never figured out the root cause.

The good news is that Wikipedia recently - just a few months ago - started publishing new kind of data dumps that include rendered article HTML and I just added support for this format in mwscrape2slob (now called mw2slob): https://github.com/itkach/mw2slob


Give it a try?

☠☠☠

da leggere,
30 gen 2022, 16:16:2030/01/22
a aarddict
Great! I read about that enterprise dumps one week ago after having found a complaint in a Wikipedia discussion, that this project is scraping the databases and therefore violating fair use policies or something. And wanted to report it here. But also didn't want to ask to much from you. And now you have already created a tool for it. Finally no more time consuming scraping anymore. Thank you very much!

Regards,
Erik

☠☠☠

da leggere,
31 gen 2022, 13:48:1731/01/22
a aarddict
Today I downloaded the spanish enterprise dump. Download rate was only 4 MB/s. So that took long.

Then I used mw2slob to create a slob on a 16 core CPU with 32 GB RAM. Took 2 hours and 8 minutes (compared to the 1 hour 50 minutes on a 6 core machine for the scraped file). And the file was bigger than the one I had created some days ago from the scraped file.

I did slob info on both of them and found out, that the enterprise dump slob had 1.689296 million articles, where the scraped slob had only 1.272385 million articles. So I will use the new one on my phone. Great, that it now only takes a few hours to have a new version. (although an update once in a year is enough for me, I think).

Thank you to Wikimedia and thank you to itkach.
Erik

PS: MHbraun, do you need the new slob file I created? Is there an ftp server I could use to upload it? Or should I put it on my webspace and you get it from there?

☠☠☠

da leggere,
31 gen 2022, 13:52:1331/01/22
a aarddict
I forgot to mention the file sizes:

3.985891517 GB old file (scraped)
4.916633288 GB new file (enterprise dump – that means I have to split it with slob convert --split …)

franc

da leggere,
1 feb 2022, 06:10:4901/02/22
a aarddict
Hello

This is good news that Wikimedia does weekly dumps of all wikis :) :) :)
I will switch to that then too!
I scrape since a good while now totally automatic on my server, creating each month actual frwiki, frwiktionary and dewiktionary. This seems to work, the slob files get each time a little bigger and they have each time some more entries.
But 100% reliable this is surely not, and I have to admit that I never read my logfiles ;)

So I think I will change my cronjobs to mw2slob the offiical dumps. Have to switch first from mwscrape2slob to mw2slob (and also update slob to use the new mw2slob).
But it will take some time, till I find some time...

Thanks for these good news and for the new mw2slob to handle the dumps!!!
frank
Rispondi a tutti
Rispondi all'autore
Inoltra
0 nuovi messaggi