Please udpate fi.wiktionary.org

157 views
Skip to first unread message

Joonatan Sjöroos

unread,
Apr 16, 2023, 5:45:53 AM4/16/23
to aarddict
Hi, please update fi.wiktionary.org, it seems it havent been updated since two years, Thanks!

AardF...@web.de

unread,
Apr 16, 2023, 8:39:02 AM4/16/23
to aarddict
Hmm, I am wondering if you checked my earlier post here.
I updated fi.wiktionary.org last month and just a couple of days ago again.
So I am not quite sure what you are referring to.
Can you elaborate?

AardF...@web.de

unread,
Apr 16, 2023, 8:41:59 AM4/16/23
to aarddict
Alright, I got it. fiwiktionary is not listed in my earlier post.
However if you would check https://ftp.halifax.rwth-aachen.de/aarddict/ you will find much more...


On Sunday, April 16, 2023 at 11:45:53 AM UTC+2 joonata...@gmail.com wrote:

Joonatan Sjöroos

unread,
Apr 16, 2023, 8:59:40 AM4/16/23
to aarddict
Oh very nice thanks :)

franc

unread,
May 4, 2023, 12:14:11 PM5/4/23
to aarddict
OK, F..K!!!
Somethings wrong, again the frwik* is crippled.
In the log of frwiki creation of slob (mwsrape2slob) I read some errors with "IncompleteRead":

  Brigasque (race ovine)
  Brigate Giustizia e Libertà
  Brigate rosse per la costruzione del Partito comunista combattenteERROR:mwscrape2slob:
Traceback (most recent call last):
  File "/home/franc/aard/env-mwscrape2slob/lib/python3.6/site-packages/mwscrape2slob/__init__.py", line 294, in run
    for title, aliases, text, error in resulti:
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 735, in next
    raise value
http.client.IncompleteRead: IncompleteRead(1765919 bytes read)

  Brigate rosse-Partito guerriglia del proletariato metropolitano
  Brigaud

and a bit later it ends much too early:

  Briggs Islet
  Briggsidae
  Briggs Automotive Company
  Briggs Cunningham
  Briggs (cratère)

Finished adding content in 3:22:18
Finalizing...
Sorting... sorted in 0:00:20
Resolving aliases...
Sorting... sorted in 0:00:20
Resolved aliases in 0:00:20
Finalized in 0:00:57Traceback (most recent call last):
  File "/home/franc/aard/env-mwscrape2slob/bin/mwscrape2slob", line 8, in <module>
    sys.exit(main())
  File "/home/franc/aard/env-mwscrape2slob/lib/python3.6/site-packages/mwscrape2slob/__init__.py", line 838, in main
    article_source.run()
  File "/home/franc/aard/env-mwscrape2slob/lib/python3.6/site-packages/mwscrape2slob/__init__.py", line 294, in run
    for title, aliases, text, error in resulti:
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 735, in next
    raise value
http.client.IncompleteRead: IncompleteRead(1765919 bytes read)

I have no log for frwiktionary at the moment, but I guess it is the same.
The dewiktionary log has no
IncompleteRead error, so I guess it is just not big enough.

I have to scrutinize that, and restart the mwsrape2slob again. In the meantime, only the last working is in the old directory.
Sorry.

franc

unread,
May 4, 2023, 3:58:06 PM5/4/23
to aarddict
I started the frwiktionary mwscrape2slob manually and this worked, no error, this is the end of it (I didnt find these IncompleteRead errors anymore):

ADDING: '~/images/Globe.svg'
ADDING: '~/css/shared.css'
ADDING: '~/css/mediawiki_monobook.css'
ADDING: '~/css/mediawiki_shared.css'
ADDING: '~/css/night.css'
ADDING: '~/js/jquery-2.1.3.min.js'
ADDING: '~/js/styleswitcher.js'

Finished adding content in 2:57:27
Finalizing...
Sorting... sorted in 0:02:36
Resolving aliases...
Sorting... sorted in 0:02:33
Resolved aliases in 0:02:33
Finalized in 0:05:22
All done in 3:02:50

It is in the actual download folder as always:

https://7fw.de/download/wiki/fr/

OK, then I noticed, that time passed by and now there is mw2slob instead of mwscrape2slob, so I changed to that (first I updated mwscrape2slob and got error when trying to run mwsrape2slob, was not found).
At the moment I run manually the mw2slob for frwiki (which takes several hours, I guess the whole night) and will see if that works too.
If it works, I will try the script again, but with mw2slob. If that works too, I have to change all my scripts from mwscrape2slob to mw2slob (the command is different).

I have to say that I am on old Ubuntu 18.04 (will upgrade to 22.04 soon) so it could be that some of the packages are maybe too old.

I updated pip to pip-21.3.1and mw2slob to 1.1 without errors, by the way.SKIPPING (not included): 'filters/image'






Markus Braun

unread,
May 5, 2023, 1:25:43 AM5/5/23
to aard...@googlegroups.com
I guess you will not be able to create a frwiki as there is no NS0 dump as of 20230501.
Lots of files are missing.


Markus



From: franc <franc...@gmail.com>
Sent: Thursday, May 4, 2023 21:58
To: aarddict
Subject: Re: Please udpate fi.wiktionary.org
--
You received this message because you are subscribed to the Google Groups "aarddict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aarddict+u...@googlegroups.com.

Frank Roehm

unread,
May 5, 2023, 1:32:22 AM5/5/23
to aard...@googlegroups.com
I do still the good old mwscrape, not from dumps.

franc

unread,
May 5, 2023, 3:34:29 AM5/5/23
to aarddict
Nope.
mw2slob of frwiki didn't work neither with the actual mw2slob :(
Here the last output:

S Cuillère à caviar (3036)
S Cuillère à dessert (5163)
S Cuillère à glace (5721)

Finished adding content in 4:33:35
Finalizing...
Sorting... sorted in 0:00:35
Resolving aliases...
Sorting... sorted in 0:00:35
Resolved aliases in 0:00:35
Finalized in 0:01:42Traceback (most recent call last):
  File "/home/franc/aard/env-slob/bin/mw2slob", line 11, in <module>
    load_entry_point('mw2slob==1.1', 'console_scripts', 'mw2slob')()
  File "/home/franc/aard/env-slob/lib/python3.6/site-packages/mw2slob/cli.py", line 394, in main
    args.func(args)
  File "/home/franc/aard/env-slob/lib/python3.6/site-packages/mw2slob/cli.py", line 128, in cli_scrape
    run(outname, info, articles, args)
  File "/home/franc/aard/env-slob/lib/python3.6/site-packages/mw2slob/cli.py", line 78, in run
    filters=filters,
  File "/home/franc/aard/env-slob/lib/python3.6/site-packages/mw2slob/core.py", line 197, in create_slob
    run(slb, articles, filters, info.interwikimap, info.namespaces, html_encoding)
  File "/home/franc/aard/env-slob/lib/python3.6/site-packages/mw2slob/core.py", line 140, in run

    for title, aliases, text, error in resulti:
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 347, in <genexpr>
    return (item for chunk in result for item in chunk)

  File "/usr/lib/python3.6/multiprocessing/pool.py", line 735, in next
    raise value
http.client.IncompleteRead: IncompleteRead(615657 bytes read)

Again that IncompleteRead :( :(
So if nobody could help me here what could be wrong here, I will stop the creation of frwiki and frwiktionary until my system is on 22.04, then I would give it a new try, I guess it is related, or indirect as a Python issue (old version or such).
Dont know, sorry.

I put the last wikis from march in the folder.
I had to put again shasum and size to the slob, might have deleted them accidently.

stat --format=%s frwiki_2023-03-02.slob > frwiki_2023-03-02.size.txt
shasum frwiki_2023-03-02.slob | sed -r 's/(.*) .*$/\1/' > frwiki_2023-03-02.sha.txt


frank

AardF...@web.de

unread,
May 6, 2023, 7:31:09 AM5/6/23
to aarddict
Frank,
you created an April version as well. It is hosted on RWTH ;)

As for your system upgrade you could look into using the Debian 11 netinstall. 
Then during installation you disallow any GUI installation, but include the webserver and ssh. For GUI install 
sudo apt install lxde-core 
modify /etc/apt/sources.list with adding 'contrib non-free' to each line with main
install all your tools and voilà you got a sleek and fast updated system with no overhead.
I love the performance of it.

franc

unread,
May 31, 2023, 4:21:50 AM5/31/23
to aarddict
AardF...@web.de schrieb am Samstag, 6. Mai 2023 um 13:31:09 UTC+2:
Frank,
you created an April version as well. It is hosted on RWTH ;)


Hallo, please delete this frwiktionary_2023-04-06.slob it is crippled!
The last working (I made) is from march.
Thank!

franc

unread,
Nov 29, 2023, 7:42:25 AM11/29/23
to aarddict
franc schrieb am Freitag, 5. Mai 2023 um 09:34:29 UTC+2:
...
... I have to say that I am on old Ubuntu 18.04 (will upgrade to 22.04 soon) so it could be that some of the packages are maybe too old. ...
...

So if nobody could help me here what could be wrong here, I will stop the creation of frwiki and frwiktionary until my system is on 22.04, then I would give it a new try, I guess it is related, or indirect as a Python issue (old version or such).
Dont know, sorry....
frank

Now lastly and lately I have updated my OS from 18.04 to 22.04 :)
It was time! "soon" ment here nearly 7 months!!!
Now lets see, how long it will take till I continue to scrape good old frwiki and frwiktionary ;)
I am on CouchDB 3.3.2 now.


AardF...@web.de

unread,
Nov 29, 2023, 11:38:49 AM11/29/23
to aarddict
Congratulations
Keep on going
Let me know when you are up and running.

franc

unread,
Dec 11, 2023, 12:12:52 PM12/11/23
to aarddict
OK, finally I got my server running and all stuff fixed after the update from 18.04 to 22.04 Ubuntu :)
It was easy but I had a long time to find some silly errors with dovecot (own package sources was the cause in the beginning).

OK then. All the time since now I didnt stop to mwscrape the three children: frwiki, frwitionary and dewiktionary. So I hope now the mw2slob is throwing no errors as before after a short while.

First I had to update all the tools -  mwscrape2slob is history, mw2slob is now the one, and I had to make all new (because I had in the beginning unclear errors in the pip stuff).
So I did first things like this:

pip install --upgrade setuptools 
python -m pip install --upgrade pip

Now I have actual pip and setuptools:

pip list 
Package      Version 
-------      ------- 
pip          23.3.1
setuptools   69.0.2

Then I reinstalled slob, mw2slob and mwscrape2slob:

rm -r /home/franc/env-slob
python3 -m venv env-slob --system-site-packages
cd /home/franc/env-slob
source bin/activate

rm -r /home/franc/env-mwscrape
python3 -m venv env-mwscrape
source env-mwscrape/bin/activate
pip install https://github.com/itkach/mwscrape/tarball/master

I am not really sure, if all this is in correct order, I did some documentationa and copy this from that but always errors are possible ;)

I did this this morning, started mwscrape dewiktionary and mw2slob the frwiki and frwiktionary this afternoon and since yet all is running without errors :)
So I hope, and the hope dies last, that now on U22.04 all is back in the row and frwiki and frwiktionary with bonus dewiktionary will be on track again soon!
When ready, I will update the github-wiki...

Thanks.frank

franc

unread,
Dec 12, 2023, 3:31:21 AM12/12/23
to aarddict
franc schrieb am Montag, 11. Dezember 2023 um 18:12:52 UTC+1:
...
When ready, I will update the github-wiki...

OH NO!
It happened again, incomplete read on frwiktionary:

...
S umherspaziere (2849)                                                                                        
S umherspazieren (6068)                                                                                      
S umherspazierend (2488)     
S umherspazierende (3607)     
ERROR:mw2slob.core:           
Traceback (most recent call last):    
  File "/home/franc/aard/env-slob/lib/python3.10/site-packages/mw2slob/core.py", line 140, in run

    for title, aliases, text, error in resulti:                                                              
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 451, in <genexpr>                                  
    return (item for chunk in result for item in chunk)                                                      
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 873, in next                                      
    raise value                                                      
http.client.IncompleteRead: IncompleteRead(414405 bytes read)                                                

Finished adding content in 11:08:00

Finalizing...
Sorting... sorted in 0:02:16
Resolving aliases...
Sorting... sorted in 0:02:15
Resolved aliases in 0:02:15
Finalized in 0:05:01Traceback (most recent call last):
  File "/home/franc/aard/env-slob/bin/mw2slob", line 8, in <module>
    sys.exit(main())
  File "/home/franc/aard/env-slob/lib/python3.10/site-packages/mw2slob/cli.py", line 394, in main
    args.func(args)
  File "/home/franc/aard/env-slob/lib/python3.10/site-packages/mw2slob/cli.py", line 128, in cli_scrape
    run(outname, info, articles, args)
  File "/home/franc/aard/env-slob/lib/python3.10/site-packages/mw2slob/cli.py", line 67, in run
    core.create_slob(
  File "/home/franc/aard/env-slob/lib/python3.10/site-packages/mw2slob/core.py", line 197, in create_slob

    run(slb, articles, filters, info.interwikimap, info.namespaces, html_encoding)
  File "/home/franc/aard/env-slob/lib/python3.10/site-packages/mw2slob/core.py", line 140, in run

    for title, aliases, text, error in resulti:
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 451, in <genexpr>

    return (item for chunk in result for item in chunk)
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 873, in next
    raise value
http.client.IncompleteRead: IncompleteRead(414405 bytes read)

But the frwiki mw2slob is still running, without this error.
So now I am a bit finished with my latin, dont know what that same IncompleteRead error means, too weak in Pyhton coding :(
This was the same error before, now I have all new and still that error, so nothing to do with the code, I guess.
My server is strong enough, I think. This says Webmin (at the moment, still mw2slob-ing frwiki):

Kernel and CPU Linux 5.15.0-91-generic on x86_64
Processor information Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz, 4 cores
CPU load averages 5.63 (1 min) 5.77 (5 mins) 5.90 (15 mins)
CPU usage 96% user, 4% kernel, 0% IO, 0% idle
Real memory 4.97 GiB used, 6.3 GiB cached, 11.67 GiB total
Virtual memory 579.07 MiB used, 952.99 MiB total
Local disk space 106.72 GiB used, 306.07 GiB total

@Igor could you give me a hint with this IncompleteRead error? Its mw2slob, not mwscrape. CouchDB is version 3.3.3

itkach

unread,
Dec 12, 2023, 9:43:08 PM12/12/23
to aarddict
IncompleteRead error indicates that the client (mw2slob) received less data than the server (CouchDB) promised. This may happen due to a network error or due to a bug in the server implementation (CouchDB). Is CouchDB running on the same machine as mw2slob? Does the error occur consistently? If so then it's probably not the network, although quick search in https://github.com/apache/couchdb/issues didn't turn up anything that looked relevant. It's interesting that the error happens when trying to load an atypically large document (at least 414405 bytes, that is ~404kb). I tried to scrape some articles with titles starting near the title after which it failed, but didn't get the error (quite possibly didn't scrape enough to hit the offending document). I'll scrape frwiktionary and see if I can reproduce the error. In the meantime, I'd be curios if you reliably get the same error if you run mw2slob again.

franc

unread,
Dec 13, 2023, 7:55:10 AM12/13/23
to aarddict
itkach schrieb am Mittwoch, 13. Dezember 2023 um 03:43:08 UTC+1:
IncompleteRead error indicates that the client (mw2slob) received less data than the server (CouchDB) promised. This may happen due to a network error or due to a bug in the server implementation (CouchDB). Is CouchDB running on the same machine as mw2slob?

Yes, CouchDB runs locally on my server. My mw2slob command was:

mw2slob scrape http://admin:password@localhost:5984/frwiktionary -f common wikt --local-namespace 0 100

Does the error occur consistently?

This was the error (see above) why I stopped the creation of frwiktionary in May (waiting for my server's update, which I thought were the cause).
It seems only to happen with the frwiktionary, or maybe mostly with that.

If so then it's probably not the network, although quick search in https://github.com/apache/couchdb/issues didn't turn up anything that looked relevant. It's interesting that the error happens when trying to load an atypically large document (at least 414405 bytes, that is ~404kb). I tried to scrape some articles with titles starting near the title after which it failed, but didn't get the error (quite possibly didn't scrape enough to hit the offending document).

I cannot see, which document exactly caused the IncompleteRead error, shouldnt it be the one following "S umherspazierende (3607)"? This would be "umherspazierendem", along a Mango Query (... "$gt": "umherspazierende"...) in CouchDB Fauxton.
I saved this entry to a file, but this file (json stuff) is then only 10k and not at all 404 k, but I dont know if there is maybe some huge overhead when getting documents from couchdb by queries as mw2slob does. So these 404 kB, I cannot say where they come from, if it were a simple document, then it were a huge one. As if there were images inside, which is not possible. Strange enough.

I wonder anyway, why I have a german word, many german words, and also spanish etc. in the frwiktionary - but I stop wondering why the frwiktionary is so huge then and far bigger than the dewiktionary ;)
 
I'll scrape frwiktionary and see if I can reproduce the error.

Thanks a lot already,  I doubt that you get that error too, because Markus didnt yet neither, scraping in behalf of me the frwiki and frwiktionary.
Maybe it is because I mw2slob at the same time the frwiki, which finished without error, by the way.

In the meantime, I'd be curios if you reliably get the same error if you run mw2slob again. 

I restarted the mw2slob now (at 13:07), without mw2slob'ing any other wiki now, maybe my server is not that strong machine I would like it were and the error was indeed from low ressources. My command now (in a bash file):
mw2slob scrape http://admin:password@localhost:5984/frwiki -f common wikt --local-namespace 0 100 --ensure-ext-image-urls --no-math 2>&1 | tee /home/franc/aard/log/mwscrape2slob-frwiktionary_2023-12-13.log
With this I have the errors in a log file (and also on the console), so I can look if there were other documents that big as 404 kB in the mw2slob.
Will post again, when error or finished...

Thanks.frank

itkach

unread,
Dec 13, 2023, 11:33:41 AM12/13/23
to aarddict
On Wednesday, December 13, 2023 at 7:55:10 AM UTC-5 franc wrote:
I restarted the mw2slob now (at 13:07), without mw2slob'ing any other wiki now, maybe my server is not that strong machine I would like it were and the error was indeed from low ressources. My command now (in a bash file):
mw2slob scrape http://admin:password@localhost:5984/frwiki -f common wikt --local-namespace 0 100 --ensure-ext-image-urls --no-math 2>&1 | tee /home/franc/aard/log/mwscrape2slob-frwiktionary_2023-12-13.log

 The URL in your command above is http://admin:password@localhost:5984/frwiki - is that pointing to Wikipedia or Wiktionary?

franc

unread,
Dec 13, 2023, 11:40:20 AM12/13/23
to aarddict
Sorry, wrote it wrong, its frwitionary.
And the running mw2slob was frwiktionary too, I just checked it. I had to kill it though, so restarted it again (I lost only less than 5 hours).

itkach

unread,
Dec 13, 2023, 12:28:15 PM12/13/23
to aarddict
looks like my scrape finished, and mw2slob compiles it without errors, but it's only ~1.57M articles, https://www.wiktionary.org/ claims it should be ~4.9M 🤔

AardFeeder

unread,
Dec 13, 2023, 12:58:18 PM12/13/23
to aard...@googlegroups.com

My frwiktionary generated with mwscrape holds 4 810 708 articles in couchdb v. 3.3.2

 

From: aard...@googlegroups.com [mailto:aard...@googlegroups.com] On Behalf Of itkach
Sent: Mittwoch, 13. Dezember 2023 18:28
To: aarddict <aard...@googlegroups.com>
Subject: Re: Please udpate fi.wiktionary.org

 

looks like my scrape finished, and mw2slob compiles it without errors, but it's only ~1.57M articles, https://www.wiktionary.org/ claims it should be ~4.9M 🤔

--

You received this message because you are subscribed to the Google Groups "aarddict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aarddict+u...@googlegroups.com.

franc

unread,
Dec 13, 2023, 1:04:33 PM12/13/23
to aarddict
itkach schrieb am Mittwoch, 13. Dezember 2023 um 18:28:15 UTC+1:
looks like my scrape finished, and mw2slob compiles it without errors, but it's only ~1.57M articles, https://www.wiktionary.org/ claims it should be ~4.9M 🤔

My Fauxton says frwiktionary has 4697975 documents, in a 11.1 GB size DB


Igor Tkach

unread,
Dec 13, 2023, 1:34:44 PM12/13/23
to aard...@googlegroups.com
I guess I need to scrape some more :)

--
You received this message because you are subscribed to the Google Groups "aarddict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aarddict+u...@googlegroups.com.

AardFeeder

unread,
Dec 13, 2023, 1:53:22 PM12/13/23
to aard...@googlegroups.com

Wanna give it a try to sync with mine?

As far as I remember, we did that a decade ago…

 

 

From: aard...@googlegroups.com [mailto:aard...@googlegroups.com] On Behalf Of Igor Tkach
Sent: Mittwoch, 13. Dezember 2023 19:35
To: aard...@googlegroups.com
Subject: Re: Please udpate fi.wiktionary.org

 

I guess I need to scrape some more :)

Igor Tkach

unread,
Dec 13, 2023, 1:59:26 PM12/13/23
to aard...@googlegroups.com
sure, if you can open up your couchdb so that I can connect

Markus Braun

unread,
Dec 13, 2023, 2:00:47 PM12/13/23
to aard...@googlegroups.com

I have the job on my screen, seed the https:// address of your database.

No authentification from my side

AardFeeder

unread,
Dec 13, 2023, 2:02:50 PM12/13/23
to aard...@googlegroups.com

Do you want to replicate into a new database or into your existing database?

franc

unread,
Dec 15, 2023, 4:42:07 AM12/15/23
to aarddict
franc schrieb am Mittwoch, 13. Dezember 2023 um 13:55:10 UTC+1:
...

I restarted the mw2slob now (at 13:07), without mw2slob'ing any other wiki now, maybe my server is not that strong machine I would like it were and the error was indeed from low ressources. My command now (in a bash file):
mw2slob scrape http://admin:password@localhost:5984/frwiki -f common wikt --local-namespace 0 100 --ensure-ext-image-urls --no-math 2>&1 | tee /home/franc/aard/log/mwscrape2slob-frwiktionary_2023-12-13.log
With this I have the errors in a log file (and also on the console), so I can look if there were other documents that big as 404 kB in the mw2slob.
Will post again, when error or finished...

OK then, It worked this time :)
My log's (  mwscrape2slob-frwiktionary_2023_12_13-17.38.16.log ) last lines are without error:

...
SKIPPING (not included): 'filters/image'

ADDING: '~/images/Globe.svg'
ADDING: '~/css/shared.css'
ADDING: '~/css/mediawiki_monobook.css'
ADDING: '~/css/mediawiki_shared.css'
ADDING: '~/css/night.css'
ADDING: '~/js/jquery-3.7.1.slim.min.js'

Finished adding content in 6:06:18
Finalizing...
Sorting... sorted in 0:02:31
Resolving aliases...
Sorting... sorted in 0:02:31
Resolved aliases in 0:02:31
Finalized in 0:05:33
All done in 6:11:52

I begin to believe that it is indeedly my server's resources culpritly, slobbing frwiktionary at the same time I slob other big wikis is not good so.
I will cron the slobs then on different days, not to have collisions.

Thank.frank

AardF...@web.de

unread,
Dec 15, 2023, 7:23:51 AM12/15/23
to aard...@googlegroups.com
If you use a batch you may run a 
For next loop
Then there will be no overlap 


Markus


From: franc <franc...@gmail.com>
Sent: Friday, December 15, 2023 10:42
To: aarddict

Subject: Re: Please udpate fi.wiktionary.org
--
You received this message because you are subscribed to the Google Groups "aarddict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aarddict+u...@googlegroups.com.

Igor Tkach

unread,
Dec 15, 2023, 8:53:01 AM12/15/23
to aard...@googlegroups.com
On Fri, Dec 15, 2023 at 4:42 AM franc <franc...@gmail.com> wrote:
franc schrieb am Mittwoch, 13. Dezember 2023 um 13:55:10 UTC+1:
...
I restarted the mw2slob now (at 13:07), without mw2slob'ing any other wiki now, maybe my server is not that strong machine I would like it were and the error was indeed from low ressources. My command now (in a bash file):
mw2slob scrape http://admin:password@localhost:5984/frwiki -f common wikt --local-namespace 0 100 --ensure-ext-image-urls --no-math 2>&1 | tee /home/franc/aard/log/mwscrape2slob-frwiktionary_2023-12-13.log
With this I have the errors in a log file (and also on the console), so I can look if there were other documents that big as 404 kB in the mw2slob.
Will post again, when error or finished...

OK then, It worked this time :)

Glad to hear it :)  I just got the scrape data from Markus (thank you!), ~4.8M articles which is close to the expected number,  and also compiled fr wiktionary - no errors.

 

AardF...@web.de

unread,
Dec 15, 2023, 10:03:04 AM12/15/23
to aarddict
I have a question to you guys.

Here are my findings:
This is the comparison of the frwiki_20231212.slob from https://7fw.de/download/wiki/fr/ which was scraped
and the frwiki-20231201.slob from https://ftp.halifax.rwth-aachen.de/aarddict/frwiki/ which was generated from a dump file of wikimedia
The data of the scraped version is some days newer than the data of the dump file.
Surprisingly the dump file contains more articles than the scraped version.
It would be interesting to know the numbers of articles in the couchdb of the scraped data.
The question is if the difference of approximately 8% or 360 000 articles is significant or not.
Is there a way to compare the files and see the articles which are in one file but not in the other?
Markus

/home/markus/Downloads/frwiki_2023-12-12.slob
---------------------------------------------
         id: fe31895467164e36a98bad13f7d4366b
   encoding: utf-8
compression: lzma2
 blob count: 2590025
  ref count: 4200239


CONTENT TYPES
-------------
text/html;charset=utf-8
application/javascript
image/svg+xml
text/css


TAGS
----
version.python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
 version.pyicu: 2.8.1
   version.icu: 70.1
    created.at: 2023-12-11T14:02:58.677622+00:00
  license.name: Creative Commons Attribution-Share Alike 4.0
   license.url: https://creativecommons.org/licenses/by-sa/4.0/deed.fr
    created.by:
     copyright:
        source: http://fr.wikipedia.org
           uri: http://fr.wikipedia.org
         label: Wikipédia (fr)




/home/markus/data/tmp/Aard/slob/frwiki/frwiki20231201-slob/frwiki-20231201.slob
-------------------------------------------------------------------------------
         id: 4e242c1556be48369f22c30a1a147c9c
   encoding: utf-8
compression: lzma2
 blob count: 2678885
  ref count: 4564918


CONTENT TYPES
-------------
text/html;charset=utf-8
application/javascript
image/svg+xml
text/css


TAGS
----
version.python: 3.9.2 (default, Feb 28 2021, 17:03:44)  [GCC 10.2.1 20210110]
 version.pyicu: 2.5
   version.icu: 67.1
    created.at: 2023-12-07T04:59:52.954216+00:00
  license.name: Creative Commons Attribution-Share Alike 4.0
   license.url: https://creativecommons.org/licenses/by-sa/4.0/deed.fr
    created.by: MHBraun
     copyright:
        source: http://fr.wikipedia.org
           uri: http://fr.wikipedia.org
         label: Wikipédia (fr)




Igor Tkach

unread,
Dec 15, 2023, 11:43:21 AM12/15/23
to aard...@googlegroups.com
On Fri, Dec 15, 2023 at 10:03 AM 'AardF...@web.de' via aarddict <aard...@googlegroups.com> wrote:
I have a question to you guys.

Here are my findings:
This is the comparison of the frwiki_20231212.slob from https://7fw.de/download/wiki/fr/ which was scraped
and the frwiki-20231201.slob from https://ftp.halifax.rwth-aachen.de/aarddict/frwiki/ which was generated from a dump file of wikimedia
The data of the scraped version is some days newer than the data of the dump file.
Surprisingly the dump file contains more articles than the scraped version.

Comments at https://phabricator.wikimedia.org/T305407 mentioned dumps contain duplicates. Those would carry over to .slob.
 
It would be interesting to know the numbers of articles in the couchdb of the scraped data.

should be the same as blob count if mw2slob without errors
 
The question is if the difference of approximately 8% or 360 000 articles is significant or not.

¯\_(ツ)_/¯
 
Is there a way to compare the files and see the articles which are in one file but not in the other?

Sure, such a program can be written :)  But with dumps clearly broken at the moment (https://phabricator.wikimedia.org/T305407 and https://phabricator.wikimedia.org/T345176) seems hardly worth the effort.
There are also indications that API used for scraping may also be not completely healthy - my attempts to scrape frwiktionary just now looked like it successfully finished, but resulted in way too few articles.

--
You received this message because you are subscribed to the Google Groups "aarddict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aarddict+u...@googlegroups.com.

AardF...@web.de

unread,
Dec 16, 2023, 5:25:42 AM12/16/23
to aarddict
Alright, I rather have duplicates then lost articles.
I was not aware of this.
The data structure of wikimedia seems to be a mess.
A litte bit scaring is the point of inconsistant API for scraping. All we can do is to monitor it and verify if the result seems logical
Thank you for clarification

Reply all
Reply to author
Forward
0 new messages