Frank,you created an April version as well. It is hosted on RWTH ;)
...
... I have to say that I am on old Ubuntu 18.04 (will upgrade to 22.04 soon) so it could be that some of the packages are maybe too old. ......
So if nobody could help me here what could be wrong here, I will stop the creation of frwiki and frwiktionary until my system is on 22.04, then I would give it a new try, I guess it is related, or indirect as a Python issue (old version or such).
Dont know, sorry....
frank
...
When ready, I will update the github-wiki...
IncompleteRead error indicates that the client (mw2slob) received less data than the server (CouchDB) promised. This may happen due to a network error or due to a bug in the server implementation (CouchDB). Is CouchDB running on the same machine as mw2slob?
Does the error occur consistently?
If so then it's probably not the network, although quick search in https://github.com/apache/couchdb/issues didn't turn up anything that looked relevant. It's interesting that the error happens when trying to load an atypically large document (at least 414405 bytes, that is ~404kb). I tried to scrape some articles with titles starting near the title after which it failed, but didn't get the error (quite possibly didn't scrape enough to hit the offending document).
I'll scrape frwiktionary and see if I can reproduce the error.
In the meantime, I'd be curios if you reliably get the same error if you run mw2slob again.
I restarted the mw2slob now (at 13:07), without mw2slob'ing any other wiki now, maybe my server is not that strong machine I would like it were and the error was indeed from low ressources. My command now (in a bash file):mw2slob scrape http://admin:password@localhost:5984/frwiki -f common wikt --local-namespace 0 100 --ensure-ext-image-urls --no-math 2>&1 | tee /home/franc/aard/log/mwscrape2slob-frwiktionary_2023-12-13.log
My frwiktionary generated with mwscrape holds 4 810 708 articles in couchdb v. 3.3.2
From: aard...@googlegroups.com [mailto:aard...@googlegroups.com] On Behalf Of itkach
Sent: Mittwoch, 13. Dezember 2023 18:28
To: aarddict <aard...@googlegroups.com>
Subject: Re: Please udpate fi.wiktionary.org
looks like my scrape finished, and mw2slob compiles it without errors, but it's only ~1.57M articles, https://www.wiktionary.org/ claims it should be ~4.9M 🤔
--
You received this message because you are subscribed to the Google Groups "aarddict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aarddict+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/aarddict/0a9e8416-5fea-48dc-8e2b-7f9bd02e1f18n%40googlegroups.com.
looks like my scrape finished, and mw2slob compiles it without errors, but it's only ~1.57M articles, https://www.wiktionary.org/ claims it should be ~4.9M 🤔
--
You received this message because you are subscribed to the Google Groups "aarddict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aarddict+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/aarddict/7999928c-705b-4a7f-b551-0ff65cd06bf6n%40googlegroups.com.
Wanna give it a try to sync with mine?
As far as I remember, we did that a decade ago…
From: aard...@googlegroups.com [mailto:aard...@googlegroups.com] On Behalf Of Igor Tkach
Sent: Mittwoch, 13. Dezember 2023 19:35
To: aard...@googlegroups.com
Subject: Re: Please udpate fi.wiktionary.org
I guess I need to scrape some more :)
To view this discussion on the web visit https://groups.google.com/d/msgid/aarddict/CAEbxot_xm9wYRE%3Df%2BnAdfxwLgS4R3v3aXWdfUWtMyBwBGSf8TA%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/aarddict/080301da2df5%24a3e59320%24ebb0b960%24%40web.de.
I have the job on my screen, seed the https:// address of your database.
No authentification from my side
To view this discussion on the web visit https://groups.google.com/d/msgid/aarddict/CAEbxot_Z1tAHEMk4egRjyAHYA3TdeJPzZkfNxBRBDYX1QK2N%2Bw%40mail.gmail.com.
Do you want to replicate into a new database or into your existing database?
To view this discussion on the web visit https://groups.google.com/d/msgid/aarddict/CAEbxot_Z1tAHEMk4egRjyAHYA3TdeJPzZkfNxBRBDYX1QK2N%2Bw%40mail.gmail.com.
...
I restarted the mw2slob now (at 13:07), without mw2slob'ing any other wiki now, maybe my server is not that strong machine I would like it were and the error was indeed from low ressources. My command now (in a bash file):mw2slob scrape http://admin:password@localhost:5984/frwiki -f common wikt --local-namespace 0 100 --ensure-ext-image-urls --no-math 2>&1 | tee /home/franc/aard/log/mwscrape2slob-frwiktionary_2023-12-13.logWith this I have the errors in a log file (and also on the console), so I can look if there were other documents that big as 404 kB in the mw2slob.Will post again, when error or finished...
franc schrieb am Mittwoch, 13. Dezember 2023 um 13:55:10 UTC+1:...I restarted the mw2slob now (at 13:07), without mw2slob'ing any other wiki now, maybe my server is not that strong machine I would like it were and the error was indeed from low ressources. My command now (in a bash file):mw2slob scrape http://admin:password@localhost:5984/frwiki -f common wikt --local-namespace 0 100 --ensure-ext-image-urls --no-math 2>&1 | tee /home/franc/aard/log/mwscrape2slob-frwiktionary_2023-12-13.logWith this I have the errors in a log file (and also on the console), so I can look if there were other documents that big as 404 kB in the mw2slob.Will post again, when error or finished...OK then, It worked this time :)
I have a question to you guys.Here are my findings:This is the comparison of the frwiki_20231212.slob from https://7fw.de/download/wiki/fr/ which was scraped
and the frwiki-20231201.slob from https://ftp.halifax.rwth-aachen.de/aarddict/frwiki/ which was generated from a dump file of wikimedia
The data of the scraped version is some days newer than the data of the dump file.
Surprisingly the dump file contains more articles than the scraped version.
It would be interesting to know the numbers of articles in the couchdb of the scraped data.
The question is if the difference of approximately 8% or 360 000 articles is significant or not.
Is there a way to compare the files and see the articles which are in one file but not in the other?
--
You received this message because you are subscribed to the Google Groups "aarddict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aarddict+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/aarddict/185893a5-4748-4af0-8296-760a272ca0c6n%40googlegroups.com.