Re: elwiki ans elwiktionary update request

17 views
Skip to first unread message

Markus Braun

unread,
Apr 5, 2024, 5:07:47 PMApr 5
to aard...@googlegroups.com
Thank you for sharing your insights. This is very helpful as it does confirm my guess. 

Markus



From: Nikolai Yourin <n.yo...@gmail.com>
Sent: Tuesday, March 19, 2024 20:13
To: aarddict
Subject: Re: elwiki ans elwiktionary update request

I was quite a bit worried about enwiktionary-20230601.slob (and frwiktionary-20230601) being so much bigger than any of the more recent versions, but those "missing" headwords actually turned out to be a non-issue. For the most part, they fall into these categories:

    Appendix:Arabic Frequency List from Quran/*
    Appendix:Communicationssprache/*
    Appendix:English terms of Eskimo-Aleut origin/*
    Appendix:Harry Potter/*
    Appendix:Hungarian suffixes/*
    Appendix:Old Prussian/*
    Appendix:Tagalog surnames/*
    Category:Zulu phrasebook/*
    Reconstruction:Aquitanian/*
    Reconstruction:Classical Nahuatl/*
    Reconstruction:Coptic/*
    Rhymes:Catalan/*
    Rhymes:Zazaki/*
    Template:Acanthamoebidae Hypernyms/*             (I can SO live without that)
    Template:Ernest Hemingway quotation templates/*
    Template:R:hsb:LSS1866/*
    Template:RQ:Bickerstaffe Love in a Village/*
    Template:U:tl:es Spanish borrowing spelling/*
    Template:el-decl-adj-ής-ού-ίδικο-ήδικο/*
    Template:list:Sylheti calendar months/*
    Wiktionary:Vandalism in progress/*
    Wiktionary:Word Competition 2020/*

... and the like, you get the idea.


As for the remaining "missing" articles, they have actually been deleted.
There aren't many of them anyway:

    $ grep -Fcv : missing-20240315.txt
    10951


On Friday, March 15, 2024 at 11:05:33 AM UTC+3 AardF...@web.de wrote:
I hear you. 
I have the same concerns.
That's why I tracked the content and compared to the dumps and the scraping. Both seem to have glitches in providing data. 
I referring here to the dumps, as I use the scraping for wiktionaries only.
And in the end I am not sure which count is correct. It looks to me like the elwiki20230901 (255.556 items) is overstated.
According to the actual statistics
Greece has 232999 articles as of today compared to the 237139 of the Dumpfile of March 1st 2024
The article count in the slob files is counting blobs. Some blobs are needed for internal organisation. So there are a little bit more blobs than articles. In the given case around 4000 blobs for internal organisation. And this is the best number we can get.
It is _very_ close to the actual number of articles

These are the historical values for the elwikis
elwiki20230601 703072kB blob count: 225960
elwiki202308* 894704kB blob count: 253935
elwiki202312* 718108kB blob count: 229878
elwiki202401* 748140kB blob count: 235884
elwiki202402* 734596kB blob count: 235524
elwiki202403* 746812kB blob count: 237139
which looks pretty accurate. 

I can only compile the data I get. The content is given.

But of course you can use whatever version you like,

have fun
Markus


On Tuesday, March 12, 2024 at 6:51:10 PM UTC+1 arnaud wrote:
Thanks for the great work that you do on those updates. Following the greek wiktionary and wikipedia, i note that the updates of the greek wiktionary seems going fine but as for wikipedia the elwiki 20240201 counts 235.524 items  so more than the  20231201 (229.878 items ) but still less than the elwiki20230901 (255.556 items) so I stick with the September 2023 update!

On Friday, March 17, 2023, Arnaud Prinstet <arnaudp...@gmail.com> wrote:
Great! Thank you for your detailed response and for the great work that you make for aard and for publishing regular updates of the Wiktionary and Wikipedia archives in all languages !

Mar 17, 2023 11:04:43 AardFeeder <AardF...@web.de>:

Short answer: yes

Long answer:

I did not create the other elwiktionary. However I guess that the compression for that older version is standard.

As our phones have become more powerful I made some tests and came to the conclusion that a higher compression does not impact usability. I am using not using a top-notch phone but a midsize Galaxy A52 and can’t see a difference.

And I never got a complain that the wikis are sluggish.

So I am using as (new) standard 1024 chunks instead of 384 which makes the files smaller.

From: aard...@googlegroups.com [mailto:aard...@googlegroups.com] On Behalf Of Arnaud Prinstet
Sent: Freitag, 17. März 2023 06:20
To: aard...@googlegroups.com
Subject: Re: elwiki ans elwiktionary update request

Just one strange  thing for the elwiktionary the size (193 mb ) is less than the version of last year (210 mb), I don't t know if it is normal?

Mar 17, 2023 07:02:00 Arnaud Prinstet <arnaudp...@gmail.com>:

This is great! viele danke!!!

Mar 16, 2023 23:35:06 aard...@gmail.com <aard...@gmail.com>:

It is synchronized and available now. :)

Have fun

arnaud schrieb am Donnerstag, 16. März 2023 um 17:49:15 UTC+1:

Thanks a lot, for now the elwiki directory still doesn't  exist, but waiting for it!

Mar 11, 2023 18:18:01 arnaud prinstet <arnaudp...@gmail.com>:

Please update greek wiktionary ans wikipedia! Thanks in advance!

--
You received this message because you are subscribed to a topic in the Google Groups "aarddict" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/aarddict/eu_q33XbRhk/unsubscribe.

To unsubscribe from this group and all its topics, send an email to aarddict+u...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "aarddict" group.

To unsubscribe from this group and stop receiving emails from it, send an email to aarddict+u...@googlegroups.com.

--
You received this message because you are subscribed to a topic in the Google Groups "aarddict" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/aarddict/eu_q33XbRhk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to aarddict+u...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "aarddict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aarddict+u...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages