Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

enwiktionary

187 views
Skip to first unread message

Alhaitham Ibrahim

unread,
Aug 25, 2024, 4:16:03 PM8/25/24
to aarddict
Hello

Know that you are having a problem with scraping Wiktionary

Would like to know if you tried to use the dumps again instead of scraping

There was a discussion about the size of enterprise html dump


But now it seems that the size went up again [13.62 GB] and maybe worth a try


Thanks

AardF...@web.de

unread,
Aug 29, 2024, 4:55:11 AM8/29/24
to aarddict
The namespaces which some people are using are not in the dumps. They need to be scraped seperately. Hence the compilation does not work without couchdb.

I deleted the couchdb setup with all databases and started from scratch.
Actually I am scraping and compiling enwiktionary-20240820 as a test.
It will take a couple of days until finished.
Would you please check the file if ready?

Alhaitham Ibrahim

unread,
Aug 29, 2024, 8:10:36 AM8/29/24
to aarddict
> Would you please check the file if ready?

Not sure what file and what to check

AardF...@web.de

unread,
Aug 29, 2024, 10:57:54 AM8/29/24
to aard...@googlegroups.com

I am creating a slob file enwiktionary-20240820.slob and will upload it to

ftp.halifax.rwth-aachen.de/aarddict/enwiki

I will let you know here when it will be available.

From there you may download the file and check if it is working fine for you.

 

Let me know if you are not interested.

 

Thank you

 

Markus

--
You received this message because you are subscribed to the Google Groups "aarddict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
aarddict+u...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/aarddict/324fd023-7d3a-4aab-98db-c86990c399aen%40googlegroups.com.

Alhaitham Ibrahim

unread,
Aug 29, 2024, 11:24:59 AM8/29/24
to aarddict
> enwiktionary-20240820.slob

That is exactly my wish and very much interested

Thanks

SailorVenusFan

unread,
Aug 31, 2024, 1:23:42 PM8/31/24
to aarddict
The 8/20 file is not there. Do you have discord too? I want that for the development of Aard 2

AardF...@web.de

unread,
Sep 1, 2024, 4:13:21 AM9/1/24
to aard...@googlegroups.com
The enwiktionary-20240820 is still in process. This takes a couple of days to be generated. 
ETA could be sometimes tonight...
Then it will be synced tomorrow morning. 


Markus Braun


From: SailorVenusFan <fedijus...@gmail.com>
Sent: Saturday, August 31, 2024 19:23
To: aarddict
Subject: Re: enwiktionary

AardF...@web.de

unread,
Sep 1, 2024, 4:15:54 AM9/1/24
to aard...@googlegroups.com
Nope no discord. 
All communication just on one platform only please. We use this Google platform for the time being. 

Markus


From: SailorVenusFan <fedijus...@gmail.com>
Sent: Saturday, August 31, 2024 19:23
To: aarddict
Subject: Re: enwiktionary

AardF...@web.de

unread,
Sep 1, 2024, 4:37:44 AM9/1/24
to aarddict
it will be synced this evening

Alhaitham Ibrahim

unread,
Sep 2, 2024, 2:22:04 AM9/2/24
to aarddict
Working great, thanks a lot

Screenshot_20240902_091528_itkach.aard2_1.jpg

JSToJestJaSam

unread,
Sep 2, 2024, 2:25:30 PM9/2/24
to aarddict
Does this mean that fiwiktionary can finally be updated too?

Nikolai Yourin

unread,
Sep 2, 2024, 2:46:15 PM9/2/24
to aarddict
Looks promising, thanks a lot Markus! Here's a brief comparison:

20230601 => 20240820: 733183 additions / 11039 deletions
20240315 => 20240820: 575397 additions / 6093 deletions

Strangely enough, this version contains a lot of inflected forms that were not included in enwiktionary-20240315,
even though most of the articles in question are not that recent:

vulcanizássedes
    (reintegrationist norm) second-person plural imperfect subjunctive of vulcanizar
    (last edited on 11 December 2023)

zoomorfizó
    third-person singular preterite indicative of zoomorfizar
    (last edited on 9 December 2023)

On Sunday, September 1, 2024 at 11:37:44 AM UTC+3 AardF...@web.de wrote:

AardF...@web.de

unread,
Sep 2, 2024, 7:49:57 PM9/2/24
to aard...@googlegroups.com
Yes.
I will update the wiktionaries upon request. 
So far en and fi ;) 


Markus


From: JSToJestJaSam <joonata...@gmail.com>
Sent: Monday, September 2, 2024 20:25

AardF...@web.de

unread,
Sep 2, 2024, 7:53:17 PM9/2/24
to aard...@googlegroups.com
Nicolai, 
This looks great. 
Yes, I included the name spaces. 
The name spaces are scraped into a couchdb and then combined with the NS0 dump. 

How did you generate those figures? 


Markus



From: Nikolai Yourin <n.yo...@gmail.com>
Sent: Monday, September 2, 2024 20:46

Alhaitham Ibrahim

unread,
Sep 2, 2024, 9:06:07 PM9/2/24
to aarddict
Just found out that there are a lot of duplicated entries

Example:


Screenshot_20240903_034757_itkach.aard2_1.jpgScreenshot_20240903_034802_itkach.aard2_1.jpg

AardF...@web.de

unread,
Sep 3, 2024, 4:48:07 AM9/3/24
to aarddict
Interesting. Is it a duplicate article or a duplicate link to an article?

Alhaitham Ibrahim

unread,
Sep 3, 2024, 12:54:17 PM9/3/24
to aarddict
Duplicate articles, as you can see, there are small differences like the beginning of Etymology

There are even triplicated articles like "farang"

Nikolai Yourin

unread,
Sep 3, 2024, 4:40:10 PM9/3/24
to aarddict
For this to work, you'll need to have GoldenDict installed.
In GoldenDict, click 'Dictionary headwords' -> Export, then sort the resulting list:
    sort list1.txt >headwords1.txt

Once you have two sorted headword lists, diff them:
    diff -u headwords1.txt headwords2.txt | tail -n +3 >headwords.diff
    grep -c ^+ headwords.diff
    grep -c ^- headwords.diff


I'd prefer to do this programmatically, but 'slob' (the Python module) won't let me:

    with slob.open(sys.argv[1]) as r:
headwords = list(r.as_dict())

It just eats up all available memory no matter how tiny the input file is.

On Tuesday, September 3, 2024 at 2:53:17 AM UTC+3 AardF...@web.de wrote:

AardF...@web.de

unread,
Sep 3, 2024, 6:10:39 PM9/3/24
to aard...@googlegroups.com
Not sure where this comes from. As these are articles, the duplicates seem to reside in the NS0 dump. 

Markus


From: Alhaitham Ibrahim <compu...@gmail.com>
Sent: Tuesday, September 3, 2024 18:54

Igor Tkach

unread,
Sep 3, 2024, 6:50:03 PM9/3/24
to aard...@googlegroups.com
On Tue, Sep 3, 2024 at 4:40 PM Nikolai Yourin <n.yo...@gmail.com> wrote:
    with slob.open(sys.argv[1]) as r:
headwords = list(r.as_dict())

slob instance is already a sorted sequence, you can just iterate over it, like so:
 
for item in s:
     print(item.key)

or use list comprehension or generator expression. This works:

with slob.open(sys.argv[1]) as r:
     headwords = (item.key for item in r)
 

Nikolai Yourin

unread,
Sep 5, 2024, 3:18:12 PM9/5/24
to aarddict
Thank you Igor, it makes sense now.

Markus, I'm attaching a tiny script that can be run either like this:

    $ slob-headwords.py enwiktionary-20240820.slob >headwords.txt 2>duplicates.txt

or like this:

    $ slob-headwords.py enwiktionary-20240315.slob enwiktionary-20240820.slob
    enwiktionary-20240315.slob: 7,781,227 unique headwords + 130 duplicates
    enwiktionary-20240820.slob: 8,350,590 unique headwords + 264,274 duplicates
    575,427 additions / 6,064 deletions

As you can see, enwiktionary-20240820 does indeed contain a lot of duplicates
(the number of additions is still accurate though, since duplicate entries do not get counted towards that number).

For the sake of speed, the script makes no distinction between real headwords and built-in styles/scripts
such as '~/MathJax/MediaWiki.js' (because accessing 'item.content_type' seems to be quite costly).
slob-headwords.py

Alhaitham Ibrahim

unread,
Sep 5, 2024, 3:52:07 PM9/5/24
to aarddict
If the duplicate issue is present in the enterprise dumps then it might have been fixed in the latest one

enwiktionary-NS0-20240820-ENTERPRISE-HTML.json.tar.gz
[13.62 GB]

enwiktionary-NS0-20240901-ENTERPRISE-HTML.json.tar.gz
[13.14 GB]

It is good to know that nothing is missing from 20240820

AardF...@web.de

unread,
Sep 7, 2024, 4:08:08 AM9/7/24
to aarddict
Thank you, will check it with the enwiktionary-20240901.slob when it is compiled

Nikolai Yourin

unread,
Sep 8, 2024, 12:58:17 PM9/8/24
to aarddict
I think this is a clear improvement over 20240820:

    $ slob-headwords.py enwiktionary-20240820.slob enwiktionary-20240901.slob

    enwiktionary-20240820.slob: 8,350,590 unique headwords + 264,274 duplicates
    enwiktionary-20240901.slob: 8,367,203 unique headwords + 12,975 duplicates
    16,813 additions / 200 deletions

No idea what to make of the duplicates list (attached, as well as a slightly improved version of the script).
There are 12673 duplicate headwords, of which 7086 are annex entries (e.g. 'Reconstruction:Proto-Algonquian/net-') that most people wouldn't care about, so that's a relief.

13 extra copies of this?
https://en.wiktionary.org/wiki/%E2%9D%A8
(last edited on 2 September 2024, at 15:52)
slob-headwords.py.gz
duplicates-20240901.txt

AardF...@web.de

unread,
Sep 8, 2024, 1:51:47 PM9/8/24
to aard...@googlegroups.com

Thank you for the updated code.

 

I would not do anything with the duplicates. They are residing in the dumps.

If they are in the annexes, I guess they are there for a reason and the duplication is just a side effect.

… and next month we are starting over anyway.

I am glad they reduced the duplicates significantly

Nikolai Yourin

unread,
Sep 8, 2024, 5:21:27 PM9/8/24
to aarddict
Absolutely.
I'm not even sure if it's possible to do any better here.

Just for the record, enwiktionary-20230601.slob, which was also made from an enterprise dump, contains 185,384 duplicates (spread over 126,306 headwords).

JSToJestJaSam

unread,
Sep 9, 2024, 5:49:41 AM9/9/24
to aarddict
Wow! Thanks a million for updating fiwikt!!! You're amazing! This helps me so much!

Alhaitham Ibrahim

unread,
Nov 14, 2024, 9:59:39 PM11/14/24
to aarddict
Hello, hope you are doing great

Just letting you know the latest enwiktionary folder is empty


No need to do it again, just for future reference that it can happen

Markus Braun

unread,
Nov 15, 2024, 1:10:41 PM11/15/24
to aard...@googlegroups.com
Thank you for the hint. The script did not went through. Waiting for the wikis to finish and will update the wiktionaries then. 
Both together is too heavy load for my little machine :-) 


Markus Braun


From: Alhaitham Ibrahim <compu...@gmail.com>
Sent: Friday, November 15, 2024 03:59
Reply all
Reply to author
Forward
0 new messages