enwiktionary

Alhaitham Ibrahim

unread,

Aug 25, 2024, 4:16:03 PM8/25/24

to aarddict

Hello

Know that you are having a problem with scraping Wiktionary

Would like to know if you tried to use the dumps again instead of scraping

There was a discussion about the size of enterprise html dump

https://github.com/itkach/aard2-android/issues/168#issuecomment-1756079555

But now it seems that the size went up again [13.62 GB] and maybe worth a try

https://dumps.wikimedia.org/other/enterprise_html/runs/20240820/enwiktionary-NS0-20240820-ENTERPRISE-HTML.json.tar.gz

Thanks

AardF...@web.de

unread,

Aug 29, 2024, 4:55:11 AM8/29/24

to aarddict

The namespaces which some people are using are not in the dumps. They need to be scraped seperately. Hence the compilation does not work without couchdb.

I deleted the couchdb setup with all databases and started from scratch.

Actually I am scraping and compiling enwiktionary-20240820 as a test.

It will take a couple of days until finished.

Would you please check the file if ready?

Alhaitham Ibrahim

unread,

Aug 29, 2024, 8:10:36 AM8/29/24

to aarddict

> Would you please check the file if ready?

Not sure what file and what to check

AardF...@web.de

unread,

Aug 29, 2024, 10:57:54 AM8/29/24

to aard...@googlegroups.com

I am creating a slob file enwiktionary-20240820.slob and will upload it to

ftp.halifax.rwth-aachen.de/aarddict/enwiki

I will let you know here when it will be available.

From there you may download the file and check if it is working fine for you.

Let me know if you are not interested.

Thank you

Markus

--
You received this message because you are subscribed to the Google Groups "aarddict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aarddict+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/aarddict/324fd023-7d3a-4aab-98db-c86990c399aen%40googlegroups.com.

Alhaitham Ibrahim

unread,

Aug 29, 2024, 11:24:59 AM8/29/24

to aarddict

> enwiktionary-20240820.slob

That is exactly my wish and very much interested

Thanks

SailorVenusFan

unread,

Aug 31, 2024, 1:23:42 PM8/31/24

to aarddict

The 8/20 file is not there. Do you have discord too? I want that for the development of Aard 2

AardF...@web.de

unread,

Sep 1, 2024, 4:13:21 AM9/1/24

to aard...@googlegroups.com

The enwiktionary-20240820 is still in process. This takes a couple of days to be generated.

ETA could be sometimes tonight...

Then it will be synced tomorrow morning.

Markus Braun

From: SailorVenusFan <fedijus...@gmail.com>
Sent: Saturday, August 31, 2024 19:23
To: aarddict
Subject: Re: enwiktionary

To view this discussion on the web visit https://groups.google.com/d/msgid/aarddict/60699cfc-30a9-4e6a-98fe-bbed98664161n%40googlegroups.com.

AardF...@web.de

unread,

Sep 1, 2024, 4:15:54 AM9/1/24

to aard...@googlegroups.com

Nope no discord.

All communication just on one platform only please. We use this Google platform for the time being.

Markus

From: SailorVenusFan <fedijus...@gmail.com>
Sent: Saturday, August 31, 2024 19:23
To: aarddict
Subject: Re: enwiktionary

To view this discussion on the web visit https://groups.google.com/d/msgid/aarddict/60699cfc-30a9-4e6a-98fe-bbed98664161n%40googlegroups.com.

AardF...@web.de

unread,

Sep 1, 2024, 4:37:44 AM9/1/24

to aarddict

it will be synced this evening

Alhaitham Ibrahim

unread,

Sep 2, 2024, 2:22:04 AM9/2/24

to aarddict

Working great, thanks a lot

Screenshot_20240902_091528_itkach.aard2_1.jpg

JSToJestJaSam

unread,

Sep 2, 2024, 2:25:30 PM9/2/24

to aarddict

Does this mean that fiwiktionary can finally be updated too?

Nikolai Yourin

unread,

Sep 2, 2024, 2:46:15 PM9/2/24

to aarddict

Looks promising, thanks a lot Markus! Here's a brief comparison:

20230601 => 20240820: 733183 additions / 11039 deletions
20240315 => 20240820: 575397 additions / 6093 deletions

Strangely enough, this version contains a lot of inflected forms that were not included in enwiktionary-20240315,

even though most of the articles in question are not that recent:

vulcanizássedes
(reintegrationist norm) second-person plural imperfect subjunctive of vulcanizar
(last edited on 11 December 2023)

zoomorfizó
third-person singular preterite indicative of zoomorfizar
(last edited on 9 December 2023)

On Sunday, September 1, 2024 at 11:37:44 AM UTC+3 AardF...@web.de wrote:

AardF...@web.de

unread,

Sep 2, 2024, 7:49:57 PM9/2/24

to aard...@googlegroups.com

Yes.

I will update the wiktionaries upon request.

So far en and fi ;)

Markus

From: JSToJestJaSam <joonata...@gmail.com>
Sent: Monday, September 2, 2024 20:25

To view this discussion on the web visit https://groups.google.com/d/msgid/aarddict/36d01fae-c1a0-4b1a-aa3d-ff3e13c18b48n%40googlegroups.com.

AardF...@web.de

unread,

Sep 2, 2024, 7:53:17 PM9/2/24

to aard...@googlegroups.com

Nicolai,

This looks great.

Yes, I included the name spaces.

The name spaces are scraped into a couchdb and then combined with the NS0 dump.

How did you generate those figures?

Markus

From: Nikolai Yourin <n.yo...@gmail.com>
Sent: Monday, September 2, 2024 20:46

To view this discussion on the web visit https://groups.google.com/d/msgid/aarddict/3771237b-8074-4625-aa83-562333228881n%40googlegroups.com.

Alhaitham Ibrahim

unread,

Sep 2, 2024, 9:06:07 PM9/2/24

to aarddict

Just found out that there are a lot of duplicated entries

Example:

Screenshot_20240903_034757_itkach.aard2_1.jpg

Screenshot_20240903_034802_itkach.aard2_1.jpg

AardF...@web.de

unread,

Sep 3, 2024, 4:48:07 AM9/3/24

to aarddict

Interesting. Is it a duplicate article or a duplicate link to an article?

Alhaitham Ibrahim

unread,

Sep 3, 2024, 12:54:17 PM9/3/24

to aarddict

Duplicate articles, as you can see, there are small differences like the beginning of Etymology

There are even triplicated articles like "farang"

Nikolai Yourin

unread,

Sep 3, 2024, 4:40:10 PM9/3/24

to aarddict

For this to work, you'll need to have GoldenDict installed.
In GoldenDict, click 'Dictionary headwords' -> Export, then sort the resulting list:
sort list1.txt >headwords1.txt

Once you have two sorted headword lists, diff them:
diff -u headwords1.txt headwords2.txt | tail -n +3 >headwords.diff
grep -c ^+ headwords.diff
grep -c ^- headwords.diff

I'd prefer to do this programmatically, but 'slob' (the Python module) won't let me:

with slob.open(sys.argv[1]) as r:
headwords = list(r.as_dict())

It just eats up all available memory no matter how tiny the input file is.

On Tuesday, September 3, 2024 at 2:53:17 AM UTC+3 AardF...@web.de wrote:

AardF...@web.de

unread,

Sep 3, 2024, 6:10:39 PM9/3/24

to aard...@googlegroups.com

Not sure where this comes from. As these are articles, the duplicates seem to reside in the NS0 dump.

Markus

From: Alhaitham Ibrahim <compu...@gmail.com>
Sent: Tuesday, September 3, 2024 18:54

To view this discussion on the web visit https://groups.google.com/d/msgid/aarddict/d92b490b-56bc-4c81-82fa-59c1b25eb81bn%40googlegroups.com.

Igor Tkach

unread,

Sep 3, 2024, 6:50:03 PM9/3/24

to aard...@googlegroups.com

On Tue, Sep 3, 2024 at 4:40 PM Nikolai Yourin <n.yo...@gmail.com> wrote:

with slob.open(sys.argv[1]) as r:
headwords = list(r.as_dict())

slob instance is already a sorted sequence, you can just iterate over it, like so:

for item in s:
print(item.key)

or use list comprehension or generator expression. This works:

with slob.open(sys.argv[1]) as r:

headwords = (item.key for item in r)

To view this discussion on the web visit https://groups.google.com/d/msgid/aarddict/13c549b1-8999-40b2-8856-d5118e65d95dn%40googlegroups.com.

Nikolai Yourin

unread,

Sep 5, 2024, 3:18:12 PM9/5/24

to aarddict

Thank you Igor, it makes sense now.

Markus, I'm attaching a tiny script that can be run either like this:

$ slob-headwords.py enwiktionary-20240820.slob >headwords.txt 2>duplicates.txt

or like this:

$ slob-headwords.py enwiktionary-20240315.slob enwiktionary-20240820.slob
enwiktionary-20240315.slob: 7,781,227 unique headwords + 130 duplicates
enwiktionary-20240820.slob: 8,350,590 unique headwords + 264,274 duplicates
575,427 additions / 6,064 deletions

As you can see, enwiktionary-20240820 does indeed contain a lot of duplicates
(the number of additions is still accurate though, since duplicate entries do not get counted towards that number).

For the sake of speed, the script makes no distinction between real headwords and built-in styles/scripts

such as '~/MathJax/MediaWiki.js' (because accessing 'item.content_type' seems to be quite costly).

slob-headwords.py

Alhaitham Ibrahim

unread,

Sep 5, 2024, 3:52:07 PM9/5/24

to aarddict

If the duplicate issue is present in the enterprise dumps then it might have been fixed in the latest one

enwiktionary-NS0-20240820-ENTERPRISE-HTML.json.tar.gz

[13.62 GB]

enwiktionary-NS0-20240901-ENTERPRISE-HTML.json.tar.gz

[13.14 GB]

It is good to know that nothing is missing from 20240820

AardF...@web.de

unread,

Sep 7, 2024, 4:08:08 AM9/7/24

to aarddict

Thank you, will check it with the enwiktionary-20240901.slob when it is compiled

Nikolai Yourin

unread,

Sep 8, 2024, 12:58:17 PM9/8/24

to aarddict

I think this is a clear improvement over 20240820:

$ slob-headwords.py enwiktionary-20240820.slob enwiktionary-20240901.slob

enwiktionary-20240820.slob: 8,350,590 unique headwords + 264,274 duplicates

enwiktionary-20240901.slob: 8,367,203 unique headwords + 12,975 duplicates
16,813 additions / 200 deletions

No idea what to make of the duplicates list (attached, as well as a slightly improved version of the script).
There are 12673 duplicate headwords, of which 7086 are annex entries (e.g. 'Reconstruction:Proto-Algonquian/net-') that most people wouldn't care about, so that's a relief.

13 extra copies of this?
https://en.wiktionary.org/wiki/%E2%9D%A8
(last edited on 2 September 2024, at 15:52)

slob-headwords.py.gz

duplicates-20240901.txt

AardF...@web.de

unread,

Sep 8, 2024, 1:51:47 PM9/8/24

to aard...@googlegroups.com

Thank you for the updated code.

I would not do anything with the duplicates. They are residing in the dumps.

If they are in the annexes, I guess they are there for a reason and the duplication is just a side effect.

… and next month we are starting over anyway.

I am glad they reduced the duplicates significantly

To view this discussion on the web visit https://groups.google.com/d/msgid/aarddict/cacc70d8-e5c7-4092-8a6e-421b5a3a0b21n%40googlegroups.com.

Nikolai Yourin

unread,

Sep 8, 2024, 5:21:27 PM9/8/24

to aarddict

Absolutely.

I'm not even sure if it's possible to do any better here.

Just for the record, enwiktionary-20230601.slob, which was also made from an enterprise dump, contains 185,384 duplicates (spread over 126,306 headwords).

JSToJestJaSam

unread,

Sep 9, 2024, 5:49:41 AM9/9/24

to aarddict

Wow! Thanks a million for updating fiwikt!!! You're amazing! This helps me so much!

Alhaitham Ibrahim

unread,

Nov 14, 2024, 9:59:39 PM11/14/24

to aarddict

Hello, hope you are doing great

Just letting you know the latest enwiktionary folder is empty

https://ftp.halifax.rwth-aachen.de/aarddict/enwiki/enwiktionary20241101-slob/

No need to do it again, just for future reference that it can happen

Markus Braun

unread,

Nov 15, 2024, 1:10:41 PM11/15/24

to aard...@googlegroups.com

Thank you for the hint. The script did not went through. Waiting for the wikis to finish and will update the wiktionaries then.

Both together is too heavy load for my little machine :-)

Markus Braun

From: Alhaitham Ibrahim <compu...@gmail.com>
Sent: Friday, November 15, 2024 03:59

To view this discussion visit https://groups.google.com/d/msgid/aarddict/5370d034-d8bf-47e8-a181-47545840e4e8n%40googlegroups.com.

Reply all

Reply to author

Forward