enterprise dumps miss a lot of closed paragraphs

114 views
Skip to first unread message

☠☠☠

unread,
Feb 8, 2022, 5:04:59 PM2/8/22
to aarddict
Hi.

I did some wiktionaries and was happy how fast everything works now with the enterprise dumps. But then I noticed an important bug:

Many of the paragraphs that are closed in the mobile view, are just not existing (have no content, can’t be opened) in aardict.

For example the subparagraphs that have the Conjugation or Declension of verbs are missing.

Is it their fault or is it a problem of mw2slob?

Regards,
Erik

☠☠☠

unread,
Feb 8, 2022, 5:12:49 PM2/8/22
to aarddict
Also I am not able to play audios. Nothing happens when I click on the play button. And sometimes they aren’t even shown.

Can I somehow check, if this is a problem of the enterprise dumps, or of the conversion to slob?

Igor Tkach

unread,
Feb 8, 2022, 6:24:32 PM2/8/22
to aarddict
Which specific dumps and articles have issues? I'll take a look

--
You received this message because you are subscribed to the Google Groups "aarddict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aarddict+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/aarddict/4bf863d7-9224-4fcc-8075-f5d9f50cd9d7n%40googlegroups.com.

Igor Tkach

unread,
Feb 8, 2022, 6:44:44 PM2/8/22
to aarddict


On Tue, Feb 8, 2022, 5:05 PM '☠☠☠' via aarddict <aard...@googlegroups.com> wrote:
Hi.

I did some wiktionaries and was happy how fast everything works now with the enterprise dumps. But then I noticed an important bug:

Many of the paragraphs that are closed in the mobile view, are just not existing (have no content, can’t be opened) in aardict.

Those may have been filtered out by content filters. Also, did you apply the right filters? "wikt" as in "-f common wikt" for wiktionary, not "wiki" - easy mistake to make :)



For example the subparagraphs that have the Conjugation or Declension of verbs are missing.

Is it their fault or is it a problem of mw2slob?

Regards,
Erik

--
You received this message because you are subscribed to the Google Groups "aarddict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aarddict+u...@googlegroups.com.

☠☠☠

unread,
Feb 9, 2022, 1:14:49 AM2/9/22
to aarddict
I just had a look into the dumps. It was for eswiktionary and there the word aumentar. Conjugación with all its contents is in the dump.

Maybe I did the filter wrong. I’ll try again.

☠☠☠

unread,
Feb 9, 2022, 3:04:51 AM2/9/22
to aarddict
Enwiktionary, Spanish word aumentar has audio, but not in aarddict.

Would be nice if you could check that.

☠☠☠

unread,
Feb 9, 2022, 8:28:36 AM2/9/22
to aarddict
Ok, seems to be a longer problem, not coming from the enterprise builds. I redid a conversion from the enterprise dump, and the conjugación was still missing.


This is also missing in the english wiktionarys.

I took a wiktionary from http://ftp.halifax.rwth-aachen.de/aarddict/eswiki/eswiktionary20211114-slob/ which was built by scraping, as there were no enterprise dumps. And this has the same missing conjugación.

But in earlier versions I had this (en.wy and es.wy).

The question is now: Is it missing in the slob files or is aarddict not showing it??

Regards,
Erik

☠☠☠

unread,
Feb 9, 2022, 8:33:50 AM2/9/22
to aarddict
Another check on another phone with old slob files: The es.wy doesn’t have the conjugación either (it is from 2017!). But the en.wy has the conjugation (from 2019!). But the new en.wy from 2022 doesn’t have it.

So in the spanish one it was always missing. And in the english one it is missing now.

Could I help somehow solve this?

Regards,
Erik

Igor Tkach

unread,
Feb 9, 2022, 9:55:23 AM2/9/22
to aarddict
augmentar has conjugations in my enwiktionary-NS0-20220120-ENTERPRISE-HTML.slob

image.png

This is looking at html from slob using Firefox dev tools using https://github.com/itkach/aard2-web

As you can see, the conjugation table is inside a div with class "NavFrame". These should be filtered out for Wikipedias as they are indeed for navigation only and not needed  (here's the filter: https://github.com/itkach/mw2slob/blob/master/mw2slob/filters/wiki). For Wiktionaries, these should not be filtered out, and so filters for Wiktionary are different (https://github.com/itkach/mw2slob/blob/master/mw2slob/filters/wikt). If your dictionary doesn't have conjugation, chances are it was compiled with the wrong filters. Filters just CSS selectors, and a filter file has selector per line. mw2slob comes with some pre-definede filters, but you can make your own and tweak it to your liking.

Igor Tkach

unread,
Feb 9, 2022, 10:00:48 AM2/9/22
to aarddict
Audio for pronunciation is filtered out by ".auditable" filter: https://github.com/itkach/mw2slob/blob/master/mw2slob/filters/common#L15
I don't quite remember why this filter was added, I think audio didn't really work in WebView in early Android versions... Maybe this filter should just be removed.


uiaenrtd uiaenrtd

unread,
Feb 9, 2022, 12:08:14 PM2/9/22
to aard...@googlegroups.com
I checked the en.wy, but there was only mediacontainer (not audiotable). But this is also filtered out in common. I will try without, see if it works).

But much more important are the conjugations. In es.wy they use collapsible as class, which is also filtered out in common. In this case I will try it too without this filter. Then finally all the spanish wiktionary articles have a conjugation and are useful now!! After so many years, great.

And the english wiktionary on halifax was probably made with the wrong filters, yes. Maybe the author can comment.

Regards,
Erik

☠☠☠

unread,
Feb 9, 2022, 5:22:57 PM2/9/22
to aarddict
It worked!! Great, now I have conjugations!!

So for wiktionaries recommend removing

    .mediaContainer
    .collapsible

Haven't tried with audiotable, as I haven't seen that being used.

I will now try do build Wikipedias and see what the difference with those filters not used would be.

Regards,
Erik

☠☠☠

unread,
Feb 13, 2022, 5:27:34 AM2/13/22
to aarddict
Next problem arising. The new slob files don’t contain pages like https://es.wikipedia.org/wiki/Anexo:Aves_de_Canarias or https://fr.wiktionary.org/wiki/Conjugaison:espagnol/aumentar but the old slob files did.

Does anybody know, if this is due to the enterprise dump files? Has it something to do with the namespace (NS0, NS14, NS6)? Has anybody found what these namespaces mean? https://en.wikipedia.org/wiki/Wikipedia:Namespace says NS0 is all articles, NS14 is categories, NS6 is media titles. Still not sure, what that means. Maybe the NS6 dumps contain all the media used in the articles. Well, whatever.

Why are the above mentioned articles not included anymore in the slob files? I can also unpack the dumps and look if they are included. But if anybody knows, would be nice.

Regards,
Erik

☠☠☠

unread,
Feb 13, 2022, 10:15:37 AM2/13/22
to aarddict
Ok, I have grepped (grep) through the frwiktionary-dump, and it seems that the mentioned article is just missing. :-(( Probably the same is true for the spanish wikipedia.

Any ideas?

pk

unread,
Feb 13, 2022, 2:46:52 PM2/13/22
to aard...@googlegroups.com
The meaning of a namespace is abitrary and the article Help:Namespace
defines it for each wiki. WMF-hosted wikis share conventions for ca.
10 namespaces.

Some wikis do not have all textual content intended for the general
public in ns0. This allows for filtering the search results easily.
For example, English Wiktionary has Appendix:.

These additional namespaces are listed case by case in the target
wiki's article Help:Namespace (type it untranslated in the search bar,
it redirects to the localized help page). Because this is content
intended for the general public, rather than technical support pages,
it totally makes sense to 1. ask the enterprise team to make dumps 2.
tell the wiki community that they are not being dumped (they are most
likely just unaware). Ideally there should be a cross-wiki discussion
to handle this once and for all.

But for the French Wiktionary, there is no namespace named
Conjugaison: so Conjugaison:espagnol/aumentar is in ns0, for the same
reason that 'Star Trek: Enterprise' is in ns0 on English Wikipedia.
https://fr.wiktionary.org/wiki/Help:Namespace#Les_espaces_de_noms_du_Wiktionnaire

Il giorno dom 13 feb 2022 alle ore 16:15 '☠☠☠' via aarddict
<aard...@googlegroups.com> ha scritto:
> To view this discussion on the web visit https://groups.google.com/d/msgid/aarddict/ce02823b-ffa7-450e-8702-12bb77712d4fn%40googlegroups.com.

Igor Tkach

unread,
Feb 13, 2022, 2:49:28 PM2/13/22
to aarddict
On Sun, Feb 13, 2022 at 10:15 AM '☠☠☠' via aarddict <aard...@googlegroups.com> wrote:
Ok, I have grepped (grep) through the frwiktionary-dump, and it seems that the mentioned article is just missing. :-(( Probably the same is true for the spanish wikipedia.

Any ideas?

☠☠☠ schrieb am Sonntag, 13. Februar 2022 um 10:27:34 UTC:
Next problem arising. The new slob files don’t contain pages like https://es.wikipedia.org/wiki/Anexo:Aves_de_Canarias or https://fr.wiktionary.org/wiki/Conjugaison:espagnol/aumentar but the old slob files did.

Does anybody know, if this is due to the enterprise dump files? Has it something to do with the namespace (NS0, NS14, NS6)? Has anybody found what these namespaces mean? https://en.wikipedia.org/wiki/Wikipedia:Namespace says NS0 is all articles, NS14 is categories, NS6 is media titles. Still not sure, what that means. Maybe the NS6 dumps contain all the media used in the articles.

Exactly. Wikipedia content consists of articles themselves, various auxiliary pages with additional content like Appendix or Categories as well as various special pages that enable editing/collaboration like talk pages. Namespaces allow to separate these different kinds of pages. A full page title/URL path in Mediawiki generally is <Namespace>:<Title>, with namespace usually being omitted for the Main (articles) namespace. Namespaces have a display name, a canonical name, and a numeric id.

Each enterprise dump tarball includes content from one namespace as indicated by namespace's numeric id in the file name.

"Anexo" is a namespace of Spanish wikipedia with numeric id 104. How do I know this?
If you look at siteinfo document which you can get with

mw2slob siteinfo https://es.wikipedia.org > eswiki.si.json
 
you will find the entry for Anexo under "namespaces" key.

mwscrape has an option to download pages from a specific namespace, like so:

mwscrape https://es.wikipedia.org --namespace 104

Now you can make a separate slob with content from Anexo:

mw2slob scrape http://localhost:5984/es-wikipedia-org --article-namespace 104

Note the --article-namespace 104 argument. This tells the converter to keep links with Anexo: path local, otherwise they would be converted to external links https://es.enwikipedia.org/wiki/Anexo:...

When creating the main wikipedia dictionary with mw2slob dump be sure to also specify --article-namespace.

Now, when following an Anexo: link from eswiki slob, it will be found in your other, Anexo-only, .slob, so even though content is split across multiple .slob file, effectively they function together as a single dictionary.

Multiple files can be specified for mw2slob scrape as input, so all namespaces available as enterprise dump can be included into a single slob, just don't forget to specify --article-namespace. If you are including both Categories (ns 14) and File (ns 6) namespaces, that would be --article-namespace 6 14

pk

unread,
Feb 13, 2022, 2:50:17 PM2/13/22
to aard...@googlegroups.com
For Spanish Wikipedia, Anexo is a namespace, so you need a dump of ns104.
https://es.wikipedia.org/wiki/Help:Namespace#Tabla_de_espacios_de_nombres

☠☠☠

unread,
Feb 14, 2022, 8:44:25 AM2/14/22
to aarddict
Thank you everybody for the explanations. Anexo is a namespace, great. That is easy to solve then.

But as I have already said, https://fr.wiktionary.org/wiki/Conjugaison:espagnol/aumentar is not part of the dump file, even though you say it should be like the Star Trek: Enterprise article. Or if so, why is the link in aarddict not working? And why didn’t I find it with grep in the tarball (maybe I did something wrong, the “:” and “/” might be escaped somehow)

When I open the article in aarddict for https://fr.wiktionary.org/wiki/aumentar#Espagnol and there click on voir la conjugaison it should show me the above mentioned link. But instead it opens the same article I am actually viewing again.

Thanks for your patience and help.

Regards,
Erik

pk

unread,
Feb 14, 2022, 9:34:53 AM2/14/22
to aard...@googlegroups.com
This URL limits the search to ns0 (ns0=1 query parameter, "Rechercher
dans Principal" below the search bar) which confirms that its search
result, Conjugaison:espagnol/aumentar, is in ns0. I have no idea why
the link would not work.

https://fr.wiktionary.org/wiki/Special:Search?search=Conjugaison:espagnol/aumentar&profile=advanced&fulltext=1&searchengineselect=mediawiki&ns0=1

Il giorno lun 14 feb 2022 alle ore 14:44 '☠☠☠' via aarddict
<aard...@googlegroups.com> ha scritto:
>
> --
> You received this message because you are subscribed to the Google Groups "aarddict" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to aarddict+u...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/aarddict/433a8ec2-6372-42e2-8c99-e39c15a70da6n%40googlegroups.com.

Igor Tkach

unread,
Feb 14, 2022, 10:51:59 AM2/14/22
to aarddict
On Mon, Feb 14, 2022 at 8:44 AM '☠☠☠' via aarddict <aard...@googlegroups.com> wrote:
Thank you everybody for the explanations. Anexo is a namespace, great. That is easy to solve then.

But as I have already said, https://fr.wiktionary.org/wiki/Conjugaison:espagnol/aumentar is not part of the dump file, even though you say it should be like the Star Trek: Enterprise article.

"Conjugaison" is a namespace in French Wiktionary, id 116. It is not in the dump for namespace 0.
 
Or if so, why is the link in aarddict not working? And why didn’t I find it with grep in the tarball (maybe I did something wrong, the “:” and “/” might be escaped somehow)

When I open the article in aarddict for https://fr.wiktionary.org/wiki/aumentar#Espagnol and there click on voir la conjugaison it should show me the above mentioned link. But instead it opens the same article I am actually viewing again.


Aard for Android was losing fragment part when handling link clicks, fixed in version 0.51 which was in beta for a bit (which is open to members of this google group) and is now being rolled to everyone.
 
Thanks for your patience and help.

Regards,
Erik

pko...@gmail.com schrieb am Sonntag, 13. Februar 2022 um 19:50:17 UTC:
For Spanish Wikipedia, Anexo is a namespace, so you need a dump of ns104.
https://es.wikipedia.org/wiki/Help:Namespace#Tabla_de_espacios_de_nombres

--
You received this message because you are subscribed to the Google Groups "aarddict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aarddict+u...@googlegroups.com.

pk

unread,
Feb 14, 2022, 2:01:34 PM2/14/22
to aard...@googlegroups.com
Just added Conjugaison to the wiktionary help article.

Il giorno lun 14 feb 2022 alle ore 16:52 Igor Tkach <itk...@gmail.com>
ha scritto:
> To view this discussion on the web visit https://groups.google.com/d/msgid/aarddict/CAEbxot9v4UvBfGnBs4%3DHX4hT8KuPUaYAoA%3D4q6v%3DiP%3DBAryk4g%40mail.gmail.com.

☠☠☠

unread,
Feb 15, 2022, 5:54:30 AM2/15/22
to aarddict
Oh, thank you to you both.

Well, itkach, may I ask you how you found out that Conjugaison is in namespace 116 when it wasn’t even mentioned on the respective Wiki page (now it is)? If you tell me (and us), we will be able to find other namespaces ourselfes without having to bother you again.

Thanks again for your great project which makes my smartphone always smart (even without or with bad internet connection).
Erik

Igor Tkach

unread,
Feb 15, 2022, 10:01:01 AM2/15/22
to aarddict
As I explained earlier in this thread, look at siteinfo.

mw2slob siteinfo https://fr.wiktionary.org > frwiki.si.json

and then look under the "namespaces" key.

Siteinfo is metadata about a Mediawiki site that includes information about namespaces, interwiki links, content license etc. and can be obtained via Mediawiki API as described here: https://www.mediawiki.org/wiki/API:Siteinfo

mw2slob siteinfo is a convenience command to fetch it, it builds a query that requests specific parts of siteinfo to be included, ones that are necessary for mw2slob dump, e.g. for frwiki https://fr.wiktionary.org/w/api.php?action=query&meta=siteinfo&siprop=general|namespaces|interwikimap|rightsinfo&format=json

mwscrape also fetches siteinfo first thing when scrape starts so that mw2slob scrape can later use it (you can find and explore it in mwscape's CouchDB via admin interface at http://localhost:5984/_utils/)

☠☠☠

unread,
Mar 18, 2022, 8:01:14 AM3/18/22
to aarddict
Hi Igor.

Thanks for all that detailed explanation. Just wanted to tell you that there is a discussion on the wikmedia dumps mailing list, which might hopefully lead to new dump files containing also the missing namespaces (so that we can avoid the scraping): https://lists.wikimedia.org/hyperkitty/list/xmldata...@lists.wikimedia.org/thread/G557PNCFVNKMBROT3U2A5C3HJGESEPJA/
Message has been deleted

itkach

unread,
Mar 18, 2022, 1:41:05 PM3/18/22
to aarddict
Interesting, thank you, Erik.

Indeed, term "scraping" and name of the tool "mwscrape" are a bit of a misnomer :) - mwscrape doesn't actually scrape the websites, it gets rendered HTML via MediaWiki REST API, which, as they say, is acceptable.

Not sure why would they bring up kiwix/zim - they are in the same position as aarddict, not a primary data source and depending on them makes no sense. The right thing to do is to export missing namespaces as enterprise dumps, as requested in https://phabricator.wikimedia.org/T303652
Reply all
Reply to author
Forward
0 new messages