Updated Wikipedia files

1,829 views
Skip to first unread message

mhbraun

unread,
Oct 14, 2014, 3:52:50 PM10/14/14
to aard...@googlegroups.com
Usually I announce new updates on Wikipedia *.aar files for Aard on Twitter @MarkusHBraun
And I will continue to do so. So if you want to be notified just follow me on Twitter.

Apparently, I was out of town and did not read my timeline for a couple of days, so possibly I missed some information which could have been interesting for me. And this probably happens to others as well, I thought. So I decided to list in this thread new updates on Wikis I created.

If anyone else is creating his wikis he is welcome to add the information here for we have a common source on upates of wikis.

Generally I host my

German language files dewiki*.* at bit.ly/AardWikiDe
Engish language files enwiki*.* at bit.ly/AardWikiEN

and as a mirror on

bit.ly/MegaEnwiki
bit.ly/MegaDewiki
 




mhbraun

unread,
Oct 14, 2014, 3:57:31 PM10/14/14
to aard...@googlegroups.com

New #German #deutsche #offline #Wikipedia of 20141013 for #aarddict reader is online http://bit.ly/AardWikiDe
The *.aar files and the *.slob for Aard2 0.3 with external pictures.

bittachi

unread,
Oct 15, 2014, 5:25:25 PM10/15/14
to aard...@googlegroups.com
I've recently updated my dictionaries from the Italian Wikimedia minor projects using the "mwscrape" method (I know, I complained about its slowness, but - after all - it gives you an excellent output and the scraping process is resumable as you want, so it isn't so bad as I believed at the beginning...). You can see (and download) my "creations" here: https://www.wuala.com/bittachi/aarddict/

Meanwhile, I thought about compiling an updated version of the Italian Wikipedia, but I still have some doubts:

1) I want to make the compilation, but stopping and resuming the process in a second moment. Is it possible to do something like that? (I thought to split up the dictionary compiling process using the "startkey" and "endkey" options in mwcouch, but I'm afraid to generate independent dictionaries instead of separate volumes gathered in a unique dictionary).
2) If I decide to rescrape Wikipedia in the future, could the scraping process take into account the deleted articles in the encyclopedia after the first scraping (and remove them into the Couch database automatically) or not?

Markus Braun

unread,
Oct 15, 2014, 6:46:36 PM10/15/14
to aard...@googlegroups.com

1. If you mean start and restart scraping then the answer is yes. No problem. If you mean compiling the slob from couchdb then I do not know. I always did it in one run. It takes 3 to 4 h for dewiki.

2. As far as I understood all changes are updated in the couchdb with a new scrape. You do not need compaction on couchdb for a correct result. However it is very space consuming if you do not compact couchdbs

--
You received this message because you are subscribed to the Google Groups "aarddict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aarddict+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

itkach

unread,
Oct 15, 2014, 10:07:15 PM10/15/14
to aard...@googlegroups.com


On Wednesday, October 15, 2014 5:25:25 PM UTC-4, bittachi wrote:
I've recently updated my dictionaries from the Italian Wikimedia minor projects using the "mwscrape" method (I know, I complained about its slowness, but - after all - it gives you an excellent output and the scraping process is resumable as you want, so it isn't so bad as I believed at the beginning...). You can see (and download) my "creations" here: https://www.wuala.com/bittachi/aarddict/

Would be nice if you could also compile .slob dictionaries for Aard 2 :) Or share your .couch database files.
 
Meanwhile, I thought about compiling an updated version of the Italian Wikipedia, but I still have some doubts:

1) I want to make the compilation, but stopping and resuming the process in a second moment.
 
Is it possible to do something like that?

I don't understand your question. Getting Wikipedia articles with mwscrape is a separate process. It can be stopped and later resumed. You can run several mwscape processes at the same time, instructing each process to start at a different point in the list of article titles returned by Wikipedia API.
Compiling .aar or .slob from the Couch database created by mwscrape is another process and it is not resumable. 
 
(I thought to split up the dictionary compiling process using the "startkey" and "endkey" options in mwcouch, but I'm afraid to generate independent dictionaries instead of separate volumes gathered in a unique dictionary).

I'm not sure what problem are you trying to solve   
 
2) If I decide to rescrape Wikipedia in the future, could the scraping process take into account the deleted articles in the encyclopedia after the first scraping (and remove them into the Couch database automatically) or not?

mwscrape requests list of available article titles from Wikipedia web API, and then compares revision Wikipedia has with revision mwscarpe's CouchDB database has from previous scrapes. If it's a new article or a new revision - it gets downloaded and stored or updated. In this process previously available and now deleted articles remain in CouchDB. Another script would need to go over what's in CouchDB and check if the article is still available. mwscrape currently doesn't implement this. However, my understanding is that in general Wikipedia articles are not deleted very often. One of the reasons for deletion would be "lack of notability" for a subject described. Vandalism (somebody inserting bogus articles) could be another. I'm not sure how big of a problem this actually is. Are you concerned with accuracy of data or is it something else? 
  
 

mhbraun

unread,
Oct 16, 2014, 2:19:27 AM10/16/14
to aard...@googlegroups.com
New #German #deutsche #offline #Wikipedia dewiki-20141013.img.v4.slob for #aarddict reader in slob format is online http://bit.ly/AardWikiDe

bittachi

unread,
Oct 16, 2014, 3:43:21 AM10/16/14
to aard...@googlegroups.com


Il giorno giovedì 16 ottobre 2014 04:07:15 UTC+2, itkach ha scritto:

Would be nice if you could also compile .slob dictionaries for Aard 2 :) Or share your .couch database files.

Didn't you notice that I uploaded either aar files or slob files for every dictionary? I'll keep the double format until you won't officially release Aard2. For what concerns .couch database file, I didn't care about them, so I've already deleted them. I'll keep them next time, I promise you.
 
I don't understand your question. Getting Wikipedia articles with mwscrape is a separate process. It can be stopped and later resumed. You can run several mwscape processes at the same time, instructing each process to start at a different point in the list of article titles returned by Wikipedia API.
Compiling .aar or .slob from the Couch database created by mwscrape is another process and it is not resumable.

OK, so you confirm me (as MHBraun said previously) that the scraping can be fulfilled with several shots, unlike the dictionary compiling from the database, that can be done exclusively with one shot. I supposed that the compiling could be performed in several times by selecting specific alphabetical range (i.e., using the "--s" and "--e" options, I first compile the articles from the beginning until the 'A' letter, then from 'B' to 'C', then again from 'D' to 'E', and so on...), but I realized that it generates independent dictionaries (doing like that). Well, I'll try to compile everything at once (the Italian Wikipedia has the same size of the German one - more or less -, so I hope to accomplish the process at the same time stated by MHBraun)
 
mwscrape requests list of available article titles from Wikipedia web API, and then compares revision Wikipedia has with revision mwscarpe's CouchDB database has from previous scrapes. If it's a new article or a new revision - it gets downloaded and stored or updated. In this process previously available and now deleted articles remain in CouchDB. Another script would need to go over what's in CouchDB and check if the article is still available. mwscrape currently doesn't implement this. However, my understanding is that in general Wikipedia articles are not deleted very often. One of the reasons for deletion would be "lack of notability" for a subject described. Vandalism (somebody inserting bogus articles) could be another. I'm not sure how big of a problem this actually is. Are you concerned with accuracy of data or is it something else? 
 
AFAIK at least 10 articles (on average) are deleted every day on the Italian Wikipedia, so there will be more than 3000 articles deleted in a year. Maybe it isn't a big number, but IMHO I don't like to keep in future reviews articles that were certified as hoaxes/promotions/vandalisms/not notable subjects, that's all.

Markus Braun

unread,
Oct 16, 2014, 7:03:44 AM10/16/14
to aard...@googlegroups.com

Just to be precise:
To create the couchdb of dewiki took me 4 weeks. The update of the existing couchdb takes 1 to 2 weeks with up to 10 scrapes.

So I do keep the couchdbs.

The generation of the dictionary takes 3-4 h.

--

itkach

unread,
Oct 16, 2014, 9:00:16 AM10/16/14
to aard...@googlegroups.com


On Thursday, October 16, 2014 3:43:21 AM UTC-4, bittachi wrote:


Il giorno giovedì 16 ottobre 2014 04:07:15 UTC+2, itkach ha scritto:

Would be nice if you could also compile .slob dictionaries for Aard 2 :) Or share your .couch database files.

Didn't you notice that I uploaded either aar files or slob files for every dictionary?

Indeed, I didn't notice, sorry about that. Thank you.
From the timestamp it looks like these probably were compiled before you could get my change to the slob file format, if that's the case then they are incompatible with latest the .aard2 apk. Sorry about that too :) As I mention in another conversation,  I expect this is to be the last change to the binary file format.
 
I'll keep the double format until you won't officially release Aard2. For what concerns .couch database file, I didn't care about them, so I've already deleted them. I'll keep them next time, I promise you.
 
I don't understand your question. Getting Wikipedia articles with mwscrape is a separate process. It can be stopped and later resumed. You can run several mwscape processes at the same time, instructing each process to start at a different point in the list of article titles returned by Wikipedia API.
Compiling .aar or .slob from the Couch database created by mwscrape is another process and it is not resumable.

OK, so you confirm me (as MHBraun said previously) that the scraping can be fulfilled with several shots, unlike the dictionary compiling from the database, that can be done exclusively with one shot. I supposed that the compiling could be performed in several times by selecting specific alphabetical range (i.e., using the "--s" and "--e" options, I first compile the articles from the beginning until the 'A' letter, then from 'B' to 'C', then again from 'D' to 'E', and so on...), but I realized that it generates independent dictionaries (doing like that). Well, I'll try to compile everything at once (the Italian Wikipedia has the same size of the German one - more or less -, so I hope to accomplish the process at the same time stated by MHBraun)
 

itwiki is somewhat smaller (1 150 381 articles vs  1 766 217 in dewiki, according to http://meta.wikimedia.org/wiki/List_of_Wikipedias), so should take less time. Similar to MHBraun, compiling dewiki from scrape I did in March takes about 3.5 hours on my machine with 4 core i7 cpu. I bilieve This is a lot less time than compiling .aar from wikipedia XML dump used to take. 
 

> For what concerns .couch database file, I didn't care about them, so I've already deleted them. I'll keep them next time, I promise you.

Keeping scrape databases speeds up future scrapes noticeably, at least for large wikipedias. For smaller projects like wikiquote and most wiktionaries this is not as significant. It is also a must when working on content filters or dictionary CSS or some other aspect of conversion. aar2slob/mwscrape2slob's --key, --start and --end options allow to quickly create small dictionaries for testing.    

 
mwscrape requests list of available article titles from Wikipedia web API, and then compares revision Wikipedia has with revision mwscarpe's CouchDB database has from previous scrapes. If it's a new article or a new revision - it gets downloaded and stored or updated. In this process previously available and now deleted articles remain in CouchDB. Another script would need to go over what's in CouchDB and check if the article is still available. mwscrape currently doesn't implement this. However, my understanding is that in general Wikipedia articles are not deleted very often. One of the reasons for deletion would be "lack of notability" for a subject described. Vandalism (somebody inserting bogus articles) could be another. I'm not sure how big of a problem this actually is. Are you concerned with accuracy of data or is it something else? 
 
AFAIK at least 10 articles (on average) are deleted every day on the Italian Wikipedia, so there will be more than 3000 articles deleted in a year. Maybe it isn't a big number,

Compared to 1 150 381 article it is not big at all. Also, keep in mind that depending on how often scrapes run and how quickly such articles are discovered and removed by editors scrape would probably get only some of them.
 
but IMHO I don't like to keep in future reviews articles that were certified as hoaxes/promotions/vandalisms/not notable subjects, that's all.

 I agree that such articles should be purged. Doing a fresh scrape from scratch every time is one way to achieve that. Adding a sweep script to delete articles from scrape database is another, perhaps I'll implement it at some point.

itkach

unread,
Oct 16, 2014, 9:12:23 AM10/16/14
to aard...@googlegroups.com
@MHBraun @bittachi by the way, GitHub public wikis are editable by any user. Feel free to edit https://github.com/itkach/slob/wiki, either my Dictionaries page (in reStructured text), or create your own in whatever format you like, to post links to dictionaries you compiled.

bittachi

unread,
Oct 18, 2014, 11:30:41 AM10/18/14
to aard...@googlegroups.com
OK, I've regenerated my slob dictionaries (using the updated "slob" module) and tested them on Aard2 ver. 0.4: everything is OK, except for some problems with the internal links in Wikisource and Wiktionary dictionaries. Maybe it's due to a bad "translation" from the original .aar format during the conversion to .slob format? (Check over there: https://www.wuala.com/bittachi/aarddict/ - the new .slob files are those marked with "(2)" in the file name)

itkach

unread,
Oct 18, 2014, 3:28:16 PM10/18/14
to aard...@googlegroups.com
On Saturday, October 18, 2014 11:30:41 AM UTC-4, bittachi wrote:
OK, I've regenerated my slob dictionaries (using the updated "slob" module) and tested them on Aard2 ver. 0.4: everything is OK, except for some problems with the internal links in Wikisource and Wiktionary dictionaries. Maybe it's due to a bad "translation" from the original .aar format during the conversion to .slob format?

Are you sure those problems are not present in .aar? As far as I know Wikisource always had problems. I also never understood what's the point of Wikisource anyway - taking publicly available books and putting them into awkward to use, weird format not suitable for books. What's wrong with text files, fb2, epub and such?

As for Wiktionary, internal links should work equally in both .aar and .slob. Do you have specific examples (which article, which link) where this is not the case?   

mhbraun

unread,
Oct 18, 2014, 9:19:30 PM10/18/14
to aard...@googlegroups.com
New #English #offline #Wikipedia simple-enwiki-20141018.slob for #aarddict reader in slob format is online http://bit.ly/AardWikiEN

bittachi

unread,
Oct 20, 2014, 11:54:10 AM10/20/14
to aard...@googlegroups.com


Il giorno sabato 18 ottobre 2014 21:28:16 UTC+2, itkach ha scritto:
Are you sure those problems are not present in .aar? As far as I know Wikisource always had problems. I also never understood what's the point of Wikisource anyway - taking publicly available books and putting them into awkward to use, weird format not suitable for books. What's wrong with text files, fb2, epub and such?

As for Wiktionary, internal links should work equally in both .aar and .slob. Do you have specific examples (which article, which link) where this is not the case?

I'm pretty sure the original .aar files aren't corrupted (you can test those I've already uploaded on my Wuala repository, loading them on the classic Aard app).
Anyway, I try to describe the bugs I've found using Aard 2:
- in Wikisource .slob files articles, if you tap on chapters links (i.e., I'm reading "Divina Commedia/Inferno/Canto I" and I want to pass to the next chapter "Divina Commedia/Inferno/Canto II"), these point to anything and the "Not found" message appears on the screen;
- in Wiktionary .slob files articles, the links appear all in red, as the linked item doesn't exist, but indeed it actually exists (you can check any article inside the dictionary to realize this bug).

itkach

unread,
Oct 20, 2014, 10:23:30 PM10/20/14
to aard...@googlegroups.com
On Monday, October 20, 2014 11:54:10 AM UTC-4, bittachi wrote:

Il giorno sabato 18 ottobre 2014 21:28:16 UTC+2, itkach ha scritto:
Are you sure those problems are not present in .aar? As far as I know Wikisource always had problems. I also never understood what's the point of Wikisource anyway - taking publicly available books and putting them into awkward to use, weird format not suitable for books. What's wrong with text files, fb2, epub and such?

As for Wiktionary, internal links should work equally in both .aar and .slob. Do you have specific examples (which article, which link) where this is not the case?

I'm pretty sure the original .aar files aren't corrupted (you can test those I've already uploaded on my Wuala repository, loading them on the classic Aard app).

To me "corrupted" means corrupted binary data, as in "missing bytes" or binary garbage instead of expected structure. It sounds like you are talking about issues with content and navigation. Lets not call it "corrupted".  
 
Anyway, I try to describe the bugs I've found using Aard 2:
- in Wikisource .slob files articles, if you tap on chapters links (i.e., I'm reading "Divina Commedia/Inferno/Canto I" and I want to pass to the next chapter "Divina Commedia/Inferno/Canto II"), these point to anything and the "Not found" message appears on the screen;

Indeed, links like these work in Aard, but not in Aard 2. In Aard, clicking a link is just like typing link's target in lookup and taking the first result. In Aard 2 it's different, these are resolved as relative links on a web page. I think this can be fixed in mwscrape2slob by escaping "/" in links. I will look into it. 
 
- in Wiktionary .slob files articles, the links appear all in red, as the linked item doesn't exist, but indeed it actually exists (you can check any article inside the dictionary to realize this bug).

I'm checking and I see that most "red" linked items actually do not exist, both in .aar and in slob. I did find some examples of links in red for items that exist, but it took some searching. Links are marked as pointing to non-existing articles in the rendered HTML returned by Wikipedia API. It is possible that Wikipedia has some of them marked incorrectly, because changes do not propagate instantaneously, or perhaps links that did not exist at some point were added and scrape missed it because it doesn't get all the articles at the same time. In any case, this doesn't appear to be Aard 2 specific issue and it appears to be the way data is returned by Wikipedia (and links in red seem to match online article versions).  
   

Frank Röhm

unread,
Oct 21, 2014, 2:14:15 PM10/21/14
to aard...@googlegroups.com


Am 16. Oktober 2014 08:19:27 MESZ, schrieb mhbraun <mhb...@freenet.de>:
>New #*German* ...
>dewiki-20141013.img.v4.slob for #*aarddict*

Do I understand right, the slob file for Aard2 version 0.4 in your downloads is marked with a v4 in the file name?

As I understand practically now, slob files for Aard2 version 0.4 are incompatible with slob files for 0.3 and previous, or?


Markus Braun

unread,
Oct 21, 2014, 4:40:17 PM10/21/14
to aard...@googlegroups.com

Due to changes in the slob format the v4 files are for Aard2 0.4 . I did not test compatibility of the versions. Just switched to the new format immediately. So I had 2 versions as the 0.3 version was made just before 0.4 was available. There were just a couple of hours of time difference.

Future releases will be automatically just for the actual format.

Igor indicated that the probability of more changes in the slob format is very low.

Just take the v4 or the newer file.

bittachi

unread,
Nov 12, 2014, 3:19:35 PM11/12/14
to aard...@googlegroups.com
New announcements for you:

1) I've transferred my dictionaries on my new Mega storage account, you can found them here: https://mega.co.nz/#F!MAIwRIKZ!-s_kee09DTTokK_ECE2KKw

2) I've solved on my own the critical problem about the very long compilation time I took on my poor notebook. Practically, I "split up" the itwiki database in 3 parts, that is, I've generated 3 slob files with 3 mwscrape2slob executions as follows:

-1st: mwscrape2slob http://127.0.0.1:5984/it-wikipedia-org -e H -o itwiki_1.slob (it scrapes from the beginning until the "H" article)
-2nd: mwscrape2slob http://127.0.0.1:5984/it-wikipedia-org -s H -e Q -o itwiki_2.slob (from the "H" article till the "Q" article)
-3rd: mwscrape2slob http://127.0.0.1:5984/it-wikipedia-org -s Q -o itwiki_3.slob (from the "Q" article till the end)

Each execution had taken about 6-7 hours, that is an acceptable time for me.

Moveover, doing like that, it would be easy also to upload the whole dictionary - I'll do this in the next days, stay tuned :-)


itkach

unread,
Nov 12, 2014, 4:30:35 PM11/12/14
to aard...@googlegroups.com
Note that having three separate dictionaries like this is not quite the same as having same content in one dictionary. When a link is followed, slob containing link's originating article is searched first, then the rest of them. if you have many dictionaries, for example Italian and English Wikipedia with matching titel in both, and Italian is split like this, it's possible that following a link from Italian article would open article from English first. This may not be an issue for you, but there's a difference.

Markus Braun

unread,
Nov 12, 2014, 5:53:29 PM11/12/14
to aard...@googlegroups.com

Are you sure this is faster than compiling it in one shot?

It would take 18 to 21h this way...

bittachi

unread,
Nov 14, 2014, 2:04:24 AM11/14/14
to aard...@googlegroups.com


Il giorno mercoledì 12 novembre 2014 22:30:35 UTC+1, itkach ha scritto:
Note that having three separate dictionaries like this is not quite the same as having same content in one dictionary. When a link is followed, slob containing link's originating article is searched first, then the rest of them. if you have many dictionaries, for example Italian and English Wikipedia with matching titel in both, and Italian is split like this, it's possible that following a link from Italian article would open article from English first. This may not be an issue for you, but there's a difference.


I know the difference: I realized it when I've loaded also the dictionaries from the other Italian Wikimedia projects (Wiktionary, Wikisource, etc.). Anyway, you can solve it with a shift to the left (in order to find the right article), or just loading the dictionary you really need. Maybe it could be annoying for someone, but it's the best solution I can do with my humble notebook. Moreover, take into account that some users can't download a huge dictionary in one shot (above all if they have a slow Internet connection), so splitting it in several parts could be an adequate solution.

bittachi

unread,
Nov 14, 2014, 2:10:21 AM11/14/14
to aard...@googlegroups.com


Il giorno mercoledì 12 novembre 2014 23:53:29 UTC+1, MHBraun ha scritto:

Are you sure this is faster than compiling it in one shot?

It would take 18 to 21h this way...

No, it isn't faster (it took the same time), but at least I can "modularize" the process in smaller slices of time (I don't like to leave my computer turned on for a whole day, especially during the night - you know...)

mhbraun

unread,
Nov 14, 2014, 3:22:36 PM11/14/14
to aard...@googlegroups.com

New #German #deutsche #offline #Wikipedia dewikipedia-20141111.slob for #aarddict reader in slob format is online http://bit.ly/AardWikiDe
 
Generally I host my

German language files dewiki*.* at bit.ly/AardWikiDe
Engish language files enwiki*.* at bit.ly/AardWikiEN

and sometimes as a mirror on

bit.ly/MegaEnwiki
bit.ly/MegaDewiki
 




Frank Röhm

unread,
Nov 15, 2014, 7:44:21 AM11/15/14
to aard...@googlegroups.com


Am 14. November 2014 21:22:36 MEZ, schrieb mhbraun <mhb...@freenet.de>:

New German
dewikipedia-20141111.slob
for aarddict reader

in slob format is online

Oh, I just downloaded the German wiki from 11.11.2014 tested it and there are strange errors in some pages. Example article:

Sexuelle Spannung

See Attachment.
I Thing many other articles not found like this one.
Was wrong?
Screenshot_2014-11-15-13-28-33.png

mhbraun

unread,
Nov 15, 2014, 9:22:04 AM11/15/14
to aard...@googlegroups.com
Can not confirm my dewiki works correctly with Aard2 0.6.
Sexuelle Spannung is forwarded to Sexuelle Erregung
- Is md5 fine?
- Do you have more articles to test?

itkach

unread,
Nov 15, 2014, 9:48:40 AM11/15/14
to aard...@googlegroups.com
On Saturday, November 15, 2014 7:44:21 AM UTC-5, franc wrote:


Am 14. November 2014 21:22:36 MEZ, schrieb mhbraun:

New German
dewikipedia-20141111.slob
for aarddict reader
in slob format is online

Oh, I just downloaded the German wiki from 11.11.2014 tested it and there are strange errors in some pages. Example article:

Sexuelle Spannung

See Attachment.
I Thing many other articles not found like this one.
Was wrong?

I doubt this is specific to dewikipedia-20141111.slob, although it doesn't hurt to check that the file made it to the device intact (checksums match). There's probably an error somewhere in the logs, would be nice to get relevant output from adb logcat  

Frank Röhm

unread,
Nov 15, 2014, 10:31:55 AM11/15/14
to aard...@googlegroups.com
Am 15.11.2014 um 15:48 schrieb itkach:
> On Saturday, November 15, 2014 7:44:21 AM UTC-5, franc wrote:
> ...
>
> Sexuelle Spannung
>
> See Attachment.
> I Thing many other articles not found like this one.
> Was wrong?
>
>
> I doubt this is specific to dewikipedia-20141111.slob, although it
> doesn't hurt to check that the file made it to the device intact
> (checksums match). There's probably an error somewhere in the logs,
> would be nice to get relevant output from adb logcat

Was my error!!!
Did Load it with laptop in a virtual machine to phone, this crashed, but
I found the file in full size so thinked that the upload worked.
Did it again direktly from Laptop to the sd-card and now it works every
artikel.
Sorry to frighten the horses :)


* Deutsch - erkannt
* Englisch
* Deutsch
* Französisch

* Englisch
* Deutsch
* Französisch

<javascript:void(0);>

bittachi

unread,
Nov 16, 2014, 1:18:25 PM11/16/14
to aard...@googlegroups.com
New Italian Wikipedia dictionary (last update: November 3rd, 2014) is now available over there: https://mega.co.nz/#F!MAIwRIKZ!-s_kee09DTTokK_ECE2KKw

As I said previously, I had voluntarily generated the dictionary in 3 separated parts in order to accomplish the compilation itself and the file upload/download at the same time. I apologize if you (maybe) consider this solution "out of ordinary", but my goal is to allow a softer spread of huge dictionaries created from Wikipedia also for those who (like me) can't afford a fast Internet connection and a powerful computer.

Igor, can you please update the link to my dictionaries in Aard homepage, so I can definitely erase the old copies I stored in Wuala? Thanks in advance.

Markus Braun

unread,
Nov 16, 2014, 5:54:34 PM11/16/14
to aard...@googlegroups.com

Wuala cancelled my free storage space. I moved all data to copy.com

--

itkach

unread,
Nov 16, 2014, 7:38:48 PM11/16/14
to aard...@googlegroups.com


On Sunday, November 16, 2014 1:18:25 PM UTC-5, bittachi wrote:
New Italian Wikipedia dictionary (last update: November 3rd, 2014) is now available over there: https://mega.co.nz/#F!MAIwRIKZ!-s_kee09DTTokK_ECE2KKw

As I said previously, I had voluntarily generated the dictionary in 3 separated parts in order to accomplish the compilation itself and the file upload/download at the same time. I apologize if you (maybe) consider this solution "out of ordinary", but my goal is to allow a softer spread of huge dictionaries created from Wikipedia also for those who (like me) can't afford a fast Internet connection and a powerful computer.


The reasons you mentioned for splitting don't sound very convincing to me, especially one about upload/download.  One third of a big file is still a big file, so resumable downloads and uploads are a must. Personal cloud storage services like Wuala, Copy and so on are great, but they are inadequate for sharing with more than a few people. As far as I can tell, Bittorrent is the only viable option so far, unless someone is willing to pay for high quality hosting, and with bittorrent there's not benefit to having multiple files, in fact it makes it a bit more cumbersome, for example, to create torrents with web seeds. Linking and publishing/verifying hash is also easier when you have just one file.

I could agree that splitting into smaller files is a useful workaround for someone who wants enwiki and is stuck with FAT32 on an SDCard (all other wikis still fit into 4Gb so far), but you don't mention this.

Having said that, I will make a change so that following a link from an article that happens to be in one file to an article in the same wiki but another file will work as if they were in one file (well, most of the time anyway).   
 
Igor, can you please update the link to my dictionaries in Aard homepage, so I can definitely erase the old copies I stored in Wuala? Thanks in advance.

Done 

mhbraun

unread,
Nov 23, 2014, 5:33:16 PM11/23/14
to aard...@googlegroups.com

New #German #deutsche #offline #Wikipedia de-wikivoyage-20141123 in aar and slob for #aarddict reader is online http://bit.ly/AardWikiDe




Am Dienstag, 14. Oktober 2014 21:52:50 UTC+2 schrieb mhbraun:

Generally I host my

German language files dewiki*.* at bit.ly/AardWikiDe
Engish language files enwiki*.* at bit.ly/AardWikiEN

mhbraun

unread,
Nov 30, 2014, 5:31:56 AM11/30/14
to aard...@googlegroups.com

New #German #deutsche #offline #Wikipedia dewikiquote-20141130 in aar and slob for #aarddict reader is online http://bit.ly/AardWikiDe

mhbraun

unread,
Nov 30, 2014, 5:45:57 AM11/30/14
to aard...@googlegroups.com

New #German #deutsche #offline #Wikipedia dewikivoyage-20141130 in aar and slob for #aarddict reader is online http://bit.ly/AardWikiDe

mhbraun

unread,
Nov 30, 2014, 10:06:44 AM11/30/14
to aard...@googlegroups.com

New #German #deutsche #offline #Wikipedia dewikibooks-20141130 in aar and slob for #aarddict reader is online http://bit.ly/AardWikiDe

mhbraun

unread,
Dec 13, 2014, 9:18:24 PM12/13/14
to aard...@googlegroups.com

Christmas release
Dewiki-20141212.slob and Dewiki-20141212.aar is online on copy.com still uploading on Mega
Enwiki-20141201.slob and *.aar is on line on copy.com. Still uploading on Mega. Will take some days.
Dewikibooks-20141130.slob and *.aar is online on copy.com will not be on Mega

mhbraun

unread,
Dec 13, 2014, 9:20:28 PM12/13/14
to aard...@googlegroups.com
dewiktionary-20141130.slob and *.aar is online on copy.com will not be updated on Mega. Too low space available.

mhbraun

unread,
Dec 22, 2014, 9:51:23 AM12/22/14
to aard...@googlegroups.com
Torrent of enwiki-20141201 created:

magnet:?xt=urn:btih:86d3e04f4e9c6685d69c4b803dfdce4a6df57e43&dn=enwiki-20141201.slob

mhbraun

unread,
Jan 7, 2015, 5:36:26 AM1/7/15
to aard...@googlegroups.com
New #German #deutsche #offline #Wikipedia of 20150103 for #aarddict reader is online http://bit.ly/AardWikiDe
The *.aar files for Aard and the *.slob for Aard2 with external pictures.
Mega mirror is still uploading.

The Torrent to dewiki-20150103.slob is:

magnet:?xt=urn:btih:7eaef7066334183562aea56671bc45338c015ac6&dn=dewiki-20150103.slob

Frank Röhm

unread,
Jan 7, 2015, 6:11:50 AM1/7/15
to aard...@googlegroups.com
Am 07.01.2015 um 11:36 schrieb mhbraun:
>
> The Torrent to dewiki-20150103.slob is:
>
> magnet:?xt=urn:btih:7eaef7066334183562aea56671bc45338c015ac6&dn=dewiki-20150103.slob
>
>

Hm. Same thing as last time, the magnet searches a while, then gives up:


Download
von:magnet:?xt=urn:btih:7eaef7066334183562aea56671bc45338c015ac6&&&&...
Warte auf Initialisierung der verteilten Datenbank...
Suche...
Metadata download created
Probiere alternativen Suchservice
Alternative Suche fehlgeschlagen: keine Quellen gefunden
Fand 0
Fehler: es wurden keine Quellen für den Torrent gefunden


Ist there not a way to tell this magnet link direkty one known source
(e.g. mhbrown)?


mhb...@freenet.de

unread,
Jan 7, 2015, 7:56:03 AM1/7/15
to aard...@googlegroups.com
We ma as well use a common tracker. This should work as well?
Which tracker would be recommended?
The three I use are showing at least one peer for the othe magnet links.



Mit TouchDown von meinem Android-Telefon gesendet (www.nitrodesk.com)


-----Original Message-----
From: Frank Röhm [franc...@gmail.com]
Received: Mittwoch, 07 Jan. 2015, 12:11
To: aard...@googlegroups.com
Subject: Re: Updated Wikipedia files

mhb...@freenet.de

unread,
Jan 7, 2015, 8:38:48 AM1/7/15
to aard...@googlegroups.com
There's probably a very wide spread tracker which is a mayor source of information about torrents?


Mit TouchDown von meinem Android-Telefon gesendet (www.nitrodesk.com)


-----Original Message-----
From: mhb...@freenet.de
Received: Mittwoch, 07 Jan. 2015, 13:56
To: aard...@googlegroups.com
Subject: RE: Updated Wikipedia files

itkach

unread,
Jan 7, 2015, 9:04:57 PM1/7/15
to aard...@googlegroups.com
I added the link in the morning, and now it's happily chugging along, more than half way through, using Transmission

itkach

unread,
Jan 7, 2015, 9:36:13 PM1/7/15
to aard...@googlegroups.com


On Wednesday, January 7, 2015 8:38:48 AM UTC-5, mhbraun wrote:
There's probably a very wide spread tracker which is a mayor source of information about torrents?

I'm not sure what "wide spread" would mean for a tracker. A tracker is a tracker. As with any other service, some are more reliable than others. I've been adding the following trackers - which seem to be decent public trackers - to torrents I publish (as you may notice if you examined the magnet links): udp://tracker.publicbt.com:80, udp://tracker.openbittorrent.com:80, udp://tracker.ccc.de:80

But magnet links Markus published so far work for me without trackers too (as they should, it's just takes a little while to get going). Bittorrent clients use various mechanisms such as DHT and Peer Exchange to find content by hashsum included into magent links which makes it possible to operate without trackers (kind of). Here are some interesting details of how this works: http://superuser.com/questions/592238/in-simple-terms-how-does-a-bittorrent-client-initially-discover-peers-using-dht and http://stackoverflow.com/questions/1181301/how-does-a-dht-in-a-bittorent-client-get-bootstrapped



Frank Röhm

unread,
Jan 8, 2015, 8:24:52 AM1/8/15
to aard...@googlegroups.com
> <https://www.transmissionbt.com/>.
>

Downloaded torrent again from copy, this time not the magnet, and now it
is happy downloading :)

Frank Röhm

unread,
Jan 8, 2015, 8:25:31 AM1/8/15
to aard...@googlegroups.com
Am 08.01.2015 um 03:04 schrieb itkach:
> On Wednesday, January 7, 2015 6:11:50 AM UTC-5, franc wrote:
>
>
> I added the link in the morning, and now it's happily chugging along,
> more than half way through, using Transmission
> <https://www.transmissionbt.com/>.
>

Downloaded torrent again from copy, this time not the magnet, and now it
is happy downloading :)

but transmission will I soon install to my Buffalo NAS, where I have
only this crippled buffalo torrent client.

spirit...@gmail.com

unread,
Jan 23, 2015, 1:24:44 AM1/23/15
to aard...@googlegroups.com
Hi, I have compiled a slob file for the Vietnamese Wikipedia. You can find it here. Maybe I will update the homepage's Dictionaries page later when I have time, but it would be great if you can do it for me. Thank you.

Also, future updates will be uploaded to this folder: https://www.mediafire.com/folder/50g4bb0w4g71e/Wiki

mhbraun

unread,
Feb 12, 2015, 3:49:09 PM2/12/15
to aard...@googlegroups.com
dewiki-20150211 is available as magnet link and on copy.com not on Mega.

However with the new mwscrape I am recompiling a new dewiki-20150213.slob which should be available on the weekend.
The new compiled issue will be available on all three sources.

Mohsen Khanpour

unread,
Feb 13, 2015, 2:16:21 PM2/13/15
to aard...@googlegroups.com
Hi, Thanks for the membership, I am a teacher and I don't know how to compile using the codes, I am clueless actually, I wanted to ask you if it is possible for you to kindly compile the English Wikiquote in slob format, cause the old one is for nearly one year ago and it definitely needs updating.

Thank you again. 

On Tuesday, October 14, 2014 at 11:22:50 PM UTC+3:30, mhbraun wrote:
Usually I announce new updates on Wikipedia *.aar files for Aard on Twitter @MarkusHBraun
And I will continue to do so. So if you want to be notified just follow me on Twitter.

Apparently, I was out of town and did not read my timeline for a couple of days, so possibly I missed some information which could have been interesting for me. And this probably happens to others as well, I thought. So I decided to list in this thread new updates on Wikis I created.

If anyone else is creating his wikis he is welcome to add the information here for we have a common source on upates of wikis.

mhbraun

unread,
Feb 14, 2015, 4:34:25 AM2/14/15
to aard...@googlegroups.com
Mwscrape is running to build the database. This will take a couple of days. Looks very slowly actually. See bit.ly/AardWikiEN in a couple of days.
I hope this will work.

itkach

unread,
Feb 14, 2015, 12:03:33 PM2/14/15
to aard...@googlegroups.com
On Friday, February 13, 2015 at 2:16:21 PM UTC-5, Mohsen Khanpour wrote:
Hi, Thanks for the membership, I am a teacher and I don't know how to compile using the codes, I am clueless actually, I wanted to ask you if it is possible for you to kindly compile the English Wikiquote in slob format, cause the old one is for nearly one year ago and it definitely needs updating.

Markus Braun

unread,
Feb 14, 2015, 2:00:21 PM2/14/15
to aard...@googlegroups.com

Thanks Igor, I will stop my scrape.

Mohsen Khanpour

unread,
Feb 14, 2015, 2:10:19 PM2/14/15
to aard...@googlegroups.com
Thanks itkach and mhbraun. I got it.you are great.

mhbraun

unread,
Feb 15, 2015, 12:00:06 PM2/15/15
to aard...@googlegroups.com

Installed improved mwscrape and mwscrape2slob which are improving some functionalities with the wikipedias.

dewiki-20150215.slob is available with

magnet:?xt=urn:btih:82f61491faee1400d389de75ca1217ea61634ae3&dn=dewiki-20150215.slob

copy.com still uploading but soon available on
 
German language files dewiki*.* at bit.ly/AardWikiDe
Engish language files enwiki*.* at bit.ly/AardWikiEN

and as a mirror on

bit.ly/MegaEnwiki
bit.ly/MegaDewiki

 copy.com or mega download is much faster than the torrent...


mhbraun

unread,
Feb 24, 2015, 5:09:48 PM2/24/15
to aard...@googlegroups.com

New simple #English #offline #Wikipedia simpleenwiki-20150224.slob for #aarddict 2 reader is online http://bit.ly/AardWikiEN


Generally I host my wikis

Shorty66

unread,
Mar 2, 2015, 6:58:54 AM3/2/15
to aard...@googlegroups.com
I tried the newest dewiki in combination with aard2 1.6.
There still are no separate links to geo:: intents - wasn't that supposed to be there?

Thanks for your work!

Markus Braun

unread,
Mar 2, 2015, 7:35:22 AM3/2/15
to aard...@googlegroups.com

How do you test this?
If I tap on the coordinate of eg Nashville the browser opens with http://tools.wmflabs.org/geohack/geohack.php?pagename=Nashville&language=de&params=36.165833333333_N_86.784444444444_W_dim:25000_region:US-TN_type:city(626681)

Tested  with 0.16 and dewiki20150215.slob

itkach

unread,
Mar 2, 2015, 8:22:18 AM3/2/15
to aard...@googlegroups.com
On Monday, March 2, 2015 at 7:35:22 AM UTC-5, MHBraun wrote:

How do you test this?
If I tap on the coordinate of eg Nashville the browser opens with http://tools.wmflabs.org/geohack/geohack.php?pagename=Nashville&language=de&params=36.165833333333_N_86.784444444444_W_dim:25000_region:US-TN_type:city(626681)

There's supposed to be a small globe icon (https://en.wikipedia.org/wiki/File:Globe.svg) next to coordinate link. Coordinates text is a regular http link (pointing to http://tools.wmflabs.org/), while icon is a geo: URI link which asks what to do on desktop and offers to open a map app on Android. 

Tested  with 0.16 and dewiki20150215.slob 

I tried the newest dewiki in combination with aard2 1.6.
There still are no separate links to geo:: intents - wasn't that supposed to be there?

Thanks for your work!


Am Dienstag, 24. Februar 2015 23:09:48 UTC+1 schrieb mhbraun:

New simple #English #offline #Wikipedia simpleenwiki-20150224.slob for #aarddict 2 reader is online http://bit.ly/AardWikiEN


Generally I host my wikis

German language files dewiki*.* at bit.ly/AardWikiDe
Engish language files enwiki*.* at bit.ly/AardWikiEN


--
You received this message because you are subscribed to the Google Groups "aarddict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aarddict+unsubscribe@googlegroups.com.

Markus Braun

unread,
Mar 3, 2015, 4:29:46 AM3/3/15
to aard...@googlegroups.com

This icon is not in my compilation. How to add this function?

To unsubscribe from this group and stop receiving emails from it, send an email to aarddict+u...@googlegroups.com.

itkach

unread,
Mar 3, 2015, 9:19:34 AM3/3/15
to aard...@googlegroups.com
On Tuesday, March 3, 2015 at 4:29:46 AM UTC-5, MHBraun wrote:

This icon is not in my compilation. How to add this function?

You need to compile with up-to-date mwscrape2slob (changes to support geo microformat where made on Feb 8 and 9, see https://github.com/itkach/mwscrape2slob/commits/master).

 

Markus Braun

unread,
Mar 3, 2015, 9:43:30 AM3/3/15
to aard...@googlegroups.com

I am confused now as I updated mwscrape2slob after our discussion in the other thread on 13.feb
Have to investigate what's wrong with my setup.

--

itkach

unread,
Mar 3, 2015, 9:47:47 AM3/3/15
to aard...@googlegroups.com
Before compiling dictionary with all articles it is useful to compile  just few articles first (with -k or --start/--end) and check results on desktop with https://github.com/itkach/slobby or https://github.com/itkach/aard2-web and then, if all looks good, on device.

Shorty66

unread,
Mar 4, 2015, 5:39:45 PM3/4/15
to aard...@googlegroups.com
Yeah, i was missing the small globe.
I am very happy that this feature was added by the way - thanks igor!

Markus Braun

unread,
Mar 4, 2015, 5:46:11 PM3/4/15
to aard...@googlegroups.com

I created testwiki with Nashville. The globe is not visible in Aard2 0.16.

From within the environment (env-slob)
Pip uninstall mwscrape2slob
Pip install Git command for mwscrape2slob
Successfull installation

Recreated testwiki with -k Nashville.
The globe is not visible in Aard2 0.16.

Can you create a correct sub dictionary for me to see exactly what we are looking for? I do not see that globe in the article Nashville and tapping on coordinates results in behaviour as described above.


--

Markus Braun

unread,
Mar 4, 2015, 5:49:14 PM3/4/15
to aard...@googlegroups.com

It is not included, isn't it?

--
dewiki-Nashville.slob

itkach

unread,
Mar 4, 2015, 8:10:29 PM3/4/15
to aard...@googlegroups.com
On Wednesday, March 4, 2015 at 5:46:11 PM UTC-5, MHBraun wrote:

I created testwiki with Nashville. The globe is not visible in Aard2 0.16.

From within the environment (env-slob)
Pip uninstall mwscrape2slob
Pip install Git command for mwscrape2slob
Successfull installation

Recreated testwiki with -k Nashville.
The globe is not visible in Aard2 0.16.

Can you create a correct sub dictionary for me to see exactly what we are looking for? I do not see that globe in the article Nashville and tapping on coordinates results in behaviour as described above.


I took a closer look - it appears that dewiki uses a different flavor of geo microformat. Well, it's not much of a format really since obviously people can't make up their mind about what it is. Anyway, I'll update mwscrape2slob to also handle dewiki...

itkach

unread,
Mar 4, 2015, 8:16:05 PM3/4/15
to aard...@googlegroups.com

Markus Braun

unread,
Mar 5, 2015, 3:46:44 AM3/5/15
to aard...@googlegroups.com

Thanks for your investigation. This looks like a typical wikimedia problem of unstructured data. Hopefully there are not too much variations on the geo tag to make the fix complex.

I do intentionally not mention other wikipedia languages, which may have additional variations.


--

mhbraun

unread,
Mar 7, 2015, 5:05:42 PM3/7/15
to aard...@googlegroups.com
I had a look on my simple-m-wikipedia and the globe is there and working. Hence my setup is fine.
Thanks.

mhbraun

unread,
Mar 9, 2015, 1:44:07 AM3/9/15
to aard...@googlegroups.com
New English Wikioedia for Aard2 available

magnet:?xt=urn:btih:4ef897aa2903746e64b69efd5a4b5cc4615ae64e&dn=enwiki-20150308.slob&tr=http%3A%2F%2Fcoppersurfer.tk%3A6969%2Fannounce&tr=http%3A%2F%2Ftracker.openbittorrent.com%2Fannounce&tr=http%3A%2F%2Ftracker.leechers-paradise.org%3A6969%2Fannounce&tr=http%3A%2F%2Fopen.domnii.com%3A1337%2Fannounce&tr=http%3A%2F%2Fannounce.torrentsmd.com%3A6969%2Fannounce



Am Dienstag, 14. Oktober 2014 21:52:50 UTC+2 schrieb mhbraun:
Usually I announce new updates on Wikipedia *.aar files for Aard on Twitter @MarkusHBraun
And I will continue to do so. So if you want to be notified just follow me on Twitter.

Apparently, I was out of town and did not read my timeline for a couple of days, so possibly I missed some information which could have been interesting for me. And this probably happens to others as well, I thought. So I decided to list in this thread new updates on Wikis I created.

If anyone else is creating his wikis he is welcome to add the information here for we have a common source on upates of wikis.

Generally I host my

German language files dewiki*.* at bit.ly/AardWikiDe
Engish language files enwiki*.* at bit.ly/AardWikiEN

mhbraun

unread,
Mar 24, 2015, 12:55:38 PM3/24/15
to aard...@googlegroups.com
New German Wikipedia of 20150324 available here:

magnet:?xt=urn:btih:5c924f3e25f0b2ccdbbaebac83999a381a6353ac&dn=dewiki-20150324.slob&tr=http%3A%2F%2Fcoppersurfer.tk%3A6969%2Fannounce&tr=http%3A%2F%2Ftracker.openbittorrent.com%2Fannounce&tr=http%3A%2F%2Ftracker.leechers-paradise.org%3A6969%2Fannounce&tr=http%3A%2F%2Fopen.domnii.com%3A1337%2Fannounce&tr=http%3A%2F%2Fannounce.torrentsmd.com%3A6969%2Fannounce

mhbraun

unread,
Mar 24, 2015, 4:33:57 PM3/24/15
to aard...@googlegroups.com
I did not follow up the subject recently. Are the changes for dewiki included in mwscrape2slob?

itkach

unread,
Mar 31, 2015, 2:05:24 PM3/31/15
to aard...@googlegroups.com


On Tuesday, March 24, 2015 at 4:33:57 PM UTC-4, mhbraun wrote:
I did not follow up the subject recently. Are the changes for dewiki included in mwscrape2slob?
I took a closer look - it appears that dewiki uses a different flavor of geo microformat. Well, it's not much of a format really since obviously people can't make up their mind about what it is. Anyway, I'll update mwscrape2slob to also handle dewiki...

mhbraun

unread,
Apr 9, 2015, 6:49:11 AM4/9/15
to aard...@googlegroups.com

I am testing a new mirror for dewiki:

bit.ly/OneDewiki

The files are hosted on OneDrive by Microsoft. Actually I have no idea if this has any hidden (traffic or other) restrictions similar to copy.com. The maximum file size is 10 GB which will probbably not allow me to upload the enwiki as the enwiki is exceeding 10GB.

Having no feedback of troubles with MEGA until now, I am assuming, that there is no restriction. On the other side. the volume I have on Mega is too small to host all the dictionaries.

Any feedback about OneDrive would be apreciated.

Markus Braun

unread,
Apr 9, 2015, 6:36:41 PM4/9/15
to aard...@googlegroups.com

OneDrive refused to accept enwiki.slob with 12GB. Limitations are still 10 GB per file.

--

igo...@o2online.de

unread,
Apr 11, 2015, 7:12:32 AM4/11/15
to aard...@googlegroups.com
Am Donnerstag, 9. April 2015 12:49:11 UTC+2 schrieb mhbraun:

I am testing a new mirror for dewiki:

bit.ly/OneDewiki

The files are hosted on OneDrive by Microsoft. Actually I have no idea if this has any hidden (traffic or other) restrictions similar to copy.com. The maximum file size is 10 GB which will probbably not allow me to upload the enwiki as the enwiki is exceeding 10GB.

Having no feedback of troubles with MEGA until now, I am assuming, that there is no restriction. On the other side. the volume I have on Mega is too small to host all the dictionaries.

Any feedback about OneDrive would be apreciated.

Could finally download dewikiquote and dewiktionary ;)
Thanks for that!

Markus Braun

unread,
Apr 11, 2015, 9:19:29 AM4/11/15
to aard...@googlegroups.com

You are welcome.
If you want to share it, put the name of the file and the download location into this thread.
For others may access it.

--

mhbraun

unread,
Apr 12, 2015, 9:15:55 AM4/12/15
to aard...@googlegroups.com
New dewiki-20140412.slob and *.aar is available.


Generally I host my

German language files dewiki*.* at bit.ly/AardWikiDe
Engish language files enwiki*.* at bit.ly/AardWikiEN

and as a mirror on

bit.ly/MegaEnwiki
bit.ly/MegaDewiki
 

Markus Braun

unread,
Apr 12, 2015, 9:52:43 AM4/12/15
to aard...@googlegroups.com

...is actually uploading...
Torrent is available as well.

--

mhbraun

unread,
May 2, 2015, 12:15:41 PM5/2/15
to aard...@googlegroups.com
New #English #offline #Wikivoyage enwikivoyage-20150502.slob for #Aard2 reader and #Goldendict is uploading on http://bit.ly/OneEnwiki and to
http://bit.ly/MegeDewiki and http://bit.ly/AardWikiDe
 Enjoy

mhbraun

unread,
May 2, 2015, 12:19:46 PM5/2/15
to aard...@googlegroups.com
FromTwitter (@MarkusHBraun)
New #English #offline #Wikivoyage enwikivoyage-20150502.aar for #Aard reader and #Goldendict is uploading on http://bit.ly/OneEnwiki
and http://Megadewiki




Am Dienstag, 14. Oktober 2014 21:52:50 UTC+2 schrieb mhbraun:
Usually I announce new updates on Wikipedia *.aar files for Aard on Twitter @MarkusHBraun
And I will continue to do so. So if you want to be notified just follow me on Twitter.

Apparently, I was out of town and did not read my timeline for a couple of days, so possibly I missed some information which could have been interesting for me. And this probably happens to others as well, I thought. So I decided to list in this thread new updates on Wikis I created.

If anyone else is creating his wikis he is welcome to add the information here for we have a common source on upates of wikis.

mhbraun

unread,
May 2, 2015, 12:55:32 PM5/2/15
to aard...@googlegroups.com
From Twitter:
New #English #offline #Wikipedia simpleenwiki-20150429.slob for #Aard2 reader and #Goldendict uploading on http://bit.ly/AardWikiEN and
on http://bit.lyMegaEnwiki
and  http://bit.ly/OneEnwiki

mhbraun

unread,
May 3, 2015, 3:24:45 PM5/3/15
to aard...@googlegroups.com
From Twitter:
New #German #deutsche #offline #Wikipedia dewiki-20150503.slob for #Aard2 reader and #Goldendict is uploading on http://bit.ly/MegaDewiki
This is the magnet link:

magnet:?xt=urn:btih:9dd41ad22425345a2cbaa351ac39cf2f9c1e3aa1&dn=dewiki-20150503.slob&tr=http%3A%2F%2Fannounce.torrentsmd.com%3A6969%2Fannounce&tr=http%3A%2F%2Fbigfoot1942.sektori.org%3A6969%2Fannounce&tr=http%3A%2F%2Ftracker.leechers-paradise.org%3A6969%2Fannounce

igo...@o2online.de

unread,
May 20, 2015, 11:16:33 AM5/20/15
to aard...@googlegroups.com
German Wikipedia 2015-05-03 (slob)
FileHoster: Share-Online.biz
Archive: RAR

dewiki-20150503.part1.rar (1,95 GB)
dewiki-20150503.part2.rar (1,85 GB)

MD5:
Part1: 0A74A114F8572637F4A978475A55AB18
Part2: 8050F3180ADC8AC116C57CB40A009116

mhbraun

unread,
Jun 4, 2015, 6:11:42 AM6/4/15
to aard...@googlegroups.com
New #German #deutsche #offline #Wikipedia dewiki-20150603.slob for #Aard2 reader and #Goldendict is uploading on http://bit.ly/MegaDewiki

mhbraun

unread,
Jun 4, 2015, 6:13:41 AM6/4/15
to aard...@googlegroups.com
New #German #deutsche #offline #Wikipedia dewiki-20150603.slob for #Aard2 reader and #Goldendict is uploading on http://bit.ly/AardWikiDe


AND there is a mirror on

https://onedrive.live.com/?cid=39D3E014318089FC&id=39d3e014318089fc!412&authkey=!AD-RYa4vbINwGpg

mhbraun

unread,
Jun 4, 2015, 6:14:46 AM6/4/15
to aard...@googlegroups.com
Torrent for dewiki-20150603.slob:

magnet:?xt=urn:btih:d97f85abb1eb69526c801164cb189bdb19da6554&dn=dewiki-20150603.slob&tr=http%3A%2F%2Fannounce.torrentsmd.com%3A6969%2Fannounce&tr=http%3A%2F%2Fbigfoot1942.sektori.org%3A6969%2Fannounce&tr=http%3A%2F%2Ftracker.leechers-paradise.org%3A6969%2Fannounce

Joel Korhonen

unread,
Jun 7, 2015, 2:32:53 AM6/7/15
to aard...@googlegroups.com
Hi!

I just joined this fine group :)

Is it possible to run a mwscrape against a locally hosted Wiki, e.g. use a wikimedia site dump? It would avoid a lot of network traffic. I don't mind in this case that the dumps tend to be a bit old and large.

E.g. can these be used to do it, either the sqls or the xmls?

http://dumps.wikimedia.org/enwiktionary/latest/

I wrote earlier a simple Java based parser that converts some wikis to StarDict and also then dictd format. It uses this as input or the de, fi or sv one:

http://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-pages-articles.xml.bz2

I first run a preparser on that, StripNamespaces, that throws away everything not in NS0 and also explicitly some titles such as Help:. This makes the file to be parsed a lot smaller. My parser isn't very good though. I haven't yet looked at how Aard2 parses files but judging from the result it must be a much better approach :)

You can find some StarDict files at my site http://dictinfo.com (sorry, no https) and I just migrated the project yesterday to https://github.com/korhoj/wiktionary-convert/

Cheers,
Joel

Joel Korhonen

unread,
Jun 7, 2015, 5:20:32 AM6/7/15
to aard...@googlegroups.com
I checked that https://en.wikipedia.org/wiki/Wikipedia:Database_download actually states one shouldn't use crawlers and max 2 threads at any rate.

I think I tried something like WikiFilter earlier although it seems to be only for Windows and whatever I tried wasn't just for windows. XAMPP like WIkiFilter uses but maybe it was a database import or something. Apparently such are very slow though.

offline-wikipedia-project sounds better but don't know yet if it has an HTTP API. I will test it.

Cheers,
Joel

Joel Korhonen

unread,
Jun 7, 2015, 7:44:25 AM6/7/15
to aard...@googlegroups.com
offline-wikipedia-project turned out to be quite old. I also tried compiling xowa but running it dumps my OpenJVM. Maybe it would be happier with Oracle's JDK. It did try to download the xxx-latest-pages-articles.xml.bz2 and would have parsed that. But I don't think that xowa has an HTTP API either. So I suppose I'll need to try setting up XAMPP.

Markus Braun

unread,
Jun 7, 2015, 9:02:40 AM6/7/15
to aard...@googlegroups.com

Before using couchdb as a source for the articles, we used the dumps of mediawiki.

However with the ongoing changes in the dumps there was no chance to render the pages correctly anymore. (Missing data etc)

The actual system works quite nicely and is adopted to the storage capabilities of mobile phones.

However I am curious about your findings. Keep us posted about your approach.


Am 07.06.2015 13:44 schrieb "Joel Korhonen" <eris...@gmail.com>:
offline-wikipedia-project turned out to be quite old. I also tried compiling xowa but running it dumps my OpenJVM. Maybe it would be happier with Oracle's JDK. It did try to download the xxx-latest-pages-articles.xml.bz2 and would have parsed that. But I don't think that xowa has an HTTP API either. So I suppose I'll need to try setting up XAMPP.

--

itkach

unread,
Jun 7, 2015, 9:35:37 AM6/7/15
to aard...@googlegroups.com, eris...@gmail.com


On Sunday, June 7, 2015 at 2:32:53 AM UTC-4, Joel Korhonen wrote:
Hi!

I just joined this fine group :)


Welcome :)
 
Is it possible to run a mwscrape against a locally hosted Wiki,

Yes. 
 
e.g. use a wikimedia site dump?

No.
 
You need a locally hosted Wiki site. Not dump, actual site with MediaWiki web API fully enabled. mwscrape downloads rendered article HTML via MediaWiki web API. Downloaded articles are stored in local CouchDB for further offline processing and conversion (see https://github.com/itkach/mwscrape2slob). 

It would avoid a lot of network traffic.

Complexity of Mediawiki setup and Wikipedia data import from sql dumps makes downloading everything over the network more attractive even if its a lot traffic (and time).
 
I don't mind in this case that the dumps tend to be a bit old and large.

E.g. can these be used to do it, either the sqls or the xmls?

http://dumps.wikimedia.org/enwiktionary/latest/


Aard 2's predecessor, Aard Dictionary used https://github.com/aarddict/tools based on https://github.com/pediapress/mwlib parser. This was pretty good for a while, but introduction of Lua templates and Wikidata put an end to that. Mediawiki is pretty insane, I doubt there will ever be an alternative, good enough implementation that can properly render all the articles.


Joel Korhonen

unread,
Jun 7, 2015, 11:21:54 AM6/7/15
to aard...@googlegroups.com, eris...@gmail.com
Hi!

Thanks for the info.
 
Aard 2's predecessor, Aard Dictionary used https://github.com/aarddict/tools based on https://github.com/pediapress/mwlib parser. This was pretty good for a while, but introduction of Lua templates and Wikidata put an end to that. Mediawiki is pretty insane, I doubt there will ever be an alternative, good enough implementation that can properly render all the articles.

I agree there is no other viable way. I programmed a pretty crude converter which took ages to program and misses much of even the most basic info. It's still enough for most of my personal needs though.

I'm really happy with Aard2 conversion quality. The only problem is the ereaders I use don't support Aard2 as a target for looking up words. I suppose I should ask e.g. the MoonReader Pro programmer to add support :) I mean it would be faster than using the clipboard.

I gave a quick try to installing XAMPP using some fancy pants GUI installer but it didn't even work properly... It installed 3 pieces of server software but only 1, the FTP server, started. The other two, Apache and MySQL, failed to start. The log for them showed they were already running when trying to start. This happened even when I explicitly stopped them before letting the GUI attemp to start them. Well, maybe something is autorestarting them, or maybe that fancy GUI I tried doesn't know how to check properly for pid's. Oh well, so much for quick attempts :) Better do it with more time using the time-honoured command line :) The funny thing is that I once installed XAMPP to Windows and it went pretty smoothly. Although I do think I had a preinstalled MySQL then.

I actually converted various Wiktionaries to a MySQL DB back then but haven't tinkered with that lately. CouchDB seems like a nice approach.

mhbraun

unread,
Jun 23, 2015, 2:24:13 PM6/23/15
to aard...@googlegroups.com

New #English #offline #Wikipedia enwiki-20150620.slob for #Aard2 reader and #Goldendict is uploading on http://bit.ly/MegaDewiki

mhbraun

unread,
Jun 23, 2015, 2:25:35 PM6/23/15
to aard...@googlegroups.com
The magnet link for torrent download of enwiki-20150620.slob is

magnet:?xt=urn:btih:27ae7d60b4c0485ed74723a298b8dd6751b7c111&dn=enwiki-20150620.slob&tr=http%3A%2F%2Fannounce.torrentsmd.com%3A6969%2Fannounce&tr=http%3A%2F%2Fbigfoot1942.sektori.org%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.leechers-paradise.org%3A6969&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Fopen.demonii.com%3A1337

mhbraun

unread,
Jul 7, 2015, 5:14:25 PM7/7/15
to aard...@googlegroups.com
New #German #deutsche #offline #Wikipedia dewiki-20150705.aar for #Aarddict reader and #Goldendict is uploading on http://bit.ly/MegaDewiki

As well available on bit.ly/OneDewiki or as torrent on

https://github.com/itkach/slob/wiki/Dictionaries#german



Am Dienstag, 14. Oktober 2014 21:52:50 UTC+2 schrieb mhbraun:
Usually I announce new updates on Wikipedia *.aar files for Aard on Twitter @MarkusHBraun
And I will continue to do so. So if you want to be notified just follow me on Twitter.
 
Generally I host my

mhbraun

unread,
Aug 18, 2015, 9:44:07 PM8/18/15
to aarddict
New dewiki is compiled:
The torrent is
magnet:?xt=urn:btih:OBVFTHNXP2AKNBP7KSAWFQJAZS5GX4XI&dn=dewiki-20150815.slob&tr=http%3a%2f%2fannounce.torrentsmd.com%3a6969%2fannounce&tr=http%3a%2f%2fbigfoot1942.sektori.org%3a6969%2fannounce&tr=udp%3a%2f%2ftracker.leechers-paradise.org%3a6969&tr=udp%3a%2f%2ftracker.coppersurfer.tk%3a6969&tr=udp%3a%2f%2fopen.demonii.com%3a1337

All other sites are uploading

Generally I host my

German language files dewiki*.* at bit.ly/AardWikiDe
Engish language files enwiki*.* at bit.ly/AardWikiEN

and as a mirror on

bit.ly/MegaEnwiki
bit.ly/MegaDewiki
 

Tony Olsen

unread,
Aug 19, 2015, 6:38:14 PM8/19/15
to aarddict
I've been trying to download this for over a day now, but uTorrent can't find the meta-data (and Torrent file) for this... and there don't appear to be any seeders.  I have enwiki-20150406.slob seeding however, but enwiki-20150620.slob can't be found on the torrent networks.

I'm also trying to download it via MEGA, but the first time (after many hours) it stopped at the 10 GB limit claiming an HTML5 storage limitation.  I'm trying again using the MEGA extension in Chrome, but I'm worried I may get the same issue again.

Anyone else hitting a 10GB limit downloading from MEGA?

itkach

unread,
Aug 19, 2015, 9:04:31 PM8/19/15
to aarddict


On Wednesday, August 19, 2015 at 6:38:14 PM UTC-4, Tony Olsen wrote:
I've been trying to download this for over a day now, but uTorrent can't find the meta-data (and Torrent file) for this... and there don't appear to be any seeders.  I have enwiki-20150406.slob seeding however, but enwiki-20150620.slob can't be found on the torrent networks.

I'm also trying to download it via MEGA, but the first time (after many hours) it stopped at the 10 GB limit claiming an HTML5 storage limitation.  I'm trying again using the MEGA extension in Chrome, but I'm worried I may get the same issue again.

Anyone else hitting a 10GB limit downloading from MEGA?


I think I tried once and it stopped at 5Gb in Chrome. Spend few minutes looking for instructions on how to lift the limit, found nothing and abandoned this method since torrent worked for me just fine. First two trackers appear to be down (torrentsmd and bigfoot1942.sektori.org) but the other three seem fine, showing 2-3 seeders. Maybe it'll help if you edit magnet link and remove the first two trackers.

mhb...@freenet.de

unread,
Aug 21, 2015, 6:20:41 AM8/21/15
to aard...@googlegroups.com
Thanks for reporting this limit with MEGA. I have no problem uploading it with the client despite it takes a couple of days. Download worked fine with me, but did not test it recently. Will look into it.

Torrents are working fine. There are 3-5 seeders per wiki. It takes a couple of hours until the torrent is known. The first upload takes a LOT of time. The consecutives are much faster due to multiple sources.





Mit TouchDown von meinem Android-Telefon gesendet (www.nitrodesk.com)


-----Original Message-----
From: itkach [itk...@gmail.com]
Received: Mittwoch, 19 Aug. 2015, 20:04
To: aarddict [aard...@googlegroups.com]
Subject: Re: Updated Wikipedia files

mhbraun

unread,
Aug 26, 2015, 2:19:46 AM8/26/15
to aarddict
I made some download tests with different IPs to verify the 10GB limit on MEGA.
Apparently I can not confirm this. All downloads with Firefox (I did not use another client) went well and complete. Of course time differed according to the connection. It took between one hour and twelve hours to complete the 12 GB enwiki-2015062.slob file.

The bad news: I can not replicate the error.
The good news: The download limit does not exist. And the speed is aweful if you have a good connection.

The issue with the magnet link for the torrent is no issue. It is just about waiting until the torrent is distributed. Of enwiki-20150620 I uploaded more than 150GB already - not knowing how much the four peers with probably a much bigger bandwith than I have seeded.
It is loading more messages.
0 new messages