1. If you mean start and restart scraping then the answer is yes. No problem. If you mean compiling the slob from couchdb then I do not know. I always did it in one run. It takes 3 to 4 h for dewiki.
2. As far as I understood all changes are updated in the couchdb with a new scrape. You do not need compaction on couchdb for a correct result. However it is very space consuming if you do not compact couchdbs
--
You received this message because you are subscribed to the Google Groups "aarddict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aarddict+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
I've recently updated my dictionaries from the Italian Wikimedia minor projects using the "mwscrape" method (I know, I complained about its slowness, but - after all - it gives you an excellent output and the scraping process is resumable as you want, so it isn't so bad as I believed at the beginning...). You can see (and download) my "creations" here: https://www.wuala.com/bittachi/aarddict/
Meanwhile, I thought about compiling an updated version of the Italian Wikipedia, but I still have some doubts:
1) I want to make the compilation, but stopping and resuming the process in a second moment.
Is it possible to do something like that?
(I thought to split up the dictionary compiling process using the "startkey" and "endkey" options in mwcouch, but I'm afraid to generate independent dictionaries instead of separate volumes gathered in a unique dictionary).
2) If I decide to rescrape Wikipedia in the future, could the scraping process take into account the deleted articles in the encyclopedia after the first scraping (and remove them into the Couch database automatically) or not?
Would be nice if you could also compile .slob dictionaries for Aard 2 :) Or share your .couch database files.
I don't understand your question. Getting Wikipedia articles with mwscrape is a separate process. It can be stopped and later resumed. You can run several mwscape processes at the same time, instructing each process to start at a different point in the list of article titles returned by Wikipedia API.Compiling .aar or .slob from the Couch database created by mwscrape is another process and it is not resumable.
mwscrape requests list of available article titles from Wikipedia web API, and then compares revision Wikipedia has with revision mwscarpe's CouchDB database has from previous scrapes. If it's a new article or a new revision - it gets downloaded and stored or updated. In this process previously available and now deleted articles remain in CouchDB. Another script would need to go over what's in CouchDB and check if the article is still available. mwscrape currently doesn't implement this. However, my understanding is that in general Wikipedia articles are not deleted very often. One of the reasons for deletion would be "lack of notability" for a subject described. Vandalism (somebody inserting bogus articles) could be another. I'm not sure how big of a problem this actually is. Are you concerned with accuracy of data or is it something else?
Just to be precise:
To create the couchdb of dewiki took me 4 weeks. The update of the existing couchdb takes 1 to 2 weeks with up to 10 scrapes.
So I do keep the couchdbs.
The generation of the dictionary takes 3-4 h.
--
Il giorno giovedì 16 ottobre 2014 04:07:15 UTC+2, itkach ha scritto:Would be nice if you could also compile .slob dictionaries for Aard 2 :) Or share your .couch database files.
Didn't you notice that I uploaded either aar files or slob files for every dictionary?
I'll keep the double format until you won't officially release Aard2. For what concerns .couch database file, I didn't care about them, so I've already deleted them. I'll keep them next time, I promise you.
I don't understand your question. Getting Wikipedia articles with mwscrape is a separate process. It can be stopped and later resumed. You can run several mwscape processes at the same time, instructing each process to start at a different point in the list of article titles returned by Wikipedia API.Compiling .aar or .slob from the Couch database created by mwscrape is another process and it is not resumable.
OK, so you confirm me (as MHBraun said previously) that the scraping can be fulfilled with several shots, unlike the dictionary compiling from the database, that can be done exclusively with one shot. I supposed that the compiling could be performed in several times by selecting specific alphabetical range (i.e., using the "--s" and "--e" options, I first compile the articles from the beginning until the 'A' letter, then from 'B' to 'C', then again from 'D' to 'E', and so on...), but I realized that it generates independent dictionaries (doing like that). Well, I'll try to compile everything at once (the Italian Wikipedia has the same size of the German one - more or less -, so I hope to accomplish the process at the same time stated by MHBraun)
mwscrape requests list of available article titles from Wikipedia web API, and then compares revision Wikipedia has with revision mwscarpe's CouchDB database has from previous scrapes. If it's a new article or a new revision - it gets downloaded and stored or updated. In this process previously available and now deleted articles remain in CouchDB. Another script would need to go over what's in CouchDB and check if the article is still available. mwscrape currently doesn't implement this. However, my understanding is that in general Wikipedia articles are not deleted very often. One of the reasons for deletion would be "lack of notability" for a subject described. Vandalism (somebody inserting bogus articles) could be another. I'm not sure how big of a problem this actually is. Are you concerned with accuracy of data or is it something else?
AFAIK at least 10 articles (on average) are deleted every day on the Italian Wikipedia, so there will be more than 3000 articles deleted in a year. Maybe it isn't a big number,
but IMHO I don't like to keep in future reviews articles that were certified as hoaxes/promotions/vandalisms/not notable subjects, that's all.
OK, I've regenerated my slob dictionaries (using the updated "slob" module) and tested them on Aard2 ver. 0.4: everything is OK, except for some problems with the internal links in Wikisource and Wiktionary dictionaries. Maybe it's due to a bad "translation" from the original .aar format during the conversion to .slob format?
Are you sure those problems are not present in .aar? As far as I know Wikisource always had problems. I also never understood what's the point of Wikisource anyway - taking publicly available books and putting them into awkward to use, weird format not suitable for books. What's wrong with text files, fb2, epub and such?As for Wiktionary, internal links should work equally in both .aar and .slob. Do you have specific examples (which article, which link) where this is not the case?
Il giorno sabato 18 ottobre 2014 21:28:16 UTC+2, itkach ha scritto:Are you sure those problems are not present in .aar? As far as I know Wikisource always had problems. I also never understood what's the point of Wikisource anyway - taking publicly available books and putting them into awkward to use, weird format not suitable for books. What's wrong with text files, fb2, epub and such?As for Wiktionary, internal links should work equally in both .aar and .slob. Do you have specific examples (which article, which link) where this is not the case?
I'm pretty sure the original .aar files aren't corrupted (you can test those I've already uploaded on my Wuala repository, loading them on the classic Aard app).
Anyway, I try to describe the bugs I've found using Aard 2:
- in Wikisource .slob files articles, if you tap on chapters links (i.e., I'm reading "Divina Commedia/Inferno/Canto I" and I want to pass to the next chapter "Divina Commedia/Inferno/Canto II"), these point to anything and the "Not found" message appears on the screen;
- in Wiktionary .slob files articles, the links appear all in red, as the linked item doesn't exist, but indeed it actually exists (you can check any article inside the dictionary to realize this bug).
Due to changes in the slob format the v4 files are for Aard2 0.4 . I did not test compatibility of the versions. Just switched to the new format immediately. So I had 2 versions as the 0.3 version was made just before 0.4 was available. There were just a couple of hours of time difference.
Future releases will be automatically just for the actual format.
Igor indicated that the probability of more changes in the slob format is very low.
Just take the v4 or the newer file.
Are you sure this is faster than compiling it in one shot?
It would take 18 to 21h this way...
Note that having three separate dictionaries like this is not quite the same as having same content in one dictionary. When a link is followed, slob containing link's originating article is searched first, then the rest of them. if you have many dictionaries, for example Italian and English Wikipedia with matching titel in both, and Italian is split like this, it's possible that following a link from Italian article would open article from English first. This may not be an issue for you, but there's a difference.
Are you sure this is faster than compiling it in one shot?
It would take 18 to 21h this way...
Generally I host my
German language files dewiki*.* at bit.ly/AardWikiDe
Engish language files enwiki*.* at bit.ly/AardWikiEN
Am 14. November 2014 21:22:36 MEZ, schrieb mhbraun <mhb...@freenet.de>:
New German
dewikipedia-20141111.slob
for aarddict reader
in slob format is online
Oh, I just downloaded the German wiki from 11.11.2014 tested it and there are strange errors in some pages. Example article:
Sexuelle Spannung
See Attachment.
I Thing many other articles not found like this one.
Was wrong?
Am 14. November 2014 21:22:36 MEZ, schrieb mhbraun:
New German
dewikipedia-20141111.slob
for aarddict reader
in slob format is online
Oh, I just downloaded the German wiki from 11.11.2014 tested it and there are strange errors in some pages. Example article:
Sexuelle Spannung
See Attachment.
I Thing many other articles not found like this one.
Was wrong?
New Italian Wikipedia dictionary (last update: November 3rd, 2014) is now available over there: https://mega.co.nz/#F!MAIwRIKZ!-s_kee09DTTokK_ECE2KKw
As I said previously, I had voluntarily generated the dictionary in 3 separated parts in order to accomplish the compilation itself and the file upload/download at the same time. I apologize if you (maybe) consider this solution "out of ordinary", but my goal is to allow a softer spread of huge dictionaries created from Wikipedia also for those who (like me) can't afford a fast Internet connection and a powerful computer.
Igor, can you please update the link to my dictionaries in Aard homepage, so I can definitely erase the old copies I stored in Wuala? Thanks in advance.
Generally I host my
German language files dewiki*.* at bit.ly/AardWikiDe
Engish language files enwiki*.* at bit.ly/AardWikiEN
There's probably a very wide spread tracker which is a mayor source of information about torrents?
Usually I announce new updates on Wikipedia *.aar files for Aard on Twitter @MarkusHBraun
And I will continue to do so. So if you want to be notified just follow me on Twitter.
Apparently, I was out of town and did not read my timeline for a couple of days, so possibly I missed some information which could have been interesting for me. And this probably happens to others as well, I thought. So I decided to list in this thread new updates on Wikis I created.
If anyone else is creating his wikis he is welcome to add the information here for we have a common source on upates of wikis.
Hi, Thanks for the membership, I am a teacher and I don't know how to compile using the codes, I am clueless actually, I wanted to ask you if it is possible for you to kindly compile the English Wikiquote in slob format, cause the old one is for nearly one year ago and it definitely needs updating.
Thanks Igor, I will stop my scrape.
German language files dewiki*.* at bit.ly/AardWikiDe
Engish language files enwiki*.* at bit.ly/AardWikiEN
and as a mirror on
bit.ly/MegaEnwiki
bit.ly/MegaDewiki
How do you test this?
If I tap on the coordinate of eg Nashville the browser opens with http://tools.wmflabs.org/geohack/geohack.php?pagename=Nashville&language=de¶ms=36.165833333333_N_86.784444444444_W_dim:25000_region:US-TN_type:city(626681)
Tested with 0.16 and dewiki20150215.slob
How do you test this?
If I tap on the coordinate of eg Nashville the browser opens with http://tools.wmflabs.org/geohack/geohack.php?pagename=Nashville&language=de¶ms=36.165833333333_N_86.784444444444_W_dim:25000_region:US-TN_type:city(626681)
Tested with 0.16 and dewiki20150215.slob
I tried the newest dewiki in combination with aard2 1.6.--
There still are no separate links to geo:: intents - wasn't that supposed to be there?
Thanks for your work!
Am Dienstag, 24. Februar 2015 23:09:48 UTC+1 schrieb mhbraun:
New simple#English#offline#Wikipedia simpleenwiki-20150224.slob for#aarddict 2 reader is online http://bit.ly/AardWikiEN
Generally I host my wikis
German language files dewiki*.* at bit.ly/AardWikiDe
Engish language files enwiki*.* at bit.ly/AardWikiEN
You received this message because you are subscribed to the Google Groups "aarddict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aarddict+unsubscribe@googlegroups.com.
This icon is not in my compilation. How to add this function?
To unsubscribe from this group and stop receiving emails from it, send an email to aarddict+u...@googlegroups.com.
This icon is not in my compilation. How to add this function?
I am confused now as I updated mwscrape2slob after our discussion in the other thread on 13.feb
Have to investigate what's wrong with my setup.
--
I created testwiki with Nashville. The globe is not visible in Aard2 0.16.
From within the environment (env-slob)
Pip uninstall mwscrape2slob
Pip install Git command for mwscrape2slob
Successfull installation
Recreated testwiki with -k Nashville.
The globe is not visible in Aard2 0.16.
Can you create a correct sub dictionary for me to see exactly what we are looking for? I do not see that globe in the article Nashville and tapping on coordinates results in behaviour as described above.
--
I created testwiki with Nashville. The globe is not visible in Aard2 0.16.
From within the environment (env-slob)
Pip uninstall mwscrape2slob
Pip install Git command for mwscrape2slob
Successfull installationRecreated testwiki with -k Nashville.
The globe is not visible in Aard2 0.16.Can you create a correct sub dictionary for me to see exactly what we are looking for? I do not see that globe in the article Nashville and tapping on coordinates results in behaviour as described above.
Thanks for your investigation. This looks like a typical wikimedia problem of unstructured data. Hopefully there are not too much variations on the geo tag to make the fix complex.
I do intentionally not mention other wikipedia languages, which may have additional variations.
--
Usually I announce new updates on Wikipedia *.aar files for Aard on Twitter @MarkusHBraun
And I will continue to do so. So if you want to be notified just follow me on Twitter.
Apparently, I was out of town and did not read my timeline for a couple of days, so possibly I missed some information which could have been interesting for me. And this probably happens to others as well, I thought. So I decided to list in this thread new updates on Wikis I created.
If anyone else is creating his wikis he is welcome to add the information here for we have a common source on upates of wikis.
Generally I host my
German language files dewiki*.* at bit.ly/AardWikiDe
Engish language files enwiki*.* at bit.ly/AardWikiEN
I did not follow up the subject recently. Are the changes for dewiki included in mwscrape2slob?
I took a closer look - it appears that dewiki uses a different flavor of geo microformat. Well, it's not much of a format really since obviously people can't make up their mind about what it is. Anyway, I'll update mwscrape2slob to also handle dewiki...
OneDrive refused to accept enwiki.slob with 12GB. Limitations are still 10 GB per file.
--
I am testing a new mirror for dewiki:
bit.ly/OneDewiki
The files are hosted on OneDrive by Microsoft. Actually I have no idea if this has any hidden (traffic or other) restrictions similar to copy.com. The maximum file size is 10 GB which will probbably not allow me to upload the enwiki as the enwiki is exceeding 10GB.
Having no feedback of troubles with MEGA until now, I am assuming, that there is no restriction. On the other side. the volume I have on Mega is too small to host all the dictionaries.
Any feedback about OneDrive would be apreciated.
You are welcome.
If you want to share it, put the name of the file and the download location into this thread.
For others may access it.
--
Generally I host my
German language files dewiki*.* at bit.ly/AardWikiDe
Engish language files enwiki*.* at bit.ly/AardWikiEN
and as a mirror on
bit.ly/MegaEnwiki
bit.ly/MegaDewiki
...is actually uploading...
Torrent is available as well.
--
Usually I announce new updates on Wikipedia *.aar files for Aard on Twitter @MarkusHBraun
And I will continue to do so. So if you want to be notified just follow me on Twitter.
Apparently, I was out of town and did not read my timeline for a couple of days, so possibly I missed some information which could have been interesting for me. And this probably happens to others as well, I thought. So I decided to list in this thread new updates on Wikis I created.
If anyone else is creating his wikis he is welcome to add the information here for we have a common source on upates of wikis.
Before using couchdb as a source for the articles, we used the dumps of mediawiki.
However with the ongoing changes in the dumps there was no chance to render the pages correctly anymore. (Missing data etc)
The actual system works quite nicely and is adopted to the storage capabilities of mobile phones.
However I am curious about your findings. Keep us posted about your approach.
offline-wikipedia-project turned out to be quite old. I also tried compiling xowa but running it dumps my OpenJVM. Maybe it would be happier with Oracle's JDK. It did try to download the xxx-latest-pages-articles.xml.bz2 and would have parsed that. But I don't think that xowa has an HTTP API either. So I suppose I'll need to try setting up XAMPP.
--
Hi!
I just joined this fine group :)
Is it possible to run a mwscrape against a locally hosted Wiki,
e.g. use a wikimedia site dump?
It would avoid a lot of network traffic.
I don't mind in this case that the dumps tend to be a bit old and large.
E.g. can these be used to do it, either the sqls or the xmls?
http://dumps.wikimedia.org/enwiktionary/latest/
Aard 2's predecessor, Aard Dictionary used https://github.com/aarddict/tools based on https://github.com/pediapress/mwlib parser. This was pretty good for a while, but introduction of Lua templates and Wikidata put an end to that. Mediawiki is pretty insane, I doubt there will ever be an alternative, good enough implementation that can properly render all the articles.
Usually I announce new updates on Wikipedia *.aar files for Aard on Twitter @MarkusHBraun
And I will continue to do so. So if you want to be notified just follow me on Twitter.
Generally I host my
Generally I host my
German language files dewiki*.* at bit.ly/AardWikiDe
Engish language files enwiki*.* at bit.ly/AardWikiEN
and as a mirror on
bit.ly/MegaEnwiki
bit.ly/MegaDewiki
I've been trying to download this for over a day now, but uTorrent can't find the meta-data (and Torrent file) for this... and there don't appear to be any seeders. I have enwiki-20150406.slob seeding however, but enwiki-20150620.slob can't be found on the torrent networks.
I'm also trying to download it via MEGA, but the first time (after many hours) it stopped at the 10 GB limit claiming an HTML5 storage limitation. I'm trying again using the MEGA extension in Chrome, but I'm worried I may get the same issue again.
Anyone else hitting a 10GB limit downloading from MEGA?