kochwiki.org as SLOB file?

195 views
Skip to first unread message

Georg D

unread,
Sep 30, 2015, 9:23:02 AM9/30/15
to aarddict
Did anyone create a SLOB file out of http://kochwiki.org/ (the more alive fork of rezeptewiki)? This would be nice e.g. when in doubt about ingredients while shopping in the supermarket, or when you wanna mix a drink in a remote space with no WLAN/GSM coverage.

franc

unread,
Sep 30, 2015, 9:39:56 AM9/30/15
to aarddict
Am Mittwoch, 30. September 2015 15:23:02 UTC+2 schrieb Georg D:
Did anyone create a SLOB file out of http://kochwiki.org/

Can give it a try... will tall leter if successfully doable.
frank

franc

unread,
Sep 30, 2015, 9:54:19 AM9/30/15
to aarddict

Doesn't look good :(
I get weird SSL errors:
Starting session kochwiki-1443621095-767
/home/franc/aard/env-mwscrape/local/lib/python2.7/site-packages/requests/packages/urllib3/util/ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
 
InsecurePlatformWarning
Traceback (most recent call last):
 
File "/home/franc/aard/env-mwscrape/bin/mwscrape", line 9, in <module>
    load_entry_point
('mwscrape==1.0', 'console_scripts', 'mwscrape')()
 
File "/home/franc/aard/env-mwscrape/local/lib/python2.7/site-packages/mwscrape/scrape.py", line 293, in main
    site
= mwclient.Site((scheme, host), path=args.site_path, ext=args.site_ext)
 
File "/home/franc/aard/env-mwscrape/local/lib/python2.7/site-packages/mwclient/client.py", line 116, in __init__
   
self.site_init()
 
File "/home/franc/aard/env-mwscrape/local/lib/python2.7/site-packages/mwclient/client.py", line 124, in site_init
    siprop
='general|namespaces', uiprop='groups|rights', retry_on_error=False)
 
File "/home/franc/aard/env-mwscrape/local/lib/python2.7/site-packages/mwclient/client.py", line 204, in api
    info
= self.raw_api(action, **kwargs)
 
File "/home/franc/aard/env-mwscrape/local/lib/python2.7/site-packages/mwclient/client.py", line 309, in raw_api
    res
= self.raw_call('api', data, retry_on_error=retry_on_error)
 
File "/home/franc/aard/env-mwscrape/local/lib/python2.7/site-packages/mwclient/client.py", line 277, in raw_call
    stream
= self.connection.post(fullurl, data=data, files=files, headers=headers)
 
File "/home/franc/aard/env-mwscrape/local/lib/python2.7/site-packages/requests/sessions.py", line 508, in post
   
return self.request('POST', url, data=data, json=json, **kwargs)
 
File "/home/franc/aard/env-mwscrape/local/lib/python2.7/site-packages/requests/sessions.py", line 465, in request
    resp
= self.send(prep, **send_kwargs)
 
File "/home/franc/aard/env-mwscrape/local/lib/python2.7/site-packages/requests/sessions.py", line 573, in send
    r
= adapter.send(request, **kwargs)
 
File "/home/franc/aard/env-mwscrape/local/lib/python2.7/site-packages/requests/adapters.py", line 431, in send
   
raise SSLError(e, request=request)
requests
.exceptions.SSLError: [Errno 1] _ssl.c:510: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed

Maybe this is something Igor can tell, I don't know what is wrong here.
This   InsecurePlatformWarning is on the normal wiki as well, but it scrapes anyway.
Sorry.

 

franc

unread,
Oct 2, 2015, 7:08:01 AM10/2/15
to aarddict
@Igor: do you know about this errror? Could you fastly tell something?

itkach

unread,
Oct 2, 2015, 10:46:18 AM10/2/15
to aarddict
On Friday, October 2, 2015 at 7:08:01 AM UTC-4, franc wrote:
@Igor: do you know about this errror? Could you fastly tell something?

By default, if only host name is given (without protocol), mwscrape assumes https. Looks like kochwiki.org doesn't have https properly configured, but the API seems to be available over HTTP, so this (almost) works:


I say "almost" because some articles are downloaded, but I get a lot of 403 errors ("Forbidden"). It looks like this site controls request rate and rejects clients that make requests to often and it looks like even single-threaded, one request at a time scaper is still too fast for them. I just added an option (--delay, you need to get latest mwscrape from github ) to introduce a delay and this seems to work:

mwscrape http://kochwiki.org --delay 1

(that's delay of 1 second before requesting rendered article)

It's not fast, but it works and the site appears to be fairly small, so we should able to get it.

Frank Röhm

unread,
Oct 2, 2015, 7:31:16 PM10/2/15
to aard...@googlegroups.com


Am 2. Oktober 2015 16:46:17 MESZ, schrieb itkach <itk...@gmail.com>:
> I
>just added an option (--delay, you need to get latest mwscrape from
>github
>) to introduce a delay and this seems to work:
>
>mwscrape http://kochwiki.org --delay 1
>
>(that's delay of 1 second before requesting rendered article)
>
>It's not fast, but it works and the site appears to be fairly small, so
>we
>should able to get it.

Oh! Great! Thank you, this was fast fix!
I started the scrape then, will see when finished and then upload...



franc

unread,
Oct 3, 2015, 8:17:15 AM10/3/15
to aarddict
Indeed, really not big. 10 MB only
Took an hour only, 2slob then 1 minute.

Download under:
7fw.de/download/wiki/kochwiki.slob
Cannot provide sha at the moment.
I am only under mobile data at the moment and cannot test very good.
But have impression that the images are not auto downloaded, they seems just links. Possible that this is because I have no Wi-Fi.

Frank

itkach

unread,
Oct 3, 2015, 9:29:19 AM10/3/15
to aarddict
Image urls in this wiki are relative, so either slob file needs to contain images or converter needs to convert these into absolute urls. I was able to download the images with wget, so maybe I'll be able to make a version with the images included (thumbnails). This is going to be a bit tricky though. Articles mostly refer to thumbnail images (smaller versions), and sometimes to full images. Thumbnails come in multiple sizes and image tags in articles refer to all of them using srcset attribute, so that different one is picked on different screens. All images from kochwiki take up about 8Gb, thumbnails - 1.3 Gb. mwscrape2slob right now only has the ability to include content of a directory, and if I simply point it to kochwiki image directory with thumbnails it creates 1.2Gb .slob with many images still missing. I need to improve mwscrape2slob to be smarter about which images it includes and how it rewrites image urls.

franc

unread,
Oct 3, 2015, 6:31:34 PM10/3/15
to aarddict
Am Samstag, 3. Oktober 2015 15:29:19 UTC+2 schrieb itkach:

Image urls in this wiki are relative, so either slob file needs to contain images or converter needs to convert these into absolute urls.

I there maybe a parameter possible to do this? To just handle the images like the other wiki slob files handle it?


 

mhbraun

unread,
Oct 4, 2015, 5:44:00 AM10/4/15
to aarddict
The links in the wiki do not refer to the wiki itself. They are opening the browser for the online version. This is inconveniant should be modified.
Eg: In Apfel-Mohnkuchen the links for Teig are directing to the online version.

Hinweise seems to be a category structure which is not translatable. It is showing the source code.
No problem I guess.

The pictures would be really helpful. I like the structure to download the images only if the article is opened. 

Am Samstag, 3. Oktober 2015 15:29:19 UTC+2 schrieb itkach:

Frank Röhm

unread,
Oct 4, 2015, 5:49:36 AM10/4/15
to aard...@googlegroups.com


Am 4. Oktober 2015 11:44:00 MESZ, schrieb mhbraun <mhb...@freenet.de>:
> ...
>The pictures would be really helpful. I like the structure to download
>the
>images only if the article is opened.
>

+1 from me :)

itkach

unread,
Oct 4, 2015, 11:07:45 AM10/4/15
to aarddict


On Sunday, October 4, 2015 at 5:44:00 AM UTC-4, mhbraun wrote:
The links in the wiki do not refer to the wiki itself. They are opening the browser for the online version. This is inconveniant should be modified.
Eg: In Apfel-Mohnkuchen the links for Teig are directing to the online version.


Indeed. This is because most links in these articles point to things like ingredients which for some reason are not regular articles, instead they are pages in a namespace (Zutat). Usually it makes sense to resolve namespaced page links to point to online version - these are categories, discussions and variois special pages typically not included in slob. This looks like an error in mediawiki configuration for this site. One way to fix it is to edit siteinfo manually and remove namespace definitions for namespaces that should be treated as regular articles. Or maybe mwscrape2slob needs to grow an option to allow user specify which namespaces to treat as local articles.

franc

unread,
Oct 4, 2015, 11:17:45 AM10/4/15
to aarddict
Am Sonntag, 4. Oktober 2015 17:07:45 UTC+2 schrieb itkach:


On Sunday, October 4, 2015 at 5:44:00 AM UTC-4, mhbraun wrote:
The links in the wiki do not refer to the wiki itself. They are opening the browser for the online version. This is inconveniant should be modified.
Eg: In Apfel-Mohnkuchen the links for Teig are directing to the online version.


Indeed. This is because most links in these articles point to things like ingredients which for some reason are not regular articles, instead they are pages in a namespace (Zutat). Usually it makes sense to resolve namespaced page links to point to online version - these are categories, discussions and variois special pages typically not included in slob. This looks like an error in mediawiki configuration for this site. One way to fix it is to edit siteinfo manually and remove namespace definitions for namespaces that should be treated as regular articles.

Where could I edit siteinfo manually?
 
Or maybe mwscrape2slob needs to grow an option to allow user specify which namespaces to treat as local articles.


I had a short look into __init__.py (from mwscrape2slob) where I think the real scrape 2 slob is done, but I admit that my Python abilities are not profund (better: nonely), so I abandoned the code fast without seeing where to manipulate it to change that. I am far distanced from doing such code in pythons and do here any pullrequest or such :(
Example: is
that: >>> (three greater than symbols) a comment command???
And so on...

I pass.

Igor Tkach

unread,
Oct 4, 2015, 11:54:54 AM10/4/15
to aard...@googlegroups.com
On Sun, Oct 4, 2015 at 11:17 AM, franc <franc...@gmail.com> wrote:
Am Sonntag, 4. Oktober 2015 17:07:45 UTC+2 schrieb itkach:


On Sunday, October 4, 2015 at 5:44:00 AM UTC-4, mhbraun wrote:
The links in the wiki do not refer to the wiki itself. They are opening the browser for the online version. This is inconveniant should be modified.
Eg: In Apfel-Mohnkuchen the links for Teig are directing to the online version.


Indeed. This is because most links in these articles point to things like ingredients which for some reason are not regular articles, instead they are pages in a namespace (Zutat). Usually it makes sense to resolve namespaced page links to point to online version - these are categories, discussions and variois special pages typically not included in slob. This looks like an error in mediawiki configuration for this site. One way to fix it is to edit siteinfo manually and remove namespace definitions for namespaces that should be treated as regular articles.

Where could I edit siteinfo manually?
 

Siteinfos are in CouchDB, in the aptly named "siteinfo" database. You can browse  CouchDB in admin UI (http://localhost:5984/_utils/) and find kochwiki siteinfo at http://localhost:5984/_utils/document.html?siteinfo/kochwiki-org
Siteinfo is a JSON document, which can be edited directly in CouchDB admin UI. I am not suggesting anyone actually do that, because it is error prone and now I see it doesn't create proper urls anyway (colon ":" in the link also needs to be url-scaped, otherwise browser interprets it a protocol name, and of course browser doesn't no such protocol as "zutat"). 

I'll tweak mwscrape2slob to handle this.

Frank Röhm

unread,
Oct 4, 2015, 12:05:39 PM10/4/15
to aard...@googlegroups.com


Am 4. Oktober 2015 17:54:33 MESZ, schrieb Igor Tkach <itk...@gmail.com>:
>...
>I'll tweak mwscrape2slob to handle this.

Oh! That would be great! Thanks!

Georg D

unread,
Oct 7, 2015, 3:49:04 AM10/7/15
to aarddict
Oh dear, I started quite a bunch of activity, coding and so on...

Indeed, really not big. 10 MB only

Download under:
7fw.de/download/wiki/kochwiki.slob

Frank, thank you, the texts work fine so I can read the ingredients etc offline :)

Frank Röhm

unread,
Oct 7, 2015, 4:07:16 AM10/7/15
to aard...@googlegroups.com

> Am 07.10.2015 um 09:49 schrieb Georg D <schos...@gmail.com>:
>
>
> Frank, thank you, the texts work fine so I can read the ingredients etc offline :)
>

When Igor has expanded mwscrape2slob so that the image links can be downloaded automatically in aard2 I will rescrape/rescrape2slob the wiki again.


Uwe

unread,
Oct 10, 2015, 4:14:13 AM10/10/15
to aarddict
Any chance to get a working AAR dump too? I'm limited to Android 2.1 on my eReader so SLOB format isn't an option.

itkach

unread,
Oct 10, 2015, 10:28:09 PM10/10/15
to aarddict
Here's updated dictionary: https://github.com/itkach/slob/wiki/Dictionaries#koch-wiki

Most links should work now. This is version with external images. I'll see if I can make a version with image thumbnails included.

There's quite a lot of content on Koch-Wiki that's not in regular articles namespace, for example pages about ingredients are in MediaWiki namespace "Zutat", so both mwscrape and mwscrape2slob needed to be modified to handle that.

mwscrape now has a new command line option, --namespace, to specify which namespace to download (it can only scrape one namespace at a time). Namespace is specified by id (see list of namespaces and their ids in siteinfo under namespaces key; assuming you run CouchDB locally, CouchDB's admin UI for siteinfo for kochwiki is at http://localhost:5984/_utils/document.html?siteinfo/kochwiki-org). So command to scrape, for example, Zutat would be

mwscrape http://kochwiki.org --delay 1 --namespace 100

mwscrape2slob now has a matching option --article-namespace to specify which namespaces are to be treated as regular articles (links to pages in these namespaces will not be converted to external links). Another new option, --ensure-ext-image-urls, specifies that image links, on the other hand should be converted to external links. kochwiki slob was created with the following command:

mwscrape2slob http://localhost:5984/kochwiki-org -f common wiki --article-namespace 14 100 102 106 108 110 112 116 --ensure-ext-image-urls --no-math

So now it's possible to include non-article content such as Appendix and Category pages (several users asked these to be added to enwiki and ruwiki). 



Frank Röhm

unread,
Oct 11, 2015, 9:16:44 AM10/11/15
to aard...@googlegroups.com


Am 11. Oktober 2015 04:28:09 MESZ, schrieb itkach <itk...@gmail.com>:
>Here's updated dictionary:
>https://github.com/itkach/slob/wiki/Dictionaries#koch-wiki
>
>Most links should work now. This is version with external images. I'll
>see
>if I can make a version with image thumbnails included.
>
>There's quite a lot of content on Koch-Wiki that's not in regular
>articles
>namespace, for example pages about ingredients are in MediaWiki
>namespace
>"Zutat", so both mwscrape and mwscrape2slob needed to be modified to
>handle
>that.
>
>mwscrape now has a new command line option, --namespace, to specify
>which
>namespace to download (it can only scrape one namespace at a time).
>Namespace is specified by id (see list of namespaces and their ids in
>siteinfo under namespaces key; assuming you run CouchDB locally,
>CouchDB's
>admin UI for siteinfo for kochwiki is at
>
>command to scrape, for example, Zutat would be
>
>*mwscrape http://kochwiki.org --delay 1 --namespace 100*
>
>mwscrape2slob now has a matching option --article-namespace to specify
>which namespaces are to be treated as regular articles (links to pages
>in
>these namespaces will not be converted to external links). Another new
>option, --ensure-ext-image-urls, specifies that image links, on the
>other
>hand should be converted to external links. kochwiki slob was created
>with
>the following command:
>
>*mwscrape2slob http://localhost:5984/kochwiki-org -f common wiki
>--article-namespace 14 100 102 106 108 110 112 116
>--ensure-ext-image-urls
>--no-math*
>
>So now it's possible to include non-article content such as Appendix
>and
>Category pages (several users asked these to be added to enwiki and
>ruwiki).

Wow! Good work! A lot of work as well it seems.
Thanks.

Ivan Zakharyaschev

unread,
Sep 1, 2019, 10:00:52 AM9/1/19
to aarddict
Hello!

Thanks for the aard2 app and the dictionary files.

I'd appreciate if the French wiktionary slob included the conjugation tables etc. for studying them offline. Like this one: https://fr.wiktionary.org/wiki/Annexe:Conjugaison_en_fran%C3%A7ais/demander

Now, http://ftp.halifax.rwth-aachen.de/aarddict/frwiki/frwiktionary_2019-06-06.slob treats them as an online link.

I've noticed that the conjugations are in a different namespace (Annexe:) and I've come accros an option of mwscrape2slob that must enable including these pages, too: 
--article-namespace

The complete way to do this is described in the message about kochwiki (cited below), and, as we see, it would involve also an option of mwscrape

I could try to generate such a slob file myself or I'd appreciate if someone else, who will update the frwiktionary files, will include the conjugations.

franc

unread,
Sep 1, 2019, 10:47:46 AM9/1/19
to aarddict
Hello
I was providing for a while the French wiki and wiktionary.
Since June 19 i had updated my old Ubuntu 14.04 server to Ubuntu 16.04
Since then my all automatic monthly frwiki update was broken and i haven't fixed it yet.
I plan to go to Ubuntu 18.04 soon so any fix could be in vain or only for short time.

If i am set again in all working wiki i will appreciate to include your changes, i don't think it's a problem as you post, as in kochwiki, where this was easy and working.

I suppose i am off that till maybe October, when i will have more time.

Thanks

Frank

MHBraun

unread,
Sep 5, 2019, 7:38:38 PM9/5/19
to aarddict
Update of Kochwiki-20190903.slob on


Enjoy

franc

unread,
Sep 9, 2019, 3:49:45 AM9/9/19
to aarddict
I think he wanted not the kochwiki but the frwiktionary, but complete :)

Markus Braun

unread,
Sep 10, 2019, 1:03:03 AM9/10/19
to aard...@googlegroups.com, franc
Yepp, I got this.

However I thought the kochwiki is one year old. And as I am using it
once and a while, an update would be appropriate :)

The frwiktionary is not that old and it is your baby, I guess... :D

franc

unread,
May 7, 2020, 8:30:31 AM5/7/20
to aarddict
Am Sonntag, 1. September 2019 16:00:52 UTC+2 schrieb Ivan Zakharyaschev:
Hello!

Thanks for the aard2 app and the dictionary files.

I'd appreciate if the French wiktionary slob included the conjugation tables etc. for studying them offline. Like this one: https://fr.wiktionary.org/wiki/Annexe:Conjugaison_en_fran%C3%A7ais/demander

Now, http://ftp.halifax.rwth-aachen.de/aarddict/frwiki/frwiktionary_2019-06-06.slob treats them as an online link.

I've noticed that the conjugations are in a different namespace (Annexe:) and I've come accros an option of mwscrape2slob that must enable including these pages, too: 
--article-namespace


Sorry so late. Just some weeks ago I got lastly Ubuntu 18.04 Update and first and still I struggle with CouchDB 2.3 (there is no more easy working CouchDB 1.x for U18 unfortunately).
So I found this post from you, Ivan and I added the namespace 100 to my scrape. I found the namespace in the json at:
Seems the structure changed, the url with .../document.html is unknown now. Anyway, this will take a while till its done, as I scrape now without the --speed parameter it takes weeks to finish! At the moment my frwiktionary has only 1340707 documents, I guess a half only, if I remember well.
Then I will re-run with namespace 100, hoping it will scrape the Annexe...

By the way, I don't understand, why the namespace 0 (main namespace with most articles) is missing in the mwscrape2slob of kochwiki some posts up. So shouldnt it be like this?

mwscrape2slob http://localhost:5984/kochwiki-org -f common wiki --article-namespace 0 14 100 102 106 108 110 112 116 --ensure-ext-image-urls --no-math

franc

unread,
May 7, 2020, 8:44:28 AM5/7/20
to aarddict
OK, on the occasion I updated kochwiki:

http://7fw.de/download/wiki/kochwiki_2020-05-07.slob
Fileisze:
http://7fw.de/download/wiki/kochwiki_2020-05-07.size.txt
SHA Hash ist:

I tried it on a first look it seems ok. I added also the new namespace "Gesundheit" (id 118) to it. Tell me if it is errorous, thanks.
frank

franc

unread,
May 7, 2020, 8:52:39 AM5/7/20
to aarddict
Am Sonntag, 1. September 2019 16:00:52 UTC+2 schrieb Ivan Zakharyaschev:
...

I'd appreciate if the French wiktionary slob included the conjugation tables etc. for studying them offline. Like this one: https://fr.wiktionary.org/wiki/Annexe:Conjugaison_en_fran%C3%A7ais/demander

I just tried to start a new mwscrape with only namespace 100 (Annexe):

# mwscrape https://fr.m.wiktionary.org --db frwiktionary --couch http://admin:12...@127.0.0.1:5984 --namespace 100

but as I am running (since a week or more) the main frwiktionary scrape for it, I get the message that "Scrape for this host is already in progress."
So I created a new directory and installed mwscrape there too (different installation!) and tried it, but still not possible, the same message:

(env-mwscrape) [14.36.39][root@ew6:/home/franc/aard/frwiktionary-100/env-mwscrape#mwscrape https://fr.m.wiktionary.org --db frwiktionary --couch http://admin:12...@127.0.0.1:5984 --namespace 100
Connecting http://127.0.0.1:5984 as user admin
Starting session frwiktionary-1588855004-199
Starting at None
Scrape for this host is already in progress. Use --speed option instead of starting multiple processes.
(env-mwscrape) [14.36.45][root@ew6:/home/franc/aard/frwiktionary-100/env-mwscrape#


I dont understand, does a newly starting mwscrape (in separate installation) is first asking CouchDB if there is another mwscrape at work for the same url?
How can that be?

Thanks.frank

Igor Tkach

unread,
May 7, 2020, 9:35:58 AM5/7/20
to aarddict
On Thu, May 7, 2020 at 8:52 AM franc <franc...@gmail.com> wrote:

I dont understand, does a newly starting mwscrape (in separate installation) is first asking CouchDB if there is another mwscrape at work for the same url?

It does!
One reason for this is that mwscrape tries to limit the stress it can put on the sites it's downloading. --speed option is limited to max 5 for the same reason. Imagine spinning hundreds of mwscrapes at max speed (10 parallel downloads) and pointing it all at one site - best case scenario your IP will get blocked, worst case you crash that site first.
Also, also competing mwscrape processes would download same article multiple reducing efficiency from running in parallel while generating high load on downloaded site (unless they are carefully configured to download different portions of the site content with options like --titles, --start and --desc)

In current implementation mwscrape only downloads one namespace at a time (0 - articles - being the default), mostly because mwclient, the library for interacting with mediawiki api, takes a single namespace as a parameter in most api calls and I didn't want to complicate mwscrape further by trying to work around that.

I can see how these limitations can be annoying for your use case :) Maybe mwscrape should either support specifying multiple namespaces or allow multiple instances per site if they target different content namespaces... In the meantime you'd have to run scrapes for different namespaces one after another.

franc

unread,
May 8, 2020, 9:50:36 AM5/8/20
to aarddict
Thanks! I understand now!
So for this namespace 100 scrape I will put fcntl.LOCK_EX to fcntl.LOCK_SH in scrape.py.

Apart of this, all my scrapes are running without error at the moment. But I will restart them next time they stop with do loop, as posted in the other thread "mwsrape - Error on Ubuntu 18.04 with CouchDB 3.0":

mwscrape https://some.m.wikipedia.org --db somewikipedia --couch http://admin:12...@127.0.0.1:5984 --speed 3 || while 1; do mwscrape https://some.m.wikipedia.org --db somewikipedia --couch http://admin:12...@127.0.0.1:5984 --speed 3 --resume; sleep 60;

will try speed 3 then.

franc

unread,
May 8, 2020, 10:23:06 AM5/8/20
to aarddict
Am Freitag, 8. Mai 2020 15:50:36 UTC+2 schrieb franc:

mwscrape https://some.m.wikipedia.org --db somewikipedia --couch http://admin:12...@127.0.0.1:5984 --speed 3 || while 1; do mwscrape https://some.m.wikipedia.org --db somewikipedia --couch http://admin:12...@127.0.0.1:5984 --speed 3 --resume; sleep 60;

 I experimented a bit with the --resume parameter but couldn't use it, I very often get a "Document update conflict" error:

(env-mwscrape) [16.17.59][root@ew6:/home/franc/aard/frwiktionary/env-mwscrape#mwscrape https://fr.m.wiktionary.org --db frwiktionary --couch http://admin:12...@127.0.0.1:5984 --speed 5 --resume

Connecting http://127.0.0.1:5984 as user admin
Resuming session frwiki-1588947064-220
Traceback (most recent call last):
 
File "/home/franc/aard/frwiktionary/env-mwscrape/bin/mwscrape", line 11, in <module>
    load_entry_point
('mwscrape==1.0', 'console_scripts', 'mwscrape')()
 
File "/home/franc/aard/frwiktionary/env-mwscrape/lib/python3.6/site-packages/mwscrape/scrape.py", line 345, in main
    sessions_db
[session_id] = session_doc
 
File "/home/franc/aard/frwiktionary/env-mwscrape/lib/python3.6/site-packages/couchdb/client.py", line 447, in __setitem__
    status
, headers, data = resource.put_json(body=content)
 
File "/home/franc/aard/frwiktionary/env-mwscrape/lib/python3.6/site-packages/couchdb/http.py", line 578, in put_json
   
**params)
 
File "/home/franc/aard/frwiktionary/env-mwscrape/lib/python3.6/site-packages/couchdb/http.py", line 596, in _request_json
    headers
=headers, **params)
 
File "/home/franc/aard/frwiktionary/env-mwscrape/lib/python3.6/site-packages/couchdb/http.py", line 592, in _request
    credentials
=self.credentials)
 
File "/home/franc/aard/frwiktionary/env-mwscrape/lib/python3.6/site-packages/couchdb/http.py", line 425, in request
   
raise ResourceConflict(error)
couchdb
.http.ResourceConflict: ('conflict', 'Document update conflict.')

what is irritating here, is that I am trying to start frwiktionary scrape (for testing --resume) but get that message "Resuming session frwiki-1588947064-220"
Indeed frwiki scrape is running, but I start here frwiktionary!
It is even in a different folder and different mwscrape installation!

Igor Tkach

unread,
May 8, 2020, 10:45:27 AM5/8/20
to aarddict
It doesn't matter which folder or installation mwscrape is, the information about scrapes is stored in CouchDB, in "mwscrape" database. --resume without parameter resumes "current" (most recently started, see document "$current" in "mwscrape" database) scrape. --resume <session id> resumes specified scrape (session id is printed at the beginning). Either way, you don't specify site to scrape with --resume, that comes from saved session info.

On a side note, the way I was using it originally was to start a new scrape from the command line, check that it's running and that content it downloads looks right, then kill it and resume in a loop. I would typically run just one scrape at a time since - it's resource intensive enough that I woudln't want to put more, especially if happens to be on a computer I'm also using for something else. So this is awkward to use if you want to fully script starting multiple new, resumable scrapes at the same time - I never got around to properly supporting such a setup.

Anyway... you can start new scrape and kill it immediately just to take note of session id, either from output in terminal or from "mwscrape" db and restart it with --resume <session -d> in a loop - this should be able to run until it's done (which is a really long time in case of large wikis :)).


franc

unread,
May 9, 2020, 9:00:42 AM5/9/20
to aarddict
Am Freitag, 8. Mai 2020 16:45:27 UTC+2 schrieb itkach:

It doesn't matter which folder or installation mwscrape is, the information about scrapes is stored in CouchDB, in "mwscrape" database. --resume without parameter resumes "current" (most recently started, see document "$current" in "mwscrape" database) scrape. --resume <session id> resumes specified scrape (session id is printed at the beginning). Either way, you don't specify site to scrape with --resume, that comes from saved session info.

On a side note, the way I was using it originally was to start a new scrape from the command line, check that it's running and that content it downloads looks right, then kill it and resume in a loop. I would typically run just one scrape at a time since - it's resource intensive enough that I woudln't want to put more, especially if happens to be on a computer I'm also using for something else. So this is awkward to use if you want to fully script starting multiple new, resumable scrapes at the same time - I never got around to properly supporting such a setup.

Anyway... you can start new scrape and kill it immediately just to take note of session id, either from output in terminal or from "mwscrape" db and restart it with --resume <session -d> in a loop - this should be able to run until it's done (which is a really long time in case of large wikis :)).


Ah, I understand now. I thought --resume is ment to continue a scrape at exact the position where it stopped, when e.g. it stopped by error or because I killed it (with CTRL+C).
I wanted to avoid that mwscrape restart from the very beginning to scrape, e.g. from Aaaa etc. if I had it already scraped up to X maybe. Because now mwscrape will re-scrape all articles from A to X and get always the skipping but has to ask the Wikimedia server anyway. So to reduce the requests enormously.

But maybe I should use then the --start parameter, looking first into the couch and the last scraped article. But I don't know how to look up the last entry in my CouchDB, I can only scroll veryl slowly to the articles. Maybe there is a command on command line to get it, but I dont know.

The situation what I do if the scrape of a big wiki (e.g. frwiki) stops late is unclear for me, what to do best to avoid many unnecessary  requests.

By the way, I have finished scrape of namepsace 100 of frwiktionary, this was not very big.

Igor Tkach

unread,
May 9, 2020, 9:34:12 AM5/9/20
to aarddict


On Sat, May 9, 2020, 09:00 franc <franc...@gmail.com> wrote:
 I thought --resume is ment to continue a scrape at exact the position where it stopped, when e.g. it stopped by error or because I killed it (with CTRL+C).

Yes, that's what it does. If you look at "mwscrape" database in couch you'll see last scraped title is in session info, resume  starts from that title 

MHBraun

unread,
May 9, 2020, 7:34:37 PM5/9/20
to aarddict
Franc this is easy. If your scrape stops for whatever reason check the last file in the mwscrape database.
The document contains the id of the scrape starting with the name of the database. Copy that string and add it to your loop command with the -r parameter.

You can do this first time a scrape broke only because mwscrape is generating a new document for each scrape. If the scrape runs to the end you do not need the id anyway. But if it breaks the id give the place where to pick up. The specific place is listed in the mwscrape document as last_page_name.

I sent you my checkscript.sh where I am using this method to continue where the scrape broke. I limited the attempts to 20 which is random. There are (at least) two factors influencing this:

a) end of scrape. You got the last article and of course you do not want to restart the scrape with the last article over and over again.
b) server response is bad (err 500 etc). Then it is better to wait for a couple if hours to restart to get better connection.

It it actually straight forward and no need to use -start and to figure out what was last article.

franc

unread,
May 24, 2020, 1:12:30 PM5/24/20
to aarddict
Am Sonntag, 1. September 2019 16:00:52 UTC+2 schrieb Ivan Zakharyaschev:
Hello!

Thanks for the aard2 app and the dictionary files.

I'd appreciate if the French wiktionary slob included the conjugation tables etc. for studying them offline. Like this one: https://fr.wiktionary.org/wiki/Annexe:Conjugaison_en_fran%C3%A7ais/demander

In the actual scraped and scrape2slobed frwiktionary I think this is working now. Please test it, Ivan! Thanks.
frank

Nikolai Yourin

unread,
May 25, 2020, 2:06:51 PM5/25/20
to aarddict


On Sunday, May 24, 2020 at 8:12:30 PM UTC+3, franc wrote:
In the actual scraped and scrape2slobed frwiktionary I think this is working now. Please test it, Ivan! Thanks.
frank

All the required data is there, stored under e.g. "Annexe:Conjugaison en français/demander",
but 'voir la conjugaison' for "demander" points back to the main article.
Would it be possible to rewrite those links?

franc

unread,
May 30, 2020, 1:02:29 PM5/30/20
to aarddict
Am Montag, 25. Mai 2020 20:06:51 UTC+2 schrieb Nikolai Yourin:

All the required data is there, stored under e.g. "Annexe:Conjugaison en français/demander",
but 'voir la conjugaison' for "demander" points back to the main article.
Would it be possible to rewrite those links?

If you tell me what to do, I will try. I don't know how to scrape (or scrape2slob) that so that these links would be pointing to the right spot :(

MHBraun

unread,
Jan 24, 2021, 12:46:23 PMJan 24
to aarddict
I tried to update the kochwiki today and got this error message.

mwscrape http://kochwiki.org --siteinfo-only

Traceback (most recent call last):
  File "/home/markus/env-mwscrape/bin/mwscrape", line 11, in <module>

    load_entry_point('mwscrape==1.0', 'console_scripts', 'mwscrape')()
  File "/home/markus/env-mwscrape/local/lib/python2.7/site-packages/mwscrape/scrape.py", line 335, in main
    update_siteinfo(site, couch_server, db_name)
  File "/home/markus/env-mwscrape/local/lib/python2.7/site-packages/mwscrape/scrape.py", line 69, in update_siteinfo
    siprop='general|interwikimap|rightsinfo|statistics|namespaces'
  File "/home/markus/env-mwscrape/local/lib/python2.7/site-packages/mwclient/client.py", line 279, in api
    info = self.raw_api(action, http_method, **kwargs)
  File "/home/markus/env-mwscrape/local/lib/python2.7/site-packages/mwclient/client.py", line 422, in raw_api
    raise errors.InvalidResponse(res)
mwclient.errors.InvalidResponse: Did not get a valid JSON response from the server. Check that you used the correct hostname. If you did, the server might be wrongly configured or experiencing temporary problems.

The kochwiki is visible and accessible via browser.

Would you please test if you see the same error?

franc

unread,
Jan 24, 2021, 1:26:45 PMJan 24
to aarddict
MHBraun schrieb am Sonntag, 24. Januar 2021 um 18:46:23 UTC+1:
I tried to update the kochwiki today and got this error message.

mwscrape http://kochwiki.org --siteinfo-only


Just checked, get the same error :(

mwclient.errors.InvalidResponse: Did not get a valid JSON response from the server. Check that you used the correct hostname. If you did, the server might be wrongly configured or experiencing temporary problems.

Sorry, frank

franc

unread,
Jan 24, 2021, 2:11:06 PMJan 24
to aarddict

OK, now with more time I checked again, and changed kochwiki to www.kochwiki and now no more errors with these scrapes:

mwscrape https://www.kochwiki.org --delay 1 --db kochwiki --couch http://admin:password@127.0.0.1:5984 --namespace 0
mwscrape https://www.kochwiki.org --delay 1 --db kochwiki --couch http://admin:password@127.0.0.1:5984 --namespace 14
mwscrape https://www.kochwiki.org --delay 1 --db kochwiki --couch http://admin:pass...@127.0.0.1:5984 --namespace 100
mwscrape https://www.kochwiki.org --delay 1 --db kochwiki --couch http://admin:password@127.0.0.1:5984 --namespace 102
mwscrape https://www.kochwiki.org --delay 1 --db kochwiki --couch http://admin:password@127.0.0.1:5984 --namespace 106
mwscrape https://www.kochwiki.org --delay 1 --db kochwiki --couch http://admin:password@127.0.0.1:5984 --namespace 108
mwscrape https://www.kochwiki.org --delay 1 --db kochwiki --couch http://admin:password@127.0.0.1:5984 --namespace 112
mwscrape https://www.kochwiki.org --delay 1 --db kochwiki --couch http://admin:password@127.0.0.1:5984 --namespace 116
mwscrape https://www.kochwiki.org --delay 1 --db kochwiki --couch http://admin:password@127.0.0.1:5984 --namespace 118

Will announce the ready kochwiki here then...
Will be in my wiki folders as ususal:

Wait please...

franc

unread,
Jan 25, 2021, 12:48:41 PMJan 25
to aarddict
Now hier ist it:

Die aktuelle KochWiki, kochwiki_2021-01-25.slob steht zum Download bereit auf:
kochwiki_2021-01-25.slob

 Dateigroesse ist:
kochwiki_2021-01-25.size.txt

 SHA Hash ist:

Please test as you wish :)
fr.
Reply all
Reply to author
Forward
0 new messages