Did anyone create a SLOB file out of http://kochwiki.org/
Starting session kochwiki-1443621095-767
/home/franc/aard/env-mwscrape/local/lib/python2.7/site-packages/requests/packages/urllib3/util/ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
InsecurePlatformWarning
Traceback (most recent call last):
File "/home/franc/aard/env-mwscrape/bin/mwscrape", line 9, in <module>
load_entry_point('mwscrape==1.0', 'console_scripts', 'mwscrape')()
File "/home/franc/aard/env-mwscrape/local/lib/python2.7/site-packages/mwscrape/scrape.py", line 293, in main
site = mwclient.Site((scheme, host), path=args.site_path, ext=args.site_ext)
File "/home/franc/aard/env-mwscrape/local/lib/python2.7/site-packages/mwclient/client.py", line 116, in __init__
self.site_init()
File "/home/franc/aard/env-mwscrape/local/lib/python2.7/site-packages/mwclient/client.py", line 124, in site_init
siprop='general|namespaces', uiprop='groups|rights', retry_on_error=False)
File "/home/franc/aard/env-mwscrape/local/lib/python2.7/site-packages/mwclient/client.py", line 204, in api
info = self.raw_api(action, **kwargs)
File "/home/franc/aard/env-mwscrape/local/lib/python2.7/site-packages/mwclient/client.py", line 309, in raw_api
res = self.raw_call('api', data, retry_on_error=retry_on_error)
File "/home/franc/aard/env-mwscrape/local/lib/python2.7/site-packages/mwclient/client.py", line 277, in raw_call
stream = self.connection.post(fullurl, data=data, files=files, headers=headers)
File "/home/franc/aard/env-mwscrape/local/lib/python2.7/site-packages/requests/sessions.py", line 508, in post
return self.request('POST', url, data=data, json=json, **kwargs)
File "/home/franc/aard/env-mwscrape/local/lib/python2.7/site-packages/requests/sessions.py", line 465, in request
resp = self.send(prep, **send_kwargs)
File "/home/franc/aard/env-mwscrape/local/lib/python2.7/site-packages/requests/sessions.py", line 573, in send
r = adapter.send(request, **kwargs)
File "/home/franc/aard/env-mwscrape/local/lib/python2.7/site-packages/requests/adapters.py", line 431, in send
raise SSLError(e, request=request)
requests.exceptions.SSLError: [Errno 1] _ssl.c:510: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed
@Igor: do you know about this errror? Could you fastly tell something?
Download under:
7fw.de/download/wiki/kochwiki.slob
Cannot provide sha at the moment.
I am only under mobile data at the moment and cannot test very good.
But have impression that the images are not auto downloaded, they seems just links. Possible that this is because I have no Wi-Fi.
Frank
Image urls in this wiki are relative, so either slob file needs to contain images or converter needs to convert these into absolute urls.
The links in the wiki do not refer to the wiki itself. They are opening the browser for the online version. This is inconveniant should be modified.
Eg: In Apfel-Mohnkuchen the links for Teig are directing to the online version.
On Sunday, October 4, 2015 at 5:44:00 AM UTC-4, mhbraun wrote:The links in the wiki do not refer to the wiki itself. They are opening the browser for the online version. This is inconveniant should be modified.
Eg: In Apfel-Mohnkuchen the links for Teig are directing to the online version.Indeed. This is because most links in these articles point to things like ingredients which for some reason are not regular articles, instead they are pages in a namespace (Zutat). Usually it makes sense to resolve namespaced page links to point to online version - these are categories, discussions and variois special pages typically not included in slob. This looks like an error in mediawiki configuration for this site. One way to fix it is to edit siteinfo manually and remove namespace definitions for namespaces that should be treated as regular articles.
Or maybe mwscrape2slob needs to grow an option to allow user specify which namespaces to treat as local articles.
Am Sonntag, 4. Oktober 2015 17:07:45 UTC+2 schrieb itkach:
On Sunday, October 4, 2015 at 5:44:00 AM UTC-4, mhbraun wrote:The links in the wiki do not refer to the wiki itself. They are opening the browser for the online version. This is inconveniant should be modified.
Eg: In Apfel-Mohnkuchen the links for Teig are directing to the online version.Indeed. This is because most links in these articles point to things like ingredients which for some reason are not regular articles, instead they are pages in a namespace (Zutat). Usually it makes sense to resolve namespaced page links to point to online version - these are categories, discussions and variois special pages typically not included in slob. This looks like an error in mediawiki configuration for this site. One way to fix it is to edit siteinfo manually and remove namespace definitions for namespaces that should be treated as regular articles.
Where could I edit siteinfo manually?
Indeed, really not big. 10 MB only
Frank, thank you, the texts work fine so I can read the ingredients etc offline :)Download under:
7fw.de/download/wiki/kochwiki.slob
--article-namespace
If i am set again in all working wiki i will appreciate to include your changes, i don't think it's a problem as you post, as in kochwiki, where this was easy and working.
I suppose i am off that till maybe October, when i will have more time.
Thanks
Frank
Hello!Thanks for the aard2 app and the dictionary files.I'd appreciate if the French wiktionary slob included the conjugation tables etc. for studying them offline. Like this one: https://fr.wiktionary.org/wiki/Annexe:Conjugaison_en_fran%C3%A7ais/demander
Now, http://ftp.halifax.rwth-aachen.de/aarddict/frwiki/frwiktionary_2019-06-06.slob treats them as an online link.I've noticed that the conjugations are in a different namespace (Annexe:) and I've come accros an option of mwscrape2slob that must enable including these pages, too:
--article-namespace
...
I'd appreciate if the French wiktionary slob included the conjugation tables etc. for studying them offline. Like this one: https://fr.wiktionary.org/wiki/Annexe:Conjugaison_en_fran%C3%A7ais/demander
# mwscrape https://fr.m.wiktionary.org --db frwiktionary --couch http://admin:12...@127.0.0.1:5984 --namespace 100
(env-mwscrape) [14.36.39][root@ew6:/home/franc/aard/frwiktionary-100/env-mwscrape#mwscrape https://fr.m.wiktionary.org --db frwiktionary --couch http://admin:12...@127.0.0.1:5984 --namespace 100
Connecting http://127.0.0.1:5984 as user admin
Starting session frwiktionary-1588855004-199
Starting at None
Scrape for this host is already in progress. Use --speed option instead of starting multiple processes.
(env-mwscrape) [14.36.45][root@ew6:/home/franc/aard/frwiktionary-100/env-mwscrape#
I dont understand, does a newly starting mwscrape (in separate installation) is first asking CouchDB if there is another mwscrape at work for the same url?
mwscrape https://some.m.wikipedia.org --db somewikipedia --couch http://admin:12...@127.0.0.1:5984 --speed 3 || while 1; do mwscrape https://some.m.wikipedia.org --db somewikipedia --couch http://admin:12...@127.0.0.1:5984 --speed 3 --resume; sleep 60;
mwscrape https://some.m.wikipedia.org --db somewikipedia --couch http://admin:12...@127.0.0.1:5984 --speed 3 || while 1; do mwscrape https://some.m.wikipedia.org --db somewikipedia --couch http://admin:12...@127.0.0.1:5984 --speed 3 --resume; sleep 60;
(env-mwscrape) [16.17.59][root@ew6:/home/franc/aard/frwiktionary/env-mwscrape#mwscrape https://fr.m.wiktionary.org --db frwiktionary --couch http://admin:12...@127.0.0.1:5984 --speed 5 --resume
Resuming session frwiki-1588947064-220
Traceback (most recent call last):
File "/home/franc/aard/frwiktionary/env-mwscrape/bin/mwscrape", line 11, in <module>
load_entry_point('mwscrape==1.0', 'console_scripts', 'mwscrape')()
File "/home/franc/aard/frwiktionary/env-mwscrape/lib/python3.6/site-packages/mwscrape/scrape.py", line 345, in main
sessions_db[session_id] = session_doc
File "/home/franc/aard/frwiktionary/env-mwscrape/lib/python3.6/site-packages/couchdb/client.py", line 447, in __setitem__
status, headers, data = resource.put_json(body=content)
File "/home/franc/aard/frwiktionary/env-mwscrape/lib/python3.6/site-packages/couchdb/http.py", line 578, in put_json
**params)
File "/home/franc/aard/frwiktionary/env-mwscrape/lib/python3.6/site-packages/couchdb/http.py", line 596, in _request_json
headers=headers, **params)
File "/home/franc/aard/frwiktionary/env-mwscrape/lib/python3.6/site-packages/couchdb/http.py", line 592, in _request
credentials=self.credentials)
File "/home/franc/aard/frwiktionary/env-mwscrape/lib/python3.6/site-packages/couchdb/http.py", line 425, in request
raise ResourceConflict(error)
couchdb.http.ResourceConflict: ('conflict', 'Document update conflict.')
Resuming session frwiki-1588947064-220"
Indeed frwiki scrape is running, but I start here frwiktionary!
It is even in a different folder and different mwscrape installation!
It doesn't matter which folder or installation mwscrape is, the information about scrapes is stored in CouchDB, in "mwscrape" database. --resume without parameter resumes "current" (most recently started, see document "$current" in "mwscrape" database) scrape. --resume <session id> resumes specified scrape (session id is printed at the beginning). Either way, you don't specify site to scrape with --resume, that comes from saved session info.On a side note, the way I was using it originally was to start a new scrape from the command line, check that it's running and that content it downloads looks right, then kill it and resume in a loop. I would typically run just one scrape at a time since - it's resource intensive enough that I woudln't want to put more, especially if happens to be on a computer I'm also using for something else. So this is awkward to use if you want to fully script starting multiple new, resumable scrapes at the same time - I never got around to properly supporting such a setup.Anyway... you can start new scrape and kill it immediately just to take note of session id, either from output in terminal or from "mwscrape" db and restart it with --resume <session -d> in a loop - this should be able to run until it's done (which is a really long time in case of large wikis :)).
I thought --resume is ment to continue a scrape at exact the position where it stopped, when e.g. it stopped by error or because I killed it (with CTRL+C).
You can do this first time a scrape broke only because mwscrape is generating a new document for each scrape. If the scrape runs to the end you do not need the id anyway. But if it breaks the id give the place where to pick up. The specific place is listed in the mwscrape document as last_page_name.
I sent you my checkscript.sh where I am using this method to continue where the scrape broke. I limited the attempts to 20 which is random. There are (at least) two factors influencing this:
a) end of scrape. You got the last article and of course you do not want to restart the scrape with the last article over and over again.
b) server response is bad (err 500 etc). Then it is better to wait for a couple if hours to restart to get better connection.
It it actually straight forward and no need to use -start and to figure out what was last article.
Hello!Thanks for the aard2 app and the dictionary files.I'd appreciate if the French wiktionary slob included the conjugation tables etc. for studying them offline. Like this one: https://fr.wiktionary.org/wiki/Annexe:Conjugaison_en_fran%C3%A7ais/demander
In the actual scraped and scrape2slobed frwiktionary I think this is working now. Please test it, Ivan! Thanks.frank
All the required data is there, stored under e.g. "Annexe:Conjugaison en français/demander",but 'voir la conjugaison' for "demander" points back to the main article.Would it be possible to rewrite those links?
I tried to update the kochwiki today and got this error message.mwscrape http://kochwiki.org --siteinfo-only