(11871) Norvège ==> (11871) Norge
(1172) Aneas ==> (1172) Énée
(1172) Énée is up to date (rev. 138156454), skipping
Traceback (most recent call last):
File "/home/franc/aard/env-mwscrape/bin/mwscrape", line 9, in <module>
load_entry_point('mwscrape==1.0', 'console_scripts', 'mwscrape')()
File
"/home/franc/aard/env-mwscrape/local/lib/python2.7/site-packages/mwscrape/scrape.py",
line 535, in
main
for _result in pool.imap(process, ipages(pages)):
File "/usr/lib/python2.7/multiprocessing/pool.py", line 673, in next
raise value
couchdb.http.ServerError: (500, (u'unknown_error', u'undefined'))
[17.44.54][root@ew6:/home/franc/aard#
Halloafter a long time, I wanted to come back to the old scrape-job and tried to install mwscrape2slob on my upgraded Ubuntu 18.04 LTS Server with now CouchDB 3.0.So I installed the stuff, following the mwscrape's README on Github and started my mwscrape-Skript.
But very fast, after some entries of scraping I got an error and the scrape stopped:
(11871) Norvège ==> (11871) Norge
(1172) Aneas ==> (1172) Énée
(1172) Énée is up to date (rev. 138156454), skipping
Traceback (most recent call last):
File "/home/franc/aard/env-mwscrape/bin/mwscrape", line 9, in <module>
load_entry_point('mwscrape==1.0', 'console_scripts', 'mwscrape')()
File
"/home/franc/aard/env-mwscrape/local/lib/python2.7/site-packages/mwscrape/scrape.py",
line 535, in
main
for _result in pool.imap(process, ipages(pages)):
File "/usr/lib/python2.7/multiprocessing/pool.py", line 673, in next
raise value
couchdb.http.ServerError: (500, (u'unknown_error', u'undefined'))
[17.44.54][root@ew6:/home/franc/aard#
I started also frwiktionary and dewiktionary, all of them stopped very soon, always Line 673 in pool.pyWhat could this kind of error mean? I am not routined with Python, rarely know this language, so I even didnt look into the code :(On my old U14.04 this never happened, but I had CouchDB 1.6, which I cannot use anymore, no package for this on U18.By the way, I got warnings about Python 2.7, it were deprecated, and soon not anymore supported. Could this be the reason?
--Thanks for hints!frank
You received this message because you are subscribed to the Google Groups "aarddict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aarddict+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/aarddict/84d8d222-c843-4a3b-aa88-6fd8561907e0%40googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to aard...@googlegroups.com.
...
Just merged an update to README (sorry Markus it took me a while) with instructions to install CouchDB 2.
Traceback (most recent call last):
File "/home/franc/aard/env-mwscrape/bin/mwscrape", line 8, in <module>
sys.exit(main())
File "/home/franc/aard/env-mwscrape/local/lib/python2.7/site-packages/mwscrape/scrape.py", line 535, in main
for _result in pool.imap(process, ipages(pages)):
File "/usr/lib/python2.7/multiprocessing/pool.py", line 673, in next
raise
value
httplib.BadStatusLine: No status line received - the server has closed the connection
[notice] 2020-04-28T04:27:17.183702Z couchdb@127.0.0.1 <0.9501.1425> b0a68b261e 127.0.0.1:5984 127.0.0.1 admin GET /mwscrape/frwiki-1588001952-831 200 ok 5
[notice] 2020-04-28T04:27:17.233145Z couchdb@127.0.0.1 <0.9501.1425> 6f739fd066 127.0.0.1:5984 127.0.0.1 admin PUT /mwscrape/frwiki-1588001952-831 201 ok 47
[notice] 2020-04-28T04:27:17.247680Z couchdb@127.0.0.1 <0.9501.1425> 536c38a61b 127.0.0.1:5984 127.0.0.1 admin GET /frwiki/Alireza%20Dabir 404 ok 10
Now got the time to downgrade, as Markus posted in his question Access - Error on Ubuntu 18.04 with CouchDB 2.3.1 already, thanks Markus :)But still I get these errors after a short while, with any speed (speed=1 or 5):
Traceback (most recent call last):
File "/home/franc/aard/env-mwscrape/bin/mwscrape", line 8, in <module>
sys.exit(main())
File "/home/franc/aard/env-mwscrape/local/lib/python2.7/site-packages/mwscrape/scrape.py", line 535, in main
for _result in pool.imap(process, ipages(pages)):
File "/usr/lib/python2.7/multiprocessing/pool.py", line 673, in next
raise value
httplib.BadStatusLine: No status line received - the server has closed the connectionagain and again the same errors. In CouchDB I can only read:
[notice] 2020-04-28T04:27:17.183702Z couchdb@127.0.0.1 <0.9501.1425> b0a68b261e 127.0.0.1:5984 127.0.0.1 admin GET /mwscrape/frwiki-1588001952-831 200 ok 5
[notice] 2020-04-28T04:27:17.233145Z couchdb@127.0.0.1 <0.9501.1425> 6f739fd066 127.0.0.1:5984 127.0.0.1 admin PUT /mwscrape/frwiki-1588001952-831 201 ok 47
[notice] 2020-04-28T04:27:17.247680Z couchdb@127.0.0.1 <0.9501.1425> 536c38a61b 127.0.0.1:5984 127.0.0.1 admin GET /frwiki/Alireza%20Dabir 404 ok 10
@Igor, couldn't you check this error, what is the real reason?
...
I can't tell from this for sure, but perhaps "server has closed the connection" comes from wikipedia server, not CouchDB. Maybe you were generating too many requests and they now blacklisted your IP or throttle it more aggressively?In any case, I updated mwscrape - now runs on Python 3, fixed mwclient deprecation warning - please reinstall (follow README at https://github.com/itkach/mwscrape/tree/master) and give it a try. Seems to be working fine with either CouchDB 2 or 3. Try scraping something other than frwiki just to see if you're getting same errors on different sites or not. And don't go too fast, try without --speed parameter (if you still get the same error at least we'll see exact spot in process() function which appears to be obscured when you run multiple requests in parallel).
cd /home/franc/aard/env-mwscrape
source bin/activate
pip uninstall mwscrape
pip install git+https://github.com/itkach/mwscrape
[20.16.30][root@ew6:/home/franc/aard/env-mwscrape#source bin/activate
(env-mwscrape) [20.16.37][root@ew6:/home/franc/aard/env-mwscrape#mwscrape fr.m.wiktionary.org --db frwiktionary --couch http://admin:12...@127.0.0.1:5984
Traceback (most recent call last):
File "/home/franc/aard/env-mwscrape/bin/mwscrape", line 5, in <module>
from mwscrape.scrape import main
File "/home/franc/aard/env-mwscrape/local/lib/python2.7/site-packages/mwscrape/scrape.py", line 17, in <module>
import urllib.parse
ImportError: No module named parse
(env-mwscrape) [20.16.54][root@ew6:/home/franc/aard/env-mwscrape#
python3 -m venv env-mwscrape-py3
source env-mwscrape-py3/bin/activate
--
You received this message because you are subscribed to the Google Groups "aarddict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aarddict+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/aarddict/7ca7eb0c-92fc-4459-8c22-1cda7e276664%40googlegroups.com.
You have to create and use new python virtualenv,python3 -m venv env-mwscrape-py3
source env-mwscrape-py3/bin/activate
(env-mwscrape-py3) [21.07.03][root@ew6:/home/franc/aard/env-mwscrape-py3#mwscrape fr.m.wikipedia.org --db
frwiki --couch http://admin:12...@127.0.0.1:5984
mwscrape: Befehl nicht gefunden.
(env-mwscrape-py3) [21.07.14][root@ew6:/home/franc/aard/env-mwscrape-py3#mwscrape-py3 fr.m.wikipedia.org --db frwiki --couch http://admin:12...@127.0.0.1:5984
mwscrape-py3: Befehl nicht gefunden.
(env-mwscrape-py3) [21.07.20][root@ew6:/home/franc/aard/env-mwscrape-py3#
Am Mittwoch, 29. April 2020 20:33:44 UTC+2 schrieb itkach:You have to create and use new python virtualenv,python3 -m venv env-mwscrape-py3
source env-mwscrape-py3/bin/activateOK, thanks! I did this :)But now, mwscrape is not found:
(env-mwscrape-py3) [21.07.03][root@ew6:/home/franc/aard/env-mwscrape-py3#mwscrape fr.m.wikipedia.org --db
frwiki --couch http://admin:12...@127.0.0.1:5984
mwscrape: Befehl nicht gefunden.
(env-mwscrape-py3) [21.07.14][root@ew6:/home/franc/aard/env-mwscrape-py3#mwscrape-py3 fr.m.wikipedia.org --db frwiki --couch http://admin:12...@127.0.0.1:5984
mwscrape-py3: Befehl nicht gefunden.
(env-mwscrape-py3) [21.07.20][root@ew6:/home/franc/aard/env-mwscrape-py3#Sorry that I don't understand much of this mwscrape-script since yet :(I never worked with Python ever and for me its different to e.g. php or javascript...
--
You received this message because you are subscribed to the Google Groups "aarddict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aarddict+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/aarddict/37d029dc-edb1-4fab-ad20-8aa312f822be%40googlegroups.com.
On Wed, Apr 29, 2020 at 3:08 PM franc <franc...@gmail.com> wrote:Am Mittwoch, 29. April 2020 20:33:44 UTC+2 schrieb itkach:You have to create and use new python virtualenv,python3 -m venv env-mwscrape-py3
source env-mwscrape-py3/bin/activateOK, thanks! I did this :)But now, mwscrape is not found:Did you actually install it?:this is all in the readme at https://github.com/itkach/mwscrape
Indeed, that I did not. Now it is running :)
Thank you much!frank
--
You received this message because you are subscribed to the Google Groups "aarddict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aarddict+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/aarddict/408af78e-792c-44a9-9a96-8fb614d5c116%40googlegroups.com.
Awesome, now let's see how it performs :)
Still running since yesterday!
...
Running setup.py bdist_wheel for pylru ... error
Complete output from command /home/franc/aard/dewiktionary/env-mwscrape/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-glmc8a42/pylru/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/tmpx9ydqv4wpip-wheel- --python-tag cp36:
usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
or: -c --help [cmd1 cmd2 ...]
or: -c --help-commands
or: -c cmd --help
error: invalid command 'bdist_wheel'
----------------------------------------
Failed building wheel for pylru
Running setup.py clean for pylru
Failed to build pylru
...
The rest seems without errors and mwscrape seems to work fine.Can I ignore that pylru-error?
Thank.
--
You received this message because you are subscribed to the Google Groups "aarddict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aarddict+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/aarddict/39ed3f3c-9102-4a75-a142-64143802d759%40googlegroups.com.
On Thu, Apr 30, 2020, 13:24 franc <franc...@gmail.com> wrote:The rest seems without errors and mwscrape seems to work fine.Can I ignore that pylru-error?I think I saw the same error and it looks like we can ignore it.
... The couch has 312.061 entries with 5.1 GB at the moment. frwiktionary, which I started later, has 228181 entries with 0.9 GB.
So I will do it in loop too.
Magny-sur-Tille
Magnéric de Trèves
Magné (Vienne)
Finished adding content in 8:37:58
Finalizing...
Sorting... sorted in 0:02:24
Resolving aliases...
Sorting... sorted in 0:02:10
Resolved aliases in 0:02:10
Finalized in 0:07:06Traceback (most recent call last):
File "/home/franc/aard/env-mwscrape2slob/bin/mwscrape2slob", line 8, in <module>
sys.exit(main())
File "/home/franc/aard/env-mwscrape2slob/lib/python3.6/site-packages/mwscrape2slob/__init__.py", line 838, in main
article_source.run()
File "/home/franc/aard/env-mwscrape2slob/lib/python3.6/site-packages/mwscrape2slob/__init__.py", line 294, in run
for title, aliases, text, error in resulti:
File "/usr/lib/python3.6/multiprocessing/pool.py", line 735, in next
raise value
http.client.IncompleteRead: IncompleteRead(2351431 bytes read)
[03.57.32][root@ew6:/home/franc/aard#
...
The frwiktionary has also finished (without errors) and has now 8.8 GB in the CouchDB and 3.834.874 entries, this too I do mwscrape2slob now. With the new scraped namespace 100 (--article-namespace 0 100)...
...
trouveront
trouves
ERROR:mwscrape2slob:
Traceback (most recent call last):
File "/home/franc/aard/env-mwscrape2slob/lib/python3.6/site-packages/mwscrape2slob/__init__.py", line 294, in run
for title, aliases, text, error in resulti:
File "/usr/lib/python3.6/multiprocessing/pool.py", line 735, in next
raise
value
http.client.IncompleteRead: IncompleteRead(326007 bytes read)
Finished adding content in 6:10:29
Finalizing...
Sorting... sorted in 0:03:44
Resolving aliases...
Sorting... sorted in 0:03:52
Resolved aliases in 0:03:52
Finalized in 0:07:50Traceback (most recent call last):
File "/home/franc/aard/env-mwscrape2slob/bin/mwscrape2slob", line 8, in <module>
sys.exit(main())
File "/home/franc/aard/env-mwscrape2slob/lib/python3.6/site-packages/mwscrape2slob/__init__.py", line 838, in main
article_source.run()
File "/home/franc/aard/env-mwscrape2slob/lib/python3.6/site-packages/mwscrape2slob/__init__.py", line 294, in run
for title, aliases, text, error in resulti:
File "/usr/lib/python3.6/multiprocessing/pool.py", line 735, in next
raise
value
http.client.IncompleteRead: IncompleteRead(326007 bytes read)