mwsrape - Error on Ubuntu 18.04 with CouchDB 3.0

48 views
Skip to first unread message

franc

unread,
Apr 13, 2020, 3:43:05 PM4/13/20
to aarddict
Hallo

after a long time, I wanted to come back to the old scrape-job and tried to install mwscrape2slob on my upgraded Ubuntu 18.04 LTS Server with now CouchDB 3.0.
So I installed the stuff, following the mwscrape's README on Github and started my mwscrape-Skript.
But very fast, after some entries of scraping I got an error and the scrape stopped:

(11871) Norvège ==> (11871) Norge
(1172) Aneas ==> (1172) Énée
(1172) Énée is up to date (rev. 138156454), skipping
Traceback (most recent call last):
 
File "/home/franc/aard/env-mwscrape/bin/mwscrape", line 9, in <module>
   load_entry_point
('mwscrape==1.0', 'console_scripts', 'mwscrape')()
 
File
"/home/franc/aard/env-mwscrape/local/lib/python2.7/site-packages/mwscrape/scrape.py",
line
535, in
main
   
for _result in pool.imap(process, ipages(pages)):
 
File "/usr/lib/python2.7/multiprocessing/pool.py", line 673, in next
   
raise value
couchdb
.http.ServerError: (500, (u'unknown_error', u'undefined'))
[17.44.54][root@ew6:/home/franc/aard#

I started also frwiktionary and dewiktionary, all of them stopped very soon, always Line 673 in pool.py

What could this kind of error mean? I am not routined with Python, rarely know this language, so I even didnt look into the code :(
On my old U14.04 this never happened, but I had CouchDB 1.6, which I cannot use anymore, no package for this on U18.

By the way, I got warnings about Python 2.7, it were deprecated, and soon not anymore supported. Could this be the reason?

Thanks for hints!

frank

Igor Tkach

unread,
Apr 13, 2020, 3:55:05 PM4/13/20
to aarddict
On Mon, Apr 13, 2020 at 3:43 PM franc <franc...@gmail.com> wrote:
Hallo

after a long time, I wanted to come back to the old scrape-job and tried to install mwscrape2slob on my upgraded Ubuntu 18.04 LTS Server with now CouchDB 3.0.
So I installed the stuff, following the mwscrape's README on Github and started my mwscrape-Skript.

Just merged an update to README (sorry Markus it took me a while) with instructions to install CouchDB 2.
 
But very fast, after some entries of scraping I got an error and the scrape stopped:

(11871) Norvège ==> (11871) Norge
(1172) Aneas ==> (1172) Énée
(1172) Énée is up to date (rev. 138156454), skipping
Traceback (most recent call last):
 
File "/home/franc/aard/env-mwscrape/bin/mwscrape", line 9, in <module>
   load_entry_point
('mwscrape==1.0', 'console_scripts', 'mwscrape')()
 
File
"/home/franc/aard/env-mwscrape/local/lib/python2.7/site-packages/mwscrape/scrape.py",
line
535, in
main
   
for _result in pool.imap(process, ipages(pages)):
 
File "/usr/lib/python2.7/multiprocessing/pool.py", line 673, in next
   
raise value
couchdb
.http.ServerError: (500, (u'unknown_error', u'undefined'))
[17.44.54][root@ew6:/home/franc/aard#

I started also frwiktionary and dewiktionary, all of them stopped very soon, always Line 673 in pool.py

What could this kind of error mean? I am not routined with Python, rarely know this language, so I even didnt look into the code :(
On my old U14.04 this never happened, but I had CouchDB 1.6, which I cannot use anymore, no package for this on U18.

By the way, I got warnings about Python 2.7, it were deprecated, and soon not anymore supported. Could this be the reason?

Thanks for hints!

frank

--
You received this message because you are subscribed to the Google Groups "aarddict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aarddict+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/aarddict/84d8d222-c843-4a3b-aa88-6fd8561907e0%40googlegroups.com.

MHBraun

unread,
Apr 13, 2020, 7:26:05 PM4/13/20
to aarddict
No worries, Igor.
There will be more updates to come.
couchdb 2.3.1 has to be installed as admin party. Otherwise it is throwing errors.
The new mwclient 0.10.0 seems not to be as stable as the earlier one. It is running max 2 h. Then the host is closing down the connection. Will report in another thread as soon as I have more data.

To unsubscribe from this group and stop receiving emails from it, send an email to aard...@googlegroups.com.

franc

unread,
Apr 28, 2020, 5:00:15 AM4/28/20
to aarddict
Am Montag, 13. April 2020 21:55:05 UTC+2 schrieb itkach:

...


Just merged an update to README (sorry Markus it took me a while) with instructions to install CouchDB 2.
 

Now got the time to downgrade, as Markus posted in his question Access - Error on Ubuntu 18.04 with CouchDB 2.3.1 already, thanks Markus :) 
But still I get these errors after a short while, with any speed (speed=1 or 5):

Traceback (most recent call last):

 
File "/home/franc/aard/env-mwscrape/bin/mwscrape", line 8, in <module>
    sys
.exit(main())

 
File "/home/franc/aard/env-mwscrape/local/lib/python2.7/site-packages/mwscrape/scrape.py", line 535, in main
   
for _result in pool.imap(process, ipages(pages)):
 
File "/usr/lib/python2.7/multiprocessing/pool.py", line 673, in next
   
raise
value
httplib
.BadStatusLine: No status line received - the server has closed the connection

again and again the same errors. In CouchDB I can only read:

[notice] 2020-04-28T04:27:17.183702Z couchdb@127.0.0.1 <0.9501.1425> b0a68b261e 127.0.0.1:5984 127.0.0.1 admin GET /mwscrape/frwiki-1588001952-831 200 ok 5
[notice] 2020-04-28T04:27:17.233145Z couchdb@127.0.0.1 <0.9501.1425> 6f739fd066 127.0.0.1:5984 127.0.0.1 admin PUT /mwscrape/frwiki-1588001952-831 201 ok 47
[notice] 2020-04-28T04:27:17.247680Z couchdb@127.0.0.1 <0.9501.1425> 536c38a61b 127.0.0.1:5984 127.0.0.1 admin GET /frwiki/Alireza%20Dabir 404 ok 10

So nothing to clarify.
I also tried to use the Python 3 mwscrape-clone from TAbdiukov which doesn't work on my system, so I asked him about some errors there, but since yet no answer.

Markus told me, he restarts automatically the mwscrape after a short while to keep it running, but I would love a stable solution.
@Igor, couldn't you check this error, what is the real reason?

Thanks.
frank

Igor Tkach

unread,
Apr 28, 2020, 9:31:20 AM4/28/20
to aarddict
On Tue, Apr 28, 2020 at 5:00 AM franc <franc...@gmail.com> wrote:
Now got the time to downgrade, as Markus posted in his question Access - Error on Ubuntu 18.04 with CouchDB 2.3.1 already, thanks Markus :) 
But still I get these errors after a short while, with any speed (speed=1 or 5):

Traceback (most recent call last):
 
File "/home/franc/aard/env-mwscrape/bin/mwscrape", line 8, in <module>
    sys
.exit(main())
 
File "/home/franc/aard/env-mwscrape/local/lib/python2.7/site-packages/mwscrape/scrape.py", line 535, in main
   
for _result in pool.imap(process, ipages(pages)):
 
File "/usr/lib/python2.7/multiprocessing/pool.py", line 673, in next
   
raise value
httplib
.BadStatusLine: No status line received - the server has closed the connection

again and again the same errors. In CouchDB I can only read:

[notice] 2020-04-28T04:27:17.183702Z couchdb@127.0.0.1 <0.9501.1425> b0a68b261e 127.0.0.1:5984 127.0.0.1 admin GET /mwscrape/frwiki-1588001952-831 200 ok 5
[notice] 2020-04-28T04:27:17.233145Z couchdb@127.0.0.1 <0.9501.1425> 6f739fd066 127.0.0.1:5984 127.0.0.1 admin PUT /mwscrape/frwiki-1588001952-831 201 ok 47
[notice] 2020-04-28T04:27:17.247680Z couchdb@127.0.0.1 <0.9501.1425> 536c38a61b 127.0.0.1:5984 127.0.0.1 admin GET /frwiki/Alireza%20Dabir 404 ok 10

@Igor, couldn't you check this error, what is the real reason?

 I can't tell from this for sure, but perhaps "server has closed the connection" comes from wikipedia server, not CouchDB. Maybe you were generating too many requests and they now blacklisted your IP or throttle it more aggressively?

In any case, I updated mwscrape - now runs on Python 3, fixed mwclient deprecation warning - please reinstall (follow README at https://github.com/itkach/mwscrape/tree/master) and give it a try. Seems to be working fine with either CouchDB 2 or 3. Try scraping something other than frwiki just to see if you're getting same errors on different sites or not. And don't go too fast, try without --speed parameter (if you still get the same error at least we'll see exact spot in process() function which appears to be obscured when you run multiple requests in parallel).

franc

unread,
Apr 29, 2020, 2:22:01 PM4/29/20
to aarddict
Am Dienstag, 28. April 2020 15:31:20 UTC+2 schrieb itkach:

...

 I can't tell from this for sure, but perhaps "server has closed the connection" comes from wikipedia server, not CouchDB. Maybe you were generating too many requests and they now blacklisted your IP or throttle it more aggressively?

In any case, I updated mwscrape - now runs on Python 3, fixed mwclient deprecation warning - please reinstall (follow README at https://github.com/itkach/mwscrape/tree/master) and give it a try. Seems to be working fine with either CouchDB 2 or 3. Try scraping something other than frwiki just to see if you're getting same errors on different sites or not. And don't go too fast, try without --speed parameter (if you still get the same error at least we'll see exact spot in process() function which appears to be obscured when you run multiple requests in parallel).

Thank a lot!

OK, I updated mwscrape:

cd /home/franc/aard/env-mwscrape
source bin
/activate
pip uninstall mwscrape
pip install git
+https://github.com/itkach/mwscrape

and retried, but:

[20.16.30][root@ew6:/home/franc/aard/env-mwscrape#source bin/activate
(env-mwscrape) [20.16.37][root@ew6:/home/franc/aard/env-mwscrape#mwscrape fr.m.wiktionary.org --db frwiktionary --couch http://admin:12...@127.0.0.1:5984

Traceback (most recent call last):

 
File "/home/franc/aard/env-mwscrape/bin/mwscrape", line 5, in <module>
   
from mwscrape.scrape import main
 
File "/home/franc/aard/env-mwscrape/local/lib/python2.7/site-packages/mwscrape/scrape.py", line 17, in <module>
   
import urllib.parse
ImportError: No module named parse
(env-mwscrape) [20.16.54][root@ew6:/home/franc/aard/env-mwscrape#

do I have to uninstall Python 2.7 now?
I am sure that I have Python 3 installed (v3.6.7-1~18.04), but I guess mwscrape still uses 2.7, where I also have some deprecated warnings.

frank

Igor Tkach

unread,
Apr 29, 2020, 2:33:44 PM4/29/20
to aarddict
You have to create and use new python virtualenv,
python3 -m venv env-mwscrape-py3
source env-mwscrape-py3/bin/activate

--
You received this message because you are subscribed to the Google Groups "aarddict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aarddict+u...@googlegroups.com.

franc

unread,
Apr 29, 2020, 3:08:47 PM4/29/20
to aarddict
Am Mittwoch, 29. April 2020 20:33:44 UTC+2 schrieb itkach:
You have to create and use new python virtualenv,
python3 -m venv env-mwscrape-py3
source env-mwscrape-py3/bin/activate

 OK, thanks! I did this :)
But now, mwscrape is not found:

(env-mwscrape-py3) [21.07.03][root@ew6:/home/franc/aard/env-mwscrape-py3#mwscrape fr.m.wikipedia.org --db
frwiki
--couch http://admin:12...@127.0.0.1:5984
mwscrape
: Befehl nicht gefunden.
(env-mwscrape-py3) [21.07.14][root@ew6:/home/franc/aard/env-mwscrape-py3#mwscrape-py3 fr.m.wikipedia.org --db frwiki --couch http://admin:12...@127.0.0.1:5984
mwscrape
-py3: Befehl nicht gefunden.
(env-mwscrape-py3) [21.07.20][root@ew6:/home/franc/aard/env-mwscrape-py3#

Sorry that I don't understand much of this mwscrape-script since yet :(
I never worked with Python ever and for me its different to e.g. php or javascript...

Igor Tkach

unread,
Apr 29, 2020, 3:22:10 PM4/29/20
to aarddict
On Wed, Apr 29, 2020 at 3:08 PM franc <franc...@gmail.com> wrote:
Am Mittwoch, 29. April 2020 20:33:44 UTC+2 schrieb itkach:
You have to create and use new python virtualenv,
python3 -m venv env-mwscrape-py3
source env-mwscrape-py3/bin/activate

 OK, thanks! I did this :)
But now, mwscrape is not found:

Did you actually install it?:

this is all in the readme at https://github.com/itkach/mwscrape
 

(env-mwscrape-py3) [21.07.03][root@ew6:/home/franc/aard/env-mwscrape-py3#mwscrape fr.m.wikipedia.org --db
frwiki
--couch http://admin:12...@127.0.0.1:5984
mwscrape
: Befehl nicht gefunden.
(env-mwscrape-py3) [21.07.14][root@ew6:/home/franc/aard/env-mwscrape-py3#mwscrape-py3 fr.m.wikipedia.org --db frwiki --couch http://admin:12...@127.0.0.1:5984
mwscrape
-py3: Befehl nicht gefunden.
(env-mwscrape-py3) [21.07.20][root@ew6:/home/franc/aard/env-mwscrape-py3#

Sorry that I don't understand much of this mwscrape-script since yet :(
I never worked with Python ever and for me its different to e.g. php or javascript...

--
You received this message because you are subscribed to the Google Groups "aarddict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aarddict+u...@googlegroups.com.

franc

unread,
Apr 29, 2020, 3:25:31 PM4/29/20
to aarddict
Am Mittwoch, 29. April 2020 21:22:10 UTC+2 schrieb itkach:


On Wed, Apr 29, 2020 at 3:08 PM franc <franc...@gmail.com> wrote:
Am Mittwoch, 29. April 2020 20:33:44 UTC+2 schrieb itkach:
You have to create and use new python virtualenv,
python3 -m venv env-mwscrape-py3
source env-mwscrape-py3/bin/activate

 OK, thanks! I did this :)
But now, mwscrape is not found:

Did you actually install it?:

this is all in the readme at https://github.com/itkach/mwscrape


So sorry!!!
Indeed, that I did not. Now it is running :)

Thank you much!
frank

Igor Tkach

unread,
Apr 29, 2020, 3:43:28 PM4/29/20
to aarddict
On Wed, Apr 29, 2020 at 3:25 PM franc <franc...@gmail.com> wrote:

Indeed, that I did not. Now it is running :)

Awesome, now let's see how it performs :)

I think I saw mine stop for some reason after running for some hours (no errors though), so
maybe run it in a loop. So starting a new scrape with, say, ru.wikipedia.org you'd run something like this

mwscrape -c http://admin:secret@localhost:5984 ru.wikipedia.org  || while 1; do mwscrape -c http://admin:secret@localhost:5984 -r; sleep 60; done

 

Thank you much!
frank

--
You received this message because you are subscribed to the Google Groups "aarddict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aarddict+u...@googlegroups.com.

Igor Tkach

unread,
Apr 29, 2020, 3:48:37 PM4/29/20
to aarddict
Actually that only works if you run a single scrape... if you run multiple at the same time should note session id printed at the beginning, kill it, and then run in a loop with -r <session id>, like so:

mwscrape -c http://admin:secret@localhost:5984 -r ru-wikipedia-org-1588189519-737

franc

unread,
Apr 30, 2020, 1:44:02 AM4/30/20
to aarddict
Am Mittwoch, 29. April 2020 21:43:28 UTC+2 schrieb itkach:

Awesome, now let's see how it performs :)


Still running since yesterday!
:) :) :)
I didn't use the speed parameter, so it might be slower.
I will reorganize my scrapes so that only one scrape at the time is active. Maybe then I even don't need a loop.
Thanks.

franc

unread,
Apr 30, 2020, 1:24:24 PM4/30/20
to aarddict
Am Donnerstag, 30. April 2020 07:44:02 UTC+2 schrieb franc: 
Still running since yesterday!


And still it runs (frwiki), very nice :)

One question about installing mwscrape: I get some errors with "pylru" when installing mwscrape:

...
 
Running setup.py bdist_wheel for pylru ... error
 
Complete output from command /home/franc/aard/dewiktionary/env-mwscrape/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-glmc8a42/pylru/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/tmpx9ydqv4wpip-wheel- --python-tag cp36:
  usage
: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
     
or: -c --help [cmd1 cmd2 ...]
     
or: -c --help-commands
     
or: -c cmd --help
 
  error
: invalid command 'bdist_wheel'
 
 
----------------------------------------
 
Failed building wheel for pylru
 
Running setup.py clean for pylru
Failed to build pylru
...

The rest seems without errors and mwscrape seems to work fine.
Can I ignore that pylru-error?

Thank.

Igor Tkach

unread,
Apr 30, 2020, 2:03:34 PM4/30/20
to aarddict
On Thu, Apr 30, 2020, 13:24 franc <franc...@gmail.com> wrote:

The rest seems without errors and mwscrape seems to work fine.
Can I ignore that pylru-error?

I think I saw the same error and it looks like we can ignore it.
 

Thank.
--
You received this message because you are subscribed to the Google Groups "aarddict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aarddict+u...@googlegroups.com.

franc

unread,
May 1, 2020, 7:52:08 AM5/1/20
to aarddict
Am Donnerstag, 30. April 2020 20:03:34 UTC+2 schrieb itkach:
On Thu, Apr 30, 2020, 13:24 franc <franc...@gmail.com> wrote:

The rest seems without errors and mwscrape seems to work fine.
Can I ignore that pylru-error?

I think I saw the same error and it looks like we can ignore it.


:thumbsup:

In the meanwhile my frwiki scrape has stopped too. But it ran a long time! Nearly a day. The couch has 312.061 entries with 5.1 GB at the moment. frwiktionary, which I started later, has 228181 entries with 0.9 GB.
So I will do it in loop too.

By the way: Fauxton, the Web-GUI for CouchDB since 2.x is less functional. On Futon (CouchDB 1.x) it was possible to start directly a compression of a db, this feature has gone on Fauxton :(

franc

unread,
May 14, 2020, 8:47:26 AM5/14/20
to aarddict
Am Freitag, 1. Mai 2020 13:52:08 UTC+2 schrieb franc:
... The couch has 312.061 entries with 5.1 GB at the moment. frwiktionary, which I started later, has 228181 entries with 0.9 GB.

So I will do it in loop too.


After weeks I think it finished, lately without any errors and interruption anymore. I consider the Couch 2.3.1 much more stable than the 3.x where the errors are coming much more often.
After finished mwscrape the frwiki had 21.3 GB in the CouchDB and 2.213.944 entries.
I did a mwscrape2slob but ran again into some unknown errors:

  Magny-sur-Tille
 
Magnéric de Trèves
 
Magné (Vienne)

Finished adding content in 8:37:58
Finalizing...
Sorting... sorted in 0:02:24
Resolving aliases...
Sorting... sorted in 0:02:10
Resolved aliases in 0:02:10
Finalized in 0:07:06Traceback (most recent call last):
 
File "/home/franc/aard/env-mwscrape2slob/bin/mwscrape2slob", line 8, in <module>
    sys
.exit(main())
 
File "/home/franc/aard/env-mwscrape2slob/lib/python3.6/site-packages/mwscrape2slob/__init__.py", line 838, in main
    article_source
.run()
 
File "/home/franc/aard/env-mwscrape2slob/lib/python3.6/site-packages/mwscrape2slob/__init__.py", line 294, in run
   
for title, aliases, text, error in resulti:
 
File "/usr/lib/python3.6/multiprocessing/pool.py", line 735, in next
   
raise value
http
.client.IncompleteRead: IncompleteRead(2351431 bytes read)
[03.57.32][root@ew6:/home/franc/aard#

In fact, the resulting frwiki.lzma2.slob is only 3.7 GB big, which is not the full size :(
I updated now the mwscrape2slob and restarted the mwscrape2slob, I hope it will work then...

The frwiktionary has also finished (without errors) and has now 8.8 GB in the CouchDB and 3.834.874 entries, this too I do mwscrape2slob now. With the new scraped namespace 100 (--article-namespace 0 100)...



franc

unread,
May 17, 2020, 4:17:33 AM5/17/20
to aarddict
Am Donnerstag, 14. Mai 2020 14:47:26 UTC+2 schrieb franc:
...

The frwiktionary has also finished (without errors) and has now 8.8 GB in the CouchDB and 3.834.874 entries, this too I do mwscrape2slob now. With the new scraped namespace 100 (--article-namespace 0 100)...


As one can read in the other thread, the frwiktionary is uncomplete too!
I thought with 3.834.874 documents in the couchdb it should be complete, but also the mwscrape2slob has thrown erros, as I notice now:

...
  trouveront
  trouves
ERROR
:mwscrape2slob:

Traceback (most recent call last):

 
File "/home/franc/aard/env-mwscrape2slob/lib/python3.6/site-packages/mwscrape2slob/__init__.py", line 294, in run
   
for title, aliases, text, error in resulti:
 
File "/usr/lib/python3.6/multiprocessing/pool.py", line 735, in next
   
raise
value
http
.client.IncompleteRead: IncompleteRead(326007 bytes read)

Finished adding content in 6:10:29
Finalizing...
Sorting... sorted in 0:03:44
Resolving aliases...
Sorting... sorted in 0:03:52
Resolved aliases in 0:03:52
Finalized in 0:07:50Traceback (most recent call last):

 
File "/home/franc/aard/env-mwscrape2slob/bin/mwscrape2slob", line 8, in <module>
    sys
.exit(main())
 
File "/home/franc/aard/env-mwscrape2slob/lib/python3.6/site-packages/mwscrape2slob/__init__.py", line 838, in main
    article_source
.run()
 
File "/home/franc/aard/env-mwscrape2slob/lib/python3.6/site-packages/mwscrape2slob/__init__.py", line 294, in run
   
for title, aliases, text, error in resulti:
 
File "/usr/lib/python3.6/multiprocessing/pool.py", line 735, in next
   
raise
value
http
.client.IncompleteRead: IncompleteRead(326007 bytes read)

so now, I am disturbed about this, I don't understand this error. I guess it means, that articles are missing ("IncompleteRead"), but how this?

I had restarted already the frwiki scrape (in loop), where I have at the moment in the CouchDB 2.214.487 Docs, when this is finished I will restart the mwscrape of frwiktionary, hoping that a clean scrape will fix the errors.
Could be the reason that I didn't use the parameter --delete-not-found ? I thought with the first scrape I wouldnt need this...

Thanks for hints.
frank

Igor Tkach

unread,
May 17, 2020, 11:32:59 AM5/17/20
to aarddict
It doesn't mean that. This is a low-level error indicating that data could not be fully read from the database. This could point to computer running out of resources (perhaps running out of space on of partitions?  run "df -h" and see if anything is at 100% use or close). You should check couchdb logs (probably in /var/log/couchdb/).
 

Reply all
Reply to author
Forward
0 new messages