neolurk.org scraping not starting

44 views
Skip to first unread message

sklart

unread,
May 3, 2023, 6:41:23 AM5/3/23
to aarddict
neolurk.org scraping not starting:

mwscrape -c http://admin:admin@localhost:5984 https://neolurk.org/ --speed 5
Connecting http://localhost:5984 as user admin
Starting session neolurk-org-1683110120-619
Traceback (most recent call last):
File "/home/aard/env-mwscrape/bin/mwscrape", line 8, in <module>
sys.exit(main())
File "/home/aard/env-mwscrape/lib/python3.10/site-packages/mwscrape/scrape.py", line 367, in main
site = mwclient.Site(host, path=args.site_path, ext=args.site_ext, scheme=scheme)
File "/home/aard/env-mwscrape/lib/python3.10/site-packages/mwclient/client.py", line 130, in __init__
self.site_init()
File "/home/aard/env-mwscrape/lib/python3.10/site-packages/mwclient/client.py", line 150, in site_init
meta = self.get('query', meta='siteinfo|userinfo',
File "/home/aard/env-mwscrape/lib/python3.10/site-packages/mwclient/client.py", line 234, in get
return self.api(action, 'GET', *args, **kwargs)
File "/home/aard/env-mwscrape/lib/python3.10/site-packages/mwclient/client.py", line 285, in api
info = self.raw_api(action, http_method, **kwargs)
File "/home/aard/env-mwscrape/lib/python3.10/site-packages/mwclient/client.py", line 437, in raw_api
res = self.raw_call('api', data, retry_on_error=retry_on_error,
File "/home/aard/env-mwscrape/lib/python3.10/site-packages/mwclient/client.py", line 409, in raw_call
stream.raise_for_status()
File "/home/aard/env-mwscrape/lib/python3.10/site-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://neolurk.org/w/api.php?meta=siteinfo%7Cuserinfo%7Cuserinfo&siprop=general%7Cnamespaces&uiprop=groups%7Crights%7Cblockinfo%7Chasmsg&continue=&action=query&format=json


Please tell me what could be the problem

itkach

unread,
May 5, 2023, 12:49:20 PM5/5/23
to aarddict
403 response status code from neolurk.org tells us that they don't want the client to access their API. Now, the same request works with curl or from the browser, so it looks like they check user agent header and allow or disallow access based on that. They either specifically blacklist mwclient-based scripts like mwscrape or they have a whitelist of user agents that they allow ¯\_(ツ)_/¯

I added --user-agent option for mwscrape, the following command now works:

mwscrape https://neolurk.org --user-agent "curl/8.87.0"

It would seem that this goes against site owner wishes though (also notice "Disallow: /w/" in https://neolurk.org/robots.txt). Nobody can stop you if you want to try anyway, but it may be best to contact them and ask for permission first.

sklart

unread,
May 10, 2023, 4:22:42 AM5/10/23
to aarddict
I apologize for a small off-topic in this topic, but I do not want to create a new topic.
When creating a snapshot from https://ru.wikihow.com (with --speed 0), an error 
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://ru.wikihow.com/api.php
is generated after a while. The file https://ru.wikihow.com/robots.txt says
Our general guideline is not to crawl more than 1 page every 3 seconds.
Perhaps make a crawl timeout option for mwscrape when downloading links?

пятница, 5 мая 2023 г. в 19:49:20 UTC+3, itkach:

sklart

unread,
May 10, 2023, 5:39:43 AM5/10/23
to aarddict
Another oddity...
It has been noticed that on some sites, after a certain point in time, scraping progress stops being displayed in the terminal.
For example, the same neolurk.org stops at the word "Бензин". 
Бендер Родригес ==> Бендер
Бендер is up to date (rev. 1266524), skipping
  18159 Бенедикт Камбербэтч
Бенедикт Камбербэтч is up to date (rev. 1264350), skipping
  18160 Бенедикт Спиноза
Бенедикт Спиноза ==> Спиноза
Спиноза is up to date (rev. 1022454), skipping
  18161 Бенезия
Бенезия is up to date (rev. 381738), skipping
  18162 Бенефис беты
  18163 Бензедрин
Бензедрин ==> Вещества
Вещества is up to date (rev. 1277923), skipping
  18164 Бензин - это взрывчатка
Бензин - это взрывчатка ==> Бензин — это взрывчатка
  18165 Бензин


When viewing this database in couchdb, it is clear that scraping progress is being made.
{
  "_id": "neolurk-org-1683701473-299",
  "_rev": "45176-ce5fae9fc78ab82bb2b824750b3f210d",
  "created_at": "2023-05-10T06:51:13.560633",
  "site": "https://neolurk.org",
  "db_name": "neolurk-org",
  "descending": false,
  "last_page_name": "Велосипед",  //record is constantly updated
  "updated_at": "2023-05-10T09:37:38.347564",  
//record is constantly updated
  "new": 12547,  
//record is constantly updated
  "up_to_date": 9973, 
//record is constantly updated
  "error": 67
}


What is the reason for this behavior and is it possible to somehow fix it in order to understand the progress of the scraping?
Thanks!
среда, 10 мая 2023 г. в 11:22:42 UTC+3, sklart:

itkach

unread,
May 10, 2023, 11:26:47 AM5/10/23
to aarddict
On Wednesday, May 10, 2023 at 4:22:42 AM UTC-4 sklart wrote:
I apologize for a small off-topic in this topic, but I do not want to create a new topic.
When creating a snapshot from https://ru.wikihow.com (with --speed 0), an error 
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://ru.wikihow.com/api.php
is generated after a while. The file https://ru.wikihow.com/robots.txt says
Our general guideline is not to crawl more than 1 page every 3 seconds.
Perhaps make a crawl timeout option for mwscrape when downloading links?

You mean delay, not timeout. It already exists, --delay

itkach

unread,
May 10, 2023, 11:30:45 AM5/10/23
to aarddict
On Wednesday, May 10, 2023 at 5:39:43 AM UTC-4 sklart wrote:
Another oddity...
It has been noticed that on some sites, after a certain point in time, scraping progress stops being displayed in the terminal.

Could be an issue with the terminal? Try running in another terminal program or play with terminal's settings (e.g. how many lines to scroll back does it keep?)
 

sklart

unread,
May 10, 2023, 4:02:30 PM5/10/23
to aarddict
среда, 10 мая 2023 г. в 18:26:47 UTC+3, itkach:
yes, of course it means delay, badly translated.
ok, thanks for this option. This parameter is not documented on the site, so I did not know about it (although I once looked at the scrape.py code, but apparently I forgot).

среда, 10 мая 2023 г. в 18:30:45 UTC+3, itkach:
Well, it seems like the default Ubuntu LTS terminal without any additional settings.
I'll try later to look at his settings, which you are talking about.
By the way, after some time, the terminal unfrozen and, starting with the letter "Л", started showing progress again.

AardF...@web.de

unread,
May 11, 2023, 9:48:15 AM5/11/23
to aarddict
Sometimes it is better to create a new topic ;)
@sklart, how did you create ru.wikihow.com ? Are there any namespaces to consider or did you just run

mwscrape --speed 0 --delay 3 -c http://admin:secret@localhost:5984 https://ru.wikihow.com

AardF...@web.de

unread,
May 11, 2023, 9:58:49 AM5/11/23
to aarddict
@sklart, the reason I am asking is that I am getting 


for
mwscrape --speed 0 --delay 3 --db dewikihow -c http://admin:secret@localhost:5984 https://de.wikihow.com

sklart

unread,
May 11, 2023, 1:36:44 PM5/11/23
to aarddict


четверг, 11 мая 2023 г. в 16:48:15 UTC+3, AardF...@web.de:
Sometimes it is better to create a new topic ;)
@sklart, how did you create ru.wikihow.com ? Are there any namespaces to consider or did you just run

mwscrape --speed 0 --delay 3 -c http://admin:secret@localhost:5984 https://ru.wikihow.com

Last command that working is
mwscrape -c http://admin:admin@localhost:5984 https://ru.wikihow.com --site-path=/ --user-agent "curl/8.87.0" --speed 0 --delay 2
 

AardF...@web.de

unread,
May 12, 2023, 7:18:02 AM5/12/23
to aarddict
Thank you.
It looks like I have a different issue here.

Reply all
Reply to author
Forward
0 new messages