{"status": "ok", "running": [], "finished": [], "pending": []} + saving the ouput into a json file

74 views
Skip to first unread message

Marco Ippolito

unread,
Jan 20, 2015, 7:23:04 AM1/20/15
to scrapy...@googlegroups.com
Hi,
I' ve got 2 situations to solve.

Seems that everything is ok:

(SCREEN)marco@pc:~/crawlscrape/sole24ore$ scrapyd-deploy sole24ore -p sole24ore
Packing version 1421755479
Deploying to project "sole24ore" in http://localhost:6800/addversion.json
Server response (200):
{"status": "ok", "project": "sole24ore", "version": "1421755479", "spiders": 1}


marco@pc:/var/lib/scrapyd/dbs$ ls -lah
totale 12K
drwxr-xr-x 2 scrapy nogroup 4,0K gen 20 13:04 .
drwxr-xr-x 5 scrapy nogroup 4,0K gen 20 06:55 ..
-rw-r--r-- 1 root   root    2,0K gen 20 13:04 sole24ore.db


marco@pc:/var/lib/scrapyd/eggs/sole24ore$ ls -lah
totale 16K
drwxr-xr-x 2 scrapy nogroup 4,0K gen 20 13:04 .
drwxr-xr-x 3 scrapy nogroup 4,0K gen 20 12:47 ..
-rw-r--r-- 1 scrapy nogroup 5,5K gen 20 13:04 1421755479.egg


, but nothing is executed

marco@pc:/var/lib/scrapyd/items/sole24ore/sole$ ls -a
.  ..

[detached from 2515.pts-4.pc]
marco@pc:~/crawlscrape/sole24ore$ curl http://localhost:6800/listjobs.json?project=sole24ore
{"status": "ok", "running": [], "finished": [], "pending": []}



The second aspect regards how to save the output into a json file.
What is the correct form to put into settings.py?

ile Edit Options Buffers Tools Python Help
# Scrapy settings for sole24ore project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
#

BOT_NAME = 'sole24ore'

SPIDER_MODULES = ['sole24ore.spiders']
NEWSPIDER_MODULE = 'sole24ore.spiders'

FEED_URI=file://home/marco/crawlscrape/sole24ore/output.json --set FEED_FORMAT=json


SCREEN)marco@pc:~/crawlscrape/sole24ore$ scrapyd-deploy sole24ore -p sole24ore
Packing version 1421756389
Deploying to project "sole24ore" in http://localhost:6800/addversion.json
Server response (200):
{"status": "error", "message": "SyntaxError: invalid syntax"}


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'sole24ore (+http://www.yourdomain.com)'

Looking forward to your kind help.
Kind regards.
Marco

Daniel Fockler

unread,
Jan 20, 2015, 12:45:08 PM1/20/15
to scrapy...@googlegroups.com
For your first problem, you've started the scrapyd project but you need to schedule a spider run using the schedule.json command. Something like

curl http://localhost:6800/schedule.json -d project=sole24ore -d spider=yourspidername

For your second problem your settings.py is misconfigured your feed settings should be like

FEED_URI = 'f
ile://home/marco/crawlscrape/sole24ore/output.json'
FEED_FORMAT = 'json'

Hope that helps

Marco Ippolito

unread,
Jan 21, 2015, 2:00:50 AM1/21/15
to scrapy...@googlegroups.com
Hi Daniel,
thank you very much for your kind help.

After scheduling the spider run, an output is actually produced:

Opening file /var/lib/scrapyd/items/sole24ore/sole/89d644f8a13a11e4a2afc04a00090e80.jl
Read output!
This is my output:
{"url": ["http://m.bbc.co.uk", "http://www.bbc.com/news/", " .....

But modifying the feed settings as:
BOT_NAME = 'sole24ore'

SPIDER_MODULES = ['sole24ore.spiders']
NEWSPIDER_MODULE = 'sole24ore.spiders'

FEED_URI = 'file://home/marco/crawlscrape/sole24ore/output.json'

doesn't produce an output.json into /home/marco/crawlscrape/sole24ore

am I missing some other steps?

Marco
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "scrapy-users" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/scrapy-users/0b4xqaHUOSA/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> scrapy-users...@googlegroups.com.
> To post to this group, send email to scrapy...@googlegroups.com.
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.

Daniel Fockler

unread,
Jan 21, 2015, 2:28:25 PM1/21/15
to scrapy...@googlegroups.com
You'll want to make sure in your settings.py that feed format is set, like

FEED_FORMAT = 'json'

If it doesn't work after that then just try changing feed uri to
 
FEED_URI = 'output.json'

and scrapy will dump it in your project root

Marco Ippolito

unread,
Jan 21, 2015, 2:41:49 PM1/21/15
to scrapy...@googlegroups.com
Hi Daniel,
thanks again for helping.

I tried with
FEED_URI = 'file://home/marco/crawlscrape/sole24ore/output.json'
FEED_FORMAT = 'json'

and with
FEED_URI = 'output.json'
FEED_FORMAT = 'json'
in both cases there is no output and not error message

Any hints?

Marco

Daniel Fockler

unread,
Jan 21, 2015, 4:09:05 PM1/21/15
to scrapy...@googlegroups.com
If you end up with an empty output.json or a file that just has a '[' character that could mean that scrapy couldn't find any items from your spider. But if that is not the case then there is another issue. Scrapyd should output logs for every spider that you run, in a logs directory

Marco Ippolito

unread,
Jan 22, 2015, 3:16:27 AM1/22/15
to scrapy...@googlegroups.com
Hi Daniel,
not seeing any logs I decided to create a new project from scratch.
But I still have the same problems.
I attached the compressed (.tar) log files together with the
compressed (.tar) urls_listing prj directory.
Feel free to have a try.

Looking forward to your kind help.
Marco
3e780d98a20711e4.tar
8d8e0dfaa20d11e4.tar
urls_listing.tar

Daniel Fockler

unread,
Jan 22, 2015, 2:16:36 PM1/22/15
to scrapy...@googlegroups.com
Alright so it looks like running your project using scrapyd-deploy is changing the output settings, so your item output files are going to

/var/lib/scrapyd/items/urls_listing/urls_grasping/8d8e0dfaa20d11e48c91c04a00090e80.jl

For some reason. In your project you can try just running scrapyd without scrapyd-deploy and that should allow you to scrape using the correct settings. I don't have a ton of experience using the deploy features with scrapyd-deploy, so I'm not sure I can help much with that.

Marco Ippolito

unread,
Jan 23, 2015, 12:21:58 AM1/23/15
to scrapy...@googlegroups.com
Hi Daniel,
actually I run /crawlscrape/urls_listing/urls_listing$ scrapy crawl
urls_grasping -o items.json -t json
and a file json has been created:
/crawlscrape/urls_listing/urls_listing$ ls -lh items.json
-rw-rw-r-- 1 marco marco 93K gen 23 06:20 items.json

so it seems that some settings in scrapyd-deploy has be to fine-tuned.
Do you know anyone to ask to?

Thank you very much for your kind help.
Marco
Reply all
Reply to author
Forward
0 new messages