Can't control what scrapy logs to stdout

1,300 views
Skip to first unread message

Hartley Brody

unread,
Sep 5, 2014, 10:57:56 AM9/5/14
to scrapy...@googlegroups.com
I'm running scrapy as a cron job, and so all output that is sent to stdout gets emailed to me at the end of the day, which is currently in the dozens of MB. Most of the log lines are INFO messages that I'm trying to suppress, but I still want WARNING, ERROR and CRITICAL to be printed to stdout so that those get emailed to me. 

I know about the logging settings, and am currently using:

```
LOG_LEVEL = 'WARNING'
LOG_FILE = '/path/to/scrapy.log'
LOG_STDOUT = False
```

in my `settings.py`. These settings seem to be doing the right thing in terms of the log *file* -- only logging the right messages -- but I'm still seeing everything (including INFO) printed to stdout. I've also tried running the scraper with the `scrapy crawl <spider> -L WARNING` flag, but I'm still seeing INFO message on stdout.

Is there a setting I'm missing somewhere that controls what gets sent to stdout? I don't want to pipe it to /dev/null since I still want WARNINGS and up to be sent to stdout. But I don't see a way to do this.

Nicolás Alejandro Ramírez Quiros

unread,
Sep 5, 2014, 3:12:04 PM9/5/14
to scrapy...@googlegroups.com
Review your code again because the settings are working fine.
https://gist.github.com/nramirezuy/e75d8c041b07a8edb44f
Message has been deleted

Hartley Brody

unread,
Sep 5, 2014, 4:45:27 PM9/5/14
to scrapy...@googlegroups.com
Tried the -s flag, still seeing INFO loglines:


$> scrapy crawl detail -s LOG_LEVEL=WARNING
2014-09-05 16:40:46-0400 [scrapy] INFO: Scrapy 0.24.4 started (bot: detail)
2014-09-05 16:40:46-0400 [scrapy] INFO: Optional features available: ssl, http11
2014-09-05 16:40:46-0400 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'crawler.spiders', 'LOG_LEVEL': 'WARNING', 'SPIDER_MODULES': ['crawler.spiders'], 'BOT_NAME': 'chrome_store_crawler', 'USER_AGENT': '...', 'DOWNLOAD_DELAY': 0.3}
2014-09-05 16:40:47-0400 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-09-05 16:40:48-0400 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-09-05 16:40:48-0400 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-09-05 16:40:48-0400 [scrapy] INFO: Enabled item pipelines: CsvExporterPipeline
2014-09-05 16:40:48-0400 [detail] INFO: Spider opened

......

Would there be any settings that would conflict with this? I'm running Scrapy v0.24.4

Hartley Brody

unread,
Sep 5, 2014, 4:51:32 PM9/5/14
to scrapy...@googlegroups.com
Here's a line where I'm writing a message to the log:

log.msg("Parsing sitemap: {0}".format(response.url), level=log.INFO)

Hartley Brody

unread,
Sep 5, 2014, 5:05:59 PM9/5/14
to scrapy...@googlegroups.com
I think I've found the issue. 

I'm logging from within a spider and had called log.start() in the __init__ method of my Spider, as recommended in the docs:


But when I removed that line, then the logging behaves as expected, where the LOG_LEVEL setting is being honored.

Seems like calling log.start overrides the settings. I'll file a bug on the project.

You can test for this by taking the code you included above and adding `log.start()` to the init method of your spider.
Reply all
Reply to author
Forward
0 new messages