Hello Scrapy users,
we released Scrapy 1.4.0 last Thursday and we hope you will like it.
It brings a bunch of bug fixes but also a handful of new features.
response.follow: the new kid in town
Checkout the new response.follow shortcut method to properly build Request objects in your callbacks.
It is the new recommended way to do that. It’s shorter to write, and more correct.
So, instead of:
for href in response.css('li.page a::attr(href)').extract():
url = response.urljoin(href)
yield scrapy.Request(url, self.parse, encoding=response.encoding)
you can now write this:
FTP in Python 3
Scrapy finally supports FTP in Python 3, with the additional support for anonymous FTP sessions even.
Just make sure you are using at least Twisted 17.1.
Link extractors
Link extractors also got some love regarding leading and trailing whitespace.
Their behavior is now much closer to what your regular desktop browser does when following hyperlinks.
Oh, and we disabled the default canonicalization of URLs for extracted links.
It was causing more trouble for users than anything.
Referrer policy
Handling of the “Referer” HTTP header is now driven by a customizable Referrer Policy, as defined by the W3C.
Checkout the details and security implications in the dedicated docs section.
Pretty-printing your items
Scrapy 1.4 also has a new option for pretty-printing items when you export to JSON or XML.
By default, you still have items on their own line. But you can also get a more human-readable output with a non-negative FEED_EXPORT_INDENT.
To get a pretty-printed JSON with an indentation of two spaces, you run:
$ scrapy crawl yourspider -o items.json -s FEED_EXPORT_INDENT=2
We recommend all users to update Scrapy to version 1.4.0.
Pip users:
$ pip install --upgrade scrapy
Conda users:
$ conda install -c conda-forge scrapy=1.4.0