Hello all,
It has been 4 months of development, 30 authors, more than 80 issues closed, 225 commits, 177 files changed, 6740 insertions and 4134 deletions.
New and old faces has been seen in the past months reporting and fixing issues, discussing and helping get new features in shape.
Pretty amazing work, thanks to everyone that contributed in one or other way to make Scrapy 0.24 possible!
I'd like to take this opportunity to ask for help with the
scrapy.org website. Its design is old (hasn't changed much since 2008!) and we would like to give it a proper makeover, with a fresher, modern look, maybe including a snippet of simple, self-contained code that shows the power of Scrapy. Anyone out there that would like to become famous for designing the new
scrapy.org website? :)
Check out the
Release Notes, from where I would like to highlight the now simpler top-level imports and selector's shortcuts::
import scrapy
class MySpider(scrapy.Spider):
# ...
def parse(self, response):
for href in response.xpath('//a/@href').extract():
yield scrapy.Request(url)
At last but not less important, the credits:
A.J. Welch (1):
Generalize the file pipeline log messages so they are not specific to downloading images.
Alex Cepoi (2):
improvements to scrapy check/contracts
fix contracts tests
Alexander Chekunkov (5):
test for RFPDupeFilter.request_fingerprint overriding
added note about RFPDupeFilter.request_fingerprint overriding to the settings documentation
added short RFPDupeFilter.request_fingerprint interface description
DOWNLOADER setting
DOWNLOADER setting
Alexey Bezhan (6):
Clarify MapCompose documentation
Fix some typos, whitespace and small errors in docs
Add a note about reporting security issues
Bind telnet console and webservice to 127.0.0.1 by default
Fix PEP8 warnings in project template files
Fix PEP8 warnings in spider templates
Ana Sabina Uban (1):
Fixed SgmlLinkExtractor constructor to properly handle both string and list parameters (attrs, tags, deny_extensions)
Benoit Blanchon (3):
BaseSgmlLinkExtractor: Fixed unknown_endtag() so that it only set current_link=None when the end tag match the opening tag
BaseSgmlLinkExtractor: Added unit test of a link with an inner tag
BaseSgmlLinkExtractor: Fixed the missing space when the link has an inner tag
Breno Colom (1):
Update scrapy command line doc with additional scrapy parse options
Cameron Lane (2):
[#744] Ensure domain is not None before building regex
[#744] Test for allowed domains including NoneTypes
Capi Etheriel (4):
fixes dynamic itemclass example usage of type()
Running lucasdemarchi/codespell to fix typos in docs
Running lucasdemarchi/codespell to fix typos in SEPs
Running lucasdemarchi/codespell to fix typos in code
Carlos Rivera (1):
grammatical issue
Cash Costello (1):
Added missing word in practices.rst
Claudio Salazar (4):
Fixed XXE flaw in sitemap reader
Fixed XML selector against XXE attacks
Added test against XXE attacks for Sitemap
Added resolve_entities to kwargs in SafeXMLParser
Daniel Graña (45):
Merge 0.22.0 release notes
bump version to 0.23
fix 0.22.0 release date
Update Ubuntu installation instructions
fix apt-get line
replace warning about updating package lists by a note on package upgrade
show ubuntu setup instructions as literal code
replace unencodeable codepoints with html entities. fixes #562 and #285
Fix wrong checks on subclassing of deprecated classes. closes #581
test inspect.stack failure
localhost666 can resolve under certain circumstances
Add 0.22.1 release notes
fix a reference to unexistent engine.slots. closes #593
Add 0.22.2 release notes
try to restore pypy tests
Run testsuite with py.test
cleanup toplevel namespace
Add basic top-level shortcuts
remove .re() shortcut
update docs
update spider templates
Remove "sel" shortcut from scrapy shell$
document shortcuts in TextResponse class
Ammend example nesting selectors
Restore and deprecate "sel" shortcut
limit Twisted support to pre-14.0.0 while #718 is fixed
fix tests after changes introduced by scrapy/w3lib#21
force installation of w3lib and queuelib for trunk env
Avoid IPython warning. thanks @bryant1410. closes #623
sort spiders in "scrapy list" cmd. closes #736
Add a LevelDB cache backend
add leveldb cache backend docs
indent parsed-literal as part of ordered list
Upload sdist and wheel packages to pypi using travis-ci deploys
Add bumpversion config
Revert "limit Twisted support to pre-14.0.0 while #718 is fixed"
hold a reference to backwards compatible _contextFactory
Restore compatibility with Settings.overrides while still deprecating it
recognize jl extension as jsonlines exporter and update docs
promote LxmlLinkExtractor as default in docs
address latest comments
No need to keep extracted links as instance attribute. fixes #763
Add 0.24.0 release notes
Bump version: 0.23.0 → 0.24.0
set 0.24.0 release date
Denys Butenko (5):
Resolved issue #546. Output format parsing from filename extension.
Added back `-t` option. If `--output-format` not defined parse from extension `--output`
Fix default value.
Add import os for crawl.
Added more verbose error message for unrecognized output format. PEP8.
Edwin O Marshall (32):
Converted sep-001 to rst format
converted sep 002 to rst
- decided that removing files would cause conflicts on merge
- readded file to prevent future merge conflicts
converted sep 3 for #629
sep 4 for #629
sep 11 for #629
- sep 15 for #629
sep 6 for #629
- sep 10 for #629
- didn't like the way blockquotes rendered
- trying to separate quote context
- changing indentation so contexts are recognized
- given that it'sa block quote, quotation marks seem redundant
- removing trac file again to see if merges play well together
- removed trac file
- removed trac file
- removed trac file
- removed track file
removed trac file
removed trac file
- removed trac file
converted sep 7 for #629
sep 12 for #629
- converted sep 18
converted sep 16
converted sep 13
converted sep 5
- convertd sep 8
converted sep 9
converted sep017
sep 14 for #629
Irhine (2):
add encoding utf-8 to the first line
support i18n by using utf-8 coding template files
Julia Medina (34):
New doc: clickdata in Formrequest.from_response
New tests: clickdata's nr in Formrequest.from_response
FormRequest doc improvements
More appropriate assert in FormRequest test
Tests for loading download handlers
Fix minor typo in DownloaderHandlers comment
Doc for disabling download handler
Minor fixes in LoadTestCase in test_downloader_handlers
Trial functionality for running tests with pytest
Add py33 environment to allowed failures in travis-ci
Support doctest and __init__.py test discover in pytest
Ignore files with import errors on pytest test discover
Change function name so it does not mess up with pytest autodiscover
Fix httpcache doctest that assumed dictionary order
Ensure spiders module reload between spider manager tests
New tox env: docs
Ignore known broken links in docs linkcheck
Fix broken links in documentation
sep#19 proposed changes
New SettingsAttribute class
Settings priorities dictionary
New set and setdict method using SettingsAttribute in Settings
Deprecate CrawlerSettings, as its functionality is replicable by Settings class
Settings and SettingsAtribute tests
Fix and extend the documentation of the new Settings api
Settings topic updated
Fix settings repr on the logs of the shell and tutorial docs topics
setmodule helper method on Settings class
Update get_crawler method in utils/test.py with new Settings interface
get_project_settings now returns a Settings instance
Change command's default_settings population in cmdline.py
Change how settings are overriden in ScrapyCommands
Fix settings usage in runspider and crawl commands
Fix settings usage across tests
Mikhail Korobov (18):
fix typos in news.rst and remove (not released yet) header
Handle cases when inspect.stack() fails
testing PIL dependency is removed because there is a new mitmproxy version
TST Improved twisted installation in tox.ini for Python 3.3
reduce code duplication in test_spidermiddleware_httperror
scrapy.utils.test.docrawl function
Fix for #612 + integration-style tests for HttpErrorMiddleware
TST fix file descriptor leak and a bad variable name in get_testlog
make scrapy.version_info a tuple of integers
remove unused import
use "import scrapy" in templates
DOC use top-level shortcuts in docs
suggest scrapy.Selector in deprecation warnings
TST fix tests that became broken after adding top-level imports and switching to py.test.
fix scrapy.version_info when SCRAPY_VERSION_FROM_GIT is set
response.selector, response.xpath(), response.css() and response.re()
DOC selectors.rst cleanup
add utf8 encoding header to spider templates
Nikita Nikishin (1):
Fixed #441.
Nikolaos-Digenis Karagiannis (5):
downloaderMW doc typo (spiderMW doc copy remnant)
SpiderMW doc typo: SWP request, response
ItemLoader doc: missing args in replace_value()
document spider.closed() shortcut
Document signal "request_scheduled"
Pablo Hoffman (11):
make 'basic' the default template spider in genspider, and added info with next steps to startproject. closes #488
add SEP-021 (Add-ons) - work in progress
remove references to deprecated scrapy-developers list
rename attribute to match conventions used for XXX_DEBUG settings (in autothrottle and cookies mw)
remove no longer used setting: MAIL_DEBUG
remove unused setting: DOWNLOADER_DEBUG
signals doc: make argument order more consistent with code (although it doesn't matter in practice)
add Julia to SEP-019 authors
crate release notes for 0.24 and #699 to it
minor change to request_scheduled signal doc
doc: use |version| substitution in ubuntu packages
Paul Brown (1):
fixed typo
Paul Tremberth (18):
Disable smart strings in lxml XPath evaluations
Make lxml smart strings functionality customizable
Add testcase to check is default Selector doesnt return smart strings
Use assertTrue/False
RegexLinkExtractor: encode URL unicode value when creating Links
Offsite: add 2 stats counters
Always enable offsite stats + refactor test to initialize crawler
Fix tests for Travis-CI build
CrawSpider: support process_links as generator
Fix HtmlParserLinkExtractor and tests after #485 merge
Docs: 4-space indent for final spider example
DupeFilter: add setting for verbose logging + stats counter for filtered requests
Remove _log_level attribute as per comments
Support case-insensitive domains in url_is_from_any_domain()
Add tests for start requests, filtered and non-filtered
Check pending start_requests before calling _spider_idle() in engine (fixes #706)
Add LxmlLinkExtractor class similar to SgmlLinkExtractor (#528)
Add doc on LxmlLinkExtractor class
Rafal Jagoda (1):
add response arg to item_dropped signal handlers #710
Rendaw (1):
Elaborated request priority value.
Rolando Espinoza (8):
Ignore None's values when using the ItemLoader.
Unused re import and PEP8 minor edits.
Expose current crawler in the scrapy shell.
PEP8 minor edits.
Updated shell docs with the crawler reference and fixed the actual shell output.
Updated the tutorial crawl output with latest output.
DOC Fixed HTTPCACHE_STORAGE typo in the default value which is now Filesystem instead Dbm.
DOC Use pipelines module name instead of pipieline following default project files.
Rolando Espinoza La fuente (1):
Alow to disable a downloader handler just like any other component.
Ruben Vereecken (2):
Added content-type check as per issue #193
Redefined test for #193
deed02392 (1):
Update httperror.py
ncp1113 (1):
for loops have to have a : at the end of the line
nyov (2):
better call to parent class
update a link reference
stray-leone (1):
modify the version of scrapy ubuntu package
tpeng (3):
add message when raise IngoreReques; fix item_scraped document
set the exit code to non-zero when contracts fails
print spider name even it has no contract tests when -v is specified
tracicot (1):
Correct typos