Scrapy 0.24 is out and it aims to improve the scraping experience for everyone! :)

1,418 views
Skip to first unread message

Daniel Graña

unread,
Jun 26, 2014, 6:08:21 PM6/26/14
to scrapy...@googlegroups.com
Hello all,

It has been 4 months of development, 30 authors, more than 80 issues closed, 225 commits, 177 files changed, 6740 insertions and 4134 deletions.
New and old faces has been seen in the past months reporting and fixing issues, discussing and helping get new features in shape.
Pretty amazing work, thanks to everyone that contributed in one or other way to make Scrapy 0.24 possible!

I'd like to take this opportunity to ask for help with the scrapy.org website. Its design is old (hasn't changed much since 2008!) and we would like to give it a proper makeover, with a fresher, modern look, maybe including a snippet of simple, self-contained code that shows the power of Scrapy. Anyone out there that would like to become famous for designing the new scrapy.org website? :)

Check out the Release Notes, from where I would like to highlight the now simpler top-level imports and selector's shortcuts::

import scrapy

class MySpider(scrapy.Spider):
   
# ...
   
def parse(self, response):
       
for href in response.xpath('//a/@href').extract():
           
yield scrapy.Request(url)


At last but not less important, the credits:

A.J. Welch (1):
      Generalize the file pipeline log messages so they are not specific to downloading images.

Alex Cepoi (2):
      improvements to scrapy check/contracts
      fix contracts tests

Alexander Chekunkov (5):
      test for RFPDupeFilter.request_fingerprint overriding
      added note about RFPDupeFilter.request_fingerprint overriding to the settings documentation
      added short RFPDupeFilter.request_fingerprint interface description
      DOWNLOADER setting
      DOWNLOADER setting

Alexey Bezhan (6):
      Clarify MapCompose documentation
      Fix some typos, whitespace and small errors in docs
      Add a note about reporting security issues
      Bind telnet console and webservice to 127.0.0.1 by default
      Fix PEP8 warnings in project template files
      Fix PEP8 warnings in spider templates

Ana Sabina Uban (1):
      Fixed SgmlLinkExtractor constructor to properly handle both string and list parameters (attrs, tags, deny_extensions)

Benoit Blanchon (3):
      BaseSgmlLinkExtractor: Fixed unknown_endtag() so that it only set current_link=None when the end tag match the opening tag
      BaseSgmlLinkExtractor: Added unit test of a link with an inner tag
      BaseSgmlLinkExtractor: Fixed the missing space when the link has an inner tag

Breno Colom (1):
      Update scrapy command line doc with additional scrapy parse options

Cameron Lane (2):
      [#744] Ensure domain is not None before building regex
      [#744] Test for allowed domains including NoneTypes

Capi Etheriel (4):
      fixes dynamic itemclass example usage of type()
      Running lucasdemarchi/codespell to fix typos in docs
      Running lucasdemarchi/codespell to fix typos in SEPs
      Running lucasdemarchi/codespell to fix typos in code

Carlos Rivera (1):
      grammatical issue

Cash Costello (1):
      Added missing word in practices.rst

Claudio Salazar (4):
      Fixed XXE flaw in sitemap reader
      Fixed XML selector against XXE attacks
      Added test against XXE attacks for Sitemap
      Added resolve_entities to kwargs in SafeXMLParser

Daniel Graña (45):
      Merge 0.22.0 release notes
      bump version to 0.23
      fix 0.22.0 release date
      Update Ubuntu installation instructions
      fix apt-get line
      replace warning about updating package lists by a note on package upgrade
      show ubuntu setup instructions as literal code
      replace unencodeable codepoints with html entities. fixes #562 and #285
      Fix wrong checks on subclassing of deprecated classes. closes #581
      test inspect.stack failure
      localhost666 can resolve under certain circumstances
      Add 0.22.1 release notes
      fix a reference to unexistent engine.slots. closes #593
      Add 0.22.2 release notes
      try to restore pypy tests
      Run testsuite with py.test
      cleanup toplevel namespace
      Add basic top-level shortcuts
      remove .re() shortcut
      update docs
      update spider templates
      Remove "sel" shortcut from scrapy shell$
      document shortcuts in TextResponse class
      Ammend example nesting selectors
      Restore and deprecate "sel" shortcut
      limit Twisted support to pre-14.0.0 while #718 is fixed
      fix tests after changes introduced by scrapy/w3lib#21
      force installation of w3lib and queuelib for trunk env
      Avoid IPython warning. thanks @bryant1410. closes #623
      sort spiders in "scrapy list" cmd. closes #736
      Add a LevelDB cache backend
      add leveldb cache backend docs
      indent parsed-literal as part of ordered list
      Upload sdist and wheel packages to pypi using travis-ci deploys
      Add bumpversion config
      Revert "limit Twisted support to pre-14.0.0 while #718 is fixed"
      hold a reference to backwards compatible _contextFactory
      Restore compatibility with Settings.overrides while still deprecating it
      recognize jl extension as jsonlines exporter and update docs
      promote LxmlLinkExtractor as default in docs
      address latest comments
      No need to keep extracted links as instance attribute. fixes #763
      Add 0.24.0 release notes
      Bump version: 0.23.0 → 0.24.0
      set 0.24.0 release date

Denys Butenko (5):
      Resolved issue #546. Output format parsing from filename extension.
      Added back `-t` option. If `--output-format` not defined parse from extension `--output`
      Fix default value.
      Add import os for crawl.
      Added more verbose error message for unrecognized output format. PEP8.

Edwin O Marshall (32):
      Converted sep-001 to rst format
      converted sep 002 to rst
      - decided that removing files would cause conflicts on merge
      - readded file to prevent future merge conflicts
      converted sep 3 for #629
      sep 4 for #629
      sep 11 for #629
      - sep 15 for #629
      sep 6 for #629
      - sep 10 for #629
      - didn't like the way blockquotes rendered
      - trying to separate quote context
      - changing indentation so contexts are recognized
      - given that it'sa block quote, quotation marks seem redundant
      - removing trac file again to see if merges play well together
      - removed trac file
      - removed trac file
      - removed trac file
      - removed track file
      removed trac file
      removed trac file
      - removed trac file
      converted sep 7 for #629
      sep 12 for #629
      - converted sep 18
      converted sep 16
      converted sep 13
      converted sep 5
      - convertd sep 8
      converted sep 9
      converted sep017
      sep 14 for #629

Irhine (2):
      add encoding utf-8 to the first line
      support i18n by using utf-8 coding template files

Julia Medina (34):
      New doc: clickdata in Formrequest.from_response
      New tests: clickdata's nr in Formrequest.from_response
      FormRequest doc improvements
      More appropriate assert in FormRequest test
      Tests for loading download handlers
      Fix minor typo in DownloaderHandlers comment
      Doc for disabling download handler
      Minor fixes in LoadTestCase in test_downloader_handlers
      Trial functionality for running tests with pytest
      Add py33 environment to allowed failures in travis-ci
      Support doctest and __init__.py test discover in pytest
      Ignore files with import errors on pytest test discover
      Change function name so it does not mess up with pytest autodiscover
      Fix httpcache doctest that assumed dictionary order
      Ensure spiders module reload between spider manager tests
      New tox env: docs
      Ignore known broken links in docs linkcheck
      Fix broken links in documentation
      sep#19 proposed changes
      New SettingsAttribute class
      Settings priorities dictionary
      New set and setdict method using SettingsAttribute in Settings
      Deprecate CrawlerSettings, as its functionality is replicable by Settings class
      Settings and SettingsAtribute tests
      Fix and extend the documentation of the new Settings api
      Settings topic updated
      Fix settings repr on the logs of the shell and tutorial docs topics
      setmodule helper method on Settings class
      Update get_crawler method in utils/test.py with new Settings interface
      get_project_settings now returns a Settings instance
      Change command's default_settings population in cmdline.py
      Change how settings are overriden in ScrapyCommands
      Fix settings usage in runspider and crawl commands
      Fix settings usage across tests

Mikhail Korobov (18):
      fix typos in news.rst and remove (not released yet) header
      Handle cases when inspect.stack() fails
      testing PIL dependency is removed because there is a new mitmproxy version
      TST Improved twisted installation in tox.ini for Python 3.3
      reduce code duplication in test_spidermiddleware_httperror
      scrapy.utils.test.docrawl function
      Fix for #612 + integration-style tests for HttpErrorMiddleware
      TST fix file descriptor leak and a bad variable name in get_testlog
      make scrapy.version_info a tuple of integers
      remove unused import
      use "import scrapy" in templates
      DOC use top-level shortcuts in docs
      suggest scrapy.Selector in deprecation warnings
      TST fix tests that became broken after adding top-level imports and switching to py.test.
      fix scrapy.version_info when SCRAPY_VERSION_FROM_GIT is set
      response.selector, response.xpath(), response.css() and response.re()
      DOC selectors.rst cleanup
      add utf8 encoding header to spider templates

Nikita Nikishin (1):
      Fixed #441.

Nikolaos-Digenis Karagiannis (5):
      downloaderMW doc typo (spiderMW doc copy remnant)
      SpiderMW doc typo: SWP request, response
      ItemLoader doc: missing args in replace_value()
      document spider.closed() shortcut
      Document signal "request_scheduled"

Pablo Hoffman (11):
      make 'basic' the default template spider in genspider, and added info with next steps to startproject. closes #488
      add SEP-021 (Add-ons) - work in progress
      remove references to deprecated scrapy-developers list
      rename attribute to match conventions used for XXX_DEBUG settings (in autothrottle and cookies mw)
      remove no longer used setting: MAIL_DEBUG
      remove unused setting: DOWNLOADER_DEBUG
      signals doc: make argument order more consistent with code (although it doesn't matter in practice)
      add Julia to SEP-019 authors
      crate release notes for 0.24 and #699 to it
      minor change to request_scheduled signal doc
      doc: use |version| substitution in ubuntu packages

Paul Brown (1):
      fixed typo

Paul Tremberth (18):
      Disable smart strings in lxml XPath evaluations
      Make lxml smart strings functionality customizable
      Add testcase to check is default Selector doesnt return smart strings
      Use assertTrue/False
      RegexLinkExtractor: encode URL unicode value when creating Links
      Offsite: add 2 stats counters
      Always enable offsite stats + refactor test to initialize crawler
      Fix tests for Travis-CI build
      CrawSpider: support process_links as generator
      Fix HtmlParserLinkExtractor and tests after #485 merge
      Docs: 4-space indent for final spider example
      DupeFilter: add setting for verbose logging + stats counter for filtered requests
      Remove _log_level attribute as per comments
      Support case-insensitive domains in url_is_from_any_domain()
      Add tests for start requests, filtered and non-filtered
      Check pending start_requests before calling _spider_idle() in engine (fixes #706)
      Add LxmlLinkExtractor class similar to SgmlLinkExtractor (#528)
      Add doc on LxmlLinkExtractor class

Rafal Jagoda (1):
      add response arg to item_dropped signal handlers     #710

Rendaw (1):
      Elaborated request priority value.

Rolando Espinoza (8):
      Ignore None's values when using the ItemLoader.
      Unused re import and PEP8 minor edits.
      Expose current crawler in the scrapy shell.
      PEP8 minor edits.
      Updated shell docs with the crawler reference and fixed the actual shell output.
      Updated the tutorial crawl output with latest output.
      DOC Fixed HTTPCACHE_STORAGE typo in the default value which is now Filesystem instead Dbm.
      DOC Use pipelines module name instead of pipieline following default project files.

Rolando Espinoza La fuente (1):
      Alow to disable a downloader handler just like any other component.

Ruben Vereecken (2):
      Added content-type check as per issue #193
      Redefined test for #193

deed02392 (1):
      Update httperror.py

ncp1113 (1):
      for loops have to have a : at the end of the line

nyov (2):
      better call to parent class
      update a link reference

stray-leone (1):
      modify the version of scrapy ubuntu package

tpeng (3):
      add message when raise IngoreReques; fix item_scraped document
      set the exit code to non-zero when contracts fails
      print spider name even it has no contract tests when -v is specified

tracicot (1):
      Correct typos

faisal anees

unread,
Jun 27, 2014, 4:23:07 AM6/27/14
to scrapy...@googlegroups.com
Hi Daniel,

I would like to work on the website !! :) :D

Daniel Graña

unread,
Jun 27, 2014, 10:03:26 AM6/27/14
to scrapy...@googlegroups.com
hi Faisal,

We are open to review proposals in the usual way trough Github pull requests against the scrapy.org source code repository at https://github.com/scrapy/scrapy.github.io

thanks!
Daniel. 

Geek Gamer

unread,
Jun 28, 2014, 7:29:57 AM6/28/14
to scrapy-users
Anyone tried to install the latest ubuntu package?
I am not able to see version 0.24 after i changed the apt line and ran "apt-get update "


--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Daniel Graña

unread,
Jun 30, 2014, 12:50:01 PM6/30/14
to scrapy...@googlegroups.com
Hi Umar,

the package is available now in ubuntu repositories.

thanks
Daniel.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscribe@googlegroups.com.

Geek Gamer

unread,
Jul 1, 2014, 12:13:27 AM7/1/14
to scrapy-users
Thanks Daniel,

It works fine now !



To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.

David Liontooth

unread,
Jul 1, 2014, 5:21:33 AM7/1/14
to scrapy...@googlegroups.com

I would like to express my appreciation to the scrapy team and the Ubuntu packagers -- and to report that the Ubuntu package scrapy-0.24 installs flawlessly on a fresh Debian Wheezy installation, provided python2.6 is uninstalled first.

I suggest the python dependency in the Ubuntu package be updated from "Depends: python (>= 2.5)" to 2.7, and it will work by default on Debian stable too.

Cheers,
David

# wajig remove python2.6
Reading package lists... Done
Building dependency tree      
Reading state information... Done
The following packages will be REMOVED:
  python2.6
0 upgraded, 0 newly installed, 1 to remove and 0 not upgraded.
After this operation, 8,447 kB disk space will be freed.
Do you want to continue [Y/n]?
(Reading database ... 43275 files and directories currently installed.)
Removing python2.6 ...
Processing triggers for man-db ...

# wajig remove python2.6-minimal
Reading package lists... Done
Building dependency tree      
Reading state information... Done
The following packages will be REMOVED:
  python2.6-minimal
0 upgraded, 0 newly installed, 1 to remove and 0 not upgraded.
After this operation, 4,984 kB disk space will be freed.
Do you want to continue [Y/n]?
(Reading database ... 42755 files and directories currently installed.)
Removing python2.6-minimal ...
Unlinking and removing bytecode for runtime python2.6
Processing triggers for man-db ...

# just install scrapy-0.24
Reading package lists... Done
Building dependency tree      
Reading state information... Done
The following extra packages will be installed:
  libxslt1.1 python-crypto python-cssselect python-lxml python-openssl python-pam python-pkg-resources python-pyasn1 python-queuelib python-serial
  python-setuptools python-six python-twisted python-twisted-bin python-twisted-conch python-twisted-core python-twisted-lore python-twisted-mail
  python-twisted-names python-twisted-news python-twisted-runner python-twisted-web python-twisted-words python-w3lib python-zope.interface
Suggested packages:
  python-crypto-dbg python-crypto-doc python-lxml-dbg python-openssl-doc python-openssl-dbg python-distribute python-distribute-doc doc-base python-wxgtk2.8
  python-wxgtk2.6 python-wxgtk python-twisted-bin-dbg python-tk python-gtk2 python-glade2 python-qt3 python-twisted-runner-dbg
The following NEW packages will be installed:
  libxslt1.1 python-crypto python-cssselect python-lxml python-openssl python-pam python-pkg-resources python-pyasn1 python-queuelib python-serial
  python-setuptools python-six python-twisted python-twisted-bin python-twisted-conch python-twisted-core python-twisted-lore python-twisted-mail
  python-twisted-names python-twisted-news python-twisted-runner python-twisted-web python-twisted-words python-w3lib python-zope.interface scrapy-0.24
0 upgraded, 26 newly installed, 0 to remove and 0 not upgraded.
Need to get 0 B/6,080 kB of archives.
After this operation, 24.5 MB of additional disk space will be used.
Do you want to continue [Y/n]?
Selecting previously unselected package libxslt1.1:amd64.
(Reading database ... 42590 files and directories currently installed.)
Unpacking libxslt1.1:amd64 (from .../libxslt1.1_1.1.26-14.1_amd64.deb) ...
Selecting previously unselected package python-crypto.
Unpacking python-crypto (from .../python-crypto_2.6-4+deb7u3_amd64.deb) ...
Selecting previously unselected package python-cssselect.
Unpacking python-cssselect (from .../python-cssselect_0.9.1-b0~0a86_all.deb) ...
Selecting previously unselected package python-lxml.
Unpacking python-lxml (from .../python-lxml_2.3.2-1+deb7u1_amd64.deb) ...
Selecting previously unselected package python-openssl.
Unpacking python-openssl (from .../python-openssl_0.13-2+deb7u1_amd64.deb) ...
Selecting previously unselected package python-pam.
Unpacking python-pam (from .../python-pam_0.4.2-13_amd64.deb) ...
Selecting previously unselected package python-pkg-resources.
Unpacking python-pkg-resources (from .../python-pkg-resources_0.6.24-1_all.deb) ...
Selecting previously unselected package python-pyasn1.
Unpacking python-pyasn1 (from .../python-pyasn1_0.1.3-1_all.deb) ...
Selecting previously unselected package python-queuelib.
Unpacking python-queuelib (from .../python-queuelib_1.1.1-r13+b0+f985c30~ab44_all.deb) ...
Selecting previously unselected package python-serial.
Unpacking python-serial (from .../python-serial_2.5-2.1_all.deb) ...
Selecting previously unselected package python-setuptools.
Unpacking python-setuptools (from .../python-setuptools_0.6.24-1_all.deb) ...
Selecting previously unselected package python-six.
Unpacking python-six (from .../python-six_1.5.2-b1~9ddb_all.deb) ...
Selecting previously unselected package python-twisted-bin.
Unpacking python-twisted-bin (from .../python-twisted-bin_12.0.0-1_amd64.deb) ...
Selecting previously unselected package python-zope.interface.
Unpacking python-zope.interface (from .../python-zope.interface_3.6.1-3_amd64.deb) ...
Selecting previously unselected package python-twisted-core.
Unpacking python-twisted-core (from .../python-twisted-core_12.0.0-1_all.deb) ...
Selecting previously unselected package python-twisted-web.
Unpacking python-twisted-web (from .../python-twisted-web_12.0.0-1_all.deb) ...
Selecting previously unselected package python-twisted-lore.
Unpacking python-twisted-lore (from .../python-twisted-lore_12.0.0-1_all.deb) ...
Selecting previously unselected package python-twisted-mail.
Unpacking python-twisted-mail (from .../python-twisted-mail_12.0.0-1_all.deb) ...
Selecting previously unselected package python-twisted-names.
Unpacking python-twisted-names (from .../python-twisted-names_12.0.0-1_all.deb) ...
Selecting previously unselected package python-twisted-news.
Unpacking python-twisted-news (from .../python-twisted-news_12.0.0-1_all.deb) ...
Selecting previously unselected package python-twisted-runner.
Unpacking python-twisted-runner (from .../python-twisted-runner_12.0.0-1_amd64.deb) ...
Selecting previously unselected package python-twisted-words.
Unpacking python-twisted-words (from .../python-twisted-words_12.0.0-1_all.deb) ...
Selecting previously unselected package python-w3lib.
Unpacking python-w3lib (from .../python-w3lib_1.6-r79+b3+16b5560~d0a4_all.deb) ...
Selecting previously unselected package python-twisted-conch.
Unpacking python-twisted-conch (from .../python-twisted-conch_1%3a12.0.0-1_all.deb) ...
Selecting previously unselected package python-twisted.
Unpacking python-twisted (from .../python-twisted_12.0.0-1_all.deb) ...
Selecting previously unselected package scrapy-0.24.
Unpacking scrapy-0.24 (from .../scrapy-0.24_0.24.1+1404145158_all.deb) ...
Processing triggers for man-db ...
Setting up libxslt1.1:amd64 (1.1.26-14.1) ...
Setting up python-crypto (2.6-4+deb7u3) ...
Setting up python-cssselect (0.9.1-b0~0a86) ...
Setting up python-lxml (2.3.2-1+deb7u1) ...
Setting up python-openssl (0.13-2+deb7u1) ...
Setting up python-pam (0.4.2-13) ...
Setting up python-pkg-resources (0.6.24-1) ...
Setting up python-pyasn1 (0.1.3-1) ...
Setting up python-queuelib (1.1.1-r13+b0+f985c30~ab44) ...
Setting up python-serial (2.5-2.1) ...
Setting up python-setuptools (0.6.24-1) ...
Setting up python-six (1.5.2-b1~9ddb) ...
Setting up python-twisted-bin (12.0.0-1) ...
Setting up python-zope.interface (3.6.1-3) ...
Setting up python-twisted-core (12.0.0-1) ...
Setting up python-twisted-web (12.0.0-1) ...
Setting up python-twisted-lore (12.0.0-1) ...
Setting up python-twisted-mail (12.0.0-1) ...
Setting up python-twisted-names (12.0.0-1) ...
Setting up python-twisted-news (12.0.0-1) ...
Setting up python-twisted-runner (12.0.0-1) ...
Setting up python-twisted-words (12.0.0-1) ...
Setting up python-w3lib (1.6-r79+b3+16b5560~d0a4) ...
Setting up python-twisted-conch (1:12.0.0-1) ...
Processing triggers for python-twisted-core ...
Setting up python-twisted (12.0.0-1) ...
Setting up scrapy-0.24 (0.24.1+1404145158) ...
Processing triggers for python-support ...

No errors. Compiling with 2.6 generates a bunch.

Daniel Graña

unread,
Jul 1, 2014, 1:20:15 PM7/1/14
to scrapy...@googlegroups.com
hello David,

I updated debian/pyversions to skip pre 2.7 versions, can you test it? the new debian packages are available in the usual place.

thanks
Daniel  

Vic N

unread,
Jul 15, 2014, 3:55:28 AM7/15/14
to scrapy...@googlegroups.com
Hi, is it suport Python 3? Could you make tutorial on how to deploy scrapy project to Openshift PaaS ?

Tibo

unread,
Sep 19, 2014, 8:09:40 AM9/19/14
to scrapy...@googlegroups.com
Hello ! 

Thanks for this release ! 

I tried to update my version, I'm on FreeBSD server but it seems I'm stucked with 0.16.5_1. 

Is there a way to upgrade on FreeBSD ?

Thanks 

Ma ChienLi

unread,
Feb 19, 2015, 9:24:38 PM2/19/15
to scrapy...@googlegroups.com
Hi, Daniel.
I look into the code of spider.py and I can not find a method "closed()" in class Spider but method "close()".
But the document says it should be "closed()". Is it something wrong?

Ma ChienLi

unread,
Feb 20, 2015, 3:33:24 AM2/20/15
to scrapy...@googlegroups.com
Sorry for the disturbance. I get the point now :).

在 2015年2月20日星期五 UTC+8上午10:24:38,Ma ChienLi写道:

Aman Jain

unread,
Mar 14, 2015, 4:40:29 PM3/14/15
to scrapy...@googlegroups.com
Hello sir i want to participate in scrapy project in your org.
Reply all
Reply to author
Forward
0 new messages