Queries regarding adding Python 3 support for scrapy.

88 views
Skip to first unread message

Anuj Bansal

unread,
Mar 17, 2015, 7:04:37 AM3/17/15
to scrapy...@googlegroups.com
Hi ,

I'm working towards adding Python 3 support to scrapy. I went through a lot of blogs and projects related to adding Python 3 support and found that currently twisted is also working towards creating a version of twisted that is source-compatible with Python 2.6, Python 2.7, and Python 3.3 [1]. There are various tools like "2to3" that read Python 2.x source code and appliy a series of fixers to transform it into valid Python 3.x code. Although it is more helpful for those who are porting to Python 3 rather than adding support for it.

Currently, I'm working towards a plan on how all this should be carried out and how much time each part of scrapy would take. Also I'm reading through [2] to see what all changes are required.

I also had some questions:

1. Why don't we completely port scrapy to Python 3 rather than adding support for it ? Would it be to much for a GSoC Project ?
It would likely result in a cleaner code as compared to adding support.

2. Is it recommended to use tools like 2to3 to convert the code ?
On twisted page [1] they mention not to use the tool whereas various projects and also the website [2] recommend its use.

It would be really helpful if you could guide me where to start and provide some useful links as well.


Regards,
Anuj Bansal
Github - ahhda

Mikhail Korobov

unread,
Mar 17, 2015, 3:24:42 PM3/17/15
to scrapy...@googlegroups.com
Hi Anuj,


вторник, 17 марта 2015 г., 16:04:37 UTC+5 пользователь Anuj Bansal написал:
Hi ,

I'm working towards adding Python 3 support to scrapy. I went through a lot of blogs and projects related to adding Python 3 support and found that currently twisted is also working towards creating a version of twisted that is source-compatible with Python 2.6, Python 2.7, and Python 3.3 [1]. There are various tools like "2to3" that read Python 2.x source code and appliy a series of fixers to transform it into valid Python 3.x code. Although it is more helpful for those who are porting to Python 3 rather than adding support for it.

Currently, I'm working towards a plan on how all this should be carried out and how much time each part of scrapy would take. Also I'm reading through [2] to see what all changes are required.

I also had some questions:

1. Why don't we completely port scrapy to Python 3 rather than adding support for it ? Would it be to much for a GSoC Project ?
It would likely result in a cleaner code as compared to adding support.


Making Scrapy Python3-only is easier than adding Python 3 support while keeping Python 2.7 support. But there are large codebases written in Python 2.x; it is not the time to drop Python 2.x support yet. Maybe we'll be able to drop 2.x support ~5 years later, if all will go well :)


2. Is it recommended to use tools like 2to3 to convert the code ?
On twisted page [1] they mention not to use the tool whereas various projects and also the website [2] recommend its use.

The recommended way is to use "six" Python module. Some parts of Scrapy are already ported to Python 3 - see e.g. https://travis-ci.org/scrapy/scrapy/jobs/54761340 - 235 tests pass in Python 3.3. To get started try cloning Scrapy and running some tests using tox (as described in docs). You can also check https://github.com/scrapy/scrapy/blob/master/tests/py3-ignores.txt file - try uncommenting something and run tests again to see what's not ported. We can't rely only on tests when porting, but they are a good start.

By the way, project description may be a bit misleading. It can make you feel that the main issue is Twisted. But this is not where the existing porting efforts stopped. Currently we stopped at porting scrapy.Request, and specifically at deciding how to represent URLs. There is an existing PR (https://github.com/scrapy/scrapy/pull/837), but I think it took a wrong path (and it seems Daniel agrees). In the PR URLs are considered bytes.

It is not entirely unreasonable (in the end, you get bytes from the internet, and you send URL as bytes when doing HTTP requests, and often they must be the same bytes). The problem is that such URLs are hard to work with in Python 3.x (unwanted unicode promotion from urllib, no .format method, etc), and that you get unicode URLs if they are extracted from HTML using scrapy selectors. Scrapy only sends ASCII-clean URLs (they are escaped using w3lib) because this is what browsers do. There is some value in allowing binary non-escaped URLs though (see e.g. https://github.com/scrapy/scrapy/issues/833) - maybe "new" URL handling could have a solution for thatas well.

So we're thinking of using unicode URLs in Python 3.x. This could require changes to https://github.com/scrapy/w3lib because we made it work on byte urls (but maybe not). Also, the method w3lib uses to encode URLs to ASCII is incorrect, i.e. it doesn't match what browsers do. Browsers are crazy here - it seems I lost the demo source code, but browsers can use different encodings for different parts of URL, something like "encode GET argument values using UTF8, but encode /path using web page encoding".

This URL encoding thing is where we stopped. Without having a solid solution we can't port scrapy.Request, and without scrapy.Request most other Scrapy components don't work.
 

Anuj Bansal

unread,
Mar 18, 2015, 2:52:19 PM3/18/15
to scrapy...@googlegroups.com
Sir,

I have learned the differences between Python 2 and Python 3. I have created a google doc (https://docs.google.com/document/d/1xf7OtuyB5b6npCOLalZ-yjPZEcoKNb19iimfElyDino/edit) in which I have written the common porting errors which I could find after going through various blogs and projects and there corresponding syntax corrections. You can add your valuable suggestions or anything that I have missed out to it by directly going to the link and editing it. Do tell me if you find something wrong with the approach.

 
The recommended way is to use "six" Python module. Some parts of Scrapy are already ported to Python 3 - see e.g. https://travis-ci.org/scrapy/scrapy/jobs/54761340 - 235 tests pass in Python 3.3. To get started try cloning Scrapy and running some tests using tox (as described in docs).

I got some errors while setting up scrapy and found out that I had to install libssl-dev, libffi-dev, python-dev and libxml2-dev. As mentioned on (http://stackoverflow.com/questions/17611324/error-when-installing-scrapy-on-ubuntu-13-04).
Shouldn't these be added to the scrapy requirements ? Should I create an issue relating to this ? I'm currently working on Ubuntu 14.04.
 
You can also check https://github.com/scrapy/scrapy/blob/master/tests/py3-ignores.txt file - try uncommenting something and run tests again to see what's not ported. We can't rely only on tests when porting, but they are a good start.

This is great ! Would really help me in planning my strategy. 
 
This URL encoding thing is where we stopped. Without having a solid solution we can't port scrapy.Request, and without scrapy.Request most other Scrapy components don't work.
 
Handling binary data is the most trickiest issue that people face in supporting Python 2 and Python 3. So the first thing to do would be to find the best solution for URL encoding. Only then we would be able to port other scrapy components.
So I should first take a look at the w3lib project.


"My recommendation for the development workflow if you want to support Python 3 without using 2to3 is to run 2to3 on the code once and then fix it up until it works on Python 3. Only then introduce Python 2 support into the Python 3 code, using six where needed. Add support for Python 2.7 first, and then Python 2.6. Doing it this way can sometimes result in a very quick and painless process."

Is this the recommended method ?

Mikhail Korobov

unread,
Mar 18, 2015, 3:43:46 PM3/18/15
to scrapy...@googlegroups.com
Hi,

среда, 18 марта 2015 г., 23:52:19 UTC+5 пользователь Anuj Bansal написал:
Sir,

I have learned the differences between Python 2 and Python 3. I have created a google doc (https://docs.google.com/document/d/1xf7OtuyB5b6npCOLalZ-yjPZEcoKNb19iimfElyDino/edit) in which I have written the common porting errors which I could find after going through various blogs and projects and there corresponding syntax corrections. You can add your valuable suggestions or anything that I have missed out to it by directly going to the link and editing it. Do tell me if you find something wrong with the approach.
 
The recommended way is to use "six" Python module. Some parts of Scrapy are already ported to Python 3 - see e.g. https://travis-ci.org/scrapy/scrapy/jobs/54761340 - 235 tests pass in Python 3.3. To get started try cloning Scrapy and running some tests using tox (as described in docs).

I got some errors while setting up scrapy and found out that I had to install libssl-dev, libffi-dev, python-dev and libxml2-dev. As mentioned on (http://stackoverflow.com/questions/17611324/error-when-installing-scrapy-on-ubuntu-13-04).
Shouldn't these be added to the scrapy requirements ? Should I create an issue relating to this ? I'm currently working on Ubuntu 14.04.

Scrapy requirements.txt lists Python packages (not system packages). There are some install notes here: http://doc.scrapy.org/en/latest/intro/install.html
libffi-dev is a dependency of PyOpenSSL; libxml2-dev is a dependency of lxml. I'm not sure - maybe we can document this all. It would be documenting the requirements of our requirements though.
 
 
You can also check https://github.com/scrapy/scrapy/blob/master/tests/py3-ignores.txt file - try uncommenting something and run tests again to see what's not ported. We can't rely only on tests when porting, but they are a good start.

This is great ! Would really help me in planning my strategy. 
 
This URL encoding thing is where we stopped. Without having a solid solution we can't port scrapy.Request, and without scrapy.Request most other Scrapy components don't work.
 
Handling binary data is the most trickiest issue that people face in supporting Python 2 and Python 3. So the first thing to do would be to find the best solution for URL encoding. Only then we would be able to port other scrapy components.
So I should first take a look at the w3lib project.


"My recommendation for the development workflow if you want to support Python 3 without using 2to3 is to run 2to3 on the code once and then fix it up until it works on Python 3. Only then introduce Python 2 support into the Python 3 code, using six where needed. Add support for Python 2.7 first, and then Python 2.6. Doing it this way can sometimes result in a very quick and painless process."

Is this the recommended method ?

Usually I just start with the existing code and add Python 3 support to it using "six" package and a common sense :) The metod from the book sounds OK, but you need to be very careful not to break existing Python 2.x code. __future__ imports can be also helpful (2to3 doesn't add them). We don't need Python 2.6 support.
 

Anuj Bansal

unread,
Mar 26, 2015, 8:16:22 AM3/26/15
to scrapy...@googlegroups.com
Sir,

My exams just finished yesterday so I can finally get back to work on scrapy. I have submitted my GSoC proposal. I know I'm late but I will surely cover the lost time.
I have created a blog where I will be posting my work with scrapy (http://ahhda.blogspot.in/).
The proposal however requires the link of a contribution which I don't have as I was busy with my college. Although I have contributed to sympy (https://github.com/sympy/sympy/pull/9121). I have given this link in the proposal. I hope this is acceptable.

I have also created a copy at (https://docs.google.com/document/d/1FUg1fhdIWS5HLh8zjbPTpR6kXwsG60pRdJ6QsF4m3u0/edit). Do tell me if you find something missing or wrong with in the proposal.

The results will be announced on 27th April. Till then I will continue to work on scrapy and fix some bugs.

Looking towards a great summer :)

Regards,
Anuj

Mikhail Korobov

unread,
Mar 26, 2015, 11:20:29 AM3/26/15
to scrapy...@googlegroups.com
Hi Anuj,

I understand your situation - the exams can be very stressful - but unfortunately a contribution to Scrapy or a related project (e.g. w3lib) is a hard requirement. It is the best way for us to understand how well can we work together with a student, and a best way for a student to understand if he likes working with us or not. We can't accept a proposal without this information.

четверг, 26 марта 2015 г., 17:16:22 UTC+5 пользователь Anuj Bansal написал:
Reply all
Reply to author
Forward
0 new messages