[Imdbpy-devel] IMDbPY revamp

2 views
Skip to first unread message

Davide Alberani

unread,
Nov 1, 2017, 10:03:08 AM11/1/17
to IMDbPY development, imdbp...@lists.sourceforge.net
Hi all,
as many of you know, IMDbPY is in need of a revamp. :-)

So, while I (again and again) have very little time to devote to it, I
try to slowly improve it.

Right now I've created a "codename-simply" branch, with the intent of
reducing the amount of legacy code and some of the oddities of my
previous choices:
https://github.com/alberanid/imdbpy/tree/codename-simplify
See also issue https://github.com/alberanid/imdbpy/issues/61

My plan is more or less as follow:
* remove the "mobile" parser (done)
* remove SQLObject support (done)
* remove cutils, the utilities written in C (done, not sure it will
not be useful again in the future)
* introduce support for the new data set (to be done:
https://github.com/alberanid/imdbpy/issues/60 )
* move to Python 3 (to be done: https://github.com/alberanid/imdbpy/issues/27 )

Another possible point is:
* remove the BeautifulSoup dependency (python-lxml will be required)

but on this I wait the opinion of Turgut, the main author of that code.

The rationale is to remove unneeded dependencies (like the old SQLObject).
For the moment I've set lxml as a mandatory dependency, but I can
revert it to an optional one.
It has to be said that _bsoup is shipped with our package, so maybe we
can leave it there.

After this little clean-up, I'd like to work, in this order on:
1. the switch to Python 3
2. the new dataset, using SQLAlchemy (unless there are strong opinions
and helping hands to switch to a no-SQL db)

If you have other ideas and/or if you want to help, let us know. :-)


--
Davide Alberani <davide....@gmail.com> [PGP KeyID: 0x3845A3D4AC9B61AD]
http://www.mimante.net/

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Imdbpy-devel mailing list
Imdbpy...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel

H. Turgut Uyar

unread,
Nov 1, 2017, 12:49:15 PM11/1/17
to imdbpy...@lists.sourceforge.net
Hi,

That's wonderful news! I'd be happy to help.

I think removing BeautifulSoup support and going with lxml is a good
idea. We did want IMDbPY to be pure Python and self-contained but not if
that's making the move to Python 3 more difficult. And once we have the
code simplified, I can try to incorporate my Piculet module to support
the pure Python option again.

In short, how can I help? I think it's better if you finish the cleanup
first before I start fiddling. But after that I can help with the Python
3 porting.

Cheers,

Turgut

Davide Alberani

unread,
Nov 2, 2017, 9:00:35 AM11/2/17
to H. Turgut Uyar, IMDbPY development
On Wed, Nov 1, 2017 at 5:24 PM, H. Turgut Uyar <uy...@tekir.org> wrote:
>
> In short, how can I help? I think it's better if you finish the cleanup
> first before I start fiddling. But after that I can help with the Python
> 3 porting.

Great, thanks!

I plan to do some work to introduce Python 3.x compatibility; after
that we'll try to understand
if keeping _bsoup is possible or makes everything more complicated.

Another point that I'd like to add to the revamp:
* switch to python-requests for queries, so that we can have real
sessions (and support logged-in users)


I'll keep you update!

--
Davide Alberani <davide....@gmail.com> [PGP KeyID: 0x3845A3D4AC9B61AD]
http://www.mimante.net/

H. Turgut Uyar

unread,
Nov 2, 2017, 12:51:18 PM11/2/17
to Davide Alberani, IMDbPY development
OK, let me know when there's something you would like to delegate.

When I looked at the code with the intention of porting, one of the
difficult issues I saw was the mixed use of strings and unicode objects.
Probably adding a future import for unicode_literals and removing the
u'' literals would be a good starting point for it. A similar thing
could be done with respect to the print function. Out of curiosity, do
you plan to use 2to3 for this, or do you plan to do it manually?


--
Turgut


On 11/02/2017 04:00 PM, Davide Alberani wrote:
> On Wed, Nov 1, 2017 at 5:24 PM, H. Turgut Uyar <uy...@tekir.org> wrote:
>>
>> In short, how can I help? I think it's better if you finish the cleanup
>> first before I start fiddling. But after that I can help with the Python
>> 3 porting.
>
> Great, thanks!
>
> I plan to do some work to introduce Python 3.x compatibility; after
> that we'll try to understand
> if keeping _bsoup is possible or makes everything more complicated.
>
> Another point that I'd like to add to the revamp:
> * switch to python-requests for queries, so that we can have real
> sessions (and support logged-in users)
>
>
> I'll keep you update!
>

Davide Alberani

unread,
Nov 2, 2017, 3:46:54 PM11/2/17
to H. Turgut Uyar, IMDbPY development
On Thu, Nov 2, 2017 at 5:51 PM, H. Turgut Uyar <uy...@tekir.org> wrote:
>
> A similar thing could be done with respect to the print function. Out of curiosity, do
> you plan to use 2to3 for this, or do you plan to do it manually?

Both: first round with 2to3, then some manual fixes.

I also compare the changes provided by others in
https://github.com/alberanid/imdbpy/pull/45 and
https://github.com/alberanid/imdbpy/pull/39

The first tests, show that there's hope. ;-)


--
Davide Alberani <davide....@gmail.com> [PGP KeyID: 0x3845A3D4AC9B61AD]
http://www.mimante.net/

Davide Alberani

unread,
Nov 5, 2017, 9:52:04 AM11/5/17
to H. Turgut Uyar, IMDbPY development
Hi everyone,

I've completed a first round of changes into the
https://github.com/alberanid/imdbpy/tree/codename-simplify branch.

Right now:
- Python 3 is supported, for http parser
- I've simplified the setup.py to always require lxml and only support
SQLAlchemy

What can be done:

1. I've not yet removed bsoup support, and I'm still undecided about it.
To test it, one can just remove the lxml after it was installed.
I assume it's broken, since I've not fixed anything, there, except
what 2to3 has done.

If it's a simple thing to fix it, I guess we can keep it as a
fallback, otherwise I've no problem introducing the lxml dependency.

2. tests, tests, tests.
I've just done some manual tests, and most of the base features seems ok.
If anyone find some problem, please notify us (and/or provide a patch ;-))

3. SQL parser support for Python 3.
I'll work on this in the next weeks.

4.
later, I want to see if using "from future import ..." it's possible
to reintroduce support for Python 2.7

H. Turgut Uyar

unread,
Nov 5, 2017, 1:16:47 PM11/5/17
to Davide Alberani, IMDbPY development
Hi,

Great work, thanks! What is the minimal supported Python 3 version?

I would rather have bsoup removed at the moment and maybe added back
later. Currently the bsoup and lxml parsers require different
preprocessors because their parsers come up with different DOM trees.
When I refactored the extractors-attributes in IMDbPY into a separate
package (piculet) I went another way: first I try to normalize the HTML
code so that parsers will parse it the same way, then I apply the
extraction rules. On my tests, piculet worked alright on the IMDb
markup. Its syntax is very close to the extractors-attributes syntax in
IMDbPY, it supports both Python 2 and Python 3, and its also cleaner and
more powerful in what it can express. I can try out incorporating
piculet into IMDbPY on another branch and we'll see if that route is
worth pursuing. Piculet will use elementtree, or lxml if available. A
possible downside is that it might be slower due to the HTML
normalization step at the beginning.

Regarding tests, I have some work left over from my earlier attempts at
porting IMDbPY to Python 3. I will send them as a pull request in a few
days. Or I can make them a separate repository like imdbpy-testsuite.

Turgut

H. Turgut Uyar

unread,
Nov 6, 2017, 7:26:33 AM11/6/17
to imdbpy...@lists.sourceforge.net
Hello again,

I've created a new repository which contains some tests I had written
for the HTTP movie combined page parser. Most of the 70+ tests pass for
Python 3.3 to 3.6 with and without lxml installed. Pretty good start.

https://github.com/uyar/imdbpy-tests

To run, just type "tox". This assumes that you have the python3.3,
python3.4, python3.5 and python3.6 executables in your path. If you
want, you can test only one environment by using it like "tox -e py35".

When it downloads a page from IMDb, it will cache it in the directory
tests/.cache and on subsequent runs it will not download the page again.

More tests are definitely welcome.

Bye,

--
Turgut

Davide Alberani

unread,
Nov 6, 2017, 7:56:55 AM11/6/17
to H. Turgut Uyar, IMDbPY development
On Sun, Nov 5, 2017 at 7:16 PM, H. Turgut Uyar <uy...@tekir.org> wrote:
>
> Great work, thanks! What is the minimal supported Python 3 version?

Not thought too much about it; I guess 3.3 / 3.4 is ok.

> I would rather have bsoup removed at the moment and maybe added back
> later.

Ok. I'll proceed this way in the next days.

I also totally agree that including piculet, later, would be cool!

> Regarding tests, I have some work left over from my earlier attempts at
> porting IMDbPY to Python 3.

Seen them, great work!

I'm ok with including them into IMDbPY itself: it may motivate others
to write more tests. :-)


Thanks!

Davide Alberani

unread,
Nov 6, 2017, 2:34:03 PM11/6/17
to H. Turgut Uyar, IMDbPY development
On Mon, Nov 6, 2017 at 1:56 PM, Davide Alberani
<davide....@gmail.com> wrote:
>
> I'm ok with including them into IMDbPY itself: it may motivate others
> to write more tests. :-)

Merged in the codename-simplify branch your test-suite and changes for
PEP8 compliance.

If I'll be able to finish with the SQL access and the documentation, I
hope to be able
to merge back on master this weekend (or the next one).

Davide Alberani

unread,
Nov 11, 2017, 11:45:03 AM11/11/17
to IMDbPY development, imdbp...@lists.sourceforge.net
On Wed, Nov 1, 2017 at 3:02 PM, Davide Alberani
<davide....@gmail.com> wrote:
>
> as many of you know, IMDbPY is in need of a revamp. :-)

A quick update: I've just merged back into master the many changes of
the "codename-simply" branch (which should now be considered closed;
I'll delete it soon).

The old version, suitable for Python 2.7, is available in the
"imdbpy-legacy" branch, and probably will receive very little updates
from now on.

Main changes:
- Python 3 support (and only Python 3: no Python 2.7 compatibility, sorry)
- removed the 'mobile' set of parsers
- removed dependencies: SQLObject, C compiler, BeautifulSoup
- introduced a testsuite, please help with it:
https://sourceforge.net/p/imdbpy/mailman/message/36107729/

I want to thanks all the contributors, and especially H. Turgut Uyar
for such a huge amount of work!

I hope to be able to update the website and pypi tomorrow.
There are for sure many many bugs, please help and report them.
Reply all
Reply to author
Forward
0 new messages