[Imdbpy-devel] IMDb redesign: call for help

55 views
Skip to first unread message

Davide Alberani

unread,
Sep 19, 2010, 4:53:01 AM9/19/10
to IMDbPY development, IMDbPY support
Hello,
you've probably noticed the latest IMDb's redesign.
So far it involves mostly the main pages for movies and persons,
but I assume other pages will change in the near future.

A temporary solution is in place, and the IMDbPYweb account points
to the old version of the movies' page - but this won't last: the only
solution is to be up-to-date with their changes.
Moreover, the main page for persons is already broken.

Now, the problem: I don't really have any time to do it; sure, in
the next weeks I can try to fix the main parsers, but there are
too many other parsers (most of them will not be that hard to fix,
but it will require a little time).
The same applies to the 'mobile' parsers.

So... is anyone out there willing to help and be in charge of
one or more parsers?

They can be found in the imdbpy/imdb/parser/http directory (don't
be scared by the main ones: most of them are short and simple).
The 'http' parsers, mostly developed H. Turgut Uyar and me, are
quite powerful and it should be not too difficult to understand
(see the DOMParserBase class in imdbpy/imdb/parser/http/utils.py)
and I'd gladly answer to any question about how they work (they
are based on DOM access via XPath).
The 'mobile' parsers are more "classical", being based on
regexp and string manipulation.


More info about the redesign (you need to be registered):
http://akas.imdb.com/board/bd0000040/nest/170012668?p=1
http://akas.imdb.com/board/bd0000040/nest/169469482?p=1


--
Davide Alberani <davide....@gmail.com> [GPG KeyID: 0x465BFD47]
http://www.mimante.net/

------------------------------------------------------------------------------
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing.
http://p.sf.net/sfu/novell-sfdev2dev
_______________________________________________
Imdbpy-devel mailing list
Imdbpy...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel

Davide Alberani

unread,
Sep 25, 2010, 10:55:19 AM9/25/10
to IMDbPY development, IMDbPY support
On Sep 19, Davide Alberani <davide....@gmail.com> wrote:

> So... is anyone out there willing to help and be in charge of
> one or more parsers?

I forgot to mention how I arranged the development of the new parsers: the
old account (automatically used by IMDbPY) was changed to use the old
set of web pages (mostly: the ones about people still needs to be fixed), so
it can't be used to develop the new parsers.

I've then created a new fork of IMDbPY on bitbucket, which uses a new account
set to refer to the new web pages; this repository can be clone by here:
http://bitbucket.org/alberanid/imdbpy_parsers2010/

Once you have cloned this repository, you can install this version on your
system (or in a virtualenv) and modify it to fix the parsers.

You can test each page as you wish; there's also a more comprehensive (well,
more or less...) set of tests: http://bitbucket.org/alberanid/imdbpy-testsuite

Specifically in the http-mobile directory.
The steps:
- download from http://erlug.linux.it/~da/erlugtmp/imdbpy_p.tar.gz a more-or-less
correct set of .p files (dumps of IMDbPY objects taken when the parsers were in
a good state) and untar it in the http-mobile directory.
- fetch the new .html web pages with ./test_parser.py -f
- run the tests with ./test_parser.py -t 2>&1 | less
- spot a problem (missing information or something like that), change the
parsers and re-run the tests until the problem is not fixed. :-)

In the 'standalone/' directory there is a separate test for each file (the
ones labeled *lxml* are faster than the *bsoup* ones.

Keep in mind that it's normal to see errors about things like changes
in the number of votes, or new cast/companies informations; what really
matters is that the parser - from one run to the other - doesn't lose complete
sets of information (and that no crap ends up in the strings, movie titles and
so on). If a key is completely missing the test_parser.py script will report
it in the lists of key that are only in the expected or in the received information.

If this was not clear enough, feel free to ask me anything!

H. Turgut Uyar

unread,
Sep 26, 2010, 10:31:40 AM9/26/10
to imdbpy...@lists.sourceforge.net, imdbp...@lists.sourceforge.net

> On Sep 19, Davide Alberani <davide....@gmail.com> wrote:
>
>> So... is anyone out there willing to help and be in charge of
>> one or more parsers?
>

Hi,

I'll try to help. I have quite lot of work these days but I'll get to
the parsers as soon as I can.

--
Turgut Uyar

Davide Alberani

unread,
Sep 27, 2010, 3:06:46 PM9/27/10
to H. Turgut Uyar, imdbpy...@lists.sourceforge.net, imdbp...@lists.sourceforge.net
On Sun, Sep 26, 2010 at 4:31 PM, H. Turgut Uyar <uy...@itu.edu.tr> wrote:
>

> I'll try to help. I have quite lot of work these days but I'll get to
> the parsers as soon as I can.

As usual, thank you! :-)

I hope to have time to check to at least the main problems about people's pages
within this week.


--
Davide Alberani <davide....@gmail.com>  [PGP KeyID: 0x465BFD47]
http://www.mimante.net/

Reply all
Reply to author
Forward
0 new messages