[Imdbpy-help] imdbpy to mysql help

123 views
Skip to first unread message

D L

unread,
Feb 15, 2013, 9:01:24 PM2/15/13
to imdbp...@lists.sourceforge.net
So I began importing the data into a mysql database and everything seems to be going fine but this last step is taking abnormally long:

# TIME completeCast() : 0min, 0sec (wall) 0min, 0sec (user) 0min, 0sec (system)
# TIME fushing caches... : 0min, 2sec (wall) 0min, 1sec (user) 0min, 0sec (system)
# TIME TOTAL TIME TO INSERT/WRITE DATA : 58min, 11sec (wall) 23min, 50sec (user) 0min, 7sec (system)
building database indexes (this may take a while)
# TIME createIndexes() : 30min, 59sec (wall) 0min, 0sec (user) 0min, 0sec (system)
adding foreign keys (this may take a while)

The adding foreign keys bit has been taking roughly 8 hours. Should I restart the whole process or wait it out?

I also have a few other questions that I could not find in the documentation:

Once you have the database set up, is there a simple way to update it with the imdb text files they routinely release, or would you have to rerun the script with the new files?

If I'm making a webapp which could potentially receive a lot of requests, it'd be optimal to fetch the requests from the local database instead of through the web requests am I correct? Since the web requests scrapes the imdb pages and imdb frowns on that?

Thanks for the help!
DL

Davide Alberani

unread,
Feb 16, 2013, 9:35:43 AM2/16/13
to D L, imdbp...@lists.sourceforge.net
On Sat, Feb 16, 2013 at 3:01 AM, D L <dlm...@hotmail.com> wrote:
>
> The adding foreign keys bit has been taking roughly 8 hours. Should I
> restart the whole process or wait it out?

Seems really slow.
Is the db actually doing something?

Anyway, the creation of indexes/foreign keys and the store/restore
of imdbIDs at db updates seems to be slightly broken.
Any help fixing it is welcome.

> Once you have the database set up, is there a simple way to update it with
> the imdb text files they routinely release, or would you have to rerun the
> script with the new files?

Re-run the script with the new dataset. No other way.
imdbIDs are (hopefully... se above) preserved between runs.

> If I'm making a webapp which could potentially receive a lot of requests,
> it'd be optimal to fetch the requests from the local database instead of
> through the web requests am I correct?

You're correct.

> Since the web requests scrapes the imdb pages and imdb frowns on that?

Main point, it's much slower.

Anyway, no matter what data you access (local or remote), I'm pretty sure
that the lIMDb license forbid you to use it for anything that is not
personal *and*
non-commercial.
I.e.: no money (not even *saved* money) from it, and no public redistribution
of the data.

HTH.

--
Davide Alberani <davide....@gmail.com> [PGP KeyID: 0x465BFD47]
http://www.mimante.net/

------------------------------------------------------------------------------
The Go Parallel Website, sponsored by Intel - in partnership with Geeknet,
is your hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials, tech docs,
whitepapers, evaluation guides, and opinion stories. Check out the most
recent posts - join the conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Imdbpy-help mailing list
Imdbp...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-help

D L

unread,
Feb 16, 2013, 3:21:00 PM2/16/13
to davide....@gmail.com, imdbp...@lists.sourceforge.net


> Date: Sat, 16 Feb 2013 15:35:43 +0100
> Subject: Re: [Imdbpy-help] imdbpy to mysql help
> From: davide....@gmail.com
> To: dlm...@hotmail.com
> CC: imdbp...@lists.sourceforge.net

>
> On Sat, Feb 16, 2013 at 3:01 AM, D L <dlm...@hotmail.com> wrote:
> >
> > The adding foreign keys bit has been taking roughly 8 hours. Should I
> > restart the whole process or wait it out?
>
> Seems really slow.
> Is the db actually doing something?
>
> Anyway, the creation of indexes/foreign keys and the store/restore
> of imdbIDs at db updates seems to be slightly broken.
> Any help fixing it is welcome.

Yeah the adding foreign keys is still going, and when I run top it seems like mysql is actually doing stuff with the CPU. I'm most likely going to rerun it and hope it works again in a timely fashion.


> > Once you have the database set up, is there a simple way to update it with
> > the imdb text files they routinely release, or would you have to rerun the
> > script with the new files?
>
> Re-run the script with the new dataset. No other way.
> imdbIDs are (hopefully... se above) preserved between runs.

Alright, will the script pass through the ones already in the database and be faster, or would it require the same amount of time?


> > If I'm making a webapp which could potentially receive a lot of requests,
> > it'd be optimal to fetch the requests from the local database instead of
> > through the web requests am I correct?
>
> You're correct.
>
> > Since the web requests scrapes the imdb pages and imdb frowns on that?
>
> Main point, it's much slower.
>
> Anyway, no matter what data you access (local or remote), I'm pretty sure
> that the lIMDb license forbid you to use it for anything that is not
> personal *and*
> non-commercial.
> I.e.: no money (not even *saved* money) from it, and no public redistribution
> of the data.
>
> HTH.

Do most apps that have been made with imdbpy use the local or web access for data? Also what defines public redistribution, what I had in mind was something along the lines of having someone input a request for say an actor, and I have a script that spits back out a bunch of data/graphs using the imdb info, would that be allowed? I wouldn't be making any money off of it, it'd just be a webapp tool. 

Once again thank you for the quick response. 

Davide Alberani

unread,
Feb 17, 2013, 3:57:50 AM2/17/13
to D L, imdbp...@lists.sourceforge.net
On Sat, Feb 16, 2013 at 9:21 PM, D L <dlm...@hotmail.com> wrote:
>
> Yeah the adding foreign keys is still going, and when I run top it seems
> like mysql is actually doing stuff with the CPU. I'm most likely going to
> rerun it and hope it works again in a timely fashion.

Doesn't looks good. :-(
Can you try again using the SQLAlchemy ORM?
Basically, you have to install it (if not already present on your system),
and add to the command line of imdbpy2sql.py: -o sqlalchemy

>> Re-run the script with the new dataset. No other way.
>> imdbIDs are (hopefully... se above) preserved between runs.
>
> Alright, will the script pass through the ones already in the database and
> be faster, or would it require the same amount of time?

The same time, sorry. No way to do otherwise, trust me. :-)

> Do most apps that have been made with imdbpy use the local or web access for
> data?

Hmmm.... I guess that most of the person who does some kind of
analysis/heavy use of the data, uses the SQL access.
Plugins of media centers, small scripts and so on, mostly uses
the web access.

> Also what defines public redistribution, what I had in mind was
> something along the lines of having someone input a request for say an
> actor, and I have a script that spits back out a bunch of data/graphs using
> the imdb info, would that be allowed?

I'm not a lawyer and so my opinion is worth about zero, but... :-)
I guess that if you just process the data, show the result of this processing
and so on (i.e. you do some transformation on it, not just printing it out
exactly as taken from the db), you're on the safe side.
Also, don't forget to put links to the imdb.com site and a footer which
explain the copyright of the data.


--
Davide Alberani <davide....@gmail.com> [PGP KeyID: 0x465BFD47]
http://www.mimante.net/

D L

unread,
Feb 17, 2013, 5:45:28 PM2/17/13
to davide....@gmail.com, imdbp...@lists.sourceforge.net


> Date: Sun, 17 Feb 2013 09:57:50 +0100

> Subject: Re: [Imdbpy-help] imdbpy to mysql help
> From: davide....@gmail.com
> To: dlm...@hotmail.com
> CC: imdbp...@lists.sourceforge.net
>
> On Sat, Feb 16, 2013 at 9:21 PM, D L <dlm...@hotmail.com> wrote:
> >
> > Yeah the adding foreign keys is still going, and when I run top it seems
> > like mysql is actually doing stuff with the CPU. I'm most likely going to
> > rerun it and hope it works again in a timely fashion.
>
> Doesn't looks good. :-(
> Can you try again using the SQLAlchemy ORM?
> Basically, you have to install it (if not already present on your system),
> and add to the command line of imdbpy2sql.py: -o sqlalchemy

Yeah tried that and ran it overnight, still no luck - it gets stuck on the foreign keys part. I'm just trying this on my laptop, so I may just proceed with using the web access for the data. Once I get everything set up for a web hosting, I may try other databases such as sqlite to see if that works. 


> >> Re-run the script with the new dataset. No other way.
> >> imdbIDs are (hopefully... se above) preserved between runs.
> >
> > Alright, will the script pass through the ones already in the database and
> > be faster, or would it require the same amount of time?
>
> The same time, sorry. No way to do otherwise, trust me. :-)
>
> > Do most apps that have been made with imdbpy use the local or web access for
> > data?
>
> Hmmm.... I guess that most of the person who does some kind of
> analysis/heavy use of the data, uses the SQL access.
> Plugins of media centers, small scripts and so on, mostly uses
> the web access.
>
> > Also what defines public redistribution, what I had in mind was
> > something along the lines of having someone input a request for say an
> > actor, and I have a script that spits back out a bunch of data/graphs using
> > the imdb info, would that be allowed?
>
> I'm not a lawyer and so my opinion is worth about zero, but... :-)
> I guess that if you just process the data, show the result of this processing
> and so on (i.e. you do some transformation on it, not just printing it out
> exactly as taken from the db), you're on the safe side.
> Also, don't forget to put links to the imdb.com site and a footer which
> explain the copyright of the data.
>
Alright I guess I shouldn't have a problem with that then. 

Davide Alberani

unread,
Feb 19, 2013, 3:28:18 PM2/19/13
to D L, imdbp...@lists.sourceforge.net
On Sun, Feb 17, 2013 at 11:45 PM, D L <dlm...@hotmail.com> wrote:
>
> Yeah tried that and ran it overnight, still no luck - it gets stuck on the
> foreign keys part. I'm just trying this on my laptop, so I may just proceed
> with using the web access for the data. Once I get everything set up for a
> web hosting, I may try other databases such as sqlite to see if that works.

D'oh! :(
Versions of:
- IMDbPY
- SQLAlchemy
- SQLObject
- MySQL
- python-mysqldb
- python-migrate
?

Anyway, if you interrupt it while it's creating the foreign key, maybe
you can try to see which were already created, and add the missing
one following the scheme you can find in imdb/parser/sql/dbschema.py

Anyway, obviously I'll try to reproduce the problem, since it's not
nice at all. :-/

--
Davide Alberani <davide....@gmail.com> [PGP KeyID: 0x465BFD47]
http://www.mimante.net/

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_feb

D L

unread,
Feb 22, 2013, 3:10:28 AM2/22/13
to davide....@gmail.com, imdbp...@lists.sourceforge.net
So after updating those dependencies, the MySQL still gets stuck on the foreign keys section, however sqlite actually manages to finish. But one of my concerns is that even the requests with sqlite can be slow the first time, and on occasion the web access was a lot faster than using the sqlite. For example, the search_person script is faster via the web, but if I run it twice (searching the same person) using the sql database, the 2nd time is noticeably much faster, most likely due to the data already being cached. My question is how fast does something like search_person take on MySQL (if I can eventually get it to work), since using sqlite seems like it's slower than just going the web route so far.


From: dlm...@hotmail.com
To: davide....@gmail.com
Subject: RE: [Imdbpy-help] imdbpy to mysql help
Date: Tue, 19 Feb 2013 17:54:16 -0800



> Date: Tue, 19 Feb 2013 21:28:18 +0100

> Subject: Re: [Imdbpy-help] imdbpy to mysql help
> From: davide....@gmail.com
> To: dlm...@hotmail.com
> CC: imdbp...@lists.sourceforge.net
>
> On Sun, Feb 17, 2013 at 11:45 PM, D L <dlm...@hotmail.com> wrote:
> >
> > Yeah tried that and ran it overnight, still no luck - it gets stuck on the
> > foreign keys part. I'm just trying this on my laptop, so I may just proceed
> > with using the web access for the data. Once I get everything set up for a
> > web hosting, I may try other databases such as sqlite to see if that works.
>
> D'oh! :(
> Versions of:
> - IMDbPY
> - SQLAlchemy
> - SQLObject
> - MySQL
> - python-mysqldb
> - python-migrate
> ?

IMDbPY - 5.0dev20130210
SQLAlchemy - 0.8.0b2
SQLObject - 1.3.2
MySQL - Server version: 5.5.29-0ubuntu0.12.04.1 (Ubuntu)
python-mysqldb - 1.2.3
python-migrate - 0.7.2

Both my python-mysqldb and python-migrate were older versions, which I just updated as I typed this. I tried the process with sqlite a night ago and it was stuck on the foreign keys section as well, I will try it again now that mysqldb and migrate have been updated and hopefully it will work. I also wrote a rough script for the data retrieval using the webaccess method, and you're right it does take a while.

> Anyway, if you interrupt it while it's creating the foreign key, maybe
> you can try to see which were already created, and add the missing
> one following the scheme you can find in imdb/parser/sql/dbschema.py
>
> Anyway, obviously I'll try to reproduce the problem, since it's not
> nice at all. :-/

Hopefully, the updated mysqldb and migrate would fix it, but we'll see.

D L

unread,
Feb 23, 2013, 6:32:39 PM2/23/13
to davide....@gmail.com, imdbp...@lists.sourceforge.net
Ok, well here's an update. I just let the foreign keys run for a little over a full day and it actually completed for mysql:
# TIME FINAL : 1883min, 1sec (wall) 23min, 57sec (user) 0min, 5sec (system)

One of my main questions  right now is the difference in results between the web search and the sql search. For example, if I ran a search on all the movies that Denzel Washington has acted in via the web search, it basically outputs all  the main ones, whereas if I do it via the sql search it will include a lot of random stuff like award ceremonies and random tv shows that he may have had a cameo on. How would I make the sql search more like the web search so that it excludes stuff like award ceremonies and only outputs the main movies?
And I haven't tested it that much, but it appears that sqlite and mysql have roughly the same speeds in running these queries, but I'm not completely sure yet.


Subject: RE: [Imdbpy-help] imdbpy to mysql help
Date: Fri, 22 Feb 2013 00:10:28 -0800

Davide Alberani

unread,
Feb 24, 2013, 5:52:45 AM2/24/13
to D L, imdbp...@lists.sourceforge.net
On Sun, Feb 24, 2013 at 12:32 AM, D L <dlm...@hotmail.com> wrote:
>
> Ok, well here's an update. I just let the foreign keys run for a little over
> a full day and it actually completed for mysql:
> # TIME FINAL : 1883min, 1sec (wall) 23min, 57sec (user) 0min, 5sec (system)

I see.
I've just run it with a subset of the db (1% taken from each file) and
my numbers are:
# TIME TOTAL TIME TO INSERT/WRITE DATA : 12min, 18sec (wall) 5min,
23sec (user) 0min, 43sec (system)
building database indexes (this may take a while)
# TIME createIndexes() : 1min, 25sec (wall) 0min, 0sec (user) 0min,
0sec (system)
adding foreign keys (this may take a while)
# TIME createForeignKeys() : 10min, 2sec (wall) 0min, 0sec (user)
0min, 0sec (system)
RESTORING imdbIDs values for movies... DONE! (restored 0 entries out of 0)
# TIME restore movies : 0min, 0sec (wall) 0min, 0sec (user) 0min, 0sec (system)
RESTORING imdbIDs values for people... DONE! (restored 0 entries out of 0)
# TIME restore people : 0min, 0sec (wall) 0min, 0sec (user) 0min, 0sec (system)
RESTORING imdbIDs values for characters... DONE! (restored 0 entries out of 0)
# TIME restore characters : 0min, 0sec (wall) 0min, 0sec (user) 0min,
0sec (system)
RESTORING imdbIDs values for companies... DONE! (restored 0 entries out of 0)
# TIME restore companies : 0min, 0sec (wall) 0min, 0sec (user) 0min,
0sec (system)
# TIME FINAL : 23min, 45sec (wall) 5min, 23sec (user) 0min, 43sec (system)

What kind of CPU/RAM/disk have you used?

> One of my main questions right now is the difference in results between the
> web search and the sql search. For example, if I ran a search on all the
> movies that Denzel Washington has acted in via the web search, it basically
> outputs all the main ones,

Yep, they are just grouped in a different way.
It would be not easy for us (even if it's not impossible, I guess) to identify
alle the various categories used on the web and the rules used to categorize
the movies, but...

For the moment, I think you could take the whole filmography and search
for tv series and/or movies in which an actor is playing Himself (or anything
that starts with Himself/Herself/Themselves)

> And I haven't tested it that much, but it appears that sqlite and mysql have
> roughly the same speeds in running these queries, but I'm not completely
> sure yet.

I expect them to be comparable in speed, but not to be slower than a
web search. :-/

--
Davide Alberani <davide....@gmail.com> [PGP KeyID: 0x465BFD47]
http://www.mimante.net/

D L

unread,
Feb 25, 2013, 1:42:45 AM2/25/13
to davide....@gmail.com, imdbp...@lists.sourceforge.net


> Date: Sun, 24 Feb 2013 11:52:45 +0100

> Subject: Re: [Imdbpy-help] imdbpy to mysql help
> From: davide....@gmail.com
> To: dlm...@hotmail.com
> CC: imdbp...@lists.sourceforge.net
>
I'm doing it on a laptop that has an i3 2.53GHZ, 4GB DDR3, with about 200gb of space (not SSD). I'm planning to just try to get everything working on my laptop before I buy web hosting to put it on there.



> > One of my main questions right now is the difference in results between the
> > web search and the sql search. For example, if I ran a search on all the
> > movies that Denzel Washington has acted in via the web search, it basically
> > outputs all the main ones,
>
> Yep, they are just grouped in a different way.
> It would be not easy for us (even if it's not impossible, I guess) to identify
> alle the various categories used on the web and the rules used to categorize
> the movies, but...
>
> For the moment, I think you could take the whole filmography and search
> for tv series and/or movies in which an actor is playing Himself (or anything
> that starts with Himself/Herself/Themselves)

Yeah, but filtering that may require even more processing time..?


> > And I haven't tested it that much, but it appears that sqlite and mysql have
> > roughly the same speeds in running these queries, but I'm not completely
> > sure yet.
>
> I expect them to be comparable in speed, but not to be slower than a
> web search. :-/

They may (hopefully) be faster once I get it up on a web hosting machine instead of my laptop.

Davide Alberani

unread,
Feb 28, 2013, 2:32:35 PM2/28/13
to D L, imdbp...@lists.sourceforge.net
On Mon, Feb 25, 2013 at 7:42 AM, D L <dlm...@hotmail.com> wrote:
>
> I'm doing it on a laptop that has an i3 2.53GHZ, 4GB DDR3, with about 200gb
> of space (not SSD). I'm planning to just try to get everything working on my
> laptop before I buy web hosting to put it on there.

Should not that that long anyway. :-/

>> For the moment, I think you could take the whole filmography and search
>> for tv series and/or movies in which an actor is playing Himself (or
>> anything that starts with Himself/Herself/Themselves)
>
> Yeah, but filtering that may require even more processing time..?

A little, but should not be too much.

Let me know if you identify the source of the problem.


--
Davide Alberani <davide....@gmail.com> [PGP KeyID: 0x465BFD47]
http://www.mimante.net/

Anshul Pandey

unread,
Apr 11, 2013, 2:24:50 PM4/11/13
to imdbp...@lists.sourceforge.net
Davide Alberani <davide.alberani@...> writes:

>

Hi

Same issue here. The adding foreign keys part has been on for last 3 hours.
Though it says,

# TIME createIndexes() : 54min, 4sec (wall) 0min, 0sec (user) 0min, 0sec (system)
adding foreign keys (this may take a while)

Did you find any fix for this? :(

Cheers
Anshul






> On Sun, Feb 17, 2013 at 11:45 PM, D L <dlman8@...> wrote:
> >
> > Yeah tried that and ran it overnight, still no luck - it gets stuck on the
> > foreign keys part. I'm just trying this on my laptop, so I may just proceed
> > with using the web access for the data. Once I get everything set up for a
> > web hosting, I may try other databases such as sqlite to see if that works.
>
> D'oh! :(
> Versions of:
> - IMDbPY
> - SQLAlchemy
> - SQLObject
> - MySQL
> - python-mysqldb
> - python-migrate
> ?
>
> Anyway, if you interrupt it while it's creating the foreign key, maybe
> you can try to see which were already created, and add the missing
> one following the scheme you can find in imdb/parser/sql/dbschema.py
>
> Anyway, obviously I'll try to reproduce the problem, since it's not
> nice at all. :-/
>
> --
> Davide Alberani <davide.alberani@...> [PGP KeyID: 0x465BFD47]
> http://www.mimante.net/
>
> ------------------------------------------------------------------------------
> Everyone hates slow websites. So do we.
> Make your web apps faster with AppDynamics
> Download AppDynamics Lite for free today:
> http://p.sf.net/sfu/appdyn_d2d_feb
>



------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter

Chris Arcand

unread,
Apr 12, 2013, 11:22:15 PM4/12/13
to imdbp...@lists.sourceforge.net
Sadly, I also have this issue, and have spent the last 4 days trying
multiple machines (OSX, Ubuntu, Debian) and flags. Was going to use this for
a databases class project but not sure if the full completion of the script
will cripple me or not. Any solution would be welcome!

Davide Alberani

unread,
Apr 14, 2013, 4:49:30 PM4/14/13
to Chris Arcand, anshu...@gmail.com, imdbp...@lists.sourceforge.net
On Sat, Apr 13, 2013 at 5:22 AM, Chris Arcand <ch...@chrisarcand.com> wrote:
>
> Sadly, I also have this issue

Unfortunately I'm unable to reproduce it, right now.

Can someone with this problem look which SQL statement
creates the problem (turning on full debug for the database
server, for example)?

You can also try to work with a directory with just a file
from the plain text data files (e.g.: movies.lists.gz) to
speed things up.
You can also reduce the content of the whole directory
using the reduce.sh script that you can find in the docs/goodies
directory.

Anyway, does someone have any hints about this problem?



--
Davide Alberani <davide....@gmail.com> [PGP KeyID: 0x465BFD47]
http://www.mimante.net/
Reply all
Reply to author
Forward
0 new messages