Dan Stromberg <drsa...@gmail.com> writes:
> Is anyone using a module or database that gives Python 3.x access to MPAA
> ratings (EG G, PG, PG-13, etc.)?
What information would you want access to? Why would a library (rather
than, say, a short set of strings) be needed?
> I explored a few of the possibilities on Pypi, a couple of web interfaces,You seem to be talking about some MPAA document, where is it so we can
> and the IMDB flat text file with ratings and reasons for those ratings, but
> I've not been really impressed yet.
know what specifically you're referring to?
It's available from many places, EG: http://www.filewatcher.com/m/mpaa-ratings-reasons.list.gz.203532-0.html
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 10/12/2013 08:40, Ben Finney wrote:
> Dan Stromberg <drsa...@gmail.com> writes:
>
>> Is anyone using a module or database that gives Python 3.x access
>> to MPAA ratings (EG G, PG, PG-13, etc.)?
If you are already using IMDB you should have a look at
http://imdbpy.sourceforge.net/downloads.html as well. It provides a
relatively simple Python interface to either a local or hosted IMDB
dataset and allows you to grab the MPAA rating directly from the
canonical movie name.
It's ISO-8859-1.
On Dec 10, 2013, at 6:25 AM, Dan Stromberg <drsa...@gmail.com> wrote:
> The IMDB flat text file probably came the closest, but it appears to have encoding issues; it's apparently nearly windows-1255, but not quite.
Both certificates.list.gz and mpaa-ratings-reasons.list.gz are rather straightforward to parse.
Michael Torrie <tor...@gmail.com> writes:Not confusion, but a desire to avoid guesses based on very vague
> I'm not sure whether there's actual confusion here on your part, or
> deliberate obtuseness.
requirements.
On 10/12/2013 23:50, Dan Stromberg wrote:I guess it wouldn't be that difficult to run it through 2to3. Try that and see what happens?
But I believe imdbpy is 2.7 only.
On Wed, 11 Dec 2013 15:07:35 -0800, Dan Stromberg wrote:What reason do you have for thinking that Windows-1255 isn't a reasonable
> $ chardet mpaa-ratings-reasons.list
> mpaa-ratings-reasons.list: windows-1255 (confidence: 0.97)
>
> I'm aware that chardet is playing guessing games, though one would hope
> it would guess well most of the time, and give a reasonable confidence
> rating.
guess? If the bulk of the text is Latin-1 except perhaps for one or two
Hebrew characters (or what chardet thinks are Hebrew characters), it may
actually be a reasonable guess.
If it is a poor guess, perhaps you ought to report it to the chardet
maintainers as a good example of a poor guess.
By the way, this forum is a text-only newsgroup and so-called "Rich
Text" (actually HTML) posts are frowned upon because most people don't
appreciate having to read gunk like this:
> <div dir="ltr"><br><div class="gmail_extra"><div
> class="gmail_quote"> ... <br>
> <blockquote class="gmail_quote" style="margin:0px 0px 0px
> 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div
> class="im"> ... <br></div></div></div></div>
If you can, would you please turn off rich text posting when you post
here please?
Thank you.
On 12/11/13 6:39 PM, Dan Stromberg wrote:
On Wed, Dec 11, 2013 at 3:24 PM, Steven D'Aprano
<mailto:steve+comp.lang.pyt...@pearwood.info>> wrote:
On Wed, 11 Dec 2013 15:07:35 -0800, Dan Stromberg wrote:
> $ chardet mpaa-ratings-reasons.list
> mpaa-ratings-reasons.list: windows-1255 (confidence: 0.97)
>
> I'm aware that chardet is playing guessing games, though one
would hope
> it would guess well most of the time, and give a reasonable
confidence
> rating.
What reason do you have for thinking that Windows-1255 isn't a
reasonable
guess? If the bulk of the text is Latin-1 except perhaps for one or two
Hebrew characters (or what chardet thinks are Hebrew characters), it may
actually be a reasonable guess.
I get a traceback if I try to read the file as Windows-1255. I don't
get a traceback if I read it as ISO-8859-1.
If it is a poor guess, perhaps you ought to report it to the chardet
maintainers as a good example of a poor guess.
I was considering that, and may do so.
I've also been wondering if ISO-8859-1 is just an octet-oriented codec,
so it'll read about anything. There are clearly non-7-bit-ASCII
characters in the file that look like line noise in an mrxvt.
Both ISO-8859-1 and Windows-1255 are octet-oriented, I don't see why one would raise an exception when the other didn't. Unless the exception isn't on the decode, but instead on your attempt to output the result. Can you show the full traceback you're seeing?