Excellent! I love to see new adapters being made. :-)
> I'm searching for a way to test it, though, to see how well it works
> and what I'm doing wrong. Not being that familiar with Python, I tried
> randomly opening couple of files to see where I could add something to
> stop the unknown site error, but couldn't really find it...
I suggest downloading the source with Hg (Tortoise Hg on Windows works
well) and using the CLI for development, if you haven't already.
Source checkout:
> http://code.google.com/p/fanficdownloader/source/checkout
As you've figured out, to add support for a new site, you need create a
new 'adapter'. Each adapter needs to be in a python file under
fanficdownloader/adapters with name adapter_<sitenodots>.py.
A good example to study is adapter_castlefansorg.py. You can browse it
online here:
I've tried to put comments in that one at all the important places to
change.
From what I see of archiveofourown, you'll need to pull the chapter
index page to get the chapter URLs and the first chapter to parse the
metadata.
After creating the file, you need to tell the downloader to import it by
adding an import line for it to fanficdownloader/adapters/__init__.py.
Again, browse online here:
> http://code.google.com/p/fanficdownloader/source/browse/fanficdownloader/adapters/__init__.py
> And after this I want to try adapting a few other websites as well, so
> do you have any general tips?
If any of the sites you plan to do use the eFiction package (most that
use URLs with viewstory.php), it would likely be easier to start with
one of those, since that's the most common.
Good luck!
Jim
--
Jim Miller
Retie...@gmail.com
Don't despair--I didn't know any python when I started working on this
either. :-)
> I don't know what Tortoise Hg or CLI is don't know what I would be
> using it for. Do I need to make changes aside from a new adapter .py
> and adding the import line to the init . py?
Mercurial (or Hg) is the type of source code archive
(http://code.google.com/p/fanficdownloader/source/checkout) we're using.
TortoiseHg(http://tortoisehg.bitbucket.org/) is the Hg client I use
for windows. It's an easy client to use, IMO.
CLI = Command Line Interface.
If you get a local copy using Hg, you'll already have everything in the
CLI, plus other stuff for the plugin and web app. But if you don't want
to mess about with Hg right off the bat, you can download the latest CLI
zip from:
http://code.google.com/p/fanficdownloader/downloads/list
You'll also need to have python 2.7 (not v3, it's not backward
compatible) installed. (http://python.org/download/)
I'll assume to start with, you're using the CLI download and have it in
a dir name fanficdownloader-4.2.1.
Open a command window in fanficdownloader-4.2.1 -- an easy way in
windows is to hold Shift, then right click the dir name and choose 'Open
command window here'
On windows, Python will typically have installed it self in
c:\Python2.7. You should be able to run the downloader something like:
> c:\Python27\python.exe downloader.py "test1.com?sid=1"
If that works, you're ready to move on.
Put your adapter file in
fanficdownloader-4.2.1\fanficdownloader\adapters\ and edit
fanficdownloader-4.2.1\fanficdownloader\adapters\__init__.py to add an
import for it.
Then you should go ahead and try it:
> c:\Python27\python.exe downloader.py "http://ksarchive.com/viewstory.php?sid=3151"
There's a fair bit of debugging output when using the CLI, but what you
want to look for here is:
> DEBUG:__init__.py(84):trying site:www.ksarchive.com
> Unknown Site(http://ksarchive.com/viewstory.php?sid=3151). Supported sites: (.....
If it says "Unknown Site" like that, your adapter isn't loading or isn't
correctly reporting its site or something.
> However, I did not understand what it meant by print data to see how
> the site handled adult checks.
It's pretty simple, really. Add a line to your adapter file that says
'print data' and see what it says in the debug output. :-) If it scrolls
too far off the screen, save it to file with:
> c:\Python27\python.exe downloader.py "http://ksarchive.com/viewstory.php?sid=3151" > output
Hope this helps.
That's a good sign--your file is being loaded and parsed. And the
'fanfic' path is the problem. castlefans.org uses that, but most sites
don't. Go through your adapter and remove all the 'fanfic/' parts in
URLs. You could just do a find and replace for 'fanfic/'->(nothing).
> I was able to generate an output file so that I could easily see where
> the error was. However, I was unable to insert 'print data' into my
> adapter. It gave me a message saying:
>
> print data
> NameError: name 'data' is not defined
>
> Am I supposed to put the line in a specific spot in the adapter?
After it's been set. In your adapter, look for 'data =
self._fetchUrl(url)'--that's where it's first set.
Then after that, there's 'if "Age Consent Required" in data', etc, that
look at data. Just before those is where you want to print it.
Sweet! This looks really, really good.
Roman, are you around to add Ida to the repository?
(If he's not, I'll put it in for you. I don't have permissions to add
new developers.)
> It parses ellipses weirdly into a single character - … And I don't
> think it is part of utf8, so it ends up looking like … I am not 100%
> sure how to fix it.
If you make utf8 the first encoding instead of 1252, I bet that will
take care of that issue. Look for self.decode and swap the two:
> self.decode = ["Windows-1252",
> "utf8"]
> Also, it can me considered a somewhat minor issue, but the link needs
> to be exactly http://archiveofourown.org/works/story_id - if it even
> has slash at the end, let alone any other stuff like chapter id it
> crashes. Not quite sure why, though.
That's an easy fix, too. Look in getSiteURLPattern. See how the regexp
ends with: r"\d+$" ? Change that to:
r"\d+(/chapters/\d+)?/?$" Now it will match, but ignore final / & chapters.
I do have some suggestions:
Rather than setting default values, like "No description available" and
"Author chose not to use warnings.", I'd prefer you left them empty.
That's what the other adapters do.
I'd remove AO3's "No Archive Warnings Apply" and "Author Chose Not To
Use Archive Warnings" "warnings" rather than including them.
Category needs to be a loop--there can be more than one category tag.
I agree with putting both Fandom and Category into 'genre', but I
wouldn't add "Category: ".
I'm not necessarily sure I like putting the Relationships into
'characters', but it's an opinion call.
Nifty way you found to determine when a story is completed. Another,
more explicit way would be to look for:
"Published: 2011-02-04 Updated: 2011-05-16" vs
"Published: 2010-04-27 Completed: 2010-05-04"
You should set 'dateUpdated' to Completed date when it's there rather
than Published.
Again, this is really good work.
NP.
> The reason I've decided to include relationships with the characters
> is because I've noticed a few examples where people put that part to
> the exclusion of the characters.
Fair enough. It's a debatable point and I don't care enough to debate
it. :-)
Does make me think a little more about your suggest of configurable
substitutions, though.
Roman, are you around to add Ida to the repository?
Got it. Looks great. I'll put out a new version with it tonight.
If you don't object, I'll put your name on the web site and the plugin
change log as contributing the new adapter.
I have one comment, though.
When you edited ...adapters/__init__.py, you changed the line endings to
Unix form (LF vs CR/LF).
If possible, please don't do that in future. I had to go find settings
to ignore changes in whitespace in diff--and whitespace is rather
important in python. :-)
Oops. Found a problem with it. The closing chapter notes on this chapter:
http://archiveofourown.org/works/78893/chapters/108914
..don't have a blockquote, so it errors out.
Rather than all the complexity of finding all the different kinds of
headers and footers AO3 uses, have you considered starting with <div
class="chapter"> and removing/replacing the objectionable elements?
That might be easier than building up the chapter text from pieces.
adapter_mediaminerorg.py is a good example of changing and removing
tags. mediaminer.org's HTML is an abomination. :-)
Jim
When I went to test it on the web version, I found that it wasn't
checking for the 'mature content' warning to raise AdultCheckRequired.
I added that myself rather than go back and forth another time before
putting out the new versions.
Roman, Can you add Ida to the dev group, too? Thanks.
> Jim
>
> On 1/30/2012 5:26 PM, Jim Miller wrote:
> >
> > Ida,
> >
> > Got it. Looks great. I'll put out a new version with it tonight.
> >
> > If you don't object, I'll put your name on the web site and the plugin
> > change log as contributing the new adapter.
> >
> > I have one comment, though.
> >
> > When you edited ...adapters/__init__.py, you changed the line endings to
> > Unix form (LF vs CR/LF).
Is there a setting for this?
> >
> > If possible, please don't do that in future. I had to go find settings
> > to ignore changes in whitespace in diff--and whitespace is rather
> > important in python. :-)
> >
> > Jim
> >
>
>
Bill
> Is there a setting for this?
Depends on the program you use to edit the file with.
> I've gotten the 'all audiences' rating to work (the one without
> warnings), but I'm not sure about the ratings that have warnings.
> There's two ratings that have different ratings (1 and 2). However,
> when I try to copy the url, it copies a javascript for the warning,
> like this: 'javascript:if(confirm('This story contains some material
> that is inappropriate for readers under 13. This may include sexual
> situations, violence, language, or other mature content.')) location =
> 'viewstory.php?sid=2149&warning=2'.'
Some of the other sites do something similar. None of the existing
adapters will accept the javascript as the URL. What you do is click
through it so you have the "http://ksarchive.com/viewstory.php?sid=2149"
URL showing and then copy that.
> At the moment, I seem to be getting the error: AttributeError:
> 'NoneType' object has no attribute ' string'.
>
> This happens after it looks at my adapter, 'line 189, in
> extractiChapterUrlsAndMetadata self.story.setMetadata (' title' ,
> a.string)'.
>
> It sounds like it's not getting the title, but I'm unsure why.
If you're seeing that with 'mature' stories, it's likely because you the
adapter isn't calling the URL with &warning=2 (or whatever--it varies a
little site to site.) It's not giving you the story page, it's giving
you the page that says: 'This story contains some material that is
inappropriate for readers under 13. This may include sexual situations,
violence, language, or other mature content.'
None of the other sites have two levels of 'warning', so the program
isn't really equipped to handle it. Have to ask 'Are you adult?' for
both, I guess.
And you can't just use warning=1 or warning=2 all the time. ksarchive
requires 1 for NC-17 and 2 for 13+. Have to try and then look to see
which warning comes out.
> I'll probably look at it again the next time I get bored. As I get a
> bit further each time I look at it, I'm sure I'll finish it some day.
Progress is always good. :-)
Great, thanks.
I've put a couple minor fixes I found with that and make plugin 1.3.4
and uploaded to the web app.