Coincidentally, I've been writing some setup notes this morning :-)
Not complete, but a start:
http://code.google.com/p/journa-list/source/browse/trunk/jl/install.txt
It's a little Ubuntu-centric I'm afraid... I've got some notes on
setting it up on a Mac from last week which I'll try and integrate soon.
Also, Gavin Buttimore got it all working under windows without huge
amounts of trouble, but I don't think we kept any notes on that...
> I've checked out the source from Google Code, but there seems to be
> some files missing, for example phplib/db.php and phplib/utility.php -
> would it be possible to ensure that svn is updated with any missing
> files. From that I'm guessing svn isn't used for deployment any more?
phplib and pylib are mySociety libraries. I've tried to lay out
journalisted the same way as a mySociety site, on the grounds that I
like their sites and they've got more experience at this kind of thing
than I. Also there was some talk of them hosting it at one point, so I
wanted to make sure it could work under their
fancy-pants-multiple-server deployment system :-)
> So the first task is to make sure all files are in svn and that all
> deployment is done through svn.
Other than the mySociety libs, the deployment is still basically an svn
export, followed by installation of a couple of site-specific config files.
> Secondly, you might want to consider having a "live" branch as well as
> the main trunk. New features can be collaboratively worked on via the
> trunk and deployed to a common staging area on demand. When a feature
> is ready, it can be merged into the live branch.
Definitely, but so far it's all been small incremental changes by one or
two people. Nothing really justifying a separate branch yet, but only a
matter of time, I'm sure.
There is a staging site, but it's not really been needed so far (and it
shares the live database due to space constraints)
> Secondly, is it possible to get a database snapshot? Either a a full
> dump or a slice big enough to develop with? A useful thing I did with
> geograph was write a script for generating developer dumps which
> stripped out anything sensitive like emails and passwords.
Unfortunately, I can't really open up the database - I keep full text in
there for future analysis, and it'd just be a big horrible legal problem.
It just occured to me that I could dump out snapshots with article text
stripped... but a lot of development work is going to involve analysing
that text, so I don't know how useful it'd really be.
Also, it's getting pretty big - about 320000 articles so far (1.3GB dump).
So, my recommendation is to run the scrapers yourself and collect your
own data...
I've dumped out a schema.sql and basedata.sql into jl/db/ which will set
up the database to a point where everything will run.
eg, to scrape the daily mail:
$ cd jl/scraper
$ export JL_DEBUG=2
$ ./dailymail.py
Anyway, hopefully that's enough to get you up and running!
Ben.
Looks good. Won't have a chance to try it out today though.
> phplib and pylib are mySociety libraries....
Would it not be a good idea to treat them like any 3rd party library
and check it into svn? As well as making setup easier you'll be able
to patch your local copy and control when you merge changes from the
master MySociety version.
> Unfortunately, I can't really open up the database - I keep full text in
> there for future analysis, and it'd just be a big horrible legal problem.
No problem, but isn't the fact you're storing the full text already a
big horrible legal problem ;)
> So, my recommendation is to run the scrapers yourself and collect your
> own data...
No problem, instructions look clear, many thanks for putting that together.
Might be a day or two before I get set up. Is there anything in
particular you'd like doing in the next few weeks?
>> phplib and pylib are mySociety libraries....
>
> Would it not be a good idea to treat them like any 3rd party library
> and check it into svn? As well as making setup easier you'll be able
> to patch your local copy and control when you merge changes from the
> master MySociety version.
Probably, although I'd probably wait until we actually had some changes
to maintain.
>> Unfortunately, I can't really open up the database - I keep full text in
>> there for future analysis, and it'd just be a big horrible legal problem.
>
> No problem, but isn't the fact you're storing the full text already a
> big horrible legal problem ;)
Well... it's a pretty grey area. The best legal advice we've got
suggests that we're OK as long as we aren't making money from it and we
never actually serve up the text.
> Might be a day or two before I get set up. Is there anything in
> particular you'd like doing in the next few weeks?
Well, this week I'm starting on collecting data about articles from
other sites - eg who's blogging about them, how many times they've been
bookmarked on digg, del.ico.us etc... what sort of interest each one is
generating.
Then I'll be moving on to scraping other sources of journalist
information - linking to their wikipedia page, bio pages on the guardian
comment-is-free... that kind of thing.
Other stuff (off the top of my head):
- a proper API to let other websites use the data we've collected
- Term extraction could be improved. A lot.
- Analysing article text to extract various metrics (eg use of weasel words)
- Doing some clustering on the articles to determine which ones are related
- general improvements to handling scraper errors and dealing with them
(better alarm systems for when things go wrong)
What sort of things are you interested in?
Ben.
Cool - that'd be great! Yeah we can stick with google code for now.
What we _should_ do at some point is take a set of articles and go
through them manually extracting terms, to create a set of 'perfect'
results.
Then we can run automatic systems against the same data to come up with
some sort of metric as to how good they are, and to gauge if various
tweaks actually improve things or make them worse.
Other thoughts (which I guess I should add to the wiki now!):
- I'd love to be able to coalesce names eg:
Gordon Brown today announced compulsory broccoli portions with
every meal. Brown called his plan "Sensible measures for a less
lardy Britain!"
"Gordon Brown" and "Brown" should be ideally be extracted as two
instances of the same term... I guess a lot of "proper" term extraction
approaches must address this...
- Should terms appearing in the first couple of paragraphs be given more
significance than ones appearing later? (because most stories will
mention their main focus at the start, right?)
- Are there any other quirks specific to newspaper articles that we can
exploit?
Ben.
> Other thoughts (which I guess I should add to the wiki now!):
>
> - I'd love to be able to coalesce names eg:
> Gordon Brown today announced compulsory broccoli portions with
> every meal. Brown called his plan "Sensible measures for a less
> lardy Britain!"
> "Gordon Brown" and "Brown" should be ideally be extracted as two
> instances of the same term... I guess a lot of "proper" term extraction
> approaches must address this...
>
Yes, been doing some reading here. Will stick my links on the wiki this
evening.
> - Should terms appearing in the first couple of paragraphs be given more
> significance than ones appearing later? (because most stories will
> mention their main focus at the start, right?)
>
Almost certainly - most stories will follow the inverse pyramid style,
with successive paragraphs adding more detail. Finding noun phrases
which feature in the first couple of paragraphs, and which are then
repeated through the rest of the piece will probably yield some good
results.
> - Are there any other quirks specific to newspaper articles that we can
> exploit?
I think some patterns may emerge as you try to identify what the perfect
terms will be for a selection of articles.
Paul