What's new with EveryPolitician?

8 views
Skip to first unread message

Steven Clift

unread,
Mar 10, 2017, 1:38:53 PM3/10/17
to poplus
Any state/regional/local expansions in any countries?

What about keeping data fresh from so many countries?

Steve

Dave Whiteland

unread,
Mar 13, 2017, 1:16:48 PM3/13/17
to Steven Clift, poplus
Hi Steven

http://everypolitician.org is still focussed on the national-level
legislatures for now. We'd love to expand to sub-national
jurisdictions but at the moment that's too big a jump for the current
team. We're keen to do that eventually, not least because we know that
from a civic tech point of view it's the local (whether that be state
level, or even smaller) politics where citizen engagement can be the
most effective.

But the reality is we're plenty busy working on the breadth and the
depth of national level at the moment. Incidentally, within this work
we *do* try to prioritise based on where we know people are actively
using the data (that is, specific legislatures — so if anyone has a
particular need for a national legislature that we're lagging behind
on, let us know). We're also actively working on adding the executive
branches.

Your question is timely, because we've just submitted a grant proposal
to the Wikimedia foundation to help us move more of this work onto
Wikidata. Our proposal is basically to use our scrapers, that
currently notify us of changes to data in the upstream sources to make
recommendation to Wikidata instead. Then we could enrich the political
data in Wikidata — in fact, EveryPolitician doesn't have a
database[1], so this feels like a very natural development. We
already add data to Wikidata as part of our manual process, but we
believe it would be ideal if more people were involved in that,
especially as it allows local context and expertise to be applied.

If you're interested in this, and are active on Wikimedia projects,
please do endorse our proposal!
https://meta.wikimedia.org/wiki/Grants:Project/EveryPolitician

- - -

*If* you're the kind of person who's interested in webscrapers, the
following is for you. If you're not, it's OK to stop reading now :-)

- - -

The general problem of keeping data fresh is of course something that
was anticipated from the outset. Broadly speaking our approach as we
add sources is to add a scraper to our (large!) farm of them (running
on morph.io); the initial hit is where the most work is... but once
that's in place, we run most scrapers once a day, so that any changes
in the data they pick up from that point on are incremental (unless
there's an election, for example, when suddenly there's a big change
in data coming in). Our bot makes pull requests and summarises the
proposed changes [2] to make the load on the humans bearable.

Inevitably, scrapers will and do need "fixing" — not because they
break, but because the sources they're hitting (such as official or
parliamentary sites) change. So our second line of defence to this
problem is to streamline how we're writing all those scrapers.

In practical terms this means our team has written the `scraped` gem
[3] (we're working in Ruby) which effectively allows us to define
scrapers declaratively (as well as other tools around that). That is,
pretty much every scraper is solving a very similar problem, so we
encapsulate that in a library, so that we can boil scraper-writing
down to just declaring each field we want to extract, and very little
else. We're using the excellent nokogiri gem as part of this toolkit,
which (to the programmers here) means often the scraper code itself is
doing little more than specifying the CSS or XPath to isolate each
named field.

Of course there's more to it than that, but certainly one key aspect
to how we're coping with the scaling problem of having over a thousand
scrapers is to streamline them in this way.

There are other tricks too, including archiving the pages we're
scraping as we go along. The bot will be blogging more about these
techniques shortly :-)

Yours
Dave

[1] https://medium.com/@everypolitician/sometimes-i-work-hard-to-produce-nothing-400762d252ff
discusses a little how we're using git rather than a database

[2] https://medium.com/@everypolitician/i-m-a-bot-who-comments-d1d93b6cab63

[3] https://github.com/everypolitician/scraped <-- documentation is
behind, I know!
Here's an example of this in action: the page is effectively being
declared in terms of the (data) fields we're collecting from them:
https://github.com/everypolitician-scrapers/guernsey-2016/blob/master/lib/member_page.rb
> --
> Poplus.org - Get involved: http://poplus.org/get-involved
> IRC: #poplus https://webchat.freenode.net
> Docs: http://bit.ly/poplusdrive
> ---
> You received this message because you are subscribed to the Google Groups
> "Poplus - Collaborative Civic Coding" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to poplus+un...@googlegroups.com.
> To post to this group, send email to pop...@googlegroups.com.
> Visit this group at https://groups.google.com/group/poplus.
> To view this discussion on the web, visit
> https://groups.google.com/d/msgid/poplus/CAO9TZ0WPsK5HcEMuXimaNO89n_fSHgD1u-BYRC4q%3DcM0OBNvKg%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages