#FlashHacks - join our latest campaign to crowdscrape 10 million datapoints in 10 days.

48 views
Skip to first unread message

Hera Hussain

unread,
Jul 8, 2014, 6:28:55 AM7/8/14
to opencorporat...@googlegroups.com
Hi everybody,

*drum roll*

We just launched a #FlashHacks campaign to crowdscrape 10 million datapoints in 10 days. OpenCorporates has grown to over 70 million companies with the help of the wider open data community and this is why we started the Missions platform in the first place. Despite strong commitments from the OGP and G8, progress from governments remains slow in making company information open. And that’s why scraping is at the heart of the open data movement! Where would the open data community be if it had not been for bot-writers spending time deciphering formats and writing code to release data?

 We want to use #FlashHacks as a celebration of the commitment of bot-writers like you and invite others to join us in changing the world through open data.

 How you can join the crowdscraping movement 

  • Join missions.opencorporates.com  and sign up!
  • Have a look at the datasets we have listed on the Campaign page as inpsiration. You can either write bots for these or even chose your own!
  • Sign up to a mission! Send a tweet pledge to say you have taken on a mission.
  • Write the bot and submit on the platform.
  • Tweet your success with the #FlashHacks tag! Don’t forget to upload the FlashHack design as your twitter cover photo and facebook cover photo to get more people involved.
Read more about why we started this on our blog.

Seb Bacon

unread,
Jul 8, 2014, 8:00:20 AM7/8/14
to Hera Hussain, opencorporat...@googlegroups.com
Hi,

Just to add a few things.

For those of you who have already been helping, please do a "gem
install turbot" to upgrade to the newest client.

Please also have a re-read of the quickstart guide, which has been
tweaked a bit: http://turbot.opencorporates.com/docs/quickstart

And also lots more examples, including how to store state:
http://turbot.opencorporates.com/docs/examples

Thanks

Seb
> --
> You received this message because you are subscribed to the Google Groups
> "OpenCorporates Community" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to opencorporates-com...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
skype: seb.bacon
mobile: 07790 939224
land: 01531 671074

Emmanuel Okyere

unread,
Jul 8, 2014, 4:53:59 PM7/8/14
to opencorporat...@googlegroups.com, hera.h...@opencorporates.com, seb....@gmail.com
Hi,

As I mentioned in another thread, this is a great idea!

I am attempting one of the missions now: http://missions.opencorporates.com/missions/588/ and it appears you need to pass a search term (to retrieve a results page before scraping.)
I have written my parser to accept a term/arg to search with ... as in: python parser.py term

How do I specify this to enable me validate the bot?

cheers,
Emmanuel.

Seb Bacon

unread,
Jul 9, 2014, 8:04:27 AM7/9/14
to opencorporat...@googlegroups.com, hera.h...@opencorporates.com, seb....@gmail.com
Hi Emmanuel,

This is a pattern which we've been intending to document, so thanks for asking :)

We usually tackle this (quite common) problem by iterating over permutations of letters, and searching for those. I've updated the documentation here:

   http://turbot.opencorporates.com/docs/examples#incrementing
Let me know if that makes sense... or not...

Regards,

Seb

Seb Bacon

unread,
Jul 9, 2014, 8:26:12 AM7/9/14
to opencorporat...@googlegroups.com, hera.h...@opencorporates.com, seb....@gmail.com
Incidentally, I don't think you'll need to  use this pattern to scrape that data.

I just noticed that there appears to be a page where you can download the entire dataset as a flat file:

    http://www4.cbs.state.or.us/ex/all/mylicsearch/index.cfm?fuseaction=search.show_download&group_id=30

Should make scraping it much easier!

Seb

Andy Lulham

unread,
Jul 9, 2014, 9:14:18 AM7/9/14
to Seb Bacon, opencorporat...@googlegroups.com, hera.h...@opencorporates.com
Can’t remember if I already emailed sent this suggestion, but…

To accompany the examples, it would be useful to click through from a
dataset to view or download the scraper that grabbed and parsed that
data. For instance, I don’t think there’s a pdf example up yet, but I
can quickly find pdf datasets that have been scraped. If I could view
those scrapers, I could figure out how to do some medium challenges :)

Thanks,
Andy
--
Andy Lulham (@andylolz)
http://treadsoft.ly
>>>> > email to opencorporates-com...@googlegroups.com.
>>>> > For more options, visit https://groups.google.com/d/optout.
>>>>
>>>>
>>>>
>>>> --
>>>> skype: seb.bacon
>>>> mobile: 07790 939224
>>>> land: 01531 671074
>
> --
> You received this message because you are subscribed to the Google Groups
> "OpenCorporates Community" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to opencorporates-com...@googlegroups.com.

Emmanuel Okyere

unread,
Jul 9, 2014, 9:29:29 AM7/9/14
to opencorporat...@googlegroups.com, hera.h...@opencorporates.com, seb....@gmail.com
Great; thanks for the update, Seb

It would probably be useful if you go through the rest of the missions and make sure people are landing on pages where data is already available, or that it is clear how to get to the data to be scraped.
I'll take a look at the link again later on tonight.

cheers,
Emmanuel.

Seb Bacon

unread,
Jul 10, 2014, 2:47:34 AM7/10/14
to Andy Lulham, Seb Bacon, opencorporat...@googlegroups.com, Hera Hussain
Hi Andy,

Great suggestion. I'll see if we can manage to get round to this
today! We also need to add support for XLS parsing, which we propose
to do via gnumeric's built-in converters. We've had issues with all
the native libraries before, so this is probably the best option.

Seb

-------------------------------------------------------
OpenCorporates :: The Open Database of the Corporate World
http://opencorporates.com
Blog: http://blog.opencorporates.com
Twitter: http://twitter.com/OpenCorporates

OpenCorporates is published by Chrinon Ltd, a company dedicated to
improving and publishing public data under an open licence that allows
and encourages reuse, including commercially. Registered in England,
number 07444723.

Seb Bacon

unread,
Jul 10, 2014, 2:49:26 AM7/10/14
to Emmanuel Okyere, opencorporat...@googlegroups.com, Hera Hussain, Seb Bacon
Hi Emmanuel,

It's actually really time consuming collecting suggested data sources,
which is why in some cases we've missed things ourselves. I just
noticed earlier about the full download; until then, we'd thought the
only endpoint was the search form. Perhaps we could add a note to the
missions along the lines of "it's worth clicking around the site to
check there's not a better place to gather the data" - what do you
think?

Thanks

Seb
-------------------------------------------------------
OpenCorporates :: The Open Database of the Corporate World
http://opencorporates.com
Blog: http://blog.opencorporates.com
Twitter: http://twitter.com/OpenCorporates

OpenCorporates is published by Chrinon Ltd, a company dedicated to
improving and publishing public data under an open licence that allows
and encourages reuse, including commercially. Registered in England,
number 07444723.


>>>>> > email to opencorporates-com...@googlegroups.com.
>>>>> > For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> skype: seb.bacon
>>>>> mobile: 07790 939224
>>>>> land: 01531 671074
>
> --
> You received this message because you are subscribed to the Google Groups
> "OpenCorporates Community" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to opencorporates-com...@googlegroups.com.

Emmanuel Okyere

unread,
Jul 10, 2014, 6:57:30 AM7/10/14
to Andy Lulham, Seb Bacon, opencorporat...@googlegroups.com, hera.h...@opencorporates.com
+1

Emmanuel Okyere

unread,
Jul 11, 2014, 4:57:39 AM7/11/14
to opencorporat...@googlegroups.com, eok...@gmail.com, hera.h...@opencorporates.com, seb....@gmail.com
That probably helps. :)

cheers,
Emmanuel.
>>>>> > email to opencorporates-community+unsub...@googlegroups.com.
>>>>> > For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> skype: seb.bacon
>>>>> mobile: 07790 939224
>>>>> land: 01531 671074
>
> --
> You received this message because you are subscribed to the Google Groups
> "OpenCorporates Community" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to opencorporates-community+unsub...@googlegroups.com.

Andy Lulham

unread,
Jul 14, 2014, 4:41:36 AM7/14/14
to Seb Bacon, Emmanuel Okyere, opencorporat...@googlegroups.com, Hera Hussain, Seb Bacon
On 10 July 2014 07:49, Seb Bacon <seb....@opencorporates.com> wrote:
> Perhaps we could add a note to the
> missions along the lines of "it's worth clicking around the site to
> check there's not a better place to gather the data" - what do you
> think?

Really good idea. In fact, I’d suggest going further, and making this
step #0 of the “Here's how we suggest you go about it” section. Along
the lines of:

0. See if you can find a better version of this dataset! Locating the
data to scrape is half the battle. If you think you’ve found a better
place to scrape from, drop us an email and let us know.

Andy
Reply all
Reply to author
Forward
0 new messages