Hi

7 views
Skip to first unread message

Anand Balachandran Pillai

unread,
Aug 1, 2008, 9:03:11 AM8/1/08
to Watchdog Volunteers
Hi,

I joined the group today. I have been observing watchdog.net for some-
time
and has read Aaron's blog post regarding it. Also a fellow BangPyper
Pradeep
Gowda works as a developer in watchdog.net. I work in Bangalore.

I am a Python coder since 2003, has contributed some useful code as
open source. I have a few recipes on ASPN Python Cookbook website
and I develop HarvestMan {http://harvestmanontheweb.com}, a pure
Python
web-crawler in my free time.

I am interested in what watchdog.net is trying to do, both in terms
of the
enormity of the task and in terms of the engineering challenges in it.
I guess
the coding involves a lot of text parsing, information extraction and
mapping, which
is right up my alley, as I have been doing this a lot in the way of
HarvestMan
development.

I am also part of an open source project to measure e-governance
indicators
for Europe, a project right now in the inception phase (http://
egovmon.no).
The aim of the project is also quite similar (keeping government and
politicians
accountable), though the design and approach is entirely different,
since it
uses "indicators". Still, they are quite in the same area of
accountability tracking
of governments and politicians.

I am volunteering my Python coding services to watchdog.net . If
there
are some problems to solve, or an unofficial guide to get started, let
me know.

Thanks

--Anand














Pradeep Gowda

unread,
Aug 1, 2008, 2:44:20 PM8/1/08
to watchdog-...@googlegroups.com
Hi Anand,
welcome to watchdog-volunteers.

Perhaps, our mailing list front-page should have a link to the
volunteers todo page

http://watchdog.jottit.com/volunteer

Pick the tasks that you are interested in, send a mail to the group
saying that you are working
on such-and-such-a feature.

If you want some clarifications on datasets/sources/copyright issues
etc.,
you can shoot a mail to the list.

As you can see, the volunteers list is quite active in the last few
weeks than earlier.

We are glad to have your web crawling and python hacking experience.
Thanks,
Pradeep

Pradeep Kishore Gowda
http://pradeepgowda.com
pra...@btbytes.com
+1-317-489-2272 (Mobile)

Anand Balachandran Pillai

unread,
Aug 1, 2008, 4:19:36 PM8/1/08
to Watchdog Volunteers
Thanks Pradeep for the links.
Yes, I agree these should be put in your front page for
better visibility.

--Anand
> prad...@btbytes.com
> +1-317-489-2272 (Mobile)

Aaron Swartz

unread,
Sep 24, 2008, 5:46:02 PM9/24/08
to Balachandran Pillai, watchdog-...@googlegroups.com
[I'm cc-ing the volunteer list; hope you don't mind.]

> I took a new look at your volunteer page. The SEC project
> looks interesting to me.

Great; that's a very important one. Basically, there's a bunch of XML
data about ownership and trading and also some largely-unstructured
HTML files (exhibit 21, I think) that list subsidiaries. The XML data
should probably come first.

> Also if there is some work which requires
> analysis/extraction of data from PDF files, I can work on it, since
> I have some experience in extracting structure & content from
> PDF.

That's a very useful skill. I assume you mean text PDFs, not scan/OCR
hybrids like:

http://bulk.resource.org/irs.gov/section_527/part01/010221978-8871-01.pdf

I don't think we've had to deal with text PDFs quite yet. The main
ones I've run into are things like laws and hearings and so on, which
might be good to convert to HTML (ala undemocracy.com) but not urgent.
(Also, I'm not sure who else is working on them.)

> Apart from this, if I can use HarvestMan and customize
> it for watchdog to perform some specific crawls for solving
> specific problems, that would be quite exciting.

Sure; we often end up having to crawl various sites, so I've no doubt
it will come in useful.

> Let me know which one you would want me to look at right now.
> I can take the problem and own it.

I think the SEC one is probably best; let me know if you have
questions or want feedback.

Aaron Swartz

unread,
Sep 24, 2008, 5:46:14 PM9/24/08
to Balachandran Pillai, watchdog-...@googlegroups.com
And of course, thanks again for taking this on!

Lukasz Szybalski

unread,
Sep 24, 2008, 9:40:06 PM9/24/08
to watchdog-...@googlegroups.com
On Wed, Sep 24, 2008 at 4:46 PM, Aaron Swartz <m...@aaronsw.com> wrote:
>
> [I'm cc-ing the volunteer list; hope you don't mind.]
>
>> I took a new look at your volunteer page. The SEC project
>> looks interesting to me.
>
> Great; that's a very important one. Basically, there's a bunch of XML
> data about ownership and trading and also some largely-unstructured
> HTML files (exhibit 21, I think) that list subsidiaries. The XML data
> should probably come first.
>
>> Also if there is some work which requires
>> analysis/extraction of data from PDF files, I can work on it, since
>> I have some experience in extracting structure & content from
>> PDF.
>
> That's a very useful skill. I assume you mean text PDFs, not scan/OCR
> hybrids like:
>
> http://bulk.resource.org/irs.gov/section_527/part01/010221978-8871-01.pdf
Are most of these pdfs hand written?

If you wanted to convert it to text you could use tesseract-ocr. I've
tried it but never found a time to actually get it working as fax/pdf
needed to be in 200x200dpi I think.

What is section 527?

Lucas


>
> I don't think we've had to deal with text PDFs quite yet. The main
> ones I've run into are things like laws and hearings and so on, which
> might be good to convert to HTML (ala undemocracy.com) but not urgent.
> (Also, I'm not sure who else is working on them.)
>
>> Apart from this, if I can use HarvestMan and customize
>> it for watchdog to perform some specific crawls for solving
>> specific problems, that would be quite exciting.
>
> Sure; we often end up having to crawl various sites, so I've no doubt
> it will come in useful.
>
>> Let me know which one you would want me to look at right now.
>> I can take the problem and own it.
>
> I think the SEC one is probably best; let me know if you have
> questions or want feedback.
>
> >
>

--
Python and OpenOffice documents and templates
http://lucasmanual.com/mywiki/OpenOffice
Fast and Easy Backup solution with Bacula
http://lucasmanual.com/mywiki/Bacula

Aaron Swartz

unread,
Sep 24, 2008, 9:42:54 PM9/24/08
to watchdog-...@googlegroups.com
> Are most of these pdfs hand written?

Not sure; it seems a lot are.

> If you wanted to convert it to text you could use tesseract-ocr. I've
> tried it but never found a time to actually get it working as fax/pdf
> needed to be in 200x200dpi I think.

There does seem to be a hidden text layer.

> What is section 527?

It's a section of the tax law used by political organizations that
aren't directly connected to a candidate.

Reply all
Reply to author
Forward
0 new messages