Perhaps, our mailing list front-page should have a link to the
volunteers todo page
http://watchdog.jottit.com/volunteer
Pick the tasks that you are interested in, send a mail to the group
saying that you are working
on such-and-such-a feature.
If you want some clarifications on datasets/sources/copyright issues
etc.,
you can shoot a mail to the list.
As you can see, the volunteers list is quite active in the last few
weeks than earlier.
We are glad to have your web crawling and python hacking experience.
Thanks,
Pradeep
Pradeep Kishore Gowda
http://pradeepgowda.com
pra...@btbytes.com
+1-317-489-2272 (Mobile)
> I took a new look at your volunteer page. The SEC project
> looks interesting to me.
Great; that's a very important one. Basically, there's a bunch of XML
data about ownership and trading and also some largely-unstructured
HTML files (exhibit 21, I think) that list subsidiaries. The XML data
should probably come first.
> Also if there is some work which requires
> analysis/extraction of data from PDF files, I can work on it, since
> I have some experience in extracting structure & content from
> PDF.
That's a very useful skill. I assume you mean text PDFs, not scan/OCR
hybrids like:
http://bulk.resource.org/irs.gov/section_527/part01/010221978-8871-01.pdf
I don't think we've had to deal with text PDFs quite yet. The main
ones I've run into are things like laws and hearings and so on, which
might be good to convert to HTML (ala undemocracy.com) but not urgent.
(Also, I'm not sure who else is working on them.)
> Apart from this, if I can use HarvestMan and customize
> it for watchdog to perform some specific crawls for solving
> specific problems, that would be quite exciting.
Sure; we often end up having to crawl various sites, so I've no doubt
it will come in useful.
> Let me know which one you would want me to look at right now.
> I can take the problem and own it.
I think the SEC one is probably best; let me know if you have
questions or want feedback.
If you wanted to convert it to text you could use tesseract-ocr. I've
tried it but never found a time to actually get it working as fax/pdf
needed to be in 200x200dpi I think.
What is section 527?
Lucas
>
> I don't think we've had to deal with text PDFs quite yet. The main
> ones I've run into are things like laws and hearings and so on, which
> might be good to convert to HTML (ala undemocracy.com) but not urgent.
> (Also, I'm not sure who else is working on them.)
>
>> Apart from this, if I can use HarvestMan and customize
>> it for watchdog to perform some specific crawls for solving
>> specific problems, that would be quite exciting.
>
> Sure; we often end up having to crawl various sites, so I've no doubt
> it will come in useful.
>
>> Let me know which one you would want me to look at right now.
>> I can take the problem and own it.
>
> I think the SEC one is probably best; let me know if you have
> questions or want feedback.
>
> >
>
--
Python and OpenOffice documents and templates
http://lucasmanual.com/mywiki/OpenOffice
Fast and Easy Backup solution with Bacula
http://lucasmanual.com/mywiki/Bacula
Not sure; it seems a lot are.
> If you wanted to convert it to text you could use tesseract-ocr. I've
> tried it but never found a time to actually get it working as fax/pdf
> needed to be in 200x200dpi I think.
There does seem to be a hidden text layer.
> What is section 527?
It's a section of the tax law used by political organizations that
aren't directly connected to a candidate.