Question about efficiency vs completeness

3 views
Skip to first unread message

Dane

unread,
Jun 28, 2011, 11:42:22 PM6/28/11
to Open State Project
Hello all,

Another question!

As I've looked over the MO pages I've noticed that the bare essential
information for legislators (house and senate) can be fetched from one
page (currently xls and html). But, additional information is
available for each representative on a second page (photo, email,
additional addresses, committees, etc).

Providing a more complete legislator record requires N+1 HTTP requests
(one for the 'overview' of representatives' and then N additional
calls for details). I'm not sure that its significantly longer, but it
certainly is a lot more than 1!

I'm just getting familiar with the legislator portion of the scraper
and I'm guessing a completely complete legislator would have all
public information I can find that fits the JSON billy/schemas/api/
legislator.json file. I'm guessing the bare minimum are the fields
directly listed in the Legislator class (name, term, district,party).

Should the ideal scraper err toward completeness or speed? Or
somewhere in the middle (some sub-portion of the JSON schema)?

I'm guessing that the 'complete' end of the spectrum is ideal, with
some reasonable cap (ie, no more than some constant number of page
requests per legislator), but I wonder what the ideal goal is...

Thanks,

Dane

Mason Simon

unread,
Jun 29, 2011, 1:07:58 AM6/29/11
to fifty-sta...@googlegroups.com, Open State Project
Dude. I would not sweat efficiency here. Not unless a million legislators spring forth and legislate. Those requests will cost mere seconds, and yield fruits so sweet that angels weep.

That said, I've contributed nothing to this project but this dumb email. Maybe you should wait for a more seasoned person to weigh in.

> --
> You received this message because you are subscribed to the Google Groups "Open State Project" group.
> To post to this group, send email to fifty-sta...@googlegroups.com.
> To unsubscribe from this group, send email to fifty-state-pro...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/fifty-state-project?hl=en.
>

James Turk

unread,
Jun 29, 2011, 10:28:09 AM6/29/11
to fifty-sta...@googlegroups.com
Generally we prefer completeness over efficiency with only a few minor
exceptions. As a rule for legislators get all of the information you
can (the schema is a reasonable subset, but if you can get fields that
go beyond that such as phone numbers, etc. you can scrape those as
well and just assign them reasonable names).

In the case of legislators the number of requests is never going to be
a huge burden, even a state with 500 legislators would take less than
10 minutes at 1req/sec.

When it comes to bills we have in some cases had to make trade-offs,
this is because bills at their worst can take more than 12 hours to
complete, and so adding 2-3 extra requests per bill isn't always worth
it. Generally though our preference is to initially try and get
everything we can, and if it becomes apparent we're hitting a site too
hard we can back off on some of the additional information to try and
lighten our impact.

-James

Dane

unread,
Jun 29, 2011, 12:12:24 PM6/29/11
to Open State Project
Great, thanks. Thats helpful.

On Jun 29, 10:28 am, James Turk <jt...@sunlightfoundation.com> wrote:
> Generally we prefer completeness over efficiency with only a few minor
> exceptions.  As a rule for legislators get all of the information you
> can (the schema is a reasonable subset, but if you can get fields that
> go beyond that such as phone numbers, etc. you can scrape those as
> well and just assign them reasonable names).
>
> In the case of legislators the number of requests is never going to be
> a huge burden, even a state with 500 legislators would take less than
> 10 minutes at 1req/sec.
>
> When it comes to bills we have in some cases had to make trade-offs,
> this is because bills at their worst can take more than 12 hours to
> complete, and so adding 2-3 extra requests per bill isn't always worth
> it.  Generally though our preference is to initially try and get
> everything we can, and if it becomes apparent we're hitting a site too
> hard we can back off on some of the additional information to try and
> lighten our impact.
>
> -James
>
>
>
>
>
>
>
> On Wed, Jun 29, 2011 at 1:07 AM, Mason Simon <masonsi...@gmail.com> wrote:
> > Dude. I would not sweat efficiency here. Not unless a million legislators spring forth and legislate. Those requests will cost mere seconds, and yield fruits so sweet that angels weep.
>
> > That said, I've contributed nothing to this project but this dumb email. Maybe you should wait for a more seasoned person to weigh in.
>
> >> For more options, visit this group athttp://groups.google.com/group/fifty-state-project?hl=en.
Reply all
Reply to author
Forward
0 new messages