Hack session next Saturday (5/16)

Tom Dooner

unread,

May 9, 2015, 11:04:01 PM5/9/15

to ope...@googlegroups.com

Hi all,

It was good seeing many of you this morning at the CFA HQ. We made some good progress getting on the same page about goals and desired architecture. For anyone that couldn't make it, you can relive it vicariously through the minutes.

We believe we have enough of an idea of where we want to go that it's time to start building something together.

Next Saturday (May 16th), will be the first (of hopefully many) hack sessions where we start building and designing where we're going from here. If you can't make it, fear not, there will be more opportunities. The agenda for this next week is simple: start working on the backend that will ETL the data from NetFile.

Thanks to Peter and Code for SF, we will be having this hack day at the CFA HQ from 10am-2pm.

To review the architecture we're thinking about building (speak now or forever hold your peace):

We are building a shared backend across multiple jurisdictions

First we will focus on pulling all data from the NetFile API
Later we will focus on Cal-Access

It is important to maintain a copy of the original source data

This will help us audit any transformations we apply to the data (deduplication, associations, name cleanup, etc.)
This will also help us perform subsequent updates incrementally, which will be a helpful performance benefit

Python is our lingua franca due to the team's general familiarity with it as well as the many available data processing tools like dedupe and the CCDC Cal-Access apps.
We will be using a relational data store, likely PostgreSQL.
This will be a new repo under the opencalifornia github repo.

Again, the intent of the hack session is to make some actual progress towards this. We have a lot of inertia on the partnerships side (props to John and Asha for your hard work on that), but now it's time for the programming rubber to meet the road of campaign finance! (Okay, so that was a weird way to say it, you get the idea.)

If anyone has questions, or wants to discuss architectural details, here is a good place to do that.

Best,
Tom

John C. Osborn

unread,

May 11, 2015, 1:16:25 PM5/11/15

to Tom Dooner, ope...@googlegroups.com

Thanks Tom! This is a great breakdown.

Cheers~

--
OpenCal Discourse http://104.131.98.144/
---
You received this message because you are subscribed to the Google Groups "OpenCalifornia" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencal+u...@googlegroups.com.
To post to this group, send email to ope...@googlegroups.com.
Visit this group at http://groups.google.com/group/opencal.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencal/CAHXRgTpwN2hd7JY3TaQ0HeR59rkH_2d9Ly7_Awqm%2BCF-fcrSpA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--

John C. Osborn
Digital Communications Manager
EdSource
707-845-7332
www.johncosborn.com
@bayreporta
Flickr: thereporta

STATEMENT OF CONFIDENTIALITY
The information contained in this electronic message and any attachments may contain confidential or privileged information intended for the exclusive use of the addressee(s). If you are not the intended recipient, please notify the sender by reply e-mail and destroy all copies of the original message and any attachments. In accordance with Electronic Communications Privacy Act, 18 U.S.C.§§ 2510-2521.

Ray Kiddy

unread,

May 12, 2015, 5:26:35 PM5/12/15

to ope...@googlegroups.com

On Sat, 9 May 2015 20:03:40 -0700
Tom Dooner <tomd...@gmail.com> wrote:

> Hi all,
>
> It was good seeing many of you this morning at the CFA HQ. We made
> some good progress getting on the same page about goals and desired
> architecture. For anyone that couldn't make it, you can relive it
> vicariously through the minutes

> <https://docs.google.com/document/d/16iJ30sORg2hBW9mKNPXsMs8uS4XvmB9OzHD-qGx8_Rg/edit>

I was sorry to have missed the meeting.

> We believe we have enough of an idea of where we want to go that
>

> *it's time to start building something together.*

Defn!

> *Next Saturday (May 16th),* will be the first (of hopefully many) hack

> sessions where we start building and designing where we're going from
> here. If you can't make it, fear not, there will be more
> opportunities. The agenda for this next week is simple: start working
> on the backend that will ETL the data from NetFile.

I will look forward to being there.

> Thanks to Peter and Code for SF, we will be having this hack day at

> the CFA HQ from *10am-2pm*.

>
> To review the architecture we're thinking about building (speak now or
> forever hold your peace):
>

> - We are building a shared backend across multiple jurisdictions

Are we going to start with the CCDC as a base? If we want to generalize
that project, adding modifiers for the particular government entity,
the CCDC people may be ok with that. Then their coe would be just a
single "CA" case, next to the "SJC", "SFC", "OAK" and "SCT" cases.

> - First we will focus on pulling all data from the NetFile API
> <https://netfile.com/connect2>

> - Later we will focus on Cal-Access
> <http://www.sos.ca.gov/campaign-lobbying/cal-access-resources/raw-data-campaign-finance-and-lobbying-activity/>

> - It is important to maintain a copy of the original source data
> - This will help us audit any transformations we apply to the data

> (deduplication, associations, name cleanup, etc.)

If you want to scope out the de-duplication task, see this table:

http://opencalaccess.org/calaccess_raw_20150509_000246_PDT/calaccess_campaign_browser_identity.sql.gz

FYI, the table at
http://opencalaccess.org/calaccess_raw_20150509_000246_PDT/calaccess_campaign_browser_identity_attribute.sql.gz
contains links from those names to the ID numbers assigned in the DIME
project (http://data.stanford.edu/dime). If anyone wants to peruse it.
I found matches in the Cal-Access data for 87,778 names in the DIME
data.

> - This will also help us perform subsequent updates

> incrementally, which will be a helpful performance benefit

> - Python is our *lingua franca* due to the team's general

> familiarity with it as well as the many available data processing
> tools like dedupe and the CCDC Cal-Access apps.

> - We will be using a relational data store, likely PostgreSQL.

Technical merits aside, MySQL probably has more presence in the
database world. How strong is the preference here?

> - This will be a new repo under the opencalifornia github repo.

> Again, the intent of the hack session is to make some actual progress
> towards this. We have a lot of inertia on the partnerships side

Was that "inertia"? Or "momentum"? :-) We do not need more inertia....

> (props to John and Asha for your hard work on that), but now it's
> time for the programming rubber to meet the road of campaign finance!
> (Okay, so that was a weird way to say it, you get the idea.)
>
> If anyone has questions, or wants to discuss architectural details,
> here is a good place to do that.
> Best,
> Tom

Look forward to doing some code!

cheers - ray

Reply all

Reply to author

Forward