other entity resolution projects

7 views
Skip to first unread message

skyebend

unread,
Oct 2, 2009, 4:08:23 PM10/2/09
to Data Commons
Hi folks,
Just an FYI, but two sunlight grantees, LittleSis.org and the EDGAR
api project at CorpWatch (api.corpwatch.org) have been doing a fair
amount of work on entity resolution of individuals and corporation
names. I think it would be great to see if there is any way we can
recycle any of the matching procedures, experience, or aliasing data
into the commons project. I believe it is something MapLight.org is
working on as well, and certainly CRP and NIMSP have years of
experience. What other projects out there are working on this? It
would be great if we could assemble some best practices from various
groups' procedures.

A few questions:

Are you planning to include ways to link together the various ids an
individual might have over their political career in multiple
jurisdictions? Likewise to assemble ids of companies/orgs that merge
or split over time? Or is the idea just to group all the data into
one place where people can name match it.

Is the idea to use the schema to represent the "raw data" from FEC/
state filings, or that this will be a cleaned/aggregated version with
standardized names? I think it would be great if it can represent
both, so that ideally there can be a framework for groups to submit
back cleaning and corrections they have made.

best,
-skye

Jeremy Carbaugh

unread,
Oct 2, 2009, 4:44:45 PM10/2/09
to Data Commons
So here is a general description of the approach we are taking so far.
Almost all the data we work with can be thought of a transactional or
relationship records:

* A gives money to B
* B sponsors earmark to C
* C hires lobbyist D

All of the participants in the relationships are ultimately abstract
entities that the underlying data store itself should not care if it's
an organization, committee, individual, or something else. We are
splitting these into to distinct problems: entities and transactions.
Entities represent participants in transactions. An entity can have
any ID associated with it or even multiple IDs from the same data
source. Say we have Exxon in NIMSP data, Exxon in CRP data, and
another EXXXON (misspelling) also in CRP data. We can merge these into
one entity with ID references for all three of the original entities.

While we are transforming contribution data from CRP and NIMSP to fit
into one common contribution schema, we are leaving the data within
the transactional records untouched. Each record will contain the
original names of the contributors, organizations, committees, and
recipients, but we will also be adding entity reference IDs for each.
So even if we merge Exxon and Exxxon, the entity reference IDs will be
updated, but the original names will remain on the transactional
records. We are not using the raw FEC data so we will trust anything
we get from CRP and NIMSP. There are no plans to change anything about
their data except the schema and the IDs of the reference entities.

As for keeping track of the merging/splitting of corporations, that's
not something that is in the plans now. Each contribution record has
fields for the organization and parent organization at the time that
the contribution was made. So if A Corp. merges with B Corp, we would
not merge the entities because at some period in time they did
represent two distinct logical entities. However, future records would
show either only the parent corp or the org/parent relationship for
both corps.

The plan is to eventually have a system where anyone can suggest
merges so we will definitely be looking for help with that!

Jeremy
Reply all
Reply to author
Forward
0 new messages