lob_issue import question

198 views
Skip to first unread message

Kyle

unread,
Mar 23, 2011, 10:19:43 PM3/23/11
to OpenSecrets Open Data
Hi everyone,

I'm trying to convert the bulk lobbying files into STATA format. Using
an ad-hoc combination of perl/STATA commands, I've managed to perform
the import for everything except the lob_issue.txt file.

When I open the file in a standard text editor, it appears that the
"specific issue" field is quite large and has been subdivided by line
breaks. The true line break is now shown as a double line break.

Has anyone found a solution for importing this file? Excel has also
been unhelpful.

Thanks,
Kyle

Susi Alger

unread,
Mar 25, 2011, 3:45:27 PM3/25/11
to opensecret...@googlegroups.com
Hello all,

I know that importing some of the data we are providing is a challenge.  I know because CRP had to import this data ourselves and often in a format much extended and messier than the closer-to-normalized and often cleaned and standardized version that we create for OpenData.  One of the most challenging tables we provide in OpenData is the lob_issue table because the specific issue column contains data entered by the filer and this field is not modified in any way by CRP.  Not only is this an unusually large field (currently, the largest has almost 39,000 characters in the field), but it often contains formatting characters such as line breaks, tabs and other unprintable characters.  Having a data record that extends beyond a single line will choke most pre-programmed import processes of standard data packages.  While we do not use STATA here at CRP, we do use Microsoft SQL Server, and it is also notoriously touchy about importing data.  While there are lots of scripting options available using SSIS packages, we've found that this solution is also nearly impossible to get working on a consistent basis with some of our regular import needs. 

So how do we do it?  We generally use one of two methods to solve difficult import problems.  The first is to write a program, often but not always in PERL, to process the data one line at a time and load it into the SQL Server table a record at a time.  For some data (such as the lob_issue.txt file), a record is NOT necessarily a single line.  Instead, a record is defined by the number of fields (separated by columns and text qualified by pipes).  We have intentionally put the "year" field at the end of the lob_issues table record, because the year is always present.  As a result, you can craft a program that uses that year field as a way to determine if you've read the entire record or need to get the next line.  Another way that we sometimes import a difficult text file into SQL server without programming is to simply load the entire file into one very large text field.  Each line of the text file appears as a single record in SQL Server.  We then programmatically process the file within SQL Server to parse the data into fields and records. 

What can you do if you don't have access to programming skills?  I propose that we begin providing two versions of the lob_issues.txt file -- one that contains the specific issue field and one that leaves it out entirely.  This will allow those of you without the time, skill or patience to tackle the scripting solutions needed for many packages to load the table with the specific issue information to access the general issue information we provide.  For those who need the specific issue field, I hope some of my suggestions help you see a solution using your database programs.  Let me know if this dual-file works for you folks, and I'll see what I can do to make it happen asap.

Susi

Susan Alger
IT Director
Center for Responsive Politics
www.OpenSecrets.org







--
You received this message because you are subscribed to the Google Groups "OpenSecrets Open Data" group.
To post to this group, send email to opensecret...@googlegroups.com.
To unsubscribe from this group, send email to opensecrets-open...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/opensecrets-open-data?hl=en.


Kyle Meng

unread,
Mar 25, 2011, 4:10:10 PM3/25/11
to opensecret...@googlegroups.com, Susi Alger
Dear Susi,

Thanks for the extensive email.

Your proposed solution of offering a specific issue-free version is a good one. I think that'll be useful for most researchers.

I'm debating whether to write a script according to the algorithm you suggested in PERL. If I do, I'll be sure to post it on the listserve.

best,
Kyle
--
PhD student in Sustainable Development
Columbia University
http://www.columbia.edu/~km2455
Reply all
Reply to author
Forward
0 new messages