Hello all,
I know that importing some of the data we are providing is a challenge. I know because CRP had to import this data ourselves and often in a format much extended and messier than the closer-to-normalized and often cleaned and standardized version that we create for OpenData. One of the most challenging tables we provide in OpenData is the lob_issue table because the specific issue column contains data entered by the filer and this field is not modified in any way by CRP. Not only is this an unusually large field (currently, the largest has almost 39,000 characters in the field), but it often contains formatting characters such as line breaks, tabs and other unprintable characters. Having a data record that extends beyond a single line will choke most pre-programmed import processes of standard data packages. While we do not use STATA here at CRP, we do use Microsoft SQL Server, and it is also notoriously touchy about importing data. While there are lots of scripting options available using SSIS packages, we've found that this solution is also nearly impossible to get working on a consistent basis with some of our regular import needs.
So how do we do it? We generally use one of two methods to solve difficult import problems. The first is to write a program, often but not always in PERL, to process the data one line at a time and load it into the SQL Server table a record at a time.
For some data (such as the lob_issue.txt file), a record is NOT necessarily a single line. Instead, a record is defined by the number of fields (separated by columns and text qualified by pipes). We have intentionally put the "year" field at the end of the lob_issues table record, because the year is always present. As a result, you can craft a program that uses that year field as a way to determine if you've read the entire record or need to get the next line. Another way that we sometimes import a difficult text file into SQL server without programming is to simply load the entire file into one very large text field. Each line of the text file appears as a single record in SQL Server. We then programmatically process the file within SQL Server to parse the data into fields and records.
What can you do if you don't have access to programming skills? I
propose that we begin providing two versions of the lob_issues.txt file -- one that contains the specific issue field and one that leaves it out entirely. This will allow those of you without the time, skill or patience to tackle the scripting solutions needed for many packages to load the table with the specific issue information to access the general issue information we provide. For those who need the specific issue field, I hope some of my suggestions help you see a solution using your database programs. Let me know if this dual-file works for you folks, and I'll see what I can do to make it happen asap.
Susi
Susan Alger
IT Director
Center for Responsive Politics
www.OpenSecrets.org