The pipes, the pipes are calling

748 views
Skip to first unread message

logan.jpr

unread,
Apr 17, 2013, 6:14:05 PM4/17/13
to OpenSecrets Open Data
Hello everyone,

At the risk of flossing a dead horse, has there been a solution to the
'pipes in the bulk data' issue? As in, is access to the raw
(specifically CampaignFin12!) data available in a .csv format, or
something where the '|' values have been removed or replaced by
something legible to say postgres or R? I know it isn't strictly an
issue to many opensecrets users but it is rather a pain for others of
us. Thanks.

-L

Spencer Graves

unread,
Apr 17, 2013, 7:53:52 PM4/17/13
to opensecret...@googlegroups.com, logan.jpr
Hi, Logan et al.:  


On 4/17/2013 3:14 PM, logan.jpr wrote:
Hello everyone,
| 
At the risk of flossing a dead horse, has there been a solution to the
'pipes in the bulk data' issue? As in, is access to the raw
(specifically CampaignFin12!) data available in a .csv format, or
something where the '|' values have been removed or replaced by
something legible to say postgres or R? 


      It's not as easy as just deleting all the pipes, because some candidates names include a comma or two (e.g., Kereti, Mata'Utia, Jr). 


      Ignoring that, the following R code will parse the "|,|" syntax: 


      indivs12raw <- readLines('indivs12.txt')


      indivs12. <- strsplit(indivs12raw, '|,|', TRUE)


      *** However, the number of fields per record is not constant: 


      ni <- sapply(indivs12., length)


      table(ni)
ni
      9      10      11      12      13      14      15      16      17      18
     28      16     570    1245    1268    2758   75486  522445    7507 2954465
     20
 214743


      I'm the maintainer for the "Ecdat"package on R-Forge (https://r-forge.r-project.org/projects/ecdat/).  I could add to that package functions to parse these files.  However, the fact that the number of fields per record range from 9 to 20 reduces my interest in trying to develop such functions.  If you want to, I'd be willing to help -- and make the functions available more widely via "Ecdat". 


      I've been trying to learn more about the tools that OpenSecrets.org already provides, e.g., at "www.opensecrets.org/resources/create/tools.php".  They use MS SQL.  They invite people to submit certain types of queries directly to their database, but I have not yet learned how.  I'm currently looking also at "https://www.opensecrets.org/resources/create/api_doc.php". 


      I'm currently interested in getting the contributions received by each member of the House from the finance industry to extend the analysis of the recent Ag committee vote to further deregulate derivatives (http://maplight.org/data-release/why-is-the-house-agriculture-committee-voting-to-deregulate-financial-derivatives-contr, http://maplight.org/data-release/bill-to-revise-dodd-frank-derivatives-passed-house-cmte-top-4-banks-contribute-78-times).  I have the vote record;  I need the contributions.  Suggestions welcomed. 


      Spencer


I know it isn't strictly an
issue to many opensecrets users but it is rather a pain for others of
us. Thanks.

-L



-- 
Spencer Graves, PhD
Executive Director
effectivedefense.org
751 Emerson Ct.
San José, CA 95126
ph:  408-655-4567
web:  www.effectivedefense.org

Ron Zucker

unread,
Apr 18, 2013, 5:21:51 PM4/18/13
to opensecret...@googlegroups.com
On a Mac, I used the sed command. I can send you the full instructions if you're curious.


--
Ron Zucker

“There is nothing wrong with America that cannot be cured by what is right with America." -- Bill Clinton



-L

--
You received this message because you are subscribed to the Google Groups "OpenSecrets Open Data" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opensecrets-open...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



message...@gmail.com

unread,
Apr 22, 2013, 8:21:42 PM4/22/13
to opensecret...@googlegroups.com
I'm new to this group so pardon me if I have missed something.

I want to take advantage of the enhancements that CRP has made to the FEC data.

I'm finding that the pipe delimiters are very inconsistent and I don't understand how to get around them. For example,

In the Individual Contributions table the amount and date are within pipes AND separated by commas. How do I work around that?

In the Cmtes12 file there are pipes with three commas inside for no reason that I can discern:
|2012|,|C00000042|,|Illinois Tool Works|,,|Illinois Tool Works|,|C00000042|,|PB|,,,|M2300|,|Hoovers|,|N|,|0|,1

Why are there pipes AND commas?

Thanks. I apologize if this is a dumb set of questions.

...Bob

Robert Weyant

unread,
Jul 25, 2013, 9:53:51 PM7/25/13
to opensecret...@googlegroups.com
Yes, I'm finding the pipes very frustrating.  The problem that you are seeing with the extra commas is that not all fields are populated.  It seems if the field is blank, there are no pipes used.  The pipes are used because some of the fields have commas in them, e.g. "ENTERPRISE HOLDINGS, INC. PAC."  Unfortunately, the lack of pipes in the blank fields is a problem I'm still trying to get around.

Marjorie Roswell

unread,
Oct 6, 2013, 4:15:25 PM10/6/13
to opensecret...@googlegroups.com
Yes, those pipes and commas still add to the pain of getting this data into postgres. What's to be done?




On Wednesday, April 17, 2013 6:14:05 PM UTC-4, logan.jpr wrote:

Luke Rosiak

unread,
Oct 6, 2013, 6:24:52 PM10/6/13
to opensecret...@googlegroups.com
What's wrong with, in postgres, COPY tablename FROM 'filename' WITH
CSV QUOTE '|' DELIMITER ','; ?

I don't see what the big deal is that another character is used for
the quote character instead of a quotation mark. It's certainly not
the default, but every CSV library, including psql and R, allow you to
specify the quote character.

logan.jpr

unread,
Dec 12, 2013, 4:20:32 PM12/12/13
to opensecret...@googlegroups.com
The only problem with it had been my basic lack of knowledge. Thank you for the help!

Cezary Podkul

unread,
Dec 12, 2013, 5:04:45 PM12/12/13
to opensecret...@googlegroups.com
You can also use an import wizard like Navicat, which allows you to set the delimiters and text indicators. I did that and it didn't have any issues with the pipes being used in lieu of quotes.

Cezary


To unsubscribe from this group and stop receiving emails from it, send an email to opensecrets-open...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.



--
Cezary Podkul | @Cezary



James Lake

unread,
Feb 10, 2014, 8:52:58 AM2/10/14
to opensecret...@googlegroups.com
As others said above, the issue with the cmtesYY.txt files is there are two delimiters: pipes "|" and commas ",". Well, the two delimiters isn't really the problem. The problem is that the comma is also used within the PACshort and UltOrg fields which means if you tell whatever program about the comma delimiter then it will split up the name field.

But there is a very simple fix using notepad (or whatever text editor you want) and excel. First, use find and replace in the text editor (control-h in notepad) to replace ",," with ",| |,". And do this twice. Then import into excel and tell excel that the only delimiter is "|". Everything is now fine in excel (importantly, the name fields haven't been split) except you will have a bunch of columns that are filled with ",". But that's no problem because you can just delete them. Actually, the last column is also messed up because it is filled with either ",0" or ",1". If this column is needed for data purposes, you can just use text to columns in excel to split it.

There is also an easy way to verify that everything is ok. You can add a row in excel at the top. Then highlight everything and add a filter to the data. In newer versions of excel, this is on the "data" tab. Then you can click on the dropdown arrow in the far right columns to see all the values in the column. For example, the second last column should only have 0s and 1s. So if it has blanks (for example) then something screwy happened. The last columns should only have ",1" or ",0". So, again, everything is ok unless you see blanks (which will appear as an entry in the drop down if there are any).

Susi Alger

unread,
Feb 10, 2014, 1:59:27 PM2/10/14
to opensecret...@googlegroups.com
Thanks for sharing your strategy, James.  We are compiling quite a list of methods that our OpenData users using various software packages.  I did want to clarify one thing about our structure and note that we do NOT have two different delimiters in these files.  There's only one character that delimits fields, and that's the comma.  The pipes surround all text fields and only text fields.  There are no pipes around numeric or date fields.  Called text qualifiers by a lot of software packages, text delimiters in others, the pipes are what lets your software know that the commas contained within the data of a text field are part of the field and are not indicating the start of a new field.  

Susi






--

James Lake

unread,
Feb 10, 2014, 2:11:33 PM2/10/14
to opensecret...@googlegroups.com
No problem Susi. And I can certainly see your point that calling something a delimiter or not would be important in certain circumstances.

It probably goes without saying but, given you talk about different people using different software packages, once the above strategy is followed you can use excel to save the file in various text formats such as CSV and tab delimited (among others) which can then be used to import into various software packages such as STATA, R, MATLAB etc...

Gary Langfelder

unread,
Feb 21, 2014, 11:55:07 AM2/21/14
to opensecret...@googlegroups.com
James,

Nice solution, and thank you for sharing it!
I was able to import the file into a Google Spreadsheet in seconds.

Regards,

Gary

Rahul Gupta

unread,
Mar 19, 2014, 9:43:47 PM3/19/14
to opensecret...@googlegroups.com
I'm having quite a bit of trouble with pipes in the context of the indivs00.txt file. There are situations where pipes are used in some cases and are not in other cases. For example, every row has |,,,|, which should actually be |,||,||,| with pipes (they are text fields). There are occasional cases of |,,| as well throughout the file, instead of |,||,|, (which may actually not be text fields) while there are situations of having ,, within an actual field. 

There also happens to be 25 rows that are broken into 2 lines (not sure if that was caused by my text editor or not). 

To your point Susi, I see the desire to differentiate between text and non-text, but I believe software/scripts are smart enough to determine that on their own. Dealing w/ pipes for certain situations and not other situations seems to make things more complicated. 

Appreciate the help!

- Rahul

Susi Alger

unread,
Mar 20, 2014, 12:59:49 PM3/20/14
to opensecret...@googlegroups.com
Hello Rahul,

I've looked at the indivs00.txt file, and I do not find some of the things you're reporting.  For example, I searched the file for the string "|,,,|,", which as I understand you're saying that you're finding on every line.  But I'm not finding it anywhere in the file.  Also, there are no records broken into two lines - you should have 1,719,688 records in the file.

Yes, most software programs can detect the type of data contained in a field, but most off the shelf software cannot tell the difference between a comma that is separating fields and a comma that is contained within a single text field.  Text fields are most often surrounded by quotes or pipes or some other character to help the importing software distinguish between a new field or the field separator inside a text string.

If you could provide me with more information, like row number, for a record you believe to be malformed, I'd be happy to take a look.  I know that these huge record sets can be difficult to work with.  We expect to release files next week aimed at providing a smaller and easier to import set of data for members of Congress and states.  These files should import with a single click into programs like Excel, and you may find them useful for your needs.

Susi

Susan Alger
IT Director
Center for Responsive Politics


--
You received this message because you are subscribed to the Google Groups "OpenSecrets Open Data" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opensecrets-open...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Rahul Gupta

unread,
Mar 21, 2014, 2:54:07 AM3/21/14
to opensecret...@googlegroups.com
Hey Susi,

I don't have the full list because I edited my copy of the list, but one example of a row broken into two is the row (and the row beneath it) with FECTransID = 0647177. I believe all the malformed rows are split on Progress Energy.
The line looks like this.
|2000|,|0647177|,|f0000032472 |,|KORPAN, RICHARD|,|N00009908|,|Florida Progress Corp|,|Progress Energy
Progress Energy|,|E1600|,10/04/1999,250,||,|OLDSMAR|,|FL|,|34877|,|RN|,|15 |,|C00335521|,||,|M|,|FLORIDA PROGRESS CORP|,|20020041828|,,,|S&P98|

As for the |,,,|, the first line has it right before source.
|2000|,|0000002|,|f0000083672 |,|HOUCK, GARY|,|C00244863|,||,||,|B3000|,01/06/1999,220,||,|HARRISBURG|,|PA|,|17112|,|PB|,|15 |,|C00244863|,||,|M|,|HOUCK SERVICES INC|,|99990012024|,,,|P/PAC|

An example of ,, within a field is the following at Weiss,, Sarota & Heifernan.
|2000|,|1379304|,|f0000060737 |,|WEISS, JAY|,|N00009930|,|Weiss,, Sarota & Helfernan|,||,|Y4000|,06/01/1999,1000,||,|FORT LAUDERDALE|,|FL|,|33331|,|3L|,|15 |,|C00349951|,||,|M|,|WEISS, SAROTA & HELFERNAN|,|99020103721|,,,||

I fully agree with you that it's pretty much impossible to use effectively use only commas as delimiters while also using them within fields. However, I believe this problem can be solved with another one character delimiter, such as a semicolon, instead of a delimiter that is determined on a field by field basis. It makes things harder on creating/your end, leading to mistakes, and on the receiving/my end, leading to more complex parsing and a fear that the data is malformed and not consistent. 

Appreciate all the help.

- Rahul

James Lake

unread,
Mar 21, 2014, 9:21:28 AM3/21/14
to opensecret...@googlegroups.com
Hi Rahul and Susi,

I have never used the indivsYY.txt files, but I just looked at it to see if the method I described earlier would work. Short answer is yes. But, there are two issues.

First, I also found that when I opened the .txt file in a text editor (I used notepad) that some Progress Energy observations were split across multiple lines. For example, the observation with an ID variable of 0174841 (I'm not sure exactly what ID this is because I haven't used the data before, but basically looks like the line number in the text file). It's not all Progress Energy observations but it seems like its only Progress Energy observations. For below, it's useful to keep in mind the next such observation is 0647177.

Second, as Rahul noted, there are instances of strings of multiple commas. You can use search and replace in the text editor (control+H in notepad) to search for ",," and replace it with ",| |,". Given there are strings with multiple commas, then this needs to be done 5 or 6 times. Then, there will be no instances of ",," left in the file. So, for the purpose of importing into Excel, you can use "|" as the only delimiter. Of course, you can then save it as various text file formats in Excel to import into whatever statistical software package you want.

Having fixed the second issue you can then import into Excel and follow the same procedure I described above (well, the newer versions of excel only have about 1.4 million rows, so you would need to split the .txt file into two files) by telling Excel that the only delimiter is "|". The only problems you should come across now are the Progress Energy observations which are split across multiple lines in the .txt file including the first two instances for observations 0174841 and 0647177. If you go back to the .txt file and fix it manually, then import it into Excel again following the same procedure then you will see that this observation has now been fixed. In the first 1.4 million or observations this only happens a handful of times so it's not a big deal to fix manually.

James

Rahul Gupta

unread,
Mar 21, 2014, 10:40:22 AM3/21/14
to opensecret...@googlegroups.com
James,

I do not think your method of replacing all instances of ",," with ",||," because there are situations where the ",," are within a field. Therefore, your technique would split a field into two fields.
For example,

|2000|,|1379304|,|f0000060737 |,|WEISS, JAY|,|N00009930|,|Weiss,, Sarota & Helfernan|,||,|Y4000|,06/01/1999,1000,||,|FORT LAUDERDALE|,|FL|,|33331|,|3L|,|15 |,|C00349951|,||,|M|,|WEISS, SAROTA & HELFERNAN|,|99020103721|,,,||

would become

|2000|,|1379304|,|f0000060737 |,|WEISS, JAY|,|N00009930|,|Weiss,||, Sarota & Helfernan|,||,|Y4000|,06/01/1999,1000,||,|FORT LAUDERDALE|,|FL|,|33331|,|3L|,|15 |,|C00349951|,||,|M|,|WEISS, SAROTA & HELFERNAN|,|99020103721|,||,||,||


- Rahul

Rahul Gupta

unread,
Mar 21, 2014, 10:43:13 AM3/21/14
to opensecret...@googlegroups.com
Just to clarify, based on the data dictionary, "Weiss,, Sarota & Helfernan" is the Orgname because the previous value is RecipID, and the value after the following blank value is the 5 digit RealCode. Therefore, "Weiss,, Sarota & Helfernan" is one field with 2 commas.

Ron Zucker

unread,
Mar 21, 2014, 10:47:39 AM3/21/14
to opensecret...@googlegroups.com
If you're uncomfortable with pipes and your database software can't handle them as the text encloser (MySQL can't, but POSTGRESQL can, and both are free), you can replace them with quotes (I had to when using MySQL for a lobbying analysis). It is not unusual for a blank field to have no quotes around it, and your database package will have no problem with it. Just keep the commas. Yes, if the data is screwy, as in your Weiss,, Sarota & Helfeman, it will look weird before you do the import, but the data ought to be correct that way. My only advice is to move the data to server so you can do a load data local infile command instead of an import from CSV.

For what it's worth, I now want to start a company, PAC or organization with two consecutive commas and a backslash followed by an R (\R) in the name, just to see how badly it screws up every database analyst on earth. 


--
Ron Zucker

“Every gun that is made, every warship launched, every rocket fired signifies in the final sense, a theft from those who hunger and are not fed, those who are cold and are not clothed. This world in arms is not spending money alone. It is spending the sweat of its laborers, the genius of its scientists, the hopes of its children. This is not a way of life at all in any true sense. Under the clouds of war, it is humanity hanging on a cross of iron.” ― Dwight D. Eisenhower


--

Rahul Gupta

unread,
Mar 21, 2014, 10:48:16 AM3/21/14
to opensecret...@googlegroups.com
The other problem with using "|" as the only delmiter (which was basically what I was doing) is that fields separated by commas, like date, amount, get viewed as only one field.

Dan Keating

unread,
Mar 21, 2014, 10:50:20 AM3/21/14
to opensecret...@googlegroups.com
Dan Keating
Washington Post

Rahul Gupta

unread,
Mar 21, 2014, 10:55:14 AM3/21/14
to opensecret...@googlegroups.com
Dan - That is hilarious.

Ron - That is a great idea replacing pipes with quotes. I was trying to avoid setting up a db, however. We'll see how long I can last!

Mike Shultz

unread,
Mar 21, 2014, 11:04:09 AM3/21/14
to opensecret...@googlegroups.com
I'd just like to point out that doing a blind replace of pipes with quotes is going to make you have a bad day.  They chose pipes on purpose as the string encapsulator because some of their string fields have quotes in them, like so: 

|"Rockefeller Financial Services|

It's a pretty straight forward concept that the likes of Excell(or the likes of SQL, something like COPY table (a, b, c, ...) FROM '/path/to/indivs12.txt' WITH CSV QUOTE AS '|';) can understand.  Just make sure when importing it, that the text delimiter is set to a pipe and not a quote.


On Fri, Mar 21, 2014 at 8:55 AM, Rahul Gupta <rahulka...@gmail.com> wrote:
Dan - That is hilarious.

Ron - That is a great idea replacing pipes with quotes. I was trying to avoid setting up a db, however. We'll see how long I can last!

--
You received this message because you are subscribed to the Google Groups "OpenSecrets Open Data" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opensecrets-open...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Mike Shultz
IT Director
Project Vote Smart

Bob Lannon

unread,
Mar 21, 2014, 11:12:27 AM3/21/14
to opensecret...@googlegroups.com
Hey all,

just chiming in to mention a really helpful repo with scripts to import opensecrets releases to postgres:



On Friday, March 21, 2014 11:04:09 AM UTC-4, Mike Shultz wrote:
I'd just like to point out that doing a blind replace of pipes with quotes is going to make you have a bad day.  They chose pipes on purpose as the string encapsulator because some of their string fields have quotes in them, like so: 

|"Rockefeller Financial Services|

It's a pretty straight forward concept that the likes of Excell(or the likes of SQL, something like COPY table (a, b, c, ...) FROM '/path/to/indivs12.txt' WITH CSV QUOTE AS '|';) can understand.  Just make sure when importing it, that the text delimiter is set to a pipe and not a quote.
On Fri, Mar 21, 2014 at 8:55 AM, Rahul Gupta <rahulka...@gmail.com> wrote:
Dan - That is hilarious.

Ron - That is a great idea replacing pipes with quotes. I was trying to avoid setting up a db, however. We'll see how long I can last!

--
You received this message because you are subscribed to the Google Groups "OpenSecrets Open Data" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opensecrets-open-data+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

James Lake

unread,
Mar 21, 2014, 12:15:56 PM3/21/14
to opensecret...@googlegroups.com
Yes, you are entirely right Rahul. I noticed this, but then forgot about it when I was writing the post. I noticed it happens here with the date and amount fields here (and perhaps more). But this is not a big problem once it's in Excel, because you can use text to columns to split it.

Rahul Gupta

unread,
Mar 21, 2014, 1:10:29 PM3/21/14
to opensecret...@googlegroups.com
Awesome, Bob! I think I might just start integrating with your work to save time.

Thanks.


On Fri, Mar 21, 2014 at 9:15 AM, James Lake <lake...@gmail.com> wrote:
Yes, you are entirely right Rahul. I noticed this, but then forgot about it when I was writing the post. I noticed it happens here with the date and amount fields here (and perhaps more). But this is not a big problem once it's in Excel, because you can use text to columns to split it.

--
You received this message because you are subscribed to a topic in the Google Groups "OpenSecrets Open Data" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/opensecrets-open-data/Isg5lTdZwx0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to opensecrets-open...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Rahul Gupta
Software Developer - Okta Inc.
University of California, Berkeley
B.S. Electrical Engineering and Computer Science |  2013 

Susi Alger

unread,
Mar 21, 2014, 1:17:36 PM3/21/14
to opensecret...@googlegroups.com
Thanks, Rahul, that made it easy to see what you were talking about and for the life of me I don't know why I didn't find it when I looked the other day.  I was so surprised not to find it that I searched several different ways. I'm currently chalking it up to March Madness.  In any case, I've fixed the file and am working my way through other files created at the same time that may have the same issue.  The new 2000 file will be posted on the site shortly, and others will follow later today and early next week.

I'm sorry for the inconvenience, Rahul, and thanks for helping make the data better for everyone!

Susi


--

Rahul Gupta

unread,
Mar 21, 2014, 10:33:41 PM3/21/14
to opensecret...@googlegroups.com
Awesome!

Thanks

- Rahul

Richard DiSalvo

unread,
Apr 27, 2015, 1:11:08 AM4/27/15
to opensecret...@googlegroups.com
Hi all,

In the expends14.txt file (candidate and committee expenditure reports bulk data) the pipe delimiting procedure seems not to work for 62 entries, since who was paid contains a pipe in their name (for example: "Cardinal Services, Inc | Workplace Solutions").

The reason I think this problem is there for 62 entries only is that's the number of entries I can't read in by simply (a) replacing every single quote " with a double quote "", (b) replacing every pipe with a quote, and then (c) reading in Stata using import delimited, ... delimiter(",") ...

Not sure what can be done to work around this, I couldn't think of anything.

Right now I'm importing using the following code in Stata running on Linux:

shell sed 's/"/""/g' <expends14.txt >expends14_version2.txt
shell sed 's/|/"/g' <expends14_version2.txt >expends14_version3.txt
import delimited using "expends14_version3.txt", delimiter(",") stringcols(_all) varnames(nonames) clear

- Richard


On Friday, March 21, 2014 at 10:33:41 PM UTC-4, Rahul Gupta wrote:
Awesome!

Thanks

- Rahul
Reply all
Reply to author
Forward
0 new messages