Pandas read_csv file importing error

1,115 views
Skip to first unread message

Manish Tripathi

unread,
Oct 20, 2013, 6:07:12 AM10/20/13
to pyd...@googlegroups.com

Hi,

I am trying to import a csv file in Pandas but it throws an error. The format of the data when opened in notepad++ is as follows with first row being column names:

"End Customer Organization ID,End Customer Organization Name,End Customer Top Parent Organization ID,End Customer Top Parent Organization Name,Reseller Top Parent ID,Reseller Top Parent Name,Business,Rev Sum Division,Rev Sum Category,Product Family,Version,Pricing Level,Summary Pricing Level,Detail Pricing Level,MS Sales Amount,MS Sales Licenses,Fiscal Year,Sales Date"
"11027676,Baroda Western Uttar Pradesh Gramin Bankgfhgfnjgfnmjmhgmghmghmghmnghnmghnmhgnmghnghngh,4078446,Bank Of Barodadfhhgfjyjtkyukujkyujkuhykluiluilui;iooi';po'fserwefvegwegf,1809012,""Hcl Infosystems Ltd - Partnerdghftrutyhb frhywer5y5tyu6ui7iukluyj,lgjmfgnhfrgweffw"",Server & CALsdgrgrfgtrhytrnhjdgthjtyjkukmhjmghmbhmgfngdfbndfhtgh,SQL Server & CALdfhtrhtrgbhrghrye5y45y45yu56juhydsgfaefwe,SQL CALdhdfthtrutrjurhjethfdehrerfgwerweqeadfawrqwerwegtrhyjuytjhyj,SQL CALdtrye45y3t434tjkabcjkasdhfhasdjkcbaksmjcbfuigkjasbcjkasbkdfhiwh,2005,Openfkvgjesropiguwe90fujklascnioawfy98eyfuiasdbcvjkxsbhg,Open Lklbjdfoigueroigbjvwioergyuiowerhgosdhvgfoisdhyguiserhguisrh,""Open Stddfm,vdnoghioerivnsdflierohgushdfovhsiodghuiohdbvgsjdhgouiwerho"",125.85,1,FY07,12/28/2006"
"12835756,Uttam Strips Pvt Ltd,12835756,Uttam Strips Pvt Ltd,12565538,Redington C/O Fortis Financial Services Ltd,MBS,Dynamics ERP,Dynamics NAV,Dynamics NAV Business Essentials,Non-specific,Other,MBS SA,MBS New Customer Enhanc. Def,0,0,FY09,9/15/2008"
"12233135,Bhagwan Singh Tondon,12233135,Bhagwan Singh Tondon,2652941,H B S Systems Pvt Ltd,Server & CAL,SQL Server & CAL,SQL CAL,SQL CAL,Non-specific,Open,Open L&SA,Deferred Open L&SA - New,0,0,FY09,9/15/2008"
"11602305,Maya Academy Of Advanced Cinematics,9750934,Maya Entertainment Ltd,336146,Embee Software Pvt Ltd,Server & CAL,Windows Server & CAL,Windows Server HPC,Windows Compute Cluster Server,Non-specific,Open,Open V/MYO - Rec,OLV Perpet L&SA Recur-Def,0,0,FY09,9/25/2008"
"13336009,Remiel Softech Solution Pvt Ltd,13336009,Remiel Softech Solution Pvt Ltd,13335482,Redington C/O Remiel Softech Solutions Pvt Ltd,MBS,Dynamics ERP,Dynamics NAV,Dynamics NAV Business Essentials,Non-specific,Other,MBS SA,MBS New Customer Enhanc. Def,0,0,FY09,12/23/2008"
"7872800,Science Application International Corporation,2839760,GOVERNMENT OF KARNATAKA,10237455,Cubic Computing P.L,Server & CAL,SQL Server & CAL,SQL Server Standard,SQL Server Standard Edition,Non-specific,Open,Open SA/UA,Deferred Open SA - Renewal,0,0,FY09,1/15/2009"
"13096361,Pratham Software Pvt Ltd,13096361,Pratham Software Pvt Ltd,10133086,Krap Computer,Information Worker,Office,Office Standard / Basic,Office Standard,2007,Open,Open L,Open Std,7132.44,28,FY09,9/24/2008"
"12192276,Texmo Precision Castings,12192276,Texmo Precision Castings,4059430,Quadra Systems. - Partner,Server & CAL,Windows Server & CAL,Windows Standard Server,Windows Server Standard,Non-specific,Open,Open L&SA,Deferred Open L&SA - New,0,0,FY09,11/15/2008"

Kindly note that the same file when double clicked in the csv format opens in excel with comma separated values BUT with NO quotation marks in each line as shown in notepad++.

I have used encoding as UTF-8 which gives the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x91 in position 13: invalid start byte

Then used encoding='cp1252' first and then tried with latin1.

df=pd.read_csv(filename,encoding='cp1252') 

or 

df=pd.read_csv(filename,encoding='latin1')

With both the encodings it didn't give any error and the data got imported but as one single column and not as different columns.

Does it have to do with the "" marks present before each line in the data? I had a similar csv file with comma separated values, but that didn't have double quotation marks in each line and that got imported correctly both with cp1252 and latin1. But not for UTF-8 even though the file was saved in utf8 format in notepad++. But in this case utf8 doesnt work as usual and other two import it as single column.

Please advise.

Thanks

PS: I have asked the similar question on StackOverflow here, but heard somewhere this group might be a better place to get some help.

Paul Hobson

unread,
Oct 21, 2013, 12:04:18 AM10/21/13
to pyd...@googlegroups.com
On Sun, Oct 20, 2013 at 3:07 AM, Manish Tripathi <tr.m...@gmail.com> wrote:

Hi,

I am trying to import a csv file in Pandas but it throws an error. The format of the data when opened in notepad++ is as follows with first row being column names:

"End Customer Organization ID,End Customer Organization Name,End Customer Top Parent Organization ID,End Customer Top Parent Organization Name,Reseller Top Parent ID,Reseller Top Parent Name,Business,Rev Sum Division,Rev Sum Category,Product Family,Version,Pricing Level,Summary Pricing Level,Detail Pricing Level,MS Sales Amount,MS Sales Licenses,Fiscal Year,Sales Date"
"11027676,Baroda Western Uttar Pradesh Gramin Bankgfhgfnjgfnmjmhgmghmghmghmnghnmghnmhgnmghnghngh,4078446,Bank Of Barodadfhhgfjyjtkyukujkyujkuhykluiluilui;iooi';po'fserwefvegwegf,1809012,""Hcl Infosystems Ltd - Partnerdghftrutyhb frhywer5y5tyu6ui7iukluyj,lgjmfgnhfrgweffw"",Server & CALsdgrgrfgtrhytrnhjdgthjtyjkukmhjmghmbhmgfngdfbndfhtgh,SQL Server & CALdfhtrhtrgbhrghrye5y45y45yu56juhydsgfaefwe,SQL CALdhdfthtrutrjurhjethfdehrerfgwerweqeadfawrqwerwegtrhyjuytjhyj,SQL CALdtrye45y3t434tjkabcjkasdhfhasdjkcbaksmjcbfuigkjasbcjkasbkdfhiwh,2005,Openfkvgjesropiguwe90fujklascnioawfy98eyfuiasdbcvjkxsbhg,Open Lklbjdfoigueroigbjvwioergyuiowerhgosdhvgfoisdhyguiserhguisrh,""Open Stddfm,vdnoghioerivnsdflierohgushdfovhsiodghuiohdbvgsjdhgouiwerho"",125.85,1,FY07,12/28/2006"
"12835756,Uttam Strips Pvt Ltd,12835756,Uttam Strips Pvt Ltd,12565538,Redington C/O Fortis Financial Services Ltd,MBS,Dynamics ERP,Dynamics NAV,Dynamics NAV Business Essentials,Non-specific,Other,MBS SA,MBS New Customer Enhanc. Def,0,0,FY09,9/15/2008"
"12233135,Bhagwan Singh Tondon,12233135,Bhagwan Singh Tondon,2652941,H B S Systems Pvt Ltd,Server & CAL,SQL Server & CAL,SQL CAL,SQL CAL,Non-specific,Open,Open L&SA,Deferred Open L&SA - New,0,0,FY09,9/15/2008"
"11602305,Maya Academy Of Advanced Cinematics,9750934,Maya Entertainment Ltd,336146,Embee Software Pvt Ltd,Server & CAL,Windows Server & CAL,Windows Server HPC,Windows Compute Cluster Server,Non-specific,Open,Open V/MYO - Rec,OLV Perpet L&SA Recur-Def,0,0,FY09,9/25/2008"
"13336009,Remiel Softech Solution Pvt Ltd,13336009,Remiel Softech Solution Pvt Ltd,13335482,Redington C/O Remiel Softech Solutions Pvt Ltd,MBS,Dynamics ERP,Dynamics NAV,Dynamics NAV Business Essentials,Non-specific,Other,MBS SA,MBS New Customer Enhanc. Def,0,0,FY09,12/23/2008"
"7872800,Science Application International Corporation,2839760,GOVERNMENT OF KARNATAKA,10237455,Cubic Computing P.L,Server & CAL,SQL Server & CAL,SQL Server Standard,SQL Server Standard Edition,Non-specific,Open,Open SA/UA,Deferred Open SA - Renewal,0,0,FY09,1/15/2009"
"13096361,Pratham Software Pvt Ltd,13096361,Pratham Software Pvt Ltd,10133086,Krap Computer,Information Worker,Office,Office Standard / Basic,Office Standard,2007,Open,Open L,Open Std,7132.44,28,FY09,9/24/2008"
"12192276,Texmo Precision Castings,12192276,Texmo Precision Castings,4059430,Quadra Systems. - Partner,Server & CAL,Windows Server & CAL,Windows Standard Server,Windows Server Standard,Non-specific,Open,Open L&SA,Deferred Open L&SA - New,0,0,FY09,11/15/2008"

Kindly note that the same file when double clicked in the csv format opens in excel with comma separated values BUT with NO quotation marks in each line as shown in notepad++.

I have used encoding as UTF-8 which gives the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x91 in position 13: invalid start byte

At first glance, nothing looks weird about that file. Which line is throwing the error?
-paul 

Manish Tripathi

unread,
Oct 21, 2013, 10:20:27 AM10/21/13
to pyd...@googlegroups.com
Hi Paul,

Yes that's right. There is nothing weird about the file. Its a normal Csv file where the columns are separated by commas and when opened in notepad++ you see each row within double quotes. 

The problem I think is occurring due to this double quotes in each line. I had a similar comma separated file but no line was inside double quotes and encoding='cp1252' and encoding='latin1' both worked fine in read_csv function of pandas. Again utf8 didn't work and threw the same error as the first error mentioned below. 

So my assumption is if I can somehow tell read_csv function to import the data by taking care of double quotes in each line then it should work with the above encodings. I guess since each line is inside double quotes even when encoding ='cp1252' or 'latin1' it is not able to parse the columns since sep=','  occurs inside the quotes and hence gets missed and what gets imported is a single columns instead of separate columns. 

Any idea how we can tackle such data in pandas?

Thanks

Richard Styron

unread,
Oct 21, 2013, 11:47:03 AM10/21/13
to pyd...@googlegroups.com
Manish,

Have you tried deleting the offending quote characters in a text editor before loading?  When I am dealing with ugly CSV files, I usually have some amount of preprocessing in vim before I can easily make a dataframe from the file.  If I am dealing with many files that all share the same formatting, I can make the preprocessing workflow pretty fast, and could script it if need be.

For your file, I did this in vim:
:2d                      (deletes line 2, which looked corrupted in your post)
:%s/^.\{1}/           (deletes the first character)
:%s/.\{1}$//g       (deletes last charcter)

and the file opened fine.

Richard


--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Jeff Tratner

unread,
Oct 24, 2013, 5:28:10 PM10/24/13
to pyd...@googlegroups.com
Manish,

I answered your question about this on StackOverflow, did the answer there (read in, strip, the pass again to read_csv) work for you? I recognize it's a bit clunky but it ought to work.

Jeff

Reply all
Reply to author
Forward
0 new messages