Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Reading 'scientific' csv using Pandas?

3,180 views
Skip to first unread message

Martin Schöön

unread,
Nov 18, 2018, 7:43:13 AM11/18/18
to
I am in this project where I try to get an overview of a bunch of
computer generated (finite element program) data. I have it stored in a
number of csv files.

Reading the data into spreadsheet programs works fine but is very labour
intensive so I am working with Pandas in Jupyter notebooks which I find
much more efficient.

Now I hit a bump in the road when some of the data is not in plain
decimal notation (xxx,xx) but in 'scientific' (xx,xxxe-xx) notation.

I use read.csv and I read its documentation and poke around for
information on this but so far I have failed. Either I have found it
already but I don't understand or I ask the wrong question to the search
engines.

My experience of Pandas is limited and I would appreciate some guidance.

/Martin

Shakti Kumar

unread,
Nov 18, 2018, 9:03:55 AM11/18/18
to
On Sun, 18 Nov 2018 at 18:18, Martin Schöön <martin...@gmail.com> wrote:
>
> I am in this project where I try to get an overview of a bunch of
> computer generated (finite element program) data. I have it stored in a
> number of csv files.
>
> Reading the data into spreadsheet programs works fine but is very labour
> intensive so I am working with Pandas in Jupyter notebooks which I find
> much more efficient.
>
> Now I hit a bump in the road when some of the data is not in plain
> decimal notation (xxx,xx) but in 'scientific' (xx,xxxe-xx) notation.
>

Martin, I believe this should be done by pandas itself while reading
the csv file,
I took an example in scientific notation and checked this out,

my sample.csv file is,
col1,col2
1.1,0
10.24e-05,1
9.492e-10,2

and then I execute,
In [29]: a= pd.read_csv('sample.csv')
In [30]: a.values
Out [30]:
array([[1.100e+00, 0.000e+00],
[1.024e-04, 1.000e+00],
[9.492e-10, 2.000e+00]])
In [31]: a.values[1][0]
Out[31]: 0.0001024
As you can see, pandas has converted scientific notation to float,
even the data type of these values is numpy.float64

What best I can guess is a problem with your pandas version, there
were some updates with the 0.17.x coming in, maybe give a shot
upgrading your pandas with,
pip install --upgrade pandas
or in case you’re using anaconda then,
conda update pandas

> [snipped for brevity]
> /Martin
> --
> https://mail.python.org/mailman/listinfo/python-list

--
Shakti.

Martin Schöön

unread,
Nov 18, 2018, 1:22:22 PM11/18/18
to
Den 2018-11-18 skrev Shakti Kumar <shakti.shr...@gmail.com>:
> On Sun, 18 Nov 2018 at 18:18, Martin Schöön <martin...@gmail.com> wrote:
>>
>> Now I hit a bump in the road when some of the data is not in plain
>> decimal notation (xxx,xx) but in 'scientific' (xx,xxxe-xx) notation.
>>
>
> Martin, I believe this should be done by pandas itself while reading
> the csv file,
> I took an example in scientific notation and checked this out,
>
> my sample.csv file is,
> col1,col2
> 1.1,0
> 10.24e-05,1
> 9.492e-10,2
>
That was a quick answer!

My pandas is up to date.

In your example you use the US convention of using "." for decimals
and "," to separate data. This works perfect for me too.

However, my data files use European conventions: decimal "," and TAB
to separate data:

col1 col2
1,1 0
10,24e-05 1
9,492e-10 2

I use

EUData = pd.read_csv('file.csv', skiprows=1, sep='\t',
decimal=',', engine='python')

to read from such files. This works so so. 'Common floats' (3,1415 etc)
works just fine but 'scientific' stuff (1,6023e23) does not work.

/Martin

Shakti Kumar

unread,
Nov 19, 2018, 5:33:57 AM11/19/18
to
Hi Martin,

On Sun, 18 Nov 2018 at 23:59, Martin Schöön <martin...@gmail.com> wrote:
>
> Den 2018-11-18 skrev Shakti Kumar <shakti.shr...@gmail.com>:
> > On Sun, 18 Nov 2018 at 18:18, Martin Schöön <martin...@gmail.com> wrote:
> >>
> >> Now I hit a bump in the road when some of the data is not in plain
> >> decimal notation (xxx,xx) but in 'scientific' (xx,xxxe-xx) notation.
> >>
> >
> > Martin, I believe this should be done by pandas itself while reading
> > the csv file,
> > I took an example in scientific notation and checked this out,
> >
> > my sample.csv file is,
> > col1,col2
> > 1.1,0
> > 10.24e-05,1
> > 9.492e-10,2
> >
> That was a quick answer!
>
> My pandas is up to date.
>
> In your example you use the US convention of using "." for decimals
> and "," to separate data. This works perfect for me too.
>
> However, my data files use European conventions: decimal "," and TAB
> to separate data:
>
> col1 col2
> 1,1 0
> 10,24e-05 1
> 9,492e-10 2
>

A quick fix would be to replace all commas in your file with stops (.)
In case you have other stops in your file not necessarily in your
scientific notation columns only, you may do this replace process only
for your interested columns.
Meanwhile I should be looking for a cleaner way of loading this csv in
pandas, never came through this comma notation :)
Members of @pytho...@python.org, any better solution?

> I use
>
> EUData = pd.read_csv('file.csv', skiprows=1, sep='\t',
> decimal=',', engine='python')
>
> to read from such files. This works so so. 'Common floats' (3,1415 etc)
> works just fine but 'scientific' stuff (1,6023e23) does not work.
>
> /Martin

Peter Otten

unread,
Nov 19, 2018, 6:32:59 AM11/19/18
to
Martin Schöön wrote:

> My pandas is up to date.
>
> In your example you use the US convention of using "." for decimals
> and "," to separate data. This works perfect for me too.
>
> However, my data files use European conventions: decimal "," and TAB
> to separate data:
>
> col1 col2
> 1,1 0
> 10,24e-05 1
> 9,492e-10 2
>
> I use
>
> EUData = pd.read_csv('file.csv', skiprows=1, sep='\t',
> decimal=',', engine='python')
>
> to read from such files. This works so so. 'Common floats' (3,1415 etc)
> works just fine but 'scientific' stuff (1,6023e23) does not work.

With

>>> with open("file.csv", "w") as f:
... f.write("col1\tcol2\n"
... "1,1\t0\n"
... "10,24e-05\t1\n"
... "9,492e-10\t2\n")
...
40

the following works on my system:

>>> pd.read_csv("file.csv", delimiter="\t", decimal=",")
col1 col2
0 1.100000e+00 0
1 1.024000e-04 1
2 9.492000e-10 2

[3 rows x 2 columns]

The version is a bit old, though:

>>> pd.__version__
'0.13.1'

The engine="python" produces an exception over here:

"""
ValueError: The 'decimal' option is not supported with the 'python' engine
"""

Maybe you can try and omit that option?
If that doesn't work you can specify a converter:

>>> pd.read_csv("file.csv", sep="\t", converters={0: lambda s:
float(s.replace(",", "."))})
col1 col2
0 1.100000e+00 0
1 1.024000e-04 1
2 9.492000e-10 2

[3 rows x 2 columns]


Martin Schöön

unread,
Nov 19, 2018, 3:41:49 PM11/19/18
to
Den 2018-11-18 skrev Stefan Ram <r...@zedat.fu-berlin.de>:
> Martin =?UTF-8?Q?Sch=C3=B6=C3=B6n?= <martin...@gmail.com> writes:
>>to read from such files. This works so so. 'Common floats' (3,1415 etc)
>>works just fine but 'scientific' stuff (1,6023e23) does not work.
>
> main.py
>
> import sys
> import pandas
> import locale
> print( sys.version )
> print( pandas.__version__ )
> with open( 'schoon20181118232102.csv', 'w' ) as file:
> print( 'col0\tcol1', file=file, flush=True )
> print( '1,1\t0', file=file, flush=True )
> print( '10,24e-05\t1', file=file, flush=True )
> print( '9,492e-10\t2', file=file, flush=True )
> EUData = pandas.read_csv\
> ( 'schoon20181118232102.csv', sep='\t', decimal=',', engine='python' )
> locale.setlocale( locale.LC_ALL, 'de' )
> print( 2 * locale.atof( EUData[ 'col0' ][ 1 ]))
>
> transcript
>
> 3.7.0
> 0.23.4
> 0.0002048
>
Thanks, I just tried this. The line locale.setlocale... throws an
error:

"locale.Error: unsupported locale setting"

Trying other ideas instead of 'de' results in more of the same.
'' results in no errors.

The output I get is this:

3.4.2 (default, Oct 8 2014, 10:45:20)
[GCC 4.9.1]
0.22.0
0.0002048

Scratching my head and speculating: I run this in a Virtualenv
I have created for Jupyter and pandas and whatever I feel I need
for this. Could locale be undefined or something that causes this?

/Martin

Martin Schöön

unread,
Nov 19, 2018, 3:44:16 PM11/19/18
to
Too many files to go through them with an editor :-(

/Martin

Chris Angelico

unread,
Nov 19, 2018, 3:48:59 PM11/19/18
to
On Tue, Nov 20, 2018 at 7:46 AM Martin Schöön <martin...@gmail.com> wrote:
> Thanks, I just tried this. The line locale.setlocale... throws an
> error:
>
> "locale.Error: unsupported locale setting"
>
> Trying other ideas instead of 'de' results in more of the same.
> '' results in no errors.

Haven't been reading in detail, but maybe "de_DE" will work better,
assuming you have that locale installed.

ChrisA

Martin Schöön

unread,
Nov 19, 2018, 3:59:57 PM11/19/18
to
Den 2018-11-19 skrev Peter Otten <__pet...@web.de>:
> Martin Schöön wrote:
>
>> My pandas is up to date.
>>
>
> The engine="python" produces an exception over here:
>
> """
> ValueError: The 'decimal' option is not supported with the 'python' engine
> """
>
> Maybe you can try and omit that option?

Bingo!
No, I don't remember why I added that engine thing. It was two days ago!

> If that doesn't work you can specify a converter:
>
>>>> pd.read_csv("file.csv", sep="\t", converters={0: lambda s:
> float(s.replace(",", "."))})
> col1 col2
> 0 1.100000e+00 0
> 1 1.024000e-04 1
> 2 9.492000e-10 2
>
> [3 rows x 2 columns]
>
I save that one for later. One never nows...

/Martin

Martin Schöön

unread,
Nov 19, 2018, 4:32:21 PM11/19/18
to
I spoke too early. Upon closer inspection I get the first column with
decimal '.' and the rest with decimal ','. I have tried the converter
thing to no avail :-(

/Martin

MRAB

unread,
Nov 19, 2018, 6:37:02 PM11/19/18
to
You passed {0: lambda s: float(s.replace(",", "."))} as the converters
argument, which means that it applies only to column 0.

MRAB

unread,
Nov 19, 2018, 6:39:38 PM11/19/18
to
On 2018-11-19 20:44, Martin Schöön wrote:
> Too many files to go through them with an editor :-(
>
If only Python could read and write files... :-)

Thomas Jollans

unread,
Nov 19, 2018, 7:06:45 PM11/19/18
to
This looks like a bug in the 'python' engine specifically. I suggest you
write a bug report at https://github.com/pandas-dev/pandas/issues

(conda:nb) /tmp
0:jollans@mn70% cat test.csv
Index Value
0 1,674
1 3,48e+3
2 8,1834e-10
3 3984,109
4 2830812370

(conda:nb) /tmp
0:jollans@mn70% ipython
Python 3.7.0 (default, Oct 9 2018, 10:31:47)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.1.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd



In [2]: pd.read_csv('test.csv', header=[0], index_col=0, decimal=',',
sep='\t')

Out[2]:
Value
Index
0 1.674000e+00
1 3.480000e+03
2 8.183400e-10
3 3.984109e+03
4 2.830812e+09

In [3]: pd.read_csv('test.csv', header=[0], index_col=0, decimal=',',
sep='\t', engine='python')

Out[3]:
Value
Index
0 1.674
1 3,48e+3
2 8,1834e-10
3 3984.109
4 2830812370

In [4]: pd.__version__


Out[4]: '0.23.4'



--
Cheers,
Thomas

Martin Schöön

unread,
Nov 20, 2018, 3:07:05 PM11/20/18
to
Problem solved!

This morning I woke up with the idea of testing if all this fuss may
be caused by the fact that the 'objectional' files missed some data
in their first few rows. I tested by replacing the blank spaces in one
file with zeros. Bingo! For real this time!

Missing data threw read_csv off course.

Fortunately, only a few files needed a handful of zeros to work so
I could do it manually without breaking too much sweat.

Thanks for the keen interest shown.

/Martin
0 new messages