Potential Solution for AssertionError: invalid dtype determination in get_concat_dtype?

840 views
Skip to first unread message

kyoto89

unread,
Sep 9, 2015, 5:28:26 PM9/9/15
to PyData
I have a list of Pandas Dataframes that I am attempting to combine using the concatenation function. 

dataframe_lists = [df1, df2, df3]

result
= pd.concat(dataframe_lists, keys = ['one', 'two','three'], ignore_index=True)


The full traceback that I receive when I execute this function is: 

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-198-a30c57d465d0> in <module>()
----> 1 result = pd.concat(dataframe_lists, keys = ['one', 'two','three'], ignore_index=True)
     
2 check(dataframe_lists)

C
:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\tools\merge.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, copy)
   
753                        verify_integrity=verify_integrity,
   
754                        copy=copy)
--> 755     return op.get_result()
   
756
   
757

C
:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\tools\merge.py in get_result(self)
   
924
   
925             new_data = concatenate_block_managers(
--> 926                 mgrs_indexers, self.new_axes, concat_axis=self.axis, copy=self.copy)
   
927             if not self.copy:
   
928                 new_data._consolidate_inplace()

C
:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy)
   
4061                                                 copy=copy),
   
4062                          placement=placement)
-> 4063               for placement, join_units in concat_plan]
   
4064
   
4065     return BlockManager(blocks, axes)

C
:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in <listcomp>(.0)
   
4061                                                 copy=copy),
   
4062                          placement=placement)
-> 4063               for placement, join_units in concat_plan]
   
4064
   
4065     return BlockManager(blocks, axes)

C
:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in concatenate_join_units(join_units, concat_axis, copy)
   
4150         raise AssertionError("Concatenating join units along axis0")
   
4151
-> 4152     empty_dtype, upcasted_na = get_empty_dtype_and_na(join_units)
   
4153
   
4154     to_concat = [ju.get_reindexed_values(empty_dtype=empty_dtype,

C
:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in get_empty_dtype_and_na(join_units)
   
4139         return np.dtype('m8[ns]'), tslib.iNaT
   
4140     else:  # pragma
-> 4141         raise AssertionError("invalid dtype determination in get_concat_dtype")
   
4142
   
4143

AssertionError: invalid dtype determination in get_concat_dtype


I believe that the error lies in the fact that one of the data frames is empty. As a temporary workaround this rather perplexing error. I used the simple function check
to verify and return just the headers of the empty dataframe: 

def check(list_of_df):

    headers
= []
   
for df in dataframe_lists:
       
if df.empty is not True:
           
continue
       
else:  
            headers
.append(df.columns)

   
return headers



I am wondering if it is possible to use this function to, if in the case of an empty dataframe, return just that empty dataframe's headers and append it to the concatenated dataframe. The output would be a single row for the headers (and, in the case of a repeating column name, just a single instance of the header (as in the case of the concatenation function). I have two sample data sources, one and two non-empty data sets. 

df1: https://gist.github.com/ahlusar1989/42708e6a3ca0aed9b79b 
df2
 :https://gist.github.com/ahlusar1989/26eb4ce1578e0844eb82 

Here is an empty dataframe. 

df3 (empty dataframe): https://gist.github.com/ahlusar1989/0721bd8b71416b54eccd 

I would like to have the resulting concatenate have the column headers (with their values) that reflects df1 and df2... 

'AT','AccountNum', 'AcctType', 'Amount', 'City', 'Comment', 'Country','DuplicateAddressFlag', 'FromAccount', 'FromAccountNum', 'FromAccountT','PN', 'PriorCity', 'PriorCountry', 'PriorState', 'PriorStreetAddress','PriorStreetAddress2', 'PriorZip', 'RTID', 'State', 'Street1','Street2', 'Timestamp', 'ToAccount', 'ToAccountNum', 'ToAccountT', 'TransferAmount', 'TransferMade', 'TransferTimestamp', 'Ttype', 'WA','WC', 'Zip'

as follows: 

'A', 'AT','AccountNum', 'AcctType', 'Amount', 'B', 'C', 'City', 'Comment', 'Country', 'D', 'DuplicateAddressFlag', 'E', 'F' 'FromAccount', 'FromAccountNum', 'FromAccountT', 'G', 'PN', 'PriorCity', 'PriorCountry', 'PriorState', 'PriorStreetAddress','PriorStreetAddress2', 'PriorZip', 'RTID', 'State', 'Street1','Street2', 'Timestamp', 'ToAccount', 'ToAccountNum', 'ToAccountT', 'TransferAmount', 'TransferMade', 'TransferTimestamp', 'Ttype', 'WA','WC', 'Zip'


I welcome any feedback on how to best do this. Thank you. 

Joris Van den Bossche

unread,
Sep 9, 2015, 5:38:59 PM9/9/15
to PyData
What version of pandas are you using?

With a small example, concating an empty frame works for me with pandas 0.16.2:

In [1]: df1 = pd.DataFrame({'a':[1,2], 'b':[3,4]})

In [3]: df2 = pd.DataFrame(columns=['a', 'b'])

In [4]: df2
Out[4]:
Empty DataFrame
Columns: [a, b]
Index: []

In [5]: pd.concat([df1, df2])
Out[5]:
   a  b
0  1  3
1  2  4



--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

kyoto89

unread,
Sep 9, 2015, 8:30:10 PM9/9/15
to PyData


On Wednesday, September 9, 2015 at 5:38:59 PM UTC-4, Joris Van den Bossche wrote:
What version of pandas are you using?

With a small example, concating an empty frame works for me with pandas 0.16.2:

In [1]: df1 = pd.DataFrame({'a':[1,2], 'b':[3,4]})

In [3]: df2 = pd.DataFrame(columns=['a', 'b'])

In [4]: df2
Out[4]:
Empty DataFrame
Columns: [a, b]
Index: []

In [5]: pd.concat([df1, df2])
Out[5]:
   a  b
0  1  3
1  2  4




I too have replicated something similar. However, for some reason the 410000 row CSV file that I am using has up to 1000 fieldnames and hence mixed dtypes. Perhaps this is the cause? Regardless I am curious to know how to write out just the headers of any empty dataframe (avoiding duplicates and appending any unique column headers - if any). Thank you for your feedback.

 I should add that I am using 0.16.2; Python 3.4.3 64 bit and Jupyter Notebooks as my IDE.

Joris Van den Bossche

unread,
Sep 10, 2015, 4:54:05 AM9/10/15
to PyData
2015-09-10 2:30 GMT+02:00 kyoto89 <ahlusar....@gmail.com>:


On Wednesday, September 9, 2015 at 5:38:59 PM UTC-4, Joris Van den Bossche wrote:
What version of pandas are you using?

With a small example, concating an empty frame works for me with pandas 0.16.2:

In [1]: df1 = pd.DataFrame({'a':[1,2], 'b':[3,4]})

In [3]: df2 = pd.DataFrame(columns=['a', 'b'])

In [4]: df2
Out[4]:
Empty DataFrame
Columns: [a, b]
Index: []

In [5]: pd.concat([df1, df2])
Out[5]:
   a  b
0  1  3
1  2  4




I too have replicated something similar. However, for some reason the 410000 row CSV file that I am using has up to 1000 fieldnames and hence mixed dtypes. Perhaps this is the cause?

Using a similar small test as above but with mixed dtypes and with different columns names also works for me.
Can you try to provide a reproducible (copy-pastable) example that reproduces the error?

 
Regardless I am curious to know how to write out just the headers of any empty dataframe (avoiding duplicates and appending any unique column headers - if any). Thank you for your feedback.

You can always manually take the union of the column names, and reindex with that.
Roughly something like this:

all_cols = df1.columns.union(df2.columns)
df1.reindex(columns=all_cols)

kyoto89

unread,
Sep 10, 2015, 9:58:09 AM9/10/15
to PyData


On Thursday, September 10, 2015 at 4:54:05 AM UTC-4, Joris Van den Bossche wrote:
2015-09-10 2:30 GMT+02:00 kyoto89 <ahlusar....@gmail.com>:


On Wednesday, September 9, 2015 at 5:38:59 PM UTC-4, Joris Van den Bossche wrote:
What version of pandas are you using?

With a small example, concating an empty frame works for me with pandas 0.16.2:

In [1]: df1 = pd.DataFrame({'a':[1,2], 'b':[3,4]})

In [3]: df2 = pd.DataFrame(columns=['a', 'b'])

In [4]: df2
Out[4]:
Empty DataFrame
Columns: [a, b]
Index: []

In [5]: pd.concat([df1, df2])
Out[5]:
   a  b
0  1  3
1  2  4




I too have replicated something similar. However, for some reason the 410000 row CSV file that I am using has up to 1000 fieldnames and hence mixed dtypes. Perhaps this is the cause?

Using a similar small test as above but with mixed dtypes and with different columns names also works for me.
Can you try to provide a reproducible (copy-pastable) example that reproduces the error?

Unfortunately, due to the sensitivity of this material, I cannot share the actual data. Leading up to what is presented in the gist is the following:

A= data[data['RRT'] == 'A'] #Select just the columns with  from the dataframe "data"
B
= data[data['RRT'] == 'B']
C
= data[data['RRT'] == 'C']
D= data[data['RRT'] == 'D']


For each of the new data frames I then apply this logic:

for column_name, column in A.transpose().iterrows():
   
AColumns= A[['ANum','RTID', 'Description','Type','Status', 'AD', 'CD', 'OD', 'RCD']]  #get select columns indexed with dataframe, "A"
 


When I perform the bound method on an empty dataframe A:

A.count

This is the output:

<bound method DataFrame.count of Empty DataFrame
Columns: [ANum,RTID, Description,Type,Status, AD, CD, OD, RCD]
Index: []>



Finally, I imported the CSV with the following:
data=pd.read_csv('Merged_Success2.csv', dtype=str, error_bad_lines = False, iterator=True,  chunksize=1000)
data
=pd.concat([chunk for chunk in data], ignore_index=True)


I am not certain what else I can provide. The concatenation method works with all other data frames that are needed to meet a requirement. I have also looked at the Pandas internals.py and the full trace. Either I have too many columns with NaN, duplicate column names or mixed  dtypes (the latter being the least likely culprit). 

Thank you again for your guidance.  


 
Regardless I am curious to know how to write out just the headers of any empty dataframe (avoiding duplicates and appending any unique column headers - if any). Thank you for your feedback.

You can always manually take the union of the column names, and reindex with that.
Roughly something like this:

all_cols = df1.columns.union(df2.columns)
df1.reindex(columns=all_cols)

Ah, I see - that's an interesting use of union. Thank you. 
Reply all
Reply to author
Forward
0 new messages