--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pydata/207af5fb-d852-4716-b5f2-3c377800f1f4%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pydata/CAKRyePAMStwaKHe0ZkN9Tpxms6N6ikAbaPWmud%3D%2B-KRvr3jmmQ%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pydata/CAC2wD71jXmD_ZsrggPLsXTdHeUF%2BaaLfMc5KBDrF_uOPaaDcXg%40mail.gmail.com.
--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pydata/f4c6a01fc7f580875bb36376ecf75149e91cd1dc.camel%40pietrobattiston.it.
When I call store.put('x', x, format='t', data_columns=True), this is the warning I get (with my original data):
/shared_directory/projects/env/lib/python3.6/site-packages/tables/leaf.py:414: PerformanceWarning: The Leaf ``/mobile/_i_table/application_data/sorted`` is exceeding the maximum recommended rowsize (104857600 bytes); be ready to see PyTables asking for *lots* of memory and possibly slow I/O. You may want to reduce the rowsize by trimming the value of dimensions that are orthogonal (and preferably close) to the *main* dimension of this leave. Alternatively, in case you have specified a very small/large chunksize, you may want to increase/decrease it. PerformanceWarning)
/shared_directory/projects/env/lib/python3.6/site-packages/tables/leaf.py:414: PerformanceWarning: The Leaf ``/sorted`` is exceeding the maximum recommended rowsize (104857600 bytes); be ready to see PyTables asking for *lots* of memory and possibly slow I/O. You may want to reduce the rowsize by trimming the value of dimensions that are orthogonal (and preferably close) to the *main* dimension of this leave. Alternatively, in case you have specified a very small/large chunksize, you may want to increase/decrease it. PerformanceWarning)
/shared_directory/projects/env/lib/python3.6/site-packages/tables/leaf.py:414: PerformanceWarning: The Leaf ``/sorted2`` is exceeding the maximum recommended rowsize (104857600 bytes); be ready to see PyTables asking for *lots* of memory and possibly slow I/O. You may want to reduce the rowsize by trimming the value of dimensions that are orthogonal (and preferably close) to the *main* dimension of this leave. Alternatively, in case you have specified a very small/large chunksize, you may want to increase/decrease it. PerformanceWarning)
(mobile is the table name, application_data is a column name, the terms sorted/sorted2 are not)
And on the example, I get a long list of warnings that are mostly variations on this:
/shared_directory/projects/env/lib/python3.6/site-packages/tables/path.py:157: NaturalNameWarning: object name is not a valid Python identifier: '9'; it does not match the pattern ``^[a-zA-Z_][a-zA-Z0-9_]*$``; you will not be able to use natural naming to access this object; using ``getattr()`` will still work, though
check_attribute_name(name)
/shared_directory/projects/env/lib/python3.6/site-packages/tables/attributeset.py:475: NaturalNameWarning: object name is not a valid Python identifier: '0_kind'; it does not match the pattern ``^[a-zA-Z_][a-zA-Z0-9_]*$``; you will not be able to use natural naming to access this object; using ``getattr()`` will still work, though
check_attribute_name(name)
/shared_directory/projects/env/lib/python3.6/site-packages/tables/attributeset.py:475: NaturalNameWarning: object name is not a valid Python identifier: '0_meta'; it does not match the pattern ``^[a-zA-Z_][a-zA-Z0-9_]*$``; you will not be able to use natural naming to access this object; using ``getattr()`` will still work, though
check_attribute_name(name)
/shared_directory/projects/env/lib/python3.6/site-packages/tables/attributeset.py:475: NaturalNameWarning: object name is not a valid Python identifier: '0_dtype'; it does not match the pattern ``^[a-zA-Z_][a-zA-Z0-9_]*$``; you will not be able to use natural naming to access this object; using ``getattr()`` will still work, though
But it does eventually seem to finish. And load - very, very slowly.
If I call store.put('engagement', df, format='t', data_columns=True), I get this:
>>> store.put('x', x, format='t', append=True)
Traceback (most recent call last):
File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 3612, in create_axes
info=self.info)
File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 1952, in set_atom
errors)
File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 1988, in set_atom_string
data_converted = _convert_string_array(data, encoding, errors)
File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 4606, in _convert_string_array
encoding, errors).values.reshape(data.shape)
File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/core/strings.py", line 2507, in encode
result = str_encode(self._data, encoding, errors)
File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/core/strings.py", line 1832, in str_encode
return _na_map(f, arr)
File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/core/strings.py", line 150, in _na_map
return _map(f, arr, na_mask=True, na_value=na_result, dtype=dtype)
File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/core/strings.py", line 165, in _map
result = lib.map_infer_mask(arr, f, mask.view(np.uint8), convert)
File "pandas/_libs/src/inference.pyx", line 1443, in pandas._libs.lib.map_infer_mask
File "pandas/_libs/src/inference.pyx", line 1233, in pandas._libs.lib.maybe_convert_objects
MemoryError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 890, in put
self._write_to_group(key, value, append=append, **kwargs)
File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 1367, in _write_to_group
s.write(obj=value, append=append, complib=complib, **kwargs)
File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 3946, in write
**kwargs)
File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 3622, in create_axes
% (b.dtype.name, b_items, str(detail))
Exception: cannot find the correct atom type -> [dtype->object,items->Index(['session_id', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], dtype='object')]
Why do you need to load all the data into memory at once? Is a lazy loading approach not possible for you? Dass has a partial wrapper for pandas dataframes to automatically distribute the workload among many cores and parallelize reading the file. There are other python libraries out there that support lazy loading to work on 2-20GB files with machines having only 8GB RAM.I also remembered the following post. tl;dr use a one time pre processing step with well chosen intermediate on disk representation to achieve significant speed ups.Best Max
On Mon 12. Aug 2019 at 18:15, Pietro Battiston <m...@pietrobattiston.it> wrote:
Il giorno lun, 12/08/2019 alle 09.08 -0700, Josh Friedlander ha
scritto:
> Thanks Sal and Pietro for your answers. However, neither of them work
> for the example I provided.
It might be useful - for people reading the thread, but ultimately
maybe also for you - if you describe the problems you found.
Pietro
--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pyd...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pydata/28564417-3c4c-4700-8d5c-99bb0c21a3d1%40googlegroups.com.
dtypes_dict = df.dtypes.to_dict()
dtypes_dict['str_col_foo'] = h5py.special_dtype(vlen=bytes)
dtypes_dict['sdr_u'] = h5py.special_dtype(vlen=str)
>>> with h5py.File('foo.h5', 'w') as f:
... for col in list(df):
... f.create_dataset(name=col, data=df[col].values, dtype=dtypes_dict[col])
>>> del df
>>> col_dict = {}
>>> with h5py.File('foo.h5', 'r') as f:
... for col in list(dtypes_dict.keys()):
... col_dict[col] = f[col][()]
>>> df_ = pd.DataFrame(col_dict)
To unsubscribe from this group and stop receiving emails from it, send an email to pyd...@googlegroups.com.