Hello Andreas,
> My goal is traceability. I often download data files (often XLS) from
> the web, in order to use the data in my research. Different XLS files
> have different formats, so the parsing always differs a bit. Also, maybe
> I need to reindex stuff according to my own standards. And the original
> data producers might not care about reproducibility and change file
> contents without changing file names.
>
> By having the original XLS file, the resulting pandas object, and the
> Python code which converted the XLS file to the pandas object, all
> within one file, I get total traceability. No matter if I reorganize my
> harddisk, or if I pass the data file on to a coworker, it's always
> totally clear where the data came from.
> In order to be able to reproduce one's results (or find a reason why
> reproducing isn't possible) I chose to go this way.
I really liked your idea and approach.
I was actually waiting for something like this:
* upload raw data to database
* add meta data on source and properties
* track name of the modifier
* track commands & scripts used to generate this data
* retieve the data for analysis
* do analysis and save back with the respective metadata
If you look at another domains, GRASS GIS uses a metadata descriptor:
http://grass.osgeo.org/grass64/manuals/r.info.html
Data Description: |
| generated by r.slope.aspect
together with a history log it takes you already somewhere...
> Actually, I just hacked together a small module to do exactly this. I
> put it on Github:
>
>
https://github.com/andreas-h/pyrepsci
>
> Maybe it can be helpful to someone out there. Any comments are welcome.
I checked it out. It's really cool.
I just miss a simple method to extract all stuff from the HDF-file into
a directory structure.
Maybe even with a IPython log or notebook template: once started or run
in IPython we would have the pandas dataframe ready at our hands...
If we manage to add this into IPython and / or Spyder so that the script
logging gets a automatic process, things would be really easy to proove
your results.
You may also check also Gloo (
http://tshauck.github.io/Gloo/) which
gives the general structure that you coudl use to setup your HDF5 file.
BTW, did you compress the XLS files before putting into the HDF?
Would this help to reduce the size?
Regards,
Timmie