useful mental picture for working with treants

13 views

Skip to first unread message

Max Linke

unread,

Aug 26, 2016, 9:05:40 AM8/26/16

to datr...@googlegroups.com

I've been using datreant again for a new study. With the text in the recent
scipy paper I have now started use the library different and I'm now
using the
following two definitions for **Treant** and **Bundle** when I setup my
study.

- **Treant**: A single entry in the 'database' containing raw/proccessed
data
for ONE experiment. Here I can attach meta-data to the experiment in
form of
tags and categories.

- **Bundle**: An 'immutable' view on different treants that can be queried.
Immutable because I can't add or remove treants from the view, but
treants
themself I can still change. The Bundle contains treants from one or more
studies.

These definition have an important implication. Datreant is **not**
usable to
create your 'database' on the filesystem. Rather it is there to help you
query
the hierachry that is already established. Most important is that a
treant is
supposed to be for only one experiment/simulation and stores processed
data for
it. That allows efficient use of tags and categories to later use the
bundle for
filtering. One of my misconceptions using datreant the first time was me
thinking that a treant could store information for several experiments which
lead to me constructing one giant treant and having to write special
wrappers
that use the filepath to extract tags and filter. Does this fit your
usage of datreant as well David?

To actually create the database in a scripted way one can use the Tree,
Leaf and
View objects. But I assume for an existing projects it might be better
to create
the datreant objects by hand and then use the bundle move processed data
into
the treant.

The docs should mostly talk about Treant and Bundle. Tree, Leaf and view
should
be marked as developer information to create tools like mdsynthesis. This
information can already be found in the docs but it is not very clear.

I think it would also help if we provide workflow examples used by us in the
docs. That should give a clearer idea how datreant can be used and
adopted to
personal needs.

# My workflow

In my work I usually produce several gigabytes, up to terabytes, of raw
data on
a super computer simulating proteins. After processing the raw data the
calculated observables often only require a small fraction of disk
space. I then
processed to copy the processed data onto my laptop/workstation to do
the final
analysis of a set of simulations.

For this I create a folder for the simulations and later one for the
processed
data on the cluster.

```
./
+-- simulations
+-- processed-data
```

The `processed-data` folder is created by the analysis scripts. The scripts
creata a new folder for each simulation as a treant and add tags and
categories
to them. Later I only need to copy the `processed-data` directory to my
laptop,
this is easily done with rsync. The `processed-data` folder is then easy to
query on my laptop using Bundles.

*Note:* Since I'm mostly creating the simulation setups with a script
I'm now
thinking of adding a treant with each simulation to store the parameters
used to
create the simulation as tags and categories. This should help the
processing
scripts to create the `processed-data` folder.

Max Linke

unread,

Aug 26, 2016, 5:33:21 PM8/26/16

to datr...@googlegroups.com

On 08/26/2016 07:41 PM, Oliver Beckstein wrote:
> Hi Max,
>
> I am using datreant/mdsynthesis in the way you describe (including bringing order to a whole bunch of legacy simulations with dirty on-the-fly code to generate the set of tags and categories).

Good. I hope someone (maybe me) then has time to add these descriptions
to the manual. They are nice to explain other people how the library can
be used.

> I just want to add to your excellent summary that I don't consider bundles that static because if I add new treants with matching tags/categories then I will get a bigger bundle. The main difference between bundles and groups (which I don't think are used any more) is that the bundle has no state (saved to disk). Thus, the common thing to do is re-evaluate the bundle whenever I do work and if I got more treats – hooray, more data to play with.

Only when you rerun the code as you said yourself. Treants though will
pick up any changes without having to recreate them. I only meant that
they are static once I created them.

> I also want to add that probably one of the neatest things is groupby() – it really helps to get very organized views of your experiments.

Yes I also like that a lot. It will help tremendously with my own
simulations.

> Finally, I should also say that any library that can be used by a PI passes an important usability test ;-) – good job David and everyone else!
>
> Oliver

>
>
>
>> On 26 Aug, 2016, at 06:05, Max Linke <max.l...@gmail.com> wrote:
>>
>> I've been using datreant again for a new study. With the text in the recent
>> scipy paper I have now started use the library different and I'm now using the
>> following two definitions for **Treant** and **Bundle** when I setup my study.
>>
>> - **Treant**: A single entry in the 'database' containing raw/proccessed data
>> for ONE experiment. Here I can attach meta-data to the experiment in form of
>> tags and categories.
>>
>> - **Bundle**: An 'immutable' view on different treants that can be queried.
>> Immutable because I can't add or remove treants from the view, but treants
>> themself I can still change. The Bundle contains treants from one or more
>> studies.
>

> --
> Oliver Beckstein * orbe...@gmx.net
> skype: orbeckst * orbe...@gmail.com
>

David Dotson

unread,

Sep 21, 2016, 5:14:03 PM9/21/16

to Max Linke, datr...@googlegroups.com

Hi Max,

I wanted to give a detailed response to this, but at the time was not in a good position to do so. Responses below.

On 08/26/2016 06:05 AM, Max Linke wrote:
> I've been using datreant again for a new study. With the text in the recent
> scipy paper I have now started use the library different and I'm now using the
> following two definitions for **Treant** and **Bundle** when I setup my study.
>
> - **Treant**: A single entry in the 'database' containing raw/proccessed data
> for ONE experiment. Here I can attach meta-data to the experiment in form of
> tags and categories.
>
> - **Bundle**: An 'immutable' view on different treants that can be queried.
> Immutable because I can't add or remove treants from the view, but treants
> themself I can still change. The Bundle contains treants from one or more
> studies.

I think it's important to point out that a `Bundle` object is mutable, in the sense that you can add or remove members from it. It's basically an ordered set (it even has the same methods as a python `set` for adding and remove members) with special methods for working with `Treant`s.

>
> These definition have an important implication. Datreant is **not** usable to
> create your 'database' on the filesystem.

Not true, actually. These days I rarely work with my data in a shell or file browser, but instead build everything from the start using `datreant` objects. You can make directories and files directly from Python using `Treant`s, `Tree`s, `Leave`s, and `View`s as the interface of choice for doing this Pythonically.

> Rather it is there to help you query
> the hierachry that is already established. Most important is that a treant is
> supposed to be for only one experiment/simulation and stores processed data for
> it.

I think this is a good use of Treants; they are most useful as the functional "atoms" of your study, whatever that is. And you can store persistently arbitrary information "inside" them, since they are filesystem objects.

> That allows efficient use of tags and categories to later use the bundle for
> filtering. One of my misconceptions using datreant the first time was me
> thinking that a treant could store information for several experiments which
> lead to me constructing one giant treant and having to write special wrappers
> that use the filepath to extract tags and filter. Does this fit your usage of datreant as well David?

I generally only work in terms of MDSynthesis Sims, and in this case each Treant in my scheme corresponds to a single simulation's data. I avoid having to write special wrappers because I use a `Bundle` directly as my tool of choice for dealing with `Treant`s as an aggregate unit, and most functions I write when e.g. working in a Jupyter notebook, writing a script, or library code take either individual Sims or whole Bundles as input. This has the added benefit that these always work well with e.g. `distributed` when crunching results on many, perhaps hundreds, of simulations.

>
>
> To actually create the database in a scripted way one can use the Tree, Leaf and
> View objects. But I assume for an existing projects it might be better to create
> the datreant objects by hand and then use the bundle move processed data into
> the treant.

I disagree, but the strength of the library is that you can do as you like. I prefer to do all my science these days in Jupyter notebooks, since the context of what I did (my explanations, plots, etc.) sit alongside the code that did the work. The directory content of my Treants/Sims are created almost entirely by Python code. But you can use whatever style suits you, of course.

>
> The docs should mostly talk about Treant and Bundle. Tree, Leaf and view should
> be marked as developer information to create tools like mdsynthesis. This
> information can already be found in the docs but it is not very clear.

Trees, Leaves, and Views are not currently motivated well for the user by the docs. But they are incredibly powerful for manipulation, modification, and construction of the filesystem tree using Python. Since I've now had plenty of experience using them in this way, I'd like to have a go at improving the docs on these with more detailed examples this winter. The docs are too abstract right now.

>
> I think it would also help if we provide workflow examples used by us in the
> docs. That should give a clearer idea how datreant can be used and adopted to
> personal needs.

Exactly. :D

>
> # My workflow
>
> In my work I usually produce several gigabytes, up to terabytes, of raw data on
> a super computer simulating proteins. After processing the raw data the
> calculated observables often only require a small fraction of disk space. I then
> processed to copy the processed data onto my laptop/workstation to do the final
> analysis of a set of simulations.
>
> For this I create a folder for the simulations and later one for the processed
> data on the cluster.
>
> ```
> ./
> +-- simulations
> +-- processed-data
> ```
>
> The `processed-data` folder is created by the analysis scripts. The scripts
> creata a new folder for each simulation as a treant and add tags and categories
> to them. Later I only need to copy the `processed-data` directory to my laptop,
> this is easily done with rsync. The `processed-data` folder is then easy to
> query on my laptop using Bundles.

This is a different approach than I take, but I think that comes down to infrastructure choices and personal preference. I'm blessed with a large storage array that can hold all my simulation data in one place alongside any derived dataset, and it doesn't make sense for me to ever pull this to my laptop. So my datasets that come out of a given simulation sit alongside the raw data, with all data for a simulation in its own Sim tree. This allows me to work with whatever data I'm most interested in at any moment in the same exact way, with the same set of Sims and their metadata available for filtering, splitting, etc.

>
> *Note:* Since I'm mostly creating the simulation setups with a script I'm now
> thinking of adding a treant with each simulation to store the parameters used to
> create the simulation as tags and categories. This should help the processing
> scripts to create the `processed-data` folder.

I think this is a good idea, and I certainly do it for some parameters I use to generate simulations. It makes it easy to split simulations by parameter values via `groupby` later when I'm working with hundreds of them at once. :D

I'm still inundated with my Ph.D. work until November ("real science", or something), but I truly value this discussion. I'm really interested in how datreant can be shaped to work even better for our purposes, and I think much of this will be helped by discussions like this and improved examples in the docs.

Cheers Max!

David

Max Linke

unread,

Dec 6, 2016, 3:41:45 PM12/6/16

to datr...@googlegroups.com

The original answer is hard to follow. But in short I listed here on what I
think we agree are points where the docs need to be improved. This also
gives a
suggestion for the chapter structure of the docs. Below is a inline reply of
David's remarks.

* Treants and Bundles
Basically the scipy paper as it gives a good overview where the
connection
between Treants and Bundles becomes clear
* Best Practices
** Treants are 'atomic' units of a study
Each treant should only contain data for ONE experiment/Simulation.
This way
one can best leverage the power of bundle and groupby. It also
enforces that
analysis scripts are written to work on single experiment at a time.
This
helps to speed up the analysis using dask/joblib. For this we should
give
some examples.
** Adding Treants to existing FS structure
What is the best way to add information. What are common patterns of FS
hierarchy used and how can they benefit from datreant.
** Creating a new study form scratch
This is the clean slate approach. Here we can show of how one would use
python only scripts to generate a FS hierarchy using only Treant
objects.
* Examples
Here we list concrete examples how people use datreant. It can be
short or in
a prose form.
** Remote MD Simulations (Max)
** Local MD Simulation (David)
* Tips and Tricks
** Working on remote data
*** Split Raw and processed data
Here I should describe my way to deal with remote data
*** Use the rsync limb
I haven't used it yet but we should also mention it here.
* Advanced usage
** FS manipulation with Tree/Leave/View

====================================

I didn't know I can add members to it. For adding I have always run
`discover`
again. I'm aware of removing items. I do that all the time using
`groupby` as a
filtering tool.

>> These definition have an important implication. Datreant is **not**
usable to
>> create your 'database' on the filesystem.

> Not true, actually. These days I rarely work with my data in a shell
or file browser, but instead build everything from the start using
`datreant` objects. You can make directories and files directly from
Python using `Treant`s, `Tree`s, `Leave`s, and `View`s as the interface
of choice for doing this Pythonically.

Interesting. I haven't considered yet to work with the Tree/Leave/View
objects
to interact with the filesystem. I now do create folders by creating Treant
objects. That does make it easier to populate a folder with subfolders.
Can you
give some example how to use the Tree/Leave objects instead of the os or
shutil
module to interact purely with the filesystem (no Treants).

>> Rather it is there to help you query
>> the hierachry that is already established. Most important is that a
treant is
>> supposed to be for only one experiment/simulation and stores
processed data for
>> it.

> I think this is a good use of Treants; they are most useful as the
functional "atoms" of your study, whatever that is. And you can store
persistently arbitrary information "inside" them, since they are
filesystem objects.

We should stress that point in examples more. Onle this makes it so
powerful in
combination with bundle and groupby. For people who tend to put things
into the
same folder datreant might therefore not be directly usable.

Here we can also show of how such a design will make writing parallel
algorithms
easier.

>> To actually create the database in a scripted way one can use the
Tree, Leaf and
>> View objects. But I assume for an existing projects it might be
better to create
>> the datreant objects by hand and then use the bundle move processed
data into
>> the treant.

> I disagree, but the strength of the library is that you can do as you
like. I prefer to do all my science these days in Jupyter notebooks,
since the context of what I did (my explanations, plots, etc.) sit
alongside the code that did the work. The directory content of my
Treants/Sims are created almost entirely by Python code. But you can use
whatever style suits you, of course.

Maybe there is a misunderstanding here. My point was that creating
Treants on an
existing file system structure isn't possible in a scripted way. The
content of
the Treant/Sim like is of course done in python, but one needs to have a
Treant
there first.

>> The docs should mostly talk about Treant and Bundle. Tree, Leaf and
view should
>> be marked as developer information to create tools like mdsynthesis.
This
>> information can already be found in the docs but it is not very clear.

> Trees, Leaves, and Views are not currently motivated well for the
user by the docs. But they are incredibly powerful for manipulation,
modification, and construction of the filesystem tree using Python.
Since I've now had plenty of experience using them in this way, I'd like
to have a go at improving the docs on these with more detailed examples
this winter. The docs are too abstract right now.

Yes more docs would be welcome. But I would introduce the Tree/Leave/View as
advanced filesystem manipulation after explaining the Treant and Bundle
object.
I so far didn't need to deal with them at all using Treant objects.

I used to have that as well in my old group. This mostly is the case
when your
cluster local. For super clusters this likely won't be the case and you
have to
copy data. There my approach can help because sometimes it isn't feasible to
copy the terabytes of raw data one accumulates during a simulation. I
would add
this as a section for tips to work with remote data.

>> *Note:* Since I'm mostly creating the simulation setups with a
script I'm now
>> thinking of adding a treant with each simulation to store the
parameters used to
>> create the simulation as tags and categories. This should help the
processing
>> scripts to create the `processed-data` folder.

> I think this is a good idea, and I certainly do it for some
parameters I use to generate simulations. It makes it easy to split
simulations by parameter values via `groupby` later when I'm working
with hundreds of them at once. :D