The original answer is hard to follow. But in short I listed here on what I
think we agree are points where the docs need to be improved. This also
gives a
suggestion for the chapter structure of the docs. Below is a inline reply of
David's remarks.
* Treants and Bundles
Basically the scipy paper as it gives a good overview where the
connection
between Treants and Bundles becomes clear
* Best Practices
** Treants are 'atomic' units of a study
Each treant should only contain data for ONE experiment/Simulation.
This way
one can best leverage the power of bundle and groupby. It also
enforces that
analysis scripts are written to work on single experiment at a time.
This
helps to speed up the analysis using dask/joblib. For this we should
give
some examples.
** Adding Treants to existing FS structure
What is the best way to add information. What are common patterns of FS
hierarchy used and how can they benefit from datreant.
** Creating a new study form scratch
This is the clean slate approach. Here we can show of how one would use
python only scripts to generate a FS hierarchy using only Treant
objects.
* Examples
Here we list concrete examples how people use datreant. It can be
short or in
a prose form.
** Remote MD Simulations (Max)
** Local MD Simulation (David)
* Tips and Tricks
** Working on remote data
*** Split Raw and processed data
Here I should describe my way to deal with remote data
*** Use the rsync limb
I haven't used it yet but we should also mention it here.
* Advanced usage
** FS manipulation with Tree/Leave/View
====================================
I didn't know I can add members to it. For adding I have always run
`discover`
again. I'm aware of removing items. I do that all the time using
`groupby` as a
filtering tool.
>> These definition have an important implication. Datreant is **not**
usable to
>> create your 'database' on the filesystem.
> Not true, actually. These days I rarely work with my data in a shell
or file browser, but instead build everything from the start using
`datreant` objects. You can make directories and files directly from
Python using `Treant`s, `Tree`s, `Leave`s, and `View`s as the interface
of choice for doing this Pythonically.
Interesting. I haven't considered yet to work with the Tree/Leave/View
objects
to interact with the filesystem. I now do create folders by creating Treant
objects. That does make it easier to populate a folder with subfolders.
Can you
give some example how to use the Tree/Leave objects instead of the os or
shutil
module to interact purely with the filesystem (no Treants).
>> Rather it is there to help you query
>> the hierachry that is already established. Most important is that a
treant is
>> supposed to be for only one experiment/simulation and stores
processed data for
>> it.
> I think this is a good use of Treants; they are most useful as the
functional "atoms" of your study, whatever that is. And you can store
persistently arbitrary information "inside" them, since they are
filesystem objects.
We should stress that point in examples more. Onle this makes it so
powerful in
combination with bundle and groupby. For people who tend to put things
into the
same folder datreant might therefore not be directly usable.
Here we can also show of how such a design will make writing parallel
algorithms
easier.
>> To actually create the database in a scripted way one can use the
Tree, Leaf and
>> View objects. But I assume for an existing projects it might be
better to create
>> the datreant objects by hand and then use the bundle move processed
data into
>> the treant.
> I disagree, but the strength of the library is that you can do as you
like. I prefer to do all my science these days in Jupyter notebooks,
since the context of what I did (my explanations, plots, etc.) sit
alongside the code that did the work. The directory content of my
Treants/Sims are created almost entirely by Python code. But you can use
whatever style suits you, of course.
Maybe there is a misunderstanding here. My point was that creating
Treants on an
existing file system structure isn't possible in a scripted way. The
content of
the Treant/Sim like is of course done in python, but one needs to have a
Treant
there first.
>> The docs should mostly talk about Treant and Bundle. Tree, Leaf and
view should
>> be marked as developer information to create tools like mdsynthesis.
This
>> information can already be found in the docs but it is not very clear.
> Trees, Leaves, and Views are not currently motivated well for the
user by the docs. But they are incredibly powerful for manipulation,
modification, and construction of the filesystem tree using Python.
Since I've now had plenty of experience using them in this way, I'd like
to have a go at improving the docs on these with more detailed examples
this winter. The docs are too abstract right now.
Yes more docs would be welcome. But I would introduce the Tree/Leave/View as
advanced filesystem manipulation after explaining the Treant and Bundle
object.
I so far didn't need to deal with them at all using Treant objects.
I used to have that as well in my old group. This mostly is the case
when your
cluster local. For super clusters this likely won't be the case and you
have to
copy data. There my approach can help because sometimes it isn't feasible to
copy the terabytes of raw data one accumulates during a simulation. I
would add
this as a section for tips to work with remote data.
>> *Note:* Since I'm mostly creating the simulation setups with a
script I'm now
>> thinking of adding a treant with each simulation to store the
parameters used to
>> create the simulation as tags and categories. This should help the
processing
>> scripts to create the `processed-data` folder.
> I think this is a good idea, and I certainly do it for some
parameters I use to generate simulations. It makes it easy to split
simulations by parameter values via `groupby` later when I'm working
with hundreds of them at once. :D
So that can go into a best-practices section