Ramón wrote this to OSS Watch mailing list:
> Massiel is a Google Code project where Israel put his scripts to do
> analysis of open source communities.
>
> http://code.google.com/p/massiel
>
> Ideally, we would like to find people who are interested in this
> kind of research and join the project.
>
Do a bunch of scripts need a google code project? I still don't see
where are you trying to do. Israel scripts are too problem specific to
have a useful goal outside his next paper. If you have a broader goal
in mind, what is it?
Pablo
Yes, why not? If we put his stuff in our svn, then nobody from outside
Oxford can access it. Googlecode is just a quick and easy way to get a
public repository, issue tracker and mailing list.
As Ross suggested, the project has a generic name "Massiel" instead of a
name that reflects what we intend it to do right now. This way, it can
be extended to other research projects.
In particular, Andrea has some ideas, and it'd be interesting to see
whether we can have two research projects running in parallel with this
kind of methodology.
Cheers,
.::r
--
Ramón Casero Cañas
http://www.robots.ox.ac.uk/~rcasero/wiki
http://www.robots.ox.ac.uk/~rcasero/blog
I just said Israel's script were too problem specific. Anyway, writing
down some goals is a good think to do. Right now the goal is to do
some research sharing specific tools but using an open approach. Good
enough. :-)
Pablo
I agree with Andrea in general, but I also think that it's important to
start doing something, and build on that if it's successful. For the
moment, my main concern was Israel leaving when he finishes his stay in
mid-August, and the code/project dying in our repository.
But it's OK to propose new things, of course. We have the issue tracker
for that. While issue trackers are usually aimed at software
development, we have added 2 "research" categories.
http://code.google.com/p/massiel/issues/list
That is, you can suggest or comment on a research question, or can
submit a "research bug" note if the software is fine but the science
behind it isn't. Other categories can be added too, e.g. stuff about data.
If you would like to try and make workflows from Israel's scripts,
that'd be great too. Or if you want to start a new project, etc.
> I would also note that this is not really a low volume list at the
> moment, so perhaps the labeling should be changed. Two dozen messages
> (approximately) before breakfast this morning is enough to make me
> change my subscription settings to receive digests.
Sorry, my fault. I sent to the list 2 threads that we had in our
internal list, so that they are publicly available and useful for future
reference, but it was a one off, it's not representative of the expected
volume in the list.
I agree with Andrea in general, but I also think that it's important to
start doing something, and build on that if it's successful. For the
moment, my main concern was Israel leaving when he finishes his stay in
mid-August, and the code/project dying in our repository.
But it's OK to propose new things, of course. We have the issue tracker
for that. While issue trackers are usually aimed at software
development, we have added 2 "research" categories.
http://code.google.com/p/massiel/issues/list
That is, you can suggest or comment on a research question, or can
submit a "research bug" note if the software is fine but the science
behind it isn't. Other categories can be added too, e.g. stuff about data.
If you would like to try and make workflows from Israel's scripts,
that'd be great too. Or if you want to start a new project, etc.
I would also note that this is not really a low volume list at themoment, so perhaps the labeling should be changed. Two dozen messages(approximately) before breakfast this morning is enough to make mechange my subscription settings to receive digests.
Sorry, my fault. I sent to the list 2 threads that we had in our
internal list, so that they are publicly available and useful for future
reference, but it was a one off, it's not representative of the expected
volume in the list.
Yes. I acknowledge that the time intervals are arbitrary, and that we
should use some criteria to define those intervals. Actually, I have
seen a couple of papers in MSR 2007 and in other conference I can not
recall right now about how to divide the history of a project for the
kind of analysis that we are doing (generations).
I will search that paper, and write back to the list with that info.
> Just from a quick look at the notes that Israel has posted, this looks
> like it would be a real challenge for me to put into a workflow
> entirely by myself. However, with a little discussion about how all
> the script elements work together and what happens throughout the
> analysis, it seems like it should be possible. My technical skills are
> such that I doubt I could just "read" what's there and transform it,
> but if I can get the data and understand what's being done to it, I
> should be able to produce a workflow that does the same, preferably
> reusing the already existing analysis scripts. It could be a good
> collaborative project, if Israel has the patience for my (many)
> questions. :)
Yes, I do :-).
I have tried to make a workflow with all the things that I do with the
scripts, but it is too hard for me. I feel much more comfortable
writing Python than dragging and dropping boxes ;-). I think that it
would be easier to write a "blended" workflow, this is, a workflow
that reuses some scripts. In any case, if you want to try, I will be
glad of helping you.
Cheers,
Israel
I think this would be more of a Research-Question /
Research-Feature-Request than bug.
But I still have trouble with the code as it is. In fact, the
run_generations.py script doesn't work for me, and I'm a beginner in
python, so I'm still trying to figure out the basics, and how it
interacts with the database.
> so long as it seems relevant). This is the kind of thing that I believe
> is easier to adjust with a workflow type of implementation if the
> analysis design is sufficiently modular, though, so perhaps better
> suited to exploration in a follow-up study.
I have trouble with this too. My background involves some programming,
and I do most of my research with Matlab, GNU R, etc.
Matlab has a similar tool for workflows, simulink, but I don't use it
because it's slower than just running code, and well documented code
(both comments in the code and reports explaining the science behind it)
is usually good enough for me.
Changing parameter values can also be done in a configuration file.
There are some cases where a flow chart may be needed, and I see its
value when communicating with other people, in particular those who
don't know the programming language, but isn't it an overkill to put
everything into a work flow?
> I'm about 2 weeks from courses starting up again, so I don't know if
> I'll have time in the short run. However, it's definitely a project I
> can pitch and maybe get some support for spending my research time on that.
We have some technical limitations here too, as I think that you are not
so much into python, and Israel is not so much into workflows. (I'm
trying to catch up with both).
So basically both of you need to stop what you are doing, and try to get
to each other's level. And I see a potential collaboration problem here.
People like to use their own tools, and it's not like we have figured
out the interface between them.
> Just from a quick look at the notes that Israel has posted, this looks
> like it would be a real challenge for me to put into a workflow entirely
> by myself. However, with a little discussion about how all the script
> elements work together and what happens throughout the analysis, it
> seems like it should be possible. My technical skills are such that I
Indeed. Israel, do you think this can be done?
Cheers,.
I don't know. It seems hard to me. The generations.py module makes
some queries to extract data from the databases, and handle those data
using lists, dictionaries, etc (this is, Python data structures). I
don't know if that could be easily adapted to a workflow.
Cheers,
Israel
> I think this would be more of a Research-Question /
> Research-Feature-Request than bug.
Sorry, guess I don't understand how to categorize these things in an
issue tracker. :)
>> so long as it seems relevant). This is the kind of thing that I
>> believe
>> is easier to adjust with a workflow type of implementation if the
>> analysis design is sufficiently modular, though, so perhaps better
>> suited to exploration in a follow-up study.
>
>
> I have trouble with this too. My background involves some programming,
> and I do most of my research with Matlab, GNU R, etc.
>
> Matlab has a similar tool for workflows, simulink, but I don't use it
> because it's slower than just running code, and well documented code
> (both comments in the code and reports explaining the science behind
> it)
> is usually good enough for me.
>
> Changing parameter values can also be done in a configuration file.
Yes, if you know how to code, that's all well and good. Well
documented code is not good enough for some people (myself included)
because it poses an enormous barrier to entry.
> There are some cases where a flow chart may be needed, and I see its
> value when communicating with other people, in particular those who
> don't know the programming language, but isn't it an overkill to put
> everything into a work flow?
Well, I don't think so, but I couldn't even attempt some of this
analysis otherwise. I found it funny that Israel thinks workflows are
too hard, because I think straight code is too hard. The other
benefits of which I have made relatively little mention are the
transparency of intermediate inputs and outputs (great for debugging)
and the self-documenting nature (it has embedded unique identifiers
and retains a process history with all the relevant metadata details,
which can be exported as evidence of findings) plus the part where
it's portable and replicable (we can both run the exact same analysis,
no struggling with Python for anyone). In addition, I can at least
theoretically wrap Israel's scripts in SOAP to run as a web service
and reuse them for other purposes.
So whether it's overkill or not depends on what you're trying to
achieve.
>> I'm about 2 weeks from courses starting up again, so I don't know if
>> I'll have time in the short run. However, it's definitely a project I
>> can pitch and maybe get some support for spending my research time
>> on that.
>
> We have some technical limitations here too, as I think that you are
> not
> so much into python, and Israel is not so much into workflows. (I'm
> trying to catch up with both).
>
> So basically both of you need to stop what you are doing, and try to
> get
> to each other's level. And I see a potential collaboration problem
> here.
> People like to use their own tools, and it's not like we have figured
> out the interface between them.
Yes, we have been pointing out the collaboration problem since we
started working with the workflows. Everyone wants to use what is
convenient for them, no one wants to take the time to learn new tools.
I was going to have to learn Perl, then Ruby, in order to do any work
with dynamic network analysis, but instead I spent much less time
learning Taverna and plied the R skills I already had. One of the nice
things about workflows is that you can still use the code you've
already written, but like any tool it still requires some overhead to
get anywhere with it. In some respects, it's a very different way of
thinking about how to achieve an analysis.
Cheers,
Andrea
Ross Gardler wrote:
> Andrea Wiggins wrote:
>> On Aug 12, 2008, at 5:58 AM, Ramón Casero Cañas wrote:
>>> We have some technical limitations here too, as I think that you are
>>> not
>>> so much into python, and Israel is not so much into workflows. (I'm
>>> trying to catch up with both).
>>>
>>> So basically both of you need to stop what you are doing, and try to
>>> get
>>> to each other's level. And I see a potential collaboration problem
>>> here.
Don't stop. Proceed and learn by doing. Understand the strengths and
weaknesses of each approach and overcome limitations through collaboration.
>>> People like to use their own tools, and it's not like we have figured
>>> out the interface between them.
That's a technical problem, not something that should prevent progress
in small steps.
Well written code should be modular enough to allow components to be
reused in different environments. Python scripts should be able to one
Taverner workflow output and workflows should be able to embed Python code.
Sure, this is not easy but such technical problems are easier to
overcome than expecting everyone to converge on a single tool.
I'd suggest a first question that needs answering is "can Taverna
execute Python code?" if it can't what can it execute?
Ross
> That's a technical problem, not something that should prevent progress
> in small steps.
>
> Well written code should be modular enough to allow components to be
> reused in different environments. Python scripts should be able to
> one
> Taverner workflow output and workflows should be able to embed
> Python code.
>
> Sure, this is not easy but such technical problems are easier to
> overcome than expecting everyone to converge on a single tool.
Yes, exactly. That sort of translation back and forth between stand-
alone code and workflow implementation might be a good test of the
quality of the work. If nothing else, it requires agreement on the
processes that must occur during the analysis.
> I'd suggest a first question that needs answering is "can Taverna
> execute Python code?" if it can't what can it execute?
As far as I know, Taverna should be able to execute Python; it can
(supposedly) execute anything that runs out of a command line. I
haven't yet tried this functionality, but I've seen mentions of using
it successfully, e.g. for running Perl scripts. I expect GNU Plot
would have to be invoked the same way.
Cheers,
Andrea
I have made a test, with a very simple workflow. It just does execute
the Python script, and then executes another script that shows the
result with Gnuplot.
It seems to work. Of course it is so simple that is useless to
understand how the script works. But at least it shows that is
possible to run Python scripts and to invoke Gnuplot.
Cheers,
Israel
[CAVEAT - my comments come from a complete lack of understanding of both
Taverna and Python, but a very deep understanding of solving the kinds
of tool mismatches identified in this thread. This means the specific
steps I suggest will be somewhat off the mark, but the basic process is
likely to be right. I'll leave it to you guys to work out the details ;-)]
The next stage is to split the Python code into the smallest useful
components so that they may be reused in a Taverna workflow without the
need for Andrea to trtuly understand *how* they work. All she should
have to care about it *what* they do.
So, for example, if the Python code does (I'm guessing at a process,
this is for illustration only):
a - get content from database
b - filter out "noise methods"
c - chunk data into time slices
d - generate summary statistics for each time slice
e - generat plots
We should have 5 components that can be dropped into a taverna workflow.
The Python scripts will no longer be monolithic, but will simply be glue
code to put the parts together.
Once this is done we can produce the same outputs with either Taverna or
Python. The Python code will be must easier for Andrea (and other
Taverna users) to understand/run and it will be much more likely that
they will be able to help generalise the components in ways they desire
(e.g. parameterising the time chunking).
Similarly, the Taverna workflows will be more understandable to the
Python people as the process names in the workflow will correspond to
Python components.
Finally, people like me who use neither Taverna or Python will find it
easier to intergrate the processing algorithms into other tools like
Simal (which is Java based).
Ross
In practice, the relevant details that I would need for each step are:
a - data selection criteria and required data points - I may be able
to pull appropriate data from other sources.
b - data cleaning specifics - this is almost never described in much
detail in papers' methods sections, but is crucial.
c - periodization can be handled a number of ways, particularly if
it's going to be adjustable.
d - summary stats process details - there are probably multiple
options for this as well, especially if the stats are relatively
straightforward.
e - plot generation code, with inputs (including format details) and
outputs highlighted.
I suspect that if there aren't any significant peculiarities with the
data/cleaning/stats, then I probably won't need the Python code at
all, and the gnuplot code is the only thing I'd need to actually copy
and paste. There are many ways to accomplish the data handling
processes, and ordinary stats are usually pretty easy with RShell or
Beanshell scripts.
Cheers,
Andrea
Thanks!
Andrea