Project goals

Pablo Barrera González

unread,

Aug 7, 2008, 6:39:02 AM8/7/08

to massiel-talk

Hi

Ramón wrote this to OSS Watch mailing list:
> Massiel is a Google Code project where Israel put his scripts to do
> analysis of open source communities.
>
> http://code.google.com/p/massiel
>
> Ideally, we would like to find people who are interested in this
> kind of research and join the project.
>

Do a bunch of scripts need a google code project? I still don't see
where are you trying to do. Israel scripts are too problem specific to
have a useful goal outside his next paper. If you have a broader goal
in mind, what is it?

Pablo

Ramón Casero Cañas

unread,

Aug 7, 2008, 6:56:10 AM8/7/08

to massie...@googlegroups.com

Yes, why not? If we put his stuff in our svn, then nobody from outside
Oxford can access it. Googlecode is just a quick and easy way to get a
public repository, issue tracker and mailing list.

As Ross suggested, the project has a generic name "Massiel" instead of a
name that reflects what we intend it to do right now. This way, it can
be extended to other research projects.

In particular, Andrea has some ideas, and it'd be interesting to see
whether we can have two research projects running in parallel with this
kind of methodology.

Cheers,

.::r

--
Ramón Casero Cañas

http://www.robots.ox.ac.uk/~rcasero/wiki
http://www.robots.ox.ac.uk/~rcasero/blog

Pablo Barrera González

unread,

Aug 7, 2008, 7:06:54 AM8/7/08

to massie...@googlegroups.com

I just said Israel's script were too problem specific. Anyway, writing
down some goals is a good think to do. Right now the goal is to do
some research sharing specific tools but using an open approach. Good
enough. :-)

Pablo

Andrea Wiggins

unread,

Aug 7, 2008, 7:25:27 AM8/7/08

to massie...@googlegroups.com

The way I see it, this project could be very beneficial if it becomes
a repository for FLOSS research scripts and analysis. Of course, for
it to be a full-blown repository, Google Code won't really meet the
needs particularly well, but it's a reasonable starting point and
could be migrated into a repository platform in time.

If the project never collects any additional contribution besides
Israel's scripts, then it's only slightly better than posting bespoke
scripts on a project web site. If we start collecting the community's
analytic assets, then we really have something of value.

However, I think Pablo is right - goals or vision are in order to
acquire any additional contributions of research scripts/analysis.
Otherwise it's too hard for others to know what you're looking to
achieve and you'll have a tough time drawing participation, just like
any other open source community. I think there are plenty of
researchers who would be interested in participating in such a
project, but there is significant evangelism required to draw them in,
and there needs to be a plan for what happens next, i.e. moving beyond
Google Code and providing the kind of metadata and documentation to
make the scripts actually usable. The idea of using Google Code this
way is fine but I don't think it will meet the needs of a broader
audience in the long term. Ideally I'd like to see this eventually
move to a "proper" repository platform or at least have associated
metadata records, and in time I expect the larger community will
develop some kind of FLOSS portal that links out to data sources,
analysis scripts, and so on.

I would also note that this is not really a low volume list at the
moment, so perhaps the labeling should be changed. Two dozen messages
(approximately) before breakfast this morning is enough to make me
change my subscription settings to receive digests.

Cheers,

Andrea

Ramón Casero Cañas

unread,

Aug 7, 2008, 8:53:06 AM8/7/08

to massie...@googlegroups.com

Andrea Wiggins wrote:
>
> audience in the long term. Ideally I'd like to see this eventually
> move to a "proper" repository platform or at least have associated
> metadata records, and in time I expect the larger community will
> develop some kind of FLOSS portal that links out to data sources,
> analysis scripts, and so on.

I agree with Andrea in general, but I also think that it's important to
start doing something, and build on that if it's successful. For the
moment, my main concern was Israel leaving when he finishes his stay in
mid-August, and the code/project dying in our repository.

But it's OK to propose new things, of course. We have the issue tracker
for that. While issue trackers are usually aimed at software
development, we have added 2 "research" categories.

http://code.google.com/p/massiel/issues/list

That is, you can suggest or comment on a research question, or can
submit a "research bug" note if the software is fine but the science
behind it isn't. Other categories can be added too, e.g. stuff about data.

If you would like to try and make workflows from Israel's scripts,
that'd be great too. Or if you want to start a new project, etc.

> I would also note that this is not really a low volume list at the
> moment, so perhaps the labeling should be changed. Two dozen messages
> (approximately) before breakfast this morning is enough to make me
> change my subscription settings to receive digests.

Sorry, my fault. I sent to the list 2 threads that we had in our
internal list, so that they are publicly available and useful for future
reference, but it was a one off, it's not representative of the expected
volume in the list.

Andrea Wiggins

unread,

Aug 7, 2008, 10:17:29 AM8/7/08

to massie...@googlegroups.com

On Aug 7, 2008, at 8:53 AM, Ramón Casero Cañas wrote:

I agree with Andrea in general, but I also think that it's important to
start doing something, and build on that if it's successful. For the
moment, my main concern was Israel leaving when he finishes his stay in
mid-August, and the code/project dying in our repository.

This context helps. I also agree that you have to start somewhere, or nothing gets done. I just see that this could be the stake in the ground around which we can rally for something bigger, and I think the time for that is just around the corner. In the meantime, it seems that the tool suits the immediate purpose.

But it's OK to propose new things, of course. We have the issue tracker
for that. While issue trackers are usually aimed at software
development, we have added 2 "research" categories.

http://code.google.com/p/massiel/issues/list

That is, you can suggest or comment on a research question, or can
submit a "research bug" note if the software is fine but the science
behind it isn't. Other categories can be added too, e.g. stuff about data.

Ah, yes - I did want to mention that the one immediate thing I noted about Israel's analysis was the choice of time periods. James Howison and I have had some discussions about how to make non-arbitrary time periods for time series analysis as we've done a little tinkering with data that makes us think there has to be something better. I think there are some good methodological contributions to be made there, but I hardly know where to start.

My main "research bug" (sorry, the GCode site is throwing me a server error just now, gotta love distributed services...) would be to suggest trying different methods of selecting periods. Obviously the number of periods included in the analysis will affect the granularity of the results, but who is to say that 20 equal-length time periods is the most useful way to slice and dice that data? Periods could also be selected according to key events (i.e. releases, could compare results for periods based on "major" versus "minor" releases) or accumulation of some number of interactions (i.e. every 20 CVS commits or what have you, so long as it seems relevant). This is the kind of thing that I believe is easier to adjust with a workflow type of implementation if the analysis design is sufficiently modular, though, so perhaps better suited to exploration in a follow-up study.

If you would like to try and make workflows from Israel's scripts,
that'd be great too. Or if you want to start a new project, etc.

I'm about 2 weeks from courses starting up again, so I don't know if I'll have time in the short run. However, it's definitely a project I can pitch and maybe get some support for spending my research time on that.

Just from a quick look at the notes that Israel has posted, this looks like it would be a real challenge for me to put into a workflow entirely by myself. However, with a little discussion about how all the script elements work together and what happens throughout the analysis, it seems like it should be possible. My technical skills are such that I doubt I could just "read" what's there and transform it, but if I can get the data and understand what's being done to it, I should be able to produce a workflow that does the same, preferably reusing the already existing analysis scripts. It could be a good collaborative project, if Israel has the patience for my (many) questions. :)

I would also note that this is not really a low volume list at the
moment, so perhaps the labeling should be changed. Two dozen messages
(approximately) before breakfast this morning is enough to make me
change my subscription settings to receive digests.

Sorry, my fault. I sent to the list 2 threads that we had in our
internal list, so that they are publicly available and useful for future
reference, but it was a one off, it's not representative of the expected
volume in the list.

Got it, thanks for the explanation - I won't bother changing my settings for now. :)

Cheers,

Andrea

Israel Herraiz

unread,

Aug 7, 2008, 4:46:38 PM8/7/08

to Massiel Mailing List

Excerpts from Andrea's message on Aug 7, 2008 about 3 PM:

> My main "research bug" (sorry, the GCode site is throwing me a server
> error just now, gotta love distributed services...) would be to
> suggest trying different methods of selecting periods. Obviously the
> number of periods included in the analysis will affect the granularity
> of the results, but who is to say that 20 equal-length time periods is
> the most useful way to slice and dice that data? Periods could also be
> selected according to key events (i.e. releases, could compare results
> for periods based on "major" versus "minor" releases) or accumulation
> of some number of interactions (i.e. every 20 CVS commits or what have
> you, so long as it seems relevant). This is the kind of thing that I
> believe is easier to adjust with a workflow type of implementation if
> the analysis design is sufficiently modular, though, so perhaps better
> suited to exploration in a follow-up study.

Yes. I acknowledge that the time intervals are arbitrary, and that we
should use some criteria to define those intervals. Actually, I have
seen a couple of papers in MSR 2007 and in other conference I can not
recall right now about how to divide the history of a project for the
kind of analysis that we are doing (generations).

I will search that paper, and write back to the list with that info.

> Just from a quick look at the notes that Israel has posted, this looks
> like it would be a real challenge for me to put into a workflow
> entirely by myself. However, with a little discussion about how all
> the script elements work together and what happens throughout the
> analysis, it seems like it should be possible. My technical skills are
> such that I doubt I could just "read" what's there and transform it,
> but if I can get the data and understand what's being done to it, I
> should be able to produce a workflow that does the same, preferably
> reusing the already existing analysis scripts. It could be a good
> collaborative project, if Israel has the patience for my (many)
> questions. :)

Yes, I do :-).

I have tried to make a workflow with all the things that I do with the
scripts, but it is too hard for me. I feel much more comfortable
writing Python than dragging and dropping boxes ;-). I think that it
would be easier to write a "blended" workflow, this is, a workflow
that reuses some scripts. In any case, if you want to try, I will be
glad of helping you.

Cheers,
Israel

Ramón Casero Cañas

unread,

Aug 12, 2008, 5:58:59 AM8/12/08

to massie...@googlegroups.com

Andrea Wiggins wrote:
>
> My main "research bug" (sorry, the GCode site is throwing me a server
> error just now, gotta love distributed services...) would be to suggest
> trying different methods of selecting periods. Obviously the number of
> periods included in the analysis will affect the granularity of the
> results, but who is to say that 20 equal-length time periods is the most
> useful way to slice and dice that data? Periods could also be selected
> according to key events (i.e. releases, could compare results for
> periods based on "major" versus "minor" releases) or accumulation of
> some number of interactions (i.e. every 20 CVS commits or what have you,

I think this would be more of a Research-Question /
Research-Feature-Request than bug.

But I still have trouble with the code as it is. In fact, the
run_generations.py script doesn't work for me, and I'm a beginner in
python, so I'm still trying to figure out the basics, and how it
interacts with the database.

> so long as it seems relevant). This is the kind of thing that I believe
> is easier to adjust with a workflow type of implementation if the
> analysis design is sufficiently modular, though, so perhaps better
> suited to exploration in a follow-up study.

I have trouble with this too. My background involves some programming,
and I do most of my research with Matlab, GNU R, etc.

Matlab has a similar tool for workflows, simulink, but I don't use it
because it's slower than just running code, and well documented code
(both comments in the code and reports explaining the science behind it)
is usually good enough for me.

Changing parameter values can also be done in a configuration file.

There are some cases where a flow chart may be needed, and I see its
value when communicating with other people, in particular those who
don't know the programming language, but isn't it an overkill to put
everything into a work flow?

> I'm about 2 weeks from courses starting up again, so I don't know if
> I'll have time in the short run. However, it's definitely a project I
> can pitch and maybe get some support for spending my research time on that.

We have some technical limitations here too, as I think that you are not
so much into python, and Israel is not so much into workflows. (I'm
trying to catch up with both).

So basically both of you need to stop what you are doing, and try to get
to each other's level. And I see a potential collaboration problem here.
People like to use their own tools, and it's not like we have figured
out the interface between them.

> Just from a quick look at the notes that Israel has posted, this looks
> like it would be a real challenge for me to put into a workflow entirely
> by myself. However, with a little discussion about how all the script
> elements work together and what happens throughout the analysis, it
> seems like it should be possible. My technical skills are such that I

Indeed. Israel, do you think this can be done?

Cheers,.

Israel Herraiz

unread,

Aug 12, 2008, 6:41:44 AM8/12/08

to Massiel Mailing List

Excerpts from Ramón's message on Aug 12, 2008 about 10 AM:

> Indeed. Israel, do you think this can be done?

I don't know. It seems hard to me. The generations.py module makes
some queries to extract data from the databases, and handle those data
using lists, dictionaries, etc (this is, Python data structures). I
don't know if that could be easily adapted to a workflow.

Cheers,
Israel

Andrea Wiggins

unread,

Aug 12, 2008, 6:48:37 AM8/12/08

to massie...@googlegroups.com

On Aug 12, 2008, at 5:58 AM, Ramón Casero Cañas wrote:

> I think this would be more of a Research-Question /
> Research-Feature-Request than bug.

Sorry, guess I don't understand how to categorize these things in an
issue tracker. :)

>> so long as it seems relevant). This is the kind of thing that I
>> believe
>> is easier to adjust with a workflow type of implementation if the
>> analysis design is sufficiently modular, though, so perhaps better
>> suited to exploration in a follow-up study.
>
>
> I have trouble with this too. My background involves some programming,
> and I do most of my research with Matlab, GNU R, etc.
>
> Matlab has a similar tool for workflows, simulink, but I don't use it
> because it's slower than just running code, and well documented code
> (both comments in the code and reports explaining the science behind
> it)
> is usually good enough for me.
>
> Changing parameter values can also be done in a configuration file.

Yes, if you know how to code, that's all well and good. Well
documented code is not good enough for some people (myself included)
because it poses an enormous barrier to entry.

> There are some cases where a flow chart may be needed, and I see its
> value when communicating with other people, in particular those who
> don't know the programming language, but isn't it an overkill to put
> everything into a work flow?

Well, I don't think so, but I couldn't even attempt some of this
analysis otherwise. I found it funny that Israel thinks workflows are
too hard, because I think straight code is too hard. The other
benefits of which I have made relatively little mention are the
transparency of intermediate inputs and outputs (great for debugging)
and the self-documenting nature (it has embedded unique identifiers
and retains a process history with all the relevant metadata details,
which can be exported as evidence of findings) plus the part where
it's portable and replicable (we can both run the exact same analysis,
no struggling with Python for anyone). In addition, I can at least
theoretically wrap Israel's scripts in SOAP to run as a web service
and reuse them for other purposes.

So whether it's overkill or not depends on what you're trying to
achieve.

>> I'm about 2 weeks from courses starting up again, so I don't know if
>> I'll have time in the short run. However, it's definitely a project I
>> can pitch and maybe get some support for spending my research time
>> on that.
>
> We have some technical limitations here too, as I think that you are
> not
> so much into python, and Israel is not so much into workflows. (I'm
> trying to catch up with both).
>
> So basically both of you need to stop what you are doing, and try to
> get
> to each other's level. And I see a potential collaboration problem
> here.
> People like to use their own tools, and it's not like we have figured
> out the interface between them.

Yes, we have been pointing out the collaboration problem since we
started working with the workflows. Everyone wants to use what is
convenient for them, no one wants to take the time to learn new tools.
I was going to have to learn Perl, then Ruby, in order to do any work
with dynamic network analysis, but instead I spent much less time
learning Taverna and plied the R skills I already had. One of the nice
things about workflows is that you can still use the code you've
already written, but like any tool it still requires some overhead to
get anywhere with it. In some respects, it's a very different way of
thinking about how to achieve an analysis.

Cheers,

Andrea

Andrea Wiggins

unread,

Aug 12, 2008, 7:05:33 AM8/12/08

to massie...@googlegroups.com

Querying data and using lists is generally no problem in the workflow
environment. Pretty much all the workflows I've used involve these
aspects. I'm not sure what you mean by dictionaries, though I haven't
yet looked at the scripts to try to figure it out.

It is more likely that I would reuse only a portion of the scripts as
written, and would implement Beanshell processes to do things like
querying and manipulating data. Being able to figure out how data is
queried and handled and processed just from looking at the analysis
tool itself is another strength of workflows - even if the specifics
of the processes are not shown, it can be easier to understand/discuss
what's happening because of the modular structure. But as they say,
your mileage may vary. :)

Best,

Andrea

Ross Gardler

unread,

Aug 12, 2008, 2:09:01 PM8/12/08

to massie...@googlegroups.com

Please can we keep threads to A single topic. When a thread naturally
progresses to a new topic change the subject, when a thread spavins a
new topic split the topics.

This is important to keep mailing list archives readable for those who
come later or when we need to refresh our memories.

Ross

Ross Gardler

unread,

Aug 12, 2008, 2:09:48 PM8/12/08

to massie...@googlegroups.com

Ross Gardler

unread,

Aug 12, 2008, 2:29:12 PM8/12/08

to massie...@googlegroups.com

sorry my hand slipped after changing the subject. This one contains
comments in-line.

Ross Gardler wrote:
> Andrea Wiggins wrote:
>> On Aug 12, 2008, at 5:58 AM, Ramón Casero Cañas wrote:

>>> We have some technical limitations here too, as I think that you are
>>> not
>>> so much into python, and Israel is not so much into workflows. (I'm
>>> trying to catch up with both).
>>>
>>> So basically both of you need to stop what you are doing, and try to
>>> get
>>> to each other's level. And I see a potential collaboration problem
>>> here.

Don't stop. Proceed and learn by doing. Understand the strengths and
weaknesses of each approach and overcome limitations through collaboration.

>>> People like to use their own tools, and it's not like we have figured
>>> out the interface between them.

That's a technical problem, not something that should prevent progress
in small steps.

Well written code should be modular enough to allow components to be
reused in different environments. Python scripts should be able to one
Taverner workflow output and workflows should be able to embed Python code.

Sure, this is not easy but such technical problems are easier to
overcome than expecting everyone to converge on a single tool.

I'd suggest a first question that needs answering is "can Taverna
execute Python code?" if it can't what can it execute?

Ross

Andrea Wiggins

unread,

Aug 12, 2008, 3:20:19 PM8/12/08

to massie...@googlegroups.com

On Aug 12, 2008, at 2:29 PM, Ross Gardler wrote:

> That's a technical problem, not something that should prevent progress
> in small steps.
>
> Well written code should be modular enough to allow components to be
> reused in different environments. Python scripts should be able to
> one
> Taverner workflow output and workflows should be able to embed
> Python code.
>
> Sure, this is not easy but such technical problems are easier to
> overcome than expecting everyone to converge on a single tool.

Yes, exactly. That sort of translation back and forth between stand-
alone code and workflow implementation might be a good test of the
quality of the work. If nothing else, it requires agreement on the
processes that must occur during the analysis.

> I'd suggest a first question that needs answering is "can Taverna
> execute Python code?" if it can't what can it execute?

As far as I know, Taverna should be able to execute Python; it can
(supposedly) execute anything that runs out of a command line. I
haven't yet tried this functionality, but I've seen mentions of using
it successfully, e.g. for running Perl scripts. I expect GNU Plot
would have to be invoked the same way.

Cheers,

Andrea

Israel Herraiz

unread,

Aug 12, 2008, 3:45:13 PM8/12/08

to Massiel Mailing List

Excerpts from Andrea's message on Aug 12, 2008 about 8 PM:

> As far as I know, Taverna should be able to execute Python; it can
> (supposedly) execute anything that runs out of a command line. I
> haven't yet tried this functionality, but I've seen mentions of using
> it successfully, e.g. for running Perl scripts. I expect GNU Plot
> would have to be invoked the same way.

I have made a test, with a very simple workflow. It just does execute
the Python script, and then executes another script that shows the
result with Gnuplot.

It seems to work. Of course it is so simple that is useless to
understand how the script works. But at least it shows that is
possible to run Python scripts and to invoke Gnuplot.

Cheers,
Israel

Ross Gardler

unread,

Aug 12, 2008, 4:44:22 PM8/12/08

to massie...@googlegroups.com

[CAVEAT - my comments come from a complete lack of understanding of both
Taverna and Python, but a very deep understanding of solving the kinds
of tool mismatches identified in this thread. This means the specific
steps I suggest will be somewhat off the mark, but the basic process is
likely to be right. I'll leave it to you guys to work out the details ;-)]

The next stage is to split the Python code into the smallest useful
components so that they may be reused in a Taverna workflow without the
need for Andrea to trtuly understand *how* they work. All she should
have to care about it *what* they do.

So, for example, if the Python code does (I'm guessing at a process,
this is for illustration only):

a - get content from database
b - filter out "noise methods"
c - chunk data into time slices
d - generate summary statistics for each time slice
e - generat plots

We should have 5 components that can be dropped into a taverna workflow.

The Python scripts will no longer be monolithic, but will simply be glue
code to put the parts together.

Once this is done we can produce the same outputs with either Taverna or
Python. The Python code will be must easier for Andrea (and other
Taverna users) to understand/run and it will be much more likely that
they will be able to help generalise the components in ways they desire
(e.g. parameterising the time chunking).

Similarly, the Taverna workflows will be more understandable to the
Python people as the process names in the workflow will correspond to
Python components.

Finally, people like me who use neither Taverna or Python will find it
easier to intergrate the processing algorithms into other tools like
Simal (which is Java based).

Ross

Andrea Wiggins

unread,

Aug 13, 2008, 1:38:04 PM8/13/08

to massie...@googlegroups.com

Ross is absolutely spot-on with this. He has outlined the major
abstract components from which I'd begin to build a functional
workflow. :)

In practice, the relevant details that I would need for each step are:

a - data selection criteria and required data points - I may be able
to pull appropriate data from other sources.
b - data cleaning specifics - this is almost never described in much
detail in papers' methods sections, but is crucial.
c - periodization can be handled a number of ways, particularly if
it's going to be adjustable.
d - summary stats process details - there are probably multiple
options for this as well, especially if the stats are relatively
straightforward.
e - plot generation code, with inputs (including format details) and
outputs highlighted.

I suspect that if there aren't any significant peculiarities with the
data/cleaning/stats, then I probably won't need the Python code at
all, and the gnuplot code is the only thing I'd need to actually copy
and paste. There are many ways to accomplish the data handling
processes, and ordinary stats are usually pretty easy with RShell or
Beanshell scripts.

Cheers,

Andrea

Andrea Wiggins

unread,

Oct 1, 2008, 9:56:07 AM10/1/08

to massie...@googlegroups.com

Israel, would you mind sending me that simple workflow, or sharing it
via myExperiment? That way I wouldn't have to recreate it, which means
I'd be all the more likely to work on replicating the rest of the
analysis. :)