Generational relay, now calculated with mailing lists archives

rcasero

unread,

Aug 7, 2008, 6:16:05 AM8/7/08

to massiel-talk

Sent by Israel Herraiz to the OSS Watch mailing list (07/28/2008 11:56
AM):

Hi all,

some days ago I sent a message about how to plot some nice graphs
about the generational relay in some of the projects of the ASF. I
have now tweaked the scripts to do the same, but using the activity in
the mailing lists instead of the Subversion commits.

The new scripts [1] need to be run with a MySQL database generated by
the Mailing List Stats tool [2]. As you have probably guessed, the
tool is quite user "unfriendly", so it might be tricky to obtain the
database for your favourite mailing list. Fortunately, I have uploaded
a database [3] (~12 MB) that contains the developers mailing list of
Forrest (yes, I know, what a coincidence ;-) . You can use any other
mailing list database obtained from FLOSSMetrics [4].

Regarding Forrest, the same conclusions that I talked about during my
presentation hold with the mailing lists. It seems that there are two
"epochs" in the project. One where there was a generational relay in
the most active group of people of the project, and one where most of
the activity is due to the same group of people since some years ago.

To use the scripts:

* wget http://gsyc.es/~herraiz/osswatch/ml.tar.bz2
* tar jxvf ml.tar.bz2
* wget http://gsyc.es/~herraiz/osswatch/mlstats.sql.bz2
* bunzip2 mlstats.sql.bz2
* mysqladmin -u root -pyourpassword_if_any create mlstats
* mysql -u root -p.... mlstats < mlstats.sql
* rm mlstats.sql
* python ml/run_generationsml.py

That of course will do it with the database of Forrest. Substitute
with your favourite database if you wish. The above commands assume
that the MySQL user name is root. The data will be written to
/tmp/dataml/ and the database name must be "mlstats". You can change
the directory at the head of the run_generationsml.py file (line 6),
and the database name at the bottom (line 34).

Do not hesitate to ask if you find any problem with the scripts, or if
you want to comment any other thing.

Cheers,
Israel

[1] http://gsyc.es/~herraiz/osswatch/ml.tar.bz2
[2] http://tools.libresoft.es/mailing_list_stats
[3] http://gsyc.es/~herraiz/osswatch/mlstats.sql.bz2
[4] http://data.flossmetrics.org/

rcasero

unread,

Aug 7, 2008, 6:21:37 AM8/7/08

to massiel-talk

Sent by Ross Gardler to the OSS Watch mailing list (07/28/2008 12:50
PM):

Israel Herraiz wrote:
> Regarding Forrest, the same conclusions that I talked about during my
> presentation hold with the mailing lists. It seems that there are two
> "epochs" in the project. One where there was a generational relay in
> the most active group of people of the project, and one where most of
> the activity is due to the same group of people since some years ago.

Did you filter out the SVN commit mails?

Ross

rcasero

unread,

Aug 7, 2008, 6:23:35 AM8/7/08

to massiel-talk

Sent by Israel Herraiz to the OSS Watch mailing list (07/28/2008 01:11
PM):

Excerpts from Ross's message on Jul 28, 2008 about 12 PM:

> > Did you filter out the SVN commit mails?

Yes. The graphs contain only the developers mailing list. There are
other three lists (svn, site-svn and users), that I did not include
for this analysis.

Cheers,
Israel

rcasero

unread,

Aug 7, 2008, 6:25:51 AM8/7/08

to massiel-talk

Sent by Ross Gardler to the OSS Watch mailing list (07/28/2008 04:10
PM):

Israel Herraiz wrote:
> Excerpts from Ross's message on Jul 28, 2008 about 12 PM:
>> Did you filter out the SVN commit mails?
>
> Yes. The graphs contain only the developers mailing list. There are
> other three lists (svn, site-svn and users), that I did not include
> for this analysis.

The SVN list was created two years into the project. Originally SVN
mails went to the dev list, this is fairly standard practice. Prior to
that we used CVS and these mails were sent to the dev list as well.

You should also note other automated mails to the mailing list, such
as GUMP, Forrestbot and JIRA.

I don't think you will find much of a change in the behaviour patterns
when you filter this stuff out, however, until you do there is know
way of knowing.

Ross

rcasero

unread,

Aug 7, 2008, 6:29:12 AM8/7/08

to massiel-talk

Sent by Ross Gardler to the OSS Watch mailing list (07/28/2008 04:22
PM):

Here's another problem with this approach (actually its a problem with
the next stage of this work, measuring a persons impact on a project
by the amount of discussion they generate)...

Sometimes people post in a way that stops discussion on list but
actually is highly useful to the community. For example, one an ideas
thread has run it course the person who collates the information into
a permanent record will notify the list. This looks like the end of a
discussion and gets no "kudos", however, the activity is actually
very valuable.

For example:
http://markmail.org/search/?q=list%3Aorg.apache.forrest.dev+gardler#query:list%3Aorg.apache.forrest.dev%20gardler%20order%3Adate-forward+page:1+mid:ayxzj363jin2upiu+state:results

I have no idea how this can be handled. It's very similar to the
single commit that solves a really sticky problem compared to the
multiple commits that reformat content.

Ross

rcasero

unread,

Aug 7, 2008, 6:32:32 AM8/7/08

to massiel-talk

Sent by Israel Herraiz to the OSS Watch mailing list (07/28/2008 05:16
PM):

Excerpts from Ross's message on Jul 28, 2008 about 4 PM:

> > Sometimes people post in a way that stops discussion on list but
> > actually is highly useful to the community. For example, one an ideas
> > thread has run it course the person who collates the information into a
> > permanent record will notify the list. This looks like the end of a
> > discussion and gets no "kudos", however, the activity is actually very
> > valuable.

Well, the methodology that I propose for the next stage is to
calculate the amount of cross correlation between two time series: One
of the series being the activity of the "motivator", and the other one
the overall activity. By activity I mean any kind of mensurable
activity (number of messages, commits, bugs, number times that a the
motivator says "Simon").

If the motivator is mainly "firefighting" (this is, stopping endless
threads), and assuming that the thread has an impact in the overall
activity (a "peak" or any other similar thing, and a lack of peak
right after the motivator intervention), there must be a cross
correlation between the two series.

Similarly, if the motivator is mainly arising questions, and that
generates more activity, there must be a cross correlation.

I am not right now sure about the case in the middle: when someones
either firefights or arise questions. Maybe we can try the methodology
with some time window. One of the conclusions of my thesis is that
software evolution is a short memory process. When studying the time
series of a software project, the activity levels are not influenced
by events that took place time ago, but only by events that took place
very little time ago (say ~ 1 week).

I think this is something that we can assume. When a motivator
kills an endless thread, he/she kills her now, not in two or three
weeks. Again, when a motivator encourages participation, the
increasing activity can be measured in the next days or weeks, but
probably not more far than that.

Therefore, if we study the time windows around the points of
participation for a certain motivator, we can probably identify such
events (killing threads, fostering activity) by cross correlating the
time series in those particular windows.

Having said this, I have to admit that I am not yet very familiar with
cross correlating time series. I have applied time series analysis in
my PhD thesis, but I did not do that but modelling, and I have
(intentionally) forgotten my time series analysis books in
Spain. Maybe cross correlating time windows of only a couple of weeks
is not enough as to obtain meaningful results. I could try to get a
book here in the library. I will try to find out how :-) . I guess
that
Ramón or Pablo (who probably know much more than me about these
things) could help me with that part.

In any case, general methods are precisely that, general, and so you
can always find particular cases that can not be detected/measured
using a general method. But I think that the idea of the "undercover"
developer, that participates very few in the community, but empowers
the participation of other, is a very good research question to look
at. In spite of the particular cases that might not be detected using
the method that I propose.

In any case, comments, suggestions, and more odd cases are welcome
;-) .

Cheers,
Israel

rcasero

unread,

Aug 7, 2008, 6:34:22 AM8/7/08

to massiel-talk

Sent by Ross Gardler to the OSS Watch mailing list (07/28/2008 09:37
PM):

Israel Herraiz wrote:
> Excerpts from Ross's message on Jul 28, 2008 about 4 PM:
>> Sometimes people post in a way that stops discussion on list but actually is highly useful to the community. For example, one an ideas thread has run it course the person who collates the information into a permanent record will notify the list. This looks like the end of a discussion and gets no "kudos", however, the activity is actually very valuable.
>
> Well, the methodology that I propose for the next stage is to
> calculate the amount of cross correlation between two time series: One
> of the series being the activity of the "motivator", and the other one
> the overall activity. By activity I mean any kind of mensurable
> activity (number of messages, commits, bugs, number times that a the
> motivator says "Simon").
>
> If the motivator is mainly "firefighting" (this is, stopping endless
> threads), and assuming that the thread has an impact in the overall
> activity (a "peak" or any other similar thing, and a lack of peak
> right after the motivator intervention), there must be a cross
> correlation between the two series.

Nope.

The thread I linked to was chosen specifically because it is an anti-
pattern to what you would expect to find unless you are intimately
familiar with the dynamics of community led projects.

The thread in question was one of a number of early threads in which
the start-up committers were brainstorming ideas. There were, at the
time, a number of such threads becuase the project was finding it's
initial objectives and mission. None of them were endless, they were
brainstorming therefore anything was allowed.

The message I linked to was the cullimantion of all threads and
resulted in a commit to SVN of a summary of the discussion. An
extremely useful excercise as it marked out the boundaries of the
project in the early days.

So you have lots of posts, followed by a sumamry announcement and a
single commit. There was no ongoing activity in SVN that could be
directly attributed to that thread specific thread.

Your methodology as I understand it above would miss this vital
documentation activity.

So, how do you tell the difference between a firefighting post and a
conclusion post?

> Similarly, if the motivator is mainly arising questions, and that
> generates more activity, there must be a cross correlation.

How do you tell the difference between a troll and a motivator?

IN a mature community this should be fairly easy as they won't feed
the trolls. But how do you know it is a mature community and how do
you adapt your model to accomodate different types of troll handling
within the community?

> I am not right now sure about the case in the middle: when someones
> either firefights or arise questions. Maybe we can try the methodology
> with some time window. One of the conclusions of my thesis is that
> software evolution is a short memory process. When studying the time
> series of a software project, the activity levels are not influenced
> by events that took place time ago, but only by events that took place
> very little time ago (say ~ 1 week).

I'll have to read your thesis but I really can't support that in my
(anecdotal) experience of FOSS software development within the ASF
(which does not mean *all* FOSS development). I wonder if this is a
finding is another manifestation of the "volunteers" misunderstanding.
It is my experience that everything done is related to past decisions.
It is this reason that I maintain that a project membory is critical.
This allows the project to remember and learn from its past mistakes.

For evidence of this you only need to look at the recent thread on
Forrest in which a decision made around three years ago raised its
head again. This is something that surfaces on a fairly regular
pattern: see
[1]

> I think this is something that we can assume. When a motivator
> kills an endless thread,

See above, this was not the killing of an endless thread - this was
the documentation of a useful thread. We need to be very careful about
making assumptions about the kinds of patterns we will find.

> Having said this, I have to admit that I am not yet very familiar with
> cross correlating time series.

You're more familiar than I am, so that's a start ;-)

> Maybe cross correlating time windows of only a couple of weeks
> is not enough as to obtain meaningful results.

I don't know what it means so can make no comment. However, since I
disagreeing with the basic premise of your work I suspect it's better
for you to just proceed and see if it shows the results you expect.

NOTE to others reading this: Israel chose the FOrrest project
precisely becuase my memory goes back to the early brainstorming
threads of that project. Therefore we can compare my expectations with
Israels findings. If they do not agree then we can dig in to the
archives at specific points to find out what happened to cause the
divergence. This approach was, for me, one of the outputs of Ramóns
workshop.

> I could try to get a
> book here in the library. I will try to find out how :-) . I guess that
> Ramón or Pablo (who probably know much more than me about these
> things) could help me with that part.

If you have a staff card you can get books out of any university
library. If not then I can help out there (although to my shame I have
never been to an Oxford library).

> In any case, general methods are precisely that, general, and so you
> can always find particular cases that can not be detected/measured
> using a general method. But I think that the idea of the "undercover"
> developer, that participates very few in the community, but empowers
> the participation of other, is a very good research question to look
> at. In spite of the particular cases that might not be detected using
> the method that I propose.

Yes, I agree. I'm only trying to highlight potential issues early so
that we can think about them and look for them when they emerge. It
may well be that cases like this are so few that they will not affect
the analysis.

> In any case, comments, suggestions, and more odd cases are welcome
> ;-) .

Good - you can trust me to be a pain in the ass ;-)

Ross

[1] http://markmail.org/search/?q=list%3Aforrest+XHTML+XDoc

rcasero

unread,

Aug 7, 2008, 6:36:20 AM8/7/08

to massiel-talk

Sent by Israel Herraiz to the OSS Watch mailing list (07/28/2008 10:24
PM):

Excerpts from Ross's message on Jul 28, 2008 about 9 PM:

> > The message I linked to was the cullimantion of all threads and resulted
> > in a commit to SVN of a summary of the discussion. An extremely useful
> > excercise as it marked out the boundaries of the project in the
> > early days.

Ok. I was trying (with not too much success :-) to show how if there
are patterns in the data, like when one guy shouts everyone else shuts
up, we can find that. But if the pattern is not clear, and it is
messed up (for instance, the same guy writing many times in the same
thread), maybe the method that I propose is not valid.

> > Your methodology as I understand it above would miss this vital
> > documentation activity.

Well, yes. Actually, I do not know if I mentioned that in my
presentation, but all the data that I have shown have been obtained
considering only code commits. Therefore, documentation activity
(unless it is done in the source code -commenting and that stuff) is
not included in the data.

> > So, how do you tell the difference between a firefighting post and a
> > conclusion post?

Well, maybe I have not chosen the right terms :-) . By firefighting I
understood someone that always stops (concludes, finishes, etc) a
thread (understand *always* in a statistical sense).

> > How do you tell the difference between a troll and a motivator?

Good point. I am now "preparing the case" for the paper that I have
commented you about, and trolls are one of the false positives of the
methodology. I have not figured out yet how to deal with that.

> > IN a mature community this should be fairly easy as they won't feed the
> > trolls. But how do you know it is a mature community and how do you
> > adapt your model to accomodate different types of troll handling within
> > the community?

Well, I think that those methods are not valid for communities that
are still in early stages of the project. After all, you need enough
historical information to find out whether or not there has been a
generational relay.

> > I'll have to read your thesis but I really can't support that in my
> > (anecdotal) experience of FOSS software development within the ASF
> > (which does not mean *all* FOSS development).

I don't know if I have mentioned it, but my thesis [1] is available
under a CC Attribution ShareAlike license. Please, consider that is
still under review, and some things can still have a lot of room for
improvement (in particular, I am desperately looking for an English
editor, spread the word if you know someone willing to earn someone
with the painful experience of reading my thesis).

> > I wonder if this is a
> > finding is another manifestation of the "volunteers" misunderstanding.
> > It is my experience that everything done is related to past decisions.
> > It is this reason that I maintain that a project membory is critical.
> > This allows the project to remember and learn from its past
> > mistakes.

I will explain it further. First of all, of course you might always
find particular cases that do not fit under that finding. The
conclusion is that "for software evolution analysis, you can handle
the history of the project as weather forecasting handles
weather. Recent events have the most influence in the current history
of the project. You have to take in account the stage where your
project is. It is like weather forecasting. If it is summer and today
is sunny, it is likely that tomorrow will be sunny. If it is summer,
it is quite unlikely that tomorrow will snow. And so on...."

But that is a statistical result. From a sample of 3821 projects,
about 80% of them were driven by a short memory dynamics (with a
memory of < 1 week). But some projects had very long memories.

And that is all. It is just a statistical property. Maybe it is just
nonsense bullshit. I don't know. At least, so far, it is useful to
build predictive models, because it tells you not too bother
considering very old events if your project is driven by a short
memory dynamics. In particular, it was useful for me to win the
MSR challenge 2007, about predicting the evolution of Eclipse, the
model had a memory of only 3 days (!). And it was good enough to win.

> > For evidence of this you only need to look at the recent thread on
> > Forrest in which a decision made around three years ago raised its head
> > again. This is something that surfaces on a fairly regular pattern: see
> > [1]

Good to know. I am probably going to use Forrest as a case for the
"undercover developer" paper, and all those cases that you are
highlighting are very much appreciated :-) .

The other project I am going to use is Libresoft :-) . In our group,
Jesus will not appear as very important in the repositories. He does
not write that much in the lists, and makes commits from time to
time. But it is by far the most important role of the group, and
fosters the participation and work of the rest of the group.

> > See above, this was not the killing of an endless thread - this was the
> > documentation of a useful thread. We need to be very careful about
> > making assumptions about the kinds of patterns we will find.

Yes. I have not chosen the proper terms. By "endless threads" I tried
to mean very long threads, or peaks of activity (actually, not all the
activity has to be in the same thread).

> > I don't know what it means so can make no comment.

Sorry. It is difficult for me to explain that in English (it is even
difficult in Spanish). I meant that instead of using the whole history
of the motivator, we could just use portions, and analyze if there is
any pattern in those portions.

> > However, since I
> > disagreeing with the basic premise of your work I suspect it's better
> > for you to just proceed and see if it shows the results you expect.

I completely agree. Have a look at the cite by Richard Feynman that I
have written in one of the first pages of my thesis. We can have
endless threads ;-) about any theory of how software is developed,
but
the numbers will tell us whether we are right or not (and without
disturbing the rest of the list with long discussions ;-) .

> > Good - you can trust me to be a pain in the ass ;-)

I hope that does not involve any kind of sexual harassment (sorry for
my bad jokes by the way ;-) .

Cheers,
Israel

PS: I guess we are breaking here the record of the longest messages in
this list.

rcasero

unread,

Aug 7, 2008, 6:38:16 AM8/7/08

to massiel-talk

Sent by Ross Gardler to the OSS Watch mailing list (07/28/2008 11:03
PM):

Israel Herraiz wrote:
> Excerpts from Ross's message on Jul 28, 2008 about 9 PM:
>> The message I linked to was the cullimantion of all threads and resulted in a commit to SVN of a summary of the discussion. An extremely useful excercise as it marked out the boundaries of the project in the
>> early days.
>
> Ok. I was trying (with not too much success :-) to show how if there
> are patterns in the data, like when one guy shouts everyone else shuts
> up, we can find that. But if the pattern is not clear, and it is
> messed up (for instance, the same guy writing many times in the same
> thread), maybe the method that I propose is not valid.

Nothing is perfect - but it's worth exploring and we have to start
somewhere. I'm only flagging problems as they occur to me.

>> Your methodology as I understand it above would miss this vital documentation activity.
>
> Well, yes. Actually, I do not know if I mentioned that in my
> presentation, but all the data that I have shown have been obtained
> considering only code commits. Therefore, documentation activity
> (unless it is done in the source code -commenting and that stuff) is
> not included in the data.

Actually, in the case of Forrest, all documentation is in SVN so no
problems there. We still miss the relationship between the act of
sumamrising the brainstorming threads and the documentation commit.
However, it would be extremely difficult to spot those in an automated
manner. We can ask Andrea to apply her teams language processing to
your statistical analysis if this works out.

>> So, how do you tell the difference between a firefighting post and a conclusion post?
>
> Well, maybe I have not chosen the right terms :-) . By firefighting I
> understood someone that always stops (concludes, finishes, etc) a
> thread (understand *always* in a statistical sense).

OK - that's a reminology thing. In English firefighting typically
means running around dealing with the most urgent issue without having
(or taking) the time to plan the overall strategy.

>> How do you tell the difference between a troll and a motivator?
>
> Good point. I am now "preparing the case" for the paper that I have
> commented you about, and trolls are one of the false positives of the
> methodology. I have not figured out yet how to deal with that.

Again, Andrea's teams work may be useful here.

>> I'll have to read your thesis but I really can't support that in my (anecdotal) experience of FOSS software development within the ASF (which does not mean *all* FOSS development).
>
> I don't know if I have mentioned it, but my thesis [1] is available
> under a CC Attribution ShareAlike license.

Interesting. I'll try and find the time to read it.

>> I wonder if this is a finding is another manifestation of the "volunteers" misunderstanding. It is my experience that everything done is related to past decisions. It is this reason that I maintain that a project membory is critical. This allows the project to remember and learn from its past
>> mistakes.
>
> I will explain it further. First of all, of course you might always
> find particular cases that do not fit under that finding. The
> conclusion is that "for software evolution analysis, you can handle
> the history of the project as weather forecasting handles
> weather. Recent events have the most influence in the current history
> of the project. You have to take in account the stage where your
> project is. It is like weather forecasting. If it is summer and today
> is sunny, it is likely that tomorrow will be sunny. If it is summer,
> it is quite unlikely that tomorrow will snow. And so on...."

That sounds very reasonable, and quite different to what you
originally said - I'm relieved ;-)

The above does not exclude important long term decisions from
influencing what is happening today - kind of like knowing we are
nearing the end of summer in the UK means the chances of two
consecutive days being sunny are reduced.

> But that is a statistical result. From a sample of 3821 projects,
> about 80% of them were driven by a short memory dynamics (with a
> memory of < 1 week). But some projects had very long memories.

Now I am really interested in your thesis. This sounds really
conclusive.

>> For evidence of this you only need to look at the recent thread on Forrest in which a decision made around three years ago raised its head again. This is something that surfaces on a fairly regular pattern: see
>> [1]
>
> Good to know. I am probably going to use Forrest as a case for the
> "undercover developer" paper, and all those cases that you are
> highlighting are very much appreciated :-) .

I'm quite fascinated by the approach of statistics supported/refuted
by particpants memory.

> The other project I am going to use is Libresoft :-) . In our group,
> Jesus will not appear as very important in the repositories. He does
> not write that much in the lists, and makes commits from time to
> time. But it is by far the most important role of the group, and
> fosters the participation and work of the rest of the group.

Good choice. I suspect you'll see something similar in Forrest,
although less pronounced. Lets see if you can succesfully identify the
individual I have in mind - that would be a good test of your
theories.

>> Good - you can trust me to be a pain in the ass ;-)
>
> I hope that does not involve any kind of sexual harassment (sorry for
> my bad jokes by the way ;-) .

You should compare joke books with Rowan ;-)

> PS: I guess we are breaking here the record of the longest messages in
> this list.

Actually, you're seeing me coming back to activity after a really busy
month. Every now and again I get active like this, I kick up a fuss
and then I get busy again. Most of the team have learnt to keep their
heads down, but your new to the list so I have new fodder ;-)

Ross

rcasero

unread,

Aug 7, 2008, 6:40:40 AM8/7/08

to massiel-talk

Sent by Israel Herraiz to the OSS Watch mailing list (07/28/2008 04:53
PM):

Excerpts from Ross's message on Jul 28, 2008 about 4 PM:

> > The SVN list was created two years into the project. Originally SVN
> > mails went to the dev list, this is fairly standard practice. Prior to
> > that we used CVS and these mails were sent to the dev list as well.

Umm. I have missed that point :-S

> > You should also note other automated mails to the mailing list, such as
> > GUMP, Forrestbot and JIRA.

Again ummm.

> > I don't think you will find much of a change in the behaviour patterns
> > when you filter this stuff out, however, until you do there is know way
> > of knowing.

I think it must be easy to get rid of those messages in my
database. Let me try it. I will update the list once I get it done.

Cheers,
Israel

rcasero

unread,

Aug 7, 2008, 6:43:05 AM8/7/08

to massiel-talk

Sent by Israel Herraiz to the OSS Watch mailing list (07/28/2008 06:02
PM):

Excerpts from Ross's message on Jul 28, 2008 about 4 PM:

> > You should also note other automated mails to the mailing list, such as
> > GUMP, Forrestbot and JIRA.

You are right. I have found this:

* SVN commits: 475 messages
* Replies to SVN commits: 839 messages

* CVS commits: 2739 messages
* Replies to CVS commits: 378 messages

* GUMP messages: 34 messages (subject starting with [GUMP] or
[GUMP-PATHCH] or [GUMP@)

* JIRA messages: 4189 (subject starting with [JIRA])
* Replies to JIRA: 787 messages

* Forresbot: 603 messages (from address of the form forrestbot@)
* Replies to forresbot: 57 messages (all the messages that are part
of threads started by forrestbot@).

And that is all. I have removed all those messages from the
database. I have uploaded the new database.

> > I don't think you will find much of a change in the behaviour patterns
> > when you filter this stuff out, however, until you do there is know way
> > of knowing.

In spite of removing all those messages, the pattern is still the
same. Check the plot that you obtain passing the file
matrix_top_fraction_commiters-0.1-normal.plot to gnuplot (of course,
after executing the scripts).

Cheers,
Israel

rcasero

unread,

Aug 7, 2008, 6:45:04 AM8/7/08

to massiel-talk

Sent by Ross Gardler to the OSS Watch mailing list (07/28/2008 09:59
PM):

Israel Herraiz wrote:
> Excerpts from Ross's message on Jul 28, 2008 about 4 PM:
>> You should also note other automated mails to the mailing list, such as GUMP, Forrestbot and JIRA.
>
> You are right. I have found this:
>
> * SVN commits: 475 messages
> * Replies to SVN commits: 839 messages
>
> * CVS commits: 2739 messages
> * Replies to CVS commits: 378 messages

Actually this is important. The replies to commit messages are of
special importance. These people are doing active code review. A very
important part of the process.

In the ASF we operate commit-then-review. A reply to a commit
indicates a potential problem with the commit. Spotting such a problem
is of great importance.

> * GUMP messages: 34 messages (subject starting with [GUMP] or
> [GUMP-PATHCH] or [GUMP@)

That number can't be right. Gump sends a mail whenever a build is
broken using SVN head from any dependent apache project. Over 5 years
there must have been more than 34 breaks.

> * JIRA messages: 4189 (subject starting with [JIRA])
> * Replies to JIRA: 787 messages

If you want to separate our JIRA use as a separate topic then the
initial messages are unimportant, but replies to them show activity in
the dev list and should be included.

Furthermore, what about the 1122 from iss...@cocoondev.org that
predate the ASF JIRA instance.

> * Forresbot: 603 messages (from address of the form forrestbot@)
> * Replies to forresbot: 57 messages (all the messages that are part
> of threads started by forrestbot@).

Forrestbot builds a nuber of test sites on a daily basis. A message
from the forrestbot indicates a problem with the build of one of those
sites. Replies to those indicate someone discussion why the build has
broken.

Finally, and off topic thought I'd say drop the forrestbot stuff but
include the replies.

Of course, it would also be interesting to look at who is responding
to forrestbot complaints. Since they are doing very important work,
but I guess that's a different problem.

> And that is all. I have removed all those messages from the
> database. I have uploaded the new database.

Oh - I was responding whilst reading. Up to you to decide if my
comments above are of value.

>> I don't think you will find much of a change in the behaviour patterns when you filter this stuff out, however, until you do there is know way of knowing.
>
> In spite of removing all those messages, the pattern is still the
> same. Check the plot that you obtain passing the file
> matrix_top_fraction_commiters-0.1-normal.plot to gnuplot (of course,
> after executing the scripts).

As expected then - that's good, I think.

Next up is the fact that (I assume) you are not matching people to
mail addresses. I, for example, have posted to the forrest lists with
at least 4 different mail addreses over the five years of the years
project.

OFF TOPIC RAMBLING: I wonder what it would look like if you just
looked at the review messages. I consider code review to be a vital
part of a healthy project. More than one person doing reviews is a
good sign.

Ross

rcasero

unread,

Aug 7, 2008, 6:46:19 AM8/7/08

to massiel-talk

Sent by Israel Herraiz to the OSS Watch mailing list (07/28/2008 10:31
PM):

Excerpts from Ross's message on Jul 28, 2008 about 9 PM:

> > That number can't be right. Gump sends a mail whenever a build is broken
> > using SVN head from any dependent apache project. Over 5 years there
> > must have been more than 34 breaks.

Ummm. How I can identify those messages? I have used a regular
expresion for the subject ("^\[GUMP\]", case insensitive)

> > Of course, it would also be interesting to look at who is responding to
> > forrestbot complaints. Since they are doing very important work, but I
> > guess that's a different problem.

Yes. So far I am just a looking at patterns of activity. I will look
deep in the list in case I have to solve any controversial
finding.

> > Next up is the fact that (I assume) you are not matching people to mail
> > addresses. I, for example, have posted to the forrest lists with at
> > least 4 different mail addreses over the five years of the years
> > project.

Weeeeelll. I am using email addresses as "proxies" of people. In any
case, using email addresses would invalidate a "generational relay"
result. This is, if we find that there is relay, but it is just
someone changing his/her address, then there is not such a relay. But
if we find that in the second half of the history of Forrest, most of
the activity is due to the same addresses, we do not need to go
further and match different addresses to the same person.

> > OFF TOPIC RAMBLING: I wonder what it would look like if you just looked
> > at the review messages. I consider code review to be a vital part of a
> > healthy project. More than one person doing reviews is a good sign.

Well. I am not reading the messages. I just tried to figure out how to
remove those automatic messages, and was having a quick look at random
searches. But I have not read carefully any of the messages.

Cheers,
Israel

rcasero

unread,

Aug 7, 2008, 6:47:28 AM8/7/08

to massiel-talk

Sent by Ross Gardler to the OSS Watch mailing list (07/28/2008 10:47
PM):

Israel Herraiz wrote:
> Excerpts from Ross's message on Jul 28, 2008 about 9 PM:
>> That number can't be right. Gump sends a mail whenever a build is broken using SVN head from any dependent apache project. Over 5 years there must have been more than 34 breaks.
>
> Ummm. How I can identify those messages? I have used a regular
> expresion for the subject ("^\[GUMP\]", case insensitive)

I would start with searching www.markmail.org with "list:forrest gump"

...

>> Next up is the fact that (I assume) you are not matching people to mail addresses. I, for example, have posted to the forrest lists with at least 4 different mail addreses over the five years of the years
>> project.
>
> Weeeeelll. I am using email addresses as "proxies" of people. In any
> case, using email addresses would invalidate a "generational relay"
> result. This is, if we find that there is relay, but it is just
> someone changing his/her address, then there is not such a relay. But
> if we find that in the second half of the history of Forrest, most of
> the activity is due to the same addresses, we do not need to go
> further and match different addresses to the same person.

OK, lets see.

>
>> OFF TOPIC RAMBLING: I wonder what it would look like if you just looked at the review messages. I consider code review to be a vital part of a healthy project. More than one person doing reviews is a good sign.
>
> Well. I am not reading the messages. I just tried to figure out how to
> remove those automatic messages, and was having a quick look at random
> searches. But I have not read carefully any of the messages.

You don't need to read them, a reply to a commit message can only come
from a human hitting return in their email client. There is no
automated system to do that. Even when relying to a commit message
from the svn list the traffic will go to the dev list.

Therefore by filtering out *all* SVN subject mails you are actually
removing important mails.

Sorry to be a pain, but if something is worth doing...

Ross

Reply all

Reply to author

Forward