Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

CPAN-river: can graph calculation be modified?

1 view

Skip to first unread message

James E Keenan

unread,

Feb 2, 2018, 10:00:02 AM2/2/18

to cpan-w...@perl.org

Overall Question: How can we implement different ways of constructing
the CPAN river?

Background:

Since about this time last year I've had occasion to use the concept of
CPAN-river to derive lists of distributions to be tested against
whatever Perl 5 blead is of the moment. In particular, for the last
three months I've been creating assessments of the impact of monthly
Perl 5 development releases on the "top 1000" of the CPAN river. (See,
e.g.,
http://thenceforward.net/perl/misc/cpan-river-1000-perl-5.27-master.psv.gz)

To calculate the CPAN river, I've been using the programs developed by
David Golden found here:

https://github.com/dagolden/zzz-index-cpan-meta

... with one modification: a local branch for the second of the three
programs cited there. I use a local branch because I'm using Linux and
cannot install Ramdisk.

Problem:

As I've stared at this data over the past year I've become aware that
the order in which distros appear in the river is not necessarily the
most useful for assessing the real-world impact of changes in blead.
Put less charitably, the CPAN river can be "gamed." It is possible for
a person to release a large number of distributions which have
dependencies on other distributions by the same author. That can boost
some of those distributions high up into the CPAN river -- into, say,
the "top 1000" that I use in my monthly program.

But if that author's distributions are not depended upon by *other*
authors' distributions then they are arguably less important than those
such as Module-Build and DateTime which are depended upon by vast
numbers of distros written by people other than those distros' maintainers.

Since "testing against blead" programs take hours to run, I would like
to have that time spent focusing on what I consider to be more relevant
distros.

For the 5.29.* development cycle starting in May of this year, I would
like to be able to use a ranking of CPAN distros which goes beyond asking:

* "How many other distributions depend on this one?"

... to asking:

* "How many distributions by other authors/maintainers depend on this one?"

Would that be feasible? Has anyone attempted this already?

Thank you very much.
Jim Keenan

James E Keenan

unread,

Feb 2, 2018, 12:00:02 PM2/2/18

to cpan-w...@perl.org, Neil Bowers

On 02/02/2018 10:51 AM, Neil Bowers wrote:
>> For the 5.29.* development cycle starting in May of this year, I would
>> like to be able to use a ranking of CPAN distros which goes beyond asking:
>>
>> * "How many other distributions depend on this one?"
>>
>> ... to asking:
>>
>> * "How many distributions by other authors/maintainers depend on this
>> one?"
>>
>> Would that be feasible? Has anyone attempted this already?
>

> When we were discussing the River model at QAH, and in discussions
> afterwards, this came up. In the end we decided to keep things simple
> and go with the current common definition. There are some tools in the
> CPAN ecosystem that only count dependencies written by others.
>

Can you point us toward those tools?

> We’d need to agree which dists get ignored in this alternate scheme.

Please note that I'm not looking to replace the current definition. I'm
looking to develop supplementary definition(s) -- and their
implementations -- that can be useful in particular circumstances.

> Consider this example:
>
>
> Here MARY has released a bunch of dists, but Foo-Bar is also relied on
> by other dists written by MUNGO and MIDGE.
>
> The river count for Foo-Bar would be 2 here (ignoring the whole branch
> that contains only dists from MARY), but the Foo river count should be
> 3, I think. Foo-Bar “counts”, because it in turn is depended on by dists
> from other authors. Otherwise the river count would be 2 for both Foo
> and Foo-Bar. Basically we’re starting at the “bottom" of the dependency
> graph, and trimming sub-graphs all from one author.
>
> Also consider this example:
>
>
> What’s the river count of Plant — 0, 1, or 3? I think it should be 1, in
> this alternate measure.
>
> I.e. for sub-graphs by the same author, you only include the dist at the
> head of the sub-graph.
>
> It would be useful to have both measures available: raw-river and
> author-river.
>
> When looking at a dist there are (at least) three figures that might be
> of interest: the full river count (total number of direct and indirect
> dependencies), the author-filtered river count (as above), and the
> number of direct dependencies (which could be split in 2 as well).
>
> Neil

James E Keenan

unread,

Feb 2, 2018, 12:15:02 PM2/2/18

to cpan-w...@perl.org, H.Merijn Brand

On 02/02/2018 11:08 AM, H.Merijn Brand wrote:

> On Fri, 2 Feb 2018 15:51:32 +0000, Neil Bowers
> <neil....@cogendo.com> wrote:
>
>>> For the 5.29.* development cycle starting in May of this year, I would like to be able to use a ranking of CPAN distros which goes beyond asking:
>>>
>>> * "How many other distributions depend on this one?"
>>>
>>> ... to asking:
>>>
>>> * "How many distributions by other authors/maintainers depend on this one?"
>>>
>>> Would that be feasible? Has anyone attempted this already?
>>

>> When we were discussing the River model at QAH, and in discussions afterwards, this came up. In the end we decided to keep things simple and go with the current common definition. There are some tools in the CPAN ecosystem that only count dependencies written by others.
>>

>> We’d need to agree which dists get ignored in this alternate scheme. Consider this example:

>>
>>
>>
>> Here MARY has released a bunch of dists, but Foo-Bar is also relied
>> on by other dists written by MUNGO and MIDGE.
>>
>> The river count for Foo-Bar would be 2 here (ignoring the whole
>> branch that contains only dists from MARY), but the Foo river count
>> should be 3, I think. Foo-Bar “counts”, because it in turn is
>> depended on by dists from other authors. Otherwise the river count
>> would be 2 for both Foo and Foo-Bar. Basically we’re starting at the
>> “bottom" of the dependency graph, and trimming sub-graphs all from
>> one author.
>
>
>> Also consider this example:
>>
>> What’s the river count of Plant — 0, 1, or 3? I think it should be 1,
>> in this alternate measure.
>

> 1 or 3: 1 if module chains from the same author are "compressed" to 1,
> 3 if not
>
> More interesting would be
>
> Thing - Plant - Fruit - Banana - Silver Banana - Distasteful stuff
> JOHN PAUL RINGO RINGO RINGO GEORGE
>
> would plant now be 1, 2, or 4?

>
>> I.e. for sub-graphs by the same author, you only include the dist at
>> the head of the sub-graph.
>

> I'd suggest to have an option to squeeze any unbranched chain of
> modules from the same author to 1
>

I *think* that's what I'm aiming for. Let's say I have a CPAN distro
called Gamma on which nothing else depends. I refactor code out of
Gamma into Beta, such that Gamma now depends on Beta. By the standard
definition, Beta moves up-river, Gamma down-river.

Next I refactor code out of Beta into Alpha. Alpha is now farther
up-river than both Beta and Gamma.

Suppose that Alpha now falls into the "top 1000" of the CPAN river.
When I then switch Perl community roles and start to play the role of
"rapid BBC evaluator." A certain portion of my BBC program is now taken
up with testing Alpha. But, assuming I confine my focus to the top
1000, that means some *other* CPAN distribution -- perhaps one whose
revdeps are from different authors -- has been pushed out of the top
1000. That means the data I generate for P5P has been skewed toward
myself. That's what I'd like to avert.

>> It would be useful to have both measures available: raw-river and
>> author-river.
>>
>> When looking at a dist there are (at least) three figures that might
>> be of interest: the full river count (total number of direct and
>> indirect dependencies), the author-filtered river count (as above),
>> and the number of direct dependencies (which could be split in 2 as
>> well).
>>
>> Neil
>

0 new messages