Statistics for Modules

91 views
Skip to first unread message

Spencer Krum

unread,
Sep 7, 2014, 6:57:22 PM9/7/14
to puppe...@googlegroups.com
Hi Puppet-dev,

I've been working, with a lot of help from some others, on a new project at http://puppet-analytics.org. It is very much in the experimental/development phase and I'm looking for feedback and help.

The goal of this project is to enable module authors and users greater visibility into module use. The architecture is modeled after Debian's popularity contest, where a program on the debian system reports to a central server about package use. This means that Puppet users can submit(through a json/http endpoint) 'hey I've deployed this version of stdlib!'. After a bunch of users have been reporting for a while, module maintainers can see the trends, identify which versions of the modules are being used, etc. Similarly users can see which modules are the most popular, which versions of those modules are the most popular, etc.

There is an arbitrary tagging system built in that allows users to report that the deploy is being performed by their ci infrastructure, by a developer doing testing, or by an operator pushing code to production. This allows people viewing the data to see the 'true' numbers, unpolluted by ci systems or runaway webcrawlers.

Reporting can be done with curl, or with a script. Right now there is a script and example curl to report to puppet analytics at: https://github.com/nibalizer/puppet-analytics-client. I think everyone's infrastructure looks a little different, so writing a generic tool to report to PA would be pretty hard. I'd like puppet-analytics-client to become a place to put scripts and tools to hit PA.

I'm interested in your thoughts an opinions. Especially around the opt-in architecture. Would you be willing to report to PA? Do you think we would ever be able to get enough people reporting that the data would be significant? All the code is open source on github (https://github.com/nibalizer/puppet-analytics). The website is hosted on digital ocean. I also have the mental model that people would report after every code change to their Puppet infrastructure, i.e. in the post-commit hook if using dynamic environments. Is this a model you agree with? Do you have a different idea?

We have had a lot of conversations, on this list, and in person, around 'what are people doing with puppet?' I think a tool like this could really help us figure out which modules are being used the most often.

Please note that PA is not nearly done yet. Much of the empty space I expect will be filled in with cool visualizations of the data. It is liable to break at any time, especially with actual users. One of the cool features that is currently in PR is the ability to have shields.io downloads tags come from PA and show up in the ReadMe's of our modules.

Thanks everybody,
Spencer

--
Spencer Krum
(619)-980-7820

Andy Parker

unread,
Sep 8, 2014, 2:21:51 PM9/8/14
to puppe...@googlegroups.com
On Sun, Sep 7, 2014 at 3:57 PM, Spencer Krum <krum.s...@gmail.com> wrote:
Hi Puppet-dev,

I've been working, with a lot of help from some others, on a new project at http://puppet-analytics.org. It is very much in the experimental/development phase and I'm looking for feedback and help.

The goal of this project is to enable module authors and users greater visibility into module use. The architecture is modeled after Debian's popularity contest, where a program on the debian system reports to a central server about package use. This means that Puppet users can submit(through a json/http endpoint) 'hey I've deployed this version of stdlib!'. After a bunch of users have been reporting for a while, module maintainers can see the trends, identify which versions of the modules are being used, etc. Similarly users can see which modules are the most popular, which versions of those modules are the most popular, etc.


This all looks awesome!
 
There is an arbitrary tagging system built in that allows users to report that the deploy is being performed by their ci infrastructure, by a developer doing testing, or by an operator pushing code to production. This allows people viewing the data to see the 'true' numbers, unpolluted by ci systems or runaway webcrawlers.


I'm wondering if there would be a way of saying "all of these installations are for the same 'site'". That would remove a module looking popular simply because it is installed a lot, but only by two or three groups. Maybe that information is valuable, maybe not...I'm not sure yet.
 
Reporting can be done with curl, or with a script. Right now there is a script and example curl to report to puppet analytics at: https://github.com/nibalizer/puppet-analytics-client. I think everyone's infrastructure looks a little different, so writing a generic tool to report to PA would be pretty hard. I'd like puppet-analytics-client to become a place to put scripts and tools to hit PA.

I'm interested in your thoughts an opinions. Especially around the opt-in architecture. Would you be willing to report to PA? Do you think we would ever be able to get enough people reporting that the data would be significant? All the code is open source on github (https://github.com/nibalizer/puppet-analytics). The website is hosted on digital ocean. I also have the mental model that people would report after every code change to their Puppet infrastructure, i.e. in the post-commit hook if using dynamic environments. Is this a model you agree with? Do you have a different idea?


I think that is a great thing to shoot for. I'm personally a little cautious about making a deploy process depend on external services, but this could be fired off as a background job and it doesn't really matter too much if it works or not.
 
We have had a lot of conversations, on this list, and in person, around 'what are people doing with puppet?' I think a tool like this could really help us figure out which modules are being used the most often.


Currently I answer this by trawling through a dump of the forge that we have available internally. However, my questions often revolve around how people are using the language rather than what modules are in use. That said, knowing which modules are heavily used would help everyone to understand a lot more.
 
Please note that PA is not nearly done yet. Much of the empty space I expect will be filled in with cool visualizations of the data. It is liable to break at any time, especially with actual users. One of the cool features that is currently in PR is the ability to have shields.io downloads tags come from PA and show up in the ReadMe's of our modules.

Thanks everybody,
Spencer

--
Spencer Krum
(619)-980-7820

--
You received this message because you are subscribed to the Google Groups "Puppet Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to puppet-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/puppet-dev/CADt6FWPoK7N6pwPj4h6_84p-6WEwtz3N6zJbuJniRkHaMi9HBA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.



--
Andrew Parker
Freenode: zaphod42
Twitter: @aparker42
Software Developer

Join us at PuppetConf 2014, September 22-24 in San Francisco
Register by May 30th to take advantage of the Early Adopter discount —save $349!

Spencer Krum

unread,
Sep 9, 2014, 12:55:59 AM9/9/14
to puppe...@googlegroups.com
Thanks for the positive feedback Andy!



I'm wondering if there would be a way of saying "all of these installations are for the same 'site'". That would remove a module looking popular simply because it is installed a lot, but only by two or three groups. Maybe that information is valuable, maybe not...I'm not sure yet.

One of the common practices when building a system such as this is keeping the people who send you data anonymous. That makes filtering on user hard. We could potentially deal with that in two ways I can think of. We could allow users to set a anonymous=false flag in their json blob they deliver, or we could hash the source ip address and keep that around.

I think the way I intended for it to be used was for users doing CI was to report that CI in the purpose field. That way we could see total deployments, but also per-usage deployments. I'm not sure the users would be willing to differentiate how they run a script between production and CI though, since the goal of CI is to test it as close to prod as you can.



I'm personally a little cautious about making a deploy process depend on external services, but this could be fired off as a background job and it doesn't really matter too much if it works or not.

I agree that it is a big pill to swallow.  This will likely change, but right now every deploy must be reported in a single curl request, no bulk updates. It is also not possible to 'back-fill' data. So deploys are recorded when they are submitted to puppet-analytics. I could see deploys for the day being written to a file or database on the users systems, then a nightly job running to fill in the days deploys on puppet-analytics, but it would require some changes to the code.

I weighed the balance of allowing arbitrary date insertion. I'm happy to be convinced otherwise but I think the problems of figuring out when a deploy occurred when reported by a global system with timezones and all that is very hard to get right.

Thanks again,
Spencer




For more options, visit https://groups.google.com/d/optout.



--
Spencer Krum
(619)-980-7820

Henrik Lindberg

unread,
Sep 9, 2014, 12:53:03 PM9/9/14
to puppe...@googlegroups.com
On 2014-09-09 6:55, Spencer Krum wrote:
> Thanks for the positive feedback Andy!
>
>
> I'm wondering if there would be a way of saying "all of these
> installations are for the same 'site'". That would remove a module
> looking popular simply because it is installed a lot, but only by
> two or three groups. Maybe that information is valuable, maybe
> not...I'm not sure yet.
>
>
> One of the common practices when building a system such as this is
> keeping the people who send you data anonymous. That makes filtering on
> user hard. We could potentially deal with that in two ways I can think
> of. We could allow users to set a anonymous=false flag in their json
> blob they deliver, or we could hash the source ip address and keep that
> around.
>

How about a UUID for every master? (Using say UUID format 5-SHA-1 to
make it completely anonymous). If we write this UUID to the
configuration it would provide a long term identity of the master.

> I think the way I intended for it to be used was for users doing CI was
> to report that CI in the purpose field. That way we could see total
> deployments, but also per-usage deployments. I'm not sure the users
> would be willing to differentiate how they run a script between
> production and CI though, since the goal of CI is to test it as close to
> prod as you can.
>
>
>
> I'm personally a little cautious about making a deploy process
> depend on external services, but this could be fired off as a
> background job and it doesn't really matter too much if it works or not.
>
>
> I agree that it is a big pill to swallow. This will likely change, but
> right now every deploy must be reported in a single curl request, no
> bulk updates. It is also not possible to 'back-fill' data. So deploys
> are recorded when they are submitted to puppet-analytics. I could see
> deploys for the day being written to a file or database on the users
> systems, then a nightly job running to fill in the days deploys on
> puppet-analytics, but it would require some changes to the code.
>
That sounds like a very good improvement, increases the willingness to
submit the reports.

> I weighed the balance of allowing arbitrary date insertion. I'm happy to
> be convinced otherwise but I think the problems of figuring out when a
> deploy occurred when reported by a global system with timezones and all
> that is very hard to get right.
>

Some experiences from Eclipse which used to have a usage collection
framework is that over time it returned less and less value and just
confirmed what everyone already knew from simpler measures like "number
of downloads". It ended up only wasting cycles and disk space.

Some "module" authors did make use of the facilities to measure in more
detail which features of their "modules" that were actually used and
frequency - this to ensure they focused on the right set of features,
and to prune old / expensive to maintain unused features. This naturally
required the module owners to make calls to the API. This was only used
by a few projects at Eclipse, and they were not happy when the
collection mechanism was turned off.

Just something worth considering.

- henrik
> ability to have shields.io <http://shields.io> downloads tags
> come from PA and show up in the ReadMe's of our modules.
>
> Thanks everybody,
> Spencer
>
> --
> Spencer Krum
> (619)-980-7820 <tel:%28619%29-980-7820>
>
> --
> You received this message because you are subscribed to the
> Google Groups "Puppet Developers" group.
> To unsubscribe from this group and stop receiving emails from
> it, send an email to puppet-dev+...@googlegroups.com
> <mailto:puppet-dev+...@googlegroups.com>.
> <https://groups.google.com/d/msgid/puppet-dev/CADt6FWPoK7N6pwPj4h6_84p-6WEwtz3N6zJbuJniRkHaMi9HBA%40mail.gmail.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.
>
>
>
>
> --
> Andrew Parker
> an...@puppetlabs.com <mailto:an...@puppetlabs.com>
> Freenode: zaphod42
> Twitter: @aparker42
> Software Developer
>
> *Join us at PuppetConf 2014 <http://www.puppetconf.com/>, September
> 22-24 in San Francisco*
> /Register by May 30th to take advantage of the Early Adopter
> discount <http://links.puppetlabs.com/puppetconf-early-adopter>
> //—//save $349!/
>
> --
> You received this message because you are subscribed to the Google
> Groups "Puppet Developers" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to puppet-dev+...@googlegroups.com
> <mailto:puppet-dev+...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/puppet-dev/CANhgQXtn%2B2FT%3DVtxhUpYUpTv0ea1Be2L613MSHHROMeRd1jxQQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/puppet-dev/CANhgQXtn%2B2FT%3DVtxhUpYUpTv0ea1Be2L613MSHHROMeRd1jxQQ%40mail.gmail.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.
>
>
>
>
> --
> Spencer Krum
> (619)-980-7820
>
> --
> You received this message because you are subscribed to the Google
> Groups "Puppet Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to puppet-dev+...@googlegroups.com
> <mailto:puppet-dev+...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/puppet-dev/CADt6FWO%3DG_r79aMRSM8H2DUZ3XJo2a1rZY3z00ust1v%3DvjUBaA%40mail.gmail.com
> <https://groups.google.com/d/msgid/puppet-dev/CADt6FWO%3DG_r79aMRSM8H2DUZ3XJo2a1rZY3z00ust1v%3DvjUBaA%40mail.gmail.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.


--

Visit my Blog "Puppet on the Edge"
http://puppet-on-the-edge.blogspot.se/

Gareth Rushgrove

unread,
Sep 10, 2014, 1:40:53 AM9/10/14
to puppe...@googlegroups.com
I mentioned last night at the Portland Puppet User Group that I think
from a module developers point of view this is really cool.

A couple of things I said which may be worth repeating are that rapid
integration with some of the dependency tools could net lots of data
quickly, for instance:

* librarian-puppet
* r10k
* geppetto
* vagrant

Probably some others I've forgotten.

Gareth

> Thanks everybody,
> Spencer
>
> --
> Spencer Krum
> (619)-980-7820
>
> --
> You received this message because you are subscribed to the Google Groups
> "Puppet Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to puppet-dev+...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/puppet-dev/CADt6FWPoK7N6pwPj4h6_84p-6WEwtz3N6zJbuJniRkHaMi9HBA%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.



--
Gareth Rushgrove
@garethr

devopsweekly.com
morethanseven.net
garethrushgrove.com

Trevor Vaughan

unread,
Sep 10, 2014, 8:58:27 AM9/10/14
to puppe...@googlegroups.com
If anyone does tool integration PLEASE make it opt-in.

It's always fun trying to explain why your tools are pounding on the inside of a corporate firewall.

Trevor


For more options, visit https://groups.google.com/d/optout.



--
Trevor Vaughan
Vice President, Onyx Point, Inc
(410) 541-6699
tvau...@onyxpoint.com

-- This account not approved for unencrypted proprietary information --

Spencer Krum

unread,
Sep 10, 2014, 12:12:14 PM9/10/14
to puppe...@googlegroups.com
Yes integration would have to be opt-in. I'm sure the maintainers of those tools would want it that way anyways. And everything would have to respect the http_proxy variable, everyone's favorite variable in corporate settings.


For more options, visit https://groups.google.com/d/optout.



--
Spencer Krum
(619)-980-7820

Gareth Rushgrove

unread,
Sep 11, 2014, 12:57:18 PM9/11/14
to puppe...@googlegroups.com
On 10 September 2014 09:12, Spencer Krum <krum.s...@gmail.com> wrote:
> Yes integration would have to be opt-in. I'm sure the maintainers of those
> tools would want it that way anyways. And everything would have to respect
> the http_proxy variable, everyone's favorite variable in corporate settings.
>
> On Wed, Sep 10, 2014 at 5:58 AM, Trevor Vaughan <tvau...@onyxpoint.com>
> wrote:
>>
>> If anyone does tool integration PLEASE make it opt-in.
>>

The simplest route would be plugins, so installation of the plugin as
the opt-in. Vagrant and Geppetto support a plugin module, while
librarian and r10k are Ruby so a gem and some monkey patching should
suffice.

G
> https://groups.google.com/d/msgid/puppet-dev/CADt6FWN%2BuqL-CgTQUnMdJ74mOzupWok-2WWmmOeoLbaKCFMFvg%40mail.gmail.com.
Reply all
Reply to author
Forward
0 new messages