[GSoC2016] Insterested in Usage Statistics Analysis Project

169 views
Skip to first unread message

Payal Priyadarshini

unread,
Mar 8, 2016, 1:38:00 PM3/8/16
to Jenkins Developers

Hey, 


This is Payal Priyadarshini, a final year undergraduate student enrolled in its dual degree (B.Tech+M.Tech) programme of CSE department of Indian Institute of Technology Kharagpur. I have gone through the idea list of the project offered, and i am interested in Usage Statistics Analysis Project.

I have experience in Data Mining field, where I worked with the Meetup [online social networking portal that facilitates offline group meetings] data to find out the success of events, group popularities etc.


Regarding this project, I am going through few links suggested by Daniel.


Link1 : http://stats.jenkins-ci.org/


- Jenkins Statistics [http://stats.jenkins-ci.org/jenkins-stats/svg/svgs.html

  • The way charts are generated can be improved, like instead of using linear axis, we can use logarithmic axis so that it can cover larger range, plot the size distributions etc. Although I didn't get what does "nodes" represent.
  • The new UI can be developed which will be easy to understand, to view the statistics of the usage. For example to show the popularity of the plugins in a given time period, we can only show top 20 plugins, to increase the visibility in piechart or bar graph.
We can maintain duration window(can be adjusted by the user) and then show stats in that time frame rather than showing the monthly data. Similar idea related to processing is already there on wiki page of the project. I will elaborate all these points in detail in my proposal.

- Jenkins plugin dependency graph [link
  • I think dependency graph can be exploited to tell what plugins are more likely used together. Can someone please clarify that what exactly dependency denotes here ? And, where can I find the source code for this.
Census data [link]
  • What is the metadata/fields for these json files?
Repo for the current sources [link].
  • Which languages other than groovy can be used by candidates for this project?
So, after going through all these links, what should my next step to make contribution to this project ?

Looking forward to suggestions.

Thanks a lot.

Regards,
Payal Priyadarshini

Kohsuke Kawaguchi

unread,
Mar 8, 2016, 7:28:18 PM3/8/16
to jenkin...@googlegroups.com
Thanks for reaching out, and sorry for missing you earlier today.

2016-03-08 10:37 GMT-08:00 Payal Priyadarshini <payal...@gmail.com>:

Hey, 


This is Payal Priyadarshini, a final year undergraduate student enrolled in its dual degree (B.Tech+M.Tech) programme of CSE department of Indian Institute of Technology Kharagpur. I have gone through the idea list of the project offered, and i am interested in Usage Statistics Analysis Project.

I have experience in Data Mining field, where I worked with the Meetup [online social networking portal that facilitates offline group meetings] data to find out the success of events, group popularities etc.


Great, can you tell us more about that project? We also use meetup.com and maybe we can form some project idea around that.
 


Regarding this project, I am going through few links suggested by Daniel.


Link1 : http://stats.jenkins-ci.org/


- Jenkins Statistics [http://stats.jenkins-ci.org/jenkins-stats/svg/svgs.html

  • The way charts are generated can be improved, like instead of using linear axis, we can use logarithmic axis so that it can cover larger range, plot the size distributions etc. Although I didn't get what does "nodes" represent.

Think of a node as a member of a cluster. It's tracking the combined total size of the entire Jenkins clusters across all the installations.

The use cases for the stats graphs we have today pretty much come down to:
  • "I want to feel good looking at a graph that's growing up, up and up"
  • "I want to see the latest Jenkins installation counts / node counts / job counts /... How do I do that?"
  • "I want to put this chart in my slide" (so I'd rather want a CSV file because I know how to plot a graph in Excel and make it look the way I want)
The first use case is the one that we handle very well today :-), but other use cases, not so much.

  • The new UI can be developed which will be easy to understand, to view the statistics of the usage. For example to show the popularity of the plugins in a given time period, we can only show top 20 plugins, to increase the visibility in piechart or bar graph.
If you are interested in looking at the data in other ways, which is great, then we should think first about what kind of questions we want stats to answer. I'm sure many people have tons of questions, some of mine are:
  • How many of our users are running Windows, and how many are Linux? Of Linux, what are the percentage of Debian family vs RHEL family?
  • What's the distribution of cluster size? Did it change over time? Does the age of the installation corelate to the cluster size? If so what does that curve look like?
  • How quickly/often are people upgrading? Is there a popular release and unpopular release? Can we spot downgrades? Do they correlate to the perceived quality of the releases (see community rating in here)? Can we use it to warn us if a release seems to be unpopular?
These are all harder questions to answer than what the stats page show today, but I think those are more technically interesting for you, and in the scope of GSoC I think it's quite adequate.

If you haven't been a Jenkins user and find it difficult to get your head around what kind of questions we want statistics to answer, maybe the way to go is to go one level meta and find a way to make this data available for adhoc queries, so that people with interesting questions can query these data by some generic language, ala Apache Pig.

 
We can maintain duration window(can be adjusted by the user) and then show stats in that time frame rather than showing the monthly data. Similar idea related to processing is already there on wiki page of the project. I will elaborate all these points in detail in my proposal.

This would be a great help.

- Jenkins plugin dependency graph [link
  • I think dependency graph can be exploited to tell what plugins are more likely used together. Can someone please clarify that what exactly dependency denotes here ? And, where can I find the source code for this.
 
Census data [link]
  • What is the metadata/fields for these json files?
Yeah we should document this. I or Daniel will get back to you on that one. You'll also want to know the sense of the data set size.

 
Repo for the current sources [link].
  • Which languages other than groovy can be used by candidates for this project?
Java or Groovy would be preferred. That way, we have more people who can work after GSoC ends.
 
So, after going through all these links, what should my next step to make contribution to this project ?

I believe we need to drive toward your creating a project plan that you'll then submit to GSoC. It sounds to me like you still need to get yourself oriented in what exists, and probably learn a bit of about Jenkins --- what it does, who uses it, that sort of things. I think that'll help you think about what are the interesting questions we are trying to answer by using data mining. If you want to hear more brainstorming from me or others, we are happy to provide one.

In parallel, we'd like to hear from you some specific space you want to take on --- "usage stat analysis" is still too big and vague. 

There's upcoming student office hours that you might be interested, too.

Looking forward to suggestions.

Thanks a lot.

Regards,
Payal Priyadarshini

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/5550ea02-93b7-45ca-a992-2149311ec881%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Kohsuke Kawaguchi

Payal Priyadarshini

unread,
Mar 10, 2016, 10:31:44 AM3/10/16
to Jenkins Developers
Hey Kohsuke & Daniel,

I fetched the fields of census data using mongodb and tried to document it for the future use. Although, I have few questions about some of the fields, which is marked in red

Thanks.
Payal Priyadarshini


install-ID

is a list, where each element contains 7 fields.


stat


jobs

Contains 4 fields. Are these counts of the respective type ?


Hudson-model-ExternalJob


Hudson-model-FreeStyleProject


Hudson-maven-MavenModuleSet


Hudson-matrix-MatrixProject


timestamp

Job execution starting time.

version

Version of what ?  is it installation

install

Installation id

Plugins [list]

Contains 2 fields


version


Name


Nodes [list]

List of Nodes. Contains 4 fields


Master

true/false

jvm-version


jvm-vendor


executors


Christopher Orr

unread,
Mar 10, 2016, 7:59:18 PM3/10/16
to jenkin...@googlegroups.com
Yes, I believe "jobs" contains the number of each job type present.

"version" should be the Jenkins version, e.g. "1.567.8"
> --
> You received this message because you are subscribed to the Google
> Groups "Jenkins Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to jenkinsci-de...@googlegroups.com
> <mailto:jenkinsci-de...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/jenkinsci-dev/7f882d36-97a0-4eda-a6e0-6a0c613e466f%40googlegroups.com
> <https://groups.google.com/d/msgid/jenkinsci-dev/7f882d36-97a0-4eda-a6e0-6a0c613e466f%40googlegroups.com?utm_medium=email&utm_source=footer>.

Payal Priyadarshini

unread,
Mar 11, 2016, 3:55:38 PM3/11/16
to Jenkins Developers
Thanks Christopher.

Here is the first draft of my proposal containing most of the discussed ideas with my proposed solution. I have added few comments, where I had doubts.


Everyone is welcome to comment.  

Looking forward to more suggestions.

Regards,
Payal Priyadarshini

Payal Priyadarshini

unread,
Mar 16, 2016, 10:16:34 AM3/16/16
to Jenkins Developers
Hello Everyone,

Here is my second draft of the proposal after working on all suggestions given in the comments by Kohsuke & Daniel on the first draft.Again, I have added few comments regarding my doubts.  


As this project includes very wide range of data mining problems that we are planning to solve during the GSoC tenure, I would like to request everyone to give their suggestions on this proposal. 

Looking forward to more suggestion on the proposal.

Regards,
Payal Priyadarshini

Payal Priyadarshini

unread,
Mar 20, 2016, 5:01:21 PM3/20/16
to Jenkins Developers
Hey Everyone,

I have a query regarding the work timetable for the GSoC tenure which I have already asked in one of the comment in the doc. Kohsuke earlier suggested that current work scope might be very large for 3 months period. I am very much interested in the Updarging/Downgrading section & laying the ground work for plugin recommendation system as these works will be highly recognisable & widely used. But dropping the basic analysis part does not seem very good idea to me. I am kind of confused here, so looking forward to your suggestions.

I wanted a heads up whether I am good to go with this proposal or not.

Thanks a lot.

Best Regards,
Payal Priyadarshini

Kohsuke Kawaguchi

unread,
Mar 21, 2016, 11:07:44 AM3/21/16
to jenkin...@googlegroups.com
I've just casually went through the proposal but I think it's looking good.

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/99f08f23-4858-4172-9538-f7f9743109cb%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Kohsuke Kawaguchi

Payal Priyadarshini

unread,
Apr 30, 2016, 10:00:33 PM4/30/16
to Jenkins Developers

Hi all,


My proposal for the project "Jenkins Usage Statistics Analysis" has been accepted in GSoC 2016. So first of all I'd like to thank each and everyone of you for the valuable support you gave me.
So far I have got  familiar  with its source code. Please let me know the procedure I have to follow and things I should accomplish during community bonding period.


Thanks and regards,


Payal Priyadarshini

suresh kumar

unread,
Apr 30, 2016, 10:35:38 PM4/30/16
to Jenkins Developers
Congrats and All the Best.

Oleg Nenashev

unread,
May 1, 2016, 4:13:58 AM5/1/16
to Jenkins Developers
Congrats Payal, welcome aboard!

Mentors of your project (Daniel and Kohsuke) should contact you shortly. They were very busy with the 2.0 release, so there was a slight communication delay. If there is no message from them by Tuesday, please let us know.

BR, Oleg

воскресенье, 1 мая 2016 г., 4:35:38 UTC+2 пользователь suresh kumar написал:

Payal Priyadarshini

unread,
May 1, 2016, 5:17:29 AM5/1/16
to jenkin...@googlegroups.com
Hey Oleg,

Thanks a lot.

I will wait until they get free. Looking forward to working with Jenkins development team.

Regards,

Payal Priyadarshini

 

Daniel Beck

unread,
May 2, 2016, 6:06:10 AM5/2/16
to jenkin...@googlegroups.com

> On 01.05.2016, at 11:17, Payal Priyadarshini <payal...@gmail.com> wrote:
>
> I will wait until they get free. Looking forward to working with Jenkins development team.
>

Hi Payal,

My apologies for only responding now.

Welcome, and congratulations! Your project is important to us, and I am happy that we were able to accept your proposal.

First of all, when possible, you should join #jenkins and maybe #jenkins-infra on Freenode IRC. Due to time zone differences, chat may not work out, but give it a try anyway.

Learn Groovy, Git, and GitHub if you're not already familiar with them.

I recommend you set up an environment that lets you run the full stats generation locally with a smaller set of the data. Make sure you understand how it works. You wrote that you familiarized yourself with the source code for the stats already. If there's documentation missing, it may help you to write the documentation for it to confirm you understood how the process works. Try using Jenkins to automate your own stats generation pipeline, to learn the tool that sends the usage data you're analyzing.

Ideally think about and investigate what you need to adapt to make your future changes work, and if it's more than anticipated, consider doing some of that now.

AFAIK you also recently lost access to the statistics when we revoked external access, so please test that, and if you can no longer download the statistics, let us know ASAP so we can restore access for you. Tell us if you still have some locally you can work with -- if not, we'll provide that data.

Daniel

Payal Priyadarshini

unread,
May 8, 2016, 11:49:32 PM5/8/16
to Jenkins Developers, m...@beckweb.net
Hi Daniel & Kohsuke,

My apologies for responding so late, because of my final semester exams. As it has ended, let's start and carry the project forward with great pace. Following are the updates on pre-project works:
  • Joined #jenkins and #jenkins-infra on Freenode IRC, but due to timezone differences, chat did not work. So, I want to know the timings around which jenkins-infra team is active.
  • Started learning Groovy and already familiar with the Git.
  • I have lost access to the statistics because of which no longer able to download the statistics. So, can you please restore the access ?
As soon as I get the access, I will start setting up an environment that will generate full stats with the smaller dataset. 

Looking forwards to your reply.

Daniel Beck

unread,
May 13, 2016, 5:36:48 PM5/13/16
to jenkin...@googlegroups.com

> On 09.05.2016, at 05:49, Payal Priyadarshini <payal...@gmail.com> wrote:
>
> • Joined #jenkins and #jenkins-infra on Freenode IRC, but due to timezone differences, chat did not work. So, I want to know the timings around which jenkins-infra team is active.

That's the thing about the project's IRC channels -- sometimes, there's nothing going on for hours, other times, multiple conversations are happening at the same time. Assuming you have a persistent internet connection, it's a good idea for you to just be around to interact with others as needed, and allow others to quickly contact you.
>
> • I have lost access to the statistics because of which no longer able to download the statistics. So, can you please restore the access ?

This is tracked in the following two issues:

https://issues.jenkins-ci.org/browse/INFRA-678
https://issues.jenkins-ci.org/browse/INFRA-682

Tyler offered to work on them to unblock you, so this should be ready soon.

Oleg Nenashev

unread,
May 13, 2016, 6:46:54 PM5/13/16
to JenkinsCI Developers
I think a snapshot some snapshots from old times (e.g. from a backup) could be a good solution

@Payal, I would also suggest to write a self-introduction letter to the INFRA team.
This team is a bit scattered, but I'm sure you will need with people there.

BR, Oleg


--
You received this message because you are subscribed to a topic in the Google Groups "Jenkins Developers" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/jenkinsci-dev/hdFfGvl5wPs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to jenkinsci-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/75181713-5F3D-440C-8BCA-9AFE1AB16838%40beckweb.net.
Reply all
Reply to author
Forward
0 new messages