INFRA-240 is fixed / post-mortem

124 views
Skip to first unread message

Kohsuke Kawaguchi

unread,
Feb 16, 2015, 9:56:34 AM2/16/15
to jenkin...@googlegroups.com
My apologies for a delay in handling INFRA-240. As the ticket indicates now, I've resolved the problem. The issue was that ldap daemon wasn't restarted when I installed a new certificate last week. So it continued running with the old certificate, and when it expired, Artifactory started refusing to talk to it.

Local apps on cucumber weren't affected because it was using unsecured communication. I need to figure out why JIRA and Confluence were unaffected by this. Perhaps they have the password locally cached, perhaps they have LDAP connections pooled and long-running, or perhaps they don't properly check the certificate.


The next thing I want to talk about is that I think this is a symptom of a deeper issue, which is that the infra ops coverage has fallen way behind. Tyler isn't spending time on this project as he used to be, and the time I spend on Jenkins infra is not as much as it needs to be, too.

In the last 6 months or so, we've handed out infra acecss right to a few more people (Daniel Beck and Oleg Nanoshev, IIRC), and that was good for better time zone coverage and what not. But the problem still remains that there is a leadership vacuum, that no one sufficiently "owns" the infra, and that's difficult to solve by adding more hands alone.

So here's what I'd like to propose:
  • Formalize our ops team more by designating the lead that reports to the board. The lead shall be chosen in the discussion during the project meeting.
  • Under the new lead, accept another round of ops team members to help spread the workload. I know for example Kostasya is interested in helping.
  • Kohsuke (and Tyler if he can join) and the ops team will schedule a series of "transfer of information" sessions to bring the new ops lead and the team up to speed about how things are put together today.
  • Identify and remove single-point-of-failure in our infra. Off the top of my head:
    • I think I'm currently the only one who has the private key to sign update center root CA.
    • jenkins-ci.org domain name still appears to be registered under Tyler's personal account.

As the ops lead, I'd like the project to consider Adam Papai. He's been a long time user of Jenkins and he is a member of the CloudBees ops team. I'm sensitive to the fact that he works for CloudBees and how that can come across, but OTOH this will be a part of his day job, and I think that ensures that he can allocate necessary time to the effort.

What do people think?

Mark Waite

unread,
Feb 16, 2015, 10:10:57 AM2/16/15
to jenkin...@googlegroups.com
+1

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/fca1745f-2083-48f4-b94c-414be6796d6a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Thanks!
Mark Waite

Stephen Connolly

unread,
Feb 16, 2015, 10:37:15 AM2/16/15
to jenkin...@googlegroups.com
On 16 February 2015 at 14:56, Kohsuke Kawaguchi <k...@kohsuke.org> wrote:
My apologies for a delay in handling INFRA-240. As the ticket indicates now, I've resolved the problem. The issue was that ldap daemon wasn't restarted when I installed a new certificate last week. So it continued running with the old certificate, and when it expired, Artifactory started refusing to talk to it.

Local apps on cucumber weren't affected because it was using unsecured communication. I need to figure out why JIRA and Confluence were unaffected by this. Perhaps they have the password locally cached, perhaps they have LDAP connections pooled and long-running, or perhaps they don't properly check the certificate.


The next thing I want to talk about is that I think this is a symptom of a deeper issue, which is that the infra ops coverage has fallen way behind. Tyler isn't spending time on this project as he used to be, and the time I spend on Jenkins infra is not as much as it needs to be, too.

In the last 6 months or so, we've handed out infra acecss right to a few more people (Daniel Beck and Oleg Nanoshev, IIRC), and that was good for better time zone coverage and what not. But the problem still remains that there is a leadership vacuum, that no one sufficiently "owns" the infra, and that's difficult to solve by adding more hands alone.

So here's what I'd like to propose:
  • Formalize our ops team more by designating the lead that reports to the board. The lead shall be chosen in the discussion during the project meeting.
  • Under the new lead, accept another round of ops team members to help spread the workload. I know for example Kostasya is interested in helping.
  • Kohsuke (and Tyler if he can join) and the ops team will schedule a series of "transfer of information" sessions to bring the new ops lead and the team up to speed about how things are put together today.
I assume you mean "knowledge transfer" sessions ;-) 
  • Identify and remove single-point-of-failure in our infra. Off the top of my head:
    • I think I'm currently the only one who has the private key to sign update center root CA.
    • jenkins-ci.org domain name still appears to be registered under Tyler's personal account.

As the ops lead, I'd like the project to consider Adam Papai. He's been a long time user of Jenkins and he is a member of the CloudBees ops team. I'm sensitive to the fact that he works for CloudBees and how that can come across, but OTOH this will be a part of his day job, and I think that ensures that he can allocate necessary time to the effort.

What do people think?

--

Kanstantsin Shautsou

unread,
Feb 16, 2015, 10:43:56 AM2/16/15
to jenkin...@googlegroups.com
imho Assuming that now only CloudBees persons has power under jenkins, it's main development and everything is locked on CB people, then we have no way.
For lead i know 0 contributions from Adam in jenkins development or infra. As first step i will be glad to see him as just man who will resolve real issues with rtyler/kohsuke approval and decide to be a lead later.

Stephen Connolly

unread,
Feb 16, 2015, 10:55:00 AM2/16/15
to jenkin...@googlegroups.com
On 16 February 2015 at 15:43, Kanstantsin Shautsou <kanstan...@gmail.com> wrote:
imho Assuming that now only CloudBees persons has power under jenkins, it's main development and everything is locked on CB people, then we have no way.

Oh my oh my, I am disappointed that you feel that way.

On the infra side KK has DB and ON neither of whom currently work for CloudBees.

On the board side, only KK works for CloudBees.

We try to ensure that we are just one voice in the community and it would sadden us greatly if the community felt that we were in charge.

I would love to hear what has you forming that view point?

- FYI some of the work I do for the community is outside of M-F/9:00am-5:30pm which means that it is my own personal contributions and not related to CloudBees, Inc.
 

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-de...@googlegroups.com.

Daniel Beck

unread,
Feb 16, 2015, 11:23:56 AM2/16/15
to jenkin...@googlegroups.com
On 16.02.2015, at 16:43, Kanstantsin Shautsou <kanstan...@gmail.com> wrote:

> imho Assuming that now only CloudBees persons has power under jenkins, it's main development and everything is locked on CB people, then we have no way.

It shouldn't be a surprise that a company whose business is services on top of Jenkins wants to and can improve Jenkins. And anyone whose day job is work on Jenkins can often contribute more than someone doing this in their spare time. There's nothing "locked" about it. In fact, the entire Jenkins project is remarkably open and easy to get into (aside from a few areas that are more restricted for good reason, such as security and infra -- and all the available bees we pinged about INFRA-240 couldn't help us because they don't have access either!).

Kanstantsin Shautsou

unread,
Feb 16, 2015, 11:25:36 AM2/16/15
to jenkin...@googlegroups.com
It just imho. I see better activity for user related issues from plugin maintainers that are not in CB yet. From CB i see only commit fixes without jenkins issue ids that mostly look like issues come from CB customers. Of course CB developers will have priority on their internal issues.

For infra DanielBeck has no power and knowledge yet or why current work stoppage was not fixed? Last months Oleg, Daniel and me spent trying to get any info about infra, accesses. Hopefully they got access and now instead of helping them you propose a Lead that never worked with this infra before. What is the difference then? INFRA must be fixed and update without any leads, just fix and resolve issues!

Board people are board people :)

The best community person is Jessy, he answers on all questions.

R. Tyler Croy

unread,
Feb 16, 2015, 2:02:42 PM2/16/15
to jenkin...@googlegroups.com, in...@lists.jenkins-ci.org
(replies inline)

On Mon, 16 Feb 2015, Kohsuke Kawaguchi wrote:

> In the last 6 months or so, we've handed out infra acecss right to a few
> more people (Daniel Beck and Oleg Nanoshev, IIRC), and that was good for
> better time zone coverage and what not. But the problem still remains that
> there is a leadership vacuum, that no one sufficiently "owns" the infra,
> and that's difficult to solve by adding more hands alone.
>
> So here's what I'd like to propose:
>
> - Formalize our ops team more by designating the lead that reports to
> the board. The lead shall be chosen in the discussion during the project
> meeting.
> - Under the new lead, accept another round of ops team members to help
> spread the workload. I know for example Kostasya is interested in helping.
> - Kohsuke (and Tyler if he can join) and the ops team will schedule a
> series of "transfer of information" sessions to bring the new ops lead and
> the team up to speed about how things are put together today.
> - Identify and remove single-point-of-failure in our infra. Off the top
> of my head:
> - I think I'm currently the only one who has the private key to sign
> update center root CA.
> - jenkins-ci.org domain name still appears to be registered under
> Tyler's personal account.
>
>
> As the ops lead, I'd like the project to consider Adam Papai
> <https://github.com/woohgit>. He's been a long time user of Jenkins and he
> is a member of the CloudBees ops team. I'm sensitive to the fact that he
> works for CloudBees and how that can come across, but OTOH this will be a
> part of his day job, and I think that ensures that he can allocate
> necessary time to the effort.



Since i've got a couple of real-world things consuming a boatload of my time, I
don't have any objections to Adam joining the infra team. I'm not sure I like
the term "ops lead" as I've never thought of there being a leadership structure
around our infrastructure so much as a steaming pile of JIRAs and not enough
people to tackle them :-P

I would suggest ramping Adam up in the following ways to mitigate some of our
current risk:

* Documenting and migrating backend crawlers into the jenkins-infra GH
organization. This is one of the places where I think we have a seriously
low bus factor
* Helping KostySha where I have failed, with feedback on this PR:
<https://github.com/jenkins-infra/jenkins-infra/pull/66>
* Drive migration of JIRA and Confluence onto the newer hardware and newer
versions we've not been able to complete due to time

There's a long tail of other smaller projects, but in terms of our current
infra health and its affect on the project's continued growth and success, I
think those are the areas of most need.


See you chaps in #jenkins-infra


Cheers
-R. Tyler Croy

signature.asc

Richard Bywater

unread,
Feb 16, 2015, 2:57:29 PM2/16/15
to jenkin...@googlegroups.com
Sounds like a great idea - even if the term "ops lead" isn't used then it would be good for one (or two?) people to drive things a bit as otherwise its easy for people to sit back and think that someone else is going to fix it.

I'm not sure if its just me, but in the past I've put my hand up to helps with ops work. Unfortunately, probably due to timezones (i'm in NZ so a lot of US times don't work very well) and other reasons I've never really been able to "get into it". So if there were going to be sessions transferring knowledge, I think it would be useful having recorded sessions for later watching and also "private" (if necessary) space on the Wiki to store all things ops (assuming its not already there).

While we are on the subject, I'd be interested in knowing whether there is any truth into my understanding that KK is the only one who can release new Jenkins versions and RCs etc. (I could have completely got the wrong end of that bit of info!)  Only reason I ask is that outside of the infrastructure single points of failure, that would appear to be another rather large one.

Cheers
Richard.

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-de...@googlegroups.com.

Kohsuke Kawaguchi

unread,
Feb 16, 2015, 3:07:06 PM2/16/15
to jenkin...@googlegroups.com
Keeping the project independent from CloudBees is actually a very important goal for me for all the reasons. We currently do have a number of people with full infra access, such as Tyler, Andrew, Daniel, Oleg to name a few. It wasn't my intention to change any of that. I was really just trying to ensure that the ops side of the house doesn't result in the cycle starvation.

When I say "lead" I meant it as a forcing function to ensure that proper transfer of information happens and the person feels like he's empowered to do what he sees fit. In the past, we've identified several tasks that can be owned by others in the community, but I feel like we have failed to empower them properly by providing necessary context, review, and access. That makes it difficult for people to contribute, and Kostasya, I think you must have felt this pain.

The new jenkins-infra repo definitely moves the dial for community-driven distributed infra work, but we still got a lot of things that are in Tyler and my heads that only we can do, and looking back I think it's pretty clear those two individuals are the bottleneck. IMHO our top priority is to change that, and that's difficult to do if we just ask more volunteers to write PRs on jenkins-infra.


That said, your point about this proposal going against meritocracy is quite valid. Tyler has a similar comment, plus he indicated offline to me that he expects to be able to spend more time on the project come April. Given all that, I'm very happy to have him ease in slowly.


--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-de...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Kohsuke Kawaguchi

Kohsuke Kawaguchi

unread,
Feb 16, 2015, 3:14:51 PM2/16/15
to jenkin...@googlegroups.com
OK, I'll make sure to have them recorded.

You are right about release being another SPoF that needs addressing. A part of the problem is the need to hook up Windows & OSX slaves (the trust issue associated with it and all), and a part of the problem is the handling of code signing private keys.


For more options, visit https://groups.google.com/d/optout.



--
Kohsuke Kawaguchi

Craig Rodrigues

unread,
Feb 16, 2015, 5:41:37 PM2/16/15
to jenkin...@googlegroups.com
On Mon, Feb 16, 2015 at 12:07 PM, Kohsuke Kawaguchi <k...@kohsuke.org> wrote:
Keeping the project independent from CloudBees is actually a very important goal for me for all the reasons. We currently do have a number of people with full infra access, such as Tyler, Andrew, Daniel, Oleg to name a few. It wasn't my intention to change any of that. I was really just trying to ensure that the ops side of the house doesn't result in the cycle starvation.


I have seen in other open source projects, that when they reach a certain size and maturity,
they have found it necessary to pay people to work part-time on devops/sysadmin stuff,
often paid through some type of non-profit foundation set up to support the project. 

The Jenkins project seems to be facing some of the similar pressures.
Volunteers are great, but when volunteers have competing work/personal items for their time,
it gets really hard to do the devops/sysadmin stuff for the open source project in a timely manner.
Having someone focused on this stuff as part of their job is great.

I think that you and CloudBees have the best interest of the Jenkins project at heart,
and will "do the right thing" to bring on new people like Adam,
but work with the existing people doing Jenkins infra work for the best interest of the project.
--
Craig

Christopher Orr

unread,
Feb 16, 2015, 6:23:50 PM2/16/15
to jenkin...@googlegroups.com
On 02/16/2015 07:02 PM, R. Tyler Croy wrote:
> (replies inline)
>
> On Mon, 16 Feb 2015, Kohsuke Kawaguchi wrote:
>
>> In the last 6 months or so, we've handed out infra acecss right to a few
>> more people (Daniel Beck and Oleg Nanoshev, IIRC), and that was good for
>> better time zone coverage and what not. But the problem still remains that
>> there is a leadership vacuum, that no one sufficiently "owns" the infra,
>> and that's difficult to solve by adding more hands alone.
>>
>> So here's what I'd like to propose:
>>
>> - Formalize our ops team more by designating the lead that reports to
>> the board. The lead shall be chosen in the discussion during the project
>> meeting.
>> - Under the new lead, accept another round of ops team members to help
>> spread the workload. I know for example Kostasya is interested in helping.
>> - Kohsuke (and Tyler if he can join) and the ops team will schedule a
>> series of "transfer of information" sessions to bring the new ops lead and
>> the team up to speed about how things are put together today.
>> - Identify and remove single-point-of-failure in our infra. Off the top
>> of my head:
>> - I think I'm currently the only one who has the private key to sign
>> update center root CA.
>> - jenkins-ci.org domain name still appears to be registered under
>> Tyler's personal account.

These kind of things sound like good INFRA tickets; can we create a new
"spof" component/tag? :)


>> As the ops lead, I'd like the project to consider Adam Papai
>> <https://github.com/woohgit>. He's been a long time user of Jenkins and he
>> is a member of the CloudBees ops team. I'm sensitive to the fact that he
>> works for CloudBees and how that can come across, but OTOH this will be a
>> part of his day job, and I think that ensures that he can allocate
>> necessary time to the effort.
>
> Since i've got a couple of real-world things consuming a boatload of my time, I
> don't have any objections to Adam joining the infra team. I'm not sure I like
> the term "ops lead" as I've never thought of there being a leadership structure
> around our infrastructure so much as a steaming pile of JIRAs and not enough
> people to tackle them :-P

I was under the same impression regarding a leadership structure, but I
guess creating this position is reasonable.

As for adding infra team members — since I've been responsible in the
past for lots of nagging, waiting for people in the US to wake up,
asking about SPoFs, and adding a bunch of tickets to the INFRA pile —
I'd be keen to help solve some of these things, and I have a fair amount
of sysadmin and Puppet experience.


Aside from the obvious infrastructure/server-access roles, we also have
accumulated various moderator roles, which (I think) a very short list
of people have, usually to varying degrees. e.g. wiki moderation, wiki
user deletion, LDAP account authorisation, account deletion, mailing
list banning.

There are probably a couple other systems like that. It would be nice
to define/document the various roles and who has them, and how people
can join that role.

I've spent a lot of time in the past deleting wiki spam (moreso when the
daily wiki email actually gets sent), and would love to help delete the
various spammers on the mailing lists, JIRA and the wiki.


> I would suggest ramping Adam up in the following ways to mitigate some of our
> current risk:
>
> * Documenting and migrating backend crawlers into the jenkins-infra GH
> organization. This is one of the places where I think we have a seriously
> low bus factor
> * Helping KostySha where I have failed, with feedback on this PR:
> <https://github.com/jenkins-infra/jenkins-infra/pull/66>
> * Drive migration of JIRA and Confluence onto the newer hardware and newer
> versions we've not been able to complete due to time
>
> There's a long tail of other smaller projects, but in terms of our current
> infra health and its affect on the project's continued growth and success, I
> think those are the areas of most need.

There is indeed always a lot to be done, but it's also worth pointing
out how well a lot of the stuff runs, and how much automation we have.
Thanks to you and Kohsuke for keeping the bulk of this stuff under
control. And of course to the other infra contributors :)

Regards,
Chris

dana...@gmail.com

unread,
Feb 17, 2015, 3:37:04 AM2/17/15
to jenkin...@googlegroups.com
I can spare time to help with infrastructure problems. I know pretty much nobody knows me, but i try to be on IRC as much as i can and help other people with issues.
If you are interested in a young fellow(version 3.0 :) ) like me just mail me and tell me what you need me to do and have experience with. You never know, maybe I am the guy to help you ;)

Kohsuke Kawaguchi

unread,
Feb 17, 2015, 12:04:13 PM2/17/15
to jenkin...@googlegroups.com
2015-02-16 15:23 GMT-08:00 Christopher Orr <ch...@orr.me.uk>:
    - Identify and remove single-point-of-failure in our infra. Off the top
    of my head:
       - I think I'm currently the only one who has the private key to sign
       update center root CA.
       - jenkins-ci.org domain name still appears to be registered under
       Tyler's personal account.

These kind of things sound like good INFRA tickets; can we create a new "spof" component/tag? :)

Created "spof" component and filed INFRA-243 for domain names and INFRA-244 for root CA private key. INFRA-75 is another SPoF.


As the ops lead, I'd like the project to consider Adam Papai
<https://github.com/woohgit>. He's been a long time user of Jenkins and he
is a member of the CloudBees ops team. I'm sensitive to the fact that he
works for CloudBees and how that can come across, but OTOH this will be a
part of his day job, and I think that ensures that he can allocate
necessary time to the effort.

Since i've got a couple of real-world things consuming a boatload of my time, I
don't have any objections to Adam joining the infra team. I'm not sure I like
the term "ops lead" as I've never thought of there being a leadership structure
around our infrastructure so much as a steaming pile of JIRAs and not enough
people to tackle them :-P

I was under the same impression regarding a leadership structure, but I guess creating this position is reasonable.

As for adding infra team members — since I've been responsible in the past for lots of nagging, waiting for people in the US to wake up, asking about SPoFs, and adding a bunch of tickets to the INFRA pile — I'd be keen to help solve some of these things, and I have a fair amount of sysadmin and Puppet experience.

Great, thanks.
 
Aside from the obvious infrastructure/server-access roles, we also have accumulated various moderator roles, which (I think) a very short list of people have, usually to varying degrees.  e.g. wiki moderation, wiki user deletion, LDAP account authorisation, account deletion, mailing list banning.

There are probably a couple other systems like that.  It would be nice to define/document the various roles and who has them, and how people can join that role.

Yes, I agree. I suppose I need to revisit the permissions of https://wiki.jenkins-ci.org/display/infra/ so that it has some publicly visible parts to capture this kind of information.
 

I've spent a lot of time in the past deleting wiki spam (moreso when the daily wiki email actually gets sent), and would love to help delete the various spammers on the mailing lists, JIRA and the wiki.


I would suggest ramping Adam up in the following ways to mitigate some of our
current risk:

  * Documenting and migrating backend crawlers into the jenkins-infra GH
    organization. This is one of the places where I think we have a seriously
    low bus factor
  * Helping KostySha where I have failed, with feedback on this PR:
     <https://github.com/jenkins-infra/jenkins-infra/pull/66>
  * Drive migration of JIRA and Confluence onto the newer hardware and newer
    versions we've not been able to complete due to time

There's a long tail of other smaller projects, but in terms of our current
infra health and its affect on the project's continued growth and success, I
think those are the areas of most need.

There is indeed always a lot to be done, but it's also worth pointing out how well a lot of the stuff runs, and how much automation we have.
Thanks to you and Kohsuke for keeping the bulk of this stuff under control.  And of course to the other infra contributors :)

Regards,
Chris
--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-dev+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/54E27BFD.8080101%40orr.me.uk.

For more options, visit https://groups.google.com/d/optout.



--
Kohsuke Kawaguchi

Jerome Lacoste

unread,
Feb 18, 2015, 5:25:30 AM2/18/15
to jenkin...@googlegroups.com
On Monday, February 16, 2015 at 9:14:51 PM UTC+1, Kohsuke Kawaguchi wrote:
OK, I'll make sure to have them recorded.

+1 on the recordings.

WRT SPoF, I would gladly read an article describing the changes to tackle the SPoF on the jenkins-ci ORG ! I am sure it will be an interesting read.

Cheers,

Jerome

 
--
Kohsuke Kawaguchi
Reply all
Reply to author
Forward
0 new messages