Future of the metal3 prow cluster

19 views
Skip to first unread message

Russell Bryant

unread,
Feb 4, 2021, 12:25:36 PM2/4/21
to Metal3 Development List
Hi, all.

I'm responsible for the Kubernetes cluster that runs prow for metal3-io.  A number of different things make me raise this discussion:

* The jenkins server has taken the bulk of the functional testing, leaving prow CI to do simple test jobs and some github automation.
* prow is now very out of date (over a year).  I tried upgrading once, it didn't go well, and I never revisited it.
* This is (was) a 3-node cluster, but we lost a node.  The good news is HA of the cluster has worked as planned.  The bad news is I have allowed the cluster to remain in this state for a couple of months now.

Based on my vastly reduced amount of participation in metal3-io, I would like to either hand over maintaining this infrastructure, or deprecate and remove the use of prow all together. 

Here are my ideas for paths forward.

1) One or more people volunteer to take over the prow cluster.  I can work with them on getting a cluster back to a healthy state.  Through this process I imagine I will have transferred enough knowledge to not be needed anymore.

2) Deprecate the metal3-io cluster and migrate to a prow cluster maintained by another group.  The OpenShift prow cluster could be used.  I avoided this in the past because I preferred a metal3-io community owned resource, but avoiding the extra maintenance could now be worth it.

3) Deprecate the metal3-io cluster and move to another set of tools.  Github actions (as one example) could be the new home for most of the jobs.  Another solution would be needed for review/merging automation if that functionality is still desired.  I've heard of Mergify [1], but never used it, as one example.

I'm open to all of these, depending on the opinions of those willing to help.  I'll say #1 is probably my least preference, though.

Let me know what you think and what other ideas you may have.

Thanks,

--
Russell Bryant

Digambar Patil

unread,
Feb 9, 2021, 11:01:08 AM2/9/21
to Russell Bryant, Metal3 Development List
Hi Russell,

I know it's painful to handle, you did a great job so far maintaining
it single handedly. I would certainly share some of my bandwidth if
you need help on this.

Thanks,
Digambar
> --
> You received this message because you are subscribed to the Google Groups "Metal3 Development List" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to metal3-dev+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/metal3-dev/CAMHuGGHDrFiga5oYtthuDh46UFQUBJUv8wkx75zMm0tmNQAbQg%40mail.gmail.com.

Russell Bryant

unread,
Feb 9, 2021, 2:52:00 PM2/9/21
to Digambar Patil, Metal3 Development List
Thanks for the offer to help, and from the 2 others that also expressed interest.

As I've thought about this some more, I'd like to drop option #1.  I don't think metal3-io gets enough value out of running its own prow cluster for it to be worth continuing.

I'm willing to assist with the migration using either option #2 or #3.  I don't have a strong preference between those two paths.  I'd like to hear preferences from those contributing to metal3-io projects on a regular basis.

In the absence of a strong preference otherwise, I would propose #2.  I think that would be the least amount of work and also the least disruptive to existing workflows.

--
Russell Bryant

Stephen Benjamin

unread,
Feb 9, 2021, 3:45:24 PM2/9/21
to Russell Bryant, Digambar Patil, Metal3 Development List
On Tue, Feb 9, 2021 at 2:52 PM Russell Bryant <rbr...@redhat.com> wrote:
>
> Thanks for the offer to help, and from the 2 others that also expressed interest.
>
> As I've thought about this some more, I'd like to drop option #1. I don't think metal3-io gets enough value out of running its own prow cluster for it to be worth continuing.
>
> I'm willing to assist with the migration using either option #2 or #3. I don't have a strong preference between those two paths. I'd like to hear preferences from those contributing to metal3-io projects on a regular basis.
>
> In the absence of a strong preference otherwise, I would propose #2. I think that would be the least amount of work and also the least disruptive to existing workflows.

If OpenShift's CI team is ok with us using that Prow, option #2 sounds
good to me.

Did you have a look at what other CNCF teams do? Is there something
between #1 and #2 where we'd have more control over Prow, but not be
stuck managing a baremetal k8s cluster?

- Stephen
> To view this discussion on the web visit https://groups.google.com/d/msgid/metal3-dev/CAMHuGGFRwJsJe4o_yJZWma4k5CLcmnEHb4w7HRQO8m5%3DSeiMEA%40mail.gmail.com.



--

Stephen Benjamin (he/him)
Principal Software Engineer
Red Hat

Russell Bryant

unread,
Feb 9, 2021, 3:59:53 PM2/9/21
to Stephen Benjamin, Digambar Patil, Metal3 Development List
On Tue, Feb 9, 2021 at 3:45 PM Stephen Benjamin <ste...@redhat.com> wrote:
On Tue, Feb 9, 2021 at 2:52 PM Russell Bryant <rbr...@redhat.com> wrote:
>
> Thanks for the offer to help, and from the 2 others that also expressed interest.
>
> As I've thought about this some more, I'd like to drop option #1.  I don't think metal3-io gets enough value out of running its own prow cluster for it to be worth continuing.
>
> I'm willing to assist with the migration using either option #2 or #3.  I don't have a strong preference between those two paths.  I'd like to hear preferences from those contributing to metal3-io projects on a regular basis.
>
> In the absence of a strong preference otherwise, I would propose #2.  I think that would be the least amount of work and also the least disruptive to existing workflows.

If OpenShift's CI team is ok with us using that Prow, option #2 sounds
good to me.

Did you have a look at what other CNCF teams do? Is there something
between #1 and #2 where we'd have more control over Prow, but not be
stuck managing a baremetal k8s cluster?

That's a good question.  It looks like hosting may be available, but I haven't seen a shared prow instance.


I saw https://www.cncf.io/community-infrastructure-lab/, but that's just bare metal server access, which equinix metal has already given metal3-io directly.
I'll try asking though just in case.

Feruzjon Muyassarov

unread,
Feb 16, 2021, 5:09:09 AM2/16/21
to Russell Bryant, Stephen Benjamin, Kashif Khan, Jan Tilles, Digambar Patil, Metal3 Development List
Hi Russell,

@Kashif Khan, @Jan Tilles and myself are ready to volunteer to take
over the prow configuration or join to set up another alternative solution that satisfies
best our needs. We are happy join with other interested folks to keep the
prow/other solution running.



Best regards,
Feruzjon Muyassarov

From: metal...@googlegroups.com <metal...@googlegroups.com> on behalf of Russell Bryant <rbr...@redhat.com>
Sent: Tuesday, February 9, 2021 10:59 PM
To: Stephen Benjamin <ste...@redhat.com>
Cc: Digambar Patil <digam...@gmail.com>; Metal3 Development List <metal...@googlegroups.com>
Subject: Re: [metal3-dev] Future of the metal3 prow cluster
 

Russell Bryant

unread,
Feb 16, 2021, 11:15:57 AM2/16/21
to Feruzjon Muyassarov, Stephen Benjamin, Kashif Khan, Jan Tilles, Digambar Patil, Metal3 Development List
Great, thanks for your interest (and everyone else that spoke up).

I checked and didn't find any suitable shared prow for CNCF projects.  I don't really want to dictate the solution here, since I'd rather it be driven by whoever wants to look after it going forward.  Here's my view of the options.  Let me know if you have a preference.  If there's no strong preference, then I would suggest we do option #1 (use openshift prow), because I think that's the easiest overall.

1) If we move to the OpenShift prow, I can do the bulk of the work, or at least finish the migration for one or two repositories.

pros:
 - nobody has to run a cluster for this
 - i'll help do the migration and ensure people know where to go to make updates
 - i think this is the least amount of effort overall

cons:
 - we rely on a red hat resource for a project that isn't a red hat only project

2) Move to some other CI (move all jobs into jenkins, use github actions for the jobs that currently run in prow, ...)

pros:
 - continue a metal3-io community owned CI setup
 - presumably will be running one less CI system for metal3-io
 - i'll help at least figure out the migration for one repo

cons:
 - a bit more complex of a migration, since job configuration will change, and PR automation workflow will also change

3) Stand up a new cluster to run prow

pros:
 - use the same CI configuration and PR automation that all repos use today

cons:
 - must continue maintaining a cluster for this purpose (i can grant maintainers access to the metal3-io equinix metal account if needed)
 - i can explain how the current system works, but if we hit problems with a new prow version, i won't be able to help debug it

--
Russell Bryant

Jan Tilles

unread,
Feb 17, 2021, 2:26:01 AM2/17/21
to Russell Bryant, Feruzjon Muyassarov, Stephen Benjamin, Kashif Khan, Digambar Patil, Metal3 Development List
Hi,

I am in favor of the proposal 2. I think it makes the most sense and makes the whole metal3 testing uniform.

Best Regards

Jan Tilles

Lähettäjä: Russell Bryant <rbr...@redhat.com>
Lähetetty: tiistai 16. helmikuuta 2021 18.15
Vastaanottaja: Feruzjon Muyassarov <feruzjon....@est.tech>
Kopio: Stephen Benjamin <ste...@redhat.com>; Kashif Khan <kashi...@est.tech>; Jan Tilles <jan.t...@est.tech>; Digambar Patil <digam...@gmail.com>; Metal3 Development List <metal...@googlegroups.com>
Aihe: Re: [metal3-dev] Future of the metal3 prow cluster
 

Feruzjon Muyassarov

unread,
Feb 17, 2021, 4:59:14 AM2/17/21
to Jan Tilles, Russell Bryant, Stephen Benjamin, Kashif Khan, Digambar Patil, Metal3 Development List
Just a couple of questions related to OpenShift prow option:
  1. Can we still use the metal3-io-bot, or will we have to create/rely on another OpenShift bot?
  2. Can we keep the prow configurations within the metal3-io/project-infra GitHub repo?
  3. If in the future there is an issue in the prow, will it be possible to fix/debug it for someone outside of the RedHat organization, but part of the Metal3 community?
BR,
Feruz

From: Jan Tilles <jan.t...@est.tech>
Sent: Wednesday, February 17, 2021 9:25 AM
To: Russell Bryant <rbr...@redhat.com>; Feruzjon Muyassarov <feruzjon....@est.tech>
Cc: Stephen Benjamin <ste...@redhat.com>; Kashif Khan <kashi...@est.tech>; Digambar Patil <digam...@gmail.com>; Metal3 Development List <metal...@googlegroups.com>
Subject: VS: [metal3-dev] Future of the metal3 prow cluster
 

Russell Bryant

unread,
Feb 17, 2021, 9:13:35 AM2/17/21
to Feruzjon Muyassarov, Jan Tilles, Stephen Benjamin, Kashif Khan, Digambar Patil, Metal3 Development List
On Wed, Feb 17, 2021 at 4:59 AM Feruzjon Muyassarov <feruzjon....@est.tech> wrote:
Just a couple of questions related to OpenShift prow option:
  1. Can we still use the metal3-io-bot, or will we have to create/rely on another OpenShift bot?
It would be a separate bot account on github, but all the same features. 

  1. Can we keep the prow configurations within the metal3-io/project-infra GitHub repo?
No, they would all have to move to a large configuration repo under github.com/openshift/. 
  1. If in the future there is an issue in the prow, will it be possible to fix/debug it for someone outside of the RedHat organization, but part of the Metal3 community?
You'd probably have to rely on people at Red Hat to sort out details that go beyond updating the job definitions in the config repo.

This arrangement certainly isn't ideal ...

Russell Bryant

unread,
Feb 17, 2021, 9:14:57 AM2/17/21
to Jan Tilles, Feruzjon Muyassarov, Stephen Benjamin, Kashif Khan, Digambar Patil, Metal3 Development List
OK, I have no problems with doing option 2.

Do you have thoughts on what you envision?

 - would you move jobs into jenkins, or do them in github actions?
 - how about github automation, maybe try out mergify? something else?

--
Russell Bryant

Feruzjon Muyassarov

unread,
Feb 19, 2021, 10:51:53 AM2/19/21
to Russell Bryant, Jan Tilles, Stephen Benjamin, Kashif Khan, Digambar Patil, Metal3 Development List
I agree that that's not the ideal option 🙂

In this case, what about the keeping the everything as it is now, like in option3 (Stand up a new cluster to run prow) ?

The only thing we will need to do is to spin up a new cluster/upgrade the existing one (not sure if upgrading would be easy), right?

I'm personally not familiar with Equinix, but I assume it provides compute resources to host the cluster.
If yes, we discussed internally, and I think we could also host the cluster in CityCloud (cloud provider) where we are currently running Jenkins jobs.

If you think option3 is also fine, and before we try it out in CityCloud infra, we will need to know some approximate resource requirements to host the prow so that we check that we have enough resources for us in CityCloud.

Is it possible to check what is the CPU, RAM, Disk size of the current setup?


BR,
Feruz




From: Russell Bryant <rbr...@redhat.com>
Sent: Wednesday, February 17, 2021 4:13 PM
To: Feruzjon Muyassarov <feruzjon....@est.tech>
Cc: Jan Tilles <jan.t...@est.tech>; Stephen Benjamin <ste...@redhat.com>; Kashif Khan <kashi...@est.tech>; Digambar Patil <digam...@gmail.com>; Metal3 Development List <metal...@googlegroups.com>

Russell Bryant

unread,
Feb 19, 2021, 4:12:38 PM2/19/21
to Feruzjon Muyassarov, Jan Tilles, Stephen Benjamin, Kashif Khan, Digambar Patil, Metal3 Development List
On Fri, Feb 19, 2021 at 10:52 AM Feruzjon Muyassarov <feruzjon....@est.tech> wrote:
I agree that that's not the ideal option 🙂

In this case, what about the keeping the everything as it is now, like in option3 (Stand up a new cluster to run prow) ?

I'm fine with it if someone else is interested in running it.  I was just stepping down from running a cluster and wanted to offer options that don't include someone else having to run one.

The only thing we will need to do is to spin up a new cluster/upgrade the existing one (not sure if upgrading would be easy), right?

That's right.  I'd rather not upgrade the current one.  I think it'd be better to start up a parallel one running the latest prow.  Then we can fall back to the current one if we have problems getting it going.

I'm personally not familiar with Equinix, but I assume it provides compute resources to host the cluster.

That's right.  We have free Equinix Metal resources that can be used if desired.
If yes, we discussed internally, and I think we could also host the cluster in CityCloud (cloud provider) where we are currently running Jenkins jobs.

If you think option3 is also fine, and before we try it out in CityCloud infra, we will need to know some approximate resource requirements to host the prow so that we check that we have enough resources for us in CityCloud.

Is it possible to check what is the CPU, RAM, Disk size of the current setup?

Sure.  The current servers are 3 (for HA) c1.small.x86 nodes from Equinix (a legacy type not offered now, I think).

  • PROC -- 1 x Intel E3-1240 v3
  • RAM -- 32GB
  • DISK -- 2 x 120GB SSD
I checked and prow itself is using a trivial amount of RAM (< 1 GB right now) and negligible CPU (less than 1 core).  Of course, that's while it's sitting idle and not running jobs.  The bulk of the work is the jobs themselves, though our jobs are also all pretty light -- unit tests and linting jobs for the most part.

The storage needs aren't significant.  All the job artifacts (logs) get stored in google cloud, which is catch #1.  I'm still running this against my personal google cloud account for object storage.  It looks like I've spent $10.25 in the last year on it.

Last I was looking at upgrading to a newer version of prow, it had a new component that required persistent storage support.  I don't have that set up on the current cluster.  It was for storing a cache of github data, I believe.  So configuring a CSI driver or some sort is probably a requirement for a new cluster.

According to their README, there is #prow on Kubernetes slack, which will be able to offer better advice on requirements than me.

Feruzjon Muyassarov

unread,
Feb 23, 2021, 6:43:36 AM2/23/21
to Russell Bryant, Jan Tilles, Stephen Benjamin, Kashif Khan, Digambar Patil, Metal3 Development List

Thank you for the info and sorry for late reply.

Based on the numbers you gave; it seems possible to host the
new cluster in CityCloud.

Jan Tilles and Kashif Khan are on vocation during this week,
as such would it be possible to arrange a meeting during next
week? I think it would be easier that we ask you all our questions
we might have by that time on the meeting, and you could also
share as much information as you can.

Meanwhile during this week, we will read docs related to prow
configuration to familiarize ourselves.


Best regards,
Feruz


From: Russell Bryant <rbr...@redhat.com>
Sent: Friday, February 19, 2021 11:12 PM

Russell Bryant

unread,
Feb 23, 2021, 3:26:09 PM2/23/21
to Feruzjon Muyassarov, Jan Tilles, Stephen Benjamin, Kashif Khan, Digambar Patil, Metal3 Development List
Sure, that sounds good.

As a start, I kept notes in the project-infra repo:



Some of my notes were OpenShift specific, but it shows what I did.

--
Russell Bryant

Feruzjon Muyassarov

unread,
Feb 23, 2021, 3:40:17 PM2/23/21
to Russell Bryant, Jan Tilles, Stephen Benjamin, Kashif Khan, Digambar Patil, Metal3 Development List

Thank you, will check the docs.

What day & time would suite best to have the meeting next week?
We are based on Europe, so I think it would be somewhere in the
afternoon/evening for us considering you are in USA.

BR,
Feruz

From: Russell Bryant <rbr...@redhat.com>
Sent: Tuesday, February 23, 2021 10:25 PM

Russell Bryant

unread,
Feb 23, 2021, 5:04:30 PM2/23/21
to Feruzjon Muyassarov, Jan Tilles, Stephen Benjamin, Kashif Khan, Digambar Patil, Metal3 Development List

Feruzjon Muyassarov

unread,
Mar 1, 2021, 5:51:13 AM3/1/21
to Russell Bryant, Jan Tilles, Stephen Benjamin, Kashif Khan, Digambar Patil, Metal3 Development List
Thanks.
Based on the votes, I would suggest Thursday 9:30-10:00 AM (for you), which would be 3:30 - 4:00 PM CET time (for us).

We can use same link we use for the community meeting.
https://zoom.us/j/97255696401?pwd=ZlJMckNFLzdxMDNZN2xvTW5oa2lCZz09

BR,
Feruz

From: Russell Bryant <rbr...@redhat.com>
Sent: Wednesday, February 24, 2021 12:03 AM

Russell Bryant

unread,
Mar 1, 2021, 12:49:31 PM3/1/21
to Feruzjon Muyassarov, Jan Tilles, Stephen Benjamin, Kashif Khan, Digambar Patil, Metal3 Development List
OK that sounds good to me.  See you then.

--
Russell Bryant

Reply all
Reply to author
Forward
0 new messages