15:00:51 <dwalt> #startmeeting airship
15:00:52 <openstack> Meeting started Tue Nov 10 15:00:51 2020 UTC and is due to finish in 60 minutes.  The chair is dwalt. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:54 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:56 <openstack> The meeting name has been set to 'airship'
15:01:07 <mattmceuen> o/
15:01:08 <dwalt> Hey everyone, we'll get started in a few minutes. The design call is still wrapping up
15:04:04 <airship-irc-bot> <ih616h> o/
15:04:09 <airship-irc-bot> <mf4716> o/
15:04:13 <airship-irc-bot> <james.gu> o/
15:04:31 <airship-irc-bot> <ak3216> o/
15:04:44 <roman_g> o/
15:04:45 <dwalt> Welcome to everyone filtering in from the design call. Here is our agenda today
15:04:47 <dwalt> #link https://etherpad.opendev.org/p/airship-team-meeting
15:06:02 <airship-irc-bot> <j_t_williams> o/
15:06:31 <dwalt> Alrighty, let's get things started
15:06:34 <dwalt> #topic Zuul gate problems
15:06:35 <airship-irc-bot> <sean.eagan> o/
15:06:56 <mattmceuen> Folks have probably noticed, we have a pair of unrelated but brutal issues with our gates at the moment
15:07:00 <jemangs> o/
15:07:10 <mattmceuen> First: infrastructure availability for our 16gb VMs
15:07:33 <mattmceuen> Of the two providers that supply them, we're back down to one working at the moment -- resulting in a lot of NODE_FAILURE type errors
15:07:48 <airship-irc-bot> <lb4368> o/
15:07:51 <airship-irc-bot> <mb551n> o/
15:07:52 <sreejithp> o/
15:08:06 <jemangs> (The same problem as earlier?)
15:08:09 <mattmceuen> Arijit has been working to get 3rd-party gating going per patchset
15:08:14 <mattmceuen> similar @jemangs
15:08:24 <mattmceuen> I think this may be a networking problem, previously a cooling problem, or something
15:08:58 <mattmceuen> Currently the 3rd party gates are non-voting, and not (yet) reporting status up to patchsets -- that's a work in progress
15:09:05 <roman_g> jemangs, similar. Previously that was AC failure, now – router malfunction.
15:09:26 <mattmceuen> https://jenkins.nc.opensource.att.com
15:09:56 <mattmceuen> While we get status reporiting to patchsets working, developers can manually check for their job's status in that UI^
15:10:40 <mattmceuen> And we'd like to switch (when ready) to making the 3rd party deployment gate our voting gate, instead of the zuul deployment gate
15:11:02 <mattmceuen> to be clear:  all the things that can be done in "normal size" VMs -- linting, image build, etc -- would still be run via zuul
15:11:56 <mattmceuen> I would propose that for the time being, we make the zuul deployment gate *on-merge only*, and then check in jenkins for per-patch testing status
15:12:05 <mattmceuen> This will get us unblocked
15:12:12 <mattmceuen> Thoughts/concerns?
15:12:19 <dwalt> This is great, thanks mattmceuen and sreejithp
15:12:23 <airship-irc-bot> <ih616h> +1
15:12:51 <dwalt> Does that mean the job would be disabled for check, and enabled voting for gate?
15:13:06 <mattmceuen> yes, disabled for check -- leaving more of our limited capacity available for the merge job
15:13:44 <roman_g> dwalt , Yes, but you can trigger job manually by leaving "check experimental" comment to the patch set in Gerrit
15:13:51 <airship-irc-bot> <mf4716> Did we hear any more about resolution to the router malfunction?
15:13:52 <mattmceuen> +1
15:13:53 <dwalt> That makes a lot of sense. More reliable than relying on the cores to verify that the job passed. +1 from me
15:14:10 <mattmceuen> The longer-term (but hopefully still soon) idea would be to make the 3rd party gating Voting, and make the zuul gate non-voting
15:14:42 <mattmceuen> We should pull the trigger on that once we have status reporting back to the PS in place, and are comfortable with it
15:15:00 <roman_g> mf4716 no. It's non-commercial cloud provider (basically home-based servers farm).
15:15:01 <mattmceuen> Any other thoughts/comments before we move on to the other terrible gate issue? :)
15:15:27 <mattmceuen> yeah - really useful from a community perspective, but can be problematic when we're relying on it for our day jobs
15:15:55 <airship-irc-bot> <mf4716> any ETA roman?
15:15:57 <mattmceuen> at least, when the third party gates break, it's our job to fix them :)
15:16:32 <roman_g> mf4716 no ETA.
15:16:52 <mattmceuen> So issue #2:
15:17:02 <mattmceuen> And this one may be more challenging to work around.
15:17:23 <mattmceuen> dockerhub has instituted rate limiting on image pulls, limiting to 100 pulls per-IP per month
15:18:04 <mattmceuen> I believe we thought we'd been using some kind of image caching, but weren't, and now we've exausted our quota for the month from some of the open infra IPs
15:18:24 <mattmceuen> A few ways we could potentially approach this:
15:18:52 <mattmceuen> 1. migrate to using the 3rd party gates as Voting ASAP, and make sure they have caching enabled -- they'll have different IPs and so won't be hurt by the quota
15:19:15 <mattmceuen> 2. migrate to a non-dockerhub mirror for all images hosted in dockerhub
15:19:32 <mattmceuen> 3. upgrade to a paid plan
15:19:55 <dwalt> Do we have an estimate on the number of images impacted?
15:21:07 <mattmceuen> This is a promising (incomplete) list:
15:21:14 <mattmceuen> [madgin@leviathan:~/airship2/airshipctl]$ grep -r "image: docker" manifests
15:21:14 <mattmceuen> manifests/function/helm-operator/deployment.yaml:        image: docker.io/fluxcd/helm-operator:1.2.0
15:21:14 <mattmceuen> [madgin@leviathan:~/airship2/airshipctl]$
15:21:55 <airship-irc-bot> <rp2723> Can we put the point in time images we care about in quay and use that instead
15:21:58 <dwalt> lol. That is indeed promising
15:22:45 <dwalt> If it's one image, migrating it to Quay seems like it would be feasible
15:22:52 <airship-irc-bot> <rp2723> It’s a single image ?? The word list is rhetorical
15:23:01 <mattmceuen> lol I should have googled earlier, mea culpa
15:23:09 <mattmceuen> there could be other images in other projects though
15:23:15 <mattmceuen> but getting airshipctl unblocked would be great
15:23:19 <airship-irc-bot> <sean.eagan> helm-controller/source-controller are now in github container registry
15:23:31 <dwalt> \o/
15:23:35 <airship-irc-bot> <sean.eagan> (the replacements for helm operator)
15:24:12 <airship-irc-bot> <sean.eagan> https://github.com/orgs/fluxcd/packages/container/package/helm-controller
15:24:33 <mattmceuen> what about helm-operator?  We're not quite ready to switch to the helm-controller yet, are we @sean.eagan?
15:24:41 <dwalt> So is it fair to say that we can push the operator image to quay, since we are moving away from it?
15:24:47 <mattmceuen> yes
15:24:54 <airship-irc-bot> <sean.eagan> https://review.opendev.org/#/c/758615/
15:24:56 <mattmceuen> +1
15:24:58 <airship-irc-bot> <sean.eagan> ^ it's merged
15:25:21 <sreejithp> mattmceuen, we also need the go docker image "docker.io/golang:1.13.1-stretch"
15:25:32 <sreejithp> we use this for building airshipctl binary
15:26:21 <mattmceuen> ahh you're right
15:26:40 <mattmceuen> do we even still need helm-operator now that we have the helm-controller merged Sean?
15:27:21 <mattmceuen> if we simply switch the test-site to use helm-controller, -operator wouldn't get run by the tests
15:27:49 <airship-irc-bot> <sean.eagan> the function for helm-operator is still in place (for a short migration period), but it's not actually used by the default phases included in airshipctl
15:28:17 <mattmceuen> then it must only be the golang image that's hurting the gates, right?
15:28:58 <dwalt> Is the gate runner failing in the pre-run playbooks or later on?
15:29:06 <dwalt> I think that's when we run `make images`
15:29:08 <mattmceuen> (treasuremap is using the helm-operator, so we'll need to handle that)
15:29:45 <airship-irc-bot> <sean.eagan> https://review.opendev.org/#/c/761666/
15:29:52 <airship-irc-bot> <sean.eagan> ^ treasuremap migration
15:30:50 <mattmceuen> nice!
15:32:44 <dwalt> That's great sean.eagan. Between pushing those two images to quay and finishing the helm controller migration, it sounds like this is largely under control
15:33:02 <mattmceuen> I'm trying to find a failed job example and am not quickly... was seeing a few yesterday, so that's interesting
15:33:29 <airship-irc-bot> <ih616h> Here's one that failed on the golang image https://zuul.opendev.org/t/openstack/build/36ce6912b90b45378d246273d193ebb6
15:33:32 <dwalt> Yeah. Mostly seeing the node failures on the front page
15:33:37 <airship-irc-bot> <ih616h> (if that's what you're looking for)
15:34:13 <mattmceuen> ahh yep, it's for the golang
15:34:30 <airship-irc-bot> <ih616h> which makes sense - that image is pulled for even the linting jobs
15:34:32 <roman_g> https://review.opendev.org/#/c/755456 Docker cache for airship/airshipctl Zuul jobs
15:34:48 <mattmceuen> so migrating to an alternate source for that is a top priority, any volunteers?
15:35:20 <dwalt> If we're just going to do a quick push, probably needs to be one of us from the wc :)
15:35:37 <dwalt> I can do it
15:35:50 <mattmceuen> oh I figured there might already be a golang image in gcr land :)
15:35:56 <mattmceuen> they share a "g"
15:36:16 <dwalt> oh that's a better idea
15:37:21 <roman_g> There are also a few images which are pulled from inside of nested VMs. Those also need to be verified on where do they get pulled from.
15:38:08 <roman_g> PS above is a solution to enable cacheing of container images on OpenDev infrastructure servers.
15:38:27 <mattmceuen> awesome ty roman_g
15:39:02 <mattmceuen> dwalt thank you for volunteering for that one, yeah I'd say just spend a few minutes searching for prior work before pushing to quay.io/airshipit as a Plan B
15:39:33 <dwalt> I'll post to #airshipit if I find one
15:39:45 <mattmceuen> Good news -- just learned from Arijit that the 3rd-party gates are now reporting back to gerrit!
15:40:00 <dwalt> nice!
15:40:08 <mattmceuen> Can I ask for a volunteer to put in a change to make the deployment zuul job on-merge only?
15:40:25 <mattmceuen> I'd do it but need to step away for a couple hours
15:40:41 <roman_g> mattmceuen Make airshipctl gate runner script run only on request and on Zuul gate (pre-merge) to reduce workload onto community CI
15:40:41 <roman_g> https://review.opendev.itorg/762136
15:41:17 <roman_g> It's all in meeting Etherpad https://etherpad.opendev.org/p/airship-team-meeting
15:41:20 <mattmceuen> That's perfect, +2 - ty Roman
15:41:38 <dwalt> thanks roman_g
15:41:43 <dwalt> Anything else we need to hash out?
15:41:56 <mattmceuen> ok dwalt that all I have for our gate shenanigans, THANKS ALL for the working session
15:42:05 <dwalt> great!
15:42:08 <dwalt> #topic release tagging
15:42:14 <roman_g> Images pulled from inside of nested VMs are not covered.
15:42:21 <dwalt> I think you're back in the spotlight mattmceuen
15:42:41 <dwalt> roman_g: do you know those atm? We may need to let the gates run to see
15:43:05 <mattmceuen> @roman_g agree we need to follow up on that; hopefully the golang change will give us a little bit of respite, esp will switching most of the VM load to 3rd party gates
15:43:07 <roman_g> dwalt not atm, need to grep code a bit
15:43:39 <dwalt> Sounds good. Let's go ahead and cover our last agenda item and circle back if we have some remaining time
15:44:00 <mattmceuen> Tagging:  so we'd decided when we cut our Beta release that we'd tag the quay.io image, but not the repo
15:44:35 <mattmceuen> For the life of me, I don't recall why we thought that was a good idea -- seems like we should be doing both; @sean.eagan brought up yesterday wanting to do git diffs and needing to dig for commit hashes
15:45:28 <mattmceuen> In addition, we're not actually labelling our tagged quay.io images with their git hash, so it's not easy to trace from "v2.0.0-beta.1" tag to commit hash in quay
15:45:58 <mattmceuen> Are their any concerns with additionally labelling the airshipctl repo (and others in the future) with release tags which match the built quay images?
15:47:07 <dwalt> I can't think of any, but I missed Sean's concern the other day
15:47:11 <airship-irc-bot> <sean.eagan> use case is starting to explore publishing actual github releases with release notes etc: https://github.com/airshipit/airshipctl/issues/390
15:47:19 <dwalt> Do we need to start publishing our images with the commit sha mattmceuen?
15:47:38 <mattmceuen> I think that would be a good thing to do additionally as well dwalt
15:48:03 <mattmceuen> Cool, I will tag the repo later today then unless anyone objects.
15:48:11 <mattmceuen> all from me dwalt :)
15:48:25 <dwalt> Sound good. Thanks mattmceuen
15:48:43 <dwalt> I'll create an issue for the image publishing. We can discuss on the flight plan call
15:48:49 <dwalt> #topic roundtable
15:49:08 <dwalt> Okay folks, 10 mins left. Anything else we need to discuss?
15:50:53 <dwalt> #topic reviews
15:50:56 <dwalt> #link https://review.opendev.org/#/c/730777/
15:51:13 <dwalt> Just one review for today. And with that, we can adjourn. Thanks, everyone! Have a great day
15:51:16 <dwalt> #endmeeting