15:00:51 <dwalt> #startmeeting airship 15:00:52 <openstack> Meeting started Tue Nov 10 15:00:51 2020 UTC and is due to finish in 60 minutes. The chair is dwalt. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:54 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:56 <openstack> The meeting name has been set to 'airship' 15:01:07 <mattmceuen> o/ 15:01:08 <dwalt> Hey everyone, we'll get started in a few minutes. The design call is still wrapping up 15:04:04 <airship-irc-bot> <ih616h> o/ 15:04:09 <airship-irc-bot> <mf4716> o/ 15:04:13 <airship-irc-bot> <james.gu> o/ 15:04:31 <airship-irc-bot> <ak3216> o/ 15:04:44 <roman_g> o/ 15:04:45 <dwalt> Welcome to everyone filtering in from the design call. Here is our agenda today 15:04:47 <dwalt> #link https://etherpad.opendev.org/p/airship-team-meeting 15:06:02 <airship-irc-bot> <j_t_williams> o/ 15:06:31 <dwalt> Alrighty, let's get things started 15:06:34 <dwalt> #topic Zuul gate problems 15:06:35 <airship-irc-bot> <sean.eagan> o/ 15:06:56 <mattmceuen> Folks have probably noticed, we have a pair of unrelated but brutal issues with our gates at the moment 15:07:00 <jemangs> o/ 15:07:10 <mattmceuen> First: infrastructure availability for our 16gb VMs 15:07:33 <mattmceuen> Of the two providers that supply them, we're back down to one working at the moment -- resulting in a lot of NODE_FAILURE type errors 15:07:48 <airship-irc-bot> <lb4368> o/ 15:07:51 <airship-irc-bot> <mb551n> o/ 15:07:52 <sreejithp> o/ 15:08:06 <jemangs> (The same problem as earlier?) 15:08:09 <mattmceuen> Arijit has been working to get 3rd-party gating going per patchset 15:08:14 <mattmceuen> similar @jemangs 15:08:24 <mattmceuen> I think this may be a networking problem, previously a cooling problem, or something 15:08:58 <mattmceuen> Currently the 3rd party gates are non-voting, and not (yet) reporting status up to patchsets -- that's a work in progress 15:09:05 <roman_g> jemangs, similar. Previously that was AC failure, now – router malfunction. 15:09:26 <mattmceuen> https://jenkins.nc.opensource.att.com 15:09:56 <mattmceuen> While we get status reporiting to patchsets working, developers can manually check for their job's status in that UI^ 15:10:40 <mattmceuen> And we'd like to switch (when ready) to making the 3rd party deployment gate our voting gate, instead of the zuul deployment gate 15:11:02 <mattmceuen> to be clear: all the things that can be done in "normal size" VMs -- linting, image build, etc -- would still be run via zuul 15:11:56 <mattmceuen> I would propose that for the time being, we make the zuul deployment gate *on-merge only*, and then check in jenkins for per-patch testing status 15:12:05 <mattmceuen> This will get us unblocked 15:12:12 <mattmceuen> Thoughts/concerns? 15:12:19 <dwalt> This is great, thanks mattmceuen and sreejithp 15:12:23 <airship-irc-bot> <ih616h> +1 15:12:51 <dwalt> Does that mean the job would be disabled for check, and enabled voting for gate? 15:13:06 <mattmceuen> yes, disabled for check -- leaving more of our limited capacity available for the merge job 15:13:44 <roman_g> dwalt , Yes, but you can trigger job manually by leaving "check experimental" comment to the patch set in Gerrit 15:13:51 <airship-irc-bot> <mf4716> Did we hear any more about resolution to the router malfunction? 15:13:52 <mattmceuen> +1 15:13:53 <dwalt> That makes a lot of sense. More reliable than relying on the cores to verify that the job passed. +1 from me 15:14:10 <mattmceuen> The longer-term (but hopefully still soon) idea would be to make the 3rd party gating Voting, and make the zuul gate non-voting 15:14:42 <mattmceuen> We should pull the trigger on that once we have status reporting back to the PS in place, and are comfortable with it 15:15:00 <roman_g> mf4716 no. It's non-commercial cloud provider (basically home-based servers farm). 15:15:01 <mattmceuen> Any other thoughts/comments before we move on to the other terrible gate issue? :) 15:15:27 <mattmceuen> yeah - really useful from a community perspective, but can be problematic when we're relying on it for our day jobs 15:15:55 <airship-irc-bot> <mf4716> any ETA roman? 15:15:57 <mattmceuen> at least, when the third party gates break, it's our job to fix them :) 15:16:32 <roman_g> mf4716 no ETA. 15:16:52 <mattmceuen> So issue #2: 15:17:02 <mattmceuen> And this one may be more challenging to work around. 15:17:23 <mattmceuen> dockerhub has instituted rate limiting on image pulls, limiting to 100 pulls per-IP per month 15:18:04 <mattmceuen> I believe we thought we'd been using some kind of image caching, but weren't, and now we've exausted our quota for the month from some of the open infra IPs 15:18:24 <mattmceuen> A few ways we could potentially approach this: 15:18:52 <mattmceuen> 1. migrate to using the 3rd party gates as Voting ASAP, and make sure they have caching enabled -- they'll have different IPs and so won't be hurt by the quota 15:19:15 <mattmceuen> 2. migrate to a non-dockerhub mirror for all images hosted in dockerhub 15:19:32 <mattmceuen> 3. upgrade to a paid plan 15:19:55 <dwalt> Do we have an estimate on the number of images impacted? 15:21:07 <mattmceuen> This is a promising (incomplete) list: 15:21:14 <mattmceuen> [madgin@leviathan:~/airship2/airshipctl]$ grep -r "image: docker" manifests 15:21:14 <mattmceuen> manifests/function/helm-operator/deployment.yaml: image: docker.io/fluxcd/helm-operator:1.2.0 15:21:14 <mattmceuen> [madgin@leviathan:~/airship2/airshipctl]$ 15:21:55 <airship-irc-bot> <rp2723> Can we put the point in time images we care about in quay and use that instead 15:21:58 <dwalt> lol. That is indeed promising 15:22:45 <dwalt> If it's one image, migrating it to Quay seems like it would be feasible 15:22:52 <airship-irc-bot> <rp2723> It’s a single image ?? The word list is rhetorical 15:23:01 <mattmceuen> lol I should have googled earlier, mea culpa 15:23:09 <mattmceuen> there could be other images in other projects though 15:23:15 <mattmceuen> but getting airshipctl unblocked would be great 15:23:19 <airship-irc-bot> <sean.eagan> helm-controller/source-controller are now in github container registry 15:23:31 <dwalt> \o/ 15:23:35 <airship-irc-bot> <sean.eagan> (the replacements for helm operator) 15:24:12 <airship-irc-bot> <sean.eagan> https://github.com/orgs/fluxcd/packages/container/package/helm-controller 15:24:33 <mattmceuen> what about helm-operator? We're not quite ready to switch to the helm-controller yet, are we @sean.eagan? 15:24:41 <dwalt> So is it fair to say that we can push the operator image to quay, since we are moving away from it? 15:24:47 <mattmceuen> yes 15:24:54 <airship-irc-bot> <sean.eagan> https://review.opendev.org/#/c/758615/ 15:24:56 <mattmceuen> +1 15:24:58 <airship-irc-bot> <sean.eagan> ^ it's merged 15:25:21 <sreejithp> mattmceuen, we also need the go docker image "docker.io/golang:1.13.1-stretch" 15:25:32 <sreejithp> we use this for building airshipctl binary 15:26:21 <mattmceuen> ahh you're right 15:26:40 <mattmceuen> do we even still need helm-operator now that we have the helm-controller merged Sean? 15:27:21 <mattmceuen> if we simply switch the test-site to use helm-controller, -operator wouldn't get run by the tests 15:27:49 <airship-irc-bot> <sean.eagan> the function for helm-operator is still in place (for a short migration period), but it's not actually used by the default phases included in airshipctl 15:28:17 <mattmceuen> then it must only be the golang image that's hurting the gates, right? 15:28:58 <dwalt> Is the gate runner failing in the pre-run playbooks or later on? 15:29:06 <dwalt> I think that's when we run `make images` 15:29:08 <mattmceuen> (treasuremap is using the helm-operator, so we'll need to handle that) 15:29:45 <airship-irc-bot> <sean.eagan> https://review.opendev.org/#/c/761666/ 15:29:52 <airship-irc-bot> <sean.eagan> ^ treasuremap migration 15:30:50 <mattmceuen> nice! 15:32:44 <dwalt> That's great sean.eagan. Between pushing those two images to quay and finishing the helm controller migration, it sounds like this is largely under control 15:33:02 <mattmceuen> I'm trying to find a failed job example and am not quickly... was seeing a few yesterday, so that's interesting 15:33:29 <airship-irc-bot> <ih616h> Here's one that failed on the golang image https://zuul.opendev.org/t/openstack/build/36ce6912b90b45378d246273d193ebb6 15:33:32 <dwalt> Yeah. Mostly seeing the node failures on the front page 15:33:37 <airship-irc-bot> <ih616h> (if that's what you're looking for) 15:34:13 <mattmceuen> ahh yep, it's for the golang 15:34:30 <airship-irc-bot> <ih616h> which makes sense - that image is pulled for even the linting jobs 15:34:32 <roman_g> https://review.opendev.org/#/c/755456 Docker cache for airship/airshipctl Zuul jobs 15:34:48 <mattmceuen> so migrating to an alternate source for that is a top priority, any volunteers? 15:35:20 <dwalt> If we're just going to do a quick push, probably needs to be one of us from the wc :) 15:35:37 <dwalt> I can do it 15:35:50 <mattmceuen> oh I figured there might already be a golang image in gcr land :) 15:35:56 <mattmceuen> they share a "g" 15:36:16 <dwalt> oh that's a better idea 15:37:21 <roman_g> There are also a few images which are pulled from inside of nested VMs. Those also need to be verified on where do they get pulled from. 15:38:08 <roman_g> PS above is a solution to enable cacheing of container images on OpenDev infrastructure servers. 15:38:27 <mattmceuen> awesome ty roman_g 15:39:02 <mattmceuen> dwalt thank you for volunteering for that one, yeah I'd say just spend a few minutes searching for prior work before pushing to quay.io/airshipit as a Plan B 15:39:33 <dwalt> I'll post to #airshipit if I find one 15:39:45 <mattmceuen> Good news -- just learned from Arijit that the 3rd-party gates are now reporting back to gerrit! 15:40:00 <dwalt> nice! 15:40:08 <mattmceuen> Can I ask for a volunteer to put in a change to make the deployment zuul job on-merge only? 15:40:25 <mattmceuen> I'd do it but need to step away for a couple hours 15:40:41 <roman_g> mattmceuen Make airshipctl gate runner script run only on request and on Zuul gate (pre-merge) to reduce workload onto community CI 15:40:41 <roman_g> https://review.opendev.itorg/762136 15:41:17 <roman_g> It's all in meeting Etherpad https://etherpad.opendev.org/p/airship-team-meeting 15:41:20 <mattmceuen> That's perfect, +2 - ty Roman 15:41:38 <dwalt> thanks roman_g 15:41:43 <dwalt> Anything else we need to hash out? 15:41:56 <mattmceuen> ok dwalt that all I have for our gate shenanigans, THANKS ALL for the working session 15:42:05 <dwalt> great! 15:42:08 <dwalt> #topic release tagging 15:42:14 <roman_g> Images pulled from inside of nested VMs are not covered. 15:42:21 <dwalt> I think you're back in the spotlight mattmceuen 15:42:41 <dwalt> roman_g: do you know those atm? We may need to let the gates run to see 15:43:05 <mattmceuen> @roman_g agree we need to follow up on that; hopefully the golang change will give us a little bit of respite, esp will switching most of the VM load to 3rd party gates 15:43:07 <roman_g> dwalt not atm, need to grep code a bit 15:43:39 <dwalt> Sounds good. Let's go ahead and cover our last agenda item and circle back if we have some remaining time 15:44:00 <mattmceuen> Tagging: so we'd decided when we cut our Beta release that we'd tag the quay.io image, but not the repo 15:44:35 <mattmceuen> For the life of me, I don't recall why we thought that was a good idea -- seems like we should be doing both; @sean.eagan brought up yesterday wanting to do git diffs and needing to dig for commit hashes 15:45:28 <mattmceuen> In addition, we're not actually labelling our tagged quay.io images with their git hash, so it's not easy to trace from "v2.0.0-beta.1" tag to commit hash in quay 15:45:58 <mattmceuen> Are their any concerns with additionally labelling the airshipctl repo (and others in the future) with release tags which match the built quay images? 15:47:07 <dwalt> I can't think of any, but I missed Sean's concern the other day 15:47:11 <airship-irc-bot> <sean.eagan> use case is starting to explore publishing actual github releases with release notes etc: https://github.com/airshipit/airshipctl/issues/390 15:47:19 <dwalt> Do we need to start publishing our images with the commit sha mattmceuen? 15:47:38 <mattmceuen> I think that would be a good thing to do additionally as well dwalt 15:48:03 <mattmceuen> Cool, I will tag the repo later today then unless anyone objects. 15:48:11 <mattmceuen> all from me dwalt :) 15:48:25 <dwalt> Sound good. Thanks mattmceuen 15:48:43 <dwalt> I'll create an issue for the image publishing. We can discuss on the flight plan call 15:48:49 <dwalt> #topic roundtable 15:49:08 <dwalt> Okay folks, 10 mins left. Anything else we need to discuss? 15:50:53 <dwalt> #topic reviews 15:50:56 <dwalt> #link https://review.opendev.org/#/c/730777/ 15:51:13 <dwalt> Just one review for today. And with that, we can adjourn. Thanks, everyone! Have a great day 15:51:16 <dwalt> #endmeeting