15:00:51 #startmeeting airship 15:00:52 Meeting started Tue Nov 10 15:00:51 2020 UTC and is due to finish in 60 minutes. The chair is dwalt. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:54 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:56 The meeting name has been set to 'airship' 15:01:07 o/ 15:01:08 Hey everyone, we'll get started in a few minutes. The design call is still wrapping up 15:04:04 o/ 15:04:09 o/ 15:04:13 o/ 15:04:31 o/ 15:04:44 o/ 15:04:45 Welcome to everyone filtering in from the design call. Here is our agenda today 15:04:47 #link https://etherpad.opendev.org/p/airship-team-meeting 15:06:02 o/ 15:06:31 Alrighty, let's get things started 15:06:34 #topic Zuul gate problems 15:06:35 o/ 15:06:56 Folks have probably noticed, we have a pair of unrelated but brutal issues with our gates at the moment 15:07:00 o/ 15:07:10 First: infrastructure availability for our 16gb VMs 15:07:33 Of the two providers that supply them, we're back down to one working at the moment -- resulting in a lot of NODE_FAILURE type errors 15:07:48 o/ 15:07:51 o/ 15:07:52 o/ 15:08:06 (The same problem as earlier?) 15:08:09 Arijit has been working to get 3rd-party gating going per patchset 15:08:14 similar @jemangs 15:08:24 I think this may be a networking problem, previously a cooling problem, or something 15:08:58 Currently the 3rd party gates are non-voting, and not (yet) reporting status up to patchsets -- that's a work in progress 15:09:05 jemangs, similar. Previously that was AC failure, now – router malfunction. 15:09:26 https://jenkins.nc.opensource.att.com 15:09:56 While we get status reporiting to patchsets working, developers can manually check for their job's status in that UI^ 15:10:40 And we'd like to switch (when ready) to making the 3rd party deployment gate our voting gate, instead of the zuul deployment gate 15:11:02 to be clear: all the things that can be done in "normal size" VMs -- linting, image build, etc -- would still be run via zuul 15:11:56 I would propose that for the time being, we make the zuul deployment gate *on-merge only*, and then check in jenkins for per-patch testing status 15:12:05 This will get us unblocked 15:12:12 Thoughts/concerns? 15:12:19 This is great, thanks mattmceuen and sreejithp 15:12:23 +1 15:12:51 Does that mean the job would be disabled for check, and enabled voting for gate? 15:13:06 yes, disabled for check -- leaving more of our limited capacity available for the merge job 15:13:44 dwalt , Yes, but you can trigger job manually by leaving "check experimental" comment to the patch set in Gerrit 15:13:51 Did we hear any more about resolution to the router malfunction? 15:13:52 +1 15:13:53 That makes a lot of sense. More reliable than relying on the cores to verify that the job passed. +1 from me 15:14:10 The longer-term (but hopefully still soon) idea would be to make the 3rd party gating Voting, and make the zuul gate non-voting 15:14:42 We should pull the trigger on that once we have status reporting back to the PS in place, and are comfortable with it 15:15:00 mf4716 no. It's non-commercial cloud provider (basically home-based servers farm). 15:15:01 Any other thoughts/comments before we move on to the other terrible gate issue? :) 15:15:27 yeah - really useful from a community perspective, but can be problematic when we're relying on it for our day jobs 15:15:55 any ETA roman? 15:15:57 at least, when the third party gates break, it's our job to fix them :) 15:16:32 mf4716 no ETA. 15:16:52 So issue #2: 15:17:02 And this one may be more challenging to work around. 15:17:23 dockerhub has instituted rate limiting on image pulls, limiting to 100 pulls per-IP per month 15:18:04 I believe we thought we'd been using some kind of image caching, but weren't, and now we've exausted our quota for the month from some of the open infra IPs 15:18:24 A few ways we could potentially approach this: 15:18:52 1. migrate to using the 3rd party gates as Voting ASAP, and make sure they have caching enabled -- they'll have different IPs and so won't be hurt by the quota 15:19:15 2. migrate to a non-dockerhub mirror for all images hosted in dockerhub 15:19:32 3. upgrade to a paid plan 15:19:55 Do we have an estimate on the number of images impacted? 15:21:07 This is a promising (incomplete) list: 15:21:14 [madgin@leviathan:~/airship2/airshipctl]$ grep -r "image: docker" manifests 15:21:14 manifests/function/helm-operator/deployment.yaml: image: docker.io/fluxcd/helm-operator:1.2.0 15:21:14 [madgin@leviathan:~/airship2/airshipctl]$ 15:21:55 Can we put the point in time images we care about in quay and use that instead 15:21:58 lol. That is indeed promising 15:22:45 If it's one image, migrating it to Quay seems like it would be feasible 15:22:52 It’s a single image ?? The word list is rhetorical 15:23:01 lol I should have googled earlier, mea culpa 15:23:09 there could be other images in other projects though 15:23:15 but getting airshipctl unblocked would be great 15:23:19 helm-controller/source-controller are now in github container registry 15:23:31 \o/ 15:23:35 (the replacements for helm operator) 15:24:12 https://github.com/orgs/fluxcd/packages/container/package/helm-controller 15:24:33 what about helm-operator? We're not quite ready to switch to the helm-controller yet, are we @sean.eagan? 15:24:41 So is it fair to say that we can push the operator image to quay, since we are moving away from it? 15:24:47 yes 15:24:54 https://review.opendev.org/#/c/758615/ 15:24:56 +1 15:24:58 ^ it's merged 15:25:21 mattmceuen, we also need the go docker image "docker.io/golang:1.13.1-stretch" 15:25:32 we use this for building airshipctl binary 15:26:21 ahh you're right 15:26:40 do we even still need helm-operator now that we have the helm-controller merged Sean? 15:27:21 if we simply switch the test-site to use helm-controller, -operator wouldn't get run by the tests 15:27:49 the function for helm-operator is still in place (for a short migration period), but it's not actually used by the default phases included in airshipctl 15:28:17 then it must only be the golang image that's hurting the gates, right? 15:28:58 Is the gate runner failing in the pre-run playbooks or later on? 15:29:06 I think that's when we run `make images` 15:29:08 (treasuremap is using the helm-operator, so we'll need to handle that) 15:29:45 https://review.opendev.org/#/c/761666/ 15:29:52 ^ treasuremap migration 15:30:50 nice! 15:32:44 That's great sean.eagan. Between pushing those two images to quay and finishing the helm controller migration, it sounds like this is largely under control 15:33:02 I'm trying to find a failed job example and am not quickly... was seeing a few yesterday, so that's interesting 15:33:29 Here's one that failed on the golang image https://zuul.opendev.org/t/openstack/build/36ce6912b90b45378d246273d193ebb6 15:33:32 Yeah. Mostly seeing the node failures on the front page 15:33:37 (if that's what you're looking for) 15:34:13 ahh yep, it's for the golang 15:34:30 which makes sense - that image is pulled for even the linting jobs 15:34:32 https://review.opendev.org/#/c/755456 Docker cache for airship/airshipctl Zuul jobs 15:34:48 so migrating to an alternate source for that is a top priority, any volunteers? 15:35:20 If we're just going to do a quick push, probably needs to be one of us from the wc :) 15:35:37 I can do it 15:35:50 oh I figured there might already be a golang image in gcr land :) 15:35:56 they share a "g" 15:36:16 oh that's a better idea 15:37:21 There are also a few images which are pulled from inside of nested VMs. Those also need to be verified on where do they get pulled from. 15:38:08 PS above is a solution to enable cacheing of container images on OpenDev infrastructure servers. 15:38:27 awesome ty roman_g 15:39:02 dwalt thank you for volunteering for that one, yeah I'd say just spend a few minutes searching for prior work before pushing to quay.io/airshipit as a Plan B 15:39:33 I'll post to #airshipit if I find one 15:39:45 Good news -- just learned from Arijit that the 3rd-party gates are now reporting back to gerrit! 15:40:00 nice! 15:40:08 Can I ask for a volunteer to put in a change to make the deployment zuul job on-merge only? 15:40:25 I'd do it but need to step away for a couple hours 15:40:41 mattmceuen Make airshipctl gate runner script run only on request and on Zuul gate (pre-merge) to reduce workload onto community CI 15:40:41 https://review.opendev.itorg/762136 15:41:17 It's all in meeting Etherpad https://etherpad.opendev.org/p/airship-team-meeting 15:41:20 That's perfect, +2 - ty Roman 15:41:38 thanks roman_g 15:41:43 Anything else we need to hash out? 15:41:56 ok dwalt that all I have for our gate shenanigans, THANKS ALL for the working session 15:42:05 great! 15:42:08 #topic release tagging 15:42:14 Images pulled from inside of nested VMs are not covered. 15:42:21 I think you're back in the spotlight mattmceuen 15:42:41 roman_g: do you know those atm? We may need to let the gates run to see 15:43:05 @roman_g agree we need to follow up on that; hopefully the golang change will give us a little bit of respite, esp will switching most of the VM load to 3rd party gates 15:43:07 dwalt not atm, need to grep code a bit 15:43:39 Sounds good. Let's go ahead and cover our last agenda item and circle back if we have some remaining time 15:44:00 Tagging: so we'd decided when we cut our Beta release that we'd tag the quay.io image, but not the repo 15:44:35 For the life of me, I don't recall why we thought that was a good idea -- seems like we should be doing both; @sean.eagan brought up yesterday wanting to do git diffs and needing to dig for commit hashes 15:45:28 In addition, we're not actually labelling our tagged quay.io images with their git hash, so it's not easy to trace from "v2.0.0-beta.1" tag to commit hash in quay 15:45:58 Are their any concerns with additionally labelling the airshipctl repo (and others in the future) with release tags which match the built quay images? 15:47:07 I can't think of any, but I missed Sean's concern the other day 15:47:11 use case is starting to explore publishing actual github releases with release notes etc: https://github.com/airshipit/airshipctl/issues/390 15:47:19 Do we need to start publishing our images with the commit sha mattmceuen? 15:47:38 I think that would be a good thing to do additionally as well dwalt 15:48:03 Cool, I will tag the repo later today then unless anyone objects. 15:48:11 all from me dwalt :) 15:48:25 Sound good. Thanks mattmceuen 15:48:43 I'll create an issue for the image publishing. We can discuss on the flight plan call 15:48:49 #topic roundtable 15:49:08 Okay folks, 10 mins left. Anything else we need to discuss? 15:50:53 #topic reviews 15:50:56 #link https://review.opendev.org/#/c/730777/ 15:51:13 Just one review for today. And with that, we can adjourn. Thanks, everyone! Have a great day 15:51:16 #endmeeting