Friday, 2020-04-17

openstackgerrit	Ian Wienand proposed openstack/project-config master: Add centos aarch64 to labels https://review.opendev.org/720619	00:00
*** DSpider has quit IRC		00:04
ianw	why that is not trying to build has me stumped right now	00:23
openstackgerrit	Ian Wienand proposed openstack/project-config master: Add centos aarch64 to labels, unpause https://review.opendev.org/720619	00:25
ianw	yeah, unpausing it will help	00:26
ianw	i've applied that manually and am watching an initial build	00:26
*** ysandeep\|away is now known as ysandeep		01:03
openstackgerrit	Merged openstack/diskimage-builder master: Add centos aarch64 tests https://review.opendev.org/720339	01:18
mnaser	infra-root: ok, i have one issue right now, i _could_ workaround it by abandon/restore but maybe useful for someone to grab looks and see why? http://zuul.opendev.org/t/vexxhost/status 720595,6 has been stuck for 2h18m (and new jobs are starting inside openstack tenant so its not a lack of nodes)...	01:19
clarkb	mnaser: it might be the inap issue we saw earlier tpday	01:19
clarkb	mnaser: basically we seem to leak enough nodes there due to failed successful node deletes amd that breaks quota acpunting so we over attwmpt to boot instances in inap	01:20
clarkb	andbasically it delays things	01:20
mnaser	clarkb: does nodepool enforces building vms in the same provider?	01:20
clarkb	mnaser: only within a job	01:20
mnaser	aaaah, so maybe it keeps trying to get an inap job	01:21
clarkb	I dont think its repeatedly trying	01:21
clarkb	its just waiting for inap to actually delete the nodes it said it deleted so no ones can boot	01:21
mnaser	clarkb: ah, so probably just best to sit and wait and if it's still around for much longer then maybe abandon/restore	01:22
openstackgerrit	Merged openstack/project-config master: Add centos aarch64 to labels, unpause https://review.opendev.org/720619	01:23
clarkb	if it persists longer we should maybe have deletes poll more or disable inap or something	01:24
clarkb	its a weird behavio and seems new but I spent a chunk of the morning tracing it through and pretty sure root cause is nova delete saus "yes I succeeded" but then the server persosts for a long time	01:24
mnaser	clarkb: the queued jobs does depend on the registry that's paused, so maybe that contributes to it?	01:25
clarkb	the one I looked at today was a tempest job so no docker bits	01:25
mnaser	ah ok, the paused one is in our cloud right now	01:26
clarkb	oh hrm do required jobs like that end up in thr same cloud?	01:27
clarkb	fwiw Im not at computer so cant debug directly but otherwise sounds similar to the inap thing from today	01:28
mnaser	clarkb: i dont know if requried jobs like that end up in teh same cloud, but im curious to know. but yeah, if you saw that behaviour earlier then might be good to leave it for someone to have a look at it later	01:29
corvus	mnaser: yeah, jobs that depend on paused jobs request nodes from the same provider	02:06
corvus	(with a bump in priority to try to speed things up)	02:07
*** ysandeep is now known as ysandeep\|afk		02:21
openstackgerrit	Merged zuul/zuul-jobs master: fetch-subunit-output test: use ensure-pip https://review.opendev.org/718225	02:42
prometheanfire	ianw: have time for https://review.opendev.org/717339 ?	02:54
openstackgerrit	Merged zuul/zuul-jobs master: ensure-tox: use ensure-pip role https://review.opendev.org/717663	02:55
openstackgerrit	Ian Wienand proposed opendev/system-config master: [dnm] test with plain nodes https://review.opendev.org/712819	02:59
prometheanfire	thanks	03:04
openstackgerrit	Ian Wienand proposed openstack/project-config master: nb03 : update to arm64 to inheritance, drop pip-and-virtualenv https://review.opendev.org/720641	03:32
*** ysandeep\|afk is now known as ysandeep		04:22
*** kevinz has joined #opendev		04:42
*** ysandeep is now known as ysandeep\|reboot		04:49
*** ykarel\|away is now known as ykarel		04:51
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: Update Fedora to 31 https://review.opendev.org/717657	04:51
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: Make ubuntu-plain jobs voting https://review.opendev.org/719701	04:51
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: Document output variables https://review.opendev.org/719704	04:51
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: Python roles: misc doc updates https://review.opendev.org/720111	04:51
*** ysandeep\|reboot is now known as ysandeep		04:53
*** mnasiadka has quit IRC		05:10
*** elod has quit IRC		05:10
*** mnasiadka has joined #opendev		05:15
*** elod has joined #opendev		05:15
openstackgerrit	Merged openstack/project-config master: Add ubuntu-bionic-plain to all regions https://review.opendev.org/720316	05:47
*** ysandeep is now known as ysandeep\|brb		05:49
*** ysandeep\|brb is now known as ysandeep		06:12
*** Romik has joined #opendev		06:21
*** Romik has quit IRC		06:33
*** Romik has joined #opendev		07:00
*** jhesketh has quit IRC		07:04
*** rpittau\|afk is now known as rpittau		07:19
*** tosky has joined #opendev		07:30
*** Romik has quit IRC		07:35
*** ralonsoh has joined #opendev		07:38
*** DSpider has joined #opendev		07:40
*** ysandeep is now known as ysandeep\|lunch		07:57
openstackgerrit	Merged openstack/project-config master: nodepool: Add more plain images https://review.opendev.org/720318	08:25
*** ysandeep\|lunch is now known as ysandeep		08:25
*** ykarel is now known as ykarel\|lunch		08:26
openstackgerrit	Andreas Jaeger proposed zuul/zuul-jobs master: Make ubuntu-plain jobs voting https://review.opendev.org/719701	08:45
openstackgerrit	Andreas Jaeger proposed zuul/zuul-jobs master: Document output variables https://review.opendev.org/719704	08:45
openstackgerrit	Andreas Jaeger proposed zuul/zuul-jobs master: Python roles: misc doc updates https://review.opendev.org/720111	08:45
AJaeger	ianw: rebased and fixed the failure ^	08:45
*** ysandeep is now known as ysandeep\|afk		09:21
*** ykarel\|lunch is now known as ykarel		09:25
openstackgerrit	Dmitry Tantsur proposed openstack/diskimage-builder master: Remove Babel and any signs of translations https://review.opendev.org/720673	09:42
openstackgerrit	Thierry Carrez proposed opendev/system-config master: No longer push refs/changes to GitHub mirrors https://review.opendev.org/720679	10:00
ttx	corvus, mordred, fungi: ^ as discussed	10:01
*** rpittau is now known as rpittau\|bbl		10:30
*** hashar has joined #opendev		11:43
openstackgerrit	Andreas Jaeger proposed openstack/project-config master: Remove pypy job from x/surveil https://review.opendev.org/720699	12:08
*** Romik has joined #opendev		12:13
openstackgerrit	Merged openstack/project-config master: Remove pypy job from bindep https://review.opendev.org/720543	12:17
*** rpittau\|bbl is now known as rpittau		12:19
hashar	hello	12:25
*** Romik has quit IRC		12:28
openstackgerrit	Monty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers https://review.opendev.org/720527	12:30
openstackgerrit	Monty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers https://review.opendev.org/720527	12:34
openstackgerrit	Monty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers https://review.opendev.org/720527	12:34
hashar	I have an interesting use case for octopus merging a couple changes	12:34
hashar	CI for the jjb/jenkins-job-builder repository is broken	12:35
*** ykarel is now known as ykarel\|afk		12:35
*** ysandeep\|afk is now known as ysandeep		12:35
hashar	err wrong repository. I mean jjb/python-jenkins	12:36
hashar	the py27 job is broken due to stestr 3.0.0 which is fixed by blacklisting it ( https://review.opendev.org/719073 )	12:36
hashar	the pypy job is broken for some reason and the job is removed by https://review.opendev.org/719366	12:37
hashar	and of course, each change has a build failure because of the other change not being around	12:37
hashar	I can't depend-on on one or the other since that still would cause one of the build to fail	12:37
hashar	A -> B (A fails because B fix is not there)	12:37
hashar	B -> A (B fails because A fix is not there)	12:38
hashar	but I could create an octopus merge of A and B to the branch which should pass just fine	12:38
hashar	which I could potentially CR+2 / W+1 and get submitted by Zuul. But, I guess Gerrit is not going to merge it because the parents A and B lack the proper votes ;]	12:39
*** ykarel\|afk is now known as ykarel		12:39
AJaeger	hashar: merge the changes together ;)	12:43
hashar	! [remote rejected] HEAD -> refs/for/master (you are not allowed to upload merges)	12:43
hashar	:(	12:43
hashar	yeah I will do a single change instead	12:44
hashar	thx	12:44
openstackgerrit	Merged opendev/system-config master: Install kubectl via openshift client tools https://review.opendev.org/707412	12:49
openstackgerrit	Merged opendev/system-config master: Remove snap cleanup tasks https://review.opendev.org/709293	12:51
ttx	corvus, mordred for asynchronously getting rid of remote refs/changes, looks like the following shall do the trick (assuming all repos are listed in github.list):	12:52
ttx	for i in $(cat github.list); do echo $i; git push --prune ssh://git@github.com/$i refs/changes/:refs/changes/ 2\| wc -l; done	12:53
ttx	the wc -l trick in there is to roughly count the deleted refs as you go. git push --prune displays those on stderr	12:53
ttx	That is what I propose to run after https://review.opendev.org/720679	12:54
openstackgerrit	Monty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers https://review.opendev.org/720527	13:13
openstackgerrit	Monty Taylor proposed opendev/system-config master: Dynamically write zookeeper host information to nodepool.yaml https://review.opendev.org/720709	13:13
mordred	ttx: cool!	13:13
ttx	I mean, seriously... stderr	13:16
ttx	git why do you hate unix	13:17
*** ykarel is now known as ykarel\|afk		13:31
openstackgerrit	Monty Taylor proposed opendev/system-config master: Remove unused gerrit puppet things https://review.opendev.org/714001	13:33
mordred	fungi, frickler : if you have a sec, easy review: https://review.opendev.org/#/c/720030/	13:34
openstackgerrit	Monty Taylor proposed opendev/system-config master: Remove old etherpad.openstack.org https://review.opendev.org/717492	13:35
openstackgerrit	Monty Taylor proposed opendev/system-config master: Dynamically write zookeeper host information to nodepool.yaml https://review.opendev.org/720709	13:40
openstackgerrit	Monty Taylor proposed opendev/system-config master: Dynamically write zookeeper host information to nodepool.yaml https://review.opendev.org/720709	13:40
openstackgerrit	Monty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers https://review.opendev.org/720527	13:40
mnaser	corvus: ok cool, that adds up, thanks for the info	13:45
openstackgerrit	Monty Taylor proposed opendev/system-config master: Start mirroring focal https://review.opendev.org/720718	13:49
AJaeger	merci, hashar	13:49
openstackgerrit	Monty Taylor proposed openstack/project-config master: Start building focal images https://review.opendev.org/720719	13:53
hashar	AJaeger: you are welcome :]	13:54
mordred	corvus: looking towards using your zk roles in the nodepool test jobs I realized I need to be able to write out the correct zookeeper hosts (will need the same in the zuul jobs) ... so I tried something in 720527 - I'm not 100% sure I like it	13:55
*** mlavalle has joined #opendev		14:00
frickler	mordred: clarkb: question on the pattern matching syntax in https://review.opendev.org/#/c/720030/	14:03
mordred	frickler: I'm pretty sure it's a regex match and not a glob match	14:05
mordred	frickler: there's a 'playbooks/roles/letsencrypt.*' showing on that page which should get files matching all of the roles starting with letsencrypt	14:06
mordred	that said - I'm not sure why we're doing .* there and just playbooks/roles/jitsi-meet/ above	14:07
openstackgerrit	Monty Taylor proposed opendev/system-config master: Dynamically write zookeeper host information to nodepool.yaml https://review.opendev.org/720709	14:15
openstackgerrit	Monty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers https://review.opendev.org/720527	14:15
*** ysandeep is now known as ysandeep\|away		14:31
mnaser	would it be ok if i setup a mirroring job in the vexxhost/base-jobs repo similar to the one i setup inside opendev/project-config ?	14:51
mnaser	i don't see an issue but i just wanted to get the ok given it's a trusted repo	14:51
openstackgerrit	Monty Taylor proposed opendev/system-config master: Dynamically write zookeeper host information to nodepool.yaml https://review.opendev.org/720709	14:53
openstackgerrit	Monty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers https://review.opendev.org/720527	14:53
openstackgerrit	Monty Taylor proposed opendev/system-config master: Run zookeeper cluster in nodepool jobs https://review.opendev.org/720740	14:53
mordred	mnaser: I don't see any issue with that	14:53
mordred	corvus: ok - I rebased the nodepool patch on top of your zk patch so that I could use the zookeeper role - let's see how many things break :)	14:56
openstackgerrit	Dmitry Tantsur proposed openstack/diskimage-builder master: Remove Babel and any signs of translations https://review.opendev.org/720673	15:05
*** ykarel\|afk is now known as ykarel		15:06
*** bwensley has joined #opendev		15:09
bwensley	Hey everyone - I notice that my gerrit review notifications seem to have stopped yesterday afternoon.	15:10
bwensley	Is this a known problem?	15:10
AJaeger	bwensley: it works for me...	15:15
AJaeger	bwensley: so, not a known problem	15:15
frickler	bwensley: assuming you are talking about emails, if you DM me your address I can check mail logs	15:16
bwensley	Yes - talking about email notifications.	15:17
bwensley	If it is working for everyone else maybe a problem with my spam filters at my employer.	15:18
frickler	infra-root: seems we are on spamhaus PBL with 104.130.246.32. fungi: IIRC you did the unblocking chant most of the time?	15:20
corvus	mordred: morning! catching up on your changes now	15:21
mordred	corvus: they may be a terrible idea - they were written during first coffee	15:22
prometheanfire	can I get a review on https://review.opendev.org/717339 ?	15:23
prometheanfire	second one that is	15:23
corvus	mordred: i don't see zk stuff in 720527?	15:24
corvus	where should i be looking	15:24
mordred	corvus: https://review.opendev.org/#/c/720709 https://review.opendev.org/#/c/720740 - which are now parents of https://review.opendev.org/#/c/720527	15:25
corvus	ah!	15:25
mordred	corvus: (I'd totally do that python module in jinja - but I'm not sure I'm good enogh with jinja)	15:26
corvus	mordred: well, my first TODO today is to jinja the ipv4 addresses of the zk hosts into the config file, so i should have something you can copy/paste in a minute.	15:26
corvus	mordred: (the same thing is needed in the zoo.cfg file)	15:27
mordred	corvus: sweet!	15:28
mordred	corvus: I think the hardest thing for the nodepool case is producing the yaml list of dicts format	15:28
mordred	but I'm sure we can figure that out	15:28
corvus	i think it's past time to move the connection stuff into a different config file, but oh well. :(	15:29
mordred	corvus: I left a note on your change with a pointer to some vars that might be useful fwiw	15:30
corvus	mordred: awesome. that's step 1 of that task :)	15:30
clarkb	frickler: mordred yes I believe it is a regex, see line 1349. However maybe I need to prefix with ^ to make that clear?	15:31
clarkb	frickler: mordred I'm looking up zuul docs now	15:31
frickler	fungi: actually I think I did send a removal request some time ago, retrying now	15:31
corvus	they're always regexes	15:31
clarkb	corvus: thanks! frickler see corvus' note I think my change is correct	15:31
corvus	^ will just anchor it to the start, omitting that will let it match anywhere	15:32
frickler	clarkb: hmm, then you could drop the ".*" ending to be consistent with everything else, right?	15:32
frickler	would be less confusing IMHO	15:33
clarkb	frickler: ya I guess I can if we allow partial matching	15:33
mordred	clarkb: we do. I think there are actually several .* suffixes that can all go	15:33
corvus	we call "regex.match(file)"	15:33
clarkb	ok I'll push up an update and look at simplifying some of the other matches in a followon	15:34
corvus	oh, match says it's always ad the start of the string	15:34
corvus	"If zero or more characters at the beginning of string match the regular expression pattern"	15:34
corvus	i think that means both ^ and trailing .* are superfluous	15:34
frickler	#status log submitted and confirmed spamhaus PBL removal request for 104.130.246.32 (review01.openstack.org)	15:35
openstackstatus	frickler: finished logging	15:35
openstackgerrit	Clark Boylan proposed opendev/system-config master: Run jobs prod test jobs when docker images update https://review.opendev.org/720030	15:36
clarkb	corvus: yup I agree	15:36
corvus	mordred: given the specific task of "modify a slurped yaml nodepool config" it probably makes sense to just keep that as a module	15:36
corvus	mordred: we can get rid of it when we make a "nodepool.conf" or something in the future	15:37
mordred	corvus: ++	15:37
corvus	we're going to have "zookeeper-tls" to add to "zookeeper-servers" shortly	15:37
mordred	corvus: assuming, of course, I can ever get that module to run	15:37
corvus	mordred: yeah, i say just keep plugging at it; i don't think my tasks are going to add anything to help	15:38
mordred	kk	15:38
clarkb	mordred: frickler ^ there is the updated cahnge	15:40
clarkb	working on a followon now to be consistent in that file	15:40
frickler	clarkb: ack, thx.	15:42
* frickler heads towards the weekend now		15:43
openstackgerrit	James E. Blair proposed opendev/system-config master: Run ZK from containers https://review.opendev.org/720498	15:43
corvus	clarkb, fungi, mordred: ^ that's ready to merge, please review and +W	15:44
corvus	after it lands, we can take zk* out of emergency	15:44
frickler	infra-root: there seem to be umpteen bounces to review@openstack.org in the mailq on review.o.o, not sure if that's normal or whether they are due to the PBL issue. do we usually clean these up or just let them expire?	15:46
clarkb	frickler: I expect its due to the PBL listing, but fungi and corvus would know better than me	15:46
corvus	i think it'd be fine to just let them expire	15:47
frickler	ok	15:49
corvus	ttx: https://review.opendev.org/720679 lgtm i'll give fungi a bit in case he wants to review	15:49
*** dpawlik has quit IRC		15:50
corvus	mordred: comment on 720709	15:51
mordred	corvus: I have learned something	15:51
mordred	corvus: well - I learned your thing - but also, the fact variables I mentioned - only exist if fact gathering has happened for the zk hosts	15:52
mordred	corvus: so we can either ensure a noop task has happend on the zookeeper group ... or we could use public_v6 and public_v4 from our inventory file	15:53
corvus	mordred: we cache facts on bridge	15:53
mordred	corvus: nod. do we in test runs?	15:53
corvus	mordred: so is this just a gate problem?	15:53
corvus	i ran my jinja on bridge using the real inventory and it works	15:54
mordred	might be. but if we use the same ansible.cfg we should cache facts in gate too	15:54
mordred	cool	15:54
openstackgerrit	Clark Boylan proposed opendev/system-config master: Simplify .zuul.yaml regexes https://review.opendev.org/720759	15:54
corvus	mordred: (and that test on bridge was with a "hosts: localhost" play)	15:54
mordred	I think in the gate we might need to run the zookeeper playbook first so that we'll populate the fact cache - but we need to run that ANYWAY to make the zk hosts	15:54
clarkb	mordred: frickler corvus ^ thats the followon though not stacked as it had a merge conflict with master and I didn't want to update the other change again :)	15:54
corvus	mordred: yeah, that sounds reasonable to rely on that as a side effect. maybe worth a comment.	15:55
mordred	++	15:55
mordred	also - in my nodepool patch I'm preferring ipv6 if it exists - is that a bad idea?	15:56
corvus	clarkb: +2; i noted one innocuous change	15:57
*** ykarel is now known as ykarel\|away		15:57
corvus	mordred: actually	15:57
* corvus wakes up		15:57
corvus	mordred: why aren't we using hostnames in nodepool.yaml?	15:57
mordred	corvus: well - we are in the normal one - but hostnames won't resolve in the gate	15:57
corvus	ah	15:58
mordred	corvus: unless we're writing out /etc/hosts files	15:58
corvus	that's lame	15:58
mordred	yeah	15:58
mordred	maybe we should write out /etc/hosts files?	15:58
corvus	oh no	15:58
corvus	i meant writing /etc/hosts is lame	15:58
mordred	yeah - it's totally lame	15:58
mordred	but - overall the "test nodes won't resolve in dns" is gonna be an ongoing thing probably as we do more and more of these real world multi-node things	15:59
corvus	true. in which case, write /etc/hosts or template in ip addresses are both reasonable solutions	16:00
corvus	templating in ip addresses does have the advantage of potentially being the same in test and prod	16:00
corvus	(eg, zoo.cfg)	16:00
corvus	mordred: anyway, to your question: preferring v6 sounds reasonable	16:01
corvus	we can see how that ends up performing in our various clouds	16:01
mordred	kk	16:01
mordred	I'll stay with ips for now - and we can swing back to /etc/hosts if needed	16:02
mordred	corvus: I'm going to have to squash two of those patches - since I need to run zk so that zk hosts exist :)	16:02
openstackgerrit	Monty Taylor proposed opendev/system-config master: Run zookeeper cluster in nodepool jobs https://review.opendev.org/720709	16:04
openstackgerrit	Monty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers https://review.opendev.org/720527	16:04
corvus	mordred: i was looking at this spurious failure on your change: https://zuul.opendev.org/t/openstack/build/c3d52b243c4b4af5bb5c6fd3abeeea5a/log/applytest/puppetapplytest18.final.out.FAILED#62	16:07
corvus	mordred: it looks like some kind of rsync race? i wonder if one of the recent changing about how we run stuff could be affecting that?	16:08
corvus	(we could just recheck it and continue to remove puppet; but i worry if we're going to start getting more errors)	16:09
ttx	corvus if you+2a the replication change it could be good to keep an eye on the replication thread see if it gets backed up -- might be a sign that refs/changes gets deleted on the push	16:09
mordred	corvus: ugh yeah	16:09
ttx	It should not, since it's a push without --mirror afaict	16:09
mordred	corvus: I mean - part of me wants to say "recheck and keep working to remove puppet" - but I also agree, this could be an escalating issue	16:09
corvus	ttx: ack	16:09
ttx	but it's not superclear looking at Gerrit plugin code	16:09
ttx	or it can wait Monday :)	16:10
corvus	ttx: yeah, it may depend on whether fungi is around :)	16:10
corvus	(or if his internet has been swept out to sea)	16:11
mordred	corvus: clarkb and I talked about doing a couple of steps to clean some things up even with puppet in place ... namely, going ahead and making service-$foo playbooks and corresponding jobs - even if those playbooks right now just run puppet on a given host ...	16:11
clarkb	mordred: corvus: you've both acked https://review.opendev.org/#/c/719589/ my parental home school duties will be over in about an hour and a half. Is that a good time for you all to land that?	16:11
mordred	corvus: and if we do that, I think we could decently change any puppet tests we have into testinfra tests - and then just drop the puppet-specific tests altogether	16:11
mordred	clarkb: wfm	16:11
corvus	mordred: yeah, that's a good idea -- running the playbook means we can drop the applytest (it's better than an apply test)	16:12
mordred	corvus: because "run all of the puppet" every time we touch an ansible file is a bit of a waste	16:12
mordred	++	16:12
mordred	I think I might put that fairly soonish on my list	16:12
corvus	clarkb: did we figure out about restarting services?	16:12
mordred	because that would also allow us to move to the opendev tenant	16:12
mordred	(since the blocker right now is the legacy base jobs in ozj - which we use in the puppet tests)	16:13
corvus	mordred: which will speed everything up :)	16:13
clarkb	corvus: we expect it will restart processes. Gerrit should be fine beacuse we don't docker-compose up it during normal runs.	16:13
mordred	++	16:13
corvus	clarkb: cool, wfm	16:13
clarkb	corvus: services like zuul preview, docker registry, gitea, nodepool-builder will restart	16:13
mordred	and once the compose change is in - we should do a controlled restart of gerrit - because we have a change we need to pick up	16:14
clarkb	gitea should be ok because we do one at a time. Though we'll want to replicate to them afterwards to avoid any missed refs	16:14
clarkb	(I can do that)	16:14
mordred	clarkb: didn't we land your update to safely restart gitea?	16:14
mordred	(so that we do it in the right order?)	16:14
clarkb	mordred: oh we did, and that might cause this to not actually restart gitea	16:14
clarkb	because we check for new images otherwise don't issue teh commands	16:15
clarkb	so we should manually restart things if there isn't a new image coincident with this update	16:15
mordred	nod. and next time we have new images, the restart should still do the right thing	16:15
clarkb	(I can also do that)	16:15
mordred	yeah	16:15
mordred	well - we DO have a new image we could roll out	16:15
mordred	https://review.opendev.org/#/c/720202/ <--	16:15
mordred	we could land that after the docker-compose patch	16:15
mordred	and that should trigger a gitea rollout	16:16
clarkb	++ lets do it that way	16:16
mordred	good exercise of our machinery	16:16
*** rpittau is now known as rpittau\|afk		16:17
mordred	corvus: you still have -2 on your zk change - but clarkb and I both +2'd it	16:22
corvus	mordred: ah thanks! :)	16:23
*** mlavalle has quit IRC		16:34
mordred	corvus, clarkb: I pushed up two changes this morning unrealted to this - https://review.opendev.org/#/c/720718/ and https://review.opendev.org/#/c/720719/ - to start mirroring and building images of focal, since that's being released next week	16:36
mordred	corvus: and speaking of - when we roll out new ze*.opendev.org servers after the ansible rollout - perhaps we should consider jumping straight to focal instead of bionic so that we don't have to think about them for a while	16:39
corvus	mordred: ++	16:40
mordred	focal is defaulting to python 3.8 - so if we did that and then bumped to the 3.8 python-base in our image builds, we'd be on the same python across the install	16:40
*** kevinz has quit IRC		16:40
corvus	hopefully afs works	16:40
mordred	yeah. that'll be the first question	16:40
*** mlavalle has joined #opendev		16:43
fungi	frickler: yeah, the pbl rejection messages should mention the url for more info, which will get you eventually to the delisting page, and i usually use the infra-root shared mailbox to do the verification message. i can take care of it in a minute if nobody has gotten to it yet	16:44
openstackgerrit	Merged opendev/system-config master: Simplify .zuul.yaml regexes https://review.opendev.org/720759	16:45
fungi	looks like you got it though	16:45
fungi	and sorry for the delay, looking over 720498 now	16:45
fungi	on the replication change, did we ever disable the live replication config update "feature"?	16:49
fungi	i think i had a change up some time ago to revert it	16:49
fungi	looking	16:49
openstackgerrit	Merged opendev/system-config master: Run ZK from containers https://review.opendev.org/720498	16:49
openstackgerrit	Monty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers https://review.opendev.org/720527	16:52
fungi	okay, yeah, that was https://review.opendev.org/691452 and it merged ~3 months ago	16:52
openstackgerrit	Monty Taylor proposed opendev/system-config master: Run zookeeper cluster in nodepool jobs https://review.opendev.org/720709	16:54
openstackgerrit	Monty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers https://review.opendev.org/720527	16:54
mordred	corvus: ^^ those were basically green last tie - except for one testinfra thing. I pushed up a fix for that, but then had to rebase because of the .zuul.yaml and the newer zk patch	16:54
mordred	so the most recent ps is just the rebase	16:55
mordred	corvus: also - check it: https://zuul.opendev.org/t/openstack/build/24f76cf23d9942ac9d015fba4d402ec2/log/nb04.opendev.org/nodepool.yaml#626-628	16:57
mordred	corvus: (the file itself now looks aweful because of slurp\|from_yaml\|to_yaml - but I think we can live with that until we get a nodepool.conf)	16:57
corvus	mordred: heh, it's readable enough :)	17:00
corvus	clarkb: +3 https://review.opendev.org/720095 ?	17:01
clarkb	corvus: do you also need to update the .env file?	17:03
clarkb	I seem torecall that one having the etherpad url in it too	17:03
clarkb	corvus: I've approved it and can update .env if necesary in new change	17:04
openstackgerrit	Monty Taylor proposed opendev/system-config master: Run Zuul using Ansible and Containers https://review.opendev.org/717620	17:04
corvus	clarkb: yeah, it is in there, but i think it's only used to generate the config file that we manually install; maybe i'll just remove it in a followup....	17:05
corvus	er, you know what i mean by manually -- ansible installs it	17:06
corvus	i'm manually running the playbook against zk01	17:08
mordred	corvus: any idea on how to do this:	17:09
mordred	hosts={% for host in groups['zookeeper'] %}{{ (hostvars[host].ansible_default_ipv4.address) }}:2888:3888,{% endfor %}	17:09
mordred	but without the trailing , that'll be there?	17:09
corvus	mordred: yeah, there's some loop variables... 1 sec	17:10
mordred	ah - found it	17:11
mordred	loop.last	17:11
corvus	++	17:11
corvus	table of variables: https://jinja.palletsprojects.com/en/2.11.x/templates/#for	17:11
mordred	hosts={% for host in groups['zookeeper'] %}{{ (hostvars[host].ansible_default_ipv4.address) }}:2888:3888{% if not loop.last %},{% endif %}{% endfor %}	17:11
corvus	lgtm	17:11
corvus	running playbook against zk02	17:13
openstackgerrit	Monty Taylor proposed opendev/system-config master: Run Zuul using Ansible and Containers https://review.opendev.org/717620	17:15
mordred	corvus: that might work ^^ ... also - inverse of the nodepool ones - I pushed up a rebase that's just a rebase, then that last patch did the fixes needed	17:15
clarkb	mordred: corvus re /etc/hosts I think our multinode role sets that up for you	17:16
corvus	yeah, so we could use that role (or that part of that role) if we wanted to go that way	17:16
mordred	does multinode do anything extra that might conflict with the things we're trying to test with system-config-run jobs?	17:17
corvus	but i was thinking about it further, and it's still not a slam dunk for this use case -- we don't want to stand up a full cluster, we only want one node, so writing out the config is still desirable	17:17
mordred	yeah	17:17
corvus	running the playbook on 03 now	17:18
mordred	although we could just join group['zookeeper'] instead of needing to do the extra loop to find the ip address from hostvars	17:18
mordred	corvus: cool	17:18
corvus	ya	17:18
* mordred could go either way		17:18
*** hashar has quit IRC		17:18
corvus	i'm seeing a bunch of client errors now	17:19
mordred	clarkb: we could just use role multi-node-hosts-file	17:19
mordred	it is nicely split out into its own role :)	17:19
corvus	infra-root: heads up -- i think the zk cluster is in a bad state	17:20
mordred	corvus: uhoh	17:20
mordred	corvus: should we switch to opendev-meeting?	17:20
fungi	at least it's friday? ;)	17:20
clarkb	corvus: logs look like yseterday	17:20
corvus	i'll stop zk03	17:21
corvus	that did not improve things	17:22
corvus	i'll restart everything?	17:22
clarkb	I think that is what helped last time?	17:23
fungi	seemed like it anyway	17:24
corvus	looks happier	17:24
corvus	i am less than satisfied with this	17:24
corvus	that should have been a straightforward rolling restart	17:24
mordred	yeah	17:24
mordred	corvus: should we try another rolling restart to see how it goes?	17:25
corvus	maybe -- though i wonder if we need the dynamic config file	17:25
clarkb	we have done rolling restarts of the ubutnu packaged zk successfully in the past (I think ianw did one in the last couple weeks too	17:25
corvus	that was 3.4.8 iirc	17:25
corvus	(we do need 3.5.x for tls)	17:25
mordred	corvus: are you thinking that maybe when a node leaves the cluster zk is updating the duymanicConfig?	17:25
corvus	mordred: yeah	17:26
corvus	i'm still fuzzy on how "optional" it is	17:26
mordred	I really wish people wouldn't write server software that writes things to its config files	17:26
corvus	i might be able to simulate this locally	17:26
corvus	that's probably the place to start	17:26
mordred	https://zookeeper.apache.org/doc/r3.5.3-beta/zookeeperReconfig.html#sc_reconfig_file	17:27
corvus	yep that thing	17:27
mordred	yeah - my reading of that tells me that it's going to write server values to the file	17:27
mordred	when servers come and go	17:27
corvus	mordred: but what happens if you don't include the client port number at the end?	17:28
corvus	see "Backward compatibility"	17:28
corvus	and if we don't "invoke a reconfiguration" that "sets the client port"	17:28
corvus	(i don't know whether we're inadvertently doing that or not when we restart a server)	17:29
corvus	all of that to say, in my mind, there's a decision tree with at least two unresolved nodes determining whether any config files get (re-)written	17:29
corvus	cluster configuration by quantum superposition	17:29
mordred	well - I think the file is going to get tranformed regardless of port	17:30
mordred	corvus: I agree - we need to just simulate locally	17:30
clarkb	have we determined if its the actual config file or not?	17:30
mordred	there's no way we're going to reason through it	17:30
clarkb	or if we set a separate path it will write to a separate file?	17:30
clarkb	(note about now is when I'm able to monitor the docker-compose thing but will wait until we are in a happy place with zk)	17:31
mordred	I believe it wants 2 files in all cases - if we put things in the single file, it will helpfully pull out the servers and put them into the second file	17:31
corvus	mordred: see the text under 'example 2' for the bit about how whether a port is there or not affects whether it writes the dynamic file	17:31
mordred	yeah - that's a good point	17:32
corvus	mordred: i agree that there's no way we'll reason about it	17:32
mordred	also - assuming that we want to implement their "recommended" way of doing things	17:32
corvus	mordred: i'm not ready to endorse any conclusions...	17:32
mordred	what a PITA from a config mgmt pov	17:32
corvus	so far we have not seen it rewrite the main config file when we did not configure a dynamic config file path	17:33
corvus	that's the only thing we know :)	17:33
mordred	\o/	17:33
corvus	i think the best thing to do is for me to go into a hole and set up a 3 node local cluster and try to replicate the problem	17:33
corvus	then start changing variables	17:33
mordred	I mean, in their "prefered" approach - as long as all three nodes are up and running when we run ansible it should be a no-op - but doing a rolling restart at the same tie ansible tries to write a config would be potentially highly yuck	17:34
mordred	corvus: ++	17:34
* mordred supports a corvus hole		17:34
fungi	the discussions i linked yesterday for the zookeeper operator indicated that zookeeper wants config write access even if told to use a static config	17:34
clarkb	ok should I hold off on docker-compose things or are we reasonably happy with the state here? I ask because those zk nodes are using docker-compose now and should noop but may not?	17:34
clarkb	I'm like 98% confident the docker-compose upgrade will nop zk	17:34
mordred	clarkb: I am fairly confident your change will noop the zk nodes	17:34
mordred	yeah - because zk is already using pip -so it should be a no-op compose up	17:35
corvus	clarkb: yeah, i think it's worth the risk. i would just stand by to do a full 'docker-compose down' 'docker-compose up -d' if it's not a noop	17:35
mordred	++	17:35
clarkb	ok I'm going to hit approve now then	17:35
corvus	okay, i'll probably be away for a few hours; exercise and then into the debugging hole	17:35
mordred	fungi: has anyone in discussions you've read complained loudly about the config writing choices?	17:35
mordred	because if they haven't I might want to	17:36
fungi	mordred: they seemed resigned to their unfortunate fates	17:36
mordred	sigh	17:36
fungi	someone probably should bring it up with the zk maintainers. though i assume multiple someones have and i've just not found record of those conversations	17:36
clarkb	why have a separate dynamic config file option if the "static" one needs writing too	17:37
fungi	though that one issue i linked in turn linked to the bits of the zk source where the write decision is made	17:37
clarkb	(that seems like a reasonable argument to make to them if this is the case)	17:37
* fungi finds again		17:37
mordred	I mean - ultimately I'm guessing that we're not going to win and will have to also resign ourselves to our unfortunate fates	17:38
mordred	but it's one of those decisions that makes running a service with automation harder	17:38
fungi	https://github.com/pravega/zookeeper-operator/issues/66#issuecomment-501191586	17:38
fungi	"It needs to be able to create a new dynamic configuration file and update the static configuration file to point to the latest configuration (that's for restarts of the server)."	17:39
fungi	so basically the static configuration file isn't entirely static, it just contains (some) static configuration	17:39
clarkb	fungi: mordred that code chunk seems like its tracking the dynamic config in the static config	17:40
mordred	yeah - it seems that the one write operation they want to make is to remove the dynamic config	17:40
clarkb	Iwonder if hte issue goes away entirely if we simply set a dynamic config path	17:40
mordred	clarkb: needEraseClientInfoFromStaticConfig()	17:40
mordred	I'm fairly certain if we set a dyamicConfigPath and also remove servers from our static config that zk will not touch our static config and will update the member list in the dynamic config as needed	17:41
clarkb	https://github.com/apache/zookeeper/blob/3aa922c5737c9ef0879f290181cb281261c965e0/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumPeerConfig.java#L591-L599 is that function	17:41
clarkb	looks like it will simply remove the dynamicConfigFile entry	17:42
clarkb	oh and then it appends dynamicConfigFile to the end	17:43
mordred	yeah	17:43
mordred	but only if it needs to erase stuff from the static	17:44
clarkb	so if we can remove those keys and ensure dynamicConfigFile is set at the end we may avoid problems. I'm not sure we can remove clientPort though	17:44
mordred	why not? we can set it on the end of each server line, no?	17:44
clarkb	mordred: just because I haven't read enough docs yet	17:44
mordred	yeah - there's a form that allows you to append to each line	17:44
clarkb	oh but the server line is also checked	17:44
clarkb	https://github.com/apache/zookeeper/blob/3aa922c5737c9ef0879f290181cb281261c965e0/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumPeerConfig.java#L640 they rewrite everything back out again there ?	17:45
mordred	yeah. which is why those lines go into the dynamic file	17:45
clarkb	well thats all the static file there in that function	17:45
clarkb	I'm basically trying to figure out if there is a form we can write that will make zk not try and change it	17:46
clarkb	dynamicConfigFile needs to be the very last key is about as far as I've gotten	17:46
mordred	yeah - but it only does editStaticConfig if you had dynamic config in the static file in the first place	17:46
clarkb	mordred: yes but it writes it back out again	17:46
mordred	but ony if it had to edit it	17:46
clarkb	we can't stop the writing from happening	17:46
mordred	I think we can	17:46
clarkb	but if ansible and zk write the same thing its fine	17:46
clarkb	"fine"	17:46
mordred	I think if we don't put the dynamic info into the static file ever	17:47
mordred	then ansible will not touch the static file	17:47
mordred	we'll still need to write the dynamic file - and zk will also write to that	17:47
clarkb	mordred: ansible is writing the static conf	17:47
mordred	yes, I understand	17:47
mordred	but what I'm saying is that if we restructure the file	17:47
mordred	and stop putting the server list in it	17:47
mordred	that zk will not desire to rewrite that file	17:47
clarkb	how do we tell it what servers are in the cluster?	17:48
mordred	if we only have ansible write the server list into the dynamic file	17:48
mordred	and we also have ansible only write that file if it doesn't exist	17:48
clarkb	ok that last bit is what I was missing	17:48
mordred	because once we've written it the first time it's owned by zk - so if we try to write it out during a rolling restart, things will have sads	17:48
mordred	because we'll be fighting zk - but by and large we'd only need to write to that file if we were changing the list of members - and that would be a big thing anyway	17:49
mordred	in any case - corvus is going to go into a hole and verify these suppositions :)	17:49
mordred	clarkb: https://review.opendev.org/#/c/720718/ if you're bored	17:50
*** ralonsoh has quit IRC		17:53
clarkb	mordred: check comment for things	17:53
fungi	i hope corvus brings a torch, we don't need him getting eaten by a grue	17:53
mordred	clarkb: oh - that's a good point	17:54
mordred	fungi: do you happen to know the answer to clarkb's comment on 720718 ?	17:55
clarkb	mordred: I'm looking I think only the things on mirror-update.opendev.org use the new ssh'd vos release	17:57
clarkb	mordred: and we've only moved the rsynced things over (since that is ansible managed and setting up reprepro is "involved")	17:57
clarkb	mordred: so I think what you need to do for your change is either update mirror-update.openstack.org to use the same ssh thing, move reprepro to mirror-update.opendev.org and have it ssh, or hold the lock, run reprepro yourself withotu a vos release, then vos release on the afs server afterwards, then release the lock	17:58
fungi	lookin'	18:02
clarkb	also we removed all trusty nodes/jobs right?	18:03
clarkb	I think maybe instead of bumping quota we want to delete trusty first (also should be manual due to sync cost)	18:03
clarkb	AJaeger: ^ pretty sure you drove that for us and it is all complete now right? (trusty test node removal)	18:04
fungi	mordred: yeah, i left a comment on 720718 just now but it basically repeats what clarkb just said	18:04
mordred	nod. so yeah - trusty removal first seems like the right choice	18:08
mordred	or - maybe what we want is to replace trusty with focal in the file	18:09
mordred	and then do a single sync	18:09
AJaeger	clarkb: yes, I think we're fine, let me double check quickly	18:09
clarkb	mordred: you might have write errors if you do that since reprepro deletes after downloading iirc	18:09
clarkb	mordred: could temporarily bump quota to handle that	18:10
clarkb	that might be the quickest option actually since you bundle the big syncs into one sync	18:10
AJaeger	yes, trusty should be gone. There's still a bit in system-config (sorry, did not read backscroll) but that's all	18:12
clarkb	AJaeger: ya we have ~3 nodes on it still but we pulled out testing of it so we don't need the afs mirror anymroe. Thank you for checking	18:12
mordred	clarkb: yeah - so we might still want to do the reprepro config as two patches - but bundle it with a single vos release	18:14
mordred	clarkb: oh - or yeah, bump quota for a minute	18:14
mordred	oh wow	18:17
mordred	clarkb: context switching back to puppet real quick ...	18:18
mordred	clarkb: puppet-beaker-rspec-puppet-4-infra-system-config is mostly testing things that are done in ansible	18:18
mordred	clarkb: so - I think it's pretty much useless at this poing	18:18
mordred	the only testing it's doing is the stuff that's defined in modules/openstack_project/spec/acceptance/basic_spec.rb	18:19
clarkb	mordred: I want to say that may be an integration job too	18:19
mordred	which is basically testing that users we set up in ansible are there	18:19
clarkb	mordred: so it runs against puppet-foo rspec too ?	18:19
clarkb	when we update puppet-foo	18:19
clarkb	so its possible we don't need the job on system-config anymore but may not be ready to delete the job itself?	18:19
clarkb	(double check me on that)	18:19
mordred	clarkb: nope	18:21
mordred	clarkb: or - rather - yes - we don't need the job on system-config	18:21
mordred	we run puppet-beaker-rspec-puppet-4-infra on puppet-foo changes	18:21
clarkb	got it	18:22
mordred	so - I think we can remove puppet-beaker-rspec-puppet-4-infra-system-config now	18:23
mordred	and then when I do the change to split remote_puppet_else into service-foo playbooks - that can replace the puppet apply job	18:23
mordred	and similarly, each one of those jobs can be used in the puppet-foo modules as appropriate	18:23
mordred	and we can get rid of all of the rspec jobs	18:23
mordred	and life will be much better	18:24
clarkb	ya the puppet apply job also only does a puppet noop apply	18:27
clarkb	so if we can actually run puppet it will be an improvemtn :)	18:27
openstackgerrit	Monty Taylor proposed opendev/system-config master: Remove puppet-beaker-rspec-puppet-4-infra-system-config https://review.opendev.org/720799	18:29
openstackgerrit	Monty Taylor proposed opendev/system-config master: Remove global variables from manifest/site.pp https://review.opendev.org/720800	18:29
mordred	clarkb: two easy-ish cleanups to prep for that ^^	18:29
openstackgerrit	Monty Taylor proposed opendev/system-config master: Remove unused rspec tests https://review.opendev.org/720802	18:30
mordred	and a third	18:30
clarkb	mordred: oh heh your third change addresses my note in first one	18:32
clarkb	mordred: the second needs work though (comment inline)	18:32
mordred	cool - thanks!	18:33
clarkb	change for docker-compose update is waiting on nodes. I should have plenty of time to pop out for a few mintues as a result. Back soon	18:34
clarkb	(the gitea job isn't incredibly quick)	18:34
openstackgerrit	Monty Taylor proposed opendev/system-config master: Remove global variables from manifest/site.pp https://review.opendev.org/720800	18:36
openstackgerrit	Monty Taylor proposed opendev/system-config master: Remove unused rspec tests https://review.opendev.org/720802	18:36
openstackgerrit	Monty Taylor proposed opendev/system-config master: Start mirroring focal, stop mirroring trusty https://review.opendev.org/720718	18:40
clarkb	22 minutes for that change to land give or take	18:54
mordred	fungi: I think we can go ahead and land https://review.opendev.org/#/c/720679/ - we need to do a gerrit restart to pick up the local replication volume anyway	18:55
mordred	so it would be nice to bundle the restart and get both things	18:55
mordred	(because of this: https://review.opendev.org/#/c/720225/)	18:56
clarkb	mordred: that can also transition the container name for us after docker-compose lands	18:56
mordred	yup	18:56
mordred	so I think we land 720679 - then docker-compose lands - then when we're happy we do a docker compose restart on review	18:57
mordred	and we're in pretty good shape	18:57
mordred	oh - we need to land https://review.opendev.org/#/c/719051/ too	18:57
mordred	clarkb: any reason to hold off on the +A for that one?	18:57
mordred	or do we want to wait?	18:58
clarkb	mordred: I don't think so	18:58
clarkb	it was just in holding pattern on the docker-compose upgrade	18:58
mordred	cool. I'm gonna go ahead and poke it	18:58
fungi	mordred: sounds good to me then	18:58
fungi	i mainly didn't want to inadvertently complicate anything else we've got going on	18:58
fungi	trying not to cross the streams too much	18:59
mordred	fungi: ++	19:11
openstackgerrit	Merged opendev/system-config master: Install docker-compose from pypi https://review.opendev.org/719589	19:11
mordred	clarkb: there we go	19:12
clarkb	mordred: and now we watch the deploy jobs ya?	19:13
mordred	yup	19:13
clarkb	hrm you know what just occuired to me does uninstalling packaged docker-compose do someting we don't want like stopping the containers too :/	19:15
clarkb	testing seemed to show that it iddn't becuase it was the docker-compose-up that happened later that restarted teh containers	19:15
clarkb	I'm just being paranoid now	19:15
clarkb	gitea-lb seems to have gone well	19:15
mordred	clarkb: yeah - I don't think it does	19:16
mordred	it's just a python program that does things with docker api	19:16
clarkb	mordred: good point	19:16
clarkb	so ya uninstalling docker may do that but not docker-compose	19:16
clarkb	in any case opendev.org is still up and the gitea-lb.yaml log looks as I expected it	19:17
clarkb	first one lgtm	19:17
clarkb	service nodepool job failed. Not sure why yte	19:19
clarkb	Unable to find any of pip3 to use. pip needs to be installed.	19:20
clarkb	that was unexpected	19:20
clarkb	on nb04	19:20
clarkb	mordred: ^ do you know why servers like gitea-lb which are bionic would have pip installed but not bionic for nb04?	19:21
clarkb	also this is a gap in our testing because our test images have pip and friends preinstalled	19:21
clarkb	I think what we may end up seeing here is that newer hosts fail on this error and older hosts are fine	19:22
clarkb	and yes I've confirmed uninstalling docker-compose does not stop containers beacuse nb04 and etherpad are in that state	19:23
prometheanfire	mordred: mind taking a look at https://review.opendev.org/717339 ?	19:24
openstackgerrit	Clark Boylan proposed opendev/system-config master: Install pip3 for docker-compose installation https://review.opendev.org/720820	19:26
fungi	clarkb: yeah, odds are our server images don't have the python3-pip package installed	19:26
clarkb	fungi: ya but why would gitea-lb have it ? different image maybe	19:26
clarkb	in any case infra-root I think 720820 fixes this problem. Note that we currently don't have docker-compose installed on hsots where this failed. But the existing docker compose'd containers are running	19:27
fungi	we deployed that in vexxhost right?	19:27
clarkb	fungi: oh ya good point	19:27
fungi	so we probably uploaded a nodepool-built image	19:27
clarkb	if we need to emergency docker compose things before the fix above lands we can reinstall the distro docker-compose	19:27
mordred	clarkb: uhm. weird.	19:27
mordred	clarkb: yeah - I thought pip3 was everywhere - but clearly I was wrong - and our images having that on them sure did mask this didn't it?	19:28
clarkb	mordred: yup	19:28
clarkb	mordred: fwiw meetpad job returned success but it didn't seem to update containers there	19:28
clarkb	"no hosts matched" ok that explains that one	19:29
mordred	PLAY [Configure meetpad] *******************************************************	19:29
mordred	skipping: no hosts matched	19:29
mordred	yeah	19:29
clarkb	zk was success and that should've nooped. Checking now	19:29
mordred	clarkb: oh - is meetpad in emergency?	19:30
clarkb	mordred: it must be	19:30
clarkb	zk looks good	19:30
mordred	yup	19:30
clarkb	so far only the pip issue	19:30
mordred	cool!	19:30
clarkb	nb04, etherpad.opendev, docker registry, and zuul-preview all failed on the pip3 missing thing. gitea-lb succeeded as did the zookeeper hosts. I expect review, review-dev, and gitea to all succeed as they are older and/or on vexxhost	19:33
openstackgerrit	Merged openstack/project-config master: Change gerrit ACLs for cinder-tempest-plugin https://review.opendev.org/720235	19:33
fungi	those ^ get applied from promote pipeline jobs now, right?	19:33
clarkb	fungi: deploy pipeline	19:34
fungi	oh, right!	19:34
fungi	i forgot we added a separate pipeline for that	19:34
clarkb	mordred: hrm does manage-projects use docker-compose in a way that may pose a problem here?	19:34
clarkb	the gerrit ACLs change has queued up the manage-projects job	19:35
fungi	yep, i see that. cool	19:35
clarkb	ok we use docker run not docker-compose for manage projects so that should be fine	19:36
clarkb	it won't try to use the wrong container name	19:36
clarkb	if we did docker exec or docker-compose for manage-projects that could be different	19:36
clarkb	720820 exposes that we don't run docker role consuming jobs on docker role updates. Thats another job fix I should figure out	19:38
clarkb	infra-root once gitea runs and shows gitea01 (it should be first) is happy I'm going to work on lunch while waiting for the fix to get tested and reviewed	19:39
clarkb	if you need to make changes to the fix or take different direction feel free	19:39
clarkb	but then beacuse the fix is in the docker role and our jobs may not be set to trigger off that role updating we may need to run the playbooks for these services manually:	19:39
mordred	clarkb: (we should add the pip3 role to things that have files depends on the install-docker role now too)	19:40
clarkb	service-nodepool.yaml, service-etherpad.yaml, service-meetpad.yaml (needs to be removed from emergency or we can wait on this one), service-registry.yaml, service-zuul-preview.yaml	19:40
clarkb	mordred: ++ so we need to do the docker role and the pip3 role	19:40
mordred	yeah	19:40
* mordred will make a patch		19:41
clarkb	thanks!	19:41
clarkb	does bridge unping for anyone else?	19:42
clarkb	I can't ping or ssh to it and my existing ssh connection seems to have gone away?	19:42
clarkb	and now it reconnects that was weird	19:43
clarkb	uptiem shows it didn't reboot	19:43
clarkb	and we didn't OOM	19:43
clarkb	"msg": "Timeout (32s) waiting for privilege escalation prompt: " <- review-dev failed on that	19:44
clarkb	possibly due to the same network connectivity issue?	19:44
clarkb	https://gitea01.opendev.org:3000/zuul/zuul is running the new containers and is happy	19:45
openstackgerrit	Monty Taylor proposed opendev/system-config master: Add install-docker and pip3 to files triggers https://review.opendev.org/720821	19:45
clarkb	so I think review-dev and meetpad were the odd ones. review-dev due to networkign to bridge going away? and meetpad due to being in emergency. All the other failures need pip3 to be installed	19:46
mordred	clarkb: woot	19:46
clarkb	gitea, gitea-lb, review, and zk are all happy	19:46
clarkb	ok I think things are stable so I'm finding lunch now. Holler if that assumption is bad :)	19:48
fungi	Timeout exception waiting for the logger. Please check connectivity to [bridge.openstack.org:19885]	19:48
clarkb	fungi: thats normal because we don't run the zuul log streamer on bridge	19:49
fungi	seen in a infra-prod-service-gitea run	19:49
fungi	got it	19:49
clarkb	fungi: if you want to see the logs you need to go to bridge /var/log/ansible/service-$playbook.yaml file	19:49
fungi	so those are expected	19:49
clarkb	yup	19:49
clarkb	service-gitea.yaml.log for gitea	19:49
openstackgerrit	Merged opendev/system-config master: Use HUP to stop gerrit in docker-compose https://review.opendev.org/719051	19:49
clarkb	I was tailing it earlier when I confirmed gitea01 was done and happy	19:49
openstackgerrit	Merged opendev/system-config master: No longer push refs/changes to GitHub mirrors https://review.opendev.org/720679	19:50
mordred	after those run ^^ we'll be good to restart gerrit	19:51
AJaeger	infra-root, this inap graph looks really odd http://grafana.openstack.org/d/ykvSNcImk/nodepool-inap?orgId=1&from=1587131505313&to=1587153105313&var-region=All&panelId=8&fullscreen	19:51
clarkb	corvus: I know you are heads down in other things, but are you good for us to remove meetpad from the emergency file?	19:51
clarkb	AJaeger: ya its because nova isn't deleting instances there reliably	19:52
clarkb	AJaeger: if you expand it to go back 2 days you'll see it happening more often	19:52
clarkb	ok really finding lunch now. Back soon :)	19:53
AJaeger	thanks, clarkb - enjoy lunch!	19:53
corvus	clarkb: yes can remove meetpad	19:55
corvus	clarkb, mordred: should i read scrollback or skip it?	19:56
corvus	clarkb, mordred, fungi: i believe i have created a reasonable local facsimile of our prod env -- same ownership and volume structure, etc. i'm seeing the same errors about dynamic config, etc. i wrote a test script to continually write data to zk to simulate the cluster continuing to handle requests when one member leaves. i have yet to see it fail when i do a rolling restart. i've done several.	19:58
fungi	corvus: there was some discussion about the bits of the zk source around the function writing to the "static" config but probably no new insights	19:58
mordred	corvus: well that's not trilling	19:58
mordred	corvus: yeah - I think we mostly just looked at the source and then pondered - but ultimately concluded "corvus will figure out reality"	19:59
corvus	my assumption for the moment is that whatever is causing the stale session issues is not related to the dynamic config	19:59
corvus	i'm starting to wonder if it's a client issue	19:59
corvus	i made sure to use the same kazoo version, under py3, that we're using on nl01	20:00
corvus	but maybe i should spot check that elsewhere -- maybe it's, say, only the scheduler that's hitting that problem	20:00
fungi	and i guess we ended up with newer kazoo in the containers?	20:00
clarkb	corvus: we hit a speedbump on yhe docker compose thing. not all servers have pip installed. for the servers that did update dockercompose everything is happy	20:01
corvus	fungi: at the moment, the only zuul component running in containers is nb04	20:01
clarkb	fox for pip has been approved and will retrigger jobs (or manuallu run playbooks) once it lands	20:01
mordred	clarkb: https://review.opendev.org/#/c/720821 is the followup with the file trigger updates	20:01
corvus	clarkb: drat. i'm still sad we have to install pip :(	20:01
corvus	oh, speaking of nb04 -- this happens when i try to exec:	20:02
corvus	root@nb04:/var/log/nodepool# docker exec -it nodepoolbuildercompose_nodepool-builder_1 /bin/sh	20:02
corvus	OCI runtime exec failed: exec failed: container_linux.go:349: starting container process caused "open /dev/ptmx: no such file or directory": unknown	20:02
mordred	corvus: oh - that's ... what?	20:02
corvus	yeah, you can imagine my delight at having a system component turn into a black box i can't access	20:02
clarkb	drop the -it maybe?	20:03
clarkb	cant really shell in that case	20:04
corvus	yeah, it was really the interactive shell i was after	20:04
mordred	corvus: https://github.com/docker/cli/issues/2067	20:05
mordred	no solution	20:05
corvus	i wonder if dib mucked it up?	20:05
fungi	oh, yeah, i guess kazoo hasn't changed... has the version of zk we're deploying in the containers changed? and you're theorizing that the older kazoo has issues with newer zk?	20:05
corvus	fungi: i've yet to find a version of kazoo in use other than 2.7.0, but i'm still looking. we have definitely upgraded zk.	20:06
fungi	got it	20:07
corvus	mordred: and of course the 'workaround' in that report doesn't work for 'exec', only for 'run'	20:08
corvus	2.7.0 is the newest kazoo, so i'll just assume that's what nb04 has	20:09
fungi	seems probable	20:10
corvus	every zuul component is using kazoo 2.7.0 except nb03 whih is using 2.6.1	20:11
mordred	corvus: I checked on nb04 - devpts is mounted in the right place, /dev/ptmx is as expected and I don't see where dib woudl have broken it	20:12
mordred	BUT - dib does so things wtih devpts - so it's entirely possible dib did a bad	20:13
mordred	somehow	20:13
mordred	corvus: neat. I tried running a non-interactive command and got:	20:14
mordred	OCI runtime exec failed: exec failed: container_linux.go:349: starting container process caused "close exec fds: open /proc/self/fd: no such file or directory": unknown	20:14
corvus	maybe we want to restart (or reboot) and see what it looks like when it starts	20:15
corvus	that may give us a clue if it's some dib cleanup task or something	20:15
clarkb	mordred: oh good the infra-prod jobs run when install docker is updated	20:15
clarkb	mordred: so we won't need to manually trigger jobs once the fix lands	20:15
clarkb	corvus: note that nb04 is one of the hosts without docker-compose currently installed	20:16
corvus	clarkb: ack. but i'm using plain docker commands	20:16
corvus	clarkb: oh, you're warning me not to restart it right now :)	20:16
corvus	message received	20:16
corvus	(or, at least, don't use dc to restart it)	20:17
clarkb	ya	20:17
corvus	i've rerun my test with zk 2.6.1 -- same results	20:17
clarkb	also if you look at zuul status for deploy pipelien right now I think its doing a thing we didn't expect it to?	20:17
clarkb	there are two chagnes in the pipeline and the second changei s running jobs before the first has finished	20:17
corvus	ah, yup, we seem to be sharing the mutex between the two.	20:18
corvus	i wonder if we can turn this into a dependent pipeline with a window of 1	20:19
corvus	the main thing would be to look into the merge check	20:19
clarkb	mordred: pip fix breaks on xenial? https://zuul.opendev.org/t/openstack/build/e979db12fcf042ed8e51ca6be4cd0545/log/job-output.txt#16953	20:20
fungi	clarkb: i saw the same a little bit ago. i thought the mutex was supposed to wind up serializing them in the item enqueue order	20:20
fungi	but that doesn't appear to be the case	20:20
fungi	so, yeah, window of 1 i guess will be better than possible out-of-sequence deployments	20:21
corvus	maybe our mutex wakeups are random	20:22
clarkb	mordred: I think maybe this isn't necessary on xenial. So we can fix pip3 too	20:23
clarkb	I'm testing it locally in a xenial container and will push fix if I think it will work	20:25
mordred	clarkb: I agree - I think it isn't necessary on xenial	20:26
corvus	clarkb, fungi: i'm still surprised about that. we should release the semaphore before processing the queue, and the queue processing should happen in order, so i'd expect each job for the first change to get it in order, then each job for the second change. unless one of the jobs on the first change didn't specify the semaphore?	20:26
mordred	corvus: the semaphore should be on the base job	20:27
corvus	we don't show nearly enough job info in the web ui	20:27
mordred	yeah. anything parented on infra-prod-playbook	20:27
mordred	that's where we're declaring use of the semaphore	20:28
mordred	oh! interesting	20:28
mordred	semaphore: infra-prod-service-bridge	20:28
openstackgerrit	Clark Boylan proposed opendev/system-config master: Install pip3 for docker-compose installation https://review.opendev.org/720820	20:28
mordred	we have one job that declares a non-existent sempahore	20:28
mordred	that is a different semaphore	20:28
corvus	mordred: which job?	20:28
clarkb	mordred: corvus fungi https://review.opendev.org/720820 has been updated to handle xenial if you have a moment between thinking about all the other things :)	20:29
openstackgerrit	Monty Taylor proposed opendev/system-config master: Remove semaphore from service-bridge https://review.opendev.org/720829	20:29
mordred	corvus: infra-prod-service-bridge	20:29
fungi	taking a look	20:29
clarkb	infra-root should we start considering making an order of changes to land?	20:29
corvus	mordred: ok. i don't think that job was involved here.	20:30
corvus	yeah, our problem set has exploded again	20:30
clarkb	https://etherpad.opendev.org/p/PzoWHp44yOP4K8LdXXrK	20:31
corvus	docker-compose is uninstalled; semaphores may run out of order; something about zk is weird when rolling restart; nb04 /dev in container is hosed	20:31
openstackgerrit	Monty Taylor proposed opendev/system-config master: Add install-docker and pip3 to files triggers https://review.opendev.org/720821	20:31
corvus	did i miss anything? :)	20:31
mordred	corvus: I think that's about it	20:31
mordred	corvus: also - luckily for us, 3 of those problems we don't really understand	20:32
corvus	okay, we gotta find a way to avoid installing docker-compose from pip in the future -- this whole sequence of "oops we don't have pip3 on this distro" was exactly the business that we got out of... for about 10 minutes.	20:32
fungi	corvus: so what i observed earlier (but was refraining from interrupting other discussion with) is that 720235,2 had a waiting infra-prod-manage-projects build, but 719051,8 which was enqueued into the deploy pipeline after it started running infra-prod-service-review (those share a semaphore, right?)	20:32
fungi	after infra-prod-service-review completed for 719051,8, infra-prod-manage-projects started running for 719051,8 ahead of it	20:33
fungi	er, for 720235,2 ahead of it	20:33
clarkb	corvus: ya I'm not sure what the proper answer is there. One crazy idea I had was running docker-compose from docker, but I imagine that will need testing	20:34
clarkb	(and generally exposing the docker command socket to docker containers seems dirty)	20:34
clarkb	https://etherpad.opendev.org/p/PzoWHp44yOP4K8LdXXrK I've filled in the docker-compose related items and put spots for the other things if people have things in flight to track	20:36
fungi	yay! etherpad is snappy again!	20:39
fungi	i've heard no complaints about it after the tuning config got added back, fwiw	20:40
mordred	\o/	20:40
mnaser	uh i feel bad about bothering with this, but it seems like i got a buildset stuck in the vexxhost tenant again somehow..	20:48
mnaser	http://zuul.opendev.org/t/vexxhost/status -- its been around for 3h10m -- even when i +W it to kick it straight into gate, it is still there	20:48
clarkb	mnaser: I think the inap issues are persisting	20:48
clarkb	let me see what that job is waiting on	20:48
mnaser	will it fail to dequeue as well?	20:49
clarkb	I don't think so but dequeing won't really help necessarily	20:49
mnaser	right, but if i +W it, shouldn't it remove it from check and kick it straight to gate	20:49
clarkb	mnaser: depends on how your popeline is set up	20:50
openstackgerrit	Arun S A G proposed opendev/gerritlib master: Fix AttributeError when _consume method in GerritWatcher fails https://review.opendev.org/720832	20:50
mnaser	im pretty sure we're using the one simila to opendev/zuul so go-straight-to-gate	20:50
clarkb	fwiw those jobs don't seem to be blocking on inap	20:52
clarkb	and two of them just started	20:52
clarkb	still trying to figure out what they were hung up on	20:52
clarkb	looks like rax-iad-main had it	20:53
clarkb	for ~3 hours	20:54
clarkb	so its the same behavior we had with inap but in rax	20:54
clarkb	we end up with a lot of active requests but they aren't being fulfilled quickly (due to what I Think are quota accounting issues)	20:54
clarkb	and check would be sorted last so that probably contributes to it, though the neutron case was in the gate	20:55
clarkb	http://grafana.openstack.org/d/8wFIHcSiz/nodepool-rackspace?orgId=1 shows iad being sad	20:55
clarkb	seems to be recovering now though	20:55
corvus	i guess we can add that to the list of fires	21:01
corvus	also, we should stop logging the "could not be deleted because it is in use: The image cannot be deleted because it is in use through the backend store outside of Glance." exception	21:01
corvus	the builder logs are pretty unreadable	21:01
clarkb	corvus: fwiw I think that may just be "normal" cloud things. Addressing that in nodepool will be complicated I think	21:05
clarkb	(its hard to work around when the cloud isn't giving us accurate info)	21:05
clarkb	but I can dig into that again monday and make sure there isn't something else going on	21:05
corvus	clarkb: it would be good to have a clear idea of what's going on. we already expect openstack to lie to us about server deletions. if it's also lying about quotas, etc, it'd be good to know	21:06
clarkb	++	21:06
corvus	clarkb, mordred: is there a way to get at the docker logs from the previous run of a container?	21:10
clarkb	corvus: if they go to systemd I think so	21:10
clarkb	and i Think they do by default /me looks	21:10
clarkb	oh maybe it isn't default	21:11
clarkb	corvus: internet says do docker logs with the container id	21:12
clarkb	and i believe you can get historical container ids from dockerd logs	21:12
clarkb	ok the distutils thing fixed the pip change	21:13
clarkb	now we wait for it to gate	21:16
*** hashar has joined #opendev		21:20
corvus	clarkb: ah, docker-compose down deletes the container, and once it's gone docker logs $containerid doesn't work	21:23
corvus	but everything is going into the journal, so that'll do for now	21:23
clarkb	oh good its in the journal anyway	21:23
clarkb	corvus: how do you get it out of the journal?	21:23
corvus	clarkb: i'm just using journalctl -u docker.service	21:24
clarkb	thanks	21:24
clarkb	(its usefully to know that bit of info)	21:24
openstackgerrit	Merged opendev/system-config master: Remove semaphore from service-bridge https://review.opendev.org/720829	21:25
clarkb	mordred: ^ some progress	21:28
mordred	clarkb: woot!	21:44
*** hashar has quit IRC		21:46
openstackgerrit	Monty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers https://review.opendev.org/720527	21:52
openstackgerrit	Monty Taylor proposed opendev/system-config master: Run Zuul using Ansible and Containers https://review.opendev.org/717620	21:56
mordred	clarkb: I have verified that the docker-compose on review is the pip version and the docker compose file has both outstanding changes in it	22:00
mordred	clarkb: so we should be well positioned to restart whenever we decide it's a good time to do that	22:00
clarkb	mordred: cool	22:01
clarkb	at this point I think my ability to debug more things is waning	22:01
clarkb	wnting to wrap up the outstanding things	22:01
mordred	totally	22:02
mordred	I recorded that we're ready to do that whenver in the etherpad	22:02
corvus	i'm digging through zk server logs and reading docs and bug reports to try to come up with a new hypothesis	22:12
openstackgerrit	Merged opendev/system-config master: Install pip3 for docker-compose installation https://review.opendev.org/720820	22:16
* mordred is going to pay attention to those		22:16
mordred	clarkb: the list of services that need the pip/compose update in the etherpad is the list of jobs that just got triggered - so that particular thing should be done once this runs	22:18
clarkb	mordred: cool and I'm around paying attention too	22:18
openstackgerrit	Merged opendev/system-config master: Add install-docker and pip3 to files triggers https://review.opendev.org/720821	22:19
clarkb	nb04 looks happy now	22:29
clarkb	also docker ps -a shows a lot of old docker containers there	22:29
clarkb	I think we need to get in the habit of doing docker run --rm ?	22:30
clarkb	mordred: ^ you probably have ideas on that	22:30
mordred	clarkb: hrm	22:32
mordred	clarkb: I wish I knew why that container was unhappy in the first place	22:32
mordred	clarkb: oh - yeah - I always do --rm when I do run	22:33
mordred	clarkb: think we shoudl clean those up real quick?	22:33
clarkb	maybe? it could be part of corvus' debugging and we should have corvus confirm first?	22:34
clarkb	but ya I think cleaning up would be a good idea	22:34
mordred	clarkb: ++ - most of those look like utility images from weeks ago	22:34
mordred	clarkb: docker ps -a \| grep Exited \| awk '{print $1}' \| xargs -n1 docker rm	22:34
clarkb	etherpad just restarted	22:35
clarkb	https://etherpad.opendev.org/p/PzoWHp44yOP4K8LdXXrK is still working for me	22:35
clarkb	this is all looking good \o/	22:35
clarkb	oh I forget to remove meetpad from emergency	22:39
clarkb	mordred: thoughts on ^ should I just remove it now or wait for money?	22:39
clarkb	*monday. money is nice too	22:39
clarkb	docker registry looks happy now too	22:40
mordred	clarkb: I think we can remove it - I don't think there were any reasons not to	22:40
clarkb	mordred: I guess my only concern is if there were other changes and they weren't happy at this point	22:41
clarkb	but since the service isn't in prod its probably fine	22:41
clarkb	I'll remove it now so I don't forget further	22:41
mordred	yeah. and corvus acked that it was ok earlier	22:41
corvus	i did not do any docker runs	22:41
corvus	only exec	22:41
clarkb	corvus: rgr so ya we should be able to clean up all thos containers mordred	22:41
mordred	kk. removing	22:41
clarkb	meetpad01 has been removed from emergency file	22:42
clarkb	I'll put further debugging of this nodepool "slowness" high on my list for monday	22:43
clarkb	since people keep noticing it so its definitely frequent and painful	22:43
clarkb	zookeeper play failed on AnsibleUndefinedVariable: 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_default_ipv4'	22:44
mordred	hrm. that's weird	22:44
mordred	investigating	22:44
corvus	we may have run in a v6 only cloud?	22:45
clarkb	corvus: this was against prod	22:45
clarkb	/var/log/ansible/service-zookeeper.yaml.log for the logs	22:45
clarkb	* on bridge	22:45
clarkb	zuul-preview seems good though I'm trying to find a change I can confirm that with via zuul dashboard	22:47
* clarkb looks for zuul website change		22:47
mordred	that var shows up when I run setup and is also in the fact cache for those hosts	22:47
clarkb	mordred: is it maybe the lookup path?	22:48
fungi	clarkb: did we already switch the zuul website preview to using the zuul-preview service?	22:49
fungi	i thought it wasn't yet (at least as of a week-ish ago)	22:49
clarkb	fungi: no I thought we did but the job artifact errors with bad urls	22:49
clarkb	and its because its at ovh's swift root not zp01	22:49
corvus	re zk cluster probs: i think we're looking at a server issue of some kind. it seems like when we kill the leader, that the new leader begins a new 'epoch' (which i think appears as the first character of the zxid in the logs -- that's why 0xd00000000 showed up -- epoch 0xd); my limited understanding is that should become the first zxid committed after the leader election, and then all the followers	22:49
corvus	should get that. we're seeing clients connect having seen that zxid, but then the followers they connect to don't seem to have it.	22:49
mordred	I can reproduce the ansible_default_ipv4 issue with a simple playbook - poking at combos to see what works and doesn't	22:51
mordred	sigh	22:51
clarkb	mnaser: if you happen to still be around did you have any zuul preview using changes we can test with (I thought you had something)	22:52
mordred	so - if I run a playbook targetting zk01.openstack.org that wants to get zk02.openstack.org's hostvars but zk02 hasn't ever done anything in the playbook, it fails	22:52
clarkb	mordred: oh so we should add an explicit setup call across those hosts maybe?	22:52
mordred	but if I run something, _anything_ on zk02 first - the hostvars are there	22:52
mordred	we don't even need a setup call	22:52
mordred	a debug call suffices	22:53
corvus	clarkb: try the zuul-website gatsby wip patch?	22:53
clarkb	corvus: thats what I pulled up but the url there is for ovh swift roo	22:53
clarkb	*root	22:53
mordred	it doesn't need to fetch new facts	22:53
clarkb	let me see if there was a different url I should use	22:53
clarkb	https://zuul.opendev.org/t/zuul/build/925bfe37815144d0859f260605d5fb98 is the build for that I think	22:54
clarkb	note the site prview url is straight to storage.gra.cloud.ovh.net	22:54
mnaser	clarkb: the zuul website changes should be good for that	22:55
mnaser	or single change. I haven’t gotten around finalizing that	22:55
clarkb	mnaser: https://zuul.opendev.org/t/zuul/build/925bfe37815144d0859f260605d5fb98 is what I'm looking at for that is that wrong?	22:55
mnaser	clarkb: yes that’s the right one	22:55
clarkb	mnaser: ok the site preview for that is straight to the ovh swift files not zp	22:56
clarkb	and that doesnt' work (as expected)	22:56
clarkb	maybe I need to manually construct the zp url?	22:56
fungi	right, like i said, i don't think the zuul-web previews are using zuul-preview (yet)	22:56
mnaser	clarkb: yeah I haven’t pushed up a patch to return that as an artifact. I have to return both	22:56
clarkb	mnaser: gotcha, do you know what the url format is in that case?	22:56
corvus	clarkb: http://site.925bfe37815144d0859f260605d5fb98.zuul.zuul-preview.opendev.org/	22:57
openstackgerrit	Monty Taylor proposed opendev/system-config master: Run a noop on all zookeeper servers first https://review.opendev.org/720847	22:57
mnaser	^^	22:57
mordred	corvus, clarkb : ^^ that should fix the zookeeper thing	22:57
clarkb	mnaser: corvus thanks! and that seems to work for me so I think zp is good	22:58
mordred	(the playbook being unhappy - not the important zk thing)	22:58
mnaser	artifact_type.build_id.tenant_id.zuul-preview.opendev.org is the format. Thanks corvus	22:58
clarkb	mordred: should you add !disabled to that?	22:58
corvus	clarkb: there's a comment explaining why not :)	22:58
mordred	clarkb: no - left that off on purpose (and wrote a comment explaining)	22:58
clarkb	heh I should read	22:59
corvus	mordred: that's super weird that it works with --limit	22:59
mordred	corvus: I agree	22:59
mordred	I think it's a super weird behavior in general	22:59
corvus	mordred: i guess it's some sort of "well, since it's limited, we know we're not going to update the data, so we should just start with the cache"	22:59
corvus	mordred: but also, it could just be "no one understands this"	22:59
mordred	yeah	22:59
mordred	fwiw - /root/foo.yaml on bridge is what I used to verify	23:00
corvus	i'm pretty sure the zookeeper images on dockerhub are being rebuild with the same tags	23:01
corvus	3.6.0 is still the only 3.6, but it's 11 hours old	23:01
corvus	and i know we ran a 3.6.0 longer ago than that	23:02
clarkb	everything succeeded but zk in that pass and zk failed for unrelated reasons and is already running newer docker-compose	23:02
* clarkb updates etehrpad but things seem happy now		23:02
corvus	what happened the last time we tried 3.6.0?	23:03
mordred	clarkb: what's the cantrip for making a fake rsa key for test data?	23:03
clarkb	mordred: ssh-keygen -p'' ?	23:03
clarkb	mordred: zuul quickstart should have it for gerrit things	23:04
clarkb	corvus: I don't remember being around for that, but could it have been upgrade concerns?	23:04
mordred	clarkb: thanks	23:04
clarkb	like maybe 3.4 -> 3.6 isn't doable in rolling fashion?	23:04
openstackgerrit	Monty Taylor proposed opendev/system-config master: Run Zuul using Ansible and Containers https://review.opendev.org/717620	23:06
corvus	clarkb: it is doable, but something was preventing a quorum from forming on 3.6	23:07
mordred	corvus: I understand the applytest race condition. I think I can live with it until I rework that job	23:09
fungi	mordred: clarkb: ssh-keygen -p'' just sets the private key to not encrypted. are you looking for something like gnupg's --debug-quick-random option for creating insecure test keys?	23:10
fungi	or are you really just looking for a key which doesn't require a passphrase to unlock?	23:11
corvus	i think maybe tomorrow we might want to do some testing-in-prod on the zk cluster, because i can't replicate the problem locally, nor do i see any problem running 3.6.0 (at least, the latest version of that image)	23:16
*** tosky has quit IRC		23:21
openstackgerrit	Monty Taylor proposed opendev/system-config master: Make applytest files outside of system-config https://review.opendev.org/720848	23:21
mordred	corvus: I support that	23:21
mordred	corvus: also - I decided I was too annoyed by the applytest race - so that ^^ shoudl fix it	23:21
mordred	fungi: really just needed some rsa key data to put into the testing "private" key hostvars so that the role would write something to disk in the integration test jobs	23:22
mordred	corvus: I mean - assuming that runs at all - I think it should fix the race :)	23:22
mordred	fungi: if you have some brainpellets - 720848 could use some eyeball powder	23:23
mordred	fungi: for context - we keep seeing occasional failures like: https://zuul.opendev.org/t/openstack/build/64e6d48f114d43979502b21ca6d626ac/log/applytest/puppetapplytest21.final.out.FAILED	23:23
fungi	mordred: got it, so one-time key generation, not rapid/repetitive key generation in a job	23:23
mordred	fungi: yeah	23:24
mordred	ssh-keygen did the trick	23:24
fungi	cool	23:24
fungi	mordred: yeah, first instinct on that is some sort of race on directory creation/deletion	23:27
openstackgerrit	Merged opendev/system-config master: Run a noop on all zookeeper servers first https://review.opendev.org/720847	23:42

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!