Friday, 2020-04-17

openstackgerritIan Wienand proposed openstack/project-config master: Add centos aarch64 to labels
*** DSpider has quit IRC00:04
ianwwhy that is not trying to build has me stumped right now00:23
openstackgerritIan Wienand proposed openstack/project-config master: Add centos aarch64 to labels, unpause
ianwyeah, unpausing it will help00:26
ianwi've applied that manually and am watching an initial build00:26
*** ysandeep|away is now known as ysandeep01:03
openstackgerritMerged openstack/diskimage-builder master: Add centos aarch64 tests
mnaserinfra-root: ok, i have one issue right now, i _could_ workaround it by abandon/restore but maybe useful for someone to grab looks and see why? 720595,6 has been stuck for 2h18m (and new jobs are starting inside openstack tenant so its not a lack of nodes)...01:19
clarkbmnaser: it might be the inap issue we saw earlier tpday01:19
clarkbmnaser: basically we seem to leak enough nodes there due to failed successful node deletes amd that breaks quota acpunting so we over attwmpt to boot instances in inap01:20
clarkbandbasically it delays things01:20
mnaserclarkb: does nodepool enforces building vms in the same provider?01:20
clarkbmnaser: only within a job01:20
mnaseraaaah, so maybe it keeps trying to get an inap job01:21
clarkbI dont think its repeatedly trying01:21
clarkbits just waiting for inap to actually delete the nodes it said it deleted so no ones can boot01:21
mnaserclarkb: ah, so probably just best to sit and wait and if it's still around for much longer then maybe abandon/restore01:22
openstackgerritMerged openstack/project-config master: Add centos aarch64 to labels, unpause
clarkbif it persists longer we should maybe have deletes poll more or disable inap or something01:24
clarkbits a weird behavio and seems new but I spent a chunk of the morning tracing it through and pretty sure root cause is nova delete saus "yes I succeeded" but then the server persosts for a long time01:24
mnaserclarkb: the queued jobs does depend on the registry that's paused, so maybe that contributes to it?01:25
clarkbthe one I looked at today was a tempest job so no docker bits01:25
mnaserah ok, the paused one is in our cloud right now01:26
clarkboh hrm do required jobs like that end up in thr same cloud?01:27
clarkbfwiw Im not at computer so cant debug directly but otherwise sounds similar to the inap thing from today01:28
mnaserclarkb: i dont know if requried jobs like that end up in teh same cloud, but im curious to know.  but yeah, if you saw that behaviour earlier then might be good to leave it for someone to have a look at it later01:29
corvusmnaser: yeah, jobs that depend on paused jobs request nodes from the same provider02:06
corvus(with a bump in priority to try to speed things up)02:07
*** ysandeep is now known as ysandeep|afk02:21
openstackgerritMerged zuul/zuul-jobs master: fetch-subunit-output test: use ensure-pip
prometheanfireianw: have time for ?02:54
openstackgerritMerged zuul/zuul-jobs master: ensure-tox: use ensure-pip role
openstackgerritIan Wienand proposed opendev/system-config master: [dnm] test with plain nodes
openstackgerritIan Wienand proposed openstack/project-config master: nb03 : update to arm64 to inheritance, drop pip-and-virtualenv
*** ysandeep|afk is now known as ysandeep04:22
*** kevinz has joined #opendev04:42
*** ysandeep is now known as ysandeep|reboot04:49
*** ykarel|away is now known as ykarel04:51
openstackgerritIan Wienand proposed zuul/zuul-jobs master: Update Fedora to 31
openstackgerritIan Wienand proposed zuul/zuul-jobs master: Make ubuntu-plain jobs voting
openstackgerritIan Wienand proposed zuul/zuul-jobs master: Document output variables
openstackgerritIan Wienand proposed zuul/zuul-jobs master: Python roles: misc doc updates
*** ysandeep|reboot is now known as ysandeep04:53
*** mnasiadka has quit IRC05:10
*** elod has quit IRC05:10
*** mnasiadka has joined #opendev05:15
*** elod has joined #opendev05:15
openstackgerritMerged openstack/project-config master: Add ubuntu-bionic-plain to all regions
*** ysandeep is now known as ysandeep|brb05:49
*** ysandeep|brb is now known as ysandeep06:12
*** Romik has joined #opendev06:21
*** Romik has quit IRC06:33
*** Romik has joined #opendev07:00
*** jhesketh has quit IRC07:04
*** rpittau|afk is now known as rpittau07:19
*** tosky has joined #opendev07:30
*** Romik has quit IRC07:35
*** ralonsoh has joined #opendev07:38
*** DSpider has joined #opendev07:40
*** ysandeep is now known as ysandeep|lunch07:57
openstackgerritMerged openstack/project-config master: nodepool: Add more plain images
*** ysandeep|lunch is now known as ysandeep08:25
*** ykarel is now known as ykarel|lunch08:26
openstackgerritAndreas Jaeger proposed zuul/zuul-jobs master: Make ubuntu-plain jobs voting
openstackgerritAndreas Jaeger proposed zuul/zuul-jobs master: Document output variables
openstackgerritAndreas Jaeger proposed zuul/zuul-jobs master: Python roles: misc doc updates
AJaegerianw: rebased and fixed the failure ^08:45
*** ysandeep is now known as ysandeep|afk09:21
*** ykarel|lunch is now known as ykarel09:25
openstackgerritDmitry Tantsur proposed openstack/diskimage-builder master: Remove Babel and any signs of translations
openstackgerritThierry Carrez proposed opendev/system-config master: No longer push refs/changes to GitHub mirrors
ttxcorvus, mordred, fungi: ^ as discussed10:01
*** rpittau is now known as rpittau|bbl10:30
*** hashar has joined #opendev11:43
openstackgerritAndreas Jaeger proposed openstack/project-config master: Remove pypy job from x/surveil
*** Romik has joined #opendev12:13
openstackgerritMerged openstack/project-config master: Remove pypy job from bindep
*** rpittau|bbl is now known as rpittau12:19
*** Romik has quit IRC12:28
openstackgerritMonty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers
openstackgerritMonty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers
openstackgerritMonty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers
hasharI have an interesting use case for octopus merging a couple changes12:34
hasharCI for the jjb/jenkins-job-builder repository  is broken12:35
*** ykarel is now known as ykarel|afk12:35
*** ysandeep|afk is now known as ysandeep12:35
hasharerr wrong repository. I mean jjb/python-jenkins12:36
hasharthe py27 job is broken due to stestr 3.0.0  which is fixed by blacklisting it ( )12:36
hasharthe pypy job is broken for some reason and the job is removed by
hasharand of course, each change has a build failure because of the other change not being around12:37
hasharI can't depend-on on one or the other since that still would cause one of the build to fail12:37
hasharA -> B  (A fails because B fix is not there)12:37
hasharB -> A  (B fails because A fix is not there)12:38
hasharbut I could create an octopus merge of A and B to the branch which should pass just fine12:38
hasharwhich I could potentially CR+2 / W+1 and get submitted by Zuul.  But, I guess Gerrit is not going to merge it because the parents A and B  lack the proper votes ;]12:39
*** ykarel|afk is now known as ykarel12:39
AJaegerhashar: merge the changes together ;)12:43
hashar  ! [remote rejected] HEAD -> refs/for/master (you are not allowed to upload merges)12:43
hasharyeah I will do a single change instead12:44
openstackgerritMerged opendev/system-config master: Install kubectl via openshift client tools
openstackgerritMerged opendev/system-config master: Remove snap cleanup tasks
ttxcorvus, mordred for asynchronously getting rid of remote refs/changes, looks like the following shall do the trick (assuming all repos are listed in github.list):12:52
ttxfor i in $(cat github.list); do echo $i; git push --prune ssh://$i refs/changes/*:refs/changes/* 2| wc -l; done12:53
ttxthe wc -l trick in there is to roughly count the deleted refs as you go. git push --prune displays those on stderr12:53
ttxThat is what I propose to run after
openstackgerritMonty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers
openstackgerritMonty Taylor proposed opendev/system-config master: Dynamically write zookeeper host information to nodepool.yaml
mordredttx: cool!13:13
ttxI mean, seriously... stderr13:16
ttxgit why do you hate unix13:17
*** ykarel is now known as ykarel|afk13:31
openstackgerritMonty Taylor proposed opendev/system-config master: Remove unused gerrit puppet things
mordredfungi, frickler : if you have a sec, easy review:
openstackgerritMonty Taylor proposed opendev/system-config master: Remove old
openstackgerritMonty Taylor proposed opendev/system-config master: Dynamically write zookeeper host information to nodepool.yaml
openstackgerritMonty Taylor proposed opendev/system-config master: Dynamically write zookeeper host information to nodepool.yaml
openstackgerritMonty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers
mnasercorvus: ok cool, that adds up, thanks for the info13:45
openstackgerritMonty Taylor proposed opendev/system-config master: Start mirroring focal
AJaegermerci, hashar13:49
openstackgerritMonty Taylor proposed openstack/project-config master: Start building focal images
hasharAJaeger: you are welcome :]13:54
mordredcorvus: looking towards using your zk roles in the nodepool test jobs I realized I need to be able to write out the correct zookeeper hosts (will need the same in the zuul jobs) ... so I tried something in 720527 - I'm not 100% sure I like it13:55
*** mlavalle has joined #opendev14:00
fricklermordred: clarkb: question on the pattern matching syntax in
mordredfrickler: I'm pretty sure it's a regex match and not a glob match14:05
mordredfrickler: there's a 'playbooks/roles/letsencrypt.*' showing on that page which should get files matching all of the roles starting with letsencrypt14:06
mordredthat said - I'm not sure why we're doing .* there and just playbooks/roles/jitsi-meet/ above14:07
openstackgerritMonty Taylor proposed opendev/system-config master: Dynamically write zookeeper host information to nodepool.yaml
openstackgerritMonty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers
*** ysandeep is now known as ysandeep|away14:31
mnaserwould it be ok if i setup a mirroring job in the vexxhost/base-jobs repo similar to the one i setup inside opendev/project-config ?14:51
mnaseri don't see an issue but i just wanted to get the ok given it's a trusted repo14:51
openstackgerritMonty Taylor proposed opendev/system-config master: Dynamically write zookeeper host information to nodepool.yaml
openstackgerritMonty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers
openstackgerritMonty Taylor proposed opendev/system-config master: Run zookeeper cluster in nodepool jobs
mordredmnaser: I don't see any issue with that14:53
mordredcorvus: ok - I rebased the nodepool patch on top of your zk patch so that I could use the zookeeper role - let's see how many things break :)14:56
openstackgerritDmitry Tantsur proposed openstack/diskimage-builder master: Remove Babel and any signs of translations
*** ykarel|afk is now known as ykarel15:06
*** bwensley has joined #opendev15:09
bwensleyHey everyone - I notice that my gerrit review notifications seem to have stopped yesterday afternoon.15:10
bwensleyIs this a known problem?15:10
AJaegerbwensley: it works for me...15:15
AJaegerbwensley: so, not a known problem15:15
fricklerbwensley: assuming you are talking about emails, if you DM me your address I can check mail logs15:16
bwensleyYes - talking about email notifications.15:17
bwensleyIf it is working for everyone else maybe a problem with my spam filters at my employer.15:18
fricklerinfra-root: seems we are on spamhaus PBL with fungi: IIRC you did the unblocking chant most of the time?15:20
corvusmordred: morning!  catching up on your changes now15:21
mordredcorvus: they may be a terrible idea - they were written during first coffee15:22
prometheanfirecan I get a review on ?15:23
prometheanfiresecond one that is15:23
corvusmordred: i don't see zk stuff in 720527?15:24
corvuswhere should i be looking15:24
mordredcorvus: - which are now parents of
mordredcorvus: (I'd totally do that python module in jinja - but I'm not sure I'm good enogh with jinja)15:26
corvusmordred: well, my first TODO today is to jinja the ipv4 addresses of the zk hosts into the config file, so i should have something you can copy/paste in a minute.15:26
corvusmordred: (the same thing is needed in the zoo.cfg file)15:27
mordredcorvus: sweet!15:28
mordredcorvus: I think the hardest thing for the nodepool case is producing the yaml list of dicts format15:28
mordredbut I'm sure we can figure that out15:28
corvusi think it's past time to move the connection stuff into a different config file, but oh well.  :(15:29
mordredcorvus: I left a note on your change with a pointer to some vars that might be useful fwiw15:30
corvusmordred: awesome.  that's step 1 of that task :)15:30
clarkbfrickler: mordred yes I believe it is a regex, see line 1349. However maybe I need to prefix with ^ to make that clear?15:31
clarkbfrickler: mordred I'm looking up zuul docs now15:31
fricklerfungi: actually I think I did send a removal request some time ago, retrying now15:31
corvusthey're always regexes15:31
clarkbcorvus: thanks! frickler see corvus' note I think my change is correct15:31
corvus^ will just anchor it to the start, omitting that will let it match anywhere15:32
fricklerclarkb: hmm, then you could drop the ".*" ending to be consistent with everything else, right?15:32
fricklerwould be less confusing IMHO15:33
clarkbfrickler: ya I guess I can if we allow partial matching15:33
mordredclarkb: we do. I think there are actually several .* suffixes that can all go15:33
corvuswe call "regex.match(file)"15:33
clarkbok I'll push up an update and look at simplifying some of the other matches in a followon15:34
corvusoh, match says it's always ad the start of the string15:34
corvus"If zero or more characters at the beginning of string match the regular expression pattern"15:34
corvusi think that means both ^ and trailing .* are superfluous15:34
frickler#status log submitted and confirmed spamhaus PBL removal request for (
openstackstatusfrickler: finished logging15:35
openstackgerritClark Boylan proposed opendev/system-config master: Run jobs prod test jobs when docker images update
clarkbcorvus: yup I agree15:36
corvusmordred: given the specific task of "modify a slurped yaml nodepool config" it probably makes sense to just keep that as a module15:36
corvusmordred: we can get rid of it when we make a "nodepool.conf" or something in the future15:37
mordredcorvus: ++15:37
corvuswe're going to have "zookeeper-tls" to add to "zookeeper-servers" shortly15:37
mordredcorvus: assuming, of course, I can ever get that module to run15:37
corvusmordred: yeah, i say just keep plugging at it; i don't think my tasks are going to add anything to help15:38
clarkbmordred: frickler ^ there is the updated cahnge15:40
clarkbworking on a followon now to be consistent in that file15:40
fricklerclarkb: ack, thx.15:42
* frickler heads towards the weekend now15:43
openstackgerritJames E. Blair proposed opendev/system-config master: Run ZK from containers
corvusclarkb, fungi, mordred: ^ that's ready to merge, please review and +W15:44
corvusafter it lands, we can take zk* out of emergency15:44
fricklerinfra-root: there seem to be umpteen bounces to in the mailq on review.o.o, not sure if that's normal or whether they are due to the PBL issue. do we usually clean these up or just let them expire?15:46
clarkbfrickler: I expect its due to the PBL listing, but fungi and corvus would know better than me15:46
corvusi think it'd be fine to just let them expire15:47
corvusttx: lgtm i'll give fungi a bit in case he wants to review15:49
*** dpawlik has quit IRC15:50
corvusmordred: comment on 72070915:51
mordredcorvus: I have learned something15:51
mordredcorvus: well - I learned your thing - but also, the fact variables I mentioned - only exist if fact gathering has happened for the zk hosts15:52
mordredcorvus: so we can either ensure a noop task has happend on the zookeeper group ... or we could use public_v6 and public_v4 from our inventory file15:53
corvusmordred: we cache facts on bridge15:53
mordredcorvus: nod. do we in test runs?15:53
corvusmordred: so is this just a gate problem?15:53
corvusi ran my jinja on bridge using the real inventory and it works15:54
mordredmight be. but if we use the same ansible.cfg we should cache facts in gate too15:54
openstackgerritClark Boylan proposed opendev/system-config master: Simplify .zuul.yaml regexes
corvusmordred: (and that test on bridge was with a "hosts: localhost" play)15:54
mordredI think in the gate we might need to run the zookeeper playbook first so that we'll populate the fact cache - but we need to run that ANYWAY to make the zk hosts15:54
clarkbmordred: frickler corvus ^ thats the followon though not stacked as it had a merge conflict with master and I didn't want to update the other change again :)15:54
corvusmordred: yeah, that sounds reasonable to rely on that as a side effect.  maybe worth a comment.15:55
mordredalso - in my nodepool patch I'm preferring ipv6 if it exists - is that a bad idea?15:56
corvusclarkb: +2; i noted one innocuous change15:57
*** ykarel is now known as ykarel|away15:57
corvusmordred: actually15:57
* corvus wakes up15:57
corvusmordred: why aren't we using hostnames in nodepool.yaml?15:57
mordredcorvus: well - we are in the normal one - but hostnames won't resolve in the gate15:57
mordredcorvus: unless we're writing out /etc/hosts files15:58
corvusthat's lame15:58
mordredmaybe we should write out /etc/hosts files?15:58
corvusoh no15:58
corvusi meant writing /etc/hosts is lame15:58
mordredyeah - it's totally lame15:58
mordredbut - overall the "test nodes won't resolve in dns" is gonna be an ongoing thing probably as we do more and more of these real world multi-node things15:59
corvustrue.  in which case, write /etc/hosts or template in ip addresses are both reasonable solutions16:00
corvustemplating in ip addresses does have the advantage of potentially being the same in test and prod16:00
corvus(eg, zoo.cfg)16:00
corvusmordred: anyway, to your question: preferring v6 sounds reasonable16:01
corvuswe can see how that ends up performing in our various clouds16:01
mordredI'll stay with ips for now - and we can swing back to /etc/hosts if needed16:02
mordredcorvus: I'm going to have to squash two of those patches - since I need to run zk so that zk hosts exist :)16:02
openstackgerritMonty Taylor proposed opendev/system-config master: Run zookeeper cluster in nodepool jobs
openstackgerritMonty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers
corvusmordred: i was looking at this spurious failure on your change:
corvusmordred: it looks like some kind of rsync race?  i wonder if one of the recent changing about how we run stuff could be affecting that?16:08
corvus(we could just recheck it and continue to remove puppet; but i worry if we're going to start getting more errors)16:09
ttxcorvus if you+2a the replication change it could be good to keep an eye on the replication thread see if it gets backed up -- might be a sign that refs/changes gets deleted on the push16:09
mordredcorvus: ugh yeah16:09
ttxIt should not, since it's a push without --mirror afaict16:09
mordredcorvus: I mean - part of me wants to say "recheck and keep working to remove puppet" - but I also agree, this could be an escalating issue16:09
corvusttx: ack16:09
ttxbut it's not superclear looking at Gerrit plugin code16:09
ttxor it can wait Monday :)16:10
corvusttx: yeah, it may depend on whether fungi is around :)16:10
corvus(or if his internet has been swept out to sea)16:11
mordredcorvus: clarkb and I talked about doing a couple of steps to clean some things up even with puppet in place ... namely, going ahead and making service-$foo playbooks and corresponding jobs - even if those playbooks right now just run puppet on a given host ...16:11
clarkbmordred: corvus: you've both acked my parental home school duties will be over in about an hour and a half. Is that a good time for you all to land that?16:11
mordredcorvus: and if we do that, I think we could decently change any puppet tests we have into testinfra tests - and then just drop the puppet-specific tests altogether16:11
mordredclarkb: wfm16:11
corvusmordred: yeah, that's a good idea -- running the playbook means we can drop the applytest (it's better than an apply test)16:12
mordredcorvus: because "run all of the puppet" every time we touch an ansible file is a bit of a waste16:12
mordredI think I might put that fairly soonish on my list16:12
corvusclarkb: did we figure out about restarting services?16:12
mordredbecause that would also allow us to move to the opendev tenant16:12
mordred(since the blocker right now is the legacy base jobs in ozj - which we use in the puppet tests)16:13
corvusmordred: which will speed everything up :)16:13
clarkbcorvus: we expect it will restart processes. Gerrit should be fine beacuse we don't docker-compose up it during normal runs.16:13
corvusclarkb: cool, wfm16:13
clarkbcorvus: services like zuul preview, docker registry, gitea, nodepool-builder will restart16:13
mordredand once the compose change is in - we should do a controlled restart of gerrit - because we have a change we need to pick up16:14
clarkbgitea should be ok because we do one at a time. Though we'll want to replicate to them afterwards to avoid any missed refs16:14
clarkb(I can do that)16:14
mordredclarkb: didn't we land your update to safely restart gitea?16:14
mordred(so that we do it in the right order?)16:14
clarkbmordred: oh we did, and that might cause this to not actually restart gitea16:14
clarkbbecause we check for new images otherwise don't issue teh commands16:15
clarkbso we should manually restart things if there isn't a new image coincident with this update16:15
mordrednod. and next time we have new images, the restart should still do the right thing16:15
clarkb(I can also do that)16:15
mordredwell - we DO have a new image we could roll out16:15
mordred <--16:15
mordredwe could land that after the docker-compose patch16:15
mordredand that should trigger a gitea rollout16:16
clarkb++ lets do it that way16:16
mordredgood exercise of our machinery16:16
*** rpittau is now known as rpittau|afk16:17
mordredcorvus: you still have -2 on your zk change - but clarkb and I both +2'd it16:22
corvusmordred: ah thanks! :)16:23
*** mlavalle has quit IRC16:34
mordredcorvus, clarkb: I pushed up two changes this morning unrealted to this - and - to start mirroring and building images of focal, since that's being released next week16:36
mordredcorvus: and speaking of - when we roll out new ze* servers after the ansible rollout - perhaps we should consider jumping straight to focal instead of bionic so that we don't have to think about them for a while16:39
corvusmordred: ++16:40
mordredfocal is defaulting to python 3.8 - so if we did that and then bumped to the 3.8 python-base in our image builds, we'd be on the same python across the install16:40
*** kevinz has quit IRC16:40
corvushopefully afs works16:40
mordredyeah. that'll be the first question16:40
*** mlavalle has joined #opendev16:43
fungifrickler: yeah, the pbl rejection messages should mention the url for more info, which will get you eventually to the delisting page, and i usually use the infra-root shared mailbox to do the verification message. i can take care of it in a minute if nobody has gotten to it yet16:44
openstackgerritMerged opendev/system-config master: Simplify .zuul.yaml regexes
fungilooks like you got it though16:45
fungiand sorry for the delay, looking over 720498 now16:45
fungion the replication change, did we ever disable the live replication config update "feature"?16:49
fungii think i had a change up some time ago to revert it16:49
openstackgerritMerged opendev/system-config master: Run ZK from containers
openstackgerritMonty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers
fungiokay, yeah, that was and it merged ~3 months ago16:52
openstackgerritMonty Taylor proposed opendev/system-config master: Run zookeeper cluster in nodepool jobs
openstackgerritMonty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers
mordredcorvus: ^^ those were basically green last tie - except for one testinfra thing. I pushed up a fix for that, but then had to rebase because of the .zuul.yaml and the newer zk patch16:54
mordredso the most recent ps is just the rebase16:55
mordredcorvus: also - check it:
mordredcorvus: (the file itself now looks aweful because of slurp|from_yaml|to_yaml - but I think we can live with that until we get a nodepool.conf)16:57
corvusmordred: heh, it's readable enough :)17:00
corvusclarkb: +3 ?17:01
clarkbcorvus: do you also need to update the .env file?17:03
clarkbI seem torecall that one having the etherpad url in it too17:03
clarkbcorvus: I've approved it and can update .env if necesary in new change17:04
openstackgerritMonty Taylor proposed opendev/system-config master: Run Zuul using Ansible and Containers
corvusclarkb: yeah, it is in there, but i think it's only used to generate the config file that we manually install; maybe i'll just remove it in a followup....17:05
corvuser, you know what i mean by manually -- ansible installs it17:06
corvusi'm manually running the playbook against zk0117:08
mordredcorvus: any idea on how to do this:17:09
mordredhosts={% for host in groups['zookeeper'] %}{{ (hostvars[host].ansible_default_ipv4.address) }}:2888:3888,{% endfor %}17:09
mordredbut without the trailing , that'll be there?17:09
corvusmordred: yeah, there's some loop variables... 1 sec17:10
mordredah - found it17:11
corvustable of variables:
mordredhosts={% for host in groups['zookeeper'] %}{{ (hostvars[host].ansible_default_ipv4.address) }}:2888:3888{% if not loop.last %},{% endif %}{% endfor %}17:11
corvusrunning playbook against zk0217:13
openstackgerritMonty Taylor proposed opendev/system-config master: Run Zuul using Ansible and Containers
mordredcorvus: that might work ^^ ... also - inverse of the nodepool ones - I pushed up a rebase that's just a rebase, then that last patch did the fixes needed17:15
clarkbmordred: corvus re /etc/hosts I think our multinode role sets that up for you17:16
corvusyeah, so we could use that role (or that part of that role) if we wanted to go that way17:16
mordreddoes multinode do anything extra that might conflict with the things we're trying to test with system-config-run jobs?17:17
corvusbut i was thinking about it further, and it's still not a slam dunk for this use case -- we don't want to stand up a full cluster, we only want one node, so writing out the config is still desirable17:17
corvusrunning the playbook on 03 now17:18
mordredalthough we could just join group['zookeeper'] instead of needing to do the extra loop to find the ip address from hostvars17:18
mordredcorvus: cool17:18
* mordred could go either way17:18
*** hashar has quit IRC17:18
corvusi'm seeing a bunch of client errors now17:19
mordredclarkb: we could just use role multi-node-hosts-file17:19
mordredit is nicely split out into its own role :)17:19
corvusinfra-root: heads up -- i think the zk cluster is in a bad state17:20
mordredcorvus: uhoh17:20
mordredcorvus: should we switch to opendev-meeting?17:20
fungiat least it's friday? ;)17:20
clarkbcorvus: logs look like yseterday17:20
corvusi'll stop zk0317:21
corvusthat did not improve things17:22
corvusi'll restart everything?17:22
clarkbI think that is what helped last time?17:23
fungiseemed like it anyway17:24
corvuslooks happier17:24
corvusi am less than satisfied with this17:24
corvusthat should have been a straightforward rolling restart17:24
mordredcorvus: should we try another rolling restart to see how it goes?17:25
corvusmaybe -- though i wonder if we need the dynamic config file17:25
clarkbwe have done rolling restarts of the ubutnu packaged zk successfully in the past (I think ianw did one in the last couple weeks too17:25
corvusthat was 3.4.8 iirc17:25
corvus(we do need 3.5.x for tls)17:25
mordredcorvus: are you thinking that maybe when a node leaves the cluster zk is updating the duymanicConfig?17:25
corvusmordred: yeah17:26
corvusi'm still fuzzy on how "optional" it is17:26
mordredI really wish people wouldn't write server software that writes things to its config files17:26
corvusi might be able to simulate this locally17:26
corvusthat's probably the place to start17:26
corvusyep that thing17:27
mordredyeah - my reading of that tells me that it's going to write server values to the file17:27
mordredwhen servers come and go17:27
corvusmordred: but what happens if you don't include the client port number at the end?17:28
corvussee "Backward compatibility"17:28
corvusand if we don't "invoke a reconfiguration" that "sets the client port"17:28
corvus(i don't know whether we're inadvertently doing that or not when we restart a server)17:29
corvusall of that to say, in my mind, there's a decision tree with at least two unresolved nodes determining whether any config files get (re-)written17:29
corvuscluster configuration by quantum superposition17:29
mordredwell - I think the file is going to get tranformed regardless of port17:30
mordredcorvus: I agree - we need to just simulate locally17:30
clarkbhave we determined if its the actual config file or not?17:30
mordredthere's no way we're going to reason through it17:30
clarkbor if we set a separate path it will write to a separate file?17:30
clarkb(note about now is when I'm able to monitor the docker-compose thing but will wait until we are in a happy place with zk)17:31
mordredI believe it wants 2 files in all cases - if we put things in the single file, it will helpfully pull out the servers and put them into the second file17:31
corvusmordred: see the text under 'example 2' for the bit about how whether a port is there or not affects whether it writes the dynamic file17:31
mordredyeah - that's a good point17:32
corvusmordred: i agree that there's no way we'll reason about it17:32
mordredalso - assuming that we want to implement their "recommended" way of doing things17:32
corvusmordred: i'm not ready to endorse any conclusions...17:32
mordredwhat a PITA from a config mgmt pov17:32
corvusso far we have not seen it rewrite the main config file when we did not configure a dynamic config file path17:33
corvusthat's the only thing we know :)17:33
corvusi think the best thing to do is for me to go into a hole and set up a 3 node local cluster and try to replicate the problem17:33
corvusthen start changing variables17:33
mordredI mean, in their "prefered" approach - as long as all three nodes are up and running when we run ansible it should be a no-op - but doing a rolling restart at the same tie ansible tries to write a config would be potentially highly yuck17:34
mordredcorvus: ++17:34
* mordred supports a corvus hole17:34
fungithe discussions i linked yesterday for the zookeeper operator indicated that zookeeper wants config write access even if told to use a static config17:34
clarkbok should I hold off on docker-compose things or are we reasonably happy with the state here? I ask because those zk nodes are using docker-compose now and should noop but may not?17:34
clarkbI'm like 98% confident the docker-compose upgrade will nop zk17:34
mordredclarkb: I am fairly confident your change will noop the zk nodes17:34
mordredyeah - because zk is already using pip -so it should be a no-op compose up17:35
corvusclarkb: yeah, i think it's worth the risk.  i would just stand by to do a full 'docker-compose down' 'docker-compose up -d' if it's not a noop17:35
clarkbok I'm going to hit approve now then17:35
corvusokay, i'll probably be away for a few hours; exercise and then into the debugging hole17:35
mordredfungi: has anyone in discussions you've read complained loudly about the config writing choices?17:35
mordredbecause if they haven't I might want to17:36
fungimordred: they seemed resigned to their unfortunate fates17:36
fungisomeone probably should bring it up with the zk maintainers. though i assume multiple someones have and i've just not found record of those conversations17:36
clarkbwhy have a separate dynamic config file option if the "static" one needs writing too17:37
fungithough that one issue i linked in turn linked to the bits of the zk source where the write decision is made17:37
clarkb(that seems like a reasonable argument to make to them if this is the case)17:37
* fungi finds again17:37
mordredI mean - ultimately I'm guessing that we're not going to win and will have to also resign ourselves to our unfortunate fates17:38
mordredbut it's one of those decisions that makes running a service with automation harder17:38
fungi"It needs to be able to create a new dynamic configuration file and update the static configuration file to point to the latest configuration (that's for restarts of the server)."17:39
fungiso basically the static configuration file isn't entirely static, it just contains (some) static configuration17:39
clarkbfungi: mordred that code chunk seems like its tracking the dynamic config in the static config17:40
mordredyeah - it seems that the one write operation they want to make is to remove the dynamic config17:40
clarkbIwonder if hte issue goes away entirely if we simply set a dynamic config path17:40
mordredclarkb: needEraseClientInfoFromStaticConfig()17:40
mordredI'm fairly certain if we set a dyamicConfigPath and also remove servers from our static config that zk will not touch our static config and will update the member list in the dynamic config as needed17:41
clarkb is that function17:41
clarkblooks like it will simply remove the dynamicConfigFile entry17:42
clarkboh and then it appends dynamicConfigFile to the end17:43
mordredbut only if it needs to erase stuff from the static17:44
clarkbso if we can remove those keys and ensure dynamicConfigFile is set at the end we may avoid problems. I'm not sure we can remove clientPort though17:44
mordredwhy not? we can set it on the end of each server line, no?17:44
clarkbmordred: just because I haven't read enough docs yet17:44
mordredyeah - there's a form that allows you to append to each line17:44
clarkboh but the server line is also checked17:44
clarkb they rewrite everything back out again there ?17:45
mordredyeah. which is why those lines go into the dynamic file17:45
clarkbwell thats all the static file there in that function17:45
clarkbI'm basically trying to figure out if there is a form we can write that will make zk not try and change it17:46
clarkbdynamicConfigFile needs to be the very last key is about as far as I've gotten17:46
mordredyeah - but it only does editStaticConfig if you had dynamic config in the static file in the first place17:46
clarkbmordred: yes but it writes it back out again17:46
mordredbut ony if it had to edit it17:46
clarkbwe can't stop the writing from happening17:46
mordredI think we can17:46
clarkbbut if ansible and zk write the same thing its fine17:46
mordredI think if we don't put the dynamic info into the static file ever17:47
mordredthen ansible will not touch the static file17:47
mordredwe'll still need to write the dynamic file - and zk will also write to that17:47
clarkbmordred: ansible is writing the static conf17:47
mordredyes, I understand17:47
mordredbut what I'm saying is that if we restructure the file17:47
mordredand stop putting the server list in it17:47
mordredthat zk will not desire to rewrite that file17:47
clarkbhow do we tell it what servers are in the cluster?17:48
mordredif we only have ansible write the server list into the dynamic file17:48
mordredand we also have ansible only write that file if it doesn't exist17:48
clarkbok that last bit is what I was missing17:48
mordredbecause once we've written it the first time it's owned by zk - so if we try to write it out during a rolling restart, things will have sads17:48
mordredbecause we'll be fighting zk - but by and large we'd only need to write to that file if we were changing the list of members - and that would be a big thing anyway17:49
mordredin any case - corvus is going to go into a hole and verify these suppositions :)17:49
mordredclarkb: if you're bored17:50
*** ralonsoh has quit IRC17:53
clarkbmordred: check comment for things17:53
fungii hope corvus brings a torch, we don't need him getting eaten by a grue17:53
mordredclarkb: oh - that's a good point17:54
mordredfungi: do you happen to know the answer to clarkb's comment on 720718 ?17:55
clarkbmordred: I'm looking I think only the things on use the new ssh'd vos release17:57
clarkbmordred: and we've only moved the rsynced things over (since that is ansible managed and setting up reprepro is "involved")17:57
clarkbmordred: so I think what you need to do for your change is either update to use the same ssh thing, move reprepro to and have it ssh, or hold the lock, run reprepro yourself withotu a vos release, then vos release on the afs server afterwards, then release the lock17:58
clarkbalso we removed all trusty nodes/jobs right?18:03
clarkbI think maybe instead of bumping quota we want to delete trusty first (also should be manual due to sync cost)18:03
clarkbAJaeger: ^ pretty sure you drove that for us and it is all complete now right? (trusty test node removal)18:04
fungimordred: yeah, i left a comment on 720718 just now but it basically repeats what clarkb just said18:04
mordrednod. so yeah - trusty removal first seems like the right choice18:08
mordredor - maybe what we want is to replace trusty with focal in the file18:09
mordredand then do a single sync18:09
AJaegerclarkb: yes, I think we're fine, let me double check quickly18:09
clarkbmordred: you might have write errors if you do that since reprepro deletes after downloading iirc18:09
clarkbmordred: could temporarily bump quota to handle that18:10
clarkbthat might be the quickest option actually since you bundle the big syncs into one sync18:10
AJaegeryes, trusty should be gone. There's still a bit in system-config (sorry, did not read backscroll) but that's all18:12
clarkbAJaeger: ya we have ~3 nodes on it still but we pulled out testing of it so we don't need the afs mirror anymroe. Thank you for checking18:12
mordredclarkb: yeah - so we might still want to do the reprepro config as two patches - but bundle it with a single vos release18:14
mordredclarkb: oh - or yeah, bump quota for a minute18:14
mordredoh wow18:17
mordredclarkb: context switching back to puppet real quick ...18:18
mordredclarkb: puppet-beaker-rspec-puppet-4-infra-system-config is mostly testing things that are done in ansible18:18
mordredclarkb: so - I think it's pretty much useless at this poing18:18
mordredthe only testing it's doing is the stuff that's defined in modules/openstack_project/spec/acceptance/basic_spec.rb18:19
clarkbmordred: I want to say that may be an integration job too18:19
mordredwhich is basically testing that users we set up in ansible are there18:19
clarkbmordred: so it runs against puppet-foo rspec too ?18:19
clarkbwhen we update puppet-foo18:19
clarkbso its possible we don't need the job on system-config anymore but may not be ready to delete the job itself?18:19
clarkb(double check me on that)18:19
mordredclarkb: nope18:21
mordredclarkb: or - rather - yes - we don't need the job on system-config18:21
mordredwe run puppet-beaker-rspec-puppet-4-infra on puppet-foo changes18:21
clarkbgot it18:22
mordredso - I think we can remove puppet-beaker-rspec-puppet-4-infra-system-config now18:23
mordredand then when I do the change to split remote_puppet_else into service-foo playbooks - that can replace the puppet apply job18:23
mordredand similarly, each one of those jobs can be used in the puppet-foo modules as appropriate18:23
mordredand we can get rid of all of the rspec jobs18:23
mordredand life will be much better18:24
clarkbya the puppet apply job also only does a puppet noop apply18:27
clarkbso if we can actually run puppet it will be an improvemtn :)18:27
openstackgerritMonty Taylor proposed opendev/system-config master: Remove puppet-beaker-rspec-puppet-4-infra-system-config
openstackgerritMonty Taylor proposed opendev/system-config master: Remove global variables from manifest/site.pp
mordredclarkb: two easy-ish cleanups to prep for that ^^18:29
openstackgerritMonty Taylor proposed opendev/system-config master: Remove unused rspec tests
mordredand a third18:30
clarkbmordred: oh heh your third change addresses my note in first one18:32
clarkbmordred: the second needs work though (comment inline)18:32
mordredcool - thanks!18:33
clarkbchange for docker-compose update is waiting on nodes. I should have plenty of time to pop out for a few mintues as a result. Back soon18:34
clarkb(the gitea job isn't incredibly quick)18:34
openstackgerritMonty Taylor proposed opendev/system-config master: Remove global variables from manifest/site.pp
openstackgerritMonty Taylor proposed opendev/system-config master: Remove unused rspec tests
openstackgerritMonty Taylor proposed opendev/system-config master: Start mirroring focal, stop mirroring trusty
clarkb22 minutes for that change to land give or take18:54
mordredfungi: I think we can go ahead and land - we need to do a gerrit restart to pick up the local replication volume anyway18:55
mordredso it would be nice to bundle the restart and get both things18:55
mordred(because of this:
clarkbmordred: that can also transition the container name for us after docker-compose lands18:56
mordredso I think we land 720679 - then docker-compose lands - then when we're happy we do a docker compose restart on review18:57
mordredand we're in pretty good shape18:57
mordredoh - we need to land too18:57
mordredclarkb: any reason to hold off on the +A for that one?18:57
mordredor do we want to wait?18:58
clarkbmordred: I don't think so18:58
clarkbit was just in holding pattern on the docker-compose upgrade18:58
mordredcool. I'm gonna go ahead and poke it18:58
fungimordred: sounds good to me then18:58
fungii mainly didn't want to inadvertently complicate anything else we've got going on18:58
fungitrying not to cross the streams too much18:59
mordredfungi: ++19:11
openstackgerritMerged opendev/system-config master: Install docker-compose from pypi
mordredclarkb: there we go19:12
clarkbmordred: and now we watch the deploy jobs ya?19:13
clarkbhrm you know what just occuired to me does uninstalling packaged docker-compose do someting we don't want like stopping the containers too :/19:15
clarkbtesting seemed to show that it iddn't becuase it was the docker-compose-up that happened later that restarted teh containers19:15
clarkbI'm just being paranoid now19:15
clarkbgitea-lb seems to have gone well19:15
mordredclarkb: yeah - I don't think it does19:16
mordredit's just a python program that does things with docker api19:16
clarkbmordred: good point19:16
clarkbso ya uninstalling docker may do that but not docker-compose19:16
clarkbin any case is still up and the gitea-lb.yaml log looks as I expected it19:17
clarkbfirst one lgtm19:17
clarkbservice nodepool job failed. Not sure why yte19:19
clarkbUnable to find any of pip3 to use.  pip needs to be installed.19:20
clarkbthat was unexpected19:20
clarkbon nb0419:20
clarkbmordred: ^ do you know why servers like gitea-lb which are bionic would have pip installed but not bionic for nb04?19:21
clarkbalso this is a gap in our testing because our test images have pip and friends preinstalled19:21
clarkbI think what we may end up seeing here is that newer hosts fail on this error and older hosts are fine19:22
clarkband yes I've confirmed uninstalling docker-compose does not stop containers beacuse nb04 and etherpad are in that state19:23
prometheanfiremordred: mind taking a look at ?19:24
openstackgerritClark Boylan proposed opendev/system-config master: Install pip3 for docker-compose installation
fungiclarkb: yeah, odds are our server images don't have the python3-pip package installed19:26
clarkbfungi: ya but why would gitea-lb have it ? different image maybe19:26
clarkbin any case infra-root I think 720820 fixes this problem. Note that we currently don't have docker-compose installed on hsots where this failed. But the existing docker compose'd containers are running19:27
fungiwe deployed that in vexxhost right?19:27
clarkbfungi: oh ya good point19:27
fungiso we probably uploaded a nodepool-built image19:27
clarkbif we need to emergency docker compose things before the fix above lands we can reinstall the distro docker-compose19:27
mordredclarkb: uhm. weird.19:27
mordredclarkb: yeah - I thought pip3 was everywhere - but clearly I was wrong - and our images having that on them sure did mask this didn't it?19:28
clarkbmordred: yup19:28
clarkbmordred: fwiw meetpad job returned success but it didn't seem to update containers there19:28
clarkb"no hosts matched" ok that explains that one19:29
mordredPLAY [Configure meetpad] *******************************************************19:29
mordredskipping: no hosts matched19:29
clarkbzk was success and that should've nooped. Checking now19:29
mordredclarkb: oh - is meetpad in emergency?19:30
clarkbmordred: it must be19:30
clarkbzk looks good19:30
clarkbso far only the pip issue19:30
clarkbnb04, etherpad.opendev, docker registry, and zuul-preview all failed on the pip3 missing thing. gitea-lb succeeded as did the zookeeper hosts. I expect review, review-dev, and gitea to all succeed as they are older and/or on vexxhost19:33
openstackgerritMerged openstack/project-config master: Change gerrit ACLs for cinder-tempest-plugin
fungithose ^ get applied from promote pipeline jobs now, right?19:33
clarkbfungi: deploy pipeline19:34
fungioh, right!19:34
fungii forgot we added a separate pipeline for that19:34
clarkbmordred: hrm does manage-projects use docker-compose in a way that may pose a problem here?19:34
clarkbthe gerrit ACLs change has queued up the manage-projects job19:35
fungiyep, i see that. cool19:35
clarkbok we use docker run not docker-compose for manage projects so that should be fine19:36
clarkbit won't try to use the wrong container name19:36
clarkbif we did docker exec or docker-compose for manage-projects that could be different19:36
clarkb720820 exposes that we don't run docker role consuming jobs on docker role updates. Thats another job fix I should figure out19:38
clarkbinfra-root once gitea runs and shows gitea01 (it should be first) is happy I'm going to work on lunch while waiting for the fix to get tested and reviewed19:39
clarkbif you need to make changes to the fix or take different direction feel free19:39
clarkbbut then beacuse the fix is in the docker role and our jobs may not be set to trigger off that role updating we may need to run the playbooks for these services manually:19:39
mordredclarkb: (we should add the pip3 role to things that have files depends on the install-docker role now too)19:40
clarkbservice-nodepool.yaml, service-etherpad.yaml, service-meetpad.yaml (needs to be removed from emergency or we can wait on this one), service-registry.yaml, service-zuul-preview.yaml19:40
clarkbmordred: ++ so we need to do the docker role and the pip3 role19:40
* mordred will make a patch19:41
clarkbdoes bridge unping for anyone else?19:42
clarkbI can't ping or ssh to it and my existing ssh connection seems to have gone away?19:42
clarkband now it reconnects that was weird19:43
clarkbuptiem shows it didn't reboot19:43
clarkband we didn't OOM19:43
clarkb "msg": "Timeout (32s) waiting for privilege escalation prompt: " <- review-dev failed on that19:44
clarkbpossibly due to the same network connectivity issue?19:44
clarkb is running the new containers and is happy19:45
openstackgerritMonty Taylor proposed opendev/system-config master: Add install-docker and pip3 to files triggers
clarkbso I think review-dev and meetpad were the odd ones. review-dev due to networkign to bridge going away? and meetpad due to being in emergency. All the other failures need pip3 to be installed19:46
mordredclarkb: woot19:46
clarkbgitea, gitea-lb, review, and zk are all happy19:46
clarkbok I think things are stable so I'm finding lunch now. Holler if that assumption is bad :)19:48
fungiTimeout exception waiting for the logger. Please check connectivity to []19:48
clarkbfungi: thats normal because we don't run the zuul log streamer on bridge19:49
fungiseen in a infra-prod-service-gitea run19:49
fungigot it19:49
clarkbfungi: if you want to see the logs you need to go to bridge /var/log/ansible/service-$playbook.yaml file19:49
fungiso those are expected19:49
clarkbservice-gitea.yaml.log for gitea19:49
openstackgerritMerged opendev/system-config master: Use HUP to stop gerrit in docker-compose
clarkbI was tailing it earlier when I confirmed gitea01 was done and happy19:49
openstackgerritMerged opendev/system-config master: No longer push refs/changes to GitHub mirrors
mordredafter those run ^^ we'll be good to restart gerrit19:51
AJaegerinfra-root, this inap graph looks really odd
clarkbcorvus: I know you are heads down in other things, but are you good for us to remove meetpad from the emergency file?19:51
clarkbAJaeger: ya its because nova isn't deleting instances there reliably19:52
clarkbAJaeger: if you expand it to go back 2 days you'll see it happening more often19:52
clarkbok really finding lunch now. Back soon :)19:53
AJaegerthanks, clarkb - enjoy lunch!19:53
corvusclarkb: yes can remove meetpad19:55
corvusclarkb, mordred: should i read scrollback or skip it?19:56
corvusclarkb, mordred, fungi: i believe i have created a reasonable local facsimile of our prod env -- same ownership and volume structure, etc.  i'm seeing the same errors about dynamic config, etc.  i wrote a test script to continually write data to zk to simulate the cluster continuing to handle requests when one member leaves.  i have yet to see it fail when i do a rolling restart.  i've done several.19:58
fungicorvus: there was some discussion about the bits of the zk source around the function writing to the "static" config but probably no new insights19:58
mordredcorvus: well that's not trilling19:58
mordredcorvus: yeah - I think we mostly just looked at the source and then pondered - but ultimately concluded "corvus will figure out reality"19:59
corvusmy assumption for the moment is that whatever is causing the stale session issues is not related to the dynamic config19:59
corvusi'm starting to wonder if it's a client issue19:59
corvusi made sure to use the same kazoo version, under py3, that we're using on nl0120:00
corvusbut maybe i should spot check that elsewhere -- maybe it's, say, only the scheduler that's hitting that problem20:00
fungiand i guess we ended up with newer kazoo in the containers?20:00
clarkbcorvus: we hit a speedbump on yhe docker compose thing. not all servers have pip installed. for the servers that did update dockercompose everything is happy20:01
corvusfungi: at the moment, the only zuul component running in containers is nb0420:01
clarkbfox for pip has been approved and will retrigger jobs (or manuallu run playbooks) once it lands20:01
mordredclarkb: is the followup with the file trigger updates20:01
corvusclarkb: drat.  i'm still sad we have to install pip :(20:01
corvusoh, speaking of nb04 -- this happens when i try to exec:20:02
corvusroot@nb04:/var/log/nodepool# docker exec -it nodepoolbuildercompose_nodepool-builder_1 /bin/sh20:02
corvusOCI runtime exec failed: exec failed: container_linux.go:349: starting container process caused "open /dev/ptmx: no such file or directory": unknown20:02
mordredcorvus: oh - that's ... what?20:02
corvusyeah, you can imagine my delight at having a system component turn into a black box i can't access20:02
clarkbdrop the -it maybe?20:03
clarkbcant really shell in that case20:04
corvusyeah, it was really the interactive shell i was after20:04
mordredno solution20:05
corvusi wonder if dib mucked it up?20:05
fungioh, yeah, i guess kazoo hasn't changed... has the version of zk we're deploying in the containers changed? and you're theorizing that the older kazoo has issues with newer zk?20:05
corvusfungi: i've yet to find a version of kazoo in use other than 2.7.0, but i'm still looking.  we have definitely upgraded zk.20:06
fungigot it20:07
corvusmordred: and of course the 'workaround' in that report doesn't work for 'exec', only for 'run'20:08
corvus2.7.0 is the newest kazoo, so i'll just assume that's what nb04 has20:09
fungiseems probable20:10
corvusevery zuul component is using kazoo 2.7.0 except nb03 whih is using 2.6.120:11
mordredcorvus: I checked on nb04 - devpts is mounted in the right place, /dev/ptmx is as expected and I don't see where dib woudl have broken it20:12
mordredBUT - dib does so things wtih devpts - so it's entirely possible dib did a bad20:13
mordredcorvus: neat. I tried running a non-interactive command and got:20:14
mordredOCI runtime exec failed: exec failed: container_linux.go:349: starting container process caused "close exec fds: open /proc/self/fd: no such file or directory": unknown20:14
corvusmaybe we want to restart (or reboot) and see what it looks like when it starts20:15
corvusthat may give us a clue if it's some dib cleanup task or something20:15
clarkbmordred: oh good the infra-prod jobs run when install docker is updated20:15
clarkbmordred: so we won't need to manually trigger jobs once the fix lands20:15
clarkbcorvus: note that nb04 is one of the hosts without docker-compose currently installed20:16
corvusclarkb: ack.  but i'm using plain docker commands20:16
corvusclarkb: oh, you're warning me not to restart it right now :)20:16
corvusmessage received20:16
corvus(or, at least, don't use dc to restart it)20:17
corvusi've rerun my test with zk 2.6.1 -- same results20:17
clarkbalso if you look at zuul status for deploy pipelien right now I think its doing a thing we didn't expect it to?20:17
clarkbthere are two chagnes in the pipeline and the second changei s running jobs before the first has finished20:17
corvusah, yup, we seem to be sharing the mutex between the two.20:18
corvusi wonder if we can turn this into a dependent pipeline with a window of 120:19
corvusthe main thing would be to look into the merge check20:19
clarkbmordred: pip fix breaks on xenial?
fungiclarkb: i saw the same a little bit ago. i thought the mutex was supposed to wind up serializing them in the item enqueue order20:20
fungibut that doesn't appear to be the case20:20
fungiso, yeah, window of 1 i guess will be better than possible out-of-sequence deployments20:21
corvusmaybe our mutex wakeups are random20:22
clarkbmordred: I think maybe this isn't necessary on xenial. So we can fix pip3 too20:23
clarkbI'm testing it locally in a xenial container and will push fix if I think it will work20:25
mordredclarkb: I agree - I think it isn't necessary on xenial20:26
corvusclarkb, fungi: i'm still surprised about that.  we should release the semaphore before processing the queue, and the queue processing should happen in order, so i'd expect each job for the first change to get it in order, then each job for the second change.  unless one of the jobs on the first change didn't specify the semaphore?20:26
mordredcorvus: the semaphore should be on the base job20:27
corvuswe don't show nearly enough job info in the web ui20:27
mordredyeah. anything parented on infra-prod-playbook20:27
mordredthat's where we're declaring use of the semaphore20:28
mordredoh! interesting20:28
mordred    semaphore: infra-prod-service-bridge20:28
openstackgerritClark Boylan proposed opendev/system-config master: Install pip3 for docker-compose installation
mordredwe have one job that declares a non-existent sempahore20:28
mordredthat is a different semaphore20:28
corvusmordred: which job?20:28
clarkbmordred: corvus fungi has been updated to handle xenial if you have a moment between thinking about all the other things :)20:29
openstackgerritMonty Taylor proposed opendev/system-config master: Remove semaphore from service-bridge
mordredcorvus: infra-prod-service-bridge20:29
fungitaking a look20:29
clarkbinfra-root should we start considering making an order of changes to land?20:29
corvusmordred: ok.  i don't think that job was involved here.20:30
corvusyeah, our problem set has exploded again20:30
corvusdocker-compose is uninstalled; semaphores may run out of order; something about zk is weird when rolling restart; nb04 /dev in container is hosed20:31
openstackgerritMonty Taylor proposed opendev/system-config master: Add install-docker and pip3 to files triggers
corvusdid i miss anything? :)20:31
mordredcorvus: I think that's about it20:31
mordredcorvus: also - luckily for us, 3 of those problems we don't really understand20:32
corvusokay, we gotta find a way to avoid installing docker-compose from pip in the future -- this whole sequence of "oops we don't have pip3 on this distro" was exactly the business that we got out of... for about 10 minutes.20:32
fungicorvus: so what i observed earlier (but was refraining from interrupting other discussion with) is that 720235,2 had a waiting infra-prod-manage-projects build, but 719051,8 which was enqueued into the deploy pipeline after it started running infra-prod-service-review (those share a semaphore, right?)20:32
fungiafter infra-prod-service-review completed for 719051,8, infra-prod-manage-projects started running for 719051,8 ahead of it20:33
fungier, for 720235,2 ahead of it20:33
clarkbcorvus: ya I'm not sure what the proper answer is there. One crazy idea I had was running docker-compose from docker, but I imagine that will need testing20:34
clarkb(and generally exposing the docker command socket to docker containers seems dirty)20:34
clarkb I've filled in the docker-compose related items and put spots for the other things if people have things in flight to track20:36
fungiyay! etherpad is snappy again!20:39
fungii've heard no complaints about it after the tuning config got added back, fwiw20:40
mnaseruh i feel bad about bothering with this, but it seems like i got a buildset stuck in the vexxhost tenant again somehow..20:48
mnaser -- its been around for 3h10m -- even when i +W it to kick it straight into gate, it is still there20:48
clarkbmnaser: I think the inap issues are persisting20:48
clarkblet me see what that job is waiting on20:48
mnaserwill it fail to dequeue as well?20:49
clarkbI don't think so but dequeing won't really help necessarily20:49
mnaserright, but if i +W it, shouldn't it remove it from check and kick it straight to gate20:49
clarkbmnaser: depends on how your popeline is set up20:50
openstackgerritArun S A G proposed opendev/gerritlib master: Fix AttributeError when  _consume method in GerritWatcher fails
mnaserim pretty sure we're using the one simila to opendev/zuul so go-straight-to-gate20:50
clarkbfwiw those jobs don't seem to be blocking on inap20:52
clarkband two of them just started20:52
clarkbstill trying to figure out what they were hung up on20:52
clarkblooks like rax-iad-main had it20:53
clarkbfor ~3 hours20:54
clarkbso its the same behavior we had with inap but in rax20:54
clarkbwe end up with a lot of active requests but they aren't being fulfilled quickly (due to what I Think are quota accounting issues)20:54
clarkband check would be sorted last so that probably contributes to it, though the neutron case was in the gate20:55
clarkb shows iad being sad20:55
clarkbseems to be recovering now though20:55
corvusi guess we can add that to the list of fires21:01
corvusalso, we should stop logging the "could not be deleted because it is in use: The image cannot be deleted because it is in use through the backend store outside of Glance." exception21:01
corvusthe builder logs are pretty unreadable21:01
clarkbcorvus: fwiw I think that may just be "normal" cloud things. Addressing that in nodepool will be complicated I think21:05
clarkb(its hard to work around when the cloud isn't giving us accurate info)21:05
clarkbbut I can dig into that again monday and make sure there isn't something else going on21:05
corvusclarkb: it would be good to have a clear idea of what's going on.  we already expect openstack to lie to us about server deletions.  if it's also lying about quotas, etc, it'd be good to know21:06
corvusclarkb, mordred: is there a way to get at the docker logs from the previous run of a container?21:10
clarkbcorvus: if they go to systemd I think so21:10
clarkband i Think they do by default /me looks21:10
clarkboh maybe it isn't default21:11
clarkbcorvus: internet says do docker logs with the container id21:12
clarkband i believe you can get historical container ids from dockerd logs21:12
clarkbok the distutils thing fixed the pip change21:13
clarkbnow we wait for it to gate21:16
*** hashar has joined #opendev21:20
corvusclarkb: ah, docker-compose down deletes the container, and once it's gone docker logs $containerid doesn't work21:23
corvusbut everything is going into the journal, so that'll do for now21:23
clarkboh good its in the journal anyway21:23
clarkbcorvus: how do you get it out of the journal?21:23
corvusclarkb: i'm just using journalctl -u docker.service21:24
clarkb(its usefully to know that bit of info)21:24
openstackgerritMerged opendev/system-config master: Remove semaphore from service-bridge
clarkbmordred: ^ some progress21:28
mordredclarkb: woot!21:44
*** hashar has quit IRC21:46
openstackgerritMonty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers
openstackgerritMonty Taylor proposed opendev/system-config master: Run Zuul using Ansible and Containers
mordredclarkb: I have verified that the docker-compose on review is the pip version and the docker compose file has both outstanding changes in it22:00
mordredclarkb: so we should be well positioned to restart whenever we decide it's a good time to do that22:00
clarkbmordred: cool22:01
clarkbat this point I think my ability to debug more things is waning22:01
clarkbwnting to wrap up the outstanding things22:01
mordredI recorded that we're ready to do that whenver in the etherpad22:02
corvusi'm digging through zk server logs and reading docs and bug reports to try to come up with a new hypothesis22:12
openstackgerritMerged opendev/system-config master: Install pip3 for docker-compose installation
* mordred is going to pay attention to those22:16
mordredclarkb: the list of services that need the pip/compose update in the etherpad is the list of jobs that just got triggered - so that particular thing should be done once this runs22:18
clarkbmordred: cool and I'm around paying attention too22:18
openstackgerritMerged opendev/system-config master: Add install-docker and pip3 to files triggers
clarkbnb04 looks happy now22:29
clarkbalso docker ps -a shows a lot of old docker containers there22:29
clarkbI think we need to get in the habit of doing docker run --rm ?22:30
clarkbmordred: ^ you probably have ideas on that22:30
mordredclarkb: hrm22:32
mordredclarkb: I wish I knew why that container was unhappy in the first place22:32
mordredclarkb: oh - yeah - I always do --rm when I do run22:33
mordredclarkb: think we shoudl clean those up real quick?22:33
clarkbmaybe? it could be part of corvus' debugging and we should have corvus confirm first?22:34
clarkbbut ya I think cleaning up would be a good idea22:34
mordredclarkb: ++ - most of those look like utility images from weeks ago22:34
mordredclarkb: docker ps -a | grep Exited | awk '{print $1}' | xargs -n1 docker rm22:34
clarkbetherpad just restarted22:35
clarkb is still working for me22:35
clarkbthis is all looking good \o/22:35
clarkboh I forget to remove meetpad from emergency22:39
clarkbmordred: thoughts on ^ should I just remove it now or wait for money?22:39
clarkb*monday. money is nice too22:39
clarkbdocker registry looks happy now too22:40
mordredclarkb: I think we can remove it - I don't think there were any reasons not to22:40
clarkbmordred: I guess my only concern is if there were other changes and they weren't happy at this point22:41
clarkbbut since the service isn't in prod its probably fine22:41
clarkbI'll remove it now so I don't forget further22:41
mordredyeah. and corvus acked that it was ok earlier22:41
corvusi did not do any docker runs22:41
corvusonly exec22:41
clarkbcorvus: rgr so ya we should be able to clean up all thos containers mordred22:41
mordredkk. removing22:41
clarkbmeetpad01 has been removed from emergency file22:42
clarkbI'll put further debugging of this nodepool "slowness" high on my list for monday22:43
clarkbsince people keep noticing it so its definitely frequent and painful22:43
clarkbzookeeper play failed on AnsibleUndefinedVariable: 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_default_ipv4'22:44
mordredhrm. that's weird22:44
corvuswe may have run in a v6 only cloud?22:45
clarkbcorvus: this was against prod22:45
clarkb/var/log/ansible/service-zookeeper.yaml.log for the logs22:45
clarkb* on bridge22:45
clarkbzuul-preview seems good though I'm trying to find a change I can confirm that with via zuul dashboard22:47
* clarkb looks for zuul website change22:47
mordredthat var shows up when I run setup and is also in the fact cache for those hosts22:47
clarkbmordred: is it maybe the lookup path?22:48
fungiclarkb: did we already switch the zuul website preview to using the zuul-preview service?22:49
fungii thought it wasn't yet (at least as of a week-ish ago)22:49
clarkbfungi: no I thought we did but the job artifact errors with bad urls22:49
clarkband its because its at ovh's swift root not zp0122:49
corvusre zk cluster probs: i think we're looking at a server issue of some kind.  it seems like when we kill the leader, that the new leader begins a new 'epoch' (which i think appears as the first character of the zxid in the logs -- that's why 0xd00000000 showed up -- epoch 0xd); my limited understanding is that should become the first zxid committed after the leader election, and then all the followers22:49
corvusshould get that.  we're seeing clients connect having seen that zxid, but then the followers they connect to don't seem to have it.22:49
mordredI can reproduce the ansible_default_ipv4 issue with a simple playbook - poking at combos to see what works and doesn't22:51
clarkbmnaser: if you happen to still be around did you have any zuul preview using changes we can test with (I thought you had something)22:52
mordredso - if I run a playbook targetting that wants to get's hostvars but zk02 hasn't ever done anything in the playbook, it fails22:52
clarkbmordred: oh so we should add an explicit setup call across those hosts maybe?22:52
mordredbut if I run something, _anything_ on zk02 first - the hostvars are there22:52
mordredwe don't even need a setup call22:52
mordreda debug call suffices22:53
corvusclarkb: try the zuul-website gatsby wip patch?22:53
clarkbcorvus: thats what I pulled up but the url there is for ovh swift roo22:53
mordredit doesn't need to fetch new facts22:53
clarkblet me see if there was a different url I should use22:53
clarkb is the build for that I think22:54
clarkbnote the site prview url is straight to
mnaserclarkb: the zuul website changes should be good for that22:55
mnaseror single change. I haven’t gotten around finalizing that22:55
clarkbmnaser: is what I'm looking at for that is that wrong?22:55
mnaserclarkb: yes that’s the right one22:55
clarkbmnaser: ok the site preview for that is straight to the ovh swift files not zp22:56
clarkband that doesnt' work (as expected)22:56
clarkbmaybe I need to manually construct the zp url?22:56
fungiright, like i said, i don't think the zuul-web previews are using zuul-preview (yet)22:56
mnaserclarkb: yeah I haven’t pushed up a patch to return that as an artifact. I have to return both22:56
clarkbmnaser: gotcha, do you know what the url format is in that case?22:56
openstackgerritMonty Taylor proposed opendev/system-config master: Run a noop on all zookeeper servers first
mordredcorvus, clarkb : ^^ that should fix the zookeeper thing22:57
clarkbmnaser: corvus thanks! and that seems to work for me so I think zp is good22:58
mordred(the playbook being unhappy - not the important zk thing)22:58 is the format. Thanks corvus22:58
clarkbmordred: should you add !disabled to that?22:58
corvusclarkb: there's a comment explaining why not :)22:58
mordredclarkb: no - left that off on purpose (and wrote a comment explaining)22:58
clarkbheh I should read22:59
corvusmordred: that's super weird that it works with --limit22:59
mordredcorvus: I agree22:59
mordredI think it's a super weird behavior in general22:59
corvusmordred: i guess it's some sort of "well, since it's limited, we know we're not going to update the data, so we should just start with the cache"22:59
corvusmordred: but also, it could just be "no one understands this"22:59
mordredfwiw - /root/foo.yaml on bridge is what I used to verify23:00
corvusi'm pretty sure the zookeeper images on dockerhub are being rebuild with the same tags23:01
corvus3.6.0 is still the only 3.6, but it's 11 hours old23:01
corvusand i know we ran a 3.6.0 longer ago than that23:02
clarkbeverything succeeded but zk in that pass and zk failed for unrelated reasons and is already running newer docker-compose23:02
* clarkb updates etehrpad but things seem happy now23:02
corvuswhat happened the last time we tried 3.6.0?23:03
mordredclarkb: what's the cantrip for making a fake rsa key for test data?23:03
clarkbmordred: ssh-keygen -p'' ?23:03
clarkbmordred: zuul quickstart should have it for gerrit things23:04
clarkbcorvus: I don't remember being around for that, but could it have been upgrade concerns?23:04
mordredclarkb: thanks23:04
clarkblike maybe 3.4 -> 3.6 isn't doable in rolling fashion?23:04
openstackgerritMonty Taylor proposed opendev/system-config master: Run Zuul using Ansible and Containers
corvusclarkb: it is doable, but something was preventing a quorum from forming on 3.623:07
mordredcorvus: I understand the applytest race condition. I think I can live with it until I rework that job23:09
fungimordred: clarkb: ssh-keygen -p'' just sets the private key to not encrypted. are you looking for something like gnupg's --debug-quick-random option for creating insecure test keys?23:10
fungior are you really just looking for a key which doesn't require a passphrase to unlock?23:11
corvusi think maybe tomorrow we might want to do some testing-in-prod on the zk cluster, because i can't replicate the problem locally, nor do i see any problem running 3.6.0 (at least, the latest version of that image)23:16
*** tosky has quit IRC23:21
openstackgerritMonty Taylor proposed opendev/system-config master: Make applytest files outside of system-config
mordredcorvus: I support that23:21
mordredcorvus: also - I decided I was too annoyed by the applytest race - so that ^^ shoudl fix it23:21
mordredfungi: really just needed some rsa key data to put into the testing "private" key hostvars so that the role would write something to disk in the integration test jobs23:22
mordredcorvus: I mean - assuming that runs at all - I think it should fix the race :)23:22
mordredfungi: if you have some brainpellets - 720848 could use some eyeball powder23:23
mordredfungi: for context - we keep seeing occasional failures like:
fungimordred: got it, so one-time key generation, not rapid/repetitive key generation in a job23:23
mordredfungi: yeah23:24
mordredssh-keygen did the trick23:24
fungimordred: yeah, first instinct on that is some sort of race on directory creation/deletion23:27
openstackgerritMerged opendev/system-config master: Run a noop on all zookeeper servers first

Generated by 2.15.3 by Marius Gedminas - find it at!