openstackgerrit | Ian Wienand proposed openstack/project-config master: Add centos aarch64 to labels https://review.opendev.org/720619 | 00:00 |
---|---|---|
*** DSpider has quit IRC | 00:04 | |
ianw | why that is not trying to build has me stumped right now | 00:23 |
openstackgerrit | Ian Wienand proposed openstack/project-config master: Add centos aarch64 to labels, unpause https://review.opendev.org/720619 | 00:25 |
ianw | yeah, unpausing it will help | 00:26 |
ianw | i've applied that manually and am watching an initial build | 00:26 |
*** ysandeep|away is now known as ysandeep | 01:03 | |
openstackgerrit | Merged openstack/diskimage-builder master: Add centos aarch64 tests https://review.opendev.org/720339 | 01:18 |
mnaser | infra-root: ok, i have one issue right now, i _could_ workaround it by abandon/restore but maybe useful for someone to grab looks and see why? http://zuul.opendev.org/t/vexxhost/status 720595,6 has been stuck for 2h18m (and new jobs are starting inside openstack tenant so its not a lack of nodes)... | 01:19 |
clarkb | mnaser: it might be the inap issue we saw earlier tpday | 01:19 |
clarkb | mnaser: basically we seem to leak enough nodes there due to failed successful node deletes amd that breaks quota acpunting so we over attwmpt to boot instances in inap | 01:20 |
clarkb | andbasically it delays things | 01:20 |
mnaser | clarkb: does nodepool enforces building vms in the same provider? | 01:20 |
clarkb | mnaser: only within a job | 01:20 |
mnaser | aaaah, so maybe it keeps trying to get an inap job | 01:21 |
clarkb | I dont think its repeatedly trying | 01:21 |
clarkb | its just waiting for inap to actually delete the nodes it said it deleted so no ones can boot | 01:21 |
mnaser | clarkb: ah, so probably just best to sit and wait and if it's still around for much longer then maybe abandon/restore | 01:22 |
openstackgerrit | Merged openstack/project-config master: Add centos aarch64 to labels, unpause https://review.opendev.org/720619 | 01:23 |
clarkb | if it persists longer we should maybe have deletes poll more or disable inap or something | 01:24 |
clarkb | its a weird behavio and seems new but I spent a chunk of the morning tracing it through and pretty sure root cause is nova delete saus "yes I succeeded" but then the server persosts for a long time | 01:24 |
mnaser | clarkb: the queued jobs does depend on the registry that's paused, so maybe that contributes to it? | 01:25 |
clarkb | the one I looked at today was a tempest job so no docker bits | 01:25 |
mnaser | ah ok, the paused one is in our cloud right now | 01:26 |
clarkb | oh hrm do required jobs like that end up in thr same cloud? | 01:27 |
clarkb | fwiw Im not at computer so cant debug directly but otherwise sounds similar to the inap thing from today | 01:28 |
mnaser | clarkb: i dont know if requried jobs like that end up in teh same cloud, but im curious to know. but yeah, if you saw that behaviour earlier then might be good to leave it for someone to have a look at it later | 01:29 |
corvus | mnaser: yeah, jobs that depend on paused jobs request nodes from the same provider | 02:06 |
corvus | (with a bump in priority to try to speed things up) | 02:07 |
*** ysandeep is now known as ysandeep|afk | 02:21 | |
openstackgerrit | Merged zuul/zuul-jobs master: fetch-subunit-output test: use ensure-pip https://review.opendev.org/718225 | 02:42 |
prometheanfire | ianw: have time for https://review.opendev.org/717339 ? | 02:54 |
openstackgerrit | Merged zuul/zuul-jobs master: ensure-tox: use ensure-pip role https://review.opendev.org/717663 | 02:55 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: [dnm] test with plain nodes https://review.opendev.org/712819 | 02:59 |
prometheanfire | thanks | 03:04 |
openstackgerrit | Ian Wienand proposed openstack/project-config master: nb03 : update to arm64 to inheritance, drop pip-and-virtualenv https://review.opendev.org/720641 | 03:32 |
*** ysandeep|afk is now known as ysandeep | 04:22 | |
*** kevinz has joined #opendev | 04:42 | |
*** ysandeep is now known as ysandeep|reboot | 04:49 | |
*** ykarel|away is now known as ykarel | 04:51 | |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: Update Fedora to 31 https://review.opendev.org/717657 | 04:51 |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: Make ubuntu-plain jobs voting https://review.opendev.org/719701 | 04:51 |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: Document output variables https://review.opendev.org/719704 | 04:51 |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: Python roles: misc doc updates https://review.opendev.org/720111 | 04:51 |
*** ysandeep|reboot is now known as ysandeep | 04:53 | |
*** mnasiadka has quit IRC | 05:10 | |
*** elod has quit IRC | 05:10 | |
*** mnasiadka has joined #opendev | 05:15 | |
*** elod has joined #opendev | 05:15 | |
openstackgerrit | Merged openstack/project-config master: Add ubuntu-bionic-plain to all regions https://review.opendev.org/720316 | 05:47 |
*** ysandeep is now known as ysandeep|brb | 05:49 | |
*** ysandeep|brb is now known as ysandeep | 06:12 | |
*** Romik has joined #opendev | 06:21 | |
*** Romik has quit IRC | 06:33 | |
*** Romik has joined #opendev | 07:00 | |
*** jhesketh has quit IRC | 07:04 | |
*** rpittau|afk is now known as rpittau | 07:19 | |
*** tosky has joined #opendev | 07:30 | |
*** Romik has quit IRC | 07:35 | |
*** ralonsoh has joined #opendev | 07:38 | |
*** DSpider has joined #opendev | 07:40 | |
*** ysandeep is now known as ysandeep|lunch | 07:57 | |
openstackgerrit | Merged openstack/project-config master: nodepool: Add more plain images https://review.opendev.org/720318 | 08:25 |
*** ysandeep|lunch is now known as ysandeep | 08:25 | |
*** ykarel is now known as ykarel|lunch | 08:26 | |
openstackgerrit | Andreas Jaeger proposed zuul/zuul-jobs master: Make ubuntu-plain jobs voting https://review.opendev.org/719701 | 08:45 |
openstackgerrit | Andreas Jaeger proposed zuul/zuul-jobs master: Document output variables https://review.opendev.org/719704 | 08:45 |
openstackgerrit | Andreas Jaeger proposed zuul/zuul-jobs master: Python roles: misc doc updates https://review.opendev.org/720111 | 08:45 |
AJaeger | ianw: rebased and fixed the failure ^ | 08:45 |
*** ysandeep is now known as ysandeep|afk | 09:21 | |
*** ykarel|lunch is now known as ykarel | 09:25 | |
openstackgerrit | Dmitry Tantsur proposed openstack/diskimage-builder master: Remove Babel and any signs of translations https://review.opendev.org/720673 | 09:42 |
openstackgerrit | Thierry Carrez proposed opendev/system-config master: No longer push refs/changes to GitHub mirrors https://review.opendev.org/720679 | 10:00 |
ttx | corvus, mordred, fungi: ^ as discussed | 10:01 |
*** rpittau is now known as rpittau|bbl | 10:30 | |
*** hashar has joined #opendev | 11:43 | |
openstackgerrit | Andreas Jaeger proposed openstack/project-config master: Remove pypy job from x/surveil https://review.opendev.org/720699 | 12:08 |
*** Romik has joined #opendev | 12:13 | |
openstackgerrit | Merged openstack/project-config master: Remove pypy job from bindep https://review.opendev.org/720543 | 12:17 |
*** rpittau|bbl is now known as rpittau | 12:19 | |
hashar | hello | 12:25 |
*** Romik has quit IRC | 12:28 | |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers https://review.opendev.org/720527 | 12:30 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers https://review.opendev.org/720527 | 12:34 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers https://review.opendev.org/720527 | 12:34 |
hashar | I have an interesting use case for octopus merging a couple changes | 12:34 |
hashar | CI for the jjb/jenkins-job-builder repository is broken | 12:35 |
*** ykarel is now known as ykarel|afk | 12:35 | |
*** ysandeep|afk is now known as ysandeep | 12:35 | |
hashar | err wrong repository. I mean jjb/python-jenkins | 12:36 |
hashar | the py27 job is broken due to stestr 3.0.0 which is fixed by blacklisting it ( https://review.opendev.org/719073 ) | 12:36 |
hashar | the pypy job is broken for some reason and the job is removed by https://review.opendev.org/719366 | 12:37 |
hashar | and of course, each change has a build failure because of the other change not being around | 12:37 |
hashar | I can't depend-on on one or the other since that still would cause one of the build to fail | 12:37 |
hashar | A -> B (A fails because B fix is not there) | 12:37 |
hashar | B -> A (B fails because A fix is not there) | 12:38 |
hashar | but I could create an octopus merge of A and B to the branch which should pass just fine | 12:38 |
hashar | which I could potentially CR+2 / W+1 and get submitted by Zuul. But, I guess Gerrit is not going to merge it because the parents A and B lack the proper votes ;] | 12:39 |
*** ykarel|afk is now known as ykarel | 12:39 | |
AJaeger | hashar: merge the changes together ;) | 12:43 |
hashar | ! [remote rejected] HEAD -> refs/for/master (you are not allowed to upload merges) | 12:43 |
hashar | :( | 12:43 |
hashar | yeah I will do a single change instead | 12:44 |
hashar | thx | 12:44 |
openstackgerrit | Merged opendev/system-config master: Install kubectl via openshift client tools https://review.opendev.org/707412 | 12:49 |
openstackgerrit | Merged opendev/system-config master: Remove snap cleanup tasks https://review.opendev.org/709293 | 12:51 |
ttx | corvus, mordred for asynchronously getting rid of remote refs/changes, looks like the following shall do the trick (assuming all repos are listed in github.list): | 12:52 |
ttx | for i in $(cat github.list); do echo $i; git push --prune ssh://git@github.com/$i refs/changes/*:refs/changes/* 2| wc -l; done | 12:53 |
ttx | the wc -l trick in there is to roughly count the deleted refs as you go. git push --prune displays those on stderr | 12:53 |
ttx | That is what I propose to run after https://review.opendev.org/720679 | 12:54 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers https://review.opendev.org/720527 | 13:13 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Dynamically write zookeeper host information to nodepool.yaml https://review.opendev.org/720709 | 13:13 |
mordred | ttx: cool! | 13:13 |
ttx | I mean, seriously... stderr | 13:16 |
ttx | git why do you hate unix | 13:17 |
*** ykarel is now known as ykarel|afk | 13:31 | |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Remove unused gerrit puppet things https://review.opendev.org/714001 | 13:33 |
mordred | fungi, frickler : if you have a sec, easy review: https://review.opendev.org/#/c/720030/ | 13:34 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Remove old etherpad.openstack.org https://review.opendev.org/717492 | 13:35 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Dynamically write zookeeper host information to nodepool.yaml https://review.opendev.org/720709 | 13:40 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Dynamically write zookeeper host information to nodepool.yaml https://review.opendev.org/720709 | 13:40 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers https://review.opendev.org/720527 | 13:40 |
mnaser | corvus: ok cool, that adds up, thanks for the info | 13:45 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Start mirroring focal https://review.opendev.org/720718 | 13:49 |
AJaeger | merci, hashar | 13:49 |
openstackgerrit | Monty Taylor proposed openstack/project-config master: Start building focal images https://review.opendev.org/720719 | 13:53 |
hashar | AJaeger: you are welcome :] | 13:54 |
mordred | corvus: looking towards using your zk roles in the nodepool test jobs I realized I need to be able to write out the correct zookeeper hosts (will need the same in the zuul jobs) ... so I tried something in 720527 - I'm not 100% sure I like it | 13:55 |
*** mlavalle has joined #opendev | 14:00 | |
frickler | mordred: clarkb: question on the pattern matching syntax in https://review.opendev.org/#/c/720030/ | 14:03 |
mordred | frickler: I'm pretty sure it's a regex match and not a glob match | 14:05 |
mordred | frickler: there's a 'playbooks/roles/letsencrypt.*' showing on that page which should get files matching all of the roles starting with letsencrypt | 14:06 |
mordred | that said - I'm not sure why we're doing .* there and just playbooks/roles/jitsi-meet/ above | 14:07 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Dynamically write zookeeper host information to nodepool.yaml https://review.opendev.org/720709 | 14:15 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers https://review.opendev.org/720527 | 14:15 |
*** ysandeep is now known as ysandeep|away | 14:31 | |
mnaser | would it be ok if i setup a mirroring job in the vexxhost/base-jobs repo similar to the one i setup inside opendev/project-config ? | 14:51 |
mnaser | i don't see an issue but i just wanted to get the ok given it's a trusted repo | 14:51 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Dynamically write zookeeper host information to nodepool.yaml https://review.opendev.org/720709 | 14:53 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers https://review.opendev.org/720527 | 14:53 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run zookeeper cluster in nodepool jobs https://review.opendev.org/720740 | 14:53 |
mordred | mnaser: I don't see any issue with that | 14:53 |
mordred | corvus: ok - I rebased the nodepool patch on top of your zk patch so that I could use the zookeeper role - let's see how many things break :) | 14:56 |
openstackgerrit | Dmitry Tantsur proposed openstack/diskimage-builder master: Remove Babel and any signs of translations https://review.opendev.org/720673 | 15:05 |
*** ykarel|afk is now known as ykarel | 15:06 | |
*** bwensley has joined #opendev | 15:09 | |
bwensley | Hey everyone - I notice that my gerrit review notifications seem to have stopped yesterday afternoon. | 15:10 |
bwensley | Is this a known problem? | 15:10 |
AJaeger | bwensley: it works for me... | 15:15 |
AJaeger | bwensley: so, not a known problem | 15:15 |
frickler | bwensley: assuming you are talking about emails, if you DM me your address I can check mail logs | 15:16 |
bwensley | Yes - talking about email notifications. | 15:17 |
bwensley | If it is working for everyone else maybe a problem with my spam filters at my employer. | 15:18 |
frickler | infra-root: seems we are on spamhaus PBL with 104.130.246.32. fungi: IIRC you did the unblocking chant most of the time? | 15:20 |
corvus | mordred: morning! catching up on your changes now | 15:21 |
mordred | corvus: they may be a terrible idea - they were written during first coffee | 15:22 |
prometheanfire | can I get a review on https://review.opendev.org/717339 ? | 15:23 |
prometheanfire | second one that is | 15:23 |
corvus | mordred: i don't see zk stuff in 720527? | 15:24 |
corvus | where should i be looking | 15:24 |
mordred | corvus: https://review.opendev.org/#/c/720709 https://review.opendev.org/#/c/720740 - which are now parents of https://review.opendev.org/#/c/720527 | 15:25 |
corvus | ah! | 15:25 |
mordred | corvus: (I'd totally do that python module in jinja - but I'm not sure I'm good enogh with jinja) | 15:26 |
corvus | mordred: well, my first TODO today is to jinja the ipv4 addresses of the zk hosts into the config file, so i should have something you can copy/paste in a minute. | 15:26 |
corvus | mordred: (the same thing is needed in the zoo.cfg file) | 15:27 |
mordred | corvus: sweet! | 15:28 |
mordred | corvus: I think the hardest thing for the nodepool case is producing the yaml list of dicts format | 15:28 |
mordred | but I'm sure we can figure that out | 15:28 |
corvus | i think it's past time to move the connection stuff into a different config file, but oh well. :( | 15:29 |
mordred | corvus: I left a note on your change with a pointer to some vars that might be useful fwiw | 15:30 |
corvus | mordred: awesome. that's step 1 of that task :) | 15:30 |
clarkb | frickler: mordred yes I believe it is a regex, see line 1349. However maybe I need to prefix with ^ to make that clear? | 15:31 |
clarkb | frickler: mordred I'm looking up zuul docs now | 15:31 |
frickler | fungi: actually I think I did send a removal request some time ago, retrying now | 15:31 |
corvus | they're always regexes | 15:31 |
clarkb | corvus: thanks! frickler see corvus' note I think my change is correct | 15:31 |
corvus | ^ will just anchor it to the start, omitting that will let it match anywhere | 15:32 |
frickler | clarkb: hmm, then you could drop the ".*" ending to be consistent with everything else, right? | 15:32 |
frickler | would be less confusing IMHO | 15:33 |
clarkb | frickler: ya I guess I can if we allow partial matching | 15:33 |
mordred | clarkb: we do. I think there are actually several .* suffixes that can all go | 15:33 |
corvus | we call "regex.match(file)" | 15:33 |
clarkb | ok I'll push up an update and look at simplifying some of the other matches in a followon | 15:34 |
corvus | oh, match says it's always ad the start of the string | 15:34 |
corvus | "If zero or more characters at the beginning of string match the regular expression pattern" | 15:34 |
corvus | i think that means both ^ and trailing .* are superfluous | 15:34 |
frickler | #status log submitted and confirmed spamhaus PBL removal request for 104.130.246.32 (review01.openstack.org) | 15:35 |
openstackstatus | frickler: finished logging | 15:35 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Run jobs prod test jobs when docker images update https://review.opendev.org/720030 | 15:36 |
clarkb | corvus: yup I agree | 15:36 |
corvus | mordred: given the specific task of "modify a slurped yaml nodepool config" it probably makes sense to just keep that as a module | 15:36 |
corvus | mordred: we can get rid of it when we make a "nodepool.conf" or something in the future | 15:37 |
mordred | corvus: ++ | 15:37 |
corvus | we're going to have "zookeeper-tls" to add to "zookeeper-servers" shortly | 15:37 |
mordred | corvus: assuming, of course, I can ever get that module to run | 15:37 |
corvus | mordred: yeah, i say just keep plugging at it; i don't think my tasks are going to add anything to help | 15:38 |
mordred | kk | 15:38 |
clarkb | mordred: frickler ^ there is the updated cahnge | 15:40 |
clarkb | working on a followon now to be consistent in that file | 15:40 |
frickler | clarkb: ack, thx. | 15:42 |
* frickler heads towards the weekend now | 15:43 | |
openstackgerrit | James E. Blair proposed opendev/system-config master: Run ZK from containers https://review.opendev.org/720498 | 15:43 |
corvus | clarkb, fungi, mordred: ^ that's ready to merge, please review and +W | 15:44 |
corvus | after it lands, we can take zk* out of emergency | 15:44 |
frickler | infra-root: there seem to be umpteen bounces to review@openstack.org in the mailq on review.o.o, not sure if that's normal or whether they are due to the PBL issue. do we usually clean these up or just let them expire? | 15:46 |
clarkb | frickler: I expect its due to the PBL listing, but fungi and corvus would know better than me | 15:46 |
corvus | i think it'd be fine to just let them expire | 15:47 |
frickler | ok | 15:49 |
corvus | ttx: https://review.opendev.org/720679 lgtm i'll give fungi a bit in case he wants to review | 15:49 |
*** dpawlik has quit IRC | 15:50 | |
corvus | mordred: comment on 720709 | 15:51 |
mordred | corvus: I have learned something | 15:51 |
mordred | corvus: well - I learned your thing - but also, the fact variables I mentioned - only exist if fact gathering has happened for the zk hosts | 15:52 |
mordred | corvus: so we can either ensure a noop task has happend on the zookeeper group ... or we could use public_v6 and public_v4 from our inventory file | 15:53 |
corvus | mordred: we cache facts on bridge | 15:53 |
mordred | corvus: nod. do we in test runs? | 15:53 |
corvus | mordred: so is this just a gate problem? | 15:53 |
corvus | i ran my jinja on bridge using the real inventory and it works | 15:54 |
mordred | might be. but if we use the same ansible.cfg we should cache facts in gate too | 15:54 |
mordred | cool | 15:54 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Simplify .zuul.yaml regexes https://review.opendev.org/720759 | 15:54 |
corvus | mordred: (and that test on bridge was with a "hosts: localhost" play) | 15:54 |
mordred | I think in the gate we might need to run the zookeeper playbook first so that we'll populate the fact cache - but we need to run that ANYWAY to make the zk hosts | 15:54 |
clarkb | mordred: frickler corvus ^ thats the followon though not stacked as it had a merge conflict with master and I didn't want to update the other change again :) | 15:54 |
corvus | mordred: yeah, that sounds reasonable to rely on that as a side effect. maybe worth a comment. | 15:55 |
mordred | ++ | 15:55 |
mordred | also - in my nodepool patch I'm preferring ipv6 if it exists - is that a bad idea? | 15:56 |
corvus | clarkb: +2; i noted one innocuous change | 15:57 |
*** ykarel is now known as ykarel|away | 15:57 | |
corvus | mordred: actually | 15:57 |
* corvus wakes up | 15:57 | |
corvus | mordred: why aren't we using hostnames in nodepool.yaml? | 15:57 |
mordred | corvus: well - we are in the normal one - but hostnames won't resolve in the gate | 15:57 |
corvus | ah | 15:58 |
mordred | corvus: unless we're writing out /etc/hosts files | 15:58 |
corvus | that's lame | 15:58 |
mordred | yeah | 15:58 |
mordred | maybe we should write out /etc/hosts files? | 15:58 |
corvus | oh no | 15:58 |
corvus | i meant writing /etc/hosts is lame | 15:58 |
mordred | yeah - it's totally lame | 15:58 |
mordred | but - overall the "test nodes won't resolve in dns" is gonna be an ongoing thing probably as we do more and more of these real world multi-node things | 15:59 |
corvus | true. in which case, write /etc/hosts or template in ip addresses are both reasonable solutions | 16:00 |
corvus | templating in ip addresses does have the advantage of potentially being the same in test and prod | 16:00 |
corvus | (eg, zoo.cfg) | 16:00 |
corvus | mordred: anyway, to your question: preferring v6 sounds reasonable | 16:01 |
corvus | we can see how that ends up performing in our various clouds | 16:01 |
mordred | kk | 16:01 |
mordred | I'll stay with ips for now - and we can swing back to /etc/hosts if needed | 16:02 |
mordred | corvus: I'm going to have to squash two of those patches - since I need to run zk so that zk hosts exist :) | 16:02 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run zookeeper cluster in nodepool jobs https://review.opendev.org/720709 | 16:04 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers https://review.opendev.org/720527 | 16:04 |
corvus | mordred: i was looking at this spurious failure on your change: https://zuul.opendev.org/t/openstack/build/c3d52b243c4b4af5bb5c6fd3abeeea5a/log/applytest/puppetapplytest18.final.out.FAILED#62 | 16:07 |
corvus | mordred: it looks like some kind of rsync race? i wonder if one of the recent changing about how we run stuff could be affecting that? | 16:08 |
corvus | (we could just recheck it and continue to remove puppet; but i worry if we're going to start getting more errors) | 16:09 |
ttx | corvus if you+2a the replication change it could be good to keep an eye on the replication thread see if it gets backed up -- might be a sign that refs/changes gets deleted on the push | 16:09 |
mordred | corvus: ugh yeah | 16:09 |
ttx | It should not, since it's a push without --mirror afaict | 16:09 |
mordred | corvus: I mean - part of me wants to say "recheck and keep working to remove puppet" - but I also agree, this could be an escalating issue | 16:09 |
corvus | ttx: ack | 16:09 |
ttx | but it's not superclear looking at Gerrit plugin code | 16:09 |
ttx | or it can wait Monday :) | 16:10 |
corvus | ttx: yeah, it may depend on whether fungi is around :) | 16:10 |
corvus | (or if his internet has been swept out to sea) | 16:11 |
mordred | corvus: clarkb and I talked about doing a couple of steps to clean some things up even with puppet in place ... namely, going ahead and making service-$foo playbooks and corresponding jobs - even if those playbooks right now just run puppet on a given host ... | 16:11 |
clarkb | mordred: corvus: you've both acked https://review.opendev.org/#/c/719589/ my parental home school duties will be over in about an hour and a half. Is that a good time for you all to land that? | 16:11 |
mordred | corvus: and if we do that, I think we could decently change any puppet tests we have into testinfra tests - and then just drop the puppet-specific tests altogether | 16:11 |
mordred | clarkb: wfm | 16:11 |
corvus | mordred: yeah, that's a good idea -- running the playbook means we can drop the applytest (it's better than an apply test) | 16:12 |
mordred | corvus: because "run all of the puppet" every time we touch an ansible file is a bit of a waste | 16:12 |
mordred | ++ | 16:12 |
mordred | I think I might put that fairly soonish on my list | 16:12 |
corvus | clarkb: did we figure out about restarting services? | 16:12 |
mordred | because that would also allow us to move to the opendev tenant | 16:12 |
mordred | (since the blocker right now is the legacy base jobs in ozj - which we use in the puppet tests) | 16:13 |
corvus | mordred: which will speed everything up :) | 16:13 |
clarkb | corvus: we expect it will restart processes. Gerrit should be fine beacuse we don't docker-compose up it during normal runs. | 16:13 |
mordred | ++ | 16:13 |
corvus | clarkb: cool, wfm | 16:13 |
clarkb | corvus: services like zuul preview, docker registry, gitea, nodepool-builder will restart | 16:13 |
mordred | and once the compose change is in - we should do a controlled restart of gerrit - because we have a change we need to pick up | 16:14 |
clarkb | gitea should be ok because we do one at a time. Though we'll want to replicate to them afterwards to avoid any missed refs | 16:14 |
clarkb | (I can do that) | 16:14 |
mordred | clarkb: didn't we land your update to safely restart gitea? | 16:14 |
mordred | (so that we do it in the right order?) | 16:14 |
clarkb | mordred: oh we did, and that might cause this to not actually restart gitea | 16:14 |
clarkb | because we check for new images otherwise don't issue teh commands | 16:15 |
clarkb | so we should manually restart things if there isn't a new image coincident with this update | 16:15 |
mordred | nod. and next time we have new images, the restart should still do the right thing | 16:15 |
clarkb | (I can also do that) | 16:15 |
mordred | yeah | 16:15 |
mordred | well - we DO have a new image we could roll out | 16:15 |
mordred | https://review.opendev.org/#/c/720202/ <-- | 16:15 |
mordred | we could land that after the docker-compose patch | 16:15 |
mordred | and that should trigger a gitea rollout | 16:16 |
clarkb | ++ lets do it that way | 16:16 |
mordred | good exercise of our machinery | 16:16 |
*** rpittau is now known as rpittau|afk | 16:17 | |
mordred | corvus: you still have -2 on your zk change - but clarkb and I both +2'd it | 16:22 |
corvus | mordred: ah thanks! :) | 16:23 |
*** mlavalle has quit IRC | 16:34 | |
mordred | corvus, clarkb: I pushed up two changes this morning unrealted to this - https://review.opendev.org/#/c/720718/ and https://review.opendev.org/#/c/720719/ - to start mirroring and building images of focal, since that's being released next week | 16:36 |
mordred | corvus: and speaking of - when we roll out new ze*.opendev.org servers after the ansible rollout - perhaps we should consider jumping straight to focal instead of bionic so that we don't have to think about them for a while | 16:39 |
corvus | mordred: ++ | 16:40 |
mordred | focal is defaulting to python 3.8 - so if we did that and then bumped to the 3.8 python-base in our image builds, we'd be on the same python across the install | 16:40 |
*** kevinz has quit IRC | 16:40 | |
corvus | hopefully afs works | 16:40 |
mordred | yeah. that'll be the first question | 16:40 |
*** mlavalle has joined #opendev | 16:43 | |
fungi | frickler: yeah, the pbl rejection messages should mention the url for more info, which will get you eventually to the delisting page, and i usually use the infra-root shared mailbox to do the verification message. i can take care of it in a minute if nobody has gotten to it yet | 16:44 |
openstackgerrit | Merged opendev/system-config master: Simplify .zuul.yaml regexes https://review.opendev.org/720759 | 16:45 |
fungi | looks like you got it though | 16:45 |
fungi | and sorry for the delay, looking over 720498 now | 16:45 |
fungi | on the replication change, did we ever disable the live replication config update "feature"? | 16:49 |
fungi | i think i had a change up some time ago to revert it | 16:49 |
fungi | looking | 16:49 |
openstackgerrit | Merged opendev/system-config master: Run ZK from containers https://review.opendev.org/720498 | 16:49 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers https://review.opendev.org/720527 | 16:52 |
fungi | okay, yeah, that was https://review.opendev.org/691452 and it merged ~3 months ago | 16:52 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run zookeeper cluster in nodepool jobs https://review.opendev.org/720709 | 16:54 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers https://review.opendev.org/720527 | 16:54 |
mordred | corvus: ^^ those were basically green last tie - except for one testinfra thing. I pushed up a fix for that, but then had to rebase because of the .zuul.yaml and the newer zk patch | 16:54 |
mordred | so the most recent ps is just the rebase | 16:55 |
mordred | corvus: also - check it: https://zuul.opendev.org/t/openstack/build/24f76cf23d9942ac9d015fba4d402ec2/log/nb04.opendev.org/nodepool.yaml#626-628 | 16:57 |
mordred | corvus: (the file itself now looks aweful because of slurp|from_yaml|to_yaml - but I think we can live with that until we get a nodepool.conf) | 16:57 |
corvus | mordred: heh, it's readable enough :) | 17:00 |
corvus | clarkb: +3 https://review.opendev.org/720095 ? | 17:01 |
clarkb | corvus: do you also need to update the .env file? | 17:03 |
clarkb | I seem torecall that one having the etherpad url in it too | 17:03 |
clarkb | corvus: I've approved it and can update .env if necesary in new change | 17:04 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run Zuul using Ansible and Containers https://review.opendev.org/717620 | 17:04 |
corvus | clarkb: yeah, it is in there, but i think it's only used to generate the config file that we manually install; maybe i'll just remove it in a followup.... | 17:05 |
corvus | er, you know what i mean by manually -- ansible installs it | 17:06 |
corvus | i'm manually running the playbook against zk01 | 17:08 |
mordred | corvus: any idea on how to do this: | 17:09 |
mordred | hosts={% for host in groups['zookeeper'] %}{{ (hostvars[host].ansible_default_ipv4.address) }}:2888:3888,{% endfor %} | 17:09 |
mordred | but without the trailing , that'll be there? | 17:09 |
corvus | mordred: yeah, there's some loop variables... 1 sec | 17:10 |
mordred | ah - found it | 17:11 |
mordred | loop.last | 17:11 |
corvus | ++ | 17:11 |
corvus | table of variables: https://jinja.palletsprojects.com/en/2.11.x/templates/#for | 17:11 |
mordred | hosts={% for host in groups['zookeeper'] %}{{ (hostvars[host].ansible_default_ipv4.address) }}:2888:3888{% if not loop.last %},{% endif %}{% endfor %} | 17:11 |
corvus | lgtm | 17:11 |
corvus | running playbook against zk02 | 17:13 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run Zuul using Ansible and Containers https://review.opendev.org/717620 | 17:15 |
mordred | corvus: that might work ^^ ... also - inverse of the nodepool ones - I pushed up a rebase that's just a rebase, then that last patch did the fixes needed | 17:15 |
clarkb | mordred: corvus re /etc/hosts I think our multinode role sets that up for you | 17:16 |
corvus | yeah, so we could use that role (or that part of that role) if we wanted to go that way | 17:16 |
mordred | does multinode do anything extra that might conflict with the things we're trying to test with system-config-run jobs? | 17:17 |
corvus | but i was thinking about it further, and it's still not a slam dunk for this use case -- we don't want to stand up a full cluster, we only want one node, so writing out the config is still desirable | 17:17 |
mordred | yeah | 17:17 |
corvus | running the playbook on 03 now | 17:18 |
mordred | although we could just join group['zookeeper'] instead of needing to do the extra loop to find the ip address from hostvars | 17:18 |
mordred | corvus: cool | 17:18 |
corvus | ya | 17:18 |
* mordred could go either way | 17:18 | |
*** hashar has quit IRC | 17:18 | |
corvus | i'm seeing a bunch of client errors now | 17:19 |
mordred | clarkb: we could just use role multi-node-hosts-file | 17:19 |
mordred | it is nicely split out into its own role :) | 17:19 |
corvus | infra-root: heads up -- i think the zk cluster is in a bad state | 17:20 |
mordred | corvus: uhoh | 17:20 |
mordred | corvus: should we switch to opendev-meeting? | 17:20 |
fungi | at least it's friday? ;) | 17:20 |
clarkb | corvus: logs look like yseterday | 17:20 |
corvus | i'll stop zk03 | 17:21 |
corvus | that did not improve things | 17:22 |
corvus | i'll restart everything? | 17:22 |
clarkb | I think that is what helped last time? | 17:23 |
fungi | seemed like it anyway | 17:24 |
corvus | looks happier | 17:24 |
corvus | i am less than satisfied with this | 17:24 |
corvus | that should have been a straightforward rolling restart | 17:24 |
mordred | yeah | 17:24 |
mordred | corvus: should we try another rolling restart to see how it goes? | 17:25 |
corvus | maybe -- though i wonder if we need the dynamic config file | 17:25 |
clarkb | we have done rolling restarts of the ubutnu packaged zk successfully in the past (I think ianw did one in the last couple weeks too | 17:25 |
corvus | that was 3.4.8 iirc | 17:25 |
corvus | (we do need 3.5.x for tls) | 17:25 |
mordred | corvus: are you thinking that maybe when a node leaves the cluster zk is updating the duymanicConfig? | 17:25 |
corvus | mordred: yeah | 17:26 |
corvus | i'm still fuzzy on how "optional" it is | 17:26 |
mordred | I really wish people wouldn't write server software that writes things to its config files | 17:26 |
corvus | i might be able to simulate this locally | 17:26 |
corvus | that's probably the place to start | 17:26 |
mordred | https://zookeeper.apache.org/doc/r3.5.3-beta/zookeeperReconfig.html#sc_reconfig_file | 17:27 |
corvus | yep that thing | 17:27 |
mordred | yeah - my reading of that tells me that it's going to write server values to the file | 17:27 |
mordred | when servers come and go | 17:27 |
corvus | mordred: but what happens if you don't include the client port number at the end? | 17:28 |
corvus | see "Backward compatibility" | 17:28 |
corvus | and if we don't "invoke a reconfiguration" that "sets the client port" | 17:28 |
corvus | (i don't know whether we're inadvertently doing that or not when we restart a server) | 17:29 |
corvus | all of that to say, in my mind, there's a decision tree with at least two unresolved nodes determining whether any config files get (re-)written | 17:29 |
corvus | cluster configuration by quantum superposition | 17:29 |
mordred | well - I think the file is going to get tranformed regardless of port | 17:30 |
mordred | corvus: I agree - we need to just simulate locally | 17:30 |
clarkb | have we determined if its the actual config file or not? | 17:30 |
mordred | there's no way we're going to reason through it | 17:30 |
clarkb | or if we set a separate path it will write to a separate file? | 17:30 |
clarkb | (note about now is when I'm able to monitor the docker-compose thing but will wait until we are in a happy place with zk) | 17:31 |
mordred | I believe it wants 2 files in all cases - if we put things in the single file, it will helpfully pull out the servers and put them into the second file | 17:31 |
corvus | mordred: see the text under 'example 2' for the bit about how whether a port is there or not affects whether it writes the dynamic file | 17:31 |
mordred | yeah - that's a good point | 17:32 |
corvus | mordred: i agree that there's no way we'll reason about it | 17:32 |
mordred | also - assuming that we want to implement their "recommended" way of doing things | 17:32 |
corvus | mordred: i'm not ready to endorse any conclusions... | 17:32 |
mordred | what a PITA from a config mgmt pov | 17:32 |
corvus | so far we have not seen it rewrite the main config file when we did not configure a dynamic config file path | 17:33 |
corvus | that's the only thing we know :) | 17:33 |
mordred | \o/ | 17:33 |
corvus | i think the best thing to do is for me to go into a hole and set up a 3 node local cluster and try to replicate the problem | 17:33 |
corvus | then start changing variables | 17:33 |
mordred | I mean, in their "prefered" approach - as long as all three nodes are up and running when we run ansible it should be a no-op - but doing a rolling restart at the same tie ansible tries to write a config would be potentially highly yuck | 17:34 |
mordred | corvus: ++ | 17:34 |
* mordred supports a corvus hole | 17:34 | |
fungi | the discussions i linked yesterday for the zookeeper operator indicated that zookeeper wants config write access even if told to use a static config | 17:34 |
clarkb | ok should I hold off on docker-compose things or are we reasonably happy with the state here? I ask because those zk nodes are using docker-compose now and should noop but may not? | 17:34 |
clarkb | I'm like 98% confident the docker-compose upgrade will nop zk | 17:34 |
mordred | clarkb: I am fairly confident your change will noop the zk nodes | 17:34 |
mordred | yeah - because zk is already using pip -so it should be a no-op compose up | 17:35 |
corvus | clarkb: yeah, i think it's worth the risk. i would just stand by to do a full 'docker-compose down' 'docker-compose up -d' if it's not a noop | 17:35 |
mordred | ++ | 17:35 |
clarkb | ok I'm going to hit approve now then | 17:35 |
corvus | okay, i'll probably be away for a few hours; exercise and then into the debugging hole | 17:35 |
mordred | fungi: has anyone in discussions you've read complained loudly about the config writing choices? | 17:35 |
mordred | because if they haven't I might want to | 17:36 |
fungi | mordred: they seemed resigned to their unfortunate fates | 17:36 |
mordred | sigh | 17:36 |
fungi | someone probably should bring it up with the zk maintainers. though i assume multiple someones have and i've just not found record of those conversations | 17:36 |
clarkb | why have a separate dynamic config file option if the "static" one needs writing too | 17:37 |
fungi | though that one issue i linked in turn linked to the bits of the zk source where the write decision is made | 17:37 |
clarkb | (that seems like a reasonable argument to make to them if this is the case) | 17:37 |
* fungi finds again | 17:37 | |
mordred | I mean - ultimately I'm guessing that we're not going to win and will have to also resign ourselves to our unfortunate fates | 17:38 |
mordred | but it's one of those decisions that makes running a service with automation harder | 17:38 |
fungi | https://github.com/pravega/zookeeper-operator/issues/66#issuecomment-501191586 | 17:38 |
fungi | "It needs to be able to create a new dynamic configuration file and update the static configuration file to point to the latest configuration (that's for restarts of the server)." | 17:39 |
fungi | so basically the static configuration file isn't entirely static, it just contains (some) static configuration | 17:39 |
clarkb | fungi: mordred that code chunk seems like its tracking the dynamic config in the static config | 17:40 |
mordred | yeah - it seems that the one write operation they want to make is to remove the dynamic config | 17:40 |
clarkb | Iwonder if hte issue goes away entirely if we simply set a dynamic config path | 17:40 |
mordred | clarkb: needEraseClientInfoFromStaticConfig() | 17:40 |
mordred | I'm fairly certain if we set a dyamicConfigPath and also remove servers from our static config that zk will not touch our static config and will update the member list in the dynamic config as needed | 17:41 |
clarkb | https://github.com/apache/zookeeper/blob/3aa922c5737c9ef0879f290181cb281261c965e0/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumPeerConfig.java#L591-L599 is that function | 17:41 |
clarkb | looks like it will simply remove the dynamicConfigFile entry | 17:42 |
clarkb | oh and then it appends dynamicConfigFile to the end | 17:43 |
mordred | yeah | 17:43 |
mordred | but only if it needs to erase stuff from the static | 17:44 |
clarkb | so if we can remove those keys and ensure dynamicConfigFile is set at the end we may avoid problems. I'm not sure we can remove clientPort though | 17:44 |
mordred | why not? we can set it on the end of each server line, no? | 17:44 |
clarkb | mordred: just because I haven't read enough docs yet | 17:44 |
mordred | yeah - there's a form that allows you to append to each line | 17:44 |
clarkb | oh but the server line is also checked | 17:44 |
clarkb | https://github.com/apache/zookeeper/blob/3aa922c5737c9ef0879f290181cb281261c965e0/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumPeerConfig.java#L640 they rewrite everything back out again there ? | 17:45 |
mordred | yeah. which is why those lines go into the dynamic file | 17:45 |
clarkb | well thats all the static file there in that function | 17:45 |
clarkb | I'm basically trying to figure out if there is a form we can write that will make zk not try and change it | 17:46 |
clarkb | dynamicConfigFile needs to be the very last key is about as far as I've gotten | 17:46 |
mordred | yeah - but it only does editStaticConfig if you had dynamic config in the static file in the first place | 17:46 |
clarkb | mordred: yes but it writes it back out again | 17:46 |
mordred | but ony if it had to edit it | 17:46 |
clarkb | we can't stop the writing from happening | 17:46 |
mordred | I think we can | 17:46 |
clarkb | but if ansible and zk write the same thing its fine | 17:46 |
clarkb | "fine" | 17:46 |
mordred | I think if we don't put the dynamic info into the static file ever | 17:47 |
mordred | then ansible will not touch the static file | 17:47 |
mordred | we'll still need to write the dynamic file - and zk will also write to that | 17:47 |
clarkb | mordred: ansible is writing the static conf | 17:47 |
mordred | yes, I understand | 17:47 |
mordred | but what I'm saying is that if we restructure the file | 17:47 |
mordred | and stop putting the server list in it | 17:47 |
mordred | that zk will not desire to rewrite that file | 17:47 |
clarkb | how do we tell it what servers are in the cluster? | 17:48 |
mordred | if we only have ansible write the server list into the dynamic file | 17:48 |
mordred | and we also have ansible only write that file if it doesn't exist | 17:48 |
clarkb | ok that last bit is what I was missing | 17:48 |
mordred | because once we've written it the first time it's owned by zk - so if we try to write it out during a rolling restart, things will have sads | 17:48 |
mordred | because we'll be fighting zk - but by and large we'd only need to write to that file if we were changing the list of members - and that would be a big thing anyway | 17:49 |
mordred | in any case - corvus is going to go into a hole and verify these suppositions :) | 17:49 |
mordred | clarkb: https://review.opendev.org/#/c/720718/ if you're bored | 17:50 |
*** ralonsoh has quit IRC | 17:53 | |
clarkb | mordred: check comment for things | 17:53 |
fungi | i hope corvus brings a torch, we don't need him getting eaten by a grue | 17:53 |
mordred | clarkb: oh - that's a good point | 17:54 |
mordred | fungi: do you happen to know the answer to clarkb's comment on 720718 ? | 17:55 |
clarkb | mordred: I'm looking I think only the things on mirror-update.opendev.org use the new ssh'd vos release | 17:57 |
clarkb | mordred: and we've only moved the rsynced things over (since that is ansible managed and setting up reprepro is "involved") | 17:57 |
clarkb | mordred: so I think what you need to do for your change is either update mirror-update.openstack.org to use the same ssh thing, move reprepro to mirror-update.opendev.org and have it ssh, or hold the lock, run reprepro yourself withotu a vos release, then vos release on the afs server afterwards, then release the lock | 17:58 |
fungi | lookin' | 18:02 |
clarkb | also we removed all trusty nodes/jobs right? | 18:03 |
clarkb | I think maybe instead of bumping quota we want to delete trusty first (also should be manual due to sync cost) | 18:03 |
clarkb | AJaeger: ^ pretty sure you drove that for us and it is all complete now right? (trusty test node removal) | 18:04 |
fungi | mordred: yeah, i left a comment on 720718 just now but it basically repeats what clarkb just said | 18:04 |
mordred | nod. so yeah - trusty removal first seems like the right choice | 18:08 |
mordred | or - maybe what we want is to replace trusty with focal in the file | 18:09 |
mordred | and then do a single sync | 18:09 |
AJaeger | clarkb: yes, I think we're fine, let me double check quickly | 18:09 |
clarkb | mordred: you might have write errors if you do that since reprepro deletes after downloading iirc | 18:09 |
clarkb | mordred: could temporarily bump quota to handle that | 18:10 |
clarkb | that might be the quickest option actually since you bundle the big syncs into one sync | 18:10 |
AJaeger | yes, trusty should be gone. There's still a bit in system-config (sorry, did not read backscroll) but that's all | 18:12 |
clarkb | AJaeger: ya we have ~3 nodes on it still but we pulled out testing of it so we don't need the afs mirror anymroe. Thank you for checking | 18:12 |
mordred | clarkb: yeah - so we might still want to do the reprepro config as two patches - but bundle it with a single vos release | 18:14 |
mordred | clarkb: oh - or yeah, bump quota for a minute | 18:14 |
mordred | oh wow | 18:17 |
mordred | clarkb: context switching back to puppet real quick ... | 18:18 |
mordred | clarkb: puppet-beaker-rspec-puppet-4-infra-system-config is mostly testing things that are done in ansible | 18:18 |
mordred | clarkb: so - I think it's pretty much useless at this poing | 18:18 |
mordred | the only testing it's doing is the stuff that's defined in modules/openstack_project/spec/acceptance/basic_spec.rb | 18:19 |
clarkb | mordred: I want to say that may be an integration job too | 18:19 |
mordred | which is basically testing that users we set up in ansible are there | 18:19 |
clarkb | mordred: so it runs against puppet-foo rspec too ? | 18:19 |
clarkb | when we update puppet-foo | 18:19 |
clarkb | so its possible we don't need the job on system-config anymore but may not be ready to delete the job itself? | 18:19 |
clarkb | (double check me on that) | 18:19 |
mordred | clarkb: nope | 18:21 |
mordred | clarkb: or - rather - yes - we don't need the job on system-config | 18:21 |
mordred | we run puppet-beaker-rspec-puppet-4-infra on puppet-foo changes | 18:21 |
clarkb | got it | 18:22 |
mordred | so - I think we can remove puppet-beaker-rspec-puppet-4-infra-system-config now | 18:23 |
mordred | and then when I do the change to split remote_puppet_else into service-foo playbooks - that can replace the puppet apply job | 18:23 |
mordred | and similarly, each one of those jobs can be used in the puppet-foo modules as appropriate | 18:23 |
mordred | and we can get rid of all of the rspec jobs | 18:23 |
mordred | and life will be much better | 18:24 |
clarkb | ya the puppet apply job also only does a puppet noop apply | 18:27 |
clarkb | so if we can actually run puppet it will be an improvemtn :) | 18:27 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Remove puppet-beaker-rspec-puppet-4-infra-system-config https://review.opendev.org/720799 | 18:29 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Remove global variables from manifest/site.pp https://review.opendev.org/720800 | 18:29 |
mordred | clarkb: two easy-ish cleanups to prep for that ^^ | 18:29 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Remove unused rspec tests https://review.opendev.org/720802 | 18:30 |
mordred | and a third | 18:30 |
clarkb | mordred: oh heh your third change addresses my note in first one | 18:32 |
clarkb | mordred: the second needs work though (comment inline) | 18:32 |
mordred | cool - thanks! | 18:33 |
clarkb | change for docker-compose update is waiting on nodes. I should have plenty of time to pop out for a few mintues as a result. Back soon | 18:34 |
clarkb | (the gitea job isn't incredibly quick) | 18:34 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Remove global variables from manifest/site.pp https://review.opendev.org/720800 | 18:36 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Remove unused rspec tests https://review.opendev.org/720802 | 18:36 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Start mirroring focal, stop mirroring trusty https://review.opendev.org/720718 | 18:40 |
clarkb | 22 minutes for that change to land give or take | 18:54 |
mordred | fungi: I think we can go ahead and land https://review.opendev.org/#/c/720679/ - we need to do a gerrit restart to pick up the local replication volume anyway | 18:55 |
mordred | so it would be nice to bundle the restart and get both things | 18:55 |
mordred | (because of this: https://review.opendev.org/#/c/720225/) | 18:56 |
clarkb | mordred: that can also transition the container name for us after docker-compose lands | 18:56 |
mordred | yup | 18:56 |
mordred | so I think we land 720679 - then docker-compose lands - then when we're happy we do a docker compose restart on review | 18:57 |
mordred | and we're in pretty good shape | 18:57 |
mordred | oh - we need to land https://review.opendev.org/#/c/719051/ too | 18:57 |
mordred | clarkb: any reason to hold off on the +A for that one? | 18:57 |
mordred | or do we want to wait? | 18:58 |
clarkb | mordred: I don't think so | 18:58 |
clarkb | it was just in holding pattern on the docker-compose upgrade | 18:58 |
mordred | cool. I'm gonna go ahead and poke it | 18:58 |
fungi | mordred: sounds good to me then | 18:58 |
fungi | i mainly didn't want to inadvertently complicate anything else we've got going on | 18:58 |
fungi | trying not to cross the streams too much | 18:59 |
mordred | fungi: ++ | 19:11 |
openstackgerrit | Merged opendev/system-config master: Install docker-compose from pypi https://review.opendev.org/719589 | 19:11 |
mordred | clarkb: there we go | 19:12 |
clarkb | mordred: and now we watch the deploy jobs ya? | 19:13 |
mordred | yup | 19:13 |
clarkb | hrm you know what just occuired to me does uninstalling packaged docker-compose do someting we don't want like stopping the containers too :/ | 19:15 |
clarkb | testing seemed to show that it iddn't becuase it was the docker-compose-up that happened later that restarted teh containers | 19:15 |
clarkb | I'm just being paranoid now | 19:15 |
clarkb | gitea-lb seems to have gone well | 19:15 |
mordred | clarkb: yeah - I don't think it does | 19:16 |
mordred | it's just a python program that does things with docker api | 19:16 |
clarkb | mordred: good point | 19:16 |
clarkb | so ya uninstalling docker may do that but not docker-compose | 19:16 |
clarkb | in any case opendev.org is still up and the gitea-lb.yaml log looks as I expected it | 19:17 |
clarkb | first one lgtm | 19:17 |
clarkb | service nodepool job failed. Not sure why yte | 19:19 |
clarkb | Unable to find any of pip3 to use. pip needs to be installed. | 19:20 |
clarkb | that was unexpected | 19:20 |
clarkb | on nb04 | 19:20 |
clarkb | mordred: ^ do you know why servers like gitea-lb which are bionic would have pip installed but not bionic for nb04? | 19:21 |
clarkb | also this is a gap in our testing because our test images have pip and friends preinstalled | 19:21 |
clarkb | I think what we may end up seeing here is that newer hosts fail on this error and older hosts are fine | 19:22 |
clarkb | and yes I've confirmed uninstalling docker-compose does not stop containers beacuse nb04 and etherpad are in that state | 19:23 |
prometheanfire | mordred: mind taking a look at https://review.opendev.org/717339 ? | 19:24 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Install pip3 for docker-compose installation https://review.opendev.org/720820 | 19:26 |
fungi | clarkb: yeah, odds are our server images don't have the python3-pip package installed | 19:26 |
clarkb | fungi: ya but why would gitea-lb have it ? different image maybe | 19:26 |
clarkb | in any case infra-root I think 720820 fixes this problem. Note that we currently don't have docker-compose installed on hsots where this failed. But the existing docker compose'd containers are running | 19:27 |
fungi | we deployed that in vexxhost right? | 19:27 |
clarkb | fungi: oh ya good point | 19:27 |
fungi | so we probably uploaded a nodepool-built image | 19:27 |
clarkb | if we need to emergency docker compose things before the fix above lands we can reinstall the distro docker-compose | 19:27 |
mordred | clarkb: uhm. weird. | 19:27 |
mordred | clarkb: yeah - I thought pip3 was everywhere - but clearly I was wrong - and our images having that on them sure did mask this didn't it? | 19:28 |
clarkb | mordred: yup | 19:28 |
clarkb | mordred: fwiw meetpad job returned success but it didn't seem to update containers there | 19:28 |
clarkb | "no hosts matched" ok that explains that one | 19:29 |
mordred | PLAY [Configure meetpad] ******************************************************* | 19:29 |
mordred | skipping: no hosts matched | 19:29 |
mordred | yeah | 19:29 |
clarkb | zk was success and that should've nooped. Checking now | 19:29 |
mordred | clarkb: oh - is meetpad in emergency? | 19:30 |
clarkb | mordred: it must be | 19:30 |
clarkb | zk looks good | 19:30 |
mordred | yup | 19:30 |
clarkb | so far only the pip issue | 19:30 |
mordred | cool! | 19:30 |
clarkb | nb04, etherpad.opendev, docker registry, and zuul-preview all failed on the pip3 missing thing. gitea-lb succeeded as did the zookeeper hosts. I expect review, review-dev, and gitea to all succeed as they are older and/or on vexxhost | 19:33 |
openstackgerrit | Merged openstack/project-config master: Change gerrit ACLs for cinder-tempest-plugin https://review.opendev.org/720235 | 19:33 |
fungi | those ^ get applied from promote pipeline jobs now, right? | 19:33 |
clarkb | fungi: deploy pipeline | 19:34 |
fungi | oh, right! | 19:34 |
fungi | i forgot we added a separate pipeline for that | 19:34 |
clarkb | mordred: hrm does manage-projects use docker-compose in a way that may pose a problem here? | 19:34 |
clarkb | the gerrit ACLs change has queued up the manage-projects job | 19:35 |
fungi | yep, i see that. cool | 19:35 |
clarkb | ok we use docker run not docker-compose for manage projects so that should be fine | 19:36 |
clarkb | it won't try to use the wrong container name | 19:36 |
clarkb | if we did docker exec or docker-compose for manage-projects that could be different | 19:36 |
clarkb | 720820 exposes that we don't run docker role consuming jobs on docker role updates. Thats another job fix I should figure out | 19:38 |
clarkb | infra-root once gitea runs and shows gitea01 (it should be first) is happy I'm going to work on lunch while waiting for the fix to get tested and reviewed | 19:39 |
clarkb | if you need to make changes to the fix or take different direction feel free | 19:39 |
clarkb | but then beacuse the fix is in the docker role and our jobs may not be set to trigger off that role updating we may need to run the playbooks for these services manually: | 19:39 |
mordred | clarkb: (we should add the pip3 role to things that have files depends on the install-docker role now too) | 19:40 |
clarkb | service-nodepool.yaml, service-etherpad.yaml, service-meetpad.yaml (needs to be removed from emergency or we can wait on this one), service-registry.yaml, service-zuul-preview.yaml | 19:40 |
clarkb | mordred: ++ so we need to do the docker role and the pip3 role | 19:40 |
mordred | yeah | 19:40 |
* mordred will make a patch | 19:41 | |
clarkb | thanks! | 19:41 |
clarkb | does bridge unping for anyone else? | 19:42 |
clarkb | I can't ping or ssh to it and my existing ssh connection seems to have gone away? | 19:42 |
clarkb | and now it reconnects that was weird | 19:43 |
clarkb | uptiem shows it didn't reboot | 19:43 |
clarkb | and we didn't OOM | 19:43 |
clarkb | "msg": "Timeout (32s) waiting for privilege escalation prompt: " <- review-dev failed on that | 19:44 |
clarkb | possibly due to the same network connectivity issue? | 19:44 |
clarkb | https://gitea01.opendev.org:3000/zuul/zuul is running the new containers and is happy | 19:45 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Add install-docker and pip3 to files triggers https://review.opendev.org/720821 | 19:45 |
clarkb | so I think review-dev and meetpad were the odd ones. review-dev due to networkign to bridge going away? and meetpad due to being in emergency. All the other failures need pip3 to be installed | 19:46 |
mordred | clarkb: woot | 19:46 |
clarkb | gitea, gitea-lb, review, and zk are all happy | 19:46 |
clarkb | ok I think things are stable so I'm finding lunch now. Holler if that assumption is bad :) | 19:48 |
fungi | Timeout exception waiting for the logger. Please check connectivity to [bridge.openstack.org:19885] | 19:48 |
clarkb | fungi: thats normal because we don't run the zuul log streamer on bridge | 19:49 |
fungi | seen in a infra-prod-service-gitea run | 19:49 |
fungi | got it | 19:49 |
clarkb | fungi: if you want to see the logs you need to go to bridge /var/log/ansible/service-$playbook.yaml file | 19:49 |
fungi | so those are expected | 19:49 |
clarkb | yup | 19:49 |
clarkb | service-gitea.yaml.log for gitea | 19:49 |
openstackgerrit | Merged opendev/system-config master: Use HUP to stop gerrit in docker-compose https://review.opendev.org/719051 | 19:49 |
clarkb | I was tailing it earlier when I confirmed gitea01 was done and happy | 19:49 |
openstackgerrit | Merged opendev/system-config master: No longer push refs/changes to GitHub mirrors https://review.opendev.org/720679 | 19:50 |
mordred | after those run ^^ we'll be good to restart gerrit | 19:51 |
AJaeger | infra-root, this inap graph looks really odd http://grafana.openstack.org/d/ykvSNcImk/nodepool-inap?orgId=1&from=1587131505313&to=1587153105313&var-region=All&panelId=8&fullscreen | 19:51 |
clarkb | corvus: I know you are heads down in other things, but are you good for us to remove meetpad from the emergency file? | 19:51 |
clarkb | AJaeger: ya its because nova isn't deleting instances there reliably | 19:52 |
clarkb | AJaeger: if you expand it to go back 2 days you'll see it happening more often | 19:52 |
clarkb | ok really finding lunch now. Back soon :) | 19:53 |
AJaeger | thanks, clarkb - enjoy lunch! | 19:53 |
corvus | clarkb: yes can remove meetpad | 19:55 |
corvus | clarkb, mordred: should i read scrollback or skip it? | 19:56 |
corvus | clarkb, mordred, fungi: i believe i have created a reasonable local facsimile of our prod env -- same ownership and volume structure, etc. i'm seeing the same errors about dynamic config, etc. i wrote a test script to continually write data to zk to simulate the cluster continuing to handle requests when one member leaves. i have yet to see it fail when i do a rolling restart. i've done several. | 19:58 |
fungi | corvus: there was some discussion about the bits of the zk source around the function writing to the "static" config but probably no new insights | 19:58 |
mordred | corvus: well that's not trilling | 19:58 |
mordred | corvus: yeah - I think we mostly just looked at the source and then pondered - but ultimately concluded "corvus will figure out reality" | 19:59 |
corvus | my assumption for the moment is that whatever is causing the stale session issues is not related to the dynamic config | 19:59 |
corvus | i'm starting to wonder if it's a client issue | 19:59 |
corvus | i made sure to use the same kazoo version, under py3, that we're using on nl01 | 20:00 |
corvus | but maybe i should spot check that elsewhere -- maybe it's, say, only the scheduler that's hitting that problem | 20:00 |
fungi | and i guess we ended up with newer kazoo in the containers? | 20:00 |
clarkb | corvus: we hit a speedbump on yhe docker compose thing. not all servers have pip installed. for the servers that did update dockercompose everything is happy | 20:01 |
corvus | fungi: at the moment, the only zuul component running in containers is nb04 | 20:01 |
clarkb | fox for pip has been approved and will retrigger jobs (or manuallu run playbooks) once it lands | 20:01 |
mordred | clarkb: https://review.opendev.org/#/c/720821 is the followup with the file trigger updates | 20:01 |
corvus | clarkb: drat. i'm still sad we have to install pip :( | 20:01 |
corvus | oh, speaking of nb04 -- this happens when i try to exec: | 20:02 |
corvus | root@nb04:/var/log/nodepool# docker exec -it nodepoolbuildercompose_nodepool-builder_1 /bin/sh | 20:02 |
corvus | OCI runtime exec failed: exec failed: container_linux.go:349: starting container process caused "open /dev/ptmx: no such file or directory": unknown | 20:02 |
mordred | corvus: oh - that's ... what? | 20:02 |
corvus | yeah, you can imagine my delight at having a system component turn into a black box i can't access | 20:02 |
clarkb | drop the -it maybe? | 20:03 |
clarkb | cant really shell in that case | 20:04 |
corvus | yeah, it was really the interactive shell i was after | 20:04 |
mordred | corvus: https://github.com/docker/cli/issues/2067 | 20:05 |
mordred | no solution | 20:05 |
corvus | i wonder if dib mucked it up? | 20:05 |
fungi | oh, yeah, i guess kazoo hasn't changed... has the version of zk we're deploying in the containers changed? and you're theorizing that the older kazoo has issues with newer zk? | 20:05 |
corvus | fungi: i've yet to find a version of kazoo in use other than 2.7.0, but i'm still looking. we have definitely upgraded zk. | 20:06 |
fungi | got it | 20:07 |
corvus | mordred: and of course the 'workaround' in that report doesn't work for 'exec', only for 'run' | 20:08 |
corvus | 2.7.0 is the newest kazoo, so i'll just assume that's what nb04 has | 20:09 |
fungi | seems probable | 20:10 |
corvus | every zuul component is using kazoo 2.7.0 except nb03 whih is using 2.6.1 | 20:11 |
mordred | corvus: I checked on nb04 - devpts is mounted in the right place, /dev/ptmx is as expected and I don't see where dib woudl have broken it | 20:12 |
mordred | BUT - dib does so things wtih devpts - so it's entirely possible dib did a bad | 20:13 |
mordred | somehow | 20:13 |
mordred | corvus: neat. I tried running a non-interactive command and got: | 20:14 |
mordred | OCI runtime exec failed: exec failed: container_linux.go:349: starting container process caused "close exec fds: open /proc/self/fd: no such file or directory": unknown | 20:14 |
corvus | maybe we want to restart (or reboot) and see what it looks like when it starts | 20:15 |
corvus | that may give us a clue if it's some dib cleanup task or something | 20:15 |
clarkb | mordred: oh good the infra-prod jobs run when install docker is updated | 20:15 |
clarkb | mordred: so we won't need to manually trigger jobs once the fix lands | 20:15 |
clarkb | corvus: note that nb04 is one of the hosts without docker-compose currently installed | 20:16 |
corvus | clarkb: ack. but i'm using plain docker commands | 20:16 |
corvus | clarkb: oh, you're warning me not to restart it right now :) | 20:16 |
corvus | message received | 20:16 |
corvus | (or, at least, don't use dc to restart it) | 20:17 |
clarkb | ya | 20:17 |
corvus | i've rerun my test with zk 2.6.1 -- same results | 20:17 |
clarkb | also if you look at zuul status for deploy pipelien right now I think its doing a thing we didn't expect it to? | 20:17 |
clarkb | there are two chagnes in the pipeline and the second changei s running jobs before the first has finished | 20:17 |
corvus | ah, yup, we seem to be sharing the mutex between the two. | 20:18 |
corvus | i wonder if we can turn this into a dependent pipeline with a window of 1 | 20:19 |
corvus | the main thing would be to look into the merge check | 20:19 |
clarkb | mordred: pip fix breaks on xenial? https://zuul.opendev.org/t/openstack/build/e979db12fcf042ed8e51ca6be4cd0545/log/job-output.txt#16953 | 20:20 |
fungi | clarkb: i saw the same a little bit ago. i thought the mutex was supposed to wind up serializing them in the item enqueue order | 20:20 |
fungi | but that doesn't appear to be the case | 20:20 |
fungi | so, yeah, window of 1 i guess will be better than possible out-of-sequence deployments | 20:21 |
corvus | maybe our mutex wakeups are random | 20:22 |
clarkb | mordred: I think maybe this isn't necessary on xenial. So we can fix pip3 too | 20:23 |
clarkb | I'm testing it locally in a xenial container and will push fix if I think it will work | 20:25 |
mordred | clarkb: I agree - I think it isn't necessary on xenial | 20:26 |
corvus | clarkb, fungi: i'm still surprised about that. we should release the semaphore before processing the queue, and the queue processing should happen in order, so i'd expect each job for the first change to get it in order, then each job for the second change. unless one of the jobs on the first change didn't specify the semaphore? | 20:26 |
mordred | corvus: the semaphore should be on the base job | 20:27 |
corvus | we don't show nearly enough job info in the web ui | 20:27 |
mordred | yeah. anything parented on infra-prod-playbook | 20:27 |
mordred | that's where we're declaring use of the semaphore | 20:28 |
mordred | oh! interesting | 20:28 |
mordred | semaphore: infra-prod-service-bridge | 20:28 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Install pip3 for docker-compose installation https://review.opendev.org/720820 | 20:28 |
mordred | we have one job that declares a non-existent sempahore | 20:28 |
mordred | that is a different semaphore | 20:28 |
corvus | mordred: which job? | 20:28 |
clarkb | mordred: corvus fungi https://review.opendev.org/720820 has been updated to handle xenial if you have a moment between thinking about all the other things :) | 20:29 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Remove semaphore from service-bridge https://review.opendev.org/720829 | 20:29 |
mordred | corvus: infra-prod-service-bridge | 20:29 |
fungi | taking a look | 20:29 |
clarkb | infra-root should we start considering making an order of changes to land? | 20:29 |
corvus | mordred: ok. i don't think that job was involved here. | 20:30 |
corvus | yeah, our problem set has exploded again | 20:30 |
clarkb | https://etherpad.opendev.org/p/PzoWHp44yOP4K8LdXXrK | 20:31 |
corvus | docker-compose is uninstalled; semaphores may run out of order; something about zk is weird when rolling restart; nb04 /dev in container is hosed | 20:31 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Add install-docker and pip3 to files triggers https://review.opendev.org/720821 | 20:31 |
corvus | did i miss anything? :) | 20:31 |
mordred | corvus: I think that's about it | 20:31 |
mordred | corvus: also - luckily for us, 3 of those problems we don't really understand | 20:32 |
corvus | okay, we gotta find a way to avoid installing docker-compose from pip in the future -- this whole sequence of "oops we don't have pip3 on this distro" was exactly the business that we got out of... for about 10 minutes. | 20:32 |
fungi | corvus: so what i observed earlier (but was refraining from interrupting other discussion with) is that 720235,2 had a waiting infra-prod-manage-projects build, but 719051,8 which was enqueued into the deploy pipeline after it started running infra-prod-service-review (those share a semaphore, right?) | 20:32 |
fungi | after infra-prod-service-review completed for 719051,8, infra-prod-manage-projects started running for 719051,8 ahead of it | 20:33 |
fungi | er, for 720235,2 ahead of it | 20:33 |
clarkb | corvus: ya I'm not sure what the proper answer is there. One crazy idea I had was running docker-compose from docker, but I imagine that will need testing | 20:34 |
clarkb | (and generally exposing the docker command socket to docker containers seems dirty) | 20:34 |
clarkb | https://etherpad.opendev.org/p/PzoWHp44yOP4K8LdXXrK I've filled in the docker-compose related items and put spots for the other things if people have things in flight to track | 20:36 |
fungi | yay! etherpad is snappy again! | 20:39 |
fungi | i've heard no complaints about it after the tuning config got added back, fwiw | 20:40 |
mordred | \o/ | 20:40 |
mnaser | uh i feel bad about bothering with this, but it seems like i got a buildset stuck in the vexxhost tenant again somehow.. | 20:48 |
mnaser | http://zuul.opendev.org/t/vexxhost/status -- its been around for 3h10m -- even when i +W it to kick it straight into gate, it is still there | 20:48 |
clarkb | mnaser: I think the inap issues are persisting | 20:48 |
clarkb | let me see what that job is waiting on | 20:48 |
mnaser | will it fail to dequeue as well? | 20:49 |
clarkb | I don't think so but dequeing won't really help necessarily | 20:49 |
mnaser | right, but if i +W it, shouldn't it remove it from check and kick it straight to gate | 20:49 |
clarkb | mnaser: depends on how your popeline is set up | 20:50 |
openstackgerrit | Arun S A G proposed opendev/gerritlib master: Fix AttributeError when _consume method in GerritWatcher fails https://review.opendev.org/720832 | 20:50 |
mnaser | im pretty sure we're using the one simila to opendev/zuul so go-straight-to-gate | 20:50 |
clarkb | fwiw those jobs don't seem to be blocking on inap | 20:52 |
clarkb | and two of them just started | 20:52 |
clarkb | still trying to figure out what they were hung up on | 20:52 |
clarkb | looks like rax-iad-main had it | 20:53 |
clarkb | for ~3 hours | 20:54 |
clarkb | so its the same behavior we had with inap but in rax | 20:54 |
clarkb | we end up with a lot of active requests but they aren't being fulfilled quickly (due to what I Think are quota accounting issues) | 20:54 |
clarkb | and check would be sorted last so that probably contributes to it, though the neutron case was in the gate | 20:55 |
clarkb | http://grafana.openstack.org/d/8wFIHcSiz/nodepool-rackspace?orgId=1 shows iad being sad | 20:55 |
clarkb | seems to be recovering now though | 20:55 |
corvus | i guess we can add that to the list of fires | 21:01 |
corvus | also, we should stop logging the "could not be deleted because it is in use: The image cannot be deleted because it is in use through the backend store outside of Glance." exception | 21:01 |
corvus | the builder logs are pretty unreadable | 21:01 |
clarkb | corvus: fwiw I think that may just be "normal" cloud things. Addressing that in nodepool will be complicated I think | 21:05 |
clarkb | (its hard to work around when the cloud isn't giving us accurate info) | 21:05 |
clarkb | but I can dig into that again monday and make sure there isn't something else going on | 21:05 |
corvus | clarkb: it would be good to have a clear idea of what's going on. we already expect openstack to lie to us about server deletions. if it's also lying about quotas, etc, it'd be good to know | 21:06 |
clarkb | ++ | 21:06 |
corvus | clarkb, mordred: is there a way to get at the docker logs from the previous run of a container? | 21:10 |
clarkb | corvus: if they go to systemd I think so | 21:10 |
clarkb | and i Think they do by default /me looks | 21:10 |
clarkb | oh maybe it isn't default | 21:11 |
clarkb | corvus: internet says do docker logs with the container id | 21:12 |
clarkb | and i believe you can get historical container ids from dockerd logs | 21:12 |
clarkb | ok the distutils thing fixed the pip change | 21:13 |
clarkb | now we wait for it to gate | 21:16 |
*** hashar has joined #opendev | 21:20 | |
corvus | clarkb: ah, docker-compose down deletes the container, and once it's gone docker logs $containerid doesn't work | 21:23 |
corvus | but everything is going into the journal, so that'll do for now | 21:23 |
clarkb | oh good its in the journal anyway | 21:23 |
clarkb | corvus: how do you get it out of the journal? | 21:23 |
corvus | clarkb: i'm just using journalctl -u docker.service | 21:24 |
clarkb | thanks | 21:24 |
clarkb | (its usefully to know that bit of info) | 21:24 |
openstackgerrit | Merged opendev/system-config master: Remove semaphore from service-bridge https://review.opendev.org/720829 | 21:25 |
clarkb | mordred: ^ some progress | 21:28 |
mordred | clarkb: woot! | 21:44 |
*** hashar has quit IRC | 21:46 | |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers https://review.opendev.org/720527 | 21:52 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run Zuul using Ansible and Containers https://review.opendev.org/717620 | 21:56 |
mordred | clarkb: I have verified that the docker-compose on review is the pip version and the docker compose file has both outstanding changes in it | 22:00 |
mordred | clarkb: so we should be well positioned to restart whenever we decide it's a good time to do that | 22:00 |
clarkb | mordred: cool | 22:01 |
clarkb | at this point I think my ability to debug more things is waning | 22:01 |
clarkb | wnting to wrap up the outstanding things | 22:01 |
mordred | totally | 22:02 |
mordred | I recorded that we're ready to do that whenver in the etherpad | 22:02 |
corvus | i'm digging through zk server logs and reading docs and bug reports to try to come up with a new hypothesis | 22:12 |
openstackgerrit | Merged opendev/system-config master: Install pip3 for docker-compose installation https://review.opendev.org/720820 | 22:16 |
* mordred is going to pay attention to those | 22:16 | |
mordred | clarkb: the list of services that need the pip/compose update in the etherpad is the list of jobs that just got triggered - so that particular thing should be done once this runs | 22:18 |
clarkb | mordred: cool and I'm around paying attention too | 22:18 |
openstackgerrit | Merged opendev/system-config master: Add install-docker and pip3 to files triggers https://review.opendev.org/720821 | 22:19 |
clarkb | nb04 looks happy now | 22:29 |
clarkb | also docker ps -a shows a lot of old docker containers there | 22:29 |
clarkb | I think we need to get in the habit of doing docker run --rm ? | 22:30 |
clarkb | mordred: ^ you probably have ideas on that | 22:30 |
mordred | clarkb: hrm | 22:32 |
mordred | clarkb: I wish I knew why that container was unhappy in the first place | 22:32 |
mordred | clarkb: oh - yeah - I always do --rm when I do run | 22:33 |
mordred | clarkb: think we shoudl clean those up real quick? | 22:33 |
clarkb | maybe? it could be part of corvus' debugging and we should have corvus confirm first? | 22:34 |
clarkb | but ya I think cleaning up would be a good idea | 22:34 |
mordred | clarkb: ++ - most of those look like utility images from weeks ago | 22:34 |
mordred | clarkb: docker ps -a | grep Exited | awk '{print $1}' | xargs -n1 docker rm | 22:34 |
clarkb | etherpad just restarted | 22:35 |
clarkb | https://etherpad.opendev.org/p/PzoWHp44yOP4K8LdXXrK is still working for me | 22:35 |
clarkb | this is all looking good \o/ | 22:35 |
clarkb | oh I forget to remove meetpad from emergency | 22:39 |
clarkb | mordred: thoughts on ^ should I just remove it now or wait for money? | 22:39 |
clarkb | *monday. money is nice too | 22:39 |
clarkb | docker registry looks happy now too | 22:40 |
mordred | clarkb: I think we can remove it - I don't think there were any reasons not to | 22:40 |
clarkb | mordred: I guess my only concern is if there were other changes and they weren't happy at this point | 22:41 |
clarkb | but since the service isn't in prod its probably fine | 22:41 |
clarkb | I'll remove it now so I don't forget further | 22:41 |
mordred | yeah. and corvus acked that it was ok earlier | 22:41 |
corvus | i did not do any docker runs | 22:41 |
corvus | only exec | 22:41 |
clarkb | corvus: rgr so ya we should be able to clean up all thos containers mordred | 22:41 |
mordred | kk. removing | 22:41 |
clarkb | meetpad01 has been removed from emergency file | 22:42 |
clarkb | I'll put further debugging of this nodepool "slowness" high on my list for monday | 22:43 |
clarkb | since people keep noticing it so its definitely frequent and painful | 22:43 |
clarkb | zookeeper play failed on AnsibleUndefinedVariable: 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_default_ipv4' | 22:44 |
mordred | hrm. that's weird | 22:44 |
mordred | investigating | 22:44 |
corvus | we may have run in a v6 only cloud? | 22:45 |
clarkb | corvus: this was against prod | 22:45 |
clarkb | /var/log/ansible/service-zookeeper.yaml.log for the logs | 22:45 |
clarkb | * on bridge | 22:45 |
clarkb | zuul-preview seems good though I'm trying to find a change I can confirm that with via zuul dashboard | 22:47 |
* clarkb looks for zuul website change | 22:47 | |
mordred | that var shows up when I run setup and is also in the fact cache for those hosts | 22:47 |
clarkb | mordred: is it maybe the lookup path? | 22:48 |
fungi | clarkb: did we already switch the zuul website preview to using the zuul-preview service? | 22:49 |
fungi | i thought it wasn't yet (at least as of a week-ish ago) | 22:49 |
clarkb | fungi: no I thought we did but the job artifact errors with bad urls | 22:49 |
clarkb | and its because its at ovh's swift root not zp01 | 22:49 |
corvus | re zk cluster probs: i think we're looking at a server issue of some kind. it seems like when we kill the leader, that the new leader begins a new 'epoch' (which i think appears as the first character of the zxid in the logs -- that's why 0xd00000000 showed up -- epoch 0xd); my limited understanding is that should become the first zxid committed after the leader election, and then all the followers | 22:49 |
corvus | should get that. we're seeing clients connect having seen that zxid, but then the followers they connect to don't seem to have it. | 22:49 |
mordred | I can reproduce the ansible_default_ipv4 issue with a simple playbook - poking at combos to see what works and doesn't | 22:51 |
mordred | sigh | 22:51 |
clarkb | mnaser: if you happen to still be around did you have any zuul preview using changes we can test with (I thought you had something) | 22:52 |
mordred | so - if I run a playbook targetting zk01.openstack.org that wants to get zk02.openstack.org's hostvars but zk02 hasn't ever done anything in the playbook, it fails | 22:52 |
clarkb | mordred: oh so we should add an explicit setup call across those hosts maybe? | 22:52 |
mordred | but if I run something, _anything_ on zk02 first - the hostvars are there | 22:52 |
mordred | we don't even need a setup call | 22:52 |
mordred | a debug call suffices | 22:53 |
corvus | clarkb: try the zuul-website gatsby wip patch? | 22:53 |
clarkb | corvus: thats what I pulled up but the url there is for ovh swift roo | 22:53 |
clarkb | *root | 22:53 |
mordred | it doesn't need to fetch new facts | 22:53 |
clarkb | let me see if there was a different url I should use | 22:53 |
clarkb | https://zuul.opendev.org/t/zuul/build/925bfe37815144d0859f260605d5fb98 is the build for that I think | 22:54 |
clarkb | note the site prview url is straight to storage.gra.cloud.ovh.net | 22:54 |
mnaser | clarkb: the zuul website changes should be good for that | 22:55 |
mnaser | or single change. I haven’t gotten around finalizing that | 22:55 |
clarkb | mnaser: https://zuul.opendev.org/t/zuul/build/925bfe37815144d0859f260605d5fb98 is what I'm looking at for that is that wrong? | 22:55 |
mnaser | clarkb: yes that’s the right one | 22:55 |
clarkb | mnaser: ok the site preview for that is straight to the ovh swift files not zp | 22:56 |
clarkb | and that doesnt' work (as expected) | 22:56 |
clarkb | maybe I need to manually construct the zp url? | 22:56 |
fungi | right, like i said, i don't think the zuul-web previews are using zuul-preview (yet) | 22:56 |
mnaser | clarkb: yeah I haven’t pushed up a patch to return that as an artifact. I have to return both | 22:56 |
clarkb | mnaser: gotcha, do you know what the url format is in that case? | 22:56 |
corvus | clarkb: http://site.925bfe37815144d0859f260605d5fb98.zuul.zuul-preview.opendev.org/ | 22:57 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run a noop on all zookeeper servers first https://review.opendev.org/720847 | 22:57 |
mnaser | ^^ | 22:57 |
mordred | corvus, clarkb : ^^ that should fix the zookeeper thing | 22:57 |
clarkb | mnaser: corvus thanks! and that seems to work for me so I think zp is good | 22:58 |
mordred | (the playbook being unhappy - not the important zk thing) | 22:58 |
mnaser | artifact_type.build_id.tenant_id.zuul-preview.opendev.org is the format. Thanks corvus | 22:58 |
clarkb | mordred: should you add !disabled to that? | 22:58 |
corvus | clarkb: there's a comment explaining why not :) | 22:58 |
mordred | clarkb: no - left that off on purpose (and wrote a comment explaining) | 22:58 |
clarkb | heh I should read | 22:59 |
corvus | mordred: that's super weird that it works with --limit | 22:59 |
mordred | corvus: I agree | 22:59 |
mordred | I think it's a super weird behavior in general | 22:59 |
corvus | mordred: i guess it's some sort of "well, since it's limited, we know we're not going to update the data, so we should just start with the cache" | 22:59 |
corvus | mordred: but also, it could just be "no one understands this" | 22:59 |
mordred | yeah | 22:59 |
mordred | fwiw - /root/foo.yaml on bridge is what I used to verify | 23:00 |
corvus | i'm pretty sure the zookeeper images on dockerhub are being rebuild with the same tags | 23:01 |
corvus | 3.6.0 is still the only 3.6, but it's 11 hours old | 23:01 |
corvus | and i know we ran a 3.6.0 longer ago than that | 23:02 |
clarkb | everything succeeded but zk in that pass and zk failed for unrelated reasons and is already running newer docker-compose | 23:02 |
* clarkb updates etehrpad but things seem happy now | 23:02 | |
corvus | what happened the last time we tried 3.6.0? | 23:03 |
mordred | clarkb: what's the cantrip for making a fake rsa key for test data? | 23:03 |
clarkb | mordred: ssh-keygen -p'' ? | 23:03 |
clarkb | mordred: zuul quickstart should have it for gerrit things | 23:04 |
clarkb | corvus: I don't remember being around for that, but could it have been upgrade concerns? | 23:04 |
mordred | clarkb: thanks | 23:04 |
clarkb | like maybe 3.4 -> 3.6 isn't doable in rolling fashion? | 23:04 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run Zuul using Ansible and Containers https://review.opendev.org/717620 | 23:06 |
corvus | clarkb: it is doable, but something was preventing a quorum from forming on 3.6 | 23:07 |
mordred | corvus: I understand the applytest race condition. I think I can live with it until I rework that job | 23:09 |
fungi | mordred: clarkb: ssh-keygen -p'' just sets the private key to not encrypted. are you looking for something like gnupg's --debug-quick-random option for creating insecure test keys? | 23:10 |
fungi | or are you really just looking for a key which doesn't require a passphrase to unlock? | 23:11 |
corvus | i think maybe tomorrow we might want to do some testing-in-prod on the zk cluster, because i can't replicate the problem locally, nor do i see any problem running 3.6.0 (at least, the latest version of that image) | 23:16 |
*** tosky has quit IRC | 23:21 | |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Make applytest files outside of system-config https://review.opendev.org/720848 | 23:21 |
mordred | corvus: I support that | 23:21 |
mordred | corvus: also - I decided I was too annoyed by the applytest race - so that ^^ shoudl fix it | 23:21 |
mordred | fungi: really just needed some rsa key data to put into the testing "private" key hostvars so that the role would write something to disk in the integration test jobs | 23:22 |
mordred | corvus: I mean - assuming that runs at all - I think it should fix the race :) | 23:22 |
mordred | fungi: if you have some brainpellets - 720848 could use some eyeball powder | 23:23 |
mordred | fungi: for context - we keep seeing occasional failures like: https://zuul.opendev.org/t/openstack/build/64e6d48f114d43979502b21ca6d626ac/log/applytest/puppetapplytest21.final.out.FAILED | 23:23 |
fungi | mordred: got it, so one-time key generation, not rapid/repetitive key generation in a job | 23:23 |
mordred | fungi: yeah | 23:24 |
mordred | ssh-keygen did the trick | 23:24 |
fungi | cool | 23:24 |
fungi | mordred: yeah, first instinct on that is some sort of race on directory creation/deletion | 23:27 |
openstackgerrit | Merged opendev/system-config master: Run a noop on all zookeeper servers first https://review.opendev.org/720847 | 23:42 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!