openstackgerrit | Adam Coldrick proposed opendev/storyboard master: Add an author_id parameter to the events endpoint https://review.opendev.org/726264 | 00:13 |
---|---|---|
openstackgerrit | Merged opendev/system-config master: Organize zuul jobs in zuul.d/ dir https://review.opendev.org/722394 | 00:18 |
openstackgerrit | Tristan Cacqueray proposed zuul/zuul-jobs master: cabal-test: add build target job variable https://review.opendev.org/726266 | 00:22 |
openstackgerrit | Tristan Cacqueray proposed zuul/zuul-jobs master: haskell-stack-test: add build target job variable https://review.opendev.org/726267 | 00:22 |
ianw | the arm64v8/ubuntu image works, and has binaries in the littleaarch format | 00:25 |
ianw | i guess this must actually come from the alpine arm64 images? | 00:31 |
ianw | hrm, the python 3.7-slim container seems to work | 00:33 |
clarkb | ianw: but not 3.8? | 00:34 |
ianw | ahh, yeah 3.8 seems to work too ... just installing binary tools | 00:34 |
ianw | looks right : /bin/ls: file format elf64-littleaarch64 | 00:36 |
ianw | ahhhhh ... objdump is leading me astray | 00:39 |
ianw | Machine: AMD x86-64 | 00:39 |
ianw | elfutils sees it | 00:39 |
ianw | of course, the host objdump doesn't understand the e_machine type set in the elf header | 00:39 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Build multi-arch python-base/python-builder https://review.opendev.org/726263 | 01:14 |
ianw | clarkb: hahaha " ianw in theory zk on arm should be easy because jvm at least" ... https://github.com/31z4/zookeeper-docker/pull/90 | 01:18 |
clarkb | hrm that kinda makes me think doing debian + zk tarballs might be easiest :/ | 01:42 |
ianw | #11 [linux/amd64 builder 3/3] RUN assemble uWSGI | 02:06 |
ianw | #11 592.5 Created wheel for uWSGI: filename=uWSGI-2.0.18-cp37-cp37m-linux_aarch64.whl size=529535 sha256=5ae5fc0c691bd90c6dda8730f5a746c6ae698db0a2d21dd3da42fdb2d701ae18 | 02:07 |
ianw | ummmm why would the amd64 build create a aarch64.whl ... | 02:07 |
ianw | here it swapped, the arm64 builder started using the amd64 image i think | 02:31 |
openstackgerrit | Merged opendev/system-config master: nodepool-builder: fix servername https://review.opendev.org/726035 | 03:13 |
*** ykarel|away is now known as ykare | 04:18 | |
ianw | mordred / corvus : so ... i've learnt a lot but not enough :) i've left comments in https://review.opendev.org/#/c/726263 | 05:28 |
ianw | in short, i think the intermediate registry is somewhat randomly returning either the amd64 or arm64 container | 05:29 |
ianw | https://storyboard.openstack.org/#!/story/2007642 has links to logs showing this | 05:29 |
*** dpawlik has joined #opendev | 05:55 | |
*** ysandeep|afk is now known as ysandeep | 06:12 | |
*** DSpider has joined #opendev | 07:01 | |
*** tosky has joined #opendev | 07:17 | |
*** dtantsur|afk is now known as dtantsur | 07:39 | |
*** ralonsoh has joined #opendev | 07:44 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Add upload-artifactory role https://review.opendev.org/725678 | 08:02 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Add upload-artifactory role https://review.opendev.org/725678 | 08:08 |
*** bhagyashris|ruck has joined #opendev | 08:09 | |
openstackgerrit | Merged openstack/project-config master: Retire syntribos - Step 1 https://review.opendev.org/726237 | 08:10 |
*** tkajinam has quit IRC | 08:13 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Add upload-artifactory role https://review.opendev.org/725678 | 08:14 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Add upload-artifactory role https://review.opendev.org/725678 | 08:18 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Add upload-artifactory role https://review.opendev.org/725678 | 08:28 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Add upload-artifactory role https://review.opendev.org/725678 | 08:29 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Add upload-artifactory role https://review.opendev.org/725678 | 08:36 |
*** ysandeep is now known as ysandeep|lunch | 08:47 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Add upload-artifactory role https://review.opendev.org/725678 | 09:01 |
*** ykare is now known as ykarel | 09:03 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: DNM: return linenumber in matchplay https://review.opendev.org/726312 | 09:09 |
*** kevinz has joined #opendev | 09:11 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Add upload-artifactory role https://review.opendev.org/725678 | 09:24 |
*** ysandeep|lunch is now known as ysandeep | 09:51 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Add ansible-lint rule to check owner and group is not preserved https://review.opendev.org/724855 | 10:09 |
*** owalsh has quit IRC | 10:15 | |
*** owalsh has joined #opendev | 10:22 | |
*** ykarel is now known as ykarel|lunch | 10:23 | |
*** sshnaidm|afk is now known as sshnaidm|off | 10:31 | |
*** ykarel|lunch is now known as ykarel | 11:08 | |
*** Toshimichi-F82 has joined #opendev | 11:50 | |
*** Toshimichi-F82 has quit IRC | 11:50 | |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Build multi-arch python-base/python-builder https://review.opendev.org/726263 | 12:07 |
*** tkajinam has joined #opendev | 12:27 | |
*** lpetrut has joined #opendev | 12:33 | |
openstackgerrit | Tristan Cacqueray proposed zuul/zuul-jobs master: cabal-test: add build target job variable https://review.opendev.org/726266 | 12:38 |
openstackgerrit | Tristan Cacqueray proposed zuul/zuul-jobs master: haskell-stack-test: add build target job variable https://review.opendev.org/726267 | 12:38 |
*** bolg has quit IRC | 12:49 | |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Explicitly target arm64 image on nb04 https://review.opendev.org/726376 | 13:00 |
*** ysandeep is now known as ysandeep|brb | 13:02 | |
*** ysandeep|brb is now known as ysandeep | 13:10 | |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run accessbot less frequently https://review.opendev.org/726379 | 13:15 |
*** ykarel is now known as ykarel|afk | 13:23 | |
openstackgerrit | Tristan Cacqueray proposed zuul/zuul-jobs master: cabal-test: add build target job variable https://review.opendev.org/726266 | 13:38 |
openstackgerrit | Tristan Cacqueray proposed zuul/zuul-jobs master: haskell-stack-test: add build target job variable https://review.opendev.org/726267 | 13:38 |
fungi | disappearing for the weekly grocery pickup, but should hopefully be back before too long | 13:45 |
*** ysandeep is now known as ysandeep|away | 13:49 | |
*** mlavalle has joined #opendev | 13:57 | |
openstackgerrit | Merged opendev/system-config master: Run accessbot less frequently https://review.opendev.org/726379 | 14:02 |
*** lpetrut has quit IRC | 14:08 | |
openstackgerrit | Tristan Cacqueray proposed zuul/zuul-jobs master: haskell-stack-test: add build target job variable https://review.opendev.org/726267 | 14:11 |
*** hashar has joined #opendev | 14:13 | |
*** hashar is now known as hasharAway | 14:14 | |
dmsimard | Where would I remove myself from infra-root emails like the crons ? | 14:24 |
mordred | dmsimard: it's in a private ansible var on bridge ... do you also want to not be infra-root? or just want to avoid emails? | 14:26 |
dmsimard | mordred: I sent an email recently: http://lists.openstack.org/pipermail/openstack-infra/2020-May/006627.html | 14:28 |
dmsimard | I'll be around but I won't be able to contribute meaningfully | 14:31 |
dmsimard | need to afk, be back in a bit | 14:31 |
*** ykarel|afk is now known as ykarel | 14:33 | |
openstackgerrit | James E. Blair proposed zuul/zuul-jobs master: Revert "ensure-tox: use venv to install" https://review.opendev.org/726404 | 14:36 |
mordred | dmsimard: ah - nod, yes. I think I was too sad you were going to register that we'd need to remove you from things :) | 14:39 |
mordred | dmsimard: I have removed you from root emails - will you have time to send in a patch to remove yourself from root in the various places in system-config? | 14:40 |
*** lpetrut has joined #opendev | 14:41 | |
mordred | corvus: when you have a second, feel like reviewing https://review.opendev.org/#/c/726263/ ? | 14:49 |
*** lpetrut_ has joined #opendev | 14:49 | |
*** lpetrut has quit IRC | 14:52 | |
openstackgerrit | Merged zuul/zuul-jobs master: Revert "ensure-tox: use venv to install" https://review.opendev.org/726404 | 14:58 |
dmsimard | mordred: yeah, I'll send patches and hope I don't forget anything | 14:58 |
dmsimard | Haven't got around to it yet | 14:59 |
mordred | dmsimard: kk. | 14:59 |
avass | tobiash: could you take a look at the callback config change whenever you have time? https://review.opendev.org/#/c/717260/ | 15:04 |
tobiash | avass: sure, I'll have a look at it later today | 15:04 |
avass | tobiash: I'll just continue nagging you about if you forget it again :) | 15:05 |
avass | oh, oops missed I was in opendev and not in zuul, sorry for the noise | 15:07 |
openstackgerrit | Monty Taylor proposed zuul/zuul-jobs master: Revert "Revert "ensure-tox: use venv to install"" https://review.opendev.org/726413 | 15:12 |
*** lpetrut_ has quit IRC | 15:14 | |
openstackgerrit | Monty Taylor proposed zuul/zuul-jobs master: Revert "Revert "ensure-tox: use venv to install"" https://review.opendev.org/726413 | 15:14 |
mordred | avass: make as much noise as you want ;) | 15:15 |
mordred | clarkb: when you're up and moving - I have 3 patches up with an amended idea of what we can/should do with arm nodepool buiklders: https://review.opendev.org/#/q/topic:arm64-specific-tags | 15:17 |
mordred | the first system-config patch I think we should do regardless - but I think the second two will actually be a better experience end to end | 15:17 |
*** hasharAway has quit IRC | 15:20 | |
*** hasharAway has joined #opendev | 15:21 | |
*** tkajinam has quit IRC | 15:22 | |
clarkb | looks like the zuul.d change landed? | 15:23 |
clarkb | I have a few changes to rebase now then :) | 15:23 |
clarkb | mordred: looking at https://review.opendev.org/#/c/726372/1/.zuul.yaml is that going to break the whole manifest thing? | 15:24 |
clarkb | because we'll push zuul/nodepool as arm64 only? | 15:24 |
clarkb | related should we fix the manifest mixups ianw was noticing before landing https://review.opendev.org/#/c/726263/3/zuul.d/docker-images/python.yaml ? | 15:25 |
clarkb | (a lot of how docker handles the metadata around this is still foreign to me which is why I'm asking questions) | 15:25 |
clarkb | ok I think I've answered my first question. We tag the arm64 specific builds as arm64. So I think that my concern is the case if you can get that image via pulling of latest | 15:30 |
clarkb | and I don't think you can so we should be opk | 15:30 |
avass | mordred: why do we no_log looking for venv anyway? | 15:31 |
avass | I guess I did it again :) | 15:31 |
clarkb | mordred: given ^ should we maybe revert the multiarch build so that we don't mixup things there and have working x86 too? then separately build the arm64 tagged images as in https://review.opendev.org/#/c/726372/1/.zuul.yaml. Then once the mixups are fixed we can go back to multiarch for all of it? | 15:31 |
*** dpawlik has quit IRC | 15:32 | |
*** hasharAway has quit IRC | 15:32 | |
*** roman_g has joined #opendev | 15:33 | |
*** diablo_rojo has joined #opendev | 15:35 | |
diablo_rojo | corvus, I went to go make a meetpad and it seems there is an issue? Its throwing a 404: https://meetpad.opendev.org/virtual-ussuri-celebration. Apologies if I missed something about that in the channel logs. | 15:36 |
clarkb | https://etherpad.opendev.org/p/virtual-ussuri-celebration is working at least | 15:38 |
clarkb | [error] 225#225: *2719 open() "/usr/share/jitsi-meet/virtual-ussuri-celebration" failed (2: No such file or directory) | 15:40 |
clarkb | diablo_rojo: corvus: that looks like webserver misconfiguration? | 15:41 |
clarkb | our nginx config has 'root /usr/share/jitsi-meet;' then for / we just turn on ssi | 15:44 |
diablo_rojo | clarkb, interesting, so relatively easy fix? | 15:44 |
clarkb | diablo_rojo: maybe? I don't fully understand this yet | 15:44 |
diablo_rojo | clarkb, that makes two of us lol | 15:45 |
clarkb | we also ahve a location match for /[:alnum:]+ rewriting to / | 15:45 |
clarkb | which I would've expected this to hit | 15:45 |
clarkb | which should force everything through index.html and the js | 15:45 |
clarkb | its weird to me that nginx is looking on disk for that path given ^ | 15:46 |
clarkb | jitsi's images were updated 18 hours ago | 15:47 |
clarkb | but our nginx container is 2 days old | 15:47 |
clarkb | maybe they got out of sync? corvus do you understand what nginx should be doing there? | 15:47 |
fungi | it was working after the http->https redirect config change merged, or so i thought | 15:48 |
fungi | i don't recall any further changes we made after that | 15:48 |
clarkb | fungi: did we test that a meetpad worked or just / ? it looks like / is working fine | 15:50 |
fungi | clarkb: diablo_rojo: it's something to do with the hyphens, i think | 15:50 |
clarkb | its the next step that it tries to load padnames off of disk | 15:50 |
fungi | room/pad names with no hyphens work fine for me, but with hyphens i get the 404 | 15:50 |
fungi | so maybe we're missing - in the redirect character set | 15:51 |
fungi | like /[:alnum:]+ probably doesn't cover - | 15:51 |
clarkb | fungi: ah ya the exact regex is location ~ ^/([a-zA-Z0-9=\?]+)$ { | 15:51 |
clarkb | which is slightly more than alnum but no - | 15:51 |
clarkb | I bet that is it | 15:51 |
fungi | so quick workaround is don't use hyphens, but we'll probably have a patch merged to add them in moments | 15:52 |
clarkb | fungi: we'll need to crosscheck that jitsi can actually handle hyphens on its side (we know etherpad can) | 15:52 |
fungi | good point | 15:52 |
fungi | easiest way is probably just to try. i mean, it's currently broken with hyphens anyway | 15:53 |
fungi | we can presumably hand-edit the config and restart the nginx container? | 15:53 |
clarkb | https://meet.jit.si/foo-bar-opendev seems valid | 15:53 |
clarkb | so ya I think we just edit that regex and it will eb good | 15:53 |
fungi | your test there is good enough to convince me. do you have a change in progress or shall i start one? | 15:54 |
clarkb | I don't | 15:54 |
clarkb | you should go for it since you figured it out | 15:54 |
fungi | working on it now, in that case | 15:54 |
fungi | heh, that sounds like incentive for me to stop figuring things out ;) | 15:54 |
clarkb | fungi: its the meet.conf file somewhere in system-config (I was looking on the prod server) | 15:54 |
diablo_rojo | This was an excellent exchange to follow along with :) | 15:55 |
fungi | we have docker/jitsi-meet/web/rootfs/defaults/meet.conf and playbooks/roles/jitsi-meet/files/meet.conf | 15:55 |
clarkb | fungi: the second one. (the first is image defaults which we inherited from upstream and second is our site config) | 15:55 |
clarkb | though if our site config update works we should update the docker image too | 15:56 |
fungi | aha, yep, that is the conclusion i came to as well after diffing | 15:56 |
*** ykarel is now known as ykarel|away | 15:57 | |
fungi | any reason not to just edit them both in the same change? | 15:57 |
fungi | any other characters we should add? maybe _ | 15:58 |
fungi | etherpad.opendev.org and meet.jot.si both support _ based on my testing | 16:00 |
fungi | s/jot/jit/ | 16:00 |
clarkb | fungi: only that if it doesn't work for some reason its more untangling, but chances seem high it will work and _ seems like good addition too | 16:01 |
*** roman_g has quit IRC | 16:02 | |
clarkb | fungi: I've actually got a related change I need to rebase after the reorg | 16:03 |
openstackgerrit | David Moreau Simard proposed opendev/system-config master: Remove dmsimard from infra-root https://review.opendev.org/726429 | 16:05 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Run jobs prod test jobs when docker images update https://review.opendev.org/720030 | 16:05 |
clarkb | fungi: ^ that change will add a bit more testing. I don't think its critical to land your chnage but we should try to get that in for future stuff | 16:05 |
openstackgerrit | David Moreau Simard proposed openstack/project-config master: Remove dmsimard from accessbot https://review.opendev.org/726431 | 16:07 |
fungi | clarkb: should i stack mine on that? | 16:08 |
fungi | happy to review and merge that asap if it's working | 16:08 |
clarkb | fungi: sure | 16:09 |
clarkb | dmsimard: one minor thing on https://review.opendev.org/#/c/726429/1 due to how the automation works | 16:09 |
dmsimard | yup, looking | 16:10 |
*** cmurphy is now known as cmorpheus | 16:10 | |
dmsimard | makes sense | 16:10 |
clarkb | dmsimard: I'll put that higher on my list to do | 16:11 |
clarkb | (you don't need to worry about doing that rotation) | 16:11 |
dmsimard | I had the feeling there was something like that because of the previous key named after a date | 16:11 |
dmsimard | I won't do anything mean even if it takes a bit, I promise <3 | 16:11 |
openstackgerrit | Jeremy Stanley proposed opendev/system-config master: Support hyphens and underscores for meetpad rooms https://review.opendev.org/726433 | 16:12 |
openstackgerrit | David Moreau Simard proposed opendev/system-config master: Remove dmsimard from infra-root https://review.opendev.org/726429 | 16:13 |
clarkb | dmsimard: +2 on both changes now. I'm sad to do that but understand at the same time. Hope your new endeavors go well | 16:22 |
fungi | clarkb: do you think it's safe to stop the dstat process on lists.o.o now? most recent oom was 2020-04-28 10:18:10 | 16:23 |
clarkb | fungi: yes I think we can consider that fixed now | 16:24 |
dmsimard | clarkb: thanks! | 16:24 |
* fungi updates the sign on the wall to "10 days since our last mailman oom" | 16:24 | |
fungi | #status log terminated dstat process on lists.o.o after 10 days with no oom | 16:26 |
openstackstatus | fungi: finished logging | 16:26 |
fungi | corvus: just a heads up you might want to take a look at 726433 for the meetpad service... i assume those weren't omitted intentionally, but would rather be sure | 16:30 |
fungi | (and alternatively, if there are other characters you think we shuold also add, i'm happy to amend the change to do that) | 16:31 |
clarkb | infra-root today is the container restart day | 16:32 |
clarkb | I think we should start with gerrit since that should be a quick one? I'm not fully up to speed on all the chagnes that were made to gerrit so would be good if those that are can at least be around? | 16:32 |
fungi | i have some containers in the fridge with leftovers in them, happy to add them to the restarts | 16:32 |
clarkb | I think we want to do `docker-compose down && docker-compose up -d` as ansible should alrady have a pretty up to date image there? | 16:33 |
fungi | but yes, i think the gerrit restart will also mean we finally stop replicating to github | 16:33 |
clarkb | yup I believe that is the major change | 16:33 |
clarkb | and for zuul I think we can get away with just restarting the scheduler, but can likely do a full restart of everything if that is helpful. I guess mergers would like the jemalloc update too? | 16:34 |
fungi | clarkb: was patchset 3 for 720030 just a rebase? | 16:35 |
fungi | (the interdiff is... large) | 16:35 |
clarkb | fungi: yes to deal with the zuul.d reorg | 16:35 |
fungi | clarkb: i expect that restarting all zuul services would be appreciated by the zuul maintainers, in preparation for finally tagging? | 16:36 |
clarkb | I expect I'll be in a good spot to help with all that at about 1800UTC. Kids have class soon then I'll be without that distraction | 16:36 |
clarkb | fungi: I think tagging was dependent of the zk tls stuff? testing for that is still in progress | 16:36 |
fungi | oh, okay. i thought that was still waiting on us | 16:36 |
clarkb | fungi: well its testing of it in opendev | 16:37 |
fungi | maybe it's still waiting on us to implement our side of it | 16:37 |
clarkb | fungi: I think all of the config management stuff is now done except for nb03 and now its simply a matter of making tls work in opendev | 16:37 |
clarkb | "simply" | 16:37 |
fungi | in that case restarting all of zuul seems less urgent, but maybe the mergers at least | 16:37 |
fungi | in addition to scheduler | 16:37 |
fungi | we did the executors recently-ish | 16:38 |
* fungi checks | 16:38 | |
AJaeger | dmsimard: I suggest you remove yourself from the channels, one more change needed on https://review.opendev.org/#/c/726431/1. | 16:38 |
AJaeger | dmsimard: all the best and thanks! | 16:39 |
fungi | clarkb: oh, actually the last mass restart of executors was 2020-04-25 | 16:39 |
fungi | so nearly two weeks ago | 16:40 |
dmsimard | AJaeger: good catch, will fix, thanks for your ever vigilant reviews :D | 16:40 |
openstackgerrit | David Moreau Simard proposed openstack/project-config master: Remove dmsimard from accessbot https://review.opendev.org/726431 | 16:41 |
*** dtantsur is now known as dtantsur|afk | 16:42 | |
fungi | clarkb: though the last change to merge for anything under zuul/executor/ was on 2020-04-16 | 16:42 |
mordred | clarkb: no - we shouldn't revert the multi-arch - and the manifest mixups are a thing we should investigate but aren't immediately germane. the nodepool patch won't upload zuul/nodepool as arm64 only - it'll upload zuul/nodepool:latest as amd64+arm64 and then upload zuul/nodepool:arm64 as arm64 only. doing that will let us attempt actually getting rid of our arm64 control plane node and just using x86 for | 16:47 |
mordred | all of them. just chatted with corvus about an idea for a followup that I'll write up in just a sec | 16:47 |
*** slittle1 has quit IRC | 16:48 | |
AJaeger | clarkb: want to review https://review.opendev.org/726431 again? | 16:53 |
*** slittle1 has joined #opendev | 16:58 | |
openstackgerrit | Andreas Jaeger proposed zuul/zuul-jobs master: Revert "Revert "ensure-tox: use venv to install"" https://review.opendev.org/726413 | 17:04 |
openstackgerrit | Merged openstack/project-config master: Remove dmsimard from accessbot https://review.opendev.org/726431 | 17:08 |
*** slittle1 has quit IRC | 17:08 | |
*** slittle1 has joined #opendev | 17:14 | |
clarkb | mordred: is there any concern that we won't be able to properly update our x86 nodepool launcers while we sort out the mixup ? I'm trying to understand the breadth of this | 17:22 |
mordred | clarkb: nope. they should all be fine | 17:22 |
clarkb | mordred: eg restarting nb01.opendev.org right now might make it unhappy? which isn't the ened of the world because we'll just use existing images for a while? | 17:22 |
clarkb | k, I need to pop out for a bit but will return to rereview thos ewith new infos and then hpefully do the planned service restarts? | 17:23 |
mordred | clarkb: it shouldn't - the only mixup inthe mix may have something to do with copying images around between intermediate and buildset registries. or it may not - but the published images should be fine | 17:23 |
mordred | clarkb: yay service restarts | 17:23 |
clarkb | mordred: gotcha so pulling from dockerhub in production should be fine, its the test infrastructure that gets confused | 17:23 |
clarkb | (so that inhibits our ability to know if things will work for prod but we can yolo if we want) | 17:24 |
mordred | clarkb: yeah - and there we don't actually know what the issue is - we only saw it that one time | 17:24 |
clarkb | ok I was worried docker hub had the same issue, understanding it does not or shouldn't helps clarify things for me | 17:24 |
mordred | ++ | 17:24 |
clarkb | alright back in a bout 20 minutes | 17:25 |
mordred | clarkb: there's also basically 2 different things in that stack | 17:25 |
mordred | the first is just making multi-arch python-base - befcause we DO install a platform-dependent thing in python-base, so we need to anyway. | 17:25 |
mordred | the second two are for attempting to try something new to ultimately allow us to get out of the business of running an arm64 control plane host - but we need to trial-run that still | 17:26 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Remove install-* roles https://review.opendev.org/719322 | 17:29 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Fail and direct user to use ensure-* version of roles https://review.opendev.org/726448 | 17:29 |
openstackgerrit | Monty Taylor proposed zuul/zuul-jobs master: Revert "Revert "ensure-tox: use venv to install"" https://review.opendev.org/726413 | 17:31 |
*** slittle1 has quit IRC | 17:32 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Remove install-* roles https://review.opendev.org/719322 | 17:33 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: DNM: return linenumber in matchplay https://review.opendev.org/726312 | 17:34 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Fix bad path in ansible-lint test job files https://review.opendev.org/726449 | 17:35 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: DNM: return linenumber in matchplay https://review.opendev.org/726312 | 17:37 |
*** slittle1 has joined #opendev | 17:38 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: ansible-lint-rules: Fix bad path and filename https://review.opendev.org/726449 | 17:39 |
clarkb | ok I was a bit slower than I expected but here now | 18:01 |
clarkb | infra-root does nowish work for gerrit restart? | 18:01 |
clarkb | I'm going to check that the configs updated on review for replication changes as expected | 18:01 |
clarkb | I see no github stuff in review.o.o:/home/gerrit2/review_site/etc/replication.conf | 18:02 |
clarkb | mordred: fungi corvus ^ if you think we are good to proceed I'll go notify the release team and can do the docker-compose down && docker-compose up -d as well. Mostly just hoping I'm not the only set of eyeballs if something goes sideways | 18:02 |
*** ralonsoh has quit IRC | 18:08 | |
mordred | clarkb: yes - I think we're good to proceed - and I'm here | 18:09 |
fungi | clarkb: yep, i'm around | 18:09 |
clarkb | mordred: cool, is my `docker-compose down && docker-compose up -d` the process you expect to be used as well? I don't think I need to pull as the latest image should already be there? | 18:09 |
fungi | sorry, was just rewiring the networking at my workbench for a minute | 18:09 |
clarkb | I've notifieid the release team, haven't heard any complaints, and I didn't see any release jobs when I checked 10 minutes ago either | 18:10 |
clarkb | I think I'm good to type the commands if those comamnds look good to you all | 18:10 |
mordred | clarkb: I also don't expect to need a pull | 18:12 |
mordred | clarkb: so - yes | 18:12 |
clarkb | ok I'm starting with those commands in review.o.o:/etc/gerrit-compose now (I'll sudo them too) | 18:12 |
*** roman_g has joined #opendev | 18:12 | |
clarkb | commands have been run gerrit should be coming back up again now | 18:13 |
* fungi reloads browser impatiently | 18:14 | |
clarkb | logs report gerrit is ready | 18:14 |
fungi | there it is | 18:14 |
clarkb | apache seems to agree now as well | 18:14 |
fungi | lgtm | 18:14 |
clarkb | I guess the next thing to do is confirm no github replication? I don't know how to do that directly though | 18:15 |
clarkb | I'm mostly ok assuming the config file being up to date is sufficient :) | 18:15 |
fungi | event stream | 18:15 |
fungi | but yeah, i suspect it's fine | 18:15 |
clarkb | that takes us to zuul | 18:15 |
clarkb | mordred: earlier fungi and I were discussing if we want to do a full zuul restart or just mergers and scheduler to pick up the jemalloc removal in the containers | 18:16 |
clarkb | corvus: ^ you may also have thoughts on this | 18:16 |
mordred | clarkb: I think I defer to corvus on that one | 18:16 |
fungi | to reiterate, we did a mass executor restart on april 25 | 18:17 |
fungi | there haven't been any new changes merged under zuul/executor/ since april 16 | 18:17 |
openstackgerrit | Merged opendev/system-config master: Run jobs prod test jobs when docker images update https://review.opendev.org/720030 | 18:18 |
fungi | but maybe there are other reasons we might want a restart there | 18:18 |
clarkb | this is me thinking out loud here: maybe we should update our restart zuul globally playbook to handle the new container situation if not done already. Then just restart everything since we have a playbook to do it? | 18:18 |
clarkb | restarting the scheduler means we'll lose all running jobs anyway so we can't optimize for keeping jobs around | 18:19 |
clarkb | and usually thats the thing we optimize for with restarts on subsets of services iirc | 18:19 |
corvus | i can't think of a reason for executor restart, but a scheduler restart would be great | 18:20 |
corvus | (sorry, i'm not fully around atm) | 18:20 |
clarkb | zuul_restart.yaml has been updated for the new conatiner situation | 18:21 |
mordred | clarkb: I believe it ... yes | 18:21 |
mordred | clarkb: I'm not 100% sure we've added queue saving and restoring there | 18:22 |
clarkb | mordred: ya that bit is still manual | 18:22 |
*** roman_g has quit IRC | 18:22 | |
clarkb | also I think there is a bug in the start.yaml playbook | 18:22 |
clarkb | we start scheduler before mergers | 18:22 |
clarkb | I believe mergers have to be first to enable the scheduler to load configs | 18:22 |
fungi | i don't see a ton of reason to spend time optimizing our restarts for queue preservation anyway, that effort's better spent getting the high-availability work finished so we never have to think about it again | 18:24 |
fungi | after all, we can use the periodic queue snapshots to manually reenqueue stuff after the scheduler restart occurs anyway. it's a fairly idempotent process | 18:25 |
* clarkb is trying to find the old playbook and derping with git. but double checking mergers first thought | 18:25 | |
clarkb | the old playbook used the current order so I expect it is actually fine as is | 18:26 |
clarkb | considering that this seems like a good time to exercise the new playbook and just restart everything? | 18:26 |
clarkb | and now I've got to learn what the proper way to run ansible on bridge is (so ya this is a good exercise) | 18:30 |
clarkb | fungi: mordred: I've started a root screen on bridge (sorry about the terminal size) | 18:31 |
clarkb | I'm in /home/zuul/src/opendev.org/openstack/system-config/playbooks and I think I want to run ansible-playbook -f 20 ./zuul_restart.yaml. And before running that we need to grab queues so they can be restored? | 18:32 |
clarkb | maybe someone else can do the queues and I'll run the playbook if that plan sounds good? | 18:35 |
fungi | i've joined | 18:36 |
clarkb | hrm one more thing to check, if the playbook will wait for zuul executors to properly stop ebfore starting them again | 18:37 |
*** avass has quit IRC | 18:37 | |
fungi | just a sec i'll take a look at the current queue state and work out what likely needs to be grabbed | 18:37 |
fungi | vexxhost has a couple changes in flight which will probably clear in the next few minutes | 18:38 |
clarkb | fungi: k I think I found a bug so we aren't in a rush :) | 18:38 |
fungi | other than that, only the openstack tenant has any real activity at the moment | 18:38 |
clarkb | on the start side we include_role zuul-executor then limit to start.yaml but there is no start.yaml unlike the other services | 18:39 |
clarkb | I think we can either add a start.yaml that just ensures service is running, or replace the include role with a task that does that | 18:39 |
clarkb | I'll write a change for start.yaml so it is symmetric with the other roles | 18:40 |
clarkb | then maybe we land that and do this after PDT lunch? | 18:41 |
openstackgerrit | Merged opendev/system-config master: Support hyphens and underscores for meetpad rooms https://review.opendev.org/726433 | 18:43 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Add a start.yaml to zuul-executor role https://review.opendev.org/726453 | 18:44 |
clarkb | infra-root ^ I'm thinking getting to a point where we can use that playbook will be a good idea now :) | 18:45 |
clarkb | mordred: I'm thinking about docker builds again now. Re https://review.opendev.org/#/c/726372/1/.zuul.yaml thats purely to make testing work? | 18:46 |
clarkb | mordred: any concern that if we land https://review.opendev.org/#/c/726263/3/zuul.d/docker-images/python.yaml before fixing the mixups we'll break a lot more testing? | 18:46 |
clarkb | I think ^ is my big concern with that stack now since a lot of pythony things consume those base images now and we'd be potentially breaking the testing they do? | 18:47 |
fungi | yeah, after pdt lunch wfm, i'm probably going to go run another quick errand around 19:15z or so | 18:48 |
fungi | but should be back before 20z | 18:49 |
mordred | clarkb: no - actually ... | 18:49 |
mordred | clarkb: that is for reals - as in, it occurred to me that with the docker binfmt support, we do not actually need to run arm64 nodepool-builder on arm64 - so that's what that is aiming to allow us to do | 18:50 |
mordred | clarkb: for the other thing - let's circle up with corvus when he's back online | 18:51 |
mordred | clarkb: I was thinking it's not an issue - but you make a good point that if we don't understand what the issue is that we had there, we could be breaking the gate | 18:51 |
clarkb | mordred: because wecant override arch docker pull with binfmt but could with a specific tag? | 18:52 |
clarkb | mordred: ya that us my concern with puthon-basr | 18:52 |
clarkb | bah cant type, started figuring out lunch | 18:52 |
mordred | clarkb: yes - that's right - youc an tell docker to run a tag that's specifically a different arch | 18:52 |
mordred | clarkb: and if you have the binfmt stuff installed, it actually just works | 18:52 |
mordred | clarkb: once you're done with lunch, I'll tell you the next even crazier part :) | 18:53 |
corvus | i'm back from exercise for a few mins before lunch | 18:53 |
mordred | corvus: so - tl;dr ... | 18:53 |
mordred | corvus: the thing we were disucssing this morning that we shelved for later ... | 18:53 |
mordred | corvus: clarkb has brought up a concern that with us not understanding the issue, and having many things consuming python-base images, that we could be introducing gate breakage | 18:54 |
mordred | corvus: but this might be an after-lunch issue | 18:54 |
corvus | the "sometimes something gets the wrong image arch issue"? | 18:54 |
mordred | yeah | 18:54 |
mordred | corvus: I'm digging in to try to understand that a bit | 18:55 |
corvus | so the suggestion is that we need to figure that out before we publish any multi-arch python-base images? | 18:55 |
mordred | corvus: yeah - out of fear that, since they are base for other things, we might break ourselves | 18:55 |
clarkb | and it could be adding those images is the fix to what we were seeing | 18:56 |
clarkb | but understanding that first would be good | 18:56 |
corvus | okay. well, i was going to spend today trying to figure out why our zuul and nodepool gate tests don't do anything. | 18:57 |
corvus | but i could switch to this instead if folks think that's more important | 18:57 |
mordred | corvus: I'll dig in to this one for a bit first | 18:57 |
mordred | and see what I can learn while you work on the other thing | 18:57 |
clarkb | (I think the testibg if existing stuff might be more important since that helps us with the existing things) | 18:58 |
corvus | okay, i'll plan on working on the gate tests after lunch then. | 18:58 |
openstackgerrit | Andreas Jaeger proposed zuul/zuul-jobs master: Fail and direct user to use ensure-* version of roles https://review.opendev.org/726448 | 19:03 |
*** dpawlik has joined #opendev | 19:03 | |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Build multi-arch uwsgi images https://review.opendev.org/726458 | 19:06 |
fungi | i'm popping out for a quick errand, should be back in ~45 minutes | 19:11 |
openstackgerrit | Andreas Jaeger proposed openstack/project-config master: Retire x/pbrx - part 1 https://review.opendev.org/726461 | 19:16 |
openstackgerrit | Andreas Jaeger proposed openstack/project-config master: Retire x/pbrx - part 1 https://review.opendev.org/726461 | 19:21 |
openstackgerrit | Andreas Jaeger proposed openstack/project-config master: Finish retiring x/pbrx https://review.opendev.org/726463 | 19:21 |
openstackgerrit | Andreas Jaeger proposed openstack/project-config master: Retire x/pbrx - part 1 https://review.opendev.org/726461 | 19:32 |
openstackgerrit | Andreas Jaeger proposed openstack/project-config master: Finish retiring x/pbrx https://review.opendev.org/726463 | 19:32 |
AJaeger | mordred: and myself just discussed to retire this ^ | 19:33 |
clarkb | infra-root https://review.opendev.org/#/c/726453/ is the change that would be good to review andhopefully land for the zuul restart | 19:33 |
clarkb | as an alternative we can just do a quick scheduelr restart and land ^ later | 19:47 |
fungi | back from errand and looking at that change now while scarfing 'za | 19:53 |
clarkb | fungi: I think I'm coming around to simply restarting the scheduler | 19:53 |
clarkb | everything else we can restart with minimal impact | 19:53 |
fungi | is the system-config-run-base-arm64 failure expected? | 19:53 |
clarkb | and its looking like a quiet friday afternoon | 19:54 |
clarkb | fungi: yes | 19:54 |
fungi | it's voting though, so we can't merge it normally | 19:54 |
clarkb | fungi: that pipeline doesn't vote | 19:55 |
clarkb | so its voting +0 basically | 19:55 |
clarkb | (or -0 depending on how you look at it :) ) | 19:55 |
clarkb | fungi: I think I've come around to just doing the scheduler its the important bit and wouold be good to get that behind us | 19:56 |
mordred | clarkb: typo | 19:56 |
fungi | oh... gertty's combining the pipeline reports | 19:56 |
fungi | ignore me | 19:56 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Add a start.yaml to zuul-executor role https://review.opendev.org/726453 | 19:57 |
clarkb | mordred: ^ thanks | 19:57 |
*** avass has joined #opendev | 19:59 | |
mordred | corvus: ok - I think we've repeated the issue, and confirmed that it seems to be non-deterministic | 20:01 |
mordred | corvus: https://zuul.opendev.org/t/openstack/build/ce8d743c1fa04841a34de543366a3bf1/log/job-output.txt#918 | 20:01 |
corvus | mordred: i'm back | 20:01 |
mordred | corvus: if you look at that, you'll see if fetching the same shas for each arch | 20:01 |
corvus | mordred: what is that job doing specifically? | 20:01 |
mordred | corvus: and ... the other job for that build, https://zuul.opendev.org/t/openstack/build/65c6686dae3544d0b55f18f72199c1e3 - correctly fails due to lack of arm wheels | 20:02 |
mordred | corvus: using the uwsgi builds to test building something using the multi-arch python-base as a parent | 20:02 |
*** diablo_rojo has quit IRC | 20:02 | |
clarkb | fungi: `/opt/zuul/tools/zuul-changes.py https://zuul.opendev.org` seems to grab all the queues which we can then filter out the results for reenqueing if necessary | 20:02 |
mordred | corvus: we should expect the arm builds to fail - because there are no wheels and the images don't install build tools | 20:02 |
clarkb | fungi: so I'm thinking we run that, the docker-compose down && docker-compose up -d in /etc/zuul-scheduler | 20:03 |
mordred | but they "succeed" sometimes - which looks like them saying they're running arm but are actually running x86 | 20:03 |
mordred | corvus: I have not yet sorted _why_ this is happening - that's next | 20:03 |
corvus | mordred: so that link you linked, that's buildkit fetching what should be a multi-arch python-base image from dockerhub, and it's getting the wrong arch? | 20:03 |
mordred | corvus: well - this is a child job of the job that built python-base multi-arch - so it actually should be fetching those images from the buildset registry | 20:04 |
corvus | mordred: :( that's tons of variables | 20:04 |
mordred | yeah. I'm not thrilled about it | 20:04 |
mordred | corvus: mostly just wanted to let you know I did reproduce the issue - albeit not yet in a consistent manner | 20:05 |
corvus | mordred: so to revise: that link is buildkit fetching what should be a multi-arch python-base image from the buildset registry, and it's getting the wrong arch. | 20:06 |
clarkb | infra-root: I think that is my plan now. I'm going to run `/opt/zuul/tools/zuul-changes.py https://zuul.opendev.org > queues.sh ; cd /etc/zuul-scheduler; sudo docker-compose down ; sudo docker-compose up -d; wait for scheduelr to be up then execute bash queues.sh this is straightfowrard and gets us onto a non jemalloc scheduler image | 20:06 |
*** avass has quit IRC | 20:07 | |
mordred | clarkb: yes | 20:07 |
mordred | gah | 20:07 |
mordred | corvus: yes | 20:07 |
mordred | clarkb: seems reasonable | 20:07 |
clarkb | I've notified release team of my intent to do that real soon now and I've checked they don't have any jobs running (they don't) | 20:08 |
clarkb | I'll give it a couple minutes for anyone to object otherwise I'm proceeding :) | 20:08 |
clarkb | fungi: ^ you were helping with things earlier so want to make sure you have a chance to see that too | 20:09 |
fungi | clarkb: catching back up... we've been running `python /opt/zuul/tools/zuul-changes.py http://zuul.opendev.org >queue.sh` | 20:09 |
fungi | that seems to still work as expected | 20:10 |
clarkb | fungi: ya thats basically the command i've got except for the python prefix (seems to work either way) | 20:10 |
fungi | okay, cool | 20:10 |
*** avass has joined #opendev | 20:11 | |
fungi | yeah, that plan looks sane | 20:11 |
clarkb | ok I'm proceeding with it now then | 20:11 |
fungi | i'm ready and have non-pizza hands now | 20:11 |
corvus | mordred: at this point, we're pretty sure that we can have buildkit push multi-arch to dockerhub. we also think we can have buildkit push multi-arch to the buildset registry, but i'm not sure we've fully tested that. we also haven't examined skopeo copying from the buildset registry to the intermediate registry. nor have we examined skopeo copying from the intermediate registry to the buildset registry. | 20:11 |
clarkb | I've saved the queues and am downing and upping zuul next | 20:12 |
clarkb | *zuul-scheduler | 20:12 |
*** dpawlik has quit IRC | 20:12 | |
corvus | mordred: so i guess we need to follow that sequence -- first make sure that the push from buildkit to the BR is okay, then see what skopeo does from BR to IR; then same from IR to BR | 20:12 |
*** hashar has joined #opendev | 20:13 | |
clarkb | its doing the things I expect | 20:13 |
corvus | mordred: my guess though is that since we've basically not looked at skopeo at all, maybe it's not doing anything with multi-arch, and so there's an extra thing we need to do in the push-to-intermediate-registry and pull-from-intermediate registry roles. and maybe it's the same thing we need to do to both roles. | 20:13 |
fungi | and nowcat jobs are underway | 20:13 |
corvus | mordred: i'm going to get started on the zuul/nodepool gate issue now | 20:15 |
clarkb | it is up now | 20:16 |
clarkb | running the queues.sh script | 20:16 |
fungi | yep, i see executor interaction | 20:16 |
clarkb | queues.sh is done and I think thats it ? | 20:18 |
clarkb | jobs are starting | 20:18 |
clarkb | #status log Restarted gerrit container on review.opendev.org to pick up new replication config (no github, replication for github runs through zuul jobs now) | 20:20 |
openstackstatus | clarkb: finished logging | 20:20 |
corvus | the reason that the nodepool job is failing is because wi didn't open iptables in the test | 20:21 |
clarkb | #status log Restarted zuul-scheduler container on zuul01 to pick up the jemalloc removal in the containers which seems to address python memory leaks. | 20:21 |
corvus | we have firewall rules for the production nodepool servers, but not in the gate | 20:21 |
openstackstatus | clarkb: finished logging | 20:21 |
corvus | can anyone think of a similar thing we have a gate test for i can model after? | 20:21 |
corvus | (we need the gate-test-fake-nl01 to be able to talk to the gate-test-fake-zk01) | 20:22 |
clarkb | corvus: easy but maybe not super correct mode would be the multinode roles | 20:22 |
clarkb | corvus: they open all traffic between hosts | 20:22 |
clarkb | (but then I guess our base iptables role may overwrite?) | 20:22 |
corvus | yeah, would prefer to just set the iptables rule based on the ansible inventory | 20:23 |
openstackgerrit | Monty Taylor proposed zuul/zuul-jobs master: Add --all to skopeo copy from insecure registry https://review.opendev.org/726469 | 20:23 |
mordred | corvus: ^^ it might just be that simple | 20:23 |
clarkb | corvus: I am not aware of any examples of that. I think if we set up /etc/hosts (using multinode roles?) then when the base iptables role runs it will configure things properly because /etc/hosts wins overdns? | 20:24 |
mordred | corvus: dealing with manifest lists is definitely a new thing for skopeo and there is a decent chunk of discussion about it - also, some of the early discussion was of the form "if there's a list just grab the best one" | 20:24 |
openstackgerrit | Monty Taylor proposed zuul/zuul-jobs master: Add --all to skopeo copy from insecure registry https://review.opendev.org/726469 | 20:25 |
fungi | "best" | 20:26 |
corvus | mordred: neat, i know nalin from way back :) | 20:27 |
mordred | clarkb, corvus : we could also have the iptables rules derived from the ansible inventory group ips | 20:27 |
mordred | so more like "open port X from the hosts in group zkclient" vs "open port X on this list of IPs" | 20:27 |
mordred | that's -- a big bit of ansible magic thoug | 20:27 |
clarkb | mordred: oh interesting | 20:27 |
mordred | but if we can figure it out and get it right (I still don't 100% know when ansible decides to not load up the group membership of something) - it's likely more maintainable long term? | 20:28 |
corvus | {% for addr in host.hostname | dns_a -%} | 20:29 |
corvus | that's what we currently do | 20:29 |
corvus | dns_a is a filter module we wrote | 20:29 |
corvus | # Note we use 'host' rather than something like | 20:29 |
corvus | # getaddrinfo so we actually query DNS and don't get any | 20:29 |
corvus | # local-only results from /etc/hosts | 20:29 |
clarkb | corvus: ha | 20:29 |
mordred | ha indeed | 20:29 |
clarkb | there goes my idea :) | 20:29 |
mordred | and we want dns values instead of the ips we've put in the ansible inventory? | 20:30 |
corvus | mordred: no we are just doing a dns lookup | 20:30 |
corvus | i kinda think replacing all that with your idea may be the way to go | 20:30 |
mordred | yeah, I'm wondering if we've changed anything about how we're organized now that might make rethinking that workwhile | 20:30 |
mordred | yah | 20:30 |
corvus | my guess is it far, far, far predates how we are doing things now | 20:31 |
mordred | I think before we still had dynamic openstack inventory | 20:31 |
clarkb | mordred: re skopeo --all. Is the --all still scoped the the urls we are passing? eg it won't try to download literally everything on the intermediate registry will it? | 20:31 |
mordred | clarkb: it's about the specific image | 20:31 |
corvus | "Add a --all/-a flag to instruct us to attempt to copy all of the instances in the source image" | 20:31 |
clarkb | corvus: hrm that makes me wonder if we'll get all versions of zuul-scheduler and so on | 20:32 |
fungi | i have a feeling that dates back to limitations we had with matching in the openstack dynamic inventory | 20:32 |
corvus | clarkb: if zuul-scheduler is a multi-arch image.... yes, but that's what we want | 20:32 |
clarkb | corvus: right but all chagnes in the intermediate registry for zuul-scheduler? | 20:32 |
clarkb | or maybe it does the right thing and I'm reading that wrong | 20:32 |
corvus | clarkb: we're telling it to copy an image | 20:33 |
mordred | yeah - and a specific tag of an image at that | 20:33 |
corvus | it's not going to copy other images we're not asking it to copy | 20:33 |
mordred | yeah | 20:33 |
clarkb | k | 20:33 |
mordred | image in this case means repository:tag | 20:33 |
corvus | (but if the image is a list of images[ie, multiarch], it will copy every image in the list of images) | 20:33 |
corvus | see https://github.com/containers/skopeo/pull/741 | 20:33 |
clarkb | mordred: right I think I'm getting confused because "image" can mean repository or repository:tag/sha | 20:33 |
mordred | clarkb: yah - in this case it means the specific repository:tag | 20:34 |
clarkb | in this case we tell it repository:tag so should get all instances of that (eg each arch) tag | 20:34 |
*** avass has quit IRC | 20:34 | |
mordred | clarkb: our more general conversational usage of call repository image is wrong | 20:34 |
mordred | clarkb: yes | 20:34 |
fungi | scheduler restart is looking good... we've got jobs succeeding and publishing logs | 20:35 |
clarkb | fungi: over the long term will want to monitor memory use too but ya things looking good so far | 20:36 |
clarkb | http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=64792&rra_id=all memory use graphs | 20:36 |
clarkb | memory use looks good so far but we tend to need at least several hours to several days of data there to say anything definitely | 20:38 |
fungi | agreed | 20:39 |
tobiash | I've also updated our zuul-web to py38 w/o jemalloc today and memory looks better so far | 20:40 |
mordred | tobiash: cool! | 20:40 |
corvus | mordred: i'm working on the iptables thing | 20:43 |
clarkb | tobiash: fwiw my completely uninvestigated and untested theory is that the bug is in jemalloc | 20:44 |
clarkb | tobiash: since glibc and jemalloc should be called the same by python in this case | 20:44 |
clarkb | tobiash: its also a major version change between ubuntu xenial and buster for jemalloc | 20:44 |
tobiash | clarkb: probably, memory allocators are hard | 20:44 |
mordred | yah - especially memory allocators underneath dynamic language memory allocators :) | 20:46 |
clarkb | ya I think if we really wanted to dig in more we'd want to run zuul-web under valgrind for a bit and then send off that data to jemalloc | 20:46 |
clarkb | but that sounds like a lot of effort for minimal to no gain :) | 20:47 |
mordred | yah - also - python has been putting a lot of effort into the new dict impl in later pythons too | 20:47 |
fungi | especially if it turns out to be something they've already fixed | 20:47 |
mordred | so at this point just being on 3.8 is probably a big win over our initial 3.5 deployments | 20:47 |
mordred | I mean - remember in opendev we were running a patched python until this container rollout | 20:48 |
tobiash | you patched python? | 20:48 |
tobiash | awesome | 20:48 |
clarkb | I thought upstream eventually pulled the fixes in? | 20:48 |
clarkb | we were for a while though | 20:48 |
mordred | clarkb: we never switched | 20:48 |
clarkb | mordred: oh hah ok | 20:48 |
mordred | tobiash: we pulled a backport patch | 20:48 |
fungi | i'd rather drink than try to remember that | 20:48 |
mordred | tobiash: https://launchpad.net/~openstack-ci-core/+archive/ubuntu/python-bpo-27945-backport | 20:49 |
mordred | clarkb: to be fair - it's possible we did get newer pythons because of versioning | 20:49 |
tobiash | ah I think I remember segfault discussions during zuul v3 development | 20:49 |
mordred | clarkb: but we never stopped adding that ppa :) | 20:49 |
clarkb | gotcha | 20:49 |
mordred | tobiash: yah. they were "fun" | 20:49 |
clarkb | we had to debug similar python issues with 3.3 iirc | 20:50 |
clarkb | I feel like I did that in the HP seattle offices, been a while | 20:50 |
fungi | seems so very long ago now | 20:50 |
clarkb | segfaults in the python garbage collector | 20:50 |
clarkb | whcih were fixed in python upstream but had to convince distro to pull it in | 20:50 |
clarkb | thankfully "segfaults due to no user input or interaction" tends to be a bad enough bug they'll patch :) | 20:50 |
* mordred loves our new python base images overlords | 20:51 | |
clarkb | fungi: I'm all out of alcohol | 20:51 |
clarkb | unless I want some more iron butt | 20:51 |
clarkb | ok I'm going to context switch to PTG planning stuff as zuul continues to look happy. Send up a signal flare if I can help debug or look at anything else | 20:52 |
fungi | clarkb: my "errand" this afternoon was to restock my home bar with aged rum before the tourists clear out the shelves in a week | 20:54 |
fungi | so i suppose i'm now well equipped to remember running with python patches after all | 20:55 |
openstackgerrit | James E. Blair proposed opendev/system-config master: Use inventory host lookup for iptables https://review.opendev.org/726472 | 20:55 |
clarkb | fungi: nice | 20:55 |
corvus | mordred: ^ that's step one that just does a 1:1 for our host based system. i'll build on top of that and switch to using groups | 20:55 |
clarkb | fungi: I've been debating picking up a scotch since I tend to drink that at a reasnoble pace | 20:55 |
corvus | so we don't have to think about it when we add a new zk client | 20:55 |
fungi | clarkb: to be fair, it was a stop off at the liquor store on the way to pick up decent pizza, so booze was not the entire reason for leaving the house at least | 20:56 |
mordred | corvus: cool | 20:58 |
mordred | corvus: so - biggest question - do all of the hosts in the inventory show up in hosts[] even if they aren't used? | 20:59 |
mordred | corvus: (I like the look of that a lot) | 20:59 |
corvus | mordred: this worked for me: ansible-playbook -i ~/git/opendev/system-config/inventory/openstack.yaml -i ~/git/opendev/system-config/inventory/groups.yaml /tmp/test.yaml | 20:59 |
corvus | mordred: where the playbook was a simple debug with what you see in the change | 21:00 |
corvus | mordred: that seem like an effective test? | 21:00 |
corvus | mordred: (and the play was on hosts:localhost) | 21:00 |
mordred | corvus: yeah. cool! | 21:00 |
mordred | corvus: that's exactly the type of test I woudl think would show it | 21:01 |
corvus | mordred: i think the other weirdness maybe you are remembering is facts? | 21:01 |
corvus | but since this is inventory data.... | 21:01 |
mordred | corvus: similar with groups - I've got that play in there somewhere to do a debug statement on hosts: zookeeper so that zookeeper shows up in groups[] - or maybe it's facts that are the issue there | 21:01 |
mordred | corvus: https://opendev.org/opendev/system-config/src/branch/master/playbooks/service-zuul.yaml#L1-L9 <-- but that does explicitly say "hostvars" | 21:02 |
mordred | corvus: in any case - it's an angle to check that I'm sure you;ll check anyway - and maybe find a way to make that chunk go away | 21:03 |
corvus | mordred: well, i'm planning on letting the gate tests check for me | 21:04 |
mordred | infra-root: I'm going to be out next tuesday afternoon so will miss the meeting | 21:04 |
mordred | corvus: \o/ | 21:04 |
mordred | clarkb: we had another IP conflict / host key issue in a job - are we collecting those? or just shrug? | 21:05 |
clarkb | mordred: usually they seem to happen in waves and cloud cleans up after itself and we move on. If it is very persistent in a single provider we usually escalate to that provider | 21:06 |
mordred | nod. I'll just got with shrug for now | 21:07 |
corvus | what's zuul-executor vs zuul-executor-opendev? | 21:08 |
corvus | i don't see the zuul-executor-opendev group used anywhere | 21:09 |
clarkb | I think zuul-executor-opendev may have been a fork in the road for container'd executors | 21:10 |
clarkb | mordred: ^ | 21:10 |
mordred | corvus: it's not - I think it can be killed | 21:10 |
corvus | k will do | 21:10 |
mordred | corvus: there's a similar one in nodepool which is used but which can go away once we sort out the final puppet host | 21:10 |
mordred | maybe we shoudl rename those to nodepool-builder and nodepool-builder-legacy at this point | 21:11 |
openstackgerrit | James E. Blair proposed opendev/system-config master: Add iptables_extra_allowed_groups https://review.opendev.org/726475 | 21:17 |
openstackgerrit | James E. Blair proposed opendev/system-config master: Add iptables_extra_allowed_groups https://review.opendev.org/726475 | 21:19 |
openstackgerrit | Merged zuul/zuul-jobs master: Add --all to skopeo copy from insecure registry https://review.opendev.org/726469 | 21:23 |
corvus | mordred: ^ recheckify? | 21:23 |
*** avass has joined #opendev | 21:24 | |
*** hashar has quit IRC | 21:24 | |
clarkb | https://etherpad.opendev.org/p/opendev-virtual-ptg-june-2020 is up and I sent email about it to the mailing list as well | 21:25 |
clarkb | feel free to add your thoughts. I'm trying to add mine as well | 21:25 |
mordred | corvus: I rechecked both the python-base change and the uwsgi-base change - because the copy to the intermediate can also be an issue | 21:27 |
corvus | i'm working on the zuul job failures now (they're different than the nodepool failures) | 21:31 |
corvus | mordred: we should standardize our docker-compose directories | 21:32 |
corvus | it's a little hard to find them on the various different hosts | 21:32 |
corvus | (and in our situation, it's a bit like not knowing the name of the init script to start a service) | 21:32 |
*** DSpider has quit IRC | 21:34 | |
mordred | corvus: I completely agree | 21:36 |
clarkb | that will change container names so we'll have to do a bit of a dance but I agree that will be a nice thing | 21:39 |
clarkb | maybe even /etc/docker-compose/$service | 21:39 |
clarkb | then we always know to look in /etc/docker-compose like /etc/init.d/ | 21:39 |
*** hashar has joined #opendev | 21:46 | |
fungi | corvus: mordred: agreed, even if they just all started with docker- instead of ending in -docker i could at least tab-complete something | 21:47 |
*** avass has quit IRC | 21:47 | |
*** hashar has quit IRC | 21:48 | |
fungi | though i suppose i could eventually get in the habit of doing ls -d /etc/*-docker | 21:49 |
clarkb | fungi: note that won't match the zuul services :) | 21:50 |
clarkb | I think that may be part of corvus' observation there | 21:50 |
fungi | oh, even more fun | 21:50 |
fungi | ahh, yep, /etc/zuul-scheduler and /etc/zuul-web there | 21:51 |
clarkb | fungi: I approved a service-discuss email you were cc'd on and responded to the list and directly. Just in case you were off in separate conversation land thought I would mention it | 21:51 |
fungi | i think i accidentally discarded it along with some spam when going over the moderation queue first thing this morning | 21:52 |
fungi | so i found the sender in the mailman logs and reached out to them directly asking them to re-send | 21:53 |
fungi | because i remembered seeing the queue notification for it late last night but then didn't remember seeing it in the moderation queue this morning when i was processing all the lists i moderate | 21:54 |
clarkb | ah they did resend and I got it through and responded to their question | 21:54 |
fungi | excellent, thank you! | 21:55 |
corvus | mordred: the problem with the zuul job is that the test node has a zuul user with id=1000 | 21:56 |
corvus | so all our careful work to make sure zuul=10001 on the real server doesn't apply in test | 21:56 |
mordred | corvus: yeah - the zuul user on the test nodes is really non-idea for us isn't it? | 21:57 |
mordred | corvus: I almost wonder if we should make a different production user name other than zuul | 21:58 |
mordred | so that we don't overlap with the zuul test user :( | 21:58 |
mordred | corvus: we could also see if docker will work with uidmap | 21:59 |
corvus | mordred: yeah; though that would make a slight discontinuity between inside (zuul) and outside (newuser) container | 21:59 |
corvus | mordred: docker uidmap? | 22:00 |
mordred | corvus: https://docs.docker.com/engine/security/userns-remap/ | 22:00 |
mordred | corvus: looks like docker does suppor tit | 22:00 |
corvus | mordred: i think i need you to spell out your idea for me | 22:00 |
mordred | corvus: basically - we configure a mapping so we can tell docker "please make uid 1234 on the host map to 10001 in the container" | 22:01 |
corvus | mordred: so would we then go back to our production hosts and set zuul to 1000 everywhere? | 22:01 |
mordred | corvus: and then whatever the uid of the zuul user is on the host translates through that mapping so that they don;t have to _match_ we just. have to know what the host uid is | 22:01 |
mordred | corvus: I think we could maybe even just have our code to set the mapping read the uid of the zuul user so that if it preexisted the mapping would say 1000:10001 and if it didn;t it would say 10001:10001 | 22:02 |
corvus | mordred: oh.. so maybe let prod and test be different? and then uidmap $hostuid:10000 ? | 22:02 |
mordred | OR - we could also make the zuul user on productio n1000 | 22:02 |
mordred | yeah- we could do either thing | 22:02 |
mordred | the ansible would be the same | 22:02 |
mordred | we could even stop setting a uid in the ansible so that it will just make one and we don't have to care becasue it;ll configure a mapping | 22:03 |
mordred | the only place we'd care about the uid is in inside the container | 22:03 |
corvus | that has a certain attractiveness to it... clarkb, fungi: thoughts ^? | 22:03 |
mordred | I recommend reading that docker.com link if you haven't poked at subuid and this sort of mapping yet | 22:04 |
corvus | mordred: specifically what would that look like? what files/settings would we have to put where to make that happen? | 22:04 |
fungi | so, like, 1:1 nat for unix uids/gids? that's not a terrible way to keep test users from colliding with production | 22:04 |
mordred | corvus: we have to put an /etc/subuid and /etc/subgid file in place | 22:05 |
corvus | mordred: i have, but the only thing i understand about it is remapping entire ranges; i don't quite grok how you say "map this uid out here to that one in there" | 22:05 |
mordred | with an entry like zuul:1000:10001 | 22:05 |
mordred | then "userns-remap": "zuul" in daemon.json | 22:06 |
mordred | corvus: I might be reading this poorly | 22:07 |
corvus | mordred: i read that as "docker containers started by the zuul user have the inside uid of 0 mapped to the outside uid of 1000" | 22:07 |
mordred | corvus: yeah - nevermind. my idea sucks | 22:08 |
corvus | mordred: well, it was a good idea, just apparently not implemented | 22:08 |
clarkb | mordred: I think uids matter for logs too? | 22:08 |
mordred | clarkb: oh - they matter - i was just reading the construct differently than it was thinking it would let us do a mapping like fungi described | 22:08 |
clarkb | gotcha | 22:09 |
fungi | so userns-remap only remaps the inside uid=0 to some outside uid? | 22:09 |
corvus | fungi: it remaps a range, but that range starts at 0 on the inside | 22:09 |
mordred | corvus: oh ... yeah | 22:09 |
mordred | the range starts there | 22:10 |
fungi | oic... so, could we map to an unused uid range where inside 1000 happens to line up with outside 10001? | 22:10 |
mordred | yeah- except we want the inverse - we want outside 1000 to map to inside 10001 :) | 22:10 |
mordred | which I don't think we can do with this | 22:11 |
fungi | ohhh | 22:11 |
mordred | we could do the other thing | 22:11 |
corvus | reverse the polarity? | 22:11 |
fungi | where's a deflector grid when you need one? | 22:11 |
corvus | mordred: what's the other thing? | 22:11 |
fungi | what i was describing (lower outside uid to higher inside uid) | 22:12 |
mordred | corvus: I meant we could do what fungi suggested if we wanted the mapping to go the opposite direction | 22:12 |
mordred | yeah | 22:12 |
corvus | oh | 22:12 |
mordred | but alas that doesn't help :( | 22:12 |
corvus | so we're back to: (a) start the gate test job by renaming the zuul user; (b) start the gate test job by renumbering the zuul user; (c) change the username on our images; .... any other options? | 22:13 |
clarkb | use different user in gate jobs | 22:13 |
mordred | corvus: d) renumber zuul in prod to 1000 | 22:13 |
corvus | (e) change the zuul number on our images | 22:13 |
mordred | yeah | 22:13 |
corvus | clarkb: how do we use a different user in gate jobs? | 22:14 |
mordred | (f) run our images as pid 0 but suggest people use userns in prod | 22:14 |
corvus | mordred: how does renumbering zuul in prod to 1000 help? it's baked into the images as 100001... | 22:14 |
clarkb | I guess that would involve changing prod user too but run zuul as the zuul-prod user | 22:14 |
mordred | corvus: d and e would have to be merged | 22:14 |
clarkb | then the zuul user on test nodes can configure zuul-prod and they can be distinct | 22:14 |
fungi | yeah, i interpreted d as actually being e | 22:15 |
mordred | yeah - I thnk that's an optino - make a user not named "zuul" in prod | 22:15 |
corvus | mordred: my suggestion for (e) is just change our diskimages so the zuul user we create is id=10001; nothing would use 1000 then. | 22:15 |
mordred | maybe we need to re-state these | 22:15 |
mordred | corvus: ah! | 22:15 |
mordred | too many images | 22:15 |
corvus | oh yeah, sorry, not suggesting we change zuul's docker image | 22:15 |
mordred | corvus: (e) is the easiest - but does mean we won't test creating the user because it'll pre-exist | 22:16 |
mordred | that's probably ok | 22:16 |
corvus | well, it'll take like a week to implement (e) :) | 22:16 |
mordred | yeah | 22:16 |
corvus | and we'll probably open a can of worms | 22:16 |
mordred | (g) rename zuul user in prod to zuul-prod | 22:16 |
mordred | (just capturing the thing clarkb was saying) | 22:16 |
mordred | and kill (d) | 22:17 |
mordred | corvus: b won't work - you can't renumber the zuul user because ansible will be running as it - trying to fails pretty directly | 22:17 |
mordred | unless we edit /etc/passwd and reboot then nodes | 22:18 |
corvus | sure you can, it's just harder | 22:18 |
mordred | fair | 22:18 |
corvus | you don't need to reboot | 22:18 |
corvus | you just can't use usermod | 22:18 |
mordred | ah - nod | 22:18 |
corvus | so edit the files, find/chown, then HUP the ssh connection | 22:18 |
clarkb | I like splitting zuul the test user and zuul the service users as it helps make reasoning about this stuff eaiser | 22:19 |
mordred | corvus: I might actually be coming around to liking that one - it's fiddly at the top of the test job, but we have to fiddle with a few things to set things up properly anyway | 22:19 |
clarkb | but its also likely somewhat involved to make that change | 22:19 |
mordred | but yeah - clarkb's point | 22:19 |
mordred | and followup | 22:19 |
mordred | I agree with both | 22:19 |
corvus | didn't we just *undo* the zuulcd user on bridge? :) | 22:20 |
mordred | if we do (b) - we could do (g) as a followup without too much trouble most likely and then remove the (b) remediation | 22:20 |
clarkb | corvus: we did | 22:20 |
mordred | yup | 22:20 |
clarkb | corvus: but that is a case of zuul the test user | 22:20 |
fungi | does having a different uid on the remote node vs on the executor create problems for log/artifact synchronization? | 22:20 |
corvus | fungi: we have one now | 22:21 |
clarkb | fungi: there were a bunch of recent changes around that in zuul jobs the answer is it did but now shouldn't | 22:21 |
clarkb | we stopped syncing ownership iirc | 22:21 |
clarkb | because someone other than us hit that | 22:21 |
corvus | the executors are running as 10001 today | 22:21 |
fungi | ahh, okay | 22:21 |
fungi | yeah, sounded familiar | 22:21 |
corvus | clarkb: i think that was for missing user names, so technically i don't think we hit that. | 22:21 |
clarkb | ah | 22:22 |
corvus | clarkb: but if we renamed to zuul-prod, then we would :) | 22:22 |
corvus | because the executors are not running in a container | 22:22 |
mordred | haha | 22:22 |
corvus | i think it would be great if the names and uids inside and outside the container matched | 22:22 |
clarkb | corvus: hrm thats a good point too and I agree | 22:23 |
corvus | (so at least our executors will match the rest of the system) | 22:23 |
clarkb | having zuul-prod on the outside and zuul on the inside would be weird | 22:23 |
mordred | yeah | 22:25 |
mordred | so I think that gets me back around to liking b the most | 22:25 |
corvus | should we do b on all hosts in the system-config-run jobs? or try to narrow it down to just the 'zuul' group hosts? | 22:26 |
mordred | or - actually - why not b - then also do e - because shrug | 22:26 |
mordred | corvus: I'd vote for all the hosts - so that it's a consistent "this is part of setting up the world" task | 22:26 |
clarkb | ya b + do it in system-config-run-bsae makes sense to me | 22:27 |
corvus | ok. i'll work on b for all hosts; then we can think about e and deprecating b | 22:27 |
clarkb | sounds good | 22:27 |
mordred | corvus: ooh carnage! https://zuul.opendev.org/t/openstack/build/590dcae1db64414a938eb0682f2a623c | 22:28 |
mordred | corvus: that --all flag did _not_ work | 22:28 |
corvus | mordred: maybe next time don't use an EN-DASH. | 22:29 |
corvus | no... weird | 22:29 |
corvus | it looks like the dash is correct in the code, it's just printing it out weird in the error message | 22:29 |
corvus | mordred: sorry, i guess that was a red herring | 22:30 |
corvus | mordred: my skopeo copy has a --all option | 22:31 |
corvus | no idea how to find out what version it is | 22:31 |
corvus | does it even have versions? | 22:31 |
mordred | corvus: I thnk our skopeo is too old | 22:31 |
corvus | skopeo version 0.1.40 | 22:31 |
corvus | that's what i have | 22:32 |
mordred | skopeo is skopeo version 0.1.37-dev | 22:32 |
clarkb | ours == what we have on the executors? | 22:32 |
mordred | yeah | 22:32 |
mordred | I think we're still installing them from teh ppa and not from kubic? | 22:32 |
mordred | we'll install skopeo from kubic for focal | 22:33 |
*** mlavalle has quit IRC | 22:33 | |
corvus | mordred: i just apt-get installed skopeo on ze01 and it upgraded to 0.1.40 | 22:34 |
corvus | how about we just do that on all the executors real quick-like? | 22:34 |
mordred | oh - hrm. yeah | 22:34 |
corvus | mordred: mind doing that while i go back to the other thing? | 22:35 |
mordred | on it | 22:35 |
mordred | corvus: done | 22:36 |
*** mlavalle has joined #opendev | 22:37 | |
mordred | corvus: I have also rechecked the patches | 22:52 |
*** tosky has quit IRC | 23:11 | |
mordred | corvus: first patch worked with the skopeo --all ... waiting on uwsgi now | 23:17 |
openstackgerrit | Merged opendev/system-config master: Remove dmsimard from infra-root https://review.opendev.org/726429 | 23:25 |
openstackgerrit | James E. Blair proposed opendev/system-config master: Change the zuul user id when running the base playbook https://review.opendev.org/726490 | 23:26 |
corvus | there's option b ^ | 23:26 |
mordred | corvus: yay! | 23:28 |
mordred | corvus: we don't need to do a business with zuul_console after reset_connection do we? | 23:29 |
corvus | mordred: unclear... it's still going to be running on the old uid.... i'm curious if we can just leave it alone | 23:29 |
mordred | corvus: we'll see | 23:32 |
clarkb | I expect the old uid for that is fine unless we have to restart it for some reason (and none of our jobs od a restart of it currently) | 23:43 |
*** mlavalle has quit IRC | 23:50 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!