Wednesday, 2022-11-02

opendevreviewMichael Kelly proposed zuul/zuul-jobs master: helm: Add job for linting helm charts  https://review.opendev.org/c/zuul/zuul-jobs/+/86179900:02
*** dviroel|rover|bbl is now known as dviroel|rover00:05
opendevreviewMichael Kelly proposed zuul/zuul-jobs master: helm: Add job for linting helm charts  https://review.opendev.org/c/zuul/zuul-jobs/+/86179900:14
opendevreviewMerged opendev/system-config master: Rebuild gitea images under new golang release  https://review.opendev.org/c/opendev/system-config/+/86317600:15
clarkbI expct that will start to deploy in about 20-25 minutes? I'll eat dinner then check in on it00:16
opendevreviewMichael Kelly proposed zuul/zuul-jobs master: helm: Add job for linting helm charts  https://review.opendev.org/c/zuul/zuul-jobs/+/86179900:25
opendevreviewMichael Kelly proposed zuul/zuul-jobs master: helm: Add job for linting helm charts  https://review.opendev.org/c/zuul/zuul-jobs/+/86179900:31
*** dviroel|rover is now known as dviroel|rover|out00:38
*** dviroel|rover|out is now known as dviroel|holiday00:38
clarkbgitea01 is done updating. Seems to work. There is a definite slowness in accessing repos as things start up but that seems to go away after a minute or two00:44
clarkbok all 8 are done and I've spot checked them and they all look happy to me01:01
clarkbthe job should finish momentarily and I expect it to succeed01:01
clarkbsuccess confirme.d I think we're good01:03
opendevreviewJie Niu proposed openstack/project-config master: Apply cfn repository for code and storyboard  https://review.opendev.org/c/openstack/project-config/+/86316801:15
jieniuHi all,  I'm trying to apply repo in opendev, the pipeline failed, is it because the "-" should not be used in project name ? 02:10
jieniuThe following projects should be alphabetized: 02:10
jieniu+ cat projects_list.diff02:10
jieniu+ grep -e '> '02:10
jieniu> computing-force-network/cfn-overview02:10
jieniu> computing-force-network/computing-native02:10
jieniu> computing-force-network/computing-offload02:10
jieniu> computing-force-network/ubiquitous-computing-scheduling02:10
jieniu> computing-force-network/use-case-and-architecture02:10
jieniu+ exit 102:10
ianwwe have dashes in names ...02:24
ianwjieniu: the job is complaining because in https://review.opendev.org/c/openstack/project-config/+/863168/2/gerrit/projects.yaml the entries are at the end (out of alphabetical order)02:26
jieniuianw: so I need to insert these lines according alphabetical order instead of append to the end?03:03
ianwjieniu: yes03:14
jieniuthank you :)03:15
*** yadnesh|away is now known as yadnesh04:40
yadneshhello all, can someone help me hold a node for https://review.opendev.org/c/openstack/aodh/+/86303905:39
*** ysandeep|out is now known as ysandeep05:40
opendevreviewJie Niu proposed openstack/project-config master: Apply cfn repository for code and storyboard  https://review.opendev.org/c/openstack/project-config/+/86316806:11
*** mnasiadka_ is now known as mnasiadka06:29
jieniuHi, all 06:32
jieniuI submit a change to apply repo from opendev, CI pipeline  failed´╝î could some one help me why this acl-config is not normalized? and how should I fix? much appreciated!06:32
jieniu [submit]06:32
jieniu-mergeContent = true06:32
jieniu\ No newline at end of file06:32
jieniu+mergeContent = true06:32
jieniuProject /home/zuul/src/opendev.org/openstack/project-config/gerrit/acls/openinfra/cfn-use-case-and-architecture.config is not normalized!06:32
jieniu--- /home/zuul/src/opendev.org/openstack/project-config/gerrit/acls/openinfra/ubiquitous-computing-scheduling.config2022-11-02 06:17:58.197768142 +000006:32
jieniu+++ /tmp/tmp.5Dr5nkebQl/normalized2022-11-02 06:19:38.774158834 +000006:32
jieniu@@ -8,4 +8,4 @@06:32
jieniu requireContributorAgreement = true06:32
*** yadnesh is now known as yadnesh|afk07:23
*** elodilles_pto is now known as elodilles07:41
*** ysandeep is now known as ysandeep|lunch07:59
*** yadnesh|afk is now known as yadnesh08:22
*** jpena|off is now known as jpena08:36
yadneshhello all, can someone help me hold a node for https://review.opendev.org/c/openstack/aodh/+/86303909:19
yadnesho/ frickler can you please help me with this ^ 09:40
frickleryadnesh: did you try to reproduce locally? or try to create a patch to gather more debug output. I can also hold a node if you specify which of the failing jobs you want to look at, but that should be considered the last resort only10:08
yadneshfrickler, i couldn't reproduce it locally, i am not familiar with creating patch to capture more output but I can give that a try if you can guide me or share doc10:15
*** ysandeep|lunch is now known as ysandeep10:17
*** rlandy|out is now known as rlandy|rover10:40
*** yadnesh is now known as yadnesh|afk11:23
frickleryadnesh|afk: the general approach would most likely look at the delta between the last passing and first failing invocation of your job, then ponder which logs could be helpful in further assessing the issue and add a patch for the job definition to add those logs11:34
fricklerbut you can also finally let me know which job you want held and your ssh key and I'll set things up11:35
*** yadnesh|afk is now known as yadnesh12:22
yadneshfrickler, i need it for telemetry-dsvm-integration job, here's my public key https://paste.openstack.org/show/bfsQ5ivQqWHkTozZa7mF/12:25
*** ysandeep is now known as ysandeep|brb12:26
*** elodilles is now known as elodilles_afk12:44
*** ysandeep|brb is now known as ysandeep12:56
*** gthiemon1e is now known as gthiemonge13:04
*** elodilles_afk is now known as elodilles13:15
*** Guest202 is now known as dasm14:03
opendevreviewdasm proposed openstack/diskimage-builder master: Fix issue in extract image  https://review.opendev.org/c/openstack/diskimage-builder/+/85088214:21
JayFI was trying to show https://zuul.opendev.org/t/openstack/config-errors to some other contributors; but it seems to be busted today14:23
fricklerinfra-root: corvus: ^^ seems to be an issue in the js renderer? I can't find an error in the web log.14:33
*** yadnesh is now known as yadnesh|away14:33
Clark[m]It works if you click the bell icon. So ya server side is probably fine14:36
fricklerJayF: ^^ just wanted to write the same, no direct link, but still a way to view the errors14:36
JayFthat's fine by me14:37
clarkb9d2e1339ff9f5080cd23e9d29fcb08315a32e5e9 that commit might be the one that broke the errors. Though I'm not sure I understand why yet15:22
clarkbit modifies the error state in the js though and the error reported by my browser is that e.map isn't a function so something about that type change maybe15:23
clarkbyup I think that is exactly it15:26
clarkbI'll work on a change15:27
clarkbhrm I feel like I'm missing something with reacts state engine that would make this easier to understand15:38
clarkbremote:   https://review.opendev.org/c/zuul/zuul/+/863326 Fix config-errors dedicated page15:48
clarkbI'm not sure if that is a complete fix. I'm hoping that the preview site will help with further debugging15:49
clarkbinfra-root I have put zk04 - zk06 in the emergency file15:54
clarkbcorvus: when you are around and ready to start the upgrade process let me know15:54
frickleryadnesh|away: sorry for the delay, I've set up the hold now, but saw that in your latest PS the job is passing, so I didn't recheck as a passing job will not trigger it. let us know if you still want to debug this further15:54
clarkbinfra-root for clarity I modified the file on bridge01.opendev.org not old bridge15:55
corvusclarkb: ack, ready in a few mins.  frickler clarkb ack re config-errors will look at clarkb's change15:57
clarkbzk05 is still the leader and I've figured out how to get it to report the number of followers it sees (mntr command)16:02
clarkbcorvus: when you are ready maybe you can do the zuul side backup (nodepool too?) and then I'll update the zk04 docker compose file, pull, and down then up -d16:04
corvusclarkb: i will start that process now16:05
clarkbgreat, let me know when I should proceed with 0416:05
*** marios is now known as marios|out16:10
corvusclarkb: i think there are 2 backup commands we should do: nodepool, then zuul.16:11
corvuson nl01, i logged into the container and then ran `nodepool export-image-data /var/log/nodepool/nodepool-export.data`16:12
corvusi put it there because of the bind mount16:12
corvus(the command wants a path and doesn't understand - so i can't do it as a single docker exec and redirect; that's a potential future improvement)16:13
corvusthat file has the metadata for the dib images in nodepool, so that if something goes wrong, we don't have to spend 2 days rebuilding images because we forgot their ids16:16
corvusnow the next thing is the zuul secret keys... that's something we should probably be backing up periodically anyway, and i can't recall if we are16:17
clarkbcorvus: I believe there is a cronjob for that on the schedulers16:18
corvus(because if a meteor takes out zk, we can still rebuild those nodepool images, but not so with the zuul keys)16:18
corvusclarkb: i thought/hoped so.  i'll double check that it looks good and recent.16:18
clarkbyes I see it on zuul02 at least. Last entry in the root crontab16:18
clarkbthanks16:18
corvusokay, that file looks internally consistent, and is dated from midnight today.  we haven't merged any changes to project-config (which would generate new keys) since then, so i think that's good.  also, i checked on zuul01, so we have 2 backups.  :)16:23
clarkbexcellent. In that case do you think we are ready to proceed with zk04's upgrade?16:23
corvusyep, i think so16:23
clarkbok I'll start on that now16:24
clarkbI've run the pull and image is present. Time to down then up16:25
clarkbthat seems to have worked and quite quickly16:26
clarkbzk 3.6 mntr output is far more verbose than 3.5s16:26
clarkbcorvus: you think we wait here for a few minutes and check that the grafana graphs don't show anything unexpected?16:26
corvusagreed, monitoring looks good so far.16:26
corvusclarkb: because we basically kicked all the clients off of 04, it's not going to be doing any client servicing work until something else happens... so the surface area for seeing errors here is small.16:27
corvusclarkb: we could try restarting some zuul components to see if they connect to 04, or just proceed...16:28
clarkbcorvus: ya I'm mostly worried about the stats monitoring itself since the mntr output has changed16:28
clarkbbut my read of that script is that if it doesn't find what it is looking for it skips gracefully so I think we are safe on that end. I think we should proceed with zk05 which should push load to zk0416:28
clarkband then later we can add more info to the graphite stats via the script and update names if any of them have changed16:29
corvusclarkb: i think zk05 is leader, so zk06 is next?16:29
clarkboh yes good catch :)16:29
corvusand otherwise i agree, let's proceed with zk06 now.  the stats for everything i expect to see data for on zk04 still look good in grafana16:29
clarkbI've double checked that the zk04 upgraded didn't cause zk05 to stop being the leader. It is still the leader. zk06 is the next one.16:30
clarkbproceeding with zk0616:30
corvuszk04 is now doing work16:32
clarkbyup and 06 came up just as quickly as 04 did16:32
clarkbso now we should be able to watch and wait for a few minutes to see 04 on 3.6 has not obvious issues16:32
corvussounds good16:33
clarkbis the spike in zuul event processing time something to be concerned about?16:34
corvusit started before this work16:34
clarkboh yup good point16:35
clarkba number of changes were proposed to tripleo and others16:36
corvuslooks like a bunch of openstack tenant reconfigs happened in rapid succesion16:36
clarkbthe changes were pushed by the openstack proposal bot16:37
clarkbI suspect this is ok and just part of having the bot show up with a bunch of work16:37
corvusyeah, seems to happen every now and then16:37
clarkblet me know when you're happy with zk04. I think the cluster handling that burst of activity in the middle of the upgrade is a good indication that things are functioning16:39
corvusas far as performance metrics -- there is a significant increase in write time, but it corresponds with an increase in object count on the y axis, and it corresponds with the event surge on the time axis, so while the event surge means we can't do 1:1 before/after comparisons, so far it looks good to me.  i think we can proceed.16:41
clarkbI will proceed with zk05 now16:41
clarkb06 is the new leader and it says it has two synced followers16:43
clarkbas far as I can tell things are happy16:44
corvuswe may have a stats problem then since the graph shows 0 for all16:44
clarkbthe follower graph?16:44
corvusyep16:44
corvuseverything else looks good16:45
clarkbI'm skimming the new mntr output and it seems zk_followers may not exist aynmore its zk_synced_followers?16:46
clarkbbut I haven't grepped to be sure yet16:46
clarkbya that key is gone after using grep16:47
clarkbI suspect that is the problem with our stats.16:47
corvusis zk_synced_followers the new one?16:47
clarkbcorvus: yes zk_synced_followers16:47
corvusoh interesting, we grab both16:47
corvusso this may be only a grafana change16:48
corvusclarkb: what zk version did we just upgrade to?  :)16:48
Clark[m]corvus: latest 3.616:49
opendevreviewJames E. Blair proposed openstack/project-config master: Update ZK followers graph  https://review.opendev.org/c/openstack/project-config/+/86341816:49
clarkbheh saw the message in matrix not irc16:49
corvusClark: ^ thx, maybe that will do it?16:49
clarkbyup I suspect so. I've gone ahead and approved it16:50
corvusasynchronous-synchronous communication :)16:50
clarkbcorvus: should I go ahead and approve the system-config change to update the docker compose files now?16:50
corvustechnically, i guess it's eventually-consistent-synchronous16:50
clarkbwell the update is a noop now but ya16:50
corvusclarkb: ++16:50
corvuswhen shall we 3.7?16:51
clarkbconsidering how well this went I'm tempted to say tomorrow :) But I've got errands I need to do tomorrow16:51
clarkbI can probably do it monday16:51
corvusokay.  also "right now" wfm if that's an option.  :)16:52
clarkboh hrm16:52
clarkblet me do a quick google search for any 3.6 to 3.7 concerns16:52
clarkbbut ya actually I think that is a good idea16:53
clarkbcorvus: lets push a change to run it through our system-config-run-* jobs for any major issues?16:53
clarkbbut if that passes proceed?16:53
corvusok16:53
clarkbthe docs say 3.6 to 3.7 upgrade should be as simple as this one16:54
corvusyeah: "The upgrade from 3.6.x to 3.7.0 can be executed as usual, no particular additional upgrade procedure is needed."16:54
opendevreviewClark Boylan proposed opendev/system-config master: Upgrade zookeeper from 3.6 to 3.7  https://review.opendev.org/c/opendev/system-config/+/86341916:56
clarkbcorvus: ^ that should give us an indication of any major issues. If that passes CI I can proceed with doing what I just did but with 3.716:57
clarkband am keeping all the servers in the emergency file for now16:57
corvusclarkb: this takes a while, right?  should we reconvene in 30m?16:58
clarkbyup it will take a few minutes (I think it runs a zookeeper job and a zuul job)16:59
clarkbsee you in half an hour16:59
corvus++16:59
*** ysandeep is now known as ysandeep|out17:08
opendevreviewMerged openstack/project-config master: Update ZK followers graph  https://review.opendev.org/c/openstack/project-config/+/86341817:27
clarkbthe zookeeper job for the 3.7 change passed. The zuul jobs for that change should finish in about 10 minutes17:30
clarkbassuming the zuul job comes back green too I'll restart the process we just ran through but updating 3.6 to 3.7 this time. Also zk06 is the current leader so it will go last17:30
clarkbthe job to fix the graph should run in a few minutes too. May not be a bad idea to wait for that to update to make it easier to confirm thigns are working17:32
corvusi'm back17:32
clarkbcorvus: tldr is looking good so far but needs a few more minutes to finish up17:33
corvusgraphs look ok to me17:33
clarkbthe grafana update job has started17:37
*** jpena is now known as jpena|off17:38
clarkbhrm that job is likely to fail in retry failure17:38
clarkbI suspect that is related to the bridge update and not related to anything zookeeper17:39
clarkbI guess we can check follower count by hand for now :)17:39
clarkbya permission denied when connecting to bridge. Almost certainly a prolem with the new bridge migration cc ianw17:41
corvusproceeding without graph sounds good to me17:42
clarkbianw: I suspect maybe something related to being triggered by project-config instead of sytem-config? infra-root we should hold off on adding new repos until we understand that also dns updates (I've got one pending) are likely to be affected17:42
corvusmaybe missed adding the project ssh key17:43
opendevreviewMerged opendev/system-config master: Upgrade our zookeeper cluster to 3.6  https://review.opendev.org/c/opendev/system-config/+/86308917:43
clarkbcorvus: https://review.opendev.org/c/opendev/system-config/+/863419 got a +1 from zuul so I'm happy to proceed with the 3.6 to 3.7 upgrade now. Look good to you as well?17:43
clarkbI'll do the upgrades in zk04 zk05 zk06 order since zk06 is now leader17:44
corvussounds good17:44
clarkbinfra-prod-service-zookeeper just started for 863089 but it should noop because the nodes are in the emergenc file17:44
clarkbyup it nooped. Proceeding on zk04 now17:46
clarkbzk04 seems happy and zk06 shows two synced followers17:48
corvuswow looks like most everybody hopped to zk0617:48
clarkbthat should be random right?17:48
corvustbh i don't know the algorithm or if it's changed.  it just usually ends up roughly equitable.  everyone going to 06 is potentially strange.17:50
corvusmy inclination would be to proceed and see what happens after we upgrade 06. if we end up unbalanced (between 4 and 5) after that, look into it more.17:50
clarkback. I'll proceed with zk05 next then?17:50
corvusyep, i don't see anything else anomalous, and we know that 1 node can handle load fine.17:51
clarkbI think all connections on are zk06 now fwiw17:52
clarkbbut zk06 continues to report all are synced17:52
clarkbI'm going to proceed with zk06 now17:53
clarkbzk05 became leader and zk04 has connections too17:54
clarkband zk05 has 2 synced followers. I wonder if that is simply a rolling upgrade behavior17:55
corvusclarkb: can you clarify, what sequence have you upgraded?17:56
clarkbcorvus: at this point all three are upgraded. After I upgraded the last server (zk06) then zk05 became the leader and both zk05 and zk04 have connections now17:57
corvuscool, that's what i thought, but i could have read one of the earlier updates 2 ways so wanted to be sure :)17:57
corvusthe client could looks well distributed now17:57
clarkbalso zk05 (the current leader) reports 2 synced followers which is what we expect17:58
clarkbcorvus: I think if this looks good to you after your checks you should approve https://review.opendev.org/c/opendev/system-config/+/86341917:58
corvusclarkb: lgtm and done17:59
clarkbthanks!17:59
corvusthank you!17:59
clarkbonce that lands I'll remove the nodes from the emergency file18:03
clarkbI'll try to catch that update so that the triggered infra-prod run will actually run and noop properly due to content on disk matching expected state rather than nooping due to hosts being in the mergency file18:05
clarkbI've gone ahead and edited the emergency file as no jobs are running and that change should land momentarily. Then I'll check that it noops as expected. Then I'll status log the new situation18:45
opendevreviewMerged opendev/system-config master: Upgrade zookeeper from 3.6 to 3.7  https://review.opendev.org/c/opendev/system-config/+/86341918:53
clarkbthe timestamp on the docker compose file did end up updating (a side effect of using the synchronize instead of copy module?) but the version didn't change and no containers were restarted. This concludes the zookeeper upgrades18:59
clarkb#status log Zuul's ZK cluster has been upgraded to 3.7 via 3.6.18:59
opendevstatusclarkb: finished logging18:59
clarkbTime for lunch but when I get back I'll probably look at the project config job retry limiting18:59
clarkbianw: corvus: I've confirmed that it seems we only have the one key present on bridge. I would have expected the base job running on the bridge to address that since we set the extra users list for ridge20:02
clarkbwow ok I think it is due to the usermod conflict saying the user is in use20:05
clarkbya its failing at that point and everything afterwards does't run20:06
clarkbI'm not sure how to handle this since the user exists and we shouldn't need to bootstrap it again. But I guess ansible isn't smart enough to not run usermod and just lets it fail?20:08
clarkb"You must make certain that the named user is not executing any processes when this command is being executed if the user's numerical user ID, the user's name, or the user's home directory is being changed."20:09
clarkbthe problem is the uid20:10
clarkbianw: ^ zuul's uid on bridge does not match what we set via ansible so that triggers the error above20:10
clarkbwe probably need to pause infra-prod access to bridge, manually change the uid, chown everything that zuul touches, then unpause and see if it works20:11
clarkb?20:11
clarkblooks like the name and homedir matchup so only the uid is the problem20:12
clarkbI think the main things to be careful of are git repo ownership? Since git refuses to operate if ownership doesn't align anymore. Otherwise logging is largel written as root?20:27
ianwclarkb: hey, sorry, catching up21:03
opendevreviewClark Boylan proposed zuul/zuul-jobs master: Fix check zone role for Jammy  https://review.opendev.org/c/zuul/zuul-jobs/+/86309821:04
ianwi think that i have messed up the zuul user when bootstrapping the host21:05
clarkbya I think our config makes uid 3000 the next one by default21:05
clarkbwhich we've done beacuse we hardcode in the 2000-2999 range21:05
clarkbso it grabbed the next free uid according to the config and went with it21:05
clarkbianw: the other thing I've realized is that we may need to double check if gerrit projects have been created properly if any changes to do that have landed. It ispossible they ended up being created by the periodic jobs on sytem-confg if so21:06
ianwi think that this bootstrapping issue is fixed by https://review.opendev.org/c/opendev/system-config/+/862845/2.  with that, if we start another bridge, we will apply the correct extra_users, etc. variables to it and create the user as specified21:07
ianwso yeah, i think it's a situation of mop up what is wrong with it now on this host, but going forward the same mistake shouldn't happen again21:07
clarkbI'm still on the fence about that fwiw. The base playbook should cover all that21:07
clarkbso it seems weird to double account for itall in a separate bootstrapping step? This is why I suggested we try to have launch node do it instead21:08
clarkbbut I also haven't fully digested that change so maybe it does what I'm thinking with launch and base.yaml21:09
ianwright, i agree the base playbook will cover all that, but the definition for the extra_users was in the group_vars/bridge.yaml definition -- which was restricted to only the *current* production host21:09
ianwso when i started a new one, it didn't apply the variables21:09
clarkboh I see21:10
ianwhowever, what i've realised is, we can have as many active bridge0X.opendev.org hosts in the inventory as we want21:10
ianwthe important thing is that the CI jobs just choose the "current" one to run the jobs on21:10
clarkbianw: and the zuul reboot cronjob only runs on one21:10
clarkbbut ya I think that is correct21:10
ianws/jobs/production jobs/21:10
ianwindeed, yes modulo any external cron-ish type things like that21:11
clarkbI think that is theo nly cronish thing we have currently due to the even more chicken and egg problems of restarting zuul with zuul :)21:11
clarkb(we can't restart the executor running the restart job without breaking the restart job)21:11
ianwso what that change does is changes things so that when we add hosts dynamically with "add_host" we put them in the prod_bridge group, and reference prod_bridge[0] in all the playbooks that setup and start the nested ansible21:12
clarkbok that helps a bit. I haven'y managed to properly review that chagne yet because it is a bit mind bendy21:13
clarkband I', not sure I'll get to it today. But if I put it on my todo list for first thing tomorrow there is a good chance I Finally manage it21:13
ianwthe production playbooks shouldn't care about the bridge running them -- the one place they did (resetting the project-config checkout on bridge) was a issue for running parallel jobs anyway21:14
ianwso what falls out of that is that we can switch the testing jobs to bridge99.opendev.org just fine -- basically proving that we're not hardcoding the bridge name :)21:14
ianw(by making the prod_bridge group in testing have the single host bridge99.opendev.org -- while in the gate it will have bridge01.opendev.org)21:16
clarkbin the gate would still be bridge99?21:16
clarkbdo you mean in the deploy jobs?21:17
ianwsorry, yes, i mean the actual production jobs, not the gate21:19
clarkbmakes sense21:19
ianwthe post-gate deploy steps :)21:19
ianw... but ... to the problem at hand ... resetting the zuul uid to 203121:20
opendevreviewClark Boylan proposed zuul/zuul-jobs master: Fix check zone role for Jammy  https://review.opendev.org/c/zuul/zuul-jobs/+/86309821:21
clarkbianw: my hunch is that doing so should be relatively safe since zuul shouldn't own anything outside of its homedir21:22
clarkbso in theory just a matter of chwoning and updating the uid in /etc/passwd?21:22
clarkbfungi and corvus may have thoughts on that21:22
clarkbthat should allow the full base.yaml playbook to run on bridge so might be worth double checking that it won't do anything we don't want yet21:23
clarkband also maybe we need to check if any new project additions have landed and are in limbo (just in case we need to take any intervention steps)21:23
ianwyeah, i think 1) change in passwd -- 2) chown -hvR 2031 /home/zuul 3) reboot for good measure21:24
ianw4) monitor base run21:25
opendevreviewClark Boylan proposed zuul/zuul-jobs master: Fix check zone role for Jammy  https://review.opendev.org/c/zuul/zuul-jobs/+/86309821:26
ianwi don't think i see anything in https://review.opendev.org/q/project:openstack/project-config since 2022-10-26 (my time) which was when the new host started running prod jobs21:27
clarkbthat simplifies things :)21:28
clarkbdns updates would also be affected but I don't think we've had any of those. Grafana is the other one affected but its impact is minimal21:28
ianwi just have to do quick school run, but plan to do it when i get back in about ~20 mins21:29
clarkbok, I don't think we're in a huge rush if there is anything else you think we should do to prepare21:29
clarkbNext week is the gerrit user summit for anyone interested in joining they will have remote access but scheudle is on london time (so timezones may make it difficult). I plan to try and wake up early and participate a bit myself21:31
ianwi don't think so -- as soon as you mentioned it, it kind of clicked that zuul having this different uid was wrong21:31
clarkbthe thing that clicked for me was reading a modern man page for that utility21:37
clarkbgoogle returns old ones21:37
clarkband sure enough the caveats section clearly described why we were hitting the problem21:37
*** dasm is now known as dasm|off21:52
*** rlandy|rover is now known as rlandy|bbl22:05
clarkbmtreinish: I've seen behavior similar to https://paste.opendev.org/show/bIfaYeDOEgM6Zz8gqEry/ across python3.8 on focal and python3.10/3.11 on jammy when using stestr. Basically we end up with a test suite that appears to have run all tests according to the orphaned stestr record file but the python process doesn't exit. When I strace things it seems the child is waiting on the22:08
clarkbparent for something22:08
clarkbmtreinish: do you have any idea of what is going on there? I'm not familar enough with stestr to know where the multiprocessing fork comes from. Maybe thats just how you spin up the concurrency and pass it a different load list?22:08
clarkbwhat is odd is that stestr and subunit haven't changed in at least a month but this seems very new (like last week at earliest)22:09
clarkbI think I've managed to hold the node that that paste was made from if we need to inspect more stuff22:12
ianwi've swapped the ownership -- bridge is quiet so i'm rebooting now22:18
clarkbianw: the hourly jobs are running 22:19
clarkbbut losing those has minimal impact22:19
clarkbit should go to the next jobs and then retry22:19
clarkbyup it just failed and should retry after22:20
ianwi'll do a base run limited to bridge manually to watch it closely22:28
ianwi have a root screen up22:28
ianwbridge01.opendev.org       : ok=65   changed=6    unreachable=0    failed=0    skipped=10   rescued=0    ignored=0 22:30
ianwclarkb: speaking of base, are you ok with https://review.opendev.org/c/opendev/system-config/+/862765 which removes the old ip address from sshd?22:33
clarkbianw: yup I think at this point we should roll forward22:34
clarkb+2'd but not approved as my ability to monitor is quickly declining22:34
clarkbianw: re the secrets management did that get resolved?22:34
ianwi thought it did but maybe not ...22:36
clarkbianw: I think running `edit-secrets` and ensuring that works as expected is the test for that? it was frickler who discovered it so can weigh in if it isn't doing what we expect yet22:37
clarkb(I'm mostly just trying to make sure things are happy that I've seen or have seen others notices)22:37
ianwyes, ok key material is there, but i get a prompt for gpg-agent that seems to randomly drop keypresses22:39
ianwi think a .emacs fixes this22:40
opendevreviewIan Wienand proposed opendev/system-config master: edit-secrets: configure gpg-agent/emacs  https://review.opendev.org/c/opendev/system-config/+/86344523:08

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!