Wednesday, 2022-11-02

opendevreview	Michael Kelly proposed zuul/zuul-jobs master: helm: Add job for linting helm charts https://review.opendev.org/c/zuul/zuul-jobs/+/861799	00:02
*** dviroel\|rover\|bbl is now known as dviroel\|rover		00:05
opendevreview	Michael Kelly proposed zuul/zuul-jobs master: helm: Add job for linting helm charts https://review.opendev.org/c/zuul/zuul-jobs/+/861799	00:14
opendevreview	Merged opendev/system-config master: Rebuild gitea images under new golang release https://review.opendev.org/c/opendev/system-config/+/863176	00:15
clarkb	I expct that will start to deploy in about 20-25 minutes? I'll eat dinner then check in on it	00:16
opendevreview	Michael Kelly proposed zuul/zuul-jobs master: helm: Add job for linting helm charts https://review.opendev.org/c/zuul/zuul-jobs/+/861799	00:25
opendevreview	Michael Kelly proposed zuul/zuul-jobs master: helm: Add job for linting helm charts https://review.opendev.org/c/zuul/zuul-jobs/+/861799	00:31
*** dviroel\|rover is now known as dviroel\|rover\|out		00:38
*** dviroel\|rover\|out is now known as dviroel\|holiday		00:38
clarkb	gitea01 is done updating. Seems to work. There is a definite slowness in accessing repos as things start up but that seems to go away after a minute or two	00:44
clarkb	ok all 8 are done and I've spot checked them and they all look happy to me	01:01
clarkb	the job should finish momentarily and I expect it to succeed	01:01
clarkb	success confirme.d I think we're good	01:03
opendevreview	Jie Niu proposed openstack/project-config master: Apply cfn repository for code and storyboard https://review.opendev.org/c/openstack/project-config/+/863168	01:15
jieniu	Hi all, I'm trying to apply repo in opendev, the pipeline failed, is it because the "-" should not be used in project name ?	02:10
jieniu	The following projects should be alphabetized:	02:10
jieniu	+ cat projects_list.diff	02:10
jieniu	+ grep -e '> '	02:10
jieniu	> computing-force-network/cfn-overview	02:10
jieniu	> computing-force-network/computing-native	02:10
jieniu	> computing-force-network/computing-offload	02:10
jieniu	> computing-force-network/ubiquitous-computing-scheduling	02:10
jieniu	> computing-force-network/use-case-and-architecture	02:10
jieniu	+ exit 1	02:10
ianw	we have dashes in names ...	02:24
ianw	jieniu: the job is complaining because in https://review.opendev.org/c/openstack/project-config/+/863168/2/gerrit/projects.yaml the entries are at the end (out of alphabetical order)	02:26
jieniu	ianw: so I need to insert these lines according alphabetical order instead of append to the end?	03:03
ianw	jieniu: yes	03:14
jieniu	thank you :)	03:15
*** yadnesh\|away is now known as yadnesh		04:40
yadnesh	hello all, can someone help me hold a node for https://review.opendev.org/c/openstack/aodh/+/863039	05:39
*** ysandeep\|out is now known as ysandeep		05:40
opendevreview	Jie Niu proposed openstack/project-config master: Apply cfn repository for code and storyboard https://review.opendev.org/c/openstack/project-config/+/863168	06:11
*** mnasiadka_ is now known as mnasiadka		06:29
jieniu	Hi, all	06:32
jieniu	I submit a change to apply repo from opendev, CI pipeline failed， could some one help me why this acl-config is not normalized? and how should I fix? much appreciated!	06:32
jieniu	[submit]	06:32
jieniu	-mergeContent = true	06:32
jieniu	\ No newline at end of file	06:32
jieniu	+mergeContent = true	06:32
jieniu	Project /home/zuul/src/opendev.org/openstack/project-config/gerrit/acls/openinfra/cfn-use-case-and-architecture.config is not normalized!	06:32
jieniu	--- /home/zuul/src/opendev.org/openstack/project-config/gerrit/acls/openinfra/ubiquitous-computing-scheduling.config2022-11-02 06:17:58.197768142 +0000	06:32
jieniu	+++ /tmp/tmp.5Dr5nkebQl/normalized2022-11-02 06:19:38.774158834 +0000	06:32
jieniu	@@ -8,4 +8,4 @@	06:32
jieniu	requireContributorAgreement = true	06:32
*** yadnesh is now known as yadnesh\|afk		07:23
*** elodilles_pto is now known as elodilles		07:41
*** ysandeep is now known as ysandeep\|lunch		07:59
*** yadnesh\|afk is now known as yadnesh		08:22
*** jpena\|off is now known as jpena		08:36
yadnesh	hello all, can someone help me hold a node for https://review.opendev.org/c/openstack/aodh/+/863039	09:19
yadnesh	o/ frickler can you please help me with this ^	09:40
frickler	yadnesh: did you try to reproduce locally? or try to create a patch to gather more debug output. I can also hold a node if you specify which of the failing jobs you want to look at, but that should be considered the last resort only	10:08
yadnesh	frickler, i couldn't reproduce it locally, i am not familiar with creating patch to capture more output but I can give that a try if you can guide me or share doc	10:15
*** ysandeep\|lunch is now known as ysandeep		10:17
*** rlandy\|out is now known as rlandy\|rover		10:40
*** yadnesh is now known as yadnesh\|afk		11:23
frickler	yadnesh\|afk: the general approach would most likely look at the delta between the last passing and first failing invocation of your job, then ponder which logs could be helpful in further assessing the issue and add a patch for the job definition to add those logs	11:34
frickler	but you can also finally let me know which job you want held and your ssh key and I'll set things up	11:35
*** yadnesh\|afk is now known as yadnesh		12:22
yadnesh	frickler, i need it for telemetry-dsvm-integration job, here's my public key https://paste.openstack.org/show/bfsQ5ivQqWHkTozZa7mF/	12:25
*** ysandeep is now known as ysandeep\|brb		12:26
*** elodilles is now known as elodilles_afk		12:44
*** ysandeep\|brb is now known as ysandeep		12:56
*** gthiemon1e is now known as gthiemonge		13:04
*** elodilles_afk is now known as elodilles		13:15
*** Guest202 is now known as dasm		14:03
opendevreview	dasm proposed openstack/diskimage-builder master: Fix issue in extract image https://review.opendev.org/c/openstack/diskimage-builder/+/850882	14:21
JayF	I was trying to show https://zuul.opendev.org/t/openstack/config-errors to some other contributors; but it seems to be busted today	14:23
frickler	infra-root: corvus: ^^ seems to be an issue in the js renderer? I can't find an error in the web log.	14:33
*** yadnesh is now known as yadnesh\|away		14:33
Clark[m]	It works if you click the bell icon. So ya server side is probably fine	14:36
frickler	JayF: ^^ just wanted to write the same, no direct link, but still a way to view the errors	14:36
JayF	that's fine by me	14:37
clarkb	9d2e1339ff9f5080cd23e9d29fcb08315a32e5e9 that commit might be the one that broke the errors. Though I'm not sure I understand why yet	15:22
clarkb	it modifies the error state in the js though and the error reported by my browser is that e.map isn't a function so something about that type change maybe	15:23
clarkb	yup I think that is exactly it	15:26
clarkb	I'll work on a change	15:27
clarkb	hrm I feel like I'm missing something with reacts state engine that would make this easier to understand	15:38
clarkb	remote: https://review.opendev.org/c/zuul/zuul/+/863326 Fix config-errors dedicated page	15:48
clarkb	I'm not sure if that is a complete fix. I'm hoping that the preview site will help with further debugging	15:49
clarkb	infra-root I have put zk04 - zk06 in the emergency file	15:54
clarkb	corvus: when you are around and ready to start the upgrade process let me know	15:54
frickler	yadnesh\|away: sorry for the delay, I've set up the hold now, but saw that in your latest PS the job is passing, so I didn't recheck as a passing job will not trigger it. let us know if you still want to debug this further	15:54
clarkb	infra-root for clarity I modified the file on bridge01.opendev.org not old bridge	15:55
corvus	clarkb: ack, ready in a few mins. frickler clarkb ack re config-errors will look at clarkb's change	15:57
clarkb	zk05 is still the leader and I've figured out how to get it to report the number of followers it sees (mntr command)	16:02
clarkb	corvus: when you are ready maybe you can do the zuul side backup (nodepool too?) and then I'll update the zk04 docker compose file, pull, and down then up -d	16:04
corvus	clarkb: i will start that process now	16:05
clarkb	great, let me know when I should proceed with 04	16:05
*** marios is now known as marios\|out		16:10
corvus	clarkb: i think there are 2 backup commands we should do: nodepool, then zuul.	16:11
corvus	on nl01, i logged into the container and then ran `nodepool export-image-data /var/log/nodepool/nodepool-export.data`	16:12
corvus	i put it there because of the bind mount	16:12
corvus	(the command wants a path and doesn't understand - so i can't do it as a single docker exec and redirect; that's a potential future improvement)	16:13
corvus	that file has the metadata for the dib images in nodepool, so that if something goes wrong, we don't have to spend 2 days rebuilding images because we forgot their ids	16:16
corvus	now the next thing is the zuul secret keys... that's something we should probably be backing up periodically anyway, and i can't recall if we are	16:17
clarkb	corvus: I believe there is a cronjob for that on the schedulers	16:18
corvus	(because if a meteor takes out zk, we can still rebuild those nodepool images, but not so with the zuul keys)	16:18
corvus	clarkb: i thought/hoped so. i'll double check that it looks good and recent.	16:18
clarkb	yes I see it on zuul02 at least. Last entry in the root crontab	16:18
clarkb	thanks	16:18
corvus	okay, that file looks internally consistent, and is dated from midnight today. we haven't merged any changes to project-config (which would generate new keys) since then, so i think that's good. also, i checked on zuul01, so we have 2 backups. :)	16:23
clarkb	excellent. In that case do you think we are ready to proceed with zk04's upgrade?	16:23
corvus	yep, i think so	16:23
clarkb	ok I'll start on that now	16:24
clarkb	I've run the pull and image is present. Time to down then up	16:25
clarkb	that seems to have worked and quite quickly	16:26
clarkb	zk 3.6 mntr output is far more verbose than 3.5s	16:26
clarkb	corvus: you think we wait here for a few minutes and check that the grafana graphs don't show anything unexpected?	16:26
corvus	agreed, monitoring looks good so far.	16:26
corvus	clarkb: because we basically kicked all the clients off of 04, it's not going to be doing any client servicing work until something else happens... so the surface area for seeing errors here is small.	16:27
corvus	clarkb: we could try restarting some zuul components to see if they connect to 04, or just proceed...	16:28
clarkb	corvus: ya I'm mostly worried about the stats monitoring itself since the mntr output has changed	16:28
clarkb	but my read of that script is that if it doesn't find what it is looking for it skips gracefully so I think we are safe on that end. I think we should proceed with zk05 which should push load to zk04	16:28
clarkb	and then later we can add more info to the graphite stats via the script and update names if any of them have changed	16:29
corvus	clarkb: i think zk05 is leader, so zk06 is next?	16:29
clarkb	oh yes good catch :)	16:29
corvus	and otherwise i agree, let's proceed with zk06 now. the stats for everything i expect to see data for on zk04 still look good in grafana	16:29
clarkb	I've double checked that the zk04 upgraded didn't cause zk05 to stop being the leader. It is still the leader. zk06 is the next one.	16:30
clarkb	proceeding with zk06	16:30
corvus	zk04 is now doing work	16:32
clarkb	yup and 06 came up just as quickly as 04 did	16:32
clarkb	so now we should be able to watch and wait for a few minutes to see 04 on 3.6 has not obvious issues	16:32
corvus	sounds good	16:33
clarkb	is the spike in zuul event processing time something to be concerned about?	16:34
corvus	it started before this work	16:34
clarkb	oh yup good point	16:35
clarkb	a number of changes were proposed to tripleo and others	16:36
corvus	looks like a bunch of openstack tenant reconfigs happened in rapid succesion	16:36
clarkb	the changes were pushed by the openstack proposal bot	16:37
clarkb	I suspect this is ok and just part of having the bot show up with a bunch of work	16:37
corvus	yeah, seems to happen every now and then	16:37
clarkb	let me know when you're happy with zk04. I think the cluster handling that burst of activity in the middle of the upgrade is a good indication that things are functioning	16:39
corvus	as far as performance metrics -- there is a significant increase in write time, but it corresponds with an increase in object count on the y axis, and it corresponds with the event surge on the time axis, so while the event surge means we can't do 1:1 before/after comparisons, so far it looks good to me. i think we can proceed.	16:41
clarkb	I will proceed with zk05 now	16:41
clarkb	06 is the new leader and it says it has two synced followers	16:43
clarkb	as far as I can tell things are happy	16:44
corvus	we may have a stats problem then since the graph shows 0 for all	16:44
clarkb	the follower graph?	16:44
corvus	yep	16:44
corvus	everything else looks good	16:45
clarkb	I'm skimming the new mntr output and it seems zk_followers may not exist aynmore its zk_synced_followers?	16:46
clarkb	but I haven't grepped to be sure yet	16:46
clarkb	ya that key is gone after using grep	16:47
clarkb	I suspect that is the problem with our stats.	16:47
corvus	is zk_synced_followers the new one?	16:47
clarkb	corvus: yes zk_synced_followers	16:47
corvus	oh interesting, we grab both	16:47
corvus	so this may be only a grafana change	16:48
corvus	clarkb: what zk version did we just upgrade to? :)	16:48
Clark[m]	corvus: latest 3.6	16:49
opendevreview	James E. Blair proposed openstack/project-config master: Update ZK followers graph https://review.opendev.org/c/openstack/project-config/+/863418	16:49
clarkb	heh saw the message in matrix not irc	16:49
corvus	Clark: ^ thx, maybe that will do it?	16:49
clarkb	yup I suspect so. I've gone ahead and approved it	16:50
corvus	asynchronous-synchronous communication :)	16:50
clarkb	corvus: should I go ahead and approve the system-config change to update the docker compose files now?	16:50
corvus	technically, i guess it's eventually-consistent-synchronous	16:50
clarkb	well the update is a noop now but ya	16:50
corvus	clarkb: ++	16:50
corvus	when shall we 3.7?	16:51
clarkb	considering how well this went I'm tempted to say tomorrow :) But I've got errands I need to do tomorrow	16:51
clarkb	I can probably do it monday	16:51
corvus	okay. also "right now" wfm if that's an option. :)	16:52
clarkb	oh hrm	16:52
clarkb	let me do a quick google search for any 3.6 to 3.7 concerns	16:52
clarkb	but ya actually I think that is a good idea	16:53
clarkb	corvus: lets push a change to run it through our system-config-run-* jobs for any major issues?	16:53
clarkb	but if that passes proceed?	16:53
corvus	ok	16:53
clarkb	the docs say 3.6 to 3.7 upgrade should be as simple as this one	16:54
corvus	yeah: "The upgrade from 3.6.x to 3.7.0 can be executed as usual, no particular additional upgrade procedure is needed."	16:54
opendevreview	Clark Boylan proposed opendev/system-config master: Upgrade zookeeper from 3.6 to 3.7 https://review.opendev.org/c/opendev/system-config/+/863419	16:56
clarkb	corvus: ^ that should give us an indication of any major issues. If that passes CI I can proceed with doing what I just did but with 3.7	16:57
clarkb	and am keeping all the servers in the emergency file for now	16:57
corvus	clarkb: this takes a while, right? should we reconvene in 30m?	16:58
clarkb	yup it will take a few minutes (I think it runs a zookeeper job and a zuul job)	16:59
clarkb	see you in half an hour	16:59
corvus	++	16:59
*** ysandeep is now known as ysandeep\|out		17:08
opendevreview	Merged openstack/project-config master: Update ZK followers graph https://review.opendev.org/c/openstack/project-config/+/863418	17:27
clarkb	the zookeeper job for the 3.7 change passed. The zuul jobs for that change should finish in about 10 minutes	17:30
clarkb	assuming the zuul job comes back green too I'll restart the process we just ran through but updating 3.6 to 3.7 this time. Also zk06 is the current leader so it will go last	17:30
clarkb	the job to fix the graph should run in a few minutes too. May not be a bad idea to wait for that to update to make it easier to confirm thigns are working	17:32
corvus	i'm back	17:32
clarkb	corvus: tldr is looking good so far but needs a few more minutes to finish up	17:33
corvus	graphs look ok to me	17:33
clarkb	the grafana update job has started	17:37
*** jpena is now known as jpena\|off		17:38
clarkb	hrm that job is likely to fail in retry failure	17:38
clarkb	I suspect that is related to the bridge update and not related to anything zookeeper	17:39
clarkb	I guess we can check follower count by hand for now :)	17:39
clarkb	ya permission denied when connecting to bridge. Almost certainly a prolem with the new bridge migration cc ianw	17:41
corvus	proceeding without graph sounds good to me	17:42
clarkb	ianw: I suspect maybe something related to being triggered by project-config instead of sytem-config? infra-root we should hold off on adding new repos until we understand that also dns updates (I've got one pending) are likely to be affected	17:42
corvus	maybe missed adding the project ssh key	17:43
opendevreview	Merged opendev/system-config master: Upgrade our zookeeper cluster to 3.6 https://review.opendev.org/c/opendev/system-config/+/863089	17:43
clarkb	corvus: https://review.opendev.org/c/opendev/system-config/+/863419 got a +1 from zuul so I'm happy to proceed with the 3.6 to 3.7 upgrade now. Look good to you as well?	17:43
clarkb	I'll do the upgrades in zk04 zk05 zk06 order since zk06 is now leader	17:44
corvus	sounds good	17:44
clarkb	infra-prod-service-zookeeper just started for 863089 but it should noop because the nodes are in the emergenc file	17:44
clarkb	yup it nooped. Proceeding on zk04 now	17:46
clarkb	zk04 seems happy and zk06 shows two synced followers	17:48
corvus	wow looks like most everybody hopped to zk06	17:48
clarkb	that should be random right?	17:48
corvus	tbh i don't know the algorithm or if it's changed. it just usually ends up roughly equitable. everyone going to 06 is potentially strange.	17:50
corvus	my inclination would be to proceed and see what happens after we upgrade 06. if we end up unbalanced (between 4 and 5) after that, look into it more.	17:50
clarkb	ack. I'll proceed with zk05 next then?	17:50
corvus	yep, i don't see anything else anomalous, and we know that 1 node can handle load fine.	17:51
clarkb	I think all connections on are zk06 now fwiw	17:52
clarkb	but zk06 continues to report all are synced	17:52
clarkb	I'm going to proceed with zk06 now	17:53
clarkb	zk05 became leader and zk04 has connections too	17:54
clarkb	and zk05 has 2 synced followers. I wonder if that is simply a rolling upgrade behavior	17:55
corvus	clarkb: can you clarify, what sequence have you upgraded?	17:56
clarkb	corvus: at this point all three are upgraded. After I upgraded the last server (zk06) then zk05 became the leader and both zk05 and zk04 have connections now	17:57
corvus	cool, that's what i thought, but i could have read one of the earlier updates 2 ways so wanted to be sure :)	17:57
corvus	the client could looks well distributed now	17:57
clarkb	also zk05 (the current leader) reports 2 synced followers which is what we expect	17:58
clarkb	corvus: I think if this looks good to you after your checks you should approve https://review.opendev.org/c/opendev/system-config/+/863419	17:58
corvus	clarkb: lgtm and done	17:59
clarkb	thanks!	17:59
corvus	thank you!	17:59
clarkb	once that lands I'll remove the nodes from the emergency file	18:03
clarkb	I'll try to catch that update so that the triggered infra-prod run will actually run and noop properly due to content on disk matching expected state rather than nooping due to hosts being in the mergency file	18:05
clarkb	I've gone ahead and edited the emergency file as no jobs are running and that change should land momentarily. Then I'll check that it noops as expected. Then I'll status log the new situation	18:45
opendevreview	Merged opendev/system-config master: Upgrade zookeeper from 3.6 to 3.7 https://review.opendev.org/c/opendev/system-config/+/863419	18:53
clarkb	the timestamp on the docker compose file did end up updating (a side effect of using the synchronize instead of copy module?) but the version didn't change and no containers were restarted. This concludes the zookeeper upgrades	18:59
clarkb	#status log Zuul's ZK cluster has been upgraded to 3.7 via 3.6.	18:59
opendevstatus	clarkb: finished logging	18:59
clarkb	Time for lunch but when I get back I'll probably look at the project config job retry limiting	18:59
clarkb	ianw: corvus: I've confirmed that it seems we only have the one key present on bridge. I would have expected the base job running on the bridge to address that since we set the extra users list for ridge	20:02
clarkb	wow ok I think it is due to the usermod conflict saying the user is in use	20:05
clarkb	ya its failing at that point and everything afterwards does't run	20:06
clarkb	I'm not sure how to handle this since the user exists and we shouldn't need to bootstrap it again. But I guess ansible isn't smart enough to not run usermod and just lets it fail?	20:08
clarkb	"You must make certain that the named user is not executing any processes when this command is being executed if the user's numerical user ID, the user's name, or the user's home directory is being changed."	20:09
clarkb	the problem is the uid	20:10
clarkb	ianw: ^ zuul's uid on bridge does not match what we set via ansible so that triggers the error above	20:10
clarkb	we probably need to pause infra-prod access to bridge, manually change the uid, chown everything that zuul touches, then unpause and see if it works	20:11
clarkb	?	20:11
clarkb	looks like the name and homedir matchup so only the uid is the problem	20:12
clarkb	I think the main things to be careful of are git repo ownership? Since git refuses to operate if ownership doesn't align anymore. Otherwise logging is largel written as root?	20:27
ianw	clarkb: hey, sorry, catching up	21:03
opendevreview	Clark Boylan proposed zuul/zuul-jobs master: Fix check zone role for Jammy https://review.opendev.org/c/zuul/zuul-jobs/+/863098	21:04
ianw	i think that i have messed up the zuul user when bootstrapping the host	21:05
clarkb	ya I think our config makes uid 3000 the next one by default	21:05
clarkb	which we've done beacuse we hardcode in the 2000-2999 range	21:05
clarkb	so it grabbed the next free uid according to the config and went with it	21:05
clarkb	ianw: the other thing I've realized is that we may need to double check if gerrit projects have been created properly if any changes to do that have landed. It ispossible they ended up being created by the periodic jobs on sytem-confg if so	21:06
ianw	i think that this bootstrapping issue is fixed by https://review.opendev.org/c/opendev/system-config/+/862845/2. with that, if we start another bridge, we will apply the correct extra_users, etc. variables to it and create the user as specified	21:07
ianw	so yeah, i think it's a situation of mop up what is wrong with it now on this host, but going forward the same mistake shouldn't happen again	21:07
clarkb	I'm still on the fence about that fwiw. The base playbook should cover all that	21:07
clarkb	so it seems weird to double account for itall in a separate bootstrapping step? This is why I suggested we try to have launch node do it instead	21:08
clarkb	but I also haven't fully digested that change so maybe it does what I'm thinking with launch and base.yaml	21:09
ianw	right, i agree the base playbook will cover all that, but the definition for the extra_users was in the group_vars/bridge.yaml definition -- which was restricted to only the current production host	21:09
ianw	so when i started a new one, it didn't apply the variables	21:09
clarkb	oh I see	21:10
ianw	however, what i've realised is, we can have as many active bridge0X.opendev.org hosts in the inventory as we want	21:10
ianw	the important thing is that the CI jobs just choose the "current" one to run the jobs on	21:10
clarkb	ianw: and the zuul reboot cronjob only runs on one	21:10
clarkb	but ya I think that is correct	21:10
ianw	s/jobs/production jobs/	21:10
ianw	indeed, yes modulo any external cron-ish type things like that	21:11
clarkb	I think that is theo nly cronish thing we have currently due to the even more chicken and egg problems of restarting zuul with zuul :)	21:11
clarkb	(we can't restart the executor running the restart job without breaking the restart job)	21:11
ianw	so what that change does is changes things so that when we add hosts dynamically with "add_host" we put them in the prod_bridge group, and reference prod_bridge[0] in all the playbooks that setup and start the nested ansible	21:12
clarkb	ok that helps a bit. I haven'y managed to properly review that chagne yet because it is a bit mind bendy	21:13
clarkb	and I', not sure I'll get to it today. But if I put it on my todo list for first thing tomorrow there is a good chance I Finally manage it	21:13
ianw	the production playbooks shouldn't care about the bridge running them -- the one place they did (resetting the project-config checkout on bridge) was a issue for running parallel jobs anyway	21:14
ianw	so what falls out of that is that we can switch the testing jobs to bridge99.opendev.org just fine -- basically proving that we're not hardcoding the bridge name :)	21:14
ianw	(by making the prod_bridge group in testing have the single host bridge99.opendev.org -- while in the gate it will have bridge01.opendev.org)	21:16
clarkb	in the gate would still be bridge99?	21:16
clarkb	do you mean in the deploy jobs?	21:17
ianw	sorry, yes, i mean the actual production jobs, not the gate	21:19
clarkb	makes sense	21:19
ianw	the post-gate deploy steps :)	21:19
ianw	... but ... to the problem at hand ... resetting the zuul uid to 2031	21:20
opendevreview	Clark Boylan proposed zuul/zuul-jobs master: Fix check zone role for Jammy https://review.opendev.org/c/zuul/zuul-jobs/+/863098	21:21
clarkb	ianw: my hunch is that doing so should be relatively safe since zuul shouldn't own anything outside of its homedir	21:22
clarkb	so in theory just a matter of chwoning and updating the uid in /etc/passwd?	21:22
clarkb	fungi and corvus may have thoughts on that	21:22
clarkb	that should allow the full base.yaml playbook to run on bridge so might be worth double checking that it won't do anything we don't want yet	21:23
clarkb	and also maybe we need to check if any new project additions have landed and are in limbo (just in case we need to take any intervention steps)	21:23
ianw	yeah, i think 1) change in passwd -- 2) chown -hvR 2031 /home/zuul 3) reboot for good measure	21:24
ianw	4) monitor base run	21:25
opendevreview	Clark Boylan proposed zuul/zuul-jobs master: Fix check zone role for Jammy https://review.opendev.org/c/zuul/zuul-jobs/+/863098	21:26
ianw	i don't think i see anything in https://review.opendev.org/q/project:openstack/project-config since 2022-10-26 (my time) which was when the new host started running prod jobs	21:27
clarkb	that simplifies things :)	21:28
clarkb	dns updates would also be affected but I don't think we've had any of those. Grafana is the other one affected but its impact is minimal	21:28
ianw	i just have to do quick school run, but plan to do it when i get back in about ~20 mins	21:29
clarkb	ok, I don't think we're in a huge rush if there is anything else you think we should do to prepare	21:29
clarkb	Next week is the gerrit user summit for anyone interested in joining they will have remote access but scheudle is on london time (so timezones may make it difficult). I plan to try and wake up early and participate a bit myself	21:31
ianw	i don't think so -- as soon as you mentioned it, it kind of clicked that zuul having this different uid was wrong	21:31
clarkb	the thing that clicked for me was reading a modern man page for that utility	21:37
clarkb	google returns old ones	21:37
clarkb	and sure enough the caveats section clearly described why we were hitting the problem	21:37
*** dasm is now known as dasm\|off		21:52
*** rlandy\|rover is now known as rlandy\|bbl		22:05
clarkb	mtreinish: I've seen behavior similar to https://paste.opendev.org/show/bIfaYeDOEgM6Zz8gqEry/ across python3.8 on focal and python3.10/3.11 on jammy when using stestr. Basically we end up with a test suite that appears to have run all tests according to the orphaned stestr record file but the python process doesn't exit. When I strace things it seems the child is waiting on the	22:08
clarkb	parent for something	22:08
clarkb	mtreinish: do you have any idea of what is going on there? I'm not familar enough with stestr to know where the multiprocessing fork comes from. Maybe thats just how you spin up the concurrency and pass it a different load list?	22:08
clarkb	what is odd is that stestr and subunit haven't changed in at least a month but this seems very new (like last week at earliest)	22:09
clarkb	I think I've managed to hold the node that that paste was made from if we need to inspect more stuff	22:12
ianw	i've swapped the ownership -- bridge is quiet so i'm rebooting now	22:18
clarkb	ianw: the hourly jobs are running	22:19
clarkb	but losing those has minimal impact	22:19
clarkb	it should go to the next jobs and then retry	22:19
clarkb	yup it just failed and should retry after	22:20
ianw	i'll do a base run limited to bridge manually to watch it closely	22:28
ianw	i have a root screen up	22:28
ianw	bridge01.opendev.org : ok=65 changed=6 unreachable=0 failed=0 skipped=10 rescued=0 ignored=0	22:30
ianw	clarkb: speaking of base, are you ok with https://review.opendev.org/c/opendev/system-config/+/862765 which removes the old ip address from sshd?	22:33
clarkb	ianw: yup I think at this point we should roll forward	22:34
clarkb	+2'd but not approved as my ability to monitor is quickly declining	22:34
clarkb	ianw: re the secrets management did that get resolved?	22:34
ianw	i thought it did but maybe not ...	22:36
clarkb	ianw: I think running `edit-secrets` and ensuring that works as expected is the test for that? it was frickler who discovered it so can weigh in if it isn't doing what we expect yet	22:37
clarkb	(I'm mostly just trying to make sure things are happy that I've seen or have seen others notices)	22:37
ianw	yes, ok key material is there, but i get a prompt for gpg-agent that seems to randomly drop keypresses	22:39
ianw	i think a .emacs fixes this	22:40
opendevreview	Ian Wienand proposed opendev/system-config master: edit-secrets: configure gpg-agent/emacs https://review.opendev.org/c/opendev/system-config/+/863445	23:08

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!