Thursday, 2023-04-13

opendevreview	Merged opendev/system-config master: Fix rax reverse DNS setup in launch https://review.opendev.org/c/opendev/system-config/+/879388	00:22
*** Trevor is now known as Guest10786		00:46
opendevreview	Ian Wienand proposed opendev/system-config master: dns: move tsig_key into common group variable https://review.opendev.org/c/opendev/system-config/+/880252	01:23
opendevreview	Ian Wienand proposed opendev/system-config master: dns: move tsig_key into common group variable https://review.opendev.org/c/opendev/system-config/+/880252	01:38
opendevreview	Merged opendev/system-config master: Refactor adns variables https://review.opendev.org/c/opendev/system-config/+/876936	02:31
ianw	openstack --os-cloud=openstackci-rax server list	03:38
ianw	Version 2 is deprecated, use alternative version 3 instead.	03:38
ianw	great	03:38
opendevreview	Ian Wienand proposed opendev/system-config master: launch: use apt to update packages https://review.opendev.org/c/opendev/system-config/+/880262	04:19
ianw	looks like the rdns setup in the launch node worked	04:20
ianw	but still i don't see the ssh keys printed out for adding	04:20
ianw	oh i see it. we're not passing the full domain	04:48
opendevreview	Ian Wienand proposed opendev/system-config master: launch: refactor https://review.opendev.org/c/opendev/system-config/+/880264	05:28
frickler	ianw: can you consider a new dib release with the latest openeuler fix?	05:28
frickler	also reviews on https://review.opendev.org/c/openstack/project-config/+/879196 would be nice to enable full debian testing in kolla	05:29
ianw	frickler: sure	05:36
opendevreview	Merged openstack/project-config master: Remove gerritbot from #openstack-charms https://review.opendev.org/c/openstack/project-config/+/879976	05:37
opendevreview	Merged opendev/zone-zuul-ci.org master: Set default ttl to one hour https://review.opendev.org/c/opendev/zone-zuul-ci.org/+/880213	05:45
opendevreview	Merged openstack/project-config master: Add nested-virt-debian-bullseye label to nodepool https://review.opendev.org/c/openstack/project-config/+/879196	05:49
ianw	frickler: pushed 3.29.0	05:55
frickler	ianw: thx	06:00
opendevreview	Ian Wienand proposed opendev/system-config master: launch: refactor to work https://review.opendev.org/c/opendev/system-config/+/880264	07:17
opendevreview	Ian Wienand proposed opendev/system-config master: launch: refactor to work https://review.opendev.org/c/opendev/system-config/+/880264	07:23
*** amoralej\|off is now known as amoralej		07:25
ianw	clarkb: ^ i finally got sick of this never working. it mostly comes down to us not waiting for the host to come back up so we can scan it's keys. but there was a fair bit of room for cleaning up. i think that makes things better, and the output is nicer	07:25
ianw	https://paste.opendev.org/show/b1MjiTvYr4E03GTeP56w/ is a sample	07:25
opendevreview	Merged openstack/project-config master: Retire patrole project: end project gating https://review.opendev.org/c/openstack/project-config/+/880013	07:34
opendevreview	Merged openstack/diskimage-builder master: Update satellite_repo labels + add env var https://review.opendev.org/c/openstack/diskimage-builder/+/879137	08:32
opendevreview	Martin Kopec proposed opendev/irc-meetings master: Update Interop meeting details https://review.opendev.org/c/opendev/irc-meetings/+/880302	13:05
genekuo	clarkb: I will take the python container image task, already marked my name on it.	13:58
Clark[m]	genekuo: sounds good. Let us know if you have any questions. Thank you!	14:02
opendevreview	Clark Boylan proposed opendev/system-config master: Update gitea to 1.19.1 https://review.opendev.org/c/opendev/system-config/+/877541	14:58
opendevreview	Clark Boylan proposed opendev/system-config master: DNM intentional gitea failure to hold a node https://review.opendev.org/c/opendev/system-config/+/848181	14:58
clarkb	I've removed my old 1.19 hold and created a new one for this newer patchset	14:59
chandankumar	Hello infra	15:06
chandankumar	https://zuul.opendev.org/t/openstack/status jobs on multiple reviews are queued for more than 23 hours	15:07
chandankumar	can someone take a look? what is going on there	15:07
chandankumar	thank you :-)	15:08
clarkb	corvus: ^ possible fallout from nodepool's zk cache changes?	15:09
clarkb	it isn't specific to a single node type which rules out a problem with an image. Thats an easy thing to check just from the zuul status page	15:10
clarkb	300-0020957863 and 300-0020957869 are the node requests that appear to belong to 877242 builds that are queued	15:12
clarkb	oh that info is exposed in the web ui now neat	15:12
opendevreview	Merged zuul/zuul-jobs master: Update promote-container-image to copy from intermediate registry https://review.opendev.org/c/zuul/zuul-jobs/+/878538	15:13
clarkb	nl04 appears to have processed 300-0020957869 trying to make sense of the logs now	15:14
clarkb	yes I think this is a nodepool bug working on a paste of relevant logs now	15:16
clarkb	corvus: https://paste.opendev.org/show/bBEKYQz0E2v7gZvYtvg6/ I think the cache implementation created the 0000 lock event but then saw that path wasn't present so it created a deleted event for that path. But the 0001 lock is/was held and we essentially deleted an unlocked unready node out for under the node creation process	15:18
clarkb	corvus: that must just be excellent timing between the deleted node working and our vision of locks?	15:19
clarkb	In theory we can get things moving again by manually removing the lock?	15:19
clarkb	but I'm not sure what sort of data consistency problems that might expose particularly for leaking nodes in zk for nodes that don't exist in clouds (I think those may be cleaned up though)	15:20
clarkb	oh! this occurs during a nodepool launcher restart	15:22
clarkb	I think the issue is that our consistency view of cache event ordering isn't atomic enough at startup for tasks like node deletions. Do we need to hold off on processing until the cache is coherent? I suspect that is the fix	15:23
clarkb	chandankumar: ^ tldr this is very likely a bug in nodepool itself due to a chnge in how caching was done (which should reduce zk load and be quicker overall but likely exposed us to this situation)	15:23
clarkb	I hesitate to blindly restart nodepool though since restarts seem to have triggered this problem	15:24
chandankumar	arxcruz: ^^	15:25
chandankumar	clarkb: thank you :-)	15:25
clarkb	infra-root if I hear no objections in the next little bit (I'm catching up on morning stuff right now) I intend on deleting static01.opendev.org and merging https://review.opendev.org/c/opendev/zone-opendev.org/+/879781 to clean up its DNS records. static02 should be serving all the content at this point for almost a week I think it is happy	15:33
clarkb	corvus: looking at that node in question it is currently in a locked deleting state	15:36
clarkb	it seems to be stuck there which is why another provider isn't attempting to take it over. I'm not sure why it hasn't been deleted though if it is locked. Maybe the lock holder is not the deleter and they have mutually locked each other out of progressing through their respective state machines?	15:36
clarkb	I need to step away for a bit but I suspect the next steps are one looking at preventing the locking issues when restarting in the first place and two inspecting the zk nodes to determine how to get these particular requests movign forward again (possibly by removing locks and issuing manual delete requests?)	15:37
corvus	clarkb: i'm around; i can look into that	15:42
corvus	clarkb: i have an alternate read of those log lines: lock 0 was the old building lock and disappeared due to the restart. on reconnect, the cache generated a NONE event to cause it to refresh the node; we don't know the result of that (zk either said it exists or not; i don't think it's important). then we get a real DELETED event from zk (that probably means zk thought it did exist, and suggests that the restart sequence may have involved an	15:57
corvus	unclean shutdown so zk held the ephemeral node of lock 0 for the full 30 second timeout; by that time, the launcher had restarted and established a new watch. so when the timeout happens and zk deletes the lock 0 ephemeral node, we get a real deleted event). now we have an unlocked building node. the launcher notices that and locks it, creating lock 1. it then marks it for deletion, and starts the delete state machine.	15:57
corvus	clarkb: assuming all that, then i think the next question is why is that node stuck deleting? i didn't see any delete server api errors in the log.	15:59
Clark[m]	Oh that makes sense since we do remove building nodes on restart.	15:59
Clark[m]	It also doesn't seem to be logging it is reattempting any delete actions which I would expect it there were API errors	16:00
corvus	agreed	16:01
corvus	i'm going to see if i can figure out the right way to get an openstack cli prompt today :)	16:01
corvus	apparently --os-cloud doesn't work even though the error message i get when i try to use it says it should	16:02
corvus	env vars work	16:03
corvus	OS_CLOUD=openstackjenkins-ovh OS_REGION_NAME=GRA1 openstack	16:03
corvus	it looks like 0033718669 does not exist	16:04
corvus	i mean the openstack server for that node	16:05
corvus	assuming OS_CLOUD=openstackjenkins-ovh OS_REGION_NAME=GRA1 openstack server list\|grep 0033718669 is the right place to be looking	16:05
Clark[m]	Yes I think that is the correct location	16:05
corvus	then i think we may want to look for an error in the server cache related to deletion...	16:06
corvus	(though obviously it's not completely broken...)	16:06
Clark[m]	Do the new np prefixed node names have the full ID number as the suffix? Might also search by uuid or double check the output of that listing without grep	16:08
corvus	yeah full number	16:08
corvus	example: np0033717123	16:08
corvus	i just double checked the uuid, no dice	16:10
Clark[m]	Ok very likely it was actually deleted then	16:10
corvus	i wonder if the lazyexecutolttlcache for servers on gra1 is stuck	16:11
corvus	yeah i don't think it's turning over any servers	16:12
corvus	looks like it stopped doing anything at 15:56 on april 12	16:13
corvus	so shortly after this restart we're looking at... or possibly even earlier	16:14
corvus	i'm going to get a thread dump	16:15
corvus	i don't see any stuck api calls. i do see 10 executor workers all waiting for work. also 2 extra ones, possibly from a previous unclean executor shutdown? i don' know what's going on there, but i doubt they are harmful. that makes me suspect a bug related to the cache invalidation in https://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/utils.py#L498	16:25
corvus	yeah, there was an internal stop/start cycle around 06:53:43 on april 13, that's probably why we have extra executor threads. i think that means we have 2 adapters running. that should be okay -- the old one should continue to run until all its state machines are done.	16:38
corvus	whatever the problem is, it started before that internal reload anyway	16:38
corvus	and yeah, i see the stop thread running and waiting, so that confirms that.	16:39
corvus	there are 3 runDeleteStateMachine threads as expected (1 for the new gra1 and bhs1 adapters, and 1 for the old gra1 adapter)	16:41
corvus	now we're getting somewher:	16:42
corvus	File "/usr/local/lib/python3.11/site-packages/nodepool/zk/zookeeper.py", line 2337, in unlockNode	16:42
corvus	with node._thread_lock:	16:42
corvus	i think there's a deadlock between the cleanupworker and the deletedstatemachinerunner	16:50
corvus	one of them holds the python local node._thread_lock while trying to unlock the zk node lock, and the other holds the zk node lock while trying to acquire the python local node._thread_lock	16:51
clarkb	that local node._thread_lock was recently added right?	16:52
corvus	yeah a few weeks ago	16:53
clarkb	re static01 I've grown a sudden paranoia that we might host things there that are either not in ansible or afs. However looking at all of the apache config files and /var/www I think we're ok to delete that server as it basically serves as a frontend to fileservers and doesn't store any data itself (other than a few index.html files that ansible writes out). Debating if I should have	16:55
clarkb	ianw ACK that before deletion since ianw did a fair bit of the setup on that host	16:55
corvus	operationally, i think we are okay to just fix this by restarting the launcher. if it's not urgent, maybe let's wait a few more minutes in case i have any more questions	16:56
clarkb	corvus: its been a day that changes have been stuck. I think we can wait a few more minutes	16:56
*** amoralej is now known as amoralej\|off		17:18
clarkb	ianw: the launch node change lgtm but I did have one small nit/cleanup	17:26
clarkb	ianw: and left a concern on https://review.opendev.org/c/openstack/project-config/+/880115	17:38
clarkb	https://158.69.65.113:3081/opendev/system-config looks like a functional gitea 1.19.1 installation	17:46
opendevreview	Merged zuul/zuul-jobs master: containers : update test variable https://review.opendev.org/c/zuul/zuul-jobs/+/878175	18:02
opendevreview	Merged zuul/zuul-jobs master: container role docs : clarify requirements https://review.opendev.org/c/zuul/zuul-jobs/+/878176	18:05
clarkb	I am testing db transplant from etherpad01 to etherpad02 now. THis should give us rough timing for how long the process will take too	18:06
clarkb	Currently running an online db dump of 01 which can be restored in 02. This isn't super fast so the downtime will be longer than I had previously anticipated	18:07
clarkb	When we d othe actual prod move we'll do it with sevices stopped on both servers and dns can update while we do the data migration	18:07
corvus	clarkb: i think we're good to restart	18:29
corvus	i still have a term open there, i can do it	18:29
clarkb	thanks	18:33
clarkb	I'll review the change you pushed just as soon as I finish this etherpad02 data migration test	18:33
clarkb	dealing with largish databases is not fast	18:44
clarkb	if anyone is wondering	18:44
*** dhill is now known as Guest10899		18:47
fungi	catching up mid-vacation, a quick skim says the new contributor call was productive (lots of notes in the pad at least?) and nothing that needs urgent attention... yeah?	18:57
fungi	the nodepool thing today seems mostly sorted out?	18:58
clarkb	fungi: nothing urgent. I was going to delete static01 then got paranoid we might host actual data there but I checked and can't find evidence of that	18:58
clarkb	Am hoping another root can double check those assumptions though. I'm happy to wait for ianw on that. I'm als testing the data migration from etherpad01 to 02 in preparation for that move (probably next week at this point)	18:58
clarkb	looks like it will be at least an hour downtime	18:58
clarkb	enjoy your vacation nothing demands attention right now	18:59
fungi	cool.i have a couple hours before i need to go meet friends for margaritas, so just catching up on e-mail and irc backlog	18:59
fungi	i don't recall us having any locally-served content on static01, it replaced the old static.o.o which did have local content with a migration to the afs-only hosting model for files.o.o which it also replaced	19:01
clarkb	and ya for nodepool there is a change up I need to review. I've pulled it up but trying to get etherpad done in one go so I don't make a silly distracted mistake	19:01
clarkb	fungi: ya there are a couple of index.html files that ansible writes out (so is also on static02) and other than that I can't find local content	19:01
clarkb	everything else seems to be redirects or afs	19:01
fungi	etherpad migration is going to be slow because of moving from trove to local sql db? putting it on a cinder volume or rootfs?	19:02
clarkb	fungi: it is local sql db to local sql db. On different cinder volume on each side. We could theoretically move the cinder volume instead, but I rulled against that because the current cinder volume is a bit small so the new host got a new larger volume instead	19:02
clarkb	the time is in dumping, copying, restoring the 30gb of sql database	19:03
fungi	sounds good to me. announcing an outage for pad migration should be fine	19:03
clarkb	I'm hoping I'll have the db running under 02 for functionality testing of the process in about 10 minutes	19:05
clarkb	this is effectively a fork though and shouldn't be relied on.	19:05
clarkb	I have used https://etherpad.opendev.org/p/opendev-contributor-bootstrap-202304 for testing. If you set /etc/hosts to point etherpad.opendev.org to etherpad02's IP address then you'll see newer edits that aren't there in prod	19:18
clarkb	as far as I can tell this is working so now just need to pick a time and announce it. I think it will take at least 90 minutes to do the data migration	19:19
clarkb	its like 30 minutes to dump the db, a little longer to restore it with some time in the middle to copy and validate data as you go	19:19
clarkb	I will need to convert my notes into a document that can be shared for the process too (not an etherpad though I'll probably use paste)	19:22
opendevreview	James E. Blair proposed opendev/base-jobs master: Set default job vars for container image promote https://review.opendev.org/c/opendev/base-jobs/+/880362	19:58
clarkb	infra-root https://paste.opendev.org/show/brRuhPssVLSi4UnF5hcN/ there is the draft for production migration of etherpad data based on my notes testing this	20:05
*** Trevor is now known as Guest10911		20:15
ianw	clarkb: afaik there's nothing on static that's either not pushed from git or on afs	20:59
ianw	i've certainly always considered it as liable to disappear at any moment :)	20:59
ianw	don't think it's in the backup roster either, so lgtm for removal	20:59
opendevreview	Ian Wienand proposed opendev/system-config master: launch: refactor to work https://review.opendev.org/c/opendev/system-config/+/880264	21:05
ianw	clarkb: ^ thanks, that one should split the dns better	21:08
opendevreview	Merged openstack/diskimage-builder master: fix ifupdown pkg map for dhcp-all-interfaces of redhat family https://review.opendev.org/c/openstack/diskimage-builder/+/879537	21:46
opendevreview	Merged opendev/base-jobs master: Set default job vars for container image promote https://review.opendev.org/c/opendev/base-jobs/+/880362	21:49
opendevreview	Merged opendev/system-config master: dns: move tsig_key into common group variable https://review.opendev.org/c/opendev/system-config/+/880252	22:13
clarkb	ianw: thank you for confirming I'll delete static01 momentarily	22:58
clarkb	corvus: fyi you have ashell still on static01. I think this is from earlier this week when we debugged why it was apparently still serving zuul content	23:00
clarkb	also to double check it has no volumes attached (another indication it might be serving data outside of afs) that will need to be claened up	23:03
corvus	clarkb: shell closed	23:04
clarkb	corvus: thanks just in time for me to delete it	23:04
clarkb	#status log Deleted static01.opendev.org (ae2fe734-cf8f-4ead-91bf-5e4e627c8d2c) as it has been replaced by static02.opendev.org	23:06
opendevstatus	clarkb: finished logging	23:06
clarkb	https://review.opendev.org/c/opendev/zone-opendev.org/+/879781 has been approved as well to reflect that in dns	23:07
opendevreview	Merged opendev/zone-opendev.org master: Remove old static01 records https://review.opendev.org/c/opendev/zone-opendev.org/+/879781	23:09
opendevreview	Clark Boylan proposed opendev/zone-opendev.org master: Update etherpad.o.o to point at etherpad02 https://review.opendev.org/c/opendev/zone-opendev.org/+/880168	23:11
opendevreview	Clark Boylan proposed opendev/zone-opendev.org master: Cleanup etherpad DNS records https://review.opendev.org/c/opendev/zone-opendev.org/+/880169	23:11
clarkb	those are just rebases needed due to the previous change merging and updating the serial. I'll WIP them again	23:11
clarkb	ianw: its probably a bit late to try the gitea upgrade today, but if you can review that change https://review.opendev.org/c/opendev/system-config/+/877541 (held node at https://158.69.65.113:3081/opendev/system-config) maybe we can upgrade gitea early next week	23:16
clarkb	and then I'll probably look to swap etherpad servers ~tuesday or wednesday next week if that works for others. Then I can send email about the outage a couple days in advance	23:16
ianw	++ will do	23:21
clarkb	side note the server show output is super verbose now. Not sure I'm a fan	23:32
ianw	i still see things like "Munch('...')" which i'm pretty sure isn't really supposed to be user-visible object-details	23:35
opendevreview	Merged opendev/system-config master: launch: refactor to work https://review.opendev.org/c/opendev/system-config/+/880264	23:51
opendevreview	Merged opendev/system-config master: launch: use apt to update packages https://review.opendev.org/c/opendev/system-config/+/880262	23:56

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!