Tuesday, 2023-03-07

clarkb	aha using uuid fixes that	00:00
clarkb	ok that shows that parakeet is the one	00:01
clarkb	so the trick was to do a server show using uuid	00:01
fungi	nice	00:01
clarkb	I ran `openstack compute service set --disable --disable-reason "Only run the mirror on this hypervisor" amazing-parakeet.local nova-compute` then started the instance via our normal user creds	00:03
clarkb	(because I wanted to make sure we could still start/stop etc as a normal user in this state)	00:03
clarkb	I can ssh in	00:04
fungi	oh cool, so it worked with the creds on the host then for sure	00:04
clarkb	yes the admin stuff did	00:04
fungi	yeah, mirror lgtm again	00:04
fungi	happy to approve a revert of the revert of the revert of the zero max-servers or whatever. once it deploys and usage ramps up, we should be able to tell whether or not job nodes are landing there again	00:05
clarkb	fungi: does that change exist yet or should I push one?	00:05
fungi	it does not exist as far as i can recall	00:05
opendevreview	Clark Boylan proposed openstack/project-config master: Revert "Revert "Revert "Temporarily stop booting nodes in inmotion iad3""" https://review.opendev.org/c/openstack/project-config/+/876652	00:07
clarkb	last call for the meeting agenda	00:08
fungi	clarkb: not strictly an addition, but i guess let's make sure to talk about project renames (though we should likely conclude to push them to after the openstack release week)	00:09
fungi	picking a date would be nice though	00:09
clarkb	yes and I want to be done with the gitea stuff (which is closer every day	00:09
clarkb	speaking of I don't know why that reminded me but nebulous too	00:11
clarkb	ok agenda sent	00:15
fungi	oh, yeah	00:15
opendevreview	Merged openstack/project-config master: Revert "Revert "Revert "Temporarily stop booting nodes in inmotion iad3""" https://review.opendev.org/c/openstack/project-config/+/876652	00:34
ianw	fungi: do you happen to remember why https://review.opendev.org/c/openstack/project-config/+/786686/2/gerrit/acls/openstack/release-test.config has PTL vote from the "Continuous Integration Tools" group?	00:34
ianw	it seems like only zuul is in that group, and i can't see it would be leaving that vote	00:36
fungi	ianw: i think you're looking for https://review.opendev.org/705990	00:42
fungi	in essence, if a ptl leaves a +1 on an openstack/releases change, there's a zuul job with a dedicated pipeline that triggers on comment-added events to see if it's from the ptl or a release liaison from the team for the deliverable the release is proposed for, and if it is then zuul leaves a +1 vote in that column	00:43
fungi	as for why it's in the release-test acl, that can probably be cleaned up, it was simply copied from the releases acl	00:44
fungi	since they both shared an acl up until the change you mentioned	00:45
ianw	ahhh, thanks	01:15
ianw	... is:true != is:True in these acl files :/	02:31
*** Tengu7 is now known as Tengu		02:53
*** Tengu3 is now known as Tengu		03:07
opendevreview	Ian Wienand proposed openstack/project-config master: gerrit/acl : convert NoBlock to NoOp https://review.opendev.org/c/openstack/project-config/+/875993	04:03
opendevreview	Ian Wienand proposed openstack/project-config master: gerrit/acl : handle key / values with multiple = https://review.opendev.org/c/openstack/project-config/+/875994	04:03
opendevreview	Ian Wienand proposed openstack/project-config master: gerrit/acl : Update Review-Priority to submit-requirements https://review.opendev.org/c/openstack/project-config/+/875995	04:03
opendevreview	Ian Wienand proposed openstack/project-config master: gerrit/acl : Convert remaining AnyWithBlock to submit requirements https://review.opendev.org/c/openstack/project-config/+/875996	04:03
*** jpena\|off is now known as jpena		08:27
opendevreview	Saggi Mizrahi proposed opendev/git-review master: Add worktree support https://review.opendev.org/c/opendev/git-review/+/876725	12:02
opendevreview	Radosław Piliszek proposed openstack/project-config master: Add the main NebulOuS repos https://review.opendev.org/c/openstack/project-config/+/876054	13:34
opendevreview	Radosław Piliszek proposed openstack/project-config master: Add the NebulOuS tenant https://review.opendev.org/c/openstack/project-config/+/876414	14:03
opendevreview	Radosław Piliszek proposed openstack/project-config master: Add the NebulOuS tenant https://review.opendev.org/c/openstack/project-config/+/876414	14:35
*** blarnath is now known as d34dh0r53		14:58
fungi	graphs for rax-ord are still looking fairly unhealthy compared to the other regions	15:24
fungi	as for inmotion, it looks like we're not booting servers there at all yet	15:26
fungi	nodepool configuration was updated at 00:39z and has max-servers: 51	15:27
fungi	server list for the nodepool tenant shows 3 active servers and 3 deleted servers, all with creation dates from a week or more ago	15:32
fungi	nodepool list knows about 4 of those, one of which is a held node and the others are deleting	15:33
clarkb	fungi: any indication if nodepool even attempted to boot in inmotion?	15:49
fungi	that's what i'm trying to figure out from the logs. i don't see it trying currently	15:52
fungi	just it repeatedly trying to delete these three nodes that are stuck in deleting state	15:52
fungi	looks like it's declining requests, trying to determine why	15:55
clarkb	possible we've got the placement leaks again leading it to think the quota is all used up?	15:55
clarkb	but ya nodepool logs should have reasonably concrete info to go off of	15:56
fungi	it doesn't seem to say why it declined this request though	15:57
fungi	it gives a ton of info about the request itself, but not the reason it got declined	15:58
clarkb	there are a couple of paths that can take it to that point. The first is deciding it doesn't support the label (arm64 or extra large nodes etc)	15:58
clarkb	another is failing three boots I think you need to look for those three failures to find why they failed	15:59
fungi	this was a request for a single ubuntu-bionic node to run ansible-playbooks-tox-bandit	15:59
clarkb	then ya you may need to look over the history of the request to see if there were three failing boots	15:59
fungi	not finding any indication it's attempting to boot anything though	16:00
fungi	and the only tracebacks are for the delete failures	16:00
clarkb	hrm maybe there is another short circuit then?	16:01
fungi	this is a sample from the debug log: https://paste.opendev.org/show/bkOR3P27zG4bVTD9HRaS/	16:02
clarkb	fungi: the grafana graphs still show max at 0	16:06
clarkb	I think the issue is the change to max hasn't applied	16:08
clarkb	possibly the same deal with rax-ord?	16:08
fungi	the config updates deployed (and i checked the files have the correct content), but yeah maybe the launchers don't notice file changes and/or any reload signal we send isn't working?	16:12
clarkb	previously it just reloaded the config on every pass hrough the run loop but ya maybe it isn't doing that anymore?	16:12
clarkb	we could try restarting the launcher to see if the behavuor changes	16:13
frickler	there are some other errors in the 03_03 log	16:14
frickler	those might have made the statemachine get stuck, I would also suggest a restart	16:14
fungi	what timestamp?	16:14
frickler	oh, it even says this: 2023-03-03 18:15:14,368 DEBUG nodepool.StateMachineProvider.inmotion-iad3: Removing state machine from runner	16:16
fungi	d'oh!	16:16
frickler	just after some nova API error	16:17
fungi	"Removing state machine from runner" seems to appear many times in all the logs though	16:17
fungi	even back to the oldest retained log from 2023-02-25	16:18
fungi	500 occurrences in launcher-debug.log.2023-02-25_05	16:18
fungi	i see it on all the launchers, not just 02	16:19
fungi	nodepool docs seem to indicate that it should be rereading its configuration automatically, yes	16:24
clarkb	gitea05-08 are all out of the gerrit replication config (I just verified that the daily jobs synced us up)	16:27
clarkb	https://review.opendev.org/c/opendev/system-config/+/876470 is ready when we're happy with the new servers (this removes those 4 servers from config management	16:27
clarkb	after which they can be deleted	16:28
clarkb	fungi: switching the grafana graphs for inmotion to a 24 hour timeframe it never has a max value >0	16:28
clarkb	is the exception frickler pointed out due to deletion not creation?	16:28
fungi	yeah, there are a bunch of deletion failures with tracebacks, related to the three instances stuck in a deleting state	16:29
fungi	the launcher keeps repeatedly trying to delete them	16:30
clarkb	ya thats "normal" for that situation	16:30
clarkb	it definitely looks like it hasn't noticed the max-servers value bumped up	16:30
clarkb	maybe doing a thread dump to see if the run loop is stuck before it can reloda its config then restart the service?	16:31
fungi	oh, good call. yeah doing that now	16:45
clarkb	remember you need to do a pair with a short pause in the middle otherwise the profiler remains on and things will be extra slow. Less important if ou end up restarting things though	16:47
fungi	yeah, i usually do a 60-second sleep between them	16:47
fungi	okay, dump with yappi stats is in nl02:~fungi/stack_dump.2023-03-07	16:53
clarkb	fungi: up ou can see PoolWorker.inmotion-iad3-main is blocking on a stop event	16:58
clarkb	I suspect its been sitting there for a while	16:58
clarkb	oh though its got a timeout listed	16:59
clarkb	so maybe not?	16:59
fungi	so the plan for nl02 is to stop the launcher container, try to clean up all the server instances in inmotion-iad3 which nodepool has no record of (i count 2 in an active state not matching any entries in nodepool list output), then start the container again?	16:59
fungi	also would be good if we could identify why these other three nodes are undeletable	17:00
clarkb	I think those three have been undeleteable for months. I looked at them when I did the placement cleanup and couldn't sort it out	17:00
clarkb	I'm not suer you need to clean those up before starting the service again	17:00
*** artom_ is now known as artom		17:01
clarkb	ok that run loop doesn't load new configs. I think an outer loop must do that	17:01
fungi	oh, actually there are 4 nodes in an active state, three of which aren't in the nodepool list	17:01
fungi	okay, it also isn't deleting those when i issue a server delete	17:03
fungi	they're still in active state	17:03
fungi	i wonder if oom killer might have terminated some openstack services on parakeet?	17:05
clarkb	I don't think that is the issue. Those nodes are not on parakeet (only the mirror is)	17:05
fungi	is there a clean way to restart everything? e.g. reboot the server?	17:05
fungi	the nodes aren't on parakeet, but some of the api services are	17:05
clarkb	and the setup runs docker with a force restart on everything iirc	17:05
clarkb	fungi: I think you can do a docker ps -a to see if anything isn't running	17:06
clarkb	I suspect this has to do with nova just losing track of things on its own	17:06
clarkb	which we've already had to do cleanup for in the past. Its just that this subset is for a different reason that isn't understood yet	17:07
frickler	likely not related, but there's also a lot of images stuck in deletion there	17:08
fungi	harbor.imhadmin.net/deployment/node-agent:master and kolla/centos-binary-telegraf:victoria are in "restarting" status, harbor.imhadmin.net/deployment/fm-deploy:master is in exited status, all other containers on that server have an up status	17:09
clarkb	fm-deploy is their deployment tool iirc and that is expected for it to exit once deployment is done	17:09
clarkb	node-agent I can only assume is some sort of phone home service to the management system.	17:10
clarkb	telegraf too. I think all the openstack services must be running then	17:10
fungi	yeah, seems so	17:10
fungi	leading me to wonder why `openstack server delete` of an active state instance does nothing, and the instance remains listed as active	17:11
fungi	i suppose these and the stuck deleting instances could all be victims of the oom killer on parakeet	17:11
fungi	mmm, nope they're spread across 3 different hostids	17:13
clarkb	I think they long predate that	17:15
clarkb	OOM killer is a relatively recent thing	17:15
clarkb	Looking at the thread dump I'm not seeing the statemachine provider object for the inmotion cloud	17:15
clarkb	ok PoolWorkers refer to the top level NodePool object to get their Provider objects. THe Provider objects capture the configs and they should be replaced when configs update	17:20
clarkb	we should log "Creating new ProviderManager object" when swapping them out /me looks	17:22
clarkb	2023-03-07 00:39:49,754 DEBUG nodepool.ProviderManager: Creating new ProviderManager object for inmotion-iad3	17:23
clarkb	so it does create the new provider manager when the new config shows up but then doesn't seem tp reflect the new max-servers value?	17:24
fungi	or else there's something else causing it to summarily decline requests, since it doesn't actually say why it declined them	17:25
fungi	oh, though statsd is still reporting max:0	17:26
fungi	which seems to support your theory	17:26
clarkb	it never stopped	17:26
clarkb	grep "DEBUG nodepool.StateMachineProvider.inmotion-iad3: Stopp" /var/log/nodepool/launcher-debug.log.2023-03-0*	17:26
*** jpena is now known as jpena\|off		17:26
clarkb	that shows on the third when we last turned it off the provider goes from stopping to stopped. But this time around it never does the transition	17:26
clarkb	it waits for all launchers and deleters to stop	17:27
clarkb	perhaps the deleters were stuck?	17:27
* clarkb looks for this thread in the dump		17:27
clarkb	Thread: 139823599843072 NodePool d: False<- thats where the apprently stuck stop() call is sitting	17:29
clarkb	fungi: so ya I believe a restart will fix this. I don't know where it is getting stuck yet	17:31
fungi	on nl01 for comparison:	17:31
fungi	2023-03-06 17:43:29,343 DEBUG nodepool.StateMachineProvider.rax-ord: Stopping	17:31
fungi	2023-03-07 04:41:03,927 DEBUG nodepool.StateMachineProvider.rax-ord: Stopped	17:31
fungi	that took 11 hours	17:32
fungi	longer than i would have anticipated	17:35
clarkb	ya based on the tracebacks it appears to be waiting on a launcher or deleter. However looking at the logs I see it running then reporting it ran 3 statemachines	17:35
clarkb	oh! but the state machines have to report they have completed for them to be removed	17:35
clarkb	corvus: ^ fyi I Think we've found an issue with openstack + statemachine related to clouds behind unhappy that results in us holding onto old configs for far too long	17:36
clarkb	I suspect these three statemachines are never going into a completed state so we're never reloading our config	17:36
clarkb	fungi: and you said there are three nodes in a deleting state?	17:37
clarkb	its curious that we were able to restart in this situation when we set the max-nodes to zero but not now	17:38
fungi	well, there are three nodepool knows about that are in a deleting state and it keeps trying to delete them	17:38
fungi	there are also three it doesn't seem to be tracking any longer which are stuck in an active state and won't transition to deleting in nova	17:38
corvus	clarkb: https://review.opendev.org/875250	17:40
clarkb	aha thanks	17:42
clarkb	fungi: ^ I think we can restart the service and review that change	17:42
corvus	fungi: nodepool should continue to try to delete leaked nodes (assuming they have the correct metadata) without tracking them; so it should continue to try to delete them even if it doesn't have a zk record (probably worth double checking the logs to make sure that happens)	17:43
fungi	here's the comparison of what nova and nodepool see for inmotion-iad3 btw: https://paste.opendev.org/show/bUgMuhc2IbvlFPNFWb0o/	17:44
fungi	and yeah, 875250 looks like what we're seeing on nl02	17:45
fungi	okay, downing and upping the launcher container on nl02	17:45
corvus	fungi: that paste looks good to me -- is there anything wrong with that?	17:46
fungi	corvus: nothing wrong on the nodepool side, problems on the nova side	17:46
corvus	fungi: ack. makes sense	17:46
clarkb	corvus: what is weird to me looking at the log on nl02 is taht we keep reporting that 3 tasks are running and then 3 tasks ran. It isn't clear to me why we never transition to completed and remove those three machines. Maybe beacuse the deletions don't ultimately succeed?	17:47
fungi	several instances stuck in delete for weeks which nodepool is no longer tracking, several more active which can't be deleted for some unknown reason (but which nodepool is repeatedly trying)	17:47
corvus	clarkb: hrm	17:47
fungi	clarkb: yeah, nova is still reporting the instances as "active"	17:48
fungi	i had it backwards originally, but match up the uuids in my paste and you'll see	17:48
clarkb	fungi: right and if you manually try to delete them it doesn't help. Its also a nova problem, but one that may be acusing those tasks to hang around forever?	17:49
fungi	the four nodes which nodepool knows about are in active state according to nova, one is for an autohold but the other three are needing to be deleted	17:49
clarkb	I think ideally we'd want it to try and give up if being stopped and let the new provider manager handle it	17:49
fungi	and nova is silently ignoring the delete attempts as far as i can see	17:49
clarkb	with corvus' proposed change I think we may end up with both the old and new provider manager running threads attemptng to delete the resources (not the end of the world)	17:49
fungi	okay, so ready for a down/up of the launcher container on nl02?	17:50
clarkb	I am but maybe corvus wants to look at it first?	17:50
fungi	sure, i can wait for a bit	17:50
corvus	clarkb: i think (haven't totally confirmed this) that we're seeing the behavior described in the commit message, but with deleting state machines -- perhaps it is removing the delete state machines after the delete timout hits, but then the periodic leaked node cleanup is adding new ones	17:51
corvus	fungiclarkb you're clear to restart; i think any further questions can prob be handled in logs	17:51
clarkb	corvus: oh! that would be a fun interaction	17:51
fungi	got it, proceeding	17:51
corvus	clarkb: so anyway, yeah, that change should handle that by causing the new delete-state-machines to go to the new provider, and there will be 2 for a brief time (<5min) and then the old one finally disappears	17:51
clarkb	corvus: yup	17:52
clarkb	fungi: I was wrong df55493d-bae7-47b2-9069-1e488c09a2fd is only a few days old.	17:52
clarkb	fungi: its task state is in deleting but it isn't actively deleting. I Think we ran into this before	17:52
fungi	#status log Restarted the nodepool-launcher container on nl02 in order to force a config reload, as a workaround until https://review.opendev.org/875250 is deployed	17:52
opendevstatus	fungi: finished logging	17:52
clarkb	fungi: basically nova sets the task state to deleting then fires off an event for that. If that event gets lost it won't let you set the state to deleting by issuing another normal delete which would refire the task because it is already deleting	17:53
clarkb	fungi: iirc melwitt linked to some process to double check things then you manually override the state back to acitve and issue another delete	17:53
corvus	clarkb: https://paste.opendev.org/show/bh9N99Gx60aXEYwM7bCC/ here's an edited series of log messages that i think shows that sequence	17:53
clarkb	I forget what that process was but we can probably just yolo	17:53
clarkb	corvus: ++	17:54
fungi	clarkb: but df55493d-bae7-47b2-9069-1e488c09a2fd is stuck in "active" not "deleted"	17:54
fungi	at least according to nova	17:54
clarkb	fungi: its task_state is "deleting"	17:54
clarkb	and if nova loses the event to acutally make that happen you get stuck in limbo here	17:55
clarkb	the fix is to reset task state back to active then reissue the delete command	17:55
corvus	is that a user-level action or cloud-admin-level?	17:55
clarkb	I found a purple link df55493d-bae7-47b2-9069-1e488c09a2fd	17:55
clarkb	corvus: admin level	17:55
corvus	that=reset task state	17:55
corvus	ack	17:55
fungi	oh, yep okay. OS-EXT-STS:task_state=deleting OS-EXT-STS:vm_state=active	17:56
fungi	and server list reports the latter	17:56
fungi	thus my confusion	17:56
clarkb	and that server isn't on parakeet so unlikely related to OOMs	17:57
fungi	right, i compared the hostids reported for these and they're scattered around the cluster	17:57
clarkb	aha I found my notes	17:57
clarkb	`openstack server set --state active $UUID`	17:57
clarkb	then you can issue a normal delete	17:58
clarkb	from mynotes it looks like maybe only the orphaned allocations needed double checking but not the task state being stuck	17:58
clarkb	fungi: do you want to issue that or should I?	17:58
clarkb	I suspect that nodepool will then automatically clean them up for us	17:58
fungi	i'll give it a go	17:59
corvus	this is the cloud we have admin on? should we make a cron script that looks for active/deleting nodes and then issue that command? i sort of suspect that major cloud operators may do that -- based on the fact this seems to happen occasionally everywhere and then eventually they get cleaned up...	17:59
clarkb	I would server show them as admin first to make sure they have the task_staet in deleting	17:59
fungi	clarkb: where was the clouds.yaml on the servers there?	17:59
clarkb	corvus: yes and not a bad idea. I think we'd want some delay	18:00
clarkb	fungi: source /etc/kolla/admin-openrc.sh	18:00
fungi	aha, thanks, i'll source that	18:00
clarkb	corvus: because there is a period of time where that set of states is valid. But if it goes for more than say an hour then we're probably good to clean it up	18:00
fungi	clarkb: is that on all the servers? which one did you find it on? not seeing it on .230	18:01
clarkb	fungi: I'm on 229. But I thought it was deployed to all of them	18:01
* clarkb looks		18:01
fungi	it's on .229 yeah, thanks	18:01
clarkb	huh ya I guess they only deploy that to the first server	18:02
fungi	and what's the path to openstackclient?	18:02
corvus	clarkb: yep. is there a timestamp for the task_state? or would we just need to have persistence to detect it?	18:02
fungi	# openstack server list	18:02
fungi	-bash: openstack: command not found	18:02
clarkb	fungi: /opt/kolla-ansible/victoria/ its in that venv. You can activate it or just use the bin path	18:02
corvus	(that might be worth looking at real quick before you reset these states :)	18:02
fungi	aha, okay	18:02
clarkb	corvus: it looks like the updated field is getting updated frequently. I did a show earlier and got a timestamp from a few minutes prior and a show now shows updated a few minutes prior to now	18:03
clarkb	corvus: we probably need to keep track ourselves. Anything that is in that state goes into a list and if it is still in that state an hour later we update it	18:04
fungi	and as admin, it's server list --all otherwise you get a blank list due to it checking the admin tenant i guess	18:04
corvus	clarkb: ack, good to know	18:04
fungi	OS-EXT-STS:task_state=None after i did that to d52712f4-80a6-4f9a-979c-3031972cd89f, so now to see if the launcher successfully deletes it	18:06
clarkb	fungi: I see task state is deleting already	18:07
clarkb	but it hasn't gone away	18:07
clarkb	if it continues to not go away we may have to check compute logs on the host	18:08
fungi	it was already in a deleting state from nodepool's perspective, i think it takes a little while to get back to retrying to delete it though	18:08
clarkb	fungi: no I mean in nova	18:09
clarkb	task_state is deleting	18:09
clarkb	oh wait you did a different host	18:09
clarkb	sorry	18:09
fungi	yeah, i've only done d52712f4-80a6-4f9a-979c-3031972cd89f so far	18:09
fungi	the first one in the nodepool list output	18:10
fungi	if memory serves, it may be as much as 10 minutes before the launcher tries to delete things again	18:10
clarkb	it looks like it is trying now	18:11
fungi	the restart did get the config reread though, and looks like we're booting new nodes there again	18:11
clarkb	the nodepool logs indicate "State machine for 0033156866 at deleting server"	18:11
clarkb	but server show doesn't indicate a state change yet	18:11
clarkb	and its gone	18:13
fungi	yay!	18:13
fungi	i'll reset the other two offenders	18:13
fungi	now as for the ones nova indicates have been in a "deleted" state since last month, is there a chance the same state reset will work for them as well?	18:14
clarkb	For that we may need to talk to nova. I'd worry that will cause nova to try and do things that can't happen since the nodes are actually gone	18:15
clarkb	and then create more problems	18:15
* clarkb reviews the nodepool fix now		18:15
fungi	yeah, server show doesn't indicate any related errors for them	18:15
fungi	anyway, i'll switch to revisiting rax-ord	18:16
fungi	still seeing "Timeout waiting for instance creation"	18:20
clarkb	fungi: so that timeout is actually the one waiting for the instance to go active	18:22
clarkb	which was already at 10 minutes	18:22
clarkb	boot timeout is once active how long to wait for ssh to be up	18:22
clarkb	so maybe we need to increase the launch timeout	18:23
clarkb	re auto updating task_state we should also resync with openmetal and see if they are deploying more up to date openstack clouds now. Then redeploy openstack and get bugfixes	18:24
clarkb	pretty sure melwitt said this issue should be addressed in like yoga?	18:24
fungi	also plenty of "Timeout waiting for instance deletion"	18:26
clarkb	ya the general slowness is why I suggested dialing back max-servers there to see if we were creating the noise	18:27
fungi	definitely looks like the servers nodepool is giving up on are in an active state in nova, just spot-checking again at random	18:29
clarkb	so they do eventually go active? that would imply our timeout is simply too short?	18:30
fungi	looking at node 0033394230, journalctl reports sshd was fully started at 18:08:24, while the launcher log says it gave up waiting at 18:08:59	18:33
clarkb	fungi: that timeout is nodepool waiting for openstack to report the node is active though	18:35
fungi	yeah, i just wanted to find something to compare against	18:35
fungi	maybe nova is lagging reporting the status change in the server list	18:35
clarkb	could be	18:36
fungi	nodepool log says it started building the node at 17:52:30 and the state machine went to "submit creating server" at 17:59:05, that's also a bit of a lag?	18:36
clarkb	fungi: that lag is possibly due to the rate limited if we have a lot of requests to process in that cloud	18:37
clarkb	this might be another good reason to drop max-servers as it should provide a cleaner picture of what is going on in the logs	18:37
fungi	yeah, i'll push a change up to ~halve it for now	18:37
opendevreview	Jeremy Stanley proposed openstack/project-config master: Try halving max-servers for rax-ord region https://review.opendev.org/c/openstack/project-config/+/876775	18:41
clarkb	fungi: note that we should've increased launch-timeout not boot timeout. I should have checked more closely yesterday	18:45
clarkb	but I think we can do that increase next	18:45
fungi	yeah, it was unclear to me that "launch" is the server create time and "boot" is the ssh access time	18:52
opendevreview	Merged openstack/project-config master: Try halving max-servers for rax-ord region https://review.opendev.org/c/openstack/project-config/+/876775	19:11
fungi	checking up on the inmotion-iad3 environment now that things should have reached a steady state there, the host where the mirror vm is running is the only one with a single qemu-kvm process, the others have two or more, suggesting that the maintenance disable for nova is working	19:17
fungi	looks like we've been setting auth = DEVELOPMENT_BECOME_ANY_ACCOUNT in the git-review tests since they were originally introduced in 2013	19:40
clarkb	fungi: ya so zuul quickstart sets that then logs in to create a first admin account. Then it configures the admin account and everything else from there	19:42
clarkb	or somethign along those lines	19:43
fungi	i've rechecked the change so i can remember what errors it was giving	19:44
clarkb	looks like the upstream gerrit docker image bakes in gerrit admin:secret	19:47
clarkb	maybe that is the default for dev mode?	19:48
clarkb	looks like our system-config testing does the same thing somehow	19:48
fungi	was also running into https://bugs.chromium.org/p/gerrit/issues/detail?id=16215	19:48
ianw	hrm, now i've found -> https://gerrit-review.googlesource.com/c/gerrit/+/339542/14/java/com/google/gerrit/server/schema/MigrateLabelFunctionsToSubmitRequirement.java	19:49
ianw	for NoOp/NoBlock, that translates things to "NoBlock" (i've chose NoOp) and adds a s-r with "applicableIf: is:false"	19:50
clarkb	I guess NoBlock is canonical? I'm fine with switching to it then	19:51
clarkb	The applicableIf is weird. Baiscally says this submit requirement exists but does nothing	19:51
clarkb	even less than it exists and always allows things to go through :)	19:51
ianw	right, that makes it a "trigger" vote	19:51
ianw	my attempt was doing "submittableIf = is:false"	19:51
ianw	that makes it weird in the UI	19:51
ianw	i think i am re-writing the change stack, again :)	19:52
clarkb	looks like admin:secret is the default for local dev mode	19:59
clarkb	fungi: ^ so ya you should be able to start in dev mode and then use that account	19:59
clarkb	fungi: re the ssh issue I remember looking int othat and wondering why it never seemed to affect us	20:00
clarkb	ut I would look at updating your git-review change to gerrit 3.6 maybe	20:00
clarkb	since that has the updated mina that can handle ssh + sha2	20:00
fungi	well, i think some of the challenge was also staying compatible with the ssh and python on ubuntu bionic	20:01
fungi	so latest gerrit we can run and test on a bionic node	20:02
fungi	https://zuul.opendev.org/t/opendev/build/547c54234e6e400cbbdece0badaa7ac9 is a fresh failure for it	20:02
ianw	what's the context here? switching the All-Projects testing?	20:03
clarkb	I don't think anything should prevent gerrit 3.6 from running on bionic	20:03
clarkb	ianw: git-review functiona ltesting runs gerrit	20:03
clarkb	but its currently set up to run realy old gerrit	20:03
fungi	yeah, okay so it's the ssh key uploads it was stuck on	20:04
fungi	RuntimeError: SSH key upload failed: <Response [401]> "Unauthorized"	20:04
fungi	i want to say where i got to was determining that in newer gerrits, the DEVELOPMENT_BECOME_ANY_ACCOUNT mode didn't work for uploading ssh keys via the rest api any more	20:07
fungi	oh, this may be the digest vs plain auth problem	20:07
fungi	right now we're posting to the api with auth=requests.auth.HTTPDigestAuth('admin', 'secret')	20:08
Clark[m]	It definitely works because both zuul and system-config use the rest API to set keys	20:08
fungi	that needs to be plain in more recent gerrit, right?	20:08
fungi	HTTPBasicAuth	20:10
opendevreview	Jeremy Stanley proposed opendev/git-review master: Upgrade testing to Gerrit 3.4.4 https://review.opendev.org/c/opendev/git-review/+/849419	20:12
fungi	guess we'll see	20:12
clarkb	fungi: if you decide to stick to 3.4 you probably want to bump the version to 3.4.8 at least	20:41
clarkb	looks like the latest error has to do with sending email that the ssh keys are added	20:43
clarkb	I think you can just disable email	20:43
clarkb	ya sendemail.enable set to false disables all email sending	20:44
fungi	well, to go past 3.4.4 we'll also need to implement a workaround for the unfixed bug report i linked, i think?	20:44
fungi	or is that fixed and the devs just never closed the bug?	20:44
fungi	basically openssh client on ubuntu bionic not working with the newer key exchange defaults on 3.4.5, seems like	20:45
clarkb	oh it looks like 3.4.5 updated mina to 2.7.0 from 2.6.0 ya I don't think they updated beyond that	20:47
fungi	so we'd need to do gerrit 3.5.x i guess to get past it?	20:47
fungi	basically they introduced a regression in a 3.4 point release and then never bothered to fix it before dropping support for that series	20:48
clarkb	well if mina 2.7.0 is the problem then I guess no newer gerrit would work	20:48
fungi	it looks like https://issues.apache.org/jira/browse/SSHD-1163 is supposed to be fixed though	20:49
opendevreview	Clark Boylan proposed opendev/git-review master: Upgrade testing to Gerrit 3.4.4 https://review.opendev.org/c/opendev/git-review/+/849419	20:49
fungi	supposedly fixed in mina-sshd 2.8.0	20:50
clarkb	in that case 3.5 or newer should work	20:50
clarkb	since 2.8.0 got backported to 3.5	20:50
clarkb	I pushed a patch to disable email	20:50
fungi	but yeah, first i wanted to see this work before rolling versions forward even more	20:50
fungi	thanks!	20:50
fungi	also there's some benefit to testing the oldest version of gerrit necessary for the features being added, since it helps avoid breaking git-review for users of somewhat older gerrit deployments (though at the expense of not catching bugs/regressions with newer gerrit releases)	20:51
clarkb	yup	20:52
clarkb	But part of the issue here is we stuck to old versions which grew stale and leads to making updates more difficult	20:52
clarkb	I suspect that keeping up with the latest stable release would minimize deltas we have to sort out	20:52
melwitt	clarkb: the doc you might be thinking of regarding task_state deleting might be this https://docs.openstack.org/nova/latest/admin/support-compute.html#reset-the-state-of-an-instance	20:58
clarkb	melwitt: yup thanks. For servers that say they are deleted but haven't been removed is tehre something to look at? we were wondering if we should set them active and try to delete them similar to ^	20:59
clarkb	but worried that since they are actually deleted that might cause more problems	20:59
melwitt	clarkb: do you mean that they are gone from the server list/server show but still have libvirt guests running?	21:01
clarkb	melwitt: no they show up in server list as status DELETED and OS-EXT-STS:vm_state \| deleted	21:02
melwitt	fwiw I don't think it would hurt to reset state and try to delete them again. I'm not sure if it would help either	21:03
clarkb	we want them to not show up in listings any longer (because nodepool thinks it leaked them and wants to delete them)	21:03
clarkb	ok thats helpful if only because it gives us something new to try	21:03
melwitt	I'm not sure I've seen that before where they are listed showing as DELETED when not using nova list --deleted	21:06
fungi	it's showing up when we `openstack server list` as a normal account in the tenant (don't need to be an admin)	21:07
clarkb	power state is NOSTATE and task_state is none	21:07
clarkb	I think it may have actually deleted the instance everywhere but in the db	21:07
melwitt	the instance record gets deleted on the compute host if that nova-compute service is "up". it's possible something failed before it got to the instance record deletion, that's pretty much the last step	21:08
fungi	melwitt: looks like this to a normal user: https://paste.opendev.org/show/bRRe78SmoX2cls5Lovkl/	21:09
melwitt	it would have failed somewhere after the save() call here https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L3254-L3267	21:10
clarkb	ah ok so we could go hunting for logs maybe depending on how catastrophic the failur was	21:10
melwitt	which apparently the main thing that happens there is deleting the allocation in the placement service	21:10
fungi	which i guess eventually leads placement to thinking it can't add more servers if we accumulate too many of those?	21:11
melwitt	so yeah, you would want to look at nova-compute.log at the time of the deletion. what might have happened is the allocation deletion failed for some reason. in the past a DELETE in placement was not "forced" and could fail for a 409 conflict	21:11
melwitt	which some of us thought was wrong and we changed it to a "force" sometime semi recently. maybe a couple of years ago. not sure what version yall are using	21:12
clarkb	this is victoria aiui	21:13
clarkb	and ya I think one of the things we should do after server upgrades and gerrit 3.7 and mailman3 is redeploy this if they've got a modern version supported in their tool (they must)	21:13
melwitt	ok, so this landed in the xena release https://review.opendev.org/c/openstack/nova/+/688802 but I'm making assumptions that the placement allocation delete failed. if it did, I think resetting the state and trying the delete again might work	21:15
clarkb	fungi: my patch fixed the email thing. Now its complaining about output from git-review based on responses from the server	21:15
clarkb	fungi: so I think at this point its just a matter of updating the tests to match modern gerrit behaviors	21:15
fungi	melwitt: worth a shot, thanks for the deep dive!	21:15
fungi	i reset the state on the three perpetually "deleted" instances there	21:18
clarkb	interestingly it seems to be complaining that the chagne was committed before the commit hook so lacks the change id?	21:18
clarkb	that implies git-review wasn't able to get the commit hook and auto apply it?	21:18
clarkb	fungi: ack	21:18
clarkb	but I think the gerrit install itself is functioning now	21:20
fungi	yep, this is much progress. thanks!	21:20
clarkb	I'm going to pop out for a bit to run that errand. https://review.opendev.org/c/opendev/system-config/+/876470/1 should be ready to go if ya'll have time to look at it	21:25
fungi	melwitt: the status reset worked on the stuck deleted notes as well, thanks again for the help!	21:29
melwitt	\o/	21:29
melwitt	np	21:30
opendevreview	Ian Wienand proposed openstack/project-config master: gerrit/acl : handle submit requirements in normalise tool https://review.opendev.org/c/openstack/project-config/+/875992	22:21
opendevreview	Ian Wienand proposed openstack/project-config master: gerrit/acl : submit-requirements for deprecated NoOp function https://review.opendev.org/c/openstack/project-config/+/875804	22:21
opendevreview	Ian Wienand proposed openstack/project-config master: gerrit/acl : Convert Backport-Candidate to submit-requirements https://review.opendev.org/c/openstack/project-config/+/875993	22:21
opendevreview	Ian Wienand proposed openstack/project-config master: gerrit/acl : handle key / values with multiple = https://review.opendev.org/c/openstack/project-config/+/875994	22:21
opendevreview	Ian Wienand proposed openstack/project-config master: gerrit/acl : Convert Review-Priority to submit-requirements https://review.opendev.org/c/openstack/project-config/+/875995	22:21
ianw	clarkb: ^ after several attempts i think that's in it's final state now. it should replicate what the upstream migration would want to do	22:22
opendevreview	Merged opendev/system-config master: Remove gitea05-08 from configuration management https://review.opendev.org/c/opendev/system-config/+/876470	22:29
clarkb	ianw: ack I'll take a look. Is that migration in the 3.6 -> 3.7 path? I guess because submit requirements are more required now? Its just weird they waited so long for that	22:41
ianw	clarkb: no, the change appears to only be on master. so i imagine that 3.7 -> 3.8 upgrade process will run this	22:41
ianw	these changes should effectively isolate us from that. so we won't skew between what's running in gerrit and what our acl's are on disk	22:43
clarkb	weird. But also ++ to syncing up with them	22:43
clarkb	yup	22:43
clarkb	#status log Increased afs quotas for ubuntu-ports, debian-security, centos, and centos-stream mirrors	22:51
opendevstatus	clarkb: finished logging	22:51
clarkb	infra-root with https://review.opendev.org/c/opendev/system-config/+/876470 merged I'll go ahead and manually stop gitea services via docker-compose down on the four hosts and in a day or two we can delete them	22:51
ianw	++	22:52
clarkb	thats done. If we really need to turn them back on remember that they haven't been replicated to for a few days	22:55
clarkb	But I don't expect that to come up	22:55
opendevreview	Merged opendev/system-config master: mirror-update: drop Fedora 35 https://review.opendev.org/c/opendev/system-config/+/876486	22:59
clarkb	ianw: fwiw I think we can make code-review normal in infra-specs if we want. But at this point maybe thats best as a followup	23:00
ianw	clarkb: yeah, probably for infra specs, but at least for governance it seems to be very concious choice to allow anyone to +1/-1 for comments, but leave the merge up to the ptl	23:02
clarkb	ya though I'm looking at https://review.opendev.org/c/openstack/project-config/+/875804/5/gerrit/acls/starlingx/governance.config and it looks like no submit requirements are needed? This isn't a regression either	23:04
* clarkb pulls up a governance chang to see what they actually rely on		23:04
clarkb	just the workflow +1 I guess	23:04
clarkb	ianw: small thing on https://review.opendev.org/c/openstack/project-config/+/875804 but let me review the rest of the stack before you push new patchsets	23:05
ianw	yeah, i think the idea is that ultimately it's up to the ptl to tally and make the final choice	23:05
ianw	doh, thanks, not sure how i missed that one	23:06
clarkb	ianw: another followup idea. We could update our linter to onl allow function = NoBlock when this is all done	23:07
clarkb	that way we don't accidentally add any new NoOps or MaxWithBlock. We might even require function = NoBlock?	23:08
ianw	yeah, i have a change sort of to do that	23:08
ianw	the problem it isn't really a linter as such, it's a transformer	23:09
clarkb	ya	23:09
clarkb	we look for a diff after normalizing it	23:09
clarkb	ianw: I think the one I discovered in an earlier change is fixed in https://review.opendev.org/c/openstack/project-config/+/875995/	23:09
clarkb	I guess if you confirm that you can decide where you want it to be applied	23:10
clarkb	(I kinda like modifying a single projet all at once so that they get the potentailly new behavior (which shouldn't happen) all at once for ease of debugging)	23:10
clarkb	but ya the rest of the stack lgtm	23:10
ianw	ok let me fix that in the lower change, i think that's the right place	23:11
opendevreview	Merged opendev/system-config master: mirror-update: Add Fedora 37 https://review.opendev.org/c/opendev/system-config/+/876487	23:11
ianw	^ i'll hold the lock and do the f35 removal and f37 addition as it will take a while	23:12
opendevreview	Ian Wienand proposed openstack/project-config master: gerrit/acl : submit-requirements for deprecated NoOp function https://review.opendev.org/c/openstack/project-config/+/875804	23:13
opendevreview	Ian Wienand proposed openstack/project-config master: gerrit/acl : Convert Backport-Candidate to submit-requirements https://review.opendev.org/c/openstack/project-config/+/875993	23:13
opendevreview	Ian Wienand proposed openstack/project-config master: gerrit/acl : handle key / values with multiple = https://review.opendev.org/c/openstack/project-config/+/875994	23:13
opendevreview	Ian Wienand proposed openstack/project-config master: gerrit/acl : Convert Review-Priority to submit-requirements https://review.opendev.org/c/openstack/project-config/+/875995	23:13
ianw	clarkb: https://review.opendev.org/c/opendev/system-config/+/876237/3 and https://review.opendev.org/c/opendev/system-config/+/876236/2 are the other two that codify the all-projects change	23:15
ianw	the proposed diff for all-projects is https://paste.opendev.org/show/brAj40R1mJbQZSXAXEQ5/	23:16
clarkb	ianw: note on https://review.opendev.org/c/opendev/system-config/+/876237	23:23
ianw	ahh, great point	23:23
opendevreview	Ian Wienand proposed opendev/system-config master: doc/gerrit : update to submit-requirements https://review.opendev.org/c/opendev/system-config/+/876237	23:25
clarkb	thanks that lgtm	23:25
ianw	https://paste.opendev.org/show/b5flVdJsiNlLmY4lA6Yp/ is an updated diff against our all-projects	23:27
ianw	i'm running the new fedora sync in a root screen on mirror-update	23:34
opendevreview	Clark Boylan proposed opendev/git-review master: Upgrade testing to Gerrit 3.4.4 https://review.opendev.org/c/opendev/git-review/+/849419	23:41
clarkb	fungi: ^ Hopefully that is all that is necessary now	23:41
fungi	oh, thanks! i hadn't gotten back to it yet	23:43
clarkb	I had the test failures open and looked at them again. It was just a difference in the text newer gerrit returns to git-review which then returns to the user	23:45
fungi	cool, that's pretty minor	23:46
clarkb	and the email thing was actually non fatal. But nice to not have giant java tracebacks in the logs for failed tests	23:46
fungi	yep, agreed	23:48
clarkb	JayF: just a heads up that we've pencilled in April 7 (Good Friday, hopefully it is quiet as a result) for project renames	23:48
clarkb	no specific time yet. We'll sort that out as we get closer to the day	23:48
clarkb	heh now they are all failing on `which git-review`	23:54
clarkb	how did this pass before. I didn't change anything in the git review install path	23:55
ianw	could it be a tox thing?	23:55
clarkb	oh yup I just realized the older python versions will work I think	23:55
clarkb	and its because new tox doesn't do older python	23:55
* clarkb looks at tox.ini		23:56
clarkb	I think its the skip sdist step	23:56
clarkb	that prevenst install of the project into the venv so which doesn't find it	23:56
fungi	mat also need to allowlist_externals which?	23:57
fungi	s/mat/may/	23:57
opendevreview	Clark Boylan proposed opendev/git-review master: Upgrade testing to Gerrit 3.4.4 https://review.opendev.org/c/opendev/git-review/+/849419	23:57
clarkb	fungi: no, the which happens within the test suite	23:57
clarkb	tox has no idea aobut that one	23:57
fungi	oh, got it	23:58
fungi	and the comment there was for xenial era i think which we no longer test o	23:59
fungi	n	23:59

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!