clarkb | aha using uuid fixes that | 00:00 |
---|---|---|
clarkb | ok that shows that parakeet is the one | 00:01 |
clarkb | so the trick was to do a server show using uuid | 00:01 |
fungi | nice | 00:01 |
clarkb | I ran `openstack compute service set --disable --disable-reason "Only run the mirror on this hypervisor" amazing-parakeet.local nova-compute` then started the instance via our normal user creds | 00:03 |
clarkb | (because I wanted to make sure we could still start/stop etc as a normal user in this state) | 00:03 |
clarkb | I can ssh in | 00:04 |
fungi | oh cool, so it worked with the creds on the host then for sure | 00:04 |
clarkb | yes the admin stuff did | 00:04 |
fungi | yeah, mirror lgtm again | 00:04 |
fungi | happy to approve a revert of the revert of the revert of the zero max-servers or whatever. once it deploys and usage ramps up, we should be able to tell whether or not job nodes are landing there again | 00:05 |
clarkb | fungi: does that change exist yet or should I push one? | 00:05 |
fungi | it does not exist as far as i can recall | 00:05 |
opendevreview | Clark Boylan proposed openstack/project-config master: Revert "Revert "Revert "Temporarily stop booting nodes in inmotion iad3""" https://review.opendev.org/c/openstack/project-config/+/876652 | 00:07 |
clarkb | last call for the meeting agenda | 00:08 |
fungi | clarkb: not strictly an addition, but i guess let's make sure to talk about project renames (though we should likely conclude to push them to after the openstack release week) | 00:09 |
fungi | picking a date would be nice though | 00:09 |
clarkb | yes and I want to be done with the gitea stuff (which is closer every day | 00:09 |
clarkb | speaking of I don't know why that reminded me but nebulous too | 00:11 |
clarkb | ok agenda sent | 00:15 |
fungi | oh, yeah | 00:15 |
opendevreview | Merged openstack/project-config master: Revert "Revert "Revert "Temporarily stop booting nodes in inmotion iad3""" https://review.opendev.org/c/openstack/project-config/+/876652 | 00:34 |
ianw | fungi: do you happen to remember why https://review.opendev.org/c/openstack/project-config/+/786686/2/gerrit/acls/openstack/release-test.config has PTL vote from the "Continuous Integration Tools" group? | 00:34 |
ianw | it seems like only zuul is in that group, and i can't see it would be leaving that vote | 00:36 |
fungi | ianw: i think you're looking for https://review.opendev.org/705990 | 00:42 |
fungi | in essence, if a ptl leaves a +1 on an openstack/releases change, there's a zuul job with a dedicated pipeline that triggers on comment-added events to see if it's from the ptl or a release liaison from the team for the deliverable the release is proposed for, and if it is then zuul leaves a +1 vote in that column | 00:43 |
fungi | as for why it's in the release-test acl, that can probably be cleaned up, it was simply copied from the releases acl | 00:44 |
fungi | since they both shared an acl up until the change you mentioned | 00:45 |
ianw | ahhh, thanks | 01:15 |
ianw | ... is:true != is:True in these acl files :/ | 02:31 |
*** Tengu7 is now known as Tengu | 02:53 | |
*** Tengu3 is now known as Tengu | 03:07 | |
opendevreview | Ian Wienand proposed openstack/project-config master: gerrit/acl : convert NoBlock to NoOp https://review.opendev.org/c/openstack/project-config/+/875993 | 04:03 |
opendevreview | Ian Wienand proposed openstack/project-config master: gerrit/acl : handle key / values with multiple = https://review.opendev.org/c/openstack/project-config/+/875994 | 04:03 |
opendevreview | Ian Wienand proposed openstack/project-config master: gerrit/acl : Update Review-Priority to submit-requirements https://review.opendev.org/c/openstack/project-config/+/875995 | 04:03 |
opendevreview | Ian Wienand proposed openstack/project-config master: gerrit/acl : Convert remaining AnyWithBlock to submit requirements https://review.opendev.org/c/openstack/project-config/+/875996 | 04:03 |
*** jpena|off is now known as jpena | 08:27 | |
opendevreview | Saggi Mizrahi proposed opendev/git-review master: Add worktree support https://review.opendev.org/c/opendev/git-review/+/876725 | 12:02 |
opendevreview | Radosław Piliszek proposed openstack/project-config master: Add the main NebulOuS repos https://review.opendev.org/c/openstack/project-config/+/876054 | 13:34 |
opendevreview | Radosław Piliszek proposed openstack/project-config master: Add the NebulOuS tenant https://review.opendev.org/c/openstack/project-config/+/876414 | 14:03 |
opendevreview | Radosław Piliszek proposed openstack/project-config master: Add the NebulOuS tenant https://review.opendev.org/c/openstack/project-config/+/876414 | 14:35 |
*** blarnath is now known as d34dh0r53 | 14:58 | |
fungi | graphs for rax-ord are still looking fairly unhealthy compared to the other regions | 15:24 |
fungi | as for inmotion, it looks like we're not booting servers there at all yet | 15:26 |
fungi | nodepool configuration was updated at 00:39z and has max-servers: 51 | 15:27 |
fungi | server list for the nodepool tenant shows 3 active servers and 3 deleted servers, all with creation dates from a week or more ago | 15:32 |
fungi | nodepool list knows about 4 of those, one of which is a held node and the others are deleting | 15:33 |
clarkb | fungi: any indication if nodepool even attempted to boot in inmotion? | 15:49 |
fungi | that's what i'm trying to figure out from the logs. i don't see it trying currently | 15:52 |
fungi | just it repeatedly trying to delete these three nodes that are stuck in deleting state | 15:52 |
fungi | looks like it's declining requests, trying to determine why | 15:55 |
clarkb | possible we've got the placement leaks again leading it to think the quota is all used up? | 15:55 |
clarkb | but ya nodepool logs should have reasonably concrete info to go off of | 15:56 |
fungi | it doesn't seem to say why it declined this request though | 15:57 |
fungi | it gives a ton of info about the request itself, but not the reason it got declined | 15:58 |
clarkb | there are a couple of paths that can take it to that point. The first is deciding it doesn't support the label (arm64 or extra large nodes etc) | 15:58 |
clarkb | another is failing three boots I think you need to look for those three failures to find why they failed | 15:59 |
fungi | this was a request for a single ubuntu-bionic node to run ansible-playbooks-tox-bandit | 15:59 |
clarkb | then ya you may need to look over the history of the request to see if there were three failing boots | 15:59 |
fungi | not finding any indication it's attempting to boot anything though | 16:00 |
fungi | and the only tracebacks are for the delete failures | 16:00 |
clarkb | hrm maybe there is another short circuit then? | 16:01 |
fungi | this is a sample from the debug log: https://paste.opendev.org/show/bkOR3P27zG4bVTD9HRaS/ | 16:02 |
clarkb | fungi: the grafana graphs still show max at 0 | 16:06 |
clarkb | I think the issue is the change to max hasn't applied | 16:08 |
clarkb | possibly the same deal with rax-ord? | 16:08 |
fungi | the config updates deployed (and i checked the files have the correct content), but yeah maybe the launchers don't notice file changes and/or any reload signal we send isn't working? | 16:12 |
clarkb | previously it just reloaded the config on every pass hrough the run loop but ya maybe it isn't doing that anymore? | 16:12 |
clarkb | we could try restarting the launcher to see if the behavuor changes | 16:13 |
frickler | there are some other errors in the 03_03 log | 16:14 |
frickler | those might have made the statemachine get stuck, I would also suggest a restart | 16:14 |
fungi | what timestamp? | 16:14 |
frickler | oh, it even says this: 2023-03-03 18:15:14,368 DEBUG nodepool.StateMachineProvider.inmotion-iad3: Removing state machine from runner | 16:16 |
fungi | d'oh! | 16:16 |
frickler | just after some nova API error | 16:17 |
fungi | "Removing state machine from runner" seems to appear many times in all the logs though | 16:17 |
fungi | even back to the oldest retained log from 2023-02-25 | 16:18 |
fungi | 500 occurrences in launcher-debug.log.2023-02-25_05 | 16:18 |
fungi | i see it on all the launchers, not just 02 | 16:19 |
fungi | nodepool docs seem to indicate that it should be rereading its configuration automatically, yes | 16:24 |
clarkb | gitea05-08 are all out of the gerrit replication config (I just verified that the daily jobs synced us up) | 16:27 |
clarkb | https://review.opendev.org/c/opendev/system-config/+/876470 is ready when we're happy with the new servers (this removes those 4 servers from config management | 16:27 |
clarkb | after which they can be deleted | 16:28 |
clarkb | fungi: switching the grafana graphs for inmotion to a 24 hour timeframe it never has a max value >0 | 16:28 |
clarkb | is the exception frickler pointed out due to deletion not creation? | 16:28 |
fungi | yeah, there are a bunch of deletion failures with tracebacks, related to the three instances stuck in a deleting state | 16:29 |
fungi | the launcher keeps repeatedly trying to delete them | 16:30 |
clarkb | ya thats "normal" for that situation | 16:30 |
clarkb | it definitely looks like it hasn't noticed the max-servers value bumped up | 16:30 |
clarkb | maybe doing a thread dump to see if the run loop is stuck before it can reloda its config then restart the service? | 16:31 |
fungi | oh, good call. yeah doing that now | 16:45 |
clarkb | remember you need to do a pair with a short pause in the middle otherwise the profiler remains on and things will be extra slow. Less important if ou end up restarting things though | 16:47 |
fungi | yeah, i usually do a 60-second sleep between them | 16:47 |
fungi | okay, dump with yappi stats is in nl02:~fungi/stack_dump.2023-03-07 | 16:53 |
clarkb | fungi: up ou can see PoolWorker.inmotion-iad3-main is blocking on a stop event | 16:58 |
clarkb | I suspect its been sitting there for a while | 16:58 |
clarkb | oh though its got a timeout listed | 16:59 |
clarkb | so maybe not? | 16:59 |
fungi | so the plan for nl02 is to stop the launcher container, try to clean up all the server instances in inmotion-iad3 which nodepool has no record of (i count 2 in an active state not matching any entries in nodepool list output), then start the container again? | 16:59 |
fungi | also would be good if we could identify why these other three nodes are undeletable | 17:00 |
clarkb | I think those three have been undeleteable for months. I looked at them when I did the placement cleanup and couldn't sort it out | 17:00 |
clarkb | I'm not suer you need to clean those up before starting the service again | 17:00 |
*** artom_ is now known as artom | 17:01 | |
clarkb | ok that run loop doesn't load new configs. I think an outer loop must do that | 17:01 |
fungi | oh, actually there are 4 nodes in an active state, three of which aren't in the nodepool list | 17:01 |
fungi | okay, it also isn't deleting those when i issue a server delete | 17:03 |
fungi | they're still in active state | 17:03 |
fungi | i wonder if oom killer might have terminated some openstack services on parakeet? | 17:05 |
clarkb | I don't think that is the issue. Those nodes are not on parakeet (only the mirror is) | 17:05 |
fungi | is there a clean way to restart everything? e.g. reboot the server? | 17:05 |
fungi | the nodes aren't on parakeet, but some of the api services are | 17:05 |
clarkb | and the setup runs docker with a force restart on everything iirc | 17:05 |
clarkb | fungi: I think you can do a docker ps -a to see if anything isn't running | 17:06 |
clarkb | I suspect this has to do with nova just losing track of things on its own | 17:06 |
clarkb | which we've already had to do cleanup for in the past. Its just that this subset is for a different reason that isn't understood yet | 17:07 |
frickler | likely not related, but there's also a lot of images stuck in deletion there | 17:08 |
fungi | harbor.imhadmin.net/deployment/node-agent:master and kolla/centos-binary-telegraf:victoria are in "restarting" status, harbor.imhadmin.net/deployment/fm-deploy:master is in exited status, all other containers on that server have an up status | 17:09 |
clarkb | fm-deploy is their deployment tool iirc and that is expected for it to exit once deployment is done | 17:09 |
clarkb | node-agent I can only assume is some sort of phone home service to the management system. | 17:10 |
clarkb | telegraf too. I think all the openstack services must be running then | 17:10 |
fungi | yeah, seems so | 17:10 |
fungi | leading me to wonder why `openstack server delete` of an active state instance does nothing, and the instance remains listed as active | 17:11 |
fungi | i suppose these and the stuck deleting instances could all be victims of the oom killer on parakeet | 17:11 |
fungi | mmm, nope they're spread across 3 different hostids | 17:13 |
clarkb | I think they long predate that | 17:15 |
clarkb | OOM killer is a relatively recent thing | 17:15 |
clarkb | Looking at the thread dump I'm not seeing the statemachine provider object for the inmotion cloud | 17:15 |
clarkb | ok PoolWorkers refer to the top level NodePool object to get their Provider objects. THe Provider objects capture the configs and they should be replaced when configs update | 17:20 |
clarkb | we should log "Creating new ProviderManager object" when swapping them out /me looks | 17:22 |
clarkb | 2023-03-07 00:39:49,754 DEBUG nodepool.ProviderManager: Creating new ProviderManager object for inmotion-iad3 | 17:23 |
clarkb | so it does create the new provider manager when the new config shows up but then doesn't seem tp reflect the new max-servers value? | 17:24 |
fungi | or else there's something else causing it to summarily decline requests, since it doesn't actually say why it declined them | 17:25 |
fungi | oh, though statsd is still reporting max:0 | 17:26 |
fungi | which seems to support your theory | 17:26 |
clarkb | it never stopped | 17:26 |
clarkb | grep "DEBUG nodepool.StateMachineProvider.inmotion-iad3: Stopp" /var/log/nodepool/launcher-debug.log.2023-03-0* | 17:26 |
*** jpena is now known as jpena|off | 17:26 | |
clarkb | that shows on the third when we last turned it off the provider goes from stopping to stopped. But this time around it never does the transition | 17:26 |
clarkb | it waits for all launchers and deleters to stop | 17:27 |
clarkb | perhaps the deleters were stuck? | 17:27 |
* clarkb looks for this thread in the dump | 17:27 | |
clarkb | Thread: 139823599843072 NodePool d: False<- thats where the apprently stuck stop() call is sitting | 17:29 |
clarkb | fungi: so ya I believe a restart will fix this. I don't know where it is getting stuck yet | 17:31 |
fungi | on nl01 for comparison: | 17:31 |
fungi | 2023-03-06 17:43:29,343 DEBUG nodepool.StateMachineProvider.rax-ord: Stopping | 17:31 |
fungi | 2023-03-07 04:41:03,927 DEBUG nodepool.StateMachineProvider.rax-ord: Stopped | 17:31 |
fungi | that took 11 hours | 17:32 |
fungi | longer than i would have anticipated | 17:35 |
clarkb | ya based on the tracebacks it appears to be waiting on a launcher or deleter. However looking at the logs I see it running then reporting it ran 3 statemachines | 17:35 |
clarkb | oh! but the state machines have to report they have completed for them to be removed | 17:35 |
clarkb | corvus: ^ fyi I Think we've found an issue with openstack + statemachine related to clouds behind unhappy that results in us holding onto old configs for far too long | 17:36 |
clarkb | I suspect these three statemachines are never going into a completed state so we're never reloading our config | 17:36 |
clarkb | fungi: and you said there are three nodes in a deleting state? | 17:37 |
clarkb | its curious that we were able to restart in this situation when we set the max-nodes to zero but not now | 17:38 |
fungi | well, there are three nodepool knows about that are in a deleting state and it keeps trying to delete them | 17:38 |
fungi | there are also three it doesn't seem to be tracking any longer which are stuck in an active state and won't transition to deleting in nova | 17:38 |
corvus | clarkb: https://review.opendev.org/875250 | 17:40 |
clarkb | aha thanks | 17:42 |
clarkb | fungi: ^ I think we can restart the service and review that change | 17:42 |
corvus | fungi: nodepool should continue to try to delete leaked nodes (assuming they have the correct metadata) without tracking them; so it should continue to try to delete them even if it doesn't have a zk record (probably worth double checking the logs to make sure that happens) | 17:43 |
fungi | here's the comparison of what nova and nodepool see for inmotion-iad3 btw: https://paste.opendev.org/show/bUgMuhc2IbvlFPNFWb0o/ | 17:44 |
fungi | and yeah, 875250 looks like what we're seeing on nl02 | 17:45 |
fungi | okay, downing and upping the launcher container on nl02 | 17:45 |
corvus | fungi: that paste looks good to me -- is there anything wrong with that? | 17:46 |
fungi | corvus: nothing wrong on the nodepool side, problems on the nova side | 17:46 |
corvus | fungi: ack. makes sense | 17:46 |
clarkb | corvus: what is weird to me looking at the log on nl02 is taht we keep reporting that 3 tasks are running and then 3 tasks ran. It isn't clear to me why we never transition to completed and remove those three machines. Maybe beacuse the deletions don't ultimately succeed? | 17:47 |
fungi | several instances stuck in delete for weeks which nodepool is no longer tracking, several more active which can't be deleted for some unknown reason (but which nodepool is repeatedly trying) | 17:47 |
corvus | clarkb: hrm | 17:47 |
fungi | clarkb: yeah, nova is still reporting the instances as "active" | 17:48 |
fungi | i had it backwards originally, but match up the uuids in my paste and you'll see | 17:48 |
clarkb | fungi: right and if you manually try to delete them it doesn't help. Its also a nova problem, but one that may be acusing those tasks to hang around forever? | 17:49 |
fungi | the four nodes which nodepool knows about are in active state according to nova, one is for an autohold but the other three are needing to be deleted | 17:49 |
clarkb | I think ideally we'd want it to try and give up if being stopped and let the new provider manager handle it | 17:49 |
fungi | and nova is silently ignoring the delete attempts as far as i can see | 17:49 |
clarkb | with corvus' proposed change I think we may end up with both the old and new provider manager running threads attemptng to delete the resources (not the end of the world) | 17:49 |
fungi | okay, so ready for a down/up of the launcher container on nl02? | 17:50 |
clarkb | I am but maybe corvus wants to look at it first? | 17:50 |
fungi | sure, i can wait for a bit | 17:50 |
corvus | clarkb: i think (haven't totally confirmed this) that we're seeing the behavior described in the commit message, but with deleting state machines -- perhaps it is removing the delete state machines after the delete timout hits, but then the periodic leaked node cleanup is adding new ones | 17:51 |
corvus | fungiclarkb you're clear to restart; i think any further questions can prob be handled in logs | 17:51 |
clarkb | corvus: oh! that would be a fun interaction | 17:51 |
fungi | got it, proceeding | 17:51 |
corvus | clarkb: so anyway, yeah, that change should handle that by causing the new delete-state-machines to go to the new provider, and there will be 2 for a brief time (<5min) and then the old one finally disappears | 17:51 |
clarkb | corvus: yup | 17:52 |
clarkb | fungi: I was wrong df55493d-bae7-47b2-9069-1e488c09a2fd is only a few days old. | 17:52 |
clarkb | fungi: its task state is in deleting but it isn't actively deleting. I Think we ran into this before | 17:52 |
fungi | #status log Restarted the nodepool-launcher container on nl02 in order to force a config reload, as a workaround until https://review.opendev.org/875250 is deployed | 17:52 |
opendevstatus | fungi: finished logging | 17:52 |
clarkb | fungi: basically nova sets the task state to deleting then fires off an event for that. If that event gets lost it won't let you set the state to deleting by issuing another normal delete which would refire the task because it is already deleting | 17:53 |
clarkb | fungi: iirc melwitt linked to some process to double check things then you manually override the state back to acitve and issue another delete | 17:53 |
corvus | clarkb: https://paste.opendev.org/show/bh9N99Gx60aXEYwM7bCC/ here's an edited series of log messages that i think shows that sequence | 17:53 |
clarkb | I forget what that process was but we can probably just yolo | 17:53 |
clarkb | corvus: ++ | 17:54 |
fungi | clarkb: but df55493d-bae7-47b2-9069-1e488c09a2fd is stuck in "active" not "deleted" | 17:54 |
fungi | at least according to nova | 17:54 |
clarkb | fungi: its task_state is "deleting" | 17:54 |
clarkb | and if nova loses the event to acutally make that happen you get stuck in limbo here | 17:55 |
clarkb | the fix is to reset task state back to active then reissue the delete command | 17:55 |
corvus | is that a user-level action or cloud-admin-level? | 17:55 |
clarkb | I found a purple link df55493d-bae7-47b2-9069-1e488c09a2fd | 17:55 |
clarkb | corvus: admin level | 17:55 |
corvus | that=reset task state | 17:55 |
corvus | ack | 17:55 |
fungi | oh, yep okay. OS-EXT-STS:task_state=deleting OS-EXT-STS:vm_state=active | 17:56 |
fungi | and server list reports the latter | 17:56 |
fungi | thus my confusion | 17:56 |
clarkb | and that server isn't on parakeet so unlikely related to OOMs | 17:57 |
fungi | right, i compared the hostids reported for these and they're scattered around the cluster | 17:57 |
clarkb | aha I found my notes | 17:57 |
clarkb | `openstack server set --state active $UUID` | 17:57 |
clarkb | then you can issue a normal delete | 17:58 |
clarkb | from mynotes it looks like maybe only the orphaned allocations needed double checking but not the task state being stuck | 17:58 |
clarkb | fungi: do you want to issue that or should I? | 17:58 |
clarkb | I suspect that nodepool will then automatically clean them up for us | 17:58 |
fungi | i'll give it a go | 17:59 |
corvus | this is the cloud we have admin on? should we make a cron script that looks for active/deleting nodes and then issue that command? i sort of suspect that major cloud operators may do that -- based on the fact this seems to happen occasionally everywhere and then eventually they get cleaned up... | 17:59 |
clarkb | I would server show them as admin first to make sure they have the task_staet in deleting | 17:59 |
fungi | clarkb: where was the clouds.yaml on the servers there? | 17:59 |
clarkb | corvus: yes and not a bad idea. I think we'd want some delay | 18:00 |
clarkb | fungi: source /etc/kolla/admin-openrc.sh | 18:00 |
fungi | aha, thanks, i'll source that | 18:00 |
clarkb | corvus: because there is a period of time where that set of states is valid. But if it goes for more than say an hour then we're probably good to clean it up | 18:00 |
fungi | clarkb: is that on all the servers? which one did you find it on? not seeing it on .230 | 18:01 |
clarkb | fungi: I'm on 229. But I thought it was deployed to all of them | 18:01 |
* clarkb looks | 18:01 | |
fungi | it's on .229 yeah, thanks | 18:01 |
clarkb | huh ya I guess they only deploy that to the first server | 18:02 |
fungi | and what's the path to openstackclient? | 18:02 |
corvus | clarkb: yep. is there a timestamp for the task_state? or would we just need to have persistence to detect it? | 18:02 |
fungi | # openstack server list | 18:02 |
fungi | -bash: openstack: command not found | 18:02 |
clarkb | fungi: /opt/kolla-ansible/victoria/ its in that venv. You can activate it or just use the bin path | 18:02 |
corvus | (that might be worth looking at real quick before you reset these states :) | 18:02 |
fungi | aha, okay | 18:02 |
clarkb | corvus: it looks like the updated field is getting updated frequently. I did a show earlier and got a timestamp from a few minutes prior and a show now shows updated a few minutes prior to now | 18:03 |
clarkb | corvus: we probably need to keep track ourselves. Anything that is in that state goes into a list and if it is still in that state an hour later we update it | 18:04 |
fungi | and as admin, it's server list --all otherwise you get a blank list due to it checking the admin tenant i guess | 18:04 |
corvus | clarkb: ack, good to know | 18:04 |
fungi | OS-EXT-STS:task_state=None after i did that to d52712f4-80a6-4f9a-979c-3031972cd89f, so now to see if the launcher successfully deletes it | 18:06 |
clarkb | fungi: I see task state is deleting already | 18:07 |
clarkb | but it hasn't gone away | 18:07 |
clarkb | if it continues to not go away we may have to check compute logs on the host | 18:08 |
fungi | it was already in a deleting state from nodepool's perspective, i think it takes a little while to get back to retrying to delete it though | 18:08 |
clarkb | fungi: no I mean in nova | 18:09 |
clarkb | task_state is deleting | 18:09 |
clarkb | oh wait you did a different host | 18:09 |
clarkb | sorry | 18:09 |
fungi | yeah, i've only done d52712f4-80a6-4f9a-979c-3031972cd89f so far | 18:09 |
fungi | the first one in the nodepool list output | 18:10 |
fungi | if memory serves, it may be as much as 10 minutes before the launcher tries to delete things again | 18:10 |
clarkb | it looks like it is trying now | 18:11 |
fungi | the restart did get the config reread though, and looks like we're booting new nodes there again | 18:11 |
clarkb | the nodepool logs indicate "State machine for 0033156866 at deleting server" | 18:11 |
clarkb | but server show doesn't indicate a state change yet | 18:11 |
clarkb | and its gone | 18:13 |
fungi | yay! | 18:13 |
fungi | i'll reset the other two offenders | 18:13 |
fungi | now as for the ones nova indicates have been in a "deleted" state since last month, is there a chance the same state reset will work for them as well? | 18:14 |
clarkb | For that we may need to talk to nova. I'd worry that will cause nova to try and do things that can't happen since the nodes are actually gone | 18:15 |
clarkb | and then create more problems | 18:15 |
* clarkb reviews the nodepool fix now | 18:15 | |
fungi | yeah, server show doesn't indicate any related errors for them | 18:15 |
fungi | anyway, i'll switch to revisiting rax-ord | 18:16 |
fungi | still seeing "Timeout waiting for instance creation" | 18:20 |
clarkb | fungi: so that timeout is actually the one waiting for the instance to go active | 18:22 |
clarkb | which was already at 10 minutes | 18:22 |
clarkb | boot timeout is once active how long to wait for ssh to be up | 18:22 |
clarkb | so maybe we need to increase the launch timeout | 18:23 |
clarkb | re auto updating task_state we should also resync with openmetal and see if they are deploying more up to date openstack clouds now. Then redeploy openstack and get bugfixes | 18:24 |
clarkb | pretty sure melwitt said this issue should be addressed in like yoga? | 18:24 |
fungi | also plenty of "Timeout waiting for instance deletion" | 18:26 |
clarkb | ya the general slowness is why I suggested dialing back max-servers there to see if we were creating the noise | 18:27 |
fungi | definitely looks like the servers nodepool is giving up on are in an active state in nova, just spot-checking again at random | 18:29 |
clarkb | so they do eventually go active? that would imply our timeout is simply too short? | 18:30 |
fungi | looking at node 0033394230, journalctl reports sshd was fully started at 18:08:24, while the launcher log says it gave up waiting at 18:08:59 | 18:33 |
clarkb | fungi: that timeout is nodepool waiting for openstack to report the node is active though | 18:35 |
fungi | yeah, i just wanted to find something to compare against | 18:35 |
fungi | maybe nova is lagging reporting the status change in the server list | 18:35 |
clarkb | could be | 18:36 |
fungi | nodepool log says it started building the node at 17:52:30 and the state machine went to "submit creating server" at 17:59:05, that's also a bit of a lag? | 18:36 |
clarkb | fungi: that lag is possibly due to the rate limited if we have a lot of requests to process in that cloud | 18:37 |
clarkb | this might be another good reason to drop max-servers as it should provide a cleaner picture of what is going on in the logs | 18:37 |
fungi | yeah, i'll push a change up to ~halve it for now | 18:37 |
opendevreview | Jeremy Stanley proposed openstack/project-config master: Try halving max-servers for rax-ord region https://review.opendev.org/c/openstack/project-config/+/876775 | 18:41 |
clarkb | fungi: note that we should've increased launch-timeout not boot timeout. I should have checked more closely yesterday | 18:45 |
clarkb | but I think we can do that increase next | 18:45 |
fungi | yeah, it was unclear to me that "launch" is the server create time and "boot" is the ssh access time | 18:52 |
opendevreview | Merged openstack/project-config master: Try halving max-servers for rax-ord region https://review.opendev.org/c/openstack/project-config/+/876775 | 19:11 |
fungi | checking up on the inmotion-iad3 environment now that things should have reached a steady state there, the host where the mirror vm is running is the only one with a single qemu-kvm process, the others have two or more, suggesting that the maintenance disable for nova is working | 19:17 |
fungi | looks like we've been setting auth = DEVELOPMENT_BECOME_ANY_ACCOUNT in the git-review tests since they were originally introduced in 2013 | 19:40 |
clarkb | fungi: ya so zuul quickstart sets that then logs in to create a first admin account. Then it configures the admin account and everything else from there | 19:42 |
clarkb | or somethign along those lines | 19:43 |
fungi | i've rechecked the change so i can remember what errors it was giving | 19:44 |
clarkb | looks like the upstream gerrit docker image bakes in gerrit admin:secret | 19:47 |
clarkb | maybe that is the default for dev mode? | 19:48 |
clarkb | looks like our system-config testing does the same thing somehow | 19:48 |
fungi | was also running into https://bugs.chromium.org/p/gerrit/issues/detail?id=16215 | 19:48 |
ianw | hrm, now i've found -> https://gerrit-review.googlesource.com/c/gerrit/+/339542/14/java/com/google/gerrit/server/schema/MigrateLabelFunctionsToSubmitRequirement.java | 19:49 |
ianw | for NoOp/NoBlock, that translates things to "NoBlock" (i've chose NoOp) and adds a s-r with "applicableIf: is:false" | 19:50 |
clarkb | I guess NoBlock is canonical? I'm fine with switching to it then | 19:51 |
clarkb | The applicableIf is weird. Baiscally says this submit requirement exists but does nothing | 19:51 |
clarkb | even less than it exists and always allows things to go through :) | 19:51 |
ianw | right, that makes it a "trigger" vote | 19:51 |
ianw | my attempt was doing "submittableIf = is:false" | 19:51 |
ianw | that makes it weird in the UI | 19:51 |
ianw | i think i am re-writing the change stack, again :) | 19:52 |
clarkb | looks like admin:secret is the default for local dev mode | 19:59 |
clarkb | fungi: ^ so ya you should be able to start in dev mode and then use that account | 19:59 |
clarkb | fungi: re the ssh issue I remember looking int othat and wondering why it never seemed to affect us | 20:00 |
clarkb | ut I would look at updating your git-review change to gerrit 3.6 maybe | 20:00 |
clarkb | since that has the updated mina that can handle ssh + sha2 | 20:00 |
fungi | well, i think some of the challenge was also staying compatible with the ssh and python on ubuntu bionic | 20:01 |
fungi | so latest gerrit we can run and test on a bionic node | 20:02 |
fungi | https://zuul.opendev.org/t/opendev/build/547c54234e6e400cbbdece0badaa7ac9 is a fresh failure for it | 20:02 |
ianw | what's the context here? switching the All-Projects testing? | 20:03 |
clarkb | I don't think anything should prevent gerrit 3.6 from running on bionic | 20:03 |
clarkb | ianw: git-review functiona ltesting runs gerrit | 20:03 |
clarkb | but its currently set up to run realy old gerrit | 20:03 |
fungi | yeah, okay so it's the ssh key uploads it was stuck on | 20:04 |
fungi | RuntimeError: SSH key upload failed: <Response [401]> "Unauthorized" | 20:04 |
fungi | i want to say where i got to was determining that in newer gerrits, the DEVELOPMENT_BECOME_ANY_ACCOUNT mode didn't work for uploading ssh keys via the rest api any more | 20:07 |
fungi | oh, this may be the digest vs plain auth problem | 20:07 |
fungi | right now we're posting to the api with auth=requests.auth.HTTPDigestAuth('admin', 'secret') | 20:08 |
Clark[m] | It definitely works because both zuul and system-config use the rest API to set keys | 20:08 |
fungi | that needs to be plain in more recent gerrit, right? | 20:08 |
fungi | HTTPBasicAuth | 20:10 |
opendevreview | Jeremy Stanley proposed opendev/git-review master: Upgrade testing to Gerrit 3.4.4 https://review.opendev.org/c/opendev/git-review/+/849419 | 20:12 |
fungi | guess we'll see | 20:12 |
clarkb | fungi: if you decide to stick to 3.4 you probably want to bump the version to 3.4.8 at least | 20:41 |
clarkb | looks like the latest error has to do with sending email that the ssh keys are added | 20:43 |
clarkb | I think you can just disable email | 20:43 |
clarkb | ya sendemail.enable set to false disables all email sending | 20:44 |
fungi | well, to go past 3.4.4 we'll also need to implement a workaround for the unfixed bug report i linked, i think? | 20:44 |
fungi | or is that fixed and the devs just never closed the bug? | 20:44 |
fungi | basically openssh client on ubuntu bionic not working with the newer key exchange defaults on 3.4.5, seems like | 20:45 |
clarkb | oh it looks like 3.4.5 updated mina to 2.7.0 from 2.6.0 ya I don't think they updated beyond that | 20:47 |
fungi | so we'd need to do gerrit 3.5.x i guess to get past it? | 20:47 |
fungi | basically they introduced a regression in a 3.4 point release and then never bothered to fix it before dropping support for that series | 20:48 |
clarkb | well if mina 2.7.0 is the problem then I guess no newer gerrit would work | 20:48 |
fungi | it looks like https://issues.apache.org/jira/browse/SSHD-1163 is supposed to be fixed though | 20:49 |
opendevreview | Clark Boylan proposed opendev/git-review master: Upgrade testing to Gerrit 3.4.4 https://review.opendev.org/c/opendev/git-review/+/849419 | 20:49 |
fungi | supposedly fixed in mina-sshd 2.8.0 | 20:50 |
clarkb | in that case 3.5 or newer should work | 20:50 |
clarkb | since 2.8.0 got backported to 3.5 | 20:50 |
clarkb | I pushed a patch to disable email | 20:50 |
fungi | but yeah, first i wanted to see this work before rolling versions forward even more | 20:50 |
fungi | thanks! | 20:50 |
fungi | also there's some benefit to testing the oldest version of gerrit necessary for the features being added, since it helps avoid breaking git-review for users of somewhat older gerrit deployments (though at the expense of not catching bugs/regressions with newer gerrit releases) | 20:51 |
clarkb | yup | 20:52 |
clarkb | But part of the issue here is we stuck to old versions which grew stale and leads to making updates more difficult | 20:52 |
clarkb | I suspect that keeping up with the latest stable release would minimize deltas we have to sort out | 20:52 |
melwitt | clarkb: the doc you might be thinking of regarding task_state deleting might be this https://docs.openstack.org/nova/latest/admin/support-compute.html#reset-the-state-of-an-instance | 20:58 |
clarkb | melwitt: yup thanks. For servers that say they are deleted but haven't been removed is tehre something to look at? we were wondering if we should set them active and try to delete them similar to ^ | 20:59 |
clarkb | but worried that since they are actually deleted that might cause more problems | 20:59 |
melwitt | clarkb: do you mean that they are gone from the server list/server show but still have libvirt guests running? | 21:01 |
clarkb | melwitt: no they show up in server list as status DELETED and OS-EXT-STS:vm_state | deleted | 21:02 |
melwitt | fwiw I don't think it would hurt to reset state and try to delete them again. I'm not sure if it would help either | 21:03 |
clarkb | we want them to not show up in listings any longer (because nodepool thinks it leaked them and wants to delete them) | 21:03 |
clarkb | ok thats helpful if only because it gives us something new to try | 21:03 |
melwitt | I'm not sure I've seen that before where they are listed showing as DELETED when not using nova list --deleted | 21:06 |
fungi | it's showing up when we `openstack server list` as a normal account in the tenant (don't need to be an admin) | 21:07 |
clarkb | power state is NOSTATE and task_state is none | 21:07 |
clarkb | I think it may have actually deleted the instance everywhere but in the db | 21:07 |
melwitt | the instance record gets deleted on the compute host if that nova-compute service is "up". it's possible something failed before it got to the instance record deletion, that's pretty much the last step | 21:08 |
fungi | melwitt: looks like this to a normal user: https://paste.opendev.org/show/bRRe78SmoX2cls5Lovkl/ | 21:09 |
melwitt | it would have failed somewhere after the save() call here https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L3254-L3267 | 21:10 |
clarkb | ah ok so we could go hunting for logs maybe depending on how catastrophic the failur was | 21:10 |
melwitt | which apparently the main thing that happens there is deleting the allocation in the placement service | 21:10 |
fungi | which i guess eventually leads placement to thinking it can't add more servers if we accumulate too many of those? | 21:11 |
melwitt | so yeah, you would want to look at nova-compute.log at the time of the deletion. what might have happened is the allocation deletion failed for some reason. in the past a DELETE in placement was not "forced" and could fail for a 409 conflict | 21:11 |
melwitt | which some of us thought was wrong and we changed it to a "force" sometime semi recently. maybe a couple of years ago. not sure what version yall are using | 21:12 |
clarkb | this is victoria aiui | 21:13 |
clarkb | and ya I think one of the things we should do after server upgrades and gerrit 3.7 and mailman3 is redeploy this if they've got a modern version supported in their tool (they must) | 21:13 |
melwitt | ok, so this landed in the xena release https://review.opendev.org/c/openstack/nova/+/688802 but I'm making assumptions that the placement allocation delete failed. if it did, I think resetting the state and trying the delete again might work | 21:15 |
clarkb | fungi: my patch fixed the email thing. Now its complaining about output from git-review based on responses from the server | 21:15 |
clarkb | fungi: so I think at this point its just a matter of updating the tests to match modern gerrit behaviors | 21:15 |
fungi | melwitt: worth a shot, thanks for the deep dive! | 21:15 |
fungi | i reset the state on the three perpetually "deleted" instances there | 21:18 |
clarkb | interestingly it seems to be complaining that the chagne was committed before the commit hook so lacks the change id? | 21:18 |
clarkb | that implies git-review wasn't able to get the commit hook and auto apply it? | 21:18 |
clarkb | fungi: ack | 21:18 |
clarkb | but I think the gerrit install itself is functioning now | 21:20 |
fungi | yep, this is much progress. thanks! | 21:20 |
clarkb | I'm going to pop out for a bit to run that errand. https://review.opendev.org/c/opendev/system-config/+/876470/1 should be ready to go if ya'll have time to look at it | 21:25 |
fungi | melwitt: the status reset worked on the stuck deleted notes as well, thanks again for the help! | 21:29 |
melwitt | \o/ | 21:29 |
melwitt | np | 21:30 |
opendevreview | Ian Wienand proposed openstack/project-config master: gerrit/acl : handle submit requirements in normalise tool https://review.opendev.org/c/openstack/project-config/+/875992 | 22:21 |
opendevreview | Ian Wienand proposed openstack/project-config master: gerrit/acl : submit-requirements for deprecated NoOp function https://review.opendev.org/c/openstack/project-config/+/875804 | 22:21 |
opendevreview | Ian Wienand proposed openstack/project-config master: gerrit/acl : Convert Backport-Candidate to submit-requirements https://review.opendev.org/c/openstack/project-config/+/875993 | 22:21 |
opendevreview | Ian Wienand proposed openstack/project-config master: gerrit/acl : handle key / values with multiple = https://review.opendev.org/c/openstack/project-config/+/875994 | 22:21 |
opendevreview | Ian Wienand proposed openstack/project-config master: gerrit/acl : Convert Review-Priority to submit-requirements https://review.opendev.org/c/openstack/project-config/+/875995 | 22:21 |
ianw | clarkb: ^ after several attempts i think that's in it's final state now. it should replicate what the upstream migration would want to do | 22:22 |
opendevreview | Merged opendev/system-config master: Remove gitea05-08 from configuration management https://review.opendev.org/c/opendev/system-config/+/876470 | 22:29 |
clarkb | ianw: ack I'll take a look. Is that migration in the 3.6 -> 3.7 path? I guess because submit requirements are more required now? Its just weird they waited so long for that | 22:41 |
ianw | clarkb: no, the change appears to only be on master. so i imagine that 3.7 -> 3.8 upgrade process will run this | 22:41 |
ianw | these changes should effectively isolate us from that. so we won't skew between what's running in gerrit and what our acl's are on disk | 22:43 |
clarkb | weird. But also ++ to syncing up with them | 22:43 |
clarkb | yup | 22:43 |
clarkb | #status log Increased afs quotas for ubuntu-ports, debian-security, centos, and centos-stream mirrors | 22:51 |
opendevstatus | clarkb: finished logging | 22:51 |
clarkb | infra-root with https://review.opendev.org/c/opendev/system-config/+/876470 merged I'll go ahead and manually stop gitea services via docker-compose down on the four hosts and in a day or two we can delete them | 22:51 |
ianw | ++ | 22:52 |
clarkb | thats done. If we really need to turn them back on remember that they haven't been replicated to for a few days | 22:55 |
clarkb | But I don't expect that to come up | 22:55 |
opendevreview | Merged opendev/system-config master: mirror-update: drop Fedora 35 https://review.opendev.org/c/opendev/system-config/+/876486 | 22:59 |
clarkb | ianw: fwiw I think we can make code-review normal in infra-specs if we want. But at this point maybe thats best as a followup | 23:00 |
ianw | clarkb: yeah, probably for infra specs, but at least for governance it seems to be very concious choice to allow anyone to +1/-1 for comments, but leave the merge up to the ptl | 23:02 |
clarkb | ya though I'm looking at https://review.opendev.org/c/openstack/project-config/+/875804/5/gerrit/acls/starlingx/governance.config and it looks like no submit requirements are needed? This isn't a regression either | 23:04 |
* clarkb pulls up a governance chang to see what they actually rely on | 23:04 | |
clarkb | just the workflow +1 I guess | 23:04 |
clarkb | ianw: small thing on https://review.opendev.org/c/openstack/project-config/+/875804 but let me review the rest of the stack before you push new patchsets | 23:05 |
ianw | yeah, i think the idea is that ultimately it's up to the ptl to tally and make the final choice | 23:05 |
ianw | doh, thanks, not sure how i missed that one | 23:06 |
clarkb | ianw: another followup idea. We could update our linter to onl allow function = NoBlock when this is all done | 23:07 |
clarkb | that way we don't accidentally add any new NoOps or MaxWithBlock. We might even require function = NoBlock? | 23:08 |
ianw | yeah, i have a change sort of to do that | 23:08 |
ianw | the problem it isn't really a linter as such, it's a transformer | 23:09 |
clarkb | ya | 23:09 |
clarkb | we look for a diff after normalizing it | 23:09 |
clarkb | ianw: I think the one I discovered in an earlier change is fixed in https://review.opendev.org/c/openstack/project-config/+/875995/ | 23:09 |
clarkb | I guess if you confirm that you can decide where you want it to be applied | 23:10 |
clarkb | (I kinda like modifying a single projet all at once so that they get the potentailly new behavior (which shouldn't happen) all at once for ease of debugging) | 23:10 |
clarkb | but ya the rest of the stack lgtm | 23:10 |
ianw | ok let me fix that in the lower change, i think that's the right place | 23:11 |
opendevreview | Merged opendev/system-config master: mirror-update: Add Fedora 37 https://review.opendev.org/c/opendev/system-config/+/876487 | 23:11 |
ianw | ^ i'll hold the lock and do the f35 removal and f37 addition as it will take a while | 23:12 |
opendevreview | Ian Wienand proposed openstack/project-config master: gerrit/acl : submit-requirements for deprecated NoOp function https://review.opendev.org/c/openstack/project-config/+/875804 | 23:13 |
opendevreview | Ian Wienand proposed openstack/project-config master: gerrit/acl : Convert Backport-Candidate to submit-requirements https://review.opendev.org/c/openstack/project-config/+/875993 | 23:13 |
opendevreview | Ian Wienand proposed openstack/project-config master: gerrit/acl : handle key / values with multiple = https://review.opendev.org/c/openstack/project-config/+/875994 | 23:13 |
opendevreview | Ian Wienand proposed openstack/project-config master: gerrit/acl : Convert Review-Priority to submit-requirements https://review.opendev.org/c/openstack/project-config/+/875995 | 23:13 |
ianw | clarkb: https://review.opendev.org/c/opendev/system-config/+/876237/3 and https://review.opendev.org/c/opendev/system-config/+/876236/2 are the other two that codify the all-projects change | 23:15 |
ianw | the proposed diff for all-projects is https://paste.opendev.org/show/brAj40R1mJbQZSXAXEQ5/ | 23:16 |
clarkb | ianw: note on https://review.opendev.org/c/opendev/system-config/+/876237 | 23:23 |
ianw | ahh, great point | 23:23 |
opendevreview | Ian Wienand proposed opendev/system-config master: doc/gerrit : update to submit-requirements https://review.opendev.org/c/opendev/system-config/+/876237 | 23:25 |
clarkb | thanks that lgtm | 23:25 |
ianw | https://paste.opendev.org/show/b5flVdJsiNlLmY4lA6Yp/ is an updated diff against our all-projects | 23:27 |
ianw | i'm running the new fedora sync in a root screen on mirror-update | 23:34 |
opendevreview | Clark Boylan proposed opendev/git-review master: Upgrade testing to Gerrit 3.4.4 https://review.opendev.org/c/opendev/git-review/+/849419 | 23:41 |
clarkb | fungi: ^ Hopefully that is all that is necessary now | 23:41 |
fungi | oh, thanks! i hadn't gotten back to it yet | 23:43 |
clarkb | I had the test failures open and looked at them again. It was just a difference in the text newer gerrit returns to git-review which then returns to the user | 23:45 |
fungi | cool, that's pretty minor | 23:46 |
clarkb | and the email thing was actually non fatal. But nice to not have giant java tracebacks in the logs for failed tests | 23:46 |
fungi | yep, agreed | 23:48 |
clarkb | JayF: just a heads up that we've pencilled in April 7 (Good Friday, hopefully it is quiet as a result) for project renames | 23:48 |
clarkb | no specific time yet. We'll sort that out as we get closer to the day | 23:48 |
clarkb | heh now they are all failing on `which git-review` | 23:54 |
clarkb | how did this pass before. I didn't change anything in the git review install path | 23:55 |
ianw | could it be a tox thing? | 23:55 |
clarkb | oh yup I just realized the older python versions will work I think | 23:55 |
clarkb | and its because new tox doesn't do older python | 23:55 |
* clarkb looks at tox.ini | 23:56 | |
clarkb | I think its the skip sdist step | 23:56 |
clarkb | that prevenst install of the project into the venv so which doesn't find it | 23:56 |
fungi | mat also need to allowlist_externals which? | 23:57 |
fungi | s/mat/may/ | 23:57 |
opendevreview | Clark Boylan proposed opendev/git-review master: Upgrade testing to Gerrit 3.4.4 https://review.opendev.org/c/opendev/git-review/+/849419 | 23:57 |
clarkb | fungi: no, the which happens within the test suite | 23:57 |
clarkb | tox has no idea aobut that one | 23:57 |
fungi | oh, got it | 23:58 |
fungi | and the comment there was for xenial era i think which we no longer test o | 23:59 |
fungi | n | 23:59 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!