Tuesday, 2023-03-07

clarkbaha using uuid fixes that00:00
clarkbok that shows that parakeet is the one00:01
clarkbso the trick was to do a server show using uuid00:01
clarkbI ran `openstack compute service set --disable --disable-reason "Only run the mirror on this hypervisor" amazing-parakeet.local nova-compute` then started the instance via our normal user creds00:03
clarkb(because I wanted to make sure we could still start/stop etc as a normal user in this state)00:03
clarkbI can ssh in00:04
fungioh cool, so it worked with the creds on the host then for sure00:04
clarkbyes the admin stuff did00:04
fungiyeah, mirror lgtm again00:04
fungihappy to approve a revert of the revert of the revert of the zero max-servers or whatever. once it deploys and usage ramps up, we should be able to tell whether or not job nodes are landing there again00:05
clarkbfungi: does that change exist yet or should I push one?00:05
fungiit does not exist as far as i can recall00:05
opendevreviewClark Boylan proposed openstack/project-config master: Revert "Revert "Revert "Temporarily stop booting nodes in inmotion iad3"""  https://review.opendev.org/c/openstack/project-config/+/87665200:07
clarkblast call for the meeting agenda00:08
fungiclarkb: not strictly an addition, but i guess let's make sure to talk about project renames (though we should likely conclude to push them to after the openstack release week)00:09
fungipicking a date would be nice though00:09
clarkbyes and I want to be done with the gitea stuff (which is closer every day00:09
clarkbspeaking of I don't know why that reminded me but nebulous too00:11
clarkbok agenda sent00:15
fungioh, yeah00:15
opendevreviewMerged openstack/project-config master: Revert "Revert "Revert "Temporarily stop booting nodes in inmotion iad3"""  https://review.opendev.org/c/openstack/project-config/+/87665200:34
ianwfungi: do you happen to remember why https://review.opendev.org/c/openstack/project-config/+/786686/2/gerrit/acls/openstack/release-test.config has PTL vote from the "Continuous Integration Tools" group?00:34
ianwit seems like only zuul is in that group, and i can't see it would be leaving that vote00:36
fungiianw: i think you're looking for https://review.opendev.org/70599000:42
fungiin essence, if a ptl leaves a +1 on an openstack/releases change, there's a zuul job with a dedicated pipeline that triggers on comment-added events to see if it's from the ptl or a release liaison from the team for the deliverable the release is proposed for, and if it is then zuul leaves a +1 vote in that column00:43
fungias for why it's in the release-test acl, that can probably be cleaned up, it was simply copied from the releases acl00:44
fungisince they both shared an acl up until the change you mentioned00:45
ianwahhh, thanks01:15
ianw... is:true != is:True in these acl files :/02:31
*** Tengu7 is now known as Tengu02:53
*** Tengu3 is now known as Tengu03:07
opendevreviewIan Wienand proposed openstack/project-config master: gerrit/acl : convert NoBlock to NoOp  https://review.opendev.org/c/openstack/project-config/+/87599304:03
opendevreviewIan Wienand proposed openstack/project-config master: gerrit/acl : handle key / values with multiple =  https://review.opendev.org/c/openstack/project-config/+/87599404:03
opendevreviewIan Wienand proposed openstack/project-config master: gerrit/acl : Update Review-Priority to submit-requirements  https://review.opendev.org/c/openstack/project-config/+/87599504:03
opendevreviewIan Wienand proposed openstack/project-config master: gerrit/acl : Convert remaining AnyWithBlock to submit requirements  https://review.opendev.org/c/openstack/project-config/+/87599604:03
*** jpena|off is now known as jpena08:27
opendevreviewSaggi Mizrahi proposed opendev/git-review master: Add worktree support  https://review.opendev.org/c/opendev/git-review/+/87672512:02
opendevreviewRadosław Piliszek proposed openstack/project-config master: Add the main NebulOuS repos  https://review.opendev.org/c/openstack/project-config/+/87605413:34
opendevreviewRadosław Piliszek proposed openstack/project-config master: Add the NebulOuS tenant  https://review.opendev.org/c/openstack/project-config/+/87641414:03
opendevreviewRadosław Piliszek proposed openstack/project-config master: Add the NebulOuS tenant  https://review.opendev.org/c/openstack/project-config/+/87641414:35
*** blarnath is now known as d34dh0r5314:58
fungigraphs for rax-ord are still looking fairly unhealthy compared to the other regions15:24
fungias for inmotion, it looks like we're not booting servers there at all yet15:26
funginodepool configuration was updated at 00:39z and has max-servers: 5115:27
fungiserver list for the nodepool tenant shows 3 active servers and 3 deleted servers, all with creation dates from a week or more ago15:32
funginodepool list knows about 4 of those, one of which is a held node and the others are deleting15:33
clarkbfungi: any indication if nodepool even attempted to boot in inmotion?15:49
fungithat's what i'm trying to figure out from the logs. i don't see it trying currently15:52
fungijust it repeatedly trying to delete these three nodes that are stuck in deleting state15:52
fungilooks like it's declining requests, trying to determine why15:55
clarkbpossible we've got the placement leaks again leading it to think the quota is all used up?15:55
clarkbbut ya nodepool logs should have reasonably concrete info to go off of15:56
fungiit doesn't seem to say why it declined this request though15:57
fungiit gives a ton of info about the request itself, but not the reason it got declined15:58
clarkbthere are a couple of paths that can take it to that point. The first is deciding it doesn't support the label (arm64 or extra large nodes etc)15:58
clarkbanother is failing three boots I think you need to look for those three failures to find why they failed15:59
fungithis was a request for a single ubuntu-bionic node to run ansible-playbooks-tox-bandit15:59
clarkbthen ya you may need to look over the history of the request to see if there were three failing boots15:59
funginot finding any indication it's attempting to boot anything though16:00
fungiand the only tracebacks are for the delete failures16:00
clarkbhrm maybe there is another short circuit then?16:01
fungithis is a sample from the debug log: https://paste.opendev.org/show/bkOR3P27zG4bVTD9HRaS/16:02
clarkbfungi: the grafana graphs still show max at 016:06
clarkbI think the issue is the change to max hasn't applied16:08
clarkbpossibly the same deal with rax-ord?16:08
fungithe config updates deployed (and i checked the files have the correct content), but yeah maybe the launchers don't notice file changes and/or any reload signal we send isn't working?16:12
clarkbpreviously it just reloaded the config on every pass hrough the run loop but ya maybe it isn't doing that anymore?16:12
clarkbwe could try restarting the launcher to see if the behavuor changes16:13
fricklerthere are some other errors in the 03_03 log16:14
fricklerthose might have made the statemachine get stuck, I would also suggest a restart16:14
fungiwhat timestamp?16:14
frickleroh, it even says this: 2023-03-03 18:15:14,368 DEBUG nodepool.StateMachineProvider.inmotion-iad3: Removing state machine from runner16:16
fricklerjust after some nova API error16:17
fungi"Removing state machine from runner" seems to appear many times in all the logs though16:17
fungieven back to the oldest retained log from 2023-02-2516:18
fungi500 occurrences in launcher-debug.log.2023-02-25_0516:18
fungii see it on all the launchers, not just 0216:19
funginodepool docs seem to indicate that it should be rereading its configuration automatically, yes16:24
clarkbgitea05-08 are all out of the gerrit replication config (I just verified that the daily jobs synced us up)16:27
clarkbhttps://review.opendev.org/c/opendev/system-config/+/876470 is ready when we're happy with the new servers (this removes those 4 servers from config management16:27
clarkbafter which they can be deleted16:28
clarkbfungi: switching the grafana graphs for inmotion to a 24 hour timeframe it never has a max value >016:28
clarkbis the exception frickler pointed out due to deletion not creation?16:28
fungiyeah, there are a bunch of deletion failures with tracebacks, related to the three instances stuck in a deleting state16:29
fungithe launcher keeps repeatedly trying to delete them16:30
clarkbya thats "normal" for that situation16:30
clarkbit definitely looks like it hasn't noticed the max-servers value bumped up16:30
clarkbmaybe doing a thread dump to see if the run loop is stuck before it can reloda its config then restart the service?16:31
fungioh, good call. yeah doing that now16:45
clarkbremember you need to do a pair with a short pause in the middle otherwise the profiler remains on and things will be extra slow. Less important if ou end up restarting things though16:47
fungiyeah, i usually do a 60-second sleep between them16:47
fungiokay, dump with yappi stats is in nl02:~fungi/stack_dump.2023-03-0716:53
clarkbfungi: up ou can see PoolWorker.inmotion-iad3-main is blocking on a stop event16:58
clarkbI suspect its been sitting there for a while16:58
clarkboh though its got a timeout listed16:59
clarkbso maybe not?16:59
fungiso the plan for nl02 is to stop the launcher container, try to clean up all the server instances in inmotion-iad3 which nodepool has no record of (i count 2 in an active state not matching any entries in nodepool list output), then start the container again?16:59
fungialso would be good if we could identify why these other three nodes are undeletable17:00
clarkbI think those three have been undeleteable for months. I looked at them when I did the placement cleanup and couldn't sort it out17:00
clarkbI'm not suer you need to clean those up before starting the service again17:00
*** artom_ is now known as artom17:01
clarkbok that run loop doesn't load new configs. I think an outer loop must do that17:01
fungioh, actually there are 4 nodes in an active state, three of which aren't in the nodepool list17:01
fungiokay, it also isn't deleting those when i issue a server delete17:03
fungithey're still in active state17:03
fungii wonder if oom killer might have terminated some openstack services on parakeet?17:05
clarkbI don't think that is the issue. Those nodes are not on parakeet (only the mirror is)17:05
fungiis there a clean way to restart everything? e.g. reboot the server?17:05
fungithe nodes aren't on parakeet, but some of the api services are17:05
clarkband the setup runs docker with a force restart on everything iirc17:05
clarkbfungi: I think you can do a docker ps -a to see if anything isn't running17:06
clarkbI suspect this has to do with nova just losing track of things on its own17:06
clarkbwhich we've already had to do cleanup for in the past. Its just that this subset is for a different reason that isn't understood yet17:07
fricklerlikely not related, but there's also a lot of images stuck in deletion there17:08
fungiharbor.imhadmin.net/deployment/node-agent:master and kolla/centos-binary-telegraf:victoria are in "restarting" status, harbor.imhadmin.net/deployment/fm-deploy:master is in exited status, all other containers on that server have an up status17:09
clarkbfm-deploy is their deployment tool iirc and that is expected for it to exit once deployment is done17:09
clarkbnode-agent I can only assume is some sort of phone home service to the management system.17:10
clarkbtelegraf too. I think all the openstack services must be running then17:10
fungiyeah, seems so17:10
fungileading me to wonder why `openstack server delete` of an active state instance does nothing, and the instance remains listed as active17:11
fungii suppose these and the stuck deleting instances could all be victims of the oom killer on parakeet17:11
fungimmm, nope they're spread across 3 different hostids17:13
clarkbI think they long predate that17:15
clarkbOOM killer is a relatively recent thing17:15
clarkbLooking at the thread dump I'm not seeing the statemachine provider object for the inmotion cloud17:15
clarkbok PoolWorkers refer to the top level NodePool object to get their Provider objects. THe Provider objects capture the configs and they should be replaced when configs update17:20
clarkbwe should log "Creating new ProviderManager object" when swapping them out /me looks17:22
clarkb2023-03-07 00:39:49,754 DEBUG nodepool.ProviderManager: Creating new ProviderManager object for inmotion-iad317:23
clarkbso it does create the new provider manager when the new config shows up but then doesn't seem tp reflect the new max-servers value?17:24
fungior else there's something else causing it to summarily decline requests, since it doesn't actually say why it declined them17:25
fungioh, though statsd is still reporting max:017:26
fungiwhich seems to support your theory17:26
clarkbit never stopped17:26
clarkbgrep "DEBUG nodepool.StateMachineProvider.inmotion-iad3: Stopp" /var/log/nodepool/launcher-debug.log.2023-03-0*17:26
*** jpena is now known as jpena|off17:26
clarkbthat shows on the third when we last turned it off the provider goes from stopping to stopped. But this time around it never does the transition17:26
clarkbit waits for all launchers and deleters to stop17:27
clarkbperhaps the deleters were stuck?17:27
* clarkb looks for this thread in the dump17:27
clarkbThread: 139823599843072 NodePool d: False<- thats where the apprently stuck stop() call is sitting17:29
clarkbfungi: so ya I believe a restart will fix this. I don't know where it is getting stuck yet17:31
fungion nl01 for comparison:17:31
fungi2023-03-06 17:43:29,343 DEBUG nodepool.StateMachineProvider.rax-ord: Stopping17:31
fungi2023-03-07 04:41:03,927 DEBUG nodepool.StateMachineProvider.rax-ord: Stopped17:31
fungithat took 11 hours17:32
fungilonger than i would have anticipated17:35
clarkbya based on the tracebacks it appears to be waiting on a launcher or deleter. However looking at the logs I see it running then reporting it ran 3 statemachines17:35
clarkboh! but the state machines have to report they have completed for them to be removed17:35
clarkbcorvus: ^ fyi I Think we've found an issue with openstack + statemachine related to clouds behind unhappy that results in us holding onto old configs for far too long17:36
clarkbI suspect these three statemachines are never going into a completed state so we're never reloading our config17:36
clarkbfungi: and you said there are three nodes in a deleting state?17:37
clarkbits curious that we were able to restart in this situation when we set the max-nodes to zero but not now17:38
fungiwell, there are three nodepool knows about that are in a deleting state and it keeps trying to delete them17:38
fungithere are also three it doesn't seem to be tracking any longer which are stuck in an active state and won't transition to deleting in nova17:38
corvusclarkb: https://review.opendev.org/87525017:40
clarkbaha thanks17:42
clarkbfungi: ^ I think we can restart the service and review that change17:42
corvusfungi: nodepool should continue to try to delete leaked nodes (assuming they have the correct metadata) without tracking them; so it should continue to try to delete them even if it doesn't have a zk record (probably worth double checking the logs to make sure that happens)17:43
fungihere's the comparison of what nova and nodepool see for inmotion-iad3 btw: https://paste.opendev.org/show/bUgMuhc2IbvlFPNFWb0o/17:44
fungiand yeah, 875250 looks like what we're seeing on nl0217:45
fungiokay, downing and upping the launcher container on nl0217:45
corvusfungi: that paste looks good to me -- is there anything wrong with that?17:46
fungicorvus: nothing wrong on the nodepool side, problems on the nova side17:46
corvusfungi: ack.  makes sense17:46
clarkbcorvus: what is weird to me looking at the log on nl02 is taht we keep reporting that 3 tasks are running and then 3 tasks ran. It isn't clear to me why we never transition to completed and remove those three machines. Maybe beacuse the deletions don't ultimately succeed?17:47
fungiseveral instances stuck in delete for weeks which nodepool is no longer tracking, several more active which can't be deleted for some unknown reason (but which nodepool is repeatedly trying)17:47
corvusclarkb: hrm17:47
fungiclarkb: yeah, nova is still reporting the instances as "active"17:48
fungii had it backwards originally, but match up the uuids in my paste and you'll see17:48
clarkbfungi: right and if you manually try to delete them it doesn't help. Its also a nova problem, but one that may be acusing those tasks to hang around forever?17:49
fungithe four nodes which nodepool knows about are in active state according to nova, one is for an autohold but the other three are needing to be deleted17:49
clarkbI think ideally we'd want it to try and give up if being stopped and let the new provider manager handle it17:49
fungiand nova is silently ignoring the delete attempts as far as i can see17:49
clarkbwith corvus' proposed change I think we may end up with both the old and new provider manager running threads attemptng to delete the resources (not the end of the world)17:49
fungiokay, so ready for a down/up of the launcher container on nl02?17:50
clarkbI am but maybe corvus wants to look at it first?17:50
fungisure, i can wait for a bit17:50
corvusclarkb: i think (haven't totally confirmed this) that we're seeing the behavior described in the commit message, but with deleting state machines -- perhaps it is removing the delete state machines after the delete timout hits, but then the periodic leaked node cleanup is adding new ones17:51
corvusfungiclarkb you're clear to restart; i think any further questions can prob be handled in logs17:51
clarkbcorvus: oh! that would be a fun interaction17:51
fungigot it, proceeding17:51
corvusclarkb: so anyway, yeah, that change should handle that by causing the new delete-state-machines to go to the new provider, and there will be 2 for a brief time (<5min) and then the old one finally disappears17:51
clarkbcorvus: yup17:52
clarkbfungi: I was wrong df55493d-bae7-47b2-9069-1e488c09a2fd is only a few days old.17:52
clarkbfungi: its task state is in deleting but it isn't actively deleting. I Think we ran into this before17:52
fungi#status log Restarted the nodepool-launcher container on nl02 in order to force a config reload, as a workaround until https://review.opendev.org/875250 is deployed17:52
opendevstatusfungi: finished logging17:52
clarkbfungi: basically nova sets the task state to deleting then fires off an event for that. If that event gets lost it won't let you set the state to deleting by issuing another normal delete which would refire the task because it is already deleting17:53
clarkbfungi: iirc melwitt linked to some process to double check things then you manually override the state back to acitve and issue another delete17:53
corvusclarkb: https://paste.opendev.org/show/bh9N99Gx60aXEYwM7bCC/  here's an edited series of log messages that i think shows that sequence17:53
clarkbI forget what that process was but we can probably just yolo17:53
clarkbcorvus: ++17:54
fungiclarkb: but df55493d-bae7-47b2-9069-1e488c09a2fd is stuck in "active" not "deleted"17:54
fungiat least according to nova17:54
clarkbfungi: its task_state is "deleting"17:54
clarkband if nova loses the event to acutally make that happen you get stuck in limbo here17:55
clarkbthe fix is to reset task state back to active then reissue the delete command17:55
corvusis that a user-level action or cloud-admin-level?17:55
clarkbI found a purple link df55493d-bae7-47b2-9069-1e488c09a2fd17:55
clarkbcorvus: admin level17:55
corvusthat=reset task state17:55
fungioh, yep okay. OS-EXT-STS:task_state=deleting OS-EXT-STS:vm_state=active17:56
fungiand server list reports the latter17:56
fungithus my confusion17:56
clarkband that server isn't on parakeet so unlikely related to OOMs17:57
fungiright, i compared the hostids reported for these and they're scattered around the cluster17:57
clarkbaha I found my notes17:57
clarkb`openstack server set --state active $UUID`17:57
clarkbthen you can issue a normal delete17:58
clarkbfrom mynotes it looks like maybe only the orphaned allocations needed double checking but not the task state being stuck17:58
clarkbfungi: do you want to issue that or should I?17:58
clarkbI suspect that nodepool will then automatically clean them up for us17:58
fungii'll give it a go17:59
corvusthis is the cloud we have admin on?  should we make a cron script that looks for active/deleting nodes and then issue that command?  i sort of suspect that major cloud operators may do that -- based on the fact this seems to happen occasionally everywhere and then eventually they get cleaned up...17:59
clarkbI would server show them as admin first to make sure they have the task_staet in deleting17:59
fungiclarkb: where was the clouds.yaml on the servers there?17:59
clarkbcorvus: yes and not a bad idea. I think we'd want some delay18:00
clarkbfungi: source /etc/kolla/admin-openrc.sh18:00
fungiaha, thanks, i'll source that18:00
clarkbcorvus: because there is a period of time where that set of states is valid. But if it goes for more than say an hour then we're probably good to clean it up18:00
fungiclarkb: is that on all the servers? which one did you find it on? not seeing it on .23018:01
clarkbfungi: I'm on 229. But I thought it was deployed to all of them18:01
* clarkb looks18:01
fungiit's on .229 yeah, thanks18:01
clarkbhuh ya I guess they only deploy that to the first server18:02
fungiand what's the path to openstackclient?18:02
corvusclarkb: yep.  is there a timestamp for the task_state?  or would we just need to have persistence to detect it?18:02
fungi# openstack server list18:02
fungi-bash: openstack: command not found18:02
clarkbfungi: /opt/kolla-ansible/victoria/ its in that venv. You can activate it or just use the bin path18:02
corvus(that might be worth looking at real quick before you reset these states :)18:02
fungiaha, okay18:02
clarkbcorvus: it looks like the updated field is getting updated frequently. I did a show earlier and got a timestamp from a few minutes prior and a show now shows updated a few minutes prior to now18:03
clarkbcorvus: we probably need to keep track ourselves. Anything that is in that state goes into a list and if it is still in that state an hour later we update it18:04
fungiand as admin, it's server list --all otherwise you get a blank list due to it checking the admin tenant i guess18:04
corvusclarkb: ack, good to know18:04
fungiOS-EXT-STS:task_state=None after i did that to d52712f4-80a6-4f9a-979c-3031972cd89f, so now to see if the launcher successfully deletes it18:06
clarkbfungi: I see task state is deleting already18:07
clarkbbut it hasn't gone away18:07
clarkbif it continues to not go away we may have to check compute logs on the host18:08
fungiit was already in a deleting state from nodepool's perspective, i think it takes a little while to get back to retrying to delete it though18:08
clarkbfungi: no I mean in nova18:09
clarkbtask_state is deleting18:09
clarkboh wait you did a different host18:09
fungiyeah, i've only done d52712f4-80a6-4f9a-979c-3031972cd89f so far18:09
fungithe first one in the nodepool list output18:10
fungiif memory serves, it may be as much as 10 minutes before the launcher tries to delete things again18:10
clarkbit looks like it is trying now18:11
fungithe restart did get the config reread though, and looks like we're booting new nodes there again18:11
clarkbthe nodepool logs indicate "State machine for 0033156866 at deleting server"18:11
clarkbbut server show doesn't indicate a state change yet18:11
clarkband its gone18:13
fungii'll reset the other two offenders18:13
funginow as for the ones nova indicates have been in a "deleted" state since last month, is there a chance the same state reset will work for them as well?18:14
clarkbFor that we may need to talk to nova. I'd worry that will cause nova to try and do things that can't happen since the nodes are actually gone18:15
clarkband then create more problems18:15
* clarkb reviews the nodepool fix now18:15
fungiyeah, server show doesn't indicate any related errors for them18:15
fungianyway, i'll switch to revisiting rax-ord18:16
fungistill seeing "Timeout waiting for instance creation"18:20
clarkbfungi: so that timeout is actually the one waiting for the instance to go active18:22
clarkbwhich was already at 10 minutes18:22
clarkbboot timeout is once active how long to wait for ssh to be up18:22
clarkbso maybe we need to increase the launch timeout18:23
clarkbre auto updating task_state we should also resync with openmetal and see if they are deploying more up to date openstack clouds now. Then redeploy openstack and get bugfixes18:24
clarkbpretty sure melwitt said this issue should be addressed in like yoga?18:24
fungialso plenty of "Timeout waiting for instance deletion"18:26
clarkbya the general slowness is why I suggested dialing back max-servers there to see if we were creating the noise18:27
fungidefinitely looks like the servers nodepool is giving up on are in an active state in nova, just spot-checking again at random18:29
clarkbso they do eventually go active? that would imply our timeout is simply too short?18:30
fungilooking at node 0033394230, journalctl reports sshd was fully started at 18:08:24, while the launcher log says it gave up waiting at 18:08:5918:33
clarkbfungi: that timeout is nodepool waiting for openstack to report the node is active though18:35
fungiyeah, i just wanted to find something to compare against18:35
fungimaybe nova is lagging reporting the status change in the server list18:35
clarkbcould be18:36
funginodepool log says it started building the node at 17:52:30 and the state machine went to "submit creating server" at 17:59:05, that's also a bit of a lag?18:36
clarkbfungi: that lag is possibly due to the rate limited if we have a lot of requests to process in that cloud18:37
clarkbthis might be another good reason to drop max-servers as it should provide a cleaner picture of what is going on in the logs18:37
fungiyeah, i'll push a change up to ~halve it for now18:37
opendevreviewJeremy Stanley proposed openstack/project-config master: Try halving max-servers for rax-ord region  https://review.opendev.org/c/openstack/project-config/+/87677518:41
clarkbfungi: note that we should've increased launch-timeout not boot timeout. I should have checked more closely yesterday18:45
clarkbbut I think we can do that increase next18:45
fungiyeah, it was unclear to me that "launch" is the server create time and "boot" is the ssh access time18:52
opendevreviewMerged openstack/project-config master: Try halving max-servers for rax-ord region  https://review.opendev.org/c/openstack/project-config/+/87677519:11
fungichecking up on the inmotion-iad3 environment now that things should have reached a steady state there, the host where the mirror vm is running is the only one with a single qemu-kvm process, the others have two or more, suggesting that the maintenance disable for nova is working19:17
fungilooks like we've been setting auth = DEVELOPMENT_BECOME_ANY_ACCOUNT in the git-review tests since they were originally introduced in 201319:40
clarkbfungi: ya so zuul quickstart sets that then logs in to create a first admin account. Then it configures the admin account and everything else from there19:42
clarkbor somethign along those lines19:43
fungii've rechecked the change so i can remember what errors it was giving19:44
clarkblooks like the upstream gerrit docker image bakes in gerrit admin:secret19:47
clarkbmaybe that is the default for dev mode?19:48
clarkblooks like our system-config testing does the same thing somehow19:48
fungiwas also running into https://bugs.chromium.org/p/gerrit/issues/detail?id=1621519:48
ianwhrm, now i've found -> https://gerrit-review.googlesource.com/c/gerrit/+/339542/14/java/com/google/gerrit/server/schema/MigrateLabelFunctionsToSubmitRequirement.java19:49
ianwfor NoOp/NoBlock, that translates things to "NoBlock" (i've chose NoOp) and adds a s-r with "applicableIf: is:false"19:50
clarkbI guess NoBlock is canonical? I'm fine with switching to it then19:51
clarkbThe applicableIf is weird. Baiscally says this submit requirement exists but does nothing19:51
clarkbeven less than it exists and always allows things to go through :)19:51
ianwright, that makes it a "trigger" vote19:51
ianwmy attempt was doing "submittableIf = is:false"19:51
ianwthat makes it weird in the UI19:51
ianwi think i am re-writing the change stack, again :)19:52
clarkblooks like admin:secret is the default for local dev mode19:59
clarkbfungi: ^ so ya you should be able to start in dev mode and then use that account19:59
clarkbfungi: re the ssh issue I remember looking int othat and wondering why it never seemed to affect us20:00
clarkbut I would look at updating your git-review change to gerrit 3.6 maybe20:00
clarkbsince that has the updated mina that can handle ssh + sha220:00
fungiwell, i think some of the challenge was also staying compatible with the ssh and python on ubuntu bionic20:01
fungiso latest gerrit we can run and test on a bionic node20:02
fungihttps://zuul.opendev.org/t/opendev/build/547c54234e6e400cbbdece0badaa7ac9 is a fresh failure for it20:02
ianwwhat's the context here?  switching the All-Projects testing?20:03
clarkbI don't think anything should prevent gerrit 3.6 from running on bionic20:03
clarkbianw: git-review functiona ltesting runs gerrit20:03
clarkbbut its currently set up to run realy old gerrit20:03
fungiyeah, okay so it's the ssh key uploads it was stuck on20:04
fungiRuntimeError: SSH key upload failed: <Response [401]> "Unauthorized"20:04
fungii want to say where i got to was determining that in newer gerrits, the DEVELOPMENT_BECOME_ANY_ACCOUNT mode didn't work for uploading ssh keys via the rest api any more20:07
fungioh, this may be the digest vs plain auth problem20:07
fungiright now we're posting to the api with auth=requests.auth.HTTPDigestAuth('admin', 'secret')20:08
Clark[m]It definitely works because both zuul and system-config use the rest API to set keys20:08
fungithat needs to be plain in more recent gerrit, right?20:08
opendevreviewJeremy Stanley proposed opendev/git-review master: Upgrade testing to Gerrit 3.4.4  https://review.opendev.org/c/opendev/git-review/+/84941920:12
fungiguess we'll see20:12
clarkbfungi: if you decide to stick to 3.4 you probably want to bump the version to 3.4.8 at least20:41
clarkblooks like the latest error has to do with sending email that the ssh keys are added20:43
clarkbI think you can just disable email20:43
clarkbya sendemail.enable set to false disables all email sending20:44
fungiwell, to go past 3.4.4 we'll also need to implement a workaround for the unfixed bug report i linked, i think?20:44
fungior is that fixed and the devs just never closed the bug?20:44
fungibasically openssh client on ubuntu bionic not working with the newer key exchange defaults on 3.4.5, seems like20:45
clarkboh it looks like 3.4.5 updated mina to 2.7.0 from 2.6.0 ya I don't think they updated beyond that20:47
fungiso we'd need to do gerrit 3.5.x i guess to get past it?20:47
fungibasically they introduced a regression in a 3.4 point release and then never bothered to fix it before dropping support for that series20:48
clarkbwell if mina 2.7.0 is the problem then I guess no newer gerrit would work20:48
fungiit looks like https://issues.apache.org/jira/browse/SSHD-1163 is supposed to be fixed though20:49
opendevreviewClark Boylan proposed opendev/git-review master: Upgrade testing to Gerrit 3.4.4  https://review.opendev.org/c/opendev/git-review/+/84941920:49
fungisupposedly fixed in mina-sshd 2.8.020:50
clarkbin that case 3.5 or newer should work20:50
clarkbsince 2.8.0 got backported to 3.520:50
clarkbI pushed a patch to disable email20:50
fungibut yeah, first i wanted to see this work before rolling versions forward even more20:50
fungialso there's some benefit to testing the oldest version of gerrit necessary for the features being added, since it helps avoid breaking git-review for users of somewhat older gerrit deployments (though at the expense of not catching bugs/regressions with newer gerrit releases)20:51
clarkbBut part of the issue here is we stuck to old versions which grew stale and leads to making updates more difficult20:52
clarkbI suspect that keeping up with the latest stable release would minimize deltas we have to sort out20:52
melwittclarkb: the doc you might be thinking of regarding task_state deleting might be this https://docs.openstack.org/nova/latest/admin/support-compute.html#reset-the-state-of-an-instance20:58
clarkbmelwitt: yup thanks. For servers that say they are deleted but haven't been removed is tehre something to look at? we were wondering if we should set them active and try to delete them similar to ^20:59
clarkbbut worried that since they are actually deleted that might cause more problems20:59
melwittclarkb: do you mean that they are gone from the server list/server show but still have libvirt guests running?21:01
clarkbmelwitt: no they show up in server list as status DELETED and OS-EXT-STS:vm_state                 | deleted21:02
melwittfwiw I don't think it would hurt to reset state and try to delete them again. I'm not sure if it would help either 21:03
clarkbwe want them to not show up in listings any longer (because nodepool thinks it leaked them and wants to delete them) 21:03
clarkbok thats helpful if only because it gives us something new to try 21:03
melwittI'm not sure I've seen that before where they are listed showing as DELETED when not using nova list --deleted21:06
fungiit's showing up when we `openstack server list` as a normal account in the tenant (don't need to be an admin)21:07
clarkbpower state is NOSTATE and task_state is none21:07
clarkbI think it may have actually deleted the instance everywhere but in the db21:07
melwittthe instance record gets deleted on the compute host if that nova-compute service is "up". it's possible something failed before it got to the instance record deletion, that's pretty much the last step21:08
fungimelwitt: looks like this to a normal user: https://paste.opendev.org/show/bRRe78SmoX2cls5Lovkl/21:09
melwittit would have failed somewhere after the save() call here https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L3254-L326721:10
clarkbah ok so we could go hunting for logs maybe depending on how catastrophic the failur was21:10
melwittwhich apparently the main thing that happens there is deleting the allocation in the placement service21:10
fungiwhich i guess eventually leads placement to thinking it can't add more servers if we accumulate too many of those?21:11
melwittso yeah, you would want to look at nova-compute.log at the time of the deletion. what might have happened is the allocation deletion failed for some reason. in the past a DELETE in placement was not "forced" and could fail for a 409 conflict21:11
melwittwhich some of us thought was wrong and we changed it to a "force" sometime semi recently. maybe a couple of years ago. not sure what version yall are using21:12
clarkbthis is victoria aiui21:13
clarkband ya I think one of the things we should do after server upgrades and gerrit 3.7 and mailman3 is redeploy this if they've got a modern version supported in their tool (they must)21:13
melwittok, so this landed in the xena release https://review.opendev.org/c/openstack/nova/+/688802 but I'm making assumptions that the placement allocation delete failed. if it did, I think resetting the state and trying the delete again might work21:15
clarkbfungi: my patch fixed the email thing. Now its complaining about output from git-review based on responses from the server21:15
clarkbfungi: so I think at this point its just a matter of updating the tests to match modern gerrit behaviors21:15
fungimelwitt: worth a shot, thanks for the deep dive!21:15
fungii reset the state on the three perpetually "deleted" instances there21:18
clarkbinterestingly it seems to be complaining that the chagne was committed before the commit hook so lacks the change id?21:18
clarkbthat implies git-review wasn't able to get the commit hook and auto apply it?21:18
clarkbfungi: ack21:18
clarkbbut I think the gerrit install itself is functioning now21:20
fungiyep, this is much progress. thanks!21:20
clarkbI'm going to pop out for a bit to run that errand. https://review.opendev.org/c/opendev/system-config/+/876470/1 should be ready to go if ya'll have time to look at it21:25
fungimelwitt: the status reset worked on the stuck deleted notes as well, thanks again for the help!21:29
opendevreviewIan Wienand proposed openstack/project-config master: gerrit/acl : handle submit requirements in normalise tool  https://review.opendev.org/c/openstack/project-config/+/87599222:21
opendevreviewIan Wienand proposed openstack/project-config master: gerrit/acl : submit-requirements for deprecated NoOp function  https://review.opendev.org/c/openstack/project-config/+/87580422:21
opendevreviewIan Wienand proposed openstack/project-config master: gerrit/acl : Convert Backport-Candidate to submit-requirements  https://review.opendev.org/c/openstack/project-config/+/87599322:21
opendevreviewIan Wienand proposed openstack/project-config master: gerrit/acl : handle key / values with multiple =  https://review.opendev.org/c/openstack/project-config/+/87599422:21
opendevreviewIan Wienand proposed openstack/project-config master: gerrit/acl : Convert Review-Priority to submit-requirements  https://review.opendev.org/c/openstack/project-config/+/87599522:21
ianwclarkb: ^ after several attempts i think that's in it's final state now.  it should replicate what the upstream migration would want to do22:22
opendevreviewMerged opendev/system-config master: Remove gitea05-08 from configuration management  https://review.opendev.org/c/opendev/system-config/+/87647022:29
clarkbianw: ack I'll take a look. Is that migration in the 3.6 -> 3.7 path? I guess because submit requirements are more required now? Its just weird they waited so long for that22:41
ianwclarkb: no, the change appears to only be on master.  so i imagine that 3.7 -> 3.8 upgrade process will run this22:41
ianwthese changes should effectively isolate us from that.  so we won't skew between what's running in gerrit and what our acl's are on disk22:43
clarkbweird. But also ++ to syncing up with them22:43
clarkb#status log Increased afs quotas for ubuntu-ports, debian-security, centos, and centos-stream mirrors22:51
opendevstatusclarkb: finished logging22:51
clarkbinfra-root with https://review.opendev.org/c/opendev/system-config/+/876470 merged I'll go ahead and manually stop gitea services via docker-compose down on the four hosts and in a day or two we can delete them22:51
clarkbthats done. If we really need to turn them back on remember that they haven't been replicated to for a few days22:55
clarkbBut I don't expect that to come up22:55
opendevreviewMerged opendev/system-config master: mirror-update: drop Fedora 35  https://review.opendev.org/c/opendev/system-config/+/87648622:59
clarkbianw: fwiw I think we can make code-review normal in infra-specs if we want. But at this point maybe thats best as a followup23:00
ianwclarkb: yeah, probably for infra specs, but at least for governance it seems to be very concious choice to allow anyone to +1/-1 for comments, but leave the merge up to the ptl23:02
clarkbya though I'm looking at https://review.opendev.org/c/openstack/project-config/+/875804/5/gerrit/acls/starlingx/governance.config and it looks like no submit requirements are needed? This isn't a regression either23:04
* clarkb pulls up a governance chang to see what they actually rely on23:04
clarkbjust the workflow +1 I guess23:04
clarkbianw: small thing on https://review.opendev.org/c/openstack/project-config/+/875804 but let me review the rest of the stack before you push new patchsets23:05
ianwyeah, i think the idea is that ultimately it's up to the ptl to tally and make the final choice23:05
ianwdoh, thanks, not sure how i missed that one23:06
clarkbianw: another followup idea. We could update our linter to onl allow function = NoBlock when this is all done23:07
clarkbthat way we don't accidentally add any new NoOps or MaxWithBlock. We might even require function = NoBlock?23:08
ianwyeah, i have a change sort of to do that23:08
ianwthe problem it isn't really a linter as such, it's a transformer23:09
clarkbwe look for a diff after normalizing it23:09
clarkbianw: I think the one I discovered in an earlier change is fixed in https://review.opendev.org/c/openstack/project-config/+/875995/23:09
clarkbI guess if you confirm that you can decide where you want it to be applied23:10
clarkb(I kinda like modifying a single projet all at once so that they get the potentailly new behavior (which shouldn't happen) all at once for ease of debugging)23:10
clarkbbut ya the rest of the stack lgtm23:10
ianwok let me fix that in the lower change, i think that's the right place23:11
opendevreviewMerged opendev/system-config master: mirror-update: Add Fedora 37  https://review.opendev.org/c/opendev/system-config/+/87648723:11
ianw^ i'll hold the lock and do the f35 removal and f37 addition as it will take a while23:12
opendevreviewIan Wienand proposed openstack/project-config master: gerrit/acl : submit-requirements for deprecated NoOp function  https://review.opendev.org/c/openstack/project-config/+/87580423:13
opendevreviewIan Wienand proposed openstack/project-config master: gerrit/acl : Convert Backport-Candidate to submit-requirements  https://review.opendev.org/c/openstack/project-config/+/87599323:13
opendevreviewIan Wienand proposed openstack/project-config master: gerrit/acl : handle key / values with multiple =  https://review.opendev.org/c/openstack/project-config/+/87599423:13
opendevreviewIan Wienand proposed openstack/project-config master: gerrit/acl : Convert Review-Priority to submit-requirements  https://review.opendev.org/c/openstack/project-config/+/87599523:13
ianwclarkb: https://review.opendev.org/c/opendev/system-config/+/876237/3 and https://review.opendev.org/c/opendev/system-config/+/876236/2 are the other two that codify the all-projects change 23:15
ianwthe proposed diff for all-projects is https://paste.opendev.org/show/brAj40R1mJbQZSXAXEQ5/23:16
clarkbianw: note on https://review.opendev.org/c/opendev/system-config/+/87623723:23
ianwahh, great point23:23
opendevreviewIan Wienand proposed opendev/system-config master: doc/gerrit : update to submit-requirements  https://review.opendev.org/c/opendev/system-config/+/87623723:25
clarkbthanks that lgtm23:25
ianwhttps://paste.opendev.org/show/b5flVdJsiNlLmY4lA6Yp/ is an updated diff against our all-projects23:27
ianwi'm running the new fedora sync in a root screen on mirror-update23:34
opendevreviewClark Boylan proposed opendev/git-review master: Upgrade testing to Gerrit 3.4.4  https://review.opendev.org/c/opendev/git-review/+/84941923:41
clarkbfungi: ^ Hopefully that is all that is necessary now23:41
fungioh, thanks! i hadn't gotten back to it yet23:43
clarkbI had the test failures open and looked at them again. It was just a difference in the text newer gerrit returns to git-review which then returns to the user23:45
fungicool, that's pretty minor23:46
clarkband the email thing was actually non fatal. But nice to not have giant java tracebacks in the logs for failed tests23:46
fungiyep, agreed23:48
clarkbJayF: just a heads up that we've pencilled in April 7 (Good Friday, hopefully it is quiet as a result) for project renames23:48
clarkbno specific time yet. We'll sort that out as we get closer to the day23:48
clarkbheh now they are all failing on `which git-review`23:54
clarkbhow did this pass before. I didn't change anything in the git review install path23:55
ianwcould it be a tox thing?23:55
clarkboh yup I just realized the older python versions will work I think23:55
clarkband its because new tox doesn't do older python23:55
* clarkb looks at tox.ini23:56
clarkbI think its the skip sdist step23:56
clarkbthat prevenst install of the project into the venv so which doesn't find it23:56
fungimat also need to allowlist_externals which?23:57
opendevreviewClark Boylan proposed opendev/git-review master: Upgrade testing to Gerrit 3.4.4  https://review.opendev.org/c/opendev/git-review/+/84941923:57
clarkbfungi: no, the which happens within the test suite23:57
clarkbtox has no idea aobut that one23:57
fungioh, got it23:58
fungiand the comment there was for xenial era i think which we no longer test o23:59

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!