Thursday, 2022-12-01

*** dviroel|afk is now known as dviroel00:08
fungilooks like jeepyb is rather bitrotten when it comes to testing00:21
fungialso it came up on the debian-python ml (why are they packaging it anyway? i have no idea) that jeepyb's use of pyyaml isn't compliant with new versions that flip out and tell you you're going to get hacked if you don't use the "safe" parser options00:22
opendevreviewIan Wienand proposed opendev/system-config master: launch: Automatically do RAX rdns updates when launching nodes
opendevreviewIan Wienand proposed opendev/system-config master: launch: remove local mode for sshfp records
opendevreviewIan Wienand proposed opendev/system-config master: launch: add ssh keys to inventory
Clark[m]fungi: that patch is super simple at least. I'm surprised they package it. I'd suggest they stop honestly00:34
fungiit was my suggestopn as well ;)00:37
opendevreviewIan Wienand proposed opendev/jeepyb master: update_blueprint: handle recent gerrit arguments
opendevreviewIan Wienand proposed opendev/jeepyb master: Fix minor pep8 issues
opendevreviewIan Wienand proposed opendev/gerritlib master: Drop Python < 3.8, add Python 3.10 testing
opendevreviewIan Wienand proposed opendev/gerritlib master: Drop Python < 3.8, add Python 3.10 testing
opendevreviewIan Wienand proposed opendev/gerritlib master: Drop Python < 3.8, add Python 3.10 testing
opendevreviewIan Wienand proposed opendev/gerritlib master: Drop Python < 3.8, add Python 3.10 testing
*** dviroel is now known as dviroel|out01:33
opendevreviewOpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml
opendevreviewIan Wienand proposed opendev/jeepyb master: Fix minor pep8 issues
opendevreviewIan Wienand proposed opendev/jeepyb master: update_blueprint: handle recent gerrit arguments
*** pojadhav|out is now known as pojadhav|ruck03:00
*** yadnesh|away is now known as yadnesh04:00
ianwi've patched maxUpdates to 5000 in the gerrit config, i'm going to restart, try the update on neutron05:24
ianwok, we only got the two errors about the missing change, that we know about05:28
ianwand then weirdly, it reported 12 changes as udpated05:28
ianwrerunning it : Labels copied for 1 project(s) have impacted 0 change(s)05:29
ianwit's gone back to updating nothing05:29
ianwso that's a pretty successful test, i've got enough for a decent bug report now.  i'll remove the config and restart gerrit05:30
ianwissue filed ->
opendevreviewCedric Jeanneret proposed opendev/system-config master: Correct how ansible-galaxy is proxified
*** ysandeep is now known as ysandeep|lunch08:12
*** pojadhav|ruck is now known as pojadhav|lunch08:26
*** jpena|off is now known as jpena08:31
opendevreviewCedric Jeanneret proposed opendev/system-config master: Correct how ansible-galaxy is proxified
*** yadnesh is now known as yadnesh|afk09:00
*** pojadhav|lunch is now known as pojadhav|ruck09:00
*** ysandeep|lunch is now known as ysandeep09:12
*** yadnesh|afk is now known as yadnesh09:58
*** dviroel|out is now known as dviroel10:43
*** rlandy|out is now known as rlandy11:20
opendevreviewMerged openstack/project-config master: Normalize projects.yaml
*** rlandy is now known as rlandy|brb13:03
*** ysandeep is now known as ysandeep|afk13:12
*** rlandy|brb is now known as rlandy13:18
opendevreviewMerged opendev/system-config master: Update iweb clouds.yaml for old and new openstacksdk
opendevreviewMerged opendev/system-config master: Clean up an old raw IP address from our MTAs
*** frenzy_friday|rover is now known as frenzy_friday|rover|lunch13:33
*** dasm|off is now known as dasm13:46
*** yadnesh is now known as yadnesh|away13:55
*** frenzy_friday|rover|lunch is now known as frenzy_friday|rover14:01
*** ysandeep|afk is now known as ysandeep14:07
*** pojadhav|ruck is now known as pojadhav|afk14:42
fungiyay, infra-prod-service-lists3 finally ran successfully! is now returning an (expected) error page and has a valid https cert14:49
fungii'll proceed with sending the announcement14:49
opendevreviewMerged zuul/zuul-jobs master: Add support for nodejs testing on Debian
*** ysandeep is now known as ysandeep|out15:10
Clark[m]Note the error page is because isn't in the acceptable list of mailman hosts. If you swap the name in the request out for it should be happier.15:14
fungiright, though for now you'll need to override dns resolution, e.g. in your /etc/hosts15:25
*** marios is now known as marios|out16:49
clarkbfungi: looks like at least one backup failed for the new lists server too. May need to followup with that if it persists17:16
fungiml migration announcement sent17:19
clarkbfungi: and its depends on are the two additional changes needed to address update-blueprint problems17:21
fungiaha, thanks17:22
clarkbfungi: looks like backups just failed again so likely both sets are failing and it does need to be looked into17:27
*** jpena is now known as jpena|off17:27
fungiyep, will take a look17:28
opendevreviewMerged opendev/gerritlib master: Drop Python < 3.8, add Python 3.10 testing
opendevreviewJeremy Stanley proposed opendev/system-config master: Correct how ansible-galaxy is proxified
clarkbI'm doing inmotion cleanups now. I removed a few leaked image uploads. A handlful can't be deleted because they are in use by instances. I've moved on to cleanup placement resource allocations. THe process for this is largely documented at but I also doing server shows on all allocations and not18:19
clarkbdeleting any placement for servers hat successfully show18:19
clarkbthose that don't successfully show shouldn't exist so I can remove their placement allocations18:19
clarkbalso you have to be careful not to remove the allocation for the mirror (which I've found and ignored)18:19
clarkbif I were to do it again I might disable the cloud in nodepool first too. That way it is easier to see everything is cleaned up (curently nodepool is booting things as I make room which makes that more difficult)18:30
clarkbmelwitt: ^ I've done that which has allowed some instances to boot now. But running openstack server list still produces a number of instances in a BUILD state that when I run server show against says does not exist. Any idea where that info might be coming from and how to clean it up?18:39
JayFfungi: your email was providing me with new information: that the deletion+EOL-tagging of branches can happen within gerrit. Does this mean it's reviewable? 18:41
clarkbmelwitt: spot checking virsh list against allocations after my cleanup does seem to show we don't have any actually leaked VMs (even the error nodes that show properly don't appear to be running).18:43
clarkbmelwitt: so I guess nova isn't dynamically populating that from libvirt18:44
clarkbI guess a lot of this could be cached too and I need to wait?18:44
melwittclarkb: yes, that means a race or other failure happened while scheduling those instances and it didn't get far enough to create the DB records for the instances and the direct link to what needs to be done to clean it up is
clarkbmelwitt: thanks!18:46
melwittwe fixed that in queens/rocky/stein by creating the related records in a single DB transaction but more recently I know of a different corner case that can result in orphaned build requests that still needs to be addressed18:47
clarkband it looks like the uuid reported by virsh dominfo matches the nova uuid so I can cross check and make sure we haven't actually leaked any real vms before removing db rows18:49
clarkbI need to take a break, but I'll look into that next18:49
opendevreviewMerged opendev/jeepyb master: Fix minor pep8 issues
opendevreviewMerged opendev/jeepyb master: update_blueprint: handle recent gerrit arguments
fungiJayF: it's not reviewable as a change in gerrit. tags are pushed directly into gerrit's git backend, while branches are deleted through its rest api or webui18:59
fungipermissions for those actions can be granted through gerrit acls18:59
fungiwhat the openstack release managers do is abstract that with some additional structured data in another git repository which describes the actions, and then its those descriptions which they review, which in turn drive the actions once the change merges19:00
melwittclarkb: yes that's right, the instance uuid can be used with the virsh commands19:00
JayFYeah; I've seen the script. I just had some hope to make even the manual stuff (for ironic bugfix branches) better guarded.19:02
clarkbmelwitt: in this situation only request_specs seems to have records and not build_requests. But I do see at least one of my leaked "instances" there19:08
clarkbmelwitt: I'm going to remove the record and see what happens I guess :)19:09
clarkbmelwitt: openstack server list is still listing it. Is this something that will be cached?19:10
melwittclarkb: no.. but it's possible you are seeing a different issue with link to cleanup steps
clarkbmelwitt: hrm in that case do I need to put the request_spec back in?19:15
melwittclarkb: I don't think so19:15
melwittrequest_spec is used for scheduling so if we're just trying to delete something, it should be ok19:15
clarkback. Let me try this19:15
clarkbokconfirmed there is a null cell id there19:16
clarkbmelwitt: that did it19:23
melwittok cool19:23
clarkbwe've got a smaller number of ERROR instances that need cleaning up too, ut I think that is a separte problem and the set is small enough I may punt on that for now. Just focusing on the stuck BUILD instances19:27
clarkbmelwitt: is there a tl;dr on why we have a default cell and a cell0 and all our nodes seem to be in the default? Or is that by chance and things can be balanced between them ? (not urgent I'm mostly just curious)19:30
melwittclarkb: cell0 is a "special" cell that contains only instances that failed to schedule. so having cell0 and a default cell is considered to be a "single cell deployment"19:32
melwittinstances get mapped to cells once they land on a node, so if they don't end up on a node, they go into cell019:33
melwittcell0 is a DB only thing really, it never has compute nodes19:34
fungicell0 is where your instance requests go to die19:52
opendevreviewJay Faulkner proposed openstack/project-config master: Allow Ironic/Sushy cores to toggle WIP state
clarkbI've confirmed all 5 ERROR nodes have a cell id set so they have a different problem19:55
clarkbthere are only 5 so I'm inclined to ignore them for now and figure them out later19:55
melwittif there are only 5, you're doing good 😂 19:57
clarkbI have discovered that running show on the intsance as an admin gives me more info though. Looks like a db connection error which is weird beacuse I've been able to delete a bunch of other instances after repairing their db records19:57
clarkbmelwitt: is it possible the error I'm seeing for db connectivity is the original error that caused the instance to go into an ERROR state and now the db is corrupt for that instance so I can't delete now that db connectivity is working?19:58
melwittclarkb: it likely is. it's set at the same time as setting of ERROR state. you should be able to delete it if the DB is working now19:59
melwittit just means it failed to build or whatever action was tried due to a DB connection error. and that makes sense considering how many instances stuck in BUILD, the most common way it happens is when the cluster is experiencing DB connection errors and is not in a nominal state20:00
clarkblooks like it isn't able to delete that instance. I'll probably need to dig into logs to see why and then fix it up from there but it is time for lunch20:01
clarkbalso this is a million times more happy now after placement and build instance cleanup so I think we can leave it for a bit20:01
melwittwe have implemented some fixes to make nova more robust but there are still at least one corner case still out there20:01
melwitthm, can't delete it ... yeah we'll have to look at that a little more to find out what's wrong there20:04
clarkbya this is where my devstack debugging skills will be useful. Except I have to map them onto kolla20:05
melwittusually can't delete means cell_id isn't set or it's set to the wrong cell where the instance record is not20:05
clarkboh the cell id is set to the regular cell not cell0 but maybe it is in cell0 due to the error20:05
clarkbI can try and take a look at that after lunch20:05
melwittyeah... IIRC I think that's possible if things race in a certain way20:05
melwittbut it's been awhile since I've seen it. it's very similar to the aforementioned issues that were fixed or at least partially fixed20:07
ianwclarkb: if we're happy has promoted new 3.5 images, i can restart it when it's more quiet to fix up the hook20:09
clarkbianw: I guess my only concern at this point is that jeepyb doesn't test those images only builds them20:12
clarkbif we want ot be extra cautious we could have a system-config change rebuild the image which does test the image too20:12
opendevreviewIan Wienand proposed opendev/system-config master: [dnm] Trigger gerrit image testing
ianw^ good idea20:18
ianwdo you want to merge that change, so we use the image from it?20:19
Clark[m]I'll send it in after lunch20:22
Clark[m]Maybe update the commit message?20:22
opendevreviewIan Wienand proposed opendev/system-config master: Trigger gerrit image testing
ianw^ yep, done :)20:23
opendevreviewMerged opendev/system-config master: borg-backup-server: build borg users betterer
*** dviroel is now known as dviroel|out21:02
clarkbok the intsance that won't delete is mapped to the cell that it appears to reside in so that wasn't it21:11
opendevreviewMerged opendev/system-config master: letsencrypt-request-certs: refactor certcheck list
clarkbmelwitt: fwiw 'Instance is already in deleting state, ignoring this request' is what nova-api-wsgi.log says21:15
opendevreviewMerged opendev/system-config master: letsencrypt: build txt record lists betterer
opendevreviewMerged opendev/system-config master: gitea-git-repos: remove #!/usr/bin/env python
opendevreviewMerged opendev/system-config master: gitea-set-org-logos: use -T on mariadb command
clarkb`nova force-delete` seems to exist but I haev no idea if that is appropriate here21:16
melwittoh :\  that seems silly to not let you delete it anyway cause the record is obviously there 😆 lemme check something21:17
clarkbI think what happened was we trie to delete the instance when the db was not reachable which put the instance in an error state while trying to delete21:18
clarkband now nova is ignoring requests for deletion because it thinks it is already being deleted21:18
clarkbthe nova api for server deletion doesn't have a force option so I'm not sure what `nova force-delete` does (maybe does backend operations directly rather than api request)21:19
melwittyeah.. I think force-delete is for the "soft-delete" feature that no one uses where when a user requests a delete, nova keeps it for a configure amount of time before deleting it21:22
melwittand you can "undelete" it if you're using that. and force-delete is for an admin to say "don't wait for the configured elapsed time, delete it now"21:23
melwittI feel like there was a change awhile back to accept delete regardless of the task_state21:24
melwitthm, maybe I dreamed it21:26
clarkbI guess a super yolo option might be to toggle the state back to active or whatever steady state is and then try deleting it again21:26
clarkbbut then you have to hope it doesn't break because of steps already taken in the deletion processing being completed21:28
melwittyeah that is true. there is an API call to do that21:28
clarkboh is there?21:28
melwittyeah, the famous "reset state"21:28
melwittpossibly only available via the legacy novaclient. I'm checking21:28
clarkb`openstack server set --action active` maybe?21:29
clarkbbut does that modify task_state or vm_state?21:30
melwittoh yeah, that looks like it21:30
clarkbvm_state is error and task_state is deleting21:30
clarkbI think it is task_state that the api is short circuiting on21:31
melwittyeah, I think you're right21:31
clarkb heres loic dachary doing the yolo db thing way back when21:35
melwittok, looks like it sets task_state = None no matter what, so it "should" help21:36
clarkboh interesting. I guess I can try toggling it to active and toggle back to error if it doesn't change task_state21:36
clarkbheh policy doens't allow it /me moves to admin shell21:37
melwittoh yeah, admin only for the yolo21:38
clarkbok that worked. Setting it to active cleared the task_state then a regular delete request as non admin deleted it21:39
melwittok good21:39
clarkball 5 appeared to eb in a similar state so I've applied the same workaroudn to all of them and they deleted properly via the openstack server delete commands21:45
clarkbI think we may have leaked some recordkeeping that tracks bfv image use in glance with those instances21:46
clarkbhrm volume list doesn't show any volumes so maybe the side effect is different than bfv21:51
clarkbwho works on kolla these days? one bit of feedback I've got is that the upstream docker hub mariadb images stick credentials in the runtime environment for the database so you can exec in and connect easily21:53
clarkbkolla doesn't do this21:53
clarkbmakes it slightly more annoying to manipulate the db by hand21:53
melwittthat I don't know22:02
clarkbyup not a problem. ou've been a tremendous help today and I learned a ton. Thank you!22:05
clarkbmelwitt: ^22:06
melwittsure, np :)22:06
*** dasm is now known as dasm|off22:07
opendevreviewMerged opendev/system-config master: Trigger gerrit image testing
clarkbcool system-config-run-review-3.5 passed in check and gate with that image update. That gives me a lot more confidence :)22:19
ianw++ i can restart it later22:27
opendevreviewMerged opendev/system-config master: bridge: Disable writing known_hosts files

Generated by 2.17.3 by Marius Gedminas - find it at!