*** dviroel|afk is now known as dviroel | 00:08 | |
fungi | looks like jeepyb is rather bitrotten when it comes to testing | 00:21 |
---|---|---|
fungi | also it came up on the debian-python ml (why are they packaging it anyway? i have no idea) that jeepyb's use of pyyaml isn't compliant with new versions that flip out and tell you you're going to get hacked if you don't use the "safe" parser options | 00:22 |
opendevreview | Ian Wienand proposed opendev/system-config master: launch: Automatically do RAX rdns updates when launching nodes https://review.opendev.org/c/opendev/system-config/+/865478 | 00:32 |
opendevreview | Ian Wienand proposed opendev/system-config master: launch: remove local mode for sshfp records https://review.opendev.org/c/opendev/system-config/+/865495 | 00:32 |
opendevreview | Ian Wienand proposed opendev/system-config master: launch: add ssh keys to inventory https://review.opendev.org/c/opendev/system-config/+/865496 | 00:32 |
Clark[m] | fungi: that patch is super simple at least. I'm surprised they package it. I'd suggest they stop honestly | 00:34 |
fungi | it was my suggestopn as well ;) | 00:37 |
opendevreview | Ian Wienand proposed opendev/jeepyb master: update_blueprint: handle recent gerrit arguments https://review.opendev.org/c/opendev/jeepyb/+/866237 | 00:40 |
opendevreview | Ian Wienand proposed opendev/jeepyb master: Fix minor pep8 issues https://review.opendev.org/c/opendev/jeepyb/+/866240 | 00:40 |
opendevreview | Ian Wienand proposed opendev/gerritlib master: Drop Python < 3.8, add Python 3.10 testing https://review.opendev.org/c/opendev/gerritlib/+/866241 | 00:52 |
opendevreview | Ian Wienand proposed opendev/gerritlib master: Drop Python < 3.8, add Python 3.10 testing https://review.opendev.org/c/opendev/gerritlib/+/866241 | 00:55 |
opendevreview | Ian Wienand proposed opendev/gerritlib master: Drop Python < 3.8, add Python 3.10 testing https://review.opendev.org/c/opendev/gerritlib/+/866241 | 01:11 |
opendevreview | Ian Wienand proposed opendev/gerritlib master: Drop Python < 3.8, add Python 3.10 testing https://review.opendev.org/c/opendev/gerritlib/+/866241 | 01:22 |
*** dviroel is now known as dviroel|out | 01:33 | |
opendevreview | OpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml https://review.opendev.org/c/openstack/project-config/+/866243 | 02:22 |
opendevreview | Ian Wienand proposed opendev/jeepyb master: Fix minor pep8 issues https://review.opendev.org/c/opendev/jeepyb/+/866240 | 02:47 |
opendevreview | Ian Wienand proposed opendev/jeepyb master: update_blueprint: handle recent gerrit arguments https://review.opendev.org/c/opendev/jeepyb/+/866237 | 02:47 |
*** pojadhav|out is now known as pojadhav|ruck | 03:00 | |
*** yadnesh|away is now known as yadnesh | 04:00 | |
ianw | i've patched maxUpdates to 5000 in the gerrit config, i'm going to restart, try the update on neutron | 05:24 |
ianw | ok, we only got the two errors about the missing change, that we know about | 05:28 |
ianw | and then weirdly, it reported 12 changes as udpated | 05:28 |
ianw | https://paste.opendev.org/show/b5pVsurgszVZdnWIER9o/ | 05:29 |
ianw | rerunning it : Labels copied for 1 project(s) have impacted 0 change(s) | 05:29 |
ianw | it's gone back to updating nothing | 05:29 |
ianw | so that's a pretty successful test, i've got enough for a decent bug report now. i'll remove the config and restart gerrit | 05:30 |
ianw | issue filed -> https://bugs.chromium.org/p/gerrit/issues/detail?id=16485 | 05:50 |
opendevreview | Cedric Jeanneret proposed opendev/system-config master: Correct how ansible-galaxy is proxified https://review.opendev.org/c/opendev/system-config/+/866175 | 07:56 |
*** ysandeep is now known as ysandeep|lunch | 08:12 | |
*** pojadhav|ruck is now known as pojadhav|lunch | 08:26 | |
*** jpena|off is now known as jpena | 08:31 | |
opendevreview | Cedric Jeanneret proposed opendev/system-config master: Correct how ansible-galaxy is proxified https://review.opendev.org/c/opendev/system-config/+/866175 | 08:55 |
*** yadnesh is now known as yadnesh|afk | 09:00 | |
*** pojadhav|lunch is now known as pojadhav|ruck | 09:00 | |
*** ysandeep|lunch is now known as ysandeep | 09:12 | |
*** yadnesh|afk is now known as yadnesh | 09:58 | |
*** dviroel|out is now known as dviroel | 10:43 | |
*** rlandy|out is now known as rlandy | 11:20 | |
opendevreview | Merged openstack/project-config master: Normalize projects.yaml https://review.opendev.org/c/openstack/project-config/+/866243 | 12:54 |
*** rlandy is now known as rlandy|brb | 13:03 | |
*** ysandeep is now known as ysandeep|afk | 13:12 | |
*** rlandy|brb is now known as rlandy | 13:18 | |
opendevreview | Merged opendev/system-config master: Update iweb clouds.yaml for old and new openstacksdk https://review.opendev.org/c/opendev/system-config/+/866220 | 13:30 |
opendevreview | Merged opendev/system-config master: Clean up an old raw IP address from our MTAs https://review.opendev.org/c/opendev/system-config/+/866203 | 13:30 |
*** frenzy_friday|rover is now known as frenzy_friday|rover|lunch | 13:33 | |
*** dasm|off is now known as dasm | 13:46 | |
*** yadnesh is now known as yadnesh|away | 13:55 | |
*** frenzy_friday|rover|lunch is now known as frenzy_friday|rover | 14:01 | |
*** ysandeep|afk is now known as ysandeep | 14:07 | |
*** pojadhav|ruck is now known as pojadhav|afk | 14:42 | |
fungi | yay, infra-prod-service-lists3 finally ran successfully! https://lists01.opendev.org/ is now returning an (expected) error page and has a valid https cert | 14:49 |
fungi | i'll proceed with sending the announcement | 14:49 |
opendevreview | Merged zuul/zuul-jobs master: Add support for nodejs testing on Debian https://review.opendev.org/c/zuul/zuul-jobs/+/865459 | 14:53 |
*** ysandeep is now known as ysandeep|out | 15:10 | |
Clark[m] | Note the error page is because lists01.opendev.org isn't in the acceptable list of mailman hosts. If you swap the name in the request out for lists.opendev.org it should be happier. | 15:14 |
fungi | right, though for now you'll need to override dns resolution, e.g. in your /etc/hosts | 15:25 |
*** marios is now known as marios|out | 16:49 | |
clarkb | fungi: looks like at least one backup failed for the new lists server too. May need to followup with that if it persists | 17:16 |
fungi | ml migration announcement sent | 17:19 |
clarkb | fungi: https://review.opendev.org/c/opendev/jeepyb/+/866240/ and its depends on are the two additional changes needed to address update-blueprint problems | 17:21 |
fungi | aha, thanks | 17:22 |
clarkb | fungi: looks like backups just failed again so likely both sets are failing and it does need to be looked into | 17:27 |
*** jpena is now known as jpena|off | 17:27 | |
fungi | yep, will take a look | 17:28 |
opendevreview | Merged opendev/gerritlib master: Drop Python < 3.8, add Python 3.10 testing https://review.opendev.org/c/opendev/gerritlib/+/866241 | 17:30 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Correct how ansible-galaxy is proxified https://review.opendev.org/c/opendev/system-config/+/866175 | 17:40 |
clarkb | I'm doing inmotion cleanups now. I removed a few leaked image uploads. A handlful can't be deleted because they are in use by instances. I've moved on to cleanup placement resource allocations. THe process for this is largely documented at https://docs.openstack.org/nova/latest/admin/troubleshooting/orphaned-allocations.html but I also doing server shows on all allocations and not | 18:19 |
clarkb | deleting any placement for servers hat successfully show | 18:19 |
clarkb | those that don't successfully show shouldn't exist so I can remove their placement allocations | 18:19 |
clarkb | also you have to be careful not to remove the allocation for the mirror (which I've found and ignored) | 18:19 |
clarkb | if I were to do it again I might disable the cloud in nodepool first too. That way it is easier to see everything is cleaned up (curently nodepool is booting things as I make room which makes that more difficult) | 18:30 |
clarkb | melwitt: ^ I've done that which has allowed some instances to boot now. But running openstack server list still produces a number of instances in a BUILD state that when I run server show against says does not exist. Any idea where that info might be coming from and how to clean it up? | 18:39 |
JayF | fungi: your email was providing me with new information: that the deletion+EOL-tagging of branches can happen within gerrit. Does this mean it's reviewable? | 18:41 |
clarkb | melwitt: spot checking virsh list against allocations after my cleanup does seem to show we don't have any actually leaked VMs (even the error nodes that show properly don't appear to be running). | 18:43 |
clarkb | melwitt: so I guess nova isn't dynamically populating that from libvirt | 18:44 |
clarkb | I guess a lot of this could be cached too and I need to wait? | 18:44 |
melwitt | clarkb: yes, that means a race or other failure happened while scheduling those instances and it didn't get far enough to create the DB records for the instances https://bugs.launchpad.net/nova/+bug/1784093 and the direct link to what needs to be done to clean it up is https://bugs.launchpad.net/nova/+bug/1784093/comments/9 | 18:45 |
clarkb | melwitt: thanks! | 18:46 |
melwitt | we fixed that in queens/rocky/stein by creating the related records in a single DB transaction but more recently I know of a different corner case that can result in orphaned build requests that still needs to be addressed | 18:47 |
clarkb | and it looks like the uuid reported by virsh dominfo matches the nova uuid so I can cross check and make sure we haven't actually leaked any real vms before removing db rows | 18:49 |
clarkb | I need to take a break, but I'll look into that next | 18:49 |
opendevreview | Merged opendev/jeepyb master: Fix minor pep8 issues https://review.opendev.org/c/opendev/jeepyb/+/866240 | 18:50 |
opendevreview | Merged opendev/jeepyb master: update_blueprint: handle recent gerrit arguments https://review.opendev.org/c/opendev/jeepyb/+/866237 | 18:52 |
fungi | JayF: it's not reviewable as a change in gerrit. tags are pushed directly into gerrit's git backend, while branches are deleted through its rest api or webui | 18:59 |
fungi | permissions for those actions can be granted through gerrit acls | 18:59 |
fungi | what the openstack release managers do is abstract that with some additional structured data in another git repository which describes the actions, and then its those descriptions which they review, which in turn drive the actions once the change merges | 19:00 |
melwitt | clarkb: yes that's right, the instance uuid can be used with the virsh commands | 19:00 |
JayF | Yeah; I've seen the script. I just had some hope to make even the manual stuff (for ironic bugfix branches) better guarded. | 19:02 |
clarkb | melwitt: in this situation only request_specs seems to have records and not build_requests. But I do see at least one of my leaked "instances" there | 19:08 |
clarkb | melwitt: I'm going to remove the record and see what happens I guess :) | 19:09 |
clarkb | melwitt: openstack server list is still listing it. Is this something that will be cached? | 19:10 |
melwitt | clarkb: no.. but it's possible you are seeing a different issue https://bugs.launchpad.net/nova/+bug/1871925 with link to cleanup steps https://bugs.launchpad.net/nova/+bug/1871925/comments/4 | 19:13 |
clarkb | melwitt: hrm in that case do I need to put the request_spec back in? | 19:15 |
melwitt | clarkb: I don't think so | 19:15 |
melwitt | request_spec is used for scheduling so if we're just trying to delete something, it should be ok | 19:15 |
clarkb | ack. Let me try this | 19:15 |
clarkb | okconfirmed there is a null cell id there | 19:16 |
clarkb | melwitt: that did it | 19:23 |
melwitt | ok cool | 19:23 |
clarkb | we've got a smaller number of ERROR instances that need cleaning up too, ut I think that is a separte problem and the set is small enough I may punt on that for now. Just focusing on the stuck BUILD instances | 19:27 |
clarkb | melwitt: is there a tl;dr on why we have a default cell and a cell0 and all our nodes seem to be in the default? Or is that by chance and things can be balanced between them ? (not urgent I'm mostly just curious) | 19:30 |
melwitt | clarkb: cell0 is a "special" cell that contains only instances that failed to schedule. so having cell0 and a default cell is considered to be a "single cell deployment" | 19:32 |
melwitt | instances get mapped to cells once they land on a node, so if they don't end up on a node, they go into cell0 | 19:33 |
melwitt | cell0 is a DB only thing really, it never has compute nodes | 19:34 |
fungi | cell0 is where your instance requests go to die | 19:52 |
opendevreview | Jay Faulkner proposed openstack/project-config master: Allow Ironic/Sushy cores to toggle WIP state https://review.opendev.org/c/openstack/project-config/+/863931 | 19:54 |
clarkb | I've confirmed all 5 ERROR nodes have a cell id set so they have a different problem | 19:55 |
clarkb | there are only 5 so I'm inclined to ignore them for now and figure them out later | 19:55 |
melwitt | if there are only 5, you're doing good 😂 | 19:57 |
clarkb | I have discovered that running show on the intsance as an admin gives me more info though. Looks like a db connection error which is weird beacuse I've been able to delete a bunch of other instances after repairing their db records | 19:57 |
clarkb | melwitt: is it possible the error I'm seeing for db connectivity is the original error that caused the instance to go into an ERROR state and now the db is corrupt for that instance so I can't delete now that db connectivity is working? | 19:58 |
melwitt | clarkb: it likely is. it's set at the same time as setting of ERROR state. you should be able to delete it if the DB is working now | 19:59 |
melwitt | it just means it failed to build or whatever action was tried due to a DB connection error. and that makes sense considering how many instances stuck in BUILD, the most common way it happens is when the cluster is experiencing DB connection errors and is not in a nominal state | 20:00 |
clarkb | looks like it isn't able to delete that instance. I'll probably need to dig into logs to see why and then fix it up from there but it is time for lunch | 20:01 |
clarkb | also this is a million times more happy now after placement and build instance cleanup so I think we can leave it for a bit | 20:01 |
melwitt | we have implemented some fixes to make nova more robust but there are still at least one corner case still out there | 20:01 |
melwitt | hm, can't delete it ... yeah we'll have to look at that a little more to find out what's wrong there | 20:04 |
clarkb | ya this is where my devstack debugging skills will be useful. Except I have to map them onto kolla | 20:05 |
melwitt | usually can't delete means cell_id isn't set or it's set to the wrong cell where the instance record is not | 20:05 |
clarkb | oh the cell id is set to the regular cell not cell0 but maybe it is in cell0 due to the error | 20:05 |
clarkb | I can try and take a look at that after lunch | 20:05 |
melwitt | yeah... IIRC I think that's possible if things race in a certain way | 20:05 |
melwitt | but it's been awhile since I've seen it. it's very similar to the aforementioned issues that were fixed or at least partially fixed | 20:07 |
ianw | clarkb: if we're happy https://review.opendev.org/c/opendev/jeepyb/+/866237 has promoted new 3.5 images, i can restart it when it's more quiet to fix up the hook | 20:09 |
clarkb | ianw: I guess my only concern at this point is that jeepyb doesn't test those images only builds them | 20:12 |
clarkb | if we want ot be extra cautious we could have a system-config change rebuild the image which does test the image too | 20:12 |
opendevreview | Ian Wienand proposed opendev/system-config master: [dnm] Trigger gerrit image testing https://review.opendev.org/c/opendev/system-config/+/866397 | 20:18 |
ianw | ^ good idea | 20:18 |
ianw | do you want to merge that change, so we use the image from it? | 20:19 |
Clark[m] | ++ | 20:22 |
Clark[m] | I'll send it in after lunch | 20:22 |
Clark[m] | Maybe update the commit message? | 20:22 |
opendevreview | Ian Wienand proposed opendev/system-config master: Trigger gerrit image testing https://review.opendev.org/c/opendev/system-config/+/866397 | 20:23 |
ianw | ^ yep, done :) | 20:23 |
opendevreview | Merged opendev/system-config master: borg-backup-server: build borg users betterer https://review.opendev.org/c/opendev/system-config/+/865202 | 20:41 |
*** dviroel is now known as dviroel|out | 21:02 | |
clarkb | ok the intsance that won't delete is mapped to the cell that it appears to reside in so that wasn't it | 21:11 |
opendevreview | Merged opendev/system-config master: letsencrypt-request-certs: refactor certcheck list https://review.opendev.org/c/opendev/system-config/+/865218 | 21:14 |
clarkb | melwitt: fwiw 'Instance is already in deleting state, ignoring this request' is what nova-api-wsgi.log says | 21:15 |
opendevreview | Merged opendev/system-config master: letsencrypt: build txt record lists betterer https://review.opendev.org/c/opendev/system-config/+/865203 | 21:15 |
opendevreview | Merged opendev/system-config master: gitea-git-repos: remove #!/usr/bin/env python https://review.opendev.org/c/opendev/system-config/+/865224 | 21:15 |
opendevreview | Merged opendev/system-config master: gitea-set-org-logos: use -T on mariadb command https://review.opendev.org/c/opendev/system-config/+/865339 | 21:15 |
clarkb | `nova force-delete` seems to exist but I haev no idea if that is appropriate here | 21:16 |
melwitt | oh :\ that seems silly to not let you delete it anyway cause the record is obviously there 😆 lemme check something | 21:17 |
clarkb | I think what happened was we trie to delete the instance when the db was not reachable which put the instance in an error state while trying to delete | 21:18 |
clarkb | and now nova is ignoring requests for deletion because it thinks it is already being deleted | 21:18 |
clarkb | the nova api for server deletion doesn't have a force option so I'm not sure what `nova force-delete` does (maybe does backend operations directly rather than api request) | 21:19 |
melwitt | yeah.. I think force-delete is for the "soft-delete" feature that no one uses where when a user requests a delete, nova keeps it for a configure amount of time before deleting it | 21:22 |
melwitt | and you can "undelete" it if you're using that. and force-delete is for an admin to say "don't wait for the configured elapsed time, delete it now" | 21:23 |
clarkb | ah | 21:23 |
melwitt | I feel like there was a change awhile back to accept delete regardless of the task_state | 21:24 |
melwitt | hm, maybe I dreamed it | 21:26 |
clarkb | I guess a super yolo option might be to toggle the state back to active or whatever steady state is and then try deleting it again | 21:26 |
clarkb | but then you have to hope it doesn't break because of steps already taken in the deletion processing being completed | 21:28 |
melwitt | yeah that is true. there is an API call to do that | 21:28 |
clarkb | oh is there? | 21:28 |
melwitt | yeah, the famous "reset state" | 21:28 |
melwitt | possibly only available via the legacy novaclient. I'm checking | 21:28 |
clarkb | `openstack server set --action active` maybe? | 21:29 |
clarkb | but does that modify task_state or vm_state? | 21:30 |
melwitt | oh yeah, that looks like it | 21:30 |
melwitt | vm_state | 21:30 |
clarkb | vm_state is error and task_state is deleting | 21:30 |
clarkb | I think it is task_state that the api is short circuiting on | 21:31 |
melwitt | yeah, I think you're right | 21:31 |
clarkb | https://blog.dachary.org/2014/05/02/reseting-an-instance-powervmtask_state-in-havana/ heres loic dachary doing the yolo db thing way back when | 21:35 |
melwitt | ok, looks like it sets task_state = None no matter what, so it "should" help | 21:36 |
clarkb | oh interesting. I guess I can try toggling it to active and toggle back to error if it doesn't change task_state | 21:36 |
melwitt | yeah | 21:36 |
clarkb | heh policy doens't allow it /me moves to admin shell | 21:37 |
melwitt | oh yeah, admin only for the yolo | 21:38 |
clarkb | ok that worked. Setting it to active cleared the task_state then a regular delete request as non admin deleted it | 21:39 |
melwitt | ok good | 21:39 |
clarkb | all 5 appeared to eb in a similar state so I've applied the same workaroudn to all of them and they deleted properly via the openstack server delete commands | 21:45 |
clarkb | I think we may have leaked some recordkeeping that tracks bfv image use in glance with those instances | 21:46 |
clarkb | hrm volume list doesn't show any volumes so maybe the side effect is different than bfv | 21:51 |
clarkb | who works on kolla these days? one bit of feedback I've got is that the upstream docker hub mariadb images stick credentials in the runtime environment for the database so you can exec in and connect easily | 21:53 |
clarkb | kolla doesn't do this | 21:53 |
clarkb | makes it slightly more annoying to manipulate the db by hand | 21:53 |
melwitt | that I don't know | 22:02 |
clarkb | yup not a problem. ou've been a tremendous help today and I learned a ton. Thank you! | 22:05 |
clarkb | melwitt: ^ | 22:06 |
melwitt | sure, np :) | 22:06 |
*** dasm is now known as dasm|off | 22:07 | |
opendevreview | Merged opendev/system-config master: Trigger gerrit image testing https://review.opendev.org/c/opendev/system-config/+/866397 | 22:17 |
clarkb | cool system-config-run-review-3.5 passed in check and gate with that image update. That gives me a lot more confidence :) | 22:19 |
ianw | ++ i can restart it later | 22:27 |
opendevreview | Merged opendev/system-config master: bridge: Disable writing known_hosts files https://review.opendev.org/c/opendev/system-config/+/865092 | 22:54 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!