*** dtantsur_ is now known as dtantsur | 00:08 | |
opendevreview | Dr. Jens Harbott proposed openstack/project-config master: sdk/osc: Rollback to LaunchPad for issuetracking https://review.opendev.org/c/openstack/project-config/+/894285 | 05:03 |
---|---|---|
frickler | gtema: ^^ fyi amended to cover all repos, which I assume is what you intended? | 05:04 |
frickler | I also noticed that ansible-collections-openstack formally still belongs to the ansible sig, which iiuc has effectively disbanded. would it make sense to move it to sdks proper? | 05:05 |
opendevreview | Dr. Jens Harbott proposed openstack/project-config master: sdk/osc: Rollback to LaunchPad for issuetracking https://review.opendev.org/c/openstack/project-config/+/894285 | 05:26 |
gtema | Fricker. Thanks. Wrt ansible-collection-openstack: I guess yes, the move makes sense | 05:46 |
opendevreview | Martin Magr proposed openstack/project-config master: Add python-observabilityclient https://review.opendev.org/c/openstack/project-config/+/894541 | 13:47 |
clarkb | fungi: midday ish my time may be good for https://review.opendev.org/c/opendev/system-config/+/894382 to update gitea to the latest version? I have a hopefully short without pupil dilation optometrist visit in about an hour so I'm thinking after that | 15:16 |
clarkb | fungi: also today is the day we said we would do https://review.opendev.org/c/openstack/project-config/+/893963/1 and child to clean up fedora | 15:19 |
clarkb | maybe start with those if you have a moment to review them ? | 15:20 |
opendevreview | Bernhard Berg proposed zuul/zuul-jobs master: prepare-workspace-git: Add ability to define synced pojects https://review.opendev.org/c/zuul/zuul-jobs/+/887917 | 15:59 |
opendevreview | Bernhard Berg proposed zuul/zuul-jobs master: prepare-workspace-git: Add ability to define synced pojects https://review.opendev.org/c/zuul/zuul-jobs/+/887917 | 16:33 |
fungi | clarkb: yep, sounds good. taking a look shortly | 16:47 |
clarkb | I'm back from getting my eyeballs examined. Happy to keep an eye on any of those three changes if they get approved | 17:34 |
clarkb | or address review comments if changes need to be made | 17:34 |
opendevreview | Merged openstack/project-config master: Remove fedora-35 and fedora-36 from nodepool providers https://review.opendev.org/c/openstack/project-config/+/893963 | 17:34 |
clarkb | we should approve https://review.opendev.org/c/openstack/project-config/+/893964/ after ^ looks good though | 17:34 |
fungi | i need to pick up a few things from the hardware store and grab lunch while i'm out, but can help test new gitea in a bit once i'm back (hour-ish) if that works? | 17:39 |
fungi | that'll probably also be enough time to know if we're ready to proceed with the fedora image removal step | 17:40 |
fungi | okay, headed out, back in about an hour | 17:41 |
clarkb | sounds good | 17:55 |
clarkb | sorry I got distracted by stuff around the house btu I'll not going anywhere so that plan sounds good | 17:55 |
clarkb | I think 893963 has applied and nodepool is continuing to run happily with the new config. There are no fedora nodes either. Now to check if the images have been cleaned up from the cloud providers | 18:25 |
clarkb | on the nodepool side of things there is a single inmotion fedora image that appears to be failing to delete. I think we can probably just remove that image from the zk db and then figure out cleaning it up from the cloud another time | 18:27 |
clarkb | spot checking rax regions there are a handful of fedora images that show up there still that aren't in nodepool | 18:28 |
clarkb | so ya I think next step is clean up fedora-36-1662540204 for inmotion on the nodepool side then we can merge the next change safely | 18:28 |
clarkb | then we can do manual cleanups of any remaining fedora images | 18:28 |
frickler | inmotion had a big bunch of very old images failing to delete last time I looked, maybe check those, too, when you're done with fedora | 18:31 |
clarkb | ack | 18:32 |
clarkb | chances are we have to login as admins and forcefully delete some thinsg then nodepool will notice they are gone | 18:32 |
clarkb | /nodepool/images/fedora-36/builds/0000000022/providers/inmotion-iad3/images/0000000001 <- that appears to be the znode to remove from the zk db | 18:33 |
frickler | if you tell me how I can do that I can have a look tomorrow | 18:33 |
frickler | nodepool image-list|grep deleting|wc => 19, all in inmotion, some > 1y old | 18:33 |
clarkb | frickler: are you interested in the zookeeper bit or the inmotion thing? For zookeeper you login to one of the three nodes and then use the zk-shell tool (I have it installed in a venv called venv in my homedir) to connect to the zk server. Then you can use simply commands like ls,cd,get,rm to manipulate the db | 18:35 |
frickler | I was talking about inmotion, sorry for the overlap | 18:35 |
clarkb | in this case what I've done is ls and cd around to find what looks like the correct node finding the path above. Then ran `get that_path` on it to confirm the data inside the node. | 18:35 |
clarkb | ah | 18:35 |
clarkb | for inmotion we have ssh keys on the servers (I can check that yours is there) and we login then its a kolla setup. The kolla vars give us account details and there is an openrc to source if you use the cli tools | 18:36 |
clarkb | Usually what I do there is login, source the appropriate admin bits then start poking around and learning bceause I'm not a real opesntack admin :) | 18:37 |
clarkb | in this case I suspect it will be doing openstack image/glance commands as admin to delete the image | 18:37 |
clarkb | fungi: I have not deleted that znode yet. I'm going to eat lunch soon but maybe you can look to see it seems correct then we can remove it | 18:38 |
frickler | well kolla is daily business for me, so if I can login, I hope that should be manageable | 18:38 |
clarkb | frickler: cool I see you aren't in the authorized keys list yet. I'll add you to the servers (this is openstack as a service so outside our normal ansible) and PM you the IP list | 18:39 |
clarkb | I'll use the same key that you have in system-config | 18:39 |
frickler | that should work, thx | 18:40 |
fungi | okay, back sorry | 19:12 |
fungi | took a few minutes longer than i projected | 19:12 |
fungi | i agree nodepool is looking no worse after then label removal | 19:13 |
clarkb | fungi: I think the main thing is confirming that znode should be deleted then deleting it. Then we can merge the second chagne to remove teh diskimage config | 19:14 |
fungi | yep, looking now | 19:15 |
fungi | zk-shell json_cat that znode does indeed indicate that it's trying to delete an image called fedora-36-1662540204 | 19:18 |
fungi | so i'm good with manually removing that | 19:18 |
clarkb | fungi: ok do you want to do it or should I? | 19:19 |
fungi | i'm happy to | 19:19 |
clarkb | I think once that single znode is removed nodepool should clean up the other znodes related to that image? | 19:19 |
clarkb | got for it | 19:20 |
clarkb | unless we want corvus to weigh in first | 19:20 |
clarkb | I suppose there is some risk we break locking or something | 19:20 |
fungi | i did /zk-shell rmr /nodepool/images/fedora-36/builds/0000000022/providers/inmotion-iad3/images/0000000001 | 19:20 |
fungi | fingers crossed that didn't break anything | 19:21 |
clarkb | its probably fine | 19:21 |
clarkb | I seem to recall doing this in the past for the same reason | 19:21 |
fungi | json_cat says "Path /nodepool/images/fedora-36/builds/0000000022/providers/inmotion-iad3/images/0000000001 doesn't exist" so it's definitely gone now | 19:21 |
clarkb | fungi: what about ls /nodepool/images/fedora-36/builds/0000000022 | 19:22 |
clarkb | since thats the bit taht should go away ocne nodepool cleans up the image as a whole | 19:22 |
clarkb | fwiw nl02 does the launcher for inmotion and seems to be running happily | 19:22 |
fungi | "Path /nodepool/images/fedora-36/builds/0000000022 doesn't exist" | 19:22 |
clarkb | perfect that is what we want | 19:22 |
fungi | `zk-shell ls /nodepool/images/fedora-36/builds/` returns two uuids and a lock | 19:23 |
clarkb | ya I think those uuids may be really old? | 19:23 |
clarkb | they don't seem to hurt anything and image-list shows no images | 19:24 |
clarkb | I think we can proceed with the next change | 19:24 |
fungi | /nodepool/images/fedora-36/builds/46e131aac17540bfa3b16945bfaeb72e/providers/ has most of our provider regions listed but may just be cruft | 19:24 |
clarkb | fungi: I think fedora-34 etc still have entries at /nodepool/images/fedora-34 too | 19:25 |
clarkb | but we haven't had those images in a while. I think this may just be stuff nodepool doesn't fully claer out? | 19:25 |
fungi | but /nodepool/images/fedora-36/builds/46e131aac17540bfa3b16945bfaeb72e/providers/inmotion-iad3/images/ just has a lock in it, apparently | 19:25 |
fungi | taken as an example | 19:25 |
fungi | yeah, we even have fedora-31 there still | 19:26 |
clarkb | should we approve https://review.opendev.org/c/openstack/project-config/+/893964/ ? | 19:31 |
fungi | yeah, i think that's safe | 19:37 |
fungi | clarkb: how about 894382? i'm around to watch it | 19:38 |
clarkb | fungi: ya I Thinkwe can approve that one too | 19:38 |
clarkb | I'm around as well | 19:38 |
fungi | done | 19:38 |
opendevreview | Merged openstack/project-config master: Remove fedora image builds https://review.opendev.org/c/openstack/project-config/+/893964 | 19:49 |
opendevreview | Clark Boylan proposed opendev/system-config master: Cleanup the Fedora 36 mirror content https://review.opendev.org/c/opendev/system-config/+/894575 | 20:00 |
opendevreview | Clark Boylan proposed opendev/system-config master: Remove ara from source install option https://review.opendev.org/c/opendev/system-config/+/894576 | 20:05 |
opendevreview | Clark Boylan proposed openstack/project-config master: Remove ara from Zuul config https://review.opendev.org/c/openstack/project-config/+/894577 | 20:08 |
clarkb | I don't actually know if released ara can work with dev ansible which may be why that is done | 20:08 |
clarkb | but I figure we can remoev it anyway and if it is a problem we can install from source using a shallow clone or something along those lines (won't have depends-on integration but I don't think we need that for ara) | 20:08 |
clarkb | there are no fedora disk images listed by nodepool dib-image-list now as well. I think that clean up is happy | 20:14 |
fungi | yeah, that looks right to me | 20:15 |
fungi | #status log Requested delisting for lists.katacontainers.io IPv4 address from SpamHaus PBL | 21:04 |
opendevstatus | fungi: finished logging | 21:04 |
opendevreview | Merged opendev/system-config master: Update to gitea 1.20.4 https://review.opendev.org/c/opendev/system-config/+/894382 | 21:15 |
fungi | watching for the deploy now | 21:16 |
clarkb | https://gitea09.opendev.org:3081/opendev/system-config is the url to watch and then in order up to gitea14 | 21:17 |
clarkb | and the first one (gitea09) is done | 21:21 |
clarkb | looks good at first glance | 21:21 |
clarkb | all are done now and look good | 21:34 |
clarkb | the deployment job reported success as well | 21:35 |
clarkb | does anyone know if we've got a change to set the nodepool image upload timeout now that it is configurable? | 21:38 |
clarkb | I've just made updates to the team meeting agenda. Please add anything that is missing and I'll send that out later today | 21:39 |
fungi | yeah, they're working for me | 21:44 |
clarkb | I've manually cleaned up fedora images across rax, ovh, inmotion, and vexxhost regions. There were no arm64 fedora images | 21:56 |
clarkb | there are three images that I couldn't remove. Two in ovh gra1 that are in a deleted state so can't transition to a deleting state and the one in inmotion that we identified earlier | 21:57 |
clarkb | the one in inmotion appears to fail because glance says it is in use so we may have a leaked node that we need to cleanup too | 21:57 |
clarkb | frickler: ^ fyi. Fwiw that node doesn't show up in nodepool listings so we should be able to more forcefully remove it then the image on the inmotion side | 21:58 |
fungi | would be great if glance had something like a "flag for cleanup" option so that images used for bfv could be automatically deleted once their reference count drops to 0 | 22:02 |
fungi | and then some way of indicating in image listings that the image will be cleaned up as soon as it is no longer in use | 22:02 |
clarkb | it would also be cool if openstack grew the idea of applying alerts to resources from the user side | 22:09 |
clarkb | then instead of needign to file a ticket nodepool could after say 10 failed attempts to do $X apply an alert to the resource and then the cloud could sweep through them periodically and take appropriate action | 22:10 |
clarkb | openstack server alert foo. then cloud looks at foo and sees it is in a deleting state and 10 deletion requests have been made that all failed so go ahead and make that happen somehow | 22:10 |
JayF | clarkb: we had a downstream patch at [former purple employer] to nova, called 'breakfix'. If you issued a 'breakfix' against your instance, it filed a ticket in our systems to fix it and maintenance'd the underlying Ironic node with the reason you provided | 22:13 |
JayF | clarkb: so there is absolutely an audience for that kind of feedback mechanism | 22:13 |
JayF | clarkb: that's in the same spirit of how Ironic is starting to hook up project information to allow folks to self-serve some maintenance tasks from Ironic's (formerly admin-only) API | 22:13 |
clarkb | ya I think the tricky bit in designing that would be coming up with something general enough taht it can reasonable and effectively tie into remediation systems that already exist within orgs | 22:15 |
clarkb | maybe that is as simple as a flag on the resource then you can do an api query or sql query to generate a list | 22:15 |
JayF | Arguably we already have the plumbing for this to be done as a sidecar in oslo.messaging notifications support | 22:17 |
JayF | we used that extensively at a couple of places to do reporting and failure detection | 22:17 |
fungi | a big counterargument to this is: these are cases where openstack has broken down, users shouldn't have to inform the cloud's operations team of that | 22:21 |
JayF | fungi: that's more of an argument for the notification-based approach | 22:22 |
fungi | images or servers in deleting+error state are quite clearly broken | 22:22 |
JayF | fungi: in either event, I think there's "space" for a sidecar project to try and help manage operations of OpenStack; I know because we've built one indepedently literally everywhere I've worked that's run it | 22:22 |
fungi | the user has asked to delete something, the cloud didn't refuse the delete request and rather went into an error state when the deletion failed | 22:22 |
JayF | but the question is basically whether or not such a sidecar project can be generally useful, or if they are useful *because* they were bespoke with business logic baked in | 22:23 |
fungi | yeah, that i don't know. it's more that shit clearly broke, waiting for users to tell ops that something broke is sort of backwards | 22:24 |
fungi | particularly when it's me opening a ticket that says something like "your api is telling me this broke, can you please do something to unbreak it" | 22:25 |
fungi | ideally the services would just notify them directly of these things and not have to wait for the user to pass that message along | 22:26 |
fungi | or, better still, not break. but i know that's probably asking a lot ;) | 22:27 |
JayF | > fungi | ideally the services would just notify them directly of these things and not have to wait for the user to pass that message along | 22:28 |
JayF | this is not the hard part | 22:28 |
JayF | we have those hooks *today* with oslo.messaging notifications | 22:28 |
fungi | it's one thing when it's our community infrastructure we don't actively monitor and volunteers patch up problems on a best effort basis, but something else entirely when it's a commercial product and their paying customers are having to reach out to let them know about a problem the software should have told them about before the customer even noticed | 22:28 |
JayF | the hard part is helping the cloud operator sort through those notifications and surface the 'real problems' over the noise | 22:29 |
fungi | yeah, makes sense | 22:29 |
JayF | it's just not clear to me (yet) if there is enough commonality to actually attack that problem versus it being a shop-to-shop sorta thing | 22:29 |
JayF | because I can tell you, what that looked like at Rackspace didn't look like it does at Yahoo neither look like it does downstream here | 22:29 |
fungi | sure, i get that | 22:30 |
fungi | no two deployments are the same | 22:30 |
fungi | "there is no such thing as vanilla openstack" | 22:30 |
JayF | I think it's even deeper than that | 22:31 |
JayF | failure tolerance is a good example; some use cases failures are "tolerated" by just papering over it with more infrasturcutre/redundancy elsewhere | 22:32 |
JayF | some places have a strong sense of "this has to run here" and try to enforce that even when we try to disallow i | 22:32 |
JayF | *it | 22:32 |
JayF | e.g. at Rackspace, a single provisioned ironic node going down was a big deal as an outage for a customer; but at an HPC shop it might be noise unless the failures get over a certain % | 22:32 |
fungi | i'm thinking more in terms of flagged error states for unusable-but-undeletable resources occupying the customer's quota | 22:33 |
fungi | if the user tells the cloud to delete something, and then the deletion fails and the resource remains in an indefinite "deleting" state, i'm not sure what decisions there are for the user to make at that point other than wait for the ops to notice or open a ticket asking they fix whatever resulted in the error state so that the resource deletion can proceed | 22:36 |
fungi | maybe that's an uncommon corner case, but it seems to happen a lot for us (with servers, images, sometimes fips or networks) | 22:36 |
JayF | That is a particularly painful case for many (and has some security-related badness around it), I think that solving the general case of surfacing anomolies has value moreso than pointing at obvious broken cases | 22:37 |
clarkb | fungi: ya I agree that the idea lstate is that customers don't need to be pushing it but maybe that sort of signal is a compromise between realtiy and ideal | 22:50 |
JayF | clarkb: fungi: one thing that pushes even more in that direction is that not all failure cases are avoidable or fixable by openstack-the-software (e.g. network failures in a portion of the datacenter) .... but they still tend, IME, to be *blamed* on openstack-the-software | 22:52 |
JayF | I tell folks who operate OpenStack at scale that OpenStack becomes the messenger for all bad news in your environment. Unless you are doing perfect operational monitoring, you will find a large number of outages will be first seen by a user in an openstack error message. | 22:52 |
JayF | Pushing back against the negative perceptions that can create is difficult, too. | 22:53 |
fungi | yeah, it just makes it more likely for users to blame the software if they have to report persistent error states to the operators | 22:53 |
fungi | well, either blame the software or blame the people operating it, anyway | 22:54 |
clarkb | maybe openstack should provide tools for monitoring these unexpected state changes | 22:54 |
clarkb | kinda like what we did with the logstash rules and queries once upon a time | 22:54 |
clarkb | that is still being done but the idea you could run the same set of queries against real clouds has largely died out I think | 22:55 |
opendevreview | Goutham Pacha Ravi proposed openstack/project-config master: Add manila-core to osc/sdk repo config https://review.opendev.org/c/openstack/project-config/+/894605 | 23:21 |
clarkb | for ^ I wonder if we shouldn't make a new group called openstacksdk-reviewers, give that group +/-2 and then that group can add subgroups as necessary | 23:25 |
clarkb | then we don't need to be gatekeepers of those changes for sdks | 23:25 |
fungi | that was the original suggestion, but gtema expressed a preference for there being an audit log in public git | 23:48 |
fungi | and said something like "there won't be that many" | 23:49 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!