Monday, 2023-09-11

*** dtantsur_ is now known as dtantsur		00:08
opendevreview	Dr. Jens Harbott proposed openstack/project-config master: sdk/osc: Rollback to LaunchPad for issuetracking https://review.opendev.org/c/openstack/project-config/+/894285	05:03
frickler	gtema: ^^ fyi amended to cover all repos, which I assume is what you intended?	05:04
frickler	I also noticed that ansible-collections-openstack formally still belongs to the ansible sig, which iiuc has effectively disbanded. would it make sense to move it to sdks proper?	05:05
opendevreview	Dr. Jens Harbott proposed openstack/project-config master: sdk/osc: Rollback to LaunchPad for issuetracking https://review.opendev.org/c/openstack/project-config/+/894285	05:26
gtema	Fricker. Thanks. Wrt ansible-collection-openstack: I guess yes, the move makes sense	05:46
opendevreview	Martin Magr proposed openstack/project-config master: Add python-observabilityclient https://review.opendev.org/c/openstack/project-config/+/894541	13:47
clarkb	fungi: midday ish my time may be good for https://review.opendev.org/c/opendev/system-config/+/894382 to update gitea to the latest version? I have a hopefully short without pupil dilation optometrist visit in about an hour so I'm thinking after that	15:16
clarkb	fungi: also today is the day we said we would do https://review.opendev.org/c/openstack/project-config/+/893963/1 and child to clean up fedora	15:19
clarkb	maybe start with those if you have a moment to review them ?	15:20
opendevreview	Bernhard Berg proposed zuul/zuul-jobs master: prepare-workspace-git: Add ability to define synced pojects https://review.opendev.org/c/zuul/zuul-jobs/+/887917	15:59
opendevreview	Bernhard Berg proposed zuul/zuul-jobs master: prepare-workspace-git: Add ability to define synced pojects https://review.opendev.org/c/zuul/zuul-jobs/+/887917	16:33
fungi	clarkb: yep, sounds good. taking a look shortly	16:47
clarkb	I'm back from getting my eyeballs examined. Happy to keep an eye on any of those three changes if they get approved	17:34
clarkb	or address review comments if changes need to be made	17:34
opendevreview	Merged openstack/project-config master: Remove fedora-35 and fedora-36 from nodepool providers https://review.opendev.org/c/openstack/project-config/+/893963	17:34
clarkb	we should approve https://review.opendev.org/c/openstack/project-config/+/893964/ after ^ looks good though	17:34
fungi	i need to pick up a few things from the hardware store and grab lunch while i'm out, but can help test new gitea in a bit once i'm back (hour-ish) if that works?	17:39
fungi	that'll probably also be enough time to know if we're ready to proceed with the fedora image removal step	17:40
fungi	okay, headed out, back in about an hour	17:41
clarkb	sounds good	17:55
clarkb	sorry I got distracted by stuff around the house btu I'll not going anywhere so that plan sounds good	17:55
clarkb	I think 893963 has applied and nodepool is continuing to run happily with the new config. There are no fedora nodes either. Now to check if the images have been cleaned up from the cloud providers	18:25
clarkb	on the nodepool side of things there is a single inmotion fedora image that appears to be failing to delete. I think we can probably just remove that image from the zk db and then figure out cleaning it up from the cloud another time	18:27
clarkb	spot checking rax regions there are a handful of fedora images that show up there still that aren't in nodepool	18:28
clarkb	so ya I think next step is clean up fedora-36-1662540204 for inmotion on the nodepool side then we can merge the next change safely	18:28
clarkb	then we can do manual cleanups of any remaining fedora images	18:28
frickler	inmotion had a big bunch of very old images failing to delete last time I looked, maybe check those, too, when you're done with fedora	18:31
clarkb	ack	18:32
clarkb	chances are we have to login as admins and forcefully delete some thinsg then nodepool will notice they are gone	18:32
clarkb	/nodepool/images/fedora-36/builds/0000000022/providers/inmotion-iad3/images/0000000001 <- that appears to be the znode to remove from the zk db	18:33
frickler	if you tell me how I can do that I can have a look tomorrow	18:33
frickler	nodepool image-list\|grep deleting\|wc => 19, all in inmotion, some > 1y old	18:33
clarkb	frickler: are you interested in the zookeeper bit or the inmotion thing? For zookeeper you login to one of the three nodes and then use the zk-shell tool (I have it installed in a venv called venv in my homedir) to connect to the zk server. Then you can use simply commands like ls,cd,get,rm to manipulate the db	18:35
frickler	I was talking about inmotion, sorry for the overlap	18:35
clarkb	in this case what I've done is ls and cd around to find what looks like the correct node finding the path above. Then ran `get that_path` on it to confirm the data inside the node.	18:35
clarkb	ah	18:35
clarkb	for inmotion we have ssh keys on the servers (I can check that yours is there) and we login then its a kolla setup. The kolla vars give us account details and there is an openrc to source if you use the cli tools	18:36
clarkb	Usually what I do there is login, source the appropriate admin bits then start poking around and learning bceause I'm not a real opesntack admin :)	18:37
clarkb	in this case I suspect it will be doing openstack image/glance commands as admin to delete the image	18:37
clarkb	fungi: I have not deleted that znode yet. I'm going to eat lunch soon but maybe you can look to see it seems correct then we can remove it	18:38
frickler	well kolla is daily business for me, so if I can login, I hope that should be manageable	18:38
clarkb	frickler: cool I see you aren't in the authorized keys list yet. I'll add you to the servers (this is openstack as a service so outside our normal ansible) and PM you the IP list	18:39
clarkb	I'll use the same key that you have in system-config	18:39
frickler	that should work, thx	18:40
fungi	okay, back sorry	19:12
fungi	took a few minutes longer than i projected	19:12
fungi	i agree nodepool is looking no worse after then label removal	19:13
clarkb	fungi: I think the main thing is confirming that znode should be deleted then deleting it. Then we can merge the second chagne to remove teh diskimage config	19:14
fungi	yep, looking now	19:15
fungi	zk-shell json_cat that znode does indeed indicate that it's trying to delete an image called fedora-36-1662540204	19:18
fungi	so i'm good with manually removing that	19:18
clarkb	fungi: ok do you want to do it or should I?	19:19
fungi	i'm happy to	19:19
clarkb	I think once that single znode is removed nodepool should clean up the other znodes related to that image?	19:19
clarkb	got for it	19:20
clarkb	unless we want corvus to weigh in first	19:20
clarkb	I suppose there is some risk we break locking or something	19:20
fungi	i did /zk-shell rmr /nodepool/images/fedora-36/builds/0000000022/providers/inmotion-iad3/images/0000000001	19:20
fungi	fingers crossed that didn't break anything	19:21
clarkb	its probably fine	19:21
clarkb	I seem to recall doing this in the past for the same reason	19:21
fungi	json_cat says "Path /nodepool/images/fedora-36/builds/0000000022/providers/inmotion-iad3/images/0000000001 doesn't exist" so it's definitely gone now	19:21
clarkb	fungi: what about ls /nodepool/images/fedora-36/builds/0000000022	19:22
clarkb	since thats the bit taht should go away ocne nodepool cleans up the image as a whole	19:22
clarkb	fwiw nl02 does the launcher for inmotion and seems to be running happily	19:22
fungi	"Path /nodepool/images/fedora-36/builds/0000000022 doesn't exist"	19:22
clarkb	perfect that is what we want	19:22
fungi	`zk-shell ls /nodepool/images/fedora-36/builds/` returns two uuids and a lock	19:23
clarkb	ya I think those uuids may be really old?	19:23
clarkb	they don't seem to hurt anything and image-list shows no images	19:24
clarkb	I think we can proceed with the next change	19:24
fungi	/nodepool/images/fedora-36/builds/46e131aac17540bfa3b16945bfaeb72e/providers/ has most of our provider regions listed but may just be cruft	19:24
clarkb	fungi: I think fedora-34 etc still have entries at /nodepool/images/fedora-34 too	19:25
clarkb	but we haven't had those images in a while. I think this may just be stuff nodepool doesn't fully claer out?	19:25
fungi	but /nodepool/images/fedora-36/builds/46e131aac17540bfa3b16945bfaeb72e/providers/inmotion-iad3/images/ just has a lock in it, apparently	19:25
fungi	taken as an example	19:25
fungi	yeah, we even have fedora-31 there still	19:26
clarkb	should we approve https://review.opendev.org/c/openstack/project-config/+/893964/ ?	19:31
fungi	yeah, i think that's safe	19:37
fungi	clarkb: how about 894382? i'm around to watch it	19:38
clarkb	fungi: ya I Thinkwe can approve that one too	19:38
clarkb	I'm around as well	19:38
fungi	done	19:38
opendevreview	Merged openstack/project-config master: Remove fedora image builds https://review.opendev.org/c/openstack/project-config/+/893964	19:49
opendevreview	Clark Boylan proposed opendev/system-config master: Cleanup the Fedora 36 mirror content https://review.opendev.org/c/opendev/system-config/+/894575	20:00
opendevreview	Clark Boylan proposed opendev/system-config master: Remove ara from source install option https://review.opendev.org/c/opendev/system-config/+/894576	20:05
opendevreview	Clark Boylan proposed openstack/project-config master: Remove ara from Zuul config https://review.opendev.org/c/openstack/project-config/+/894577	20:08
clarkb	I don't actually know if released ara can work with dev ansible which may be why that is done	20:08
clarkb	but I figure we can remoev it anyway and if it is a problem we can install from source using a shallow clone or something along those lines (won't have depends-on integration but I don't think we need that for ara)	20:08
clarkb	there are no fedora disk images listed by nodepool dib-image-list now as well. I think that clean up is happy	20:14
fungi	yeah, that looks right to me	20:15
fungi	#status log Requested delisting for lists.katacontainers.io IPv4 address from SpamHaus PBL	21:04
opendevstatus	fungi: finished logging	21:04
opendevreview	Merged opendev/system-config master: Update to gitea 1.20.4 https://review.opendev.org/c/opendev/system-config/+/894382	21:15
fungi	watching for the deploy now	21:16
clarkb	https://gitea09.opendev.org:3081/opendev/system-config is the url to watch and then in order up to gitea14	21:17
clarkb	and the first one (gitea09) is done	21:21
clarkb	looks good at first glance	21:21
clarkb	all are done now and look good	21:34
clarkb	the deployment job reported success as well	21:35
clarkb	does anyone know if we've got a change to set the nodepool image upload timeout now that it is configurable?	21:38
clarkb	I've just made updates to the team meeting agenda. Please add anything that is missing and I'll send that out later today	21:39
fungi	yeah, they're working for me	21:44
clarkb	I've manually cleaned up fedora images across rax, ovh, inmotion, and vexxhost regions. There were no arm64 fedora images	21:56
clarkb	there are three images that I couldn't remove. Two in ovh gra1 that are in a deleted state so can't transition to a deleting state and the one in inmotion that we identified earlier	21:57
clarkb	the one in inmotion appears to fail because glance says it is in use so we may have a leaked node that we need to cleanup too	21:57
clarkb	frickler: ^ fyi. Fwiw that node doesn't show up in nodepool listings so we should be able to more forcefully remove it then the image on the inmotion side	21:58
fungi	would be great if glance had something like a "flag for cleanup" option so that images used for bfv could be automatically deleted once their reference count drops to 0	22:02
fungi	and then some way of indicating in image listings that the image will be cleaned up as soon as it is no longer in use	22:02
clarkb	it would also be cool if openstack grew the idea of applying alerts to resources from the user side	22:09
clarkb	then instead of needign to file a ticket nodepool could after say 10 failed attempts to do $X apply an alert to the resource and then the cloud could sweep through them periodically and take appropriate action	22:10
clarkb	openstack server alert foo. then cloud looks at foo and sees it is in a deleting state and 10 deletion requests have been made that all failed so go ahead and make that happen somehow	22:10
JayF	clarkb: we had a downstream patch at [former purple employer] to nova, called 'breakfix'. If you issued a 'breakfix' against your instance, it filed a ticket in our systems to fix it and maintenance'd the underlying Ironic node with the reason you provided	22:13
JayF	clarkb: so there is absolutely an audience for that kind of feedback mechanism	22:13
JayF	clarkb: that's in the same spirit of how Ironic is starting to hook up project information to allow folks to self-serve some maintenance tasks from Ironic's (formerly admin-only) API	22:13
clarkb	ya I think the tricky bit in designing that would be coming up with something general enough taht it can reasonable and effectively tie into remediation systems that already exist within orgs	22:15
clarkb	maybe that is as simple as a flag on the resource then you can do an api query or sql query to generate a list	22:15
JayF	Arguably we already have the plumbing for this to be done as a sidecar in oslo.messaging notifications support	22:17
JayF	we used that extensively at a couple of places to do reporting and failure detection	22:17
fungi	a big counterargument to this is: these are cases where openstack has broken down, users shouldn't have to inform the cloud's operations team of that	22:21
JayF	fungi: that's more of an argument for the notification-based approach	22:22
fungi	images or servers in deleting+error state are quite clearly broken	22:22
JayF	fungi: in either event, I think there's "space" for a sidecar project to try and help manage operations of OpenStack; I know because we've built one indepedently literally everywhere I've worked that's run it	22:22
fungi	the user has asked to delete something, the cloud didn't refuse the delete request and rather went into an error state when the deletion failed	22:22
JayF	but the question is basically whether or not such a sidecar project can be generally useful, or if they are useful because they were bespoke with business logic baked in	22:23
fungi	yeah, that i don't know. it's more that shit clearly broke, waiting for users to tell ops that something broke is sort of backwards	22:24
fungi	particularly when it's me opening a ticket that says something like "your api is telling me this broke, can you please do something to unbreak it"	22:25
fungi	ideally the services would just notify them directly of these things and not have to wait for the user to pass that message along	22:26
fungi	or, better still, not break. but i know that's probably asking a lot ;)	22:27
JayF	> fungi \| ideally the services would just notify them directly of these things and not have to wait for the user to pass that message along	22:28
JayF	this is not the hard part	22:28
JayF	we have those hooks today with oslo.messaging notifications	22:28
fungi	it's one thing when it's our community infrastructure we don't actively monitor and volunteers patch up problems on a best effort basis, but something else entirely when it's a commercial product and their paying customers are having to reach out to let them know about a problem the software should have told them about before the customer even noticed	22:28
JayF	the hard part is helping the cloud operator sort through those notifications and surface the 'real problems' over the noise	22:29
fungi	yeah, makes sense	22:29
JayF	it's just not clear to me (yet) if there is enough commonality to actually attack that problem versus it being a shop-to-shop sorta thing	22:29
JayF	because I can tell you, what that looked like at Rackspace didn't look like it does at Yahoo neither look like it does downstream here	22:29
fungi	sure, i get that	22:30
fungi	no two deployments are the same	22:30
fungi	"there is no such thing as vanilla openstack"	22:30
JayF	I think it's even deeper than that	22:31
JayF	failure tolerance is a good example; some use cases failures are "tolerated" by just papering over it with more infrasturcutre/redundancy elsewhere	22:32
JayF	some places have a strong sense of "this has to run here" and try to enforce that even when we try to disallow i	22:32
JayF	*it	22:32
JayF	e.g. at Rackspace, a single provisioned ironic node going down was a big deal as an outage for a customer; but at an HPC shop it might be noise unless the failures get over a certain %	22:32
fungi	i'm thinking more in terms of flagged error states for unusable-but-undeletable resources occupying the customer's quota	22:33
fungi	if the user tells the cloud to delete something, and then the deletion fails and the resource remains in an indefinite "deleting" state, i'm not sure what decisions there are for the user to make at that point other than wait for the ops to notice or open a ticket asking they fix whatever resulted in the error state so that the resource deletion can proceed	22:36
fungi	maybe that's an uncommon corner case, but it seems to happen a lot for us (with servers, images, sometimes fips or networks)	22:36
JayF	That is a particularly painful case for many (and has some security-related badness around it), I think that solving the general case of surfacing anomolies has value moreso than pointing at obvious broken cases	22:37
clarkb	fungi: ya I agree that the idea lstate is that customers don't need to be pushing it but maybe that sort of signal is a compromise between realtiy and ideal	22:50
JayF	clarkb: fungi: one thing that pushes even more in that direction is that not all failure cases are avoidable or fixable by openstack-the-software (e.g. network failures in a portion of the datacenter) .... but they still tend, IME, to be blamed on openstack-the-software	22:52
JayF	I tell folks who operate OpenStack at scale that OpenStack becomes the messenger for all bad news in your environment. Unless you are doing perfect operational monitoring, you will find a large number of outages will be first seen by a user in an openstack error message.	22:52
JayF	Pushing back against the negative perceptions that can create is difficult, too.	22:53
fungi	yeah, it just makes it more likely for users to blame the software if they have to report persistent error states to the operators	22:53
fungi	well, either blame the software or blame the people operating it, anyway	22:54
clarkb	maybe openstack should provide tools for monitoring these unexpected state changes	22:54
clarkb	kinda like what we did with the logstash rules and queries once upon a time	22:54
clarkb	that is still being done but the idea you could run the same set of queries against real clouds has largely died out I think	22:55
opendevreview	Goutham Pacha Ravi proposed openstack/project-config master: Add manila-core to osc/sdk repo config https://review.opendev.org/c/openstack/project-config/+/894605	23:21
clarkb	for ^ I wonder if we shouldn't make a new group called openstacksdk-reviewers, give that group +/-2 and then that group can add subgroups as necessary	23:25
clarkb	then we don't need to be gatekeepers of those changes for sdks	23:25
fungi	that was the original suggestion, but gtema expressed a preference for there being an audit log in public git	23:48
fungi	and said something like "there won't be that many"	23:49

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!