Wednesday, 2025-06-04

frickler	ok, trixie build was successful, but it looks like we are overloading the poor arm environment when all builds are triggered in parallel, maybe we need to limit these using a semaphore? https://zuul.opendev.org/t/opendev/buildset/d2f7e3a55f7f415da144bb625df9c3ff	05:51
frickler	btw. was there any update regarding the future of osuosl?	05:51
frickler	looks like we are having similar issues with the periodic builds since a couple of days. Ramereth[m] are there any known issues with IO performance? https://zuul.opendev.org/t/opendev/buildset/4d8e2687b5f44637b342d4296bbbfce1	06:02
mnasiadka	frickler: I see you were faster - I wanted to ask about that since my centos10 builds on aarch64 are timing out	06:18
opendevreview	Tony Breeds proposed openstack/diskimage-builder master: Add new openstack/devstack based functional testing https://review.opendev.org/c/openstack/diskimage-builder/+/949942	06:39
frickler	clarkb: there's multiple things, like much less efficient use of screenspace in the browser, lack of flexible ordering, lack of shortcuts to directly access specific rooms, lack of greppable logging. also what fungi calls the "media-rich experience" is something I'm not comfortable with either	06:55
mnasiadka	frickler: I assume there are none irssi-like matrix clients? (I've seen there's a 3rd party plugin for irssi but it's dead for a year or so and very limited)	06:59
frickler	mnasiadka: fungi mentioned a weechat plugin, but that also seems to have limited functionality. also I do need the graphical environment for downstream usage and I also don't want to leave irssi for IRC, so adding yet another client? ... meh	07:01
tonyb	clarkb: Re Matrix, I assume you and corvus have a good feel for the costs of hosting rooms on matrix, I did some research and it all boiled down call us. I did find: https://discussion.fedoraproject.org/t/fedora-council-tickets-ticket-487-budget-request-price-increase-for-hosted-matrix-chat-fp-o-by-element-matrix-services-ems/111759 which indicates that Fedora pay $USD 25k/year	07:20
opendevreview	Tony Breeds proposed openstack/diskimage-builder master: Add new openstack/devstack based functional testing https://review.opendev.org/c/openstack/diskimage-builder/+/949942	08:05
opendevreview	Tony Breeds proposed openstack/diskimage-builder master: Add new openstack/devstack based functional testing https://review.opendev.org/c/openstack/diskimage-builder/+/949942	09:01
opendevreview	Tony Breeds proposed openstack/diskimage-builder master: Add new openstack/devstack based functional testing https://review.opendev.org/c/openstack/diskimage-builder/+/949942	09:42
fungi	frickler: tonyb: yeah, the weechat plugin i've been using is https://github.com/poljar/weechat-matrix-rs but for exampl it doesn't support session verification yet nor even joining rooms (so i have to join rooms with elment instead), but it can send and receive messages and other basics	12:24
fungi	frickler: osuosl got sufficient funding, but can always use more: https://lists.osuosl.org/pipermail/hosting/2025-May/000644.html	12:30
fungi	i have some errands to run this morning, will probably be gone for an hour, maybe a little longer... back as soon as i can	12:34
*** ralonsoh_ is now known as ralonsoh		12:36
fungi	okay, that went quicker than anticipated	13:36
fungi	#status log Renamed and hid the unused kolla-ansible-core group in Gerrit, per discussion in #openstack-kolla IRC	13:46
opendevstatus	fungi: finished logging	13:46
corvus	tonyb: since matrix is federated, the cost to do anything with it can be as low as zero. there are public homeservers like matrix.rg (and others) where anyone can get an account for free, and anyone can create a room. opendev has a homeserver so that we can create rooms with names like "#zuul:opendev.org" to give them some sense of officiality. but because we're not big on user management, we don't host any users there (except our bots).	13:51
corvus	the cost is not great, and helps support the open source ecosystem around the software. there are other low-cost providers who could provide that service as well, or we could do it ourselves on a volunteer basis. adding more rooms to our homeserver has zero marginal cost. iirc, fedora does provide user account services, so their cost is higher.	13:51
corvus	regarding the element user interface: i create groups of rooms (element/matrix call them "spaces") for easy access to the ones i need most under different circumstances. i have spaces like "services", "zuul", "ansible". it's easy to click one of those spaces to find an old room i haven't used in a while. also, there is full-text search across all rooms which i have found quite helpful. it's certainly different, but at least in my use, i	13:59
corvus	haven't found it deficient.	13:59
Ramereth[m]	<frickler> "looks like we are having similar..." <- I have been adding more storage and making some other adjustments. It's likely related to that. I will see if we have an OSD causing the issues however. Is it still happening as of this morning?	14:40
clarkb	mnasiadka: ^ you probably have better insight into ongoing issues since the opendev image builds are more like ~daily	14:48
frickler	I can do a recheck on my patch to get some fresh data	14:49
mnasiadka	Ramereth[m]: yes, it has been happening since my morning (so around 8 hours ago) - see https://zuul.opendev.org/t/openstack/build/5b22f318879a4448b2428cb0603fde71	14:50
clarkb	frickler: re matrix client pain I do agree on some of that. I prefer weechat's UX for sure, though I've managed to configure element into a state where I'm reasonably happy with it (you can set irc mode for better density and then disable previews for urls). The two things I'm trying to consider are that we seem to have better engagement in places like zuul that moved than we did in	14:51
clarkb	the past. And for IRC a significant number of individuals seem to use matrix bridges which have become flaky recently due to lack of funding. In my mind it seems like we can better serve our existing audience if non bridged IRC users are willing to compromise and also potentially become more accessible via the richer web client environment matrix provides at the same time for people	14:51
clarkb	who want that	14:51
clarkb	really thats the balance I'm trying to achieve here. I'm hoping that IRC users are willing to compromise given that matrix preserves the federated network design and open source aspects of the communication system while not alienating people who are currently not on irc via direct connections	14:57
frickler	well the arguments are all well known, but to me it seems difficult to ask volunteers that enjoy hanging around here in their spare time to "compromise" and move somewhere much less enjoyable	15:13
frickler	I also still question the not very much federated (in terms of user count) network implementation and lack of open development of the matrix ecosystem	15:14
JayF	If I have to stay connected to Matrix all day, I go from having to have 3 chat apps running during my day job to 4. It's not a matter of philosophical feelings about one or the other, it's a matter of the list seemingly growing unbounded; even if OSS/OpenDev/OpenStack would only represent 2 of those	15:35
fungi	i do manage to connect to several irc networks, a multitude of slack instances, and matrix all from a single terminal-based client, but it's complex and sacrifices on some features (granted mostly on features i neither need nor want)	15:39
JayF	I have put in the feature request a number of times for IRCCloud to support matrix, and they have no interest whatsoever :(	15:41
clarkb	right we're between a rock and a hard place with people who rely on matrix to connect to irc (whcih is a seemingly large number) and those who want to avoid another system entirely. I sympahtize with wanting to have volunteer time be easy and fit within personal preferences for tooling (I really dislike being on discord for gerrit for example). But we also offer services here and	15:41
clarkb	ideally are reachable for our users	15:41
clarkb	if we don't do something (regardless of what that choice is among the options I identified earlier) we're very likely going to have disjoint and confusing communications in the near future	15:42
corvus	we've gotten some pretty consistent feedback that irc isn't a great experience for many new contributors, and for those of us for whom it is a great experience, it is made so with a lot of effort that is not easy for new users to put in. all of that (persistent sessions, authentication, etc) comes out of the box with matrix.	15:48
corvus	it can federate to almost any other open system, which could actually help reduce the app fragmentation that's out there	15:49
corvus	i think it's worth trying to find a place where everyone can communicate together, rather than accept the split of our communities into irc vs proprietary walled gardens	15:50
corvus	so, yeah, i decided that for me, it was worth giving up my highly tuned irssi config to switch to matrix to try to walk the walk on universal communication and collaboration	15:51
corvus	i want people to see that we can have the open source, open access collaboration we have with irc and all the conveniences of the proprietary systems without the proprietary systems	15:52
corvus	if the alternative is bifurcated irc+slack+discord, then i think matrix is a better way	15:53
JayF	We're going to lose people no matter which direction we go. I personally don't find matrix easier to setup or understand than IRC	15:55
clarkb	right because we all understand chanserv and nickserv and how they are implemented differently on different networks. Matrix really simplifies all of that I think	15:57
JayF	Well, there's a reason I point new folks at irccloud which completely removes those decoder rings; but I do understand. I don't think matrix is a higher bar; but I'm not convinced it's significantly easier for new folks.	15:58
clarkb	basically if we are already comfortable with irc then yes matrix isn't really going to solve any problems for us. But I think it does address problems for newer users when it comes to creating and authenticating accounts, joining rooms, and maintaining historical scrollback	15:58
JayF	As I said to spotz at SCaLE/OIF days: It takes a PhD in computer to setup Matrix, at least when using IRC bridges. I'm not sure if the story gets much easier doing matrix-direct.	15:59
corvus	i'm going to restart zuul-web to pick up some new api endpoints	15:59
clarkb	JayF: I have not had the same experience	15:59
clarkb	setting up matrix does not require a phd in anything	16:00
clarkb	the bridges are more complicated and thati s part of why I think we should consider using a native room	16:00
clarkb	then all anyone should need is a url to the room and an account (no bridge management to go along with it)	16:00
clarkb	corvus: that will enable zuul-launcher resource managment via the api?	16:01
fungi	yes, not using the bridges and connecting to a matrix-only room with element as a browser-based client is as simple as following a url and letting it walk you through either signing in or creating an account on matrix.org	16:01
*** gthiemon1e is now known as gthiemonge		16:02
JayF	Yeah, I should try with a matrix native room. I tried a while back (when they were reliable!) to switch to matrix+IRC bridge and I cannot recall a time when I felt similarly befuddled by something that seemed like it should be simple	16:02
fungi	trying to use matrix as an irc client, in effect, is a good bit more complicated sure	16:02
mnasiadka	I’d say irc bridges on matrix currently are in some weird shape, sort of unreliable - so I moved back to separate clients…	16:02
JayF	mnasiadka: yeah, I think that's what spawned this conversation	16:05
JayF	https://github.com/progval/matrix2051 this was just pointed in my direction by the irccloud folks	16:06
clarkb	JayF: zuul and starlingx and openstack-ops all have native matrix rooms now if you need one to hop onto.	16:06
JayF	might poke at doing an instance somewhere just for me (or GR-OSS)	16:06
corvus	and of course there are folks who have the opposite experience	16:06
corvus	there is nothing that will satisfy everyone, but i honestly think matrix is the closest thing to it, and it is our best hope for actually pushing back against proprietary balkanization of communication in open source communities. i think that's worth a lot.	16:07
fungi	the openstack vmt has a private matrix room too, but a majority of the curent vmt members preferred to stick with irc so i'm straddling the two for now in case that shifts in the future	16:07
corvus	and account setup is just standard username/password webform	16:09
corvus	clarkb: yep. not all the changes have merged yet, but i want to be able to get listings now. so i'll do another restart later likely	16:10
corvus	yes, they had a bad 2 weeks. since they are effectively a standalone federated component, their reliability reflects their operator (not the network), and their operator has deprioritized them. so i agree it's good to consider not individually relying on them now, and avoiding relying on them makes us more self-sufficient.	16:12
corvus	they=matirx-irc bridges	16:13
corvus	(also, like everything matrix, anyone can run a public bridge if they want, but no one else wants to because of the work and coordination involved. but individual bridges are still plentiful)	16:17
corvus	i think matrix2051 is a good design, i'd be interested to hear how it goes	16:20
corvus	https://zuul.opendev.org/t/zuul/nodeset-requests exists now	16:21
leonardo-serrano	Hi, we've noticed Zuul is running a bit slower at the moment. Is there any info about this?	16:27
clarkb	leonardo-serrano: can you be more specific about what is slow? is the web ui loading more slowly? Are jobs not starting as quicklly as you expect? Have specific job runtimes increased?	16:30
corvus	the disk on nl05 is full	16:32
leonardo-serrano	clarkb: A few starlingx reviews are taking longer than usual to get their zuul jobs started. One example is https://zuul.opendev.org/t/openstack/status?change=951778%2C2	16:33
corvus	the errors from right before the disk filled up suggest errors in rax-iad	16:34
clarkb	leonardo-serrano: the first thing I like to check in taht situation is https://grafana.opendev.org/d/21a6e53ea4/zuul-status?orgId=1&from=now-6h&to=now&timezone=utc to see if we are running at capacity ( doesn't seem to be at node capacity or executor capacity)	16:34
corvus	i'm going to delete some old logs then restart that launcher and see what it says	16:35
clarkb	but I think corvus has already likely diagnosed the issue	16:35
clarkb	nl05 provides ~60% of our capacity and if it is out of commission then we'll run fewer jobs	16:35
clarkb	corvus: thanks!	16:35
corvus	i don't think the problem is the launcher	16:36
leonardo-serrano	Thanks for the tip and the quick responses guys. Cheers	16:36
corvus	i think it's the cloud	16:36
corvus	(very likely that the launcher can continue even with a full disk, we just can't see what it's doing)	16:36
clarkb	ah	16:37
clarkb	https://grafana.opendev.org/d/a8667d6647/nodepool3a-rackspace?orgId=1&from=now-6h&to=now&timezone=utc&var-region=$__all ya that implies that rax-ord and rax-dfw are still doing work	16:38
clarkb	maybe we should set iad to max-servers 0?	16:38
clarkb	or max servers 5 as a canary?	16:38
clarkb	but even one cloud that can't boot servers will delay many requests by ~15 minutes iirc so could be the source of the problem	16:39
corvus	the disk is full due to a particularly insane stacktrace from python; honestly not sure how that was constructed	16:40
corvus	the stacktrace is thousands of lines long, but it doesn't make sense because it's not actually recursive	16:41
corvus	it's caused by a failure to connect to the workers	16:41
corvus	so i think reducing iad servers is probably a good idea	16:42
clarkb	"the workers" == the test nodes?	16:42
clarkb	corvus: I'm good with that. I can quickly review a change if the disk situation is at least temporarily resolved so it can apply	16:42
corvus	i'm seeing keyscan errors on multiple rax regions	16:42
corvus	so maybe hold off on that max servers change idea	16:42
corvus	ya	16:43
clarkb	rax is where we rely on glean to configure the network statically based on their special config drive network config	16:43
clarkb	if that doesn't work for some reason (image updates or cloud changes to config drive) then that may explain failed ssh keyscans?	16:44
corvus	i restarted that launcher just in case there was something internal messed up	16:44
corvus	openstack.exceptions.ConflictException: ConflictException: 409: Client Error for url: https://neutron.api.dfw3.rackspacecloud.com/v2.0/floatingips, Quota exceeded for resources: ['floatingip'].	16:44
corvus	that's going on too	16:44
clarkb	corvus: thats raxflex not classic so would be a separate issue I think	16:44
fungi	i wonder if we've leaked fips in flex though	16:45
corvus	oh good point	16:45
clarkb	https://grafana.opendev.org/d/6d29645669/nodepool3a-rackspace-flex?orgId=1&from=now-6h&to=now&timezone=utc&var-region=$__all nothing seems to be booting there right now	16:45
clarkb	but those failures should process much more quickly if we're hitting quota errors	16:45
corvus	maybe the fip leaks are niz related?	16:46
clarkb	maybe?	16:46
clarkb	like we're using the quota up on the niz side and nodepool doesn't know or something?	16:46
corvus	yeah, i don't think fips are included in quota, so we don't actually know how many we have	16:47
corvus	but i don't think we expect to have fewer fips than we can run instances, so that suggests a leak	16:48
clarkb	corvus: openstack limits show --absolute seems to indicate we're using no floating ips in those two regions. So possibly a leak on the cloud side that we can't even see?	16:50
clarkb	doing floating ip list does show floating ips though	16:50
corvus	can you tell if they're unattached? have an example i can grep for in nodepool/zuul logs?	16:51
clarkb	corvus: from DFW3: 02b6917c-a4bb-4534-8766-cef2018cd3ed \| 50.56.157.247	16:51
clarkb	corvus: that does not have a fixed IP so I think it is unattached	16:51
clarkb	I suspect that these have leaked and limits show --absolute is not accurately reporting the data back to us.	16:52
clarkb	None of them have fixed ips according to floating ip list	16:52
clarkb	(so they should all be unattached)	16:52
clarkb	also max servers is 32 in both regions and I count 50 fips	16:53
clarkb	I think we might be able to safely delete each of the floating IPs in sjc3 and dfw3 just be sure to do one listing and delete from that as nodepool may quickly start booting new nodes creating new fips that would get listed if we list multiple times	16:53
corvus	ack... you mind doing that? and save a list of them so we can try to backtrack and figure out when/where they leaked?	16:54
clarkb	corvus: yes I'll start on that now	16:54
corvus	that ip doesn't show up in either nodepool or zuul launcher logs, but i did just delete a bunch of nodepool logs.	16:55
corvus	(i'm looking in executor logs now to try to backtrack that way)	16:55
clarkb	corvus: the listings are in bridge:~clarkb/notes/	16:55
fungi	dos show tell us when they were created? might help to narrow down logs	16:55
fungi	maybe most of them are prior to our retention	16:55
clarkb	oh ya the example I gave was created in april	16:56
corvus	good point, that would save a lot of time :)	16:56
clarkb	b4c3634d-2dce-413d-8c12-11478409f02d \| 174.143.59.149	16:56
clarkb	that one is from the 20th of may	16:56
fungi	so over two weeks ago	16:56
clarkb	I'll check a few more to see if Ican find more recent ones	16:56
clarkb	ec26d25d-1e92-434a-9bc7-04b28f2dfe7c \| 50.56.157.32 is the 21st	16:57
clarkb	but after checking ~5 more that was the most recent. Its possible that occurred far enough abck in time there aren't newer ones	16:57
Ramereth[m]	So it seems some of the SSD's I added have poor quality. I'm working on getting them out of rotation. Hopefully that will be completed by the end of today. I just ordered some replacements I know will perform well.	16:58
clarkb	Ramereth[m]: thank you for the update	16:58
clarkb	fungi: corvus I'm ready to start cleaning up fips in dfw3 should I proceed or do you want more data collection first?	17:00
corvus	clarkb: so....	17:00
corvus	clarkb: both nodepool and niz require us to explicitly turn on floating-ip-cleanup	17:00
corvus	because it's not a safe operation, because fips don't have metadata, and we can't tell if an unattached fip belongs to us or not	17:00
fungi	clarkb: no, that's fine	17:01
clarkb	corvus: so maybe this is just slow background leak and since we're running two competing systems we have that disabled?	17:01
corvus	it's only "safe" to do in a situation where we know that nodepool (or zuul) are the only consumers	17:01
corvus	i don't see that set in eeither the nodepool or zuul configurations for the new cloud	17:01
clarkb	corvus: ya and 90% of the time (or whatever the rate is) the fips should delete properly	17:02
clarkb	this is just for leaked cleanup handling whcih we cannot safely enable right now?	17:02
corvus	is flex the only one with fips? and did we just not set that? or am i missing something?	17:02
clarkb	in that case I can manually delete things now and get stuff working again and maybe we rinse and repeat again and eventually we'll be zuul-launcher only and can enable the leak cleanups?	17:02
clarkb	corvus: yes flex is the only one right now	17:02
clarkb	acutally the osuosl cloud might be too I'm not 100% certain on that one	17:03
clarkb	but ovh, rax classic, and openmetal are all direct ip assignments	17:03
corvus	so if flex is the only fip cloud, and we just "forgot" to turn on fip cleanup, then i think we can say mystery solved and we don't need to backtrack any more to find the problem	17:03
clarkb	corvus: ++ I think that makes sense to me	17:03
corvus	(we obviously know that fip leaking happens, so i don't think we need to assign fault to nodepool or zuul for that)	17:04
clarkb	I'll proceed with the manual cleanup now then?	17:04
corvus	okay, then yeah, i think we can delete the fips now. thanks.	17:04
corvus	as for enabling that: because we are running in parallel, it's technically not safe. but maybe we should turn it on in nodepool, with the understanding that it might cause a niz error?	17:04
corvus	or, we could leave it disabled and just delete another batch in a few weeks	17:05
corvus	maybe that's the best idea? since it mostly works? :)	17:05
clarkb	I think it could cause errors to both nodepool and niz	17:05
clarkb	the race is when booting a new server and the fip is created but not attached. We could delete the fip after it has been attached?	17:05
clarkb	then we'd have a broken node either during boot checks or possibly after zuul has received it. Most likely that happens quickly enough to be in pre-run and we rerun the job?	17:06
clarkb	but yes I think we can just cleanup fips later again	17:06
corvus	yeah, i think you're right... the implementation looks pretty naive and may not take into account current builds	17:06
clarkb	I doubt neutron has a flag to delete only if detached	17:07
clarkb	but that might fix this race for us if it did	17:07
clarkb	dfw3 should be claened up now.	17:07
corvus	clarkb: well, we "self._client.delete_unattached_floating_ips()"	17:07
corvus	oh you mean a hypothetical flag that means "was previously attached but is not now"	17:08
corvus	dear openstack, metadata on all objects is critical for tracking purposes. love, zuul	17:09
fungi	i guess there's not a boot option to delete the fip when the server instance is deleted?	17:09
clarkb	corvus: yes. Just some simple amount of info to know what the lifecycle state is	17:09
clarkb	is it starting or ending	17:09
clarkb	fungi: there is, and that is what we do. But it only works $PERCENT of the time	17:09
clarkb	fungi: another option would be for neutron and nova to fix that. but its been like 15 years and this problem has persisted so....	17:10
clarkb	we still don't know why rax-iad is sad?	17:10
corvus	it looks like our current deletion scheme is to "delete all fips attached to a server" and only once that is complete, to delete the server.	17:10
corvus	so i think this failure scenario is the cloud saying "yes i deleted that fip" and then... it's not deleted.	17:11
clarkb	and now SJC3 has been cleaned up	17:11
clarkb	neither region has tried to boot new nodes since the cleanup that I can see	17:11
corvus	clarkb: i think the requests are in building, but none have completed, which does not bode well, even though it hasn't exploded yet. lete me check one	17:12
clarkb	there are fips in use in dfw3 and sjc3 now. Nodepool has marked them in use	17:15
clarkb	corvus: its possible that max-servers 0 for iad is still appropriate for that specific issue	17:15
corvus	2025-06-04 17:11:27,430 WARNING nodepool.StateMachineNodeLauncher.rax-iad: [e: 2ad750856fce4f6d825980dda3184e11] [node_request: 200-0027179958] [node: 0041024577] Launch attempt 1/3 for node 0041024577, failed: Timeout waiting for instance creation	17:15
corvus	yeah, many timeouts on iad, no success	17:17
corvus	ord and dfw look ok	17:17
corvus	so i think we should ramp down iad	17:17
clarkb	0041024620 is a dfw noble instance that just booted, went ready, then in use	17:17
clarkb	I agree the probklem seems iad specific	17:17
corvus	clarkb: you want to type that and i approve it?	17:17
clarkb	on it	17:18
frickler	hmm, I can do "openstack floating ip create public --tag ZUUL" and then e.g. "openstack floating ip list --tags ZUUL", isn't that good enough as metadata?	17:19
opendevreview	Clark Boylan proposed openstack/project-config master: Disable Nodepool's rax iad region https://review.opendev.org/c/openstack/project-config/+/951794	17:20
corvus	frickler: cool, if that exists now, we can use it. we just need to figure out if all our clouds support it, since i'm pretty sure that wasn't always the case.	17:20
clarkb	also part of the problem may be in plumbing that with the auto fip handling?	17:21
clarkb	but that is client side I'm guessing whcih is easier to modify	17:21
clarkb	corvus: I looked at opendev/zuul-providers and don't see max servers there. Are we relying on quotas?	17:21
corvus	i think zuul/nodepool manually create and delete fips	17:21
clarkb	ah in that case we could tag them	17:21
corvus	clarkb: it's spelled differently, 1 sec	17:22
clarkb	https://review.opendev.org/c/openstack/project-config/+/951794 should cover nodepool though	17:22
corvus	i guess if flex is the only fip cloud, then evaluating whether that's available is easy; and maybe osuosl right?	17:23
corvus	frickler: what cloud did you test with?	17:23
clarkb	corvus: yes should be easy to test.	17:23
frickler	corvus: downstream cloud, should be at 2024.1	17:24
fungi	got it, so the cloud-side autocleanup for fips is unreliable, and th cleanup in nodepool/zuul is a workaround for when openstack breaks in the provider essentially	17:24
clarkb	corvus: I found the limit stuff I'll get a change up for zuul-providers	17:24
opendevreview	James E. Blair proposed opendev/zuul-providers master: Disable rax-iad https://review.opendev.org/c/opendev/zuul-providers/+/951795	17:25
corvus	clarkb: ^	17:25
clarkb	ah you're on it that works too	17:25
clarkb	+2 from me	17:25
clarkb	I need to pop out now as mentioned yesterday	17:26
opendevreview	Merged opendev/zuul-providers master: Disable rax-iad https://review.opendev.org/c/opendev/zuul-providers/+/951795	17:26
fungi	approved both	17:26
clarkb	I'll check in periodically via my matrix connection and should be back this afternoon	17:26
corvus	https://opendev.org/openstack/openstacksdk/src/branch/master/openstack/cloud/_network_common.py#L714	17:28
corvus	that is the api method zuul/nodepool use to create a fip; there's no metadata argument there	17:29
corvus	i wonder how the cli adds that tag	17:29
corvus	https://docs.openstack.org/api-ref/network/v2/index.html#floating-ips-floatingips	17:31
corvus	don't see metadata there either	17:32
corvus	there's an extension with a separate endpoint to add tags: https://docs.openstack.org/api-ref/network/v2/index.html#standard-attributes-tag-extension	17:32
corvus	that is not as useful since it doesn't address race conditions. it can potentially reduce the problem, but won't eliminate it.	17:33
frickler	"tags cannot be set when created, so tags need to be set later." https://opendev.org/openstack/python-openstackclient/src/branch/master/openstackclient/network/v2/floating_ip.py#L195-L196	17:36
corvus	:(	17:37
opendevreview	Merged openstack/project-config master: Disable Nodepool's rax iad region https://review.opendev.org/c/openstack/project-config/+/951794	17:42
fungi	i guess the promote-image-build pipeline is still a work in progress? https://zuul.opendev.org/t/opendev/buildset/b6bc802e3aff413c97ed06487b8e5402	18:07
fungi	looks like maybe those jobs either need to short-circuit or not run for changes that wouldn't produce new artifacts in the gate?	18:11
frickler	yes, we noted that yesterday already when the change adding these jobs failed the same way	18:21
frickler	Ramereth[m]: would it help if we disable the cloud until what I assume is ceph rebalancing has finished?	18:23
corvus	yeah, the extra failures are not critical and should be an easy fix	18:24
Ramereth[m]	frickler: yeah, that's okay	18:26
opendevreview	Antoine Musso proposed zuul/zuul-jobs master: add-build-sshkey: slightly enhance tasks names https://review.opendev.org/c/zuul/zuul-jobs/+/951804	18:47
opendevreview	James E. Blair proposed opendev/zuul-providers master: Add a dispatch job to the image promote pipeline https://review.opendev.org/c/opendev/zuul-providers/+/951805	19:09
corvus	i put a little more effort into that so that it should be really clear what each buildset should be trying to do	19:09
opendevreview	James E. Blair proposed opendev/zuul-providers master: Add a dispatch job to the image promote pipeline https://review.opendev.org/c/opendev/zuul-providers/+/951805	19:10
corvus	infra-root: fyi, zuul now has the ability to pause event processing (either incoming trigger events, or reporting), so the next time we perform gerrit maintenance and are concerned about the interaction with zuul, we can pause it if we want.	20:51
fungi	ooh!	20:52
fungi	that sounds great	20:52
opendevreview	Merged openstack/project-config master: Remove redundant editHashtags config https://review.opendev.org/c/openstack/project-config/+/951603	20:54
opendevreview	Merged openstack/project-config master: Remove editHashtags config https://review.opendev.org/c/openstack/project-config/+/951604	20:57
opendevreview	Merged openstack/project-config master: Disallow editHashtags in acl configs https://review.opendev.org/c/openstack/project-config/+/951715	20:58
fungi	951604 failed infra-prod-manage-projects in deploy: https://zuul.opendev.org/t/openstack/build/276bf41e3d044d03b4732c0c275a28b5	21:05
fungi	fatal: [gitea10.opendev.org]: FAILED! => ... No usable temporary directory found in ['/tmp', '/var/tmp', '/usr/tmp', '/root']	21:06
fungi	rootfs has filled up on gitea10	21:06
fungi	checking /var/gitea/data/gitea/repo-archive as the usual suspect	21:07
fungi	110G of the 155G fs	21:09
corvus	seems to have 110G	21:09
corvus	what's the cleanup procedure? can we just rm-fr that dir?	21:11
fungi	i think we have to mash a button in the admin webui	21:12
fungi	which i think is only reachable through 3000/tcp which we no longer allow through iptables, so ssh tunnel	21:12
corvus	btw, have we tried making the repo-archive dir non-writeable?	21:13
fungi	i think clarkb raised a concern that it would break the repo download feature, though that seems like a positive to me (too bad we can't make it inaccessible)	21:14
corvus	fungi: are you planning on hitting that button or should i figure out how to do that?	21:18
fungi	i'm muddling through it, have the ssh tunnel established now, just working out the path	21:18
fungi	once the tunnel is in place, looks like next step is logging into https://localhost:3000/user/login	21:19
corvus	thanks!	21:21
fungi	ah, of course the db can't write my session	21:22
fungi	so once i authenticate, i get a big fat 500 internal server error page	21:23
corvus	maybe snipe a few individual files?	21:23
fungi	yeah, working on vaccuuming journald if nothing else	21:23
fungi	okay, that worked, i'm logged in	21:26
fungi	aha, the url changed at some point to include an extra -/ (i guess to avoid conflicting with repos whose names start with admin/)	21:29
fungi	https://localhost:3000/-/admin/monitor/cron shows me a "Delete all repositories' archives (ZIP, TAR.GZ, etc..)" function that runs at 03:00 utc daily	21:30
fungi	clicking the button, it showed me a message that said "Started Task: Delete all repositories' archives (ZIP, TAR.GZ, etc..)"	21:31
fungi	and indeed, almost instantly we have 112gb free on / now	21:31
corvus	great. i assume manage-projects will eventually consistentize itself	21:32
fungi	#status log Manually cleared all repository archive files which had accumulated on gitea10 since 03:00 utc	21:32
opendevstatus	fungi: finished logging	21:32
fungi	i should also quickly reboot the server in case anything is stuck in a weird state after failing to write to /	21:32
fungi	it's rebooting now	21:33
fungi	it seems to be up and working again	21:36
fungi	i'll also trigger a full re-replication from gerrit	21:36
fungi	that's running now	21:37
fungi	checking the other gitea servers for free space	21:37
fungi	sadly gitea11, 13 and 14 are all in the same state 10 was (09 and 12 are fine though)	21:39
fungi	i'll work on doing the same to all of them	21:39
fungi	probably whatever was grabbing all the archive url links kept downing the backends in the lb one after another	21:41
corvus	exactly like whack-a-mole	21:43
fungi	whack-our-servers	21:44
fungi	okay, the last of them is rebooting now	21:44
fungi	and then i'll re-re-replicate from gerrit sigh	21:44
fungi	i should have checked the others before i started that	21:44
fungi	c'est la vie	21:44
fungi	okay, all of them are up and reporting at least 100gb free on /	21:45
fungi	#status log Repeated the same process on gitea11, 13 and 14 which were in a similar state, then rebooted them, then triggered full replication from Gerrit in case any objects were missing	21:47
opendevstatus	fungi: finished logging	21:48
Clark[m]	thank you for handling that. Note I don't think you need to tunnel	22:18
Clark[m]	I'm not sure how we would make that dir read only. Mount a ro tmpfs over the path?	22:19
Clark[m]	Or maybe a to volume mount over the path? But I'd be happy to test something like that on a held node	22:19
fungi	yeah, in retrospect it was probably that i didn't know it had started needing the -/ in the admin url path	22:20
fungi	looks like it works without the ssh tunnel if i use the correct url	22:21
fungi	good call Clark[m]!	22:21
fungi	i had at first assumed i needed to tunnel because it was returning a 404	22:22
fungi	then when i still got a 404 over the tunnel i rechecked gitea docs to find an example admin url	22:22
fungi	but eventually i figured out that there was an "administrate" option in the user drop-down context menu for the logged-in session	22:23
Clark[m]	I think I had a doc's update for that change	22:25
corvus	yeah, i was thinking some kind of volume mount	22:25
Clark[m]	Change of the url I mean	22:25
fungi	if we wanted to block access to urls starting in /user/login or /-/ we could simply exclude them from the proxy in the apache config	22:25
fungi	Clark[m]: for making /var/gitea/data/gitea/repo-archive read-only, couldn't we just set the file permissions on the directory to ee.g. 0444?	22:29
corvus	could probably do that when we build the image, assuming it isn't a docker volume	22:30
Clark[m]	I'd worry about gitea being smart and fixing that but we could try it	22:30
Clark[m]	It is inside a docker volume	22:31
Clark[m]	Iirc	22:31
Clark[m]	But not the root of one	22:31
corvus	oh right, we bind mount that to our system fs	22:31
corvus	so yeah, presumably we could just chmod it there	22:31
corvus	Clark: fungi https://review.opendev.org/951805 if you have a sec should address the image promote stuff -- since that's post-review i need to merge that to try it out	22:33
fungi	corvus: okay, i have to admit that's really neat, is it our first use of a dispatch job in opendev?	22:38
Clark[m]	corvus https://review.opendev.org/c/opendev/zuul-providers/+/951805/2/zuul.d/image-build-jobs.yaml line 71 should that be child_jobs not child_job?	22:38
Clark[m]	The with_items loop refers to zuul.child_jobs	22:38
corvus	fungi: i think it is	22:40
fungi	playbooks/opendev-promote-diskimage/dispatch-inner.yaml line 1 also uses "{{ child_job }}" and set_fact appends it to the ret_child_jobs list	22:40
fungi	child_job is the loop_var in playbooks/opendev-promote-diskimage/dispatch.yaml (line 11)	22:41
Clark[m]	Ya I think it is correct in the inner playbook	22:41
corvus	yeah, i think it should be child_job (ie, correct as written) because the inner task list is where it's interpolated and used	22:41
corvus	the overarching structure is (pseudocode): for child_job in zuul.child_jobs: mutate the name of this child_job and get the builds for it	22:42
fungi	right, zuul.d/image-build-jobs.yaml isn't interpolated, just passs the j2 strings into the context	22:43
Clark[m]	Oh right we don't process the query until the inner playbook	22:44
corvus	++	22:44
Clark[m]	Ok LGTM but I can't vote right now	22:44
opendevreview	Merged opendev/zuul-providers master: Add a dispatch job to the image promote pipeline https://review.opendev.org/c/opendev/zuul-providers/+/951805	22:45
fungi	i made a "proxy +2" comment for you when i approved it	22:45
corvus	https://zuul.opendev.org/t/opendev/buildset/5e6ce0e55c8c489e885aa8ebc5dc9a3b	23:10
corvus	that's a good sign	23:10
tonyb	Do we support/have prior art for installing ansible collections, I guess on the ze? I'm hoping I can use openstack.cloud.* in some of the DIB testing.	23:16
tonyb	if we can't trivially do that than I can of course use the openstack cli	23:16
corvus	tonyb: the executor has the full ansible community edition thingy installed, which should have a bunch of cloud things including openstack.	23:23
corvus	tonyb: but also, it might be safer/more flexible to just do it nested or use the cli	23:23
corvus	in case you end up needing a newer version or something, which sounds plausible for this use case ;)	23:24
tonyb	Okay. I'll stick with the cli	23:28
corvus	tonyb: also, i'm not totally versed in the details, but i imagine there might be some tasks like "upload image to cloud" which we really want to run on the worker node, not the executor, so this avoids any complications like that	23:30
tonyb	That's fair	23:31
tonyb	I was writing some openstack commands and then I thought wait the ansible modules probably do all of this for me, but that may be a "do it never^Wlater" type thing	23:32
opendevreview	Tony Breeds proposed openstack/diskimage-builder master: Add new openstack/devstack based functional testing https://review.opendev.org/c/openstack/diskimage-builder/+/949942	23:50

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!