frickler | ok, trixie build was successful, but it looks like we are overloading the poor arm environment when all builds are triggered in parallel, maybe we need to limit these using a semaphore? https://zuul.opendev.org/t/opendev/buildset/d2f7e3a55f7f415da144bb625df9c3ff | 05:51 |
---|---|---|
frickler | btw. was there any update regarding the future of osuosl? | 05:51 |
frickler | looks like we are having similar issues with the periodic builds since a couple of days. Ramereth[m] are there any known issues with IO performance? https://zuul.opendev.org/t/opendev/buildset/4d8e2687b5f44637b342d4296bbbfce1 | 06:02 |
mnasiadka | frickler: I see you were faster - I wanted to ask about that since my centos10 builds on aarch64 are timing out | 06:18 |
opendevreview | Tony Breeds proposed openstack/diskimage-builder master: Add new openstack/devstack based functional testing https://review.opendev.org/c/openstack/diskimage-builder/+/949942 | 06:39 |
frickler | clarkb: there's multiple things, like much less efficient use of screenspace in the browser, lack of flexible ordering, lack of shortcuts to directly access specific rooms, lack of greppable logging. also what fungi calls the "media-rich experience" is something I'm not comfortable with either | 06:55 |
mnasiadka | frickler: I assume there are none irssi-like matrix clients? (I've seen there's a 3rd party plugin for irssi but it's dead for a year or so and very limited) | 06:59 |
frickler | mnasiadka: fungi mentioned a weechat plugin, but that also seems to have limited functionality. also I do need the graphical environment for downstream usage and I also don't want to leave irssi for IRC, so adding yet another client? ... meh | 07:01 |
tonyb | clarkb: Re Matrix, I assume you and corvus have a good feel for the costs of hosting rooms on matrix, I did some research and it all boiled down call us. I did find: https://discussion.fedoraproject.org/t/fedora-council-tickets-ticket-487-budget-request-price-increase-for-hosted-matrix-chat-fp-o-by-element-matrix-services-ems/111759 which indicates that Fedora pay $USD 25k/year | 07:20 |
opendevreview | Tony Breeds proposed openstack/diskimage-builder master: Add new openstack/devstack based functional testing https://review.opendev.org/c/openstack/diskimage-builder/+/949942 | 08:05 |
opendevreview | Tony Breeds proposed openstack/diskimage-builder master: Add new openstack/devstack based functional testing https://review.opendev.org/c/openstack/diskimage-builder/+/949942 | 09:01 |
opendevreview | Tony Breeds proposed openstack/diskimage-builder master: Add new openstack/devstack based functional testing https://review.opendev.org/c/openstack/diskimage-builder/+/949942 | 09:42 |
fungi | frickler: tonyb: yeah, the weechat plugin i've been using is https://github.com/poljar/weechat-matrix-rs but for exampl it doesn't support session verification yet nor even joining rooms (so i have to join rooms with elment instead), but it can send and receive messages and other basics | 12:24 |
fungi | frickler: osuosl got sufficient funding, but can always use more: https://lists.osuosl.org/pipermail/hosting/2025-May/000644.html | 12:30 |
fungi | i have some errands to run this morning, will probably be gone for an hour, maybe a little longer... back as soon as i can | 12:34 |
*** ralonsoh_ is now known as ralonsoh | 12:36 | |
fungi | okay, that went quicker than anticipated | 13:36 |
fungi | #status log Renamed and hid the unused kolla-ansible-core group in Gerrit, per discussion in #openstack-kolla IRC | 13:46 |
opendevstatus | fungi: finished logging | 13:46 |
corvus | tonyb: since matrix is federated, the cost to do anything with it can be as low as zero. there are public homeservers like matrix.rg (and others) where anyone can get an account for free, and anyone can create a room. opendev has a homeserver so that we can create rooms with names like "#zuul:opendev.org" to give them some sense of officiality. but because we're not big on user management, we don't host any users there (except our bots). | 13:51 |
corvus | the cost is not great, and helps support the open source ecosystem around the software. there are other low-cost providers who could provide that service as well, or we could do it ourselves on a volunteer basis. adding more rooms to our homeserver has zero marginal cost. iirc, fedora does provide user account services, so their cost is higher. | 13:51 |
corvus | regarding the element user interface: i create groups of rooms (element/matrix call them "spaces") for easy access to the ones i need most under different circumstances. i have spaces like "services", "zuul", "ansible". it's easy to click one of those spaces to find an old room i haven't used in a while. also, there is full-text search across all rooms which i have found quite helpful. it's certainly different, but at least in my use, i | 13:59 |
corvus | haven't found it deficient. | 13:59 |
Ramereth[m] | <frickler> "looks like we are having similar..." <- I have been adding more storage and making some other adjustments. It's likely related to that. I will see if we have an OSD causing the issues however. Is it still happening as of this morning? | 14:40 |
clarkb | mnasiadka: ^ you probably have better insight into ongoing issues since the opendev image builds are more like ~daily | 14:48 |
frickler | I can do a recheck on my patch to get some fresh data | 14:49 |
mnasiadka | Ramereth[m]: yes, it has been happening since my morning (so around 8 hours ago) - see https://zuul.opendev.org/t/openstack/build/5b22f318879a4448b2428cb0603fde71 | 14:50 |
clarkb | frickler: re matrix client pain I do agree on some of that. I prefer weechat's UX for sure, though I've managed to configure element into a state where I'm reasonably happy with it (you can set irc mode for better density and then disable previews for urls). The two things I'm trying to consider are that we seem to have better engagement in places like zuul that moved than we did in | 14:51 |
clarkb | the past. And for IRC a significant number of individuals seem to use matrix bridges which have become flaky recently due to lack of funding. In my mind it seems like we can better serve our existing audience if non bridged IRC users are willing to compromise and also potentially become more accessible via the richer web client environment matrix provides at the same time for people | 14:51 |
clarkb | who want that | 14:51 |
clarkb | really thats the balance I'm trying to achieve here. I'm hoping that IRC users are willing to compromise given that matrix preserves the federated network design and open source aspects of the communication system while not alienating people who are currently not on irc via direct connections | 14:57 |
frickler | well the arguments are all well known, but to me it seems difficult to ask volunteers that enjoy hanging around here in their spare time to "compromise" and move somewhere much less enjoyable | 15:13 |
frickler | I also still question the not very much federated (in terms of user count) network implementation and lack of open development of the matrix ecosystem | 15:14 |
JayF | If I have to stay connected to Matrix all day, I go from having to have 3 chat apps running during my day job to 4. It's not a matter of philosophical feelings about one or the other, it's a matter of the list seemingly growing unbounded; even if OSS/OpenDev/OpenStack would only represent 2 of those | 15:35 |
fungi | i do manage to connect to several irc networks, a multitude of slack instances, and matrix all from a single terminal-based client, but it's complex and sacrifices on some features (granted mostly on features i neither need nor want) | 15:39 |
JayF | I have put in the feature request a number of times for IRCCloud to support matrix, and they have no interest whatsoever :( | 15:41 |
clarkb | right we're between a rock and a hard place with people who rely on matrix to connect to irc (whcih is a seemingly large number) and those who want to avoid another system entirely. I sympahtize with wanting to have volunteer time be easy and fit within personal preferences for tooling (I really dislike being on discord for gerrit for example). But we also offer services here and | 15:41 |
clarkb | ideally are reachable for our users | 15:41 |
clarkb | if we don't do something (regardless of what that choice is among the options I identified earlier) we're very likely going to have disjoint and confusing communications in the near future | 15:42 |
corvus | we've gotten some pretty consistent feedback that irc isn't a great experience for many new contributors, and for those of us for whom it is a great experience, it is made so with a lot of effort that is not easy for new users to put in. all of that (persistent sessions, authentication, etc) comes out of the box with matrix. | 15:48 |
corvus | it can federate to almost any other open system, which could actually help reduce the app fragmentation that's out there | 15:49 |
corvus | i think it's worth trying to find a place where everyone can communicate together, rather than accept the split of our communities into irc vs proprietary walled gardens | 15:50 |
corvus | so, yeah, i decided that for me, it was worth giving up my highly tuned irssi config to switch to matrix to try to walk the walk on universal communication and collaboration | 15:51 |
corvus | i want people to see that we can have the open source, open access collaboration we have with irc and all the conveniences of the proprietary systems without the proprietary systems | 15:52 |
corvus | if the alternative is bifurcated irc+slack+discord, then i think matrix is a better way | 15:53 |
JayF | We're going to lose people no matter which direction we go. I personally don't find matrix easier to setup or understand than IRC | 15:55 |
clarkb | right because we all understand chanserv and nickserv and how they are implemented differently on different networks. Matrix really simplifies all of that I think | 15:57 |
JayF | Well, there's a reason I point new folks at irccloud which completely removes those decoder rings; but I do understand. I don't think matrix is a higher bar; but I'm not convinced it's significantly easier for new folks. | 15:58 |
clarkb | basically if we are already comfortable with irc then yes matrix isn't really going to solve any problems for us. But I think it does address problems for newer users when it comes to creating and authenticating accounts, joining rooms, and maintaining historical scrollback | 15:58 |
JayF | As I said to spotz at SCaLE/OIF days: It takes a PhD in computer to setup Matrix, at least when using IRC bridges. I'm not sure if the story gets much easier doing matrix-direct. | 15:59 |
corvus | i'm going to restart zuul-web to pick up some new api endpoints | 15:59 |
clarkb | JayF: I have not had the same experience | 15:59 |
clarkb | setting up matrix does not require a phd in anything | 16:00 |
clarkb | the bridges are more complicated and thati s part of why I think we should consider using a native room | 16:00 |
clarkb | then all anyone should need is a url to the room and an account (no bridge management to go along with it) | 16:00 |
clarkb | corvus: that will enable zuul-launcher resource managment via the api? | 16:01 |
fungi | yes, not using the bridges and connecting to a matrix-only room with element as a browser-based client is as simple as following a url and letting it walk you through either signing in or creating an account on matrix.org | 16:01 |
*** gthiemon1e is now known as gthiemonge | 16:02 | |
JayF | Yeah, I should try with a matrix native room. I tried a while back (when they were reliable!) to switch to matrix+IRC bridge and I cannot recall a time when I felt similarly befuddled by something that seemed like it should be simple | 16:02 |
fungi | trying to use matrix as an irc client, in effect, is a good bit more complicated sure | 16:02 |
mnasiadka | I’d say irc bridges on matrix currently are in some weird shape, sort of unreliable - so I moved back to separate clients… | 16:02 |
JayF | mnasiadka: yeah, I think that's what spawned this conversation | 16:05 |
JayF | https://github.com/progval/matrix2051 this was just pointed in my direction by the irccloud folks | 16:06 |
clarkb | JayF: zuul and starlingx and openstack-ops all have native matrix rooms now if you need one to hop onto. | 16:06 |
JayF | might poke at doing an instance somewhere just for me (or GR-OSS) | 16:06 |
corvus | and of course there are folks who have the opposite experience | 16:06 |
corvus | there is nothing that will satisfy everyone, but i honestly think matrix is the closest thing to it, and it is our best hope for actually pushing back against proprietary balkanization of communication in open source communities. i think that's worth a lot. | 16:07 |
fungi | the openstack vmt has a private matrix room too, but a majority of the curent vmt members preferred to stick with irc so i'm straddling the two for now in case that shifts in the future | 16:07 |
corvus | and account setup is just standard username/password webform | 16:09 |
corvus | clarkb: yep. not all the changes have merged yet, but i want to be able to get listings now. so i'll do another restart later likely | 16:10 |
corvus | yes, they had a bad 2 weeks. since they are effectively a standalone federated component, their reliability reflects their operator (not the network), and their operator has deprioritized them. so i agree it's good to consider not individually relying on them now, and avoiding relying on them makes us more self-sufficient. | 16:12 |
corvus | they=matirx-irc bridges | 16:13 |
corvus | (also, like everything matrix, anyone can run a public bridge if they want, but no one else wants to because of the work and coordination involved. but individual bridges are still plentiful) | 16:17 |
corvus | i think matrix2051 is a good design, i'd be interested to hear how it goes | 16:20 |
corvus | https://zuul.opendev.org/t/zuul/nodeset-requests exists now | 16:21 |
leonardo-serrano | Hi, we've noticed Zuul is running a bit slower at the moment. Is there any info about this? | 16:27 |
clarkb | leonardo-serrano: can you be more specific about what is slow? is the web ui loading more slowly? Are jobs not starting as quicklly as you expect? Have specific job runtimes increased? | 16:30 |
corvus | the disk on nl05 is full | 16:32 |
leonardo-serrano | clarkb: A few starlingx reviews are taking longer than usual to get their zuul jobs started. One example is https://zuul.opendev.org/t/openstack/status?change=951778%2C2 | 16:33 |
corvus | the errors from right before the disk filled up suggest errors in rax-iad | 16:34 |
clarkb | leonardo-serrano: the first thing I like to check in taht situation is https://grafana.opendev.org/d/21a6e53ea4/zuul-status?orgId=1&from=now-6h&to=now&timezone=utc to see if we are running at capacity ( doesn't seem to be at node capacity or executor capacity) | 16:34 |
corvus | i'm going to delete some old logs then restart that launcher and see what it says | 16:35 |
clarkb | but I think corvus has already likely diagnosed the issue | 16:35 |
clarkb | nl05 provides ~60% of our capacity and if it is out of commission then we'll run fewer jobs | 16:35 |
clarkb | corvus: thanks! | 16:35 |
corvus | i don't think the problem is the launcher | 16:36 |
leonardo-serrano | Thanks for the tip and the quick responses guys. Cheers | 16:36 |
corvus | i think it's the cloud | 16:36 |
corvus | (very likely that the launcher can continue even with a full disk, we just can't see what it's doing) | 16:36 |
clarkb | ah | 16:37 |
clarkb | https://grafana.opendev.org/d/a8667d6647/nodepool3a-rackspace?orgId=1&from=now-6h&to=now&timezone=utc&var-region=$__all ya that implies that rax-ord and rax-dfw are still doing work | 16:38 |
clarkb | maybe we should set iad to max-servers 0? | 16:38 |
clarkb | or max servers 5 as a canary? | 16:38 |
clarkb | but even one cloud that can't boot servers will delay many requests by ~15 minutes iirc so could be the source of the problem | 16:39 |
corvus | the disk is full due to a particularly insane stacktrace from python; honestly not sure how that was constructed | 16:40 |
corvus | the stacktrace is thousands of lines long, but it doesn't make sense because it's not actually recursive | 16:41 |
corvus | it's caused by a failure to connect to the workers | 16:41 |
corvus | so i think reducing iad servers is probably a good idea | 16:42 |
clarkb | "the workers" == the test nodes? | 16:42 |
clarkb | corvus: I'm good with that. I can quickly review a change if the disk situation is at least temporarily resolved so it can apply | 16:42 |
corvus | i'm seeing keyscan errors on multiple rax regions | 16:42 |
corvus | so maybe hold off on that max servers change idea | 16:42 |
corvus | ya | 16:43 |
clarkb | rax is where we rely on glean to configure the network statically based on their special config drive network config | 16:43 |
clarkb | if that doesn't work for some reason (image updates or cloud changes to config drive) then that may explain failed ssh keyscans? | 16:44 |
corvus | i restarted that launcher just in case there was something internal messed up | 16:44 |
corvus | openstack.exceptions.ConflictException: ConflictException: 409: Client Error for url: https://neutron.api.dfw3.rackspacecloud.com/v2.0/floatingips, Quota exceeded for resources: ['floatingip']. | 16:44 |
corvus | that's going on too | 16:44 |
clarkb | corvus: thats raxflex not classic so would be a separate issue I think | 16:44 |
fungi | i wonder if we've leaked fips in flex though | 16:45 |
corvus | oh good point | 16:45 |
clarkb | https://grafana.opendev.org/d/6d29645669/nodepool3a-rackspace-flex?orgId=1&from=now-6h&to=now&timezone=utc&var-region=$__all nothing seems to be booting there right now | 16:45 |
clarkb | but those failures should process much more quickly if we're hitting quota errors | 16:45 |
corvus | maybe the fip leaks are niz related? | 16:46 |
clarkb | maybe? | 16:46 |
clarkb | like we're using the quota up on the niz side and nodepool doesn't know or something? | 16:46 |
corvus | yeah, i don't think fips are included in quota, so we don't actually know how many we have | 16:47 |
corvus | but i don't think we expect to have fewer fips than we can run instances, so that suggests a leak | 16:48 |
clarkb | corvus: openstack limits show --absolute seems to indicate we're using no floating ips in those two regions. So possibly a leak on the cloud side that we can't even see? | 16:50 |
clarkb | doing floating ip list does show floating ips though | 16:50 |
corvus | can you tell if they're unattached? have an example i can grep for in nodepool/zuul logs? | 16:51 |
clarkb | corvus: from DFW3: 02b6917c-a4bb-4534-8766-cef2018cd3ed | 50.56.157.247 | 16:51 |
clarkb | corvus: that does not have a fixed IP so I think it is unattached | 16:51 |
clarkb | I suspect that these have leaked and limits show --absolute is not accurately reporting the data back to us. | 16:52 |
clarkb | None of them have fixed ips according to floating ip list | 16:52 |
clarkb | (so they should all be unattached) | 16:52 |
clarkb | also max servers is 32 in both regions and I count 50 fips | 16:53 |
clarkb | I think we might be able to safely delete each of the floating IPs in sjc3 and dfw3 just be sure to do one listing and delete from that as nodepool may quickly start booting new nodes creating new fips that would get listed if we list multiple times | 16:53 |
corvus | ack... you mind doing that? and save a list of them so we can try to backtrack and figure out when/where they leaked? | 16:54 |
clarkb | corvus: yes I'll start on that now | 16:54 |
corvus | that ip doesn't show up in either nodepool or zuul launcher logs, but i did just delete a bunch of nodepool logs. | 16:55 |
corvus | (i'm looking in executor logs now to try to backtrack that way) | 16:55 |
clarkb | corvus: the listings are in bridge:~clarkb/notes/ | 16:55 |
fungi | dos show tell us when they were created? might help to narrow down logs | 16:55 |
fungi | maybe most of them are prior to our retention | 16:55 |
clarkb | oh ya the example I gave was created in april | 16:56 |
corvus | good point, that would save a lot of time :) | 16:56 |
clarkb | b4c3634d-2dce-413d-8c12-11478409f02d | 174.143.59.149 | 16:56 |
clarkb | that one is from the 20th of may | 16:56 |
fungi | so over two weeks ago | 16:56 |
clarkb | I'll check a few more to see if Ican find more recent ones | 16:56 |
clarkb | ec26d25d-1e92-434a-9bc7-04b28f2dfe7c | 50.56.157.32 is the 21st | 16:57 |
clarkb | but after checking ~5 more that was the most recent. Its possible that occurred far enough abck in time there aren't newer ones | 16:57 |
Ramereth[m] | So it seems some of the SSD's I added have poor quality. I'm working on getting them out of rotation. Hopefully that will be completed by the end of today. I just ordered some replacements I know will perform well. | 16:58 |
clarkb | Ramereth[m]: thank you for the update | 16:58 |
clarkb | fungi: corvus I'm ready to start cleaning up fips in dfw3 should I proceed or do you want more data collection first? | 17:00 |
corvus | clarkb: so.... | 17:00 |
corvus | clarkb: both nodepool and niz require us to explicitly turn on floating-ip-cleanup | 17:00 |
corvus | because it's not a safe operation, because fips don't have metadata, and we can't tell if an unattached fip belongs to us or not | 17:00 |
fungi | clarkb: no, that's fine | 17:01 |
clarkb | corvus: so maybe this is just slow background leak and since we're running two competing systems we have that disabled? | 17:01 |
corvus | it's only "safe" to do in a situation where we know that nodepool (or zuul) are the only consumers | 17:01 |
corvus | i don't see that set in eeither the nodepool or zuul configurations for the new cloud | 17:01 |
clarkb | corvus: ya and 90% of the time (or whatever the rate is) the fips should delete properly | 17:02 |
clarkb | this is just for leaked cleanup handling whcih we cannot safely enable right now? | 17:02 |
corvus | is flex the only one with fips? and did we just not set that? or am i missing something? | 17:02 |
clarkb | in that case I can manually delete things now and get stuff working again and maybe we rinse and repeat again and eventually we'll be zuul-launcher only and can enable the leak cleanups? | 17:02 |
clarkb | corvus: yes flex is the only one right now | 17:02 |
clarkb | acutally the osuosl cloud might be too I'm not 100% certain on that one | 17:03 |
clarkb | but ovh, rax classic, and openmetal are all direct ip assignments | 17:03 |
corvus | so if flex is the only fip cloud, and we just "forgot" to turn on fip cleanup, then i think we can say mystery solved and we don't need to backtrack any more to find the problem | 17:03 |
clarkb | corvus: ++ I think that makes sense to me | 17:03 |
corvus | (we obviously know that fip leaking happens, so i don't think we need to assign fault to nodepool or zuul for that) | 17:04 |
clarkb | I'll proceed with the manual cleanup now then? | 17:04 |
corvus | okay, then yeah, i think we can delete the fips now. thanks. | 17:04 |
corvus | as for enabling that: because we are running in parallel, it's technically not safe. but maybe we should turn it on in nodepool, with the understanding that it might cause a niz error? | 17:04 |
corvus | or, we could leave it disabled and just delete another batch in a few weeks | 17:05 |
corvus | maybe that's the best idea? since it mostly works? :) | 17:05 |
clarkb | I think it could cause errors to both nodepool and niz | 17:05 |
clarkb | the race is when booting a new server and the fip is created but not attached. We could delete the fip after it has been attached? | 17:05 |
clarkb | then we'd have a broken node either during boot checks or possibly after zuul has received it. Most likely that happens quickly enough to be in pre-run and we rerun the job? | 17:06 |
clarkb | but yes I think we can just cleanup fips later again | 17:06 |
corvus | yeah, i think you're right... the implementation looks pretty naive and may not take into account current builds | 17:06 |
clarkb | I doubt neutron has a flag to delete only if detached | 17:07 |
clarkb | but that might fix this race for us if it did | 17:07 |
clarkb | dfw3 should be claened up now. | 17:07 |
corvus | clarkb: well, we "self._client.delete_unattached_floating_ips()" | 17:07 |
corvus | oh you mean a hypothetical flag that means "was previously attached but is not now" | 17:08 |
corvus | dear openstack, metadata on all objects is critical for tracking purposes. love, zuul | 17:09 |
fungi | i guess there's not a boot option to delete the fip when the server instance is deleted? | 17:09 |
clarkb | corvus: yes. Just some simple amount of info to know what the lifecycle state is | 17:09 |
clarkb | is it starting or ending | 17:09 |
clarkb | fungi: there is, and that is what we do. But it only works $PERCENT of the time | 17:09 |
clarkb | fungi: another option would be for neutron and nova to fix that. but its been like 15 years and this problem has persisted so.... | 17:10 |
clarkb | we still don't know why rax-iad is sad? | 17:10 |
corvus | it looks like our current deletion scheme is to "delete all fips attached to a server" and only once that is complete, to delete the server. | 17:10 |
corvus | so i think this failure scenario is the cloud saying "yes i deleted that fip" and then... it's not deleted. | 17:11 |
clarkb | and now SJC3 has been cleaned up | 17:11 |
clarkb | neither region has tried to boot new nodes since the cleanup that I can see | 17:11 |
corvus | clarkb: i think the requests are in building, but none have completed, which does not bode well, even though it hasn't exploded yet. lete me check one | 17:12 |
clarkb | there are fips in use in dfw3 and sjc3 now. Nodepool has marked them in use | 17:15 |
clarkb | corvus: its possible that max-servers 0 for iad is still appropriate for that specific issue | 17:15 |
corvus | 2025-06-04 17:11:27,430 WARNING nodepool.StateMachineNodeLauncher.rax-iad: [e: 2ad750856fce4f6d825980dda3184e11] [node_request: 200-0027179958] [node: 0041024577] Launch attempt 1/3 for node 0041024577, failed: Timeout waiting for instance creation | 17:15 |
corvus | yeah, many timeouts on iad, no success | 17:17 |
corvus | ord and dfw look ok | 17:17 |
corvus | so i think we should ramp down iad | 17:17 |
clarkb | 0041024620 is a dfw noble instance that just booted, went ready, then in use | 17:17 |
clarkb | I agree the probklem seems iad specific | 17:17 |
corvus | clarkb: you want to type that and i approve it? | 17:17 |
clarkb | on it | 17:18 |
frickler | hmm, I can do "openstack floating ip create public --tag ZUUL" and then e.g. "openstack floating ip list --tags ZUUL", isn't that good enough as metadata? | 17:19 |
opendevreview | Clark Boylan proposed openstack/project-config master: Disable Nodepool's rax iad region https://review.opendev.org/c/openstack/project-config/+/951794 | 17:20 |
corvus | frickler: cool, if that exists now, we can use it. we just need to figure out if all our clouds support it, since i'm pretty sure that wasn't always the case. | 17:20 |
clarkb | also part of the problem may be in plumbing that with the auto fip handling? | 17:21 |
clarkb | but that is client side I'm guessing whcih is easier to modify | 17:21 |
clarkb | corvus: I looked at opendev/zuul-providers and don't see max servers there. Are we relying on quotas? | 17:21 |
corvus | i think zuul/nodepool manually create and delete fips | 17:21 |
clarkb | ah in that case we could tag them | 17:21 |
corvus | clarkb: it's spelled differently, 1 sec | 17:22 |
clarkb | https://review.opendev.org/c/openstack/project-config/+/951794 should cover nodepool though | 17:22 |
corvus | i guess if flex is the only fip cloud, then evaluating whether that's available is easy; and maybe osuosl right? | 17:23 |
corvus | frickler: what cloud did you test with? | 17:23 |
clarkb | corvus: yes should be easy to test. | 17:23 |
frickler | corvus: downstream cloud, should be at 2024.1 | 17:24 |
fungi | got it, so the cloud-side autocleanup for fips is unreliable, and th cleanup in nodepool/zuul is a workaround for when openstack breaks in the provider essentially | 17:24 |
clarkb | corvus: I found the limit stuff I'll get a change up for zuul-providers | 17:24 |
opendevreview | James E. Blair proposed opendev/zuul-providers master: Disable rax-iad https://review.opendev.org/c/opendev/zuul-providers/+/951795 | 17:25 |
corvus | clarkb: ^ | 17:25 |
clarkb | ah you're on it that works too | 17:25 |
clarkb | +2 from me | 17:25 |
clarkb | I need to pop out now as mentioned yesterday | 17:26 |
opendevreview | Merged opendev/zuul-providers master: Disable rax-iad https://review.opendev.org/c/opendev/zuul-providers/+/951795 | 17:26 |
fungi | approved both | 17:26 |
clarkb | I'll check in periodically via my matrix connection and should be back this afternoon | 17:26 |
corvus | https://opendev.org/openstack/openstacksdk/src/branch/master/openstack/cloud/_network_common.py#L714 | 17:28 |
corvus | that is the api method zuul/nodepool use to create a fip; there's no metadata argument there | 17:29 |
corvus | i wonder how the cli adds that tag | 17:29 |
corvus | https://docs.openstack.org/api-ref/network/v2/index.html#floating-ips-floatingips | 17:31 |
corvus | don't see metadata there either | 17:32 |
corvus | there's an extension with a separate endpoint to add tags: https://docs.openstack.org/api-ref/network/v2/index.html#standard-attributes-tag-extension | 17:32 |
corvus | that is not as useful since it doesn't address race conditions. it can potentially reduce the problem, but won't eliminate it. | 17:33 |
frickler | "tags cannot be set when created, so tags need to be set later." https://opendev.org/openstack/python-openstackclient/src/branch/master/openstackclient/network/v2/floating_ip.py#L195-L196 | 17:36 |
corvus | :( | 17:37 |
opendevreview | Merged openstack/project-config master: Disable Nodepool's rax iad region https://review.opendev.org/c/openstack/project-config/+/951794 | 17:42 |
fungi | i guess the promote-image-build pipeline is still a work in progress? https://zuul.opendev.org/t/opendev/buildset/b6bc802e3aff413c97ed06487b8e5402 | 18:07 |
fungi | looks like maybe those jobs either need to short-circuit or not run for changes that wouldn't produce new artifacts in the gate? | 18:11 |
frickler | yes, we noted that yesterday already when the change adding these jobs failed the same way | 18:21 |
frickler | Ramereth[m]: would it help if we disable the cloud until what I assume is ceph rebalancing has finished? | 18:23 |
corvus | yeah, the extra failures are not critical and should be an easy fix | 18:24 |
Ramereth[m] | frickler: yeah, that's okay | 18:26 |
opendevreview | Antoine Musso proposed zuul/zuul-jobs master: add-build-sshkey: slightly enhance tasks names https://review.opendev.org/c/zuul/zuul-jobs/+/951804 | 18:47 |
opendevreview | James E. Blair proposed opendev/zuul-providers master: Add a dispatch job to the image promote pipeline https://review.opendev.org/c/opendev/zuul-providers/+/951805 | 19:09 |
corvus | i put a little more effort into that so that it should be really clear what each buildset should be trying to do | 19:09 |
opendevreview | James E. Blair proposed opendev/zuul-providers master: Add a dispatch job to the image promote pipeline https://review.opendev.org/c/opendev/zuul-providers/+/951805 | 19:10 |
corvus | infra-root: fyi, zuul now has the ability to pause event processing (either incoming trigger events, or reporting), so the next time we perform gerrit maintenance and are concerned about the interaction with zuul, we can pause it if we want. | 20:51 |
fungi | ooh! | 20:52 |
fungi | that sounds great | 20:52 |
opendevreview | Merged openstack/project-config master: Remove redundant editHashtags config https://review.opendev.org/c/openstack/project-config/+/951603 | 20:54 |
opendevreview | Merged openstack/project-config master: Remove editHashtags config https://review.opendev.org/c/openstack/project-config/+/951604 | 20:57 |
opendevreview | Merged openstack/project-config master: Disallow editHashtags in acl configs https://review.opendev.org/c/openstack/project-config/+/951715 | 20:58 |
fungi | 951604 failed infra-prod-manage-projects in deploy: https://zuul.opendev.org/t/openstack/build/276bf41e3d044d03b4732c0c275a28b5 | 21:05 |
fungi | fatal: [gitea10.opendev.org]: FAILED! => ... No usable temporary directory found in ['/tmp', '/var/tmp', '/usr/tmp', '/root'] | 21:06 |
fungi | rootfs has filled up on gitea10 | 21:06 |
fungi | checking /var/gitea/data/gitea/repo-archive as the usual suspect | 21:07 |
fungi | 110G of the 155G fs | 21:09 |
corvus | seems to have 110G | 21:09 |
corvus | what's the cleanup procedure? can we just rm-fr that dir? | 21:11 |
fungi | i think we have to mash a button in the admin webui | 21:12 |
fungi | which i think is only reachable through 3000/tcp which we no longer allow through iptables, so ssh tunnel | 21:12 |
corvus | btw, have we tried making the repo-archive dir non-writeable? | 21:13 |
fungi | i think clarkb raised a concern that it would break the repo download feature, though that seems like a positive to me (too bad we can't make it inaccessible) | 21:14 |
corvus | fungi: are you planning on hitting that button or should i figure out how to do that? | 21:18 |
fungi | i'm muddling through it, have the ssh tunnel established now, just working out the path | 21:18 |
fungi | once the tunnel is in place, looks like next step is logging into https://localhost:3000/user/login | 21:19 |
corvus | thanks! | 21:21 |
fungi | ah, of course the db can't write my session | 21:22 |
fungi | so once i authenticate, i get a big fat 500 internal server error page | 21:23 |
corvus | maybe snipe a few individual files? | 21:23 |
fungi | yeah, working on vaccuuming journald if nothing else | 21:23 |
fungi | okay, that worked, i'm logged in | 21:26 |
fungi | aha, the url changed at some point to include an extra -/ (i guess to avoid conflicting with repos whose names start with admin/) | 21:29 |
fungi | https://localhost:3000/-/admin/monitor/cron shows me a "Delete all repositories' archives (ZIP, TAR.GZ, etc..)" function that runs at 03:00 utc daily | 21:30 |
fungi | clicking the button, it showed me a message that said "Started Task: Delete all repositories' archives (ZIP, TAR.GZ, etc..)" | 21:31 |
fungi | and indeed, almost instantly we have 112gb free on / now | 21:31 |
corvus | great. i assume manage-projects will eventually consistentize itself | 21:32 |
fungi | #status log Manually cleared all repository archive files which had accumulated on gitea10 since 03:00 utc | 21:32 |
opendevstatus | fungi: finished logging | 21:32 |
fungi | i should also quickly reboot the server in case anything is stuck in a weird state after failing to write to / | 21:32 |
fungi | it's rebooting now | 21:33 |
fungi | it seems to be up and working again | 21:36 |
fungi | i'll also trigger a full re-replication from gerrit | 21:36 |
fungi | that's running now | 21:37 |
fungi | checking the other gitea servers for free space | 21:37 |
fungi | sadly gitea11, 13 and 14 are all in the same state 10 was (09 and 12 are fine though) | 21:39 |
fungi | i'll work on doing the same to all of them | 21:39 |
fungi | probably whatever was grabbing all the archive url links kept downing the backends in the lb one after another | 21:41 |
corvus | exactly like whack-a-mole | 21:43 |
fungi | whack-our-servers | 21:44 |
fungi | okay, the last of them is rebooting now | 21:44 |
fungi | and then i'll re-re-replicate from gerrit *sigh* | 21:44 |
fungi | i should have checked the others before i started that | 21:44 |
fungi | c'est la vie | 21:44 |
fungi | okay, all of them are up and reporting at least 100gb free on / | 21:45 |
fungi | #status log Repeated the same process on gitea11, 13 and 14 which were in a similar state, then rebooted them, then triggered full replication from Gerrit in case any objects were missing | 21:47 |
opendevstatus | fungi: finished logging | 21:48 |
Clark[m] | thank you for handling that. Note I don't think you need to tunnel | 22:18 |
Clark[m] | I'm not sure how we would make that dir read only. Mount a ro tmpfs over the path? | 22:19 |
Clark[m] | Or maybe a to volume mount over the path? But I'd be happy to test something like that on a held node | 22:19 |
fungi | yeah, in retrospect it was probably that i didn't know it had started needing the -/ in the admin url path | 22:20 |
fungi | looks like it works without the ssh tunnel if i use the correct url | 22:21 |
fungi | good call Clark[m]! | 22:21 |
fungi | i had at first assumed i needed to tunnel because it was returning a 404 | 22:22 |
fungi | then when i still got a 404 over the tunnel i rechecked gitea docs to find an example admin url | 22:22 |
fungi | but eventually i figured out that there was an "administrate" option in the user drop-down context menu for the logged-in session | 22:23 |
Clark[m] | I think I had a doc's update for that change | 22:25 |
corvus | yeah, i was thinking some kind of volume mount | 22:25 |
Clark[m] | Change of the url I mean | 22:25 |
fungi | if we wanted to block access to urls starting in /user/login or /-/ we could simply exclude them from the proxy in the apache config | 22:25 |
fungi | Clark[m]: for making /var/gitea/data/gitea/repo-archive read-only, couldn't we just set the file permissions on the directory to ee.g. 0444? | 22:29 |
corvus | could probably do that when we build the image, assuming it isn't a docker volume | 22:30 |
Clark[m] | I'd worry about gitea being smart and fixing that but we could try it | 22:30 |
Clark[m] | It is inside a docker volume | 22:31 |
Clark[m] | Iirc | 22:31 |
Clark[m] | But not the root of one | 22:31 |
corvus | oh right, we bind mount that to our system fs | 22:31 |
corvus | so yeah, presumably we could just chmod it there | 22:31 |
corvus | Clark: fungi https://review.opendev.org/951805 if you have a sec should address the image promote stuff -- since that's post-review i need to merge that to try it out | 22:33 |
fungi | corvus: okay, i have to admit that's really neat, is it our first use of a dispatch job in opendev? | 22:38 |
Clark[m] | corvus https://review.opendev.org/c/opendev/zuul-providers/+/951805/2/zuul.d/image-build-jobs.yaml line 71 should that be child_jobs not child_job? | 22:38 |
Clark[m] | The with_items loop refers to zuul.child_jobs | 22:38 |
corvus | fungi: i think it is | 22:40 |
fungi | playbooks/opendev-promote-diskimage/dispatch-inner.yaml line 1 also uses "{{ child_job }}" and set_fact appends it to the ret_child_jobs list | 22:40 |
fungi | child_job is the loop_var in playbooks/opendev-promote-diskimage/dispatch.yaml (line 11) | 22:41 |
Clark[m] | Ya I think it is correct in the inner playbook | 22:41 |
corvus | yeah, i think it should be child_job (ie, correct as written) because the inner task list is where it's interpolated and used | 22:41 |
corvus | the overarching structure is (pseudocode): for child_job in zuul.child_jobs: mutate the name of this child_job and get the builds for it | 22:42 |
fungi | right, zuul.d/image-build-jobs.yaml isn't interpolated, just passs the j2 strings into the context | 22:43 |
Clark[m] | Oh right we don't process the query until the inner playbook | 22:44 |
corvus | ++ | 22:44 |
Clark[m] | Ok LGTM but I can't vote right now | 22:44 |
opendevreview | Merged opendev/zuul-providers master: Add a dispatch job to the image promote pipeline https://review.opendev.org/c/opendev/zuul-providers/+/951805 | 22:45 |
fungi | i made a "proxy +2" comment for you when i approved it | 22:45 |
corvus | https://zuul.opendev.org/t/opendev/buildset/5e6ce0e55c8c489e885aa8ebc5dc9a3b | 23:10 |
corvus | that's a good sign | 23:10 |
tonyb | Do we support/have prior art for installing ansible collections, I guess on the ze? I'm hoping I can use openstack.cloud.* in some of the DIB testing. | 23:16 |
tonyb | if we can't trivially do that than I can of course use the openstack cli | 23:16 |
corvus | tonyb: the executor has the full ansible community edition thingy installed, which should have a bunch of cloud things including openstack. | 23:23 |
corvus | tonyb: but also, it might be safer/more flexible to just do it nested or use the cli | 23:23 |
corvus | in case you end up needing a newer version or something, which sounds plausible for this use case ;) | 23:24 |
tonyb | Okay. I'll stick with the cli | 23:28 |
corvus | tonyb: also, i'm not totally versed in the details, but i imagine there might be some tasks like "upload image to cloud" which we really want to run on the worker node, not the executor, so this avoids any complications like that | 23:30 |
tonyb | That's fair | 23:31 |
tonyb | I was writing some openstack commands and then I thought wait the ansible modules probably do all of this for me, but that may be a "do it never^Wlater" type thing | 23:32 |
opendevreview | Tony Breeds proposed openstack/diskimage-builder master: Add new openstack/devstack based functional testing https://review.opendev.org/c/openstack/diskimage-builder/+/949942 | 23:50 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!