Wednesday, 2025-06-04

fricklerok, trixie build was successful, but it looks like we are overloading the poor arm environment when all builds are triggered in parallel, maybe we need to limit these using a semaphore? https://zuul.opendev.org/t/opendev/buildset/d2f7e3a55f7f415da144bb625df9c3ff05:51
fricklerbtw. was there any update regarding the future of osuosl?05:51
fricklerlooks like we are having similar issues with the periodic builds since a couple of days. Ramereth[m] are there any known issues with IO performance? https://zuul.opendev.org/t/opendev/buildset/4d8e2687b5f44637b342d4296bbbfce106:02
mnasiadkafrickler: I see you were faster - I wanted to ask about that since my centos10 builds on aarch64 are timing out06:18
opendevreviewTony Breeds proposed openstack/diskimage-builder master: Add new openstack/devstack based functional testing  https://review.opendev.org/c/openstack/diskimage-builder/+/94994206:39
fricklerclarkb: there's multiple things, like much less efficient use of screenspace in the browser, lack of flexible ordering, lack of shortcuts to directly access specific rooms, lack of greppable logging. also what fungi calls the "media-rich experience" is something I'm not comfortable with either06:55
mnasiadkafrickler: I assume there are none irssi-like matrix clients? (I've seen there's a 3rd party plugin for irssi but it's dead for a year or so and very limited)06:59
fricklermnasiadka: fungi mentioned a weechat plugin, but that also seems to have limited functionality. also I do need the graphical environment for downstream usage and I also don't want to leave irssi for IRC, so adding yet another client? ... meh07:01
tonybclarkb: Re Matrix, I assume you and corvus have a good feel for the costs of hosting rooms on matrix, I did some research and it all boiled down call us.  I did find: https://discussion.fedoraproject.org/t/fedora-council-tickets-ticket-487-budget-request-price-increase-for-hosted-matrix-chat-fp-o-by-element-matrix-services-ems/111759 which indicates that Fedora pay $USD 25k/year07:20
opendevreviewTony Breeds proposed openstack/diskimage-builder master: Add new openstack/devstack based functional testing  https://review.opendev.org/c/openstack/diskimage-builder/+/94994208:05
opendevreviewTony Breeds proposed openstack/diskimage-builder master: Add new openstack/devstack based functional testing  https://review.opendev.org/c/openstack/diskimage-builder/+/94994209:01
opendevreviewTony Breeds proposed openstack/diskimage-builder master: Add new openstack/devstack based functional testing  https://review.opendev.org/c/openstack/diskimage-builder/+/94994209:42
fungifrickler: tonyb: yeah, the weechat plugin i've been using is https://github.com/poljar/weechat-matrix-rs but for exampl it doesn't support session verification yet nor even joining rooms (so i have to join rooms with elment instead), but it can send and receive messages and other basics12:24
fungifrickler: osuosl got sufficient funding, but can always use more: https://lists.osuosl.org/pipermail/hosting/2025-May/000644.html12:30
fungii have some errands to run this morning, will probably be gone for an hour, maybe a little longer... back as soon as i can12:34
*** ralonsoh_ is now known as ralonsoh12:36
fungiokay, that went quicker than anticipated13:36
fungi#status log Renamed and hid the unused kolla-ansible-core group in Gerrit, per discussion in #openstack-kolla IRC13:46
opendevstatusfungi: finished logging13:46
corvustonyb: since matrix is federated, the cost to do anything with it can be as low as zero.  there are public homeservers like matrix.rg (and others) where anyone can get an account for free, and anyone can create a room.  opendev has a homeserver so that we can create rooms with names like "#zuul:opendev.org" to give them some sense of officiality.  but because we're not big on user management, we don't host any users there (except our bots).13:51
corvusthe cost is not great, and helps support the open source ecosystem around the software.  there are other low-cost providers who could provide that service as well, or we could do it ourselves on a volunteer basis.  adding more rooms to our homeserver has zero marginal cost.  iirc, fedora does provide user account services, so their cost is higher.13:51
corvusregarding the element user interface: i create groups of rooms (element/matrix call them "spaces") for easy access to the ones i need most under different circumstances.  i have spaces like "services", "zuul", "ansible".  it's easy to click one of those spaces to find an old room i haven't used in a while.  also, there is full-text search across all rooms which i have found quite helpful.  it's certainly different, but at least in my use, i13:59
corvushaven't found it deficient.13:59
Ramereth[m]<frickler> "looks like we are having similar..." <- I  have been adding more storage and making some other adjustments. It's likely related to that. I will see if we have an OSD causing the issues however. Is it still happening as of this morning?14:40
clarkbmnasiadka: ^ you probably have better insight into ongoing issues since the opendev image builds are more like ~daily14:48
fricklerI can do a recheck on my patch to get some fresh data14:49
mnasiadkaRamereth[m]: yes, it has been happening since my morning (so around 8 hours ago) - see https://zuul.opendev.org/t/openstack/build/5b22f318879a4448b2428cb0603fde7114:50
clarkbfrickler: re matrix client pain I do agree on some of that. I prefer weechat's UX for sure, though I've managed to configure element into a state where I'm reasonably happy with it (you can set irc mode for better density and then disable previews for urls). The two things I'm trying to consider are that we seem to have better engagement in places like zuul that moved than we did in14:51
clarkbthe past. And for IRC a significant number of individuals seem to use matrix bridges which have become flaky recently due to lack of funding. In my mind it seems like we can better serve our existing audience if non bridged IRC users are willing to compromise and also potentially become more accessible via the richer web client environment matrix provides at the same time for people14:51
clarkbwho want that14:51
clarkbreally thats the balance I'm trying to achieve here. I'm hoping that IRC users are willing to compromise given that matrix preserves the federated network design and open source aspects of the communication system while not alienating people who are currently not on irc via direct connections14:57
fricklerwell the arguments are all well known, but to me it seems difficult to ask volunteers that enjoy hanging around here in their spare time to "compromise" and move somewhere much less enjoyable15:13
fricklerI also still question the not very much federated (in terms of user count) network implementation and lack of open development of the matrix ecosystem15:14
JayFIf I have to stay connected to Matrix all day, I go from having to have 3 chat apps running during my day job to 4. It's not a matter of philosophical feelings about one or the other, it's a matter of the list seemingly growing unbounded; even if OSS/OpenDev/OpenStack would only represent 2 of those15:35
fungii do manage to connect to several irc networks, a multitude of slack instances, and matrix all from a single terminal-based client, but it's complex and sacrifices on some features (granted mostly on features i neither need nor want)15:39
JayFI have put in the feature request a number of times for IRCCloud to support matrix, and they have no interest whatsoever :( 15:41
clarkbright we're between a rock and a hard place with people who rely on matrix to connect to irc (whcih is a seemingly large number) and those who want to avoid another system entirely. I sympahtize with wanting to have volunteer time be easy and fit within personal preferences for tooling (I really dislike being on discord for gerrit for example). But we also offer services here and15:41
clarkbideally are reachable for our users15:41
clarkbif we don't do something (regardless of what that choice is among the options I identified earlier) we're very likely going to have disjoint and confusing communications in the near future15:42
corvuswe've gotten some pretty consistent feedback that irc isn't a great experience for many new contributors, and for those of us for whom it is a great experience, it is made so with a lot of effort that is not easy for new users to put in.  all of that (persistent sessions, authentication, etc) comes out of the box with matrix.15:48
corvusit can federate to almost any other open system, which could actually help reduce the app fragmentation that's out there15:49
corvusi think it's worth trying to find a place where everyone can communicate together, rather than accept the split of our communities into irc vs proprietary walled gardens15:50
corvusso, yeah, i decided that for me, it was worth giving up my highly tuned irssi config to switch to matrix to try to walk the walk on universal communication and collaboration15:51
corvusi want people to see that we can have the open source, open access collaboration we have with irc and all the conveniences of the proprietary systems without the proprietary systems15:52
corvusif the alternative is bifurcated irc+slack+discord, then i think matrix is a better way15:53
JayFWe're going to lose people no matter which direction we go. I personally don't find matrix easier to setup or understand than IRC15:55
clarkbright because we all understand chanserv and nickserv and how they are implemented differently on different networks. Matrix really simplifies all of that I think15:57
JayFWell, there's a reason I point new folks at irccloud which completely removes those decoder rings; but I do understand. I don't think matrix is a higher bar; but I'm not convinced it's significantly easier for new folks.15:58
clarkbbasically if we are already comfortable with irc then yes matrix isn't really going to solve any problems for us. But I think it does address problems for newer users when it comes to creating and authenticating accounts, joining rooms, and maintaining historical scrollback15:58
JayFAs I said to spotz at SCaLE/OIF days: It takes a PhD in computer to setup Matrix, at least when using IRC bridges. I'm not sure if the story gets much easier doing matrix-direct.15:59
corvusi'm going to restart zuul-web to pick up some new api endpoints15:59
clarkbJayF: I have not had the same experience15:59
clarkbsetting up matrix does not require a phd in anything16:00
clarkbthe bridges are more complicated and thati s part of why I think we should consider using a native room16:00
clarkbthen all anyone should need is a url to the room and an account (no bridge management to go along with it)16:00
clarkbcorvus: that will enable zuul-launcher resource managment via the api?16:01
fungiyes, not using the bridges and connecting to a matrix-only room with element as a browser-based client is as simple as following a url and letting it walk you through either signing in or creating an account on matrix.org16:01
*** gthiemon1e is now known as gthiemonge16:02
JayFYeah, I should try with a matrix native room. I tried a while back (when they were reliable!) to switch to matrix+IRC bridge and I cannot recall a time when I felt similarly befuddled by something that seemed like it should be simple16:02
fungitrying to use matrix as an irc client, in effect, is a good bit more complicated sure16:02
mnasiadkaI’d say irc bridges on matrix currently are in some weird shape, sort of unreliable - so I moved back to separate clients…16:02
JayFmnasiadka: yeah, I think that's what spawned this conversation16:05
JayFhttps://github.com/progval/matrix2051 this was just pointed in my direction by the irccloud folks16:06
clarkbJayF: zuul and starlingx and openstack-ops all have native matrix rooms now if you need one to hop onto.16:06
JayFmight poke at doing an instance somewhere just for me (or GR-OSS)16:06
corvusand of course there are folks who have the opposite experience16:06
corvusthere is nothing that will satisfy everyone, but i honestly think matrix is the closest thing to it, and it is our best hope for actually pushing back against proprietary balkanization of communication in open source communities.  i think that's worth a lot.16:07
fungithe openstack vmt has a private matrix room too, but a majority of the curent vmt members preferred to stick with irc so i'm straddling the two for now in case that shifts in the future16:07
corvusand account setup is just standard username/password webform16:09
corvusclarkb: yep.  not all the changes have merged yet, but i want to be able to get listings now.  so i'll do another restart later likely16:10
corvusyes, they had a bad 2 weeks.  since they are effectively a standalone federated component, their reliability reflects their operator (not the network), and their operator has deprioritized them.  so i agree it's good to consider not individually relying on them now, and avoiding relying on them makes us more self-sufficient.16:12
corvusthey=matirx-irc bridges16:13
corvus(also, like everything matrix, anyone can run a public bridge if they want, but no one else wants to because of the work and coordination involved.  but individual bridges are still plentiful)16:17
corvusi think matrix2051 is a good design, i'd be interested to hear how it goes16:20
corvushttps://zuul.opendev.org/t/zuul/nodeset-requests exists now16:21
leonardo-serranoHi, we've noticed Zuul is running a bit slower at the moment. Is there any info about this?16:27
clarkbleonardo-serrano: can you be more specific about what is slow? is the web ui loading more slowly? Are jobs not starting as quicklly as you expect? Have specific job runtimes increased?16:30
corvusthe disk on nl05 is full16:32
leonardo-serranoclarkb: A few starlingx reviews are taking longer than usual to get their zuul jobs started. One example is https://zuul.opendev.org/t/openstack/status?change=951778%2C216:33
corvusthe errors from right before the disk filled up suggest errors in rax-iad16:34
clarkbleonardo-serrano: the first thing I like to check in taht situation is https://grafana.opendev.org/d/21a6e53ea4/zuul-status?orgId=1&from=now-6h&to=now&timezone=utc to see if we are running at capacity ( doesn't seem to be at node capacity or executor capacity)16:34
corvusi'm going to delete some old logs then restart that launcher and see what it says16:35
clarkbbut I think corvus has already likely diagnosed the issue16:35
clarkbnl05 provides ~60% of our capacity and if it is out of commission then we'll run fewer jobs16:35
clarkbcorvus: thanks!16:35
corvusi don't think the problem is the launcher16:36
leonardo-serranoThanks for the tip and the quick responses guys. Cheers16:36
corvusi think it's the cloud16:36
corvus(very likely that the launcher can continue even with a full disk, we just can't see what it's doing)16:36
clarkbah16:37
clarkbhttps://grafana.opendev.org/d/a8667d6647/nodepool3a-rackspace?orgId=1&from=now-6h&to=now&timezone=utc&var-region=$__all ya that implies that rax-ord and rax-dfw are still doing work16:38
clarkbmaybe we should set iad to max-servers 0?16:38
clarkbor max servers 5 as a canary?16:38
clarkbbut even one cloud that can't boot servers will delay many requests by ~15 minutes iirc so could be the source of the problem16:39
corvusthe disk is full due to a particularly insane stacktrace from python; honestly not sure how that was constructed16:40
corvusthe stacktrace is thousands of lines long, but it doesn't make sense because it's not actually recursive16:41
corvusit's caused by a failure to connect to the workers16:41
corvusso i think reducing iad servers is probably a good idea16:42
clarkb"the workers" == the test nodes?16:42
clarkbcorvus: I'm good with that. I can quickly review a change if the disk situation is at least temporarily resolved so it can apply16:42
corvusi'm seeing keyscan errors on multiple rax regions16:42
corvusso maybe hold off on that max servers change idea16:42
corvusya16:43
clarkbrax is where we rely on glean to configure the network statically based on their special config drive network config16:43
clarkbif that doesn't work for some reason (image updates or cloud changes to config drive) then that may explain failed ssh keyscans?16:44
corvusi restarted that launcher just in case there was something internal messed up16:44
corvusopenstack.exceptions.ConflictException: ConflictException: 409: Client Error for url: https://neutron.api.dfw3.rackspacecloud.com/v2.0/floatingips, Quota exceeded for resources: ['floatingip'].16:44
corvusthat's going on too16:44
clarkbcorvus: thats raxflex not classic so would be a separate issue I think16:44
fungii wonder if we've leaked fips in flex though16:45
corvusoh good point16:45
clarkbhttps://grafana.opendev.org/d/6d29645669/nodepool3a-rackspace-flex?orgId=1&from=now-6h&to=now&timezone=utc&var-region=$__all nothing seems to be booting there right now16:45
clarkbbut those failures should process much more quickly if we're hitting quota errors16:45
corvusmaybe the fip leaks are niz related?16:46
clarkbmaybe?16:46
clarkblike we're using the quota up on the niz side and nodepool doesn't know or something?16:46
corvusyeah, i don't think fips are included in quota, so we don't actually know how many we have16:47
corvusbut i don't think we expect to have fewer fips than we can run instances, so that suggests a leak16:48
clarkbcorvus: openstack limits show --absolute seems to indicate we're using no floating ips in those two regions. So possibly a leak on the cloud side that we can't even see?16:50
clarkbdoing floating ip list does show floating ips though16:50
corvuscan you tell if they're unattached?  have an example i can grep for in nodepool/zuul logs?16:51
clarkbcorvus: from DFW3: 02b6917c-a4bb-4534-8766-cef2018cd3ed | 50.56.157.24716:51
clarkbcorvus: that does not have a fixed IP so I think it is unattached16:51
clarkbI suspect that these have leaked and limits show --absolute is not accurately reporting the data back to us.16:52
clarkbNone of them have fixed ips according to floating ip list16:52
clarkb(so they should all be unattached)16:52
clarkbalso max servers is 32 in both regions and I count 50 fips16:53
clarkbI think we might be able to safely delete each of the floating IPs in sjc3 and dfw3 just be sure to do one listing and delete from that as nodepool may quickly start booting new nodes creating new fips that would get listed if we list multiple times16:53
corvusack... you mind doing that?  and save a list of them so we can try to backtrack and figure out when/where they leaked?16:54
clarkbcorvus: yes I'll start on that now16:54
corvusthat ip doesn't show up in either nodepool or zuul launcher logs, but i did just delete a bunch of nodepool logs.16:55
corvus(i'm looking in executor logs now to try to backtrack that way)16:55
clarkbcorvus: the listings are in bridge:~clarkb/notes/16:55
fungidos show tell us when they were created? might help to narrow down logs16:55
fungimaybe most of them are prior to our retention16:55
clarkboh ya the example I gave was created in april16:56
corvusgood point, that would save a lot of time :)16:56
clarkb b4c3634d-2dce-413d-8c12-11478409f02d | 174.143.59.14916:56
clarkbthat one is from the 20th of may16:56
fungiso over two weeks ago16:56
clarkbI'll check a few more to see if  Ican find more recent ones16:56
clarkbec26d25d-1e92-434a-9bc7-04b28f2dfe7c | 50.56.157.32 is the 21st16:57
clarkbbut after checking ~5 more that was the most recent. Its possible that occurred far enough abck in time there aren't newer ones16:57
Ramereth[m]So it seems some of the SSD's I added have poor quality. I'm working on getting them out of rotation. Hopefully that will be completed by the end of today. I just ordered some replacements I know will perform well.16:58
clarkbRamereth[m]: thank you for the update16:58
clarkbfungi: corvus I'm ready to start cleaning up fips in dfw3 should I proceed or do you want more data collection first?17:00
corvusclarkb: so....17:00
corvusclarkb: both nodepool and niz require us to explicitly turn on floating-ip-cleanup17:00
corvusbecause it's not a safe operation, because fips don't have metadata, and we can't tell if an unattached fip belongs to us or not17:00
fungiclarkb: no, that's fine17:01
clarkbcorvus: so maybe this is just slow background leak and since we're running two competing systems we have that disabled?17:01
corvusit's only "safe" to do in a situation where we know that nodepool (or zuul) are the only consumers17:01
corvusi don't see that set in eeither the nodepool or zuul configurations for the new cloud17:01
clarkbcorvus: ya and 90% of the time (or whatever the rate is) the fips should delete properly17:02
clarkbthis is just for leaked cleanup handling whcih we cannot safely enable right now?17:02
corvusis flex the only one with fips?  and did we just not set that?  or am i missing something?17:02
clarkbin that case I can manually delete things now and get stuff working again and maybe we rinse and repeat again and eventually we'll be zuul-launcher only and can enable the leak cleanups?17:02
clarkbcorvus: yes flex is the only one right now17:02
clarkbacutally the osuosl cloud might be too I'm not 100% certain on that one17:03
clarkbbut ovh, rax classic, and openmetal are all direct ip assignments17:03
corvusso if flex is the only fip cloud, and we just "forgot" to turn on fip cleanup, then i think we can say mystery solved and we don't need to backtrack any more to find the problem17:03
clarkbcorvus: ++ I think that makes sense to me17:03
corvus(we obviously know that fip leaking happens, so i don't think we need to assign fault to nodepool or zuul for that)17:04
clarkbI'll proceed with the manual cleanup now then?17:04
corvusokay, then yeah, i think we can delete the fips now.  thanks.17:04
corvusas for enabling that: because we are running in parallel, it's technically not safe.  but maybe we should turn it on in nodepool, with the understanding that it might cause a niz error?17:04
corvusor, we could leave it disabled and just delete another batch in a few weeks17:05
corvusmaybe that's the best idea?  since it mostly works?  :)17:05
clarkbI think it could cause errors to both nodepool and niz17:05
clarkbthe race is when booting a new server and the fip is created but not attached. We could delete the fip after it has been attached?17:05
clarkbthen we'd have a broken node either during boot checks or possibly after zuul has received it. Most likely that happens quickly enough to be in pre-run and we rerun the job?17:06
clarkbbut yes I think we can just cleanup fips later again17:06
corvusyeah, i think you're right... the implementation looks pretty naive and may not take into account current builds17:06
clarkbI doubt neutron has a flag to delete only if detached17:07
clarkbbut that might fix this race for us if it did17:07
clarkbdfw3 should be claened up now.17:07
corvusclarkb: well, we "self._client.delete_unattached_floating_ips()"17:07
corvusoh you mean a hypothetical flag that means "was previously attached but is not now"17:08
corvusdear openstack, metadata on all objects is critical for tracking purposes. love, zuul17:09
fungii guess there's not a boot option to delete the fip when the server instance is deleted?17:09
clarkbcorvus: yes. Just some simple amount of info to know what the lifecycle state is17:09
clarkbis it starting or ending17:09
clarkbfungi: there is, and that is what we do. But it only works $PERCENT of the time17:09
clarkbfungi: another option would be for neutron and nova to fix that. but its been like 15 years and this problem has persisted so....17:10
clarkbwe still don't know why rax-iad is sad?17:10
corvusit looks like our current deletion scheme is to "delete all fips attached to a server" and only once that is complete, to delete the server.17:10
corvusso i think this failure scenario is the cloud saying "yes i deleted that fip" and then... it's not deleted.17:11
clarkband now SJC3 has been cleaned up17:11
clarkbneither region has tried to boot new nodes since the cleanup that I can see17:11
corvusclarkb: i think the requests are in building, but none have completed, which does not bode well, even though it hasn't exploded yet.  lete me check one17:12
clarkbthere are fips in use in dfw3 and sjc3 now. Nodepool has marked them in use17:15
clarkbcorvus: its possible that max-servers 0 for iad is still appropriate for that specific issue17:15
corvus2025-06-04 17:11:27,430 WARNING nodepool.StateMachineNodeLauncher.rax-iad: [e: 2ad750856fce4f6d825980dda3184e11] [node_request: 200-0027179958] [node: 0041024577] Launch attempt 1/3 for node 0041024577, failed: Timeout waiting for instance creation17:15
corvusyeah, many timeouts on iad, no success17:17
corvusord and dfw look ok17:17
corvusso i think we should ramp down iad17:17
clarkb0041024620 is a dfw noble instance that just booted, went ready, then in use17:17
clarkbI agree the probklem seems iad specific17:17
corvusclarkb: you want to type that and i approve it?17:17
clarkbon it17:18
fricklerhmm, I can do "openstack floating ip create public --tag ZUUL" and then e.g. "openstack floating ip list --tags ZUUL", isn't that good enough as metadata?17:19
opendevreviewClark Boylan proposed openstack/project-config master: Disable Nodepool's rax iad region  https://review.opendev.org/c/openstack/project-config/+/95179417:20
corvusfrickler: cool, if that exists now, we can use it.  we just need to figure out if all our clouds support it, since i'm pretty sure that wasn't always the case.17:20
clarkbalso part of the problem may be in plumbing that with the auto fip handling?17:21
clarkbbut that is client side I'm guessing whcih is easier to modify17:21
clarkbcorvus: I looked at opendev/zuul-providers and don't see max servers there. Are we relying on quotas?17:21
corvusi think zuul/nodepool manually create and delete fips17:21
clarkbah in that case we could tag them17:21
corvusclarkb: it's spelled differently, 1 sec17:22
clarkbhttps://review.opendev.org/c/openstack/project-config/+/951794 should cover nodepool though17:22
corvusi guess if flex is the only fip cloud, then evaluating whether that's available is easy; and maybe osuosl right?17:23
corvusfrickler: what cloud did you test with?17:23
clarkbcorvus: yes should be easy to test.17:23
fricklercorvus: downstream cloud, should be at 2024.117:24
fungigot it, so the cloud-side autocleanup for fips is unreliable, and th cleanup in nodepool/zuul is a workaround for when openstack breaks in the provider essentially17:24
clarkbcorvus: I found the limit stuff I'll get a change up for zuul-providers17:24
opendevreviewJames E. Blair proposed opendev/zuul-providers master: Disable rax-iad  https://review.opendev.org/c/opendev/zuul-providers/+/95179517:25
corvusclarkb: ^17:25
clarkbah you're on it that works too17:25
clarkb+2 from me17:25
clarkbI need to pop out now as mentioned yesterday17:26
opendevreviewMerged opendev/zuul-providers master: Disable rax-iad  https://review.opendev.org/c/opendev/zuul-providers/+/95179517:26
fungiapproved both17:26
clarkbI'll check in periodically via my matrix connection and should be back this afternoon17:26
corvushttps://opendev.org/openstack/openstacksdk/src/branch/master/openstack/cloud/_network_common.py#L71417:28
corvusthat is the api method zuul/nodepool use to create a fip; there's no metadata argument there17:29
corvusi wonder how the cli adds that tag17:29
corvushttps://docs.openstack.org/api-ref/network/v2/index.html#floating-ips-floatingips17:31
corvusdon't see metadata there either17:32
corvusthere's an extension with a separate endpoint to add tags: https://docs.openstack.org/api-ref/network/v2/index.html#standard-attributes-tag-extension17:32
corvusthat is not as useful since it doesn't address race conditions.  it can potentially reduce the problem, but won't eliminate it.17:33
frickler"tags cannot be set when created, so tags need to be set later." https://opendev.org/openstack/python-openstackclient/src/branch/master/openstackclient/network/v2/floating_ip.py#L195-L19617:36
corvus:(17:37
opendevreviewMerged openstack/project-config master: Disable Nodepool's rax iad region  https://review.opendev.org/c/openstack/project-config/+/95179417:42
fungii guess the promote-image-build pipeline is still a work in progress? https://zuul.opendev.org/t/opendev/buildset/b6bc802e3aff413c97ed06487b8e540218:07
fungilooks like maybe those jobs either need to short-circuit or not run for changes that wouldn't produce new artifacts in the gate?18:11
frickleryes, we noted that yesterday already when the change adding these jobs failed the same way18:21
fricklerRamereth[m]: would it help if we disable the cloud until what I assume is ceph rebalancing has finished?18:23
corvusyeah, the extra failures are not critical and should be an easy fix18:24
Ramereth[m]frickler: yeah, that's okay18:26
opendevreviewAntoine Musso proposed zuul/zuul-jobs master: add-build-sshkey: slightly enhance tasks names  https://review.opendev.org/c/zuul/zuul-jobs/+/95180418:47
opendevreviewJames E. Blair proposed opendev/zuul-providers master: Add a dispatch job to the image promote pipeline  https://review.opendev.org/c/opendev/zuul-providers/+/95180519:09
corvusi put a little more effort into that so that it should be really clear what each buildset should be trying to do19:09
opendevreviewJames E. Blair proposed opendev/zuul-providers master: Add a dispatch job to the image promote pipeline  https://review.opendev.org/c/opendev/zuul-providers/+/95180519:10
corvusinfra-root: fyi, zuul now has the ability to pause event processing (either incoming trigger events, or reporting), so the next time we perform gerrit maintenance and are concerned about the interaction with zuul, we can pause it if we want.20:51
fungiooh!20:52
fungithat sounds great20:52
opendevreviewMerged openstack/project-config master: Remove redundant editHashtags config  https://review.opendev.org/c/openstack/project-config/+/95160320:54
opendevreviewMerged openstack/project-config master: Remove editHashtags config  https://review.opendev.org/c/openstack/project-config/+/95160420:57
opendevreviewMerged openstack/project-config master: Disallow editHashtags in acl configs  https://review.opendev.org/c/openstack/project-config/+/95171520:58
fungi951604 failed infra-prod-manage-projects in deploy: https://zuul.opendev.org/t/openstack/build/276bf41e3d044d03b4732c0c275a28b521:05
fungifatal: [gitea10.opendev.org]: FAILED! => ... No usable temporary directory found in ['/tmp', '/var/tmp', '/usr/tmp', '/root']21:06
fungirootfs has filled up on gitea1021:06
fungichecking /var/gitea/data/gitea/repo-archive as the usual suspect21:07
fungi110G of the 155G fs21:09
corvusseems to have 110G21:09
corvuswhat's the cleanup procedure?  can we just rm-fr that dir?21:11
fungii think we have to mash a button in the admin webui21:12
fungiwhich i think is only reachable through 3000/tcp which we no longer allow through iptables, so ssh tunnel21:12
corvusbtw, have we tried making the repo-archive dir non-writeable?21:13
fungii think clarkb raised a concern that it would break the repo download feature, though that seems like a positive to me (too bad we can't make it inaccessible)21:14
corvusfungi: are you planning on hitting that button or should i figure out how to do that?21:18
fungii'm muddling through it, have the ssh tunnel established now, just working out the path21:18
fungionce the tunnel is in place, looks like next step is logging into https://localhost:3000/user/login21:19
corvusthanks!21:21
fungiah, of course the db can't write my session21:22
fungiso once i authenticate, i get a big fat 500 internal server error page21:23
corvusmaybe snipe a few individual files?21:23
fungiyeah, working on vaccuuming journald if nothing else21:23
fungiokay, that worked, i'm logged in21:26
fungiaha, the url changed at some point to include an extra -/ (i guess to avoid conflicting with repos whose names start with admin/)21:29
fungihttps://localhost:3000/-/admin/monitor/cron shows me a "Delete all repositories' archives (ZIP, TAR.GZ, etc..)" function that runs at 03:00 utc daily21:30
fungiclicking the button, it showed me a message that said "Started Task: Delete all repositories' archives (ZIP, TAR.GZ, etc..)"21:31
fungiand indeed, almost instantly we have 112gb free on / now21:31
corvusgreat.  i assume manage-projects will eventually consistentize itself21:32
fungi#status log Manually cleared all repository archive files which had accumulated on gitea10 since 03:00 utc21:32
opendevstatusfungi: finished logging21:32
fungii should also quickly reboot the server in case anything is stuck in a weird state after failing to write to /21:32
fungiit's rebooting now21:33
fungiit seems to be up and working again21:36
fungii'll also trigger a full re-replication from gerrit21:36
fungithat's running now21:37
fungichecking the other gitea servers for free space21:37
fungisadly gitea11, 13 and 14 are all in the same state 10 was (09 and 12 are fine though)21:39
fungii'll work on doing the same to all of them21:39
fungiprobably whatever was grabbing all the archive url links kept downing the backends in the lb one after another21:41
corvusexactly like whack-a-mole21:43
fungiwhack-our-servers21:44
fungiokay, the last of them is rebooting now21:44
fungiand then i'll re-re-replicate from gerrit *sigh*21:44
fungii should have checked the others before i started that21:44
fungic'est la vie21:44
fungiokay, all of them are up and reporting at least 100gb free on /21:45
fungi#status log Repeated the same process on gitea11, 13 and 14 which were in a similar state, then rebooted them, then triggered full replication from Gerrit in case any objects were missing21:47
opendevstatusfungi: finished logging21:48
Clark[m]thank you for handling that. Note I don't think you need to tunnel22:18
Clark[m]I'm not sure how we would make that dir read only. Mount a ro tmpfs over the path?22:19
Clark[m]Or maybe a to volume mount over the path? But I'd be happy to test something like that on a held node 22:19
fungiyeah, in retrospect it was probably that i didn't know it had started needing the -/ in the admin url path22:20
fungilooks like it works without the ssh tunnel if i use the correct url22:21
fungigood call Clark[m]!22:21
fungii had at first assumed i needed to tunnel because it was returning a 40422:22
fungithen when i still got a 404 over the tunnel i rechecked gitea docs to find an example admin url22:22
fungibut eventually i figured out that there was an "administrate" option in the user drop-down context menu for the logged-in session22:23
Clark[m]I think I had a doc's update for that change22:25
corvusyeah, i was thinking some kind of volume mount22:25
Clark[m]Change of the url I mean22:25
fungiif we wanted to block access to urls starting in /user/login or /-/ we could simply exclude them from the proxy in the apache config22:25
fungiClark[m]: for making /var/gitea/data/gitea/repo-archive read-only, couldn't we just set the file permissions on the directory to ee.g. 0444?22:29
corvuscould probably do that when we build the image, assuming it isn't a docker volume22:30
Clark[m]I'd worry about gitea being smart and fixing that but we could try it22:30
Clark[m]It is inside a docker volume22:31
Clark[m]Iirc22:31
Clark[m]But not the root of one22:31
corvusoh right, we bind mount that to our system fs22:31
corvusso yeah, presumably we could just chmod it there22:31
corvusClark: fungi https://review.opendev.org/951805 if you have a sec should address the image promote stuff -- since that's post-review i need to merge that to try it out22:33
fungicorvus: okay, i have to admit that's really neat, is it our first use of a dispatch job in opendev?22:38
Clark[m]corvus https://review.opendev.org/c/opendev/zuul-providers/+/951805/2/zuul.d/image-build-jobs.yaml line 71 should that be child_jobs not child_job?22:38
Clark[m]The with_items loop refers to zuul.child_jobs22:38
corvusfungi: i think it is22:40
fungiplaybooks/opendev-promote-diskimage/dispatch-inner.yaml line 1 also uses "{{ child_job }}" and set_fact appends it to the ret_child_jobs list22:40
fungichild_job is the loop_var in playbooks/opendev-promote-diskimage/dispatch.yaml (line 11)22:41
Clark[m]Ya I think it is correct in the inner playbook22:41
corvusyeah, i think it should be child_job (ie, correct as written) because the inner task list is where it's interpolated and used22:41
corvusthe overarching structure is (pseudocode): for child_job in zuul.child_jobs: mutate the name of this child_job and get the builds for it22:42
fungiright, zuul.d/image-build-jobs.yaml isn't interpolated, just passs the j2 strings into the context22:43
Clark[m]Oh right we don't process the query until the inner playbook 22:44
corvus++22:44
Clark[m]Ok LGTM but I can't vote right now22:44
opendevreviewMerged opendev/zuul-providers master: Add a dispatch job to the image promote pipeline  https://review.opendev.org/c/opendev/zuul-providers/+/95180522:45
fungii made a "proxy +2" comment for you when i approved it22:45
corvushttps://zuul.opendev.org/t/opendev/buildset/5e6ce0e55c8c489e885aa8ebc5dc9a3b23:10
corvusthat's a good sign23:10
tonybDo we support/have prior art for installing ansible collections, I guess on the ze?   I'm hoping I can use openstack.cloud.* in some of the DIB testing.23:16
tonybif we can't trivially do that than I can of course use the openstack cli23:16
corvustonyb: the executor has the full ansible community edition thingy installed, which should have a bunch of cloud things including openstack.23:23
corvustonyb: but also, it might be safer/more flexible to just do it nested or use the cli23:23
corvusin case you end up needing a newer version or something, which sounds plausible for this use case ;)23:24
tonybOkay.  I'll stick with the cli23:28
corvustonyb: also, i'm not totally versed in the details, but i imagine there might be some tasks like "upload image to cloud" which we really want to run on the worker node, not the executor, so this avoids any complications like that23:30
tonybThat's fair23:31
tonybI was writing some openstack commands and then I thought wait the ansible modules probably do all of this for me, but that may be a "do it never^Wlater" type thing23:32
opendevreviewTony Breeds proposed openstack/diskimage-builder master: Add new openstack/devstack based functional testing  https://review.opendev.org/c/openstack/diskimage-builder/+/94994223:50

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!