Thursday, 2026-04-09

-@gerrit:opendev.org- Steve Baker proposed:05:02
- [openstack/diskimage-builder] 983813: Refactor 02-set-machine-id into an element https://review.opendev.org/c/openstack/diskimage-builder/+/983813
- [openstack/diskimage-builder] 983814: Refactor 03-reset-bls-entries into an element https://review.opendev.org/c/openstack/diskimage-builder/+/983814
- [openstack/diskimage-builder] 983815: Add tarball element for unprivileged container builds https://review.opendev.org/c/openstack/diskimage-builder/+/983815
@mnasiadka:matrix.orgAfter yesterdays changes the LB Grafana dashboard might need some love - https://grafana.opendev.org/d/1f6dfd6769/opendev-load-balancer?orgId=1&from=now-24h&to=now&timezone=utc07:26
@priteau:matrix.orgHello. Is something broken with Zuul? https://zuul.opendev.org is returning 40308:40
@sean-k-mooney:matrix.orgIt works for me just now08:45
@sean-k-mooney:matrix.orgI guess you’re signing in?08:45
@sean-k-mooney:matrix.orgRather then accessing it without auth08:46
@mnasiadka:matrix.orgWorks for me08:59
@mnasiadka:matrix.orgHmm, reprepro Ubuntu mirror script is failing on missing bionic "Error: packages database contains unused 'bionic-backports|main|amd64' database."09:03
@priteau:matrix.orgIt works in Chrome but not in Safari09:18
@priteau:matrix.orgUser agent filtering?09:18
@priteau:matrix.org* It works in Chrome and Firefox, but not in Safari09:18
@sean-k-mooney:matrix.orgso again not signed in but i just tested on safari and i can access it09:19
@sean-k-mooney:matrix.orgboth on moblie and on my macbook air09:20
@mnasiadka:matrix.orgPierre Riteau: might be, but my Safari works09:37
@priteau:matrix.orgMine is Safari 26.409:48
@priteau:matrix.orgIt works if I change the user agent to Chrome09:49
@priteau:matrix.orgUser-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/26.4 Safari/605.1.1510:02
@mnasiadka:matrix.orgYes, that one seems to be in Zuul Apache ua-filter.conf10:16
@priteau:matrix.orgSame issue to access https://static.openstack.org/project/opendev.org/docs/opendev/infra-manual/latest/matrix.html10:16
@priteau:matrix.orgAnd https://tarballs.openstack.org10:17
@priteau:matrix.orgBut https://docs.openstack.org works fine10:17
-@gerrit:opendev.org- Michal Nasiadka proposed: [opendev/system-config] 983845: Remove entry from ua-filter https://review.opendev.org/c/opendev/system-config/+/98384510:20
@mnasiadka:matrix.org^^ that should fix it, once it gets reviewed and merged10:22
-@gerrit:opendev.org- Michal Nasiadka proposed: [opendev/zone-opendev.org] 983851: Promote mirror04.gra1.ovh https://review.opendev.org/c/opendev/zone-opendev.org/+/98385110:47
@mnasiadka:matrix.org#status log Replaced mirror02.bhs1.ovh.opendev.org (b2a0f48a-c850-485f-838c-0c896c2cfa5d, volume: 05e72250-a3af-46c6-a2b8-aa1b8b0da928) with mirror03.bhs1.ovh.opendev.org10:52
@status:opendev.org@mnasiadka:matrix.org: finished logging10:52
@harbott.osism.tech:regio.chatit just looks like there were some counter wraps/resets?11:53
@harbott.osism.tech:regio.chatalso the next attack wave seems to have started 10 mins ago :-/11:53
@mnasiadka:matrix.orgwell, opendev.org is responsive for me, maybe a bit slot - but not that bad11:59
-@gerrit:opendev.org- Dmitriy Rabotyagov proposed: [openstack/project-config] 981924: Introduce OpenStack-Ansible Power Reviewers group https://review.opendev.org/c/openstack/project-config/+/98192412:04
@fungicide:matrix.orgi'm getting 500 internal server error when trying to clone a repo from gitea at the moment12:17
@fungicide:matrix.orgso we may have tuned apache and haproxy up high enough that gitea can get overwhelmed now12:17
@fungicide:matrix.orgyeah, looking at several backends, apache responds quickly but server-status shows most workers in "sending reply" state so may be waiting for gitea to stream the requested content12:21
@fungicide:matrix.orggitea09 is handling roughly one POST for a git-upload-pack each second12:29
@fungicide:matrix.orgthe other backends are seeing a lot less, so this may be the effect of hashing by client address12:31
-@gerrit:opendev.org- Zuul merged on behalf of Michal Nasiadka: [opendev/system-config] 983845: Remove entry from ua-filter https://review.opendev.org/c/opendev/system-config/+/98384512:31
@fungicide:matrix.orgyeah, load average on gitea09 is hovering around 25 too12:31
@clarkb:matrix.orgThe 500 errors should be logged in /var/log/containers/docker-gitea.log iirc12:33
@fungicide:matrix.orgthe poor folks on the launchpad team at canonical are struggling again today too, based on discussion in their matrix room12:33
@clarkb:matrix.orgI got an error from gitea12. Looking at grafana graphs I suspect that the load balancer may be cycling the back ends if the healthz responses are slow or also 500ing12:34
-@gerrit:opendev.org- Michal Nasiadka proposed: [opendev/zone-opendev.org] 983860: Add entries for mirror03.ord.rax https://review.opendev.org/c/opendev/zone-opendev.org/+/98386012:35
@clarkb:matrix.orgThe haproxy show stat command shows backend status iirc but I don't remember the exact command off the top of my head (we don't capture that in grafana)12:35
@fungicide:matrix.org`Apr  9 12:35:55 gitea09 docker-gitea[712]: 2026/04/09 12:35:54 services/context/repo.go:535:RepoAssignment() [E] GetReleaseCountByRepoID: Error 1040: Too many connections`12:36
@fungicide:matrix.orgquite a few of those12:36
@mnasiadka:matrix.orgClark: # watch 'echo "show stat" | sudo nc -U /var/lib/haproxy/run/stats | cut -d "," -f 1,2,3,4,5,6,8-10,18 | column -s, -t'12:36
@clarkb:matrix.orgOh we overwhelmed the databade12:36
@mnasiadka:matrix.orgah, backend status - more columns :)12:36
@mnasiadka:matrix.orgactually not, it's in last column of my oneliner12:37
@clarkb:matrix.orgfungi: error 1040 too many connections is the mariadb database complaining I think12:37
@clarkb:matrix.orgI'm not sure if we want to try increasing the mariadb limit and restart services or drop the haproxy front end limits back down. It looks like maybe the prior 8000 value must be right on the limit of either what our DB connection limit can do or what the botbet is doing since we're hovering around 8k connections now12:39
@clarkb:matrix.orgI believe there is a specific tuning config file for gitea mariadb connection limits already12:41
@clarkb:matrix.orgBut without access to gitea it's hard to find at the moment :)12:41
@clarkb:matrix.orgIf we want to increase DB connections I can probably get to a real computer and load ssh keys12:42
@mnasiadka:matrix.orgWell, it's the question of allow the botnet to scrape what they want faster (and explore other bottlenecks) or get back to previous limits and wait until it finishes?12:44
@clarkb:matrix.orgYes, though yesterday we were theorizing a good chunk of the traffic would 403. However if we're getting far enough to talk to the db then that may also have changed. Where the traffic is now actually processing the entire request and waiting for a response rather than short circuiting12:45
@clarkb:matrix.orgThat may also explain why at 8k ish connections we're seeing trouble12:46
@fungicide:matrix.orgthe balance comes from being able to reject enough bogus requests before they hit the db12:46
@fungicide:matrix.orgi'm definitely seeing a ton of the same sorts of mobile phone user agent strings associated with requests for /commit/ paths making it through to the backend still12:47
@vhasko:matrix.orghello guys, we from T Cloud Public (formerly OpenTelekomCloud) also experiencing 500 error on https://opendev.org/zuul/zuul-jobs/12:47
@fungicide:matrix.orgVladi: thanks, it's known12:47
@clarkb:matrix.orgSo I think our options are either to drop the front end limits down again or increase the mariadb connection limits. Status quo is less desirable as I suspect this can impact gereit replication stuff with the DB errors12:49
@fungicide:matrix.orgthough the majority of the crawlers getting through now are back to computer browser identifiers rather than mobile device ones12:50
@clarkb:matrix.orgYes, the traffic actually hitting the DB implies we're filtering less and returning fewer 403s. We may still be able to keep up if we stabilize via increased DB connection limits though12:51
@clarkb:matrix.orgOr the system load will skyrocket and it won't be useable ;) hard to say from the current vantage point 12:52
@fungicide:matrix.orgbut with the load average on gitea09 approaching 30, i expect other failure modes if we increase the db connections12:52
@clarkb:matrix.orgThat may be due to 09 being the only system up and then going down and the next system is hit. But yes not a good indicator12:53
@fungicide:matrix.orgmaybe it's time to add more backend servers? copying the data to them is time-consuming though12:53
@mnasiadka:matrix.orgAll gitea servers have around 30 load avg12:53
@mnasiadka:matrix.orgIt's not only 09 anymore12:53
@fungicide:matrix.orgyeah, i was using 09 as an example, though maybe a poor choice since it was handling a lot more git requests in the past hour12:54
@fungicide:matrix.orgor do we want to try to go ahead and land the anubis implementation?12:54
@mnasiadka:matrix.orgWell I don't think lowering frontend limits is going to help, it was sort of the same situation yesterday - it just was blocked on LB instead of hammering backend12:55
@clarkb:matrix.orgThe upside to blocking on the frontend is that Gerrit replication could continue to succeed12:55
@mnasiadka:matrix.orgRight12:56
@mnasiadka:matrix.orgSo - lower the LB frontend limit and land Anubis and see if it helps?12:56
@fungicide:matrix.orgpreviously i didn't expect anubis to help much because the requests had to go through haproxy and apache anyway, but now it's actually load on the gitea service/database killing us, so it might be the solution12:56
@clarkb:matrix.orgYa we may need to increase the limits again as the frontend will just get overpowered but in theory anubis should push back and allow the backend to keep up12:57
@clarkb:matrix.orgSo step 1 limit front end which won't fix anything for clients. Step 2 deploy anubis. Step 3 see if that provides enough return to sanity?12:57
@vhasko:matrix.orgI can confirm that implementing anubis saved our asses on our HelpCenter, kicked off many crawler bots and AI bots12:58
@clarkb:matrix.orgYes we've deployed it on another service12:58
@clarkb:matrix.orgIt has just been a bit more complicated and tricky to get into gitea particularly when we have had to run a fire drill every 6 hours12:59
@fungicide:matrix.orgthe main concern with using it in front of gitea is that the crawlers were overloading haproxy and apache before they would have reached anubis, but now we've tuned those to accept a lot more connections/requests12:59
@clarkb:matrix.orgfungi: Jens Harbott has a comment about why the latest anubis ps failed. Do we want to get that updated and work on the http vs https switchover in the interim?13:03
@clarkb:matrix.orgBut also setting the limit on the frontend back to 4k will likely help with the config updates on the backend 13:04
@fungicide:matrix.orglooking, i hadn't seen it yet13:04
-@gerrit:opendev.org- Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org proposed: [opendev/system-config] 983061: Apply Anubis to the Gitea backend servers https://review.opendev.org/c/opendev/system-config/+/98306113:07
@clarkb:matrix.orgI do half wonder if we want to try and manually cut over a single backend too13:07
@clarkb:matrix.orgBut as previously mentioned there is a fair bit of config to update by hand13:08
@fungicide:matrix.orgeasy enough to disable ansible deployment to the remaining backends, but then deploying to them later will require manually running ansible i guess?13:08
@clarkb:matrix.orgYes or reenqueue the buildset assuming we don't land anything else conflicting in the interim13:09
@clarkb:matrix.orgLooking at the http cutover the bulk of that change is in management code. But otherwise it's a fairly straightforward update of apache and gitea config and restarts of both services. Applying that by hand shouldn't be too bad. I can do that on say gitea14 after manually taking it out of the rotation if we think that would be good to verify. Then similarly anubis is docker compose update and apache update and could be done manually? Just to confirm it generally works and then we apply it via ansible? I dunno13:14
@clarkb:matrix.orgThere are no good easy straightforward options so just need to pick something and move forward I guess13:14
@fungicide:matrix.orgthe anubis change yes, though the parent change will be a bit more of a beast13:14
@clarkb:matrix.orgIt doesn't look too bad (the parent)13:15
@clarkb:matrix.orgThe main thing is that the management code that talks to https localhost will fail if it runs after a manual update. Which is probably ok let's just not create any projects right now?13:16
@priteau:matrix.orgThanks for fixing the UA filtering for https://zuul.opendev.org/13:16
@fungicide:matrix.orgyeah, i guess it's mainly just docker-compose.yaml, app.ini and gitea.vhost changes that need to get applied by hand, and then apache and containers restarted13:16
@priteau:matrix.orgI see there is still the same issue for https://opendev.org13:17
@clarkb:matrix.orgfungi: ya give me a few minutes to get situated at the computer but I'll remove gitea14 from the backend rotation then start on it13:18
@fungicide:matrix.orgPierre Riteau: yes, infra-prod-service-gitea failed in deploy for 98384513:18
@fungicide:matrix.orgprobably related to one or more of the servers being overloaded13:18
@fungicide:matrix.orgi'll need to check the log to confirm13:18
@fungicide:matrix.org`TASK [gitea : List keys again to ensure key ids are correct for deletion.] fatal: [gitea09.opendev.org]: FAILED! => "censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result"...`13:19
@fungicide:matrix.orgso probably yes13:20
@fungicide:matrix.orgzuul estimates another 45 minutes to finish system-config-run-gitea for checking the test update to 983061, and then it needs to get through the gate, but since other testing already succeeded on it earlier if we want to hand-apply it on one of the backends as a trial run i guess we could do it in the interim13:22
@clarkb:matrix.org14 has been pulled out of the haproxy rotation via manual haproxy socat commands13:24
@clarkb:matrix.orgI'm going to apply the two changes separately and we can confirm things look good between the two13:26
@fungicide:matrix.orgthanks. sorry i'm a little distracted, in a completely unrelated conference call and have another right after this one13:31
@clarkb:matrix.orgI'm staging everything in my homedir so that I'm not ninja editing and there is a bit of a paper trail13:32
@clarkb:matrix.orgjust in case anyone is wondering what is taking so long and/or if you want to look13:32
@clarkb:matrix.orgok gitea14 should be up after a manual transition to the http backend13:40
@clarkb:matrix.orgfungi: ^ if you want t o test that too (I am about to test it. Haven't done any checks beyond looking at processes so far)13:40
@clarkb:matrix.orgweb seems to be working for me via the socks proxy through the load balancer13:41
@clarkb:matrix.orgI'm going to work on staging the anubis change next. That will take a bit of time giving time to test the http transition13:41
@fungicide:matrix.orghah, it's almost 100% git clients hitting the gitea-web interface now13:46
@fungicide:matrix.orgload average is running sub-1.013:47
@clarkb:matrix.orgfungi: on which host?13:47
@fungicide:matrix.orggitea1413:47
@clarkb:matrix.orggitea14 should be out of haproxy so I wouldn't expect anything talking to it?13:47
@clarkb:matrix.orgunless maybe those git clients are from before I took it out and haproxy lets them finish?13:47
@fungicide:matrix.orgoh, okay that's internal gitea connections13:48
@clarkb:matrix.organyway staging is taking me a minute to get my bearings13:48
@fungicide:matrix.org`GiteaHttpLib`13:48
@clarkb:matrix.orgso I haven't anubis'd anything yet13:48
@fungicide:matrix.orgthat makes more sense as to why there's so little activity13:50
@clarkb:matrix.orgok I'm about ready to restart gitea14 services again just a heads up that I'll be doing that and your tests may start to fail13:55
@fungicide:matrix.orgk13:56
@fungicide:matrix.orgwe're about 10 minutes out from system-config-run-gitea possibly moving into the gate if Jens Harbott's test suggestion worked13:57
@fungicide:matrix.orgooh, i see a firefox hit that made it through gitea13:58
@fungicide:matrix.orgoh, was that you Clark?13:58
@fungicide:matrix.orglooks like it used https://gitea14.opendev.org:3081/ as the url base13:59
@clarkb:matrix.orgyes that awas me13:59
@clarkb:matrix.orgit isn't working I think due to the :3081 so I'm trying to fix that13:59
@fungicide:matrix.orgdoes hitting apache on 443 not work, or are you trying to exclude that layer?14:00
@clarkb:matrix.orgthe anubis redirect says not an allowed redirect domain14:00
@clarkb:matrix.orgI had gitea14.opendev.org in the redirect domains to match waht the chagne should do but I think it needs the :3081 maybe?14:00
@clarkb:matrix.orgI'm going to restart services again14:00
@fungicide:matrix.orglooks like you got a 200 ok14:01
@clarkb:matrix.orgyes that seems to fix it if I add :3081. I don't think this affects production so I think we can proceed as is and then do a followup to fix it14:02
@clarkb:matrix.orgdo you think I should add gitea14.opendev.org back to haproxy? My concern is that we'll get the backends round robinning due to the other issues and either gitea14 will take an outsized amount of load or the anubis cookie may confuse things?14:02
@clarkb:matrix.orghowever, we can always take it back out of rotation if that is a problem I guess if we want to go for it14:02
@fungicide:matrix.orgyes, please do14:03
@fungicide:matrix.orgi'm watching the log14:03
@clarkb:matrix.orgdone14:03
@fungicide:matrix.orgi see normal traffic now14:03
@fungicide:matrix.organd as expected, it's almost all git traffic14:03
@fungicide:matrix.orglots and lots of it14:04
@fungicide:matrix.orgthough i do see some normal browsers getting through as well, they look like maybe more normal requests14:04
@clarkb:matrix.orgcool I guess I can do 13 next and so on down the line and race ansible14:05
@fungicide:matrix.orglooks like google gemini knows how to solve anubis challenges14:06
@clarkb:matrix.orgon 13 I can just go to the final state too now that we know it works so maybe it will be quicker14:06
@clarkb:matrix.orgfungi: can you monitor gitea14 while I do 13 as its load is steadily climbing14:07
@clarkb:matrix.orgI worry its going to get the bulk of the traffic due to being functional14:07
@fungicide:matrix.orgyeah, we're up around 12 load average so far14:07
@fungicide:matrix.orgactually it's fallen a little14:07
@fungicide:matrix.orgrequests haven't stopped coming in though (at a very rapid clip), so it's not like it fell out of haproxy14:08
@fungicide:matrix.orgload average is now down around 614:09
@fungicide:matrix.orgso it may be reaching a steady state14:09
@clarkb:matrix.orgfungi: oh also if you can write a followchange to add the :3081 to the redirect domain stuff that would be great. You can look at the docker-compose.yaml on gitea14 to see what I mean specifically14:10
@fungicide:matrix.orgon it now14:10
@clarkb:matrix.orgotherwise I'll try to do that when I've either beaten ansible or ansible has won14:10
@clarkb:matrix.orgthanks!14:10
-@gerrit:opendev.org- Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org proposed: [opendev/system-config] 983875: Support proxy tunnel to Gitea Apache for testing https://review.opendev.org/c/opendev/system-config/+/98387514:14
@fungicide:matrix.orglike that ^ ?14:14
@clarkb:matrix.orgyes14:16
@fungicide:matrix.orgi see pip hitting gitea successfully on gitea1414:18
@fungicide:matrix.orgsomeone fetched /openstack/requirements/raw/branch/stable/2025.2/upper-constraints.txt with a 200 ok response14:19
@fungicide:matrix.orgload average is still down around 7-814:20
@clarkb:matrix.orgI've just got gitea13 up and running and I think it looks good. I'll give it a minute for anyone to object before I put it back into the load balancer14:20
@scott.little:matrix.orgIs there a time based element to these attacks?  Starlingx is finding that around 8:00 am Eastern, you guys are almost always not responding.  Builds at other times are much more likely to pass.14:21
@clarkb:matrix.orgscott.little: yes they come in waves you can see them at https://grafana.opendev.org/d/1f6dfd6769/opendev-load-balancer?orgId=1&from=now-2d&to=now&timezone=utc14:22
@fungicide:matrix.orgwhoever's in control of this particular bot army seems to kick off batches which then subside14:23
@clarkb:matrix.orgfungi: I've put 13 into the rotation now. I'll move onto 12 next14:24
@clarkb:matrix.org(if you're able to keep monitoring as I go and/or spot check configs and responses that is much appreciated)14:24
@fungicide:matrix.orgcan do, thanks14:25
@fungicide:matrix.orgi do see reasonable-looking requests getting through to 1314:26
@fungicide:matrix.orgincluding pip fetching constraints files14:26
@clarkb:matrix.org12 is up and appears to be working. I'll add it back to the lb shortly14:32
@tafkamax:matrix.orgTried opendev.org and got the waifu approval14:33
@tafkamax:matrix.orgseems to be working14:33
@fungicide:matrix.orgTaavi Ansper: yeah, for now it will be hit-or-miss depending on which backend you get routed to14:34
@clarkb:matrix.orgfungi: 12 should be back in the lb if you want to spot check it. I'm moving onto 11. It isn't quick because I'm trying to be careful and check things as I go. but I think this must already be helping based on what people are reporting back14:34
@clarkb:matrix.organd its a bit faster now as I'm not staging each step I'm just going to the end result14:35
@clarkb:matrix.orggitea11 is done now. I've added it to the load balacner again. I noticed that on 12 and 13 I failed to remove the bind mount of the certs. Doesn't affect functionality, but is something that I'll clean up after 10 and 0914:42
@clarkb:matrix.orgfungi: I've also just realized that since I'm doing this manaully our two stage deployment from ansible will undo anubis. This is probably fine? But I wanted to mention it as we may want to edit the emergency file appropriately. Let both changes deploy in a noop fashion, then remove hosts from the emergency file and then reenqueue?14:45
@clarkb:matrix.orgI think that makes sense to me but we have time to think about it14:45
@fungicide:matrix.org`RuntimeError: Cannot validate ip address '[::1]'`14:53
@clarkb:matrix.orgI'm also noticing that we do indeed need to rereplicate as not all gitea backends have mnasiadka's change to clean up the ua filter14:53
@mnasiadka:matrix.orgugh, sorry for that14:53
@fungicide:matrix.orglooks like system-config-run-gitea timed out on the anubis change, but the failure shows up in the child change14:54
@fungicide:matrix.orgit may need to be tcp6://?14:54
@clarkb:matrix.orgfungi: that is in the testinfra test? they do have docs that may have clues14:54
@fungicide:matrix.orgyeah, i'm digging14:55
@clarkb:matrix.org10 is back in the rotation. I'm moving to 09 next then will go back and fix teh bind mount of certs on 13 and 1214:55
@clarkb:matrix.orgthen maybe we put hosts in the emergency file. Get this into mergeable shape. And trigger replication?14:55
@fungicide:matrix.orgwe could also just drop the `assert anubis.is_listening` check you asked for, we exercise anubis in subsequent tests so it has to be listening anyway for those to pass14:56
@clarkb:matrix.orgya that seems fine for now too14:56
@clarkb:matrix.orgcan add that later when we figure it out14:56
@fungicide:matrix.orgi could move it to a separate change so it doesn't block deploymemt14:56
@clarkb:matrix.org++14:56
@fungicide:matrix.orgi'll do that now, but fold your request :3081 in14:57
@fungicide:matrix.orgnow that more of the backends are not burdened with bogus requests, load average across them has fallen drastically14:57
-@gerrit:opendev.org- Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org proposed:15:02
- [opendev/system-config] 983134: Remove intermediate HTTPS layer for Gitea backends https://review.opendev.org/c/opendev/system-config/+/983134
- [opendev/system-config] 983061: Apply Anubis to the Gitea backend servers https://review.opendev.org/c/opendev/system-config/+/983061
@clarkb:matrix.orgok 09 is done. I'm going to clean up the bind mounts on 13 and 12 now so they will rotate out then back in15:03
@clarkb:matrix.orgfungi: do you want to put the gitea backends in the emergency file?15:03
@clarkb:matrix.orgI don't think that is a rush and I can do it when I'm done with 12 adn 13 as well15:03
@fungicide:matrix.orghuh, why did i get a new patchset on 983134 i wonder15:06
@fungicide:matrix.orgweird, i didn't think i changed the parent base when i rebased15:07
@clarkb:matrix.orgok 12 and 13 are done15:07
@clarkb:matrix.orgso all backends should now be running anubis using a config that matches the proposed changes if I have done the manual edits correctly15:07
@clarkb:matrix.orgfungi: I can put things in the emergency file since I've run out of things that need doing immediately15:08
@fungicide:matrix.orgif we can land the two changes for that before daily jobs run, we shouldn't need them in there, right?15:09
@clarkb:matrix.orgfungi: we need them because the first change will undo anubis and then the second will readd it. I guess this may not be critical depending on the state of the system, but not flip flopping would be nice15:10
@fungicide:matrix.orgah, fair15:11
@clarkb:matrix.orgok they are in the emergencyfile15:12
@fungicide:matrix.orgso we would put them in emergency, wait for the deploy on both changes to skip those servers, then take them out of emergency and let the daily deploy cover both changes together15:12
@fungicide:matrix.org(or some subsequent deploy that runs the same job)15:13
@clarkb:matrix.orgyup or maybe even better is after both deployments happen remove the servers from the emergency file and reenqueue the deployment buildset for the second change15:13
@clarkb:matrix.orgso that we don't have to wait until 02:00 to dsicover if there was an important difference between what I did and what is in ansible15:13
@fungicide:matrix.orgfor that matter, we could take them out of emergency between the first and second deploy15:13
@clarkb:matrix.orgyup, though that may be a tight window15:13
@clarkb:matrix.orgthen once that is done I think we should attempt cleanup of the emergency ua filter stuff since we keep having false positives and see if anubis si sufficient (it should be based on current evidence)15:14
@clarkb:matrix.orgwe can't completely drop ua filters because not all services are behind anubis but we can start with the removal of those rules I think15:14
@fungicide:matrix.orgokay, looks like we do the v6 listening test for keycloak and it just knows that the final `:` is the port separator15:15
@fungicide:matrix.org`keycloak = host.socket("tcp://::1:8080")`15:15
@fungicide:matrix.orgi'll just reinclude it15:16
-@gerrit:opendev.org- Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org proposed: [opendev/system-config] 983061: Apply Anubis to the Gitea backend servers https://review.opendev.org/c/opendev/system-config/+/98306115:16
@clarkb:matrix.orgI have approved both changes now15:16
@fungicide:matrix.orgthanks15:16
@fungicide:matrix.orggitea is working quickly and cleanly for me, browsing around and also doing git fetches. i did see the anubis screen flash up momentarily15:18
@mnasiadka:matrix.orgWorks well for me also, I can do some more testing in around an hour - if that’s useful15:19
@fungicide:matrix.org5-minute load averages on all backends are ranging from 0.5-4.5 at the moment15:19
@clarkb:matrix.orgya I think general monitoring for unexpected behaviors is good. I want to audit the manual work I did after a short break to reset the head (and maybe a shower)15:20
@clarkb:matrix.orgjust to make sure I didn't miss anything else like the certs bind mount removal15:20
@fungicide:matrix.orgi may have missed it in scrollback, but did full replication from gerrit get kicked off yet?15:21
@fungicide:matrix.orgif not i can do that next15:21
@clarkb:matrix.orgI did not start that. I think it would be good if you can do that. you can `gerrit show-queue -w` first to confirm it isn't already in progress15:21
@fungicide:matrix.orgonly 27 tasks running in gerrit's queue so i think it didn't15:21
@fungicide:matrix.orgyeah i just checked that15:22
@fungicide:matrix.orgi'll start it, unless there are reasons not to15:22
@clarkb:matrix.orgI don't think  so at this point unless you want to audit the manual work I did first15:23
@fungicide:matrix.orglooks like the gerrit ssh api command is `replication start --all`15:23
@clarkb:matrix.orgbut I want a shower before I do that to reset the head and look at it with fresh eyes15:23
@fungicide:matrix.orgi'd rather reduce the window of time people might be pulling outdated git refs, then can look over the backends while that's in progress15:24
@fungicide:matrix.orgworst case we end up running it twice15:24
@clarkb:matrix.orgwfm15:24
@fungicide:matrix.orgrunning now15:26
@fungicide:matrix.org2175 tasks in the gerrit queue15:26
@clarkb:matrix.organy objections if I pop out now and get that shower? Then when I get back I'll review my work15:27
@fungicide:matrix.orgno objection15:28
@clarkb:matrix.orgfungi: if you filter for GiteaHttpLib in the /var/gitea/logs/access.log fiel you'll see all the internal requests that I think are rleated to replication15:29
@clarkb:matrix.orgthey appear to have 200 response codes so I think it is working15:29
@fungicide:matrix.orgstatus notice Anubis is now deployed on our Gitea backends, and things are back to working normally though you may notice an Anubis screen flash briefly when starting to browse opendev.org; any jobs which failed prior to 15:00 UTC today can be safely rechecked15:30
@fungicide:matrix.orgthat look reasonable?15:30
@clarkb:matrix.orgyes15:31
@clarkb:matrix.organd I'll pop out now for ~20 minutes or so15:31
@fungicide:matrix.org#status notice Anubis is now deployed on our Gitea backends, and things are back to working normally though you may notice an Anubis screen flash briefly when starting to browse opendev.org; any jobs which failed prior to 15:00 UTC today can be safely rechecked15:32
@status:opendev.org@fungicide:matrix.org: sending notice15:32
@clarkb:matrix.orgactually one last thought: looking at https://grafana.opendev.org/d/1f6dfd6769/opendev-load-balancer?orgId=1&from=now-6h&to=now&timezone=utc it is hard to say if the end of the crawling coincided with our anubis deployment though it falls off gradually so I think we were actually pushing back properly. Previous editions were like an on off switch iirc15:32
@clarkb:matrix.orgI guess the frontend apache logs may tell us if we really want to know15:33
@fungicide:matrix.orgmore the haproxy log if we're just trying to gauge request volume15:33
@clarkb:matrix.org++15:34
@clarkb:matrix.orgok really popping out now15:34
@mnasiadka:matrix.orgWell, tomorrow 14:00 CEST will tell us - weirdly it’s sort of start of work time in EST timezone?15:34
@fungicide:matrix.orggerrit's down to 8521 tasks in the queue now15:35
-@status:opendev.org- NOTICE: Anubis is now deployed on our Gitea backends, and things are back to working normally though you may notice an Anubis screen flash briefly when starting to browse opendev.org; any jobs which failed prior to 15:00 UTC today can be safely rechecked15:36
@status:opendev.org@fungicide:matrix.org: finished sending notice15:36
-@gerrit:opendev.org- Zuul merged on behalf of Gregory Thiemonge: [opendev/irc-meetings] 983191: Update meeting chair for Octavia https://review.opendev.org/c/opendev/irc-meetings/+/98319115:45
@fungicide:matrix.orgi'm still seeing a few requests that look like crawlers faking browser-type user agent strings which would have had to solve the anubis challenges, but the volume is low15:49
@fungicide:matrix.orgthough likely an indicator that this workaround will only get us by for so long before they adapt15:49
@fungicide:matrix.orgbut maybe it slows them down enough, in the case of the ones that need to run js to solve challenges and get/redistribute cookies15:50
@fungicide:matrix.orgreplication from gerrit to the gitea backens has completed15:52
@scott.little:matrix.orgWe are trying to find a way to get more reliable git downloads.  Would switching from https: to ssh: help?   Another suggestion was to pull from  review.opendev.org rather than opendev.org.  15:55
@fungicide:matrix.orgscott.little: we're trying to provide more reliable git downloads. please don't switch all your fetches to review.opendev.org or we'll end up overloading that server instead15:59
@fungicide:matrix.orgwe also don't have an ssh git endpoint15:59
@jim:acmegating.comremind me why jobs aren't just using zuul required-projects?16:00
@fungicide:matrix.orgthe request volume overall in the past few weeks has simply been too much for the services to keep up with, but we think we have a longer-term replacement im there now16:00
@fungicide:matrix.orgi'm assuming scott.little isn't talking about zuul jobs, but yes if this is in the context of jobs running in zuul then setting their required-projects list appropriately will make the git access local on the test nodes rather than over the internet16:02
@jim:acmegating.comack.  i know there were a lot of job failures.  i agree it's unclear if scott.little was asking in that context, but it seems regardless there may be at least some cases where the question may be relevant.16:03
@clarkb:matrix.orgright I think to summarize the OpenDev team is doing what it can to make the opendev.org git services as reliable as possible. The solution to them not being up to the level of reliability desired is to join us and help amke things better. I tried explaining this earlier today in an other channel. But I have something like 8 jobs/roles right now? OpenDev sysadmin, opendev service coordinator, zuul maintainer, zuul community manager, I ahve been massaging contribution metics with the switch to lfx, I also do CI engineering (hand wave around taht one). I maintain python packaging systems and container images that are critical to many of our software projects and services. The list goes on and on. I'm spread incredibly thing. I know fungi is too. The problem isn't that this can't be solved. The problem is that we are currently demanding far too much of the people who care enough to try16:04
@scott.little:matrix.orgnot zuul, just the 'repo sync' that our designers tend to do in the morning to update there 80ish git's16:04
@fungicide:matrix.orgyes, in openstack a lot of teams ended up recently merging a regression called "precommit" which apparently only knows how to pull things from remote git urls16:04
@clarkb:matrix.orgfor a concrete example anubis has been on our "lets do this" list for a few days but we simply haven'y had any time to push it forward due to all the firefighting. Eventually things got bad enough that we said screw it and just applied it manually and are now going to sync up config management after the fact16:05
@jim:acmegating.comfungi: wow, that sounds very wrong.  is there an architecture document or spec or something?16:06
@fungicide:matrix.orglooking at what user agents are hitting the gitea backends now that anubis is in place, gitea09 and gitea13 apparently are handling very high volumes of requests from git clients (in the 15z hour, 20656 from "git/2.43.5" on gitea09 and 46400 from "git/2.47.3" on gitea13)16:06
@clarkb:matrix.orgOpenDev has always been built on the idea that the hosted projects would get involved and help maintain OpenDev too. OpenStack and Zuul have regularly had overlap with the OpenDev team but starlingx has not. If there is interest in getting more involved I'm happy to help point people in the right direction (thats the service coordinator hat I wear talking)16:06
@fungicide:matrix.orgthe other backends don't exhibit this pattern, so likely coming from one or a handful of ip addresses16:07
@fungicide:matrix.orgcorvus: no, somebody pushed a bunch of changes to openstack repositories adding configuration for this "precommit" tool and then setting up tox to run that in order to do things like linting, and it does its own dependency installation. i think it originated in the golang ecosystem where everyone refers to dependencies with git urls16:08
@fungicide:matrix.orgi'm not that familiar with it, just trying to get to the bottom of what happened there16:09
@fungicide:matrix.orgscott.little: approximately what time of day are the designers running this repo sync task, and is there a central cache they share or are they doing it independently?16:10
@scott.little:matrix.orgoh boy, we are spread thin too, but of opendev is open to adding a head from the StarlingX pool of talent, I can put that forward at our next team meeting. 16:10
@scott.little:matrix.orgindependently.  no cache.  The North American members would likely try to update around 8-9 am Eastern16:11
@jim:acmegating.comfungi: it seems like a dependency of an opendev/openstack job should always be installed via zuul, never directly from gitea or gerrit.  this seems like a major flaw in both job design an systems design.  we were very careful to avoid that situation when originally designing the jobs.16:12
@fungicide:matrix.orgwell, in this case they replaced pulling dependencies from pypi to pulling them from git16:13
@jim:acmegating.comthey were pulling openstack dependencies from pypi?16:14
@clarkb:matrix.orgcorvus: yes we had this discussion in the openstack tc channel yesterday16:14
@clarkb:matrix.orgcorvus: there was some indication that this isn't possible with precommit which I argued agaisnt as a git repo and git commit is consistent across locations16:14
@fungicide:matrix.orgcorvus: looks like it's generally just the "hacking" tool, e.g. https://opendev.org/openstack/nova/src/commit/622a015/.pre-commit-config.yaml#L4816:14
@jim:acmegating.comyeah, i mean, git repos can be at file urls...16:15
@mnasiadka:matrix.orgI didn’t add up to the discussion, but it sounds weird precommit can’t do file repo urls16:16
@jim:acmegating.commaybe we need to re-socialize some of the basic rules for setting up zuul jobs: never fetch from opendev.org and, really, really, never fetch from review.opendev.org.16:17
@fungicide:matrix.org(we do have some rare exceptions to the latter, like proposal jobs that need to check whether there's a pending change in gerrit that hasn't merged in order to decide whether to push a new change or a revision to the existing one)16:21
@jim:acmegating.comyep.  and those jobs and exceptions were designed in consultation with the folks running the service.  :)16:21
@fungicide:matrix.org(which we could maybe figure out from named refs in a full local mirror of the repo instead? not sure)16:21
@scott.little:matrix.orgis there a doc on how a project can get involved in maintaining opendev infrastructure?16:21
@clarkb:matrix.orgI've done my audits: all 6 backends have 5 containers running (gitea web, giteassh, anubis, mariadb, and memcached). All 6 giteas have what appears to be a consistent anubis config. All 6 giteas haev the updated protocol http app.ini related changes. All 6 appear to have updated apache vhost configs. And all 6 have the anubis update to docker-compose.yaml and the correct bind mounts for gitea-web16:22
@clarkb:matrix.orgscott.little: https://docs.opendev.org/opendev/system-config/latest/open-infrastructure.html this is a good overview of how thinsg are built (though it may be out of date at times) and includes some pointers at contributing from there16:23
@fungicide:matrix.orgother than the outsized volume of git requests getting balanced to the 09 and 13 backends, everything seems to be operating normally and even those two are still handling the increased volume gracefully16:23
@clarkb:matrix.orgwe also maintain a set of spec documents and help wanted info here: https://docs.opendev.org/opendev/infra-specs/latest/16:23
@clarkb:matrix.orgincluding a less formal etherpad link with lists of things we'd like to do that haven't been formalized or may not need to be formal because they are more straightforard16:23
@fungicide:matrix.orgi'm going to take this opportunity to grab a much-belated shower now that Clark is back16:24
@fungicide:matrix.orgbrb16:24
@mnasiadka:matrix.orgNow that the dust is sort of settled - Clark - I started working on the ord.rax mirror replacement/upgrade (what was funny the image that has OpenDev in the name was not booting, but the Cloud 24.04 image did)16:28
@jim:acmegating.comit's like we only have one shower16:28
@clarkb:matrix.orgmnasiadka: the opendev image needs to be booted with the --config-drive flag to launch node in that cloud16:29
@clarkb:matrix.orgmnasiadka: cloud init doesn't know how to use metadata in that cloud or they don't supply it or something. But config drive works16:29
@clarkb:matrix.orgthe gerrit 3.11->3.12 upgrade test job failed on the gitea http transition change16:30
@clarkb:matrix.orgIt looks like the project index lock wasn't removed when gerrit 3.11 shutdown so 3.12 couldn't reindex it on startup16:30
@clarkb:matrix.orgI haven't seen that before so I think we can probably recheck it16:30
@mnasiadka:matrix.orgClark: Is it fine with using RAX 24.04 image or should I resort back to the OpenDev one?16:31
@clarkb:matrix.orgmnasiadka: I suspect it is fine. In the past we have typically used the cloud supplied image but this time noble images were not uploaded quickly so we ended up uploading our own to all clouds16:32
@mnasiadka:matrix.orgAh, ok - so let me continue with that and see if there are any problems16:33
@clarkb:matrix.orgsounds good16:33
@clarkb:matrix.orgLooks like gerrit replication completed. I'm going to check each of the backends has the latset system-config commit now16:33
-@gerrit:opendev.org- Michal Nasiadka proposed: [opendev/system-config] 983911: Add mirror03.ord.rax https://review.opendev.org/c/opendev/system-config/+/98391116:35
@clarkb:matrix.orgyes https://opendev.org/opendev/system-config/commit/b3e229b67919c540809289838ffea48969f6b324 seems to be present on each backend now16:35
-@gerrit:opendev.org- Monty Taylor https://matrix.to/#/@mordred:inaugust.com proposed: [openstack/project-config] 983912: Add gerrit plugin for openclaw https://review.opendev.org/c/openstack/project-config/+/98391216:36
@mnasiadka:matrix.orgClark: the mirror volume in ORD seems to be 256 compared to 200 in OVH - should it stay 256?16:36
@mnasiadka:matrix.org(size)16:36
@clarkb:matrix.orgmnasiadka: yes. The reason for that is that the http proxy cache pruner can't keep up on those volumes like ti can on others so we give it more headroom (this is a classic case of tribal knowledge that should be written down somewhere probably)16:37
@clarkb:matrix.orgmnasiadka: I would mimic the existing server for this reason. Basically we need more headroom so that http cachecleaning works. And apologies for the tribal knowledge delta16:38
@fungicide:matrix.orgmordred: last week someone proposed adding an openclaw-as-a-service project to openstack too16:38
@mnasiadka:matrix.orgClark: I like tribal knowledge, it's everywhere :)16:38
@mnasiadka:matrix.orgOk, volume mounted to mirror03.ord.rax, server rebooting - https://review.opendev.org/c/opendev/system-config/+/983911 (and depends-on) need to be merged for it16:43
@clarkb:matrix.orgcool. I should probably take a minute to update my todo list with all the gitea related stuff and also changes like that.16:44
@clarkb:matrix.orgI'll try to get to them. But not sure where they are on the priority list at the moment16:44
-@gerrit:opendev.org- Michal Nasiadka proposed: [opendev/system-config] 983911: Add mirror03.ord.rax https://review.opendev.org/c/opendev/system-config/+/98391116:45
@mnasiadka:matrix.orgno worries, I just aim to be done with all mirrors this week16:46
@mordred:waterwanders.comfungi: hah, really? I mean, it's not a bad idea, I've been noodling on an aaS myself, but too many ideas not enough time16:47
@mordred:waterwanders.comfungi: I've got a really nice matrix+gerrit+zuul+openclaw setup going that I'm starting to think would make a good conference talk at some point. they all work together _really_ well16:48
@fungicide:matrix.orgmordred: nijaba is apparently working at the company behind openclaw these days, so proposed it16:48
@mnasiadka:matrix.orgAh, there's one more thing - the reprepro job for Ubuntu mirroring is failing on missing bionic things?16:49
"Error: packages database contains unused 'bionic-backports|main|amd64' database."
@mnasiadka:matrix.orgAnd that sort of broke kolla-build jobs due to some mirror inconsistency16:49
@clarkb:matrix.orgmnasiadka: I think that was fallout of removing the bionic reprepro config. There is a manual step we need to run. I thought it would just warn us but I guess it didn't16:50
@clarkb:matrix.orgmnasiadka: let me get some documentation and maybe that is something we want to work through together?16:50
@mnasiadka:matrix.orgClark: sure :)16:50
@mnasiadka:matrix.orgmore tribal knowledge, the better16:50
@clarkb:matrix.orgno this one is actualyl docuemtned I just haven't had time to do it with all the fires16:51
@clarkb:matrix.orgbut I have to find the links first16:51
@clarkb:matrix.orgmnasiadka: we need to run the process described here for the ~3 reprepro repos that had bionic removed: https://docs.opendev.org/opendev/system-config/latest/reprepro.html#removing-components All of that can and should be done with the mirroring keytabs for authentication. But generally openafs can also be accessed directly with your own account: https://docs.opendev.org/opendev/system-config/latest/afs.html But I think you can ignore this for that particular issue16:52
@clarkb:matrix.orgI'll jump on mirror update now and see if I can grab the locks we need to grab. https://review.opendev.org/c/opendev/system-config/+/983221 shows the things that need cleanup16:53
@clarkb:matrix.orgactually there is a step there that involves manually deleting files using your own kinit aklog so we do need that16:55
@clarkb:matrix.orgok I have started a root screen on mirror-update03 and grabbed the apt-puppetlabs, ubuntu-cloud-archive, and ubuntu locks16:57
@clarkb:matrix.orgI'm going to start this process against apt-puppetlabs now16:59
@clarkb:matrix.orghrm the apt puppetlabs cleanup cleaned up more than I expected implying that the expectation that there be a warning nto an error may be valid? Still the cleanup needs to be done anyway so I'll keep going17:02
@mnasiadka:matrix.orgOk, attached to screen17:10
@clarkb:matrix.orgmnasiadka: cool window 0 has the overview, windows 1,2,3 are where I held specific locks and are where I will perform operations for each of the mirrors. and i just logged into openafs in window 4 to do the next step of apt-puppetlabs cleanup17:12
@mnasiadka:matrix.orgSure, following :)17:13
@clarkb:matrix.orgI'm taking my time since I don't remember the last time I've done this :)17:14
@clarkb:matrix.orglooks like we have an expired key which is why that instaneously returned to us17:19
@clarkb:matrix.orgfungi: are you back yet?17:19
@clarkb:matrix.orgI think we need to find whatever key apt puppetlabs is using now and add it to the keychain then use it17:20
@clarkb:matrix.orgThere are a number of keys at https://apt.puppetlabs.com/ and I'm not sure which is current. I think I may leave this one in this state for now and contineu to cloud archive17:21
@clarkb:matrix.orgI suspect that if it was broken before and no one was complaining that this isn't urgent to fix17:21
@fungicide:matrix.orgthanks! i was planning to run the bionic mirror cleanup steps but obviously distracted17:22
@clarkb:matrix.orgfungi: maybe you want to look at the apt puppetlabs stuff? It looks liek UCA was working until that change broke it so I think it is happier and its good for me to practice17:23
@clarkb:matrix.orgfungi: see windows 0, 1, and 4 in the root screen on mirror update for where we ended up but tldr is after deleting everything including directly via rm after aklog the run of reprepro fails because the gpg key is expired17:24
@clarkb:matrix.orgso it needs a new key and a rerun I guess17:24
@clarkb:matrix.orgok UCA is done and it appears to have been happy. I'm going to proceed with the big one, Ubuntu proper, now17:33
@clarkb:matrix.orgfungi: also should I be doing an intermediate vos release for ubuntu if we've already downloaded stuff that isn't in a happy state?17:35
@clarkb:matrix.orgfungi: I'm worried the intermediate release may put us into a spot where things are broken until we do the final release. Maybe I should just do one release at the end?17:35
@fungicide:matrix.orgi would not manually run `vos release` for those volumes, no17:36
@fungicide:matrix.orgthe mirror script will skip it if reprepro failed17:37
@clarkb:matrix.orgok our docs at https://docs.opendev.org/opendev/system-config/latest/reprepro.html#removing-components say to do it. I'll skup the one in the middle and just run the mirror script after the cleanups nd let it come to a fully happy state before proceeding17:37
@clarkb:matrix.orgnote I did do the intermediate release for apt-puppetlabs and for ubuntu-cloud-archive. Ubuntu-cloud-archive is completely done now and happy though and apt-puppetlabs is broken but I'm not sure how much that matters?17:38
@clarkb:matrix.organyway I'll proceed with that plan for ubuntu (no intermediate release after clearvanished  and deleteunreferenced17:38
@clarkb:matrix.orgmnasiadka: and I realize I'm sort of skimming the highlights here. If you want we can have a call or a focused discussion on how m irrors and oepnafs are done17:40
@clarkb:matrix.orgI just figured since you noted things were broken I should get around to doing this arleady17:40
@mnasiadka:matrix.orgClark: no worries, I've been noting stuff as you go and there are some docs - I think that's enough information for today, but happy to gain some more knowledge on some more peaceful day :)17:41
@clarkb:matrix.orgsounds good, and thanks again for putting up with the crazyness. It has been a week17:41
@mnasiadka:matrix.orgHopefully Anubis should get us more planned activities and less fire drills17:42
@clarkb:matrix.org++17:42
@clarkb:matrix.orgthinking about apt-puppetlabs more: I'm wondering if we should just remove it? Do we know if it isused anywhere? Presumably it hasn't udpated in some time if the gpg key expired17:57
@mnasiadka:matrix.orgClark: I commented on the Prometheus patch - sorry it took so long ;)17:58
@clarkb:matrix.orgthanks! and I won't throw stones :) I understand we've all been super distracted and busy. As you said hopefully we've made a positive change towards keeping this under control going forward17:58
@mnasiadka:matrix.orgOk, enough for today - it's 8pm here :)17:59
@clarkb:matrix.orggood night!17:59
@mnasiadka:matrix.orgI see the mirror script is running now, so hopefully everything should be fine17:59
@clarkb:matrix.orgyes it looks like it is doing what is expected of it. Finding new packages etc18:00
@clarkb:matrix.orgfungi: I think I will rerun the mirror script by hand for both uca and ubuntu just to be sure that they are happy with back to back syncs. Then I'm not sure what to do about apt-puppetlabs18:00
@fungicide:matrix.orgtkajinam likely knows if that mirror is still used18:00
@fungicide:matrix.orgthe apt-puppetlabs package mirror i mean18:01
@clarkb:matrix.orgwhat I haev done for apt-puppetlabs is clear vanished, delete unreferenced, then manually rm the dists/ and lists/ content for stretch, xenial, and bionic as documented in our docs. Then did the intermediate vos_release. Running the actual script fails as the key is expected18:01
@clarkb:matrix.org* what I haev done for apt-puppetlabs is clear vanished, delete unreferenced, then manually rm the dists/ and lists/ content for stretch, xenial, and bionic as documented in our docs. Then did the intermediate vos\_release. Running the actual script fails as the key is expired18:01
@clarkb:matrix.orghttps://apt.puppetlabs.com/ lists a number of keys I'm not sure which is valid now so presumably we can have a change to update the key and it will automatically pick up from there. Or I can leave the lock held and we try to manually fix it then catch up with the system-config updates to match?18:02
@clarkb:matrix.orglooks like Ubuntu is solving the different versions of rust problem by having 30 versions of rust available18:02
@fungicide:matrix.orgi'll see if i can track it down18:04
@clarkb:matrix.orgI want to say `/*stdin*\ : Read error (39) : premature end` is ok and expected?18:04
@clarkb:matrix.orgwe just got a few of those18:05
@fungicide:matrix.orgyeah, those are common and benign afaik18:05
@fungicide:matrix.orgunrelated, argh, system-config-run-gitea is probably going to time out on 983134 this time18:07
@clarkb:matrix.orgmaybe we direct enqueue it to the gate if that happens18:07
@clarkb:matrix.orgI wonder if anubis is making things slower though18:07
@clarkb:matrix.orgits an extra layer of processing for all the thousands of requests we make to create projects and so on18:08
@clarkb:matrix.orgit is running testinfra tests it will be close I bet18:09
@clarkb:matrix.orgfungi: it succeeded!18:12
@clarkb:matrix.orgtwo changes are now in the gate. The other thing to note is after anubis is in place we should monitor the first project-config update that creatse a new project or modifies one. This is will covered in the gate testing, but its worth making sure it is happy when we do one for the firsttime just due to how much editing around that was done18:14
@fungicide:matrix.orgjust barely in under the wire18:14
@clarkb:matrix.orgreprepro must be in its validate all the things are reachable stage and it isn't fast nor does it log any info about its progress18:22
@clarkb:matrix.orgoh we're swapping. Thats not great. Multiple overlapping reprepro runs will do that I guess18:25
@clarkb:matrix.orgubuntu-ports and debian both seem to have started while ubuntu was going18:25
@clarkb:matrix.orgin theory it does things like this every few hours so its probably fine18:25
@fungicide:matrix.orgyes, i usually refrain from manually running more than one in parallel when doing a large catch-up sync, just because also that's a lot of churn in afs18:26
@clarkb:matrix.orgwell I am only manually running one. But cron started the other two18:26
@fungicide:matrix.orgaha, yes hopefully those are for things that won't take long18:27
@fungicide:matrix.orgif the delta is small18:27
@fungicide:matrix.orgi'm going to go grab a long-overdue lunch, but when i get back i'll try to track down the correct/newer apt-puppetlabs keys and we can try to reenqueue the anubis deploy buildset with the gitea backends out of the emergency file again?18:29
@fungicide:matrix.org(assuming those changes merge and don't have to get rechecked)18:29
@clarkb:matrix.orgsounds like a plan18:29
@fungicide:matrix.orgokay, back in a while18:30
@clarkb:matrix.orgI'm going to continue to try and finish up the reprepro bionic cleanups for UCA and Ubuntu in the meantime18:30
@clarkb:matrix.orgdebian completed as did debian security so now it is just ubuntu and ubuntu ports18:38
-@gerrit:opendev.org- Monty Taylor https://matrix.to/#/@mordred:inaugust.com proposed: [openstack/project-config] 983924: Add an OpenClaw plugin for Zuul integration https://review.opendev.org/c/openstack/project-config/+/98392418:45
@clarkb:matrix.orgmordred: maybe your new plugin will be the guinea pig after we get the anubis config applied via ansible atop what I did manually18:45
@mordred:waterwanders.comClark: I love being a guinea pig :)18:46
@mordred:waterwanders.comClark: I'm guessing that's more about the zuul api interaction and less about the gerrit one, yeah?18:47
@clarkb:matrix.orgits the gitea api interaction. We're in the middle of converting gitea to listen on http instead of https to make anubis simpler. I manually applied the entire anubis change to the giteas already due to a ddos, but the changes to apply to prod are happening soon (they are in the gate)18:48
@clarkb:matrix.organd all of the machinery to create projects and all that in gitea was using https now need to use http18:48
@clarkb:matrix.orgthere is good coverage of this in teh gate too we actually create a bunch of empty projects based on the prod projects.yaml file so it should just work. but its good to confirm and your new project should do that for us18:49
@clarkb:matrix.orgonly ubuntu mirroring is running now. All the others appear to have succeeded so I don't think we got OOMKillered or anything19:24
-@gerrit:opendev.org- Zuul merged on behalf of Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org: [opendev/system-config] 983134: Remove intermediate HTTPS layer for Gitea backends https://review.opendev.org/c/opendev/system-config/+/98313419:30
@clarkb:matrix.orgok here we go. The 6 backend nodes are in teh emergency file and that should noop19:30
@clarkb:matrix.orglooks like we did enqueue manage-projects which should also noop I think19:31
@clarkb:matrix.orgyes the manage projects playbook respects disbaled lists19:35
@mnasiadka:matrix.orgI forgot to tell - had a look in the Resolute build issues and yes, /opt/dib_tmp is full when the job ends up running in rax dfw (and maybe also somewhere else) - any ideas what to do other than making the image smaller?:)19:35
@clarkb:matrix.orgmnasiadka: we could restrict the builds to the nested virt labels. I think they all have larger disks without the extra ephemeral drive. But otherwise ya I think we're looking at optimizing the builds themselves. Maybe doing vhd conversion first if it isn't alredy first so that there isn't a qcow2 and a raw image already there while we do the two step vhd conversion19:37
@clarkb:matrix.orgok service-gitea succeeded very quickly and my tail of syslog on gitea09 showed no ansible access19:37
@clarkb:matrix.orgI think we may actually be able to remove the nodes from the emergency file if manage-projects completes before the second change merges19:38
@clarkb:matrix.orgfungi: I don't know if you are back yet, but I think I will do that if the timing works out. Just one less thing to wait for this way19:38
@clarkb:matrix.orgyup deploy for the first change just succeeded and teh second has not merged yet. I'll remove the hosts from emergency now19:38
@clarkb:matrix.orgdone19:39
@clarkb:matrix.orgif the second change merges it should deploy normally19:39
@clarkb:matrix.orgubuntu reprepro completed and now the vos release is running19:39
@clarkb:matrix.orgI've never worked in a kitchen but I imagine this feels similar. I've got vos release simmering over there. I'm waiting for the sauce someone else is cooking to finish (zuul and the anubis change) so that I can then plate something else I've already prepped. Except I bet its a lot harder and more physically exhausting in a kitchen19:42
-@gerrit:opendev.org- Zuul merged on behalf of Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org: [opendev/system-config] 983061: Apply Anubis to the Gitea backend servers https://review.opendev.org/c/opendev/system-config/+/98306119:43
@clarkb:matrix.orgok here we go. I'm monitoring gitea0919:43
@clarkb:matrix.organsible has started running on 0919:46
@clarkb:matrix.orggitea09 and 10 have both had their ansible runs. We reloaded apache on both implying the vhost file wasn't an exact match. I tested gitea09 via socks proxy through the load balancer and it seems to work so I don't think this is an issue19:49
@clarkb:matrix.orgnone of the containers were restarted which is what I expected since app.ini and the images shouldn't have changed19:49
@clarkb:matrix.orgall that to say I think this is working as epxected but I'll test 10 now then look at the vhost files19:49
@clarkb:matrix.orgyup 10 works too so I don't think the apache reload is a problem19:50
@clarkb:matrix.orgok I don't see anything obviously different with the vhost. Maybe just whitespace?19:51
@clarkb:matrix.orgonce the job completes I'll test the other 4 backends. Then I want to start pushing up some followups that I've been thinking about19:51
@clarkb:matrix.orgalso `Released volume mirror.ubuntu successfully` this happened. i'm goign to manually re run UCA now then when that is done I'll rerun ubuntu19:52
@clarkb:matrix.orgUCA is done and ubuntu is running now19:54
@clarkb:matrix.orghttps://zuul.opendev.org/t/openstack/buildset/aa4d885597f648e1a011322cfd83ed5b deployment reports success19:56
@clarkb:matrix.orgall backends work when accessed via socks tunnel19:56
@clarkb:matrix.organd the frontend seems to work for me19:57
@clarkb:matrix.orgall backends are up according to the haproxy show stats command as well19:57
@fungicide:matrix.orgokay, catching back up now20:00
@fungicide:matrix.organd yeah, looks like i just missed the exciting non-event of the changes rolling out in deployment?20:01
@fungicide:matrix.orggood deal20:01
@clarkb:matrix.orgfungi: once this ubuntu rerun finishes and if it is successful I will unlog -cell openstack.org and kdestroy in the screen window 4 if that is safe and won't affect the mirroring stuff. `tokens` only reports one token so I think it is20:01
@clarkb:matrix.orgyup it all seems to be working as epxected according to my testing20:02
@clarkb:matrix.orgthe main thing we haven't checked is new project craetion. Mordred has a new project proposal if we want to proceed with that now. Though I think I want to finish up with the mirror cleanups and then push some changes I have in mind so  Idon't forget20:02
@clarkb:matrix.orghttps://mirror.dfw.rax.opendev.org/ubuntu/dists/ but bionic is gone from here20:03
@fungicide:matrix.orgthe kitchen analogy is an apt one20:03
@clarkb:matrix.orglooking at https://grafana.opendev.org/d/9871b26303/afs?orgId=1&from=now-6h&to=now&timezone=utc I think we freed around 200GB of disk20:03
@clarkb:matrix.orgI've also been collecting terminal windows like pokemon20:05
@fungicide:matrix.orggotta catch 'em all20:09
-@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 983929: Pin anubis container image to v1.25.0 https://review.opendev.org/c/opendev/system-config/+/98392920:11
@clarkb:matrix.orgok change 1 consider this an RFC20:11
-@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 983930: Cleanup Apache UA filters https://review.opendev.org/c/opendev/system-config/+/98393020:14
@clarkb:matrix.orgchange 2 again I'm happy to consider this an RFC20:14
@clarkb:matrix.orgI also dont think any of these are super urgent. I just want to have an easy reminder20:14
@fungicide:matrix.orgi agree we could stand to compress and maybe clear out a lot of our ua filter20:15
@fungicide:matrix.orgit's mostly a disorganized pile of rules added in haste20:15
@clarkb:matrix.orgya I did a compression and simplification at one point but it was nowhere near complete20:16
-@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 983931: Increase the gitea mysqld connection limit https://review.opendev.org/c/opendev/system-config/+/98393120:18
-@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 983932: Limit the gitea haproxy connection limit down to 10k https://review.opendev.org/c/opendev/system-config/+/98393220:22
@clarkb:matrix.orgok I'm looking for feedback on all of those. I dont' feel strongly about them landing at all or landing with the specific values chosen20:22
@fungicide:matrix.orgsp was it just the apt-puppetlabs repo that needed troubleshooting still?20:28
@clarkb:matrix.orgfungi: yes20:29
@clarkb:matrix.orgfungi: if you join the root screen on mirror update you can see what I did in window 1 and window 4 (though its at the very beginning of window 4 scrollback20:29
@clarkb:matrix.orgI followed our documentation up to the point where you have to rerun the normal reprepro mirror script and that immediately failed with a key is expired error20:29
@clarkb:matrix.orgyou can see that error in the usual log file location20:29
@clarkb:matrix.orgwindow 1 is also where I'm holding the lock20:30
@clarkb:matrix.orgfungi: I am still holding the uca and ubuntu locks. ubuntu is still rerunning its second pass. When that second pass finishes I will close window 2 and 3 and drop the locks for them. Then I will unlog -cell openstack and kdestroy in window 4 to undo my auth there20:31
@clarkb:matrix.orgbut I can keep windows 0, 1, and 4 open if it helps with apt-puppetlabs debugging (and it will hold the lock for apt-puppetlabs)20:31
@fungicide:matrix.orgi think to avoid confusion, i'll wait until you clean up, feel free to exit the screen session and i'll hold a new lock under a fresh one just for the puppetlabs mirror20:44
@clarkb:matrix.orgok that works for me20:45
@clarkb:matrix.orgstill waiting for ubuntu to finish the second pass20:45
@fungicide:matrix.orgi mainly just need to check the signature verification error to see what key fingerprint it sees being used, find and double-check the official full copy of that key, maybe hand-apply and test it quickly on the server, then push a change up to add it once i'm sure it's what we need20:45
@clarkb:matrix.orgmnasiadka: so I'm looking at mirror03.ord.rax.opendev.org IP addresses lgtm. I thought that there was no volume attached yet and checked via openstack volume list (which showed it). Then got really confused. Then realized that I was doing mount | grep mapper and cat /etc/fstab on mirror-update03. Notice that mirror03 and mirror-update03 look similar to tired eyes :) anyway that was my fault not yours. I think your changes look good and I'll +2 them once I confirm the chagnes themselves and not just the server20:48
@clarkb:matrix.orgactually just foudn a small but important issue with the dns change so I -1'd that one. But its an easy fix20:52
@fungicide:matrix.orgchecking the end of /var/log/reprepro/apt-puppetlabs.log i see `VerifyRelease condition '9E61EF26' lists expired key '4528B6CD9E61EF26'.` so it's possible we're mirroring packages from an index signed by a key that's now expired and we merely need to do what's on the tin20:55
@clarkb:matrix.org`com.rackspace.servermill.failed_reason='auth_failure', rax_service_level_automation='Build Error'` I also see this against server show mirror03.ord.rax.opendev.org. This is a mirror s oI'm not super concerned about proceeding with it. But figured I'd mention it20:56
@clarkb:matrix.orgthe server is up and running so it didn't completely fail20:56
@clarkb:matrix.orgI just remembered that we're upgrading gerrit on sunday20:57
@clarkb:matrix.orgI'm like 95% certain the prep work I'ev done is fairly complete so I don't think that changes after the rest of this week. More of a "oh ya I haev to do that too"20:57
@clarkb:matrix.orgthis reprepro export is taking longer than the last one. I think that has to do wit hthe reprepro for ubuntu ports overlapping at this point in the process?21:04
@clarkb:matrix.orgI will attempt to practice patience21:04
@fungicide:matrix.orglikely. if i need to clock out before it completes i can pick up apt-puppetlabs in the morning too21:04
@clarkb:matrix.orgfungi: should I leave it locked in that case?21:05
@clarkb:matrix.orgI guess its going to noop each time it runs anyway so probably fine to unlock21:05
@fungicide:matrix.orgit's fine to let it go back to running in the meantime21:06
@fungicide:matrix.orgit's also a very small repository so doesn't run for long21:06
@fungicide:matrix.orgi think puppetlabs didn't get the memo that gpg signatures are meant to verify the contents of your package repository haven't been tampered with, so serving that as https://apt.puppetlabs.com/pubkey.gpg is sort of silly21:11
@fungicide:matrix.orgbut easy to find i guess21:11
@clarkb:matrix.orgfungi: if you are still near that screen session: does the klist output show only my principal and then the tickets it has? or are those service principals for mirror/reprepro operations?21:14
@clarkb:matrix.org(if you know021:14
@clarkb:matrix.orgmy concern is that if I kdestroy I might impact some running process on the mirror updater21:15
@fungicide:matrix.orgo really don't know. i guess you could wait ~20 hours until they expire21:16
@clarkb:matrix.orgya if I was smart I would've su'd back to my personal user21:16
@clarkb:matrix.orgso that there was a clear distinction in ownership of credentials21:16
@fungicide:matrix.orgthe `Default principal:` lists your admin account, so i think those are just you21:17
@fungicide:matrix.orgshould be safe?21:17
@clarkb:matrix.orgyes I think it is saying that default principal has these two service principals tickets?21:17
@clarkb:matrix.orgthe first I assume is from kinit and the second from aklog21:17
@fungicide:matrix.orgthat's how i interpret it, but even my beard is not grey enough to have 100% confidence in kerberos matters21:17
@clarkb:matrix.orgactually man klist says you can pass -k to look at specific keytabs and since we use keytabs for the mirror operations I think this is a correct interpretation21:18
@fungicide:matrix.orgworst case, some mirrors stop updating (again), which is unlikely but also not the end of the world21:18
@clarkb:matrix.orgso it should be safe to unlog -cell openstack.org and then kdestroy. I'll run klist after unlogging to see if that second item goes away too21:18
@clarkb:matrix.organd I'll wait for reprepro to finish up so there is nothing runningwhen I do that to be extra safe21:19
@clarkb:matrix.orgok its done21:37
@clarkb:matrix.orgbut ubuntu-ports is running so I'll wait for that before I dod the unlog and kdestroy21:37
@fungicide:matrix.orgcool21:38
@clarkb:matrix.orgfungi: ok I've dropped my flocks and closed the screen21:48
@clarkb:matrix.orgafter doing the unlog and kdestroy so hopefully everything is happy after. I think it will be21:48
@fungicide:matrix.orgthanks! i was following along21:48
@clarkb:matrix.orgI captured the output of tokens and klist locally just in case that becomes useful later21:49
@clarkb:matrix.orgbut its mostly about the timestamps I think21:49
@fungicide:matrix.orgi've checked the apt-puppetlabs upstream repo and there's a signing key from last year with a different fingerprint. sadly they don't seem to bother cross-signing keys so provenance is an issue21:49
@clarkb:matrix.orghttps://mirror.dfw.rax.opendev.org/apt-puppetlabs/timestamp.txt this is likely when the key expired I bet (a yaer ago)21:50
@clarkb:matrix.orggiven it has been a year I wonder if we can just drop the mirror entirely21:50
@fungicide:matrix.orgwe've got it configured to mirror focal puppet5 which no longer exists at https://apt.puppetlabs.com/dists/focal/21:51
@clarkb:matrix.orgfwiw I checked system-config and we pull from upstream in the few places we use puppet21:51
@clarkb:matrix.orgya in system-config we pull packages from their archive because things apparently go out of the apt repo21:51
@fungicide:matrix.orghttps://apt.puppetlabs.com/dists/focal/InRelease seems to be signed with 0x4528B6CD9E61EF26 which is the key in https://apt.puppetlabs.com/pubkey.gpg21:52
@fungicide:matrix.orgso i think if we just update to that it'll solve the key error21:53
@clarkb:matrix.orgthen we'll find the next error :) but that is still progress21:53
@fungicide:matrix.orgright, the next error i expect is no puppet5 index for focal21:53
@clarkb:matrix.orgfwiw my ssh keys have aged out which is my primary "you've worked a long day" signal. Why are you still around? (I mean that in a good way we should both probably go get some fresh air or something)21:54
@fungicide:matrix.orgoh, yes i can pick this up again tomorrow. great reminder21:54
@fungicide:matrix.orgalso if tkajinam happens to be around later, maybe he can tell us to just delete the whole thing and stop caring about it ;)21:55
@clarkb:matrix.orgthat would be an excellent outcome21:55
@fungicide:matrix.orgi do enjoy deleting things a lot more than fixing them21:55
@clarkb:matrix.orgfwiw grafana seems to show opendev.org is still happy and I can still browse it from here21:55
@clarkb:matrix.orgI do think we should consider when/how we want to add a new project to gerrit and gitea. Doing that sooner than later is probably a good idea21:56
@clarkb:matrix.organd then I'm hoping I can review gerrit upgrade things tomorrow and feel prepared for sunday21:56
@fungicide:matrix.orgdid we not approve mordred's new openclaw gerrit plugin project yet?21:56
@fungicide:matrix.orga lighter weight intermediate test might be approving a gerrit acl change like https://review.opendev.org/981924 since that'll still run manage-projects21:57
@fungicide:matrix.orgthough the gitea side of that should be all noop21:58
@clarkb:matrix.orgthats a good point if we run the gitea side of manage projects in a noop fashion against an acl update that gives us an early signal if there are any errors21:58
@clarkb:matrix.orgthen we can followup by landing modrreds new project21:59
@clarkb:matrix.orgI mean this should all work as we test it extensively in CI, but its a big change we just made so good to pay attention21:59
@clarkb:matrix.orgcorvus: btw we now have ~300GB ish of space in the mirror.ubuntu volume if we want to start working on resolute mirroring. Maybe bump up that quote a bit just to be safe then adjust back when we see how big resolute is?22:16
@clarkb:matrix.orgthat is 300GB of free space according to our quota22:17
@clarkb:matrix.organd mnasiadka confirmed the issue building resolute is with disk space so we may need to figure out if the images are unnecessarily large or if we can optimize the builds one way or another etc22:18
@jim:acmegating.comok i was just looking at that, and i guess our dstat graph isn't telling the whole story22:19
@fungicide:matrix.orgi think we should increase the quota by at least 50gb22:19
@jim:acmegating.commaybe it's measuring the wrong disk?22:19
@jim:acmegating.combecause it sure looked like there was space at the end of the failing jobs22:19
@clarkb:matrix.orgyes on rax it would be / on one disk and /opt on another22:19
@clarkb:matrix.orgso maybe a mixup between those two/22:20
@jim:acmegating.comyeah -- maybe dstat is either recording only / and not opt, or it's recording the sum of the two, so we still can't see that opt is full22:20
@fungicide:matrix.orgthe mirror.ububtu volume was running at something like 97% or 98% full before bionic removal started, so i fully expect resolute to exceed our available capacity since successive releases seem to only ever get larger, not smaller22:20
@fungicide:matrix.orgalso we're going to need to substantially increase quota for mirror.ubuntu-ports if we're also going to mirror resolute for arm22:21
@clarkb:matrix.orgwe already floating around 80% capacity with bionic removed and no resolute too22:21
@fungicide:matrix.orgsince that's close to full even long after dropping the bionic mirror there22:21

Generated by irclog2html.py 4.1.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!