Friday, 2025-02-21

clarkbthough the easiest thing might be to bounce through a proxy or vpn :(00:00
tonyb--- mirror01.ca-ymq-1.vexxhost.opendev.org ping statistics ---00:00
tonyb30 packets transmitted, 29 received, 3.33333% packet loss, time 29148ms00:00
tonybrtt min/avg/max/mdev = 2003.815/2693.335/3720.201/371.795 ms, pipe 400:00
ianwyeah i get a bit of packet loss to mirror.iad.rax.opendev.org too00:01
tonybAlso the reverse DNS is interesting00:01
tonybtony@thor:~$ host mirror.ca-ymq-1.vexxhost.opendev.org00:01
tonybmirror.ca-ymq-1.vexxhost.opendev.org is an alias for mirror01.ca-ymq-1.vexxhost.opendev.org.00:01
tonybmirror01.ca-ymq-1.vexxhost.opendev.org has address 199.204.45.14900:01
tonybmirror01.ca-ymq-1.vexxhost.opendev.org has IPv6 address 2604:e100:1:0:f816:3eff:fe0c:e2c000:01
tonybtony@thor:~$ host 199.204.45.14900:01
tonyb149.45.204.199.in-addr.arpa domain name pointer abla-4.albalisaude2.com.br.00:01
ianw7.14286% packet loss00:02
ianwPerhaps the USA has implemented at 10% tariff on incoming packets crossing the Gulf of America?00:03
tonybLOL00:03
clarkbre reverse DNS that might be something mnaser would want to look at but I don't think we're able to poke those bits ourselves00:04
clarkbeither tariffs or someone put an anchor in the wrong spot00:04
ianw... or the right spot depending on who's at the other end of the anchor :)00:05
tonybYeah I the reverse DNS was beyond our control.00:05
clarkbI'm trying to ping stuff in australia and everything I try is either cloudflare or akamai so far00:11
tonybtry magni.bakeyoutnoodle.com00:11
clarkbthat doesn't resolve for me00:12
tonybor ozlabs.org00:12
clarkboh you should be your I think00:12
tonybYeah00:12
tonybIt's my home server00:13
tonybozlabs.org would be better as it's an a DC $somewhere in AU00:13
clarkbno packet loss from my west coast isp to either of those00:13
clarkbso its prbably not all of NA but some subset00:14
clarkbpdx.edu is close to me physically and route wise I think if you want to ping something here from there and confirm00:14
tonybI suspect the pdx.edu is filtering as I got 100% packet loss00:15
tonybOSU? 00:16
clarkbtry cat.pdx.edu00:16
clarkbI get some packet loss to curtin.edu.au which is in amazon in sydney looks like so ya its probably specific routes that are struggling across the pacific00:16
tonybcat.pdx.edu is better00:17
tonyb0% packet loss00:17
tonybWe can only assume that someone's pager is going off and it will be fixed ASAP00:18
clarkbianw: I'm guessing the network connectivity might make it tough but did you see the questions about why you would prefer to pull the inventory out of system-config on the executor rather than copy system-config over at that point if we have to copy it at some point anywy?00:27
ianwit's a good question and challenging my assumptions about what the bootstrapping bridge steps are doing00:27
clarkbconsidering that the repo sync step doesn't involve running ansible on the bridge itself I feel like its ok either way. The run ansible on bridge step seems like a natural boundary00:29
clarkbbut I we wanted to make sure we weren't missing somethign important there00:29
ianwhttps://opendev.org/opendev/system-config/src/branch/master/playbooks/bootstrap-bridge.yaml#L6900:32
ianwi don't think this works00:32
ianw[WARNING]: Could not match supplied host pattern, ignoring: prod_bastion00:32
ianwUsing /etc/ansible/ansible.cfg as config file00:32
ianwPLAY [prod_bastion[0]] *********************************************************00:32
ianwskipping: no hosts matched00:32
clarkbBRIDGE_INVENTORY doesn't have the host in it? Is that something that maybe works in CI but not prod or vice versa?00:35
clarkbBRIDGE_INVENTORY: '{{ "-i/home/zuul/bastion-inventory.ini" if root_rsa_key is defined else "" }}'00:35
clarkbya I wonder if the case you're seeing is the else "" there00:35
ianwright, it's confusing for sure.  the idea was that we could have multiple bastion hosts when we need to rotate00:36
ianw"prod_bastion" is a group of one host, the currently active bastion host.  the "bastion" group can have all the bastion hosts in in we want00:36
clarkbthe comment earlier in the file says that root_rsa_key value should be set in prod secrets and in testing its set with an ephemeral key00:37
ianwthat way we can bootstrap a new bridge by adding it, and the old bridge will actually go to the new bridge and deploy almost everything onto it00:37
clarkbso the itnent seems to be that we would always have a root_rsa_key so I'm not sure why we have the else block there00:37
clarkboh I think I get it00:38
clarkbroot_rsa_key should only be set in the zuul ansible in CI jobs. If we aren't a ci job we still run ansible but it uses the built in inventory on the bridge00:39
ianwi think https://opendev.org/opendev/system-config/src/branch/master/playbooks/zuul/run-production-bootstrap-bridge-add-rootkey.yaml#L1 should run on localhost00:39
clarkbI think that might fix it. Becuase system-config/inventory/service/groups.yaml doesn't have a prod_bastion group. That group is dynamically created within zuul jobs looks like00:40
clarkbso ya we never match on the production side00:40
clarkbhowever, this should only come into play when bootstrapping a new bastion right? so its probably less urgent than the known_hosts thing but another good catch on something to fix there00:40
ianwright, what got me thinking is that https://opendev.org/opendev/system-config/src/branch/master/playbooks/bootstrap-bridge.yaml#L69 we are running that all from the checkout of system-config 00:41
clarkboh I see ya so things need to be updated there too00:41
clarkbotherwise we'd run with the N-1 version of playbooks/zuul/run-production-bootstrap-bridge-add-rootkey.yaml00:42
clarkbor in the case of bootstrapping an entirely new server nothing would be present00:42
ianwso i'm just trying to think through the ordering.  i think we probably agree that if we want to run in parallel, we have to have a step to setup the executor so that it can log into the bastion host (done for every job) and a single step to update the source on the bastion host00:43
clarkbyes that makes sense to me there has to be at least one source update that happens before things consume the playbooks from sytem-config00:43
clarkbI need to help with dinner momentarily but I'll try to catch up afterwards. Feel free to drop any further thoughts in email or here if gerrit is still painfully behind the packet loss barrier00:49
clarkband thank you for taking a look at this. Its helpful since you had so much of it paged in in the past00:50
ianwyeah sorry i'm trying to untangle it in my head again :)00:51
ianwthe difficult bit is remembering where the line between production and testing lies00:52
ianwi think that at https://opendev.org/opendev/system-config/src/branch/master/playbooks/bootstrap-bridge.yaml#L69 we are trying to install the root key from ansible on the bridge, because we want to access the root key from hiera01:07
ianw... on the production system.  in the testing case, the root key is in a file on disk.  and we want to use ansible's key deployment steps, rather than write our own script01:09
ianwthe problem is that it's not that much of a "bootstrap" step if it relies on an up-to-date system-config being in place01:09
ianwit is in the *testing* case which is the VAST majority of what runs, because the bridge is a node setup by zuul.  but still ...01:10
Clark[m]Ya but we also don't need to boil the ocean if we can make progress overall. The chicken and egg does feel a bit painful though01:46
Clark[m]But if there is minimal manual intervention that can still be a ein01:47
Clark[m]*win01:47
ianwyep i think the insight here is that there is no separate bootstrap and "update source" step ... it's the same thing.  for testing, it's a no-op 01:56
ianwclarkb: I think i'm ok with it, but I do think we should probably clear up those jobs in the LE section that aren't actually LE based03:25
ianwas I said in my too long comment -- can we get it so that the bootstrap job pauses in the same way the buildset-registry job does?  and then locks out another bootstrap job from running?  03:26
ianwi think the dependencies are in place that that would mean you could clone the source just once in the bootstrap job03:30
corvusi think the deploy pipeline will only run one buildset at a time; so we shouldn't need to lock out another bootstrap?  (but we still have collisions with periodic).  having said that, having the bootstrap job run the full length of the buildset and used to lock out periodic jobs makes sense to me03:33
ianwyeah it's the periodic collisions that i think we need to lock out03:33
corvusokay, then yeah that sounds like an elegant solution, and basically no-cost since it's a zero-node job03:34
ianwi guess it's not a solution so much as an enhancement (or a RFE from me, haha) ... but it's getting us to being able to run prod jobs in parallel03:34
ianwi had it in my head that we'd have a separate src update job that everything else would depend on (that was what was reverted in https://review.opendev.org/c/opendev/system-config/+/820250)03:35
corvusi think doing that while we understand it makes a lot of sense :)03:36
ianwthe "bootstrap-bridge" stuff post-dates that, from when we did our most recent replacement of the actual bridge node.  that was how we got to the point that we run bridge99.opendev.org in testing -- as a deliberate thing to avoid too much hard-coding of bridge01 as much as possible03:38
ianwbut i think the work done there, to parent everything to the bootstrap bridge job, is a good basis for the next step03:39
Clark[m]ianw: corvus: catching up, it isn't clear to me if you think we need the pause block behavior as a prereq to fix the immediate issue or if that is a solution for the run things in parallel as a followup04:12
ianwi think probably we can pull the source in the bootstrap job as you suggest.  pausing and moving towards parallel running is an enhancement04:13
Clark[m]Ok cool. Then the only real update for the current change is to organize the jobs a bit better in project.yaml to make the dependencies clearer.04:13
ianwyes i think we should do that, having identified that04:14
Clark[m]Then we can followup with trying to parallelize stuff04:14
Clark[m]Thanks again for looking. Enjoy your weekend!04:15
ianwsomething is still funky talking to review from here04:35
opendevreviewIan Wienand proposed opendev/system-config master: DNM: Bootstrap as top-level job  https://review.opendev.org/c/opendev/system-config/+/94243912:57
ianwclarkb: ^ that's the vague idea ... i'm not sure it can be tested outside of production because in testing, each job is it's own little world, there's no global bootstrap job ...12:58
clarkbthanks I'll work on updating my change to address the organizational concerns then I can rebase ^ on top of it too15:45
clarkbhttps://docs.docker.com/docker-hub/usage/ says that the 10 pulls per hour per ip limit on docker hub goes into effect March 1. There is discussion here: https://news.ycombinator.com/item?id=43125089 some people have even picked up on the issue with upstream blocking PRs to add mirrors for non docker hub registries (of course others have said this isn't an important featre...)15:46
clarkbthat discussion points out that https://gallery.ecr.aws already mirrors some official docker images so we may be able to reduce the amount of mirroring we do ourselves if we like15:48
clarkbwhat is weird is that we definitely saw a dramatic increase in docker hub rate limit errors late last year. I half wonder if we got A/B tested on the new limits15:49
opendevreviewJay Faulkner proposed openstack/diskimage-builder master: [doc] Document expected environment for dib  https://review.opendev.org/c/openstack/diskimage-builder/+/94246615:49
clarkbbut we should probably expect things to get even worse in ~1week15:49
clarkba good reminder that there are changes up in zuul-jobs to fetch mirrored images from quay rather than from docker hub if anyone has a moment to review them15:50
opendevreviewJay Faulkner proposed openstack/diskimage-builder master: [doc] Document expected environment for dib  https://review.opendev.org/c/openstack/diskimage-builder/+/94246615:58
mnaserif there's anything i can help with this... `Powered by Gitea Version: v1.23.3 Page: 22261ms Template: 53ms`16:16
mnaserI really want to try and use opendev.org to browse code but it's incredibly frustrating :-( 16:16
clarkbmnaser: we asked for feedback when we last debugged it and didn't get any...16:17
mnasersorry, it must have gotten lost in my buffer16:17
clarkbthe working theory is that the gitea caches may becoming large enough that there is an impact on go gc. I restarted teh gitea instance you were speaking to at the time (gitea13) as that would reset the internal memory state in hopes it would clear out gc issues16:17
mnaseroh its using sticky sessions?16:18
clarkbwe knew it would be temporary but it was hard to say one way or another if it actually helped at the time without more feedback16:18
clarkbyes git protocols don't let us load balance based on load16:18
clarkbthey are too stateful across requests/connections16:18
clarkbif all gitea was doing was web ui stuff then it wouldn't matter16:18
mnaseroh even for serving http/https?16:19
clarkbyes16:19
mnaserTIL!16:19
mnasersorry, i've gotten through a new pc and didnt have irccloud installed so i haven't been fully syncd in yet16:19
mnaserit has started happening again for sure for me, is there a way i can share what backend im talking to16:19
clarkbI think it has to do with where the refs are located as they can be loose or in packfiles and if you talk to server A and it says grab this packfile then hit server B and the ref is loose your git fetch regardless of transport protocol can fail16:19
clarkbif you inspect the ssl cert one of the names is the backend name16:20
clarkbthat is probably the easiest method16:20
mnasernow that makes me curious how github and gitlab do it :)16:20
mnaserone sec16:20
mnasergitea13.opendev.org indeed16:20
clarkbI think github uses a single git repo for all forks of the repo16:21
mordredmnaser: I believe they have shared/coordinated storage clusters, as opposed to the shared-nothing setup we use16:21
clarkbvia shared fielsystem or object storage or something16:21
mnaseryes they do thats why you can do this weird link that points to a commit that doesnt exist in that repo16:21
clarkbyup which has led to many major security issues. Tradeoffs16:22
clarkbanyway give me a minute and I'll restart services on gitea13 to reset the memory state and hopefully we can get more feedback on whether or not htis is helpful16:23
clarkb(side note I think the persistent requests from all the various ai crawler bots is what causes teh cache to go crazy)16:23
clarkbbut this is all hypothesis at the moment16:23
mnaserthey can get their juice from github.com mirrors, maybe we just block them?16:23
mnaserim sure they already scraped that before us :)16:23
clarkbthey reproduce like rabbits16:25
clarkband not everything we host is replicated to github. Its a tricky thing to balance16:25
clarkbservices have restarted. I would expect things to be slowish as valid things we care about cache and then speed up with a tail off as all the things get cached over time (this was all discussed last time too)16:26
clarkbthe ideal is that the folks running the crawlers would realize that runninga git clone against each rpo and then crawling the commits locally would be far more efficient for us and them16:27
clarkbif anyone knows anyone at anthropic, openai, amazon, meta, google, etc I'm happy to give them this insight for a good deal16:27
clarkbmnaser: gitea13 seems to be performing reasonably well if I hit it directly via https://gitea13.opendev.org:3081/zuul/zuul for example16:28
clarkbany noticeable difference for your usage patterns?16:28
clarkbyou might want ot double check that you're still hitting gitea13 too as I'm not sure how haproxy rebalancing works with the service flapping. I believe the algorithm is deterministic against your IP and the total number of backends so I would expect you to be back on the same server16:31
mnaserclarkb: i think it actually feels better now from a few seconds ago16:34
opendevreviewClark Boylan proposed opendev/system-config master: Update got Gitea 1.23.4  https://review.opendev.org/c/opendev/system-config/+/94247716:35
clarkbI don't think ^ will help based on the changelog but seems like something we should be doing anyway16:36
clarkbmnaser: ok, it might be helpful to know when it "feels" slow again so that we can gauge how long it takes to degrade16:36
clarkbbut ya the gitea caches are implemented as an internal go map. There is no maximum size on the map (though they do treat older data (based on timestamps) as invalid and will refresh it). I think what happens is things can just grow and grow with the crawlers fetching pages for every single commit on the service which results in very large maps. That apparently can craete problems for16:39
clarkbthe go gc system whcih I suspect is creating halt the world problems like you'd expect from java programs 10 yaers ago. Unfortunately I don't think go binaries come with tools like the jvm provides to inspect this easily. You either run the entire process with gc tracing on or you grab the data from within your program (I'd be happy to hear there are better tools I don't kow about16:39
clarkbthat I can look at)16:39
mnaseri will report that for sure16:40
clarkband maybe turning on gc tracing is the next step in debugging. Just need to be careful we don't accidentally fill disks and make git sad16:42
opendevreviewClark Boylan proposed opendev/system-config master: Reparent the bootstrap-bridge job onto a job that sets up git repos  https://review.opendev.org/c/opendev/system-config/+/94230716:55
clarkbok that is the promised small update to the fix for known_hosts generation16:56
opendevreviewClark Boylan proposed opendev/system-config master: DNM: Bootstrap as top-level job  https://review.opendev.org/c/opendev/system-config/+/94243917:01
clarkband rebased ianw's DNM change to show what this might look like as a followon17:01
clarkbI think nb04 has filled its disk again17:13
clarkbconfirmed. I'll start a screen and begin deletion of content in the tmp dir17:14
opendevreviewMerged openstack/diskimage-builder master: [doc] Document expected environment for dib  https://review.opendev.org/c/openstack/diskimage-builder/+/94246618:00
clarkbhttps://review.opendev.org/c/zuul/zuul-jobs/+/942127 https://review.opendev.org/c/zuul/zuul-jobs/+/941992/ https://review.opendev.org/c/zuul/zuul-jobs/+/941970 and https://review.opendev.org/c/zuul/zuul-jobs/+/942008 are straghtforward zuul-jobs updates that might be good for a friday afternoon if anyone has time21:39
clarkbthis will reduce our reliance on docker hub for zuul jobs21:39
corvusyou have at least one usually 2 +2s on all those, but i didn't +w them22:21
clarkbthanks. I guess only the most recent one was devoid of all review until now. I can approve them (or a subset I'll double check each one for comfort levels on a friday afternnon before doing so)22:29
clarkbah looks like fungi got them22:29
clarkbI can help debug if anything goes wrong (and they should all be relatively easy reverts if it comes to that)22:29
fungiyep22:29
fungii'm semi-around still too22:30
clarkbthen monday we should probably do the gitea upgrade nad maybe the grafana upgrade22:30
fungithat'll be great22:30
clarkbthe nb04 disk cleanup is still in progress22:31
clarkbio there is not very quick unfortunately22:31
opendevreviewMerged zuul/zuul-jobs master: Use mirrored buildkit:buildx-stable-1 image  https://review.opendev.org/c/zuul/zuul-jobs/+/94199222:37
opendevreviewMerged zuul/zuul-jobs master: Use mirrored qemu-user-static image  https://review.opendev.org/c/zuul/zuul-jobs/+/94212722:37
opendevreviewMerged zuul/zuul-jobs master: Use registry:2 image mirrored to quay.io  https://review.opendev.org/c/zuul/zuul-jobs/+/94197022:37
opendevreviewMerged zuul/zuul-jobs master: Replace debian:testing with quay.io/opendevmirror/httpd:alpine  https://review.opendev.org/c/zuul/zuul-jobs/+/94200822:42
clarkbI do wonder if we'll need to force ipv4 access to docker hub after march 1, but we can cross that bridge when we get there. I think for most opendev and zuul stuff we'ev managed to shift everything but the images opendev produces for itself to quay at this point23:08
clarkbthat leaves a number of images still, but it is rare that we need many images all on one host with one ip at the same time23:08
clarkbI think jitsi meet is the main exception to both of those statements. We pull from upstream and they are on docker still and there are like 5 images or something23:09
clarkbbut most things will pull at most 2 images23:09
clarkbI think we're probably in a see what happens move while still moving what we can over time to reduce the overall demand we create23:10

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!