Friday, 2025-02-21

clarkb	though the easiest thing might be to bounce through a proxy or vpn :(	00:00
tonyb	--- mirror01.ca-ymq-1.vexxhost.opendev.org ping statistics ---	00:00
tonyb	30 packets transmitted, 29 received, 3.33333% packet loss, time 29148ms	00:00
tonyb	rtt min/avg/max/mdev = 2003.815/2693.335/3720.201/371.795 ms, pipe 4	00:00
ianw	yeah i get a bit of packet loss to mirror.iad.rax.opendev.org too	00:01
tonyb	Also the reverse DNS is interesting	00:01
tonyb	tony@thor:~$ host mirror.ca-ymq-1.vexxhost.opendev.org	00:01
tonyb	mirror.ca-ymq-1.vexxhost.opendev.org is an alias for mirror01.ca-ymq-1.vexxhost.opendev.org.	00:01
tonyb	mirror01.ca-ymq-1.vexxhost.opendev.org has address 199.204.45.149	00:01
tonyb	mirror01.ca-ymq-1.vexxhost.opendev.org has IPv6 address 2604:e100:1:0:f816:3eff:fe0c:e2c0	00:01
tonyb	tony@thor:~$ host 199.204.45.149	00:01
tonyb	149.45.204.199.in-addr.arpa domain name pointer abla-4.albalisaude2.com.br.	00:01
ianw	7.14286% packet loss	00:02
ianw	Perhaps the USA has implemented at 10% tariff on incoming packets crossing the Gulf of America?	00:03
tonyb	LOL	00:03
clarkb	re reverse DNS that might be something mnaser would want to look at but I don't think we're able to poke those bits ourselves	00:04
clarkb	either tariffs or someone put an anchor in the wrong spot	00:04
ianw	... or the right spot depending on who's at the other end of the anchor :)	00:05
tonyb	Yeah I the reverse DNS was beyond our control.	00:05
clarkb	I'm trying to ping stuff in australia and everything I try is either cloudflare or akamai so far	00:11
tonyb	try magni.bakeyoutnoodle.com	00:11
clarkb	that doesn't resolve for me	00:12
tonyb	or ozlabs.org	00:12
clarkb	oh you should be your I think	00:12
tonyb	Yeah	00:12
tonyb	It's my home server	00:13
tonyb	ozlabs.org would be better as it's an a DC $somewhere in AU	00:13
clarkb	no packet loss from my west coast isp to either of those	00:13
clarkb	so its prbably not all of NA but some subset	00:14
clarkb	pdx.edu is close to me physically and route wise I think if you want to ping something here from there and confirm	00:14
tonyb	I suspect the pdx.edu is filtering as I got 100% packet loss	00:15
tonyb	OSU?	00:16
clarkb	try cat.pdx.edu	00:16
clarkb	I get some packet loss to curtin.edu.au which is in amazon in sydney looks like so ya its probably specific routes that are struggling across the pacific	00:16
tonyb	cat.pdx.edu is better	00:17
tonyb	0% packet loss	00:17
tonyb	We can only assume that someone's pager is going off and it will be fixed ASAP	00:18
clarkb	ianw: I'm guessing the network connectivity might make it tough but did you see the questions about why you would prefer to pull the inventory out of system-config on the executor rather than copy system-config over at that point if we have to copy it at some point anywy?	00:27
ianw	it's a good question and challenging my assumptions about what the bootstrapping bridge steps are doing	00:27
clarkb	considering that the repo sync step doesn't involve running ansible on the bridge itself I feel like its ok either way. The run ansible on bridge step seems like a natural boundary	00:29
clarkb	but I we wanted to make sure we weren't missing somethign important there	00:29
ianw	https://opendev.org/opendev/system-config/src/branch/master/playbooks/bootstrap-bridge.yaml#L69	00:32
ianw	i don't think this works	00:32
ianw	[WARNING]: Could not match supplied host pattern, ignoring: prod_bastion	00:32
ianw	Using /etc/ansible/ansible.cfg as config file	00:32
ianw	PLAY [prod_bastion[0]] *********************************************************	00:32
ianw	skipping: no hosts matched	00:32
clarkb	BRIDGE_INVENTORY doesn't have the host in it? Is that something that maybe works in CI but not prod or vice versa?	00:35
clarkb	BRIDGE_INVENTORY: '{{ "-i/home/zuul/bastion-inventory.ini" if root_rsa_key is defined else "" }}'	00:35
clarkb	ya I wonder if the case you're seeing is the else "" there	00:35
ianw	right, it's confusing for sure. the idea was that we could have multiple bastion hosts when we need to rotate	00:36
ianw	"prod_bastion" is a group of one host, the currently active bastion host. the "bastion" group can have all the bastion hosts in in we want	00:36
clarkb	the comment earlier in the file says that root_rsa_key value should be set in prod secrets and in testing its set with an ephemeral key	00:37
ianw	that way we can bootstrap a new bridge by adding it, and the old bridge will actually go to the new bridge and deploy almost everything onto it	00:37
clarkb	so the itnent seems to be that we would always have a root_rsa_key so I'm not sure why we have the else block there	00:37
clarkb	oh I think I get it	00:38
clarkb	root_rsa_key should only be set in the zuul ansible in CI jobs. If we aren't a ci job we still run ansible but it uses the built in inventory on the bridge	00:39
ianw	i think https://opendev.org/opendev/system-config/src/branch/master/playbooks/zuul/run-production-bootstrap-bridge-add-rootkey.yaml#L1 should run on localhost	00:39
clarkb	I think that might fix it. Becuase system-config/inventory/service/groups.yaml doesn't have a prod_bastion group. That group is dynamically created within zuul jobs looks like	00:40
clarkb	so ya we never match on the production side	00:40
clarkb	however, this should only come into play when bootstrapping a new bastion right? so its probably less urgent than the known_hosts thing but another good catch on something to fix there	00:40
ianw	right, what got me thinking is that https://opendev.org/opendev/system-config/src/branch/master/playbooks/bootstrap-bridge.yaml#L69 we are running that all from the checkout of system-config	00:41
clarkb	oh I see ya so things need to be updated there too	00:41
clarkb	otherwise we'd run with the N-1 version of playbooks/zuul/run-production-bootstrap-bridge-add-rootkey.yaml	00:42
clarkb	or in the case of bootstrapping an entirely new server nothing would be present	00:42
ianw	so i'm just trying to think through the ordering. i think we probably agree that if we want to run in parallel, we have to have a step to setup the executor so that it can log into the bastion host (done for every job) and a single step to update the source on the bastion host	00:43
clarkb	yes that makes sense to me there has to be at least one source update that happens before things consume the playbooks from sytem-config	00:43
clarkb	I need to help with dinner momentarily but I'll try to catch up afterwards. Feel free to drop any further thoughts in email or here if gerrit is still painfully behind the packet loss barrier	00:49
clarkb	and thank you for taking a look at this. Its helpful since you had so much of it paged in in the past	00:50
ianw	yeah sorry i'm trying to untangle it in my head again :)	00:51
ianw	the difficult bit is remembering where the line between production and testing lies	00:52
ianw	i think that at https://opendev.org/opendev/system-config/src/branch/master/playbooks/bootstrap-bridge.yaml#L69 we are trying to install the root key from ansible on the bridge, because we want to access the root key from hiera	01:07
ianw	... on the production system. in the testing case, the root key is in a file on disk. and we want to use ansible's key deployment steps, rather than write our own script	01:09
ianw	the problem is that it's not that much of a "bootstrap" step if it relies on an up-to-date system-config being in place	01:09
ianw	it is in the testing case which is the VAST majority of what runs, because the bridge is a node setup by zuul. but still ...	01:10
Clark[m]	Ya but we also don't need to boil the ocean if we can make progress overall. The chicken and egg does feel a bit painful though	01:46
Clark[m]	But if there is minimal manual intervention that can still be a ein	01:47
Clark[m]	*win	01:47
ianw	yep i think the insight here is that there is no separate bootstrap and "update source" step ... it's the same thing. for testing, it's a no-op	01:56
ianw	clarkb: I think i'm ok with it, but I do think we should probably clear up those jobs in the LE section that aren't actually LE based	03:25
ianw	as I said in my too long comment -- can we get it so that the bootstrap job pauses in the same way the buildset-registry job does? and then locks out another bootstrap job from running?	03:26
ianw	i think the dependencies are in place that that would mean you could clone the source just once in the bootstrap job	03:30
corvus	i think the deploy pipeline will only run one buildset at a time; so we shouldn't need to lock out another bootstrap? (but we still have collisions with periodic). having said that, having the bootstrap job run the full length of the buildset and used to lock out periodic jobs makes sense to me	03:33
ianw	yeah it's the periodic collisions that i think we need to lock out	03:33
corvus	okay, then yeah that sounds like an elegant solution, and basically no-cost since it's a zero-node job	03:34
ianw	i guess it's not a solution so much as an enhancement (or a RFE from me, haha) ... but it's getting us to being able to run prod jobs in parallel	03:34
ianw	i had it in my head that we'd have a separate src update job that everything else would depend on (that was what was reverted in https://review.opendev.org/c/opendev/system-config/+/820250)	03:35
corvus	i think doing that while we understand it makes a lot of sense :)	03:36
ianw	the "bootstrap-bridge" stuff post-dates that, from when we did our most recent replacement of the actual bridge node. that was how we got to the point that we run bridge99.opendev.org in testing -- as a deliberate thing to avoid too much hard-coding of bridge01 as much as possible	03:38
ianw	but i think the work done there, to parent everything to the bootstrap bridge job, is a good basis for the next step	03:39
Clark[m]	ianw: corvus: catching up, it isn't clear to me if you think we need the pause block behavior as a prereq to fix the immediate issue or if that is a solution for the run things in parallel as a followup	04:12
ianw	i think probably we can pull the source in the bootstrap job as you suggest. pausing and moving towards parallel running is an enhancement	04:13
Clark[m]	Ok cool. Then the only real update for the current change is to organize the jobs a bit better in project.yaml to make the dependencies clearer.	04:13
ianw	yes i think we should do that, having identified that	04:14
Clark[m]	Then we can followup with trying to parallelize stuff	04:14
Clark[m]	Thanks again for looking. Enjoy your weekend!	04:15
ianw	something is still funky talking to review from here	04:35
opendevreview	Ian Wienand proposed opendev/system-config master: DNM: Bootstrap as top-level job https://review.opendev.org/c/opendev/system-config/+/942439	12:57
ianw	clarkb: ^ that's the vague idea ... i'm not sure it can be tested outside of production because in testing, each job is it's own little world, there's no global bootstrap job ...	12:58
clarkb	thanks I'll work on updating my change to address the organizational concerns then I can rebase ^ on top of it too	15:45
clarkb	https://docs.docker.com/docker-hub/usage/ says that the 10 pulls per hour per ip limit on docker hub goes into effect March 1. There is discussion here: https://news.ycombinator.com/item?id=43125089 some people have even picked up on the issue with upstream blocking PRs to add mirrors for non docker hub registries (of course others have said this isn't an important featre...)	15:46
clarkb	that discussion points out that https://gallery.ecr.aws already mirrors some official docker images so we may be able to reduce the amount of mirroring we do ourselves if we like	15:48
clarkb	what is weird is that we definitely saw a dramatic increase in docker hub rate limit errors late last year. I half wonder if we got A/B tested on the new limits	15:49
opendevreview	Jay Faulkner proposed openstack/diskimage-builder master: [doc] Document expected environment for dib https://review.opendev.org/c/openstack/diskimage-builder/+/942466	15:49
clarkb	but we should probably expect things to get even worse in ~1week	15:49
clarkb	a good reminder that there are changes up in zuul-jobs to fetch mirrored images from quay rather than from docker hub if anyone has a moment to review them	15:50
opendevreview	Jay Faulkner proposed openstack/diskimage-builder master: [doc] Document expected environment for dib https://review.opendev.org/c/openstack/diskimage-builder/+/942466	15:58
mnaser	if there's anything i can help with this... `Powered by Gitea Version: v1.23.3 Page: 22261ms Template: 53ms`	16:16
mnaser	I really want to try and use opendev.org to browse code but it's incredibly frustrating :-(	16:16
clarkb	mnaser: we asked for feedback when we last debugged it and didn't get any...	16:17
mnaser	sorry, it must have gotten lost in my buffer	16:17
clarkb	the working theory is that the gitea caches may becoming large enough that there is an impact on go gc. I restarted teh gitea instance you were speaking to at the time (gitea13) as that would reset the internal memory state in hopes it would clear out gc issues	16:17
mnaser	oh its using sticky sessions?	16:18
clarkb	we knew it would be temporary but it was hard to say one way or another if it actually helped at the time without more feedback	16:18
clarkb	yes git protocols don't let us load balance based on load	16:18
clarkb	they are too stateful across requests/connections	16:18
clarkb	if all gitea was doing was web ui stuff then it wouldn't matter	16:18
mnaser	oh even for serving http/https?	16:19
clarkb	yes	16:19
mnaser	TIL!	16:19
mnaser	sorry, i've gotten through a new pc and didnt have irccloud installed so i haven't been fully syncd in yet	16:19
mnaser	it has started happening again for sure for me, is there a way i can share what backend im talking to	16:19
clarkb	I think it has to do with where the refs are located as they can be loose or in packfiles and if you talk to server A and it says grab this packfile then hit server B and the ref is loose your git fetch regardless of transport protocol can fail	16:19
clarkb	if you inspect the ssl cert one of the names is the backend name	16:20
clarkb	that is probably the easiest method	16:20
mnaser	now that makes me curious how github and gitlab do it :)	16:20
mnaser	one sec	16:20
mnaser	gitea13.opendev.org indeed	16:20
clarkb	I think github uses a single git repo for all forks of the repo	16:21
mordred	mnaser: I believe they have shared/coordinated storage clusters, as opposed to the shared-nothing setup we use	16:21
clarkb	via shared fielsystem or object storage or something	16:21
mnaser	yes they do thats why you can do this weird link that points to a commit that doesnt exist in that repo	16:21
clarkb	yup which has led to many major security issues. Tradeoffs	16:22
clarkb	anyway give me a minute and I'll restart services on gitea13 to reset the memory state and hopefully we can get more feedback on whether or not htis is helpful	16:23
clarkb	(side note I think the persistent requests from all the various ai crawler bots is what causes teh cache to go crazy)	16:23
clarkb	but this is all hypothesis at the moment	16:23
mnaser	they can get their juice from github.com mirrors, maybe we just block them?	16:23
mnaser	im sure they already scraped that before us :)	16:23
clarkb	they reproduce like rabbits	16:25
clarkb	and not everything we host is replicated to github. Its a tricky thing to balance	16:25
clarkb	services have restarted. I would expect things to be slowish as valid things we care about cache and then speed up with a tail off as all the things get cached over time (this was all discussed last time too)	16:26
clarkb	the ideal is that the folks running the crawlers would realize that runninga git clone against each rpo and then crawling the commits locally would be far more efficient for us and them	16:27
clarkb	if anyone knows anyone at anthropic, openai, amazon, meta, google, etc I'm happy to give them this insight for a good deal	16:27
clarkb	mnaser: gitea13 seems to be performing reasonably well if I hit it directly via https://gitea13.opendev.org:3081/zuul/zuul for example	16:28
clarkb	any noticeable difference for your usage patterns?	16:28
clarkb	you might want ot double check that you're still hitting gitea13 too as I'm not sure how haproxy rebalancing works with the service flapping. I believe the algorithm is deterministic against your IP and the total number of backends so I would expect you to be back on the same server	16:31
mnaser	clarkb: i think it actually feels better now from a few seconds ago	16:34
opendevreview	Clark Boylan proposed opendev/system-config master: Update got Gitea 1.23.4 https://review.opendev.org/c/opendev/system-config/+/942477	16:35
clarkb	I don't think ^ will help based on the changelog but seems like something we should be doing anyway	16:36
clarkb	mnaser: ok, it might be helpful to know when it "feels" slow again so that we can gauge how long it takes to degrade	16:36
clarkb	but ya the gitea caches are implemented as an internal go map. There is no maximum size on the map (though they do treat older data (based on timestamps) as invalid and will refresh it). I think what happens is things can just grow and grow with the crawlers fetching pages for every single commit on the service which results in very large maps. That apparently can craete problems for	16:39
clarkb	the go gc system whcih I suspect is creating halt the world problems like you'd expect from java programs 10 yaers ago. Unfortunately I don't think go binaries come with tools like the jvm provides to inspect this easily. You either run the entire process with gc tracing on or you grab the data from within your program (I'd be happy to hear there are better tools I don't kow about	16:39
clarkb	that I can look at)	16:39
mnaser	i will report that for sure	16:40
clarkb	and maybe turning on gc tracing is the next step in debugging. Just need to be careful we don't accidentally fill disks and make git sad	16:42
opendevreview	Clark Boylan proposed opendev/system-config master: Reparent the bootstrap-bridge job onto a job that sets up git repos https://review.opendev.org/c/opendev/system-config/+/942307	16:55
clarkb	ok that is the promised small update to the fix for known_hosts generation	16:56
opendevreview	Clark Boylan proposed opendev/system-config master: DNM: Bootstrap as top-level job https://review.opendev.org/c/opendev/system-config/+/942439	17:01
clarkb	and rebased ianw's DNM change to show what this might look like as a followon	17:01
clarkb	I think nb04 has filled its disk again	17:13
clarkb	confirmed. I'll start a screen and begin deletion of content in the tmp dir	17:14
opendevreview	Merged openstack/diskimage-builder master: [doc] Document expected environment for dib https://review.opendev.org/c/openstack/diskimage-builder/+/942466	18:00
clarkb	https://review.opendev.org/c/zuul/zuul-jobs/+/942127 https://review.opendev.org/c/zuul/zuul-jobs/+/941992/ https://review.opendev.org/c/zuul/zuul-jobs/+/941970 and https://review.opendev.org/c/zuul/zuul-jobs/+/942008 are straghtforward zuul-jobs updates that might be good for a friday afternoon if anyone has time	21:39
clarkb	this will reduce our reliance on docker hub for zuul jobs	21:39
corvus	you have at least one usually 2 +2s on all those, but i didn't +w them	22:21
clarkb	thanks. I guess only the most recent one was devoid of all review until now. I can approve them (or a subset I'll double check each one for comfort levels on a friday afternnon before doing so)	22:29
clarkb	ah looks like fungi got them	22:29
clarkb	I can help debug if anything goes wrong (and they should all be relatively easy reverts if it comes to that)	22:29
fungi	yep	22:29
fungi	i'm semi-around still too	22:30
clarkb	then monday we should probably do the gitea upgrade nad maybe the grafana upgrade	22:30
fungi	that'll be great	22:30
clarkb	the nb04 disk cleanup is still in progress	22:31
clarkb	io there is not very quick unfortunately	22:31
opendevreview	Merged zuul/zuul-jobs master: Use mirrored buildkit:buildx-stable-1 image https://review.opendev.org/c/zuul/zuul-jobs/+/941992	22:37
opendevreview	Merged zuul/zuul-jobs master: Use mirrored qemu-user-static image https://review.opendev.org/c/zuul/zuul-jobs/+/942127	22:37
opendevreview	Merged zuul/zuul-jobs master: Use registry:2 image mirrored to quay.io https://review.opendev.org/c/zuul/zuul-jobs/+/941970	22:37
opendevreview	Merged zuul/zuul-jobs master: Replace debian:testing with quay.io/opendevmirror/httpd:alpine https://review.opendev.org/c/zuul/zuul-jobs/+/942008	22:42
clarkb	I do wonder if we'll need to force ipv4 access to docker hub after march 1, but we can cross that bridge when we get there. I think for most opendev and zuul stuff we'ev managed to shift everything but the images opendev produces for itself to quay at this point	23:08
clarkb	that leaves a number of images still, but it is rare that we need many images all on one host with one ip at the same time	23:08
clarkb	I think jitsi meet is the main exception to both of those statements. We pull from upstream and they are on docker still and there are like 5 images or something	23:09
clarkb	but most things will pull at most 2 images	23:09
clarkb	I think we're probably in a see what happens move while still moving what we can over time to reduce the overall demand we create	23:10

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!