Wednesday, 2024-12-11

elibrokeit	JayF: the pkg_resources assert is a pretty common mistake people make all over the place :D	01:07
opendevreview	Merged openstack/diskimage-builder master: Temporarily disable OpenEuler functional testing https://review.opendev.org/c/openstack/diskimage-builder/+/937466	01:38
opendevreview	tzing proposed openstack/diskimage-builder master: Upgrade openEuler to 24.03 LTS https://review.opendev.org/c/openstack/diskimage-builder/+/927466	01:57
opendevreview	Michal Nasiadka proposed openstack/diskimage-builder master: Revert "dib-functests: run on bookworm" https://review.opendev.org/c/openstack/diskimage-builder/+/937514	07:16
opendevreview	Michal Nasiadka proposed openstack/diskimage-builder master: Revert "dib-functests: run on bookworm" https://review.opendev.org/c/openstack/diskimage-builder/+/937514	07:26
opendevreview	Michal Nasiadka proposed openstack/diskimage-builder master: Revert "dib-functests: run on bookworm" https://review.opendev.org/c/openstack/diskimage-builder/+/937514	07:26
tkajinam	hm there are number of jobs in tag queue being stuck https://zuul.opendev.org/t/openstack/status	07:26
opendevreview	Michal Nasiadka proposed openstack/diskimage-builder master: Revert "dib-functests: run on bookworm" https://review.opendev.org/c/openstack/diskimage-builder/+/937514	07:59
frickler	tkajinam: looks like they are not stuck, but still processing after the bunch of eol stuff that merged yesterday https://zuul.opendev.org/t/openstack/builds?pipeline=tag&skip=0	08:15
tkajinam	frickler, ok. yeah the number is reducing slowly so we may have to just wait	08:44
frickler	tkajinam: yes, seems it goes down by about 10 jobs per hour, so should be hopefully done around 12:00	08:58
frickler	infra-root: seems the ubuntu mirror volume is unhealthy, grafana says last released 25d ago, mirror-update log has: Could not lock the VLDB entry for the volume 536870949.	08:59
frickler	25d is also the uptime of mirror-update02, maybe it was rebooted while a release was in progress? /me tries to find docs on handling this now	09:02
*** mrunge_ is now known as mrunge		11:01
opendevreview	Albin Vass proposed zuul/zuul-jobs master: prepare-workspace-git: Make it possible to sync a subset of projects https://review.opendev.org/c/zuul/zuul-jobs/+/936828	11:57
mnasiadka	clarkb: I can experiment a bit with force and let you know - but probably tomorrow	12:06
mnasiadka	clarkb: on another front it seems https://review.opendev.org/c/openstack/diskimage-builder/+/937514 fixes the arm64 functests - noble is only in testing/unstable debootstrap package	12:07
fungi	frickler: huh, 25 days likely coincides with when i rebooted it for the openafs security fix (lastlog doesn't seem to have recorded the reboot?), but i did hold every lock and waited for running cronjobs to complete, so i'm surprised that one is stuck. if you need help with recovering it, let me know	13:51
frickler	fungi: I totally got distracted by other stuff, so if you could look into this, that would be great	13:55
fungi	on it, handled this situation quite a few times in the past	13:56
frickler	unrelated: did we (or zuul) bump the retry_limit from 3 to 5? cf. https://zuul.opendev.org/t/openstack/buildset/cb2886a166b745cf96d29c0a7135f7ba	13:56
fungi	i've used `vos unlock` on the volume and am attempting another mirror update in a root screen session on mirror-update02	13:58
fungi	frickler: that retry count is odd indeed, considering https://zuul.opendev.org/t/openstack/job/kolla-build-centos9s shows the job has retry attempts limited to 3	14:00
fungi	frickler: aha, for some reason kolla-base does specify "attempts: 5"	14:01
fungi	frickler: https://opendev.org/openstack/kolla/src/branch/master/.zuul.d/base.yaml#L128	14:02
frickler	ok, thx, guess I never noticed that before, then. doesn't look new, either :-/ https://opendev.org/openstack/kolla/commit/98fb7dc8ff2136d658c9a79ed40f6e42e7893662	14:04
fungi	not sure why the zuul dashboard indicates it's 3 for the child job. https://zuul.opendev.org/t/openstack/job/kolla-base (the parent of kolla-build-centos9s) says 5 correctly	14:04
fungi	https://opendev.org/openstack/kolla/src/branch/master/.zuul.d/centos.yaml#L14-L20 doesn't seem to override attempts either	14:05
fungi	corvus: ^ is that a known/intended behavior and i'm just interpreting the interface in a different way than it's intended?	14:05
corvus	fungi: 3 is the default value; the web ui does not indicate whether it's explicitly set or not, so that bit of info is missing. once the value is overridden by the parent, a child would have to explicitly set it to override the value set by the parent (the default is not used to override an explicitly set value).	14:49
fungi	got it, so the dashboard is supposed to report the default there if not locally overridden, not any inherited value	14:50
fungi	thanks for clarifying!	14:50
frickler	corvus: still sounds weird to me for the dashboard not to show the value inherited from the parent, what's the rationale for that?	14:52
corvus	frickler: because it hasn't been inherited yet. the freeze-job page will show the actual value for the job as run	15:00
frickler	hmm, ok, that makes sense. I still think it would be better if one could see the difference between "this job definition specifies retry_limit=3" and "this job definition doesn't specify a value for retry_limit, so it will use the value inherited from the parent or the default, and the latter happens to be 3"	15:08
fungi	same probably goes for other defaulted fields in the job documentation view	15:10
frickler	seems the "timeout" field in comparison is only shown when the jobs actually specifies some value, that looks like a better way of handling this	15:11
fungi	infra-root: pypi notified us that https://pypi.org/project/fusion/ is going to be reclaimed in a week if it remains unused. i don't recognize the other maintainer listed there, but looks like they created 8 projects between june and october of 2015 all of which are empty	15:32
clarkb	fungi: doesn't look like fusion has any pacakge files. My inclination is to let pypi reassign it since nothing should be using the name	15:51
clarkb	infra-root are we ready to land https://review.opendev.org/c/opendev/system-config/+/937289 https://review.opendev.org/c/opendev/system-config/+/937290/ and https://review.opendev.org/c/opendev/system-config/+/937465/ then do a gerrit restart later today to pick up the new 3.10 image build?	15:52
opendevreview	Clark Boylan proposed openstack/diskimage-builder master: Revert "dib-functests: run on bookworm" https://review.opendev.org/c/openstack/diskimage-builder/+/937514	15:57
clarkb	mnasiadka: ^ I made a small edit to the readme in that change and I'll go ahead and approve it now	15:58
clarkb	frickler: mnasiadka I would also ask that projects be careful bumping retries up like that particularly when said project is capable of using 1/5 of our quota with a single change	16:05
clarkb	retries are meant to mitigate problems that are infrequent and largely out of our control like the internet had a hiccup	16:05
clarkb	three attempts seems plenty for that use case. Do we know why we are running with 5?	16:05
clarkb	catching up on the tag pipeline backlog it does look like that cleared out as expected (which is good means we don't need that to finish before doing a gerrit restart later)	16:08
clarkb	I used the suggested edit feature in Gerrit for the first time in 937514. I wouldn't call it the most intuitive system but it does work and I think that is neat	16:15
clarkb	frickler: the cirros ssl cert expires in less than a week	16:22
clarkb	if there isn't anything we can do about that and the expectation is renewals happen right near the expiration point I wonder if we should remove the check for it from our list	16:23
frickler	clarkb: I've been watching this and mostly hoped dreamhost would fix it on their own, since the cert is LE-based. but then I pinged smoser about it yesterday, however no response so far	16:43
opendevreview	Jay Faulkner proposed openstack/diskimage-builder master: Stop using deprecated pkg_resources API https://review.opendev.org/c/openstack/diskimage-builder/+/907691	16:44
clarkb	frickler: ya it is interesting how many people wait until things are about to expire to renew when LE themselves recommend renewal at 60 days into the 90 validity period	16:44
frickler	regarding retries I agree. the bump to 5 is 5 years old, I think we can revert it now and see how it goes with that	16:44
clarkb	Don't mean to pester but its like the last major set of things I need to do post gerrit upgrade: would be great to drop the 3.9 images, add 3.11 images and testing, and then also drop the cronjob to manage log file deletion with find	16:53
clarkb	is anyone willing to review those changes nowish? topic:upgrade-gerrit-3.10	16:53
fungi	all lgtm. i approved the one that had another +2	17:02
clarkb	thanks. Opinions on single core approvals for the rest in ~half an hour if there is no other input?	17:04
fungi	yeah, i can do that	17:11
clarkb	I can approve too if we're comfortable with that. I just wanted to make sure before we do that especially since this is for gerrit	17:12
clarkb	it has been half an hour I'll go ahead and approve those changes	17:34
opendevreview	Merged opendev/system-config master: Drop Gerrit 3.9 image builds and testing https://review.opendev.org/c/opendev/system-config/+/937289	17:36
clarkb	for the one that stops pruning log files via a cronjob we should check on Friday if we're still keeping only 30 days of logs	17:36
clarkb	I'm writing myself a remdinder to do that now	17:36
fungi	sounds good, thanks!	17:38
clarkb	I'm also trying to shepherd some dib fixes in now that CI should be happier	17:40
opendevreview	Merged openstack/diskimage-builder master: Followup: Ensure devuser-created dir has sane perms https://review.opendev.org/c/openstack/diskimage-builder/+/936206	17:54
opendevreview	Merged openstack/diskimage-builder master: Fix pyyaml install for svc-map https://review.opendev.org/c/openstack/diskimage-builder/+/937192	18:18
clarkb	the gerrit related changes shoudl merge momentarily	18:55
clarkb	then we check that the cronjob is removed and images promote	18:55
opendevreview	Merged opendev/system-config master: Remove log cleanup cronjob from review https://review.opendev.org/c/opendev/system-config/+/937277	18:55
opendevreview	Merged opendev/system-config master: Add Gerrit 3.11 image builds and testing https://review.opendev.org/c/opendev/system-config/+/937290	18:56
opendevreview	Merged opendev/system-config master: Reenable Gerrit upgrade job now testing 3.10 to 3.11 https://review.opendev.org/c/opendev/system-config/+/937465	18:56
clarkb	cronjob was removed	19:00
clarkb	and images just promoted too.	19:00
clarkb	I think we can decide what our restart schedule is	19:00
clarkb	I am going to try and pop out for a bike ride around 21:30 utc before the rain arrives. Could probaby do the restart in the next hour or two or after 23:30 UTC	19:01
clarkb	(I have to wait for the sun to warm things up as much as possible before going hence not going now. Still will only be like 7C	19:02
fungi	i'm free if you want to arrange it now	19:15
clarkb	yup now works for me	19:18
clarkb	give me a couple of minutes and I'll start a screen and we can do the pull, compare images and proceed	19:19
fungi	cool	19:19
clarkb	screen is started	19:22
clarkb	how does this look	19:22
clarkb	status notice Gerrit will undergo a short restart to pick up some bugfixes for the 3.10 release that we upgraded to.	19:22
fungi	attached	19:22
fungi	notice lgtm	19:23
clarkb	#status notice Gerrit will undergo a short restart to pick up some bugfixes for the 3.10 release that we upgraded to.	19:24
opendevstatus	clarkb: sending notice	19:24
-opendevstatus- NOTICE: Gerrit will undergo a short restart to pick up some bugfixes for the 3.10 release that we upgraded to.		19:24
clarkb	ok the docker compose pull hit the docker hub rate limit	19:25
clarkb	which seems impossible because we haven't pulled any images since we upgraded according to docker image list	19:25
clarkb	I'm beginning to suspect that the docker hub rate limits are not being accurately accounted	19:25
fungi	no kidding	19:26
clarkb	and unfortunately that makes the notice a misfire	19:26
clarkb	but not much I can do about that right now	19:26
fungi	that's... absurd	19:26
fungi	i wonder if they're counting all v6 addresses from the same /64 in a single bucket, and rackspace's address allocation policy is biting us	19:27
clarkb	looking at journalctl -u docker there are no obvious pulls recorded until the errors	19:28
clarkb	but it may just not record pulls however ya this is something	19:28
clarkb	looks like docker hub did add ipv6 support in 2024	19:29
clarkb	s/2024/2023/ I have that muscle memory	19:29
fungi	yeah, i basically just guessed based on the results of a dns lookup for hub.docker.com	19:29
clarkb	fungi: https://www.docker.com/blog/docker-hub-registry-ipv6-support-now-generally-available/ this seems to confirm your suspicion	19:30
fungi	that's a lovely discovery, and might explain why the rate limits seem to be hitting so hard in some providers	19:30
clarkb	this would also explain the error rate we've seen	19:30
clarkb	ya	19:31
clarkb	if I remember correctly there are multiple fqdns redirected to back and forth so editing /etc/hosts may work but also isn't straightforward	19:31
clarkb	registry-1.docker.io is the main entrypoint it seems like? and then either cloudfront or cloudflare in front of the actual data	19:33
clarkb	fungi: which name were you digging?	19:34
clarkb	I'm thinking maybe we add registry-1.docker.io and whatever you were looking at too into /etc/hosts with ipv4 addrs and try again?	19:34
clarkb	I was going to quickly do a docker pull locally but then realized I don't have ipv6 so can't identify everything that might be going to ipv6 easily. But I can look at all the ipv4 things and we can override those maybe	19:36
clarkb	let me try that	19:36
clarkb	hrm how to filter all theo ther network traffic out though	19:37
fungi	oh, i was looking up hub.docker.com	19:37
fungi	i guess the registry has different records	19:37
clarkb	fungi: right those names are just indexes then they point at the actual data storag which is like cloudflare or something	19:38
clarkb	I think the rate limits are being enforced on the index/manifest side of things but I'm nto sure that the actual manifest data isn't being handled by the storage system too and getting enforced there and not the registry itself	19:39
fungi	yeah, that would make sense, as imposing rate limits on the third-party object storage is likely more challenging	19:40
clarkb	I'm going to turn off my browsers (the pain) and tcpdump port 443 and do a docker pull	19:40
clarkb	oh actually I have an idea	19:41
fungi	easier than spinning up a vm anyway	19:41
clarkb	I have a held gerrit I can do the pull on that may have ipv6	19:41
fungi	oh, yep	19:41
clarkb	yup I do	19:41
clarkb	172.99.69.123 is the node but it also has a global ipv6 addr	19:42
opendevreview	Merged openstack/diskimage-builder master: Revert "dib-functests: run on bookworm" https://review.opendev.org/c/openstack/diskimage-builder/+/937514	19:43
clarkb	ofcourse something is making a bunch of noise over https there on the host so I'm also going to stop apache	19:43
clarkb	that seems quicker than making a complicated tcpdump rule	19:43
clarkb	heh I should've anticipated that just means we're going to get abunch of SYN packets	19:44
fungi	hah	19:48
clarkb	arg there are not reverse records	19:48
clarkb	registry-1.docker.io is the first thing we talk to over ipv6	19:49
clarkb	and then it hands over to a cloudflare ipv6 address. Trying to work out the name for that one now	19:50
clarkb	production.cloudflare.docker.com its this	19:50
clarkb	yay for having this info in our mirror rules	19:50
clarkb	fungi: so I think if we put ipv4 forward records for those two names in /etc/hosts we have a chance	19:50
fungi	only have to reverse-engineer it once (or once they change)	19:51
fungi	that's what i would try first at least, yes	19:51
clarkb	however, if we feel this is too sketchy and want to avoid that then I understand	19:51
fungi	it's only temporary, and you're effectively just filtering dns responses after checking what the tool would ask for	19:51
clarkb	ya I think my main concern is if I've done the manual mapping wrong and someone is set up to feed us bad data	19:52
clarkb	I PM'd you the two ip addrs so you can triple check things on your side before we do this	19:52
clarkb	I think that will give me enough confidence that we're doing something not too crazy	19:52
fungi	yeah, seems legit to me	19:52
fungi	looks like both match in screen	19:57
clarkb	cool I'll run a docker-compose pull now then?	19:57
fungi	yeah, let's see if this is a suitable workaround	19:57
clarkb	pull worked I'm going to verify the images now can you clean up /etc/hosts?	19:58
fungi	yeah, can do	19:58
clarkb	https://hub.docker.com/layers/opendevorg/gerrit/3.10/images/sha256-624738cddc8851e4a38eaaf093599d08edc43afd23d08b25ec31a6b3d48b6def?context=explore seems to match	19:59
fungi	cleaned up by commenting the entries out for now and adding an explanation for their presence	20:00
clarkb	as does https://hub.docker.com/layers/library/mariadb/10.11/images/sha256-db739b1d86fd8606a383386d2077aa6441d1bdd1c817673be62c9b74bdd6e7f3?context=explore	20:01
fungi	confirmed	20:01
clarkb	note for mariadb its multi arch so the sha in the url does not match but on the page the index digest does match	20:01
clarkb	ok should we proceed with normal restart process? do we want to send another notice first?	20:01
clarkb	the previous ntoice never indicated it had completed so wonder if the bot had a sad	20:02
fungi	nah, i'd consider that one warning enough, even though there was some "lag"	20:02
clarkb	ok if we're happy with these images I will proceed to down, mv waiting queue aside then up -d the service	20:02
fungi	yes, looks good	20:03
clarkb	proceedign now	20:03
clarkb	INFO com.google.gerrit.pgm.Daemon : Gerrit Code Review 3.10.3-12-g4e557ffe0e-dirty ready	20:04
clarkb	web responds and reports that version too	20:04
clarkb	I'm still waiting on diffs to load	20:05
clarkb	once diffs load I'm going to test login on an incognito browser tab just to sanity check that still works	20:05
clarkb	so now I think we want to update our CI role for docker setup to modify /etc/hosts with ipv4 addrs looked up when the jobs run?	20:06
clarkb	and I guess feedback for raxflex is if you implement ipv6 providing a /64 to each node is super helpful cc klamath	20:06
clarkb	diffs load for me now	20:07
clarkb	I'm going to test login	20:07
fungi	i'm not so sure that editing /etc/hosts on the fly before calling out to docker is the way to go (seems over-complicated and fragile), but also setting rate limits based on /64 collation is as much docker's lack of awareness of how sites use ipv6 as anything	20:08
clarkb	I don't think they care unfortunately	20:10
clarkb	fungi: re fragility I agree but so is using a shared /64 apparently	20:10
fungi	well, i mean, it's not like i thought they cared prior to this most recent revelation either	20:10
clarkb	and I suspect that over time using a shared /64 will result in more broken job runs than looking up ipv4 addrs that might change once in a while	20:11
clarkb	looking them up when the job runs helps ensure we get the right locality if the dns records are location aware and mitigates the fact they may change over time	20:11
fungi	yeah maybe	20:11
clarkb	login in an incognito tab worked and redirected me to my personal /dashboard/self page	20:11
clarkb	so I think we can consider this login redirects mostly fixed (they still don't return you to where you start but thats less of a problem)	20:12
jrosser	it would be pretty normal to assign a /64 to a dual stacked openstack public network	20:12
fungi	wonder if calls to docker could just be wrapped in some preload that disables v6 address family or something	20:12
jrosser	I wonder if docker have attempted to deal with what the privacy extensions do	20:12
jrosser	as that might bypass the rate limit quite naturally for v6	20:13
fungi	well, it's not just privacy extensions. i'm sure the way they look at it is that clients within a network can pick any address they want to use within the router's advertised prefix, and see that as a way to skit rate limits	20:13
fungi	which is also entirely possible over ipv4, except that address scarcity (the primary thing ipv6 was invented to solve) makes it unlikely	20:14
fungi	so it's more or less dockerhub ostensibly adding ipv6 support while really just making it no better than ipv4 from a rate limiting perspective	20:15
clarkb	another approach may be to set up unbound rules to only return A records for docker.com and docker.io	20:16
fungi	it's like everyone on an ipv6 network is connecting over an overload nat to a single address, rate-limit wise	20:16
clarkb	(I don't know if that is possible. Google didn't return any results for LD_PRELOAD to disable ipv6 though we can probably write out own. That might be a fun exercise for someone)	20:17
corvus	alternatively -- mirror everything we use on dockerhub to somewhere else	20:23
fungi	somewhere... dare i say... reasonable	20:23
corvus	woah let's not go crazy there	20:24
clarkb	heh yes also that role did merge so we can add a job now	20:24
fungi	okay, maybe reasonable is a strong word	20:25
clarkb	I'm marking the gerrit tasks done except for the followup to check pruning of log files is happening properly	20:26
clarkb	I'll brainstorm whether its worth working around docker rate limits other than mirroring content when I go on a bike ride. But first I need something to eat	20:26
fungi	yeah, it's a new chunk of information to be sure, but there are a variety of ways we could approach it	20:27
fungi	(including just disabling ipv6 in sysctl but that would be a loss)	20:27
clarkb	oh I also have the held node I need to cleanup when we're happy with this 3.10 update /me makes another note on the todo list	20:27
clarkb	ya I don't like that idea particularly since it means we can't use ipv6 only clouds	20:28
clarkb	and maybe ^ is a good reason to just go with corvus' suggestion since an ipv6 only cloud will either have these ipv6 issues or funnel all ipv4 through a single ip	20:28
clarkb	er I guess it could be setup to do a /64 per test node but still	20:28
fungi	only if the provider gives us a larger allocation to divvy up	20:29
clarkb	right and I don't know that any cloud has done that yet so it seems less likely	20:29
clarkb	ovh and rax both give you a /128 iirc	20:29
clarkb	fungi: should I close the screen?	20:31
fungi	yeah, i think we're good	20:31
clarkb	I didn't turn on recording in it. Not sure if you want to capture anything first	20:31
fungi	nah, there was nothing we need for later as far as i saw	20:32
clarkb	ok shutting it down	20:32
fungi	thanks!	20:32
fungi	in unrelated news, the ubuntu mirror vos release is still underway, 6+ hours later	20:34
fungi	it needed to do a full re-sync of data to afs02.dfw	20:34
fungi	here's hoping it finishes before i'm done for the day	20:35
opendevreview	Clark Boylan proposed opendev/system-config master: Update Gitea to 1.22.5 https://review.opendev.org/c/opendev/system-config/+/937574	23:28
clarkb	bike ride was good for brain storming. I think for the most part corvus' suggestion of focusing on using mirrored images is a good one. However, we hit this error in production and we may be our own noisy neighbors there with our ci system eating up the rate limit quota and its unlikely we'd switch prod to pull from the mirrors for opendev in the short term.	23:29
fungi	good point	23:29
clarkb	I think if there is a straightforward method to mitigate the ipv6 rate limit problem in our zuul jobs that may be worth pursuing simply to make prod more reliable	23:29
clarkb	the best idea I've got for that is still a hacky /etc/hosts override though	23:30
clarkb	since it should be simple to implement and limit the scope of the ipv6 disablement to this one area	23:30
fungi	alternatively, it's worth keeping in mind that if all nodes in a given region are effectively getting lumped into a single quota bucket, then it's no better than relying on our per-region proxies (potentially worse)	23:31
clarkb	yes except that we in theory have the out if we can force ipv4	23:32
clarkb	and I think it has helped in clouds with no ipv6	23:32
tonyb	I saw your idea about using an LD_PRELOAD to skip ipv6 addresses, that should work but how many places would we need to remember to use the PRELOAD, and to build/distribute it	23:37
corvus	clarkb: why not switch prod to mirrors?	23:40
corvus	clarkb: many of our images we build ourselves, so just shifting their publication to quay would address the issue. for the rest, why not run them from our mirrored copies if we're mirroring them anyway?	23:42
corvus	only downside i see is that mirror updates would be periodic, so there might be a slight delay between when an upstream image is updated and our copy is, but i feel like that's probably a non-issue for most of what we do.	23:44
clarkb	corvus: we can but we have to switch away from docker the runtime to make that work with speculative testing which I'm assuming will take time	23:46
clarkb	corvus: so basically I'm saying lets do all that too, but in the interim try and make docker proper interaction less painful?	23:47
clarkb	though this is the first time this has haoppend that I know of so maybe the problem is very minor	23:47
clarkb	one workaround may be to publish to both quay and docker I suppose then use docker for spceulative stuff and quay in prod or osmething	23:47
clarkb	maybe that is what you meant by publishing there ourselves?	23:47

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!