Wednesday, 2024-12-11

elibrokeitJayF: the pkg_resources assert is a pretty common mistake people make all over the place :D01:07
opendevreviewMerged openstack/diskimage-builder master: Temporarily disable OpenEuler functional testing  https://review.opendev.org/c/openstack/diskimage-builder/+/93746601:38
opendevreviewtzing proposed openstack/diskimage-builder master: Upgrade openEuler to 24.03 LTS  https://review.opendev.org/c/openstack/diskimage-builder/+/92746601:57
opendevreviewMichal Nasiadka proposed openstack/diskimage-builder master: Revert "dib-functests: run on bookworm"  https://review.opendev.org/c/openstack/diskimage-builder/+/93751407:16
opendevreviewMichal Nasiadka proposed openstack/diskimage-builder master: Revert "dib-functests: run on bookworm"  https://review.opendev.org/c/openstack/diskimage-builder/+/93751407:26
opendevreviewMichal Nasiadka proposed openstack/diskimage-builder master: Revert "dib-functests: run on bookworm"  https://review.opendev.org/c/openstack/diskimage-builder/+/93751407:26
tkajinamhm there are number of jobs in tag queue being stuck https://zuul.opendev.org/t/openstack/status07:26
opendevreviewMichal Nasiadka proposed openstack/diskimage-builder master: Revert "dib-functests: run on bookworm"  https://review.opendev.org/c/openstack/diskimage-builder/+/93751407:59
fricklertkajinam: looks like they are not stuck, but still processing after the bunch of eol stuff that merged yesterday https://zuul.opendev.org/t/openstack/builds?pipeline=tag&skip=008:15
tkajinamfrickler, ok. yeah the number is reducing slowly so we may have to just wait08:44
fricklertkajinam: yes, seems it goes down by about 10 jobs per hour, so should be hopefully done around 12:0008:58
fricklerinfra-root: seems the ubuntu mirror volume is unhealthy, grafana says last released 25d ago, mirror-update log has: Could not lock the VLDB entry for the volume 536870949.08:59
frickler25d is also the uptime of mirror-update02, maybe it was rebooted while a release was in progress? /me tries to find docs on handling this now09:02
*** mrunge_ is now known as mrunge11:01
opendevreviewAlbin Vass proposed zuul/zuul-jobs master: prepare-workspace-git: Make it possible to sync a subset of projects  https://review.opendev.org/c/zuul/zuul-jobs/+/93682811:57
mnasiadkaclarkb: I can experiment a bit with force and let you know - but probably tomorrow12:06
mnasiadkaclarkb: on another front it seems https://review.opendev.org/c/openstack/diskimage-builder/+/937514 fixes the arm64 functests - noble is only in testing/unstable debootstrap package12:07
fungifrickler: huh, 25 days likely coincides with when i rebooted it for the openafs security fix (lastlog doesn't seem to have recorded the reboot?), but i did hold every lock and waited for running cronjobs to complete, so i'm surprised that one is stuck. if you need help with recovering it, let me know13:51
fricklerfungi: I totally got distracted by other stuff, so if you could look into this, that would be great13:55
fungion it, handled this situation quite a few times in the past13:56
fricklerunrelated: did we (or zuul) bump the retry_limit from 3 to 5? cf. https://zuul.opendev.org/t/openstack/buildset/cb2886a166b745cf96d29c0a7135f7ba13:56
fungii've used `vos unlock` on the volume and am attempting another mirror update in a root screen session on mirror-update0213:58
fungifrickler: that retry count is odd indeed, considering https://zuul.opendev.org/t/openstack/job/kolla-build-centos9s shows the job has retry attempts limited to 314:00
fungifrickler: aha, for some reason kolla-base does specify "attempts: 5"14:01
fungifrickler: https://opendev.org/openstack/kolla/src/branch/master/.zuul.d/base.yaml#L12814:02
fricklerok, thx, guess I never noticed that before, then. doesn't look new, either :-/ https://opendev.org/openstack/kolla/commit/98fb7dc8ff2136d658c9a79ed40f6e42e789366214:04
funginot sure why the zuul dashboard indicates it's 3 for the child job. https://zuul.opendev.org/t/openstack/job/kolla-base (the parent of kolla-build-centos9s) says 5 correctly14:04
fungihttps://opendev.org/openstack/kolla/src/branch/master/.zuul.d/centos.yaml#L14-L20 doesn't seem to override attempts either14:05
fungicorvus: ^ is that a known/intended behavior and i'm just interpreting the interface in a different way than it's intended?14:05
corvusfungi: 3 is the default value; the web ui does not indicate whether it's explicitly set or not, so that bit of info is missing.  once the value is overridden by the parent, a child would have to explicitly set it to override the value set by the parent (the default is not used to override an explicitly set value).14:49
fungigot it, so the dashboard is supposed to report the default there if not locally overridden, not any inherited value14:50
fungithanks for clarifying!14:50
fricklercorvus: still sounds weird to me for the dashboard not to show the value inherited from the parent, what's the rationale for that?14:52
corvusfrickler: because it hasn't been inherited yet.  the freeze-job page will show the actual value for the job as run15:00
fricklerhmm, ok, that makes sense. I still think it would be better if one could see the difference between "this job definition specifies retry_limit=3" and "this job definition doesn't specify a value for retry_limit, so it will use the value inherited from the parent or the default, and the latter happens to be 3"15:08
fungisame probably goes for other defaulted fields in the job documentation view15:10
fricklerseems the "timeout" field in comparison is only shown when the jobs actually specifies some value, that looks like a better way of handling this15:11
fungiinfra-root: pypi notified us that https://pypi.org/project/fusion/ is going to be reclaimed in a week if it remains unused. i don't recognize the other maintainer listed there, but looks like they created 8 projects between june and october of 2015 all of which are empty15:32
clarkbfungi: doesn't look like fusion has any pacakge files. My inclination is to let pypi reassign it since nothing should be using the name15:51
clarkbinfra-root are we ready to land https://review.opendev.org/c/opendev/system-config/+/937289 https://review.opendev.org/c/opendev/system-config/+/937290/ and https://review.opendev.org/c/opendev/system-config/+/937465/ then do a gerrit restart later today to pick up the new 3.10 image build?15:52
opendevreviewClark Boylan proposed openstack/diskimage-builder master: Revert "dib-functests: run on bookworm"  https://review.opendev.org/c/openstack/diskimage-builder/+/93751415:57
clarkbmnasiadka: ^ I made a small edit to the readme in that change and I'll go ahead and approve it now15:58
clarkbfrickler: mnasiadka I would also ask that projects be careful bumping retries up like that particularly when said project is capable of using 1/5 of our quota with a single change16:05
clarkbretries are meant to mitigate problems that are infrequent and largely out of our control like the internet had a hiccup16:05
clarkbthree attempts seems plenty for that use case. Do we know why we are running with 5?16:05
clarkbcatching up on the tag pipeline backlog it does look like that cleared out as expected (which is good means we don't need that to finish before doing a gerrit restart later)16:08
clarkbI used the suggested edit feature in Gerrit for the first time in 937514. I wouldn't call it the most intuitive system but it does work and I think that is neat16:15
clarkbfrickler: the cirros ssl cert expires in less than a week16:22
clarkbif there isn't anything we can do about that and the expectation is renewals happen right near the expiration point I wonder if we should remove the check for it from our list16:23
fricklerclarkb: I've been watching this and mostly hoped dreamhost would fix it on their own, since the cert is LE-based. but then I pinged smoser about it yesterday, however no response so far16:43
opendevreviewJay Faulkner proposed openstack/diskimage-builder master: Stop using deprecated pkg_resources API  https://review.opendev.org/c/openstack/diskimage-builder/+/90769116:44
clarkbfrickler: ya it is interesting how many people wait until things are about to expire to renew when LE themselves recommend renewal at 60 days into the 90 validity period16:44
fricklerregarding retries I agree. the bump to 5 is 5 years old, I think we can revert it now and see how it goes with that16:44
clarkbDon't mean to pester but its like the last major set of things I need to do post gerrit upgrade: would be great to drop the 3.9 images, add 3.11 images and testing, and then also drop the cronjob to manage log file deletion with find16:53
clarkbis anyone willing to review those changes nowish? topic:upgrade-gerrit-3.1016:53
fungiall lgtm. i approved the one that had another +217:02
clarkbthanks. Opinions on single core approvals for the rest in ~half an hour if there is no other input?17:04
fungiyeah, i can do that17:11
clarkbI can approve too if we're comfortable with that. I just wanted to make sure before we do that especially since this is for gerrit17:12
clarkbit has been half an hour I'll go ahead and approve those changes17:34
opendevreviewMerged opendev/system-config master: Drop Gerrit 3.9 image builds and testing  https://review.opendev.org/c/opendev/system-config/+/93728917:36
clarkbfor the one that stops pruning log files via a cronjob we should check on Friday if we're still keeping only 30 days of logs17:36
clarkbI'm writing myself a remdinder to do that now17:36
fungisounds good, thanks!17:38
clarkbI'm also trying to shepherd some dib fixes in now that CI should be happier17:40
opendevreviewMerged openstack/diskimage-builder master: Followup: Ensure devuser-created dir has sane perms  https://review.opendev.org/c/openstack/diskimage-builder/+/93620617:54
opendevreviewMerged openstack/diskimage-builder master: Fix pyyaml install for svc-map  https://review.opendev.org/c/openstack/diskimage-builder/+/93719218:18
clarkbthe gerrit related changes shoudl merge momentarily18:55
clarkbthen we check that the cronjob is removed and images promote18:55
opendevreviewMerged opendev/system-config master: Remove log cleanup cronjob from review  https://review.opendev.org/c/opendev/system-config/+/93727718:55
opendevreviewMerged opendev/system-config master: Add Gerrit 3.11 image builds and testing  https://review.opendev.org/c/opendev/system-config/+/93729018:56
opendevreviewMerged opendev/system-config master: Reenable Gerrit upgrade job now testing 3.10 to 3.11  https://review.opendev.org/c/opendev/system-config/+/93746518:56
clarkbcronjob was removed19:00
clarkband images just promoted too.19:00
clarkbI think we can decide what our restart schedule is19:00
clarkbI am going to try and pop out for a bike ride around 21:30 utc before the rain arrives. Could probaby do the restart in the next hour or two or after 23:30 UTC19:01
clarkb(I have to wait for the sun to warm things up as much as possible before going hence not going now. Still will only be like 7C19:02
fungii'm free if you want to arrange it now19:15
clarkbyup now works for me19:18
clarkbgive me a couple of minutes and I'll start a screen and we can do the pull, compare images and proceed19:19
fungicool19:19
clarkbscreen is started19:22
clarkbhow does this look19:22
clarkbstatus notice Gerrit will undergo a short restart to pick up some bugfixes for the 3.10 release that we upgraded to.19:22
fungiattached19:22
funginotice lgtm19:23
clarkb#status notice Gerrit will undergo a short restart to pick up some bugfixes for the 3.10 release that we upgraded to.19:24
opendevstatusclarkb: sending notice19:24
-opendevstatus- NOTICE: Gerrit will undergo a short restart to pick up some bugfixes for the 3.10 release that we upgraded to.19:24
clarkbok the docker compose pull hit the docker hub rate limit19:25
clarkbwhich seems impossible because we haven't pulled any images since we upgraded according to docker image list19:25
clarkbI'm beginning to suspect that the docker hub rate limits are not being accurately accounted19:25
fungino kidding19:26
clarkband unfortunately that makes the notice a misfire19:26
clarkbbut not much I can do about that right now19:26
fungithat's... absurd19:26
fungii wonder if they're counting all v6 addresses from the same /64 in a single bucket, and rackspace's address allocation policy is biting us19:27
clarkblooking at journalctl -u docker there are no obvious pulls recorded until the errors19:28
clarkbbut it may just not record pulls however ya this is something19:28
clarkblooks like docker hub did add ipv6 support in 202419:29
clarkbs/2024/2023/ I have that muscle memory19:29
fungiyeah, i basically just guessed based on the results of a dns lookup for hub.docker.com19:29
clarkbfungi: https://www.docker.com/blog/docker-hub-registry-ipv6-support-now-generally-available/ this seems to confirm your suspicion19:30
fungithat's a lovely discovery, and might explain why the rate limits seem to be hitting so hard in some providers19:30
clarkbthis would also explain the error rate we've seen19:30
clarkbya19:31
clarkbif I remember correctly there are multiple fqdns redirected to back and forth so editing /etc/hosts may work but also isn't straightforward19:31
clarkbregistry-1.docker.io is the main entrypoint it seems like? and then either cloudfront or cloudflare in front of the actual data19:33
clarkbfungi: which name were you digging?19:34
clarkbI'm thinking maybe we add registry-1.docker.io and whatever you were looking at too into /etc/hosts with ipv4 addrs and try again?19:34
clarkbI was going to quickly do a docker pull locally but then realized I don't have ipv6 so can't identify everything that might be going to ipv6 easily. But I can look at all the ipv4 things and we can override those maybe19:36
clarkblet me try that19:36
clarkbhrm how to filter all theo ther network traffic out though19:37
fungioh, i was looking up hub.docker.com19:37
fungii guess the registry has different records19:37
clarkbfungi: right those names are just indexes then they point at the actual data storag which is like cloudflare or something19:38
clarkbI think the rate limits are being enforced on the index/manifest side of things but I'm nto sure that the actual manifest data isn't being handled by the storage system too and getting enforced there and not the registry itself19:39
fungiyeah, that would make sense, as imposing rate limits on the third-party object storage is likely more challenging19:40
clarkbI'm going to turn off my browsers (the pain) and tcpdump port 443 and do a docker pull19:40
clarkboh actually I have an idea19:41
fungieasier than spinning up a vm anyway19:41
clarkbI have a held gerrit I can do the pull on that may have ipv619:41
fungioh, yep19:41
clarkbyup I do19:41
clarkb172.99.69.123 is the node but it also has a global ipv6 addr19:42
opendevreviewMerged openstack/diskimage-builder master: Revert "dib-functests: run on bookworm"  https://review.opendev.org/c/openstack/diskimage-builder/+/93751419:43
clarkbofcourse something is making a bunch of noise over https there on the host so I'm also going to stop apache19:43
clarkbthat seems quicker than making a complicated tcpdump rule19:43
clarkbheh I should've anticipated that just means we're going to get abunch of SYN packets19:44
fungihah19:48
clarkbarg there are not reverse records19:48
clarkbregistry-1.docker.io is the first thing we talk to over ipv619:49
clarkband then it hands over to a cloudflare ipv6 address. Trying to work out the name for that one now19:50
clarkbproduction.cloudflare.docker.com its this19:50
clarkbyay for having this info in our mirror rules19:50
clarkbfungi: so I think if we put ipv4 forward records for those two names in /etc/hosts we have a chance19:50
fungionly have to reverse-engineer it once (or once they change)19:51
fungithat's what i would try first at least, yes19:51
clarkbhowever, if we feel this is too sketchy and want to avoid that then I understand19:51
fungiit's only temporary, and you're effectively just filtering dns responses after checking what the tool would ask for19:51
clarkbya I think my main concern is if I've done the manual mapping wrong and someone is set up to feed us bad data19:52
clarkbI PM'd you the two ip addrs so you can triple check things on your side before we do this19:52
clarkbI think that will give me enough confidence that we're doing something not too crazy19:52
fungiyeah, seems legit to me19:52
fungilooks like both match in screen19:57
clarkbcool I'll run a docker-compose pull now then?19:57
fungiyeah, let's see if this is a suitable workaround19:57
clarkbpull worked I'm going to verify the images now can you clean up /etc/hosts?19:58
fungiyeah, can do19:58
clarkbhttps://hub.docker.com/layers/opendevorg/gerrit/3.10/images/sha256-624738cddc8851e4a38eaaf093599d08edc43afd23d08b25ec31a6b3d48b6def?context=explore seems to match19:59
fungicleaned up by commenting the entries out for now and adding an explanation for their presence20:00
clarkbas does https://hub.docker.com/layers/library/mariadb/10.11/images/sha256-db739b1d86fd8606a383386d2077aa6441d1bdd1c817673be62c9b74bdd6e7f3?context=explore20:01
fungiconfirmed20:01
clarkbnote for mariadb its multi arch so the sha in the url does not match but on the page the index digest does match20:01
clarkbok should we proceed with normal restart process? do we want to send another notice first?20:01
clarkbthe previous ntoice never indicated it had completed so wonder if the bot had a sad20:02
funginah, i'd consider that one warning enough, even though there was some "lag"20:02
clarkbok if we're happy with these images I will proceed to down, mv waiting queue aside then up -d the service20:02
fungiyes, looks good20:03
clarkbproceedign now20:03
clarkbINFO  com.google.gerrit.pgm.Daemon : Gerrit Code Review 3.10.3-12-g4e557ffe0e-dirty ready20:04
clarkbweb responds and reports that version too20:04
clarkbI'm still waiting on diffs to load20:05
clarkbonce diffs load I'm going to test login on an incognito browser tab just to sanity check that still works20:05
clarkbso now I think we want to update our CI role for docker setup to modify /etc/hosts with ipv4 addrs looked up when the jobs run?20:06
clarkband I guess feedback for raxflex is if you implement ipv6 providing a /64 to each node is super helpful cc klamath 20:06
clarkbdiffs load for me now20:07
clarkbI'm going to test login20:07
fungii'm not so sure that editing /etc/hosts on the fly before calling out to docker is the way to go (seems over-complicated and fragile), but also setting rate limits based on /64 collation is as much docker's lack of awareness of how sites use ipv6 as anything20:08
clarkbI don't think they care unfortunately20:10
clarkbfungi: re fragility I agree but so is using a shared /64 apparently20:10
fungiwell, i mean, it's not like i thought they cared prior to this most recent revelation either20:10
clarkband I suspect that over time using a shared /64 will result in more broken job runs than looking up ipv4 addrs that might change once in a while20:11
clarkblooking them up when the job runs helps ensure we get the right locality if the dns records are location aware and mitigates the fact they may change over time20:11
fungiyeah maybe20:11
clarkblogin in an incognito tab worked and redirected me to my personal /dashboard/self page20:11
clarkbso I think we can consider this login redirects mostly fixed (they still don't return you to where you start but thats less of a problem)20:12
jrosserit would be pretty normal to assign a /64 to a dual stacked openstack public network20:12
fungiwonder if calls to docker could just be wrapped in some preload that disables v6 address family or something20:12
jrosserI wonder if docker have attempted to deal with what the privacy extensions do20:12
jrosseras that might bypass the rate limit quite naturally for v620:13
fungiwell, it's not just privacy extensions. i'm sure the way they look at it is that clients within a network can pick any address they want to use within the router's advertised prefix, and see that as a way to skit rate limits20:13
fungiwhich is also entirely possible over ipv4, except that address scarcity (the primary thing ipv6 was invented to solve) makes it unlikely20:14
fungiso it's more or less dockerhub ostensibly adding ipv6 support while really just making it no better than ipv4 from a rate limiting perspective20:15
clarkbanother approach may be to set up unbound rules to only return A records for docker.com and docker.io20:16
fungiit's like everyone on an ipv6 network is connecting over an overload nat to a single address, rate-limit wise20:16
clarkb(I don't know if that is possible. Google didn't return any results for LD_PRELOAD to disable ipv6 though we can probably write out own. That might be a fun exercise for someone)20:17
corvusalternatively -- mirror everything we use on dockerhub to somewhere else20:23
fungisomewhere... dare i say... reasonable20:23
corvuswoah let's not go crazy there20:24
clarkbheh yes also that role did merge so we can add a job now20:24
fungiokay, maybe reasonable is a strong word20:25
clarkbI'm marking the gerrit tasks done except for the followup to check pruning of log files is happening properly20:26
clarkbI'll brainstorm whether its worth working around docker rate limits other than mirroring content when I go on a bike ride. But first I need something to eat20:26
fungiyeah, it's a new chunk of information to be sure, but there are a variety of ways we could approach it20:27
fungi(including just disabling ipv6 in sysctl but that would be a loss)20:27
clarkboh I also have the held node I need to cleanup when we're happy with this 3.10 update /me makes another note on the todo list20:27
clarkbya I don't like that idea particularly since it means we can't use ipv6 only clouds20:28
clarkband maybe ^ is a good reason to just go with corvus' suggestion since an ipv6 only cloud will either have these ipv6 issues or funnel all ipv4 through a single ip20:28
clarkber I guess it could be setup to do a /64 per test node but still20:28
fungionly if the provider gives us a larger allocation to divvy up20:29
clarkbright and I don't know that any cloud has done that yet so it seems less likely20:29
clarkbovh and rax both give you a /128 iirc20:29
clarkbfungi: should I close the screen?20:31
fungiyeah, i think we're good20:31
clarkbI didn't turn on recording in it. Not sure if you want to capture anything first20:31
funginah, there was nothing we need for later as far as i saw20:32
clarkbok shutting it down20:32
fungithanks!20:32
fungiin unrelated news, the ubuntu mirror vos release is still underway, 6+ hours later20:34
fungiit needed to do a full re-sync of data to afs02.dfw20:34
fungihere's hoping it finishes before i'm done for the day20:35
opendevreviewClark Boylan proposed opendev/system-config master: Update Gitea to 1.22.5  https://review.opendev.org/c/opendev/system-config/+/93757423:28
clarkbbike ride was good for brain storming. I think for the most part corvus' suggestion of focusing on using mirrored images is a good one. However, we hit this error in production and we may be our own noisy neighbors there with our ci system eating up the rate limit quota and its unlikely we'd switch prod to pull from the mirrors for opendev in the short term.23:29
fungigood point23:29
clarkbI think if there is a straightforward method to mitigate the ipv6 rate limit problem in our zuul jobs that may be worth pursuing simply to make prod more reliable23:29
clarkbthe best idea I've got for that is still a hacky /etc/hosts override though23:30
clarkbsince it should be simple to implement and limit the scope of the ipv6 disablement to this one area23:30
fungialternatively, it's worth keeping in mind that if all nodes in a given region are effectively getting lumped into a single quota bucket, then it's no better than relying on our per-region proxies (potentially worse)23:31
clarkbyes except that we in theory have the out if we can force ipv423:32
clarkband I think it has helped in clouds with no ipv623:32
tonybI saw your idea about using an LD_PRELOAD to skip ipv6 addresses, that should work but how many places would we need to remember to use the PRELOAD, and to build/distribute it23:37
corvusclarkb: why not switch prod to mirrors?23:40
corvusclarkb: many of our images we build ourselves, so just shifting their publication to quay would address the issue.  for the rest, why not run them from our mirrored copies if we're mirroring them anyway?23:42
corvusonly downside i see is that mirror updates would be periodic, so there might be a slight delay between when an upstream image is updated and our copy is, but i feel like that's probably a non-issue for most of what we do.23:44
clarkbcorvus: we can but we have to switch away from docker the runtime to make that work with speculative testing which I'm assuming will take time23:46
clarkbcorvus: so basically I'm saying lets do all that too, but in the interim try and make docker proper interaction less painful?23:47
clarkbthough this is the first time this has haoppend that I know of so maybe the problem is very minor23:47
clarkbone workaround may be to publish to both quay and docker I suppose then use docker for spceulative stuff and quay in prod or osmething23:47
clarkbmaybe that is what you meant by publishing there ourselves?23:47

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!