Thursday, 2025-11-13

priteau	Hello. I am seeing lots of CI errors while accessing mirror.dfw3.raxflex.opendev.org. It is refusing all connections.	08:28
tonyb	priteau: I'll look we did have issues with that server but we thought they were fixed	08:29
priteau	Thanks. For example here: https://9b62756387dc03e179ba-9f2ef8380661efedee2428ccb67a9720.ssl.cf2.rackcdn.com/openstack/ec48ad54ccda4860928611e9d04a946c/primary/ansible/install-pre-upgrade	08:30
tonyb	So that's about an hour ago right?	08:31
tonyb	priteau: Should be fixed	08:35
tonyb	#status log apache wasn't running on mirror.dfw3.raxflex.opendev.org with nothing obvious in the logs. Restarted.	08:39
opendevstatus	tonyb: finished logging	08:39
priteau	Thanks!	08:39
opendevreview	Merged openstack/project-config master: Grafana: add core graphs to rackspace-flex https://review.opendev.org/c/openstack/project-config/+/966981	10:59
*** rlandy_ is now known as rlandy		12:01
fungi	apache dying on mirror.dfw3.raxflex definitely seems odd. that's also the server i just brought back online yesterday after a multi-week outage	14:30
fungi	yeah, very odd. i booted mirror.dfw3.raxflex around 21:35 utc. i don't see any reference in journald for apache starting until tonyb presumably addressed it at 08:34:26 utc	14:37
fungi	when i tested i was so focused on making sure afs was working i'm not sure i confirmed apache was actually serving things, so maybe apache didn't actually start at boot for some reason?	14:38
fungi	maybe /var/cache/apache2 wasn't mounted when systemd tried to start apache? though i find no evidence of that in syslog either	14:40
clarkb	does journalctl -u apache2 indicate anything useful?	15:48
fungi	it did not sadly, no	15:56
fungi	that was the first thing i checked	15:56
fungi	shows it stopping when we tried to reboot back on the 3rd, and shows when tonyb started it manually, which is what's leading me to suspect it never started at boot	15:57
clarkb	for some reason I thought I checked ping and apache but maybe it was ping and ssh	15:58
clarkb	apache never starting on boot would certainly explain what you discovered so is probably the most likely case	15:58
fungi	and yeah i merely cat'ed files from the afs ro replica path in a shell there	15:58
fungi	it didn't dawn on me that apache might not start	15:59
fungi	but it certainly will from now on	15:59
clarkb	I'm going to catch up on my morning but then expect to approve the two gerrit update changes before too long. Chime in if there is reason to delay	16:03
fungi	i'll be steppout out momentarily but expect i'll be back before those deploy regardless	16:03
fungi	er, stepping out	16:03
clarkb	ya I dont' think the real fun will start until 2100 so that tonyb can join in	16:04
fungi	okay, afk for a little while, probably back in an hour-ish	16:15
clarkb	enjoy	16:16
clarkb	gerrit changes have been approved	17:19
clarkb	I did some self rereview just to be sure I felt ready and can't come up with any reason not to proceed that isn't "gerrit restarts have been annoying recently"	17:19
fungi	cool, i'm back and ready to help with restarts later whenever	17:30
clarkb	idea that just occurred to me: should we consider reducing the rax classic limits in zuul relative to the amounts we're able to increase quotas elsewhere? Mostly thinking that cloud in particular doesn't support the newer instruction sets so we can't run all images there and the disk + fs setup makes some jobs slower to bootstrap there than elsewhere ( specifically devstack style	17:44
clarkb	jobs).	17:44
clarkb	I think the way launcher works it will give the rax classic provider proportionally more work because they are larger. We can tune it away from that by reducing its quota	17:44
clarkb	the downside to that is we'd have less quota than we can actually field when demand is high	17:45
fungi	or maybe we decide where the demarcation line is when we can declare that we have sufficient quota elsewhere to just stop using rax classic for job nodes	17:46
fungi	platform-specific challenges aside (xen, performance, et cetera) ,we also run into provider-oriented problems like slow api responses, accumulation of undeletable cruft, and so on	17:47
clarkb	antoher thought that occurred to me was adding some sort of prioritization to providers in the launcher so that classic would only be considered if everything else was full up	17:48
clarkb	but I'm not sure that is a generally useful feature or if that is just something that makes sense in this one circumstance which may not be worth solving directly like that	17:49
fungi	it's a common use case in other cloud-oriented applications. long ago we used to refer to it as "cloudbursting" for situations where you had low-cost/sunk-cost infrastructure like a private cloud that you preferred to run your workloads on, but wanted to "burst" into a public cloud when your backlog reached a set threshold, basically paying some outside provider to avoid falling	17:51
fungi	farther behind	17:51
fungi	though if the idea is to prefer performant resources and then fall back on a slower/feature-limited alternative under load, i think that's a recipe for increasing job failure rates. the tests will become attuned to the non-overflow providers, and so grow assumptions that don't hold when they spill over into rax classic	17:53
clarkb	that is a good point	17:54
fungi	it happens already, but is kept somewhat in check by some percentage of those jobs always running there	17:54
clarkb	In my mind its more about improving throughput. Preferring rax classic means we're choosing some of our sloweest nodes more than others	17:54
clarkb	I don't really care too much about individual job performance as much as overall throughput in this context	17:54
clarkb	and ya that risk is fine if we think we'll never need those resources again but if we do and we don't operate in that environment anymore it would be painful	17:55
fungi	if we can tune the ratio of assignments but still keep some jobs always running there, we probably mostly avoid the acclimitzation problem	17:55
fungi	if we have lengthy periods where no jobs run there, the risk is much higher	17:56
fungi	acclimitization	17:56
fungi	it's a hard word to type after lunch	17:57
corvus	nodepool has a priority feature for the provider pools.... zuul-launcher does not. i think it would be generally useful and we can probably accomodate it.	18:20
corvus	an interesting question, which fungi is getting at, is do we set it up so that it's strictly ordered (fill everything else first, then rax-classic), or slightly more random (give rax-classic a 1% chance of handling a request).	18:22
clarkb	I think I'm convinced we'd want to make it proportional like it is today but probably with more direct control of the proportions	18:23
clarkb	to ensure that you're semi constantly checking that you can still oeprate in each environment	18:23
corvus	currently it works by going through a sorted list of providers and finding the first one that can fulfill the request, but they are sorted like this:	18:24
corvus	1) unassigned nodes; 2) quota used percentage (quantized to 10% buckets); 3) random	18:24
corvus	so the primary sort is by unassigned nodes (so we grab those first)	18:24
corvus	then after that we use the provider that's least used	18:25
corvus	and among providers with the same usage, the order is randomized	18:25
corvus	(and by "same usage" i mean "within 10%" so that the randomization is meaningful)	18:25
opendevreview	Merged opendev/system-config master: Drop /opt/project-config/gerrit/projects.ini on review https://review.opendev.org/c/opendev/system-config/+/966083	18:26
opendevreview	Merged opendev/system-config master: Update Gerrit images to 3.10.9 and 3.11.7 https://review.opendev.org/c/opendev/system-config/+/966084	18:26
fungi	probably the ideal algorithm would be distributing across a tunable proportion of available providers, the real question is how backlogged requests can/should factor into those decisions (for example does a solution exist where we always run with x% distributed to that provider across available providers, but prefer to also delay satisfying some requests unless they reach a backlog	18:26
fungi	threshold?)	18:26
corvus	so if we did what clarkb suggests, one way to do that might be to tweak the step at #2	18:26
fungi	yeah, i can imagine a couple of different tunable distribution methods, but i expect they directly conflict with one another and we'd need to decide which to support (or support multiple distribution algorithms, but that seems less like it would be worth the effort)	18:28
corvus	fungi: there isn't anything like that in z-l today -- eventually if all providers become full enough, it's just a random allocation among all of them	18:28
corvus	(except that we try to avoid processing requests that can't be fulfilled, so if things are working well and the clouds are not lying to us, we just wait for the next available provider)	18:29
fungi	right, algorithmically speaking, "random" could get a weighted distribution applied to it	18:29
clarkb	ya reusing unassigned nodes first regardless of where they are makes sense as the first criteria once filtering for ability to fulfill the request	18:29
corvus	(usually the cloud lies to us though, and we end up bouncing over-quota requests around different providers)	18:29
fungi	oh, good point, "already available" vs "need to be booted" is another critical efficiency factor to figure in	18:30
clarkb	the naive approach of applying a preference ration to the existing #2 criteria seems like it could starve clouds unexpectedly	18:30
corvus	yeah. there's a discontinuity in the algorithm at 90% though -- we quantize to 10% buckets below 90, but above 90%, it's actual values, so we try to land at exactly max quota for every provider. we could incorporate something like that into the proposed scaling.	18:32
clarkb	manage-projects is running now for the updated docker compose file for gerrit. Manage-projects was already bind mounting the homedir location for projects.ini so this should be a noop	18:32
corvus	oh, sorry, the discontinuity is actually at 100%. but still, point stands.	18:33
corvus	(above 100% is a launcher with a backlog)	18:33
corvus	s/launcher/provider/	18:33
fungi	anyway, back to the original topic, my opinion is that we just need to decide when to stop booting job nodes in rackspace classic (by adjusting our configuration), probably based on our determination that we have sufficient quota in other providers. something more flexible could be nice, but is probably not necessary	18:34
fungi	i don't think we have a strong use case for such features in zuul-launcher, though we might take advantage of them if they existed	18:36
clarkb	ya manage projects logs lgtm. Lots of skipping due to matching shas	18:36
clarkb	that implies a noop but also I don't think it would've gotten that far if there aws a problem with the projects.ini config which tells it where and how to connect to gerrit	18:36
fungi	(and others probably do have stronger use cases, so what's optimal for them should be more informative of actual development effort, if any)	18:36
clarkb	sorry for splitting the discussion I just want to make sure the gerrit stuff updates as expected	18:37
fungi	yep, appreciated, trying to keep tabs on it myself	18:37
clarkb	fungi: yes agreed as far as this not being a strong need	18:37
clarkb	I was mostly brainstorming if we can start managing the old cloud more intelligently in response to new resources	18:38
fungi	i do feel like we don't get a lot of complaints about being backlogged on change queues, but we do hear about performance/featureset problems with rackspace classic nodes somewhat regularly still	18:39
fungi	so i'd be willing to trade a bit more of the former for less of the latter	18:39
clarkb	both buildests in deploy succeeded. I'm checking the docker-compose.yaml now	18:40
clarkb	ya projects.ini bind mount is gone from docker-compose.yaml. I realized the new images don't udpate docker-compose.yaml for that I need to look at quay	18:41
fungi	perfect	18:42
clarkb	quay.io has a banner warning of a tls cert update for their cdn occuring on the 17th. The reason for the warning is the trust chain will change. But if you open the document explaining this its subscriber content only :/	18:42
clarkb	anyway I don't expect that to affect us and https://quay.io/repository/opendevorg/gerrit?tab=tags says 3.10 did update as expected	18:42
fungi	TheJulia: do you happen to know who at rh would make the decision to fix the fact that the tls certificate warning banner at the top of https://quay.io/ takes users to a red hat knowledgebase article that requires a red hat subscription?	18:46
fungi	seems like some wires got crossed wrt appropriate publication platforms for public-facing services	18:46
clarkb	fungi: tonyb: I think that around 2000 UTC we can start to get ready, pull and confirm the new image looks correct, prepare the suite of commands to clear caches and replication queues, then ~2100 UTC send it	18:48
fungi	sounds good, i'll be here	18:48
clarkb	maybe around 2000 we also send a status notice indicating our plan to take a short outage	18:48
clarkb	I'll go ahead and warn the openstack release team now	18:49
fungi	good idea	18:49
fungi	though looks like it's been about 5 hours since anyone there was approving new release requests	18:49
TheJulia	I suspect it is because the "easy" path for teams is the kb which does require an account and possible subscription entitlement. I suspect the right thing to do would be to open an issue in issues.redhat.com in PROJQUAY, type bug, component set to quay.io	18:50
fungi	thanks! now to see if i have a "red hat account" so i can log into jira	18:52
TheJulia	I take it you don't at all... hmm	18:58
fungi	no, i do, took me a few minutes to find it since i don't use it often	18:58
fungi	now i'm just fumbling around the interface, shouldn't take me long	18:59
fungi	oh, that's fun. i have a preexisting account on the rh jira with the same e-mail address as my rh sso account, so it won't let me log into jira due to that conflict?	19:01
fungi	"SAML-Login failed: You have logged in with username [...], but you already have a Jira account with [...]. If you need assistance, email jiraconf-sd@redhat.com and provide tracker-id."	19:01
fungi	i guess this is where i hit the "too much effort required to bother reporting something so trivial" threshold	19:02
clarkb	its gerrit openid all over again	19:02
fungi	yeah, complicated in my case because the conflict is on an e-mail address that redhat.com's mail provider rejects messages from, so if i e-mail them to try and get it resolved i'd have to do it from a different address than the one for the account in question	19:10
clarkb	heh I've just realized my 2000 UTC math coincides with lunch. I should be able to make that work just fine though	19:11
clarkb	my mental timezone map post DST drop hasn't updated properly yet	19:11
TheJulia	fungi: opening an issue for you now	19:12
fungi	TheJulia: oh thank you!	19:15
TheJulia	PROJQUAY-9772	19:15
fungi	appreciated. we don't think it impacts us at all, but suspect it may impact other users who don't have rh accounts to read the article explaining whether or not it does	19:16
TheJulia	It is kind of not great, tbh	19:17
TheJulia	and the only way for teams to sometimes know is when they triage such issues as a team	19:18
clarkb	how does this look #status notice The OpenDev team will be restarting Gerrit at approximately 2100 UTC in order to pick up the latest 3.10 bugfix release.	20:00
clarkb	tonyb: I'm happy to wait for your morning to start to do the on server actions so that you're able to follow along and/or be the typing driver	20:02
clarkb	and then I guess the other question is do we want to jump on meetpad to make coordination more synchronous or are we happy with irc?	20:04
corvus	clarkb: msg lgtm	20:07
fungi	yeah, i'm good with the wording	20:07
clarkb	oh looks like I said we'd get started at 2100 with actual restart happening at about 2130 when planning on Tuesday	20:09
clarkb	so maybe update that warning to 2130? but otherwise send it at 2030?	20:09
fungi	fine by me, bettter stick with the prior plan yeah	20:09
clarkb	I then recorded that in my notes as 2100 hence the confusion.	20:12
clarkb	Mostly I want to make sure tonyb has a chnce to participate	20:12
fungi	hah	20:12
fungi	yeah, it's all good with me, i'll be here regardless	20:12
clarkb	ok sending the notice now	20:30
clarkb	#status notice The OpenDev team will be restarting Gerrit at approximately 2130 UTC in order to pick up the latest 3.10 bugfix release.	20:30
opendevstatus	clarkb: sending notice	20:30
-opendevstatus- NOTICE: The OpenDev team will be restarting Gerrit at approximately 2130 UTC in order to pick up the latest 3.10 bugfix release.		20:30
tonyb	I'll be at my laptop in a couple of minutes.	20:31
fungi	there's no rush	20:31
clarkb	ya I may take this as an opportunity for a short break before we dive in. I think the next steps are deciding if we want to use meetpad or not, then starting a screens session and pulling and verifying the image	20:32
clarkb	then we wait for ~2130 UTC and can proceed wtih restarting things at that time	20:32
opendevstatus	clarkb: finished sending notice	20:33
fungi	i'm happy to jump in a meetpad room if people want that (e.g. for screen sharing), but gnu screen multi-attach works fine as long as we're all infra-root	20:34
opendevreview	Merged openstack/project-config master: Set noop job for the governance-sigs repository https://review.opendev.org/c/openstack/project-config/+/966755	20:39
tonyb	clarkb, fungi: ready when you are	20:47
clarkb	fungi: ya I was just thinking maybe synchronous voice chat would be helpful to explain what we're doing but I'll let tonyb decide if that is helpful. I don't think we should use meetpad to screenshare and instead use screen to share the context	20:52
clarkb	tonyb: want to start a root screen we can attach to? Then we can take notes on what the current image is that we're running and fetch ahd check the new image	20:53
tonyb	Can do	20:54
clarkb	tonyb: if you do a docker ps -a you should see that the container's image matches that image there too	20:58
clarkb	ya so the container running gerrit (gerrit-compose-gerrit-1) says its image is quay.io/opendevorg/gerrit:3.10 which is currently 16e3e6710a12	21:00
clarkb	this is mostly for historical record keeping and theoretical rollback paths	21:00
tonyb	Okay	21:00
clarkb	so now I think you can do `docker-compose pull` and that should only update images for us and not touch the running containers. Then we can cross check the new container image with what we have in quay.io	21:01
clarkb	or `docker compose pull` since the - is optional on noble nodes	21:01
tonyb	https://meetpad.opendev.org/gerrit-update if y'all want	21:01
clarkb	tonyb: so the new image can be found at https://quay.io/repository/opendevorg/gerrit?tab=tags	21:04
opendevreview	James E. Blair proposed opendev/system-config master: Update zuul-client image location https://review.opendev.org/c/opendev/system-config/+/967111	21:22
corvus	clarkb: tonyb zuul isn't super busy, but it's also not idle... if we want to be very conservative, we can pause event processing before you shut down gerrit.	21:24
corvus	lmk if you want to do that, i've got a command ready (which i can send to you if you want to run it)	21:24
clarkb	corvus: will that pause merge attempts for changes already in the gate?	21:24
corvus	yes	21:24
clarkb	I think we should go ahead and do that	21:25
corvus	(that would be the main reason to do it; i'm not worried about disconnections or events or anything like that)	21:25
corvus	our helper script has an error, which i worked on correcting in 967111; in the mean time, i made a local copy on zuul01 that is corrected	21:26
corvus	so, as root on zuul01: /root/zuul-client manage-events --all-tenants --reason "Gerrit upgrade in progress" pause-result	21:26
clarkb	thanks!	21:26
corvus	when finished: /root/zuul-client manage-events --all-tenants normal	21:26
clarkb	shutdown is proceeding now	21:30
clarkb	[main] INFO com.google.gerrit.pgm.Daemon : Gerrit Code Review 3.10.9-4-g309e8c8861-dirty ready	21:34
clarkb	https://groups.google.com/g/repo-discuss/c/TxeJ9lvW6Is/m/NGLT2D5aAQAJ I think this is related to the error we see from the plugin manager preloader	21:40
clarkb	however the specific error is different	21:40
clarkb	https://gerrit.googlesource.com/plugins/webhooks	21:42
clarkb	https://gerrit-review.googlesource.com/Documentation/cmd-index-start.html	21:47
opendevreview	Clark Boylan proposed opendev/system-config master: DNM this is a change ot test gerrit functionality https://review.opendev.org/c/opendev/system-config/+/967131	21:49
clarkb	this didn't make it into the IRC logs so I'll add it: corvus pointed out that at least one change would've reported during the zuul pause period. If we hadn't paused that could've broken a merge	22:09
clarkb	so thats a great little feature for us to remember we have now with zuul	22:09
clarkb	I'm going to delete my autoholds for the old gerrit 3.11 image (we just updated it to a new version) and etherpad 2.5.3 and gitea 1.25.1	22:17
clarkb	I don't see any other autoholds in the openstack tenant	22:17
clarkb	usually I make note of this here so that I can delete other autoholds too but not necessary this time	22:17
*** dmellado9 is now known as dmellado		22:18
clarkb	infra-root gerrit reindexing has completed and reports `Reindex changes to version 86 complete` with 3 failed changes (the expected count)	22:27
clarkb	I have requested new autohold for gerrit 3.10 and 3.11 jobs that will hopefully finally be used to test the ugprade and get that 3.10 to 3.11 upgrade done	23:48

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!