Wednesday, 2025-02-19

cloudnull	just checked https://grafana.opendev.org/d/6d29645669/nodepool3a-rackspace-flex?orgId=1 - looks happy :D	00:11
* cloudnull happy		00:12
fungi	yep, thanks bunches cloudnull!	00:35
cloudnull	sorry for the curfuffle	01:28
opendevreview	Thierry Carrez proposed opendev/irc-meetings master: Move Large Scale SIG meeting time https://review.opendev.org/c/opendev/irc-meetings/+/942172	10:07
opendevreview	Thierry Carrez proposed opendev/irc-meetings master: Move Large Scale SIG meeting time https://review.opendev.org/c/opendev/irc-meetings/+/942172	12:49
fungi	ttx: see inline comment on ^ in case you want to adjust before it's approved	12:55
ttx	fungi: yeah, it's actually irrelevant (the next date is generated from this date, but it does not have to be the first occurrence)	13:07
ttx	I tried to be more specific but then it would set the first date to April instead of March	13:07
ttx	probably a bug somewhere in there	13:08
fungi	neat	13:12
opendevreview	Merged opendev/irc-meetings master: Move Large Scale SIG meeting time https://review.opendev.org/c/opendev/irc-meetings/+/942172	13:16
opendevreview	Doug Goldstein proposed openstack/diskimage-builder master: dhcp-all-interfaces: avoid systemd-networkd starting DHCP https://review.opendev.org/c/openstack/diskimage-builder/+/942215	15:56
clarkb	fungi: not sure if the hack in https://review.opendev.org/c/opendev/system-config/+/942155 is something you've seen before? In any case landing that then testing access to dfw3 seems like a good next step	16:00
clarkb	and also https://review.opendev.org/c/opendev/system-config/+/941997 to make spinning up new servers quicker	16:00
clarkb	and then we should also proceed with landing bindep changes and looking at mailman log rotation?	16:04
opendevreview	Doug Goldstein proposed openstack/diskimage-builder master: dhcp-all-interfaces: avoid systemd-networkd starting DHCP https://review.opendev.org/c/openstack/diskimage-builder/+/942215	16:07
cardoe	interesting... my diskimage-builder change comes here.. I was trying to find the team that would be responsible for it.	16:13
clarkb	there is a dedicated dib team with a #openstack-dib channel but it overlaps a bit with us as we've relied on it heavily since tripleo half put it to pasture	16:15
clarkb	cardoe: I'm curious why you wouldn't do the inverse in that chagne and just let systemd-networkd dhcp all interfaces if that is the goal and disable just not setup the manual configuration for that	16:15
cardoe	I'd be fine with that too. I actually put that in the bug report.	16:19
cardoe	I just don't know how to prevent the dib/element from doing anything in that case.	16:19
cardoe	the ironic-python-agent element pulls in dhcp-all-interfaces. I'm just not knowledgeable enough to make it conditional in that case.	16:20
clarkb	diskimage_builder/elements/dhcp-all-interfaces/install.d/50-dhcp-all-interfaces seems to do the setup and already has an exclusion rule for gentoo I think beacuse gentoo uses systemd-networkd	16:21
clarkb	you might want to do something in the if init system == systemd block that checks if systemd-networkd is enabled and if so noop	16:21
clarkb	but I'm not super familiar with that element. We use simple-init with glean then let config drive contents determine interface setup	16:22
clarkb	sounds like the gerrit meets presentation from luca about the gerrit 2025 roadmap will be streamed here: https://www.youtube.com/gerritforgetv	16:24
clarkb	that occurs at 19:45 UTC if I've translated timezones properly	16:25
mordred	<clarkb> "I think 942155 has the hack that..." <- looks reasonable to me	16:26
clarkb	mordred: you've got a comment in openstacksdk indicating that a better way should be written but as far as I can tell that hasn't happened yet :)	16:29
fungi	clarkb: commit message on 942155 is a bit confusing, but the solution looks fine to me, i agree it matches the internal sdk profiles	16:31
clarkb	fungi: do you want me to edit the message? there are a few typos and words that are the wrong term	16:33
fungi	basically that, after reading the change i think i understand what the commit message meant to say, so good enough	16:33
clarkb	I was rushing to get that up and also to the service coordinator thing before the EOD deadline	16:34
fungi	makes sense	16:34
clarkb	clearly my typing drivers don't work so great when rushing	16:34
clarkb	mirror-update is on the list of servers to replace. This server doesn't haev any real disk space locally as it does everything with afs	17:08
clarkb	I think we should be able to boot a new mirror-update, hold all the lock files on the old server, then merge a chagne to deploy new mirror update as a mirror-updater. Then delete the old server once we're satisfied with the new one. Any concerns with using "hold all the locks" as the conflict resolution method?	17:09
clarkb	antoher option would be to shutdown the old server entirely	17:09
clarkb	that mgiht be safer since it would avoid problems with spontaneous reboots	17:09
clarkb	corvus: any idea what the transition of state for tracing from an old to new server is? Maybe we're ok with losing the old state?	17:12
clarkb	also this reminds me I think we're still pinning that container image and its been on my todo list forever to try and debug that but deprioritized	17:13
fungi	i guess we could put the old server in the emergency disable list, comment out all the long-running mirror cronjobs on it, wait for any of those to complete, comment out the static publishing vos release cronjob that runs every 5 minutes, make sure it's not in progress, then shut down the server and add the replacement to the inventory?	17:15
fungi	but yeah, the mess you can get into with interrupted afs writes makes it a little complicated	17:16
clarkb	oh ya I guess before we shut it down we would need to settle things gracefully first	17:16
clarkb	so the shutdown appraoch implies holding all the locks first (and maybe disabling the vos release cron)	17:17
clarkb	I don't know if that vos release cron has a lock	17:17
clarkb	if it doesn't we should add one	17:17
clarkb	it does have a lockfile but it isn't in the cron command. It is embedded in the python script	17:18
clarkb	publish-mirror-logs doesn't have a lock but I'm not sure if needs one. Seems it just writes to the afs filesystem then we vos release it as part of that other script?	17:19
clarkb	so ya grab all locks, shutdown server, merge change for new server should be safe I think	17:19
clarkb	for nodepool launchers we give each server a server specific config file. I think if we replace nl01 with nl05 the change to add nl05 to inventory would set max-servers to 0 on nl01 and let nl05 take over those providers. Then we delete nl01 and it goes away	17:21
clarkb	those servers should be straightforward. Mirrors are straightforward too. I guess we can start with those two groups before worrying about mirror update and tracing	17:22
clarkb	reminder I'd like to get https://review.opendev.org/c/opendev/system-config/+/941997 in before adding any more new servers. Want at least one person to sanity check my assumptions about testing there	17:22
clarkb	but then I can look at replacing the nl servers	17:22
opendevreview	Merged opendev/system-config master: Add DFW3 to raxflex cloud profiles on bridge https://review.opendev.org/c/opendev/system-config/+/942155	17:23
clarkb	that is deploying right now (hourly jobs are already done)	17:24
fungi	looking at the mirror-update server, the only state not kept in afs seems to be in /var/log (afs-release, rsync-mirrors, reprepro), and it's probably not super critical we preserve those	17:24
fungi	reprepro state databases are in afs	17:24
clarkb	fungi: cloudnull: after 942155 I am still able to run server list against sjc3 for both of our projects/tenants but doing so against dfw3 says authentication is required	17:26
clarkb	using the --debug flag I can confirm that we are talking to http://keystone.api.dfw3.rackspacecloud.com/v3/ when using the dfw3 region	17:27
clarkb	is there additional account setup thati s required to use the new region?	17:27
fungi	i'm good with 941997, logic there makes sense	17:28
clarkb	I'm not sure what surgery was done with the project ids yesterday either. Maybe each region has different ids? (that would make my assumption we can auth with one clouds.yaml provider a bad one I think)	17:28
clarkb	fungi: thanks I'll go ahead and approev that now then. I just wanted at least one person to ensure that removing testing is appropriate in this case	17:29
fungi	cloudnull mentioned via privmsg to me yesterday that the new project_id would be consistent across regions	17:29
clarkb	it is easy to add back in if necessary too	17:29
fungi	fwiw, testing the dfw3 region with my personal rackspace account, openstackclient reports "The request you have made requires authentication. (HTTP 401)" even though i'm using the federated project_name/project_domain_name options instead of project_id in my own clouds.yaml	17:34
fungi	same config is working fine for me with sjc3 though	17:34
clarkb	so thats similar to what I see with our accounts. I wonder if we just need that region to be enabled?	17:35
fungi	or if it's like the first time we used sjc3 where the api wouldn't authenticate until i logged into skyline at least once	17:38
clarkb	ya though I thought the said it souldn't work that way (instead there was some heuristic to opt projects into the regions and we got missed?)	17:38
clarkb	I dunno may be worth a shot	17:38
fungi	i should be able to test that theory with my account in a few minutes	17:38
clarkb	thanks	17:38
fungi	and if it works, we can leave our opendev accounts in that state temporarily so raxfolx can look into it	17:40
clarkb	++	17:40
opendevreview	Merged opendev/system-config master: Adjust LE role file matchers on system-config-run-* jobs https://review.opendev.org/c/opendev/system-config/+/941997	17:46
fungi	huh, so with my personal account i seem to now have two tenants in sjc3, one under my NNNNNN_Flex project_name and the other under a new uuid-based project_name, the project_id of each of those is different	17:48
fungi	but in dfw3 i only have that new uuid-based project_name (and its project_id is consistent with sjc3)	17:50
cloudnull	clarkb accounts should be the same, are you able to login to skyline with the same credentials?	17:50
fungi	cloudnull: we haven't tested skyline with our opendev accounts yet, i'm checking with my personal rackspace account first before i accidentally nudge a heisenbug or something	17:51
corvus	cloudnull: opentelementry/jaeger tracing? i say don't worry about it; i wouldn't bother trying to keep old data, and it'll be fine if the server just starts getting new data.	17:51
corvus	gah	17:51
corvus	clarkb: ^ sorry that was for you not cloudnull	17:52
clarkb	corvus: ack thanks for confirming	17:52
cloudnull	can you auth with no defined project, run something like openstack project list, with no project-id/name defeined, that should list out the available projects for the tenant.	17:54
fungi	cloudnull: looking at my personal account first, it looks like i now have two different tenants in sjc3 (one under my NNNNNN_Flex project_name which has a server instance in it, another with a uuid-based project_name that has no server instances). in dfw3 i only have one tenant (the same uuid-based project name as in sjc3, with a matching project_id from sjc3 too)	17:54
cloudnull	if that works, it should seed the environment with your account.	17:54
cloudnull	fungi ++ those NNNNNN_Flex projects were an early account type, they'll continue to exist in SJC but going forward we're not creating them.	17:55
clarkb	I thought project was a required auth parameter?	17:55
clarkb	like osc won't let you auth without it specified?	17:55
cloudnull	it should let you get an unscoped token	17:55
fungi	cloudnull: thanks, that explains it. wrt my personal account i'll just boot a new instance in the newer tenant and delete the old one in that case	17:56
clarkb	does logging into skyline authenticate without a project?	17:56
clarkb	I guess that could explain the heisenbug behavior we think we'ev seen	17:56
fungi	clarkb: skyline seems to not ask for a project, right	17:57
cloudnull	it does, it first pulls an unscoped token	17:57
clarkb	ok so logging into skyline is probably the easiest way to do that as it doesn't involve copying or hacking up clouds.yaml files	17:57
fungi	i'll go ahead and do it in that case since i've already got everything up in front of me	17:57
clarkb	fungi: are you able to use osc against dfw3 with your personal account now to confirm that generally seems to work?	17:58
clarkb	fungi: cool thanks	17:58
cloudnull	https://gisty.link/087b568c73dc635a676190e09c6b13baa45cadc9 this is my unscoped clouds.yaml	17:59
fungi	clarkb: i need to switch my clouds.yaml over to use the new tenant (which doesn't contain my existing server instance because that's in the old tenant), but looking at what's in skyline i expect it to just work	17:59
cloudnull	the output https://gisty.link/b1fd7580d5d3cf2b3173cbf9521453dfa595c730	18:01
fungi	i can confirm that when i comment the project out of my clouds.yaml `openstack project list` gives me a list of the projects my account has access to	18:03
clarkb	maybe that is new but I thought you always had to have a project/tenant even if it was the default one. Or maybe I'm thinking of the whole domain madness	18:03
clarkb	ya actually it might be domain ebcause I remember at one time the default changed from "default" to "Default" or vice versa and then you had to get extra verbose about it even when using the default	18:04
fungi	you can leave project_name and project_id out of your clouds.yaml and pass --os-project-name or --os-project-id on the command line instead	18:04
fungi	just tested that and it's working for me	18:05
clarkb	ya thee is a list of values that you can supply on the command line to override or supplement the clouds.yaml	18:05
clarkb	(that doesnt' work for all options iirc)	18:05
clarkb	anyway let me know when I should test dfw3 again and I'll do that	18:06
fungi	what's the undocumented magic to tell osc to use a different clouds.yaml file?	18:10
clarkb	its something like OS_CLOUD_CONFIG_FILE=/path/here	18:10
fungi	oh, right, there isn't a cli option	18:11
clarkb	OS_CLIENT_CONFIG_FILE	18:11
fungi	so the experiment was unsuccessful	18:12
fungi	no, wait, it's still picking up a default project in my config somewhere	18:12
fungi	okay, that's working	18:13
clarkb	is this with your personal account?	18:15
fungi	no, our opendev accounts now	18:15
clarkb	ok those aren't working for me. Not suer what the difference is	18:15
fungi	because i had already logged into dfw3 skyline with my personal account so it got synced up already	18:16
clarkb	the server list commands I was running before continue to produce the same result of needing authentication	18:16
fungi	yeah, i see the issue	18:20
fungi	like my personal account does, our opendev accounts now have two projects in sjc3: an old one with an NNNNNN_Flex name and a new one with a uuid-based name	18:21
fungi	the old NNNNNN_Flex projects contain our mirror server and nodepool nodes	18:21
fungi	the new uuid-based projects are empty	18:21
clarkb	and dfw3 only has the new one	18:21
fungi	correct	18:21
fungi	the new uuid-based project has a consistent name and id in both sjc3 and dfw3, but is not the one we're currently using	18:22
clarkb	so maybe the "solution" here is to create a new raxflex clouds.yaml profile for both sjc3 and dfw3. Then use this as the motivation to rebuild sjc3 in the new project to fix the networking stack	18:22
fungi	so maybe this is our push to... yeah exactly	18:22
clarkb	that might get a little complicated with nodepool configs but the clouds name is referenced by nodepool config so I thinkwe can set max-servers to zero and image to empty list and let nodepool clean things up. switch the cloud name then revert the shutdown steps to have it rebuild?	18:23
fungi	we'll want to build a new mirror too	18:23
clarkb	yes	18:23
fungi	i guess we can just build new sjc3 and dfw3 mirror instances at the same time	18:24
clarkb	I think the process is add new clouds.yaml profiles for sjc3 and dfw3 using the new tenant/project. Manage those new tenants in cloud launcher. Upload noble images if necesasry. Boot new mirrors. Add dfw3 to nodepool. Gracefully shutdown old sjc3. Bring up new sjc3 in nodepool	18:24
clarkb	then clean up the old profiles and secrets data	18:25
clarkb	fungi: did you want to push up the changes to start reorging clouds.yaml content since you seem to have a good handle on it or should I give ti a go and you can tell me what needs editing?	18:28
fungi	yeah, just pulled the clouds.yaml template up in my editot	18:28
fungi	editor	18:28
clarkb	cool I'm ready to do my best to review the change(s)	18:28
fungi	the hardest part is going to be naming them, as always	18:28
clarkb	ya the old one has the "good" name :)	18:28
clarkb	could do opendevci-rax-flex instead of opendevci-raxflex then eventually we'll delete the opendevci-raxflex profile and it won't be confusing	18:29
clarkb	slight confusion until we reach that point	18:29
fungi	sure, i dig it	18:29
clarkb	I've started booting a tracing02 server	18:44
clarkb	seems like it should be an straightforward swap and I'm all for getting more noble coverage from the easy things	18:44
opendevreview	Jeremy Stanley proposed opendev/system-config master: Add new Rackspace Flex projects https://review.opendev.org/c/opendev/system-config/+/942230	18:44
opendevreview	Jeremy Stanley proposed opendev/system-config master: Switch Nodepool to using a Rackspace Flex project https://review.opendev.org/c/opendev/system-config/+/942231	18:44
fungi	the second change there is wip for now, more a placeholder to flesh out once we're ready	18:45
clarkb	fungi: changes to the clouds.yaml files require dummy hostvars (or maybe groupvars) data so that the file can be templated out successfully iirc	18:47
clarkb	if you do a git grep of the vars in that file you should find where the dummy values are set	18:48
fungi	ah, yeah	18:48
fungi	i did it in the second change but not the first	18:48
clarkb	fungi: for the second change I was thinking more that we'd edit https://opendev.org/openstack/project-config/src/branch/master/nodepool/nl01.opendev.org.yaml#L199 (and the related builder config) to also point to a different profile name (rax-flex?) after we set max-servers to 0 and clean up the images in the old tenant	18:49
clarkb	fungi: the reason for that is it would allow us to trivially adding the dfw3 region while we sort through the sjc3 cleanup. If you do it the way you've proposed then we have to coordinate things more tightly (because suddenly sjc3 could stop working and/or orphan resourcesin the old tenant)	18:50
fungi	can do	18:51
fungi	just didn't want to lose the stats history in grafana	18:51
clarkb	I don't think we will since only the credentials reference changes	18:52
clarkb	fungi: the provider in nodepool remains the same we just tell it to use different credentials	18:52
fungi	so keep the statsd prefix set the same?	18:53
fungi	for both?	18:53
clarkb	oh thats a clouds.yaml config hrm	18:53
clarkb	ya I think so	18:53
clarkb	in the case of dfw3 it will be scoped to that region and is fine. In the case of sjc3 we should be able to shut things down gracefully with old credentials then start things up again with new credentials and keep all the logical provider stuff the same including the statsd prefix	18:54
fungi	other thing is this would be the first nodepool provider with a - in its name, while we've generally used - as the separator between the provider and region names	18:54
clarkb	but only if we change the provider name?	18:54
clarkb	I'm suggesting we only change the cloud: value	18:54
fungi	ah, okay, yeah i suppose it wouldn't be consistent but would be good enough	18:55
clarkb	though now that you mention it I'm not sure where we get the values for say mirror name construction maybe those are based on the cloud.yaml profile name?	18:55
clarkb	I'm starting to feel like we've tried this before and it didn't work due to something like ^	18:56
clarkb	Another option would be to just start over entirely and orphan the existing grafana data	18:56
clarkb	similar to when linaro changed names a couple of times. Maybe that is simplest	18:57
opendevreview	Clark Boylan proposed opendev/zone-opendev.org master: Add tracing02 to DNS https://review.opendev.org/c/opendev/zone-opendev.org/+/942233	18:57
opendevreview	Jeremy Stanley proposed opendev/system-config master: Add new Rackspace Flex projects https://review.opendev.org/c/opendev/system-config/+/942230	18:58
opendevreview	Jeremy Stanley proposed opendev/system-config master: Switch Nodepool to a new Rackspace Flex project https://review.opendev.org/c/opendev/system-config/+/942231	18:58
fungi	we can debate the nodepool change as we work on the earlier steps	18:59
fungi	also all the new private hostvars in those changes have been added on bridge with their correct values	19:01
opendevreview	Clark Boylan proposed opendev/system-config master: Add tracing02 to inventory https://review.opendev.org/c/opendev/system-config/+/942235	19:03
opendevreview	Clark Boylan proposed opendev/system-config master: Add tracing02 to inventory https://review.opendev.org/c/opendev/system-config/+/942235	19:08
clarkb	forgot the depends on. Important in this case	19:08
clarkb	fungi: if we take your initially proposed approach we would need to shutdown raxflex entirely first. Then bring it back up again in sjc3 and dfw3. That is probably the cleanest approach from a historical record keeping process but requires more coordinated effort and loss of ~32 test nodes while we work through it	19:14
clarkb	I think I'm ok with that because there are also fewer questions about how to work through that process. We could end up with more work than anticipated cleaing up issues with a less careful approach that allows us to bring up dfw3 early	19:15
clarkb	but before we get that far we can bring up sjc3 and dfw3 via cloud launcher, upload noble image, and spin up mirrors	19:15
clarkb	then decide how we want to transition nodepool	19:15
Clark[m]	The gerritforge Livestream on YouTube is about to start	19:57
fungi	oh. also we can probably forego the extra network creation and floating-ip stuff if we like	19:58
fungi	certainly for the mirrors at least	19:58
Clark[m]	Due to direct attachment to the public net?	20:00
fungi	yeah	20:01
fungi	should i add the cloud-launcher config into 942230 in that case?	20:05
fungi	or as a separate change?	20:05
opendevreview	Jeremy Stanley proposed opendev/system-config master: Add new Rackspace Flex projects https://review.opendev.org/c/opendev/system-config/+/942230	20:07
opendevreview	Jeremy Stanley proposed opendev/system-config master: Switch Nodepool to a new Rackspace Flex project https://review.opendev.org/c/opendev/system-config/+/942231	20:07
fungi	combined it for now but can split it out if needed	20:07
Clark[m]	Same change is probably fine we'll just have broken launcher if the initial setup still doesn't work	20:11
Clark[m]	Gerrit 3.12 will require java 21	20:11
Clark[m]	3.12 will update the H2 version for caches which is a breaking change	20:13
fungi	ubuntu noble has openjdk-21-jre, if we want it on debian we need to wait for trixie	20:13
fungi	but by the time we're ready to upgrade to gerrit 3.12 i expect it'll be plenty ready	20:14
fungi	my best guess is sometime around june/july for trixie release	20:15
Clark[m]	Ya I think it will be fine	20:16
Clark[m]	In theory we upgrade to 3.11 on Java 17. Then update our images to Java 21 for 3.11 and 3.12 then upgrade to 3.12	20:20
Clark[m]	And if we stick to our existing timeline that will occur at the end of 2025	20:20
fungi	sounds about right	20:29
Clark[m]	Gerrit 3.13 will formalize the ability to run a Gerrit server for the UI server and a different headless Gerrit for the rest API and git protocols. This would allow for you to tune and scale their jvms separately	20:34
fungi	oh neat. maybe we could scale down our gerrit(s) then	20:36
fungi	down and out, that is	20:36
fungi	granted, we're only really using half the ram on our current 128gb vm	20:37
fungi	a quarter is active and a quarter is buffers/cache	20:38
Clark[m]	Luca is talking about Gerrit 4 possibilities. One idea is to decouple the UI from the backend more so that you can build different code review systems on it or just use it as a git server	20:40
Clark[m]	Support for PR like reviews (reviews of branches rather than specific commits)	20:42
Clark[m]	Which he points out is technically possible through merge commit reviews but the UI isn't really useful in this capacity	20:42
Clark[m]	He wants to see llm integration make it into core plugins rather than external plugins	20:44
JayF	I wonder if that would help enable any potential future federation a la https://gitlab.com/gitlab-org/gitlab/-/issues/6468	20:53
JayF	Obviously that is not necessarily cross project yet or even exists at all yet, but it's nice to think about the possibility	20:54
Clark[m]	It seems like it would be a prereq to federate with PR systems but figuring out federation with Gerrit first seems like a baseline need. That said I feel like zuul really addresses much of what people want out of federation	21:04
Clark[m]	When I write bugfixes for Gerrit I push them upstream then downstream I set a depends on, rebuild our images, and test in opendev that our problem goes away	21:05
Clark[m]	There is no formal federation but zuul talks to both and problem solved	21:05
Clark[m]	And that works for Gerrit and GitHub and gitlab etc today	21:06
clarkb	fungi: I have a question on https://review.opendev.org/c/opendev/system-config/+/942230	21:15
clarkb	the gerrit thing was informative. It seems like a lot of the interesting/focus within the gerrit community is building a system that works well for enterprise software development in large companies with lots of large git repos. Not necessarily a bad thing for us but I personlly think it would be neat if more effort went into the process of optimizing code review itself	21:16
clarkb	and then tracing seems to be happy with its two changes https://review.opendev.org/c/opendev/zone-opendev.org/+/942233 https://review.opendev.org/c/opendev/system-config/+/942235 if you have a moment	21:17
clarkb	fungi: I'm happy to move forward with 942230 if that was intentional but didn't want it to get lost if it was an oversight	21:20
fungi	clarkb: thanks for catching that, i meant to do both of course. fix incoming	21:26
clarkb	oh and when we boot the mirrors we should check tmus	21:26
clarkb	*mtus	21:26
fungi	yep	21:26
opendevreview	Jeremy Stanley proposed opendev/system-config master: Add new Rackspace Flex projects https://review.opendev.org/c/opendev/system-config/+/942230	21:27
opendevreview	Jeremy Stanley proposed opendev/system-config master: Switch Nodepool to a new Rackspace Flex project https://review.opendev.org/c/opendev/system-config/+/942231	21:27
fungi	clarkb: fwiw, the server instance i booted in my personal account in sjc3 is just attached directly to publicnet and has a 1500 byte mtu on its ens3 interface already	21:28
fungi	so should be fine	21:29
clarkb	perfect	21:29
clarkb	fungi: I +2'd the first change and I think you can approve it when secret vars are in place for it	21:30
fungi	they already are, were even before i pushed the initial patchset	21:30
clarkb	extra perfect	21:30
fungi	did those first thing	21:30
clarkb	I did confirm that mirror_fqdn includes nodepool.cloud in it	21:31
clarkb	that means we would have to have mirror.sjc3.rax-flex.opendev.org instead of mirror.sjc3.raxflex.opendev.org	21:31
opendevreview	Merged opendev/zone-opendev.org master: Add tracing02 to DNS https://review.opendev.org/c/opendev/zone-opendev.org/+/942233	21:31
clarkb	an alternative would be to do what you originally proposed and simply shut thinsg down before turning anything new on	21:32
fungi	yeah, i'm leaning toward that	21:32
fungi	and keeping the old cloud names	21:32
fungi	name	21:32
clarkb	wfm	21:32
clarkb	fungi: I made a note on https://review.opendev.org/c/opendev/system-config/+/942231/ that we should haev a change in project-config that cleans up the the existing sjc3 resources in nodepool. Maybe it should be two changes. One to set max-servers to 0 then another to clean up all images in that cloud	21:34
clarkb	that way we can launch the new clouds with the first change, spin up new mirrors, land the cleanup changes I just described ^ there and then land 942231 and spin up new sjc3 and dfw	21:34
clarkb	oh also we can switch the mirror in sjc3 over to the new mirror before we shut things down if we end up keeping things up for some reason (its the same region just a different tenant which is no different than how we normally do things)	21:35
fungi	yeah, sounds right	21:35
clarkb	for noble image uploads if we can't download our existing image from glance then we may just need to go with whatever the latest image is. Looks like the vhd file is the only one on bridge anymore (due to disk constraints) and I'm not sure we can reliably convert a vhd back to a raw/qcow2	21:41
clarkb	but that should be fine. Maybe even preferable if it reduces the total number of packages we have to update when we lauch new nodes	21:41
opendevreview	Merged opendev/system-config master: Add tracing02 to inventory https://review.opendev.org/c/opendev/system-config/+/942235	22:05
clarkb	that change is finally deploying now but the tracing job is near the end so may still be a while. I'm keeping an eye on it	22:27
opendevreview	Merged opendev/system-config master: Add new Rackspace Flex projects https://review.opendev.org/c/opendev/system-config/+/942230	22:39
JayF	clarkb: for sure, gerrit<>gerrit would have to go first, but zuul covers zero of the use case I was thinking of -- my brain is always geared to "how to avoid lock-in", and getting all the various "forge" systems to collaborate is a potential path to get there	22:42
clarkb	deployment failed because the base job failed because tracing02 was unreachable	22:44
clarkb	I am able to reach it from my local system. Now to try from bridge	22:45
clarkb	ssh worked from bridge too. Not sure why it failed	22:45
clarkb	oh hrm it says host key verification failed. But I was able to ssh to it without doing anything with host keys. I wonder if that is a race between updating known hosts and trying to ssh to it? Bootstrap bridge must do the ssh key setup and base can run concurrently maybe? The -base run for 942230 should be a good indicator if this is still a proble. If not I can probably wait for	22:47
clarkb	daily jobs this evening	22:47
clarkb	JayF: ok sorry wanted to debug that problem. Git is already inherently distributed you can pretty trivially avoid lock in taking your git repo from one forge to another	22:48
clarkb	I think the real lock in problems are with all of the tooling surrounding a specific forge and federation doesn't help prevent lock in there	22:48
clarkb	fungi: ya the run for 942230 managed to connect to tracing02	22:49
JayF	I mean, your comment is true in the most direct sense; but ignores the cost of retraining and migration. However, if you had something like a common PR-style gitlab/github flow that could federate, it gives companies an option to maintain existing workflows generally but migrate backends, moving things internally to another vendor.	22:49
JayF	You are correct, however, in noting that ^^^ has a lot of "not-source-code" stuff rolled into it, like issue tracking and so on.	22:49
clarkb	I think from my perspective lock in has to do with problems that federation doesn't solve	22:50
clarkb	what federation theoretically sovles is making it easy for me to go review a PR in one forge without creating new accounts or doing any extra work to bootstrap myself in that system	22:50
JayF	clarkb: I guess I'm envisioning a world where, in the same way you can view a pixelfed post in mastodon, someone being able to use different UI/workflows to interact cross-force. I do think you're right that federation /will not/ solve this problem, simply because I think incentives are misaligned for that ecosystem to embrace true mobility.	22:51
clarkb	infra-root bootstrap-bridge is a soft dependency of the infra-prod-base. bootstrap-bridge runs the known hosts update. It did so before the base playbook ran according to zuul log timestamps. The task for that reported ok against 942233 which merged with the inventory update. The bootstrap job for 924420 reports changed. its almost like we ran with the wrong git content	22:55
clarkb	corvus: ^ that might be interesting to you from a "is zuul using the correct git state" perspective.	22:55
clarkb	https://zuul.opendev.org/t/openstack/build/2f7584dccc0c40b689bd74cbae6dbfde/log/job-output.txt#269-270 where I expected it to change. Where we tried to use the updated value and failed: https://zuul.opendev.org/t/openstack/build/2af6e7d514e348f497a9458f5e0ded84/log/job-output.txt#132 And finally where it appears to have updated in the followup change:	22:56
clarkb	https://zuul.opendev.org/t/openstack/build/76e874edc9fa4f94ae1f82af2332b50d/log/job-output.txt#269-270	22:56
clarkb	JayF: in the mastodon example you still have to edit the account from the hosting location right? I guess even in those examples you're still only doing high level communication over the top of the actual content	22:57
JayF	clarkb: tbh my mental model of this was always "git handles the code federation bits" and that the communication about the code (e.g. merge requests and related feedback) would be the parts that need federation.	22:58
JayF	but you're right it leads to an explosion of complexity when you consider caching and display on a frontend	22:58
JayF	but let a man dream :D	22:58
clarkb	looks like we load the inventory hosts.yaml file off of disk on bridge then use that to emit the known hosts. I'm not seeing where we update system-config before trying to update known hosts which would explain the problem. However, last week I didn't have any issues like this. And I'm pretty sure I did similar updates of just adding the node to inventory and letting it run	23:01
clarkb	ya the base job runs the synchronize src repos to workspace directory tasks	23:04
clarkb	which would update system-config but that doesn't appear to happen in bootstrap bridge. So how did this ever work before?	23:04
clarkb	have we gotten lucky with hourly jobs running first whcih would update system-config then we run the jobs for a specific deployment?	23:05
clarkb	that would be one mechanism that would allow this to work I think	23:05
clarkb	ianw: ^ if you happen to be around I'd be curious if you have any ideas as I think you set this up	23:06
clarkb	https://zuul.opendev.org/t/openstack/build/108f4de4058246f9a71210365e8ce238/log/job-output.txt is the job from when we added codesearch02. It too updated known hosts but the change landed at 23:55 and the job ran at ~23:57 well after any hourly jobs would've run to udpate sytme-config for us	23:11
clarkb	I think that rules out that possibility as the source for things working sometimes	23:11
clarkb	ok I think I may have figured it out	23:14
clarkb	whoever added build timelines to buildset info pages has my gratitude	23:14
clarkb	infra-prod-service-gitea-lb ran concurrently with infra-prod-bootstrap-bridge when codesearch02 was added	23:15
clarkb	https://zuul.opendev.org/t/openstack/build/d5043d5b453148f7b52b8158503ee457/log/job-output.txt#97-98 ran then https://zuul.opendev.org/t/openstack/build/108f4de4058246f9a71210365e8ce238/log/job-output.txt#260-261 ran so it is a race	23:16
clarkb	now to figure out if we can safely fix this :/	23:18
clarkb	fungi: looks like the cloud launcher failed	23:18
clarkb	I think this bug has sublty been hiding here since ianw refactoring things to bootstrap the bridge ansible using zuul ansible	23:25
clarkb	or maybe since known hosts addition was added if that is newer	23:26
clarkb	ebcause we need the git repos to be up to date to update known hosts	23:26
opendevreview	Clark Boylan proposed opendev/system-config master: Reparent the bootstrap-bridge job onto a job that sets up git repos https://review.opendev.org/c/opendev/system-config/+/942307	23:44
clarkb	ianw infra-root ^ I've tried tp capture all that I've learned in that chagne. I suspect this is safe with all the extra betls and suspenders I added but this probably deserves careful review	23:45
clarkb	basically infra-prod-bootstrap-bridge should also synchronize the repos because it directly depends on that content being up to date. Then if we ever refactor things to run concurrently only that job will update git repos for us	23:46
clarkb	everything else should depend on infra-prod-base which depends on infra-prod-bootstrap-bridge ensuring the git repos are in place for the current run	23:47
ianw	looking :)	23:49
clarkb	I think another followup we could do is switch all the other infra-prod jobs including infra-prod-base to use the key only update parent job. But if we do that we need to make the dependency to infra-prod-bootstrap-bridge a hard dependency (it is soft right now) and drop the file matchers in infra-prod-bootstrap-bridge to ensure it alawys runs to set up the git repos	23:49
clarkb	ianw: thanks!	23:49
clarkb	also I think digging into that melted my brain a little bit so don't feel bad if its a review that takes time to get through and maybe multipel passes	23:50
ianw	trying to get these in parallel was a bit mind bending at the best of times	23:56

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!