Wednesday, 2025-02-19

cloudnulljust checked https://grafana.opendev.org/d/6d29645669/nodepool3a-rackspace-flex?orgId=1 - looks happy :D00:11
* cloudnull happy 00:12
fungiyep, thanks bunches cloudnull!00:35
cloudnullsorry for the curfuffle 01:28
opendevreviewThierry Carrez proposed opendev/irc-meetings master: Move Large Scale SIG meeting time  https://review.opendev.org/c/opendev/irc-meetings/+/94217210:07
opendevreviewThierry Carrez proposed opendev/irc-meetings master: Move Large Scale SIG meeting time  https://review.opendev.org/c/opendev/irc-meetings/+/94217212:49
fungittx: see inline comment on ^ in case you want to adjust before it's approved12:55
ttxfungi: yeah, it's actually irrelevant (the next date is generated from this date, but it does not have to be the first occurrence)13:07
ttxI tried to be more specific but then it would set the first date to April instead of March13:07
ttxprobably a bug somewhere in there13:08
fungineat13:12
opendevreviewMerged opendev/irc-meetings master: Move Large Scale SIG meeting time  https://review.opendev.org/c/opendev/irc-meetings/+/94217213:16
opendevreviewDoug Goldstein proposed openstack/diskimage-builder master: dhcp-all-interfaces: avoid systemd-networkd starting DHCP  https://review.opendev.org/c/openstack/diskimage-builder/+/94221515:56
clarkbfungi: not sure if the hack in https://review.opendev.org/c/opendev/system-config/+/942155 is something you've seen before? In any case landing that then testing access to dfw3 seems like a good next step16:00
clarkband also https://review.opendev.org/c/opendev/system-config/+/941997 to make spinning up new servers quicker16:00
clarkband then we should also proceed with landing bindep changes and looking at mailman log rotation?16:04
opendevreviewDoug Goldstein proposed openstack/diskimage-builder master: dhcp-all-interfaces: avoid systemd-networkd starting DHCP  https://review.opendev.org/c/openstack/diskimage-builder/+/94221516:07
cardoeinteresting... my diskimage-builder change comes here.. I was trying to find the team that would be responsible for it.16:13
clarkbthere is a dedicated dib team with a #openstack-dib channel but it overlaps a bit with us as we've relied on it heavily since tripleo half put it to pasture16:15
clarkbcardoe: I'm curious why you wouldn't do the inverse in that chagne and just let systemd-networkd dhcp all interfaces if that is the goal and disable just not setup the manual configuration for that16:15
cardoeI'd be fine with that too. I actually put that in the bug report.16:19
cardoeI just don't know how to prevent the dib/element from doing anything in that case.16:19
cardoethe ironic-python-agent element pulls in dhcp-all-interfaces. I'm just not knowledgeable enough to make it conditional in that case.16:20
clarkbdiskimage_builder/elements/dhcp-all-interfaces/install.d/50-dhcp-all-interfaces seems to do the setup and already has an exclusion rule for gentoo I think beacuse gentoo uses systemd-networkd16:21
clarkbyou might want to do something in the if init system == systemd block that checks if systemd-networkd is enabled and if so noop16:21
clarkbbut I'm not super familiar with that element. We use simple-init with glean then let config drive contents determine interface setup16:22
clarkbsounds like the gerrit meets presentation from luca about the gerrit 2025 roadmap will be streamed here: https://www.youtube.com/gerritforgetv16:24
clarkbthat occurs at 19:45 UTC if I've translated timezones properly16:25
mordred<clarkb> "I think 942155 has the hack that..." <- looks reasonable to me16:26
clarkbmordred: you've got a comment in openstacksdk indicating that a better way should be written but as far as I can tell that hasn't happened yet :)16:29
fungiclarkb: commit message on 942155 is a bit confusing, but the solution looks fine to me, i agree it matches the internal sdk profiles16:31
clarkbfungi: do you want me to edit the message? there are a few typos and words that are the wrong term16:33
fungibasically that, after reading the change i think i understand what the commit message meant to say, so good enough16:33
clarkbI was rushing to get that up and also to the service coordinator thing before the EOD deadline16:34
fungimakes sense16:34
clarkbclearly my typing drivers don't work so great when rushing16:34
clarkbmirror-update is on the list of servers to replace. This server doesn't haev any real disk space locally as it does everything with afs17:08
clarkbI think we should be able to boot a new mirror-update, hold all the lock files on the old server, then merge a chagne to deploy new mirror update as a mirror-updater. Then delete the old server once we're satisfied with the new one. Any concerns with using "hold all the locks" as the conflict resolution method?17:09
clarkbantoher option would be to shutdown the old server entirely17:09
clarkbthat mgiht be safer since it would avoid problems with spontaneous reboots17:09
clarkbcorvus: any idea what the transition of state for tracing from an old to new server is? Maybe we're ok with losing the old state?17:12
clarkbalso this reminds me I think we're still pinning that container image and its been on my todo list forever to try and debug that but deprioritized17:13
fungii guess we could put the old server in the emergency disable list, comment out all the long-running mirror cronjobs on it, wait for any of those to complete, comment out the static publishing vos release cronjob that runs every 5 minutes, make sure it's not in progress, then shut down the server and add the replacement to the inventory?17:15
fungibut yeah, the mess you can get into with interrupted afs writes makes it a little complicated17:16
clarkboh ya I guess before we shut it down we would need to settle things gracefully first17:16
clarkbso the shutdown appraoch implies holding all the locks first (and maybe disabling the vos release cron)17:17
clarkbI don't know if that vos release cron has a lock17:17
clarkbif it doesn't we should add one17:17
clarkbit does have a lockfile but it isn't in the cron command. It is embedded in the python script17:18
clarkbpublish-mirror-logs doesn't have a lock but I'm not sure if needs one. Seems it just writes to the afs filesystem then we vos release it as part of that other script?17:19
clarkbso ya grab all locks, shutdown server, merge change for new server should be safe I think17:19
clarkbfor nodepool launchers we give each server a server specific config file. I think if we replace nl01 with nl05 the change to add nl05 to inventory would set max-servers to 0 on nl01 and let nl05 take over those providers. Then we delete nl01 and it goes away17:21
clarkbthose servers should be straightforward. Mirrors are straightforward too. I guess we can start with those two groups before worrying about mirror update and tracing17:22
clarkbreminder I'd like to get https://review.opendev.org/c/opendev/system-config/+/941997 in before adding any more new servers. Want at least one person to sanity check my assumptions about testing there17:22
clarkbbut then I can look at replacing the nl servers17:22
opendevreviewMerged opendev/system-config master: Add DFW3 to raxflex cloud profiles on bridge  https://review.opendev.org/c/opendev/system-config/+/94215517:23
clarkbthat is deploying right now (hourly jobs are already done)17:24
fungilooking at the mirror-update server, the only state not kept in afs seems to be in /var/log (afs-release, rsync-mirrors, reprepro), and it's probably not super critical we preserve those17:24
fungireprepro state databases are in afs17:24
clarkbfungi: cloudnull: after 942155 I am still able to run server list against sjc3 for both of our projects/tenants but doing so against dfw3 says authentication is required17:26
clarkbusing the --debug flag I can confirm that we are talking to http://keystone.api.dfw3.rackspacecloud.com/v3/ when using the dfw3 region17:27
clarkbis there additional account setup thati s required to use the new region?17:27
fungii'm good with 941997, logic there makes sense17:28
clarkbI'm not sure what surgery was done with the project ids yesterday either. Maybe each region has different ids? (that would make my assumption we can auth with one clouds.yaml provider a bad one I think)17:28
clarkbfungi: thanks I'll go ahead and approev that now then. I just wanted at least one person to ensure that removing testing is appropriate in this case17:29
fungicloudnull mentioned via privmsg to me yesterday that the new project_id would be consistent across regions17:29
clarkbit is easy to add back in if necessary too17:29
fungifwiw, testing the dfw3 region with my personal rackspace account, openstackclient reports "The request you have made requires authentication. (HTTP 401)" even though i'm using the federated project_name/project_domain_name options instead of project_id in my own clouds.yaml17:34
fungisame config is working fine for me with sjc3 though17:34
clarkbso thats similar to what I see with our accounts. I wonder if we just need that region to be enabled?17:35
fungior if it's like the first time we used sjc3 where the api wouldn't authenticate until i logged into skyline at least once17:38
clarkbya though I thought the said it souldn't work that way (instead there was some heuristic to opt projects into the regions and we got missed?)17:38
clarkbI dunno may be worth a shot17:38
fungii should be able to test that theory with my account in a few minutes17:38
clarkbthanks17:38
fungiand if it works, we can leave our opendev accounts in that state temporarily so raxfolx can look into it17:40
clarkb++17:40
opendevreviewMerged opendev/system-config master: Adjust LE role file matchers on system-config-run-* jobs  https://review.opendev.org/c/opendev/system-config/+/94199717:46
fungihuh, so with my personal account i seem to now have two tenants in sjc3, one under my NNNNNN_Flex project_name and the other under a new uuid-based project_name, the project_id of each of those is different17:48
fungibut in dfw3 i only have that new uuid-based project_name (and its project_id is consistent with sjc3)17:50
cloudnullclarkb accounts should be the same, are you able to login to skyline with the same credentials?17:50
fungicloudnull: we haven't tested skyline with our opendev accounts yet, i'm checking with my personal rackspace account first before i accidentally nudge a heisenbug or something17:51
corvuscloudnull: opentelementry/jaeger tracing?  i say don't worry about it; i wouldn't bother trying to keep old data, and it'll be fine if the server just starts getting new data.17:51
corvusgah17:51
corvusclarkb: ^  sorry that was for you not cloudnull 17:52
clarkbcorvus: ack thanks for confirming17:52
cloudnullcan you auth with no defined project, run something like openstack project list, with no project-id/name defeined, that should list out the available projects for the tenant. 17:54
fungicloudnull: looking at my personal account first, it looks like i now have two different tenants in sjc3 (one under my NNNNNN_Flex project_name which has a server instance in it, another with a uuid-based project_name that has no server instances). in dfw3 i only have one tenant (the same uuid-based project name as in sjc3, with a matching project_id from sjc3 too)17:54
cloudnullif that works, it should seed the environment with your account. 17:54
cloudnullfungi ++ those NNNNNN_Flex projects were an early account type, they'll continue to exist in SJC but going forward we're not creating them. 17:55
clarkbI thought project was a required auth parameter?17:55
clarkblike osc won't let you auth without it specified?17:55
cloudnullit should let you get an unscoped token17:55
fungicloudnull: thanks, that explains it. wrt my personal account i'll just boot a new instance in the newer tenant and delete the old one in that case17:56
clarkbdoes logging into skyline authenticate without a project?17:56
clarkbI guess that could explain the heisenbug behavior we think we'ev seen17:56
fungiclarkb: skyline seems to not ask for a project, right17:57
cloudnullit does, it first pulls an unscoped token 17:57
clarkbok so logging into skyline is probably the easiest way to do that as it doesn't involve copying or hacking up clouds.yaml files17:57
fungii'll go ahead and do it in that case since i've already got everything up in front of me17:57
clarkbfungi: are you able to use osc against dfw3 with your personal account now to confirm that generally seems to work?17:58
clarkbfungi: cool thanks17:58
cloudnullhttps://gisty.link/087b568c73dc635a676190e09c6b13baa45cadc9 this is my unscoped clouds.yaml17:59
fungiclarkb: i need to switch my clouds.yaml over to use the new tenant (which doesn't contain my existing server instance because that's in the old tenant), but looking at what's in skyline i expect it to just work17:59
cloudnullthe output https://gisty.link/b1fd7580d5d3cf2b3173cbf9521453dfa595c73018:01
fungii can confirm that when i comment the project out of my clouds.yaml `openstack project list` gives me a list of the projects my account has access to18:03
clarkbmaybe that is new but I thought you always had to have a project/tenant even if it was the default one. Or maybe I'm thinking of the whole domain madness18:03
clarkbya actually it might be domain ebcause I remember at one time the default changed from "default" to "Default" or vice versa and then you had to get extra verbose about it even when using the default18:04
fungiyou can leave project_name and project_id out of your clouds.yaml and pass --os-project-name or --os-project-id on the command line instead18:04
fungijust tested that and it's working for me18:05
clarkbya thee is a list of values that you can supply on the command line to override or supplement the clouds.yaml18:05
clarkb(that doesnt' work for all options iirc)18:05
clarkbanyway let me know when I should test dfw3 again and I'll do that18:06
fungiwhat's the undocumented magic to tell osc to use a different clouds.yaml file?18:10
clarkbits something like OS_CLOUD_CONFIG_FILE=/path/here18:10
fungioh, right, there isn't a cli option18:11
clarkbOS_CLIENT_CONFIG_FILE18:11
fungiso the experiment was unsuccessful18:12
fungino, wait, it's still picking up a default project in my config somewhere18:12
fungiokay, that's working18:13
clarkbis this with your personal account?18:15
fungino, our opendev accounts now18:15
clarkbok those aren't working for me. Not suer what the difference is18:15
fungibecause i had already logged into dfw3 skyline with my personal account so it got synced up already18:16
clarkbthe server list commands I was running before continue to produce the same result of needing authentication18:16
fungiyeah, i see the issue18:20
fungilike my personal account does, our opendev accounts now have two projects in sjc3: an old one with an NNNNNN_Flex name and a new one with a uuid-based name18:21
fungithe old NNNNNN_Flex projects contain our mirror server and nodepool nodes18:21
fungithe new uuid-based projects are empty18:21
clarkband dfw3 only has the new one18:21
fungicorrect18:21
fungithe new uuid-based project has a consistent name and id in both sjc3 and dfw3, but is not the one we're currently using18:22
clarkbso maybe the "solution" here is to create a new raxflex clouds.yaml profile for both sjc3 and dfw3. Then use this as the motivation to rebuild sjc3 in the new project to fix the networking stack18:22
fungiso maybe this is our push to... yeah exactly18:22
clarkbthat might get a little complicated with nodepool configs but the clouds name is referenced by nodepool config so I thinkwe can set max-servers to zero and image to empty list and let nodepool clean things up. switch the cloud name then revert the shutdown steps to have it rebuild?18:23
fungiwe'll want to build a new mirror too18:23
clarkbyes18:23
fungii guess we can just build new sjc3 and dfw3 mirror instances at the same time18:24
clarkbI think the process is add new clouds.yaml profiles for sjc3 and dfw3 using the new tenant/project. Manage those new tenants in cloud launcher. Upload noble images if necesasry. Boot new mirrors. Add dfw3 to nodepool. Gracefully shutdown old sjc3. Bring up new sjc3 in nodepool18:24
clarkbthen clean up the old profiles and secrets data18:25
clarkbfungi: did you want to push up the changes to start reorging clouds.yaml content since you seem to have a good handle on it or should I give ti a go and you can tell me what needs editing?18:28
fungiyeah, just pulled the clouds.yaml template up in my editot18:28
fungieditor18:28
clarkbcool I'm ready to do my best to review the change(s)18:28
fungithe hardest part is going to be naming them, as always18:28
clarkbya the old one has the "good" name :)18:28
clarkbcould do opendevci-rax-flex instead of opendevci-raxflex then eventually we'll delete the opendevci-raxflex profile and it won't be confusing18:29
clarkbslight confusion until we reach that point18:29
fungisure, i dig it18:29
clarkbI've started booting a tracing02 server18:44
clarkbseems like it should be an straightforward swap and I'm all for getting more noble coverage from the easy things18:44
opendevreviewJeremy Stanley proposed opendev/system-config master: Add new Rackspace Flex projects  https://review.opendev.org/c/opendev/system-config/+/94223018:44
opendevreviewJeremy Stanley proposed opendev/system-config master: Switch Nodepool to using a Rackspace Flex project  https://review.opendev.org/c/opendev/system-config/+/94223118:44
fungithe second change there is wip for now, more a placeholder to flesh out once we're ready18:45
clarkbfungi: changes to the clouds.yaml files require dummy hostvars (or maybe groupvars) data so that the file can be templated out successfully iirc18:47
clarkbif you do a git grep of the vars in that file you should find where the dummy values are set18:48
fungiah, yeah18:48
fungii did it in the second change but not the first18:48
clarkbfungi: for the second change I was thinking more that we'd edit https://opendev.org/openstack/project-config/src/branch/master/nodepool/nl01.opendev.org.yaml#L199 (and the related builder config) to also point to a different profile name (rax-flex?) after we set max-servers to 0 and clean up the images in the old tenant18:49
clarkbfungi: the reason for that is it would allow us to trivially adding the dfw3 region while we sort through the sjc3 cleanup. If you do it the way you've proposed then we have to coordinate things more tightly (because suddenly sjc3 could stop working and/or orphan resourcesin the old tenant)18:50
fungican do18:51
fungijust didn't want to lose the stats history in grafana18:51
clarkbI don't think we will since only the credentials reference changes18:52
clarkbfungi: the provider in nodepool remains the same we just tell it to use different credentials18:52
fungiso keep the statsd prefix set the same?18:53
fungifor both?18:53
clarkboh thats a clouds.yaml config hrm18:53
clarkbya I think so18:53
clarkbin the case of dfw3 it will be scoped to that region and is fine. In the case of sjc3 we should be able to shut things down gracefully with old credentials then start things up again with new credentials and keep all the logical provider stuff the same including the statsd prefix18:54
fungiother thing is this would be the first nodepool provider with a - in its name, while we've generally used - as the separator between the provider and region names18:54
clarkbbut only if we change the provider name?18:54
clarkbI'm suggesting we only change the cloud: value18:54
fungiah, okay, yeah i suppose it wouldn't be consistent but would be good enough18:55
clarkbthough now that you mention it I'm not sure where we get the values for say mirror name construction maybe those are based on the cloud.yaml profile name?18:55
clarkbI'm starting to feel like we've tried this before and it didn't work due to something like ^18:56
clarkbAnother option would be to just start over entirely and orphan the existing grafana data18:56
clarkbsimilar to when linaro changed names a couple of times. Maybe that is simplest18:57
opendevreviewClark Boylan proposed opendev/zone-opendev.org master: Add tracing02 to DNS  https://review.opendev.org/c/opendev/zone-opendev.org/+/94223318:57
opendevreviewJeremy Stanley proposed opendev/system-config master: Add new Rackspace Flex projects  https://review.opendev.org/c/opendev/system-config/+/94223018:58
opendevreviewJeremy Stanley proposed opendev/system-config master: Switch Nodepool to a new Rackspace Flex project  https://review.opendev.org/c/opendev/system-config/+/94223118:58
fungiwe can debate the nodepool change as we work on the earlier steps18:59
fungialso all the new private hostvars in those changes have been added on bridge with their correct values19:01
opendevreviewClark Boylan proposed opendev/system-config master: Add tracing02 to inventory  https://review.opendev.org/c/opendev/system-config/+/94223519:03
opendevreviewClark Boylan proposed opendev/system-config master: Add tracing02 to inventory  https://review.opendev.org/c/opendev/system-config/+/94223519:08
clarkbforgot the depends on. Important in this case19:08
clarkbfungi: if we take your initially proposed approach we would need to shutdown raxflex entirely first. Then bring it back up again in sjc3 and dfw3. That is probably the cleanest approach from a historical record keeping process but requires more coordinated effort and loss of ~32 test nodes while we work through it19:14
clarkbI think I'm ok with that because there are also fewer questions about how to work through that process. We could end up with more work than anticipated cleaing up issues with a less careful approach that allows us to bring up dfw3 early19:15
clarkbbut before we get that far we can bring up sjc3 and dfw3 via cloud launcher, upload noble image, and spin up mirrors19:15
clarkbthen decide how we want to transition nodepool19:15
Clark[m]The gerritforge Livestream on YouTube is about to start19:57
fungioh. also we can probably forego the extra network creation and floating-ip stuff if we like19:58
fungicertainly for the mirrors at least19:58
Clark[m]Due to direct attachment to the public net?20:00
fungiyeah20:01
fungishould i add the cloud-launcher config into 942230 in that case?20:05
fungior as a separate change?20:05
opendevreviewJeremy Stanley proposed opendev/system-config master: Add new Rackspace Flex projects  https://review.opendev.org/c/opendev/system-config/+/94223020:07
opendevreviewJeremy Stanley proposed opendev/system-config master: Switch Nodepool to a new Rackspace Flex project  https://review.opendev.org/c/opendev/system-config/+/94223120:07
fungicombined it for now but can split it out if needed20:07
Clark[m]Same change is probably fine we'll just have broken launcher if the initial setup still doesn't work 20:11
Clark[m]Gerrit 3.12 will require java 2120:11
Clark[m]3.12 will update the H2 version for caches which is a breaking change20:13
fungiubuntu noble has openjdk-21-jre, if we want it on debian we need to wait for trixie20:13
fungibut by the time we're ready to upgrade to gerrit 3.12 i expect it'll be plenty ready20:14
fungimy best guess is sometime around june/july for trixie release20:15
Clark[m]Ya I think it will be fine20:16
Clark[m]In theory we upgrade to 3.11 on Java 17. Then update our images to Java 21 for 3.11 and 3.12 then upgrade to 3.12 20:20
Clark[m]And if we stick to our existing timeline that will occur at the end of 202520:20
fungisounds about right20:29
Clark[m]Gerrit 3.13 will formalize the ability to run a Gerrit server for the UI server and a different headless Gerrit for the rest API and git protocols. This would allow for you to tune and scale their jvms separately 20:34
fungioh neat. maybe we could scale down our gerrit(s) then20:36
fungidown and out, that is20:36
fungigranted, we're only really using half the ram on our current 128gb vm20:37
fungia quarter is active and a quarter is buffers/cache20:38
Clark[m]Luca is talking about Gerrit 4 possibilities. One idea is to decouple the UI from the backend more so that you can build different code review systems on it or just use it as a git server20:40
Clark[m]Support for PR like reviews (reviews of branches rather than specific commits)20:42
Clark[m]Which he points out is technically possible through merge commit reviews but the UI isn't really useful in this capacity 20:42
Clark[m]He wants to see llm integration make it into core plugins rather than external plugins20:44
JayFI wonder if that would help enable any potential future federation a la https://gitlab.com/gitlab-org/gitlab/-/issues/646820:53
JayFObviously that is not necessarily cross project yet or even exists at all yet, but it's nice to think about the possibility20:54
Clark[m]It seems like it would be a prereq to federate with PR systems but figuring out federation with Gerrit first seems like a baseline need. That said I feel like zuul really addresses much of what people want out of federation21:04
Clark[m]When I write bugfixes for Gerrit I push them upstream then downstream I set a depends on, rebuild our images, and test in opendev that our problem goes away21:05
Clark[m]There is no formal federation but zuul talks to both and problem solved21:05
Clark[m]And that works for Gerrit and GitHub and gitlab etc today21:06
clarkbfungi: I have a question on https://review.opendev.org/c/opendev/system-config/+/94223021:15
clarkbthe gerrit thing was informative. It seems like a lot of the interesting/focus within the gerrit community is building a system that works well for enterprise software development in large companies with lots of large git repos. Not necessarily a bad thing for us but I personlly think it would be neat if more effort went into the process of optimizing code review itself21:16
clarkband then tracing seems to be happy with its two changes https://review.opendev.org/c/opendev/zone-opendev.org/+/942233 https://review.opendev.org/c/opendev/system-config/+/942235 if you have a moment21:17
clarkbfungi: I'm happy to move forward with 942230 if that was intentional but didn't want it to get lost if it was an oversight21:20
fungiclarkb: thanks for catching that, i meant to do both of course. fix incoming21:26
clarkboh and when we boot the mirrors we should check tmus21:26
clarkb*mtus21:26
fungiyep21:26
opendevreviewJeremy Stanley proposed opendev/system-config master: Add new Rackspace Flex projects  https://review.opendev.org/c/opendev/system-config/+/94223021:27
opendevreviewJeremy Stanley proposed opendev/system-config master: Switch Nodepool to a new Rackspace Flex project  https://review.opendev.org/c/opendev/system-config/+/94223121:27
fungiclarkb: fwiw, the server instance i booted in my personal account in sjc3 is just attached directly to publicnet and has a 1500 byte mtu on its ens3 interface already21:28
fungiso should be fine21:29
clarkbperfect21:29
clarkbfungi: I +2'd the first change and I think you can approve it when secret vars are in place for it21:30
fungithey already are, were even before i pushed the initial patchset21:30
clarkbextra perfect21:30
fungidid those first thing21:30
clarkbI did confirm that mirror_fqdn includes nodepool.cloud in it21:31
clarkbthat means we would have to have mirror.sjc3.rax-flex.opendev.org instead of mirror.sjc3.raxflex.opendev.org21:31
opendevreviewMerged opendev/zone-opendev.org master: Add tracing02 to DNS  https://review.opendev.org/c/opendev/zone-opendev.org/+/94223321:31
clarkban alternative would be to do what you originally proposed and simply shut thinsg down before turning anything new on21:32
fungiyeah, i'm leaning toward that21:32
fungiand keeping the old cloud names21:32
funginame21:32
clarkbwfm21:32
clarkbfungi: I made a note on https://review.opendev.org/c/opendev/system-config/+/942231/ that we should haev a change in project-config that cleans up the the existing sjc3 resources in nodepool. Maybe it should be two changes. One to set max-servers to 0 then another to clean up all images in that cloud21:34
clarkbthat way we can launch the new clouds with the first change, spin up new mirrors, land the cleanup changes I just described ^ there and then land 942231 and spin up new sjc3 and dfw21:34
clarkboh also we can switch the mirror in sjc3 over to the new mirror before we shut things down if we end up keeping things up for some reason (its the same region just a different tenant which is no different than how we normally do things)21:35
fungiyeah, sounds right21:35
clarkbfor noble image uploads if we can't download our existing image from glance then we may just need to go with whatever the latest image is. Looks like the vhd file is the only one on bridge anymore (due to disk constraints) and I'm not sure we can reliably convert a vhd back to a raw/qcow221:41
clarkbbut that should be fine. Maybe even preferable if it reduces the total number of packages we have to update when we lauch new nodes21:41
opendevreviewMerged opendev/system-config master: Add tracing02 to inventory  https://review.opendev.org/c/opendev/system-config/+/94223522:05
clarkbthat change is finally deploying now but the tracing job is near the end so may still be a while. I'm keeping an eye on it22:27
opendevreviewMerged opendev/system-config master: Add new Rackspace Flex projects  https://review.opendev.org/c/opendev/system-config/+/94223022:39
JayFclarkb: for sure, gerrit<>gerrit would have to go first, but zuul covers zero of the use case I was thinking of -- my brain is always geared to "how to avoid lock-in", and getting all the various "forge" systems to collaborate is a potential path to get there 22:42
clarkbdeployment failed because the base job failed because tracing02 was unreachable22:44
clarkbI am able to reach it from my local system. Now to try from bridge22:45
clarkbssh worked from bridge too. Not sure why it failed22:45
clarkboh hrm it says host key verification failed. But I was able to ssh to it without doing anything with host keys. I wonder if that is a race between updating known hosts and trying to ssh to it? Bootstrap bridge must do the ssh key setup and base can run concurrently maybe? The -base run for 942230 should be a good indicator if this is still a proble. If not I can probably wait for22:47
clarkbdaily jobs this evening22:47
clarkbJayF: ok sorry wanted to debug that problem. Git is already inherently distributed you can pretty trivially avoid lock in taking your git repo from one forge to another22:48
clarkbI think the real lock in problems are with all of the tooling surrounding a specific forge and federation doesn't help prevent lock in there22:48
clarkbfungi: ya the run for 942230 managed to connect to tracing0222:49
JayFI mean, your comment is true in the most direct sense; but ignores the cost of retraining and migration. However, if you had something like a common PR-style gitlab/github flow that could federate, it gives companies an option to maintain existing workflows generally but migrate backends, moving things internally to another vendor.22:49
JayFYou are correct, however, in noting that ^^^ has a lot of "not-source-code" stuff rolled into it, like issue tracking and so on.22:49
clarkbI think from my perspective lock in has to do with problems that federation doesn't solve22:50
clarkbwhat federation theoretically sovles is making it easy for me to go review a PR in one forge without creating new accounts or doing any extra work to bootstrap myself in that system22:50
JayFclarkb: I guess I'm envisioning a world where, in the same way you can view a pixelfed post in mastodon, someone being able to use different UI/workflows to interact cross-force. I do think you're right that federation /will not/ solve this problem, simply because I think incentives are misaligned for that ecosystem to embrace true mobility.22:51
clarkbinfra-root bootstrap-bridge is a soft dependency of the infra-prod-base. bootstrap-bridge runs the known hosts update. It did so before the base playbook ran according to zuul log timestamps. The task for that reported ok against 942233 which merged with the inventory update. The bootstrap job for 924420 reports changed. its almost like we ran with the wrong git content22:55
clarkbcorvus: ^ that might be interesting to you from a "is zuul using the correct git state" perspective.22:55
clarkbhttps://zuul.opendev.org/t/openstack/build/2f7584dccc0c40b689bd74cbae6dbfde/log/job-output.txt#269-270 where I expected it to change. Where we tried to use the updated value and failed: https://zuul.opendev.org/t/openstack/build/2af6e7d514e348f497a9458f5e0ded84/log/job-output.txt#132  And finally where it appears to have updated in the followup change:22:56
clarkbhttps://zuul.opendev.org/t/openstack/build/76e874edc9fa4f94ae1f82af2332b50d/log/job-output.txt#269-27022:56
clarkbJayF: in the mastodon example you still have to edit the account from the hosting location right? I guess even in those examples you're still only doing high level communication over the top of the actual content22:57
JayFclarkb: tbh my mental model of this was always "git handles the code federation bits" and that the communication about the code (e.g. merge requests and related feedback) would be the parts that need federation.22:58
JayFbut you're right it leads to an explosion of complexity when you consider caching and display on a frontend22:58
JayFbut let a man dream :D 22:58
clarkblooks like we load the inventory hosts.yaml file off of disk on bridge then use that to emit the known hosts. I'm not seeing where we update system-config before trying to update known hosts which would explain the problem.  However, last week I didn't have any issues like this. And I'm pretty sure I did similar updates of just adding the node to inventory and letting it run23:01
clarkbya the base job runs the synchronize src repos to workspace directory tasks23:04
clarkbwhich would update system-config but that doesn't appear to happen in bootstrap bridge. So how did this ever work before?23:04
clarkbhave we gotten lucky with hourly jobs running first whcih would update system-config then we run the jobs for a specific deployment?23:05
clarkbthat would be one mechanism that would allow this to work I think23:05
clarkbianw: ^ if you happen to be around I'd be curious if you have any ideas as I think you set this up23:06
clarkbhttps://zuul.opendev.org/t/openstack/build/108f4de4058246f9a71210365e8ce238/log/job-output.txt is the job from when we added codesearch02. It too updated known hosts but the change landed at 23:55 and the job ran at ~23:57 well after any hourly jobs would've run to udpate sytme-config for us23:11
clarkbI think that rules out that possibility as the source for things working sometimes23:11
clarkbok I think I may have figured it out23:14
clarkbwhoever added build timelines to buildset info pages has my gratitude23:14
clarkbinfra-prod-service-gitea-lb ran concurrently with infra-prod-bootstrap-bridge when codesearch02 was added23:15
clarkbhttps://zuul.opendev.org/t/openstack/build/d5043d5b453148f7b52b8158503ee457/log/job-output.txt#97-98 ran then https://zuul.opendev.org/t/openstack/build/108f4de4058246f9a71210365e8ce238/log/job-output.txt#260-261 ran so it is a race23:16
clarkbnow to figure out if we can safely fix this :/23:18
clarkbfungi: looks like the cloud launcher failed23:18
clarkbI think this bug has sublty been hiding here since ianw refactoring things to bootstrap the bridge ansible using zuul ansible23:25
clarkbor maybe since known hosts addition was added if that is newer23:26
clarkbebcause we need the git repos to be up to date to update known hosts23:26
opendevreviewClark Boylan proposed opendev/system-config master: Reparent the bootstrap-bridge job onto a job that sets up git repos  https://review.opendev.org/c/opendev/system-config/+/94230723:44
clarkbianw infra-root ^ I've tried tp capture all that I've learned in that chagne. I suspect this is safe with all the extra betls and suspenders I added but this probably deserves careful review23:45
clarkbbasically infra-prod-bootstrap-bridge should also synchronize the repos because it directly depends on that content being up to date. Then if we ever refactor things to run concurrently only that job will update git repos for us23:46
clarkbeverything else should depend on infra-prod-base which depends on infra-prod-bootstrap-bridge ensuring the git repos are in place for the current run23:47
ianwlooking :)23:49
clarkbI think another followup we could do is switch all the other infra-prod jobs including infra-prod-base to use the key only update parent job. But if we do that we need to make the dependency to infra-prod-bootstrap-bridge a hard dependency (it is soft right now) and drop the file matchers in infra-prod-bootstrap-bridge to ensure it alawys runs to set up the git repos23:49
clarkbianw: thanks!23:49
clarkbalso I think digging into that melted my brain a little bit so don't feel bad if its a review that takes time to get through and maybe multipel passes23:50
ianwtrying to get these in parallel was a bit mind bending at the best of times23:56

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!