cloudnull | just checked https://grafana.opendev.org/d/6d29645669/nodepool3a-rackspace-flex?orgId=1 - looks happy :D | 00:11 |
---|---|---|
* cloudnull happy | 00:12 | |
fungi | yep, thanks bunches cloudnull! | 00:35 |
cloudnull | sorry for the curfuffle | 01:28 |
opendevreview | Thierry Carrez proposed opendev/irc-meetings master: Move Large Scale SIG meeting time https://review.opendev.org/c/opendev/irc-meetings/+/942172 | 10:07 |
opendevreview | Thierry Carrez proposed opendev/irc-meetings master: Move Large Scale SIG meeting time https://review.opendev.org/c/opendev/irc-meetings/+/942172 | 12:49 |
fungi | ttx: see inline comment on ^ in case you want to adjust before it's approved | 12:55 |
ttx | fungi: yeah, it's actually irrelevant (the next date is generated from this date, but it does not have to be the first occurrence) | 13:07 |
ttx | I tried to be more specific but then it would set the first date to April instead of March | 13:07 |
ttx | probably a bug somewhere in there | 13:08 |
fungi | neat | 13:12 |
opendevreview | Merged opendev/irc-meetings master: Move Large Scale SIG meeting time https://review.opendev.org/c/opendev/irc-meetings/+/942172 | 13:16 |
opendevreview | Doug Goldstein proposed openstack/diskimage-builder master: dhcp-all-interfaces: avoid systemd-networkd starting DHCP https://review.opendev.org/c/openstack/diskimage-builder/+/942215 | 15:56 |
clarkb | fungi: not sure if the hack in https://review.opendev.org/c/opendev/system-config/+/942155 is something you've seen before? In any case landing that then testing access to dfw3 seems like a good next step | 16:00 |
clarkb | and also https://review.opendev.org/c/opendev/system-config/+/941997 to make spinning up new servers quicker | 16:00 |
clarkb | and then we should also proceed with landing bindep changes and looking at mailman log rotation? | 16:04 |
opendevreview | Doug Goldstein proposed openstack/diskimage-builder master: dhcp-all-interfaces: avoid systemd-networkd starting DHCP https://review.opendev.org/c/openstack/diskimage-builder/+/942215 | 16:07 |
cardoe | interesting... my diskimage-builder change comes here.. I was trying to find the team that would be responsible for it. | 16:13 |
clarkb | there is a dedicated dib team with a #openstack-dib channel but it overlaps a bit with us as we've relied on it heavily since tripleo half put it to pasture | 16:15 |
clarkb | cardoe: I'm curious why you wouldn't do the inverse in that chagne and just let systemd-networkd dhcp all interfaces if that is the goal and disable just not setup the manual configuration for that | 16:15 |
cardoe | I'd be fine with that too. I actually put that in the bug report. | 16:19 |
cardoe | I just don't know how to prevent the dib/element from doing anything in that case. | 16:19 |
cardoe | the ironic-python-agent element pulls in dhcp-all-interfaces. I'm just not knowledgeable enough to make it conditional in that case. | 16:20 |
clarkb | diskimage_builder/elements/dhcp-all-interfaces/install.d/50-dhcp-all-interfaces seems to do the setup and already has an exclusion rule for gentoo I think beacuse gentoo uses systemd-networkd | 16:21 |
clarkb | you might want to do something in the if init system == systemd block that checks if systemd-networkd is enabled and if so noop | 16:21 |
clarkb | but I'm not super familiar with that element. We use simple-init with glean then let config drive contents determine interface setup | 16:22 |
clarkb | sounds like the gerrit meets presentation from luca about the gerrit 2025 roadmap will be streamed here: https://www.youtube.com/gerritforgetv | 16:24 |
clarkb | that occurs at 19:45 UTC if I've translated timezones properly | 16:25 |
mordred | <clarkb> "I think 942155 has the hack that..." <- looks reasonable to me | 16:26 |
clarkb | mordred: you've got a comment in openstacksdk indicating that a better way should be written but as far as I can tell that hasn't happened yet :) | 16:29 |
fungi | clarkb: commit message on 942155 is a bit confusing, but the solution looks fine to me, i agree it matches the internal sdk profiles | 16:31 |
clarkb | fungi: do you want me to edit the message? there are a few typos and words that are the wrong term | 16:33 |
fungi | basically that, after reading the change i think i understand what the commit message meant to say, so good enough | 16:33 |
clarkb | I was rushing to get that up and also to the service coordinator thing before the EOD deadline | 16:34 |
fungi | makes sense | 16:34 |
clarkb | clearly my typing drivers don't work so great when rushing | 16:34 |
clarkb | mirror-update is on the list of servers to replace. This server doesn't haev any real disk space locally as it does everything with afs | 17:08 |
clarkb | I think we should be able to boot a new mirror-update, hold all the lock files on the old server, then merge a chagne to deploy new mirror update as a mirror-updater. Then delete the old server once we're satisfied with the new one. Any concerns with using "hold all the locks" as the conflict resolution method? | 17:09 |
clarkb | antoher option would be to shutdown the old server entirely | 17:09 |
clarkb | that mgiht be safer since it would avoid problems with spontaneous reboots | 17:09 |
clarkb | corvus: any idea what the transition of state for tracing from an old to new server is? Maybe we're ok with losing the old state? | 17:12 |
clarkb | also this reminds me I think we're still pinning that container image and its been on my todo list forever to try and debug that but deprioritized | 17:13 |
fungi | i guess we could put the old server in the emergency disable list, comment out all the long-running mirror cronjobs on it, wait for any of those to complete, comment out the static publishing vos release cronjob that runs every 5 minutes, make sure it's not in progress, then shut down the server and add the replacement to the inventory? | 17:15 |
fungi | but yeah, the mess you can get into with interrupted afs writes makes it a little complicated | 17:16 |
clarkb | oh ya I guess before we shut it down we would need to settle things gracefully first | 17:16 |
clarkb | so the shutdown appraoch implies holding all the locks first (and maybe disabling the vos release cron) | 17:17 |
clarkb | I don't know if that vos release cron has a lock | 17:17 |
clarkb | if it doesn't we should add one | 17:17 |
clarkb | it does have a lockfile but it isn't in the cron command. It is embedded in the python script | 17:18 |
clarkb | publish-mirror-logs doesn't have a lock but I'm not sure if needs one. Seems it just writes to the afs filesystem then we vos release it as part of that other script? | 17:19 |
clarkb | so ya grab all locks, shutdown server, merge change for new server should be safe I think | 17:19 |
clarkb | for nodepool launchers we give each server a server specific config file. I think if we replace nl01 with nl05 the change to add nl05 to inventory would set max-servers to 0 on nl01 and let nl05 take over those providers. Then we delete nl01 and it goes away | 17:21 |
clarkb | those servers should be straightforward. Mirrors are straightforward too. I guess we can start with those two groups before worrying about mirror update and tracing | 17:22 |
clarkb | reminder I'd like to get https://review.opendev.org/c/opendev/system-config/+/941997 in before adding any more new servers. Want at least one person to sanity check my assumptions about testing there | 17:22 |
clarkb | but then I can look at replacing the nl servers | 17:22 |
opendevreview | Merged opendev/system-config master: Add DFW3 to raxflex cloud profiles on bridge https://review.opendev.org/c/opendev/system-config/+/942155 | 17:23 |
clarkb | that is deploying right now (hourly jobs are already done) | 17:24 |
fungi | looking at the mirror-update server, the only state not kept in afs seems to be in /var/log (afs-release, rsync-mirrors, reprepro), and it's probably not super critical we preserve those | 17:24 |
fungi | reprepro state databases are in afs | 17:24 |
clarkb | fungi: cloudnull: after 942155 I am still able to run server list against sjc3 for both of our projects/tenants but doing so against dfw3 says authentication is required | 17:26 |
clarkb | using the --debug flag I can confirm that we are talking to http://keystone.api.dfw3.rackspacecloud.com/v3/ when using the dfw3 region | 17:27 |
clarkb | is there additional account setup thati s required to use the new region? | 17:27 |
fungi | i'm good with 941997, logic there makes sense | 17:28 |
clarkb | I'm not sure what surgery was done with the project ids yesterday either. Maybe each region has different ids? (that would make my assumption we can auth with one clouds.yaml provider a bad one I think) | 17:28 |
clarkb | fungi: thanks I'll go ahead and approev that now then. I just wanted at least one person to ensure that removing testing is appropriate in this case | 17:29 |
fungi | cloudnull mentioned via privmsg to me yesterday that the new project_id would be consistent across regions | 17:29 |
clarkb | it is easy to add back in if necessary too | 17:29 |
fungi | fwiw, testing the dfw3 region with my personal rackspace account, openstackclient reports "The request you have made requires authentication. (HTTP 401)" even though i'm using the federated project_name/project_domain_name options instead of project_id in my own clouds.yaml | 17:34 |
fungi | same config is working fine for me with sjc3 though | 17:34 |
clarkb | so thats similar to what I see with our accounts. I wonder if we just need that region to be enabled? | 17:35 |
fungi | or if it's like the first time we used sjc3 where the api wouldn't authenticate until i logged into skyline at least once | 17:38 |
clarkb | ya though I thought the said it souldn't work that way (instead there was some heuristic to opt projects into the regions and we got missed?) | 17:38 |
clarkb | I dunno may be worth a shot | 17:38 |
fungi | i should be able to test that theory with my account in a few minutes | 17:38 |
clarkb | thanks | 17:38 |
fungi | and if it works, we can leave our opendev accounts in that state temporarily so raxfolx can look into it | 17:40 |
clarkb | ++ | 17:40 |
opendevreview | Merged opendev/system-config master: Adjust LE role file matchers on system-config-run-* jobs https://review.opendev.org/c/opendev/system-config/+/941997 | 17:46 |
fungi | huh, so with my personal account i seem to now have two tenants in sjc3, one under my NNNNNN_Flex project_name and the other under a new uuid-based project_name, the project_id of each of those is different | 17:48 |
fungi | but in dfw3 i only have that new uuid-based project_name (and its project_id is consistent with sjc3) | 17:50 |
cloudnull | clarkb accounts should be the same, are you able to login to skyline with the same credentials? | 17:50 |
fungi | cloudnull: we haven't tested skyline with our opendev accounts yet, i'm checking with my personal rackspace account first before i accidentally nudge a heisenbug or something | 17:51 |
corvus | cloudnull: opentelementry/jaeger tracing? i say don't worry about it; i wouldn't bother trying to keep old data, and it'll be fine if the server just starts getting new data. | 17:51 |
corvus | gah | 17:51 |
corvus | clarkb: ^ sorry that was for you not cloudnull | 17:52 |
clarkb | corvus: ack thanks for confirming | 17:52 |
cloudnull | can you auth with no defined project, run something like openstack project list, with no project-id/name defeined, that should list out the available projects for the tenant. | 17:54 |
fungi | cloudnull: looking at my personal account first, it looks like i now have two different tenants in sjc3 (one under my NNNNNN_Flex project_name which has a server instance in it, another with a uuid-based project_name that has no server instances). in dfw3 i only have one tenant (the same uuid-based project name as in sjc3, with a matching project_id from sjc3 too) | 17:54 |
cloudnull | if that works, it should seed the environment with your account. | 17:54 |
cloudnull | fungi ++ those NNNNNN_Flex projects were an early account type, they'll continue to exist in SJC but going forward we're not creating them. | 17:55 |
clarkb | I thought project was a required auth parameter? | 17:55 |
clarkb | like osc won't let you auth without it specified? | 17:55 |
cloudnull | it should let you get an unscoped token | 17:55 |
fungi | cloudnull: thanks, that explains it. wrt my personal account i'll just boot a new instance in the newer tenant and delete the old one in that case | 17:56 |
clarkb | does logging into skyline authenticate without a project? | 17:56 |
clarkb | I guess that could explain the heisenbug behavior we think we'ev seen | 17:56 |
fungi | clarkb: skyline seems to not ask for a project, right | 17:57 |
cloudnull | it does, it first pulls an unscoped token | 17:57 |
clarkb | ok so logging into skyline is probably the easiest way to do that as it doesn't involve copying or hacking up clouds.yaml files | 17:57 |
fungi | i'll go ahead and do it in that case since i've already got everything up in front of me | 17:57 |
clarkb | fungi: are you able to use osc against dfw3 with your personal account now to confirm that generally seems to work? | 17:58 |
clarkb | fungi: cool thanks | 17:58 |
cloudnull | https://gisty.link/087b568c73dc635a676190e09c6b13baa45cadc9 this is my unscoped clouds.yaml | 17:59 |
fungi | clarkb: i need to switch my clouds.yaml over to use the new tenant (which doesn't contain my existing server instance because that's in the old tenant), but looking at what's in skyline i expect it to just work | 17:59 |
cloudnull | the output https://gisty.link/b1fd7580d5d3cf2b3173cbf9521453dfa595c730 | 18:01 |
fungi | i can confirm that when i comment the project out of my clouds.yaml `openstack project list` gives me a list of the projects my account has access to | 18:03 |
clarkb | maybe that is new but I thought you always had to have a project/tenant even if it was the default one. Or maybe I'm thinking of the whole domain madness | 18:03 |
clarkb | ya actually it might be domain ebcause I remember at one time the default changed from "default" to "Default" or vice versa and then you had to get extra verbose about it even when using the default | 18:04 |
fungi | you can leave project_name and project_id out of your clouds.yaml and pass --os-project-name or --os-project-id on the command line instead | 18:04 |
fungi | just tested that and it's working for me | 18:05 |
clarkb | ya thee is a list of values that you can supply on the command line to override or supplement the clouds.yaml | 18:05 |
clarkb | (that doesnt' work for all options iirc) | 18:05 |
clarkb | anyway let me know when I should test dfw3 again and I'll do that | 18:06 |
fungi | what's the undocumented magic to tell osc to use a different clouds.yaml file? | 18:10 |
clarkb | its something like OS_CLOUD_CONFIG_FILE=/path/here | 18:10 |
fungi | oh, right, there isn't a cli option | 18:11 |
clarkb | OS_CLIENT_CONFIG_FILE | 18:11 |
fungi | so the experiment was unsuccessful | 18:12 |
fungi | no, wait, it's still picking up a default project in my config somewhere | 18:12 |
fungi | okay, that's working | 18:13 |
clarkb | is this with your personal account? | 18:15 |
fungi | no, our opendev accounts now | 18:15 |
clarkb | ok those aren't working for me. Not suer what the difference is | 18:15 |
fungi | because i had already logged into dfw3 skyline with my personal account so it got synced up already | 18:16 |
clarkb | the server list commands I was running before continue to produce the same result of needing authentication | 18:16 |
fungi | yeah, i see the issue | 18:20 |
fungi | like my personal account does, our opendev accounts now have two projects in sjc3: an old one with an NNNNNN_Flex name and a new one with a uuid-based name | 18:21 |
fungi | the old NNNNNN_Flex projects contain our mirror server and nodepool nodes | 18:21 |
fungi | the new uuid-based projects are empty | 18:21 |
clarkb | and dfw3 only has the new one | 18:21 |
fungi | correct | 18:21 |
fungi | the new uuid-based project has a consistent name and id in both sjc3 and dfw3, but is not the one we're currently using | 18:22 |
clarkb | so maybe the "solution" here is to create a new raxflex clouds.yaml profile for both sjc3 and dfw3. Then use this as the motivation to rebuild sjc3 in the new project to fix the networking stack | 18:22 |
fungi | so maybe this is our push to... yeah exactly | 18:22 |
clarkb | that might get a little complicated with nodepool configs but the clouds name is referenced by nodepool config so I thinkwe can set max-servers to zero and image to empty list and let nodepool clean things up. switch the cloud name then revert the shutdown steps to have it rebuild? | 18:23 |
fungi | we'll want to build a new mirror too | 18:23 |
clarkb | yes | 18:23 |
fungi | i guess we can just build new sjc3 and dfw3 mirror instances at the same time | 18:24 |
clarkb | I think the process is add new clouds.yaml profiles for sjc3 and dfw3 using the new tenant/project. Manage those new tenants in cloud launcher. Upload noble images if necesasry. Boot new mirrors. Add dfw3 to nodepool. Gracefully shutdown old sjc3. Bring up new sjc3 in nodepool | 18:24 |
clarkb | then clean up the old profiles and secrets data | 18:25 |
clarkb | fungi: did you want to push up the changes to start reorging clouds.yaml content since you seem to have a good handle on it or should I give ti a go and you can tell me what needs editing? | 18:28 |
fungi | yeah, just pulled the clouds.yaml template up in my editot | 18:28 |
fungi | editor | 18:28 |
clarkb | cool I'm ready to do my best to review the change(s) | 18:28 |
fungi | the hardest part is going to be naming them, as always | 18:28 |
clarkb | ya the old one has the "good" name :) | 18:28 |
clarkb | could do opendevci-rax-flex instead of opendevci-raxflex then eventually we'll delete the opendevci-raxflex profile and it won't be confusing | 18:29 |
clarkb | slight confusion until we reach that point | 18:29 |
fungi | sure, i dig it | 18:29 |
clarkb | I've started booting a tracing02 server | 18:44 |
clarkb | seems like it should be an straightforward swap and I'm all for getting more noble coverage from the easy things | 18:44 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Add new Rackspace Flex projects https://review.opendev.org/c/opendev/system-config/+/942230 | 18:44 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Switch Nodepool to using a Rackspace Flex project https://review.opendev.org/c/opendev/system-config/+/942231 | 18:44 |
fungi | the second change there is wip for now, more a placeholder to flesh out once we're ready | 18:45 |
clarkb | fungi: changes to the clouds.yaml files require dummy hostvars (or maybe groupvars) data so that the file can be templated out successfully iirc | 18:47 |
clarkb | if you do a git grep of the vars in that file you should find where the dummy values are set | 18:48 |
fungi | ah, yeah | 18:48 |
fungi | i did it in the second change but not the first | 18:48 |
clarkb | fungi: for the second change I was thinking more that we'd edit https://opendev.org/openstack/project-config/src/branch/master/nodepool/nl01.opendev.org.yaml#L199 (and the related builder config) to also point to a different profile name (rax-flex?) after we set max-servers to 0 and clean up the images in the old tenant | 18:49 |
clarkb | fungi: the reason for that is it would allow us to trivially adding the dfw3 region while we sort through the sjc3 cleanup. If you do it the way you've proposed then we have to coordinate things more tightly (because suddenly sjc3 could stop working and/or orphan resourcesin the old tenant) | 18:50 |
fungi | can do | 18:51 |
fungi | just didn't want to lose the stats history in grafana | 18:51 |
clarkb | I don't think we will since only the credentials reference changes | 18:52 |
clarkb | fungi: the provider in nodepool remains the same we just tell it to use different credentials | 18:52 |
fungi | so keep the statsd prefix set the same? | 18:53 |
fungi | for both? | 18:53 |
clarkb | oh thats a clouds.yaml config hrm | 18:53 |
clarkb | ya I think so | 18:53 |
clarkb | in the case of dfw3 it will be scoped to that region and is fine. In the case of sjc3 we should be able to shut things down gracefully with old credentials then start things up again with new credentials and keep all the logical provider stuff the same including the statsd prefix | 18:54 |
fungi | other thing is this would be the first nodepool provider with a - in its name, while we've generally used - as the separator between the provider and region names | 18:54 |
clarkb | but only if we change the provider name? | 18:54 |
clarkb | I'm suggesting we only change the cloud: value | 18:54 |
fungi | ah, okay, yeah i suppose it wouldn't be consistent but would be good enough | 18:55 |
clarkb | though now that you mention it I'm not sure where we get the values for say mirror name construction maybe those are based on the cloud.yaml profile name? | 18:55 |
clarkb | I'm starting to feel like we've tried this before and it didn't work due to something like ^ | 18:56 |
clarkb | Another option would be to just start over entirely and orphan the existing grafana data | 18:56 |
clarkb | similar to when linaro changed names a couple of times. Maybe that is simplest | 18:57 |
opendevreview | Clark Boylan proposed opendev/zone-opendev.org master: Add tracing02 to DNS https://review.opendev.org/c/opendev/zone-opendev.org/+/942233 | 18:57 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Add new Rackspace Flex projects https://review.opendev.org/c/opendev/system-config/+/942230 | 18:58 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Switch Nodepool to a new Rackspace Flex project https://review.opendev.org/c/opendev/system-config/+/942231 | 18:58 |
fungi | we can debate the nodepool change as we work on the earlier steps | 18:59 |
fungi | also all the new private hostvars in those changes have been added on bridge with their correct values | 19:01 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add tracing02 to inventory https://review.opendev.org/c/opendev/system-config/+/942235 | 19:03 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add tracing02 to inventory https://review.opendev.org/c/opendev/system-config/+/942235 | 19:08 |
clarkb | forgot the depends on. Important in this case | 19:08 |
clarkb | fungi: if we take your initially proposed approach we would need to shutdown raxflex entirely first. Then bring it back up again in sjc3 and dfw3. That is probably the cleanest approach from a historical record keeping process but requires more coordinated effort and loss of ~32 test nodes while we work through it | 19:14 |
clarkb | I think I'm ok with that because there are also fewer questions about how to work through that process. We could end up with more work than anticipated cleaing up issues with a less careful approach that allows us to bring up dfw3 early | 19:15 |
clarkb | but before we get that far we can bring up sjc3 and dfw3 via cloud launcher, upload noble image, and spin up mirrors | 19:15 |
clarkb | then decide how we want to transition nodepool | 19:15 |
Clark[m] | The gerritforge Livestream on YouTube is about to start | 19:57 |
fungi | oh. also we can probably forego the extra network creation and floating-ip stuff if we like | 19:58 |
fungi | certainly for the mirrors at least | 19:58 |
Clark[m] | Due to direct attachment to the public net? | 20:00 |
fungi | yeah | 20:01 |
fungi | should i add the cloud-launcher config into 942230 in that case? | 20:05 |
fungi | or as a separate change? | 20:05 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Add new Rackspace Flex projects https://review.opendev.org/c/opendev/system-config/+/942230 | 20:07 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Switch Nodepool to a new Rackspace Flex project https://review.opendev.org/c/opendev/system-config/+/942231 | 20:07 |
fungi | combined it for now but can split it out if needed | 20:07 |
Clark[m] | Same change is probably fine we'll just have broken launcher if the initial setup still doesn't work | 20:11 |
Clark[m] | Gerrit 3.12 will require java 21 | 20:11 |
Clark[m] | 3.12 will update the H2 version for caches which is a breaking change | 20:13 |
fungi | ubuntu noble has openjdk-21-jre, if we want it on debian we need to wait for trixie | 20:13 |
fungi | but by the time we're ready to upgrade to gerrit 3.12 i expect it'll be plenty ready | 20:14 |
fungi | my best guess is sometime around june/july for trixie release | 20:15 |
Clark[m] | Ya I think it will be fine | 20:16 |
Clark[m] | In theory we upgrade to 3.11 on Java 17. Then update our images to Java 21 for 3.11 and 3.12 then upgrade to 3.12 | 20:20 |
Clark[m] | And if we stick to our existing timeline that will occur at the end of 2025 | 20:20 |
fungi | sounds about right | 20:29 |
Clark[m] | Gerrit 3.13 will formalize the ability to run a Gerrit server for the UI server and a different headless Gerrit for the rest API and git protocols. This would allow for you to tune and scale their jvms separately | 20:34 |
fungi | oh neat. maybe we could scale down our gerrit(s) then | 20:36 |
fungi | down and out, that is | 20:36 |
fungi | granted, we're only really using half the ram on our current 128gb vm | 20:37 |
fungi | a quarter is active and a quarter is buffers/cache | 20:38 |
Clark[m] | Luca is talking about Gerrit 4 possibilities. One idea is to decouple the UI from the backend more so that you can build different code review systems on it or just use it as a git server | 20:40 |
Clark[m] | Support for PR like reviews (reviews of branches rather than specific commits) | 20:42 |
Clark[m] | Which he points out is technically possible through merge commit reviews but the UI isn't really useful in this capacity | 20:42 |
Clark[m] | He wants to see llm integration make it into core plugins rather than external plugins | 20:44 |
JayF | I wonder if that would help enable any potential future federation a la https://gitlab.com/gitlab-org/gitlab/-/issues/6468 | 20:53 |
JayF | Obviously that is not necessarily cross project yet or even exists at all yet, but it's nice to think about the possibility | 20:54 |
Clark[m] | It seems like it would be a prereq to federate with PR systems but figuring out federation with Gerrit first seems like a baseline need. That said I feel like zuul really addresses much of what people want out of federation | 21:04 |
Clark[m] | When I write bugfixes for Gerrit I push them upstream then downstream I set a depends on, rebuild our images, and test in opendev that our problem goes away | 21:05 |
Clark[m] | There is no formal federation but zuul talks to both and problem solved | 21:05 |
Clark[m] | And that works for Gerrit and GitHub and gitlab etc today | 21:06 |
clarkb | fungi: I have a question on https://review.opendev.org/c/opendev/system-config/+/942230 | 21:15 |
clarkb | the gerrit thing was informative. It seems like a lot of the interesting/focus within the gerrit community is building a system that works well for enterprise software development in large companies with lots of large git repos. Not necessarily a bad thing for us but I personlly think it would be neat if more effort went into the process of optimizing code review itself | 21:16 |
clarkb | and then tracing seems to be happy with its two changes https://review.opendev.org/c/opendev/zone-opendev.org/+/942233 https://review.opendev.org/c/opendev/system-config/+/942235 if you have a moment | 21:17 |
clarkb | fungi: I'm happy to move forward with 942230 if that was intentional but didn't want it to get lost if it was an oversight | 21:20 |
fungi | clarkb: thanks for catching that, i meant to do both of course. fix incoming | 21:26 |
clarkb | oh and when we boot the mirrors we should check tmus | 21:26 |
clarkb | *mtus | 21:26 |
fungi | yep | 21:26 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Add new Rackspace Flex projects https://review.opendev.org/c/opendev/system-config/+/942230 | 21:27 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Switch Nodepool to a new Rackspace Flex project https://review.opendev.org/c/opendev/system-config/+/942231 | 21:27 |
fungi | clarkb: fwiw, the server instance i booted in my personal account in sjc3 is just attached directly to publicnet and has a 1500 byte mtu on its ens3 interface already | 21:28 |
fungi | so should be fine | 21:29 |
clarkb | perfect | 21:29 |
clarkb | fungi: I +2'd the first change and I think you can approve it when secret vars are in place for it | 21:30 |
fungi | they already are, were even before i pushed the initial patchset | 21:30 |
clarkb | extra perfect | 21:30 |
fungi | did those first thing | 21:30 |
clarkb | I did confirm that mirror_fqdn includes nodepool.cloud in it | 21:31 |
clarkb | that means we would have to have mirror.sjc3.rax-flex.opendev.org instead of mirror.sjc3.raxflex.opendev.org | 21:31 |
opendevreview | Merged opendev/zone-opendev.org master: Add tracing02 to DNS https://review.opendev.org/c/opendev/zone-opendev.org/+/942233 | 21:31 |
clarkb | an alternative would be to do what you originally proposed and simply shut thinsg down before turning anything new on | 21:32 |
fungi | yeah, i'm leaning toward that | 21:32 |
fungi | and keeping the old cloud names | 21:32 |
fungi | name | 21:32 |
clarkb | wfm | 21:32 |
clarkb | fungi: I made a note on https://review.opendev.org/c/opendev/system-config/+/942231/ that we should haev a change in project-config that cleans up the the existing sjc3 resources in nodepool. Maybe it should be two changes. One to set max-servers to 0 then another to clean up all images in that cloud | 21:34 |
clarkb | that way we can launch the new clouds with the first change, spin up new mirrors, land the cleanup changes I just described ^ there and then land 942231 and spin up new sjc3 and dfw | 21:34 |
clarkb | oh also we can switch the mirror in sjc3 over to the new mirror before we shut things down if we end up keeping things up for some reason (its the same region just a different tenant which is no different than how we normally do things) | 21:35 |
fungi | yeah, sounds right | 21:35 |
clarkb | for noble image uploads if we can't download our existing image from glance then we may just need to go with whatever the latest image is. Looks like the vhd file is the only one on bridge anymore (due to disk constraints) and I'm not sure we can reliably convert a vhd back to a raw/qcow2 | 21:41 |
clarkb | but that should be fine. Maybe even preferable if it reduces the total number of packages we have to update when we lauch new nodes | 21:41 |
opendevreview | Merged opendev/system-config master: Add tracing02 to inventory https://review.opendev.org/c/opendev/system-config/+/942235 | 22:05 |
clarkb | that change is finally deploying now but the tracing job is near the end so may still be a while. I'm keeping an eye on it | 22:27 |
opendevreview | Merged opendev/system-config master: Add new Rackspace Flex projects https://review.opendev.org/c/opendev/system-config/+/942230 | 22:39 |
JayF | clarkb: for sure, gerrit<>gerrit would have to go first, but zuul covers zero of the use case I was thinking of -- my brain is always geared to "how to avoid lock-in", and getting all the various "forge" systems to collaborate is a potential path to get there | 22:42 |
clarkb | deployment failed because the base job failed because tracing02 was unreachable | 22:44 |
clarkb | I am able to reach it from my local system. Now to try from bridge | 22:45 |
clarkb | ssh worked from bridge too. Not sure why it failed | 22:45 |
clarkb | oh hrm it says host key verification failed. But I was able to ssh to it without doing anything with host keys. I wonder if that is a race between updating known hosts and trying to ssh to it? Bootstrap bridge must do the ssh key setup and base can run concurrently maybe? The -base run for 942230 should be a good indicator if this is still a proble. If not I can probably wait for | 22:47 |
clarkb | daily jobs this evening | 22:47 |
clarkb | JayF: ok sorry wanted to debug that problem. Git is already inherently distributed you can pretty trivially avoid lock in taking your git repo from one forge to another | 22:48 |
clarkb | I think the real lock in problems are with all of the tooling surrounding a specific forge and federation doesn't help prevent lock in there | 22:48 |
clarkb | fungi: ya the run for 942230 managed to connect to tracing02 | 22:49 |
JayF | I mean, your comment is true in the most direct sense; but ignores the cost of retraining and migration. However, if you had something like a common PR-style gitlab/github flow that could federate, it gives companies an option to maintain existing workflows generally but migrate backends, moving things internally to another vendor. | 22:49 |
JayF | You are correct, however, in noting that ^^^ has a lot of "not-source-code" stuff rolled into it, like issue tracking and so on. | 22:49 |
clarkb | I think from my perspective lock in has to do with problems that federation doesn't solve | 22:50 |
clarkb | what federation theoretically sovles is making it easy for me to go review a PR in one forge without creating new accounts or doing any extra work to bootstrap myself in that system | 22:50 |
JayF | clarkb: I guess I'm envisioning a world where, in the same way you can view a pixelfed post in mastodon, someone being able to use different UI/workflows to interact cross-force. I do think you're right that federation /will not/ solve this problem, simply because I think incentives are misaligned for that ecosystem to embrace true mobility. | 22:51 |
clarkb | infra-root bootstrap-bridge is a soft dependency of the infra-prod-base. bootstrap-bridge runs the known hosts update. It did so before the base playbook ran according to zuul log timestamps. The task for that reported ok against 942233 which merged with the inventory update. The bootstrap job for 924420 reports changed. its almost like we ran with the wrong git content | 22:55 |
clarkb | corvus: ^ that might be interesting to you from a "is zuul using the correct git state" perspective. | 22:55 |
clarkb | https://zuul.opendev.org/t/openstack/build/2f7584dccc0c40b689bd74cbae6dbfde/log/job-output.txt#269-270 where I expected it to change. Where we tried to use the updated value and failed: https://zuul.opendev.org/t/openstack/build/2af6e7d514e348f497a9458f5e0ded84/log/job-output.txt#132 And finally where it appears to have updated in the followup change: | 22:56 |
clarkb | https://zuul.opendev.org/t/openstack/build/76e874edc9fa4f94ae1f82af2332b50d/log/job-output.txt#269-270 | 22:56 |
clarkb | JayF: in the mastodon example you still have to edit the account from the hosting location right? I guess even in those examples you're still only doing high level communication over the top of the actual content | 22:57 |
JayF | clarkb: tbh my mental model of this was always "git handles the code federation bits" and that the communication about the code (e.g. merge requests and related feedback) would be the parts that need federation. | 22:58 |
JayF | but you're right it leads to an explosion of complexity when you consider caching and display on a frontend | 22:58 |
JayF | but let a man dream :D | 22:58 |
clarkb | looks like we load the inventory hosts.yaml file off of disk on bridge then use that to emit the known hosts. I'm not seeing where we update system-config before trying to update known hosts which would explain the problem. However, last week I didn't have any issues like this. And I'm pretty sure I did similar updates of just adding the node to inventory and letting it run | 23:01 |
clarkb | ya the base job runs the synchronize src repos to workspace directory tasks | 23:04 |
clarkb | which would update system-config but that doesn't appear to happen in bootstrap bridge. So how did this ever work before? | 23:04 |
clarkb | have we gotten lucky with hourly jobs running first whcih would update system-config then we run the jobs for a specific deployment? | 23:05 |
clarkb | that would be one mechanism that would allow this to work I think | 23:05 |
clarkb | ianw: ^ if you happen to be around I'd be curious if you have any ideas as I think you set this up | 23:06 |
clarkb | https://zuul.opendev.org/t/openstack/build/108f4de4058246f9a71210365e8ce238/log/job-output.txt is the job from when we added codesearch02. It too updated known hosts but the change landed at 23:55 and the job ran at ~23:57 well after any hourly jobs would've run to udpate sytme-config for us | 23:11 |
clarkb | I think that rules out that possibility as the source for things working sometimes | 23:11 |
clarkb | ok I think I may have figured it out | 23:14 |
clarkb | whoever added build timelines to buildset info pages has my gratitude | 23:14 |
clarkb | infra-prod-service-gitea-lb ran concurrently with infra-prod-bootstrap-bridge when codesearch02 was added | 23:15 |
clarkb | https://zuul.opendev.org/t/openstack/build/d5043d5b453148f7b52b8158503ee457/log/job-output.txt#97-98 ran then https://zuul.opendev.org/t/openstack/build/108f4de4058246f9a71210365e8ce238/log/job-output.txt#260-261 ran so it is a race | 23:16 |
clarkb | now to figure out if we can safely fix this :/ | 23:18 |
clarkb | fungi: looks like the cloud launcher failed | 23:18 |
clarkb | I think this bug has sublty been hiding here since ianw refactoring things to bootstrap the bridge ansible using zuul ansible | 23:25 |
clarkb | or maybe since known hosts addition was added if that is newer | 23:26 |
clarkb | ebcause we need the git repos to be up to date to update known hosts | 23:26 |
opendevreview | Clark Boylan proposed opendev/system-config master: Reparent the bootstrap-bridge job onto a job that sets up git repos https://review.opendev.org/c/opendev/system-config/+/942307 | 23:44 |
clarkb | ianw infra-root ^ I've tried tp capture all that I've learned in that chagne. I suspect this is safe with all the extra betls and suspenders I added but this probably deserves careful review | 23:45 |
clarkb | basically infra-prod-bootstrap-bridge should also synchronize the repos because it directly depends on that content being up to date. Then if we ever refactor things to run concurrently only that job will update git repos for us | 23:46 |
clarkb | everything else should depend on infra-prod-base which depends on infra-prod-bootstrap-bridge ensuring the git repos are in place for the current run | 23:47 |
ianw | looking :) | 23:49 |
clarkb | I think another followup we could do is switch all the other infra-prod jobs including infra-prod-base to use the key only update parent job. But if we do that we need to make the dependency to infra-prod-bootstrap-bridge a hard dependency (it is soft right now) and drop the file matchers in infra-prod-bootstrap-bridge to ensure it alawys runs to set up the git repos | 23:49 |
clarkb | ianw: thanks! | 23:49 |
clarkb | also I think digging into that melted my brain a little bit so don't feel bad if its a review that takes time to get through and maybe multipel passes | 23:50 |
ianw | trying to get these in parallel was a bit mind bending at the best of times | 23:56 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!