opendevreview | Michael Kelly proposed zuul/zuul-jobs master: helm: Add job for linting helm charts https://review.opendev.org/c/zuul/zuul-jobs/+/861799 | 00:02 |
---|---|---|
*** dviroel|rover|bbl is now known as dviroel|rover | 00:05 | |
opendevreview | Michael Kelly proposed zuul/zuul-jobs master: helm: Add job for linting helm charts https://review.opendev.org/c/zuul/zuul-jobs/+/861799 | 00:14 |
opendevreview | Merged opendev/system-config master: Rebuild gitea images under new golang release https://review.opendev.org/c/opendev/system-config/+/863176 | 00:15 |
clarkb | I expct that will start to deploy in about 20-25 minutes? I'll eat dinner then check in on it | 00:16 |
opendevreview | Michael Kelly proposed zuul/zuul-jobs master: helm: Add job for linting helm charts https://review.opendev.org/c/zuul/zuul-jobs/+/861799 | 00:25 |
opendevreview | Michael Kelly proposed zuul/zuul-jobs master: helm: Add job for linting helm charts https://review.opendev.org/c/zuul/zuul-jobs/+/861799 | 00:31 |
*** dviroel|rover is now known as dviroel|rover|out | 00:38 | |
*** dviroel|rover|out is now known as dviroel|holiday | 00:38 | |
clarkb | gitea01 is done updating. Seems to work. There is a definite slowness in accessing repos as things start up but that seems to go away after a minute or two | 00:44 |
clarkb | ok all 8 are done and I've spot checked them and they all look happy to me | 01:01 |
clarkb | the job should finish momentarily and I expect it to succeed | 01:01 |
clarkb | success confirme.d I think we're good | 01:03 |
opendevreview | Jie Niu proposed openstack/project-config master: Apply cfn repository for code and storyboard https://review.opendev.org/c/openstack/project-config/+/863168 | 01:15 |
jieniu | Hi all, I'm trying to apply repo in opendev, the pipeline failed, is it because the "-" should not be used in project name ? | 02:10 |
jieniu | The following projects should be alphabetized: | 02:10 |
jieniu | + cat projects_list.diff | 02:10 |
jieniu | + grep -e '> ' | 02:10 |
jieniu | > computing-force-network/cfn-overview | 02:10 |
jieniu | > computing-force-network/computing-native | 02:10 |
jieniu | > computing-force-network/computing-offload | 02:10 |
jieniu | > computing-force-network/ubiquitous-computing-scheduling | 02:10 |
jieniu | > computing-force-network/use-case-and-architecture | 02:10 |
jieniu | + exit 1 | 02:10 |
ianw | we have dashes in names ... | 02:24 |
ianw | jieniu: the job is complaining because in https://review.opendev.org/c/openstack/project-config/+/863168/2/gerrit/projects.yaml the entries are at the end (out of alphabetical order) | 02:26 |
jieniu | ianw: so I need to insert these lines according alphabetical order instead of append to the end? | 03:03 |
ianw | jieniu: yes | 03:14 |
jieniu | thank you :) | 03:15 |
*** yadnesh|away is now known as yadnesh | 04:40 | |
yadnesh | hello all, can someone help me hold a node for https://review.opendev.org/c/openstack/aodh/+/863039 | 05:39 |
*** ysandeep|out is now known as ysandeep | 05:40 | |
opendevreview | Jie Niu proposed openstack/project-config master: Apply cfn repository for code and storyboard https://review.opendev.org/c/openstack/project-config/+/863168 | 06:11 |
*** mnasiadka_ is now known as mnasiadka | 06:29 | |
jieniu | Hi, all | 06:32 |
jieniu | I submit a change to apply repo from opendev, CI pipeline failed, could some one help me why this acl-config is not normalized? and how should I fix? much appreciated! | 06:32 |
jieniu | [submit] | 06:32 |
jieniu | -mergeContent = true | 06:32 |
jieniu | \ No newline at end of file | 06:32 |
jieniu | +mergeContent = true | 06:32 |
jieniu | Project /home/zuul/src/opendev.org/openstack/project-config/gerrit/acls/openinfra/cfn-use-case-and-architecture.config is not normalized! | 06:32 |
jieniu | --- /home/zuul/src/opendev.org/openstack/project-config/gerrit/acls/openinfra/ubiquitous-computing-scheduling.config2022-11-02 06:17:58.197768142 +0000 | 06:32 |
jieniu | +++ /tmp/tmp.5Dr5nkebQl/normalized2022-11-02 06:19:38.774158834 +0000 | 06:32 |
jieniu | @@ -8,4 +8,4 @@ | 06:32 |
jieniu | requireContributorAgreement = true | 06:32 |
*** yadnesh is now known as yadnesh|afk | 07:23 | |
*** elodilles_pto is now known as elodilles | 07:41 | |
*** ysandeep is now known as ysandeep|lunch | 07:59 | |
*** yadnesh|afk is now known as yadnesh | 08:22 | |
*** jpena|off is now known as jpena | 08:36 | |
yadnesh | hello all, can someone help me hold a node for https://review.opendev.org/c/openstack/aodh/+/863039 | 09:19 |
yadnesh | o/ frickler can you please help me with this ^ | 09:40 |
frickler | yadnesh: did you try to reproduce locally? or try to create a patch to gather more debug output. I can also hold a node if you specify which of the failing jobs you want to look at, but that should be considered the last resort only | 10:08 |
yadnesh | frickler, i couldn't reproduce it locally, i am not familiar with creating patch to capture more output but I can give that a try if you can guide me or share doc | 10:15 |
*** ysandeep|lunch is now known as ysandeep | 10:17 | |
*** rlandy|out is now known as rlandy|rover | 10:40 | |
*** yadnesh is now known as yadnesh|afk | 11:23 | |
frickler | yadnesh|afk: the general approach would most likely look at the delta between the last passing and first failing invocation of your job, then ponder which logs could be helpful in further assessing the issue and add a patch for the job definition to add those logs | 11:34 |
frickler | but you can also finally let me know which job you want held and your ssh key and I'll set things up | 11:35 |
*** yadnesh|afk is now known as yadnesh | 12:22 | |
yadnesh | frickler, i need it for telemetry-dsvm-integration job, here's my public key https://paste.openstack.org/show/bfsQ5ivQqWHkTozZa7mF/ | 12:25 |
*** ysandeep is now known as ysandeep|brb | 12:26 | |
*** elodilles is now known as elodilles_afk | 12:44 | |
*** ysandeep|brb is now known as ysandeep | 12:56 | |
*** gthiemon1e is now known as gthiemonge | 13:04 | |
*** elodilles_afk is now known as elodilles | 13:15 | |
*** Guest202 is now known as dasm | 14:03 | |
opendevreview | dasm proposed openstack/diskimage-builder master: Fix issue in extract image https://review.opendev.org/c/openstack/diskimage-builder/+/850882 | 14:21 |
JayF | I was trying to show https://zuul.opendev.org/t/openstack/config-errors to some other contributors; but it seems to be busted today | 14:23 |
frickler | infra-root: corvus: ^^ seems to be an issue in the js renderer? I can't find an error in the web log. | 14:33 |
*** yadnesh is now known as yadnesh|away | 14:33 | |
Clark[m] | It works if you click the bell icon. So ya server side is probably fine | 14:36 |
frickler | JayF: ^^ just wanted to write the same, no direct link, but still a way to view the errors | 14:36 |
JayF | that's fine by me | 14:37 |
clarkb | 9d2e1339ff9f5080cd23e9d29fcb08315a32e5e9 that commit might be the one that broke the errors. Though I'm not sure I understand why yet | 15:22 |
clarkb | it modifies the error state in the js though and the error reported by my browser is that e.map isn't a function so something about that type change maybe | 15:23 |
clarkb | yup I think that is exactly it | 15:26 |
clarkb | I'll work on a change | 15:27 |
clarkb | hrm I feel like I'm missing something with reacts state engine that would make this easier to understand | 15:38 |
clarkb | remote: https://review.opendev.org/c/zuul/zuul/+/863326 Fix config-errors dedicated page | 15:48 |
clarkb | I'm not sure if that is a complete fix. I'm hoping that the preview site will help with further debugging | 15:49 |
clarkb | infra-root I have put zk04 - zk06 in the emergency file | 15:54 |
clarkb | corvus: when you are around and ready to start the upgrade process let me know | 15:54 |
frickler | yadnesh|away: sorry for the delay, I've set up the hold now, but saw that in your latest PS the job is passing, so I didn't recheck as a passing job will not trigger it. let us know if you still want to debug this further | 15:54 |
clarkb | infra-root for clarity I modified the file on bridge01.opendev.org not old bridge | 15:55 |
corvus | clarkb: ack, ready in a few mins. frickler clarkb ack re config-errors will look at clarkb's change | 15:57 |
clarkb | zk05 is still the leader and I've figured out how to get it to report the number of followers it sees (mntr command) | 16:02 |
clarkb | corvus: when you are ready maybe you can do the zuul side backup (nodepool too?) and then I'll update the zk04 docker compose file, pull, and down then up -d | 16:04 |
corvus | clarkb: i will start that process now | 16:05 |
clarkb | great, let me know when I should proceed with 04 | 16:05 |
*** marios is now known as marios|out | 16:10 | |
corvus | clarkb: i think there are 2 backup commands we should do: nodepool, then zuul. | 16:11 |
corvus | on nl01, i logged into the container and then ran `nodepool export-image-data /var/log/nodepool/nodepool-export.data` | 16:12 |
corvus | i put it there because of the bind mount | 16:12 |
corvus | (the command wants a path and doesn't understand - so i can't do it as a single docker exec and redirect; that's a potential future improvement) | 16:13 |
corvus | that file has the metadata for the dib images in nodepool, so that if something goes wrong, we don't have to spend 2 days rebuilding images because we forgot their ids | 16:16 |
corvus | now the next thing is the zuul secret keys... that's something we should probably be backing up periodically anyway, and i can't recall if we are | 16:17 |
clarkb | corvus: I believe there is a cronjob for that on the schedulers | 16:18 |
corvus | (because if a meteor takes out zk, we can still rebuild those nodepool images, but not so with the zuul keys) | 16:18 |
corvus | clarkb: i thought/hoped so. i'll double check that it looks good and recent. | 16:18 |
clarkb | yes I see it on zuul02 at least. Last entry in the root crontab | 16:18 |
clarkb | thanks | 16:18 |
corvus | okay, that file looks internally consistent, and is dated from midnight today. we haven't merged any changes to project-config (which would generate new keys) since then, so i think that's good. also, i checked on zuul01, so we have 2 backups. :) | 16:23 |
clarkb | excellent. In that case do you think we are ready to proceed with zk04's upgrade? | 16:23 |
corvus | yep, i think so | 16:23 |
clarkb | ok I'll start on that now | 16:24 |
clarkb | I've run the pull and image is present. Time to down then up | 16:25 |
clarkb | that seems to have worked and quite quickly | 16:26 |
clarkb | zk 3.6 mntr output is far more verbose than 3.5s | 16:26 |
clarkb | corvus: you think we wait here for a few minutes and check that the grafana graphs don't show anything unexpected? | 16:26 |
corvus | agreed, monitoring looks good so far. | 16:26 |
corvus | clarkb: because we basically kicked all the clients off of 04, it's not going to be doing any client servicing work until something else happens... so the surface area for seeing errors here is small. | 16:27 |
corvus | clarkb: we could try restarting some zuul components to see if they connect to 04, or just proceed... | 16:28 |
clarkb | corvus: ya I'm mostly worried about the stats monitoring itself since the mntr output has changed | 16:28 |
clarkb | but my read of that script is that if it doesn't find what it is looking for it skips gracefully so I think we are safe on that end. I think we should proceed with zk05 which should push load to zk04 | 16:28 |
clarkb | and then later we can add more info to the graphite stats via the script and update names if any of them have changed | 16:29 |
corvus | clarkb: i think zk05 is leader, so zk06 is next? | 16:29 |
clarkb | oh yes good catch :) | 16:29 |
corvus | and otherwise i agree, let's proceed with zk06 now. the stats for everything i expect to see data for on zk04 still look good in grafana | 16:29 |
clarkb | I've double checked that the zk04 upgraded didn't cause zk05 to stop being the leader. It is still the leader. zk06 is the next one. | 16:30 |
clarkb | proceeding with zk06 | 16:30 |
corvus | zk04 is now doing work | 16:32 |
clarkb | yup and 06 came up just as quickly as 04 did | 16:32 |
clarkb | so now we should be able to watch and wait for a few minutes to see 04 on 3.6 has not obvious issues | 16:32 |
corvus | sounds good | 16:33 |
clarkb | is the spike in zuul event processing time something to be concerned about? | 16:34 |
corvus | it started before this work | 16:34 |
clarkb | oh yup good point | 16:35 |
clarkb | a number of changes were proposed to tripleo and others | 16:36 |
corvus | looks like a bunch of openstack tenant reconfigs happened in rapid succesion | 16:36 |
clarkb | the changes were pushed by the openstack proposal bot | 16:37 |
clarkb | I suspect this is ok and just part of having the bot show up with a bunch of work | 16:37 |
corvus | yeah, seems to happen every now and then | 16:37 |
clarkb | let me know when you're happy with zk04. I think the cluster handling that burst of activity in the middle of the upgrade is a good indication that things are functioning | 16:39 |
corvus | as far as performance metrics -- there is a significant increase in write time, but it corresponds with an increase in object count on the y axis, and it corresponds with the event surge on the time axis, so while the event surge means we can't do 1:1 before/after comparisons, so far it looks good to me. i think we can proceed. | 16:41 |
clarkb | I will proceed with zk05 now | 16:41 |
clarkb | 06 is the new leader and it says it has two synced followers | 16:43 |
clarkb | as far as I can tell things are happy | 16:44 |
corvus | we may have a stats problem then since the graph shows 0 for all | 16:44 |
clarkb | the follower graph? | 16:44 |
corvus | yep | 16:44 |
corvus | everything else looks good | 16:45 |
clarkb | I'm skimming the new mntr output and it seems zk_followers may not exist aynmore its zk_synced_followers? | 16:46 |
clarkb | but I haven't grepped to be sure yet | 16:46 |
clarkb | ya that key is gone after using grep | 16:47 |
clarkb | I suspect that is the problem with our stats. | 16:47 |
corvus | is zk_synced_followers the new one? | 16:47 |
clarkb | corvus: yes zk_synced_followers | 16:47 |
corvus | oh interesting, we grab both | 16:47 |
corvus | so this may be only a grafana change | 16:48 |
corvus | clarkb: what zk version did we just upgrade to? :) | 16:48 |
Clark[m] | corvus: latest 3.6 | 16:49 |
opendevreview | James E. Blair proposed openstack/project-config master: Update ZK followers graph https://review.opendev.org/c/openstack/project-config/+/863418 | 16:49 |
clarkb | heh saw the message in matrix not irc | 16:49 |
corvus | Clark: ^ thx, maybe that will do it? | 16:49 |
clarkb | yup I suspect so. I've gone ahead and approved it | 16:50 |
corvus | asynchronous-synchronous communication :) | 16:50 |
clarkb | corvus: should I go ahead and approve the system-config change to update the docker compose files now? | 16:50 |
corvus | technically, i guess it's eventually-consistent-synchronous | 16:50 |
clarkb | well the update is a noop now but ya | 16:50 |
corvus | clarkb: ++ | 16:50 |
corvus | when shall we 3.7? | 16:51 |
clarkb | considering how well this went I'm tempted to say tomorrow :) But I've got errands I need to do tomorrow | 16:51 |
clarkb | I can probably do it monday | 16:51 |
corvus | okay. also "right now" wfm if that's an option. :) | 16:52 |
clarkb | oh hrm | 16:52 |
clarkb | let me do a quick google search for any 3.6 to 3.7 concerns | 16:52 |
clarkb | but ya actually I think that is a good idea | 16:53 |
clarkb | corvus: lets push a change to run it through our system-config-run-* jobs for any major issues? | 16:53 |
clarkb | but if that passes proceed? | 16:53 |
corvus | ok | 16:53 |
clarkb | the docs say 3.6 to 3.7 upgrade should be as simple as this one | 16:54 |
corvus | yeah: "The upgrade from 3.6.x to 3.7.0 can be executed as usual, no particular additional upgrade procedure is needed." | 16:54 |
opendevreview | Clark Boylan proposed opendev/system-config master: Upgrade zookeeper from 3.6 to 3.7 https://review.opendev.org/c/opendev/system-config/+/863419 | 16:56 |
clarkb | corvus: ^ that should give us an indication of any major issues. If that passes CI I can proceed with doing what I just did but with 3.7 | 16:57 |
clarkb | and am keeping all the servers in the emergency file for now | 16:57 |
corvus | clarkb: this takes a while, right? should we reconvene in 30m? | 16:58 |
clarkb | yup it will take a few minutes (I think it runs a zookeeper job and a zuul job) | 16:59 |
clarkb | see you in half an hour | 16:59 |
corvus | ++ | 16:59 |
*** ysandeep is now known as ysandeep|out | 17:08 | |
opendevreview | Merged openstack/project-config master: Update ZK followers graph https://review.opendev.org/c/openstack/project-config/+/863418 | 17:27 |
clarkb | the zookeeper job for the 3.7 change passed. The zuul jobs for that change should finish in about 10 minutes | 17:30 |
clarkb | assuming the zuul job comes back green too I'll restart the process we just ran through but updating 3.6 to 3.7 this time. Also zk06 is the current leader so it will go last | 17:30 |
clarkb | the job to fix the graph should run in a few minutes too. May not be a bad idea to wait for that to update to make it easier to confirm thigns are working | 17:32 |
corvus | i'm back | 17:32 |
clarkb | corvus: tldr is looking good so far but needs a few more minutes to finish up | 17:33 |
corvus | graphs look ok to me | 17:33 |
clarkb | the grafana update job has started | 17:37 |
*** jpena is now known as jpena|off | 17:38 | |
clarkb | hrm that job is likely to fail in retry failure | 17:38 |
clarkb | I suspect that is related to the bridge update and not related to anything zookeeper | 17:39 |
clarkb | I guess we can check follower count by hand for now :) | 17:39 |
clarkb | ya permission denied when connecting to bridge. Almost certainly a prolem with the new bridge migration cc ianw | 17:41 |
corvus | proceeding without graph sounds good to me | 17:42 |
clarkb | ianw: I suspect maybe something related to being triggered by project-config instead of sytem-config? infra-root we should hold off on adding new repos until we understand that also dns updates (I've got one pending) are likely to be affected | 17:42 |
corvus | maybe missed adding the project ssh key | 17:43 |
opendevreview | Merged opendev/system-config master: Upgrade our zookeeper cluster to 3.6 https://review.opendev.org/c/opendev/system-config/+/863089 | 17:43 |
clarkb | corvus: https://review.opendev.org/c/opendev/system-config/+/863419 got a +1 from zuul so I'm happy to proceed with the 3.6 to 3.7 upgrade now. Look good to you as well? | 17:43 |
clarkb | I'll do the upgrades in zk04 zk05 zk06 order since zk06 is now leader | 17:44 |
corvus | sounds good | 17:44 |
clarkb | infra-prod-service-zookeeper just started for 863089 but it should noop because the nodes are in the emergenc file | 17:44 |
clarkb | yup it nooped. Proceeding on zk04 now | 17:46 |
clarkb | zk04 seems happy and zk06 shows two synced followers | 17:48 |
corvus | wow looks like most everybody hopped to zk06 | 17:48 |
clarkb | that should be random right? | 17:48 |
corvus | tbh i don't know the algorithm or if it's changed. it just usually ends up roughly equitable. everyone going to 06 is potentially strange. | 17:50 |
corvus | my inclination would be to proceed and see what happens after we upgrade 06. if we end up unbalanced (between 4 and 5) after that, look into it more. | 17:50 |
clarkb | ack. I'll proceed with zk05 next then? | 17:50 |
corvus | yep, i don't see anything else anomalous, and we know that 1 node can handle load fine. | 17:51 |
clarkb | I think all connections on are zk06 now fwiw | 17:52 |
clarkb | but zk06 continues to report all are synced | 17:52 |
clarkb | I'm going to proceed with zk06 now | 17:53 |
clarkb | zk05 became leader and zk04 has connections too | 17:54 |
clarkb | and zk05 has 2 synced followers. I wonder if that is simply a rolling upgrade behavior | 17:55 |
corvus | clarkb: can you clarify, what sequence have you upgraded? | 17:56 |
clarkb | corvus: at this point all three are upgraded. After I upgraded the last server (zk06) then zk05 became the leader and both zk05 and zk04 have connections now | 17:57 |
corvus | cool, that's what i thought, but i could have read one of the earlier updates 2 ways so wanted to be sure :) | 17:57 |
corvus | the client could looks well distributed now | 17:57 |
clarkb | also zk05 (the current leader) reports 2 synced followers which is what we expect | 17:58 |
clarkb | corvus: I think if this looks good to you after your checks you should approve https://review.opendev.org/c/opendev/system-config/+/863419 | 17:58 |
corvus | clarkb: lgtm and done | 17:59 |
clarkb | thanks! | 17:59 |
corvus | thank you! | 17:59 |
clarkb | once that lands I'll remove the nodes from the emergency file | 18:03 |
clarkb | I'll try to catch that update so that the triggered infra-prod run will actually run and noop properly due to content on disk matching expected state rather than nooping due to hosts being in the mergency file | 18:05 |
clarkb | I've gone ahead and edited the emergency file as no jobs are running and that change should land momentarily. Then I'll check that it noops as expected. Then I'll status log the new situation | 18:45 |
opendevreview | Merged opendev/system-config master: Upgrade zookeeper from 3.6 to 3.7 https://review.opendev.org/c/opendev/system-config/+/863419 | 18:53 |
clarkb | the timestamp on the docker compose file did end up updating (a side effect of using the synchronize instead of copy module?) but the version didn't change and no containers were restarted. This concludes the zookeeper upgrades | 18:59 |
clarkb | #status log Zuul's ZK cluster has been upgraded to 3.7 via 3.6. | 18:59 |
opendevstatus | clarkb: finished logging | 18:59 |
clarkb | Time for lunch but when I get back I'll probably look at the project config job retry limiting | 18:59 |
clarkb | ianw: corvus: I've confirmed that it seems we only have the one key present on bridge. I would have expected the base job running on the bridge to address that since we set the extra users list for ridge | 20:02 |
clarkb | wow ok I think it is due to the usermod conflict saying the user is in use | 20:05 |
clarkb | ya its failing at that point and everything afterwards does't run | 20:06 |
clarkb | I'm not sure how to handle this since the user exists and we shouldn't need to bootstrap it again. But I guess ansible isn't smart enough to not run usermod and just lets it fail? | 20:08 |
clarkb | "You must make certain that the named user is not executing any processes when this command is being executed if the user's numerical user ID, the user's name, or the user's home directory is being changed." | 20:09 |
clarkb | the problem is the uid | 20:10 |
clarkb | ianw: ^ zuul's uid on bridge does not match what we set via ansible so that triggers the error above | 20:10 |
clarkb | we probably need to pause infra-prod access to bridge, manually change the uid, chown everything that zuul touches, then unpause and see if it works | 20:11 |
clarkb | ? | 20:11 |
clarkb | looks like the name and homedir matchup so only the uid is the problem | 20:12 |
clarkb | I think the main things to be careful of are git repo ownership? Since git refuses to operate if ownership doesn't align anymore. Otherwise logging is largel written as root? | 20:27 |
ianw | clarkb: hey, sorry, catching up | 21:03 |
opendevreview | Clark Boylan proposed zuul/zuul-jobs master: Fix check zone role for Jammy https://review.opendev.org/c/zuul/zuul-jobs/+/863098 | 21:04 |
ianw | i think that i have messed up the zuul user when bootstrapping the host | 21:05 |
clarkb | ya I think our config makes uid 3000 the next one by default | 21:05 |
clarkb | which we've done beacuse we hardcode in the 2000-2999 range | 21:05 |
clarkb | so it grabbed the next free uid according to the config and went with it | 21:05 |
clarkb | ianw: the other thing I've realized is that we may need to double check if gerrit projects have been created properly if any changes to do that have landed. It ispossible they ended up being created by the periodic jobs on sytem-confg if so | 21:06 |
ianw | i think that this bootstrapping issue is fixed by https://review.opendev.org/c/opendev/system-config/+/862845/2. with that, if we start another bridge, we will apply the correct extra_users, etc. variables to it and create the user as specified | 21:07 |
ianw | so yeah, i think it's a situation of mop up what is wrong with it now on this host, but going forward the same mistake shouldn't happen again | 21:07 |
clarkb | I'm still on the fence about that fwiw. The base playbook should cover all that | 21:07 |
clarkb | so it seems weird to double account for itall in a separate bootstrapping step? This is why I suggested we try to have launch node do it instead | 21:08 |
clarkb | but I also haven't fully digested that change so maybe it does what I'm thinking with launch and base.yaml | 21:09 |
ianw | right, i agree the base playbook will cover all that, but the definition for the extra_users was in the group_vars/bridge.yaml definition -- which was restricted to only the *current* production host | 21:09 |
ianw | so when i started a new one, it didn't apply the variables | 21:09 |
clarkb | oh I see | 21:10 |
ianw | however, what i've realised is, we can have as many active bridge0X.opendev.org hosts in the inventory as we want | 21:10 |
ianw | the important thing is that the CI jobs just choose the "current" one to run the jobs on | 21:10 |
clarkb | ianw: and the zuul reboot cronjob only runs on one | 21:10 |
clarkb | but ya I think that is correct | 21:10 |
ianw | s/jobs/production jobs/ | 21:10 |
ianw | indeed, yes modulo any external cron-ish type things like that | 21:11 |
clarkb | I think that is theo nly cronish thing we have currently due to the even more chicken and egg problems of restarting zuul with zuul :) | 21:11 |
clarkb | (we can't restart the executor running the restart job without breaking the restart job) | 21:11 |
ianw | so what that change does is changes things so that when we add hosts dynamically with "add_host" we put them in the prod_bridge group, and reference prod_bridge[0] in all the playbooks that setup and start the nested ansible | 21:12 |
clarkb | ok that helps a bit. I haven'y managed to properly review that chagne yet because it is a bit mind bendy | 21:13 |
clarkb | and I', not sure I'll get to it today. But if I put it on my todo list for first thing tomorrow there is a good chance I Finally manage it | 21:13 |
ianw | the production playbooks shouldn't care about the bridge running them -- the one place they did (resetting the project-config checkout on bridge) was a issue for running parallel jobs anyway | 21:14 |
ianw | so what falls out of that is that we can switch the testing jobs to bridge99.opendev.org just fine -- basically proving that we're not hardcoding the bridge name :) | 21:14 |
ianw | (by making the prod_bridge group in testing have the single host bridge99.opendev.org -- while in the gate it will have bridge01.opendev.org) | 21:16 |
clarkb | in the gate would still be bridge99? | 21:16 |
clarkb | do you mean in the deploy jobs? | 21:17 |
ianw | sorry, yes, i mean the actual production jobs, not the gate | 21:19 |
clarkb | makes sense | 21:19 |
ianw | the post-gate deploy steps :) | 21:19 |
ianw | ... but ... to the problem at hand ... resetting the zuul uid to 2031 | 21:20 |
opendevreview | Clark Boylan proposed zuul/zuul-jobs master: Fix check zone role for Jammy https://review.opendev.org/c/zuul/zuul-jobs/+/863098 | 21:21 |
clarkb | ianw: my hunch is that doing so should be relatively safe since zuul shouldn't own anything outside of its homedir | 21:22 |
clarkb | so in theory just a matter of chwoning and updating the uid in /etc/passwd? | 21:22 |
clarkb | fungi and corvus may have thoughts on that | 21:22 |
clarkb | that should allow the full base.yaml playbook to run on bridge so might be worth double checking that it won't do anything we don't want yet | 21:23 |
clarkb | and also maybe we need to check if any new project additions have landed and are in limbo (just in case we need to take any intervention steps) | 21:23 |
ianw | yeah, i think 1) change in passwd -- 2) chown -hvR 2031 /home/zuul 3) reboot for good measure | 21:24 |
ianw | 4) monitor base run | 21:25 |
opendevreview | Clark Boylan proposed zuul/zuul-jobs master: Fix check zone role for Jammy https://review.opendev.org/c/zuul/zuul-jobs/+/863098 | 21:26 |
ianw | i don't think i see anything in https://review.opendev.org/q/project:openstack/project-config since 2022-10-26 (my time) which was when the new host started running prod jobs | 21:27 |
clarkb | that simplifies things :) | 21:28 |
clarkb | dns updates would also be affected but I don't think we've had any of those. Grafana is the other one affected but its impact is minimal | 21:28 |
ianw | i just have to do quick school run, but plan to do it when i get back in about ~20 mins | 21:29 |
clarkb | ok, I don't think we're in a huge rush if there is anything else you think we should do to prepare | 21:29 |
clarkb | Next week is the gerrit user summit for anyone interested in joining they will have remote access but scheudle is on london time (so timezones may make it difficult). I plan to try and wake up early and participate a bit myself | 21:31 |
ianw | i don't think so -- as soon as you mentioned it, it kind of clicked that zuul having this different uid was wrong | 21:31 |
clarkb | the thing that clicked for me was reading a modern man page for that utility | 21:37 |
clarkb | google returns old ones | 21:37 |
clarkb | and sure enough the caveats section clearly described why we were hitting the problem | 21:37 |
*** dasm is now known as dasm|off | 21:52 | |
*** rlandy|rover is now known as rlandy|bbl | 22:05 | |
clarkb | mtreinish: I've seen behavior similar to https://paste.opendev.org/show/bIfaYeDOEgM6Zz8gqEry/ across python3.8 on focal and python3.10/3.11 on jammy when using stestr. Basically we end up with a test suite that appears to have run all tests according to the orphaned stestr record file but the python process doesn't exit. When I strace things it seems the child is waiting on the | 22:08 |
clarkb | parent for something | 22:08 |
clarkb | mtreinish: do you have any idea of what is going on there? I'm not familar enough with stestr to know where the multiprocessing fork comes from. Maybe thats just how you spin up the concurrency and pass it a different load list? | 22:08 |
clarkb | what is odd is that stestr and subunit haven't changed in at least a month but this seems very new (like last week at earliest) | 22:09 |
clarkb | I think I've managed to hold the node that that paste was made from if we need to inspect more stuff | 22:12 |
ianw | i've swapped the ownership -- bridge is quiet so i'm rebooting now | 22:18 |
clarkb | ianw: the hourly jobs are running | 22:19 |
clarkb | but losing those has minimal impact | 22:19 |
clarkb | it should go to the next jobs and then retry | 22:19 |
clarkb | yup it just failed and should retry after | 22:20 |
ianw | i'll do a base run limited to bridge manually to watch it closely | 22:28 |
ianw | i have a root screen up | 22:28 |
ianw | bridge01.opendev.org : ok=65 changed=6 unreachable=0 failed=0 skipped=10 rescued=0 ignored=0 | 22:30 |
ianw | clarkb: speaking of base, are you ok with https://review.opendev.org/c/opendev/system-config/+/862765 which removes the old ip address from sshd? | 22:33 |
clarkb | ianw: yup I think at this point we should roll forward | 22:34 |
clarkb | +2'd but not approved as my ability to monitor is quickly declining | 22:34 |
clarkb | ianw: re the secrets management did that get resolved? | 22:34 |
ianw | i thought it did but maybe not ... | 22:36 |
clarkb | ianw: I think running `edit-secrets` and ensuring that works as expected is the test for that? it was frickler who discovered it so can weigh in if it isn't doing what we expect yet | 22:37 |
clarkb | (I'm mostly just trying to make sure things are happy that I've seen or have seen others notices) | 22:37 |
ianw | yes, ok key material is there, but i get a prompt for gpg-agent that seems to randomly drop keypresses | 22:39 |
ianw | i think a .emacs fixes this | 22:40 |
opendevreview | Ian Wienand proposed opendev/system-config master: edit-secrets: configure gpg-agent/emacs https://review.opendev.org/c/opendev/system-config/+/863445 | 23:08 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!