ianw | Effectively Scale Your Web Apps | IBM's Breakthrough in Quantum Computing | 00:03 |
---|---|---|
ianw | VM -> container -> kubernetes -> istio -> quantum computing | 00:03 |
ianw | actually seems about right | 00:03 |
fungi | "quantum vapor" | 00:45 |
opendevreview | James E. Blair proposed opendev/system-config master: Add a keycloak server https://review.opendev.org/c/opendev/system-config/+/819923 | 00:46 |
corvus | i guess it's good they skipped the blockchain step then | 00:47 |
fungi | zing! | 00:48 |
opendevreview | Merged opendev/system-config master: Make haproxy role more generic https://review.opendev.org/c/opendev/system-config/+/677903 | 00:50 |
ianw | junior developer wanted: full stack role (quarks to neutrons), experience with matter frameworks such as strong and weak force, familiar with gravity, radioactive decay or fusion experience a plus, javascript | 02:38 |
ianw | it looks like i've broken infra-prod deployment; "Unable to freeze job graph: Job opendev-infra-prod-setup-src is abstract and may not be directly run". investigating | 02:44 |
opendevreview | Ian Wienand proposed opendev/system-config master: infra-prod: fix name of clone source job https://review.opendev.org/c/opendev/system-config/+/819944 | 02:50 |
opendevreview | Merged opendev/system-config master: infra-prod: fix name of clone source job https://review.opendev.org/c/opendev/system-config/+/819944 | 03:49 |
*** ysandeep|out is now known as ysandeep|ruck | 04:03 | |
*** ysandeep|ruck is now known as ysandeep|afk | 05:00 | |
frickler | ianw: ^^ the deploy for that failed with a missing playbook, looks related to me at first glance | 05:43 |
ianw | frickler: thanks, yeah looking | 05:44 |
frickler | corvus: can we have a spec for keycloak? I think I'm missing some context both as to why this is needed and why you think keycloak might be the right solution. or would you plan to do that after you did some experimenting? | 05:49 |
opendevreview | Ian Wienand proposed opendev/base-jobs master: infra-prod-setup-keys: drop inventory add https://review.opendev.org/c/opendev/base-jobs/+/818297 | 05:52 |
opendevreview | Ian Wienand proposed opendev/base-jobs master: infra-prod: fix naming of playbook https://review.opendev.org/c/opendev/base-jobs/+/819953 | 05:52 |
ianw | frickler: i would say this is for authentication to zuul and the changes to add the ability to hold nodes, restart jobs, etc. to authorized users | 05:53 |
ianw | http://lists.zuul-ci.org/pipermail/zuul-discuss/2021-November/001749.html | 05:53 |
ianw | and the linked demo site | 05:54 |
ianw | frickler: if you think https://review.opendev.org/c/opendev/base-jobs/+/819953 is ok and can monitor it a bit, please do, but otherwise i'll get back to it first thing, i'm also out of time | 05:55 |
frickler | ianw: thx, I'll have a look | 05:59 |
*** ysandeep|afk is now known as ysandeep|ruck | 06:02 | |
opendevreview | Dr. Jens Harbott proposed opendev/base-jobs master: infra-prod: fix naming of playbook https://review.opendev.org/c/opendev/base-jobs/+/819953 | 06:24 |
opendevreview | Dr. Jens Harbott proposed opendev/base-jobs master: infra-prod-setup-keys: drop inventory add https://review.opendev.org/c/opendev/base-jobs/+/818297 | 07:02 |
*** ysandeep|ruck is now known as ysandeep|lunch | 07:33 | |
*** pojadhav is now known as pojadhav|afk | 08:06 | |
*** ysandeep|lunch is now known as ysandeep | 08:19 | |
*** ysandeep is now known as ysandeep|ruck | 08:28 | |
*** pojadhav|afk is now known as pojadhav | 08:36 | |
ianw | frickler: thanks for fixing! | 08:55 |
opendevreview | Merged opendev/base-jobs master: infra-prod: fix naming of playbook https://review.opendev.org/c/opendev/base-jobs/+/819953 | 10:09 |
*** jpena|off is now known as jpena | 10:10 | |
*** pojadhav is now known as pojadhav|afk | 10:19 | |
*** rlandy is now known as rlandy|ruck | 10:49 | |
*** pojadhav|afk is now known as pojadhav | 11:13 | |
opendevreview | Marios Andreou proposed opendev/base-jobs master: Fix NODEPOOL_CENTOS_MIRROR for 9-stream https://review.opendev.org/c/opendev/base-jobs/+/820018 | 11:34 |
*** pojadhav is now known as pojadhav|brb | 11:37 | |
opendevreview | Marios Andreou proposed opendev/base-jobs master: Fix NODEPOOL_CENTOS_MIRROR for 9-stream https://review.opendev.org/c/opendev/base-jobs/+/820018 | 11:58 |
*** pojadhav|brb is now known as pojadhav | 12:08 | |
*** ysandeep|ruck is now known as ysandeep|mtg | 13:00 | |
fungi | frickler: the keycloak spec is https://docs.opendev.org/opendev/infra-specs/latest/specs/central-auth.html | 13:07 |
*** ysandeep|mtg is now known as ysandeep | 13:28 | |
*** ysandeep is now known as ysandeep|brb | 13:34 | |
*** pojadhav is now known as pojadhav|brb | 13:38 | |
*** ysandeep|brb is now known as ysandeep | 13:47 | |
frickler | fungi: ah, thx, now that I see it, I also remember it, but it seems it was already swapped out of readily accessible memory | 13:56 |
opendevreview | Merged opendev/base-jobs master: infra-prod-setup-keys: drop inventory add https://review.opendev.org/c/opendev/base-jobs/+/818297 | 14:07 |
fungi | frickler: happens to me all the time ;) | 14:29 |
mordred | yay. the CNCF has defined GitOps. guess what - it amazingly is a definition that doesn't mention git, and also doesn't describe anything we do! \o/ https://thenewstack.io/cncf-working-group-sets-some-standards-for-gitops/ | 14:38 |
*** pojadhav|brb is now known as pojadhav | 14:38 | |
fungi | excellent, i really don't want to be doing "gitops" anyway, whatever that is | 14:48 |
*** ysandeep is now known as ysandeep|out | 14:52 | |
fungi | sounds almost as offensive as "devops" | 14:52 |
*** pojadhav is now known as pojadhav|brb | 14:58 | |
mordred | right? | 15:03 |
mordred | I've been trying for a few years to come up with a similarly catchy term - "GateOps" is the best I came up with . but like - people who do gitops still think it's ok for a human to do a git push - which is of course silly | 15:04 |
mordred | but I think we can all agree that coming up with catchy names is not my strong suit | 15:04 |
fungi | sysops | 15:08 |
frickler | repoops has a good chance of being misread ;) | 15:22 |
fungi | or perhaps it simply has a dual meaning | 15:23 |
fungi | marios: just to clarify my point on 820018 we try to treat that template mostly as a shell script, the minimal amount of actual jinja templating happens at the top and is only used to provide some fallback defaults for the envvars we use throughout | 15:31 |
fungi | so we'd prefer to rely on shell logic and envvar evaluation in the body of the script rather than relying on jinja logic and ansible vars | 15:32 |
fungi | ianw: frickler: not sure if it's related to what you were talking about earlier, but do either of you happen to know why infra-prod-remote-puppet-else stopped running two days ago? | 15:39 |
fungi | i went about trying to figure out whether the storyboard-webclient fixes which merged yesterday had failed to deploy, and discovered that the job just hasn't been running: https://zuul.opendev.org/t/openstack/builds?job_name=infra-prod-remote-puppet-else | 15:40 |
frickler | fungi: I can check after current meeting | 15:40 |
fungi | no worries, i'll continue digging, i just didn't know if it was related to what you had already looked into with the misnamed playbook issue | 15:41 |
fungi | i'm guessing it's unanticipated fallout from the parallel deploy work, i'll find it | 15:41 |
fungi | no obvious zuul config errors related to that as far as i can see | 15:44 |
fungi | Unknown projects: opendev/ansible-role-puppet | 15:45 |
fungi | it's listed as a required project for infra-prod-remote-puppet-else | 15:46 |
fungi | the repo seems to still exist: https://opendev.org/opendev/ansible-role-puppet | 15:46 |
mordred | I read "repoops" as "Re-Poops" rather than "Repo-Ops" | 15:47 |
fungi | opendev/ansible-role-puppet is in the untrusted-projects list for the openstack tenant in our config too | 15:48 |
fungi | oh, there are a ton of these, that's not the only project it's complaining about | 15:50 |
fungi | WARNING zuul.ConfigLoader: Zuul encountered a syntax error while parsing its configuration in the repo openstack/heat on branch stable/pike. The error was: Unknown projects: openstack/neutron-lbaas | 15:51 |
fungi | okay, that one's legit, we removed it from the tenant config | 15:52 |
fungi | the ones i'm most concerned about (because they are in the tenant config) are entries like... | 15:55 |
fungi | WARNING zuul.ConfigLoader: Zuul encountered a syntax error while parsing its configuration in the repo opendev/system-config on branch master. The error was: Unknown projects: opendev/puppet-openstack_infra_spec_helper, opendev/puppet-bugdaystats, opendev/puppet-mysql_backup, opendev/puppet-meetbot, opendev/puppet-pip, opendev/puppet-project_config, opendev/puppet-ethercalc, | 15:55 |
fungi | opendev/puppet-httpd, opendev/puppet-subunit2sql, opendev/puppet-reviewday, opendev/puppet-kibana, opendev/puppet-redis, opendev/puppet-zanata, opendev/puppet-logstash, opendev/puppet-mediawiki, opendev/puppet-tmpreaper, opendev/puppet-elastic_recheck, opendev/puppet-ulimit, opendev/puppet-logrotate, opendev/puppet-elasticsearch, opendev/puppet-storyboard, opendev/puppet-openstack_health, | 15:56 |
fungi | opendev/puppet-log_processor, opendev/puppet-simpleproxy, opendev/puppet-bup, opendev/puppet-pgsql_backup, opendev/puppet-ssh, opendev/puppet-user, opendev/puppet-jeepyb, opendev/puppet-vcsrepo | 15:56 |
fungi | oof, sorry, should have used paste.o.o, didn't expect it to be quite that long | 15:56 |
fungi | mmm, this could be coming from the system-config inclusion in the zuul tenant | 15:57 |
fungi | yeah, the ones i was worried about seem to be showing up shortly after it logs "WARNING zuul.ConfigLoader: 9 errors detected during zuul tenant configuration loading" | 15:59 |
fungi | so that's simply because opendev/system-config is included in the zuul tenant but defines jobs which require projects that aren't included there | 16:00 |
marios | fungi: thank you for review I will have a look and update | 16:08 |
marios | fungi: (820018) | 16:08 |
fungi | i see other periodics of ours not running since two days as well, e.g.: https://zuul.opendev.org/t/openstack/builds?job_name=infra-prod-service-zuul-preview | 16:08 |
fungi | stuff in opendev-prod-hourly is still triggering fine though | 16:11 |
fungi | yeah, nothing has been triggered for system-config in periodic for the past two days though: https://zuul.opendev.org/t/openstack/builds?project=opendev%2Fsystem-config&pipeline=periodic | 16:12 |
clarkb | fungi: did one of the CD update changes land ~monday/sunday? | 16:14 |
clarkb | I know a couple landed yesterday/last night but that is too late to explain the periodic issues | 16:15 |
clarkb | fungi: the errors page in the zuul status should be tenant specific if you want to be sure those system-config issues are not in the openstack tenant | 16:16 |
clarkb | and indeed I don't see them there | 16:16 |
clarkb | where there == openstack tenant | 16:16 |
fungi | right, i did check there first-ish | 16:17 |
frickler | hmm, zuul said "Unable to freeze job graph: Job system-config-promote-image-haproxy-lb not defined" at the end of https://review.opendev.org/c/opendev/system-config/+/807672 (next-to-last comment) | 16:20 |
frickler | that might match "2d ago"? | 16:21 |
clarkb | it should be -haproxy instead of -lb | 16:21 |
clarkb | and ya that might be 2d ago relative to periodics | 16:22 |
clarkb | since they would run 0600 november 30 and 0600 dcember 1 | 16:22 |
clarkb | and that was ~2000 november 29 | 16:22 |
clarkb | er it should be -statsd instead of -lb I think | 16:22 |
clarkb | https://codesearch.opendev.org/?q=system-config-promote-image-haproxy&i=nope&literal=nope&files=&excludeFiles=&repos= | 16:22 |
fungi | bingo | 16:23 |
clarkb | frickler: did you want to push the fix up since you caught it? good eye. Also I wonder if that isin the openstack tenant error list | 16:23 |
clarkb | ^F isn't finding it. weird | 16:24 |
fungi | yeah, i couldn't find it | 16:24 |
frickler | clarkb: on it | 16:24 |
fungi | also system-config-promote-image-haproxy-lb isn't mentioned anywhere in the debug logs on zuul02 | 16:25 |
fungi | aha! | 16:25 |
fungi | i wonder if the config errors listed depend on which scheduler was queried? | 16:26 |
fungi | Exception: Job system-config-promote-image-haproxy-lb not defined | 16:26 |
fungi | that's in the debug log on *zuul01* | 16:26 |
clarkb | fungi: I thought they used an eventually consistent view of the configs | 16:26 |
clarkb | and that should be eventually consistent on the order of minutes not days | 16:26 |
fungi | eventually consistent view of the configs yes, but of the parser errors? | 16:26 |
clarkb | hrm its possible that we just fallback to a different config in that situation and then the errors get lost | 16:27 |
fungi | i guess zuul-web does try to load all the configs | 16:27 |
clarkb | I'd have to go look at the code again | 16:27 |
corvus | can someone summarize the question? | 16:27 |
fungi | corvus: wondering why the system-config-promote-image-haproxy-lb error isn't caught in the config errors listed on the zuul dashboard | 16:28 |
clarkb | corvus: we suspect that the error message at the end of https://review.opendev.org/c/opendev/system-config/+/807672 comments explains why zuul isn't running system-config periodic jobs. But that error does not show up in the openstack tenant configs | 16:28 |
clarkb | *openstack tenant error list | 16:28 |
fungi | it's logged in zuul01's debug log but not zuul02, presumably because zuul01 was the one parsing that particular pipeline | 16:28 |
corvus | yeah, only seeing it in one scheduler log makes sense | 16:28 |
*** pojadhav|brb is now known as pojadhav | 16:29 | |
fungi | but should it be listed in the tenant config errors? | 16:29 |
corvus | yeah, i would expect so; looking | 16:30 |
corvus | oh that's a job graph freeze error, not a config error | 16:31 |
corvus | not detected until runtime (versus config time) | 16:31 |
fungi | aha, okay | 16:31 |
*** marios is now known as marios|out | 16:31 | |
clarkb | got it | 16:32 |
fungi | at least now i've realized that hunting in the debug logs on just one scheduler isn't a good idea to run down such issues. i spent far too long looking on zuul02 but i should have checked zuul01 as well | 16:32 |
fungi | lesson learned | 16:32 |
corvus | yeah. log aggregation wouldn't be a terrible idea for us :) | 16:32 |
opendevreview | Dr. Jens Harbott proposed opendev/system-config master: Fix name for haproxy-statsd dependency https://review.opendev.org/c/opendev/system-config/+/820047 | 16:32 |
fungi | should we have expected that original error to be mergeable? | 16:34 |
clarkb | fungi: ya because its a runtime error and was only affecting deploy and periodic pipelines which don't happen pre merge | 16:34 |
clarkb | if it affected check or gate we wouldn't be able to merge | 16:34 |
clarkb | (I think) | 16:35 |
corvus | yep | 16:35 |
*** rlandy is now known as rlandy|ruck | 16:35 | |
fungi | okay, makes sense | 16:36 |
opendevreview | James E. Blair proposed opendev/system-config master: Add a keycloak server https://review.opendev.org/c/opendev/system-config/+/819923 | 16:37 |
fungi | i guess zuul will let us merge adding a nonexistent or otherwise invalid job addition to a pipeline which isn't evaluated in order to merge | 16:37 |
*** sshnaidm is now known as sshnaidm|afk | 16:46 | |
opendevreview | Merged opendev/system-config master: Fix name for haproxy-statsd dependency https://review.opendev.org/c/opendev/system-config/+/820047 | 16:53 |
clarkb | now what is a good system-config change to land and check that it is happy again | 16:55 |
clarkb | https://review.opendev.org/c/opendev/system-config/+/818645 and https://review.opendev.org/c/opendev/system-config/+/819733 are possible options from my side, though I think I may be popping out for ab ike ride in an hour or two so don't want to be on the hook for those if that is the case. I wonder if we've got some lower impact changes laying around | 16:56 |
clarkb | https://review.opendev.org/c/opendev/system-config/+/819927 that one seems low impact | 16:57 |
fungi | i suppose i'll wait on 818826 until we know the deploy jobs are running | 17:25 |
fungi | i've approved 819927 now | 17:25 |
*** sshnaidm|afk is now known as sshnaidm | 17:29 | |
*** gmann is now known as gmann_afk | 17:34 | |
*** pojadhav is now known as pojadhav|out | 17:46 | |
*** jpena is now known as jpena|off | 18:18 | |
clarkb | the hourly deploy seems to be running though I don't think it was affected by the previous issue | 18:23 |
fungi | it was not, from what i saw | 18:24 |
opendevreview | Merged opendev/system-config master: haproxy: map in config as ro https://review.opendev.org/c/opendev/system-config/+/819927 | 18:45 |
clarkb | I don't see ^ in deploy | 18:48 |
clarkb | it still says Unable to freeze job graph: Job system-config-promote-image-haproxy-lb not defined | 18:48 |
clarkb | codesearch agrees that that string no longer appears in our configs after frickler's fix landed | 18:49 |
clarkb | corvus: ^ is this possibly a zuul caching bug? | 18:49 |
clarkb | I'm going to pop out momentarily for some exercise | 18:49 |
clarkb | but can help look closer after | 18:49 |
*** gmann_afk is now known as gmann | 18:53 | |
corvus | i'm taking a look | 19:12 |
fungi | mmm, yeah i'm eating still but wonder if the reconfigure failed | 19:14 |
corvus | i think it may have used a cache value when it shouldn't have: `2021-12-01 16:54:03,424 DEBUG zuul.TenantParser: Using files from cache for project opendev.org/opendev/system-config @master:` | 19:25 |
fungi | clarkb wins the pool then ;) | 19:26 |
ianw | o/ | 20:19 |
ianw | if i'm reading correctly, the base job thing is fixed, but a typo fix has now uncovered a caching bug in zuul? | 20:19 |
fungi | apparently so | 20:21 |
fungi | or at least that's how it's shaping up | 20:21 |
clarkb | I'm back also yay for finding bugs | 20:37 |
corvus | i believe i have made a test case with the sequence of events, but it's passing, so i'm digging in further | 20:38 |
corvus | oh! i think it may be a race with the tenant reconfiguration | 20:39 |
corvus | and... i found where it actually did not use the cached data, so that falsifies the cache theory | 20:41 |
corvus | (this project is in 2 tenants; the other tenant updated the cache, so it correctly used the cache the second time) | 20:42 |
corvus | i'm pretty sure this is just a race and is already fixed (was probably fixed a few seconds after that failure) | 20:42 |
corvus | i'll put together a timeline | 20:43 |
fungi | so we simply tried too soon after merging the config fix? | 20:44 |
corvus | oh, hrm, i think i was looking at the wrong change. what i described is why 820047 (the change that fixed the error) didn't run the job in promote... but i need to look at 819927 | 20:46 |
*** timburke_ is now known as timburke | 20:53 | |
frickler | was there any feedback for the us.linaro.cloud cert? down to 12 days remaining | 20:55 |
ianw | clarkb: would you mind looking at https://review.opendev.org/c/zuul/nodepool/+/818705 to update the testing of kernel parameters. i've made the dib bootloader fixup depend on that for testing | 20:55 |
clarkb | sure | 20:56 |
ianw | kevinz: ^^ although i did think i saw you say it was updated | 20:56 |
clarkb | frickler: kevinz acknowledged it, but may have forgotten since? | 20:56 |
frickler | might also be updated but missing a reload | 20:58 |
clarkb | if it isn't fixed by Friday I'll send a followup email | 21:08 |
clarkb | it is the time of year where my window and computer monitor and afternoon sun are all aligned such that I can't see it well | 21:49 |
corvus | okay, i think i found it -- the tenant reconfiguration failed, and i think it's because of a change cache error related to something in the openstack periodic pipeline | 21:55 |
corvus | we won't be updating the tenant layout for openstack any more; soon-ish might be a good time to do that disruptive restart. but first i want to collect more data. | 21:58 |
clarkb | noted, I'm aroudn this afternoon and can help (just reviewing the keycloak change now) | 21:59 |
fungi | i'm sure we'd be fine leaving it, as long as config-core is aware that any changes approved which alter the tenant config won't take effect | 21:59 |
clarkb | fungi: well its not just our configs but also openstack proper. Probably a good idea to fix it | 21:59 |
fungi | (until the next restart) | 21:59 |
clarkb | but ya can wait a bit to collect data | 21:59 |
fungi | ahh, i may have conflated the tenant definitions with the layout | 22:00 |
fungi | but sure, having it not take effect in the short term doesn't seem particularly problematic as long as we're aware and can let users know | 22:00 |
clarkb | corvus: totally not urgent but I left some thoughts on the keycloak change | 22:03 |
ianw | i'm happy to help with a restart, and it would be good to pull the new gerrit images too | 22:09 |
clarkb | ianw: I don't know the new gerrit images got built? https://review.opendev.org/c/opendev/system-config/+/819733 | 22:10 |
clarkb | I was wary of approving system-config changes when it became clear this morning we werent running the jobs at all | 22:10 |
clarkb | I think in this case our promotion jobs won't run so we should do it after the restart :( | 22:10 |
clarkb | thats ok gerrit restarts are quick | 22:11 |
ianw | ahh, indeed | 22:11 |
clarkb | otherwise I would agree | 22:11 |
clarkb | ianw: re gerrit user summit tomorrow they will be talking new features in 3.4 and 3.5. There is also a talk on the checks system. I'll do my best to take good notes | 22:12 |
ianw | thanks, it seems to run a bit early for .au | 22:16 |
*** rlandy|ruck is now known as rlandy|ruck|biab | 22:31 | |
corvus | ianw, clarkb, fungi : i'm satisfied with data collection (and see #zuul for a hypothesis). i think we can restart zuul now. would someone else be kind enough to do that so i can stay heads-down on the fix? please save queues, run "zuul delete-state" and restore, because of backwards incompat zk changes. | 22:54 |
corvus | oh in #zuul clarkb just suggested restarting after the fix... hrm... i think i like the idea of restarting now and then rolling-restart the fix into place... | 22:54 |
clarkb | wfm | 22:54 |
clarkb | corvus: for zuul delete-state what context do we run that in? | 22:55 |
clarkb | does that have to run in the docker container image? so we need something like docker run zuul-scheduler -- zuul delete-state ? | 22:55 |
corvus | clarkb: scheduler container: `docker-compose run --rm scheduler zuul delete-state` is what i used previously | 22:55 |
clarkb | thanks | 22:55 |
clarkb | I see that in history on zuul02. | 22:56 |
clarkb | I can start the restart process in a couple of minutes | 22:56 |
ianw | i can also do it -- clarkb if you have more pressing things lmn | 22:56 |
clarkb | ianw: I don't just needed to get situated | 22:58 |
clarkb | step 0 is pulling images. I'll run that playbook on bridge | 22:58 |
clarkb | infra-root there are a number of buildests in the openstack periodic pipelines. I seem to recall that restoring those results in a bunch of failures. I'll remove them from the restore script if there are no objections | 23:00 |
ianw | ++ agree | 23:00 |
clarkb | also I'm going to do a zuul stop playbook then run the delete state command then the zuul start playbook | 23:01 |
clarkb | ok queues are saved and edited to remove periodic jobs. I'm running the stop playbook now. Can someone let the openstack release team know just as a heads up? | 23:03 |
clarkb | it looks like the zuul stop playbook doesn't stop the scheduler on zuul01 | 23:03 |
clarkb | I'll manually stop it. Then I guess we manually start it after zuul02 is back up again. | 23:04 |
clarkb | corvus: ^ do we restore queues then start zuul01? | 23:04 |
johnsom | A known outage I am guessing "Service Unavailable (503) https://zuul.openstack.org/api/status" | 23:05 |
clarkb | johnsom: yes see scrollback | 23:05 |
clarkb | I'll run the delete state command now that zuul is offline | 23:05 |
clarkb | hrm should I stop zuul web too? | 23:05 |
clarkb | I'm going to since it talks to zk as well and not sure what it will do with the delete state | 23:05 |
clarkb | oh envermind it seems to have stopped itself | 23:06 |
corvus | yeah stop and restart everything | 23:06 |
clarkb | delete state is running now | 23:07 |
clarkb | corvus: the only other thing I'm wondering about is when to up -d the scheduler on zuul01. But we're a little ways away from that | 23:07 |
corvus | i think safest is 02 completely up, then restore, then 01 | 23:07 |
corvus | we might be able to do something else but i haven't thought about it and that's what i've done previously :) | 23:08 |
clarkb | got it | 23:08 |
clarkb | delete state is done | 23:11 |
clarkb | I'm running the start playbook now | 23:11 |
clarkb | Zuul should be on its way back up again now | 23:12 |
fungi | thanks, sorry that i seem to have stepped away at the wrong moment | 23:12 |
clarkb | oh another thought. Should we consider leaving zuul01 off until we can fix this bug? | 23:14 |
clarkb | Then when it is fixed we can turn on zuul01 with the fix and rolling restart 02 on it. But if I understand correctly we can't hit this state with one scheduler | 23:14 |
fungi | is the bug specific to multi-scheduler interaction? | 23:14 |
fungi | ah, sounds like yes | 23:15 |
clarkb | that was my understanding of it. corvus can confirm | 23:15 |
fungi | seems like a reasonable precaution to me if so | 23:15 |
fungi | turning on the second scheduler is the final step anyway, so there's no urgency i guess | 23:16 |
clarkb | many merge requests are being submitted right now according to the debug log | 23:19 |
corvus | it's also really unlikely to happen | 23:23 |
clarkb | ya I guess if it happens again we're likely ot have the fix ready to go and can restart at that point | 23:24 |
fungi | in that case, maybe making sure the second scheduler is running before the fix is deployed would be useful for further testing rolling upgrades | 23:24 |
clarkb | still loading configs if anyone is wondering | 23:32 |
clarkb | I think it is up but I can't get to zuul web to confirm yet | 23:34 |
clarkb | corvus: do I need to restart zuul-web now that teh configs are loaded? | 23:35 |
clarkb | oh tailing the web debug log seems to indicate it also loads configs and that isn't done yet | 23:36 |
clarkb | ok its up. reenqueuing now | 23:37 |
fungi | yeah, the longer zuul-web restarts were a new reason to consider running more than one of those now | 23:42 |
clarkb | corvus: the reenqueue should be nearly complete. Anything else to check before starting the scheduler on zuul01? | 23:47 |
clarkb | corvus: also we've got the new debugging info you added https://zuul.opendev.org/t/openstack/build/f4f74f342cbd44b498314fb40bb79dea/log/zuul-info/inventory.yaml | 23:48 |
clarkb | alright reenqueue complete | 23:51 |
clarkb | I'll start the scheduler on zuul01 in a minute | 23:52 |
corvus | clarkb: i can't think of anything else to check once the re-enqueue is complete | 23:53 |
clarkb | thanks I went ahead and started it. I don't see anything indicating it might be unhappy | 23:53 |
clarkb | also note I did a pull prior to up -d since I wasn't sure if the pull playbook would update it | 23:53 |
clarkb | it seemed to be up to date so we were keeping up somehwo | 23:53 |
corvus | ++ i think it does but good precaution | 23:53 |
corvus | i think the selectors are different in those playbooks | 23:54 |
clarkb | #status log Restarted all of Zuul on ac9b62e4b5fb2f3c7fecfc1ac29c84c50293dafe to correct wedged config state in the openstack tenant and install bug fixes. | 23:54 |
clarkb | I think that is the commit we're on | 23:54 |
opendevstatus | clarkb: finished logging | 23:54 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!