Wednesday, 2021-12-01

ianwEffectively Scale Your Web Apps | IBM's Breakthrough in Quantum Computing00:03
ianwVM -> container -> kubernetes -> istio -> quantum computing00:03
ianwactually seems about right00:03
fungi"quantum vapor"00:45
opendevreviewJames E. Blair proposed opendev/system-config master: Add a keycloak server
corvusi guess it's good they skipped the blockchain step then00:47
opendevreviewMerged opendev/system-config master: Make haproxy role more generic
ianwjunior developer wanted: full stack role (quarks to neutrons), experience with matter frameworks such as strong and weak force, familiar with gravity, radioactive decay or fusion experience a plus, javascript02:38
ianwit looks like i've broken infra-prod deployment; "Unable to freeze job graph: Job opendev-infra-prod-setup-src is abstract and may not be directly run".  investigating02:44
opendevreviewIan Wienand proposed opendev/system-config master: infra-prod: fix name of clone source job
opendevreviewMerged opendev/system-config master: infra-prod: fix name of clone source job
*** ysandeep|out is now known as ysandeep|ruck04:03
*** ysandeep|ruck is now known as ysandeep|afk05:00
fricklerianw: ^^ the deploy for that failed with a missing playbook, looks related to me at first glance05:43
ianwfrickler: thanks, yeah looking05:44
fricklercorvus: can we have a spec for keycloak? I think I'm missing some context both as to why this is needed and why you think keycloak might be the right solution. or would you plan to do that after you did some experimenting?05:49
opendevreviewIan Wienand proposed opendev/base-jobs master: infra-prod-setup-keys: drop inventory add
opendevreviewIan Wienand proposed opendev/base-jobs master: infra-prod: fix naming of playbook
ianwfrickler: i would say this is for authentication to zuul and the changes to add the ability to hold nodes, restart jobs, etc. to authorized users05:53
ianwand the linked demo site05:54
ianwfrickler: if you think is ok and can monitor it a bit, please do, but otherwise i'll get back to it first thing, i'm also out of time 05:55
fricklerianw: thx, I'll have a look05:59
*** ysandeep|afk is now known as ysandeep|ruck06:02
opendevreviewDr. Jens Harbott proposed opendev/base-jobs master: infra-prod: fix naming of playbook
opendevreviewDr. Jens Harbott proposed opendev/base-jobs master: infra-prod-setup-keys: drop inventory add
*** ysandeep|ruck is now known as ysandeep|lunch07:33
*** pojadhav is now known as pojadhav|afk08:06
*** ysandeep|lunch is now known as ysandeep08:19
*** ysandeep is now known as ysandeep|ruck08:28
*** pojadhav|afk is now known as pojadhav08:36
ianwfrickler: thanks for fixing!08:55
opendevreviewMerged opendev/base-jobs master: infra-prod: fix naming of playbook
*** jpena|off is now known as jpena10:10
*** pojadhav is now known as pojadhav|afk10:19
*** rlandy is now known as rlandy|ruck10:49
*** pojadhav|afk is now known as pojadhav11:13
opendevreviewMarios Andreou proposed opendev/base-jobs master: Fix NODEPOOL_CENTOS_MIRROR for 9-stream
*** pojadhav is now known as pojadhav|brb11:37
opendevreviewMarios Andreou proposed opendev/base-jobs master: Fix NODEPOOL_CENTOS_MIRROR for 9-stream
*** pojadhav|brb is now known as pojadhav12:08
*** ysandeep|ruck is now known as ysandeep|mtg13:00
fungifrickler: the keycloak spec is
*** ysandeep|mtg is now known as ysandeep13:28
*** ysandeep is now known as ysandeep|brb13:34
*** pojadhav is now known as pojadhav|brb13:38
*** ysandeep|brb is now known as ysandeep13:47
fricklerfungi: ah, thx, now that I see it, I also remember it, but it seems it was already swapped out of readily accessible memory13:56
opendevreviewMerged opendev/base-jobs master: infra-prod-setup-keys: drop inventory add
fungifrickler: happens to me all the time ;)14:29
mordredyay. the CNCF has defined GitOps. guess what - it amazingly is a definition that doesn't mention git, and also doesn't describe anything we do! \o/
*** pojadhav|brb is now known as pojadhav14:38
fungiexcellent, i really don't want to be doing "gitops" anyway, whatever that is14:48
*** ysandeep is now known as ysandeep|out14:52
fungisounds almost as offensive as "devops"14:52
*** pojadhav is now known as pojadhav|brb14:58
mordredI've been trying for a few years to come up with a similarly catchy term - "GateOps" is the best I came up with . but like - people who do gitops still think it's ok for a human to do a git push - which is of course silly15:04
mordredbut I think we can all agree that coming up with catchy names is not my strong suit15:04
fricklerrepoops has a good chance of being misread ;)15:22
fungior perhaps it simply has a dual meaning15:23
fungimarios: just to clarify my point on 820018 we try to treat that template mostly as a shell script, the minimal amount of actual jinja templating happens at the top and is only used to provide some fallback defaults for the envvars we use throughout15:31
fungiso we'd prefer to rely on shell logic and envvar evaluation in the body of the script rather than relying on jinja logic and ansible vars15:32
fungiianw: frickler: not sure if it's related to what you were talking about earlier, but do either of you happen to know why infra-prod-remote-puppet-else stopped running two days ago?15:39
fungii went about trying to figure out whether the storyboard-webclient fixes which merged yesterday had failed to deploy, and discovered that the job just hasn't been running:
fricklerfungi: I can check after current meeting15:40
fungino worries, i'll continue digging, i just didn't know if it was related to what you had already looked into with the misnamed playbook issue15:41
fungii'm guessing it's unanticipated fallout from the parallel deploy work, i'll find it15:41
fungino obvious zuul config errors related to that as far as i can see15:44
fungiUnknown projects: opendev/ansible-role-puppet15:45
fungiit's listed as a required project for infra-prod-remote-puppet-else15:46
fungithe repo seems to still exist:
mordredI read "repoops" as "Re-Poops" rather than "Repo-Ops"15:47
fungiopendev/ansible-role-puppet is in the untrusted-projects list for the openstack tenant in our config too15:48
fungioh, there are a ton of these, that's not the only project it's complaining about15:50
fungiWARNING zuul.ConfigLoader: Zuul encountered a syntax error while parsing its configuration in the repo openstack/heat on branch stable/pike.  The error was: Unknown projects: openstack/neutron-lbaas15:51
fungiokay, that one's legit, we removed it from the tenant config15:52
fungithe ones i'm most concerned about (because they are in the tenant config) are entries like...15:55
fungiWARNING zuul.ConfigLoader: Zuul encountered a syntax error while parsing its configuration in the repo opendev/system-config on branch master.  The error was: Unknown projects: opendev/puppet-openstack_infra_spec_helper, opendev/puppet-bugdaystats, opendev/puppet-mysql_backup, opendev/puppet-meetbot, opendev/puppet-pip, opendev/puppet-project_config, opendev/puppet-ethercalc,15:55
fungiopendev/puppet-httpd, opendev/puppet-subunit2sql, opendev/puppet-reviewday, opendev/puppet-kibana, opendev/puppet-redis, opendev/puppet-zanata, opendev/puppet-logstash, opendev/puppet-mediawiki, opendev/puppet-tmpreaper, opendev/puppet-elastic_recheck, opendev/puppet-ulimit, opendev/puppet-logrotate, opendev/puppet-elasticsearch, opendev/puppet-storyboard, opendev/puppet-openstack_health,15:56
fungiopendev/puppet-log_processor, opendev/puppet-simpleproxy, opendev/puppet-bup, opendev/puppet-pgsql_backup, opendev/puppet-ssh, opendev/puppet-user, opendev/puppet-jeepyb, opendev/puppet-vcsrepo15:56
fungioof, sorry, should have used paste.o.o, didn't expect it to be quite that long15:56
fungimmm, this could be coming from the system-config inclusion in the zuul tenant15:57
fungiyeah, the ones i was worried about seem to be showing up shortly after it logs "WARNING zuul.ConfigLoader: 9 errors detected during zuul tenant configuration loading"15:59
fungiso that's simply because opendev/system-config is included in the zuul tenant but defines jobs which require projects that aren't included there16:00
mariosfungi: thank you for review I will have a look and update16:08
mariosfungi: (820018)16:08
fungii see other periodics of ours not running since two days as well, e.g.:
fungistuff in opendev-prod-hourly is still triggering fine though16:11
fungiyeah, nothing has been triggered for system-config in periodic for the past two days though:
clarkbfungi: did one of the CD update changes land ~monday/sunday?16:14
clarkbI know a couple landed yesterday/last night but that is too late to explain the periodic issues16:15
clarkbfungi: the errors page in the zuul status should be tenant specific if you want to be sure those system-config issues are not in the openstack tenant16:16
clarkband indeed I don't see them there16:16
clarkbwhere there == openstack tenant16:16
fungiright, i did check there first-ish16:17
fricklerhmm, zuul said "Unable to freeze job graph: Job system-config-promote-image-haproxy-lb not defined" at the end of (next-to-last comment)16:20
fricklerthat might match "2d ago"?16:21
clarkbit should be -haproxy instead of -lb16:21
clarkband ya that might be 2d ago relative to periodics16:22
clarkbsince they would run 0600 november 30 and 0600 dcember 116:22
clarkband that was ~2000 november 2916:22
clarkber it should be -statsd instead of -lb I think16:22
clarkbfrickler: did you want to push the fix up since you caught it? good eye. Also I wonder if that isin the openstack tenant error list16:23
clarkb^F isn't finding it. weird16:24
fungiyeah, i couldn't find it16:24
fricklerclarkb: on it16:24
fungialso system-config-promote-image-haproxy-lb isn't mentioned anywhere in the debug logs on zuul0216:25
fungii wonder if the config errors listed depend on which scheduler was queried?16:26
fungiException: Job system-config-promote-image-haproxy-lb not defined16:26
fungithat's in the debug log on *zuul01*16:26
clarkbfungi: I thought they used an eventually consistent view of the configs16:26
clarkband that should be eventually consistent on the order of minutes not days16:26
fungieventually consistent view of the configs yes, but of the parser errors?16:26
clarkbhrm its possible that we just fallback to a different config in that situation and then the errors get lost16:27
fungii guess zuul-web does try to load all the configs16:27
clarkbI'd have to go look at the code again16:27
corvuscan someone summarize the question?16:27
fungicorvus: wondering why the system-config-promote-image-haproxy-lb error isn't caught in the config errors listed on the zuul dashboard16:28
clarkbcorvus: we suspect that the error message at the end of comments explains why zuul isn't running system-config periodic jobs. But that error does not show up in the openstack tenant configs16:28
clarkb*openstack tenant error list16:28
fungiit's logged in zuul01's debug log but not zuul02, presumably because zuul01 was the one parsing that particular pipeline16:28
corvusyeah, only seeing it in one scheduler log makes sense16:28
*** pojadhav|brb is now known as pojadhav16:29
fungibut should it be listed in the tenant config errors?16:29
corvusyeah, i would expect so; looking16:30
corvusoh that's a job graph freeze error, not a config error16:31
corvusnot detected until runtime (versus config time)16:31
fungiaha, okay16:31
*** marios is now known as marios|out16:31
clarkbgot it16:32
fungiat least now i've realized that hunting in the debug logs on just one scheduler isn't a good idea to run down such issues. i spent far too long looking on zuul02 but i should have checked zuul01 as well16:32
fungilesson learned16:32
corvusyeah.  log aggregation wouldn't be a terrible idea for us :)16:32
opendevreviewDr. Jens Harbott proposed opendev/system-config master: Fix name for haproxy-statsd dependency
fungishould we have expected that original error to be mergeable?16:34
clarkbfungi: ya because its a runtime error and was only affecting deploy and periodic pipelines which don't happen pre merge16:34
clarkbif it affected check or gate we wouldn't be able to merge16:34
clarkb(I think)16:35
*** rlandy is now known as rlandy|ruck16:35
fungiokay, makes sense16:36
opendevreviewJames E. Blair proposed opendev/system-config master: Add a keycloak server
fungii guess zuul will let us merge adding a nonexistent or otherwise invalid job addition to a pipeline which isn't evaluated in order to merge16:37
*** sshnaidm is now known as sshnaidm|afk16:46
opendevreviewMerged opendev/system-config master: Fix name for haproxy-statsd dependency
clarkbnow what is a good system-config change to land and check that it is happy again16:55
clarkb and are possible options from my side, though I think I may be popping out for ab ike ride in an hour or two so don't want to be on the hook for those if that is the case. I wonder if we've got some lower impact changes laying around16:56
clarkb that one seems low impact16:57
fungii suppose i'll wait on 818826 until we know the deploy jobs are running17:25
fungii've approved 819927 now17:25
*** sshnaidm|afk is now known as sshnaidm17:29
*** gmann is now known as gmann_afk17:34
*** pojadhav is now known as pojadhav|out17:46
*** jpena is now known as jpena|off18:18
clarkbthe hourly deploy seems to be running though I don't think it was affected by the previous issue18:23
fungiit was not, from what i saw18:24
opendevreviewMerged opendev/system-config master: haproxy: map in config as ro
clarkbI don't see ^ in deploy18:48
clarkbit still says Unable to freeze job graph: Job system-config-promote-image-haproxy-lb not defined18:48
clarkbcodesearch agrees that that string no longer appears in our configs after frickler's fix landed18:49
clarkbcorvus: ^ is this possibly a zuul caching bug?18:49
clarkbI'm going to pop out momentarily for some exercise18:49
clarkbbut can help look closer after18:49
*** gmann_afk is now known as gmann18:53
corvusi'm taking a look19:12
fungimmm, yeah i'm eating still but wonder if the reconfigure failed19:14
corvusi think it may have used a cache value when it shouldn't have: `2021-12-01 16:54:03,424 DEBUG zuul.TenantParser: Using files from cache for project @master:`19:25
fungiclarkb wins the pool then ;)19:26
ianwif i'm reading correctly, the base job thing is fixed, but a typo fix has now uncovered a caching bug in zuul?20:19
fungiapparently so20:21
fungior at least that's how it's shaping up20:21
clarkbI'm back also yay for finding bugs20:37
corvusi believe i have made a test case with the sequence of events, but it's passing, so i'm digging in further20:38
corvusoh!  i think it may be a race with the tenant reconfiguration20:39
corvusand... i found where it actually did not use the cached data, so that falsifies the cache theory20:41
corvus(this project is in 2 tenants; the other tenant updated the cache, so it correctly used the cache the second time)20:42
corvusi'm pretty sure this is just a race and is already fixed (was probably fixed a few seconds after that failure)20:42
corvusi'll put together a timeline20:43
fungiso we simply tried too soon after merging the config fix?20:44
corvusoh, hrm, i think i was looking at the wrong change.  what i described is why 820047 (the change that fixed the error) didn't run the job in promote... but i need to look at 81992720:46
*** timburke_ is now known as timburke20:53
fricklerwas there any feedback for the cert? down to 12 days remaining20:55
ianwclarkb: would you mind looking at to update the testing of kernel parameters.  i've made the dib bootloader fixup depend on that for testing20:55
ianwkevinz: ^^ although i did think i saw you say it was updated20:56
clarkbfrickler: kevinz acknowledged it, but may have forgotten since?20:56
fricklermight also be updated but missing a reload20:58
clarkbif it isn't fixed by Friday I'll send a followup email21:08
clarkbit is the time of year where my window and computer monitor and afternoon sun are all aligned such that I can't see it well21:49
corvusokay, i think i found it -- the tenant reconfiguration failed, and i think it's because of a change cache error related to something in the openstack periodic pipeline21:55
corvuswe won't be updating the tenant layout for openstack any more; soon-ish might be a good time to do that disruptive restart.  but first i want to collect more data.21:58
clarkbnoted, I'm aroudn this afternoon and can help (just reviewing the keycloak change now)21:59
fungii'm sure we'd be fine leaving it, as long as config-core is aware that any changes approved which alter the tenant config won't take effect21:59
clarkbfungi: well its not just our configs but also openstack proper. Probably a good idea to fix it21:59
fungi(until the next restart)21:59
clarkbbut ya can wait a bit to collect data21:59
fungiahh, i may have conflated the tenant definitions with the layout22:00
fungibut sure, having it not take effect in the short term doesn't seem particularly problematic as long as we're aware and can let users know22:00
clarkbcorvus: totally not urgent but I left some thoughts on the keycloak change22:03
ianwi'm happy to help with a restart, and it would be good to pull the new gerrit images too22:09
clarkbianw: I don't know the new gerrit images got built?
clarkbI was wary of approving system-config changes when it became clear this morning we werent running the jobs at all22:10
clarkbI think in this case our promotion jobs won't run so we should do it after the restart :(22:10
clarkbthats ok gerrit restarts are quick22:11
ianwahh, indeed22:11
clarkbotherwise I would agree22:11
clarkbianw: re gerrit user summit tomorrow they will be talking new features in 3.4 and 3.5. There is also a talk on the checks system. I'll do my best to take good notes22:12
ianwthanks, it seems to run a bit early for .au22:16
*** rlandy|ruck is now known as rlandy|ruck|biab22:31
corvusianw, clarkb, fungi : i'm satisfied with data collection (and see #zuul for a hypothesis). i think we can restart zuul now.  would someone else be kind enough to do that so i can stay heads-down on the fix?  please save queues, run "zuul delete-state" and restore, because of backwards incompat zk changes.22:54
corvusoh in #zuul clarkb just suggested restarting after the fix... hrm... i think i like the idea of restarting now and then rolling-restart the fix into place...22:54
clarkbcorvus: for zuul delete-state what context do we run that in?22:55
clarkbdoes that have to run in the docker container image? so we need something like docker run zuul-scheduler -- zuul delete-state ?22:55
corvusclarkb: scheduler container: `docker-compose run --rm scheduler zuul delete-state` is what i used previously22:55
clarkbI see that in history on zuul02.22:56
clarkbI can start the restart process in a couple of minutes22:56
ianwi can also do it -- clarkb if you have more pressing things lmn22:56
clarkbianw: I don't just needed to get situated22:58
clarkbstep 0 is pulling images. I'll run that playbook on bridge22:58
clarkbinfra-root there are a number of buildests in the openstack periodic pipelines. I seem to recall that restoring those results in a bunch of failures. I'll remove them from the restore script if there are no objections23:00
ianw++ agree23:00
clarkbalso I'm going to do a zuul stop playbook then run the delete state command then the zuul start playbook23:01
clarkbok queues are saved and edited to remove periodic jobs. I'm running the stop playbook now. Can someone let the openstack release team know just as a heads up?23:03
clarkbit looks like the zuul stop playbook doesn't stop the scheduler on zuul0123:03
clarkbI'll manually stop it. Then I guess we manually start it after zuul02 is back up again.23:04
clarkbcorvus: ^ do we restore queues then start zuul01?23:04
johnsomA known outage I am guessing "Service Unavailable (503)"23:05
clarkbjohnsom: yes see scrollback23:05
clarkbI'll run the delete state command now that zuul is offline23:05
clarkbhrm should I stop zuul web too?23:05
clarkbI'm going to since it talks to zk as well and not sure what it will do with the delete state23:05
clarkboh envermind it seems to have stopped itself23:06
corvusyeah stop and restart everything23:06
clarkbdelete state is running now23:07
clarkbcorvus: the only other thing I'm wondering about is when to up -d the scheduler on zuul01. But we're a little ways away from that23:07
corvusi think safest is 02 completely up, then restore, then 0123:07
corvuswe might be able to do something else but i haven't thought about it and that's what i've done previously :)23:08
clarkbgot it23:08
clarkbdelete state is done23:11
clarkbI'm running the start playbook now23:11
clarkbZuul should be on its way back up again now23:12
fungithanks, sorry that i seem to have stepped away at the wrong moment23:12
clarkboh another thought. Should we consider leaving zuul01 off until we can fix this bug?23:14
clarkbThen when it is fixed we can turn on zuul01 with the fix and rolling restart 02 on it. But if I understand correctly we can't hit this state with one scheduler23:14
fungiis the bug specific to multi-scheduler interaction?23:14
fungiah, sounds like yes23:15
clarkbthat was my understanding of it. corvus can confirm23:15
fungiseems like a reasonable precaution to me if so23:15
fungiturning on the second scheduler is the final step anyway, so there's no urgency i guess23:16
clarkbmany merge requests are being submitted right now according to the debug log23:19
corvusit's also really unlikely to happen23:23
clarkbya I guess if it happens again we're likely ot have the fix ready to go and can restart at that point23:24
fungiin that case, maybe making sure the second scheduler is running before the fix is deployed would be useful for further testing rolling upgrades23:24
clarkbstill loading configs if anyone is wondering23:32
clarkbI think it is up but I can't get to zuul web to confirm yet23:34
clarkbcorvus: do I need to restart zuul-web now that teh configs are loaded?23:35
clarkboh tailing the web debug log seems to indicate it also loads configs and that isn't done yet23:36
clarkbok its up. reenqueuing now23:37
fungiyeah, the longer zuul-web restarts were a new reason to consider running more than one of those now23:42
clarkbcorvus: the reenqueue should be nearly complete. Anything else to check before starting the scheduler on zuul01?23:47
clarkbcorvus: also we've got the new debugging info you added
clarkbalright reenqueue complete23:51
clarkbI'll start the scheduler on zuul01 in a minute23:52
corvusclarkb: i can't think of anything else to check once the re-enqueue is complete23:53
clarkbthanks I went ahead and started it. I don't see anything indicating it might be unhappy23:53
clarkbalso note I did a pull prior to up -d since I wasn't sure if the pull playbook would update it23:53
clarkbit seemed to be up to date so we were keeping up somehwo23:53
corvus++ i think it does but good precaution23:53
corvusi think the selectors are different in those playbooks23:54
clarkb#status log Restarted all of Zuul on ac9b62e4b5fb2f3c7fecfc1ac29c84c50293dafe to correct wedged config state in the openstack tenant and install bug fixes.23:54
clarkbI think that is the commit we're on23:54
opendevstatusclarkb: finished logging23:54

Generated by 2.17.2 by Marius Gedminas - find it at!