*** rlandy|biab is now known as rlandy|out | 00:41 | |
Clark[m] | It is on to ze08 now | 01:18 |
---|---|---|
fungi | probably still won't be through the remaining executors before i get to sleep | 01:20 |
ianw | BlaisePabon[m]: congratulations :) do you have to work in with existing deployment/IT things, or could you start out with a Zuul-based CI/CD for your control-plane from day 1? | 01:31 |
fungi | already stopping ze10 now, so definitely speeding up | 02:03 |
Clark[m] | Ya but periodic jobs just happened to slow us down again :) | 02:08 |
fungi | bah | 02:13 |
fungi | hopefully the periodics are mostly short-runners and don't involve any paused builds | 02:14 |
BlaisePabon[m] | <ianw> "Blaise Pabon: congratulations :)..." <- I can do whatever I want!! | 02:37 |
BlaisePabon[m] | In fact, the expectation is that I should start from a clean slate... and in fact, I don't have a choice because there is no CI and no CD at present. | 02:40 |
BlaisePabon[m] | (as in, 90's style, take the server off line, ssh as the user called `root` and the proceed to `yum install ...` and `npm build` for 90 mins | 02:40 |
BlaisePabon[m] | So if ever anyone wanted to set up an exemplary zuul-ci configuration, this would be it. | 02:41 |
BlaisePabon[m] | fwiw, I'm rather comfortable with Docker, git, python and Kubernetes | 02:41 |
BlaisePabon[m] | btw, I figured out how to setup reverse proxies for the servers in my garage. A while back I had offered to make them available to the nodepool, so the offer still stands. | 02:45 |
BlaisePabon[m] | oh, and, in full disclosure, I'm not sure I know what you mean by `Zuul-based CI/CD for your control-plane` but whatever it is, I can do it. | 02:47 |
ianw | i mean that what we do is have Zuul actually deploy gerrit (and all our other services, including zuul itself :) | 02:59 |
opendevreview | Kenny Ho proposed zuul/zuul-jobs master: fetch-output-openshift: add option to specify container to fetch from https://review.opendev.org/c/zuul/zuul-jobs/+/843527 | 03:02 |
opendevreview | Kenny Ho proposed zuul/zuul-jobs master: fetch-output-openshift: add option to specify container to fetch from https://review.opendev.org/c/zuul/zuul-jobs/+/843527 | 03:04 |
opendevreview | Kenny Ho proposed zuul/zuul-jobs master: fetch-output-openshift: add option to specify container to fetch from https://review.opendev.org/c/zuul/zuul-jobs/+/843527 | 03:10 |
ianw | i'm just spinning up a test in ovh and noticed a bunch of really old leaked images. i'll clean them up | 04:00 |
ianw | no severs; image uploads. the have timestamps starting with 15 | 04:01 |
*** ysandeep|out is now known as ysandeep | 04:12 | |
ianw | there's still a few in there in state "deleted" that don't seem to go away. don't think there's much we can do client side on that | 04:14 |
*** marios is now known as marios|ruck | 05:01 | |
*** bhagyashris is now known as bhagyashris|ruck | 05:08 | |
ykarel | frickler, can you please check https://review.opendev.org/c/openstack/devstack-gate/+/843148 | 05:55 |
mnasiadka | fungi, clarkb: a lot of jobs are being run only on changes to particular files, ceph jobs are non voting because they are sometimes failing due to their complexity (multinode, ceph deployment, openstack deployment, etc). We moved to use cephadm in Wallaby (and working more to improve failure rate), since Victoria is EM - we could remove those jobs (I was not aware they are failing so much). | 06:23 |
*** ysandeep is now known as ysandeep|afk | 06:31 | |
opendevreview | Merged openstack/project-config master: Add ops to openstack-ansible-sig channel https://review.opendev.org/c/openstack/project-config/+/843492 | 07:03 |
*** ysandeep|afk is now known as ysandeep | 07:22 | |
opendevreview | Ian Wienand proposed opendev/glean master: redhat-ish platforms: write out ipv6 configuration https://review.opendev.org/c/opendev/glean/+/843243 | 07:25 |
ianw | fungi/clarkb: ^ i've validated that on an OVH centos-9 node. more testing to do, but i think that is ~ what it will end up like | 07:25 |
opendevreview | Ian Wienand proposed opendev/glean master: _network_info: refactor to add ipv4 info at the end https://review.opendev.org/c/opendev/glean/+/843367 | 07:30 |
opendevreview | Ian Wienand proposed opendev/glean master: redhat-ish platforms: write out ipv6 configuration https://review.opendev.org/c/opendev/glean/+/843243 | 07:30 |
*** marios|ruck is now known as marios|ruck|afk | 08:44 | |
*** marios|ruck|afk is now known as marios|ruck | 09:38 | |
*** rlandy|out is now known as rlandy | 10:03 | |
*** soniya29 is now known as soniya29|afk | 10:07 | |
*** ysandeep is now known as ysandeep|afk | 10:30 | |
*** ysandeep|afk is now known as ysandeep | 10:43 | |
yoctozepto | infra-root: (cc frickler) hi! may I request a hold of the node used for the job that runs for change #843583 | 10:58 |
yoctozepto | the job name is kolla-ansible-ubuntu-source-zun-upgrade | 10:58 |
*** ysandeep is now known as ysandeep|afk | 11:06 | |
yoctozepto | my ssh key https://github.com/yoctozepto.keys | 11:07 |
*** soniya29|afk is now known as soniya29 | 11:08 | |
*** rlandy is now known as rlandy|PTOish | 11:09 | |
frickler | yoctozepto: no need to double-hilight me ;) will set it up now | 11:14 |
*** dviroel|out is now known as dviroel | 11:15 | |
frickler | and done | 11:16 |
fungi | as corvus predicted, we have merger graceful stopping problems. i'll leave the playbook in its present hung state for now, but essentially we upgraded all the executors and have been waiting for many hours for zm01 to gracefully stop (which it obviously has no intention of doing) | 11:34 |
fungi | we can probably manually swizzle the processes on the mergers in order to make the playbook think it stopped them so it will proceed through the rest of the list, but i'll sit on that idea until we have more folks on hand to tell me i'm crazy | 11:35 |
yoctozepto | frickler: thanks; the more, the merrier, or so they say :-) | 11:37 |
*** dviroel is now known as dviroel|afk | 13:07 | |
yoctozepto | frickler: somehow I cannot get onto that node, my key is rejected | 13:23 |
fungi | yoctozepto: as root? | 13:30 |
yoctozepto | fungi: root@23.253.20.188: Permission denied (publickey). | 13:30 |
Clark[m] | fungi: I'm not properly at a keyboard for a while yet but I think once the merger no longer shows up in the components list you can manually docker-compose down on that server to kill the container which will cause the playbook to continue. I can do that in about an hour and a half myself once at the proper keyboard | 13:37 |
fungi | Clark[m]: yeah, that's what i was thinking of doing, just didn't want to proceed until more folks are around since it'll churn through the remaining services rather quickly | 13:38 |
fungi | yoctozepto: i've added your ssh key to the held node now | 13:39 |
yoctozepto | thanks fungi, it works (cc frickler) | 13:44 |
corvus | fungi: Clark yes that's what i would suggest | 13:49 |
frickler | yoctozepto: ah, I was about to add your key now, seems fungi already did that, thx | 14:10 |
yoctozepto | fungi, frickler: thx, I powered off that machine, it can be returned to the pool | 14:31 |
fungi | yoctozepto: thanks! i've released it | 14:35 |
*** ysandeep|afk is now known as ysandeep | 14:46 | |
fungi | Clark[m]: see the #openstack-cinder channel log for some discussion of more meetpad audio strife... digging around i ran across these which have a potential solution for chrom*'s autoplay permission and might also work around the problem in ff? https://github.com/jitsi/jitsi-meet/issues/10633 https://github.com/jitsi/jitsi-meet/issues/9528 | 14:47 |
fungi | specifically, adding a pre-join page so that users click on/enter something in the page is enough of a signal that the browser considers the user has given permission to auto-play for that session | 14:48 |
fungi | our config dumps people straight into the call without them needing to interact before the audio stream starts, which seems to maybe be the problem | 14:49 |
fungi | also ran across another comment buried in an issue suggesting to switch media.webrtc.hw.h264.enabled to true in about:config on ff | 14:53 |
fungi | (it's still false by default even in my ff 100) | 14:54 |
Clark[m] | Enabling the join page looks like a simple config update at least. Asking users to edit about:config is probably best avoided | 14:54 |
fungi | yeah, that was separate, for improving streaming performance on ff | 14:56 |
fungi | though it looks like it's probably a bad idea to switch on unless you've got at least ff 96 when they merged an updated libwebrtc | 14:56 |
fungi | but yes, i'm in favor of trying to add a pre-join page and seeing if that helps. i'll propose a change | 14:57 |
fungi | also more generally, it looks like the lack of simulcast support between jitsi-meet and firefox is likely to still create additional load on all participants the more firefox users join the call with video on | 14:58 |
fungi | since for firefox it ends up falling back on peer-to-peer streams | 14:59 |
fungi | or at least that's how i read the discussions | 14:59 |
Clark[m] | I think we explicitly disable peer to peer | 15:00 |
fungi | ahh, okay | 15:00 |
fungi | then maybe not for our case | 15:00 |
Clark[m] | The problem aiui is webrtc is expensive for video and just adding video bogs things down. Zoom web client which isn't webrtc does the same thing | 15:00 |
Clark[m] | Add in devices that thermal throttle (MacBooks) and problems abound :( | 15:01 |
fungi | for sure | 15:01 |
fungi | and yes, even on my workstation i end up setting zoom's in-browser client to disable incoming video | 15:02 |
fungi | okay, since it's getting to the point in the day where more people are going to be around, i'll start manually downing the docker containers for each merger as they disappear from the components page, one by one | 15:03 |
fungi | starting with zm01 now | 15:03 |
fungi | as soon as i did that, the system kicked me out for a reboot and the playbook progressed | 15:04 |
fungi | looks like it came back and zm02 is down now so doing the same for it | 15:04 |
fungi | i'll wait when it gets to zm08, so everybody's got warning when the scheduler/web containers are going down | 15:05 |
Clark[m] | ++ I should be home soon | 15:06 |
fungi | no need to rush | 15:07 |
fungi | all done except for zm08, and i've got the docker-compose down queued for that so ready to proceed when others are | 15:14 |
clarkb | I'm here now just without ssh keys loaded yet | 15:17 |
clarkb | and now that is done. We can probably proceed unless you wanted corvus to ack too | 15:18 |
ykarel | Hi is there some known issue with unbound on c9-stream fips jobs | 15:22 |
clarkb | ykarel: there is a race where ansible continues running job stuff after the fips reboot but before unbound is up and running | 15:23 |
fungi | ykarel: the fips setup reboots the machine, which seems to result in unbound coming undone. i think there was some work in progress to make the unbound startup wait for networking to be marked ready by systemd first | 15:23 |
clarkb | yup I think the idea was to encode all of that into a post reboot role in zuul-jobs. Then whether or not you are doing fips you can run that in your jobs to ensure the test node is ready before continuing | 15:23 |
fungi | and yeah, it's basically that the job proceeds after the reboot when resolution dns isn't working yet | 15:23 |
ykarel | clarkb, fungi okk so it's something known | 15:24 |
ykarel | Thanks | 15:24 |
ykarel | may be after reboot can wait for sometime until unbound is up | 15:24 |
fungi | clarkb: i thought the idea was to change the service unit dependencies in the centos images to make sure sshd isn't running until unbound is fully started | 15:24 |
clarkb | fungi: no I suggested against that because then you need our images to run the tests successfully | 15:25 |
clarkb | I suggested that the test jobs themselves become smart enough to handle the distro behavior | 15:25 |
fungi | well, you need out images to run the tests successfully if you're using unbound for a local dns cache (which is a decision we made in our images) | 15:26 |
clarkb | you have to do a couple things post reboot like starting the console logger anyway so encoding all of that into an easy to use role makes sense | 15:26 |
clarkb | fungi: its a decision we made in our images but we just install the normal distro package for it | 15:26 |
clarkb | its not like this is a bug in our use of unbound. Distro systemd is allowing ssh connections before dns resolvers are up | 15:27 |
clarkb | and systemd is sort of designed to do that | 15:27 |
clarkb | (speed up boots even if you end up on a machien that can't do much for a few extra seconds and all that) | 15:27 |
fungi | but yes, i can see the logic in forcing the job to wait until the console stream is running again, so checking dns resolves successfully somehow is reasonable to do at the same time | 15:27 |
fungi | okay, downing the container on zm08 now at 15:30z | 15:30 |
fungi | and it's rebooting | 15:30 |
fungi | and the containers on zuul01 have stopped and it's rebooting now | 15:31 |
johnsom | ade_lee Has a patch been proposed for the zuul task to wait for DNS? | 15:31 |
fungi | containers are starting on zuul01 | 15:32 |
fungi | clarkb: is this benign? "[WARNING]: conditional statements should not include jinja2 templating delimiters such as {{ }} or {% %}. Found: {{ components.status == 200 and components.content | from_json | json_query(scheduler_query) | length == 1 and components.content | from_json | json_query(scheduler_query) | first == 'running' }}" | 15:32 |
clarkb | fungi: yes I made note of that when I was testing this | 15:34 |
clarkb | it was the only way I could get the ? in the query var to not explode as a parse error | 15:34 |
fungi | thanks, looked familiar | 15:35 |
clarkb | fungi: I suspect this is a corner case of ansible jinja parsing where ansible really wants you to use the less syntax heavy version because it makes ansible look better but it isn't as expressive and can have issues as I found | 15:35 |
clarkb | yoctozepto: note my response on https://review.opendev.org/c/openstack/kolla-ansible/+/843536 | 15:36 |
clarkb | fungi: corvus: one thing I wonder is if having web and scheduler fight over intiailizing in the db may cause the whole thing to be slower? I guess they might be slower individually but since we run them concurrently wall time should be less? | 15:38 |
clarkb | fungi: unrelated did you see https://storyboard.openstack.org/#!/story/2010054 I'm having a hard time undersatnding that one since all of our repos have / in their names too. I wonder if the actual repo dir has a / in it. We do openstack dir containing nova repo dir. Maybe they are doing something like openstack/nova is the repo dir (how you would convince a filesystem of that I | 15:40 |
clarkb | don't know) | 15:40 |
clarkb | oh you know what? Iwonder if they need to edit their gerrit config | 15:40 |
clarkb | there is a way to tell it to encode nested slashes iirc | 15:41 |
fungi | i think we do that, yeah | 15:42 |
yoctozepto | clarkb: im on mobile atm but your comment looks reasonable, the other case is something we were not aware of, we will amend our ways then, thanks | 15:42 |
fungi | clarkb: we set it in the apache config actually | 15:43 |
clarkb | fungi: aha | 15:43 |
fungi | i'll link th eexample | 15:44 |
clarkb | yoctozepto: ya its always retriable if it happens in pre-run regardless of the reason. But then in any phase it is retriable if ansible reports a network error (and for reasons filling the remote disk results in network errors) | 15:44 |
clarkb | corvus: 2022-05-27 15:55:47,506 ERROR zuul.Scheduler: voluptuous.error.MultipleInvalid: expected str for dictionary value @ data['default-ansible-version'] <- I think zuul01 is unhappy with the zuul tenant config | 15:56 |
clarkb | zuul01 still shows as initializing, but I think it is up? | 15:57 |
clarkb | could it be related to that error? | 15:57 |
*** marios|ruck is now known as marios|out | 15:57 | |
opendevreview | Clark Boylan proposed openstack/project-config master: The default ansible version in zuul config is a str not int https://review.opendev.org/c/openstack/project-config/+/843650 | 15:59 |
clarkb | I think ^ that will fix things based on the error message. However, I'm not sure if initiliazing as the current state is ideal for zuul to report if it is running otherwise. Maybe "degraded" ? | 15:59 |
clarkb | anyway I suspect that if we land 843650 zuul will switch over to running and the playbook will proceed but that is just a hunch | 16:01 |
clarkb | and we've got about 2.5 hours to do it before the playbook exits in error | 16:02 |
*** dviroel|afk is now known as dviroel | 16:08 | |
corvus | clarkb: approved 650 | 16:09 |
clarkb | looking at zuul01's scheduler log more closely I think degraded is not really accurate either | 16:09 |
clarkb | the process is up and running but it isn't processing pipelines | 16:09 |
clarkb | maybe an ERROR state would be best then? | 16:09 |
clarkb | it is just logging side effects caused by zuul02's operation if I am reading this correctly | 16:10 |
corvus | clarkb: i think it's restarted 2x | 16:13 |
clarkb | corvus: hrm is that something docker would've helpfully done for us? | 16:14 |
clarkb | it exited with error maybe so docker started it? | 16:14 |
corvus | 2022-05-27 15:32:06,508 DEBUG zuul.Scheduler: Configured logging: 6.0.1.dev34 | 16:14 |
corvus | 2022-05-27 15:55:47,506 ERROR zuul.Scheduler: voluptuous.error.MultipleInvalid: expected str for dictionary value @ data['default-ansible-version'] | 16:14 |
corvus | 2022-05-27 15:55:51,679 DEBUG zuul.Scheduler: Configured logging: 6.0.1.dev34 | 16:14 |
corvus | maybe? | 16:15 |
corvus | and yeah, it's getting data from zk now | 16:15 |
corvus | so i think we're in a 30m long startup loop | 16:15 |
corvus | which is great, actually; it means that 30m after 843650 lands it should succeed maybe hopefully? | 16:15 |
fungi | neat | 16:15 |
clarkb | corvus: that would be my expectation. If that comes in under the 3 hour total timeout wait period then zuul02 should get managed by the automated playbook too | 16:16 |
corvus | oh -- but only if ansible puts that file in place | 16:16 |
clarkb | corvus: the regular deploy job should do that | 16:16 |
corvus | cool | 16:16 |
fungi | and we haven't blocked deployments so that should happen | 16:16 |
corvus | wasn't sure how much was disabled (but i'm glad that isn't -- i don't think we need to) | 16:16 |
fungi | and i think we've got plenty of time before the timeout is reached, yeah | 16:17 |
clarkb | corvus: nothing is currently disabled | 16:17 |
corvus | (it should be fine to do a tenant reconfig during a rolling restart) | 16:17 |
corvus | (it would slow stuff down but shouldn't break) | 16:17 |
clarkb | if 02 doesn't get automatically handled I can take care of it after the fix lands. Then we can retry the automated playbook after merger stop is fixed and with https://review.opendev.org/c/opendev/system-config/+/843549 if we think that is a good idea | 16:18 |
corvus | clarkb: i +2d 549 ... will leave to you to +w | 16:19 |
clarkb | corvus: thanks. | 16:19 |
corvus | zuul01 just restarted again | 16:20 |
clarkb | if the timing estimates on the dashboard are accurate then the next restart should be happy | 16:20 |
corvus | so assuming 650 lands soon, probably a successful restart around 16:45 | 16:20 |
clarkb | (I expect the fix will land in a couple of minutes and then the hourly zuul deploy should run shortly after that | 16:20 |
clarkb | then after hourly is done the deploy for 650 will run and noop | 16:21 |
opendevreview | Merged openstack/project-config master: The default ansible version in zuul config is a str not int https://review.opendev.org/c/openstack/project-config/+/843650 | 16:22 |
fungi | yeah, i'm happy approving 843549 any time, since we're manually running this anyway for now and the current run won't pick that up even if it merges in the middle since the playbook has already been read | 16:24 |
fungi | and i don't expect to run it again until we at least think we have clean merger stops | 16:24 |
clarkb | yup. We may need to restart the mergers after they are fixed to pick up the fix, But then we can run the automated playbook again and it should roll through without being paused | 16:25 |
clarkb | corvus: is the issue a thread that isn't exiting or isn't marked daemon? | 16:25 |
corvus | clarkb: unsure -- i'm planning on taking a look at that tomorrow. i'm sure it'll be something simple like that. | 16:26 |
clarkb | ok, no rush. I don't expect I'll be running this playbook over the weekend :) | 16:26 |
clarkb | good news is if we have another config error like this happen when we are all sleeping the playbook should timeout and error without proceeding to the second scheduler | 16:27 |
ade_lee | johnsom, yeah - not yet -- I've been trying to find time to create it | 16:31 |
ade_lee | johnsom, hopefully by early next week | 16:32 |
johnsom | ade_lee Ack, thanks for the update | 16:32 |
ade_lee | fungi, clarkb - do you guys know anything about this error here? https://zuul.opendev.org/t/openstack/build/041ccac8861442a192beaabb7c9ca500 | 16:32 |
ade_lee | fungi, clarkb something about oslo.log not being set correctly from upper constraints in train? | 16:33 |
clarkb | ade_lee: there are two problems that cause that. The first is trying to install a version of a library that sets a required python version that isn't compatible with the current python. The other is if pypi's CDN falls back to their backup backend and serves you a stale index without the new package present | 16:34 |
clarkb | ade_lee: in this case oslo.log==5.0.0 requires python>=3.8 and you appear to be using 3.6 | 16:35 |
clarkb | whcih means it is the first issue | 16:35 |
ade_lee | ah | 16:35 |
fungi | all installdeps: -chttps://releases.openstack.org/constraints/upper/master, -r/opt/stack/new/tempest/requirements.txt | 16:35 |
clarkb | fungi: corvus: the hourly zuul deploy did not update to the fixed version as I thought it might. We have to wait for the normal deployment to happen which should happen soon enough | 16:35 |
fungi | ade_lee: coming from /opt/stack/new/devstack-gate/devstack-vm-gate.sh:main:L890 | 16:36 |
fungi | sudo -H -u tempest UPPER_CONSTRAINTS_FILE=https://releases.openstack.org/constraints/upper/master TOX_CONSTRAINTS_FILE=https://releases.openstack.org/constraints/upper/master tox -eall -- barbican --concurrency=4 | 16:36 |
clarkb | newer pip will actually tell you why it failed in this case rather than giving you a convoluted message | 16:36 |
fungi | so yes, i think this is the first case clarkb mentioned | 16:37 |
fungi | tempest's virtualenv is built with the python3 from ubuntu-bionic and is trying to install the master branch constraints | 16:37 |
ade_lee | clarkb, fungi thanks - so I need to switch the barbican gate to py38 | 16:37 |
clarkb | and stop using devstack-gate | 16:37 |
clarkb | oh wait you said train though so the actual fix may be more stable branch specific | 16:38 |
clarkb | like using an older tempest or something | 16:38 |
fungi | ade_lee: you might check with #openstack-qa, i think they noted some breakage to old stable branches from tempest et al dropping support for old python | 16:38 |
ade_lee | fungi, ack - will do | 16:38 |
fungi | basically you're supposed to install a tagged version of tempest i think, at least that's how it's been handled in the past | 16:38 |
clarkb | we just missed this restart so we have to wait for the next one | 16:43 |
clarkb | project-config is now updated. The next restart should work I hope | 16:46 |
clarkb | so about 35 minute away? | 16:46 |
mgariepy | hello, can i have a hold on : --project=opendev.org/openstack/openstack-ansible --job=openstack-ansible-deploy-aio_lxc-rockylinux-8--ref=refs/changes/17/823417/31 | 16:51 |
mgariepy | to investigate a bootstrap issue on rocky ? | 16:51 |
fungi | mgariepy: sure, i'm curious to see this work while the schedulers are in the middle of a rolling restart. it could be an interesting test | 16:52 |
mgariepy | hehe :D | 16:54 |
fungi | mgariepy: it seems to be set successfully | 16:54 |
mgariepy | hopefully it will be ok to get it, the job just failed it's 3rd attemp :/ | 16:55 |
fungi | if the build failed after i added the autohold at 16:54 utc then it should, otherwise it'll need a recheck | 16:56 |
mgariepy | completed at 2022-05-27 16:54:34 . | 16:56 |
fungi | zuul returns the nodes yep! just in time | 16:56 |
mgariepy | lol | 16:56 |
mgariepy | it was close ! | 16:57 |
fungi | what's the link to your ssh public key again? | 16:57 |
mgariepy | https://paste.openstack.org/show/bmEEcIcyQre3D8rn76hz/ | 16:57 |
fungi | mgariepy: ssh root@104.239.175.230 | 16:58 |
mgariepy | thanks | 16:58 |
fungi | yw | 16:58 |
fungi | clarkb: zuul-web seems to have come up on zuul01 | 17:00 |
clarkb | fungi: ya it doesn't care about the tenant configs | 17:01 |
fungi | oh, so it came up earlier i guess | 17:01 |
clarkb | it came up after the first restart | 17:01 |
fungi | got it | 17:01 |
clarkb | I think the current restart will fail then the next one will succeed since this current one started just before the fix was put in place on the server | 17:01 |
clarkb | but it does fail late in the process maybe that means it loads the config late enough to see the fix? I don't think so | 17:02 |
fungi | still tons of time left in the timeout window anyway | 17:02 |
clarkb | ok its restarting zuul02 now so it did actually load late enough | 17:08 |
clarkb | However I'm seeing a new error which may or may not be a problem for actual functionality | 17:08 |
clarkb | https://paste.opendev.org/show/bHNqoDi2M2f9ExF9s1NH/ | 17:10 |
clarkb | I think this is a zuul model upgrading problem | 17:10 |
clarkb | ya I think this has effectively paused zuul job running :/ | 17:11 |
clarkb | ya I see the issue | 17:12 |
clarkb | https://opendev.org/zuul/zuul/src/branch/master/zuul/model.py#L2008 that attribute is added unconditionally | 17:13 |
opendevreview | Merged openstack/diskimage-builder master: Fix grub setup on Gentoo https://review.opendev.org/c/openstack/diskimage-builder/+/842856 | 17:13 |
clarkb | but it is part of the latest zuul model update so we've got old job content without that attribute and new trying to use it | 17:13 |
clarkb | I think new jobs are happy and old old jobs are happy | 17:14 |
clarkb | its just the jobs that were started in the interim period that are broken | 17:14 |
clarkb | considering that I'm somewhat inclined to let things roll for a bit. I don't think we'll get any worse. Then we should be able to evict and reenqueue and jobs that were caught in the middle? | 17:15 |
clarkb | that seems less impactful overall than doing a full restart and rollback to v6 | 17:15 |
fungi | yeah, agreed | 17:17 |
fungi | this is something other continuous deployments of zuul may need to be aware of | 17:18 |
mgariepy | thanks fungi you can remove the hold. | 17:18 |
mgariepy | and kill the instance :) | 17:18 |
clarkb | fungi: yes just left notes in the zuul matrix room | 17:19 |
fungi | mgariepy: done | 17:19 |
clarkb | note that a rollback to v6 may not actually be necessary | 17:19 |
clarkb | as long as we start on the new model api. It may be what becomes necessary depending on whether or not we can dequeue changes is stopping zuul and deleting zk state then starting zuul again | 17:19 |
fungi | clarkb: do you still need the autohold labeled "Clarkb debugging jammy on devstack" or shall i clean it up while i'm in there? | 17:19 |
clarkb | fungi: you can clean it up | 17:20 |
fungi | done. thanks! | 17:20 |
clarkb | fungi: another possible option available to us is modifying those jobs in zk directly. But that seems extra dangerous | 17:21 |
fungi | mmm, yeah | 17:22 |
clarkb | the two affected jobs I see regularly in the log our our infra-prod-service-bridge from the hourly jobs and tripleo-ci-centos-9-undercloud-containers from I don't know what yet | 17:23 |
clarkb | fungi: if you are still in there do you want to try dequeing our hourly deploy buildset? | 17:23 |
clarkb | I'm going to try and identify where taht tripleo-ci job is coming from so that we can evaluate if that is possible for it too | 17:23 |
clarkb | but I worry we won't be able to dequeue either due to this error | 17:24 |
clarkb | and we may need to stop the cluster and clear zk state to fix it | 17:24 |
clarkb | 843382,3 is the tripleo source of the problem I think | 17:25 |
fungi | should we wait to dequeue things until zuul02 is fully up? | 17:25 |
clarkb | so ya I wonder if we can dequeue that change and then reenqueue it and we'll be moving again | 17:25 |
fungi | or is that blocking the startup? | 17:25 |
clarkb | fungi: I don't think that will affect startup | 17:25 |
clarkb | this is an issue in pipeline processing | 17:26 |
clarkb | which happens after startup | 17:26 |
fungi | so only 843382,3 needs to be dequeued and enqueued again, or are there others? | 17:26 |
clarkb | thats the only one I've identified that needs to be dequeued and enqueued again. our hourly buildset needs to just be dequeued adn we'll let the next hour enqueue it | 17:27 |
clarkb | Then if we still have trouble starting jobs we need to consider a full cluster shutdown, zk wipe, startup, reenqueue | 17:27 |
clarkb | (it isn't clear to me if we're starting any new jobs currently fwiw) | 17:28 |
fungi | i did `sudo zuul-client dequeue --tenant=openstack --pipeline=check --project=openstack/puppet-tripleo --change=843382,3` | 17:29 |
fungi | though it doesn't seem to have been processed yet | 17:29 |
fungi | there's a management event pending for the check pipeline according to the status page | 17:30 |
clarkb | and the zuul01 debug log is quite idle right now | 17:31 |
fungi | which i guess is this one? | 17:31 |
clarkb | zuul01 is the up one | 17:31 |
fungi | there it went | 17:32 |
fungi | okay, enqueuing it again now | 17:33 |
clarkb | looks like we're processing jobs too | 17:33 |
clarkb | fungi: can you do the same for our hourly deploy? | 17:33 |
clarkb | the playbook completed and looks like it succeeded | 17:34 |
clarkb | fungi: I think the pause was zuul01 nad zuul02 synchronizing on the config as zuul02 came up | 17:35 |
clarkb | I also suspect that if we remove our hourly deploy then the upgrade issue with the deduplicate attribute will be gone in our install | 17:35 |
clarkb | but also that other jobs seem to be running our deployment so if we jsut leave it that way for corvus to inspect later we're probably good | 17:35 |
clarkb | though we also have logs of the problem and I pasted them above too so that seems overkill | 17:36 |
fungi | i did `sudo zuul-client dequeue --tenant=openstack --pipeline=opendev-prod-hourly --project=opendev/system-config --ref=refs/heads/master` | 17:36 |
fungi | that seems to have cleared it | 17:36 |
clarkb | #status log Upgraded all of Zuul to 6.0.1.dev34 b1311a590. There was a minor hiccup with the new deduplicate attribute on jobs that forced us to dequeue/enqueue two buildsets. Otherwise seems to be running. | 17:37 |
opendevstatus | clarkb: finished logging | 17:37 |
fungi | also 843382,3 is back in check and running new builds | 17:37 |
clarkb | fungi: ya so I think it was just those two jobs that had the mismatch in attributes as they raced the model update | 17:38 |
clarkb | clearing them out and reenqueuing allowed the tripleo buildset to reenqueue under the new model api version and it is happy | 17:38 |
clarkb | zuul itself will want to fix that for other people doing upgrades, but overall the impact was fairly minor once we took care of those | 17:39 |
clarkb | fungi: I think all of the problems the rebooting playbook ran into were external to itself | 17:41 |
clarkb | and those problems should be fixable which is great | 17:42 |
fungi | yep! | 17:43 |
clarkb | I think I've convinced ymself that a revert to zuulv6 is not necessary if we continue to have problems. We're more likely to need to do a zk state clear and then starting on the current version is fine | 17:58 |
clarkb | since the problem is consistency of the ephemeral jobs in zk between different versions of zuul. Starting on a single version of zuul with clear zk state should be fine | 17:58 |
fungi | makes sense, yes | 17:59 |
fungi | i mean, that's what all zuul's own functional tests do anyway | 17:59 |
clarkb | I think https://review.opendev.org/c/openstack/tempest/+/843542/ is a good canary. It is about 20 minutes out from merging in the gate if its last build passes. | 18:02 |
clarkb | It would've started before the problem was introduced | 18:02 |
clarkb | I've got a small worry that jobs that started aren't as happy as they appear to be however, I don't have real evidence of that yet | 18:03 |
clarkb | https://review.opendev.org/c/openstack/openstack-ansible/+/843483/ too | 18:03 |
clarkb | But if they do fail due to this they should get evicted in the gate and all their children will be reenqueued and fine | 18:04 |
clarkb | so again impact should be slight | 18:04 |
clarkb | https://review.opendev.org/c/openstack/openstack-ansible/+/843483/ merged I think my fears are unfounded | 18:20 |
fungi | lgtm, yep | 18:20 |
fungi | any reason to keep the screen session on bridge around now? | 18:21 |
fungi | if not, i'll shut it down | 18:21 |
fungi | wall clock time for that playbook was 1798m11.450s | 18:22 |
clarkb | fungi: the time data probably isn't very useful after the merger pause. Also it probably ended up in the log file | 18:22 |
fungi | right | 18:22 |
clarkb | I think we can stop the screen | 18:22 |
fungi | and done | 18:22 |
clarkb | if we guestimate how long it took wtihout the merger pause and without the config error probably about a day. Just over a day? | 18:26 |
clarkb | That is better than I anticipated | 18:26 |
clarkb | and even before it is fully automated we can run it manually when appropriate | 18:27 |
fungi | clarkb: looking at the example in https://github.com/jitsi/jitsi-meet/issues/10633 it's setting an enable flag inside a prejoinConfig array and the comment says it replaces prejoinPageEnabled, but our settings-config.js uses config.prejoinPageEnabled | 18:32 |
fungi | are we using an outdated config file format? | 18:32 |
clarkb | we copy the config out of their container image file and then edit it iirc | 18:33 |
clarkb | it is possible the content we copy out is out of date | 18:33 |
fungi | okay, and maybe we haven't done that in a while | 18:33 |
clarkb | https://review.opendev.org/c/openstack/tempest/+/843542/ has merged now too along with a whole stack of changes \o/ | 18:34 |
clarkb | https://github.com/jitsi/docker-jitsi-meet/blob/master/web/rootfs/defaults/settings-config.js I believ that is the upstream file | 18:35 |
fungi | looks like https://review.opendev.org/781159 added that playbooks/roles/jitsi-meet/files/settings-config.js file over a year ago (march 2021), so was probably copied from the container around that time i would guess | 18:35 |
clarkb | https://github.com/jitsi/docker-jitsi-meet/blob/master/web/rootfs/defaults/settings-config.js#L265 | 18:35 |
clarkb | and https://github.com/jitsi/docker-jitsi-meet/blob/master/web/rootfs/defaults/settings-config.js#L11 | 18:36 |
fungi | yeah, that looks like what we have | 18:36 |
clarkb | so I think you just need to modify https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/jitsi-meet/templates/jvb-env.j2 and https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/jitsi-meet/templates/meet-env.j2 to set that flag | 18:36 |
fungi | agreed, that's the commit i've drafted, but i started to question it after looking back at the example in the issue | 18:37 |
fungi | i only did meet-env.j2 but i can also add it to jvb-env.j2 if you think it's necessary | 18:38 |
clarkb | It isn't strictly necessary since we don't run the web service on the jvbs | 18:38 |
clarkb | maybe best to leave it out to avoid confusion | 18:39 |
fungi | looks like they've updated the settings-config.js to default that to true though, so maybe we've diverged there after all | 18:40 |
fungi | ours defaults to false still | 18:40 |
clarkb | ya maybe we want to resync? compare the delta to amke sure we haven't overridden anything in the settings.config.js (we should rely on the .env files for overrides) and then update? | 18:41 |
fungi | so might be better to re-sync their files to our repo, right | 18:41 |
fungi | i'll diff and see what's changed | 18:41 |
clarkb | fungi: I want to say we did a copy because there were some things we couldn't override via their config | 18:43 |
clarkb | another option is to stop supplying the overridden config entirely and rely on upstream's in the image if we have everything we need in the file now | 18:43 |
clarkb | but I'd need to look at file/git history to remembe what exactly it was that was missing | 18:43 |
clarkb | useRoomAsSharedDocumentName and openSharedDocumentOnJoin according to c1bb5b52cfb00cb80555348614ee6ff1136c2f52 | 18:44 |
fungi | yep, gonna | 18:49 |
fungi | clarkb: any idea where the playbooks/roles/jitsi-meet/files/interface_config.js came from? | 18:55 |
fungi | i can't seem to find it in the docker-jitsi-meet repo | 18:55 |
clarkb | fungi: Ithink interface_config.js is the config for the app on the browser side | 18:59 |
fungi | anyway, for the settings-config.js, this is the diff from ours to theirs: https://paste.opendev.org/show/bOI75ISjM1Zr4nKTxVbw/ | 18:59 |
clarkb | https://github.com/jitsi/docker-jitsi-meet/blob/master/web/rootfs/defaults/meet.conf#L35-L37 upstream serves it by default | 18:59 |
fungi | ahh | 19:00 |
clarkb | https://github.com/jitsi/docker-jitsi-meet/issues/275 I may have even fetched it out of the running container? | 19:02 |
fungi | oh neat | 19:05 |
fungi | i was mainly wondering if we should try to resync it from somewhere too | 19:05 |
clarkb | iirc they use some templating engine on container startup that writes out files like that | 19:05 |
clarkb | but I'm not seeing them in that repo | 19:05 |
clarkb | fungi: https://github.com/jitsi/jitsi-meet/blob/master/interface_config.js | 19:08 |
clarkb | I think the main jitsi source contains that | 19:08 |
fungi | oh okay | 19:08 |
fungi | and yeah, the difference is substantial: https://paste.opendev.org/show/bet8dgBO9tlXNaX0L9JX/ | 19:16 |
fungi | looks like maybe i should tell diff to be a bit smarter though | 19:16 |
fungi | patiencediff didn't do much better: https://paste.opendev.org/show/bedtf6bUc7W4hgNVVIfX/ | 19:24 |
fungi | looking at the git history for that file, it seems we edited it in order to disable the watermark which was overlapping the etherpad controls, took firefox out of the list of recommended browsers, and took out the background blur feature | 19:26 |
fungi | the nice thing is that a recent update has added a comment block indicating that file is deprecated and config options should move to config.js eventually. i'll make a note to see if those things we changed are configurable there now | 19:27 |
*** dviroel is now known as dviroel|afk | 19:50 | |
*** rlandy|PTOish is now known as rlandy | 20:01 | |
clarkb | fungi: we can probably sync the file then add in those extra bits too | 20:13 |
fungi | yeah, that's sort of where i'm headed, though i also want to update the env configs from the example in the upstream repo as it's also got new stuff in it corresponding to the service configs | 20:39 |
clarkb | I've approved https://review.opendev.org/c/opendev/system-config/+/843549 so that it is ready for us when we are ready to rerun that playbook next | 20:40 |
fungi | thanks! | 20:41 |
opendevreview | Merged opendev/system-config master: Perform package upgrades prior to zuul cluster node reboots https://review.opendev.org/c/opendev/system-config/+/843549 | 21:01 |
johnsom | Hi infra neighbors. I think there might be something wrong with the log storage. | 21:42 |
johnsom | https://zuul.opendev.org/t/openstack/build/554a978fa1f346ddb89aea349cd4d76b | 21:42 |
johnsom | Is saying it has no logs, but the job just ran: https://review.opendev.org/c/openstack/designate-tempest-plugin/+/837180 | 21:42 |
jrosser_ | i am also seeing the same sort of thing here https://zuul.opendev.org/t/openstack/build/0c4ec03005f94771ad426ace70e869a4 | 21:44 |
johnsom | The interesting thing is the "download all logs" works | 21:44 |
johnsom | Yeah, the "View log" link works also, so it must be a zuul issue | 21:47 |
clarkb | which view log link? | 22:57 |
clarkb | oh there it is | 22:58 |
clarkb | ok so the raw data is there, but the web viewer isn't finding/rendering it | 22:58 |
clarkb | "Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at https://d321133537aef6ff2c0f-8ffa80ef1885272f8fa2b55d06420ca4.ssl.cf2.rackcdn.com/837180/7/check/designate-bind9-stable-xena/554a978/job-output.json. (Reason: CORS header ‘Access-Control-Allow-Origin’ missing)" | 22:59 |
clarkb | it is a CORS issue | 22:59 |
clarkb | we should be setting CORS headers when we upload the objects to swift | 22:59 |
clarkb | looking at the response headers there are no CORS headers at all. | 22:59 |
clarkb | logs uploaded to ovh swift are fine. It appears related to rax swift | 23:01 |
clarkb | which makes me think it isn't something that changed on our side, ut let me double check zuul-jobs to be sure | 23:01 |
clarkb | I don't see any changes to zuul-jobs' log uploading. It may be an update to whatever swift client we use as well | 23:02 |
clarkb | we use openstack sdk | 23:03 |
clarkb | openstack sdk did make a release on May 20 that we may have picked up with this latest restart | 23:03 |
clarkb | would specifically be the executors | 23:04 |
clarkb | I think this is either rax side or openstacksdk | 23:04 |
clarkb | I'm not seeing any likely changes in openstacksdk unless some very low level system is filtering out the headers we attempt to set (seems unlikely because the ovh containers seem fine? we do have https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/upload-logs-base/library/zuul_swift_upload.py#L226-L227 which is likely non standard so maybe that gets filtered?) | 23:14 |
clarkb | Considering ist late friday on a holdiay weekend adn you can click the view raw logs button for now I may punt on this | 23:15 |
clarkb | Anyway if someone else ends up looking at this my suspicion is either something cloud side (maybe we can mitm ourselves and verify what sdk ends up sending to the cloud?) or a chagne in openstacksdk that filters out the non standard headers that we need via ^ | 23:17 |
clarkb | I suppose we could test this by using sdk 0.99.0 and 0.61.0 and see if the behavior changes | 23:17 |
clarkb | and hopefully we don't need to deploy a proxy to fix it | 23:20 |
clarkb | that would be annoying | 23:20 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!