Friday, 2022-05-27

*** rlandy\|biab is now known as rlandy\|out		00:41
Clark[m]	It is on to ze08 now	01:18
fungi	probably still won't be through the remaining executors before i get to sleep	01:20
ianw	BlaisePabon[m]: congratulations :) do you have to work in with existing deployment/IT things, or could you start out with a Zuul-based CI/CD for your control-plane from day 1?	01:31
fungi	already stopping ze10 now, so definitely speeding up	02:03
Clark[m]	Ya but periodic jobs just happened to slow us down again :)	02:08
fungi	bah	02:13
fungi	hopefully the periodics are mostly short-runners and don't involve any paused builds	02:14
BlaisePabon[m]	<ianw> "Blaise Pabon: congratulations :)..." <- I can do whatever I want!!	02:37
BlaisePabon[m]	In fact, the expectation is that I should start from a clean slate... and in fact, I don't have a choice because there is no CI and no CD at present.	02:40
BlaisePabon[m]	(as in, 90's style, take the server off line, ssh as the user called `root` and the proceed to `yum install ...` and `npm build` for 90 mins	02:40
BlaisePabon[m]	So if ever anyone wanted to set up an exemplary zuul-ci configuration, this would be it.	02:41
BlaisePabon[m]	fwiw, I'm rather comfortable with Docker, git, python and Kubernetes	02:41
BlaisePabon[m]	btw, I figured out how to setup reverse proxies for the servers in my garage. A while back I had offered to make them available to the nodepool, so the offer still stands.	02:45
BlaisePabon[m]	oh, and, in full disclosure, I'm not sure I know what you mean by `Zuul-based CI/CD for your control-plane` but whatever it is, I can do it.	02:47
ianw	i mean that what we do is have Zuul actually deploy gerrit (and all our other services, including zuul itself :)	02:59
opendevreview	Kenny Ho proposed zuul/zuul-jobs master: fetch-output-openshift: add option to specify container to fetch from https://review.opendev.org/c/zuul/zuul-jobs/+/843527	03:02
opendevreview	Kenny Ho proposed zuul/zuul-jobs master: fetch-output-openshift: add option to specify container to fetch from https://review.opendev.org/c/zuul/zuul-jobs/+/843527	03:04
opendevreview	Kenny Ho proposed zuul/zuul-jobs master: fetch-output-openshift: add option to specify container to fetch from https://review.opendev.org/c/zuul/zuul-jobs/+/843527	03:10
ianw	i'm just spinning up a test in ovh and noticed a bunch of really old leaked images. i'll clean them up	04:00
ianw	no severs; image uploads. the have timestamps starting with 15	04:01
*** ysandeep\|out is now known as ysandeep		04:12
ianw	there's still a few in there in state "deleted" that don't seem to go away. don't think there's much we can do client side on that	04:14
*** marios is now known as marios\|ruck		05:01
*** bhagyashris is now known as bhagyashris\|ruck		05:08
ykarel	frickler, can you please check https://review.opendev.org/c/openstack/devstack-gate/+/843148	05:55
mnasiadka	fungi, clarkb: a lot of jobs are being run only on changes to particular files, ceph jobs are non voting because they are sometimes failing due to their complexity (multinode, ceph deployment, openstack deployment, etc). We moved to use cephadm in Wallaby (and working more to improve failure rate), since Victoria is EM - we could remove those jobs (I was not aware they are failing so much).	06:23
*** ysandeep is now known as ysandeep\|afk		06:31
opendevreview	Merged openstack/project-config master: Add ops to openstack-ansible-sig channel https://review.opendev.org/c/openstack/project-config/+/843492	07:03
*** ysandeep\|afk is now known as ysandeep		07:22
opendevreview	Ian Wienand proposed opendev/glean master: redhat-ish platforms: write out ipv6 configuration https://review.opendev.org/c/opendev/glean/+/843243	07:25
ianw	fungi/clarkb: ^ i've validated that on an OVH centos-9 node. more testing to do, but i think that is ~ what it will end up like	07:25
opendevreview	Ian Wienand proposed opendev/glean master: _network_info: refactor to add ipv4 info at the end https://review.opendev.org/c/opendev/glean/+/843367	07:30
opendevreview	Ian Wienand proposed opendev/glean master: redhat-ish platforms: write out ipv6 configuration https://review.opendev.org/c/opendev/glean/+/843243	07:30
*** marios\|ruck is now known as marios\|ruck\|afk		08:44
*** marios\|ruck\|afk is now known as marios\|ruck		09:38
*** rlandy\|out is now known as rlandy		10:03
*** soniya29 is now known as soniya29\|afk		10:07
*** ysandeep is now known as ysandeep\|afk		10:30
*** ysandeep\|afk is now known as ysandeep		10:43
yoctozepto	infra-root: (cc frickler) hi! may I request a hold of the node used for the job that runs for change #843583	10:58
yoctozepto	the job name is kolla-ansible-ubuntu-source-zun-upgrade	10:58
*** ysandeep is now known as ysandeep\|afk		11:06
yoctozepto	my ssh key https://github.com/yoctozepto.keys	11:07
*** soniya29\|afk is now known as soniya29		11:08
*** rlandy is now known as rlandy\|PTOish		11:09
frickler	yoctozepto: no need to double-hilight me ;) will set it up now	11:14
*** dviroel\|out is now known as dviroel		11:15
frickler	and done	11:16
fungi	as corvus predicted, we have merger graceful stopping problems. i'll leave the playbook in its present hung state for now, but essentially we upgraded all the executors and have been waiting for many hours for zm01 to gracefully stop (which it obviously has no intention of doing)	11:34
fungi	we can probably manually swizzle the processes on the mergers in order to make the playbook think it stopped them so it will proceed through the rest of the list, but i'll sit on that idea until we have more folks on hand to tell me i'm crazy	11:35
yoctozepto	frickler: thanks; the more, the merrier, or so they say :-)	11:37
*** dviroel is now known as dviroel\|afk		13:07
yoctozepto	frickler: somehow I cannot get onto that node, my key is rejected	13:23
fungi	yoctozepto: as root?	13:30
yoctozepto	fungi: root@23.253.20.188: Permission denied (publickey).	13:30
Clark[m]	fungi: I'm not properly at a keyboard for a while yet but I think once the merger no longer shows up in the components list you can manually docker-compose down on that server to kill the container which will cause the playbook to continue. I can do that in about an hour and a half myself once at the proper keyboard	13:37
fungi	Clark[m]: yeah, that's what i was thinking of doing, just didn't want to proceed until more folks are around since it'll churn through the remaining services rather quickly	13:38
fungi	yoctozepto: i've added your ssh key to the held node now	13:39
yoctozepto	thanks fungi, it works (cc frickler)	13:44
corvus	fungi: Clark yes that's what i would suggest	13:49
frickler	yoctozepto: ah, I was about to add your key now, seems fungi already did that, thx	14:10
yoctozepto	fungi, frickler: thx, I powered off that machine, it can be returned to the pool	14:31
fungi	yoctozepto: thanks! i've released it	14:35
*** ysandeep\|afk is now known as ysandeep		14:46
fungi	Clark[m]: see the #openstack-cinder channel log for some discussion of more meetpad audio strife... digging around i ran across these which have a potential solution for chrom*'s autoplay permission and might also work around the problem in ff? https://github.com/jitsi/jitsi-meet/issues/10633 https://github.com/jitsi/jitsi-meet/issues/9528	14:47
fungi	specifically, adding a pre-join page so that users click on/enter something in the page is enough of a signal that the browser considers the user has given permission to auto-play for that session	14:48
fungi	our config dumps people straight into the call without them needing to interact before the audio stream starts, which seems to maybe be the problem	14:49
fungi	also ran across another comment buried in an issue suggesting to switch media.webrtc.hw.h264.enabled to true in about:config on ff	14:53
fungi	(it's still false by default even in my ff 100)	14:54
Clark[m]	Enabling the join page looks like a simple config update at least. Asking users to edit about:config is probably best avoided	14:54
fungi	yeah, that was separate, for improving streaming performance on ff	14:56
fungi	though it looks like it's probably a bad idea to switch on unless you've got at least ff 96 when they merged an updated libwebrtc	14:56
fungi	but yes, i'm in favor of trying to add a pre-join page and seeing if that helps. i'll propose a change	14:57
fungi	also more generally, it looks like the lack of simulcast support between jitsi-meet and firefox is likely to still create additional load on all participants the more firefox users join the call with video on	14:58
fungi	since for firefox it ends up falling back on peer-to-peer streams	14:59
fungi	or at least that's how i read the discussions	14:59
Clark[m]	I think we explicitly disable peer to peer	15:00
fungi	ahh, okay	15:00
fungi	then maybe not for our case	15:00
Clark[m]	The problem aiui is webrtc is expensive for video and just adding video bogs things down. Zoom web client which isn't webrtc does the same thing	15:00
Clark[m]	Add in devices that thermal throttle (MacBooks) and problems abound :(	15:01
fungi	for sure	15:01
fungi	and yes, even on my workstation i end up setting zoom's in-browser client to disable incoming video	15:02
fungi	okay, since it's getting to the point in the day where more people are going to be around, i'll start manually downing the docker containers for each merger as they disappear from the components page, one by one	15:03
fungi	starting with zm01 now	15:03
fungi	as soon as i did that, the system kicked me out for a reboot and the playbook progressed	15:04
fungi	looks like it came back and zm02 is down now so doing the same for it	15:04
fungi	i'll wait when it gets to zm08, so everybody's got warning when the scheduler/web containers are going down	15:05
Clark[m]	++ I should be home soon	15:06
fungi	no need to rush	15:07
fungi	all done except for zm08, and i've got the docker-compose down queued for that so ready to proceed when others are	15:14
clarkb	I'm here now just without ssh keys loaded yet	15:17
clarkb	and now that is done. We can probably proceed unless you wanted corvus to ack too	15:18
ykarel	Hi is there some known issue with unbound on c9-stream fips jobs	15:22
clarkb	ykarel: there is a race where ansible continues running job stuff after the fips reboot but before unbound is up and running	15:23
fungi	ykarel: the fips setup reboots the machine, which seems to result in unbound coming undone. i think there was some work in progress to make the unbound startup wait for networking to be marked ready by systemd first	15:23
clarkb	yup I think the idea was to encode all of that into a post reboot role in zuul-jobs. Then whether or not you are doing fips you can run that in your jobs to ensure the test node is ready before continuing	15:23
fungi	and yeah, it's basically that the job proceeds after the reboot when resolution dns isn't working yet	15:23
ykarel	clarkb, fungi okk so it's something known	15:24
ykarel	Thanks	15:24
ykarel	may be after reboot can wait for sometime until unbound is up	15:24
fungi	clarkb: i thought the idea was to change the service unit dependencies in the centos images to make sure sshd isn't running until unbound is fully started	15:24
clarkb	fungi: no I suggested against that because then you need our images to run the tests successfully	15:25
clarkb	I suggested that the test jobs themselves become smart enough to handle the distro behavior	15:25
fungi	well, you need out images to run the tests successfully if you're using unbound for a local dns cache (which is a decision we made in our images)	15:26
clarkb	you have to do a couple things post reboot like starting the console logger anyway so encoding all of that into an easy to use role makes sense	15:26
clarkb	fungi: its a decision we made in our images but we just install the normal distro package for it	15:26
clarkb	its not like this is a bug in our use of unbound. Distro systemd is allowing ssh connections before dns resolvers are up	15:27
clarkb	and systemd is sort of designed to do that	15:27
clarkb	(speed up boots even if you end up on a machien that can't do much for a few extra seconds and all that)	15:27
fungi	but yes, i can see the logic in forcing the job to wait until the console stream is running again, so checking dns resolves successfully somehow is reasonable to do at the same time	15:27
fungi	okay, downing the container on zm08 now at 15:30z	15:30
fungi	and it's rebooting	15:30
fungi	and the containers on zuul01 have stopped and it's rebooting now	15:31
johnsom	ade_lee Has a patch been proposed for the zuul task to wait for DNS?	15:31
fungi	containers are starting on zuul01	15:32
fungi	clarkb: is this benign? "[WARNING]: conditional statements should not include jinja2 templating delimiters such as {{ }} or {% %}. Found: {{ components.status == 200 and components.content \| from_json \| json_query(scheduler_query) \| length == 1 and components.content \| from_json \| json_query(scheduler_query) \| first == 'running' }}"	15:32
clarkb	fungi: yes I made note of that when I was testing this	15:34
clarkb	it was the only way I could get the ? in the query var to not explode as a parse error	15:34
fungi	thanks, looked familiar	15:35
clarkb	fungi: I suspect this is a corner case of ansible jinja parsing where ansible really wants you to use the less syntax heavy version because it makes ansible look better but it isn't as expressive and can have issues as I found	15:35
clarkb	yoctozepto: note my response on https://review.opendev.org/c/openstack/kolla-ansible/+/843536	15:36
clarkb	fungi: corvus: one thing I wonder is if having web and scheduler fight over intiailizing in the db may cause the whole thing to be slower? I guess they might be slower individually but since we run them concurrently wall time should be less?	15:38
clarkb	fungi: unrelated did you see https://storyboard.openstack.org/#!/story/2010054 I'm having a hard time undersatnding that one since all of our repos have / in their names too. I wonder if the actual repo dir has a / in it. We do openstack dir containing nova repo dir. Maybe they are doing something like openstack/nova is the repo dir (how you would convince a filesystem of that I	15:40
clarkb	don't know)	15:40
clarkb	oh you know what? Iwonder if they need to edit their gerrit config	15:40
clarkb	there is a way to tell it to encode nested slashes iirc	15:41
fungi	i think we do that, yeah	15:42
yoctozepto	clarkb: im on mobile atm but your comment looks reasonable, the other case is something we were not aware of, we will amend our ways then, thanks	15:42
fungi	clarkb: we set it in the apache config actually	15:43
clarkb	fungi: aha	15:43
fungi	i'll link th eexample	15:44
clarkb	yoctozepto: ya its always retriable if it happens in pre-run regardless of the reason. But then in any phase it is retriable if ansible reports a network error (and for reasons filling the remote disk results in network errors)	15:44
clarkb	corvus: 2022-05-27 15:55:47,506 ERROR zuul.Scheduler: voluptuous.error.MultipleInvalid: expected str for dictionary value @ data['default-ansible-version'] <- I think zuul01 is unhappy with the zuul tenant config	15:56
clarkb	zuul01 still shows as initializing, but I think it is up?	15:57
clarkb	could it be related to that error?	15:57
*** marios\|ruck is now known as marios\|out		15:57
opendevreview	Clark Boylan proposed openstack/project-config master: The default ansible version in zuul config is a str not int https://review.opendev.org/c/openstack/project-config/+/843650	15:59
clarkb	I think ^ that will fix things based on the error message. However, I'm not sure if initiliazing as the current state is ideal for zuul to report if it is running otherwise. Maybe "degraded" ?	15:59
clarkb	anyway I suspect that if we land 843650 zuul will switch over to running and the playbook will proceed but that is just a hunch	16:01
clarkb	and we've got about 2.5 hours to do it before the playbook exits in error	16:02
*** dviroel\|afk is now known as dviroel		16:08
corvus	clarkb: approved 650	16:09
clarkb	looking at zuul01's scheduler log more closely I think degraded is not really accurate either	16:09
clarkb	the process is up and running but it isn't processing pipelines	16:09
clarkb	maybe an ERROR state would be best then?	16:09
clarkb	it is just logging side effects caused by zuul02's operation if I am reading this correctly	16:10
corvus	clarkb: i think it's restarted 2x	16:13
clarkb	corvus: hrm is that something docker would've helpfully done for us?	16:14
clarkb	it exited with error maybe so docker started it?	16:14
corvus	2022-05-27 15:32:06,508 DEBUG zuul.Scheduler: Configured logging: 6.0.1.dev34	16:14
corvus	2022-05-27 15:55:47,506 ERROR zuul.Scheduler: voluptuous.error.MultipleInvalid: expected str for dictionary value @ data['default-ansible-version']	16:14
corvus	2022-05-27 15:55:51,679 DEBUG zuul.Scheduler: Configured logging: 6.0.1.dev34	16:14
corvus	maybe?	16:15
corvus	and yeah, it's getting data from zk now	16:15
corvus	so i think we're in a 30m long startup loop	16:15
corvus	which is great, actually; it means that 30m after 843650 lands it should succeed maybe hopefully?	16:15
fungi	neat	16:15
clarkb	corvus: that would be my expectation. If that comes in under the 3 hour total timeout wait period then zuul02 should get managed by the automated playbook too	16:16
corvus	oh -- but only if ansible puts that file in place	16:16
clarkb	corvus: the regular deploy job should do that	16:16
corvus	cool	16:16
fungi	and we haven't blocked deployments so that should happen	16:16
corvus	wasn't sure how much was disabled (but i'm glad that isn't -- i don't think we need to)	16:16
fungi	and i think we've got plenty of time before the timeout is reached, yeah	16:17
clarkb	corvus: nothing is currently disabled	16:17
corvus	(it should be fine to do a tenant reconfig during a rolling restart)	16:17
corvus	(it would slow stuff down but shouldn't break)	16:17
clarkb	if 02 doesn't get automatically handled I can take care of it after the fix lands. Then we can retry the automated playbook after merger stop is fixed and with https://review.opendev.org/c/opendev/system-config/+/843549 if we think that is a good idea	16:18
corvus	clarkb: i +2d 549 ... will leave to you to +w	16:19
clarkb	corvus: thanks.	16:19
corvus	zuul01 just restarted again	16:20
clarkb	if the timing estimates on the dashboard are accurate then the next restart should be happy	16:20
corvus	so assuming 650 lands soon, probably a successful restart around 16:45	16:20
clarkb	(I expect the fix will land in a couple of minutes and then the hourly zuul deploy should run shortly after that	16:20
clarkb	then after hourly is done the deploy for 650 will run and noop	16:21
opendevreview	Merged openstack/project-config master: The default ansible version in zuul config is a str not int https://review.opendev.org/c/openstack/project-config/+/843650	16:22
fungi	yeah, i'm happy approving 843549 any time, since we're manually running this anyway for now and the current run won't pick that up even if it merges in the middle since the playbook has already been read	16:24
fungi	and i don't expect to run it again until we at least think we have clean merger stops	16:24
clarkb	yup. We may need to restart the mergers after they are fixed to pick up the fix, But then we can run the automated playbook again and it should roll through without being paused	16:25
clarkb	corvus: is the issue a thread that isn't exiting or isn't marked daemon?	16:25
corvus	clarkb: unsure -- i'm planning on taking a look at that tomorrow. i'm sure it'll be something simple like that.	16:26
clarkb	ok, no rush. I don't expect I'll be running this playbook over the weekend :)	16:26
clarkb	good news is if we have another config error like this happen when we are all sleeping the playbook should timeout and error without proceeding to the second scheduler	16:27
ade_lee	johnsom, yeah - not yet -- I've been trying to find time to create it	16:31
ade_lee	johnsom, hopefully by early next week	16:32
johnsom	ade_lee Ack, thanks for the update	16:32
ade_lee	fungi, clarkb - do you guys know anything about this error here? https://zuul.opendev.org/t/openstack/build/041ccac8861442a192beaabb7c9ca500	16:32
ade_lee	fungi, clarkb something about oslo.log not being set correctly from upper constraints in train?	16:33
clarkb	ade_lee: there are two problems that cause that. The first is trying to install a version of a library that sets a required python version that isn't compatible with the current python. The other is if pypi's CDN falls back to their backup backend and serves you a stale index without the new package present	16:34
clarkb	ade_lee: in this case oslo.log==5.0.0 requires python>=3.8 and you appear to be using 3.6	16:35
clarkb	whcih means it is the first issue	16:35
ade_lee	ah	16:35
fungi	all installdeps: -chttps://releases.openstack.org/constraints/upper/master, -r/opt/stack/new/tempest/requirements.txt	16:35
clarkb	fungi: corvus: the hourly zuul deploy did not update to the fixed version as I thought it might. We have to wait for the normal deployment to happen which should happen soon enough	16:35
fungi	ade_lee: coming from /opt/stack/new/devstack-gate/devstack-vm-gate.sh:main:L890	16:36
fungi	sudo -H -u tempest UPPER_CONSTRAINTS_FILE=https://releases.openstack.org/constraints/upper/master TOX_CONSTRAINTS_FILE=https://releases.openstack.org/constraints/upper/master tox -eall -- barbican --concurrency=4	16:36
clarkb	newer pip will actually tell you why it failed in this case rather than giving you a convoluted message	16:36
fungi	so yes, i think this is the first case clarkb mentioned	16:37
fungi	tempest's virtualenv is built with the python3 from ubuntu-bionic and is trying to install the master branch constraints	16:37
ade_lee	clarkb, fungi thanks - so I need to switch the barbican gate to py38	16:37
clarkb	and stop using devstack-gate	16:37
clarkb	oh wait you said train though so the actual fix may be more stable branch specific	16:38
clarkb	like using an older tempest or something	16:38
fungi	ade_lee: you might check with #openstack-qa, i think they noted some breakage to old stable branches from tempest et al dropping support for old python	16:38
ade_lee	fungi, ack - will do	16:38
fungi	basically you're supposed to install a tagged version of tempest i think, at least that's how it's been handled in the past	16:38
clarkb	we just missed this restart so we have to wait for the next one	16:43
clarkb	project-config is now updated. The next restart should work I hope	16:46
clarkb	so about 35 minute away?	16:46
mgariepy	hello, can i have a hold on : --project=opendev.org/openstack/openstack-ansible --job=openstack-ansible-deploy-aio_lxc-rockylinux-8--ref=refs/changes/17/823417/31	16:51
mgariepy	to investigate a bootstrap issue on rocky ?	16:51
fungi	mgariepy: sure, i'm curious to see this work while the schedulers are in the middle of a rolling restart. it could be an interesting test	16:52
mgariepy	hehe :D	16:54
fungi	mgariepy: it seems to be set successfully	16:54
mgariepy	hopefully it will be ok to get it, the job just failed it's 3rd attemp :/	16:55
fungi	if the build failed after i added the autohold at 16:54 utc then it should, otherwise it'll need a recheck	16:56
mgariepy	completed at 2022-05-27 16:54:34 .	16:56
fungi	zuul returns the nodes yep! just in time	16:56
mgariepy	lol	16:56
mgariepy	it was close !	16:57
fungi	what's the link to your ssh public key again?	16:57
mgariepy	https://paste.openstack.org/show/bmEEcIcyQre3D8rn76hz/	16:57
fungi	mgariepy: ssh root@104.239.175.230	16:58
mgariepy	thanks	16:58
fungi	yw	16:58
fungi	clarkb: zuul-web seems to have come up on zuul01	17:00
clarkb	fungi: ya it doesn't care about the tenant configs	17:01
fungi	oh, so it came up earlier i guess	17:01
clarkb	it came up after the first restart	17:01
fungi	got it	17:01
clarkb	I think the current restart will fail then the next one will succeed since this current one started just before the fix was put in place on the server	17:01
clarkb	but it does fail late in the process maybe that means it loads the config late enough to see the fix? I don't think so	17:02
fungi	still tons of time left in the timeout window anyway	17:02
clarkb	ok its restarting zuul02 now so it did actually load late enough	17:08
clarkb	However I'm seeing a new error which may or may not be a problem for actual functionality	17:08
clarkb	https://paste.opendev.org/show/bHNqoDi2M2f9ExF9s1NH/	17:10
clarkb	I think this is a zuul model upgrading problem	17:10
clarkb	ya I think this has effectively paused zuul job running :/	17:11
clarkb	ya I see the issue	17:12
clarkb	https://opendev.org/zuul/zuul/src/branch/master/zuul/model.py#L2008 that attribute is added unconditionally	17:13
opendevreview	Merged openstack/diskimage-builder master: Fix grub setup on Gentoo https://review.opendev.org/c/openstack/diskimage-builder/+/842856	17:13
clarkb	but it is part of the latest zuul model update so we've got old job content without that attribute and new trying to use it	17:13
clarkb	I think new jobs are happy and old old jobs are happy	17:14
clarkb	its just the jobs that were started in the interim period that are broken	17:14
clarkb	considering that I'm somewhat inclined to let things roll for a bit. I don't think we'll get any worse. Then we should be able to evict and reenqueue and jobs that were caught in the middle?	17:15
clarkb	that seems less impactful overall than doing a full restart and rollback to v6	17:15
fungi	yeah, agreed	17:17
fungi	this is something other continuous deployments of zuul may need to be aware of	17:18
mgariepy	thanks fungi you can remove the hold.	17:18
mgariepy	and kill the instance :)	17:18
clarkb	fungi: yes just left notes in the zuul matrix room	17:19
fungi	mgariepy: done	17:19
clarkb	note that a rollback to v6 may not actually be necessary	17:19
clarkb	as long as we start on the new model api. It may be what becomes necessary depending on whether or not we can dequeue changes is stopping zuul and deleting zk state then starting zuul again	17:19
fungi	clarkb: do you still need the autohold labeled "Clarkb debugging jammy on devstack" or shall i clean it up while i'm in there?	17:19
clarkb	fungi: you can clean it up	17:20
fungi	done. thanks!	17:20
clarkb	fungi: another possible option available to us is modifying those jobs in zk directly. But that seems extra dangerous	17:21
fungi	mmm, yeah	17:22
clarkb	the two affected jobs I see regularly in the log our our infra-prod-service-bridge from the hourly jobs and tripleo-ci-centos-9-undercloud-containers from I don't know what yet	17:23
clarkb	fungi: if you are still in there do you want to try dequeing our hourly deploy buildset?	17:23
clarkb	I'm going to try and identify where taht tripleo-ci job is coming from so that we can evaluate if that is possible for it too	17:23
clarkb	but I worry we won't be able to dequeue either due to this error	17:24
clarkb	and we may need to stop the cluster and clear zk state to fix it	17:24
clarkb	843382,3 is the tripleo source of the problem I think	17:25
fungi	should we wait to dequeue things until zuul02 is fully up?	17:25
clarkb	so ya I wonder if we can dequeue that change and then reenqueue it and we'll be moving again	17:25
fungi	or is that blocking the startup?	17:25
clarkb	fungi: I don't think that will affect startup	17:25
clarkb	this is an issue in pipeline processing	17:26
clarkb	which happens after startup	17:26
fungi	so only 843382,3 needs to be dequeued and enqueued again, or are there others?	17:26
clarkb	thats the only one I've identified that needs to be dequeued and enqueued again. our hourly buildset needs to just be dequeued adn we'll let the next hour enqueue it	17:27
clarkb	Then if we still have trouble starting jobs we need to consider a full cluster shutdown, zk wipe, startup, reenqueue	17:27
clarkb	(it isn't clear to me if we're starting any new jobs currently fwiw)	17:28
fungi	i did `sudo zuul-client dequeue --tenant=openstack --pipeline=check --project=openstack/puppet-tripleo --change=843382,3`	17:29
fungi	though it doesn't seem to have been processed yet	17:29
fungi	there's a management event pending for the check pipeline according to the status page	17:30
clarkb	and the zuul01 debug log is quite idle right now	17:31
fungi	which i guess is this one?	17:31
clarkb	zuul01 is the up one	17:31
fungi	there it went	17:32
fungi	okay, enqueuing it again now	17:33
clarkb	looks like we're processing jobs too	17:33
clarkb	fungi: can you do the same for our hourly deploy?	17:33
clarkb	the playbook completed and looks like it succeeded	17:34
clarkb	fungi: I think the pause was zuul01 nad zuul02 synchronizing on the config as zuul02 came up	17:35
clarkb	I also suspect that if we remove our hourly deploy then the upgrade issue with the deduplicate attribute will be gone in our install	17:35
clarkb	but also that other jobs seem to be running our deployment so if we jsut leave it that way for corvus to inspect later we're probably good	17:35
clarkb	though we also have logs of the problem and I pasted them above too so that seems overkill	17:36
fungi	i did `sudo zuul-client dequeue --tenant=openstack --pipeline=opendev-prod-hourly --project=opendev/system-config --ref=refs/heads/master`	17:36
fungi	that seems to have cleared it	17:36
clarkb	#status log Upgraded all of Zuul to 6.0.1.dev34 b1311a590. There was a minor hiccup with the new deduplicate attribute on jobs that forced us to dequeue/enqueue two buildsets. Otherwise seems to be running.	17:37
opendevstatus	clarkb: finished logging	17:37
fungi	also 843382,3 is back in check and running new builds	17:37
clarkb	fungi: ya so I think it was just those two jobs that had the mismatch in attributes as they raced the model update	17:38
clarkb	clearing them out and reenqueuing allowed the tripleo buildset to reenqueue under the new model api version and it is happy	17:38
clarkb	zuul itself will want to fix that for other people doing upgrades, but overall the impact was fairly minor once we took care of those	17:39
clarkb	fungi: I think all of the problems the rebooting playbook ran into were external to itself	17:41
clarkb	and those problems should be fixable which is great	17:42
fungi	yep!	17:43
clarkb	I think I've convinced ymself that a revert to zuulv6 is not necessary if we continue to have problems. We're more likely to need to do a zk state clear and then starting on the current version is fine	17:58
clarkb	since the problem is consistency of the ephemeral jobs in zk between different versions of zuul. Starting on a single version of zuul with clear zk state should be fine	17:58
fungi	makes sense, yes	17:59
fungi	i mean, that's what all zuul's own functional tests do anyway	17:59
clarkb	I think https://review.opendev.org/c/openstack/tempest/+/843542/ is a good canary. It is about 20 minutes out from merging in the gate if its last build passes.	18:02
clarkb	It would've started before the problem was introduced	18:02
clarkb	I've got a small worry that jobs that started aren't as happy as they appear to be however, I don't have real evidence of that yet	18:03
clarkb	https://review.opendev.org/c/openstack/openstack-ansible/+/843483/ too	18:03
clarkb	But if they do fail due to this they should get evicted in the gate and all their children will be reenqueued and fine	18:04
clarkb	so again impact should be slight	18:04
clarkb	https://review.opendev.org/c/openstack/openstack-ansible/+/843483/ merged I think my fears are unfounded	18:20
fungi	lgtm, yep	18:20
fungi	any reason to keep the screen session on bridge around now?	18:21
fungi	if not, i'll shut it down	18:21
fungi	wall clock time for that playbook was 1798m11.450s	18:22
clarkb	fungi: the time data probably isn't very useful after the merger pause. Also it probably ended up in the log file	18:22
fungi	right	18:22
clarkb	I think we can stop the screen	18:22
fungi	and done	18:22
clarkb	if we guestimate how long it took wtihout the merger pause and without the config error probably about a day. Just over a day?	18:26
clarkb	That is better than I anticipated	18:26
clarkb	and even before it is fully automated we can run it manually when appropriate	18:27
fungi	clarkb: looking at the example in https://github.com/jitsi/jitsi-meet/issues/10633 it's setting an enable flag inside a prejoinConfig array and the comment says it replaces prejoinPageEnabled, but our settings-config.js uses config.prejoinPageEnabled	18:32
fungi	are we using an outdated config file format?	18:32
clarkb	we copy the config out of their container image file and then edit it iirc	18:33
clarkb	it is possible the content we copy out is out of date	18:33
fungi	okay, and maybe we haven't done that in a while	18:33
clarkb	https://review.opendev.org/c/openstack/tempest/+/843542/ has merged now too along with a whole stack of changes \o/	18:34
clarkb	https://github.com/jitsi/docker-jitsi-meet/blob/master/web/rootfs/defaults/settings-config.js I believ that is the upstream file	18:35
fungi	looks like https://review.opendev.org/781159 added that playbooks/roles/jitsi-meet/files/settings-config.js file over a year ago (march 2021), so was probably copied from the container around that time i would guess	18:35
clarkb	https://github.com/jitsi/docker-jitsi-meet/blob/master/web/rootfs/defaults/settings-config.js#L265	18:35
clarkb	and https://github.com/jitsi/docker-jitsi-meet/blob/master/web/rootfs/defaults/settings-config.js#L11	18:36
fungi	yeah, that looks like what we have	18:36
clarkb	so I think you just need to modify https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/jitsi-meet/templates/jvb-env.j2 and https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/jitsi-meet/templates/meet-env.j2 to set that flag	18:36
fungi	agreed, that's the commit i've drafted, but i started to question it after looking back at the example in the issue	18:37
fungi	i only did meet-env.j2 but i can also add it to jvb-env.j2 if you think it's necessary	18:38
clarkb	It isn't strictly necessary since we don't run the web service on the jvbs	18:38
clarkb	maybe best to leave it out to avoid confusion	18:39
fungi	looks like they've updated the settings-config.js to default that to true though, so maybe we've diverged there after all	18:40
fungi	ours defaults to false still	18:40
clarkb	ya maybe we want to resync? compare the delta to amke sure we haven't overridden anything in the settings.config.js (we should rely on the .env files for overrides) and then update?	18:41
fungi	so might be better to re-sync their files to our repo, right	18:41
fungi	i'll diff and see what's changed	18:41
clarkb	fungi: I want to say we did a copy because there were some things we couldn't override via their config	18:43
clarkb	another option is to stop supplying the overridden config entirely and rely on upstream's in the image if we have everything we need in the file now	18:43
clarkb	but I'd need to look at file/git history to remembe what exactly it was that was missing	18:43
clarkb	useRoomAsSharedDocumentName and openSharedDocumentOnJoin according to c1bb5b52cfb00cb80555348614ee6ff1136c2f52	18:44
fungi	yep, gonna	18:49
fungi	clarkb: any idea where the playbooks/roles/jitsi-meet/files/interface_config.js came from?	18:55
fungi	i can't seem to find it in the docker-jitsi-meet repo	18:55
clarkb	fungi: Ithink interface_config.js is the config for the app on the browser side	18:59
fungi	anyway, for the settings-config.js, this is the diff from ours to theirs: https://paste.opendev.org/show/bOI75ISjM1Zr4nKTxVbw/	18:59
clarkb	https://github.com/jitsi/docker-jitsi-meet/blob/master/web/rootfs/defaults/meet.conf#L35-L37 upstream serves it by default	18:59
fungi	ahh	19:00
clarkb	https://github.com/jitsi/docker-jitsi-meet/issues/275 I may have even fetched it out of the running container?	19:02
fungi	oh neat	19:05
fungi	i was mainly wondering if we should try to resync it from somewhere too	19:05
clarkb	iirc they use some templating engine on container startup that writes out files like that	19:05
clarkb	but I'm not seeing them in that repo	19:05
clarkb	fungi: https://github.com/jitsi/jitsi-meet/blob/master/interface_config.js	19:08
clarkb	I think the main jitsi source contains that	19:08
fungi	oh okay	19:08
fungi	and yeah, the difference is substantial: https://paste.opendev.org/show/bet8dgBO9tlXNaX0L9JX/	19:16
fungi	looks like maybe i should tell diff to be a bit smarter though	19:16
fungi	patiencediff didn't do much better: https://paste.opendev.org/show/bedtf6bUc7W4hgNVVIfX/	19:24
fungi	looking at the git history for that file, it seems we edited it in order to disable the watermark which was overlapping the etherpad controls, took firefox out of the list of recommended browsers, and took out the background blur feature	19:26
fungi	the nice thing is that a recent update has added a comment block indicating that file is deprecated and config options should move to config.js eventually. i'll make a note to see if those things we changed are configurable there now	19:27
*** dviroel is now known as dviroel\|afk		19:50
*** rlandy\|PTOish is now known as rlandy		20:01
clarkb	fungi: we can probably sync the file then add in those extra bits too	20:13
fungi	yeah, that's sort of where i'm headed, though i also want to update the env configs from the example in the upstream repo as it's also got new stuff in it corresponding to the service configs	20:39
clarkb	I've approved https://review.opendev.org/c/opendev/system-config/+/843549 so that it is ready for us when we are ready to rerun that playbook next	20:40
fungi	thanks!	20:41
opendevreview	Merged opendev/system-config master: Perform package upgrades prior to zuul cluster node reboots https://review.opendev.org/c/opendev/system-config/+/843549	21:01
johnsom	Hi infra neighbors. I think there might be something wrong with the log storage.	21:42
johnsom	https://zuul.opendev.org/t/openstack/build/554a978fa1f346ddb89aea349cd4d76b	21:42
johnsom	Is saying it has no logs, but the job just ran: https://review.opendev.org/c/openstack/designate-tempest-plugin/+/837180	21:42
jrosser_	i am also seeing the same sort of thing here https://zuul.opendev.org/t/openstack/build/0c4ec03005f94771ad426ace70e869a4	21:44
johnsom	The interesting thing is the "download all logs" works	21:44
johnsom	Yeah, the "View log" link works also, so it must be a zuul issue	21:47
clarkb	which view log link?	22:57
clarkb	oh there it is	22:58
clarkb	ok so the raw data is there, but the web viewer isn't finding/rendering it	22:58
clarkb	"Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at https://d321133537aef6ff2c0f-8ffa80ef1885272f8fa2b55d06420ca4.ssl.cf2.rackcdn.com/837180/7/check/designate-bind9-stable-xena/554a978/job-output.json. (Reason: CORS header ‘Access-Control-Allow-Origin’ missing)"	22:59
clarkb	it is a CORS issue	22:59
clarkb	we should be setting CORS headers when we upload the objects to swift	22:59
clarkb	looking at the response headers there are no CORS headers at all.	22:59
clarkb	logs uploaded to ovh swift are fine. It appears related to rax swift	23:01
clarkb	which makes me think it isn't something that changed on our side, ut let me double check zuul-jobs to be sure	23:01
clarkb	I don't see any changes to zuul-jobs' log uploading. It may be an update to whatever swift client we use as well	23:02
clarkb	we use openstack sdk	23:03
clarkb	openstack sdk did make a release on May 20 that we may have picked up with this latest restart	23:03
clarkb	would specifically be the executors	23:04
clarkb	I think this is either rax side or openstacksdk	23:04
clarkb	I'm not seeing any likely changes in openstacksdk unless some very low level system is filtering out the headers we attempt to set (seems unlikely because the ovh containers seem fine? we do have https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/upload-logs-base/library/zuul_swift_upload.py#L226-L227 which is likely non standard so maybe that gets filtered?)	23:14
clarkb	Considering ist late friday on a holdiay weekend adn you can click the view raw logs button for now I may punt on this	23:15
clarkb	Anyway if someone else ends up looking at this my suspicion is either something cloud side (maybe we can mitm ourselves and verify what sdk ends up sending to the cloud?) or a chagne in openstacksdk that filters out the non standard headers that we need via ^	23:17
clarkb	I suppose we could test this by using sdk 0.99.0 and 0.61.0 and see if the behavior changes	23:17
clarkb	and hopefully we don't need to deploy a proxy to fix it	23:20
clarkb	that would be annoying	23:20

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!