Monday, 2023-10-16

Clark[m]	That ended up taking far longer than I anticipated but I should be home soon and can approve it then	00:40
clarkb	I've approved it and will keep an eye on it to make sure nothing looks really wrong	01:37
fungi	thanks again!	02:06
opendevreview	Merged opendev/system-config master: Drop the mailman_copy Exim router https://review.opendev.org/c/opendev/system-config/+/898268	02:17
Clark[m]	So that did not trigger the lists3 job. We'll need to fix that. But I think that's fine we will let the daily application apply it in a few hours	02:20
Clark[m]	Hrm but those are already enqueued. I'm not sure I can debug the job stuff now though. Maybe tomorrow we can land a change that fixes this and triggers the jobs	02:24
fungi	i may not have bandwidth to write that change but can find time to review it	02:31
Clark[m]	The issue is we don't trigger the job on that group vars file updating. We do on the host vars file	02:32
Clark[m]	I'll try to push something up before bed	02:47
opendevreview	Clark Boylan proposed opendev/system-config master: Fix the relevant files lists for lists3 jobs https://review.opendev.org/c/opendev/system-config/+/898280	02:55
clarkb	fungi: ^ I think something like that should do it	02:55
fungi	lgtm	03:01
*** gthiemon1e is now known as gthiemonge		06:54
opendevreview	Jaromír Wysoglad proposed openstack/project-config master: Add infrawatch/sg-core to available repos https://review.opendev.org/c/openstack/project-config/+/898314	10:11
*** drannou_ is now known as drannou		13:18
clarkb	I've gone ahead and approved the mm3 chnage. I have to pop out in an hour or so for another dentist visit but don't expect any drilling or anything so I'll be around to check on it after	15:14
opendevreview	Merged opendev/system-config master: Fix the relevant files lists for lists3 jobs https://review.opendev.org/c/opendev/system-config/+/898280	15:51
clarkb	the lists3 job is enqueued after that changed merged	15:53
clarkb	fungi: do you know if exim will automatically give up on deliveries for the emails it is already trying to deliver once we remove the config or will it continue because those deliveries entered the queue needing to be delivered?	15:55
clarkb	the letsencrypt job failed which caused the lists3 job to bail out and not run	16:18
clarkb	as mentioend before I have to pop out momentarily for a dentist appointment so won't be able to dig into that now.	16:18
clarkb	worst case we can probably run that playbook manually though if we get LE working before then we could also let the periodic runs do it later today	16:20
clarkb	ok I'm back now. I'll dig into the LE thing next I guess	18:26
clarkb	`The task includes an option with an undefined variable. The error was: 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'letsencrypt_certcheck_domains'. 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'letsencrypt_certcheck_domains'`	18:32
clarkb	what is interesting/weird is this is an ansible loop that ran many iterations before failing	18:32
clarkb	oh I see it is saying one of the LE nodes didn't have that var set on its hostvars	18:33
clarkb	that node was mirror01.sjc1.vexxhost.opendev.org	18:33
clarkb	oh awit no that was the last one to loop successfully. Would be whatever comes after that node	18:34
clarkb	why does ansible not log the loop item when the loop fails?	18:37
clarkb	it logs it every iteration that succeeds but when you actually need it to debug you get nothnig	18:37
tkajinam	We have a few changes proposed by release bot in puppet-openstack repos and zuul does not trigger jobs for some of these patches. wondering if anyone has any idea about potential causes.	18:41
tkajinam	example: https://review.opendev.org/c/openstack/puppet-placement/+/898404/	18:41
clarkb	tkajinam: the first thing to check is if your project has any zuul config errors (I don't see any). The next is to check if any of your jobs run given that file being updated	18:42
clarkb	in this case it also apepars to be on a new stable branch so you may need to check if any jobs are defined for that branch	18:43
tkajinam	clarkb, I believe the lint job should be triggered. actually that job is triggered in the same release patch though we are using the same job templates for (most of) all repos	18:43
tkajinam	https://review.opendev.org/c/openstack/puppet-gnocchi/+/898352/1	18:44
tkajinam	I've seen similar problems after new Puppet OpenStack release, and it looked like some rate limit problem, though usually recheck should trigger the jobs properly.	18:47
tkajinam	I'm leaving soon and may try recheck during the day tomorrow (actually it's "today") but in case recheck does not work I might need some help.	18:48
clarkb	2023-10-16 18:40:12,966 DEBUG zuul.Pipeline.openstack.check: [e: 105beb10ecb744188c37206ef04a6501] No jobs for change <Change 0x7f306af40590 openstack/puppet-placement 898404,1>	18:49
clarkb	it thinks there are no jobs to run for some reason	18:49
clarkb	I'm not sure what rate limit you would be hitting	18:49
tkajinam	hmmm	18:49
tkajinam	https://github.com/openstack/puppet-placement/blob/stable/2023.2/.zuul.yaml	18:50
tkajinam	https://github.com/openstack/puppet-nova/blob/stable/2023.2/.zuul.yaml	18:50
clarkb	I see build-openstack-puppet-tarball reported as not matching files but that doesn't explain why there were no jobs just that single job	18:50
tkajinam	puppet-nova and puppet-placement have the exactly same definition but job is triggered in puppet-nova change while it is not in puppet-placement change.	18:51
clarkb	https://paste.opendev.org/show/bj1LukLlhu9RnXfvbFXi/ is interesting because it says it is using a cached layout and refers to many branches but none are 2023.2	18:53
clarkb	corvus: ^ could this be a layout caching bug in zuul?	18:53
clarkb	corvus: specifically around using stale layouts when enw branches are craeted?	18:53
tkajinam	ah	18:53
tkajinam	sounds like a very possible scenario	18:54
opendevreview	Clark Boylan proposed opendev/system-config master: Noop change to retrigger lists3 deployment https://review.opendev.org/c/opendev/system-config/+/898468	18:55
tkajinam	I'll retry recheck some hours later and it may succeed if the problem is caused by caching. I'll update you tomorrow.	18:56
clarkb	infra-root ^ fyi I am going to self approve that change to see if this LE issue is consistent. I've scanned through the list of hosts and they all seem to set a doamins list value. Additionally we use the same group set when we set the values and read them back so I don't think we're mixing up the group membership between the two halves of this transaction	18:58
clarkb	looking at the build results in zuul I think we have about a 10% fail rate for the letsencrypt job.	19:31
clarkb	this particular error only show up in failures from today though hrm	19:37
clarkb	no recent ansible updates from what I can see	19:38
clarkb	you know what I wonder if this is an "item" var loop name conflict	19:59
* clarkb starts there		19:59
opendevreview	Merged opendev/system-config master: Noop change to retrigger lists3 deployment https://review.opendev.org/c/opendev/system-config/+/898468	20:00
opendevreview	Clark Boylan proposed opendev/system-config master: Add debugging info to certcheck list building https://review.opendev.org/c/opendev/system-config/+/898475	20:09
clarkb	that is an attempt at adding more debugging info	20:09
clarkb	ok the most recent le run just passed. Lists3 is running now	20:17
clarkb	I still think we should consider 898475 to aid in future debugging	20:17
clarkb	and now finally that exim config udpate should be applied. fungi if/when you get a chance to check on it that would be great	20:20
clarkb	I'm going to go ahead and self approve https://review.opendev.org/c/openstack/project-config/+/897710/1/zuul.d/projects.yaml its straightfowrad job dependency update	20:43
opendevreview	Merged openstack/project-config master: Update the jeepyb gerrit build jobs to match current base image https://review.opendev.org/c/openstack/project-config/+/897710	20:49
opendevreview	Clark Boylan proposed opendev/system-config master: Fix job dependencies on old container images https://review.opendev.org/c/opendev/system-config/+/898479	20:53
opendevreview	Clark Boylan proposed opendev/system-config master: Stop building python3.9 container images https://review.opendev.org/c/opendev/system-config/+/898480	20:53
clarkb	zuul-registry and openstackclient are the last two things relying on python3.10 as well. Hopefully we can get them moved to 3.11 and clean up 3.10 soon	20:54
clarkb	fwiw after trying to validte that the exim config is the way we want it I think it did update properly during periodic updates last night. And the processes were restarted around then too	20:59
opendevreview	Clark Boylan proposed opendev/system-config master: Fix the Ansible Galaxy proxy testinfra test https://review.opendev.org/c/opendev/system-config/+/898502	21:41
corvus	clarkb: the most probable cause i can determine at this point is a race condition in the branches that gerrit reported to zuul when it queried for the current list after receiving the ref-updated event. that seems strange to me because i think they should be fairly synchronous -- except -- are we still running a git server replica on our gerrit to offload traffic from gerrit? i believe the zuul gerrit drivers gets the branch list from that.	22:05
corvus	if that's still the case then i think that is highly likely to be the cause, and at this point the best resolution is probably to switch zuul to use the gerrit rest api to get branch listings from gerrit.	22:05
corvus	(^ is re openstack/puppet-placement)	22:05
clarkb	corvus: we stopped running the local replicate becuse the paths clashed wtih polygerrit (/p/)	22:06
corvus	okay, then let's just call that medium probability. :)	22:07
corvus	unfortunately we don't log the returned values in order to confirm that, so i'm just going on behavior here.	22:07
corvus	i think what i would do at this point would be: 1) wait for the mass branch creation to finish; 2) issue a tenant reconfiguration for openstack and confirm that the new branch shows up for puppet-placement; 3) if not, issue a full-reconfiguration and do the same; 4) sometime in the next 6 months update the gerrit driver to use the rest api for branch listings	22:09
clarkb	I think 1) is probably done at least for the day since the release team is largely eu based. For 2) how do we check the branch shows up?	22:10
corvus	that's based on my assesment of the urgency of this; if we want to pursue it with more vigor, then: 1) add debug info to zuul on branch listings; 2) try to reproduce it (probably by very heavily loading a test gerrit)	22:10
clarkb	ya I don't think it is super urgent	22:10
corvus	https://zuul.opendev.org/t/openstack/project/opendev.org/openstack/puppet-placement	22:10
corvus	clarkb: i think a tab for it should show up there ^	22:10
clarkb	thanks. I'll go ahead and ask for the tenant configuration now then. Just have to figure out the right command	22:11
corvus	(i believe the absence of the tab there confirms the earlier observed behavior and suggests that it's not fixed now)	22:11
corvus	cool	22:11
clarkb	is it `zuul-scheduler smart-reconfigure` ?	22:11
corvus	and looking at the logs, i suspect many project-branches may have been affected by this	22:11
corvus	clarkb: `zuul-scheduler tenant-reconfigure openstack`	22:11
clarkb	that seems wrong because it isn't doing it in a container	22:11
clarkb	thanks	22:12
corvus	and yeah, just run that in a container; so "docker exec -it" that or exec a shell and run that	22:12
corvus	inside a running scheduler container i mean	22:12
clarkb	`docker exec zuul-scheduler_scheduler_1 zuul-scheduler tenant-reconfigure openstack` is what I ran on zuul02	22:13
corvus	good it looks like that invalidated the branch cache for all projects, so i think that's promising	22:13
clarkb	I don't see the branch on that page you linked yet. Not sure if I just need to wait for it to rebuild things	22:15
corvus	oh yeah it's slow	22:15
corvus	if you tail zuul02 you'll see it	22:15
corvus	it's on starlingx now	22:16
clarkb	gotcha	22:16
clarkb	unrelated: galaxy just changed its apis and stuff	22:17
clarkb	really makes me feel like our mirroring of galaxy is more pain than it is worth	22:17
corvus	++	22:17
corvus	it's on the github projects now, so we'll see if that update worked :)	22:18
clarkb	the new galaxy thing has versioend apis its its /api/ document	22:18
clarkb	but it appears to have deleted v2 and only kept v1 and now a v3	22:18
corvus	oO	22:18
clarkb	so all of our test stuff that checks v2 data looks correct is currently failing	22:18
corvus	branch queries done; cat jobs now. it's the full set.	22:19
corvus	but i think that's only, what, like 10m or something these days?	22:19
clarkb	seems like a complete reload was like 20 minutes so that seems right	22:21
opendevreview	Clark Boylan proposed opendev/system-config master: Fix the Ansible Galaxy proxy testinfra test https://review.opendev.org/c/opendev/system-config/+/898502	22:27
clarkb	after being told I wasn't authenticated I gave up trying to figure out the new api	22:27
clarkb	if anyone cares about ansible galaxy proxy caching ^ fixing that up (in a followup I'd like to land this as is) would be great	22:28
clarkb	corvus: Ithink the reconfiguration is done. But still not seeing that branch on the project page	22:36
clarkb	2023-10-16 22:34:50,750 INFO zuul.Scheduler: Reconfiguration complete (smart: False, tenants: ['openstack'], duration: 1307.959 seconds)	22:39
clarkb	oh the branch shows up now	22:39
clarkb	maybe there was another layer of caching in my browser or something that a hard refresh wasn't clearing	22:40
clarkb	so that did do it, just needed to be patient	22:40
clarkb	thank you for the help!	22:40
clarkb	I've put together a meeting agenda for tomorrow. I'll give it a bit before I actualyl sent it out if anyone has anything to add or edit	22:43
corvus	clarkb: np -- probably the delay was that once zuul02 was done with the reconfiguration (and putting everything in the cache) the web servers and other schedulers still needed to update their in-memory layout based on the cache. that's relatively fast, but maybe a few minutes.	23:07
corvus	(but as soon as zuul02 was done, that was what was in effect; no other component would act on the old data any more)	23:08
opendevreview	Clark Boylan proposed opendev/system-config master: Update to Ansible 8 on bridge https://review.opendev.org/c/opendev/system-config/+/898505	23:38
opendevreview	Clark Boylan proposed opendev/system-config master: Add debugging info to certcheck list building https://review.opendev.org/c/opendev/system-config/+/898475	23:38
clarkb	898505 is only running the base job. We probably want it to run many more jobs than that /me looks at file lists	23:40
clarkb	we could temporarily add the install ansible role to the file list for all the system-config-run-* jobs I guess	23:42
clarkb	I'll ponder that overnight	23:42

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!