Monday, 2023-10-16

Clark[m]That ended up taking far longer than I anticipated but I should be home soon and can approve it then00:40
clarkbI've approved it and will keep an eye on it to make sure nothing looks really wrong01:37
fungithanks again!02:06
opendevreviewMerged opendev/system-config master: Drop the mailman_copy Exim router
Clark[m]So that did not trigger the lists3 job. We'll need to fix that. But I think that's fine we will let the daily application apply it in a few hours02:20
Clark[m]Hrm but those are already enqueued. I'm not sure I can debug the job stuff now though. Maybe tomorrow we can land a change that fixes this and triggers the jobs 02:24
fungii may not have bandwidth to write that change but can find time to review it02:31
Clark[m]The issue is we don't trigger the job on that group vars file updating. We do on the host vars file02:32
Clark[m]I'll try to push something up before bed02:47
opendevreviewClark Boylan proposed opendev/system-config master: Fix the relevant files lists for lists3 jobs
clarkbfungi: ^ I think something like that should do it02:55
*** gthiemon1e is now known as gthiemonge06:54
opendevreviewJaromír Wysoglad proposed openstack/project-config master: Add infrawatch/sg-core to available repos
*** drannou_ is now known as drannou13:18
clarkbI've gone ahead and approved the mm3 chnage. I have to pop out in an hour or so for another dentist visit but don't expect any drilling or anything so I'll be around to check on it after15:14
opendevreviewMerged opendev/system-config master: Fix the relevant files lists for lists3 jobs
clarkbthe lists3 job is enqueued after that changed merged15:53
clarkbfungi: do you know if exim will automatically give up on deliveries for the emails it is already trying to deliver once we remove the config or will it continue because those deliveries entered the queue needing to be delivered?15:55
clarkbthe letsencrypt job failed which caused the lists3 job to bail out and not run16:18
clarkbas mentioend before I have to pop out momentarily for a dentist appointment so won't be able to dig into that now.16:18
clarkbworst case we can probably run that playbook manually though if we get LE working before then we could also let the periodic runs do it later today16:20
clarkbok I'm back now. I'll dig into the LE thing next I guess18:26
clarkb`The task includes an option with an undefined variable. The error was: 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'letsencrypt_certcheck_domains'. 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'letsencrypt_certcheck_domains'`18:32
clarkbwhat is interesting/weird is this is an ansible loop that ran many iterations before failing18:32
clarkboh I see it is saying one of the LE nodes didn't have that var set on its hostvars18:33
clarkbthat node was mirror01.sjc1.vexxhost.opendev.org18:33
clarkboh awit no that was the last one to loop successfully. Would be whatever comes after that node18:34
clarkbwhy does ansible not log the loop item when the loop fails?18:37
clarkbit logs it every iteration that succeeds but when you actually need it to debug you get nothnig18:37
tkajinamWe have a few changes proposed by release bot in puppet-openstack repos and zuul does not trigger jobs for some of these patches. wondering if anyone has any idea about potential causes.18:41
clarkbtkajinam: the first thing to check is if your project has any zuul config errors (I don't see any). The next is to check if any of your jobs run given that file being updated18:42
clarkbin this case it also apepars to be on a new stable branch so you may need to check if any jobs are defined for that branch18:43
tkajinamclarkb, I believe the lint job should be triggered. actually that job is triggered in the same release patch though we are using the same job templates for (most of) all repos18:43
tkajinamI've seen similar problems after new Puppet OpenStack release, and it looked like some rate limit problem, though usually recheck should trigger the jobs properly.18:47
tkajinamI'm leaving soon and may try recheck during the day tomorrow (actually it's "today") but in case recheck does not work I might need some help.18:48
clarkb2023-10-16 18:40:12,966 DEBUG zuul.Pipeline.openstack.check: [e: 105beb10ecb744188c37206ef04a6501] No jobs for change <Change 0x7f306af40590 openstack/puppet-placement 898404,1>18:49
clarkbit thinks there are no jobs to run for some reason18:49
clarkbI'm not sure what rate limit you would be hitting18:49
clarkbI see build-openstack-puppet-tarball reported as not matching files but that doesn't explain why there were no jobs just that single job18:50
tkajinampuppet-nova and puppet-placement have the exactly same definition but job is triggered in puppet-nova change while it is not in puppet-placement change.18:51
clarkb is interesting because it says it is using a cached layout and refers to many branches but none are 2023.218:53
clarkbcorvus: ^ could this be a layout caching bug in zuul?18:53
clarkbcorvus: specifically around using stale layouts when enw branches are craeted?18:53
tkajinamsounds like a very possible scenario18:54
opendevreviewClark Boylan proposed opendev/system-config master: Noop change to retrigger lists3 deployment
tkajinamI'll retry recheck some hours later and it may succeed if the problem is caused by caching. I'll update you tomorrow.18:56
clarkbinfra-root ^ fyi I am going to self approve that change to see if this LE issue is consistent. I've scanned through the list of hosts and they all seem to set a doamins list value. Additionally we use the same group set when we set the values and read them back so I don't think we're mixing up the group membership between the two halves of this transaction18:58
clarkblooking at the build results in zuul I think we have about a 10% fail rate for the letsencrypt job.19:31
clarkbthis particular error only show up in failures from today though hrm19:37
clarkbno recent ansible updates from what I can see19:38
clarkbyou know what I wonder if this is an "item" var loop name conflict19:59
* clarkb starts there19:59
opendevreviewMerged opendev/system-config master: Noop change to retrigger lists3 deployment
opendevreviewClark Boylan proposed opendev/system-config master: Add debugging info to certcheck list building
clarkbthat is an attempt at adding more debugging info20:09
clarkbok the most recent le run just passed. Lists3 is running now20:17
clarkbI still think we should consider 898475 to aid in future debugging20:17
clarkband now finally that exim config udpate should be applied. fungi if/when you get a chance to check on it that would be great20:20
clarkbI'm going to go ahead and self approve its straightfowrad job dependency update20:43
opendevreviewMerged openstack/project-config master: Update the jeepyb gerrit build jobs to match current base image
opendevreviewClark Boylan proposed opendev/system-config master: Fix job dependencies on old container images
opendevreviewClark Boylan proposed opendev/system-config master: Stop building python3.9 container images
clarkbzuul-registry and openstackclient are the last two things relying on python3.10 as well. Hopefully we can get them moved to 3.11 and clean up 3.10 soon20:54
clarkbfwiw after trying to validte that the exim config is the way we want it I think it did update properly during periodic updates last night. And the processes were restarted around then too20:59
opendevreviewClark Boylan proposed opendev/system-config master: Fix the Ansible Galaxy proxy testinfra test
corvusclarkb: the most probable cause i can determine at this point is a race condition in the branches that gerrit reported to zuul when it queried for the current list after receiving the ref-updated event.  that seems strange to me because i think they should be fairly synchronous -- except -- are we still running a git server replica on our gerrit to offload traffic from gerrit?  i believe the zuul gerrit drivers gets the branch list from that.22:05
corvusif that's still the case then i think that is highly likely to be the cause, and at this point the best resolution is probably to switch zuul to use the gerrit rest api to get branch listings from gerrit.22:05
corvus(^ is re openstack/puppet-placement)22:05
clarkbcorvus: we stopped running the local replicate becuse the paths clashed wtih polygerrit (/p/)22:06
corvusokay, then let's just call that medium probability.  :)22:07
corvusunfortunately we don't log the returned values in order to confirm that, so i'm just going on behavior here.22:07
corvusi think what i would do at this point would be: 1) wait for the mass branch creation to finish; 2) issue a tenant reconfiguration for openstack and confirm that the new branch shows up for puppet-placement; 3) if not, issue a full-reconfiguration and do the same; 4) sometime in the next 6 months update the gerrit driver to use the rest api for branch listings22:09
clarkbI think 1) is probably done at least for the day since the release team is largely eu based. For 2) how do we check the branch shows up?22:10
corvusthat's based on my assesment of the urgency of this; if we want to pursue it with more vigor, then: 1) add debug info to zuul on branch listings; 2) try to reproduce it (probably by very heavily loading a test gerrit)22:10
clarkbya I don't think it is super urgent22:10
corvusclarkb: i think a tab for it should show up there ^22:10
clarkbthanks. I'll go ahead and ask for the tenant configuration now then. Just have to figure out the right command22:11
corvus(i believe the absence of the tab there confirms the earlier observed behavior and suggests that it's not fixed now)22:11
clarkbis it `zuul-scheduler smart-reconfigure` ?22:11
corvusand looking at the logs, i suspect many project-branches may have been affected by this22:11
corvusclarkb: `zuul-scheduler tenant-reconfigure openstack`22:11
clarkbthat seems wrong because it isn't doing it in a container22:11
corvusand yeah, just run that in a container; so "docker exec -it" that or exec a shell and run that22:12
corvusinside a running scheduler container i mean22:12
clarkb`docker exec zuul-scheduler_scheduler_1 zuul-scheduler tenant-reconfigure openstack` is what I ran on zuul0222:13
corvusgood it looks like that invalidated the branch cache for all projects, so i think that's promising22:13
clarkbI don't see the branch on that page you linked yet. Not sure if I just need to wait for it to rebuild things22:15
corvusoh yeah it's slow22:15
corvusif you tail zuul02 you'll see it22:15
corvusit's on starlingx now22:16
clarkbunrelated: galaxy just changed its apis and stuff22:17
clarkbreally makes me feel like our mirroring of galaxy is more pain than it is worth22:17
corvusit's on the github projects now, so we'll see if that update worked :)22:18
clarkbthe new galaxy thing has versioend apis its its /api/ document22:18
clarkbbut it appears to have deleted v2 and only kept v1 and now a v322:18
clarkbso all of our test stuff that checks v2 data looks correct is currently failing22:18
corvusbranch queries done; cat jobs now.  it's the full set.22:19
corvusbut i think that's only, what, like 10m or something these days?22:19
clarkbseems like a complete reload was like 20 minutes so that seems right22:21
opendevreviewClark Boylan proposed opendev/system-config master: Fix the Ansible Galaxy proxy testinfra test
clarkbafter being told I wasn't authenticated I gave up trying to figure out the new api22:27
clarkbif anyone cares about ansible galaxy proxy caching ^ fixing that up (in a followup I'd like to land this as is) would be great22:28
clarkbcorvus:  Ithink the reconfiguration is done. But still not seeing that branch on the project page22:36
clarkb2023-10-16 22:34:50,750 INFO zuul.Scheduler: Reconfiguration complete (smart: False, tenants: ['openstack'], duration: 1307.959 seconds)22:39
clarkboh the branch shows up now22:39
clarkbmaybe there was another layer of caching in my browser or something that a hard refresh wasn't clearing22:40
clarkbso that did do it, just needed to be patient22:40
clarkbthank you for the help!22:40
clarkbI've put together a meeting agenda for tomorrow. I'll give it a bit before I actualyl sent it out if anyone has anything to add or edit22:43
corvusclarkb: np -- probably the delay was that once zuul02 was done with the reconfiguration (and putting everything in the cache) the web servers and other schedulers still needed to update their in-memory layout based on the cache.  that's relatively fast, but maybe a few minutes.23:07
corvus(but as soon as zuul02 was done, that was what was in effect; no other component would act on the old data any more)23:08
opendevreviewClark Boylan proposed opendev/system-config master: Update to Ansible 8 on bridge
opendevreviewClark Boylan proposed opendev/system-config master: Add debugging info to certcheck list building
clarkb898505 is only running the base job. We probably want it to run many more jobs than that /me looks at file lists23:40
clarkbwe could temporarily add the install ansible role to the file list for all the system-config-run-* jobs I guess23:42
clarkbI'll ponder that overnight23:42

Generated by 2.17.3 by Marius Gedminas - find it at!