Clark[m] | That ended up taking far longer than I anticipated but I should be home soon and can approve it then | 00:40 |
---|---|---|
clarkb | I've approved it and will keep an eye on it to make sure nothing looks really wrong | 01:37 |
fungi | thanks again! | 02:06 |
opendevreview | Merged opendev/system-config master: Drop the mailman_copy Exim router https://review.opendev.org/c/opendev/system-config/+/898268 | 02:17 |
Clark[m] | So that did not trigger the lists3 job. We'll need to fix that. But I think that's fine we will let the daily application apply it in a few hours | 02:20 |
Clark[m] | Hrm but those are already enqueued. I'm not sure I can debug the job stuff now though. Maybe tomorrow we can land a change that fixes this and triggers the jobs | 02:24 |
fungi | i may not have bandwidth to write that change but can find time to review it | 02:31 |
Clark[m] | The issue is we don't trigger the job on that group vars file updating. We do on the host vars file | 02:32 |
Clark[m] | I'll try to push something up before bed | 02:47 |
opendevreview | Clark Boylan proposed opendev/system-config master: Fix the relevant files lists for lists3 jobs https://review.opendev.org/c/opendev/system-config/+/898280 | 02:55 |
clarkb | fungi: ^ I think something like that should do it | 02:55 |
fungi | lgtm | 03:01 |
*** gthiemon1e is now known as gthiemonge | 06:54 | |
opendevreview | JaromÃr Wysoglad proposed openstack/project-config master: Add infrawatch/sg-core to available repos https://review.opendev.org/c/openstack/project-config/+/898314 | 10:11 |
*** drannou_ is now known as drannou | 13:18 | |
clarkb | I've gone ahead and approved the mm3 chnage. I have to pop out in an hour or so for another dentist visit but don't expect any drilling or anything so I'll be around to check on it after | 15:14 |
opendevreview | Merged opendev/system-config master: Fix the relevant files lists for lists3 jobs https://review.opendev.org/c/opendev/system-config/+/898280 | 15:51 |
clarkb | the lists3 job is enqueued after that changed merged | 15:53 |
clarkb | fungi: do you know if exim will automatically give up on deliveries for the emails it is already trying to deliver once we remove the config or will it continue because those deliveries entered the queue needing to be delivered? | 15:55 |
clarkb | the letsencrypt job failed which caused the lists3 job to bail out and not run | 16:18 |
clarkb | as mentioend before I have to pop out momentarily for a dentist appointment so won't be able to dig into that now. | 16:18 |
clarkb | worst case we can probably run that playbook manually though if we get LE working before then we could also let the periodic runs do it later today | 16:20 |
clarkb | ok I'm back now. I'll dig into the LE thing next I guess | 18:26 |
clarkb | `The task includes an option with an undefined variable. The error was: 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'letsencrypt_certcheck_domains'. 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'letsencrypt_certcheck_domains'` | 18:32 |
clarkb | what is interesting/weird is this is an ansible loop that ran many iterations before failing | 18:32 |
clarkb | oh I see it is saying one of the LE nodes didn't have that var set on its hostvars | 18:33 |
clarkb | that node was mirror01.sjc1.vexxhost.opendev.org | 18:33 |
clarkb | oh awit no that was the last one to loop successfully. Would be whatever comes after that node | 18:34 |
clarkb | why does ansible not log the loop item when the loop fails? | 18:37 |
clarkb | it logs it every iteration that succeeds but when you actually need it to debug you get nothnig | 18:37 |
tkajinam | We have a few changes proposed by release bot in puppet-openstack repos and zuul does not trigger jobs for some of these patches. wondering if anyone has any idea about potential causes. | 18:41 |
tkajinam | example: https://review.opendev.org/c/openstack/puppet-placement/+/898404/ | 18:41 |
clarkb | tkajinam: the first thing to check is if your project has any zuul config errors (I don't see any). The next is to check if any of your jobs run given that file being updated | 18:42 |
clarkb | in this case it also apepars to be on a new stable branch so you may need to check if any jobs are defined for that branch | 18:43 |
tkajinam | clarkb, I believe the lint job should be triggered. actually that job is triggered in the same release patch though we are using the same job templates for (most of) all repos | 18:43 |
tkajinam | https://review.opendev.org/c/openstack/puppet-gnocchi/+/898352/1 | 18:44 |
tkajinam | I've seen similar problems after new Puppet OpenStack release, and it looked like some rate limit problem, though usually recheck should trigger the jobs properly. | 18:47 |
tkajinam | I'm leaving soon and may try recheck during the day tomorrow (actually it's "today") but in case recheck does not work I might need some help. | 18:48 |
clarkb | 2023-10-16 18:40:12,966 DEBUG zuul.Pipeline.openstack.check: [e: 105beb10ecb744188c37206ef04a6501] No jobs for change <Change 0x7f306af40590 openstack/puppet-placement 898404,1> | 18:49 |
clarkb | it thinks there are no jobs to run for some reason | 18:49 |
clarkb | I'm not sure what rate limit you would be hitting | 18:49 |
tkajinam | hmmm | 18:49 |
tkajinam | https://github.com/openstack/puppet-placement/blob/stable/2023.2/.zuul.yaml | 18:50 |
tkajinam | https://github.com/openstack/puppet-nova/blob/stable/2023.2/.zuul.yaml | 18:50 |
clarkb | I see build-openstack-puppet-tarball reported as not matching files but that doesn't explain why there were no jobs just that single job | 18:50 |
tkajinam | puppet-nova and puppet-placement have the exactly same definition but job is triggered in puppet-nova change while it is not in puppet-placement change. | 18:51 |
clarkb | https://paste.opendev.org/show/bj1LukLlhu9RnXfvbFXi/ is interesting because it says it is using a cached layout and refers to many branches but none are 2023.2 | 18:53 |
clarkb | corvus: ^ could this be a layout caching bug in zuul? | 18:53 |
clarkb | corvus: specifically around using stale layouts when enw branches are craeted? | 18:53 |
tkajinam | ah | 18:53 |
tkajinam | sounds like a very possible scenario | 18:54 |
opendevreview | Clark Boylan proposed opendev/system-config master: Noop change to retrigger lists3 deployment https://review.opendev.org/c/opendev/system-config/+/898468 | 18:55 |
tkajinam | I'll retry recheck some hours later and it may succeed if the problem is caused by caching. I'll update you tomorrow. | 18:56 |
clarkb | infra-root ^ fyi I am going to self approve that change to see if this LE issue is consistent. I've scanned through the list of hosts and they all seem to set a doamins list value. Additionally we use the same group set when we set the values and read them back so I don't think we're mixing up the group membership between the two halves of this transaction | 18:58 |
clarkb | looking at the build results in zuul I think we have about a 10% fail rate for the letsencrypt job. | 19:31 |
clarkb | this particular error only show up in failures from today though hrm | 19:37 |
clarkb | no recent ansible updates from what I can see | 19:38 |
clarkb | you know what I wonder if this is an "item" var loop name conflict | 19:59 |
* clarkb starts there | 19:59 | |
opendevreview | Merged opendev/system-config master: Noop change to retrigger lists3 deployment https://review.opendev.org/c/opendev/system-config/+/898468 | 20:00 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add debugging info to certcheck list building https://review.opendev.org/c/opendev/system-config/+/898475 | 20:09 |
clarkb | that is an attempt at adding more debugging info | 20:09 |
clarkb | ok the most recent le run just passed. Lists3 is running now | 20:17 |
clarkb | I still think we should consider 898475 to aid in future debugging | 20:17 |
clarkb | and now finally that exim config udpate should be applied. fungi if/when you get a chance to check on it that would be great | 20:20 |
clarkb | I'm going to go ahead and self approve https://review.opendev.org/c/openstack/project-config/+/897710/1/zuul.d/projects.yaml its straightfowrad job dependency update | 20:43 |
opendevreview | Merged openstack/project-config master: Update the jeepyb gerrit build jobs to match current base image https://review.opendev.org/c/openstack/project-config/+/897710 | 20:49 |
opendevreview | Clark Boylan proposed opendev/system-config master: Fix job dependencies on old container images https://review.opendev.org/c/opendev/system-config/+/898479 | 20:53 |
opendevreview | Clark Boylan proposed opendev/system-config master: Stop building python3.9 container images https://review.opendev.org/c/opendev/system-config/+/898480 | 20:53 |
clarkb | zuul-registry and openstackclient are the last two things relying on python3.10 as well. Hopefully we can get them moved to 3.11 and clean up 3.10 soon | 20:54 |
clarkb | fwiw after trying to validte that the exim config is the way we want it I think it did update properly during periodic updates last night. And the processes were restarted around then too | 20:59 |
opendevreview | Clark Boylan proposed opendev/system-config master: Fix the Ansible Galaxy proxy testinfra test https://review.opendev.org/c/opendev/system-config/+/898502 | 21:41 |
corvus | clarkb: the most probable cause i can determine at this point is a race condition in the branches that gerrit reported to zuul when it queried for the current list after receiving the ref-updated event. that seems strange to me because i think they should be fairly synchronous -- except -- are we still running a git server replica on our gerrit to offload traffic from gerrit? i believe the zuul gerrit drivers gets the branch list from that. | 22:05 |
corvus | if that's still the case then i think that is highly likely to be the cause, and at this point the best resolution is probably to switch zuul to use the gerrit rest api to get branch listings from gerrit. | 22:05 |
corvus | (^ is re openstack/puppet-placement) | 22:05 |
clarkb | corvus: we stopped running the local replicate becuse the paths clashed wtih polygerrit (/p/) | 22:06 |
corvus | okay, then let's just call that medium probability. :) | 22:07 |
corvus | unfortunately we don't log the returned values in order to confirm that, so i'm just going on behavior here. | 22:07 |
corvus | i think what i would do at this point would be: 1) wait for the mass branch creation to finish; 2) issue a tenant reconfiguration for openstack and confirm that the new branch shows up for puppet-placement; 3) if not, issue a full-reconfiguration and do the same; 4) sometime in the next 6 months update the gerrit driver to use the rest api for branch listings | 22:09 |
clarkb | I think 1) is probably done at least for the day since the release team is largely eu based. For 2) how do we check the branch shows up? | 22:10 |
corvus | that's based on my assesment of the urgency of this; if we want to pursue it with more vigor, then: 1) add debug info to zuul on branch listings; 2) try to reproduce it (probably by very heavily loading a test gerrit) | 22:10 |
clarkb | ya I don't think it is super urgent | 22:10 |
corvus | https://zuul.opendev.org/t/openstack/project/opendev.org/openstack/puppet-placement | 22:10 |
corvus | clarkb: i think a tab for it should show up there ^ | 22:10 |
clarkb | thanks. I'll go ahead and ask for the tenant configuration now then. Just have to figure out the right command | 22:11 |
corvus | (i believe the absence of the tab there confirms the earlier observed behavior and suggests that it's not fixed now) | 22:11 |
corvus | cool | 22:11 |
clarkb | is it `zuul-scheduler smart-reconfigure` ? | 22:11 |
corvus | and looking at the logs, i suspect many project-branches may have been affected by this | 22:11 |
corvus | clarkb: `zuul-scheduler tenant-reconfigure openstack` | 22:11 |
clarkb | that seems wrong because it isn't doing it in a container | 22:11 |
clarkb | thanks | 22:12 |
corvus | and yeah, just run that in a container; so "docker exec -it" that or exec a shell and run that | 22:12 |
corvus | inside a running scheduler container i mean | 22:12 |
clarkb | `docker exec zuul-scheduler_scheduler_1 zuul-scheduler tenant-reconfigure openstack` is what I ran on zuul02 | 22:13 |
corvus | good it looks like that invalidated the branch cache for all projects, so i think that's promising | 22:13 |
clarkb | I don't see the branch on that page you linked yet. Not sure if I just need to wait for it to rebuild things | 22:15 |
corvus | oh yeah it's slow | 22:15 |
corvus | if you tail zuul02 you'll see it | 22:15 |
corvus | it's on starlingx now | 22:16 |
clarkb | gotcha | 22:16 |
clarkb | unrelated: galaxy just changed its apis and stuff | 22:17 |
clarkb | really makes me feel like our mirroring of galaxy is more pain than it is worth | 22:17 |
corvus | ++ | 22:17 |
corvus | it's on the github projects now, so we'll see if that update worked :) | 22:18 |
clarkb | the new galaxy thing has versioend apis its its /api/ document | 22:18 |
clarkb | but it appears to have deleted v2 and only kept v1 and now a v3 | 22:18 |
corvus | oO | 22:18 |
clarkb | so all of our test stuff that checks v2 data looks correct is currently failing | 22:18 |
corvus | branch queries done; cat jobs now. it's the full set. | 22:19 |
corvus | but i think that's only, what, like 10m or something these days? | 22:19 |
clarkb | seems like a complete reload was like 20 minutes so that seems right | 22:21 |
opendevreview | Clark Boylan proposed opendev/system-config master: Fix the Ansible Galaxy proxy testinfra test https://review.opendev.org/c/opendev/system-config/+/898502 | 22:27 |
clarkb | after being told I wasn't authenticated I gave up trying to figure out the new api | 22:27 |
clarkb | if anyone cares about ansible galaxy proxy caching ^ fixing that up (in a followup I'd like to land this as is) would be great | 22:28 |
clarkb | corvus: Ithink the reconfiguration is done. But still not seeing that branch on the project page | 22:36 |
clarkb | 2023-10-16 22:34:50,750 INFO zuul.Scheduler: Reconfiguration complete (smart: False, tenants: ['openstack'], duration: 1307.959 seconds) | 22:39 |
clarkb | oh the branch shows up now | 22:39 |
clarkb | maybe there was another layer of caching in my browser or something that a hard refresh wasn't clearing | 22:40 |
clarkb | so that did do it, just needed to be patient | 22:40 |
clarkb | thank you for the help! | 22:40 |
clarkb | I've put together a meeting agenda for tomorrow. I'll give it a bit before I actualyl sent it out if anyone has anything to add or edit | 22:43 |
corvus | clarkb: np -- probably the delay was that once zuul02 was done with the reconfiguration (and putting everything in the cache) the web servers and other schedulers still needed to update their in-memory layout based on the cache. that's relatively fast, but maybe a few minutes. | 23:07 |
corvus | (but as soon as zuul02 was done, that was what was in effect; no other component would act on the old data any more) | 23:08 |
opendevreview | Clark Boylan proposed opendev/system-config master: Update to Ansible 8 on bridge https://review.opendev.org/c/opendev/system-config/+/898505 | 23:38 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add debugging info to certcheck list building https://review.opendev.org/c/opendev/system-config/+/898475 | 23:38 |
clarkb | 898505 is only running the base job. We probably want it to run many more jobs than that /me looks at file lists | 23:40 |
clarkb | we could temporarily add the install ansible role to the file list for all the system-config-run-* jobs I guess | 23:42 |
clarkb | I'll ponder that overnight | 23:42 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!