Wednesday, 2022-02-16

clarkbianw: left a couple of notes on the first change in the wheel python stack00:09
ianwclarkb: thanks; new version posted00:14
clarkbianw: ok left comments on the gpg encrypted log files stack too00:43
ianwclarkb: it was bit of a concious choice to not tee the logs there, as i wondered if it was all quite long and not quite necessary in the console output, because it's not formatted that well for that context00:47
clarkbianw: hrm I tend to use them. But I know the ara is there as an alternative (as would the new log files)00:50
clarkbI have an old habit of looking at hte job output file before anything else. I can be convinced this is a bad idea :)00:50
ianwas it changes the status quo, i can put in a "tee" there for now, and propose stopping doing this in a separate change00:51
clarkbI'm happy to see if others have a preference00:51
clarkband just remove it if that is what people prefer00:51
*** rlandy|ruck is now known as rlandy|out00:52
opendevreviewMerged openstack/diskimage-builder master: Update platform support to describe stable testing
ianwthanks for the reviews, i will tweak things after som elunch01:23
opendevreviewIan Wienand proposed zuul/zuul-jobs master: encrypt-file : role to encrypt a file
Clark[m]fungi: email indicates your GitHub fix is likely working02:35
opendevreviewIan Wienand proposed zuul/zuul-jobs master: encrypt-file : role to encrypt a file
opendevreviewIan Wienand proposed opendev/system-config master: Base work for exporting encrypted logs
opendevreviewIan Wienand proposed opendev/system-config master: run-production-playbook: return encrypted logs
opendevreviewIan Wienand proposed opendev/system-config master: zuul/run-base.yaml : don't echo test playbooks to console log
opendevreviewIan Wienand proposed zuul/zuul-jobs master: encrypt-file : role to encrypt a file
opendevreviewIan Wienand proposed zuul/zuul-jobs master: encrypt-file : role to encrypt a file
opendevreviewIan Wienand proposed zuul/zuul-jobs master: encrypt-file : role to encrypt a file
*** ysandeep|out is now known as ysandeep05:36
opendevreviewIan Wienand proposed opendev/system-config master: Base work for exporting encrypted logs
opendevreviewIan Wienand proposed opendev/system-config master: run-production-playbook: return encrypted logs
opendevreviewIan Wienand proposed opendev/system-config master: zuul/run-base.yaml : don't echo test playbooks to console log
*** ykarel_ is now known as ykarel06:49
*** dwhite449 is now known as dwhite4407:38
*** amoralej|off is now known as amoralej08:03
*** jpena|off is now known as jpena08:33
*** ysandeep is now known as ysandeep|lunch08:59
lourotfungi, re: github mirroring fix, this worked, thanks a lot!08:59
fungiperfect. at least the missing repos were created. content may not show up for them until a new change merges in each09:00
*** ysandeep|lunch is now known as ysandeep10:03
*** pojadhav- is now known as pojadhav10:25
*** ykarel_ is now known as ykarel10:31
*** rlandy|out is now known as rlandy|ruck11:08
*** dviroel|out is now known as dviroel11:21
opendevreviewMerged openstack/project-config master: Add Rocky Linux to nodepool elements tooling
opendevreviewSzymon Datko proposed zuul/zuul-jobs master: [ensure-python] Fix for CentOS/RHEL 9 packages
opendevreviewSzymon Datko proposed zuul/zuul-jobs master: [ensure-python] Fix for CentOS/RHEL 9 packages
*** amoralej is now known as amoralej|lunch13:02
*** pojadhav is now known as pojadhav|brb13:18
opendevreviewSzymon Datko proposed zuul/zuul-jobs master: [ensure-python] Fix for CentOS/RHEL 9 packages
fricklerkevinz_: hi, any update on the certificate? I'm still seeing the expiry warnings13:39
opendevreviewSzymon Datko proposed zuul/zuul-jobs master: [ensure-python] Allow overriding package name
*** amoralej|lunch is now known as amoralej14:01
admin1hi all.. i am setting up openstack-ansible and got this message 14:16
admin1You are not building wheels while running role against multiple hosts. This might result in DOS-ing OpenDev infrustructure servers. In order to proceed, please ensure that you have repo servers for selected OS version and architecture. If you want to avoid building wheel on purpose, ensure that you run playbook in serial manner. In case of causing14:16
admin1unreasonable load on the git servers, your access may be blocked to protect other users and the OpenDev CI infrastructure which are reliant on this service. 14:16
fungiadmin1: please ask in #openstack-ansible14:19
admin1they sent me here :D 14:20
fungiall the opendev sysadmins know is that sometimes openstack-ansible users flood our git servers with repository clone requests from hundreds of systems at once and knock us offline, so the openstack-ansible maintainers added that warning after diagnosing the cause14:21
opendevreviewNeil Hanlon proposed openstack/project-config master: Add rockylinux-8 to nodepool configuration
*** pojadhav|brb is now known as pojadhav14:51
*** weechat1 is now known as amorin15:02
*** pojadhav is now known as pojadhavdinner15:21
*** pojadhavdinner is now known as pojadhav|dinner15:21
*** ysandeep is now known as ysandeep|out15:26
fungianother open source videoconferencing platform i hadn't seen before:
fungilooks like the underlying webrtc gateway implementation came from which i'd also never heard of15:29
fungipart of the pandemic wfh vc explosion bubble, i guess15:29
*** dviroel is now known as dviroel|lunch15:31
*** ysandeep|out is now known as ysandeep16:00
funginope, i guessed wrong: "Meetecho was born in 2009 as an official academic spin-off of the University of Napoli Federico II."16:01
*** ysandeep is now known as ysandeep|out16:04
*** rlandy|ruck is now known as rlandy|ruck|mtg16:22
opendevreviewSzymon Datko proposed zuul/zuul-jobs master: [ensure-python] Allow overriding package name
*** rlandy|ruck|mtg is now known as rlandy|ruck16:37
*** dviroel|lunch is now known as dviroel16:39
*** marios is now known as marios|out17:04
*** pojadhav|dinner is now known as pojadhav17:08
corvusi'm going to look into the log streaming issue that ianw reported; anything new i should be aware of?17:31
*** jpena is now known as jpena|off17:32
clarkbnot that I know of17:33
clarkbI think we've all been distracted by other stuff and havne't had a chance to look at it closer17:33
clarkbinfra-root I'm approving and then will be ready to land. This last one could use one more review17:37
fungithanks for the heads up, looking now17:46
clarkbfungi: re gerrit gitea I think I can technically land that change now. But historically davido has submitted changes for me when he has +2'd them so I'm thinking maybe he wants extra review on these?17:49
fungiyeah, that's what i was wondering17:50
fungii didn't notice him adding other requested reviewers, but may have missed it17:50
clarkbI can ping davido on their slack and get his opinion so that it is clear17:50
opendevreviewMerged opendev/system-config master: Remove configuration management for wiki servers
opendevreviewMerged opendev/system-config master: Stop using puppet repos that will be retired
clarkbcorvus: is non responsive right now and zuul-web is spinning a cpu18:27
*** pojadhav is now known as pojadhav|out18:27
clarkbis this possibly related to your debugging? Should we restart zuul web?18:27
fungior did he already restart zuul-web and it's still doing its smart-reconfig?18:28
clarkboh maybe?18:28
clarkbthe process is old but it does appear it is reloading its configs18:29
fungiif memory serves, it takes zuul-web 15-20 minutes to restart now that it's been reimplemented as basically another scheduler18:29
clarkbtailing the debug log shows it talking about config files18:29
corvusi did sigusr2, so yappi is running... maybe that's slowing this all down18:35
clarkbyes that could do it18:35
fungifrom the logs, i suspect we ended up doing a reconfigure with yappi going18:35
corvusi hit it again, hopefully it speeds up now18:35
corvuscheck out the grafana18:36
corvuspeaked at 200 queued requests18:36
corvusback to normal now18:37
*** rlandy|ruck is now known as rlandy|ruck|mtg19:00
*** rlandy|ruck|mtg is now known as rlandy|ruck20:14
opendevreviewIan Wienand proposed zuul/zuul-jobs master: encrypt-file : role to encrypt a file
opendevreviewJonathan Rosser proposed zuul/zuul-jobs master: Allow some configure-mirrors repositories to be disabled
opendevreviewJonathan Rosser proposed zuul/zuul-jobs master: Allow some configure-mirrors repositories to be disabled
*** rlandy|ruck is now known as rlandy|ruck|mtg20:43
clarkbianw: if you get a chance can you review for improved haproxying with gitea?20:48
clarkbfungi: looks like davido merged the gerrit changes20:51
clarkbfungi: I think we can test it without the depends on now20:52
fungioh, yep! on it20:53
opendevreviewJeremy Stanley proposed opendev/system-config master: Use Gitea for Gerrit's code browser URLs
opendevreviewJeremy Stanley proposed opendev/system-config master: DNM: Fail our Gerrit testing for an autohold
fungii've set a fresh autohold for the dnm change there and released the previous hold21:00
clarkband then if we update for that we'll pull in the ls-members fix too21:02
*** rlandy|ruck|mtg is now known as rlandy|ruck21:04
*** amoralej is now known as amoralej|off21:10
ianwclarkb: why do we need to "verify none" for the production case -- there wouldn't we have valid SSL certificates?21:20
clarkbianw: we do have valid ssl certs in prod. The problem is testing it since you have to provide the ca files as well. Mostly just worried that if we land the chagne we'll suddenly have no valid backends because testing is difficult21:21
ianwclarkb: perhaps as a follow-up we should switch one to do ssl checks in production, and if it's ok, switch the rest?21:22
fungiwell, even in prod we'd have to point to a ca file21:22
clarkbfungi: ya but if they are all verify none but one the ca-file should be pretty non impactful for the other 721:23
clarkbianw: I think that approach works21:23
fungiianw: keep in mind that this isn't a regression over what we already had with the tcp checks, even before apache, and we're separately checking and alerting on cert validity anyway outside of haproxy21:23
clarkbwe need to bind mount the /etc/ssl/certs/ca-certificates.crt file into the container then set the config to use it21:23
clarkband if one backend has a sad that isn't the end of the world21:23
ianwahh, i had sort of assumed the LB container would have the certs setups for LE21:24
ianwi'm fine with it btw, just thinking through what we could do21:24
clarkbianw: we consume haproxy from upstream and they don't seem to include any certs21:25
clarkbwhich kinda makes sense for the target audience of the image I guess21:25
fungithe odds that one of our gitea servers would spontaneously have an invalid cert while the others are fine is fairly low, and having the load balancer make decisions based on cert validity also increases complexity, thus the chances that it might decide to take all the backends out of the pool because of an error somewhere21:25
clarkbbut we can bind mount it in from the host21:25
clarkbfungi: ya also that21:25
clarkbit does add more complexity and that is always opportunity for unexpected failure21:25
*** dviroel is now known as dviroel|out21:26
fungimy primary fear with load balancers is that they'll hit a condition in their check logic which causes them to invalidate all backends21:26
ianwfair points,  i guess it's more about the LB knowing who it's talking to in the back-end21:27
fungiand this is fear borne from experience managing very expensive commercial load balancers for decades in a past life. it definitely happens21:27
fungitelling the customer that a minor change to their website caused the load balancer cluster to suddenly decide none of their servers were viable destinations was never fun, and inevitably led them to question why they were even using load balancers if they could cause the site to go offline rather than preventing it21:29
fungiso, yeah, simpler checks are better. the complexity of the layer-7 check is a mitigation for the reverse-proxy causing the service to seem up for simpler tcp socket checks when it isn't, but i think we need to carefully weigh any increase in check complexity against the benefits provided21:32
fungiin particular, we know that we take the gitea containers offline when the container images are replaced21:33
fungiso i think not sending traffic to them under those (fairly frequent) outages is worth the added risk21:33
clarkbya I think we definitely want a check that covers both apache and gitea21:35
clarkbcurrently we only have apache. The proposed change should also cover gitea21:35
clarkblooking at the timing of the change landing it appears I'll be starting my walk to get kids from school around when the job should merge. infra-root if you'd like I can put the lb in the emergency file now and remove it and run the playbook when I return21:45
clarkbConsidering it is tested I'm not too worried about it. But I won't be able to fix it for a little bit after it lands if something goes wrong21:46
opendevreviewMerged zuul/zuul-jobs master: Allow some configure-mirrors repositories to be disabled
fungii should be done with dinner by then21:59
clarkbcool I'll leave it be then22:04
opendevreviewMerged opendev/system-config master: Haproxy http checks for Gitea
clarkbit will end up behind the hourly jobs. Those take about half an hour iirc. I'll check when the school run is done22:09
ianwi'm around, so can watch too22:14
rcastillo|roverHello. We're running into some issues with our tripleo centos 9 content provider jobs. They're failing on retry and the running jobs don't seem to print any logs22:19
rcastillo|roversome issue with stream 9 nodes?22:20
ianwrcastillo|rover: hrm, let me have a look22:20
ianw let's check that one, seems the latest22:20
ianwAdding node request <NodeRequest 300-0017269841 ['centos-9-stream']> for job <FrozenJob tripleo-ci-centos-9-content-provider> to item <QueueItem 4cd6063e4397459dabb0793b8f36afbf for <Change 0x7f1cc4175250 openstack/tripleo-ansible 828920,3> in check>22:24
ianw2022-02-16 22:05:04,927 DEBUG zuul.nodepool: [e: 1a7616b0e3c74112b17d45540797e1ab] Node request <NodeRequest 300-0017269841 ['centos-9-stream']> fulfilled22:24
ianw2022-02-16 22:05:04,884 DEBUG nodepool.driver.NodeRequestHandler[]: [e: 1a7616b0e3c74112b17d45540797e1ab] [node_request: 300-0017269841] Fulfilled node request22:32
ianwnl04 fufilled the request, and it went to ovh-bhs122:32
ianw2022-02-16 22:04:24,900 ERROR zuul.AnsibleJob: [e: 1a7616b0e3c74112b17d45540797e1ab] [build: e89d10ecb10a4812b1dd9e4ea50fdb1a] Exception while executing job22:44
ianwon ze03 ... this might be a clue22:44
ianw is the full error22:48
ianw2022-02-16 22:04:24,900 ERROR zuul.AnsibleJob:   ValueError: SHA b'01e50ed7dac6cd25ce268d01cc457910633ccbf0' could not be resolved, git returned: b'01e50ed7dac6cd25ce268d01cc457910633ccbf0 missing'22:49
ianwis the interesting bit22:49
fungicorrupt repo cache?22:49
ianwcorvus: ^ might be able to short-cut my further investigations :)22:49
ianwyeah, possibly22:49
ianw2022-02-16 22:04:24,899 DEBUG zuul.Repo.Ref: Create reference refs/heads/patchback/backports/stable-4/7f793c83f1aae99e936238dc1873251ac2c358a3/pr-4162 at 01e50ed7dac6cd25ce268d01cc457910633ccbf0 in /var/lib/zuul/builds/e89d10ecb10a4812b1dd9e4ea50fdb1a/work/src/
ianwthis is before it22:50
ianw does not have a depends-on22:52
ianwso that's something22:52
clarkblooks like haproxy update hasn't happened yet and I'm back so can watch it22:53
ianwroot@ze03:/var/lib/zuul/executor-git/ git show 01e50ed7dac6cd25ce268d01cc457910633ccbf022:54
ianwfatal: bad object 01e50ed7dac6cd25ce268d01cc457910633ccbf022:54
clarkbianw: I think we need to check if that object is broken upstream too22:55
clarkbif it is then there isn't much we can do. If it isn't then we likely need to take the executor out of rotation, remove the repo and let it repopulate after starting the executor back up again22:55
clarkbconsidering this seems to be somewhat widepsread I suspect this isn't a problem on our end but somethign we are managing to fetch from upstream (but we should confirm that)22:55
ianwthis must have run several times across executors you'd think22:56
ianwyeah, what you said :)22:56
ianw[iwienand@fedora19 community.general (main)]$ git show 01e50ed7dac6cd25ce268d01cc457910633ccbf022:56
ianwfatal: bad object 01e50ed7dac6cd25ce268d01cc457910633ccbf022:56
clarkbthat is unfortunate for ansible, but I'm not sure we can do anything about it?22:57
clarkbeg that should be fixed upstream22:57
ianwi guess i will file an issue with ansible, i feel like github may have to get involved22:57
clarkbya it is possible22:58
clarkbI guess on our end we might be able to better capture the error somehow22:58
*** prometheanfire is now known as Guest222:59
ianwi'm not exactly sure how zuul came up with 01e50ed...23:02
clarkblooking at the traceback it is doing a setRepostate which sets up all the refs/branches iirc23:03
ianwgrep -r '01e50ed' * in .git/refs doesn't match anything23:03
ianw seems related23:04
*** Guest2 is now known as prometheanfire23:05
clarkbthe commit for that PR seems to be 01250ed23:05
clarkbif I can type it right :)23:05
*** dhill is now known as Guest523:06
ianwthe branch was deleted23:06
*** Guest5 is now known as dhill_23:06
dhill_rlandy|ruck, hi23:07
ianwi just want to make sure before blaming github that this is not some generic error you see if *you* have an old ref23:07
clarkbianw: the thing Zuul is attempting to do there is set all the branches to the right state. SO maybe we aren't handling a branch being deleted properly if that is what has happened? However what is odd is that the ref is in the repo so you would've expect tofind it anyway23:08
rlandy|ruckdhill_: hey23:08
clarkbianw: how do we know the branch was deleted? Does github tell us that?23:08
rlandy|ruckjoin the party23:08
dhill_rlandy|ruck, we have too many irc network23:09
clarkbor just that clicking on the branch in the PR is a 404?23:09
rlandy|ruckdiscussion in progress23:09
ianw$ git show aaaaaed7dac6cd25ce268d01cc457910633ccbf023:09
ianwfatal: bad object aaaaaed7dac6cd25ce268d01cc457910633ccbf023:09
ianwso we get the same error with a bogus ref anyway23:09
ianwwhich makes me think it's zuul's fault here, somehow23:09
ianwclarkb: i'm looking at the last comment in
clarkbianw: aha thanks23:09
clarkbianw: as far as zuul doing something wrong. I think the expectation is that Zuul sets every branch to the ref that is pointed to upstream. If zuul somehow missed the delete branch event and then didn't noticed the branch was gone when listing branches to set state on that could happen?23:10
clarkbI think another thing that makes this confusing is this repo seems to use a rebase not merge method of commiting PRs (the sha for the PR commit changed in the target branch and there is no merge commit)23:11
clarkbbasically the branch and the commit get discarded. And ya maybe it is possible for zuul to get confused in that state23:11
ianwthat is what ze03's git cache repo has23:12
ianwoh, pr-4136 is ! 420823:13
clarkband that is a valid branch23:13
clarkbianw: did it log which branch repo state settnig it awas angry about?23:14
ianwlet me go back and find the exception to see what's above it more23:14
clarkbI'm curious if we asked zuul to try again and see if it works now and this is a race with deleting branches and trying to update them locally23:15
clarkbZuul checks repo and finds PR branch with sha foo, repo delets branch with sha foo, zuul tries to set branch with sha foo and cannot find sha foo sort of ordering23:15
ianwseems to happen between  21:53:16 and 22:04:41 so that's a big window23:16
ianwlike 8 retries23:16
dhill_ianw, so if I "recheck" my failing patch we'll know ?23:16
ianwdhill_: yes, it's worth trying at least please23:16
dhill_ianw, I just rechecked this one23:17
ianwclarkb: that's all we have above exception for this repo ->
clarkb829610 appears to have been in the check queue for an hour and 16 minutes already23:18
dhill_it fails here23:19
dhill_tripleo-ci-centos-9-content-provider : RETRY_LIMIT in 23s23:19
ianwfelixfontein deleted the patchback/backports/stable-4/7f793c83f1aae99e936238dc1873251ac2c358a3/pr-4162 branch 1 hour ago 23:19
clarkbdhill_: ya but that is from 21:54 and it is 23:19 now23:19
clarkbdhill_: and it is in check for an hour and 16 minutes so ~22:02 ish?23:19
clarkbif you hover the one hour ago it gives you a proper timestamp. I suspect that this is a race23:21
ianwhovering that tells me it was 8:52am local time, which was ... 1hour 28 minutes ago, so UTC 21:5223:21
clarkbya and the failure was a couple minutes after that. Depending on how long it took zuul to construct a state for the job (which isn't always quick depending on merges etc) we could hit this race23:21
clarkbIn a repo that merged commits this would be a non issue23:21
clarkbyou'd have the commit in the repo still and could checkout a branch to that (even if the branch shouldn't exit anymore)23:22
ianwok, that sort of makes sense, but why is it still failing on recheck23:22
clarkbI think it is specifically the combo of deleting while zuul is processing state for jobs containing that repo and the repo must use lossy merge process23:22
clarkbianw: I don't think it is?23:22
clarkbthe change dhill_ linked to is in check and has been since ~22:02 ish23:23
dhill_clarkb, yeah it failed with a retry23:23
clarkb829610 is currently in check and has been since 22:0223:23
clarkbI think the recheck worked and its been running?23:23
clarkbI feel like I'm missing something23:23
ianwyeah, i agree; tripleo-ci-centos-9-content-provider is and currently paused23:26
clarkbinfra-root the haproxy config update happened but we don't reload the haproxy config when that happens. We do reload when the docker compose file is updated. I'm going to manually run the graceful reload command. Then push up a change to fix our config management23:27
fungisounds good, thanks23:27
clarkbthe docker-compose kill -s HUP haproxy command has been run and there is a new process. The old ones go away when they are done serving content iirc. I can still reach so thats good23:29
opendevreviewClark Boylan proposed opendev/system-config master: Reload haproxy when its config updates
*** diablo_rojo_phone is now known as Guest723:36
clarkbdhill_: rlandy|ruck rcastillo|rover ianw  reporting some info back from the zuul channel. A typical solution to this would be setting however that only helps if the repo in question protects the branches you care about (and I'm not sure how to determine that23:39
clarkb is an alternative that would help us but it hasn't merge yet. You'd specificy which branches you cared about in those repos regardless of their protection strategy23:40
* rlandy|ruck reads back23:43
rlandy|ruckhmmm ... good question23:45
clarkbcorvus notes that the second option presented by 804177 presents its own challenges because now you're getting a window on the repository and if you need something outside that window you may fail too23:47
clarkbReally the best option would be to not delete commits :)23:47
clarkbbut I suspect we won't get very far making that usggestion to people23:47
rlandy|ruckI am not sure I have a good answer on the branches23:47
dhill_I don't understand anything of this lol23:48
rlandy|ruckwe'd need to set up something like: 23:49
rlandy|ruck exclude-unprotected-branches: 23:49
rlandy|ruckbut be clear what exact branches we need23:49
rlandy|ruckwhich is not always constant23:50
clarkbdhill_: let me try to explain a different way. For every buildset (collection of jobs) that zuul runs it attempts to configure a consistent repository state for every repository the jobs know about. This can take some time as it has to inspect the git content of all the places the jobs are configured. If one of those repos (like the ansible collection repo) does lossy deletions while23:50
clarkbzuul is processing that state this can happen.23:50
clarkbBasically people in github are deleting content from their repo while zuul is trying to process that content and that results in brokeness. Rerunning does seem to work though23:50
clarkbrlandy|ruck: ya that is what 804177 would add to zuul and ya you'd be potentially trying to keep up with upstream changes that way23:51
rlandy|ruckwhich doesn't seem sustainable23:51
dhill_no failures yet23:51
rlandy|ruckconsidering the number of repos we include23:51
clarkbIn general I think my sugegstions would be that people don't perform lossy operations on their git repos as a good first step. But people do it because they like the cleanliness of it23:52
rlandy|ruckclarkb: ianw: fungi: thanks for following this through 23:52
rlandy|ruckand yeah - it's a possible hit for us on any repo23:52
rlandy|ruckat any time23:52
clarkbrlandy|ruck: well only those that allow lossy operations. Those in gerrit shouldn't23:53
clarkbI suppose another option here is to stop using those repos via zuul and clone them from github instead23:53
rlandy|ruckthat has other challenges23:53
clarkbyou'd trade lossy operation races for network bw and failures23:53
rlandy|ruckwe have done that in the past23:53
fungiand could no longer depends-on pull requests23:53
rlandy|ruckand decided zuul was far more reliable23:53
rlandy|ruckI think we go forward as is  - seems like the most maintainable solution23:54
clarkbif you talk to your upstreams they may be willing to protect the branches that you care about then you can set that flag23:54
clarkband/or sugegst they stop deleting content :)23:54
rlandy|ruckI'll pick it up tomorrow morning with our internal infra people23:55
rlandy|ruckwe need a standard line here wrt repos we intend to include23:56
rlandy|ruckthanks again ... I'm out23:56
*** rlandy|ruck is now known as rlandy|out23:56

Generated by 2.17.3 by Marius Gedminas - find it at!