Wednesday, 2022-08-17

*** ysandeep|out is now known as ysandeep|ruck01:36
*** ysandeep|ruck is now known as ysandeep|ruck|afk02:32
*** ysandeep|ruck|afk is now known as ysandeep|ruck03:45
opendevreviewMerged opendev/system-config master: refstack: trigger image upload
*** jpena|off is now known as jpena07:34
*** ysandeep|ruck is now known as ysandeep|ruck|lunch08:11
gthiemongeHi Folks, gerrit is acting weird: shows no modification of the files, and I cannot download the patch with git review -d (however git fetch works)08:22
gibigthiemonge: yeah I see similar errors now in gerrit08:27
gibiafter couple of refresh now does not even load08:27
mnasiadkasame here08:29
mnasiadkafrickler: alive? (probably you're the only one in the EU timezone)08:29
gthiemongegibi: yeah it looks down now08:31
fricklerI can confirm that gerrit seems to have issues, checking logs now08:32
mariosah k already known then 08:34
mariosthanks frickler 08:34
*** ysandeep|ruck|lunch is now known as ysandeep|ruck08:39
fricklero.k., restarted gerrit. I'm seeing the "empty files section" issue for other patches, too, but it seems to resolve when waiting long enough (like a minute or so). maybe give gerrit a moment to get itself sorted again08:48
ykarelThanks frickler 08:53
ykarelstill not loading, likely need to wait some more08:54
gthiemongefrickler: thanks!08:56
gibiI get proxy error from gerrit now08:58
ykarelworking now09:04
gibiyep working here now too09:10
gibithank you frickler 09:10
fricklerseems to have recovered for now, please ping infra-root if you notice any further issues09:12
mnasiadkaSeems that rockylinux-9 nodes are not spawning? looks like those are stuck in "queued"10:15
*** ysandeep|ruck is now known as ysandeep|afk10:35
frickler#status log added slaweq to newly created whitebox-neutron-tempest-plugin-core group (
opendevstatusfrickler: finished logging10:45
*** ysandeep|afk is now known as ysandeep|ruck11:30
*** dviroel|afk is now known as dviroel11:32
opendevreviewMerged openstack/project-config master: Add StackHPC Ansible roles used by openstack/kayobe
fungimnasiadka: looking through the launcher logs, it seems the nodes are booting but nodepool is unable to gather ssh host keys from them and eventually gives up and tries to boot another11:46
fungiso something probably changed in rocky 9 with regard to networking, service startup, or openssh11:47
*** ysandeep|ruck is now known as ysandeep|ruck|mtg11:57
mnasiadkafungi: uhh12:11
mnasiadkaNeilHanlon: any idea? ^^12:11
*** ysandeep|ruck|mtg is now known as ysandeep|ruck12:24
NeilHanloni will take a look mnasiadka12:32
NeilHanlonfungi: could you remind me where (or if) I can poke at the launcher logs?12:45
Clark[m]frickler: ykarel: gthiemonge: gibi: tristanC: I think the current thought on the Gerrit issues there is that software factory's zuul is making enough git requests that we hit our http server limits. We've asked them to improve the spread of requests using zuul's jitter settings but not sure if that has happened yet.13:21
Clark[m]Restarting Gerrit won't necessarily fix the problem if sf continues to make a lot of requests 13:21
gibiClark[m]: thanks for the information13:22
Clark[m]We can also increase the number of http threads the Gerrit server will support (this is the limit we hit that causes your client requests to fail). But that will require maths to determine what a reasonable new value is. Considering the OpenDev zuul doesn't create these issues I'm hesitant to increase it for SF and instead continue to work with them to better tune zuul13:24
tristanCClark[m]: we recently integrated zuul 6.2 to enable the jitter option, but we need to schedule a release+upgrade to apply it to production.13:26
tristanCI think the issue is coming from one of these periodic pipeline:
tristanCfor example in this pipeline config, i think most jobs have many opendev project in their required-projects attribute:
tristanCwould adding jitter to this pipeline would actually work? if i understand correctly, this constitutes a single buildset, does the jitter applies to the individual build?13:45
Clark[m]That pipeline seems to start at 09:10, but we are seeing the issues at closer to 08:10. Are your clocks UTC on the zuul servers? Maybe they are UTC+1?13:47
Clark[m]tristanC the changes to jitter in 6.2 apply it to each buildset I think. Previously the entire pipeline was triggered with the same random offset but now each buildset (for each project+branch combo) has a separate offset13:49
ykarelClark[m], Thanks for the info13:51
corvusthe gerrit folks are working on removing the extension point that the zuul-summary-plugin uses: (cc Clark ianw )13:53
tristanCClark[m]: alright, then I think the jitter won't help because rod13:53
tristanCrdo's pipeline are mostly using a single buildset13:53
Clark[m]corvus: yes I responded to them on slack indicating the zuul summary plugin uses that. Nasser thought it should be kept as generally useful too13:54
Clark[m]tristanC ok, what do you suggest then?13:54
corvusClark: i know.  just pointing out that ben has a change outstanding13:55
corvusso is something that can be tracked now13:55
Clark[m]tristanC: I think on my end we have two options. Increasing Gerrit limits (and resource usage). Blocking SF's account/IP. Neither are great13:56
tristanCClark[m]: would it be possible to limit the number of connection made by the gerrit driver?14:06
tristanCotherwise we are looking for ways to split the pipeline to spread the build execution14:06
Clark[m]tristanC: The issue isn't the Gerrit driver but the mergers I expect. Though I guess there is relationship between them. The Gerrit driver should already reuse connections for api requests.14:07
Clark[m]Within the merger the git operations act as separate git processes so those connections cannot be reused14:09
Clark[m]corvus does have a change to reduce the number of Gerrit requests. I'll make time to review that today14:13
*** dasm|off is now known as dasm14:22
*** dviroel is now known as dviroel|lunch15:07
fungiNeilHanlon: i don't think we publish the logs from our launchers, historically there may have been some concern around the potential to leak cloud credentials into them, but we can revisit that decision. in this case the launcher logs aren't much help because they simply say that the builder sees the api report the node has booted, then the launcher is repeatedly trying to connect to the node's15:24
fungisshd in order to record the host keys from it, and eventually gives up raising "nodepool.exceptions.ConnectionTimeoutException: Timeout waiting for connection to on port 22" or similar15:24
fungiwe can turn on boot log collection, which might be the next step (either that or manually booting the image in a cloud and trying to work out why it's not responding)15:25
fungier, nova console log collection i mean15:25
*** ysandeep|ruck is now known as ysandeep|ruck|afk15:26
johnsomHi everyone, we have been seeing an uptick in POST_FAILURE results recently:
johnsomHere is an example:
johnsomThe three-four I have seen have a similar connection closed log.15:27
*** marios is now known as marios|out15:32
fungijohnsom: is it always rsync complaining about connection closed during the fetch-output collection task, or at random points in the job?15:34
fungithe times also look somewhat clustered15:36
fungiwhich leads me to wonder whether there are issues (network connectivity or similar) impacting connections between the executors and job nodes15:37
fungimight be good to try and see if there's any correlation to a specific provider or to a specific executor15:37
fungi(i'm technically supposed to be on vacation this week, so will be in trouble with the wife if caught spending time looking into it)15:39
*** ysandeep|ruck|afk is now known as ysandeep|ruck15:45
johnsomfungi Enjoy your time! I will poke around in the logs and get some answers. We saw this in recent days as well, but figured it was a temporary bump.15:50
johnsomI have started to put together an etherpad:
*** dviroel|lunch is now known as dviroel16:12
johnsomI have collected five random occurrences there. Let me know if you would like more16:13
*** ysandeep|ruck is now known as ysandeep|out16:18
*** jpena is now known as jpena|off16:26
fungijohnsom: a quick skim suggests it's spread across our providers fairly evenly (as much as can be projected from the sample size anyway). seems to be all focal but that's probably just because vast majority of our builds are anyway. the other potentially useful bit of info would be which executor it ran from, since that's the end complaining about the connection failures:16:34
johnsomfungi Ack, I will collect that for the list16:34
fungiif it all maps back to one or two executors then that may point to a problem with them or with a small part of the local network in the provider we host them all in16:34
fungiif it's spread across the executors then it could still be networking in rax-dfw since that's where all the executors reside16:35
johnsomNot just one executor16:37
johnsomThough a few of them were on 0316:37
clarkbdue to the tripleo localhost ansible issues I'm really suspecting something on the ansible side of things not the infrastructure17:00
clarkbin their case it was the nested ansible installed by rpm, but zuul has a relatively up to date ansibel install too that could be having trouble as well17:01
clarkbthe etherpad data also seems to point at it being unlikely that a specific provider is struggling17:01
clarkbthe error messages look almost identical to the tripleo ones fwiw. The difference is tripleo's are talking to from
clarkblooking at johnsom's examples though this is primarily with rsync. The tripleo examples were for normal ansible ssh to run tasks iirc17:03
johnsomThe port is 22 so it seems it's rsync over ssh17:04
clarkbya just pointing out that maybe it isn't ansible itself that is struggling (since I mentioend I suspected ansible previously)17:04
clarkbjohnsom: in the tripleo cases according to journald sshd was creating a session for the connection then closing it at the request of the client17:05
clarkbjohnsom: on the ansible side it basically logged what you are seeing but for regular ssh not rsync17:05
clarkb"connection closed by remote\nconnection closed by"17:06
clarkbIn both cases we were also able to run subsequent tasks without issue17:06
clarkbdviroel: rcastillo|rover: ^ fyi. Also it just occured to me that your ansible must not be using ssh control persistence with ansible since journald was logging a new connection for every task. That might be worth changing regardless of what the problme is here17:07
clarkbI kinda want to focus on the tripleo examples because they are to localhost which eliminates a lot of variables17:09
clarkbansible 2.12.8 did release on Monday17:14
clarkbAnd we did redeploy executors around then17:14
clarkb(however it isn't clear to me how that ends up in the `ansible` package on pypi since pypi reports a release older than that as the latest)17:14
clarkbok ansible-core released on monday. Does `ansible` install `ansible-core`? /me checks17:16
clarkbyes it does17:17
clarkbok I'm now suspicious of ansible again :)17:18
clarkbBut we're running ansible-core 2.12.7 on ze0117:20
clarkbwe also saw similar with our testinfra stuff for system-config that also runs outside of zuul with a nested ansible between test nodes in the same provider region17:23
clarkbAll that to say I think whatever is causing this isn't specific to a provider or even a specific distro. It seems to affect centos 8 and 9 and ubuntu focal at least. It affects connections to localhost in addition to those between providers and within a provider17:24
clarkbthat still makes me suspect ansible because it is a common piece in all of that. SSH (via openssh not paramiko) is also a common piece, but its versions will vary across centos 8 and 9 and ubuntu focal17:24
NeilHanlonfungi: thank you very much and enjoy your vacation!!17:28
dviroelclarkb: ack - thanks for the heads up17:29
tristanCclarkb: corvus: thinking about the gerrit overload caused by rdo's zuul, would adding jitters between build within a periodic buildset be a solution?17:30
opendevreviewClark Boylan proposed opendev/system-config master: Increase the number of Gerrit threads for http requests
clarkbtristanC: ^ fyi I think maybe we can give the httpd more threads to keep the web ui rsponsive while the git requests queue up17:33
clarkbI was worried about increasing things and having all the threads get consumed and not really changing anything, but if that works as I think it might this may be a reasonable workaround for now to keep the ui responsive for uses17:34
opendevreviewClark Boylan proposed opendev/system-config master: Increase the number of Gerrit threads for http requests
clarkbof course my first ps did the wrong thing though :)17:34
clarkbtristanC: in opendev I believe that happens naturally via the randomness imposed by nodepool allocations17:35
clarkbtristanC: last week I wondered if maybe you are using container based jobs that allocate immediately which negates that benefit17:36
clarkbOne of johnsom's example did not run out disk17:47
clarkb(just ruling that out as ansible ssh things don't fare well with full disks17:48
clarkbplenty of inodes too17:48
tristanCclarkb: i meant, it looks like one buildset may contains many job using , and i wonder if the zuul scheduler is not updating each project repeatedly, even before making node request or starting the job17:51
clarkbtristanC: I think it should do it once to determine what builds to trigger for the buildset when the event arrives. Then each individual build will do independent merging (using the same input to the merger which should produce the same results). corvus can confirm or deny that though17:52
tristanCin other words, if a buildset contains 10 jobs that require the same 10 projects, does the scheduler perform 100 git update?17:53
clarkbI don't think so.17:54
clarkbthat was fun my laptop remounted fses RO after resuming from suspend19:10
clarkbI think it isn't starting the nvme device up properly19:10
clarkbjrosser_: do you think oculd be the source of our ansible issue sabove?19:21
clarkbjrosser_: i sa specific example and I notice that rc 255 appears to be the code returned when it is recoverable in your bug19:22
clarkbhrm it seems the errors that ansible bubble up are different though19:24
clarkbapparnetly this error can happen if sshd isn't able to keep up with incoming ssh session starts (so maybe the scanners are having an impact? I'm surprised that openssh woul dfall over that easily though)19:29
clarkbsshd_config MaxStartups maybe needing to be edited?19:33
opendevreviewClark Boylan proposed openstack/project-config master: Tune sshd connections settings on test nodes
clarkbjohnsom: dviroel rcastillo|rover ^ fyi that may be the problem / fix?19:43
johnsomSo the IPs in the error is actually the nodepool instance? I.e. the executor is ssh/rsync to the instance under test.19:45
johnsomThat would be interesting and sad at same time if that is the issue...19:46
clarkbjohnsom: yes the issue is ansible on the executor is unable to establish a connection to the nodepool test node. In the tripleo case it was on the test node connecting to on the test node. What the tripleo logs showed was that ssh scanning/brute forcing was in progress at the same time too19:46
clarkbso ya it could be that we're just randomly colliding with the scanners19:46
clarkbIt would explain why it happens regardless of node and provider and platform and whether or not you cross actual networking19:46
clarkbif this is the fix it will certainly be one of the more unique issues I've looked into19:51
dviroelclarkb: oh nice, that may help us yes. Thanks for digging and proposing this patch. 19:57
dviroelclarkb: i saw just one failure like this as for today, nothing like yesterday19:58
dviroelwe do have a LP bug and a card to track this issue19:58
clarkbdviroel: only if you created one19:58
clarkbI'm not sur ewhat a card would be19:58
dviroelclarkb: I mean, we already created a tripleo bug19:59
dviroeland a internal ticker to track this issue19:59
dviroelwe can provide a feedback if the issue remains even after your patch get merged20:00
clarkbthat would be great. If it is scanners it may go away after whoever is scanning slows down too. Either way hopefully this makes things happier20:00
johnsomI may create a canary job that logs the ssh connections, just to see what it looks like.20:05
clarkbjohnsom: ya that would probably be goo ddata. The tripleo jobs were catching them in regular journal logs for sshd20:13
clarkbinfra-root I've manually applied 853536 to a centos stream 8 node and a focal node and restarted sshd. It seems it doesn't break anything at least20:13
clarkbps output for sshd on ubuntu shows 0 of 30-100 startups now instead of 0 of 10-100 so I think it is taking effect too20:13
clarkbNeilHanlon: fungi: I'm trying to boot a rocky 9 node in ovh bhs1 now to see what happens20:45
clarkbthere is a console log so we have a kernel. Not the grub issue again20:49
clarkbIt does not ping or ssh though20:49
clarkbaha but it eventually drops into emergency mode20:50
clarkbits not setting the boot disk properly20:53
clarkbI'll get logs pasted shortly20:53
clarkbNeilHanlon: fungi: it is trying to boot off of the nodepool builder's disk by uuid20:56
clarkbdib image builds should be selecting the disk by label20:56
clarkbcloudimg-rootfs for ext4 and img-rootfs for xfs. We use ext4. I wonder why they differ20:59
clarkbah cloudimg-rootfs is what everyone used for forever then xfs became popular for rootfs and thta label is too long for it21:00
*** dviroel is now known as dviroel|afk21:02
*** timburke_ is now known as timburke21:03
clarkbalso you should not pay attention to this fungi :) we'll get it sorted21:18
clarkbNeilHanlon: it almost seems like rocky grub is ignoring GRUB_DISABLE_LINUX_UUID=true in /etc/default/grub21:28
clarkbwe set GRUB_DEVICE_LABEL=cloudimg-rootfs and GRUB_DISABLE_LINUX_UUID=true in /etc/default/grub but the resulting grub config has the builder's root disk uuid21:29
clarkbgrub packages don't look like they've updated super recently21:34
fungiyeah, i'm trying to pretend i'm not aware of all the broken, and resisting my innate urge to troubleshoot21:35
clarkbthe number of patches to grub is ... daunting21:38
fungithat's in rocky (and presumably copied from rhel)?21:42
corvusi'm going to restart zuul-web so we can see the job graph changes live21:42
fungithanks! excited to see that go live finally21:43
clarkbI think maybe we need GRUB_DISABLE_UUID too
clarkbianw: ^ you probably understand this much better than I do if you want to take a look when your day starts21:45
clarkbthe patch for GRUB_DISABLE_UUID is one of those 274 patches on rocky21:46
NeilHanloni'll take a read through this later... who knew booting an OS could be so friggin' long lol21:50
corvusokay!  deep links!
corvusthat's a job graph for what happens if you upload a change to system-config21:53
fungiooh, yes21:54
corvusand the frozen jobs are deep-linkable too:
*** dasm is now known as dasm|off22:00
ianwreading backwards; thanks for noting the tab extension point removal.  i guess that is marked as wip but i don't see any context about it.  anyway, it's good that it's been called out22:06
ianwif we see the slowdowns on gerrit probably worth checking the sshd log, that's where i saw the sf logins.  i don't think it's http?22:07
ianwit is weird that rocky would boot in the gate but not on hosts.  we've seen something similar before ... trying to remember.  it was something about how in the gate the host already had a labeled fs, and that was leaking through22:10
clarkbianw: httpd and sshd share a common thread pool limit though22:10
ianw was the change i'm thinking of22:11
clarkbI tried to capture the relationship in re gerrit22:11
clarkbbasically I think we may be starving the common thread pool and then it falls over from there? Idea is maybe increasing httpd max threads above the shared limit will allow the web ui to continue working though22:12
clarkbthat won't fix the git pushes and fetches being slow, but maybe it will make the UI responsive through it22:14
clarkband then we can bump up sshd.threads as well as httpd.maxThreads if that doesn't cause problems. Just trying to be conservative as we grow into the new server22:14
ianwsshd.threads -> "When SSH daemon is enabled then this setting also defines the max number of concurrent Git requests for interactive users over SSH and HTTP together." ... ok then?22:15
ianwmaybe it should have a different name then22:15
clarkbya I agree it is super confusing22:15
clarkbI stared at a bit this morning then went on a bike ride and eventually ended up with the change I linked22:15
clarkband I'm still not entirely sure it will do what I think it will do22:16
ianwclarkb: i wasn't aware of that one22:16
clarkbya so service users has another issue for us22:16
clarkbits binary and we really have three classes of user22:17
clarkbwe have humans, zuul, and all the other bots22:17
clarkbI brought this up with upstream a while back after the 3.5 or 3.4 upgrade (whichever put that in place and started making zuul stuff slow)22:17
clarkbwe ended up setting batch threads to 0 so that the pools are shared since we have the three classes of users and can't really separate them in a way that makes sense22:18
clarkbhowever, maybe putting sf in its own thread pool as a service user and having everything else mix with the humans would be an improvement?22:18
* clarkb is just thinking out loud here22:18
clarkbalso I'm melting into the couch so the brain may not be opreating well22:21
ianwyeah, i think maybe starting with the thread increases as the simple thing and going from there would be good first steps22:23
opendevreviewSteve Baker proposed openstack/diskimage-builder master: Do dmsetup remove device in rollback
opendevreviewSteve Baker proposed openstack/diskimage-builder master: Support LVM thin provisioning
opendevreviewSteve Baker proposed openstack/diskimage-builder master: Add thin provisioning support to growvols
clarkbianw: frickler: that brings up a good point though that if/when this happens again we should see about running a show queue against gerrit22:24
clarkb`gerrit show-queue -w -q` that should give us quite a bit of good info on what is going on22:25
ianwyeah i did do that when i first saw it22:26
ianwi can't remember what i saw :/22:26
ianwmaybe i logged something22:26
fungimaybe it was so horrific you repressed that memory22:27
clarkband then is a completely unrelated but interesting change too22:27
clarkbfungi: ^ re the tripleo and othe rssh issues22:27
ianwi didn't : ... i can't remember now.  i don't think i saw a smoking gun22:28
ianwanyway, yes ++ to running and taking more notice of the queue stats22:28
ianwinteresting ...22:30
clarkband also I'm not sure restarting gerrit helps anything. I don't think we are running out of resources. We're just hitting our artifical limits and everything is queueing behind that22:31
clarkbBut to double check that gerrit show-caches will emit memory info so that is another good command to try22:31
ianwyeah that was the conclusion that i came to at the time; it seemed like it had stopped responding to new requests but was still working22:31
ianwi wonder with the ssh thing, can you just instantly block anything trying password auth?22:32
ianwthis is just thinking out loud, i've never looked into it22:33
clarkbianw: we do already reject password auth, but I think the protocol itself allows you to try password auth and fail, then try key 1 and then key 2 and so on 22:33
ianwyeah, i meant maybe more than just reject it, if you try a password you get blocked.  but it's probably not really possible due to the negotitaion phase22:34
fungiyes, ssh protocol is designed to accept multiple authentication mechanisms and "negotiate" between the client and server by trying different ones until something works22:34
ianwyeah even something like fail2ban is scanning rejection logs22:35
ianweven rate limiting ip's to 22 probably doesn't help, as it's multiple scanners i guess22:37
fungithough also we do have concurrent connection limits enforced by conntrack22:38
clarkband zuul can be pretty persistent22:38
clarkbthe reason we're seeing rsync fail i sthat it doesn't go through the controlpersist process22:38
clarkbdepending on how many rsyncs you do in a short period of time you could hit a rate limit22:39
clarkbI suspect we're right on the threshold too which is why we don't see this super frequently22:39
clarkbbut I could be wrong22:39
clarkbthis is half hunch at this point anyway22:39
clarkbbut it does explain what tripleo observed pretty neatly22:40
opendevreviewMerged openstack/project-config master: Tune sshd connections settings on test nodes
johnsomclarkb On a two hour tempest job, there were not a large number of SSH hits. Could just be the timing, but it's not an ongoing thing.23:45
johnsom56 SSH connections over that two hours, including the zuul related connections23:46
clarkbjohnsom: right I think that is why it affects so few jobs23:47
clarkbits all about getting lucky23:47
fungithat would also explain the clustering i saw for timestamps when looking at the zuul builds list23:53
fungiyou would get several failing around the same time, with gaps between23:53
clarkbya and it is possible I've misdiagnosed it too. But based on the tripleo logs I'm reasonably confident it is this23:54
clarkbit perfectly explains why ssh to localhost randomly doesn'y work23:55

Generated by 2.17.3 by Marius Gedminas - find it at!