*** ysandeep|out is now known as ysandeep|ruck | 01:36 | |
*** ysandeep|ruck is now known as ysandeep|ruck|afk | 02:32 | |
*** ysandeep|ruck|afk is now known as ysandeep|ruck | 03:45 | |
opendevreview | Merged opendev/system-config master: refstack: trigger image upload https://review.opendev.org/c/opendev/system-config/+/853251 | 04:02 |
---|---|---|
*** jpena|off is now known as jpena | 07:34 | |
*** ysandeep|ruck is now known as ysandeep|ruck|lunch | 08:11 | |
gthiemonge | Hi Folks, gerrit is acting weird: https://review.opendev.org/c/openstack/octavia/+/850590 shows no modification of the files, and I cannot download the patch with git review -d (however git fetch works) | 08:22 |
gibi | gthiemonge: yeah I see similar errors now in gerrit | 08:27 |
gibi | after couple of refresh now review.opendev.org does not even load | 08:27 |
mnasiadka | yup | 08:29 |
mnasiadka | same here | 08:29 |
mnasiadka | frickler: alive? (probably you're the only one in the EU timezone) | 08:29 |
gthiemonge | gibi: yeah it looks down now | 08:31 |
frickler | I can confirm that gerrit seems to have issues, checking logs now | 08:32 |
marios | ah k already known then | 08:34 |
marios | thanks frickler | 08:34 |
*** ysandeep|ruck|lunch is now known as ysandeep|ruck | 08:39 | |
frickler | o.k., restarted gerrit. I'm seeing the "empty files section" issue for other patches, too, but it seems to resolve when waiting long enough (like a minute or so). maybe give gerrit a moment to get itself sorted again | 08:48 |
ykarel | Thanks frickler | 08:53 |
ykarel | still not loading, likely need to wait some more | 08:54 |
gthiemonge | frickler: thanks! | 08:56 |
gibi | I get proxy error from gerrit now | 08:58 |
ykarel | working now | 09:04 |
gibi | yep working here now too | 09:10 |
gibi | thank you frickler | 09:10 |
frickler | seems to have recovered for now, please ping infra-root if you notice any further issues | 09:12 |
mnasiadka | Seems that rockylinux-9 nodes are not spawning? looks like those are stuck in "queued" | 10:15 |
*** ysandeep|ruck is now known as ysandeep|afk | 10:35 | |
frickler | #status log added slaweq to newly created whitebox-neutron-tempest-plugin-core group (https://review.opendev.org/c/openstack/project-config/+/851031) | 10:45 |
opendevstatus | frickler: finished logging | 10:45 |
*** ysandeep|afk is now known as ysandeep|ruck | 11:30 | |
*** dviroel|afk is now known as dviroel | 11:32 | |
opendevreview | Merged openstack/project-config master: Add StackHPC Ansible roles used by openstack/kayobe https://review.opendev.org/c/openstack/project-config/+/853003 | 11:39 |
fungi | mnasiadka: looking through the launcher logs, it seems the nodes are booting but nodepool is unable to gather ssh host keys from them and eventually gives up and tries to boot another | 11:46 |
fungi | so something probably changed in rocky 9 with regard to networking, service startup, or openssh | 11:47 |
*** ysandeep|ruck is now known as ysandeep|ruck|mtg | 11:57 | |
mnasiadka | fungi: uhh | 12:11 |
mnasiadka | NeilHanlon: any idea? ^^ | 12:11 |
*** ysandeep|ruck|mtg is now known as ysandeep|ruck | 12:24 | |
NeilHanlon | i will take a look mnasiadka | 12:32 |
NeilHanlon | fungi: could you remind me where (or if) I can poke at the launcher logs? | 12:45 |
Clark[m] | frickler: ykarel: gthiemonge: gibi: tristanC: I think the current thought on the Gerrit issues there is that software factory's zuul is making enough git requests that we hit our http server limits. We've asked them to improve the spread of requests using zuul's jitter settings but not sure if that has happened yet. | 13:21 |
Clark[m] | Restarting Gerrit won't necessarily fix the problem if sf continues to make a lot of requests | 13:21 |
gibi | Clark[m]: thanks for the information | 13:22 |
Clark[m] | We can also increase the number of http threads the Gerrit server will support (this is the limit we hit that causes your client requests to fail). But that will require maths to determine what a reasonable new value is. Considering the OpenDev zuul doesn't create these issues I'm hesitant to increase it for SF and instead continue to work with them to better tune zuul | 13:24 |
tristanC | Clark[m]: we recently integrated zuul 6.2 to enable the jitter option, but we need to schedule a release+upgrade to apply it to production. | 13:26 |
tristanC | I think the issue is coming from one of these periodic pipeline: https://review.rdoproject.org/cgit/config/tree/zuul.d/upstream.yaml#n96 | 13:27 |
tristanC | for example in this pipeline config, i think most jobs have many opendev project in their required-projects attribute: https://review.rdoproject.org/cgit/rdo-jobs/tree/zuul.d/integration-pipeline-wallaby-centos9.yaml | 13:43 |
tristanC | would adding jitter to this pipeline would actually work? if i understand correctly, this constitutes a single buildset, does the jitter applies to the individual build? | 13:45 |
Clark[m] | That pipeline seems to start at 09:10, but we are seeing the issues at closer to 08:10. Are your clocks UTC on the zuul servers? Maybe they are UTC+1? | 13:47 |
Clark[m] | tristanC the changes to jitter in 6.2 apply it to each buildset I think. Previously the entire pipeline was triggered with the same random offset but now each buildset (for each project+branch combo) has a separate offset | 13:49 |
ykarel | Clark[m], Thanks for the info | 13:51 |
corvus | the gerrit folks are working on removing the extension point that the zuul-summary-plugin uses: https://gerrit-review.googlesource.com/c/gerrit/+/342836 (cc Clark ianw ) | 13:53 |
tristanC | Clark[m]: alright, then I think the jitter won't help because rod | 13:53 |
tristanC | rdo's pipeline are mostly using a single buildset | 13:53 |
Clark[m] | corvus: yes I responded to them on slack indicating the zuul summary plugin uses that. Nasser thought it should be kept as generally useful too | 13:54 |
Clark[m] | tristanC ok, what do you suggest then? | 13:54 |
corvus | Clark: i know. just pointing out that ben has a change outstanding | 13:55 |
corvus | so is something that can be tracked now | 13:55 |
Clark[m] | tristanC: I think on my end we have two options. Increasing Gerrit limits (and resource usage). Blocking SF's account/IP. Neither are great | 13:56 |
tristanC | Clark[m]: would it be possible to limit the number of connection made by the gerrit driver? | 14:06 |
tristanC | otherwise we are looking for ways to split the pipeline to spread the build execution | 14:06 |
Clark[m] | tristanC: The issue isn't the Gerrit driver but the mergers I expect. Though I guess there is relationship between them. The Gerrit driver should already reuse connections for api requests. | 14:07 |
Clark[m] | Within the merger the git operations act as separate git processes so those connections cannot be reused | 14:09 |
Clark[m] | corvus does have a change to reduce the number of Gerrit requests. I'll make time to review that today | 14:13 |
*** dasm|off is now known as dasm | 14:22 | |
*** dviroel is now known as dviroel|lunch | 15:07 | |
fungi | NeilHanlon: i don't think we publish the logs from our launchers, historically there may have been some concern around the potential to leak cloud credentials into them, but we can revisit that decision. in this case the launcher logs aren't much help because they simply say that the builder sees the api report the node has booted, then the launcher is repeatedly trying to connect to the node's | 15:24 |
fungi | sshd in order to record the host keys from it, and eventually gives up raising "nodepool.exceptions.ConnectionTimeoutException: Timeout waiting for connection to 198.72.124.26 on port 22" or similar | 15:24 |
fungi | we can turn on boot log collection, which might be the next step (either that or manually booting the image in a cloud and trying to work out why it's not responding) | 15:25 |
fungi | er, nova console log collection i mean | 15:25 |
*** ysandeep|ruck is now known as ysandeep|ruck|afk | 15:26 | |
johnsom | Hi everyone, we have been seeing an uptick in POST_FAILURE results recently: https://zuul.openstack.org/builds?result=POST_FAILURE&skip=0 | 15:26 |
johnsom | Here is an example: https://zuul.opendev.org/t/openstack/build/330d88ae18a144009a605ba94bcad56c/log/job-output.txt#5261 | 15:26 |
johnsom | The three-four I have seen have a similar connection closed log. | 15:27 |
*** marios is now known as marios|out | 15:32 | |
fungi | johnsom: is it always rsync complaining about connection closed during the fetch-output collection task, or at random points in the job? | 15:34 |
fungi | the times also look somewhat clustered | 15:36 |
fungi | which leads me to wonder whether there are issues (network connectivity or similar) impacting connections between the executors and job nodes | 15:37 |
fungi | might be good to try and see if there's any correlation to a specific provider or to a specific executor | 15:37 |
fungi | (i'm technically supposed to be on vacation this week, so will be in trouble with the wife if caught spending time looking into it) | 15:39 |
*** ysandeep|ruck|afk is now known as ysandeep|ruck | 15:45 | |
johnsom | fungi Enjoy your time! I will poke around in the logs and get some answers. We saw this in recent days as well, but figured it was a temporary bump. | 15:50 |
johnsom | I have started to put together an etherpad: https://etherpad.opendev.org/p/zed-post-failure | 15:59 |
*** dviroel|lunch is now known as dviroel | 16:12 | |
johnsom | I have collected five random occurrences there. Let me know if you would like more | 16:13 |
*** ysandeep|ruck is now known as ysandeep|out | 16:18 | |
*** jpena is now known as jpena|off | 16:26 | |
fungi | johnsom: a quick skim suggests it's spread across our providers fairly evenly (as much as can be projected from the sample size anyway). seems to be all focal but that's probably just because vast majority of our builds are anyway. the other potentially useful bit of info would be which executor it ran from, since that's the end complaining about the connection failures: | 16:34 |
fungi | https://zuul.opendev.org/t/openstack/build/330d88ae18a144009a605ba94bcad56c/log/zuul-info/inventory.yaml#66 | 16:34 |
johnsom | fungi Ack, I will collect that for the list | 16:34 |
fungi | if it all maps back to one or two executors then that may point to a problem with them or with a small part of the local network in the provider we host them all in | 16:34 |
fungi | if it's spread across the executors then it could still be networking in rax-dfw since that's where all the executors reside | 16:35 |
johnsom | Not just one executor | 16:37 |
johnsom | Though a few of them were on 03 | 16:37 |
clarkb | due to the tripleo localhost ansible issues I'm really suspecting something on the ansible side of things not the infrastructure | 17:00 |
clarkb | in their case it was the nested ansible installed by rpm, but zuul has a relatively up to date ansibel install too that could be having trouble as well | 17:01 |
clarkb | the etherpad data also seems to point at it being unlikely that a specific provider is struggling | 17:01 |
clarkb | the error messages look almost identical to the tripleo ones fwiw. The difference is tripleo's are talking to 127.0.0.2 from 127.0.0.1 | 17:02 |
clarkb | looking at johnsom's examples though this is primarily with rsync. The tripleo examples were for normal ansible ssh to run tasks iirc | 17:03 |
johnsom | The port is 22 so it seems it's rsync over ssh | 17:04 |
clarkb | ya just pointing out that maybe it isn't ansible itself that is struggling (since I mentioend I suspected ansible previously) | 17:04 |
clarkb | johnsom: in the tripleo cases according to journald sshd was creating a session for the connection then closing it at the request of the client | 17:05 |
clarkb | johnsom: on the ansible side it basically logged what you are seeing but for regular ssh not rsync | 17:05 |
clarkb | "connection closed by remote\nconnection closed by 127.0.0.2" | 17:06 |
clarkb | In both cases we were also able to run subsequent tasks without issue | 17:06 |
clarkb | dviroel: rcastillo|rover: ^ fyi. Also it just occured to me that your ansible must not be using ssh control persistence with ansible since journald was logging a new connection for every task. That might be worth changing regardless of what the problme is here | 17:07 |
clarkb | I kinda want to focus on the tripleo examples because they are to localhost which eliminates a lot of variables | 17:09 |
clarkb | ansible 2.12.8 did release on Monday | 17:14 |
clarkb | And we did redeploy executors around then | 17:14 |
clarkb | (however it isn't clear to me how that ends up in the `ansible` package on pypi since pypi reports a release older than that as the latest) | 17:14 |
clarkb | ok ansible-core released on monday. Does `ansible` install `ansible-core`? /me checks | 17:16 |
clarkb | yes it does | 17:17 |
clarkb | ok I'm now suspicious of ansible again :) | 17:18 |
clarkb | But we're running ansible-core 2.12.7 on ze01 | 17:20 |
clarkb | we also saw similar with our testinfra stuff for system-config that also runs outside of zuul with a nested ansible between test nodes in the same provider region | 17:23 |
clarkb | All that to say I think whatever is causing this isn't specific to a provider or even a specific distro. It seems to affect centos 8 and 9 and ubuntu focal at least. It affects connections to localhost in addition to those between providers and within a provider | 17:24 |
clarkb | that still makes me suspect ansible because it is a common piece in all of that. SSH (via openssh not paramiko) is also a common piece, but its versions will vary across centos 8 and 9 and ubuntu focal | 17:24 |
NeilHanlon | fungi: thank you very much and enjoy your vacation!! | 17:28 |
dviroel | clarkb: ack - thanks for the heads up | 17:29 |
tristanC | clarkb: corvus: thinking about the gerrit overload caused by rdo's zuul, would adding jitters between build within a periodic buildset be a solution? | 17:30 |
opendevreview | Clark Boylan proposed opendev/system-config master: Increase the number of Gerrit threads for http requests https://review.opendev.org/c/opendev/system-config/+/853528 | 17:33 |
clarkb | tristanC: ^ fyi I think maybe we can give the httpd more threads to keep the web ui rsponsive while the git requests queue up | 17:33 |
clarkb | I was worried about increasing things and having all the threads get consumed and not really changing anything, but if that works as I think it might this may be a reasonable workaround for now to keep the ui responsive for uses | 17:34 |
opendevreview | Clark Boylan proposed opendev/system-config master: Increase the number of Gerrit threads for http requests https://review.opendev.org/c/opendev/system-config/+/853528 | 17:34 |
clarkb | of course my first ps did the wrong thing though :) | 17:34 |
clarkb | tristanC: in opendev I believe that happens naturally via the randomness imposed by nodepool allocations | 17:35 |
clarkb | tristanC: last week I wondered if maybe you are using container based jobs that allocate immediately which negates that benefit | 17:36 |
clarkb | One of johnsom's example did not run out disk | 17:47 |
clarkb | (just ruling that out as ansible ssh things don't fare well with full disks | 17:48 |
clarkb | plenty of inodes too | 17:48 |
tristanC | clarkb: i meant, it looks like one buildset may contains many job using https://zuul.opendev.org/t/openstack/job/tripleo-ci-base-required-projects , and i wonder if the zuul scheduler is not updating each project repeatedly, even before making node request or starting the job | 17:51 |
clarkb | tristanC: I think it should do it once to determine what builds to trigger for the buildset when the event arrives. Then each individual build will do independent merging (using the same input to the merger which should produce the same results). corvus can confirm or deny that though | 17:52 |
tristanC | in other words, if a buildset contains 10 jobs that require the same 10 projects, does the scheduler perform 100 git update? | 17:53 |
clarkb | I don't think so. | 17:54 |
clarkb | that was fun my laptop remounted fses RO after resuming from suspend | 19:10 |
clarkb | I think it isn't starting the nvme device up properly | 19:10 |
clarkb | jrosser_: do you think https://github.com/ansible/ansible/issues/78344 oculd be the source of our ansible issue sabove? | 19:21 |
clarkb | jrosser_: https://zuul.opendev.org/t/openstack/build/06e949c509fc4fa7b83684573bbd8d91/log/job-output.txt#18570-18571 i sa specific example and I notice that rc 255 appears to be the code returned when it is recoverable in your bug | 19:22 |
clarkb | hrm it seems the errors that ansible bubble up are different though | 19:24 |
clarkb | apparnetly this error can happen if sshd isn't able to keep up with incoming ssh session starts (so maybe the scanners are having an impact? I'm surprised that openssh woul dfall over that easily though) | 19:29 |
clarkb | sshd_config MaxStartups maybe needing to be edited? | 19:33 |
opendevreview | Clark Boylan proposed openstack/project-config master: Tune sshd connections settings on test nodes https://review.opendev.org/c/openstack/project-config/+/853536 | 19:42 |
clarkb | johnsom: dviroel rcastillo|rover ^ fyi that may be the problem / fix? | 19:43 |
johnsom | So the IPs in the error is actually the nodepool instance? I.e. the executor is ssh/rsync to the instance under test. | 19:45 |
johnsom | That would be interesting and sad at same time if that is the issue... | 19:46 |
clarkb | johnsom: yes the issue is ansible on the executor is unable to establish a connection to the nodepool test node. In the tripleo case it was 127.0.0.1 on the test node connecting to 127.0.0.2 on the test node. What the tripleo logs showed was that ssh scanning/brute forcing was in progress at the same time too | 19:46 |
clarkb | so ya it could be that we're just randomly colliding with the scanners | 19:46 |
clarkb | It would explain why it happens regardless of node and provider and platform and whether or not you cross actual networking | 19:46 |
clarkb | if this is the fix it will certainly be one of the more unique issues I've looked into | 19:51 |
dviroel | clarkb: oh nice, that may help us yes. Thanks for digging and proposing this patch. | 19:57 |
dviroel | clarkb: i saw just one failure like this as for today, nothing like yesterday | 19:58 |
dviroel | we do have a LP bug and a card to track this issue | 19:58 |
clarkb | dviroel: only if you created one | 19:58 |
clarkb | I'm not sur ewhat a card would be | 19:58 |
dviroel | clarkb: I mean, we already created a tripleo bug | 19:59 |
dviroel | and a internal ticker to track this issue | 19:59 |
dviroel | we can provide a feedback if the issue remains even after your patch get merged | 20:00 |
clarkb | that would be great. If it is scanners it may go away after whoever is scanning slows down too. Either way hopefully this makes things happier | 20:00 |
dviroel | ++ | 20:01 |
johnsom | I may create a canary job that logs the ssh connections, just to see what it looks like. | 20:05 |
clarkb | johnsom: ya that would probably be goo ddata. The tripleo jobs were catching them in regular journal logs for sshd | 20:13 |
clarkb | infra-root I've manually applied 853536 to a centos stream 8 node and a focal node and restarted sshd. It seems it doesn't break anything at least | 20:13 |
clarkb | ps output for sshd on ubuntu shows 0 of 30-100 startups now instead of 0 of 10-100 so I think it is taking effect too | 20:13 |
clarkb | NeilHanlon: fungi: I'm trying to boot a rocky 9 node in ovh bhs1 now to see what happens | 20:45 |
clarkb | there is a console log so we have a kernel. Not the grub issue again | 20:49 |
clarkb | It does not ping or ssh though | 20:49 |
clarkb | aha but it eventually drops into emergency mode | 20:50 |
clarkb | its not setting the boot disk properly | 20:53 |
clarkb | I'll get logs pasted shortly | 20:53 |
clarkb | NeilHanlon: fungi: https://paste.opendev.org/show/bQdnCNQZvTK0IQRfDlL2/ it is trying to boot off of the nodepool builder's disk by uuid | 20:56 |
clarkb | dib image builds should be selecting the disk by label | 20:56 |
clarkb | cloudimg-rootfs for ext4 and img-rootfs for xfs. We use ext4. I wonder why they differ | 20:59 |
clarkb | ah cloudimg-rootfs is what everyone used for forever then xfs became popular for rootfs and thta label is too long for it | 21:00 |
*** dviroel is now known as dviroel|afk | 21:02 | |
*** timburke_ is now known as timburke | 21:03 | |
clarkb | also you should not pay attention to this fungi :) we'll get it sorted | 21:18 |
clarkb | NeilHanlon: it almost seems like rocky grub is ignoring GRUB_DISABLE_LINUX_UUID=true in /etc/default/grub | 21:28 |
clarkb | we set GRUB_DEVICE_LABEL=cloudimg-rootfs and GRUB_DISABLE_LINUX_UUID=true in /etc/default/grub but the resulting grub config has the builder's root disk uuid | 21:29 |
clarkb | grub packages don't look like they've updated super recently | 21:34 |
fungi | yeah, i'm trying to pretend i'm not aware of all the broken, and resisting my innate urge to troubleshoot | 21:35 |
clarkb | the number of patches to grub is ... daunting | 21:38 |
clarkb | 274 | 21:39 |
fungi | that's in rocky (and presumably copied from rhel)? | 21:42 |
clarkb | correct | 21:42 |
corvus | i'm going to restart zuul-web so we can see the job graph changes live | 21:42 |
fungi | thanks! excited to see that go live finally | 21:43 |
clarkb | I think maybe we need GRUB_DISABLE_UUID too https://lists.gnu.org/archive/html/grub-devel/2019-10/msg00007.html | 21:45 |
clarkb | ianw: ^ you probably understand this much better than I do if you want to take a look when your day starts | 21:45 |
clarkb | the patch for GRUB_DISABLE_UUID is one of those 274 patches on rocky | 21:46 |
NeilHanlon | i'll take a read through this later... who knew booting an OS could be so friggin' long lol | 21:50 |
corvus | okay! deep links! https://zuul.opendev.org/t/openstack/project/opendev.org/opendev/system-config?branch=master&pipeline=check | 21:52 |
corvus | that's a job graph for what happens if you upload a change to system-config | 21:53 |
clarkb | nice | 21:53 |
fungi | ooh, yes | 21:54 |
corvus | and the frozen jobs are deep-linkable too: https://zuul.opendev.org/t/openstack/freeze-job?pipeline=check&project=opendev%2Fsystem-config&job=opendev-buildset-registry&branch=master | 21:56 |
*** dasm is now known as dasm|off | 22:00 | |
ianw | reading backwards; thanks for noting the tab extension point removal. i guess that is marked as wip but i don't see any context about it. anyway, it's good that it's been called out | 22:06 |
ianw | if we see the slowdowns on gerrit probably worth checking the sshd log, that's where i saw the sf logins. i don't think it's http? | 22:07 |
ianw | it is weird that rocky would boot in the gate but not on hosts. we've seen something similar before ... trying to remember. it was something about how in the gate the host already had a labeled fs, and that was leaking through | 22:10 |
clarkb | ianw: httpd and sshd share a common thread pool limit though | 22:10 |
ianw | https://review.opendev.org/c/openstack/diskimage-builder/+/818851 was the change i'm thinking of | 22:11 |
clarkb | I tried to capture the relationship in https://review.opendev.org/c/opendev/system-config/+/853528 re gerrit | 22:11 |
clarkb | basically I think we may be starving the common thread pool and then it falls over from there? Idea is maybe increasing httpd max threads above the shared limit will allow the web ui to continue working though | 22:12 |
clarkb | that won't fix the git pushes and fetches being slow, but maybe it will make the UI responsive through it | 22:14 |
clarkb | and then we can bump up sshd.threads as well as httpd.maxThreads if that doesn't cause problems. Just trying to be conservative as we grow into the new server | 22:14 |
ianw | sshd.threads -> "When SSH daemon is enabled then this setting also defines the max number of concurrent Git requests for interactive users over SSH and HTTP together." ... ok then? | 22:15 |
ianw | maybe it should have a different name then | 22:15 |
clarkb | ya I agree it is super confusing | 22:15 |
clarkb | I stared at a bit this morning then went on a bike ride and eventually ended up with the change I linked | 22:15 |
clarkb | and I'm still not entirely sure it will do what I think it will do | 22:16 |
ianw | clarkb: https://gerrit-review.googlesource.com/Documentation/access-control.html#service_users i wasn't aware of that one | 22:16 |
clarkb | ya so service users has another issue for us | 22:16 |
clarkb | its binary and we really have three classes of user | 22:17 |
clarkb | we have humans, zuul, and all the other bots | 22:17 |
clarkb | I brought this up with upstream a while back after the 3.5 or 3.4 upgrade (whichever put that in place and started making zuul stuff slow) | 22:17 |
clarkb | we ended up setting batch threads to 0 so that the pools are shared since we have the three classes of users and can't really separate them in a way that makes sense | 22:18 |
clarkb | however, maybe putting sf in its own thread pool as a service user and having everything else mix with the humans would be an improvement? | 22:18 |
* clarkb is just thinking out loud here | 22:18 | |
clarkb | also I'm melting into the couch so the brain may not be opreating well | 22:21 |
ianw | yeah, i think maybe starting with the thread increases as the simple thing and going from there would be good first steps | 22:23 |
opendevreview | Steve Baker proposed openstack/diskimage-builder master: Do dmsetup remove device in rollback https://review.opendev.org/c/openstack/diskimage-builder/+/847860 | 22:23 |
opendevreview | Steve Baker proposed openstack/diskimage-builder master: Support LVM thin provisioning https://review.opendev.org/c/openstack/diskimage-builder/+/840144 | 22:23 |
opendevreview | Steve Baker proposed openstack/diskimage-builder master: Add thin provisioning support to growvols https://review.opendev.org/c/openstack/diskimage-builder/+/848688 | 22:23 |
clarkb | ianw: frickler: that brings up a good point though that if/when this happens again we should see about running a show queue against gerrit | 22:24 |
clarkb | `gerrit show-queue -w -q` that should give us quite a bit of good info on what is going on | 22:25 |
ianw | yeah i did do that when i first saw it | 22:26 |
ianw | i can't remember what i saw :/ | 22:26 |
ianw | maybe i logged something | 22:26 |
fungi | maybe it was so horrific you repressed that memory | 22:27 |
clarkb | and then https://review.opendev.org/c/openstack/project-config/+/853536 is a completely unrelated but interesting change too | 22:27 |
clarkb | fungi: ^ re the tripleo and othe rssh issues | 22:27 |
ianw | i didn't : https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2022-08-11.log.html#t2022-08-11T01:26:37 ... i can't remember now. i don't think i saw a smoking gun | 22:28 |
ianw | anyway, yes ++ to running and taking more notice of the queue stats | 22:28 |
ianw | interesting ... | 22:30 |
clarkb | and also I'm not sure restarting gerrit helps anything. I don't think we are running out of resources. We're just hitting our artifical limits and everything is queueing behind that | 22:31 |
clarkb | But to double check that gerrit show-caches will emit memory info so that is another good command to try | 22:31 |
ianw | yeah that was the conclusion that i came to at the time; it seemed like it had stopped responding to new requests but was still working | 22:31 |
ianw | i wonder with the ssh thing, can you just instantly block anything trying password auth? | 22:32 |
ianw | this is just thinking out loud, i've never looked into it | 22:33 |
clarkb | ianw: we do already reject password auth, but I think the protocol itself allows you to try password auth and fail, then try key 1 and then key 2 and so on | 22:33 |
ianw | yeah, i meant maybe more than just reject it, if you try a password you get blocked. but it's probably not really possible due to the negotitaion phase | 22:34 |
fungi | yes, ssh protocol is designed to accept multiple authentication mechanisms and "negotiate" between the client and server by trying different ones until something works | 22:34 |
ianw | yeah even something like fail2ban is scanning rejection logs | 22:35 |
ianw | even rate limiting ip's to 22 probably doesn't help, as it's multiple scanners i guess | 22:37 |
fungi | though also we do have concurrent connection limits enforced by conntrack | 22:38 |
clarkb | and zuul can be pretty persistent | 22:38 |
clarkb | the reason we're seeing rsync fail i sthat it doesn't go through the controlpersist process | 22:38 |
clarkb | depending on how many rsyncs you do in a short period of time you could hit a rate limit | 22:39 |
clarkb | I suspect we're right on the threshold too which is why we don't see this super frequently | 22:39 |
clarkb | but I could be wrong | 22:39 |
clarkb | this is half hunch at this point anyway | 22:39 |
clarkb | but it does explain what tripleo observed pretty neatly | 22:40 |
opendevreview | Merged openstack/project-config master: Tune sshd connections settings on test nodes https://review.opendev.org/c/openstack/project-config/+/853536 | 22:49 |
johnsom | clarkb On a two hour tempest job, there were not a large number of SSH hits. Could just be the timing, but it's not an ongoing thing. | 23:45 |
johnsom | 56 SSH connections over that two hours, including the zuul related connections | 23:46 |
clarkb | johnsom: right I think that is why it affects so few jobs | 23:47 |
clarkb | its all about getting lucky | 23:47 |
fungi | that would also explain the clustering i saw for timestamps when looking at the zuul builds list | 23:53 |
fungi | you would get several failing around the same time, with gaps between | 23:53 |
clarkb | ya and it is possible I've misdiagnosed it too. But based on the tripleo logs I'm reasonably confident it is this | 23:54 |
clarkb | it perfectly explains why ssh to localhost randomly doesn'y work | 23:55 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!