Wednesday, 2022-08-17

*** ysandeep\|out is now known as ysandeep\|ruck		01:36
*** ysandeep\|ruck is now known as ysandeep\|ruck\|afk		02:32
*** ysandeep\|ruck\|afk is now known as ysandeep\|ruck		03:45
opendevreview	Merged opendev/system-config master: refstack: trigger image upload https://review.opendev.org/c/opendev/system-config/+/853251	04:02
*** jpena\|off is now known as jpena		07:34
*** ysandeep\|ruck is now known as ysandeep\|ruck\|lunch		08:11
gthiemonge	Hi Folks, gerrit is acting weird: https://review.opendev.org/c/openstack/octavia/+/850590 shows no modification of the files, and I cannot download the patch with git review -d (however git fetch works)	08:22
gibi	gthiemonge: yeah I see similar errors now in gerrit	08:27
gibi	after couple of refresh now review.opendev.org does not even load	08:27
mnasiadka	yup	08:29
mnasiadka	same here	08:29
mnasiadka	frickler: alive? (probably you're the only one in the EU timezone)	08:29
gthiemonge	gibi: yeah it looks down now	08:31
frickler	I can confirm that gerrit seems to have issues, checking logs now	08:32
marios	ah k already known then	08:34
marios	thanks frickler	08:34
*** ysandeep\|ruck\|lunch is now known as ysandeep\|ruck		08:39
frickler	o.k., restarted gerrit. I'm seeing the "empty files section" issue for other patches, too, but it seems to resolve when waiting long enough (like a minute or so). maybe give gerrit a moment to get itself sorted again	08:48
ykarel	Thanks frickler	08:53
ykarel	still not loading, likely need to wait some more	08:54
gthiemonge	frickler: thanks!	08:56
gibi	I get proxy error from gerrit now	08:58
ykarel	working now	09:04
gibi	yep working here now too	09:10
gibi	thank you frickler	09:10
frickler	seems to have recovered for now, please ping infra-root if you notice any further issues	09:12
mnasiadka	Seems that rockylinux-9 nodes are not spawning? looks like those are stuck in "queued"	10:15
*** ysandeep\|ruck is now known as ysandeep\|afk		10:35
frickler	#status log added slaweq to newly created whitebox-neutron-tempest-plugin-core group (https://review.opendev.org/c/openstack/project-config/+/851031)	10:45
opendevstatus	frickler: finished logging	10:45
*** ysandeep\|afk is now known as ysandeep\|ruck		11:30
*** dviroel\|afk is now known as dviroel		11:32
opendevreview	Merged openstack/project-config master: Add StackHPC Ansible roles used by openstack/kayobe https://review.opendev.org/c/openstack/project-config/+/853003	11:39
fungi	mnasiadka: looking through the launcher logs, it seems the nodes are booting but nodepool is unable to gather ssh host keys from them and eventually gives up and tries to boot another	11:46
fungi	so something probably changed in rocky 9 with regard to networking, service startup, or openssh	11:47
*** ysandeep\|ruck is now known as ysandeep\|ruck\|mtg		11:57
mnasiadka	fungi: uhh	12:11
mnasiadka	NeilHanlon: any idea? ^^	12:11
*** ysandeep\|ruck\|mtg is now known as ysandeep\|ruck		12:24
NeilHanlon	i will take a look mnasiadka	12:32
NeilHanlon	fungi: could you remind me where (or if) I can poke at the launcher logs?	12:45
Clark[m]	frickler: ykarel: gthiemonge: gibi: tristanC: I think the current thought on the Gerrit issues there is that software factory's zuul is making enough git requests that we hit our http server limits. We've asked them to improve the spread of requests using zuul's jitter settings but not sure if that has happened yet.	13:21
Clark[m]	Restarting Gerrit won't necessarily fix the problem if sf continues to make a lot of requests	13:21
gibi	Clark[m]: thanks for the information	13:22
Clark[m]	We can also increase the number of http threads the Gerrit server will support (this is the limit we hit that causes your client requests to fail). But that will require maths to determine what a reasonable new value is. Considering the OpenDev zuul doesn't create these issues I'm hesitant to increase it for SF and instead continue to work with them to better tune zuul	13:24
tristanC	Clark[m]: we recently integrated zuul 6.2 to enable the jitter option, but we need to schedule a release+upgrade to apply it to production.	13:26
tristanC	I think the issue is coming from one of these periodic pipeline: https://review.rdoproject.org/cgit/config/tree/zuul.d/upstream.yaml#n96	13:27
tristanC	for example in this pipeline config, i think most jobs have many opendev project in their required-projects attribute: https://review.rdoproject.org/cgit/rdo-jobs/tree/zuul.d/integration-pipeline-wallaby-centos9.yaml	13:43
tristanC	would adding jitter to this pipeline would actually work? if i understand correctly, this constitutes a single buildset, does the jitter applies to the individual build?	13:45
Clark[m]	That pipeline seems to start at 09:10, but we are seeing the issues at closer to 08:10. Are your clocks UTC on the zuul servers? Maybe they are UTC+1?	13:47
Clark[m]	tristanC the changes to jitter in 6.2 apply it to each buildset I think. Previously the entire pipeline was triggered with the same random offset but now each buildset (for each project+branch combo) has a separate offset	13:49
ykarel	Clark[m], Thanks for the info	13:51
corvus	the gerrit folks are working on removing the extension point that the zuul-summary-plugin uses: https://gerrit-review.googlesource.com/c/gerrit/+/342836 (cc Clark ianw )	13:53
tristanC	Clark[m]: alright, then I think the jitter won't help because rod	13:53
tristanC	rdo's pipeline are mostly using a single buildset	13:53
Clark[m]	corvus: yes I responded to them on slack indicating the zuul summary plugin uses that. Nasser thought it should be kept as generally useful too	13:54
Clark[m]	tristanC ok, what do you suggest then?	13:54
corvus	Clark: i know. just pointing out that ben has a change outstanding	13:55
corvus	so is something that can be tracked now	13:55
Clark[m]	tristanC: I think on my end we have two options. Increasing Gerrit limits (and resource usage). Blocking SF's account/IP. Neither are great	13:56
tristanC	Clark[m]: would it be possible to limit the number of connection made by the gerrit driver?	14:06
tristanC	otherwise we are looking for ways to split the pipeline to spread the build execution	14:06
Clark[m]	tristanC: The issue isn't the Gerrit driver but the mergers I expect. Though I guess there is relationship between them. The Gerrit driver should already reuse connections for api requests.	14:07
Clark[m]	Within the merger the git operations act as separate git processes so those connections cannot be reused	14:09
Clark[m]	corvus does have a change to reduce the number of Gerrit requests. I'll make time to review that today	14:13
*** dasm\|off is now known as dasm		14:22
*** dviroel is now known as dviroel\|lunch		15:07
fungi	NeilHanlon: i don't think we publish the logs from our launchers, historically there may have been some concern around the potential to leak cloud credentials into them, but we can revisit that decision. in this case the launcher logs aren't much help because they simply say that the builder sees the api report the node has booted, then the launcher is repeatedly trying to connect to the node's	15:24
fungi	sshd in order to record the host keys from it, and eventually gives up raising "nodepool.exceptions.ConnectionTimeoutException: Timeout waiting for connection to 198.72.124.26 on port 22" or similar	15:24
fungi	we can turn on boot log collection, which might be the next step (either that or manually booting the image in a cloud and trying to work out why it's not responding)	15:25
fungi	er, nova console log collection i mean	15:25
*** ysandeep\|ruck is now known as ysandeep\|ruck\|afk		15:26
johnsom	Hi everyone, we have been seeing an uptick in POST_FAILURE results recently: https://zuul.openstack.org/builds?result=POST_FAILURE&skip=0	15:26
johnsom	Here is an example: https://zuul.opendev.org/t/openstack/build/330d88ae18a144009a605ba94bcad56c/log/job-output.txt#5261	15:26
johnsom	The three-four I have seen have a similar connection closed log.	15:27
*** marios is now known as marios\|out		15:32
fungi	johnsom: is it always rsync complaining about connection closed during the fetch-output collection task, or at random points in the job?	15:34
fungi	the times also look somewhat clustered	15:36
fungi	which leads me to wonder whether there are issues (network connectivity or similar) impacting connections between the executors and job nodes	15:37
fungi	might be good to try and see if there's any correlation to a specific provider or to a specific executor	15:37
fungi	(i'm technically supposed to be on vacation this week, so will be in trouble with the wife if caught spending time looking into it)	15:39
*** ysandeep\|ruck\|afk is now known as ysandeep\|ruck		15:45
johnsom	fungi Enjoy your time! I will poke around in the logs and get some answers. We saw this in recent days as well, but figured it was a temporary bump.	15:50
johnsom	I have started to put together an etherpad: https://etherpad.opendev.org/p/zed-post-failure	15:59
*** dviroel\|lunch is now known as dviroel		16:12
johnsom	I have collected five random occurrences there. Let me know if you would like more	16:13
*** ysandeep\|ruck is now known as ysandeep\|out		16:18
*** jpena is now known as jpena\|off		16:26
fungi	johnsom: a quick skim suggests it's spread across our providers fairly evenly (as much as can be projected from the sample size anyway). seems to be all focal but that's probably just because vast majority of our builds are anyway. the other potentially useful bit of info would be which executor it ran from, since that's the end complaining about the connection failures:	16:34
fungi	https://zuul.opendev.org/t/openstack/build/330d88ae18a144009a605ba94bcad56c/log/zuul-info/inventory.yaml#66	16:34
johnsom	fungi Ack, I will collect that for the list	16:34
fungi	if it all maps back to one or two executors then that may point to a problem with them or with a small part of the local network in the provider we host them all in	16:34
fungi	if it's spread across the executors then it could still be networking in rax-dfw since that's where all the executors reside	16:35
johnsom	Not just one executor	16:37
johnsom	Though a few of them were on 03	16:37
clarkb	due to the tripleo localhost ansible issues I'm really suspecting something on the ansible side of things not the infrastructure	17:00
clarkb	in their case it was the nested ansible installed by rpm, but zuul has a relatively up to date ansibel install too that could be having trouble as well	17:01
clarkb	the etherpad data also seems to point at it being unlikely that a specific provider is struggling	17:01
clarkb	the error messages look almost identical to the tripleo ones fwiw. The difference is tripleo's are talking to 127.0.0.2 from 127.0.0.1	17:02
clarkb	looking at johnsom's examples though this is primarily with rsync. The tripleo examples were for normal ansible ssh to run tasks iirc	17:03
johnsom	The port is 22 so it seems it's rsync over ssh	17:04
clarkb	ya just pointing out that maybe it isn't ansible itself that is struggling (since I mentioend I suspected ansible previously)	17:04
clarkb	johnsom: in the tripleo cases according to journald sshd was creating a session for the connection then closing it at the request of the client	17:05
clarkb	johnsom: on the ansible side it basically logged what you are seeing but for regular ssh not rsync	17:05
clarkb	"connection closed by remote\nconnection closed by 127.0.0.2"	17:06
clarkb	In both cases we were also able to run subsequent tasks without issue	17:06
clarkb	dviroel: rcastillo\|rover: ^ fyi. Also it just occured to me that your ansible must not be using ssh control persistence with ansible since journald was logging a new connection for every task. That might be worth changing regardless of what the problme is here	17:07
clarkb	I kinda want to focus on the tripleo examples because they are to localhost which eliminates a lot of variables	17:09
clarkb	ansible 2.12.8 did release on Monday	17:14
clarkb	And we did redeploy executors around then	17:14
clarkb	(however it isn't clear to me how that ends up in the `ansible` package on pypi since pypi reports a release older than that as the latest)	17:14
clarkb	ok ansible-core released on monday. Does `ansible` install `ansible-core`? /me checks	17:16
clarkb	yes it does	17:17
clarkb	ok I'm now suspicious of ansible again :)	17:18
clarkb	But we're running ansible-core 2.12.7 on ze01	17:20
clarkb	we also saw similar with our testinfra stuff for system-config that also runs outside of zuul with a nested ansible between test nodes in the same provider region	17:23
clarkb	All that to say I think whatever is causing this isn't specific to a provider or even a specific distro. It seems to affect centos 8 and 9 and ubuntu focal at least. It affects connections to localhost in addition to those between providers and within a provider	17:24
clarkb	that still makes me suspect ansible because it is a common piece in all of that. SSH (via openssh not paramiko) is also a common piece, but its versions will vary across centos 8 and 9 and ubuntu focal	17:24
NeilHanlon	fungi: thank you very much and enjoy your vacation!!	17:28
dviroel	clarkb: ack - thanks for the heads up	17:29
tristanC	clarkb: corvus: thinking about the gerrit overload caused by rdo's zuul, would adding jitters between build within a periodic buildset be a solution?	17:30
opendevreview	Clark Boylan proposed opendev/system-config master: Increase the number of Gerrit threads for http requests https://review.opendev.org/c/opendev/system-config/+/853528	17:33
clarkb	tristanC: ^ fyi I think maybe we can give the httpd more threads to keep the web ui rsponsive while the git requests queue up	17:33
clarkb	I was worried about increasing things and having all the threads get consumed and not really changing anything, but if that works as I think it might this may be a reasonable workaround for now to keep the ui responsive for uses	17:34
opendevreview	Clark Boylan proposed opendev/system-config master: Increase the number of Gerrit threads for http requests https://review.opendev.org/c/opendev/system-config/+/853528	17:34
clarkb	of course my first ps did the wrong thing though :)	17:34
clarkb	tristanC: in opendev I believe that happens naturally via the randomness imposed by nodepool allocations	17:35
clarkb	tristanC: last week I wondered if maybe you are using container based jobs that allocate immediately which negates that benefit	17:36
clarkb	One of johnsom's example did not run out disk	17:47
clarkb	(just ruling that out as ansible ssh things don't fare well with full disks	17:48
clarkb	plenty of inodes too	17:48
tristanC	clarkb: i meant, it looks like one buildset may contains many job using https://zuul.opendev.org/t/openstack/job/tripleo-ci-base-required-projects , and i wonder if the zuul scheduler is not updating each project repeatedly, even before making node request or starting the job	17:51
clarkb	tristanC: I think it should do it once to determine what builds to trigger for the buildset when the event arrives. Then each individual build will do independent merging (using the same input to the merger which should produce the same results). corvus can confirm or deny that though	17:52
tristanC	in other words, if a buildset contains 10 jobs that require the same 10 projects, does the scheduler perform 100 git update?	17:53
clarkb	I don't think so.	17:54
clarkb	that was fun my laptop remounted fses RO after resuming from suspend	19:10
clarkb	I think it isn't starting the nvme device up properly	19:10
clarkb	jrosser_: do you think https://github.com/ansible/ansible/issues/78344 oculd be the source of our ansible issue sabove?	19:21
clarkb	jrosser_: https://zuul.opendev.org/t/openstack/build/06e949c509fc4fa7b83684573bbd8d91/log/job-output.txt#18570-18571 i sa specific example and I notice that rc 255 appears to be the code returned when it is recoverable in your bug	19:22
clarkb	hrm it seems the errors that ansible bubble up are different though	19:24
clarkb	apparnetly this error can happen if sshd isn't able to keep up with incoming ssh session starts (so maybe the scanners are having an impact? I'm surprised that openssh woul dfall over that easily though)	19:29
clarkb	sshd_config MaxStartups maybe needing to be edited?	19:33
opendevreview	Clark Boylan proposed openstack/project-config master: Tune sshd connections settings on test nodes https://review.opendev.org/c/openstack/project-config/+/853536	19:42
clarkb	johnsom: dviroel rcastillo\|rover ^ fyi that may be the problem / fix?	19:43
johnsom	So the IPs in the error is actually the nodepool instance? I.e. the executor is ssh/rsync to the instance under test.	19:45
johnsom	That would be interesting and sad at same time if that is the issue...	19:46
clarkb	johnsom: yes the issue is ansible on the executor is unable to establish a connection to the nodepool test node. In the tripleo case it was 127.0.0.1 on the test node connecting to 127.0.0.2 on the test node. What the tripleo logs showed was that ssh scanning/brute forcing was in progress at the same time too	19:46
clarkb	so ya it could be that we're just randomly colliding with the scanners	19:46
clarkb	It would explain why it happens regardless of node and provider and platform and whether or not you cross actual networking	19:46
clarkb	if this is the fix it will certainly be one of the more unique issues I've looked into	19:51
dviroel	clarkb: oh nice, that may help us yes. Thanks for digging and proposing this patch.	19:57
dviroel	clarkb: i saw just one failure like this as for today, nothing like yesterday	19:58
dviroel	we do have a LP bug and a card to track this issue	19:58
clarkb	dviroel: only if you created one	19:58
clarkb	I'm not sur ewhat a card would be	19:58
dviroel	clarkb: I mean, we already created a tripleo bug	19:59
dviroel	and a internal ticker to track this issue	19:59
dviroel	we can provide a feedback if the issue remains even after your patch get merged	20:00
clarkb	that would be great. If it is scanners it may go away after whoever is scanning slows down too. Either way hopefully this makes things happier	20:00
dviroel	++	20:01
johnsom	I may create a canary job that logs the ssh connections, just to see what it looks like.	20:05
clarkb	johnsom: ya that would probably be goo ddata. The tripleo jobs were catching them in regular journal logs for sshd	20:13
clarkb	infra-root I've manually applied 853536 to a centos stream 8 node and a focal node and restarted sshd. It seems it doesn't break anything at least	20:13
clarkb	ps output for sshd on ubuntu shows 0 of 30-100 startups now instead of 0 of 10-100 so I think it is taking effect too	20:13
clarkb	NeilHanlon: fungi: I'm trying to boot a rocky 9 node in ovh bhs1 now to see what happens	20:45
clarkb	there is a console log so we have a kernel. Not the grub issue again	20:49
clarkb	It does not ping or ssh though	20:49
clarkb	aha but it eventually drops into emergency mode	20:50
clarkb	its not setting the boot disk properly	20:53
clarkb	I'll get logs pasted shortly	20:53
clarkb	NeilHanlon: fungi: https://paste.opendev.org/show/bQdnCNQZvTK0IQRfDlL2/ it is trying to boot off of the nodepool builder's disk by uuid	20:56
clarkb	dib image builds should be selecting the disk by label	20:56
clarkb	cloudimg-rootfs for ext4 and img-rootfs for xfs. We use ext4. I wonder why they differ	20:59
clarkb	ah cloudimg-rootfs is what everyone used for forever then xfs became popular for rootfs and thta label is too long for it	21:00
*** dviroel is now known as dviroel\|afk		21:02
*** timburke_ is now known as timburke		21:03
clarkb	also you should not pay attention to this fungi :) we'll get it sorted	21:18
clarkb	NeilHanlon: it almost seems like rocky grub is ignoring GRUB_DISABLE_LINUX_UUID=true in /etc/default/grub	21:28
clarkb	we set GRUB_DEVICE_LABEL=cloudimg-rootfs and GRUB_DISABLE_LINUX_UUID=true in /etc/default/grub but the resulting grub config has the builder's root disk uuid	21:29
clarkb	grub packages don't look like they've updated super recently	21:34
fungi	yeah, i'm trying to pretend i'm not aware of all the broken, and resisting my innate urge to troubleshoot	21:35
clarkb	the number of patches to grub is ... daunting	21:38
clarkb	274	21:39
fungi	that's in rocky (and presumably copied from rhel)?	21:42
clarkb	correct	21:42
corvus	i'm going to restart zuul-web so we can see the job graph changes live	21:42
fungi	thanks! excited to see that go live finally	21:43
clarkb	I think maybe we need GRUB_DISABLE_UUID too https://lists.gnu.org/archive/html/grub-devel/2019-10/msg00007.html	21:45
clarkb	ianw: ^ you probably understand this much better than I do if you want to take a look when your day starts	21:45
clarkb	the patch for GRUB_DISABLE_UUID is one of those 274 patches on rocky	21:46
NeilHanlon	i'll take a read through this later... who knew booting an OS could be so friggin' long lol	21:50
corvus	okay! deep links! https://zuul.opendev.org/t/openstack/project/opendev.org/opendev/system-config?branch=master&pipeline=check	21:52
corvus	that's a job graph for what happens if you upload a change to system-config	21:53
clarkb	nice	21:53
fungi	ooh, yes	21:54
corvus	and the frozen jobs are deep-linkable too: https://zuul.opendev.org/t/openstack/freeze-job?pipeline=check&project=opendev%2Fsystem-config&job=opendev-buildset-registry&branch=master	21:56
*** dasm is now known as dasm\|off		22:00
ianw	reading backwards; thanks for noting the tab extension point removal. i guess that is marked as wip but i don't see any context about it. anyway, it's good that it's been called out	22:06
ianw	if we see the slowdowns on gerrit probably worth checking the sshd log, that's where i saw the sf logins. i don't think it's http?	22:07
ianw	it is weird that rocky would boot in the gate but not on hosts. we've seen something similar before ... trying to remember. it was something about how in the gate the host already had a labeled fs, and that was leaking through	22:10
clarkb	ianw: httpd and sshd share a common thread pool limit though	22:10
ianw	https://review.opendev.org/c/openstack/diskimage-builder/+/818851 was the change i'm thinking of	22:11
clarkb	I tried to capture the relationship in https://review.opendev.org/c/opendev/system-config/+/853528 re gerrit	22:11
clarkb	basically I think we may be starving the common thread pool and then it falls over from there? Idea is maybe increasing httpd max threads above the shared limit will allow the web ui to continue working though	22:12
clarkb	that won't fix the git pushes and fetches being slow, but maybe it will make the UI responsive through it	22:14
clarkb	and then we can bump up sshd.threads as well as httpd.maxThreads if that doesn't cause problems. Just trying to be conservative as we grow into the new server	22:14
ianw	sshd.threads -> "When SSH daemon is enabled then this setting also defines the max number of concurrent Git requests for interactive users over SSH and HTTP together." ... ok then?	22:15
ianw	maybe it should have a different name then	22:15
clarkb	ya I agree it is super confusing	22:15
clarkb	I stared at a bit this morning then went on a bike ride and eventually ended up with the change I linked	22:15
clarkb	and I'm still not entirely sure it will do what I think it will do	22:16
ianw	clarkb: https://gerrit-review.googlesource.com/Documentation/access-control.html#service_users i wasn't aware of that one	22:16
clarkb	ya so service users has another issue for us	22:16
clarkb	its binary and we really have three classes of user	22:17
clarkb	we have humans, zuul, and all the other bots	22:17
clarkb	I brought this up with upstream a while back after the 3.5 or 3.4 upgrade (whichever put that in place and started making zuul stuff slow)	22:17
clarkb	we ended up setting batch threads to 0 so that the pools are shared since we have the three classes of users and can't really separate them in a way that makes sense	22:18
clarkb	however, maybe putting sf in its own thread pool as a service user and having everything else mix with the humans would be an improvement?	22:18
* clarkb is just thinking out loud here		22:18
clarkb	also I'm melting into the couch so the brain may not be opreating well	22:21
ianw	yeah, i think maybe starting with the thread increases as the simple thing and going from there would be good first steps	22:23
opendevreview	Steve Baker proposed openstack/diskimage-builder master: Do dmsetup remove device in rollback https://review.opendev.org/c/openstack/diskimage-builder/+/847860	22:23
opendevreview	Steve Baker proposed openstack/diskimage-builder master: Support LVM thin provisioning https://review.opendev.org/c/openstack/diskimage-builder/+/840144	22:23
opendevreview	Steve Baker proposed openstack/diskimage-builder master: Add thin provisioning support to growvols https://review.opendev.org/c/openstack/diskimage-builder/+/848688	22:23
clarkb	ianw: frickler: that brings up a good point though that if/when this happens again we should see about running a show queue against gerrit	22:24
clarkb	`gerrit show-queue -w -q` that should give us quite a bit of good info on what is going on	22:25
ianw	yeah i did do that when i first saw it	22:26
ianw	i can't remember what i saw :/	22:26
ianw	maybe i logged something	22:26
fungi	maybe it was so horrific you repressed that memory	22:27
clarkb	and then https://review.opendev.org/c/openstack/project-config/+/853536 is a completely unrelated but interesting change too	22:27
clarkb	fungi: ^ re the tripleo and othe rssh issues	22:27
ianw	i didn't : https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2022-08-11.log.html#t2022-08-11T01:26:37 ... i can't remember now. i don't think i saw a smoking gun	22:28
ianw	anyway, yes ++ to running and taking more notice of the queue stats	22:28
ianw	interesting ...	22:30
clarkb	and also I'm not sure restarting gerrit helps anything. I don't think we are running out of resources. We're just hitting our artifical limits and everything is queueing behind that	22:31
clarkb	But to double check that gerrit show-caches will emit memory info so that is another good command to try	22:31
ianw	yeah that was the conclusion that i came to at the time; it seemed like it had stopped responding to new requests but was still working	22:31
ianw	i wonder with the ssh thing, can you just instantly block anything trying password auth?	22:32
ianw	this is just thinking out loud, i've never looked into it	22:33
clarkb	ianw: we do already reject password auth, but I think the protocol itself allows you to try password auth and fail, then try key 1 and then key 2 and so on	22:33
ianw	yeah, i meant maybe more than just reject it, if you try a password you get blocked. but it's probably not really possible due to the negotitaion phase	22:34
fungi	yes, ssh protocol is designed to accept multiple authentication mechanisms and "negotiate" between the client and server by trying different ones until something works	22:34
ianw	yeah even something like fail2ban is scanning rejection logs	22:35
ianw	even rate limiting ip's to 22 probably doesn't help, as it's multiple scanners i guess	22:37
fungi	though also we do have concurrent connection limits enforced by conntrack	22:38
clarkb	and zuul can be pretty persistent	22:38
clarkb	the reason we're seeing rsync fail i sthat it doesn't go through the controlpersist process	22:38
clarkb	depending on how many rsyncs you do in a short period of time you could hit a rate limit	22:39
clarkb	I suspect we're right on the threshold too which is why we don't see this super frequently	22:39
clarkb	but I could be wrong	22:39
clarkb	this is half hunch at this point anyway	22:39
clarkb	but it does explain what tripleo observed pretty neatly	22:40
opendevreview	Merged openstack/project-config master: Tune sshd connections settings on test nodes https://review.opendev.org/c/openstack/project-config/+/853536	22:49
johnsom	clarkb On a two hour tempest job, there were not a large number of SSH hits. Could just be the timing, but it's not an ongoing thing.	23:45
johnsom	56 SSH connections over that two hours, including the zuul related connections	23:46
clarkb	johnsom: right I think that is why it affects so few jobs	23:47
clarkb	its all about getting lucky	23:47
fungi	that would also explain the clustering i saw for timestamps when looking at the zuul builds list	23:53
fungi	you would get several failing around the same time, with gaps between	23:53
clarkb	ya and it is possible I've misdiagnosed it too. But based on the tripleo logs I'm reasonably confident it is this	23:54
clarkb	it perfectly explains why ssh to localhost randomly doesn'y work	23:55

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!