Tuesday, 2023-02-07

JayF	similarly failed	00:00
ianw	i can propose a change with enableChannelIdTracking=false. i think we should file another upstream bug report though, as i can't find a recent one	00:01
opendevreview	Jeremy Stanley proposed opendev/system-config master: Upgrade to latest Mailman 3 releases https://review.opendev.org/c/opendev/system-config/+/869210	00:02
fungi	if it helps, i was able to push that ^ easily and quickly	00:03
JayF	perhaps it's related to the size of the repo?	00:03
JayF	I'm trying to push literally a three character logging fix for neutron lol	00:03
fungi	openssh-client 1:9.1p1-2 from debian/sid	00:03
ianw	JayF's push looks like	00:04
ianw	git-upload-pack./openstack/neutron.git 4ms 13044ms '4ms 16ms 0ms 111ms 0ms 11950ms 12077ms -1 3658 5467 3462123' 0 - 751ms 740ms 152267816	00:04
ianw	i have no idea what all those numbers mean	00:04
fungi	i was pushing a change for the same repo where ianw saw his error	00:04
ianw	fungi: yeah, i re-pushed and didn't see the error -- but also that was a different error to jayf	00:04
fungi	ahh, okay	00:05
JayF	mtr to review.opendev.org makes it look like cogent:comcast links ipv6 are doing bad	00:05
JayF	I wonder if this is internet shenanigans to a degree to impact this	00:05
ianw	it does seem to suggest that slow (or maybe packet-dropping-ish) links and big repos trigger it	00:05
* JayF puts a v4 address in his hosts file and tries a thing		00:05
ianw	can you try ipv4?	00:05
clarkb	I don't think the wikimedia workaround is still a valid option (possibly due to the switch to MINA)	00:06
JayF	looks like the issues persist over v4, but so do the packet drops upstream of me in MTR	00:06
ianw	although in the overall list of people seeing this, there's plenty of ipv4 addresses	00:06
ianw	clarkb: i thought that change was for mina -- not nio2 or whatever it used to be?	00:07
* JayF stepping away, but will happily retry if ping'd		00:07
clarkb	ianw: hrm it may have been but looking at current valid sshd options for gerrit that flag is not in the list	00:08
ianw	i think it's https://gerrit-review.googlesource.com/c/gerrit/+/238384	00:08
ianw	"For more safety, protect this experimental feature behind undocumented	00:08
ianw	configuration option, but enable this option per default.	00:08
ianw	"	00:08
clarkb	heh I'm not sure I like that approach. Secret config options	00:09
clarkb	but undocumented so we don't know what they do	00:09
ianw	https://issues.apache.org/jira/browse/SSHD-942 seems to be upstream discussion	00:11
clarkb	fwiw I can git remote update in neutron and nova without issue	00:12
JayF	`git fetch gerrit` also worked for me immediately before I started having this issue	00:12
ianw	"I thought some more about this and what I have suggested might "mask" out protocol problems that are not related to latency or out-of-order messages - e.g. a client that does not adhere to the protocol 100% and is indeed sending messages when they are not expected."	00:13
ianw	i think we can rule out the client not adhering to the protocol	00:13
ianw	so that really seems to leave out-of-order messages	00:13
JayF	which matches my observation of packet loss in the internet route to gerrit	00:14
clarkb	I don't cross cogent. I'm going through wholesail and beanfield	00:14
JayF	ooooooh	00:14
JayF	I'm going to tether.	00:14
clarkb	thought I'd try due to adjacency to JayF but our ISPs seem to share completely different sets of peering	00:14
clarkb	s/share/use/	00:14
ianw	i guess the question is -- is this an error that gerrit is quitting, or would it be failing anyway and gerrit is just telling us	00:14
JayF	I have comcast and am currently on some kind of wait list for centurylink :(	00:14
clarkb	ianw: with MINA its hard to say unless you pull out your ssh protocol rfc and start reading :( this is what ultimately turned me into a java dev	00:15
clarkb	fair warning	00:15
JayF	so, it worked	00:16
JayF	from my tether	00:16
ianw	... interesting ...	00:17
fungi	my v4 and v6 routes seem to go from my broadband provider, through beanfield, and into vexxhost. no other backbones	00:17
ianw	JayF: if you have time, i wonder if we might catch a tcpdump of a failure to your IP? it might show us out-of-order things?	00:18
JayF	so my hypothesis: gerrit is currently extremely intolerant of packet loss / out of order packets	00:18
JayF	ianw: I can tell you with nearly 100% certainty from client behavior: I went from "some retransmits" to "zero retransmits" once I tethered	00:18
fungi	the tcp/ip stack on the server should normally reassemble streams into the correct packet order	00:18
JayF	I actually need to head out for now though, sorry you all, good luck o/	00:19
ianw	"This is old issue, but we are getting reports, that the problem shows up even with the custom UnknownChannelReferenceHandler. And even could disappear if using default DefaultUnknownChannelReferenceHandler instance according to this gerrit configuration option:	00:19
ianw	sshd.enableChannelIdTracking = false."	00:19
ianw	JayF: no worries, thanks :)	00:20
fungi	but if there's massive packet loss too, just about anything will get cranky	00:20
ianw	that's hard to parse, but it seems to be saying that "we're seeing this, and users are not seeing it if they turn of enableChannelIdTracking"	00:20
clarkb	ianw: right, but what does that lose us and why does gerrit think it is very unsafe so much so to not document it	00:21
clarkb	I'm good with trying it if we understand the impacts and the risks are reasonable	00:21
clarkb	I need to switch gears to getting a meeting agenda out before dinner. Holler if this ends up needing more eyeballs	00:23
ianw	yeah i'm just trying to parse all these changelogs etc	00:28
clarkb	ok I've finally managed to update the wiki agenda contents. I'll give it a few if anyone wants to add anything	00:55
*** dasm\|rover is now known as dasm\|out		00:55
clarkb	ianw: do we want the ssh thing on there?	00:55
ianw	clarkb: umm, i'm writing up a bug report, we can add it to see what, if anything, gets said about it	00:56
clarkb	sound sgood	00:56
fungi	869091 is back to failing system-config-run-gitea even though 872801 merged a few hours ago	01:00
clarkb	it passed on the apparmor fix change	01:00
clarkb	the run-gitea job I mean (that doesn't get us your logo stuff though)	01:00
fungi	yeah, i'll get logs pulled up here in a few and see why	01:04
fungi	oh, nevermind, i was looking at old results i think	01:06
fungi	it passed when i rechecked after the image fix landed	01:06
fungi	clarkb: is this more what you were expecting? http://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_0b6/869091/11/check/system-config-run-gitea/0b667a3/bridge99.opendev.org/screenshots/gitea-main.png	01:08
clarkb	yup I think that flows a lot better	01:08
clarkb	let me rereview that change	01:08
clarkb	then once ianw has a bug report link I can add that ot the agenda and get that sent	01:09
fungi	once we get the donor logos addition landed, we're in a good place to talk about which other logos we might want to add to the list	01:09
clarkb	I +2'd it	01:10
fungi	there's a good case for adding osuosl since they host part of our control plane (nb04)	01:11
clarkb	++	01:11
ianw	ok, https://github.com/apache/mina-sshd/issues/319	01:21
ianw	happy for feedback if it makes sense ...	01:22
ianw	after pulling that apart, i am unsure if turning the option off will make a difference	01:22
ianw	it seems like it will just think the channel isn't open and fail. but i gues sthe only way would be to test	01:23
clarkb	ok agenda sent	01:31
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: ensure-skopeo: fixup some typos https://review.opendev.org/c/zuul/zuul-jobs/+/872733	03:04
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: build-docker-image: further cleanup buildx path https://review.opendev.org/c/zuul/zuul-jobs/+/872806	03:05
opendevreview	Merged zuul/zuul-jobs master: ansible-lint: uncap https://review.opendev.org/c/zuul/zuul-jobs/+/872495	03:29
opendevreview	Merged zuul/zuul-jobs master: build-docker-image: fix change prefix https://review.opendev.org/c/zuul/zuul-jobs/+/872258	03:39
opendevreview	Merged zuul/zuul-jobs master: container-roles-jobs: Update tests to jammy nodes https://review.opendev.org/c/zuul/zuul-jobs/+/872375	03:39
ianw	Feb 7 03:46:24 graphite02 systemd[1]: systemd-fsckd.service: Succeeded.	03:45
ianw	Feb 7 03:39:23 graphite02 systemd-timesyncd[462]: Initial synchronization to time server [2620:2d:4000:1::40]:123 (ntp.ubuntu.com).	03:45
ianw	time went a fair way backwards on graphite02 when i restarted it	03:45
ianw	(i noticed this because the container was saying it was created "less than a second ago" for ... more than a second)	03:46
ianw	i so far haven't noticed this on any other servers i've restarted ... it does suggest though that ntp wasn't working	03:48
opendevreview	Vishal Manchanda proposed openstack/project-config master: Retire xstatic-font-awesome: remove project from infra https://review.opendev.org/c/openstack/project-config/+/872835	03:58
*** yadnesh\|away is now known as yadnesh		04:07
ianw	keycloak also wasn't happy on restart. i tried a turn-it-off-and-on-again approach and it seems ok now	04:07
ianw	it's also happened on paste	04:14
ianw	Feb 7 04:20:26 paste01 systemd[1]: systemd-fsckd.service: Succeeded.	04:14
ianw	Feb 7 04:12:40 paste01 systemd-timesyncd[413]: Initial synchronization to time server [2620:2d:4000:1::41]:123 (ntp.ubuntu.com).	04:14
ianw	so the gerrit promote failed	05:01
ianw	https://review.opendev.org/c/opendev/system-config/+/872802	05:01
ianw	all failed in "promote-docker-image: Delete all change tags older than the cutoff"	05:03
opendevreview	Merged zuul/zuul-jobs master: ensure-skopeo: fixup some typos https://review.opendev.org/c/zuul/zuul-jobs/+/872733	05:19
fungi	ianw: the tag delete happens after the new tag is done, right? if so, that's benign at least	05:25
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: promote-docker-image: improve failure debugability https://review.opendev.org/c/zuul/zuul-jobs/+/872842	05:48
ianw	fungi: yep, that's right. i think something like ^ will help with this	05:49
ianw	i'll restart gerrit with that under new docker	05:50
ianw	(that is done)	05:52
*** ysandeep is now known as ysandeep\|ruck		06:00
ianw	https://etherpad.opendev.org/p/docker-23-prod is about half done. the rest are probably more suited to an ansible loop which i can look at later	06:14
ianw	fungi: ^ maybe you could look at the mailman ones?	06:15
opendevreview	Merged openstack/project-config master: nodepool: infra-package-needs; cleanup tox installs https://review.opendev.org/c/openstack/project-config/+/872478	06:26
*** jpena\|off is now known as jpena		08:24
*** gthiemon1e is now known as gthiemonge		10:38
*** ysandeep__ is now known as ysandeep\|break		11:36
*** soniya29 is now known as soniya29\|afk		11:39
*** yadnesh is now known as yadnesh\|away		12:22
*** ysandeep\|break is now known as ysandeep\|ruck		12:22
fungi	ianw: i have a held mai	12:45
fungi	lman 3 server i can test on first too	12:45
fungi	the mailman 2 servers aren't dockerized anyway	12:45
ysandeep\|ruck	openstack-tox-py27 on many projects failing with ERROR: py27: InterpreterNotFound: python2.7	13:44
ysandeep\|ruck	https://zuul.opendev.org/t/openstack/builds?job_name=openstack-tox-py27&skip=0	13:44
*** dasm\|out is now known as dasm\|rover		13:53
fungi	ysandeep\|ruck: when did it start? was it coincident with tox v4 releasing at the end of 2022, or when we switched our default nodeset to ubuntu-jammy earlier last year?	14:06
fungi	or was it more recent?	14:06
fungi	i assume this is for older openstack stable branches, since openstack hasn't supported python 2.7 for a number of releases now	14:06
ysandeep\|ruck	fungi, started yesterday: https://zuul.opendev.org/t/openstack/builds?job_name=openstack-tox-py27&project=openstack%2Ftripleo-heat-templates&skip=0	14:06
ysandeep\|ruck	yes hitting on stable branches	14:07
fungi	interesting, i should be able to take a closer look in a few minutes	14:07
slittle1_	StarlingX release scripts always have issues pushing new branches or tags across all our gits. It starts out ok, but eventually we start seeing 'connection reset by peer'. It feels like we are triggering some sort spam/bot protection.	14:15
slittle1_	or ....	14:16
slittle1_	ssh_exchange_identification: read: Connection reset by peer	14:16
slittle1_	fatal: Could not read from remote repository.	14:16
slittle1_	just now	14:16
fungi	slittle1_: we observed this from some locations starting late yesterday. it appears cogent (the internet backbone provider) is having some issues, and connections traversing their network are suffering	14:53
fungi	my connection doesn't get routed through cogent and as a result things are quite snappy	14:54
fungi	it's possible vexxhost (where those servers are housed) could do some massaging of bgp announcements to make cogent seem less attractive	14:54
Clark[m]	But we do also have connection limits by IP and user account	14:54
fungi	oh, yes that's also a potential culprit	14:55
Clark[m]	fungi: ysandeep\|ruck I think we may have dropped python2 from our images expecting bindep to cover that.	14:55
fungi	i think it's 64 simultaneous connections from the same gerrit account or 100 concurrent sockets from the same ip address	14:56
fungi	if you exceed either of those your authentication gets denied or your connection gets refused, respectively	14:56
fungi	though the cogent issue could also be compounding that if sockets are getting left hanging	14:56
fungi	slittle1_: by always have issues i assume you mean since longer than just the past 24 hours	14:57
fungi	so Clark[m]'s point about connection limits seems more likely in this case	14:57
*** dasm\|rover is now known as dasm\|afk		15:15
opendevreview	Merged opendev/system-config master: Feature our cloud donors on opendev.org https://review.opendev.org/c/opendev/system-config/+/869091	15:15
jrosser	fwiw some connectivity issue to opendev.org is recurring again now, i see the same very low throughput as i had a few weeks ago	15:18
jrosser	and just the same going to another host with different peering but geographically similar, it's all good	15:18
fungi	jrosser: yeah, JayF mentioned seeing similar at 23:37 utc	15:23
fungi	presumably traversing cogent in your traceroute?	15:23
jrosser	fungi: looks like zayo again for me as the closest thing to opendev.org	15:26
ysandeep\|ruck	Clark[m], >> I think we may have dropped python2 from our images expecting bindep to cover that. -- is it possible to confirm that theory?	15:27
Clark[m]	Yes look at the change log for openstack/project-config to see if the proposed changes landed	15:28
fungi	jrosser: it may be an assymetric route too (packets to vexxhost via zayo, replies coming back across cogent). i'd have to traceroute to your address from the load balancer to confirm though	15:28
fungi	or from the gerrit server i guess, though any server there likely follows the same routes	15:28
ysandeep\|ruck	Clark[m], i see https://github.com/openstack/project-config/commit/95ff98b54b1095dae6591ac273777ff79834887d	15:28
fungi	might have been "nodepool: infra-package-needs; cleanup tox installs" https://review.opendev.org/c/openstack/project-config/+/872478	15:29
fungi	that merged 06:26:24 utc today	15:29
fungi	would have taken effect the next time images were built	15:30
ysandeep\|ruck	but we are seeing issue since yesterday: https://zuul.opendev.org/t/openstack/builds?job_name=openstack-tox-py27&project=openstack%2Ftripleo-heat-templates&skip=0	15:31
Clark[m]	https://review.opendev.org/c/openstack/project-config/+/872476/2 this one	15:31
fungi	oh, yep. we stopped installing python-dev	15:32
fungi	and python-xml, either of which would have pulled in python2.7	15:32
ysandeep\|ruck	dasm\|afk, ^^ I will be out soon, please follow the chatter	15:32
fungi	that merged 00:16:31 yesterday	15:33
fungi	so images built yesterday would have no longer had python2.7 present on them by default	15:33
clarkb	and confirmed that bindep.txt lacks python in it https://opendev.org/openstack/tripleo-heat-templates/src/branch/stable/train/bindep.txt	15:35
clarkb	the glance and neutron bindep.txt files are similar	15:36
clarkb	nova and swift are fine	15:36
clarkb	I think we have ~3 options: A) revert/or readd python2-dev installs B) ask projects to fix their broken bindep.txts (something we've asked for for years at this point) or C) update -py27 jobs to explicitly install python2 for you	15:37
slittle1_	fungi: By always I mean every 3-6 months when we want to create a branch or tag on all the StarlingX gits.	15:39
clarkb	in that case there is a good chnace you are hitting the IP address limits or user account limits if you run things in parallel	15:43
fungi	slittle1_: how are you creating them and how many? are you doing it in parallel? with a single account?	15:43
fungi	if they're all done from a single account and it's authenticating more than 64 operations at the same time, that could be it	15:44
clarkb	and even if it never fully gets to the IP limit tcp stacks will often keep connections open for a bit once they are "done"	15:45
clarkb	so you might only do 60 pushes and be fine from an auth perspective but if you've got a tcp stack holding the connections open for a few extra minutes then eventually you can trip over the tcp connection limit	15:45
slittle1_	We use 'repo forall' to iterate through out gits serially. perhaps 60 gits in all, For each there would be a 'git fetch --all', a 'git review -s', a 'git push ${review_remote} ${tag}', a 'ssh -p ${port} ${host} gerrit create-branch ${path} ${branch} ${tag}' and a 'git review --yes --topic=${branch/\//.}'	15:46
clarkb	repo isn't serial though right?	15:47
clarkb	(I'm not super familiar with repo, but my understanding is that it tries to setup large systems like android in parallel. Something google can get away with but not our gerrit)	15:47
slittle1_	'repo forall' is serial by default	15:47
fungi	a quick count of starlingx/ namespace projects returns 57 at the moment, so the number that have branches being created is probably lower	15:48
slittle1_	We branch nearly all of them at the same time	15:48
fungi	in that case it's more likely the ip connection limit (i'm pretty sure the connection refused response would point to that, as the account connection limit would be more likely to report a different error message)	15:49
fungi	it may be that the connections are getting created too quickly in succession and not enough time is being allowed for the server's tcp/ip stack to close them down	15:49
clarkb	fungi: yup that is my suspicion	15:50
*** dasm\|afk is now known as dasm\|rover		15:50
clarkb	each of those commands will create new tcp connections and they may idle for a few minutes	15:50
fungi	are those commands being issued by someone behind a nat shared by other clients that may also be connecting? if so that could also make it more likely	15:50
clarkb	per repo its at least 5 connections (I think more)	15:50
slittle1_	script runtime would probably be 5 min if it could get through without errors	15:50
clarkb	after 20 repos you'll be in ip blocking territory if things haven't cleaned up quickly enough	15:51
ysandeep\|ruck	Clark[m]: thanks, as issue is affecting multiple repo and older stable branches, I will let infra decides which options is better atm before adding in bindep.txt(for tripleo repo)	15:51
slittle1_	not sure about the 'nat' question. I'd have to chase down our own IT guys for that one	15:51
clarkb	if your local IP isn't a publicly routable IP then most likely this is he case	15:52
jrosser	is all of that over ssh? perhaps some ControlPersist could help?	15:52
fungi	if you know what ip address it's coming from i may be able to find some evidence in syslog/journald (i can't remember of those conntrack limit rules log)	15:52
fungi	s/of/if/	15:52
slittle1_	there is also a retry loop, 45 sec delay, max 5 retries, when one of the above commands fail.	15:52
slittle1_	it's not publicly routable	15:53
*** ysandeep\|ruck is now known as ysandeep\|out		15:54
fungi	slittle1_: i mean the public address it's getting rewritten to on exiting your network	15:54
fungi	anyway, this is the rule we're suspecting is being tripped: https://opendev.org/opendev/system-config/src/branch/master/inventory/service/group_vars/review.yaml#L4	15:55
fungi	in theory it would stop blocking as soon as the tracked connections falls below 100	15:55
clarkb	jrosser's suggestion may be one to try since that will reduce total ssh connections. I'm not sure how to set that up with git though. Might need to use GIT_SSH?	15:56
clarkb	or maybe you can edit your ssh config to manipulate that	15:56
slittle1_	probably 28.224.252.2	15:57
slittle1_	128.224.252.2	15:57
jrosser	google took me here https://docs.rackspace.com/blog/speeding-up-ssh-session-creation/	15:57
fungi	or try throttling down the loop and see if that helps give earlier connections time to clean up	15:57
jrosser	if git is using the underlying ssh config (which i think it does) then you should be able to use any of the usual things in your ~/.ssh/config	15:58
clarkb	ya it should	15:58
fungi	we probably need to add -j LOG to that rule if we want to be able to find evidence of these events after the fact	15:59
fungi	though i have no idea if there may be ssh brute-forcing attempts constantly hitting that which could fill the logs quickly	16:00
slittle1_	So I try ....	16:04
slittle1_	Host *.opendev.org	16:04
slittle1_	ControlMaster auto	16:04
slittle1_	ControlPersist 600	16:04
slittle1_	Sound ok ?	16:04
fungi	netfilter docs mention /proc/net/nf_conntrack which doesn't seem to exist, i wonder if this kernel version exposes that somewhere else	16:05
fungi	found /proc/sys/net/netfilter/nf_conntrack_*	16:07
clarkb	slittle1_: yes that seems correct, I'd probably not use a * though and keep in mind that this will be shared with all ssh things to the identified server (shouldn't be an issue but calling it out)	16:37
fungi	clarkb: where would be the best place to add the conntrack package (provides tools for inspecting the kernel's connection tracking tables since i can't find them in /proc) on the gerrit server? not the docker image obviously it's for the underlying server. should we just install it on all our servers? there's the iptables role but it's hard-coded to only install one package and it's also	16:37
fungi	multidistro which i doubt we need any longer	16:37
clarkb	fungi: do we know if the jeepyb fix from yesterday was sufficient to make the daily job run?	16:37
clarkb	fungi: the ansible gerrit role is where I'd do it if you want it specific to gerrit. Otherwise in our base package list for all servers	16:38
fungi	oh, right, i was meaning to rerun manage-projects today	16:39
clarkb	I'm not sure if we auto pull the gerrit image or not which is what would be needed to have the daily run pick it up	16:39
clarkb	you don't need to restart gerrit on the new image to use the new image for manage-projects at least	16:40
fungi	manage-projects for starlingx/public-keys seems to have worked this time	16:41
fungi	it pushed to refs/meta/config successfully according to the log	16:42
fungi	though failed after that:	16:42
fungi	jeepyb.utils - INFO - Executing command: git --git-dir=/opt/lib/jeepyb/starlingx/public-keys/.git --work-tree=/opt/lib/jeepyb/starlingx/public-keys checkout master	16:43
fungi	error: pathspec 'master' did not match any file(s) known to git	16:43
clarkb	https://review.opendev.org/admin/repos/starlingx/public-keys,branches seems there in gerrit at least	16:44
clarkb	is debian's git init creating main by default now maybe?	16:45
fungi	when i `git init` on my up to date debian/sid workstation it says "hint: Using 'master' as the name for the initial branch. This default branch name is subject to change."	16:48
clarkb	yup I did a git init in the gerrit image and got the same result	16:49
clarkb	same == master is default	16:49
clarkb	fungi: was there an earlier error? the checkout is in a finally block	16:49
fungi	i'll see if i can find earlier errors but that was right after successfully pushing the refs/meta/config update. the log is at /var/log/manage-projects_2023-02-07T16\:39\:44+00\:00.log	16:50
clarkb	fungi: the other thing to consider is we may not be git initing anymore because the repo exists in gerrit so we'd be cloning instead?	16:51
clarkb	but still master exist in the gerrit repo as HEAD	16:51
fungi	yeah, you can see the git clone	16:52
clarkb	ya that is the problem I think. If you git clone it HEAD is master but there is nothing there	16:52
clarkb	yup I've reproduced by cloning and trying to checkout master	16:52
fungi	so the earlier failures left it in a dirty state	16:52
clarkb	I think this is a bug in jeepyb where if we don't handle it with the initial git init locally to push the .gitreview file to the master branch we're stuck	16:52
fungi	and rerunning won't recover from that	16:52
clarkb	yup	16:52
clarkb	a workaround would be to push the .gitreview file manually. Or maybe even just propose a change for it?	16:53
clarkb	I'm not sure how to address this in jeepyb safely. maybe fallback to git init if we cannot checkout the default branch?	16:54
clarkb	I need to pop out before my next round of meetings to eat breakfast	16:56
fungi	wow, i can't push --force with my admin account unless it agrees to the icla	17:06
fungi	i think that may be new?	17:06
fungi	probably a missing permission in project bootstrappers	17:06
fungi	i wonder how i agree to the icla with it since it has no webui access	17:06
fungi	To ssh://review.opendev.org:29418/starlingx/public-keys	17:07
fungi	* [new branch] master -> master	17:07
fungi	i very briefly added my normal account to get around that	17:08
fungi	i wanted to do it via a review, but git-review refused to use the master remote citing it did not exist (i guess because it was empty)	17:08
fungi	https://opendev.org/starlingx/public-keys	17:10
fungi	manage-projects runs clean now	17:10
clarkb	fungi: you need to add yourself to bootsrappers but then I think you can push?	17:11
fungi	right, you need to add an account that has agreed to the icla to bootstrappers	17:11
clarkb	oh I thought any account in bootstrappers could do it. I guess maybe our creation account has signed it?	17:12
fungi	we probably need to explicitly put our admin accounts in the system-cla group, or it's possible new gerrit has an additional perm we need to add to bypass cla enforcement	17:12
clarkb	oh system-cla makes sense	17:12
fungi	slittle1_: i've added your gerrit account as the initial member of the starlingx-public-keys-core group now. sorry about the delay	17:14
fungi	group name seems to be 'System CLA' for reference	17:15
fungi	right now the members are jenkins, openstack-project-creator, proposal-bot, release	17:16
fungi	so that explains why manage-projects is normally able to do it	17:16
*** jpena is now known as jpena\|off		17:17
fungi	in future, i'll just add my fungi.admin account to that group and see if it solves the cla rejection	17:17
opendevreview	You-Sheng Yang proposed opendev/git-review master: Allow specifying remote branch at listing changes https://review.opendev.org/c/opendev/git-review/+/872988	17:40
opendevreview	Jeremy Stanley proposed opendev/system-config master: Better diag for Gerrit server connection limit https://review.opendev.org/c/opendev/system-config/+/872989	17:55
noonedeadpunk	Hey there. We've got an issue with https://github.com/openstack/diskimage-builder/commit/2c4d230d7a09ad8940538338dfdc8bc3212fbc20 and rocky9	17:56
noonedeadpunk	I do see that cache-url should be inlcuded by cache-devstack and it seems present https://opendev.org/openstack/project-config/src/branch/master/nodepool/nodepool.yaml#L186	17:57
noonedeadpunk	But couple of days ago we started seeing retry_limits as curl-minimal seems to be pre-installed there	17:57
opendevreview	Clark Boylan proposed opendev/system-config master: Update our base python images https://review.opendev.org/c/opendev/system-config/+/873012	17:59
fungi	noonedeadpunk: my first guess is this is related to the same base image cleanup changes which also stopped preinstalling python2.7	18:01
clarkb	there was a different change to stop installing curl because of rocky	18:01
noonedeadpunk	aha	18:01
clarkb	apparently rocky installs one of the curl packages and centos the other. And they conflict and installing one won't automaticallyreplace the other	18:01
clarkb	ianw's solution to this was to stop preinstalling anything and let the distro defaults decide	18:02
noonedeadpunk	Yes, I like this solution	18:02
clarkb	if your problems are with building images using dib then udpate to latest dib and I think you'll be good?	18:02
noonedeadpunk	um. my problem is in CI images	18:04
clarkb	ok you may needt o describe the issue in more detail. Is curl-minimal (the rocky default iirc) not working for your jobs?	18:04
noonedeadpunk	Aha, ok, I misunderstood the commit then	18:04
clarkb	basically the dib change is saying don't explicitly try to install curl because the two rhel like distros we build conflict in their curl choice and the dib package management doesn't have a way to express "uninstall this first" I think	18:05
noonedeadpunk	ok, yes, you're right. I was thinking some pre-install of curl still happens with that, but it;s wrong assumption. Thanks for clarifying that!	18:06
noonedeadpunk	uh, why distros like to complicate things that much....	18:07
clarkb	heres fedora's justification https://fedoraproject.org/wiki/Changes/CurlMinimal_as_Default	18:08
noonedeadpunk	8mb on the diskspace...	18:10
opendevreview	Elod Illes proposed openstack/project-config master: Remove add_master_python3_jobs.sh https://review.opendev.org/c/openstack/project-config/+/873016	18:12
clarkb	noonedeadpunk: what protocol do you use?	18:26
clarkb	that doc implies the common ones should be in minimal	18:26
fungi	looks like system-config-promote-image-assets and system-config-promote-image-gitea failed deploy for 869091 and so infra-prod-service-gitea was skipped	18:30
noonedeadpunk	clarkb: shouldn't be anything outside of the minimal	18:31
noonedeadpunk	I just misunderstood what that patch did	18:31
noonedeadpunk	I assumed that there will be no curl as a result of it vs default provided one from minimal	18:32
clarkb	I see	18:32
fungi	okay, both failures were for the same thing ianw observed earlier: "Task Delete all change tags older than the cutoff failed running on host localhost"	18:33
fungi	"the output has been hidden due to the fact that 'no_log: true' was specified for this result" but it's a uri call	18:33
fungi	so maybe dockerhub changed their api or something	18:33
clarkb	fungi: there was a change to zuul-jobs that I wrote to fix a linter rule to the calculation of the cutoff	18:34
clarkb	fungi: its possible that is broken (and the linter was wrong?)	18:34
clarkb	but that seems unlikely since all I did was remove the {{ }}s from a when: entry	18:34
clarkb	and we successfully published gerrit images yesterday which was after the linter fix	18:35
clarkb	(unless that job failed too and we didn't notice?)	18:35
ianw	clarkb: i think https://review.opendev.org/c/zuul/zuul-jobs/+/872842 will help, but i need to look at the linter failure...	19:08
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: promote-docker-image: improve failure debugability https://review.opendev.org/c/zuul/zuul-jobs/+/872842	19:10
opendevreview	Ade Lee proposed zuul/zuul-jobs master: Add ubuntu to enable-fips role https://review.opendev.org/c/zuul/zuul-jobs/+/866881	19:29
opendevreview	Merged opendev/system-config master: Update our base python images https://review.opendev.org/c/opendev/system-config/+/873012	19:52
ianw	https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/873020 adds a pre playbook to install python/python-dev, which i think should get the 27 jobs back to where they were	20:01
ianw	mea culpa on that one ... i wasn't even vaguely thinking about python2.7	20:02
fungi	you're living the dream! i'll be thrilled when i can stop thinking about python 2.7	20:02
fungi	envious	20:02
clarkb	apparently I'm not logged int ogerrit so will have to +2 after lunch. I'm starving	20:02
fungi	i'm testing the docker upgrade on the held mm3 node at 213.32.76.66	20:31
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: promote-docker-image: improve failure debugability https://review.opendev.org/c/zuul/zuul-jobs/+/872842	20:31
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: promote-docker-image: double-quote regexes https://review.opendev.org/c/zuul/zuul-jobs/+/873028	20:31
fungi	currently has docker-ce 5:20.10.23~3-0~ubuntu-jammy	20:31
fungi	The following packages will be upgraded: docker-ce docker-ce-cli libssl3 openssl	20:32
fungi	guess this was still missing the openssl security fix	20:32
fungi	now running docker-ce 5:23.0.0-1~ubuntu.22.04~jammy	20:33
fungi	the containers take a minute to get back to a working state	20:33
fungi	still not up though. i wonder if i should have downed the containers before upgrading	20:34
fungi	error: exec: "apparmor_parser": executable file not found in $PATH	20:37
fungi	d'oh!	20:37
fungi	working slightly better after installing apparmor	20:37
fungi	ianw: your procedure on the pad starts with "stop docker" does that mean stopping dockerd or downing the containers or what?	20:38
fungi	unfortunately, after restarting the containers on the held node, i'm getting a server error trying to browse the sites	20:44
fungi	i'll need to dig the cause out of logs, i expect	20:44
fungi	docker container logs indicate everything started up fine and apache logs say i'm getting a 200/ok response, but the page content is "Server error: An error occurred while processing your request."	20:49
fungi	so i guess that's coming from the backend	20:49
clarkb	fungi: Isuspect downing the containers is sufficient. Then let the package upgrade restart the daemon	20:50
fungi	https://paste.opendev.org/show/bBrbtHgLEhvJkwWELKB5/	20:54
fungi	this may be something still not quite right with the django site setting	20:54
fungi	and maybe the container restart in the job isn't sufficient to exercise it	20:55
fungi	i'll have to continue investigating after dinner or in the morning	20:55
clarkb	ya that seems unlikely to be related to docker itself if the software is executing under docker	20:55
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: promote-docker-image: improve failure debugability https://review.opendev.org/c/zuul/zuul-jobs/+/872842	20:57
clarkb	I guess https://review.opendev.org/c/opendev/system-config/+/873012 would've been hit by the broken promotions too?	20:58
fungi	likely	21:00
clarkb	ianw: fungi how new of an apt do you need for https://review.opendev.org/c/opendev/system-config/+/872808/1/playbooks/roles/install-docker/templates/sources.list.j2 to work?	21:00
fungi	oh, right that may not work on xenial	21:01
ianw	clarkb: i think it was 1.4, which is quite old	21:01
ianw	do we have xenial docker?	21:04
clarkb	I don't think so	21:04
clarkb	so ya it may be fine. Also I think gerritbot may be out to lunch. I updated https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/873020 to fix an issue and no notifiction here	21:05
clarkb	or maybe its a different event for ps created by the web tool?	21:05
opendevreview	Merged zuul/zuul-jobs master: promote-docker-image: double-quote regexes https://review.opendev.org/c/zuul/zuul-jobs/+/873028	21:08
slittle1_	just got around to trying the modified ssh settings as suggested on this forum. No improvement	21:10
slittle1_	Running: git review -s	21:10
slittle1_	Problem running 'git remote update gerrit'	21:10
slittle1_	Fetching gerrit	21:10
slittle1_	kex_exchange_identification: read: Connection reset by peer	21:10
slittle1_	fatal: Could not read from remote repository.	21:10
clarkb	thats a different error though right?	21:11
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: promote-docker-image: improve failure debugability https://review.opendev.org/c/zuul/zuul-jobs/+/872842	21:11
clarkb	or maybe the eralier error was truncated to leave off the kex info	21:11
clarkb	slittle1_: if the control persistence is working properly ps should show you an ssh process ControlMaster in its command line. Something like `ps -elf \| grep ControlMaster` should find it	21:14
slittle1_	ssh_exchange_identification vs kex_exchange_identification, yes slightly different	21:15
slittle1_	ps -elf \| grep ControlMaster ... nothing found	21:16
clarkb	another thing to check is the number of connections you have open to gerrit `ss --tcp --numeric \| grep 199.204.45.33` assuming ipv4	21:16
clarkb	ok I think it may not be using the ControlMaster then	21:17
clarkb	what ControMaster does is creates a process that keeps a single ssh connection open then subsequent ssh connections talk through that ssh connection. Its also possible that MINA doesn't deal with the multiple logical connections over one tcp connection properly.	21:17
slittle1_	ss --tcp --numeric \| grep 199.204.45.33 ... no output	21:20
ianw	slittle1_ : are you coming from an ip starting with 128.224 ?	21:20
slittle1_	yes, probably 128.224.252.2	21:21
ianw	that's interesting	21:21
ianw	$ ssh -p 29418 ianw.admin@localhost "gerrit show-connections -w" \| grep 128.224.252.2 \| wc -l	21:21
ianw	18	21:21
ianw	gerrit shows 18 open connections	21:21
fungi	note that https://review.opendev.org/872989 is aimed at making that easier to debug too	21:21
ianw	but, they have a blank user	21:22
clarkb	fungi: that will log to syslog right?	21:23
ianw	oh, i have de-ja-vu on the blank user thing	21:23
slittle1_	hmmm, I have	21:23
fungi	clarkb: yes	21:23
slittle1_	Host *	21:23
slittle1_	ServerAliveInterval 60	21:23
ianw	i think we see it when the session is open but not logged in	21:23
slittle1_	in .ssh/config	21:23
fungi	netstat -nt reports 20 sockets from that address at the moment too	21:23
clarkb	ianw: ya you need successful login for gerrit to attach the user	21:23
slittle1_	is that .ssh/config entry a concern ?	21:24
clarkb	slittle1_: shouldn't be. The ServerAliveInterval creates extra packets on existing connections	21:25
clarkb	but it shouldn't make new connections	21:25
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: promote-docker-image: improve failure debugability https://review.opendev.org/c/zuul/zuul-jobs/+/872842	21:26
slittle1_	clarkb: yes I believe so. Added it ages ago to keep ssh sessions alive a bit longer.	21:26
clarkb	slittle1_: one side effect it could have though is keeping unneeded connections open for longer potentially	21:27
clarkb	usually thats on the server side though as it wants to make sure it has received all of the bits from the client before shutting down completely	21:28
slittle1_	I wanted interactive ssh sessions to stay open when I took a call or was otherwise distracted for a bit	21:28
clarkb	hrm I wonder if that means you've also got network gear that is shutting down connections between ou and the remote	21:29
clarkb	typically an ssh connection will stay open	21:29
clarkb	but that could also impact the server side connection count if it thinks you've got a bunch of connections open that some firewall has killed without telling the server	21:29
slittle1_	I'll add a domain rather than applying to all connections	21:29
slittle1_	ok, that shouldn't impact opendev connections any longer	21:31
slittle1_	Gotta get kids, I'll test again tomorrow	21:32
slittle1_	Thanks for the help	21:32
opendevreview	Jay Faulkner proposed opendev/infra-manual master: Add documentation on removing human user from pypi https://review.opendev.org/c/opendev/infra-manual/+/873033	21:35
clarkb	I have requested access to the gerrit community meeting agenda doc	22:02
clarkb	fungi: for gitea I wonder if the easiest thing is to just land a noop change to the dockerfile?	22:16
fungi	i recall netscreen firewalls by default silently dropped "idle" tcp sessions after 120 seconds, so on my machines at the office where they used one of those i set my ssh client configuration for a 60 second protocol level keepalive	22:16
clarkb	similar to what I did for the python base images?	22:16
clarkb	(and maybe push a similar change again for the python base images?)	22:16
clarkb	though iwth the base images we don't need to coordinate the deploy pipeline so maybe we can just reenqueue	22:16
fungi	though ssh keepalive also depends on the server side having support implemented. no idea if mina-sshd does	22:16
fungi	clarkb: the donor logos deployment isn't urgent. i'm fine waiting for whatever we end up needing to merge to the gitea server images down the road to trigger a new deployment	22:17
*** dasm\|rover is now known as dasm\|off		22:40
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: build-docker-image: further cleanup buildx path https://review.opendev.org/c/zuul/zuul-jobs/+/872806	22:42
dasm\|off	thanks fungi for the email update and "stable branches" py27 solution!	22:44
fungi	dasm\|off: ianw is to thank for pushing the change there	22:46
fungi	i'm just trying to keep people apprised of what's going on	22:46
ianw	well i also broke it ... so :)	23:14

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!