Wednesday, 2023-10-25

opendevreview	Dmitriy Rabotyagov proposed zuul/zuul-jobs master: Add role for uploading Ansible collections to Galaxy https://review.opendev.org/c/zuul/zuul-jobs/+/899230	08:12
pcheli	Hello, I'm setting up ThirdParty CI with jenkins and gerrit-trigger plugin.	08:45
pcheli	Generally, it works. However, results posting fails with Too many concurrent connections (96) - max. allowed: 96.	08:45
pcheli	Can anybody help with this?	08:45
zigo	I'm really not sure what to do to re-trigger the puppet-nova release job and get the release notes in order ... can someone help?	08:48
zigo	https://docs.openstack.org/releasenotes/puppet-heat/2023.2.html <--- 404 as well ...	08:57
frickler	zigo: did you check the release jobs as clarkb suggested earlier?	09:24
frickler	pcheli: seems you need to limit the number of connections your setup uses, no idea how to help with that. also please don't ask the same question in multiple channels if possible	09:25
zigo	frickler: I'm really not sure how to do this ... :/	09:30
zigo	Alltogether, we have puppet-{heat,nova,octavia} that have broken release notes.	09:31
tkajinam	frickler clarkb zigo, hmm it's strange that the promote job succeeded without any error after https://review.opendev.org/c/openstack/puppet-nova/+/898384 was merged	10:45
zigo	Ah, thanks for looking into it! :)	10:46
zigo	I had the same thinking and didn't get it too...	10:46
tkajinam	I subscribe to the release-job-failures list but I've not seen any failures about these puppet repos, either	10:47
tkajinam	(I mean release-job-failures@lists.openstack.org	10:47
tkajinam	https://zuul.opendev.org/t/openstack/builds?job_name=publish-openstack-releasenotes-python3&project=openstack%2Fpuppet-nova&skip=0	10:48
tkajinam	it looks like we have to trigger the job to build release notes based on the latest master content to reflect the change in the index made by that 898384 but I don't clearly understand why it hasn't been done	10:50
tkajinam	sorry I have be disconnected for a while, but I'll check the status later (or tomorrow)	10:52
tkajinam	I have to be *	10:52
frickler	seems the above publish job ran at the same time as the job for the update of the 2023.2 branch, which did not have the 2023.2 reno update yet https://review.opendev.org/c/openstack/puppet-nova/+/898383	11:24
frickler	so that may have overwritten the content from the master patch. I'm not sure whether we can simply reenqueue the promote job, another - maybe safer - solution would be to commit any new update on the release notes, like just a typo or formatting fix, which should cause the whole site to be republished in the correct form	11:26
fungi	yes, the problem with those release notes jobs for different branches sharing the same file tree is that changes for different branches can race one another and publish content out of sequence compared to the order in which they were built/merged	12:03
*** d34dh0r5- is now known as d34dh0r53		12:20
*** Guest4496 is now known as diablo_rojo		13:09
clarkb	pcheli: I would use netstat/ss/lsof to determine how many connectiosn you've got to gerrit from the Jenkins host. If it is a high number (near 96) then you'll need to debug the Jenkins server. If it is much smaller and you are traversing NAT then you may need to identify other sources of connections.	13:57
clarkb	pcheli: however, I suspect they will be from the Jenkins server because the 96 connections limit is per username iirc and not per IP. We have a separate slightly higher limit for IPs	13:57
fungi	yes, also it's likely you have a bug with something not correctly closing ssh sessions	13:58
pcheli	clarkb: I've found only one connection. tcp6 0 0 xxxx:34254 199.204.45.33:29418 ESTABLISHED 9210/java	13:58
pcheli	that's why I'm asking :)	13:59
fungi	96 open connections to gerrit's ssh api is unlikely to represent normal behavior	13:59
fungi	pcheli: is it possible you have a firewall in front of your jenkins server that is uncleanly dropping "idle" ssh connections? if it doesn't cleanly terminate the connection by sending a tcp/rst or fin on behalf of the client, then the gerrit server will assume those old connections are still open	14:00
fungi	we can manually close them, but they'll just pile back up again if the problem isn't addressed	14:00
clarkb	https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/gerrit/templates/gerrit.config.j2#L56 this is where the limit comes from and it is configured by user account	14:01
clarkb	(just to be sure the 96 limit wasn't our IP limit)	14:01
fungi	yeah, the connections per ip address limit we set with conntrack in iptables is 100, and if you hit that you'll start getting icmp port-unreachable errors rather than error messages from the api itself	14:04
pcheli	I've found the same issue in mailing list resolved by Clark Boylan by killing stale connections. May I ask you to do the same?	14:04
fungi	like i said, doing that may temporarily stop the errors, but unless you know what caused you to end up with so many unclosed connections (like a poorly-configured firewall, for example) then it will start happening again at some point	14:05
pcheli	fungi: I've updated gerrit trigger plugin. Hopefully, it will resolve the issue.	14:16
fungi	pcheli: if it has an ssh keepalive option, or dead peer detection feature, make sure those are turned on	14:17
pcheli	Hm, nothing like this.	14:19
fungi	looks like the only account with 96 established ssh sessions is a/33746	14:21
pcheli	yep, this is mine	14:23
fungi	i've got a loop going telling gerrit to close all those now	14:23
fungi	this will take a few minutes to complete	14:24
fungi	#status log Manually closed 96 stale SSH connections to Gerrit for account 33746	14:25
opendevstatus	fungi: finished logging	14:25
fungi	pcheli: there's just 1 established session for that account now	14:25
pcheli	fungi: can you check again pls?	14:27
pcheli	just to be sure that everything is fine	14:27
fungi	pcheli: still only 1 session for that account at the moment	14:27
pcheli	Great	14:28
pcheli	#thanks fungi	14:28
fungi	i'll check again later in the day and see if the count starts to climb	14:28
opendevstatus	pcheli: Added your thanks to Thanks page (https://wiki.openstack.org/wiki/Thanks)	14:28
opendevreview	Merged opendev/system-config master: Stop building python3.9 container images https://review.opendev.org/c/opendev/system-config/+/898480	14:52
clarkb	infra-root https://review.opendev.org/c/opendev/system-config/+/898989 is ready for review and there is a link in the comments to a held test node where you can see that the conversion appears to be working in comments of the linked change	14:52
clarkb	fungi: and I've marked the secondary email lookup thing in gerrit as a non issue as the tools only use primarily emails	14:53
clarkb	fungi: if you are back today https://review.opendev.org/c/opendev/system-config/+/898505 might be a good one to try and get in. I've intentionally been waiting until more people are around so will defer on others' availability	15:15
fungi	yep, i'm around enough today, parents are headed home but i have a repair tech coming to try to fix my washing machine	15:20
opendevreview	Jeremy Stanley proposed opendev/system-config master: Add OpenInfra EU mailing lists https://review.opendev.org/c/opendev/system-config/+/898846	15:33
clarkb	fungi: for the ansible 8 change do you want to review it?	15:52
clarkb	fungi: there are notes about the testing done in comments there as well	15:52
fungi	clarkb: yep, i just approved it	15:53
fungi	hoping it will also fix infra-prod-run-cloud-launcher	15:53
clarkb	cool so the thing to check is that the virtualenv updates properly (it should)	15:53
clarkb	and then monitor jobs	15:53
fungi	yep	15:53
opendevreview	Clark Boylan proposed opendev/system-config master: Revert "Cap ruamel.yaml install for ARA" https://review.opendev.org/c/opendev/system-config/+/899283	16:05
clarkb	testing if that cap is no longer necessary after some updates were made to ruamel.yaml	16:05
fungi	oh, did they roll some stuff back or fix regressions?	16:09
clarkb	fungi: they replaced a sys.exit() call with an exception throw	16:10
clarkb	apparently they were hard crashign things previously by exiting 1 in the library...	16:11
fungi	ouch	16:12
fungi	yeah, sys.exit() is really never appropriate in a library	16:13
opendevreview	Merged opendev/system-config master: Update to Ansible 8 on bridge https://review.opendev.org/c/opendev/system-config/+/898505	16:25
clarkb	ansible==8.5.0	16:31
clarkb	I believe the upgrade of ansible in the venv worked	16:31
fungi	that was fast!	16:31
clarkb	fungi: the merge for the list creation will probably be the first thing that runs under ansible 8 just fyi	16:32
clarkb	I can execute ansible-playbook --version successfully as well so the install seems to be good	16:33
clarkb	https://zuul.opendev.org/t/openstack/build/d095cf5cd898428982a71742f30a7c74/log/bridge99.opendev.org/ansible/install-root-key.2023-10-25T16:17:50.log this log shows the ruamel thing is no logner fatal (the rest of the playbook runs rather tahn stopping)	16:36
clarkb	and we get an ara report https://44e79568cedacd253db2-e38ecce2b4446ed6b5d96caa6af2a2c7.ssl.cf2.rackcdn.com/899283/1/check/system-config-run-base/d095cf5/bridge99.opendev.org/ara-report/	16:36
fungi	oh nice	16:36
clarkb	so ara is still working. I guess that isn't a super critical piece of code?	16:36
clarkb	(I think it is in the ara server path which we don't really use maybe)	16:36
clarkb	so ya https://review.opendev.org/c/opendev/system-config/+/899283 should be safe to merge	16:39
opendevreview	Merged opendev/system-config master: Add OpenInfra EU mailing lists https://review.opendev.org/c/opendev/system-config/+/898846	16:42
clarkb	fungi: the lists playbook is running now	16:59
fungi	thanks! looks like it worked	17:03
clarkb	ya I see the public list that was created	17:04
clarkb	there are a number of gerrit 3.8 changes that affect theming plugins and general ui plugins. https://217.182.143.183/c/x/test-project/+/3?tab=change-view-tab-header-zuul-results-summary looks fine though	17:47
clarkb	I'll do some grepping of the removed/renamed methods across the two plugins we run to see if there are any hits but I suspect all that is a non issue based on the held node's behavior	17:48
clarkb	fungi: can you check my notes for 358975 in https://etherpad.opendev.org/p/gerrit-upgrade-3.8? I think this is somethign we don't really care about but its a big enough chagne that I want another set of eyeballs on it. I tried to sumarrize the behavior change as well as my interpretation for why this doesn't affect us	18:40
clarkb	If we can cross that one off then the commentlinks chagne is the only one out of that list to take action on. I'll have to look at the other changes listed next (the non breaking but still called out changes)	18:42
fungi	clarkb: yeah, i think it'll be fine. if anything, tooling we have that queries such things may be able to drop some error checks because now they'll get well-formed empty responses	18:43
clarkb	thanks I've struck it out. Leaving just commentlinks so far as something we need to address pre upgrade	18:46
opendevreview	Jeremy Stanley proposed opendev/system-config master: Upgrade to latest Mailman 3 releases https://review.opendev.org/c/opendev/system-config/+/899300	19:39
opendevreview	Jeremy Stanley proposed opendev/system-config master: Merge production and test node mailman configs https://review.opendev.org/c/opendev/system-config/+/899304	19:46
opendevreview	Jeremy Stanley proposed opendev/system-config master: Clean up old Mailman v2 roles and vars https://review.opendev.org/c/opendev/system-config/+/899305	19:52
fungi	infra-root: ^ more post-migration changes for mailman v3	19:53
fungi	not urgent, just trying to make sure they didn't fall off my plate while it's still fresh in my mind	19:54
opendevreview	Jeremy Stanley proposed opendev/system-config master: Merge production and test node mailman configs https://review.opendev.org/c/opendev/system-config/+/899304	20:11
opendevreview	Jeremy Stanley proposed opendev/system-config master: Clean up old Mailman v2 roles and vars https://review.opendev.org/c/opendev/system-config/+/899305	20:16
clarkb	fungi: I'm not seeing any special upgrade steps between these versions of mm3 components. is taht your read too?	20:25
clarkb	basically we stop the containers, then start the containers which will run db migratiosn as necessary and that should be it? (those steps are automated too iirc)	20:25
fungi	right	20:25
fungi	just like last upgrade	20:26
opendevreview	Jeremy Stanley proposed opendev/system-config master: Merge production and test node mailman configs https://review.opendev.org/c/opendev/system-config/+/899304	20:47
clarkb	fungi: looks like the upgrade change is failign on the db check for the auth user table being present	20:51
clarkb	I wonder if that table has a new name	20:51
fungi	the db container log shows auth errors	20:52
fungi	still digging	20:52
fungi	hard to tell in the console log what the timestamps would really be for starting and stopping trying to check for that table	20:57
fungi	these are suspicious: https://zuul.opendev.org/t/openstack/build/f24a998cc95340bd82fc69f3e637b0e2/log/lists99.opendev.org/docker/mailman-compose_database_1.txt#87-116	20:59
clarkb	fungi: https://zuul.opendev.org/t/openstack/build/f24a998cc95340bd82fc69f3e637b0e2/log/job-output.txt#17764 that is connecting as the mailman user	21:04
clarkb	fungi: looking in ara it seems to be saying we never get any stdout which implies to me that the database table just doesn't exist	21:07
clarkb	could be something isn't creating it because there is an error or the db table was renamed	21:07
fungi	yeah, i'll probably have to hold a node and inspect the db, or add a mysqldump	21:27
TheJulia	o/ Regarding glean, is the testing just image builds, or do we try to boot the image with say, static network config via configuration drive?	21:52
opendevreview	Jeremy Stanley proposed opendev/system-config master: Merge production and test node mailman configs https://review.opendev.org/c/opendev/system-config/+/899304	22:01
clarkb	TheJulia: the integration testing with nodepool and dib does a full build and boot and ssh into the node test	22:02
fungi	dib-nodepool-functional-openstack-centos-9-stream-src et cetera	22:02
clarkb	TheJulia: the unittests simply rely on that os detection library to mock out /etc/os-release stuff and then we check output results for the config files	22:02
TheJulia	clarkb: but do those nodes operate with full static metadata, or are we just doing dhcp? I ask because at least on centos9, I've noticed I'm not getting static config applying necessarily on an instance boot, which has me raising my eyebrow	22:08
clarkb	TheJulia: oh is the question whether or not dhcp is used or static config? I'm not sure. It could be default dhcp. We would need to look at the nodepool config for the provider	22:11
clarkb	also I think openstack actually makes it difficult to not do dhcp. Which makes the fact that multiple public clouds fail at dhcp all the more surprising	22:11
TheJulia	Okay, I ask because I have been working on an advanced ironic job without dhcp	22:11
TheJulia	and expecting simple-init/glean to just work, and it thinks it does things, but doesn't seem to	22:11
TheJulia	At least, with the instance image, which is still a bit curious.	22:12
clarkb	fwiw glean does work without dhcp on our images beacuse they all boot in rackspace	22:12
TheJulia	Yeah, that is a good data point	22:12
TheJulia	I know this worked in the past, but maybe something changed. Dunno. It is also weird it just works with the ramdisk I boot, but not again when I reboot	22:13
TheJulia	I can see it doing what it expects, I might just have to reproduce it locally	22:13
clarkb	with centos 9 you have to use network manager with glean but I thought that was autmatic when using simple-init	22:16
TheJulia	... yeah, that is what I was thinking as well.	22:17
TheJulia	I might be grazing upon some problematic case	22:17
TheJulia	so in my stack of changes, I can see where I explicitly re-run glean to extract the configuration, and then trigger networkmanager to refresh and it does the needful, it is an instance image though that fails	22:21
TheJulia	which is built very similarly	22:21
TheJulia	hmmmmm	22:21
TheJulia	I wonder if this is centos vs centos-minimal...	22:23
TheJulia	err, that makes no sense	22:23
* TheJulia will look deeper tomorrow		22:23
clarkb	diablo_rojo: tonyb: the ptgbot etherpad for tomorrow doesn't ahve any agenda. IIRC that was a session frickler was interested in but requested a meetpad location instead of zoom?	22:59
clarkb	I was planning to be there but wanted to call that otu to make sure everyone could attend	23:00
tonyb	I think it's on Friday sometime?	23:04
diablo_rojo_phone	Heh I guess i don't remember signing up for that time but okay lol.	23:05
diablo_rojo_phone	Yes we can definitely do meetpad instead.	23:05
diablo_rojo_phone	I am happy to meet there instead.	23:05
tonyb	https://meetpad.opendev.org/oct2023-ptg-ptgbot registered for tomorrow	23:08
diablo_rojo_phone	Perfect.	23:09
diablo_rojo_phone	frickler: should we do an hour earlier so you don't miss tc stuff?	23:11
diablo_rojo_phone	Assuming that works for you clarkb and you tonyb	23:11
diablo_rojo_phone	fungi: too.	23:12
clarkb	that is fine with me. But I'm not sure if frickler is attendnign tc things due to zoom?	23:12
clarkb	I dont' mind either way	23:12
tonyb	I thought the TC agreed to use meetpad rather than zoom	23:16
clarkb	if they did it isn't in the schedule. The previous tc sessions were on zoom not meetpad	23:16
tonyb	But that's not what's in the bot	23:16
tonyb	so I guess I imagined it	23:16
tonyb	diablo_rojo_phone: an hour ealier would be good for me as I'd like to be in the "leaderless projects reto/discussion"	23:17
fungi	an hour earlier will conflict with openstack qa rather than tc, not sure if frickler wanted to attend both	23:20
clarkb	also apologies if I misremembered frickler's interested in that session. I swear that was one that frickler said would be attended if held on meetpad though	23:21
fungi	two hours earlier wouldn't conflict with either one, but might be early for folks in pdt	23:21

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!