Friday, 2023-12-01

ianw	tonyb: https://paste.opendev.org/show/bJeCVfdS9SUfEmRdWpun/ is the failure for 902008 ... looks like cacti failed trying to build the certcheck domains for mirror03.gra1.ovh.opendev.org	05:12
ianw	it's worked in the next run	05:15
ianw	oh i think i see it	05:19
ianw	TASK [letsencrypt-request-certs : Run acme.sh driver for certificate issue] ****	05:19
ianw	fatal: [mirror01.iad.rax.opendev.org]: FAILED! => {	05:19
ianw	"changed": false,	05:19
ianw	"rc": -13	05:19
ianw	}	05:19
ianw	it looks like that failed. that aborted the letsencrypt run on mirror01, so it never got to the next step, where it populates the hostvar with the certificates to put into the certcheck list	05:20
ianw	so it was looping around making that list -- mirror03.gra1.ovh.opendev.org was the one before this ... it asked for hostvars[mirror01.iad...]['letsencrypt_certcheck_domains'] which didn't exist because letsencrypt-request-certs bailed out before it got set	05:21
ianw	why acme.sh failed -13 i don't know :)	05:23
tonyb	ianw: thanks. that makes sense the fact the error looks like mirror03 rather than mirror01.iad is a little confusing.	06:24
tonyb	I'll grab acme.sh tomorrow and see what exits with -13	06:25
tonyb	.... where I guess tomorrow is actually later today	06:25
ykarel	frickler, fungi i have reported https://bugs.launchpad.net/neutron/+bug/2045383 to track that xena job issue	07:50
opendevreview	Steve Brasier proposed openstack/diskimage-builder master: allow passing arbitrary build options in containerfile element https://review.opendev.org/c/openstack/diskimage-builder/+/902366	10:24
*** ykarel is now known as ykarel\|away		13:51
fungi	ykarel\|away: it should be fixed when zuul restarts tomorrow, since https://review.opendev.org/902316 has already merged	14:51
clarkb	Looks like my message about that ansible thing didn't make it to irc	16:08
clarkb	tonyb: ianw: https://review.opendev.org/c/opendev/system-config/+/898475 is related to the certcheck domains list building issue	16:08
clarkb	I thought it was an ansible bug that may haev been fixed by ansible 8 beacuse it seemed to go away after we upgraded, but I guess not	16:09
clarkb	we can resurrect that change if we think the extra debug info would still be useful	16:09
clarkb	tonyb: frickler: thoughts on proceeding with https://review.opendev.org/c/opendev/system-config/+/902315/ and child to rotate the replication key?	16:09
clarkb	I'm pretty sure I can be around for a gerrit restart at some point today if we want to try and land both of them (we should ensure the gitea one lands and applies first though just to avoid any errors there)	16:11
fungi	same	16:11
clarkb	the new/replacement/fixed gerrit 3.9.x release should be out sometime today	16:35
clarkb	I'm hoping it will be based on rc6 this time around and we alraedy ran tests against that rc and they were good	16:35
clarkb	fungi: half wondering if we should proceed at least with the gitea update since the risk for that one is lower? (though technically if I got the key cleanup code wrong it could remove the key we are relying on I suppose. This seems unlikely because it didn't remove the key yet)	16:41
tonyb	clarkb: reviewing 902315 now	16:47
clarkb	thanks!	16:48
tonyb	clarkb: I'm happy with both of those. I agree we should land the gitea one first; pause/check and then land the gerrit one	17:03
tonyb	I'm going to be in and out a bit today	17:03
tonyb	clarkb: I think restoring https://review.opendev.org/c/opendev/system-config/+/898475 is a good idea.	17:06
clarkb	tonyb: thanks I approved the gitea change and restored the ansible debugging change	17:23
tonyb	Awesome	17:26
tonyb	doing a little more digging. the -13 that ianw logged is coming from ansible NOT acme.sh. the shell module had an error	17:27
tonyb	MODULE FAILURE	17:27
tonyb	See stdout/stderr for the exact error	17:27
tonyb	I looking at logs on mirro01.iad indeed LE never ran there	17:28
tonyb	so STDOUT would go to the LE log that I've been loogin at. do we loose STDERR?	17:30
clarkb	tonyb: this is why I suspected an ansible bug	17:39
clarkb	basically nothing ansible is doing there makes any sense. They don't log the iteration value they don't log any outputs its just a -13 error	17:40
clarkb	my hunch is that we aren't loading the host vars for the other hosts properly due to some weird logic bug	17:40
clarkb	tonyb: you might want to look at the periodic run from overnight to see if it managed to succeed	17:41
clarkb	part of the issue is we don't know what host is actually causing the error because it doesn't log failed loop iterations only successful ones (this definitely seems like an ansible bug in my opinion)	17:41
tonyb	I think in this case specifically for the builing of certcheck I'm pretty confident that we do know beacuase earilier mirro01.iad failed with -13 (which means that it didn't run driver.sh on that node, and therefore also didn't add create the hostvar. which caused the ultimate failure	17:44
clarkb	oh I see so this is a variant of the case I ran into I think. Unless I had simply missed that there was an earlier failure	17:45
tonyb	in run-production-playbook we do 'ansible-playbook ..... > log'	17:45
clarkb	in the situation I ran into I thought the only error was in the iteration itself not previously but it is entirely possible I simply missed that	17:45
clarkb	still ansible shoudl log the iteration item	17:45
tonyb	where would stderr from that command go?	17:45
tonyb	Yes I agree totally. I think the change you restored is a good thing regardless	17:46
clarkb	it should be in the job log	17:46
clarkb	https://zuul.opendev.org/t/openstack/build/0b1a2cad2c0842b6874b599efb955fcc/log/job-output.txt#139	17:47
tonyb	Okay I was hoping to see more.	17:47
clarkb	there is no stderr there or at https://zuul.opendev.org/t/openstack/build/0b1a2cad2c0842b6874b599efb955fcc/console#2/1/2/bridge01.opendev.org	17:47
clarkb	we may need more -v's to get that info but probably also want to redirect stderr if to stdout if we do that since that may expose lots of inner workings we don't want out in the clear	17:48
clarkb	tonyb: and on the host do you have an acme.sh log? I guess the -13 implies that ansible broke and didn't run the shell at all though	17:49
clarkb	since acme.sh should return 0 1 or 2 iirc	17:49
tonyb	Hmm okay.	17:50
tonyb	Correct acme.sh didn't run at that time.	17:50
tonyb	it has since and worked just fine	17:50
tonyb	https://github.com/ansible/ansible/issues/57323#issuecomment-498732320 is about the file module but seems to say that the shell module got a SIGPIPE before running on mirror01.iad	17:51
clarkb	https://github.com/ansible/ansible/issues/81777	17:54
clarkb	I'm betting it is this issue	17:56
tonyb	Yup	17:56
tonyb	So I think this is something to watch for, but there isn't anything we can do right now.	17:58
clarkb	agreed. I think the one thing we might be able to do is increase the 60s control persist timeout to extend the amount of time things can be idle? That doesn't seem like a great workaround though a minute is an awfully long time	17:59
corvus	do i understand correctly that this bug would be triggered if the time between tasks on a given host is 60 seconds (ie, the control persist timeout). and that this does not happen when tasks are running?	17:59
clarkb	corvus: that was my reading of it. basically if you run a task and then try to run a task exactly 60 seconds later the new task execution races with connection persistence in ssh	18:00
clarkb	my hunch is that statistically the delta between tasks executions is most likely in the 0-5second range with a long long tail that extends out beyond there	18:01
clarkb	if we set the persistence timeout to 120s instead of 60s we're probably several orders of magnitude less likely to hit this race?	18:02
corvus	yeah. agreed. so the question is whether it's worth it to bump that to, say, 120 for the nested ansible running on bridge.	18:02
corvus	heh, we picked the same number. :)	18:02
corvus	yeah, i sort of lean towards that since it's been characterized in our environment that 60s intervals are apparently common enough to be occasionally noticed.	18:03
clarkb	I'd be fine with that. I don't know that I'd go much beyond say 5 minutes because we could end up with a lot of useless ssh processes that way. but 120s is probably safe	18:03
corvus	(and to be clear, i think this is an okay band-aid for nested ansible on bridge but my feelings for the similar setting in zuul are very different)	18:04
tonyb	I agree.	18:04
tonyb	Yeah very different for zuul itself	18:05
clarkb	I think 60s might be ansible's default? I don't see where we set that explicitlytoday	18:08
fungi	i'm going to go grab lunch real quick, but available to help with gitea/gerrit in about an hour	18:08
corvus	i don't see it being set either. but zuul sets it in ansible.cfg with ssh_args = -o ControlMaster=auto -o ControlPersist=60s	18:11
clarkb	https://docs.ansible.com/ansible/latest/network/getting_started/network_connection_options.html seems to be the more ansible way?	18:23
clarkb	unless that is just a regular timeout unrealted to ssh persistence	18:24
clarkb	ya I think that is unrealted to control persistence just a happy google collision	18:25
opendevreview	Clark Boylan proposed opendev/system-config master: Increase bridge's ansible ssh controlpersist timeout to 120s https://review.opendev.org/c/opendev/system-config/+/902436	18:29
clarkb	I believe that should be self testing	18:29
clarkb	and it is easy to revert (by hand if we have to) if it goes wrong	18:29
corvus	clarkb: want to put the ansible bug in a code comment? that could be helpful for unwinding it	18:31
clarkb	sure one sec	18:31
opendevreview	Clark Boylan proposed opendev/system-config master: Increase bridge's ansible ssh controlpersist timeout to 120s https://review.opendev.org/c/opendev/system-config/+/902436	18:32
opendevreview	Merged opendev/system-config master: Rotate the new Gitea replication key into Gitea config https://review.opendev.org/c/opendev/system-config/+/902315	18:55
clarkb	I've got a tail -F going on /var/log/ansible/service-gitea.log so I should see when that adds the second key	18:56
clarkb	I'm pretty sure the key was properly added to gitea09 based on the logs	19:17
clarkb	we no log a bunch of stuff so I'm going by "ok" and "skipping" output	19:17
clarkb	assuming the job succeeds I think we acn proceed with approving the gerrit side change	19:18
clarkb	gitea14 also looks good. not sure we need to check all of them if those two are fine chances something went wrong on another and the job passed is low	19:25
clarkb	I'll let someone else approve the gerrit change if they agree and think we are ready. I'm going to figure out breakfast/lunch things nowish	19:25
fungi	i'm looking now	19:36
fungi	back for the remainder of the day, so we can take this as slowly as we like	19:36
tonyb	clarkb: 902436 looks good. holding off +W until the gitea/Gerrit key rotation is done	19:42
*** dviroel__ is now known as dviroel		19:46
fungi	deploy succeeded about 30 minutes ago (19:23:31)	19:52
fungi	for 902315 i mean	19:53
clarkb	and looks like the gerrit update has been approved	19:54
fungi	i went ahead and approved 902169, so once it deploys successfully we'll arrange a quick gerrit restart and make sure replication is still working	19:54
tonyb	I was thinking of 902169 and the associated Gerrit restart	19:55
fungi	also probably worth triggering a full re-replication once we're sure it is	19:55
clarkb	++ also I think we can temporarily move id_rsa aside when we restart gerrit to ensure the new key is being used. I think that is the easiest way. If it doesn't get used we'll need to move it back and restart gerrit again but that is better now than later I suspect	19:55
fungi	since we keep saying we want to do that and then i keep forgetting to actually do it	19:55
clarkb	as I'm not sure there is any logging on either side that would be clearly definitive about which key is used	19:56
fungi	a very good point	19:56
fungi	at least if gerrit no longer has access to the old key, we'll know	19:56
tonyb	yeah. we could ssh -vv but that isn't guaranteed to be the same	19:57
clarkb	the issue is gerrit uses mina ssh client	19:58
clarkb	and its logging is in the depths of java logging somewhere. We probably can figure out a way to get debug logs out of it but I'm not confident even that would identify the specific key	19:58
fungi	i blame java	19:58
fungi	could probably even blame log4j	19:59
fungi	looks like system-config-run-review-3.8 encountered some testinfra failures	20:55
fungi	the thread plugin-manager-preloader exceptions in error_log are expected, right?	20:56
Clark[m]	Yes	20:56
fungi	oh, looks like maybe the sshd on the test node started closing connections	20:57
fungi	"Connection closed by remote host\r\nConnection closed by 104.130.74.118 port 22"	20:57
fungi	https://c495624553d21df6092e-b1e6fa2ff0876057a2da9deb6ccb589b.ssl.cf5.rackcdn.com/902169/3/gate/system-config-run-review-3.8/63d072e/bridge99.opendev.org/test-results.html?sort=result	20:57
fungi	nothing related jumping out at me in syslog though	20:59
Clark[m]	Could it be the arp fight we've seen?	21:00
fungi	possibly, though it's just connection closed during key exchange rather than a hostkey mismatch	21:03
Clark[m]	Maybe recheck and see if it persists? The ssh config I wrote out should only affect the gerrit user not zuul and testinfra	21:06
Clark[m]	And even then it's scoped to gitea* which aren't in play for testinfra on review02	21:07
tonyb	yeah seems unlikely the change is responsible	21:07
fungi	yeah, i agree it's more likely the rogue vm problem	21:09
fungi	once zuul reports on the change in a sec i'll recheck it	21:13
fungi	looks like the last job is wrapping up now	21:13
fungi	and rechecked now	21:20
opendevreview	Clark Boylan proposed opendev/system-config master: Add gerrit 3.9 image builds https://review.opendev.org/c/opendev/system-config/+/901468	22:02
opendevreview	Clark Boylan proposed opendev/system-config master: Add gerrit 3.8 to 3.9 upgrade testing https://review.opendev.org/c/opendev/system-config/+/901469	22:02
clarkb	infra-root Gerrit 3.9.1 has been released to address the problems with the pulled 3.9.0 release. I expect the above changes will pass testing and they should be mergeable as is now	22:03
clarkb	note that since this doesn't affect production (only our testing of future upgrades) we don't need any special monitoring either	22:03
clarkb	looks like we are an hour away from the gerrit ssh config update landing	22:55
clarkb	I'll still be around and can do a restart if others are	22:55
tonyb	I won't be around at that point (not that I'm really around now)	22:57
fungi	i'll be around for a while still, sure	23:19
clarkb	fungi: its even later where you are :) maybe we should just do it first thing next week	23:28
clarkb	up to you :)	23:29
clarkb	the gerrit 3.9.1 changes did pass so thats good	23:34

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!