Wednesday, 2021-06-09

mordred	BUT- if it starts growing its own set of dependencies - then it might want to have its own container image	00:00
mordred	ircbot and statusbot are using the same libs right?	00:00
clarkb	no one is limnoria, the other is python irclib or whatever it is called	00:01
clarkb	there may be some overlap in other libs like ssl things	00:01
ianw	mordred: i mean not really; ircbot is really a limnoria container. we'd like statusbot to be a limnoria plugin but it isn't	00:01
mordred	ah. I mean - for my pov - that sounds like two different images. but- if it's not a problem to co-install right now, then shrug - wait until it's a problem, yeah?	00:02
ianw	i mean we do have to override the entrypoint, which is a bit ugly	00:02
opendevreview	Merged opendev/system-config master: Remove special x/ handling patch in gerrit https://review.opendev.org/c/opendev/system-config/+/791995	00:03
mordred	any reason to avoid making a new image? just the overhead of the dockerfiles and the jobs and whatnot?	00:03
*** odyssey4me has quit IRC		00:03
ianw	yeah, it's mostly me wanting to be done with this post-haste. but i think i will just make the separate image to avoid confusion	00:04
mordred	I hear you :)	00:06
*** odyssey4me has joined #opendev		00:12
clarkb	ianw: fungi re 791995 we should be covering that in testing with https://opendev.org/opendev/system-config/src/branch/master/testinfra/test_gerrit.py#L29-L33	00:13
clarkb	and https://gerrit.googlesource.com/gerrit/+/b1f4115304a3820be434a6201da57e4508862f82 is the upstream commit to fix things	00:14
ianw	++	00:15
clarkb	I had to go and tripleo check to convince myslelf it is probably ok	00:15
ianw	i'll probably look at pulling/restarting in ~4 hours?	00:16
clarkb	wfm, but I'll not be around :) I'm also happy to try and get it done tomorrow if you end up busy	00:16
fungi	i may be around then, but also may not be very useful if so	00:18
clarkb	ianw: I would take note of the image that is currently running so that a rollback is straightforward if necessary	00:18
clarkb	I checked via our config mgmt stuff to see if we clean up images like we do with zuul and we don't appaer to do that with gerrit so we should be able to easily rollback if necessary by overriding the image in the docker-compose file then reverting the chagne above and udpating our image promotion	00:19
ianw	++	00:24
opendevreview	Ian Wienand proposed opendev/statusbot master: Add container image build https://review.opendev.org/c/opendev/statusbot/+/795428	00:37
*** ysandeep has joined #opendev		00:41
*** ysandeep is now known as ysandeep\|ruck		00:41
ianw	mnaser: ^ we know f34 containerfile boots but after that ... i'm sure you'll let us know of any issues :) note that to use containerfile you'll want to make sure /var/lib/containers is mounted; see https://opendev.org/zuul/nodepool/src/branch/master/playbooks/nodepool-functional-container-openstack/templates/docker-compose.yaml.j2	00:42
ianw	it's all extremely new so documentation and bug reports all welcome	00:42
opendevreview	Ian Wienand proposed opendev/statusbot master: Add container image build https://review.opendev.org/c/opendev/statusbot/+/795428	00:57
ianw	Temporary failure resolving 'debian.map.fastlydns.net' Could not connect to deb.debian.org:80 (151.101.250.132), connection timed out	01:25
ianw	i wonder if there's still issues	01:25
opendevreview	Merged opendev/statusbot master: Add container image build https://review.opendev.org/c/opendev/statusbot/+/795428	01:51
opendevreview	Ian Wienand proposed opendev/system-config master: Run statusbot from eavesdrop01.opendev.org https://review.opendev.org/c/opendev/system-config/+/795213	02:19
*** ysandeep\|ruck is now known as ysandeep\|away		02:53
*** ykarel\|away has joined #opendev		03:30
opendevreview	Ian Wienand proposed opendev/system-config master: Run statusbot from eavesdrop01.opendev.org https://review.opendev.org/c/opendev/system-config/+/795213	03:36
ianw	ok the current container on review is "Image": "sha256:57df55aec1eb7835bf80fa6990459ed4f3399ee57f65b07f56cabb09f1b5e455",	03:48
ianw	the latest docker.io/opendevorg/gerrit:3.2 is https://hub.docker.com/layers/opendevorg/gerrit/3.2/images/sha256-9cbb7c83155b41659cd93cf769275644470ce2519a158ff1369e8b0eebe47671?context=explore that was pushed a few hours ago, that lines up	03:50
ianw	i've pulled the latest image now	03:54
ianw	reporting itself as 3.2.10-21-g6ce7d261e1-dirty	03:58
ianw	opendevorg/gerrit 3.2 sha256:9cbb7c83155b41659cd93cf769275644470ce2519a158ff1369e8b0eebe47671 7a921f417d3c 4 hours ago 811MB	03:59
ianw	"Image": "sha256:7a921f417d3cf7f9e1aa602e934fb22e8c0064017d3bf4f5694aafd3ed8d163c",	04:00
ianw	ergo the container is running the image that has the same digest as the upstream tag. i.e. we're done :)	04:00
ianw	#status restarted gerrit to pick up changes for https://review.opendev.org/c/opendev/system-config/+/791995	04:01
opendevstatus	ianw: unknown command	04:01
ianw	#status log restarted gerrit to pick up changes for https://review.opendev.org/c/opendev/system-config/+/791995	04:01
opendevstatus	ianw: finished logging	04:01
*** ysandeep\|away has quit IRC		04:18
*** ysandeep\|away has joined #opendev		04:18
opendevreview	Ian Wienand proposed opendev/statusbot master: Dockerfile: correct config command line https://review.opendev.org/c/opendev/statusbot/+/795473	04:29
*** odyssey4me has quit IRC		04:31
*** ysandeep\|away is now known as ysandeep\|ruck		04:36
opendevreview	Merged opendev/statusbot master: Dockerfile: correct config command line https://review.opendev.org/c/opendev/statusbot/+/795473	04:47
opendevreview	Ian Wienand proposed opendev/system-config master: Run statusbot from eavesdrop01.opendev.org https://review.opendev.org/c/opendev/system-config/+/795213	05:06
*** marios has joined #opendev		05:10
opendevreview	Merged openstack/project-config master: Move devstack-gate jobs list to in-tree https://review.opendev.org/c/openstack/project-config/+/795379	05:30
*** ykarel\|away has quit IRC		05:41
*** ykarel\|away has joined #opendev		05:41
*** ykarel\|away is now known as ykarel		05:41
opendevreview	Merged opendev/system-config master: Cleanup ask.openstack.org https://review.opendev.org/c/opendev/system-config/+/795207	05:42
*** ykarel is now known as ykarel\|mtg		06:07
*** whoami-rajat has joined #opendev		06:12
*** ralonsoh has joined #opendev		06:29
*** osmanlic- has joined #opendev		06:45
*** osmanlicilegi has quit IRC		06:45
*** odyssey4me has joined #opendev		06:55
*** hashar has joined #opendev		06:58
*** dklyle has quit IRC		06:58
*** osmanlic- has quit IRC		07:01
*** osmanlicilegi has joined #opendev		07:05
*** amoralej\|off is now known as amoralej		07:13
*** andrewbonney has joined #opendev		07:16
*** jpena\|off is now known as jpena		07:31
*** rav has joined #opendev		07:49
rav	How do i rebase master branch to another branch in my repo?	07:50
*** tosky has joined #opendev		07:53
frickler	rav: what would be the use case for that? you usually rebase other branches onto master (likely before merging them into master), but rebasing master seems very unusual? also, it this a general git question or somehow specific to opendev?	07:57
*** ysandeep\|ruck is now known as ysandeep\|lunch		08:05
rav	So i did a clone to a branch.. then i did git checkout "stable/wallaby" then i made some changes but when i pushed the changes the changes have gone to master instead of stable/wallaby. In general git i know how to merge to a branch but opendev seems to not work the same way for me	08:08
*** lucasagomes has joined #opendev		08:10
tosky	when you say "push", do you mean "git review"? And when you did git checkout stable/wallaby, did git really create a local stable/wallaby branch which points to the remote stable/wallaby branch?	08:21
*** ykarel\|mtg is now known as ykarel		08:30
rav	Yes its git review	08:39
rav	I think its creating a local branch	08:41
*** ysandeep\|lunch is now known as ysandeep		08:50
*** ysandeep is now known as ysandeep\|ruck		08:50
*** boistordu_ex has joined #opendev		08:55
ysandeep\|ruck	#opendev is this the right channel to report if some jobs are stuck on zuul?	09:21
*** rpittau\|afk is now known as rpittau		09:22
ysandeep\|ruck	top job in tripleo queue are stuck even though they are not running	09:24
ysandeep\|ruck	i think i should try my luck on # zuul	09:25
rav	How to delete a branch in opendev??	09:28
tobiash[m]	ysandeep\|ruck: looks like something might be stuck in the queue processing there. I guess an infra-root needs to look at that. (however most are located in us timezone)	09:29
ysandeep\|ruck	tobiash[m]: thank you so much for sharing that, I will reping in few hours then..	09:31
*** swest has joined #opendev		09:32
*** hjensas has joined #opendev		09:35
frickler	tobiash[m]: ysandeep\|ruck: we did a gerrit restart earlier, maybe that broke something in zuul processing	09:48
rav	Can someone tell me how to delete a branch in OpenDev.org? TIA	09:51
*** hashar has quit IRC		09:52
tobiash[m]	ysandeep\|ruck: if you need to unblock urgently you could abandon/restore the top change, but I guess it would be helpful for the analysis to leave it until someone had a chance to look at that if it's not urgent	09:52
ysandeep\|ruck	tobiash[m], frickler thanks! we can wait for few hours, so that infra-root can check and fix the issue permanently.	09:54
frickler	rav: likely an admin would have to do that, which branch in particular are you referring to?	09:54
rav	stable/wallaby	09:55
rav	I have the admin rights	09:55
rav	in my own repo	09:55
frickler	rav: sorry, I should have been more specific: which repo?	09:55
rav	https://opendev.org/x/networking-infoblox/src/branch/master	09:55
rav	frickler: this is the branch	09:58
frickler	rav: hmm, o.k., I'm actually not sure how our procedure to deal with this type of issues is for the x/ tree, so I'll have to defer to some other infra-root, should be around in a couple of hours	10:02
rav	ok	10:02
frickler	infra-root: checking zuul logs I found these errors, I see no direct relation to the tripleo stuckness, but likely worth looking at anyhow http://paste.openstack.org/show/806483/	10:06
frickler	these look more suspicous even: Exception while removing nodeset from build <Build 60b0ee70b6f34543a85719b673929635 of n	10:07
frickler	ova-tox-functional-py38 ... No job nodeset for nova-tox-functional-py38	10:08
tobiash[m]	frickler: can you paste the stack trace?	10:08
frickler	tobiash[m]: http://paste.openstack.org/show/806484/ I hope I selected the proper context	10:11
frickler	to me it looks like zuul is in some weird broken state in general	10:11
tobiash[m]	do you see some recurring exceptions in the live log of the scheduler?	10:12
frickler	I've also earlier wondered why https://review.opendev.org/c/openstack/devstack-gate/+/795383 doesn't go into gate	10:12
frickler	tobiash[m]: that seems to repeat every minute or so for multiple jobs, yes	10:12
tobiash[m]	hrm, that exception is cancel related, are there other recurring exceptions as well other than the one in removeJobNodeSet?	10:13
tobiash[m]	maybe something mentionung the run handler	10:14
frickler	tobiash[m]: well there a those about semaphores I posted before that http://paste.openstack.org/show/806483/	10:14
tobiash[m]	is there anything suspicious when grepping the log for 4ef4a09fbb50478f8c7f6bfee2fb3926 (which is the event id of the stuck item in tripleo)	10:17
frickler	tobiash[m]: not in today's log, checking yesterday now	10:20
*** ysandeep\|ruck is now known as ysandeep\|afk		10:33
frickler	so the semaphore errors have been present for some time and thus are likely unrelated. found nothing else yet, will check back later	10:41
*** ysandeep\|afk is now known as ysandeep\|ruck		11:06
*** jpena is now known as jpena\|lunch		11:44
fungi	rav: make sure the .gitreview file on that branch mentions the correct branch as its defaultbranch, or alternatively use the first command-line argument with git review to tell it what branch you're proposing for	12:03
fungi	rav: branch deletion is a permission which can be delegated in the project acl, but generally yes what frickler said, you want to be careful with branch deletions, and should probably tag them before deleting if they had anything useful on them	12:04
fungi	ysandeep\|ruck: frickler: tobiash[m]: the recent common cause i've seen for stuck items in pipelines is that one of the nodepool has managed to not unlock the node request after thinking it's either satisfied or declined it	12:05
fungi	if you can find which launcher last tried to service those node requests, then restarting its container frees up the locks	12:06
*** ysandeep\|ruck is now known as ysandeep\|mtg		12:07
fungi	i have yet to be able to identify the cause of that behavior in nodepool, but it seems to happen most when there are lots of boot failures, so likely some sort of race around that	12:07
fungi	though this case looks different, i see the stuck builds aren't waiting for nodes	12:18
*** amoralej is now known as amoralej\|lunch		12:19
fungi	the "Exception: No job nodeset for" occurrences, while vast in number, are likely unrelated since they stretch back at least several days in the scheduler's logs at similar levels... 338913 so far today, 530287 last thursday	12:26
*** artom has quit IRC		12:28
fungi	whatever's happening seems to stall after result processing... looking at the in progress tripleo-ci-centos-8-scenario000-multinode-oooq-container-updates build for 795302,3 currently at the top of the check pipeline for the openstack tenant, this is the last thing the scheduler logged about it:	12:30
fungi	2021-06-09 01:10:26,804 DEBUG zuul.Scheduler: Processing result event <zuul.model.BuildStatusEvent object at 0x7f267ce9f940> for build fa45a5d15d404267b912be4bc14b3734	12:30
fungi	so that build seems to have completed over 11 hours ago but hasn't been fully marked as such and so the buildset hasn't reported	12:31
fungi	interestingly, a different build (7fbf525fe42b4f11849933cb606212a3) for that same nodeset completed a minute later and seems to have gone on to register a "Build complete, result ABORTED" in the log	12:37
fungi	er, for that same buildset i mean	12:37
fungi	tobiash[m]: as for your speculation in #zuul about lost result events, this does seem like it could be a case of it	12:38
*** weshay\|ruck has joined #opendev		12:44
*** dviroel has joined #opendev		12:47
opendevreview	Benjamin Schanzel proposed zuul/zuul-jobs master: Add a meta log upload role with a failover mechanism https://review.opendev.org/c/zuul/zuul-jobs/+/795336	12:48
*** jpena\|lunch is now known as jpena		12:49
*** dviroel is now known as dviroel\|brb		13:02
*** arxcruz\|rover is now known as arxcruz		13:03
*** amoralej\|lunch is now known as amoralej		13:07
*** ykarel has quit IRC		13:19
fungi	i could try to manually dequeue the affected items, or worst case perform a scheduler restart, but would prefer to wait for corvus and clarkb to be around in case the running state provides us some clues as to the cause which we could otherwise lose	13:23
fungi	also hoping tobiash[m] has input, since he seemed to have suspicions as to the cause already (the affected build was seen as generating a result event but the scheduler never logged processing it, and the results queue is showing 0, suggesting the result was indeed lost somewhere along the way)	13:25
*** rav has quit IRC		13:26
tobiash[m]	fungi: dequeue should be sufficient	13:26
fungi	er, rather, the shcheduler logged that it was processing it, but never continued to log what the result actually was	13:26
tobiash[m]	fungi: afaik the result events now go through zk so maybe there was a zk session loss during that time?	13:27
fungi	ahh, good point, i'll check the logs for the zk cluster members	13:27
tobiash[m]	fungi: I'd start with grepping the scheduler logs for 'ZooKeeper connection'	13:30
fungi	interestingly we seem to mount /var/zookeeper/logs as /logs in the zk containers, but the directory is completely empty	13:30
fungi	tobiash[m]: yeah, not finding any signs of a zk connection problem in the scheduler debug log, at least	13:33
tobiash[m]	ok, that's good, so there is maybe some edge case during result event processing	13:33
fungi	i'm going to need to step away soon for a bit to take care of some errands, but hopefully shouldn't be gone more than an hour-ish	13:35
*** ysandeep\|mtg is now known as ysandeep		13:43
corvus	o/	13:43
fungi	looks like zk is just logging to stdout/stderr, so `docker-compose logs` has it, though quite verbose	13:43
fungi	mornin' corvus!	13:44
fungi	we have what looks like a lost result for build fa45a5d15d404267b912be4bc14b3734 and possibly others	13:44
fungi	trying to work out what could have precipitated it losing track of that	13:45
fungi	it's basically stuck in a permanent running state because the scheduler didn't evaluate the result reported back in gearman	13:46
fungi	looking in the zk cluster member logs, there was some sort of connection event around 01:28 which all of them logged, though that was 8 minutes after the build returned a result. i guess if the scheduler had a backlog in the results queue at that point due to a reconfiguration event or something then maybe the writes to zk were lost "somehow" ? (i'm a bit lost really)	13:57
*** ykarel has joined #opendev		14:00
fungi	okay, taking a quick break now to run errands, but should return soon	14:05
*** tcervi has joined #opendev		14:13
*** tosky has quit IRC		14:16
*** tosky has joined #opendev		14:18
*** tcervi has quit IRC		14:42
ysandeep	#opendev are you still investigation tripleo stuck gate, tobiash[m] earlier mentioned if we abandon/restore the top patch in gate queue.. it would clean the queue. Should we go that route?	14:52
ysandeep	weshay\|ruck, fyi.. ^^	14:53
weshay\|ruck	happy to abandon all the patches in gate if that will help infra	14:53
tobiash[m]	you'd just need to abandon/restore the top item, that should cause a gate reset and re-run all the other items	14:54
*** hashar has joined #opendev		14:57
weshay\|ruck	tobiash[m], ysandeep just confirming it was just the top patch that needs a reset? https://review.opendev.org/c/openstack/tripleo-heat-templates/+/794634/	15:00
weshay\|ruck	just FYI.. if I would have known that earlier.. I would have done that happily :)	15:00
*** dviroel\|brb is now known as dviroel		15:01
ysandeep	weshay\|ruck, tobiash[m] suggested we wait for infra-root to debug first.. lets see what infra-root says	15:01
tobiash[m]	weshay\|ruck: yes, that worked, feel free to restore it again	15:02
ysandeep	tobiash[m], thanks!	15:02
frickler	corvus: ^^ since IIUC clarkb is familying, it would be up to you to decide whether you want to debug further or let folks try to get their queue unstuck	15:03
*** dklyle has joined #opendev		15:04
clarkb	I'm sort of here but ya slow start today	15:05
clarkb	looks like they may have gone ahead and done the abandon and restore? the top of the tripleo queue apperas to be moving at least	15:10
ysandeep	clarkb, yes weshay\|ruck abandon/restore the top one .. gate queue was getting really long	15:14
fungi	ysandeep: weshay\|ruck: sorry, i had to take a break from investigating to run some errands, and was hoping we could keep it in that state to make it easier for others to possibly identify the cause so we can hopefully keep it from happening again in the future	15:15
weshay\|ruck	ah k	15:15
fungi	but maybe the logs we have will still lend a clue	15:16
weshay\|ruck	roger that.. hope I didn't mess you up	15:16
weshay\|ruck	that was not clear to me	15:16
fungi	it's fine, wouldn't expect you to read dozens of lines of scrollback	15:16
weshay\|ruck	ya.. my friggin bouncer is down :(	15:16
weshay\|ruck	hardware :(	15:16
weshay\|ruck	fungi, so if it gets stuck again.. do you want us to hold it there?	15:17
weshay\|ruck	for investigation?	15:17
fungi	weshay\|ruck: at least give us a heads up again before resetting it, yeah	15:18
weshay\|ruck	k.. I'll make sure we're communicating that.. thanks for the help / support :)	15:19
fungi	yw	15:21
fungi	and thanks for bringing it to our attention earlier	15:21
clarkb	fungi: does the zk blip line up with the gerrit 500s reported in -infra time wise?	15:22
clarkb	I wonder if there was a disturbance in the force	15:22
fungi	it's possible there are more examples still stuck, i'll check for them once i get done eating	15:22
clarkb	a netowrk flap in that region could explain a number of these issues people are reporting possibly	15:22
clarkb	and even if the times don't line up maybe there was some ongoing or rolling set of network updates	15:23
frickler	clarkb: fungi: another example to look at may be https://review.opendev.org/c/openstack/devstack-gate/+/795383 which doesn't start gating even after I rebased and re-workflowed it. different error at first, but possibly related	15:25
fungi	clarkb: no, the lost result came into the results queue at 01:10 utc, the zk logs i was looking at were from 01:28 utc, the reports of gerrit and gitea problems were around 10:10-10:15 utc	15:25
frickler	and with that I'm mostly off for today	15:25
fungi	thanks frickler!	15:25
fungi	and yeah, i started to look at that d-g change, the last thing zuul logged was that it could not merge	15:25
fungi	so possibly unrelated	15:25
clarkb	fungi: also I think the zk connection status goes into syslog if you need an easy way to check that in the future	15:26
clarkb	s/status/state changes/	15:27
fungi	it didn't seem to be in syslog that i could spot at least	15:27
fungi	but docker-compose was capturing it all	15:27
clarkb	the +A on the d-g change is not when the tripleo had a sad either looks like	15:28
clarkb	fungi: frickler I reapproved 795383 and it is enqueued now	15:38
*** ysandeep is now known as ysandeep\|away		15:50
opendevreview	Hervé Beraud proposed openstack/project-config master: Temporarly reverting project template to finalize train-em https://review.opendev.org/c/openstack/project-config/+/795583	15:57
clarkb	fungi: frickler: is it possible zuul also lost connectivity to the gerrit event stream and missed that approval event?	16:01
tobiash[m]	fungi, clarkb, corvus : this is another (still) live item that looks stuck as well and doesn't block a gate pipeline: https://zuul.opendev.org/t/openstack/status/change/795302,3	16:01
tobiash[m]	so if you need to live debug I guess this can be used as well	16:01
tobiash[m]	same goes for the first few items in check	16:01
clarkb	tobiash[m]: ya looks like everything that is older than ~14 hours in check?	16:02
tobiash[m]	so to me it looks like there was some kind of event 14-16h ago that lead to a hand full of stuck items	16:02
clarkb	tobiash[m]: we have evidence of network issues between gerrit and its database at a different time (about 6 hours ago)	16:02
tobiash[m]	clarkb: yes	16:02
opendevreview	Hervé Beraud proposed openstack/project-config master: Dropping revert https://review.opendev.org/c/openstack/project-config/+/795586	16:03
clarkb	I'm starting to think that there may have been widespread/rolling network problems and this is all fallout from that	16:03
tobiash[m]	maybe fallout from the fastly cdn issue	16:03
clarkb	apparently zoom is having trouble too	16:04
tobiash[m]	at least network issues would be the best case since that would mean that we didn't introduce a regression in event handling	16:04
clarkb	I'm also seeing incredibly slow downloads to my suse mirror for sytem updates	16:06
corvus	i'm back from breakfast and will continue looking into this now	16:06
clarkb	corvus: thanks, note the changes in openstack's check pipeline if you want to see some in action	16:06
*** ysandeep\|away has quit IRC		16:06
*** hjensas is now known as hjensas\|afk		16:06
clarkb	I'll hold off on trying to reset those until you give a go ahead	16:06
clarkb	(the ones older than 14 hours are likely to be hit by this)	16:06
corvus	yes, thanks; especially whe something is "stuck" it's helpeful to have it stay stuck to try to figure out why	16:07
*** ykarel has quit IRC		16:07
corvus	i'm going to ignore the previous gate issue for now and focus on 795302 instead	16:07
fungi	the current zk logs on zk04 logged a rather lengthy java.io.IOException backtrace at 2021-06-09 01:28:33,550	16:11
fungi	seems to be a "Connection reset by peer" event	16:11
fungi	the uptime of all 3 zk servers is a couple months though, so none of them rebooted	16:11
clarkb	fungi: zk will reset connections if there is an internal ping timeout	16:12
clarkb	and those internal ping timeouts could be caused by network instability	16:13
fungi	similar one on zk05 at 01:28:31,083	16:13
fungi	(though it also has a prior one in its log from 2021-06-04 13:33:34,950)	16:13
*** lucasagomes has quit IRC		16:14
*** marios is now known as marios\|out		16:15
fungi	none in the logs on zk06 however	16:15
fungi	so yes, more anecdotal evidence of "network instability" but no smoking gun	16:15
*** rpittau is now known as rpittau\|afk		16:16
clarkb	fungi: if you look at the client side (a lot more work since we have a few) we might see evidence that those talking to 04 and 05 disconnected but those connected to 06 were ok?	16:17
fungi	got it, so this could be client disconnects not intra-cluster	16:20
corvus	looking at the zk watches graph, i don't see any shift happing aronud 13:30	16:21
corvus	if there were client disconnects i would expect that graph to rebalance	16:21
corvus	oh i'm looking at the wrong time; 13:30 is from a long time ago	16:22
corvus	1:30 is the relevant time	16:22
fungi	yep, seems it could be clients... immediately after the disconnects i see nb01.opendev.org and nb03.opendev.org	16:23
*** marios\|out has quit IRC		16:23
fungi	reauthenticating	16:23
fungi	also nb02.opendev.org	16:24
fungi	so all the builders reconnected within seconds after the 3 disconnects. before the disruption one was connected to zk04 and two to zk05, but on reconnecting one of them moved from 05 to 06	16:25
fungi	at least that's what i infer from the zk logs anyway	16:25
fungi	nice that we have recognizable cn fields in the client auth certs	16:26
corvus	we haven't seen any indication the scheduler had a problem connecting with zk though, right?	16:27
fungi	i found none, no	16:28
fungi	nothing in the scheduler logs to indicate it, and i found only 3 connection reset exceptions in the zk server logs all of which are accounted for by the nb servers reconnectinfg	16:29
clarkb	the build result events are recorded by executors now though right?	16:29
clarkb	maybe the executor fo rtose jobs failed to write to zk properly and the scheduler never saw things update?	16:29
corvus	yeah, i'm heading over to the executor to look now	16:30
corvus	fa45a5d15d404267b912be4bc14b3734 on ze02	16:31
corvus	16:32:02 up 14:26, 1 user, load average: 5.25, 4.70, 4.88	16:32
corvus	that's suspicious	16:32
corvus	possible executor hard reboot around the time	16:32
corvus	we may be looking at a case of an improperly closed tcp connection, so gearman didn't see the disconnect?	16:33
tobiash[m]	I thought I had implemented a two way keep alive years ago	16:34
corvus	2021-06-09 01:28:42,748 is the last log entry on ze02 before:	16:35
corvus	2021-06-09 02:06:17,890 DEBUG zuul.Executor: Configured logging: 4.4.1.dev9	16:35
corvus	(which is a startup log message)	16:35
corvus	also there's some binary garbage between the two	16:35
corvus	tobiash: me too	16:36
tobiash[m]	did it reboot or got stuck?	16:37
clarkb	the uptime indicates a reboot	16:37
clarkb	16:32:02 up 14:26	16:37
tobiash[m]	I think I've seen seldom stuck executors that kept the gearman connection but didn't do anything, but that was long ago	16:37
*** jpena is now known as jpena\|off		16:38
corvus	definitely a reboot; and no syslogs preceeding it, so i'm assuming a forced reboot from the cloud	16:38
corvus	like crash+boot	16:38
corvus	i don't see any extra TCP connections between the scheduler and ze02	16:39
corvus	the gearman server log is empty	16:40
corvus	geard should have detected that and sent a work_fail packet to the scheduler	16:45
corvus	i don't know why it didn't, but the fact that we have zero geard logs will make that very hard to debug	16:45
clarkb	corvus: did the geard logs possibly end up in the debug log?	16:46
clarkb	corvus: fungi: I finally managed to check up on emails and both ze02 and the gerrit mysql had host problems	16:48
clarkb	I guess that means not widespread networking issues but hosting problems. Good to have that info to point at though showing it didn't happen in a vacuum	16:49
fungi	clarkb: ooh, good thinking, i hadn't checked that inbox yet today	16:49
fungi	but in the "widespread network disruption bucket" this failure just got mentioned in #openstack-release https://zuul.opendev.org/t/openstack/build/8f1a900189f547c688259da9fcafa712	16:49
corvus	clarkb: i don't see any geard logs there, but i did find that the loop that's supposed to detect and clean up this condition is broken:	16:49
corvus	http://paste.openstack.org/show/806500/	16:50
corvus	that is repeated X times in the log... I'm guessing X == the number of stuck jobs right now	16:50
corvus	11 times in the logs	16:50
corvus	i count 9 stuck jobs	16:51
fungi	seems like about the right order of magnitude, yeah	16:51
fungi	one we know was reset by abandon/restore	16:52
fungi	maybe the other got a new patchset in the meantime	16:52
corvus	good point; i don't know if the abandoned job is still in that list; could be	16:52
*** amoralej is now known as amoralej\|off		16:54
corvus	oh i see	16:55
corvus	we're right between two v5 ZK changes: we have result events in zk, but build requests in gearman	16:55
corvus	a client disconnect is a gearman result event	16:56
corvus	and they're effectively ignored	16:56
fungi	oh neat!	16:56
corvus	so it's very likely that geard did send the WORK_FAIL packet; and the scheduler ignored it since "real" results come from zk now	16:56
corvus	tobiash: ^	16:57
fungi	yeah, that seems like a pretty good explanation of exactly what we saw	16:57
corvus	the cleanup in http://paste.openstack.org/show/806500/ should have caught it even so, but it has a fatal flaw	16:59
corvus	let me see if i can patch that real quick	16:59
opendevreview	Michael Johnson proposed opendev/system-config master: Removing openstack-state-management from statusbot https://review.opendev.org/c/opendev/system-config/+/795596	16:59
corvus	patch in #zuul	17:03
fungi	thanks!	17:04
*** artom has joined #opendev		17:04
opendevreview	Michael Johnson proposed openstack/project-config master: Removing openstack-state-management from gerritbot https://review.opendev.org/c/openstack/project-config/+/795599	17:04
*** andrewbonney has quit IRC		17:05
johnsom	Hi opendev neighbors. I have posted the follow up patches to retire the openstack-state-management channel in favor of using openstack-oslo as discussed on the discuss mailing list. I don't have op privilege on the openstack-state-management channel to update the topic.	17:06
johnsom	Can someone update the topic to something similar to "The channel is retired. Please join us in #openstack-oslo" ?	17:07
fungi	we can take care of it, sure. i'll do that in a moment	17:07
johnsom	Thank you!	17:07
fungi	johnsom: can you also take a look at https://docs.opendev.org/opendev/system-config/latest/irc.html#renaming-an-irc-channel and comment it out in the accessbot config?	17:09
fungi	johnsom: i've adjusted the topic now	17:09
johnsom	fungi Ok, I wasn't sure on that part. Updating...	17:10
fungi	it's new, capabilities on oftc differ from freenode so we had to adjust our channel renaming process, and decided to start aging out our channel registrations	17:10
opendevreview	Michael Johnson proposed openstack/project-config master: Removing openstack-state-management from the bots https://review.opendev.org/c/openstack/project-config/+/795599	17:11
johnsom	Updated	17:11
fungi	appreciated!	17:12
opendevreview	Mohammed Naser proposed opendev/system-config master: Add Fedora 34 mirrors https://review.opendev.org/c/opendev/system-config/+/795602	17:27
clarkb	mnaser: question on ^	17:29
mnaser	yes, yes we do clarkb :)	17:29
opendevreview	Mohammed Naser proposed opendev/system-config master: Add Fedora 34 mirrors https://review.opendev.org/c/opendev/system-config/+/795602	17:30
mnaser	i will update nodepool to build f34	17:30
mnaser	and then move f32 to f34	17:31
*** ralonsoh has quit IRC		17:34
opendevreview	Mohammed Naser proposed openstack/project-config master: Build images for Fedora 34 https://review.opendev.org/c/openstack/project-config/+/795604	17:35
*** amoralej\|off has quit IRC		17:42
*** whoami-rajat has quit IRC		18:13
opendevreview	Mohammed Naser proposed zuul/zuul-jobs master: Switch jobs to use fedora-34 nodes https://review.opendev.org/c/zuul/zuul-jobs/+/795636	18:15
opendevreview	Mohammed Naser proposed opendev/base-jobs master: Switch fedora-latest to use fedora-34 https://review.opendev.org/c/opendev/base-jobs/+/795639	18:18
opendevreview	Mohammed Naser proposed openstack/project-config master: Stop launch fedora-32 nodes nodepool https://review.opendev.org/c/openstack/project-config/+/795643	18:30
opendevreview	Mohammed Naser proposed openstack/project-config master: Remove fedora-32 disk image config https://review.opendev.org/c/openstack/project-config/+/795644	18:30
mnaser	infra-core: i just pushed up everything needed to get rid of f32	18:30
mnaser	might need some rechecks here and there because of the dependencies	18:30
mnaser	it's all covered here - https://review.opendev.org/q/hashtag:%22fedora-34%22+(status:open%20OR%20status:merged)	18:31
clarkb	mnaser: thanks, I'll try to review them	18:32
*** amoralej\|off has joined #opendev		18:54
*** timburke_ is now known as timburke		18:58
*** amoralej\|off has quit IRC		19:11
*** hashar has quit IRC		19:39
*** amoralej\|off has joined #opendev		19:50
*** dviroel is now known as dviroel\|brb		20:23
*** amoralej\|off has quit IRC		20:53
*** dviroel\|brb is now known as dviroel		21:09
ianw	mnaser: thanks!	22:11
ianw	one thing is that i think we'll need to update our builders to mount /var/lib/containers	22:11
ianw	to use the containerfile element	22:11
clarkb	ianw: do we also need to mount in cgroup stuff from sysfs?	22:12
clarkb	or is that automagic since the host conatiner depends on it?	22:12
ianw	yeah i've not had to do that in local testing, and we don't do that in the gate test and it works	22:13
ianw	i should probably qualify that with "for now" :)	22:13
mnaser	ianw, clarkb: i've not have to do any cgroup stuff either and once i mounted /var/lib/containers -- i got a clean build that boots	22:13
ianw	fungi/clarkb: not sure if you saw https://review.opendev.org/c/opendev/system-config/+/795213 but i went ahead and made a statusbot container and deploy it with that. it's really just a mechanical patch now deploying the config file	22:15
clarkb	ianw: I haven't yet	22:16
ianw	if you don't have any objections, i can try moving meetbot to limnoria maybe again this afternoon when it's quiet	22:17
clarkb	no objections from me. fungi ^ anything to consider when doing that? maybe double checking the meeting schedule to ensure its not quiet with a meeting?	22:18
ianw	beforehand i'll sort out syncs of the logs	22:18
ianw	from e.openstack.org -> e.opendev.org	22:18
fungi	ianw: go for it. sorry i'm not much help, internet outage here, i'm limping along on a wireless modem for the moment	22:19
ianw	no worries :) i'm surprised i still have power/internet, it's crazy here	22:21
ianw	https://www.abc.net.au/news/2021-06-10/wild-weather-batters-victoria/100203532	22:22
fungi	yikes	22:22
fungi	stay safe!	22:22
ianw	settling down now but last night could have thought you were on the Pequod at times :)	22:22
opendevreview	Ian Wienand proposed opendev/system-config master: nodepool-builder: add volume for /var/lib/containers https://review.opendev.org/c/opendev/system-config/+/795707	22:41
ianw	mnaser / infra-root: ^ that mirrors what we do in nodepool gate for production	22:41
*** tosky has quit IRC		22:47
ianw	-09 22:42:25.839331 \| LOOP [build-docker-image : Copy sibling source directories]	22:50
ianw	2021-06-09 22:42:26.548311 \| ubuntu-focal \| cp: cannot stat 'opendev.org/opendev/meetbot': No such file or directory	22:50
ianw	i wonder why the ircbot build would fail in the gate like that :/	22:50
ianw	https://zuul.opendev.org/t/openstack/build/1529ed5e1b0242e39eabc2bc3c86b79a/log/job-output.txt#279 is the prepare-workspace from the check job	22:55
ianw	the gate job only cloned system-config	22:55
clarkb	missing required project?	22:55
ianw	maybe but why would it work in check?	22:56
ianw	https://1aa2de7fd6bc4a6a901d-a3c1233a0305e644b60ccc0279f1954b.ssl.cf1.rackcdn.com/793704/24/gate/system-config-upload-image-ircbot/ec9c181/job-output.txt is the failed job	22:56
ianw	ohhh, i guess it's the "upload-image" job ... not "build-image"	22:57
ianw	they should be sharing required jobs via yaml tags, but something must have gone wrong	22:57
ianw	required projects i mean	22:57
*** whoami-rajat has joined #opendev		22:58
opendevreview	Ian Wienand proposed opendev/system-config master: Create ircbot container https://review.opendev.org/c/opendev/system-config/+/793704	23:04
opendevreview	Ian Wienand proposed opendev/system-config master: limnoria/meetbot setup on eavesdrop01.opendev.org https://review.opendev.org/c/opendev/system-config/+/793719	23:04
opendevreview	Ian Wienand proposed opendev/system-config master: Move meetbot config to eavesdrop01.opendev.org https://review.opendev.org/c/opendev/system-config/+/795007	23:04
opendevreview	Ian Wienand proposed opendev/system-config master: Cleanup eavesdrop puppet references https://review.opendev.org/c/opendev/system-config/+/795014	23:04
opendevreview	Ian Wienand proposed opendev/system-config master: Run statusbot from eavesdrop01.opendev.org https://review.opendev.org/c/opendev/system-config/+/795213	23:04
ianw	https://zuul.opendev.org/t/openstack/build/6804e9a6b18d44cd947f03280b9921be failed with POST_FAILURE -- WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!	23:30
clarkb	we've seen that if there is an arp fight for an IP	23:31
clarkb	usually it will get bad for a day or two and the ngo away as the cloud notices and cleans those up ?	23:32
clarkb	er :/	23:32
ianw	rax-iad in that one	23:32

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!