Monday, 2021-09-20

*** ysandeep\|away is now known as ysandeep		05:45
*** ykarel_ is now known as ykarel\|afk		06:54
*** jpena\|off is now known as jpena		07:27
lourot	o/ I think I need another +2 on https://review.opendev.org/c/openstack/project-config/+/808190 - thanks!	07:41
*** rpittau\|afk is now known as rpittau		07:55
*** ykarel\|afk is now known as ykarel		07:58
*** ykarel_ is now known as ykarel		08:09
*** ysandeep is now known as ysandeep\|lunch		08:53
*** ykarel_ is now known as ykarel\|lunch		10:10
*** ysandeep\|lunch is now known as ysandeep		10:35
*** ykarel\|lunch is now known as ykarel		11:12
*** jpena is now known as jpena\|lunch		11:29
*** dviroel\|out is now known as dviroel		11:36
*** ysandeep is now known as ysandeep\|afk		11:37
*** jpena\|lunch is now known as jpena		12:22
lourot	thanks fungi :)	12:38
fungi	yw	12:39
fungi	sorry we missed that in the prior review	12:39
opendevreview	Merged openstack/project-config master: Fix missing charm-interface-ironic-api https://review.opendev.org/c/openstack/project-config/+/808190	12:47
*** ysandeep\|afk is now known as ysandeep		12:47
opendevreview	Yuriy Shyyan proposed openstack/project-config master: Scaling down proposed instance count. https://review.opendev.org/c/openstack/project-config/+/810009	13:03
opendevreview	Jeremy Stanley proposed opendev/system-config master: Switch IPv4 rejects from host-prohibit to admin https://review.opendev.org/c/opendev/system-config/+/810013	13:26
opendevreview	Merged openstack/project-config master: Scaling down proposed instance count. https://review.opendev.org/c/openstack/project-config/+/810009	13:33
opendevreview	daniel.pawlik proposed opendev/puppet-log_processor master: Add capability with python3; add log request cert verify https://review.opendev.org/c/opendev/puppet-log_processor/+/809424	13:43
opendevreview	daniel.pawlik proposed opendev/base-jobs master: Submit a log processing job to the RDO infra logstash workers https://review.opendev.org/c/opendev/base-jobs/+/809765	14:00
*** ykarel is now known as ykarel\|away		14:39
ricolin	got 502 error when try to generate contents on https://paste.opendev.org/	15:13
fungi	i've seen that when its database connection times out... is it consistent when you retry?	15:18
noonedeadpunk	seems like it is	15:44
noonedeadpunk	and db connection issue looked differently	15:45
noonedeadpunk	not proxy error... but dunno	15:45
clarkb	we got an email ~40 minutes ago from the cloud provider saying that the backing volume had an issue?	15:46
clarkb	I suspect it is related to that	15:46
clarkb	fungi: ^ fyi.	15:46
clarkb	they suggest a fsck. Hopefully mariadb doesn't need more than that. /me needs breakfast but can help look at that after	15:46
fungi	i'll take a look now, thanks i hadn't seen that ticket from the provide ryet	15:50
fungi	yeah, i guess we added a cinder volume for /var/lib/lodgeit/mariadb ?	15:51
fungi	i'll stop the containers, then umount and fsck that	15:51
fungi	looks like the server got rebooted 1.75 hours ago	15:52
fungi	and ~0.25 hours after that it started logging i/o errors for that volume	15:52
fungi	fsck ran clean. i'll reboot the server again too for good measure, since we were still logging i/o errors for the device as recently as a few minutes agi	15:54
fungi	ago	15:54
fungi	noonedeadpunk: ricolin: is it working better now?	15:56
fungi	#status log Ran filesystem checks on database volume for the paste.opendev.org service, and rebooted the server in order to attempt to clear any lingering I/O error state	16:00
opendevstatus	fungi: finished logging	16:00
noonedeadpunk	works for me	16:00
fungi	thanks for confirming	16:00
*** rpittau is now known as rpittau\|afk		16:00
fungi	it was because clouds	16:01
clarkb	a random paste loads for me now too, thanks for doing the cleanup	16:04
fungi	i'll get the ticket closed out too	16:06
*** marios is now known as marios\|out		16:09
fungi	now that lists.o.o has been running upgraded for a while, it's clear memory utilization is improved and more stable: http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=219&rra_id=all	16:11
clarkb	fungi: over the weekend it occured to me that an easy way to replace the server would be to boot a new instance and have ansible configure it in test mode so no one is notified but all the lists should be made.	16:15
clarkb	Then we can just copy over the membership and archives?	16:15
clarkb	However, it may make more sense to deploy the new server as mm3 and migrate into that?	16:15
fungi	yes, i wanted to push up an mm3 migration spec late last week but other fires keep coming up	16:16
fungi	i've got it half drafted	16:16
clarkb	cool	16:16
fungi	okay, so in addition to two tickets about paste volume outages, there was a hypervisor host issue with paste.o.o (that explains the short uptime when i first logged in), and also for zm06	16:19
fungi	zm06 looks like it's got a merger running on it now, and grafana shows a full compliment of mergers as well as the short dip while zm06 was offline	16:21
fungi	so i think there's nothing to worry about there	16:21
*** ysandeep is now known as ysandeep\|dinner		16:25
clarkb	and zuul should be resilient to losing that node when the issue occured (though a job or two may have failed?)	16:26
fungi	hopefully not, but who knows	16:26
clarkb	looks like the integrated openstack gate isn't doing any work	16:28
clarkb	does anyone understand what might be happening there? corvus ^ cyi	16:28
clarkb	er fyi	16:28
clarkb	I'm going to start looking in logs	16:29
clarkb	I see it started ~12 builds at around 12:07-12:12 UTC	16:32
clarkb	then around 12:50 UTC the builds are canceled/aborted (I assume due to a gate reset)	16:33
clarkb	No new jobs are started after that point in time	16:33
clarkb	corvus: what is intended to submit new build erquests after the previous ones get canceled?	16:34
corvus	clarkb: o/ catching up	16:34
fungi	hopefully it's not related to the tripleo-ci change in check with 44 depends-on footers	16:34
fungi	that's an interesting stress-test	16:35
corvus	clarkb: if jobs are canceled for a gate reset, they should be immediately replaced by new jobs from the same run of the main scheduler loop	16:35
clarkb	corvus: it doesn't look like that is happening based on the logs	16:36
fungi	it looks similar to when mergers haven't reported back on changes, the entries don't yet expand to any jobs	16:36
clarkb	It goes very quiet on that event/change after the cancels happen	16:36
clarkb	fungi: or perhaps zm06 restarting tripped this up? Do you know when zm06 rebooted?	16:36
clarkb	I think ~12:50 +/1 a minute or two is the window we're looking at here	16:37
clarkb	8bb2e0f74b1144cca9ba41ddea58a1dc is the event id fwiw	16:37
fungi	reboot system boot Mon Sep 20 13:20	16:37
corvus	it's plausible there's an outstanding merge job that its waiting on	16:37
clarkb	Submitting job request to ZooKeeper <MergeRequest 578fe1e26cf84ba7a86454315c1b4eb1 that one maybe?	16:38
corvus	i'm going to look for the first merge job uuid submitted after the reset; let me know if you find it first	16:38
corvus	ok thx :)	16:38
clarkb	er no it says it removed that request laster	16:38
clarkb	and I don't see any requests logged after that one	16:39
fungi	the tripleo-ci change with all the deps would have been enqueued around 12:53z when it was last rechecked	16:39
clarkb	I wonder why they keep rechecking that change a bunch of the depends on are not mergeable	16:40
fungi	looks like zm06 might have died around 12:30z and came back at 13:20z	16:41
fungi	trying to correlate cacti against syslog entries now	16:42
fungi	yeah, the gap in cacti graphs bears that out	16:42
clarkb	578fe1e26cf84ba7a86454315c1b4eb1 claims it completed, though I'm not sure we record its status in the scheduler logs	16:42
corvus	2021-09-20 13:12:08,815 DEBUG zuul.MergerApi: [e: 8bb2e0f74b1144cca9ba41ddea58a1dc] [build: 578fe1e26cf84ba7a86454315c1b4eb1] Updating request <MergeRequest 578fe1e26cf84ba7a86454315c1b4eb1, job_type=merge, state=completed, path=/zuul/merger/requests/578fe1e26cf84ba7a86454315c1b4eb1>	16:43
corvus	state=completed	16:43
clarkb	`grep 8bb2e0f74b1144cca9ba41ddea58a1dc /var/log/zuul/debug.log \| grep -v 'Checking for changes needed by' \| grep -v 'No changes needed' \| less` is the sort of thing I'm looking at it and shows the event went very idle after this stuff	16:43
corvus	so i think that job is good	16:43
clarkb	corvus: ya	16:43
corvus	it'll be cool when we have the pipeline state in zk; will be a lot easier to inspect	16:44
corvus	it looks like this was kicked off when the change ahead of 809336 failed, and 336 became the head of the queue	16:45
corvus	that was the event at 12:50 thet caused the jobs to cancel and spawned the 578 merge job	16:46
clarkb	based on the log it appears we're stuck between having completed the merge to determine new speculative state and starting jobs?	16:47
corvus	yes -- somewhat complicated by the fact that since then there has been another re-equeue event (13:47:14,916) i don't know details of that	16:47
corvus	that appears to be a tenant reconfig event	16:49
corvus	that probably should just be a noop	16:51
corvus	okay, i'm going to fire up the repl and see if i can inspect the queue item state	16:51
clarkb	ok	16:51
clarkb	fungi: unrelated to ^ https://review.opendev.org/c/opendev/system-config/+/809483 should be safe to land as we pull and restart on new nodepool builder images when available and https://review.opendev.org/c/zuul/nodepool/+/805743 has merged	16:53
corvus	<BuildSet item: <QueueItem a31bac49a9a0474a8094b4313bb307e1 for <Change 0x7f829f42e430 openstack/placement 809366,1> in gate> #builds: 0 merge state: PENDING>	16:54
corvus	i'm suspecting that the merge request was deleted by the cleanup process	16:59
corvus	working on confirming that now	16:59
clarkb	interesting. The log message did indicate they overlapped each other	17:00
corvus	yeah, we need a 'finished' log msg for the cleanup :/	17:00
corvus	clarkb: did you find where 578fe1e26cf84ba7a86454315c1b4eb1 ran?	17:01
clarkb	corvus: I did not	17:02
corvus	i'll look	17:02
corvus	apparently nowhere	17:04
corvus	sudo ansible -f 20 -m shell -a 'grep 578fe1e26cf84ba7a86454315c1b4eb1 /var/log/zuul/merger-debug.log' 'zm*'	17:04
corvus	sudo ansible -f 20 -m shell -a 'grep 578fe1e26cf84ba7a86454315c1b4eb1 /var/log/zuul/executor-debug.log' 'ze*'	17:04
corvus	both return nothing	17:04
*** ysandeep\|dinner is now known as ysandeep		17:05
clarkb	I agree I grepped the same uuid across mergers and executors and got no results	17:05
clarkb	need to dbouel check file rotation though	17:05
clarkb	looks like merger's .log.1 is from yseterday UTC so wouldn't have it	17:06
clarkb	same thing with executors	17:06
corvus	i'm pretty sure we log the job id right as we pull it off the queue	17:07
fungi	unrelated, looks like mirror.iad3.inmotion.opendev.org is down or unreachable, i see multiple jobs retried due to pre-run failures trying to update package lists	17:08
fungi	can't ping it at all at the moment	17:08
corvus	perhaps we've had other merger connectivity/uptime issues	17:08
clarkb	fungi: is the server down? it may have crashed when we overloaded the hypervisors? If so doing a hard reboot is probably what it needs	17:09
fungi	clarkb: yeah, not sure, i'm checking the nova api now	17:09
fungi	to see what state it thinks the instance is in	17:09
clarkb	corvus: is it possible that the cancel happened such that the completion occured all on the scheduler side before a merger ever saw it?	17:09
fungi	aha, it's in shutoff state, starting it and hopefully it will boot	17:10
corvus	clarkb: i don't think we canceled 578fe1e26cf84ba7a86454315c1b4eb1	17:11
clarkb	got it	17:11
fungi	#status log Started mirror02.iad3.inmotion.opendev.org via Nova API as it was in a SHUTOFF stateyes	17:11
opendevstatus	fungi: finished logging	17:11
fungi	er, yes that seems to have gotten it back online again	17:12
corvus	only zm06 has update < x days, so i wouldn't expect log entry data loss from anyone else	17:13
corvus	oh, zm06 has log entries from as late as 12:47; our merge job was enqueued at 12:50... that's pretty close	17:14
corvus	like, close enough it may have still been semi-functional at 12:50?	17:14
fungi	entirely possible	17:14
fungi	12:25 was the last i saw anything in syslog but i didn't check the merger log	17:14
fungi	and it's hard to tell from cacti since i've ceased to be able to zoom the timescale on the graphs as of a while back	17:15
corvus	functional enough to pull a job off the queue, update its state to running, and then crash before finishing the job or flushing out log buffer pages to disk	17:15
corvus	12:47:30,773 is the last merger log entry	17:15
clarkb	that seems plausible	17:16
corvus	it's not conclusive, but it fits the facts we have...	17:16
clarkb	why would it be marked completed though?	17:16
corvus	that's part of what the cleanup does...	17:16
clarkb	ah	17:16
corvus	so i think we've learned 2 things: 1) the cleanup routine works (yay). 2) we need a way to poke the scheduler and tell it that a merge job it was expecting failed.	17:17
corvus	i think we may need to do something similar to what we do with build jobs and put a fake complete event into the queue	17:17
clarkb	corvus: have the cleanup routine send that event you mean?	17:18
clarkb	that makes sense to me	17:18
corvus	yep; that's what the build cleanup does	17:18
corvus	clarkb: i think we have the info we need and at this point remediation is dequeue/enqueue.	17:18
clarkb	corvus: cool I can do the dequeue enqueue	17:18
corvus	i'll dig into fixing	17:19
*** jpena is now known as jpena\|off		17:19
fungi	would promote probably solve it?	17:19
fungi	just reorder the queue (might even not actually have to reorder it if you tell it to simply promote the first item)	17:20
fungi	solve as in get builds going again i mean	17:20
fungi	clarkb: ^	17:20
clarkb	it may have. I've done the dequeue enqueue and stuck it at the end of the queue	17:21
clarkb	or am attempting to anyway, it hasn't finished the enqueue yet	17:21
clarkb	I think that is beacuse it is busy handling the events for the other changes that suddenly became viable	17:22
fungi	yeah	17:22
clarkb	yup there it goes all done	17:22
fungi	thanks!	17:22
opendevreview	Merged opendev/system-config master: Run daily backups of nodepool zk image data https://review.opendev.org/c/opendev/system-config/+/809483	17:34
*** ysandeep is now known as ysandeep\|away		18:04
clarkb	fungi: https://review.opendev.org/c/opendev/system-config/+/809488/ is the next step in the logo asset management process. I think the earlier changes in that stack are likely to be safe to land? Definitely review them for correctness and potential release impacts I guess	18:22
fungi	sure, i could use a break anyway, will take a look while i have quieries underway	18:26
opendevreview	Merged opendev/system-config master: gitea: cleanup logo assets https://review.opendev.org/c/opendev/system-config/+/809488	19:42
opendevreview	Merged opendev/system-config master: docs: use logo from assets https://review.opendev.org/c/opendev/system-config/+/809489	19:42
clarkb	corvus: fungi: it just occured to me that putting the nodepool image data backups in /var/log/nodepool means we'll serve them with the http server on those hosts. There isn't anything sensitive in that though right? Should we move them in case that changes?	19:55
clarkb	I'm running one of the exports into /tmp in the container to look at it	19:57
clarkb	fungi: corvus: ianw: ya looking at the data it has info like when it was built, the python path, the username associated with the image, the cloud side uuid and name, and the image format	20:01
clarkb	I think that is ok to have exposed. But if you all can double check that that would be great	20:01
clarkb	oh and we may only serve the image build logs not the actual service logs so this may be all for nothing anyway	20:03
corvus	clarkb: i think we should not serve it. i don't expect there to be any sensitive info, but i don't want to commit to that	20:04
clarkb	corvus: ok I'm double checking what we actually serve now	20:04
corvus	(for example, i'd like to move more authn info into zk for nodes; that might end up doing the same for images too (probably not, but it's conceivable))	20:04
clarkb	we serve content of /var/log/nodepool/builds and /opt/nodepool_dib. Currently the backups go to /var/log/nodepool/nodepool-image-backup.json	20:05
clarkb	I think that means we are good	20:05
corvus	kk	20:05
fungi	yeah, seems fine then	20:25
clarkb	One of the things on my todo list is restarting gerrit to pick up the replication teimouts. We have a fair number of leaked replicatiosn again which I'll clean up now. Any sense for when a good time to restart gerrit would be?	20:26
clarkb	I think we should wait until we've got caught up replication queues too	20:28
clarkb	also I'm noticing they always seem to happen around 1300-1800 UTC	20:32
clarkb	either that means I'm checking after they have timed out hours later from getting stuck earlier or something specific about those times is causing us issues?	20:32
clarkb	I'm thinking it might be best to do the restart when we can monitor gerrit and ideally through that period of time. Maybe do a restart early tomorrow as a result?	20:35
clarkb	the backlog is cleared now fwiw	20:36
*** dviroel is now known as dviroel\|out		21:09
clarkb	Note it appears the paste server is having trouble again	21:32
clarkb	cloud reports they are investigating and will let us know when we can expect to recover things	21:33
fungi	thanks	21:34
clarkb	Last time they sent us a nice email suggesting fsck and related recovery path so I expect we'll get similar this time around too	21:35
fungi	yeah, basic steps are 1. down the containers, 2. umount the fs, 3. fsck it, 4. reboot, 5. log back in and up the containers again	21:55
clarkb	ok email saying it should be happy has arrived	22:12
clarkb	fungi: just `fsck $path` after umounting?	22:14
fungi	yup	22:14
fungi	it'll probably report clean	22:14
fungi	fsck $devpath	22:14
clarkb	yup it did	22:14
clarkb	I'll reboot now then up the containers again	22:15
clarkb	seems to be working. I can load an older paste and create a new one	22:17
fungi	perfect	22:17
TheJulia	out of curiosity, does opendev track/trend cpu utilization/iowait hit on test vms?	23:26
ianw	TheJulia: devstack runs a dstat service in the background that can be graphed, but not generally	23:26
TheJulia	i ask because I've got some cases where I've spotted unit tests taking 1000 seconds where my 10 year old desktop runs them in <200 seconds	23:27
fungi	were you able to correlate those runs to a particular provider?	23:27
fungi	i get the impression some of our providers may sort of oversubscribe our disk i/o bandwidth in particuar, and when our test volumes get really high all our test nodes in that location wind up fighting one another over limited disk bandwidth	23:29
TheJulia	one moment	23:29
TheJulia	inmotion-iad3, 600 seconds one one job, i think the other one was like a thousand seconds	23:31
TheJulia	examples are slim since one is a gate fix with neutron I'm trying to get merged and...well.. yeah :(	23:33
TheJulia	at least, a workaround for ironic to keep moving forward	23:33
fungi	we did just ramp up the usage on that provider over the weekend, so it's untested territory. early today we were oom'ing bits of the infrastructure there but scaled it back as of roughly 14:00 utc	23:34
fungi	it seemed more stable after that at least	23:34
fungi	what time did the really long test duration happen?	23:34
TheJulia	uhh, with-in the last 2 hours	23:37
TheJulia	https://zuul.opendev.org/t/openstack/status#809895	23:38
fungi	okay, thanks that helps	23:39
* TheJulia steps away for a little while to ponder dinner		23:39

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!