Wednesday, 2021-06-16

*** diablo_rojo is now known as Guest2326		00:47
ianw	fungi: been debugging, with a little input from #openafs ... eventually i figured out that systemd is timing out; which tries to unload the module while afsd is still in a tight loop of ioctls to it	01:05
ianw	after upping the timeout, the service takes ~4:30 to start, but it does eventually start	01:06
fungi	aha!	01:08
fungi	so the real trigger for the oops is the attempted rmmod call then?	01:08
fungi	and we've just got a slow vm	01:09
fungi	and the one in linaro-us is fast enough to not hit that timeout?	01:09
ianw	yeah, Ramereth ^ maybe I do need to summon your help :)	01:11
ianw	is the node backing mirror01.regionone.osuosl.opendev.org under particularly heavy load?	01:11
fungi	i guess it would help to know what's being slow there at openafs-client startup... creating the cachedirs?	01:11
fungi	wondering if it's i/o contention, storage bandwidth, cpu hitching, memory pressure...	01:12
ianw	yeah i have strace, it first goes through and stats everything, but then spends a long period of time making sequential ioctl calls which i am guessing is doing something like registering every cache directory	01:12
fungi	my guess is something to do with the cinder volume	01:12
ianw	it dosen't seem like it's stuck at one thing	01:12
ianw	i'm going to clear the cache, reboot and see again	01:13
fungi	also maybe see if startup is faster when the cachedir is sanely created from a previous clean start	01:13
ianw	i think the time is all this stating and ioctl-ing to register the directories; my hunch is actually that a more populated directory would take longer	01:14
ianw	i can run afsd under strace with timing, but that will only further perturb things i guess	01:15
fungi	yeah	01:16
ianw	0.01user 0.00system 1:38.27elapsed 0%CPU (0avgtext+0avgdata 6376maxresident)k	01:18
ianw	that's right on the edge of a default 1:30 timeout i guess	01:21
* Ramereth has been summoned		01:22
Ramereth	let me take a look at the load and see what's going on	01:22
fungi	i always enjoy appears in a puff of smoke	01:23
Ramereth	doesn't appear to be network issues	01:29
Ramereth	which node is having the issue? the mirror01 one?	01:29
ianw	yeah, f0f1ed9d-98a8-4713-8d74-69b168c4c996	01:30
ianw	i dunno if we can 100% pin it as issues. starting openafs is probably the only time it pegs itself at 100% stating and doing all these ioctls	01:31
ianw	if the timeout for the service starting is at 1:30 and we're at something like 1:38 ... then maybe we've just been lucky and slipped under the timeout previously	01:32
ianw	i.e. the performance is roughly the same as it's always been	01:32
ianw	but if the backing node looks particularly overcommitted, or something else, that would also explain things	01:33
Ramereth	the hypervisor looks the same as it has been all week	01:38
Ramereth	nothing looks different out of the last 24hr for the ceph nodes either	01:39
Ramereth	I need to go but I can look more into it tomorrow. Feel free to send an email to support with additional details if you think it is something on our end	01:40
ianw	Ramereth: thanks for looking!	01:40
ianw	it may well be that the only time we hit this is on reboot; we barely ever reboot so we don't really know what the usual time to start up is	01:41
ianw	i'll propose a timeout override and i think we can just continue to monitor	01:41
opendevreview	Ian Wienand proposed opendev/system-config master: openafs-client: add service timeout override https://review.opendev.org/c/opendev/system-config/+/796578	01:51
fungi	yeah, also the reboot where we first hit this was last week, we weren't having much luck figuring out the problem and disabled booting job nodes there on friday	01:53
ianw	probably hadn't been rebooted since it was first started	02:08
ianw	stdout: Looking in indexes: https://mirror.regionone.limestone.opendev.org/pypi/simple, https://mirror.regionone.limestone.opendev.org/wheel/ubuntu-18.04-x86_64	03:29
ianw	:stderr: ERROR: Could not find a version that satisfies the requirement ara[server] (from versions: none)	03:29
ianw	are we having limestone issues?	03:29
ianw	mirror_443_access.log:2607:ff68:100:54:f816:3eff:feb4:c4a3 - - [2021-06-16 01:55:55.115] "GET /pypi/simple/ara/ HTTP/1.1" 200 15079 cache miss: attempting entity save "-"	03:43
ianw	that must have been the query? timestamp doesn't quite line up so maybe it didn't hit the mirror?	03:43
ianw	E: Failed to fetch https://mirror.regionone.limestone.opendev.org/debian/pool/main/k/kerberos-configs/krb5-config_2.6_all.deb Connection failed [IP: 216.245.200.130 443]	03:44
ianw	similar error here	03:44
opendevreview	Ian Wienand proposed opendev/system-config master: gerrit: add mariadb_container option https://review.opendev.org/c/opendev/system-config/+/775961	03:57
opendevreview	Ian Wienand proposed opendev/system-config master: review02 : switch reviewdb to mariadb_container type https://review.opendev.org/c/opendev/system-config/+/795192	03:57
opendevreview	Ian Wienand proposed openstack/project-config master: infra-package-needs: don't start ntp for Fedora https://review.opendev.org/c/openstack/project-config/+/796584	04:08
opendevreview	Ian Wienand proposed openstack/project-config master: Revert "Disable the osuosl arm64 cloud" https://review.opendev.org/c/openstack/project-config/+/796585	04:18
opendevreview	Merged openstack/project-config master: infra-package-needs: don't start ntp for Fedora https://review.opendev.org/c/openstack/project-config/+/796584	04:54
*** marios is now known as marios\|ruck		05:04
gibi	hi infra! do you see some package mirror issue in limestone region? We have this bug https://bugs.launchpad.net/nova/+bug/1931864 that is only appeare in limestone and produces nonsensible requirement conflicts seemingly randomly	05:25
ianw	gibi: i've also seen some issues with limestone that don't seem repeatable. we might have to disable it	05:41
fungi	ianw: gibi: earlier there were two htcacheclean processes going at the same time driving system load up to 100 due to excessive iowait, and after i killed one of the htcachecleans it seemed to calm down	06:16
fungi	it's possible the apache cache on that server is just getting larger than the i/o performance makes it possible to prune	06:16
gibi	ianw, fungi: thanks for replying. we just got a fresh it of the req conflict https://1b38eb07519f5fe2ed36-da9f1bb46fd216e97fa5e10d4af58222.ssl.cf5.rackcdn.com/796255/2/check/openstack-tox-py36/3c59c9d/job-output.txt	06:24
ianw	fungi: that sounds exactly like the performance issues we disabled it for before	06:40
ianw	i think we'll have to disable it. i think clarkb worked with logan-, but i think something is still up with storage	06:40
opendevreview	Ian Wienand proposed openstack/project-config master: Revert "Revert "Revert "Revert "Disable limestone due to mirror issues"""" https://review.opendev.org/c/openstack/project-config/+/796590	06:45
opendevreview	Merged openstack/project-config master: Revert "Revert "Revert "Revert "Disable limestone due to mirror issues"""" https://review.opendev.org/c/openstack/project-config/+/796590	07:14
*** rpittau\|afk is now known as rpittau		07:15
*** jpena\|off is now known as jpena		07:33
*** elodilles is now known as elodilles_afk		08:10
*** gthiemon1e is now known as gthiemonge		08:23
*** amoralej\|off is now known as amoralej		08:24
*** ykarel is now known as ykarel\|lunch		08:31
*** ysandeep\|out is now known as ysandeep		08:45
*** sshnaidm\|afk is now known as sshnaidm		08:50
*** ykarel\|lunch is now known as ykarel		09:38
*** bhagyashris_ is now known as bhagyashris		10:27
*** ykarel_ is now known as ykarel		10:54
*** jpena is now known as jpena\|lunch		11:32
*** amoralej is now known as amoralej\|lunch		12:10
*** hashar is now known as Guest2387		12:24
*** hashar is now known as Guest2388		12:26
*** jpena\|lunch is now known as jpena		12:30
*** diablo_rojo__ is now known as diablo_rojo		12:35
*** elodilles_afk is now known as elodilles		13:04
*** amoralej\|lunch is now known as amoralej		13:10
*** ysandeep is now known as ysandeep\|brb		13:33
mordred	fungi: http://exple.tive.org/blarg/2020/03/06/brace-for-impact/ <-- this is from our friends at mozilla (part of a series of blog posts about their move) ... apparently matrix has federated ban lists	13:38
fungi	oh neat	13:53
mordred	yeah - seems like a really neat way for folks to collaborate in managing bad actors	14:07
*** ysandeep\|brb is now known as ysandeep		14:15
*** gthiemon1e is now known as gthiemonge		14:37
*** ykarel is now known as ykarel\|away		14:39
fungi	okay, so i'm looking at tagging a new version of gear, as requested in #zuul	15:17
fungi	the last release was 0.15.1 in february 2020	15:18
fungi	there are 9 non-merge commits since then	15:19
fungi	doesn't look like there are any backward-incompatible changes, but we did increase some dependency versions and add support for more python interpreters, so this should probably be 0.16.0 (unless we want to be bold and make it 1.0.0, but keep in mind we're likely to no longer use it after zuul v5)	15:20
fungi	when i was trying to release it earlier in the year we got stuck solving the ssl/tls changes for python 3.9, but that's working now	15:21
*** dklyle is now known as dklyle__		15:31
mordred	fungi: I think 0.16 seems reasonable	15:41
*** marios\|ruck is now known as marios\|out		15:45
fungi	i just realized i still need to update the package metadata to list newer python releases, so working on that match now	15:46
opendevreview	Jeremy Stanley proposed opendev/gear master: Overhaul package metadata and contributor info https://review.opendev.org/c/opendev/gear/+/796704	16:07
fungi	mordred: ^	16:07
mordred	fungi: I think you could claim 3.10 too? (wasn't that what mhuin was poking at?)	16:08
fungi	well, we're not actively testing gear changes with 3.10, he tested zuul with 3.10 which happens to exercise gear	16:08
mordred	ah - nod. fair	16:09
fungi	but also 3.10 is still in beta anyway	16:09
fungi	and won't be released for months yet	16:09
mhuin	fungi, I can add py310 testing if you'd like, I've done it on a local clone at some point	16:10
*** rpittau is now known as rpittau\|afk		16:11
fungi	mhuin: not urgent, at this stage i still think it would be premature to have a release of gear claiming it supports 3.10 when 3.10 hasn't been finalized, so the metadata could retroactively wind up incorrect if we need another release to fix it for things which change in 3.10 before it's final	16:12
fungi	"gear 0.16.0 claims it supports python 3.10, but you really need to install gear 0.16.1 if you want to use it with python 3.10" would be an unfortunate situation	16:13
mhuin	fungi, fair enough. I'll still do it anyway as a [DNM], it's going to be useful for packaging python-gear on Fedora Rawhide	16:13
fungi	sure, sounds like a great idea	16:13
opendevreview	Jeremy Stanley proposed opendev/gear master: Overhaul package metadata and contributor info https://review.opendev.org/c/opendev/gear/+/796704	16:16
fungi	minor fixup to a line in the setup.cfg ^	16:16
opendevreview	Matthieu Huin proposed opendev/gear master: [DNM] add python 3.10 testing https://review.opendev.org/c/opendev/gear/+/796705	16:20
opendevreview	Matthieu Huin proposed opendev/gear master: [DNM] add python 3.10 testing https://review.opendev.org/c/opendev/gear/+/796705	16:26
*** ysandeep is now known as ysandeep\|out		16:28
opendevreview	Matthieu Huin proposed opendev/gear master: [DNM] add python 3.10 testing https://review.opendev.org/c/opendev/gear/+/796705	16:29
*** amoralej is now known as amoralej\|off		16:34
corvus	fungi: whoah please don't tag gear	16:38
fungi	corvus: holding off in that case	16:47
corvus	fungi: see followup in #zuul	16:48
fungi	thanks!	16:51
*** jpena is now known as jpena\|off		16:59
corvus	infra-root: i've gotten a report of a memory leak in zuul; opendev doesn't appear to be suffering from it now, but it may have been last week; i'd like to do some investigation with the repl. i've developed a less-intrusive technique of getting object graphs, so i think i can do this with little or no visible impact, but it's all new, so things could go wrong.	17:46
*** ricolin_ is now known as ricolin		17:49
fungi	corvus: sounds good, thanks for the heads up... nice to not be the first ones to hit a memory leak for a change! ;)	17:53
corvus	yeah, hopefully i can find it before it becomes a bigger problem	17:57
*** mordred[m] is now known as mordred		17:59
fungi	gmann: looks like archive.org didn't know to grab a copy of ptg.json, so we'll likely have to extract data from the server if you still need it: https://web.archive.org/web//http://ptg.openstack.org/	18:02
fungi	the etherpads.html is unfortunately not statically served, it's just javascript pulling and rendering a json blob	18:03
tobiash[m]	fungi: we were the ones hit by the memleak ;)	18:19
fungi	tobiash[m]: oh no! sorry to hear that	18:21
tobiash[m]	took one day to go oom with 30gb of memory...	18:22
fungi	ouch	18:24
fungi	surprised we haven't hit the same then	18:25
fungi	but yeah, our memory usage has been fairly flat since we restarted ~5 days ago: http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=70194&rra_id=all	18:27
corvus	unfortunately, so far, all i've managed to do is confirm that we don't have any leaked layout objects :(	18:27
fungi	we did seem to be on track for runaway memory utilization back on june 10	18:27
fungi	looks like it began very suddenly after having been running for more than a week	18:28
fungi	so maybe we did experience it as well, but just happened to restart before we exhausted available ram	18:29
corvus	fungi: yeah, which is part of why i suspect even if we're not seeing it now, we may still be suceptible. that was the old version without the gear job fix, but the mem increase happened before the executor crashed, so i don't think that was the cause	18:29
fungi	makes sense	18:30
gmann	fungi: ohk, I think we can try by bring the site up and see of it is there otherwise leave - http://ptg.openstack.org/	18:45
fungi	yeah, the main gotcha is going to be that the old irc bots may start up and connect, and might be smart enough to ghost the production bots	18:48
fungi	since that server houses gerritbot and meetbot too	18:49
opendevreview	Florian Haas proposed opendev/git-review master: Support the Git "core.hooksPath" option when dealing with hook scripts https://review.opendev.org/c/opendev/git-review/+/796727	19:03
fungi	ooh, a patch from florian!	19:09
JayF	I hope someone exclaims with glee when they see me post a patch somewhere :)	19:17
fungi	i always do	19:17
opendevreview	Florian Haas proposed opendev/git-review master: Support the Git "core.hooksPath" option when dealing with hook scripts https://review.opendev.org/c/opendev/git-review/+/796727	19:18
fungi	so looking through the current state with our arm nodes, we seem to once again have a number of stale node request locks, i'm going to restart the launcher container on nl03.o.o	19:19
fungi	#status log Restarted nodepool launcher on nl03 to free stale node request locks for arm64 nodes	19:21
opendevstatus	fungi: finished logging	19:21
fungi	also, aside from the problem getting the mirror server workign again in osuosl, we seem to be deleting quite a few nodes there, one of which has been stuck in a deleting state for two months	19:25
fungi	i'm not sure what to make of this traceback: http://paste.openstack.org/show/806712	19:28
fungi	anyway, need to take a dinner break, back in a while	19:29
corvus	fungi: there may have been an exception deleting a server	19:46
corvus	fungi: apparently that really means "timeout of 10 minutes exceeded while waiting for server deletion"	19:48
corvus	fungi: so assuming instance id 1c8d118d-756f-4e8c-bf57-0c6b8ad690ea still exists in the cloud, it's likely that nodepool did try to delete that and it persists in showing up in the server list	19:50
corvus	i'm closing the repl for now; i've found no evidence of any leak of layout or pipeline objects :/	20:02
*** ChanServ sets mode: +o corvus		21:17
*** corvus was kicked by corvus (Kicked by @jim:acmegating.com : Removing stale matrix bridge puppet user)		21:18
fungi	corvus: thanks, and yeah that was also the only interpretation i could come up with. i'm used to seeing api errors for delete failures, so the lack of api response logged suggested no response	21:43
fungi	or at least no response before it gave up waiting for one	21:44
*** diablo_rojo is now known as Guest2446		21:51
ianw	fungi: i just got into a usual yak shaving exercise with the timeout for openafs-client role; since the last time we ran centos8 has pulled in some changes, so we need to update our rpms to unbreak the gate	21:59
ianw	of course, the pre-release has a slightly different naming convention, breaking the build scripts ...	22:00
fungi	of course it does!	22:03
fungi	who would you keep something like that consistent?	22:03
opendevreview	Merged opendev/system-config master: gerrit: add mariadb_container option https://review.opendev.org/c/opendev/system-config/+/775961	23:14
*** diablo_rojo is now known as Guest2457		23:20
ianw	fungi: https://review.opendev.org/c/opendev/system-config/+/796578 now overrides the openafs timeout, which i've already manually applied to osuosl mirror. if you're ok, i can unapply, merge that, and we can just leave the mirror there on 1.8.8pre1	23:48
ianw	it can essentially be a 1.8.8 tester, and i'll keep an eye for release and we can jump straight to that in the main PPA repo	23:49
fungi	reviewing	23:49
fungi	ianw: lgtm, approved, go ahead	23:50
ianw	fungi: thanks; do you say you cleared out some stuck nodes?	23:51
fungi	ianw: no, it's not clear to me why they're stuck in delete state in osuosl	23:52
fungi	might be undeletable and need attention from the admin end	23:52
ianw	ahh, i feel like we've had that before	23:52
ianw	Ramereth deleted them for me ^	23:52
fungi	for linaro-us, we seem to be managing to get 2 in-use nodes at a time right now	23:53
fungi	15 nodes in delete state in linaro-us with 4 building, 2 ready and 2 in-use	23:54
ianw	i feel like that led me down the path of https://review.opendev.org/c/zuul/nodepool/+/785821	23:54
ianw	the mariadb container for gerrit seemed to hvae faield in deploy with letsencrypt : https://review.opendev.org/c/opendev/system-config/+/775961	23:55
fungi	many of the deleting nodes in linaro-us have been there for two months	23:55
ianw	i'll look into that. it's annoying as i wanted to be 100% sure it didn't modify production before moving on to deploy it on review02	23:56
ianw	fungi: i guess we need to ping kevinz on that ... email might be better	23:56
fungi	we've got that e-mail thread with him clarkb started yesterday	23:56
fungi	where he cleared some stuck building nodes i think?	23:57
ianw	i'll double check and reply	23:59
ianw	fedora-34 should also have built	23:59

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!