Friday, 2021-07-02

opendevreview	Ian Wienand proposed opendev/lodgeit master: Add mariadb connector to container https://review.opendev.org/c/opendev/lodgeit/+/798411	00:33
*** odyssey4me is now known as Guest1231		01:12
opendevreview	Ian Wienand proposed opendev/lodgeit master: Add mariadb connector to container https://review.opendev.org/c/opendev/lodgeit/+/798411	01:16
*** ysandeep\|out is now known as ysandeep		01:48
*** ysandeep is now known as ysandeep\|afk		02:11
opendevreview	Ian Wienand proposed openstack/diskimage-builder master: [wip] test centos8-stream with ro /sys https://review.opendev.org/c/openstack/diskimage-builder/+/799126	03:22
*** ysandeep\|afk is now known as ysandeep		03:59
opendevreview	Ian Wienand proposed openstack/diskimage-builder master: [wip] test centos8-stream with ro /sys https://review.opendev.org/c/openstack/diskimage-builder/+/799126	04:19
*** ykarel\|away is now known as ykarel		05:34
kopecmartin	ianw: oh, I thought you've already did, because I noticed yesterday that the server stopped downloading guidelines and was throwing errors so I merged the interop change - https://review.opendev.org/c/osf/interop/+/796413	06:22
kopecmartin	everything is working now which is very weird if the container hasn't been pulled yet	06:23
*** gthiemon1e is now known as gthiemonge		06:32
*** jpena\|off is now known as jpena		06:52
*** amoralej\|off is now known as amoralej		06:56
*** ysandeep is now known as ysandeep\|lunch		08:30
*** ykarel is now known as ykarel\|lunch		08:31
*** ysandeep\|lunch is now known as ysandeep		09:34
*** ykarel\|lunch is now known as ykarel		09:52
ricolin	ianw, fungi clarkb found this error in https://nb03.opendev.org/debian-bullseye-arm64-0000029245.log	10:11
ricolin	Exit code: 1	10:11
ricolin	"/usr/local/lib/python3.7/site-packages/diskimage_builder/lib/disk-image-create: line 145: cannot create temp file for here-document: No space left on device"	10:13
ricolin	currently all debian-bullseye-arm64 jobs are queued for days	10:14
*** frenzy_friday is now known as frenzyfriday\|afk		11:00
*** ysandeep is now known as ysandeep\|afk		11:01
*** dviroel\|out is now known as dviroel		11:34
*** jpena is now known as jpena\|lunch		11:36
*** bhagyashris_ is now known as bhagyashris\|ruck		12:15
*** ysandeep\|afk is now known as ysandeep		12:26
*** ysandeep is now known as ysandeep\|mtg		12:29
*** jpena\|lunch is now known as jpena		12:36
*** ysandeep\|mtg is now known as ysandeep		12:38
*** amoralej is now known as amoralej\|lunch		12:45
*** ysandeep is now known as ysandeep\|mtg		13:00
*** amoralej\|lunch is now known as amoralej		13:41
fungi	ricolin: thanks for the heads up, i wonder if we're having growroot issues on those specific images	13:46
fungi	ricolin: oh! that's in the build log, so we've likely filled up the disk on that builder, i'll check it	13:47
fungi	we may need to shut down the builder container on it and clean up the disk	13:47
fungi	/dev/mapper/main-main 787G 787G 0 100% /opt	13:47
fungi	bingo	13:47
fungi	we basically haven't been building any new arm64 images	13:48
fungi	ricolin: the backlog may be unrelated, i know we were also waiting on the linaro-us cloud to fix an expired ssl cert, i need to see if it's been replaced yet	13:49
fungi	the ssl cert for the api endpoint expired some days ago	13:49
fungi	the full disk on nb03 might actually be related to that if it's been struggling and failing to upload new images there	13:50
fungi	i've downed the nodepool-builder container on nb03.opendev.org now	13:50
corvus	i'd like to restart zuul to see how the zk executor api changes perform	14:04
fungi	corvus: seems like a good day for it. also we'll get the zuul vars back in the inventory.yaml file after that	14:15
*** ysandeep\|mtg is now known as ysandeep		14:17
corvus	ya	14:17
corvus	restarting now	14:21
corvus	#status log restarted all of zuul on commit cc3ab7ee3512421d7b2a6c78745ca618aa79fc52 (includes zk executor api and zuul vars changes)	14:22
opendevstatus	corvus: finished logging	14:22
fungi	i let the openstack release team know, they were about to start approving some patches in their meeting	14:28
corvus	oh sorry, i thought they were typically idle on friday; i will re-evaluate my assumptions	14:29
corvus	it's up again, and jobs are running	14:29
fungi	no worries, i told them i would give them a heads up when we were starting, but no harm done	14:29
corvus	re-enqueue in progress	14:29
fungi	thanks!	14:30
corvus	jobs seem to be running, so that's a good sign	14:30
corvus	there are significantly more ephemeral nodes in zk	14:32
corvus	also signficantly less data size (probably compression)	14:33
corvus	we've added about 2k nodes (for a total of 39k) but dropped from 21.5mb to 14.9mb	14:34
corvus	oh, interesting, the data has gone back up and increased; i guess that metric lagged a bit?	14:35
tobiash[m]	has been the scheduler startup time impacted (due to mergers via zk)?	14:36
corvus	tobiash: it didn't seem significant; let me see if i can get a number	14:36
corvus	tobiash: almost exactly average. our mean of 4 reconfiguratons in the last month was 378 seconds (range from 357-403), today's was 375	14:40
tobiash[m]	great	14:41
fungi	that's great news	14:42
corvus	the executors seem to have reached their nominal capacity for builds fairly quickly	14:43
corvus	i wonderi f we need a stats adjustment for the executors and executor queue though; those graphs appear to have flatlined	14:43
fungi	okay, release team meeting has wrapped up and i'm back to looking at nb03 to see what we need to clean up	14:45
tobiash[m]	the queued jobs still counts the gearman queue	14:46
fungi	i expect the contents of /opt/dib_tmp are all leaked trash at this point	14:46
tobiash[m]	as it looks like	14:46
*** ysandeep is now known as ysandeep\|dinner		14:46
corvus	i think for the executors graph we need to add "unzoned"	14:46
fungi	ooh, yep, none of our executors are zoned	14:47
fungi	so if it's treed by zone now that would make sense we'd have to adjust the stat we're polling	14:47
tobiash[m]	corvus: I wonder why the running jobs graph still works given that the stats seems to still count the gearman queue: https://opendev.org/zuul/zuul/src/branch/master/zuul/scheduler.py#L363	14:49
corvus	i'm confused, i think the plain zuul.executors.accepting stat should work; we shouldn't need to switch to unzoned yet	14:50
tobiash[m]	the accepting should work	14:51
*** dviroel is now known as dviroel\|lunch		14:51
corvus	yet it doesn't; and the running should not work, yet it does	14:52
tobiash[m]	that's weird	14:52
tobiash[m]	the running might be taken from the per executor metric	14:54
tobiash[m]	which should not have changed	14:54
tobiash[m]	ah I think I got it, the "Executor Queue" graph is taken from the queue metrics from the scheduler which are broken now and is flatlined	14:56
tobiash[m]	the "Running Builds" graph uses the executor stats and works	14:56
corvus	ah yep, that's it	14:57
tobiash[m]	which leaves the "Executors" graph to be checked	14:57
tobiash[m]	which I think should continue to work	14:57
corvus	though, we still have the mystery of why zuul.executors.accepting isn't working but zuul.executors.unzoned.accepting is	14:59
fungi	apparently nl02 got caught up in a hypervisor host problem earlier in the week and was rebooted, per a ticket from rackspace	15:01
fungi	but looks like it's running okay currently	15:01
tobiash[m]	corvus: there is a bug: https://opendev.org/zuul/zuul/src/branch/master/zuul/scheduler.py#L329	15:04
tobiash[m]	that's not taking the accepting into account	15:04
corvus	aha	15:05
*** ysandeep\|dinner is now known as ysandeep		15:10
*** ysandeep is now known as ysandeep\|away		15:24
*** ykarel is now known as ykarel\|away		15:26
* clarkb is catching up		15:35
clarkb	Sounds like things are working toher than stats reporting? not bad considering	15:37
clarkb	fungi: re nb03 the builders all do that. I suspect it is partially related to us updating the docker container images forcefully. But ianw thought that the issue on the dib side that let that happen had been addressed	15:38
clarkb	fungi: one thought I had after the last cleanup was that we could run a simple find in cron to clean those up based on what the current build is (basically find a way to ignore the current build)	15:38
fungi	i suppose we could hold a lock on a file in the tempdir and then check that known filename for open handles before removing the containing directory?	15:46
clarkb	fungi: ya that should probably do it. I think you can also find the random string in the current build log or in the process tree (I suppose your idea is to look it up from the process tree)	15:47
fungi	nah, i mean actually stat the known filename inside each tempdir and then if it has no open file handles we know it's been leaked... but that assumes the process grows or is wrapped in a script with the feature to hold that lock until the process terminates	15:50
clarkb	fungi: also I started thinking about the gerrit account cleanup and realized that the last set of data was generated long enough ago that if I disabled accounts today that suddenly started being active again in the last 2 months that would be sadness. I don't expect a large delta but I think I should regenerate all the outputs of our scripts around this (redo the config check in	15:52
clarkb	gerrit, feed that into the audit, compare the audit from nowish to a couple of months ago) before retiring accounts	15:52
clarkb	I suspect we'll get zero delta and we can proceed without much extra checking beyond that, but if there is a delta it should be small and we can accomodate it	15:53
*** jpena is now known as jpena\|off		15:53
*** amoralej is now known as amoralej\|off		15:55
fungi	yeah, that's a great point	15:58
fungi	trying to run du over /opt/dib_tmp on nb03 is taking a very long time to return	16:00
opendevreview	Clark Boylan proposed opendev/system-config master: Update gerrit image to v3.2.11 https://review.opendev.org/c/opendev/system-config/+/799225	16:01
clarkb	melwitt: fungi: ^ re gerrit update	16:01
*** dviroel\|lunch is now known as dviroel		16:16
fungi	clarkb: i'm beginning to think du is never going to finish counting the contents of /opt/dib_tmp, is it safe just to empty that while the builder is stopped?	16:25
clarkb	fungi: yes all of the data there is temporary. One suggestion though is that you down the builder container, then reboot to clear out any stale mounts that may exist for those entries (hopefully would only be for the running build that dies due to the stop), then cleanup and start the process again	16:26
fungi	clarkb: there is nothing mounted currently anyway, at least not according to df/mount commands	16:27
fungi	just the normal system mounts and a /run/user mount for my session	16:28
clarkb	in that case should be totally fine without a reboot	16:28
fungi	okay, wiping out everything inside /opt/dib_tmp in that case	16:29
fungi	it's been an hour of deleting and freed ~120GiB so far, but i have a feeling it's still going to be deleting for a while	17:26
JayF	I'm pretty reliably getting 400 errors from storyboard trying to submit a new story. error is a red box popup saying "400: POST /api/v1/stories/2009026: Invalid input for field/attribute story. Value: '2009026'. unable to convert to Story	17:26
clarkb	JayF: https://storyboard.openstack.org/#!/story/2009026 I think that is because it was already created	17:28
JayF	Got a new browser window and it... oh	17:28
clarkb	I suspect you had a non fatal error on the intial creation then subsequent attempts result in that error you posted above	17:28
JayF	Well, it worked in a new browser window. Now it's obvious as to why.	17:28
clarkb	The timestampsfor creation are from about 7 minutes ago	17:29
JayF	yeah, it matches	17:29
JayF	weird but glad it's all fine, I'll cleanu p my dupe	17:29
fungi	JayF: also it can do that if you try to add two initial tasks in the story creation dialog, known bug	17:40
JayF	That is /exactly/ what I did.	17:40
fungi	the task creations seem to try to happen in overlappnig transactions	17:41
JayF	Thanks for closing the loop on it, that matches b/c I got a different error the first time (but didn't recall it) and then got this one every other step	17:41
fungi	and the api call to add the second tasks fails on a lock	17:41
fungi	2.5 hours into cleanup we've deleted 300GiB from /opt/dib_tmp on nb03	18:58
clarkb	infra-root I'm going to run a gerrit config consistency check now to get an up to date list of conflicts that I can use to rerun an audit with. Though at this rate I probably won't get to that today as I think the zuul stuff is going to take priority	19:16
clarkb	consistency check hasn't changed since we last ran it (good that is expected). Now I need to run the audit to see if user interactions have changed	19:31
opendevreview	Goutham Pacha Ravi proposed openstack/project-config master: Add feature branch notifications to openstack-sdks https://review.opendev.org/c/openstack/project-config/+/799323	19:32
clarkb	I have a user audit running now	19:45
fungi	/opt/dib_tmp on nb03 is finally empty. 356GiB available now. is there anything else i should clean up before starting the nodepool-builder container there again?	19:46
fungi	it has 22 base images plus their variants and checksums in /opt/nodepool_dib which is probably reasonable	19:47
clarkb	fungi: you can check if we haev leaked images in /opt/nodepool_dib but you can clean those up safely while the process is running	19:47
clarkb	fungi: on the x86 builders you occasionally see the intermediate vhd file get lost	19:48
clarkb	but we don't build vhds for arm64 so that shouldn't happen I suspect htat that cleanup si fairly complete	19:48
fungi	yeah, no vhd files in there	19:48
fungi	starting the container again in that case	19:48
*** dviroel is now known as dviroel\|brb		19:50
clarkb	I'm glad I decided to rerun the audit. There is at least one account that had gone from inactive for the last three yaers to active (not sure it was one the cleanup list yet, but there was certainly enough churn to make double checking a good idea)	20:13
fungi	yep	20:14
clarkb	doesn't look like it was on the chopping block (good means that my methods are not completely terrible)	20:15
clarkb	but now I have a pretty good indication I can put the other account related to this user on the chopping block. However I was going to save those for when we got to the ~80 I think we haev remaining and reach out to people about it first	20:16
*** dviroel\|brb is now known as dviroel		20:34
clarkb	fungi: fwiw I'm going through the existnig proposals for my own piece of mind. I'm flagging any that seem more dangerous than others and I may ask you to take a look at those and double check them. If we want we can trim them out or if they look safe we can proceed with them.	20:36
clarkb	Once I'ev gotten through this I'll push up files like we already have on review but with newer timestamps	20:37
fungi	thanks	20:55
*** dviroel is now known as dviroel\|out		21:03
opendevreview	James E. Blair proposed zuul/zuul-jobs master: Enable ZooKeeper 4 letter words https://review.opendev.org/c/zuul/zuul-jobs/+/799334	21:24
opendevreview	Merged zuul/zuul-jobs master: Enable ZooKeeper 4 letter words https://review.opendev.org/c/zuul/zuul-jobs/+/799334	21:45
clarkb	fungi: there are three files in review:~clarkb/gerrit_user_cleanups/notes/ audit-results.yaml.20210702 is the otuput of the audit resutls which you can refer to to see what data was used to make decisions. proposed-cleanups.20210702 is the list of accounts that we will retire, then later the email associated with the external id conflicts that will be dleeted on the retired accounts.	22:05
clarkb	And finally doublecheck.20210702 a subset of those in the previous file whcih I have identified as riskier because the other side of the conflict was somewhat recently used	22:05
clarkb	if you can take a look at those files and doublecheck the double check list I think we're just about ready to retire the accounts identified in the proposed-cleanups.20210702 file	22:06
fungi	lookin'	22:06
clarkb	I probably won't do that today because the way that script is set up it takes a long time and I have to acknowledge use of my ssh key (though I could temporarily turn that off). But Definitely should be able to run that tuesday	22:06
fungi	so 36 high-risk	22:08
clarkb	ya and even then I think those are relatively low risk because for each of them its pretty clear which is used more recently	22:08
clarkb	but if we are going to run into problems I suspect it would be with that set. Maybe they are using the second account in some way that is harder to measure for example	22:09
fungi	low-high-risk ;)	22:09
fungi	huh... i only just noticed that the poetry readme uses oslo.utils as its example of a challenging dep solver problem: https://pypi.org/project/poetry/	22:55
opendevreview	arkady kanevsky proposed opendev/irc-meetings master: Changed Interop WG meeting time for the summer 2 hours earlier. https://review.opendev.org/c/opendev/irc-meetings/+/799337	23:44

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!