Friday, 2022-07-15

opendevreview	Clark Boylan proposed openstack/diskimage-builder master: Add Rockylinux 9 build configuration and update jobs for 8 and 9 https://review.opendev.org/c/openstack/diskimage-builder/+/848901	00:03
clarkb	fungi: I'm going to need to pop out for evening things momentarily. Happy to help with the cinder stuff in the morning. I think we'll be fine until then	00:04
fungi	yeah, i'm contemplating adding the new volume this evening or waiting for friday	00:09
Clark[m]	Tomorrow works for me	00:11
fungi	yeah, that's probably better anyway	00:14
*** rlandy\|bbl is now known as rlandy\|out		01:10
*** dviroel\|rover\|afk is now known as dviroel\|rover		01:17
*** ysandeep\|out is now known as ysandeep		01:37
*** join_subline is now known as \join_subline		01:47
opendevreview	Merged opendev/system-config master: production-playbook logs : don't use ansible_date_time https://review.opendev.org/c/opendev/system-config/+/849784	01:54
opendevreview	Merged opendev/system-config master: production-playbook logs : move to post-run step https://review.opendev.org/c/opendev/system-config/+/849785	01:54
opendevreview	OpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml https://review.opendev.org/c/openstack/project-config/+/849909	02:25
ianw	i don't think run-production-playbook-post.yaml is working; but it also doesn't error	02:33
opendevreview	Ian Wienand proposed opendev/system-config master: run-production-playbook-post: add bridge for playbook https://review.opendev.org/c/opendev/system-config/+/849919	02:48
*** ysandeep is now known as ysandeep\|afk		03:45
opendevreview	Merged opendev/system-config master: run-production-playbook-post: add bridge for playbook https://review.opendev.org/c/opendev/system-config/+/849919	03:48
ianw	hrm, i guess the periodic jobs that are running have checked themselves out on a repo prior to ^	05:20
ianw	i'll merge the limnoria change and watch that. it's quiet now, no meetings	05:21
*** ysandeep\|afk is now known as ysandeep		06:16
ianw	oh interesting, system-config-run-eavesdrop timed_out in gate	06:21
ianw	perhaps it was actually during log upload?	06:24
ianw	it looks like it ran ok	06:24
ianw	https://zuul.opendev.org/t/openstack/build/bcc13270b9f14dd28af70882949b6579/logs	06:24
ianw	one data point ... this has an ara report too. the one we looked at the other day did as well	06:27
ianw	that's a lot of small files ...	06:27
ianw	that wouldn't really describe the infra-prod timeouts though; they don't have ara reports	06:41
ianw	(only the system-config-run ones do)	06:41
*** chandankumar is now known as chkumar\|rover		06:59
opendevreview	Merged opendev/system-config master: Install Limnoria from upstream https://review.opendev.org/c/opendev/system-config/+/821331	07:41
opendevreview	lixuehai proposed openstack/diskimage-builder master: Fix build rockylinux baremetal image error https://review.opendev.org/c/openstack/diskimage-builder/+/849947	07:46
*** rlandy\|out is now known as rlandy		10:31
opendevreview	Tobias Henkel proposed zuul/zuul-jobs master: Allow overriding the buildset registry image https://review.opendev.org/c/zuul/zuul-jobs/+/849989	11:04
opendevreview	Merged openstack/project-config master: Normalize projects.yaml https://review.opendev.org/c/openstack/project-config/+/849909	11:21
fungi	ianw: if they were timing out during log collection/upload, the build result would have been post_timeout	11:25
*** dviroel\|rover is now known as dviroel		11:28
*** ysandeep is now known as ysandeep\|afk		11:56
ianw	fungi: yeah, this is true. it seems to be something else going on with the ansible on the bastion host just ... stopping	12:04
opendevreview	James Page proposed openstack/project-config master: Add OpenStack K8S charms https://review.opendev.org/c/openstack/project-config/+/849996	12:18
opendevreview	James Page proposed openstack/project-config master: Add OpenStack K8S charms https://review.opendev.org/c/openstack/project-config/+/849996	12:19
opendevreview	James Page proposed openstack/project-config master: Add OpenStack K8S charms https://review.opendev.org/c/openstack/project-config/+/849996	12:21
fungi	trying to dig the reason for the post_failure on https://zuul.opendev.org/t/openstack/build/69dc9e01301848a9a8a3f45fd952a24b out of the debug log on ze10 and not having much luck	13:27
fungi	no ansible tasks are returning failed in the post phase	13:28
Clark[m]	fungi: very likely related to splitting the log management into post-run for those jobs. But I'm not sure why it would report failure if no tasks fail	13:31
Clark[m]	Perhaps related to the inventory fix here: https://review.opendev.org/c/opendev/system-config/+/849919 though as noted there if no nodes match inventory you just skip successfully	13:31
*** dasm\|off is now known as dasm\|ruck		13:33
fungi	oh. perhaps. i'll look for that signature	14:02
fungi	2022-07-15 12:22:17,931 DEBUG zuul.AnsibleJob.output: [e: 3ae9a107a3364e86bbfc5d0c7e59c499] [build: 69dc9e01301848a9a8a3f45fd952a24b] Ansible output: b'changed: [localhost] => {"add_host": {"groups": [], "host_name": "bridge.openstack.org", "host_vars": {"ansible_host": "bridge.openstack.org", "ansible_port": 22, "ansible_python_interpreter": "python3", "ansible_user": "zuul"}}, "changed":	14:04
fungi	true}'	14:04
fungi	seems like it ran (that was the output for "TASK [Add bridge.o.o to inventory for playbook name=bridge.openstack.org, ansible_python_interpreter=python3, ansible_user=zuul, ansible_host=bridge.openstack.org, ansible_port=22]")	14:04
Clark[m]	And did the subsequent tasks in that playbook manage to run with the updated inventory?	14:08
fungi	the play recap only mentions localhost	14:15
fungi	maybe you can't update a playbook's inventory from within the playbook itself?	14:16
*** ysandeep\|afk is now known as ysandeep		14:16
fungi	though that would be silly since it would mean there's no point in ansible's add_host task	14:17
fungi	there are some "no hosts matched" warnings toward the end of the log	14:20
Clark[m]	That should be how the run playbook functions	14:21
Clark[m]	Maybe something about the new add host is different and missing bits?	14:21
fungi	after "TASK [Register quick-download link]" is recapped showing it applied to localhost, it starts to run the cleanup playbook and says "[WARNING]: provided hosts list is empty, only localhost is available. Note that the implicit localhost does not match 'all'"	14:23
fungi	i think the failure must be prior to that though	14:25
fungi	i just can't seem to find it	14:26
fungi	as soon as the openstack release meeting wraps up in a few minutes, i need to run a couple of quick errands (probably 30 minutes) and then i'll get started adding the new cinder volume to review.o.o and migrating onto it	14:38
Clark[m]	Sounds good. I can look at that post failure more closely after I've fully booted my morning	14:38
fungi	it's probably not urgent, the job seems to have done the deployment things it was supposed to and collected/published logs, just ended with an as-of-yet inexplicable post_failure	14:41
fungi	popping out for quick errands now, bbiab	14:45
clarkb	I wonder if these timeouts are related to the problem albin is debugging with ansible and glibc	14:49
clarkb	https://github.com/ansible/ansible/issues/78270 is the upstream bug report albin filed	14:49
clarkb	we wouldn't have seen them before with ansible 2.9 I think but now we default to ansible 5	14:49
corvus	clarkb: i think Albin Vass said it was happening in older versions too? (iirc, an earlier disproved hypothesis was that ansible 5 might fix it)	15:09
corvus	clarkb: but the recent container bump to 3.10 almost certainly changed our glibc, right?	15:10
clarkb	corvus: no the 3.10 image and the 3.9 image are both based on debian bullseye and should have the same glibc	15:11
corvus	ah	15:12
*** dviroel is now known as dviroel\|lunch		15:16
clarkb	fungi: [build: 69dc9e01301848a9a8a3f45fd952a24b] Ansible timeout exceeded: 1800	15:19
clarkb	its the timeout issue that ianw was looking at. In post run we run every playbook I think so the failure happened before you were looking. It happened in the encrypt steps. Maybe a lack of entropy?	15:20
clarkb	I think one possibility is the glibc problem albin discovered (maybe ansible 5 trips it far more often?) or something related to the leaking zuul console log files in /tmp	15:24
clarkb	corvus: ^ re the leaking console log files is that a side effect of not running the console log streamer daemon?	15:24
clarkb	corvus: we should be able to safely delete those files I think, but also i wonder if we can have a zuul option to not write them in the first place?	15:24
clarkb	just doing a random sampling of those filse they all appear to be ansible task text that would end up in the console log. I'm pretty confident we can just delete the lot of them	15:26
*** ysandeep is now known as ysandeep\|out		15:28
AlbinVass[m]	clarkb: we had two different issues, one in Ansible 2.9 getting stuck in the pwd library (i believe), and Ansible 5 getting stuck in grantpty with glibc 2.31 which is installed in the official zuul images (v6.1)	15:35
clarkb	AlbinVass[m]: ok, that is good to know. I'm not sure that this problem is related to what you've seen but the symptoms seem similar. The job just seems to stop and then eventually timeout	15:36
clarkb	AlbinVass[m]: I'd be willing to land your updated glibc change and see if it helps :)	15:37
clarkb	let me go and review it now while it is fresh	15:38
AlbinVass[m]	The only reason we detected it was because we had jobs hanging forever in cleanup :)	15:38
clarkb	ya that is similar to what we are seeing now	15:39
fungi	clarkb: thanks! i always forget to search for task timeouts, which should be the first thing i think of when i can't find any failed tasks (because the timed out task never gets a recap)	15:39
*** rlandy is now known as rlandy\|brb		15:42
clarkb	AlbinVass[m]: ok left some comments on that change. I think it is basically there though and worth a try	15:46
clarkb	Otherwise we may want to undo the ansible 5 default update? or override it for system-config jobs at least	15:46
clarkb	it looks like just about every infra-prod job is hitting those post failures now	15:48
clarkb	I don't think that is the end of the world while we work thoruhg the gerrit thing. But be aware of that if landing changes to prod as they may not deploy as expected	15:48
clarkb	I think we should finish up the gerrit thread then look at cleaning up /tmp on bridge to see if that helps then possibly do AlbinVass[m]'s zuul image update or force ansible 2.9 on the infra-prod jobs. I'll push up a change for infra-prod jobs now so that it is ready if we wish to do that	15:49
clarkb	actually nevermind lets try the /tmp cleanup first	15:49
fungi	clarkb: you noted the timeout was after the file encryption tasks, maybe we're consistently starving that machine for entropy when repeatedly running those?	15:51
clarkb	fungi: yes that was another thought	15:51
fungi	could be entirely unrelated to our ansible version	15:51
clarkb	is there a way to check that?	15:51
fungi	we could report the size of the entropy pool before tasks we think are likely to try to use it	15:52
fungi	or we could check it with cacti or something	15:52
clarkb	`find /tmp -name 'console-*-bridgeopenstackorg.log' -mtime +30` for a listing of what we might want to delete	15:53
clarkb	/proc/sys/kernel/random/entropy_avail appears to report good numbers currently	15:54
fungi	fungi@bridge:~$ cat /proc/sys/kernel/random/entropy_avail	15:54
fungi	3413	15:54
fungi	yeah, not bad for now	15:54
fungi	bigger question is whether we're doing more key generation and similar tasks on there during jobs recently	15:55
fungi	the server seems to have a reasonable amount of available entropy, but the jobs could have started demanding an unreasonable amount	15:56
clarkb	I don't think the total has changed. We only shifted where it ran (that was in response to this problem too so the issue was preexisting)	15:57
*** marios is now known as marios\|out		15:57
clarkb	that is the main reason why I susecpted ansible 5 + glibc since it was a recent change	15:57
clarkb	basically if we look at what changed ansible 5 is it and AlbinVass[m] has evidence of a very similar problem happening for them	15:57
fungi	yeah, i agree that's a stronger theory	15:58
clarkb	I think we can try something like `find /tmp -name 'console-*-bridgeopenstackorg.log' -mtime +30 -delete` if others agree those files look safe to prune. But don't expect that to help	15:59
clarkb	(good cleanup either way)	16:00
clarkb	But also removing the old index back up on review and expanding available disk there might be the higher priority? There are too many things and I don't want to try and do too many things at once :)	16:00
clarkb	actually you know what why don't we just flip infra-prod to ansible 2.9 again. its a dirt cheap thing to change and we can toggle it back too	16:03
clarkb	but that would give us a really good idea if it is related to ansible 5. change incoming	16:03
fungi	sure	16:03
fungi	sounds good to me	16:03
*** rlandy\|brb is now known as rlandy		16:05
opendevreview	Clark Boylan proposed opendev/system-config master: Force ansible 2.9 on infra-prod jobs https://review.opendev.org/c/opendev/system-config/+/850027	16:07
fungi	i count 172663 console logs matching the glob pattern suggested above	16:08
clarkb	ya there are just over 200k if you drop the mtime filter	16:08
fungi	all the glob matches also match the regex ^/tmp/console-[^/]*-bridgeopenstackorg.log$	16:09
fungi	so i agree that looks safe	16:09
clarkb	fungi: do you agree the content of those files isn't work keeping after 30 days?	16:10
clarkb	I'm not even sure it is worth keeping 30 days of them but figure we can start there	16:10
clarkb	I guess the other thing we can do is limit the depth on the search so that subdirs that happen to have the same filename pattern don't get files deleted but that also seems unlikely	16:12
fungi	yeah, spot-checking some at random, the contents don't really seem that useful to begin with	16:12
clarkb	`find /tmp -maxdepth 1 -name 'console-*-bridgeopenstackorg.log' -mtime +30 -delete` ?	16:13
fungi	yes, that still works	16:16
fungi	and is slightly safer, i suppose	16:16
clarkb	ya as there are some subdirs. I guess I'll spin up a screen and get that ready to go?	16:16
clarkb	fungi: screen started if you want to join and help double check	16:17
clarkb	shoudl I add a -print?	16:17
clarkb	I think I've decided that is going to be too much noise with -print	16:18
AlbinVass[m]	mind that we had other issues with ansible 2.9, I'm not sure why I haven't seen those before because the root cause for the deadlocks are the same (multithreaded process using fork)	16:26
fungi	clarkb: sorry, stepped away for a moment, but i'm in the screen session now and that looks great	16:28
AlbinVass[m]	* seen those when running zuul with ansible 2.9 before because	16:28
fungi	clarkb: and i agree, -print is worthless for this unless you're just trying to see if it's hung	16:28
clarkb	fungi: ok I'll run that now	16:30
fungi	thanks	16:31
clarkb	its done. Still 38k entries for the last 30 days but much cleaner	16:31
clarkb	and ya I don't expect that to help. But maybe it will	16:31
fungi	yeah, that looks reasonable	16:31
*** dviroel\|lunch is now known as dviroel		16:32
clarkb	AlbinVass[m]: ya at this point I think its fine for us to do some process of elimination. We ran under 2.9 for a while and it was fine there for us iirc	16:33
clarkb	AlbinVass[m]: could be a timing issue or one environment being more likely to trip specific issues than others	16:33
corvus	clarkb: nothing deletes the zuul console log files; so if we're talking bridge, we should have some kind of tmp cleaner	16:59
fungi	corvus: bridge, correct	17:00
fungi	these are in /tmp on bridge.o.o	17:00
fungi	also /tmp was nowhere near filling up from a block or inode standpoint	17:01
opendevreview	Jeremy Stanley proposed opendev/git-review master: Clarify that test rebases are not kept https://review.opendev.org/c/opendev/git-review/+/850054	17:28
clarkb	corvus: thank you for confirming I'll work on putting something together for that I guess	17:35
opendevreview	Clark Boylan proposed opendev/system-config master: Add cronjobs to cleanup Zuul console log files https://review.opendev.org/c/opendev/system-config/+/850059	17:48
clarkb	fungi: I'm going to pop out soon for lunch, but let me know if I can be helpful re gerrit and I'll adjust timing as necessary	18:20
opendevreview	Jeremy Stanley proposed opendev/git-review master: Clarify that test rebases are not kept https://review.opendev.org/c/opendev/git-review/+/850054	18:26
opendevreview	Jeremy Stanley proposed opendev/git-review master: Don't keep incomplete rebase state by default https://review.opendev.org/c/opendev/git-review/+/850061	18:26
fungi	clarkb: thanks, i'm switching to that now, since i'm done with the git-review distraction for the moment	18:27
fungi	[Fri Jul 15 18:42:54 2022] virtio_blk virtio5: [vdc] 2147483648 512-byte logical blocks (1.10 TB/1.00 TiB)	18:46
fungi	/dev/vdc1 lvm2 --- <1024.00g <1024.00g	18:47
fungi	i have `pvmove /dev/vdb1 /dev/vdc1` running in a root screen session on review.o.o now	18:48
fungi	that's after creating and attaching the cinder volume, creating a partition table on it with one large partition, writing a pv signature to the partition, and adding that pv to the existing "main" vg on the server	18:50
fungi	i did make sure it was set as --type=nvme too	18:50
fungi	(openstack volume show confirms it's the case too)	18:51
fungi	block migration is already 13% done, so it's going quickly	18:51
fungi	hopefully the additional i/o isn't impacting gerrit performance too badly	18:51
clarkb	cool. I think where I always get confused is the relationship between physical volumes, volume groups, and logical volumes	18:51
clarkb	physical volumes are aggregates into volume groups to provide logical volumes that may be larger than any one physical volume. Is that basically it?	18:52
fungi	[lvx][--lvy--][-lvz]	18:52
fungi	[-------vg0--------]	18:53
fungi	[---pva---][--pvb--]	18:53
clarkb	ok ya that helps	18:53
fungi	er, except for my indentation on that last line	18:53
fungi	but yeah, a vg is just an aggregation of whatever block devices the kernel knows about, then you create logical volumes within a vg	18:54
clarkb	and in this case we do the pvmove so that we can remove the other physical volume later. But then we hvae to expand the lv to stretch it out to match the larger size. Then after that the fs itself	18:54
fungi	optional but yes	18:55
fungi	the process is basically 1. add a new pv to the vg, 2. move the extents for the lv from the old pv to the new pv, 3. remove the old pv from the vg	18:55
fungi	then later you can also 4. increase the size of the lv, 5. resize the fs on that lv, 6. profit	18:56
clarkb	fungi: are vdb and vdc pvs or lvs?	18:56
clarkb	(they are pvs Ithink?)	18:57
fungi	we could also instead have done 1. add a new pv to the vg, 2. increase the size of the lv, 3. resize the fs	18:57
fungi	pv (so-called "physical volume") is the kernel block devices for the disks, so vdb and vdc in this case	18:57
clarkb	but then we would have to keep both cinder volumes attached (which I think is less ideal long term)	18:57
fungi	lv is the "logical volume" which spans one or more physical volumes within the volume group	18:58
clarkb	and they are localted under the /dev/mapper paths	18:59
fungi	and yes, the second option i described would have been more if we just wanted to incrementally add to the existing physical volume set rather than migrating data from one to another. in the real world where you may be plugging another actual hunk of spinning rust into a server that's more typical, but with the virtual block devices in a cloud it usually makes more sense to migrate data to a	19:00
fungi	larger pv where possible	19:00
fungi	also some people simply don't know about pvmove and so don't realize it's possible to do this	19:00
clarkb	fungi: last question, why do you need to add it to the volume group if we're essentially lconing an existing member of the vg onto the new pv? I guess just a bookkeeping activity?	19:02
fungi	it's just how lvm is designed. the extents of an lv can be mapped to any pv in their vg. when you do a pvmove it's taking advantage of the copy-on-write/snapshot capabilities in lvm to update the address for an extent to a new location and then cleaning up the old one, so during any arbitrary point in the pvmove that lv is spread across the old and new pvs	19:04
clarkb	got it	19:05
fungi	since an lv can already span multiple pvs, it's almost a side effect of the eother features	19:05
clarkb	right	19:05
fungi	but to have an lv spread across multiple pvs, they have to be in its vg, hence the initial step of adding the new pv to the vg	19:06
fungi	technically we're "extending" the vg to include the new pv, and then later "reducing" the vg off of the old pv once there are no longer any lv extents in use on it	19:07
clarkb	makes sense	19:07
fungi	but since it's all happening underneath the lv abstraction, it's essentially invisible to the filesystem activity	19:08
fungi	just (possibly lots of) added i/o activity for those devices while in progress	19:09
clarkb	once the pvmove is done we could theoretically expand to ~1.25TB on the lv since the other pv will still be in the vg right? So you need to be careful with maths or remove the other pv first?	19:10
fungi	yeah, that's why i suggested removing the old pv first before resizing the lv	19:10
fungi	at the moment we have .25tb in use and 1.25tb free in the vg, so as long as i don't tell the lv to use more space before i remove the old pv, we'll be down to .25tb used and .75tb free out of 1tb in the vg, then i'll lvextend the lv to use the additional space and resize the fs after that	19:12
fungi	pvmove is done now, and if you look at the pvs output /dev/vdb1 has 250.00g of 250.00g free, while /dev/vdc1 has 774.00g of 1024.00g free	19:13
clarkb	I'll trust yo uon that (as I'm juggling lunch right now)	19:13
fungi	so this is the point where i vgreduce off of /dev/vdb1	19:13
clarkb	but ya that lal makes sense. Th ehard part is retaining this info for the future	19:13
fungi	we have most of the commands documented at https://docs.opendev.org/opendev/system-config/latest/sysadmin.html#cinder-volume-management	19:14
fungi	but also the lvm manpage is helpful	19:14
clarkb	When we do the fs resize does that add more inodes (in theory we need them?)	19:15
clarkb	iirc it will add them at the same ratio of blocks to inodes that the existing fs has and we do get new ones but we can't change the ratio?	19:15
fungi	i think that depends on the fs, but ext4 should increase the inode count	19:15
fungi	if memory serves, ext3 did not increase inode count relative to block count	19:15
fungi	now i've done `vgreduce main /dev/vdb1` to take the old pv out of our "main" vg	19:16
fungi	and `pvremove /dev/vdb1` to wipe the pv signature from the partition	19:16
fungi	so `pvs` currently reports only one pv on the server, which is a member of the main vg, with 774.00g free	19:17
fungi	and `vgs` similarly reports the vg has 774.00g available	19:18
fungi	now to detach the old volume, which is always the iffy part of this process	19:18
*** tobias-urdin is now known as tobias-urdin_pto		19:19
fungi	i used `openstack server remove volume "review02.opendev.org" "review02.opendev.org/gerrit01"` and after a minute it returned. now `openstack volume list` reports review02.opendev.org/gerrit01 is available rather than in-use	19:21
clarkb	and /dev on review02 probably doesn't show that device anymore?	19:21
fungi	correct, we have vda and vdc but nothing between	19:22
fungi	interestingly, nothing gets logged in dmesg during that hot unplug action	19:22
fungi	and now i've done `openstack volume delete review02.opendev.org/gerrit01` so we're cleaned up on the cloud side of things	19:23
fungi	next is the lv and fs resizing	19:23
fungi	i've run `lvextend --extents=100%FREE main/gerrit` which tells it to grow the gerrit lv into the remaining available extents of the main vg	19:25
fungi	"Size of logical volume main/gerrit changed from <250.00 GiB (63999 extents) to 774.00 GiB (198144 extents)."	19:26
clarkb	shouldn't it be 1TB intsead of 774?	19:26
fungi	d'oh, yep!	19:27
fungi	i missed a +	19:27
fungi	thanks for spotting that	19:27
fungi	lvextend --extents=+100%FREE main/gerrit	19:27
fungi	Size of logical volume main/gerrit changed from 774.00 GiB (198144 extents) to <1024.00 GiB (262143 extents).	19:27
clarkb	out of curiousity would it have errored if the 100%FREE was less than the current consumed size?	19:28
clarkb	(or did we just get really lucky?)	19:28
fungi	it would have errored, yes. lvextend can only increase. you have to use lvreduce to decrease	19:28
clarkb	the command is extend so ya	19:28
clarkb	makes sense	19:28
fungi	now pvs and vgs show 0 free blocks, and lvs shows the gerrit lv is 1tb	19:28
corvus	glad they retired lvbork	19:29
fungi	the swedish chef was bummed though	19:29
fungi	resize2fs /dev/main/gerrit	19:30
fungi	The filesystem on /dev/main/gerrit is now 268434432 (4k) blocks long.	19:30
fungi	Filesystem Size Used Avail Use% Mounted on	19:30
fungi	/dev/mapper/main-gerrit 1007G 220G 788G 22% /home/gerrit2	19:30
clarkb	any idea if the inode count changed with that expansion?	19:31
fungi	#status log Replaced the 250GB volume for Gerrit data on review.opendev.org with a 1TB volume, to handle future cache growth	19:31
opendevstatus	fungi: finished logging	19:32
clarkb	(I don't really expect inodes to be a problem considering why we expanded. Mostly just curious)	19:32
fungi	df -i says we have 1% of inodes in use on that fs	19:32
clarkb	cool. I feel like I have learned things	19:32
fungi	so while i think it did increase available inodes x4, it's unlikely to matter	19:33
fungi	now i just need to do two more of these on the ord backup server and one on the ord afs server before the end of the month	19:33
fungi	(for upcoming announced storage maintenance in that region)	19:34
clarkb	cacti reflects the changes too fwiw	19:34
fungi	awesome	19:34
fungi	i've closed out my root screen session on review.o.o now	19:38
opendevreview	Jeremy Stanley proposed opendev/git-review master: Don't keep incomplete rebase state by default https://review.opendev.org/c/opendev/git-review/+/850061	19:39
clarkb	and now we monitor the cache disk usage	19:39
clarkb	I marked the change to disable the caches as WIP so we know that is a fallback we aren't ready for yet	19:39
fungi	thanks	19:43
*** dviroel is now known as dviroel\|biab		20:09
opendevreview	Jeremy Stanley proposed opendev/git-review master: Don't keep incomplete rebase state by default https://review.opendev.org/c/opendev/git-review/+/850061	20:23
*** dviroel\|biab is now known as dviroel		21:08
*** dasm\|ruck is now known as dasm\|off		21:37
*** dviroel is now known as dviroel\|out		21:37
clarkb	fungi: dpawlik1 can we abandon https://review.opendev.org/c/opendev/system-config/+/833264 ?	21:40
fungi	yeah, there's info at https://governance.openstack.org/sigs/tact-sig.html#opensearch and https://docs.opendev.org/opendev/infra-manual/latest/developers.html#automated-testing has been cleaned up. https://docs.openstack.org/project-team-guide/testing.html#automatic-test-failure-identification probably does need updating though	21:45
fungi	i've abandoned it, thanks for the reminder	21:47
clarkb	thanks	21:47
fungi	sure thing	21:50
ianw	clarkb: thanks for looking into that. the 2.9 check would probably be good	22:24
clarkb	ianw: if 2.9 is better then I strongly suspect AlbinVass[m] did all the work :)	22:25
ianw	i've approved it and will check in on it after the weekend	22:25
ianw	deadlock sounds like it. i poked at all the processes i could see, and nothing seemed to be doing or waiting for anything obvious	22:26
clarkb	One thing I wonder about is whether or not we should revert the changes to the run playbook. But I think this is still strictly an improvement to seprate the logging from the actions	22:26
ianw	i guess that was a response to the timeouts i had already randomly seen -- so it was happening before that	22:27
clarkb	yes I think it was happening before, just more reliably now	22:27
ianw	i'm not sure the encryption jobs really require entropy, as they're using pre-generated keys. but it's a thread to pull	22:28
ianw	i'd also believe something about gpg agents, sockets, <insert hand wavy actions here>...	22:29
clarkb	ianw: my suspicion is tha tthe new order of operations just happens to trip the deadlock more reliably	22:29
clarkb	it was happening before less reliably	22:30
clarkb	but now we've managed to align timing/whatever to make it happen 100% of the time	22:30
opendevreview	Merged opendev/system-config master: Force ansible 2.9 on infra-prod jobs https://review.opendev.org/c/opendev/system-config/+/850027	22:30
ianw	oh annoying, the POST failure is different	22:32
corvus	clarkb: to be clear -- 850027 is temporary for data collection?	22:32
clarkb	corvus: yes	22:32
ianw	https://zuul.opendev.org/t/openstack/builds?project=opendev%2Fsystem-config&skip=0	22:32
clarkb	corvus: if that makes it better than I think we should strongly consider AlbinVass[m]'s docker image update	22:33
corvus	when will we have data?	22:33
clarkb	corvus: jobs will be triggered in 27 minutes in the deploy pipeline	22:33
corvus	and one buildset is sufficient?	22:34
clarkb	corvus: yes Ithink so. we had a 100% failure on the last hour's buildset	22:34
clarkb	er sorry its the opendev-prod-hourly pipeline not depoy	22:34
clarkb	but ya runs hourly. Last hour was 100% failure. In ~26 minutes the first job that indicates whether or not this is happier should run	22:35
opendevreview	Ian Wienand proposed opendev/system-config master: run-production-playbook-post: use timestamp from hostvars https://review.opendev.org/c/opendev/system-config/+/850077	22:36
clarkb	ianw: oh does that need to land before we get consistent results?	22:36
clarkb	and/or should we just revert back to the way it was for simplicity then work from there?	22:37
ianw	i dunno, reverting back means we get no logs, which is also unhelpful	22:37
clarkb	ianw: well if 2.9 fixes it we would get logs	22:37
clarkb	ianw: I'm not sure your fix will fix it since this is a separate playbook run	22:38
clarkb	is _log_timestamp cached?	22:38
ianw	now i've sent that i'm wondering if the way we dynamically add bridge might make that not work	22:38
clarkb	ya I'm not sure that will work	22:38
ianw	it would work across playbooks, but i'm not sure with the dynamic adding of the host	22:39
ianw	hrm	22:39
clarkb	I think you can just get a new timestamp value in the post-run playbook	22:41
clarkb	since the two uses in the run playbook and the post-run playbook are detached from each other (though them lining up is a bit nice it isn't necessary)	22:42
ianw	yeah i can do that for now, and try passing the variable around later when it's all working	22:42
clarkb	I think if the jobs fail because that var is undefined that is a good indication that ansible 2.9 fixes the othe rproblem at least	22:42
opendevreview	Ian Wienand proposed opendev/system-config master: run-production-playbook-post: use timestamp from hostvars https://review.opendev.org/c/opendev/system-config/+/850077	22:45
ianw	right, because they all got far enough to try renaming the file, rather than hanging	22:45
ianw	well actually, all those ran under 2.12?	22:46
clarkb	the example I looked at earlier today didn't get that far. It failed due to a timeout right around encrypting things and it ran under 2.12	22:47
clarkb	I guess I hvaen't looked at the others we may have 100% failure due to either one of the things	22:47
clarkb	corvus: ^ related we seem to report POST_FAILURE when we we timeout in post-run	22:47
clarkb	ianw: ah yup seems at least some of the failures are due to the missing var. corvus I was mistaken I assumed 100% of the failures were due to the timeouts but that isn't the case so we won't know as quickly as I hoped	22:48
clarkb	(sorry I was juggling like 4 different issues this morning...)	22:49
ianw	right, i think we a) fix the missing var and b) i'll check in on it on monday with fresh eyes and see if we're still seeing timeouts	22:49
ianw	https://zuul.openstack.org/build/304cacd67c2e459cbb69e0af5b24963f was the job i was looking at last night that looked stuck (at the time, and eventually timed out)	22:49
ianw	for reference, https://paste.opendev.org/show/bvelHlomXJwkNpCmKO6r/ is the console log which i had open on it	22:51
ianw	i agree the last thing before the timeout was	22:51
ianw	2022-07-15 09:36:45.327346 \| TASK [encrypt-file : Check for existing key]	22:51
ianw	this means it actually ran all the playbook and has done everything up to setting up logs for storage	22:52
ianw	that is not a very exciting task -> https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/encrypt-file/tasks/import-key.yaml#L3 ; just calls gpg --list-keys in a command: \| block	22:53
opendevreview	Ian Wienand proposed opendev/system-config master: run-production-playbook-post: generate timestamp https://review.opendev.org/c/opendev/system-config/+/850077	22:54
ianw	clarkb: ^ fixed that change typo to avoid confusion	22:54
clarkb	+2 thanks	22:55
clarkb	and ya once that one lands we should get much clearer info	22:55
ianw	i've run gpg --list-keys in a loop on bridge as "zuul" and haven't seen anything hang, so doesn't seem as obvious as gpg being silly	22:56
ianw	and last night, when i was poking, i did not see a gpg process attached to any of the ansible; all that was running was an "ansiballz" python glob thing afaics	22:57
corvus	clarkb: i'm not sure reporting post_failure when a post playbook times out is wrong?	23:11
corvus	the idea of post_failure is that it's saying "the main thing worked, but the post-run stuff did not"	23:12
corvus	ie, "your change is not broken, but your job cleanup is"	23:13
clarkb	corvus: would POST_TIMEOUT make sense though?	23:13
clarkb	similar to how we have TIMEOUT and POST_FAILURE	23:13
corvus	i guess if that's useful?	23:13
clarkb	I think for me its mostly to reduce confusion over an actual failure or a timeout in post since we have that in run	23:13
corvus	hrm	23:13
clarkb	in this case it would have distinguished between the two different failures we are currently seeing	23:14
corvus	i sort of disagree -- a post-run script should not timeout	23:14
corvus	yeah, but i mean that's just a happenstance of these 2 failures	23:14
corvus	a test run timeout tells you maybe your unit tests run too long	23:14
clarkb	ya I'll have to think about it a bit more. Current thought is that it is mostly surprising considering the differentiation elsewhere	23:15
clarkb	I think it may have tripped fungi up too when trying to determine why the job we debugged earlier today had failed	23:16
corvus	it does have some signal to it -- it tells you something like "your job broke in post run because your system is completely hosed" vs "your job broke in post run because it didn't produce a file that was expected"	23:16
corvus	but i personally am not convinced it's enough signal to make it a result so that you can scan the build reports for it	23:17
fungi	yeah, we've had a few different chronic situations recently which caused some task to timeout in post-run. i should just be more diligent about mining executor logs for "timeout" in addition to failed=[^0]	23:34
fungi	i think to check for task failures, but then forget that if there aren't any it could be because a task timed out and so we didn't get output from it	23:35
fungi	or even a play recap	23:35

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!