Wednesday, 2022-07-13

*** ysandeep is now known as ysandeep\|afk		00:09
*** dviroel\|rover is now known as dviroel\|out		00:10
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: upload-pypi: support API token upload https://review.opendev.org/c/zuul/zuul-jobs/+/849589	00:29
opendevreview	James E. Blair proposed opendev/system-config master: WIP: Build a nodepool image https://review.opendev.org/c/opendev/system-config/+/848792	00:34
ianw	hrm, not sure how to test the upload-pypi role	00:46
ianw	i can make a limited api key that can only update one project on test.pypi.org and we can assume that is public, and use it in zuul-jobs	00:47
ianw	i mean to say test it automatically, rather than just a one-off manual approach	00:48
ianw	i think the best approach might be to test upload-pypi in zuul jobs with an api key separately and manually, before committing. then we can have the switch ready in project-config and merge it just before we do something that will push to pypi like a dib release, and monitor it closely	00:51
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: [wip] upload-pypi: basic testing https://review.opendev.org/c/zuul/zuul-jobs/+/849593	01:04
fungi	maybe uploads of opendev/sandbox?	01:11
fungi	though we may be missing a lot of the necessary bits for that	01:12
*** ysandeep\|afk is now known as ysandeep		01:16
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: upload-pypi: support API token upload https://review.opendev.org/c/zuul/zuul-jobs/+/849589	01:20
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: [wip] upload-pypi: basic testing https://review.opendev.org/c/zuul/zuul-jobs/+/849593	01:20
ianw	yeah, it would be a bit of a pain to make something that increases its version number on every gate check	01:21
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: [wip] upload-pypi: basic testing https://review.opendev.org/c/zuul/zuul-jobs/+/849593	01:38
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: upload-pypi: basic testing https://review.opendev.org/c/zuul/zuul-jobs/+/849593	01:47
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: [wip] test sandbox upload https://review.opendev.org/c/zuul/zuul-jobs/+/849597	01:52
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: upload-pypi: basic testing https://review.opendev.org/c/zuul/zuul-jobs/+/849593	01:57
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: [wip] test sandbox upload https://review.opendev.org/c/zuul/zuul-jobs/+/849597	01:57
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: ensure-twine: make python3 default, ensure pip installed https://review.opendev.org/c/zuul/zuul-jobs/+/849598	01:57
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: [wip] test sandbox upload https://review.opendev.org/c/zuul/zuul-jobs/+/849597	02:04
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: [wip] test sandbox upload https://review.opendev.org/c/zuul/zuul-jobs/+/849597	02:32
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: upload-pypi: support API token upload https://review.opendev.org/c/zuul/zuul-jobs/+/849589	02:50
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: ensure-twine: make python3 default, ensure pip installed https://review.opendev.org/c/zuul/zuul-jobs/+/849598	02:50
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: upload-pypi: basic testing https://review.opendev.org/c/zuul/zuul-jobs/+/849593	02:50
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: [wip] test sandbox upload https://review.opendev.org/c/zuul/zuul-jobs/+/849597	02:50
*** ysandeep is now known as ysandeep\|afk		03:19
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: upload-pypi: support API token upload https://review.opendev.org/c/zuul/zuul-jobs/+/849589	04:02
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: ensure-twine: make python3 default, ensure pip installed https://review.opendev.org/c/zuul/zuul-jobs/+/849598	04:02
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: upload-pypi: basic testing https://review.opendev.org/c/zuul/zuul-jobs/+/849593	04:02
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: [wip] test sandbox upload https://review.opendev.org/c/zuul/zuul-jobs/+/849597	04:02
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: upload-pypi: support API token upload https://review.opendev.org/c/zuul/zuul-jobs/+/849589	04:27
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: ensure-twine: make python3 default, ensure pip installed https://review.opendev.org/c/zuul/zuul-jobs/+/849598	04:27
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: upload-pypi: basic testing https://review.opendev.org/c/zuul/zuul-jobs/+/849593	04:27
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: [wip] test sandbox upload https://review.opendev.org/c/zuul/zuul-jobs/+/849597	04:27
*** ysandeep\|afk is now known as ysandeep		04:44
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: upload-pypi: support API token upload https://review.opendev.org/c/zuul/zuul-jobs/+/849589	05:02
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: ensure-twine: make python3 default, ensure pip installed https://review.opendev.org/c/zuul/zuul-jobs/+/849598	05:02
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: upload-pypi: basic testing https://review.opendev.org/c/zuul/zuul-jobs/+/849593	05:02
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: [wip] test sandbox upload https://review.opendev.org/c/zuul/zuul-jobs/+/849597	05:03
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: upload-pypi: support API token upload https://review.opendev.org/c/zuul/zuul-jobs/+/849589	05:18
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: ensure-twine: make python3 default, ensure pip installed https://review.opendev.org/c/zuul/zuul-jobs/+/849598	05:18
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: upload-pypi: basic testing https://review.opendev.org/c/zuul/zuul-jobs/+/849593	05:18
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: [wip] test sandbox upload https://review.opendev.org/c/zuul/zuul-jobs/+/849597	05:18
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: [wip] test sandbox upload https://review.opendev.org/c/zuul/zuul-jobs/+/849597	06:07
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: [wip] test sandbox upload https://review.opendev.org/c/zuul/zuul-jobs/+/849597	06:20
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: [wip] test sandbox upload https://review.opendev.org/c/zuul/zuul-jobs/+/849597	06:27
*** ysandeep is now known as ysandeep\|afk		06:44
*** soniya is now known as soniya\|ruck		06:48
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: upload-pypi: test sandbox upload https://review.opendev.org/c/zuul/zuul-jobs/+/849597	07:01
*** ysandeep\|afk is now known as ysandeep		07:40
*** ysandeep is now known as ysandeep\|lunch		08:25
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: upload-pypi: test sandbox upload https://review.opendev.org/c/zuul/zuul-jobs/+/849597	08:34
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: upload-pypi: test sandbox upload https://review.opendev.org/c/zuul/zuul-jobs/+/849597	08:53
*** anbanerj is now known as frenzy_friday		09:17
*** soniya\|ruck is now known as soniya\|ruck\|lunch		09:41
*** soniya\|ruck\|lunch is now known as soniya\|ruck		10:09
*** soniya\|ruck is now known as soniya\|ruck\|afk		10:11
*** rlandy\|out is now known as rlandy		10:26
*** ysandeep\|lunch is now known as ysandeep		10:40
ianw	fungi: https://review.opendev.org/q/topic:upload-pypi-api is the base work for pypi api upload	10:57
*** soniya\|ruck\|afk is now known as soniya\|ruck		11:07
*** rlandy is now known as rlandy\|rover		11:15
*** dviroel is now known as dviroel\|rover		12:12
*** rlandy\|rover is now known as rlandy		12:23
*** ysandeep is now known as ysandeep\|afk		12:59
*** ysandeep\|afk is now known as ysandeep		13:31
mnaser	infra-root: https://tarballs.opendev.org is returning forbidden	14:27
fungi	looking	14:28
fungi	may be an afs outage	14:28
mnaser	thank you fungi !	14:28
fungi	apache throwing lots of kernel oopses in dmesg	14:29
*** dasm\|off is now known as dasm		14:29
fungi	[Wed Jul 13 13:15:38 2022] afs: Lost contact with file server 104.130.138.161 in cell openstack.org (code -1) (all multi-homed ip addresses down for the server)	14:29
mnaser	that'll do it	14:30
mnaser	104.130.138.161 is not pingable	14:30
fungi	time reported by dmesg may also not be accurate so that may be more recent than an hour ago	14:30
mnaser	so maybe afs could be the real issue here (unless that ip is not pingable by icmp)	14:30
fungi	yeah, that's afs01.dfw.openstack.org	14:30
fungi	trying to ssh into it now but it's hanging	14:31
fungi	i'll check the oob console	14:31
mnaser	is afs02 a replica for afs01 ?	14:31
fungi	yes, for most things anyway	14:31
mnaser	im wondering why it didnt fall back to that	14:31
fungi	it did for some volumes, but doesn't seem to have for tarballs	14:32
fungi	possible something is wrong/stuck with the replica for it	14:32
fungi	for now i'm going to dig into what's happening to afs01.dfw though	14:33
fungi	infra-root: ^ heads up, and also i have a conference call i have to jump to in 25 minutes, just fyi	14:33
fungi	ticket from rackspace: This message is to inform you that the host your cloud server, afs01.dfw.openstack.org, resides on alerted our monitoring systems at 2022-07-13T13:29:01.300633. We are currently investigating the issue and will update you as soon as we have additional information regarding what is causing the alert.	14:36
mnaser	ah	14:36
fungi	followup: This message is to inform you that the host your cloud server, afs01.dfw.openstack.org, resides on became unresponsive. We have rebooted the server and will continue to monitor it for any further alerts.	14:36
fungi	that followup was stamped roughly an hour ago	14:37
fungi	so i guess the instance didn't come back when the host rebooted	14:37
fungi	yeah, the api reports it in an "error" state	14:38
fungi	fault \| {'message': 'Storage error: Reached maximum number of retries trying to unplug VBD OpaqueRef:6d2337f7-aa1d-46b3-5da6-209ac49fd06b', 'code': 500, 'created': '2022-04-28T20:06:53Z'}	14:40
mnaser	the date of that fault seems to show that its unrelated	14:41
mnaser	(also wow what a throwback to see 'OpaqueRef', old school xenserver code)	14:41
mnaser	afaik the nova api will let you hard reboot if vm is in error state	14:41
fungi	afs01.dfw has four volumes in cinder, all in-use, none of which match that uuid	14:43
mnaser	that is an internal uuid used by xenserver	14:43
fungi	ahh	14:44
fungi	so no clue which cinder volume it might be	14:44
fungi	anyway, yeah, i'll try a hard reboot and hope we don't corrupt any filesystems	14:44
fungi	fault \| {'message': 'Failure', 'code': 500, 'created': '2022-07-13T14:45:08Z'}	14:46
fungi	that's less than helpful	14:46
fungi	putting it into shutoff for a minute and then trying a server start	14:47
fungi	it went into shutoff state fine, but server start seems to be getting ignored now	14:49
mnaser	i think the hypervisor feels borked :(	14:49
fungi	yeah, i'll follow up on the ticket they opened about the host reboot	14:50
fungi	in the meantime we can see if the read-only replica for tarballs can be brought online	14:50
fungi	#status notice Due to an incident in our hosting provider, the tarballs.opendev.org site (and possibly other sites served from static.opendev.org) is offline while we attempt recovery	14:51
opendevstatus	fungi: sending notice	14:51
-opendevstatus- NOTICE: Due to an incident in our hosting provider, the tarballs.opendev.org site (and possibly other sites served from static.opendev.org) is offline while we attempt recovery		14:52
*** dviroel\|rover is now known as dviroel\|rover\|biab		14:53
opendevstatus	fungi: finished sending notice	14:54
fungi	infra-root: i've updated the ticket (220713-ord-0002114) and am awaiting further response from rackspace support	14:55
fungi	i probably don't have time to dig into what's preventing failover for the tarballs volume to the read-only replica before my call in a few minutes, but can try to poke at it some. also we should disable afs volume releases in the meantime and work on doing a full switchover to afs02.dfw	14:57
*** soniya is now known as soniya\|ruck		15:01
*** ysandeep is now known as ysandeep\|out		15:04
Clark[m]	I'm getting my morning started but need to do quick system updates first.	15:12
Clark[m]	fungi: are we serving the RW path on static?	15:12
jrosser	wouold this be related? https://mirror-int.dfw.rax.opendev.org/ubuntu/dists/bionic/universe/binary-amd64/Packages 403 Forbidden [IP: 10.209.161.66 443]	15:16
clarkb	jrosser: likely yes	15:21
clarkb	static's dmesg has a number of tracebacks involving afs after losing contact with the server. mirror.dfw does now	15:24
clarkb	s/does now/does not/	15:24
clarkb	all three afsdb0X servers report they are running happily according to bos status so I'm not sure why failover wouldn't have happened except for maybe talking to RW paths instead of RO paths	15:27
clarkb	or maybe the kernel tracebacks crashes afs hard enough to prevent failover on the client side?	15:27
clarkb	looking at /var/www/mirror on mirror.dfw I think some volumes failed over and others did not	15:29
clarkb	https://mirror.ord.rax.opendev.org/ubuntu/dists/bionic/universe/ has content so ya this may be ~luck of the draw on individual clients for handling failovers.	15:29
clarkb	I'm trying to restart openafs on mirror.iad3.inmotion	15:31
clarkb	it isn't going very quickly	15:31
clarkb	ok that was slow but it didn't seem to break anything. I'll try that on mirror.dfw now	15:33
clarkb	before I did that I simply navigated to the path on the fs and now it seems happy on dfw?	15:35
clarkb	I wonder if we cached a failed lookup in apache and the napache stopped trying to hit the fs to refresh the failover?	15:35
fungi	yeah, apache restarts might help, i suppose	15:36
fungi	and sorry, confcall is pretty distracting	15:36
fungi	but should be free again in ~25 minutes	15:36
clarkb	ok ya I think mirror.dfw is good now simply by manually traversing the path on afs	15:36
clarkb	I'll check the other mirrors first (I know that static is probably what people want more but I feel like I'm learning and mirrors are far less stressful)	15:37
rlandy	hi ... Failed to fetch https://mirror.iad3.inmotion.opendev.org/ubuntu/dists/focal/main/binary-amd64/Packages 403 Forbidden [IP: 173.231.253.126 443]	15:37
rlandy	mirror.iad3.inmotion.opendev.org seems to be the failing mirror now for us	15:38
clarkb	rlandy: it is working now I think. Note timestamps and also links to failures are always always useful. But ya I think that particular mirror as well as dfw is happy now	15:38
clarkb	(it could be that failure occured when I restartedo penafs)	15:38
rlandy	clarkb: thanks - will watch that	15:39
clarkb	if we see failures after this point in time for mirror.dfw and mirror.iad3 let us know. And now I'm looking at the others	15:39
rlandy	failures are probably from an hour back	15:39
clarkb	mirror.mtl01.iweb appears happy	15:40
clarkb	mirror.ord and mirror.iad as well. None of them have the tracebacks like static does	15:40
clarkb	both ovh mirrors are similarly happy from what I see. No tracebacks either	15:42
*** marios is now known as marios\|out		15:44
clarkb	ya all the mirrors appear ok now based on filesystem listings against /var/www/mirror	15:45
clarkb	none contain the dmesg tracebacks that static shows	15:45
rlandy	thanks for checking	15:47
clarkb	looks like tarballs is up too? I wonder if taking the unhappy fileserver down was what we needed to failover	15:47
clarkb	fungi: ^	15:47
fungi	i can check in a few	15:48
clarkb	I think we're in a good state now via failover to RO volumes on afs02.dfw	15:48
clarkb	I think the next steps are likely going to be disabling any vos releases so that we don't possibly replicate corrupted RW volumes on afs01 to RO volumes on afs02 when 01 comes back (openafs likely protects against this but I'm not sure)	15:49
clarkb	then we can bring back afs01 and convert its volumes to RO and switch 02 to RW then enable releases in the other direction?	15:50
fungi	also possible 01 came back up, i haven't checked yet	15:54
clarkb	it doesn't ping and there is no ssh	15:55
clarkb	anyway I didn't really have to do anything on the servers other than navigate their /afs/openstack.org/mirror and /afs/openstack.org/project paths and that seemed to make things happy. Either that or the shutdown of afs01 caused the afs db servers to finally notice it is down and fail over	15:56
clarkb	I believe we are in a RO state with content being served. I've ntoified the release team to not do releases and updated the mailing list thread with this info	15:56
clarkb	I'm going to take a break now as I haven't had breakfast yet and I have a bunch of email to catch up on after being out for a few days	15:57
fungi	thanks! i'm freeing up again now for a bit, but will have an errand to run soon as well, so will see what i can get done on this in the meantime	16:00
clarkb	fungi: I think holding locks/commenting out vos release cron jobs so that we control how, when and what syncs when afs01 is back is the next thing	16:01
clarkb	and then it is probably just a matter of monitoring and seeing what rax says? I guess we could try booting a recovery instance to inspect why it is failing	16:01
clarkb	But I really need food	16:02
fungi	go eat!	16:04
*** dviroel\|rover\|biab is now known as dviroel\|rover		16:11
fungi	i've added mirror-update02.opendev.org to the emergency disable list	16:15
fungi	i've also temporarily commented out all lines in the root crontab on that server	16:16
clarkb	fungi: I think docs and tarballs etc are released via a cronjob elsewhere? Worth double checking	16:23
fungi	those are handled by the release-volumes.py cronjob on that server, as far as i'm aware	16:24
fungi	which runs every 5 minutes	16:24
fungi	or did, until i commented it out	16:24
fungi	we had separate mirror-update servers which reprepro and rsync mirroring was split between for a while, but that's been consolidated onto the newer server more recently	16:25
clarkb	aha	16:26
clarkb	looks like there is an update to our ticket? I'm not in a good place to login and check that yet	16:26
clarkb	(I've got a post road trip todo list a mile long too :( ...)	16:26
*** dviroel__ is now known as dviroel\|rover\|biab		16:42
fungi	i can only imagine	16:44
fungi	the ticket updates were "The query regarding unable to boot afs01.dfw.openstack.org has been received. I am currently reviewing this ticket and I will update you with more information as it becomes available." followed by "I will now escalate to appropriate team for further review."	16:46
fungi	so i guess we're waiting for an appropriate team	16:47
clarkb	good to know it has been seen at least	16:47
fungi	i need to go run some errands, but will make them as quick as possible. shouldn't be more than an hour i hope	16:47
clarkb	I think we've done what we can until we hear back form them short of booting a recovery instance	16:48
clarkb	and it is probably better to let them poke at it now that tehy have seen it	16:48
fungi	yep	16:50
*** dviroel\|rover\|biab is now known as dviroel\|rover		17:21
opendevreview	James E. Blair proposed opendev/system-config master: WIP: Build a nodepool image https://review.opendev.org/c/opendev/system-config/+/848792	17:33
fungi	racker todd is my new hero! "the volume afs01.dfw.opendev.org/main03 eafb4d8d-19e2-453e-8657-013c4da7acb6 lost it's iscsi connection to the Compute host... Detaching and reattaching it did the trick."	18:08
fungi	reboot system boot Wed Jul 13 18:08	18:08
fungi	i think afs01.dfw is back in business now, but need to double-check all the volumes to make sure everything's copacetic before i can say with any certainty	18:10
fungi	i've gone ahead and closed out the ticket with much thanks, since we can at least take it from here	18:12
clarkb	excellent	18:12
fungi	for future reference, i suppose we can try detaching/reattaching through cinder	18:13
fungi	i've got a narrow window to try and catch up on yardwork, but may be able to poke at checking those over on breaks or once i finish	18:17
opendevreview	James E. Blair proposed opendev/system-config master: WIP: Build a nodepool image https://review.opendev.org/c/opendev/system-config/+/848792	18:20
fungi	per discussion in #openstack-infra a zuul job successfully wrote to the docs rw volume, so i'm going to uncomment the vos release cronjob for that next and see if we have any new problems there	19:10
fungi	i'm tailing /var/log/afs-release/afs-release.log on mirror-update and should hopefully see it kick off in ~2 minutes	19:13
clarkb	thanks	19:15
fungi	looks like all releases were successful, including tarballs	19:16
fungi	we dodged a bullet there	19:16
opendevreview	James E. Blair proposed opendev/system-config master: WIP: Build a nodepool image https://review.opendev.org/c/opendev/system-config/+/848792	19:16
fungi	clarkb: any objections to me uncommenting the other cronjobs and taking mirror-update02 out of the emergency disable list now?	19:18
clarkb	fungi: no, I probably would've done the mirrors first myself since they are all upstream data :)	19:19
clarkb	I think if tarballs et al are happy then mirrors are good to go	19:19
fungi	fair, but there was a request to rerun a docs job so i took the opportunity	19:19
clarkb	ya	19:19
fungi	okay, undoing the rest	19:19
fungi	and done	19:19
fungi	i'll hold my hopes until we see if there are mirror volumes remaining stale, but i think we can status log a conclusion (i only did status notice earlier, not alert)	19:20
fungi	#status log The afs01.dfw server is back in full operation and writes are successfully replicating once more	19:21
opendevstatus	fungi: finished logging	19:21
fungi	i'll let #openstack-release know too	19:21
opendevreview	Clark Boylan proposed opendev/system-config master: WIP Update to Gitea 1.17.0-rc1 https://review.opendev.org/c/opendev/system-config/+/847204	20:45
opendevreview	Clark Boylan proposed opendev/system-config master: Update Gitea to 1.16.9 https://review.opendev.org/c/opendev/system-config/+/849754	20:45
clarkb	There is a new gitea bugfix release. I put that update between the testing update and the 1.17.0 rc change	20:45
clarkb	Hopefully we can land the testing update and the 1.16.9 update soon. But as always please review the changelog and template updatds	20:45
clarkb	git diff didn't show me any template changes between .8 and .9 for the three templates we override	20:47
ianw	o/ ... is the short story one of the afs volumes went away for a bit?	20:56
ianw	it seems we didn't need to fsck, which is good	20:57
clarkb	ianw: the entire fileserver went away due to one of the cinder volumes going away	20:57
clarkb	I think that may have impacted all of the afs volumes due to lvm?	20:58
clarkb	but ya it seems to have come back	20:58
clarkb	ianw: while we are doing catchup thank you for updating our default ansible version in zuul (I shoudl've set myself a calendar reminder for that and just spaced it). Also looks like we update ansibled to v5 on bridge too?	21:00
ianw	umm, i didn't touch the ansible version on bridge, i don't think	21:00
ianw	i guess /vicepa reports itself as clean ... how it survived that I don't know :)	21:04
clarkb	ianw: oh maybe I misread something over the last week	21:04
clarkb	I may have just smashed together the zuul update and bridge update in my head	21:05
corvus	ianw: clarkb fungi for any who are interested, https://review.opendev.org/848792 is an image-build-in-zuul-job which has 2 successful runs -- one at 1 hour, one in 38 minutes. i believe that further improvement in runtime is possible with better use of the cached data already on the nodes. it does use the existing git repo cache (but then fetches updates, which is a little slow. it also copies it twice, and i feel like we should be able to	21:13
corvus	avoid that somehow, but that requires some detailed thought about what's mounted where and when). it doesn't use any of the devstack/tarball/blob cache on the host, so those files are all being fetched each time; that could obviously be improved. anyway, i think that's a useful starting point, and it could be used to test out the cointainerfile stuff ianw was looking at. i'm currently working on a new spec for nodepool/zuul, and i wanted to	21:13
corvus	get an idea of what a job like that would look like. feel free to take that change and modify it or copy it or whatever if you have any ideas you want to explore; i'm basically done with that for right now (it answered my questions).	21:13
clarkb	corvus: re caching off the host I think the existing dib caching knows how to check for updates to thosefiles we just have to copy/link them into the right locations in the dib build path?	21:15
corvus	clarkb: maybe -- but it also has some shasum hash thing it does and i think that's only in the /opt/dib_cache dir, so i don't think we have all that data on the host (which is this case is one of our built images)	21:16
clarkb	ya the dib_cache dir isn't copied into the zuul runtime images	21:17
clarkb	but we could probably update things to leak that across assuming it isn't very large and is also useful	21:17
clarkb	I'd have to think about that a bit more	21:17
corvus	yeah	21:17
corvus	at least, the theoretical problem of "we have foo.img, let's update it iff it needs updating" seems solveable :)	21:18
corvus	(i went ahead and put a bit of effort into the git repo cache already though since i knew that was the big thing)	21:18
fungi	ianw: clarkb: a more accurate summary would be the primary afs server went away because the hypervisor host went away, but then we couldn't boot it back up for hours because the host got confused when it lost contact with the iscsi backend for one of the attached volumes	21:19
clarkb	fungi: thanks	21:19
fungi	so it was a bit of a cascade failure	21:19
*** dmitriis is now known as Guest4934		21:20
fungi	also we didn't manage to automatically fail over serving the ro replica for something (tarballs volume at least) and needed to intercede	21:20
clarkb	fungi: was the server off for all those hours then? If so then I think the idea taht shutting it down caused failover to happen is unlikely (and more likely that my manual navigation of paths made it happier)	21:20
fungi	the server was offline until 18:08 yes	21:20
fungi	and the outage started around 13:something	21:21
clarkb	ok that helps. For some reason i had it in my head that the server was up but with sad openafs and that may have confused the afs dbs	21:21
fungi	tarballs.o.o didn't start serving content until somewhere in between those timrs	21:21
clarkb	ya I suspect more that my manual navigation of afs paths on static forced openafs there to try again and it started working?	21:22
fungi	possibly, though i also did that earlier in the outage	21:22
clarkb	or maybe we cached the bad results for a couple of hours and that timing just lined up where the caching timed out	21:22
fungi	just as part of inspecting things to see what was actually down	21:23
clarkb	fungi: if you have time can you take a look at https://review.opendev.org/c/opendev/system-config/+/849754 you've already reviewd its parent.	21:58
clarkb	ianw: ^ if you get a chance to look too that would be great	21:59
clarkb	CI results on the child should be back momentarily	21:59
opendevreview	Clark Boylan proposed opendev/system-config master: Install Limnoria from upstream https://review.opendev.org/c/opendev/system-config/+/821331	22:01
clarkb	infra-root ^ is a change that keeps ending up stale beacuse there is never a good time to land it :/ I think Fridays are generally quiet with meetings if we want to try and land it this friday (seems like the last time I pciekd a day there was a big fire that distracted me)	22:02
*** dasm is now known as dasm\|off		22:08
fungi	clarkb: lgtm. unrelated, a review of 849576 and its child would be awesome when you have time	22:13
clarkb	fungi: I've +2'd both but didn't approve in case you wanted to respond to ianw first. Feel free to self approve	22:15
ianw	oh i assume that it is all in order	22:16
clarkb	I've approved the update to gitea testing. I think I'll hold off on gitea upgrade proper until tomorrow though as I'm still getting distracted by all the "home after a week away" problems	22:18
clarkb	feel free to land the gitea upgrade if you're able to monitor it, but I'm happy to do that tomorrow	22:18
ianw	i can monitor it, can merge in a few hours when it all slows down	22:18
opendevreview	Ian Wienand proposed openstack/project-config master: Remove testpypi references https://review.opendev.org/c/openstack/project-config/+/849757	22:19
fungi	ianw: did i not respond to ianw? maybe i missed something	22:20
fungi	er clarkb ^	22:20
ianw	oh you did, about the handbook v the guide v the open way v the four opens etc.	22:21
fungi	looking back, i left a review comment in reply to an inline comment, rather than replying with an inline comment, sorry!	22:23
fungi	and yeah, they're intentionally distinct	22:24
fungi	(we debated the option of putting them together or not at great length)	22:24
fungi	i was personally in favor of fewer repos, but one more repo wasn't that great of a cost to appease those who disagreed with my position on the matter	22:25
opendevreview	Ian Wienand proposed openstack/project-config master: twine: default to python3 install https://review.opendev.org/c/openstack/project-config/+/849758	22:27
clarkb	fungi: hrm this is the problem of not responding to the inlien comment directly so it doesn't show up as a response on the file	22:30
clarkb	https://review.opendev.org/c/openstack/project-config/+/849576/1/gerrit/projects.yaml basically that doesn't show a response	22:31
clarkb	but ya I see it now	22:31
fungi	well, in this case i missed that it was an inline comment so i made a normal review comment instead. was my bad	22:39
clarkb	with the web ui if you click reply it automatically attaches it to the correct place. I wonder if gertty could grow a similar functionality	22:45
clarkb	or maybe it does and I just haven't used it in long enough to have forgotten	22:45
*** dviroel\|rover is now known as dviroel\|rover\|Afk		22:58
jrosser	i might have a very long running logs upload in progress here https://zuul.opendev.org/t/openstack/stream/915484832105431892e804fb86abc2d3?logfile=console.log	23:08
clarkb	hrm doesn't look like we've ported the base-test updates to log the target to the production base job?	23:09
clarkb	or if we have I'm not seeing it in that log yet	23:09
jrosser	it's from 847991	23:09
opendevreview	Merged opendev/system-config master: Move gitea partial clone test https://review.opendev.org/c/opendev/system-config/+/848174	23:09
jrosser	no i don't think we have merged that yet	23:09
* clarkb makes a note to catch back up on that tomorrow		23:10
jrosser	i have a patch to do that but it needs updating	23:10
jrosser	i saw one POST FAILURE earlier, and just noticed that one aparrently stuck	23:10
ianw	lsof on that shows connections to ...	23:19
ianw	142.44.227.102	23:19
ianw	OVH Hosting Inc.	23:20
ianw	looking at it in strace, it doesn't seem to be doing anything	23:21
fungi	clarkb: in this case it wasn't a gertty failing, i replied to the review comment which contained the inline comment rather than replying to the inline comment itself	23:21
clarkb	ianw: is it waiting on a read or a write? (might point to which side is idling)	23:23
ianw	https://paste.opendev.org/show/bzXL1q1f2G0e4d4dQgvA/	23:23
ianw	looks to me stuck in a bunch of reads	23:24
clarkb	to me that implies something about the remote end being unhappy	23:26
clarkb	we're waiting for ovh to respond to us?	23:26
clarkb	could be something on the network between as well	23:26
ianw	pinging it from ze02 seems fine	23:27
ianw	it really just looks like those threads are sitting there waiting for something	23:27
clarkb	might be something amorin could help with	23:28
clarkb	(to check on the ovh side to see if there is any obvious reason for the pause)	23:28
ianw	i've had that under strace for a while and nothing has got any data or timed out either	23:30
fungi	when was the connection initiated?	23:31
ianw	https://paste.opendev.org/show/bdRYt3Lbz7PZZoEuovxE/	23:33
ianw	it would indicate Jul 13 23:33 i guess	23:34
ianw	although, that's 1 minute ago?	23:34
ianw	... and it's gone ...	23:35
ianw	did it get killed?	23:35
ianw	2022-07-13 20:34:38.552391 \| TASK [upload-logs-swift : Upload logs to swift]	23:37
ianw	2022-07-13 23:34:33.029640 \| POST-RUN END RESULT_TIMED_OUT: [trusted : opendev.org/opendev/base-jobs/playbooks/base/post-logs.yaml@master]	23:37
ianw	yes	23:37
jrosser	still html ara report there which needs to be got rid of	23:38
ianw	i guess the time of the file in /proc/<pid>/fd is the time that the kernel made the virtual file in response to the dirent or whatever (i.e. when you "ls" it), not the file creation time. not sure i've ever considered the timestamp of it before	23:40
ianw	anyway, that's a data point i guess? it was ovh, and it was all the thread stuck in read() calls	23:41
clarkb	++ someone like timburke might know what portion of the upload is doing reads too. Though it may just be waiting for a status result from the http server	23:46
clarkb	and the problem is processing/storing the data on the remote	23:46
ianw	my next thought was a backtrace on one of those threads, but they disappearedd	23:48
opendevreview	Ian Wienand proposed openstack/project-config master: pypi: use API token for upload https://review.opendev.org/c/openstack/project-config/+/849763	23:54
ianw	does "Job publish-service-types-authority not defined" in project-config ring any bells?	23:56
clarkb	service types authority is the thing that is published for keystone ? I think its the static json blob	23:57
*** dviroel\|rover\|Afk is now known as dviroel\|rover		23:57
ianw	https://review.opendev.org/c/openstack/project-config/+/708518 removed the job in feb 2020	23:58
clarkb	ianw: did you get that error from the zuul scheduler log?	23:58
clarkb	all it said in gerrit was the change depends on a change with invalid config	23:58
ianw	we have a reference @ https://opendev.org/openstack/project-config/src/branch/master/zuul.d/projects.yaml#L5018	23:59
clarkb	https://opendev.org/openstack/project-config/src/branch/master/zuul.d/projects.yaml#L5017-L5018 ya I just found that too	23:59
clarkb	maybe we need to clean that up?	23:59
ianw	https://review.opendev.org/c/openstack/project-config/+/849757 gives me a zuul error	23:59

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!