Tuesday, 2021-06-15

clarkb	I'm going to need to step out for dinner shortly. I'll check in on docs later, but its still locked :/	00:03
clarkb	`vos move -id docs -toserver afs01.dfw.openstack.org -topartition vicepa -fromserver afs02.dfw.openstack.org -frompartition vicepa -localauth` is the command we need to run (I was running them all from screen on afsdb01). If I don't manage to run that today I'll try to get it done in the morning	00:03
opendevreview	Merged opendev/system-config master: Update eavesdrop deploy job https://review.opendev.org/c/opendev/system-config/+/796006	00:08
opendevreview	Merged opendev/system-config master: ircbot: update limnoria https://review.opendev.org/c/opendev/system-config/+/796323	00:08
ianw	ok i haven't been following closely on that one sorry. i also have to go afk for a bit for vaccine shots, so back at an indeterminate time	00:39
clarkb	ianw: tldr is we wanted to do reboots and to do those without impacting afs users I did afs02.dfw and afs01.ord first. THen today I held locks and moved RW volumes around and rebooted afs01.dfw. Now I'm just trying to restore things to the way they were before as far as volume assignments go	00:43
clarkb	and it is looking like I won't get to move docs back tonight. I'll check it in the morning	00:50
opendevreview	Steve Baker proposed openstack/diskimage-builder master: Add a growvols utility for growing LVM volumes https://review.opendev.org/c/openstack/diskimage-builder/+/791083	02:16
*** diablo_rojo is now known as Guest2181		03:19
ianw	and back. i'll looks at getting nb containers up with that mount and f34	03:52
*** ykarel\|away is now known as ykarel		04:18
opendevreview	Steve Baker proposed openstack/diskimage-builder master: Add a growvols utility for growing LVM volumes https://review.opendev.org/c/openstack/diskimage-builder/+/791083	04:33
opendevreview	Merged opendev/system-config master: nodepool-builder: add volume for /var/lib/containers https://review.opendev.org/c/opendev/system-config/+/795707	04:38
opendevreview	Merged opendev/system-config master: statusbot: don't prefix with extra # for testing https://review.opendev.org/c/opendev/system-config/+/796009	04:38
*** marios is now known as marios\|ruck		05:16
opendevreview	Merged openstack/project-config master: Build images for Fedora 34 https://review.opendev.org/c/openstack/project-config/+/795604	06:03
*** ysandeep\|out is now known as ysandeep		06:09
*** iurygregory_ is now known as iurygregory		06:19
*** jpena\|off is now known as jpena		06:34
opendevreview	Merged openstack/project-config master: Removing openstack-state-management from the bots https://review.opendev.org/c/openstack/project-config/+/795599	06:37
*** rpittau\|afk is now known as rpittau		07:13
*** ykarel is now known as ykarel\|lunch		07:38
opendevreview	Ian Wienand proposed openstack/project-config master: infra-package-needs: stub ntp for Fedora https://review.opendev.org/c/openstack/project-config/+/796414	09:45
ianw	mnaser: getting closer ... ^	09:45
*** ykarel\|lunch is now known as ykarel		09:57
dtantsur	FYI freenode has reset all registrations apparently. so those of us who stayed to redirect people can no longer join the +r channels	09:59
avass[m]	is https://review.opendev.org having issues?	10:16
avass[m]	oh now it loads but that took way longer than usual	10:18
ianw	dtantsur: yeah, i had left a connection too, hoping some sort of sanity may prevail. but alas I just closed everything too, the complete end of an era.	10:38
dtantsur	indeed	10:40
*** sshnaidm\|afk is now known as sshnaidm		11:21
sshnaidm	I see there are problems with limestone proxies, fyi https://zuul.opendev.org/t/openstack/build/e8b283b37245414284966011a2d0e49e/log/job-output.txt#1245	11:23
sshnaidm	clarkb, fungi ^	11:23
sshnaidm	marios\|ruck, ^	11:23
*** ysandeep is now known as ysandeep\|afk		11:24
*** jpena is now known as jpena\|lunch		11:25
marios\|ruck	sshnaidm: ack yes quite a few this morning but last couple hours have not seen more	11:34
marios\|ruck	sshnaidm: i filed that earlier https://bugs.launchpad.net/tripleo/+bug/1931975 and was waiting for afternoon when the folks you pinged would be around	11:35
*** amoralej is now known as amoralej\|lunch		11:50
rav	Hi All	11:51
rav	Needed info on how i can add gpgkey	11:51
rav	i want to trigger a pypi release and tag the version using "git tag -s "versionnumber"	11:52
rav	but the tag fails with error gpg skipped: no secret key	11:52
rav	I have added gpg in mygithub	11:52
rav	but how do i add it on opendev?	11:52
fungi	sshnaidm: looks like it timed out while pulling a cryptography wheel through the proxy from pypi, yeah. must be intermittent because i can pull that same file from the same mirror, though it's also possible the local network between the job nodes and mirror server in limestone are having trouble... have you noticed more than one incident like this today?	12:04
fungi	rav: you don't need to upload your gpg key anywhere for that to work, you just need to have it available locally where you're running git	12:05
sshnaidm	fungi, yeah, much more, that was just one example	12:05
sshnaidm	fungi, mostly timeouts, for example marios\|ruck submitted an issue with another examples: https://bugs.launchpad.net/tripleo/+bug/1931975	12:06
fungi	rav: running `gpg --list-secret-keys` locally where you use git should list your key if you have one	12:06
fungi	sshnaidm: were they all errors pulling files from pypi, or other stuff too?	12:07
rav	got it working	12:08
sshnaidm	fungi, in the issues above the example is with centos repositories	12:08
fungi	sshnaidm: thanks, so probably not just a problem between the mirror server and pypi/fastly	12:12
fungi	centos packages would be getting served from the afs cache on the server	12:12
fungi	seems more likely it's a local networking issue between the mirror server and the job nodes in that case	12:13
sshnaidm	yeah	12:14
fungi	i also don't seem to be able to get the xml file mentioned in that bug	12:15
fungi	sshnaidm: oh, maybe not, that's a proxy to https://trunk.rdoproject.org/	12:17
*** jpena\|lunch is now known as jpena		12:17
fungi	we're not mirroring that, just caching/forwarding requests for it	12:17
sshnaidm	fungi, yeah, though request has died to http://mirror.regionone.limestone.opendev.org:8080	12:18
sshnaidm	together with all pypi mirror failures	12:18
fungi	and i was finally able to get that xml file to pull through the proxy, so maybe this is a performance problem with the cache	12:18
fungi	oh yeah it took me several seconds just to touch a new file in /var/cache/apache2/	12:19
fungi	load average on the server is nearly 100	12:20
sshnaidm	fungi, here an example with centos/8-stream repo: https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_e0d/795019/2/gate/tripleo-ci-centos-8-content-provider/e0d29df/job-output.txt	12:20
fungi	half its cpu is consumed by iowait	12:20
sshnaidm	fungi, that explains a lot :)	12:20
fungi	i'm going to try stopping and starting apache completely there	12:21
rav	fungi: fatal: 'gerrit' does not appear to be a git repository. Im getting this error when i run "git push gerrit <tag>	12:23
fungi	sshnaidm: looks like the problem is htcacheclean being unable to keep up	12:24
fungi	and there was more than one htcacheclean running, one had been going since friday, the second started today at 08:00 utc	12:25
fungi	i killed the older one and started apache back up for now, but we're going to need to keep a close eye on this... the cache volume is around 80% full when htcacheclean normally wants to keep it closer to 50%	12:25
rav	ok this works. thanks	12:26
fungi	rav: see the third bullet in the note section at https://docs.opendev.org/opendev/infra-manual/latest/drivers.html#tagging-a-release	12:26
rav	Yes, had missed it	12:26
fungi	#status log Killed a long-running extra htcacheclean process on mirror.regionone.limestone which was driving system load up around 100 from exceptional iowait contention and causing file retrieval problems for jobs run there	12:29
opendevstatus	fungi: finished logging	12:29
*** ysandeep\|afk is now known as ysandeep		12:33
*** amoralej\|lunch is now known as amoralej		12:51
*** diablo_rojo__ is now known as diablo_rojo		13:13
fungi	minotaur launch in a few minutes hopefully, at the spaceport just up the beach from here	13:32
fungi	too cloudy here for us to be able to see it clear the horizon, unfortunately	13:33
fungi	nasa tv has the livestream from the wallops island facility though	13:34
fungi	launch successful!	13:37
fungi	pointed out in #openstack-infra a few minutes ago, channel logging seems to be spotty	14:06
fungi	on eavesdrop, /var/lib/limnoria/opendev/logs/ChannelLogger/oftc/#openstack-infra/#openstack-infra.2021-06-15.log is entirely empty	14:09
fungi	and #opendev/#opendev.2021-06-15.log is truncated after 12:26:25 today	14:10
fungi	the /var/lib/limnoria volume isn't anywhere close to being out of space or inodes	14:12
corvus	it might be easier to write a quick matrix log bot instead of figuring out and fixing whatever ails limnoria	14:13
fungi	it hasn't logged any errors related to writing channel logs	14:15
corvus	iiuc, limnoria itself just logs the plaintext; the html formatting is a separate program. so a log bot doesn't need to do anything other than write out every thing it receives.	14:23
fungi	yep, potentially filtering (for example we filter the join/part messages so that we don't publish logs of everyone's ip addresses)	14:25
*** rpittau is now known as rpittau\|afk		14:29
gibi	btw, #openstack-neutron logging also seems to be stopped during today	14:30
gibi	and the #openstack-nova log is empty today	14:30
fungi	i'll try restarting the bot during a lull between meetings	14:39
clarkb	fungi: re limestone, I think the idea was if we caught it in that state again we could have logan- look at the running host. (Disable the cloud in nodepool if necessary)	14:44
clarkb	gibi: we found a bug in limnoria, which we fixed but are trying to get deployed	14:44
fungi	clarkb: the new problem seems to be unrelated to the previous limnoria bug you patched	14:44
clarkb	fungi: oh?	14:44
clarkb	fun	14:45
fungi	also it looks like the qa and neutron meetings finished a few minutes ago, so i'll stop/start the container before the next meeting block begins	14:45
clarkb	fungi: is it possible that ianw's fixes to update the image simply undid our manual patching?	14:45
fungi	clarkb: nope, this doesn't appear related to joining channels, the bot is just suddenly no longer logging conversations for channels it's in	14:46
clarkb	infra-root it looks like the afs docs volume is currently locked. Is there a way for me to see the age of the lock and/or release progress? I want to see if I can narrow down when I might be able to move the RW copy or to see if there is something wrong?	14:46
clarkb	fungi: that is unfortunate	14:46
clarkb	https://grafana.opendev.org/d/T5zTt6PGk/afs?viewPanel=36&orgId=1 makes me suspicious	14:51
clarkb	is it possible that whatever held that lock has died and the lock is stale? This is a vos release that happens in a cron?	14:52
clarkb	hrm do we do those vos releases on afs01.dfw? and when I rebooted it maybe I killed a release in progress?	14:54
clarkb	(we need to update our docs on server reboots if so)	14:54
opendevreview	Merged openstack/project-config master: Re-add publish jobs after panko deprecation https://review.opendev.org/c/openstack/project-config/+/793891	14:55
fungi	clarkb: new detail on the limnoria bug... seems it's just not getting around to flushing its files. when i stopped the container it wrote out all the many hours of missing lines to the logfiles	14:55
*** ykarel_ is now known as ykarel		14:56
mordred	well that's nice	14:58
clarkb	at leasat that is likely to be an easy fix?	14:59
fungi	yeah, probably, and we haven't lost any data seems like	14:59
fungi	clarkb: there's a command to list volume locks, lemme see if i can find it when i get a spare moment	15:00
fungi	oh, and yes the vos releases for static volumes happen via ssh from mirror-update to afs01.dfw	15:00
fungi	with a cronjob	15:00
clarkb	ya so I bet what happened was I rebooted in the middle of that vos release (and our docs need an update)	15:02
clarkb	after meetings I guess I should try to confirm that better, break the lock, do a release, then do the move?	15:03
corvus	clarkb, fungi: there's a setting in the limnoria config	15:03
corvus	to flush	15:03
fungi	oh, maybe that's just missing or wrong	15:04
fungi	supybot.plugins.ChannelLogger.flushImmediately: False	15:08
fungi	"Determines whether channel logfiles will be flushed anytime they're written to, rather than being buffered by the operating system."	15:08
fungi	apparently that buffering can be a day or more, even	15:08
opendevreview	Jeremy Stanley proposed opendev/system-config master: ircbot: flush channel logging continuously https://review.opendev.org/c/opendev/system-config/+/796513	15:13
fungi	corvus: thanks! clarkb: gibi: ^	15:13
fungi	hopefully that takes care of it	15:13
gibi	thanks	15:17
*** jpena is now known as jpena\|off		15:18
fungi	clarkb: yeah, so vos examine on the docs rw volume says "Volume is locked for a release operation" and we normally only release it via cronjob on mirror-update.o.o	15:20
clarkb	oh is the cron on mirror-update? I rebooted that server last week	15:21
clarkb	could the problem be that I rebooted the fileserver during the reelase instead?	15:22
clarkb	regardless of where the release is running?	15:22
fungi	it tried to release the docs volume yesterday at 19:20:02 running `ssh -T -i /root/.ssh/id_vos_release vos_release@afs01.dfw.openstack.org -- vos release docs` and recorded a failure	15:22
clarkb	ah ok the actual release happens remotely on afs01 and that is approximately when I rebooted iirc	15:22
clarkb	(though `last reboot` can confirm as can irc logs)	15:23
clarkb	I rebooted around 21:00UTC	15:23
clarkb	depending on how long that release was taking I could have rebooted during the release	15:24
mordred	corvus: gtema reports being able to set the channel logo in #openstack-ansible-sig without being op there	15:24
fungi	it last successfully released the docs volume at 18:30:21, subsequent checks showed no vos release needed for that volume up until 19:20:02 when it tried and failed	15:24
clarkb	fungi: I guess in that case we confirm there is no current release happening on mirror-update, hold a lock on mirror-update to prevent it trying again, manuall break the lock, then manually rerun the release?	15:25
clarkb	fungi: then after that is done we can move the volume again?	15:25
fungi	correct, and be prepared for it to decide this needs to be a full re-release and take days because of the ro replica in ord	15:25
clarkb	ok after this meeting I'll look into starting that	15:26
mordred	corvus: nevermind - he had elevated privs but wasn't aware of it	15:26
clarkb	I'll put everything in screen too as I'll be out the next 1.5 days starting after our meeting today	15:26
fungi	clarkb: the vos release very well may take that long if it decides to retransfer all the data from dfw to ord	15:27
corvus	mordred: yeah, i was gonna say, matrix says he's a mod :)	15:27
clarkb	fungi: I think the docs need to be updated to note that rebooting RO servers is not safe if vos releases can happen?	15:27
clarkb	we need to hold locks for anything that may vos release essentially regardless of the RO or RW hosting state of the fileserver	15:28
fungi	clarkb: yeah, it should probably mention temporarily disabling the release-volumes.py job in the root crontab on mirror-update.o.o	15:28
clarkb	fungi: and holding the mirror locks too	15:28
fungi	right	15:28
clarkb	currently it basically says you can reboot an RO server safely	15:29
clarkb	whcih is why i moved all the volumes	15:29
fungi	anything that's remotely calling vos release on the fileservers so they can use -localauth	15:29
clarkb	oh I see it is specifically due to us running the vos release from afs01	15:29
clarkb	so not generic that RO isn't safe during a reboot, but that the controlling release process needs to be safe. Got it	15:29
fungi	yeah, the script does an ssh to mirror-update and calls vos release in a shell on it, in order to get around the kerberos ticket timeouts	15:30
clarkb	yup	15:30
fungi	and your reboot may have killed the vos release process	15:30
clarkb	ya I think that is must've	15:30
clarkb	but I need to ssh into the mirror and check that no other release is running and then figure out how to break the lock etc	15:30
clarkb	(doesn't look like breaking locks is in our docs?)	15:31
fungi	vos unlock	15:31
clarkb	https://docs.openafs.org/AdminGuide/HDRWQ247.html yup google found it	15:31
fungi	i've done it like `vos unlock -localauth docs`	15:32
fungi	as root on afs01.dfw	15:32
clarkb	thanks. I'll get that moving once i check the mirror-update side after this meeting. I'll use a screen session on mirror-update and afs01.dfw too so that others can check on it if necessary	15:32
fungi	and then try to `vos release -localauth docs` after that, but you'll want a screen session for that since it may take a long time	15:33
clarkb	yup ++	15:33
*** ysandeep is now known as ysandeep\|out		15:41
opendevreview	Merged openstack/project-config master: Revert "Accessbot OFTC channel list stopgap" https://review.opendev.org/c/openstack/project-config/+/792857	15:42
clarkb	fungi: looks like /opt/afs-release/release-volumes.py is the script run on mirror-update to do the release	15:44
fungi	correct	15:44
fungi	i guess if you want to temporarily comment out the cronjob, that implies disabling ansible for the mirror-update server as well so it doesn't get readded	15:45
clarkb	/var/run/release-volumes.lock is the lockfile that script appears to try and hold. I'm going to try and hold that lockfile instead	15:45
clarkb	`flock -n /var/run/release-volumes.lock bash` seems to be running successfully in my screen now implying I have the lock	15:47
clarkb	fungi: ^ if you think that is sufficient I'll do the `vos unlock -localauth docs` next	15:47
fungi	yeah, that looks right	15:49
fungi	and probably simpler than fiddling the crontab	15:49
clarkb	lock has been released I am going to start the vos release now	15:50
clarkb	for record keeping purposes there is a root screen on mirror-update that is holding the release-volumes.lock file with flock -n /var/run/release-volumes.lock bash. Then on afs01.dfw.openstack.org I have another root screen that is running the vos release	15:51
clarkb	Once the vos release on afs01.dfw.openstack.org completes we can exit the processes on mirror-udpate to release the lock. THen later we can move the volume. I'm not sure I'll be around when the release completes, but I can do the followups when I get back on thursday if no one beats me to it	15:52
*** amoralej is now known as amoralej\|off		15:53
*** marios\|ruck is now known as marios\|out		15:57
opendevreview	Slawek Kaplonski proposed openstack/project-config master: Add members of the neutron drivers team as ops in neutron channel https://review.opendev.org/c/openstack/project-config/+/796521	15:59
clarkb	fungi: it claims it is done already actually. Can you double check it? I'll keep holding the lock so that i can do the move without contention but then drop the lock when the move is done if it looks good to you	16:00
fungi	oh, sure. that was the other possibility: that it decided it could do an incremental completion safely	16:01
fungi	looking	16:01
fungi	did you try a second vos release yet, btw?	16:01
fungi	usually if one is interrupted, your first vos release will only attempt to complete the interrupted transfers	16:01
clarkb	I have not. I can do that first	16:02
fungi	then a second one will catch up to current	16:02
clarkb	Only the one has completed. Will do second one now	16:02
fungi	vos examine is claiming it's locked, but that's probably you	16:03
clarkb	yup since I just started another release	16:04
clarkb	fungi: that release shows it is done now though so you can check it again for no lock and I have done two releases in succession	16:04
fungi	clarkb: yep, looks great, not locked, all ro volumes are current	16:04
fungi	should be safe to proceed with the move	16:05
clarkb	doing so now, thanks for checking	16:07
gmann	clarkb: fungi; is http://ptg.openstack.org/ down? I was dumping the Xena etherpads to wiki and seems it is not working	16:24
*** ricolin_ is now known as ricolin		16:26
clarkb	I think the foundation runs that?	16:27
* clarkb looks		16:27
clarkb	oh no its on eavesdrop til	16:27
clarkb	gmann: the issue is that we are migrating irc services from the old eavesdrop server to a new one and we must've turned off apache?	16:28
clarkb	gmann: ianw has been working on that mgiration	16:28
gmann	ohk	16:28
fungi	also any content on that site would have been about the last ptg not the next one	16:29
clarkb	fungi: should we go ahead and approve https://review.opendev.org/c/opendev/system-config/+/796513/	16:29
fungi	clarkb: yeah, old eavesdrop server is completely offline	16:29
gmann	yeah, I want to get the Xena etherpad list. I hope that is still there	16:29
gmann	Xena PTG	16:29
fungi	clarkb: for the limnoria config change, i guess that won't redeploy/restart the container?	16:29
clarkb	fungi: I am not sure	16:30
clarkb	gmann: I am not sure about your question either	16:30
gmann	clarkb: I mean this one http://ptg.openstack.org/etherpads.html	16:30
fungi	we may need to restore a copy of the ptgbot sqlite db somewhere to get that data	16:30
fungi	"The Internet Archive's sites are temporarily offline due to a planned power outage."	16:31
fungi	that's unfortunate, i was going to see if they had a copy	16:31
gmann	ok, thanks	16:32
fungi	gmann: but no, we haven't deleted the old server yet, merely turned it off	16:33
fungi	if it's not urgent let's see if there's an archive of the site content at archive.org once they're back up, otherwise we can try to temporarily turn the ptg.o.o site back on or try to restore its data from backups	16:34
gmann	we do not have all the etherpads anywhere so thought of dumping it on wiki for future ref. but not so urgent so we can go with 'ptg.o.o site back on or try'	16:37
gmann	may be we can automate dumping etherpad to wiki or so right after ptg.	16:38
fungi	yeah, all good ideas	16:39
clarkb	fungi: the docs RW volume move has completed. Can you double check vos listvldb and if it looks good to you I'll stop holding the lock on mirror-update?	16:48
clarkb	fwiw the listing lgtm but want to double check so that I'm not competing with some automated process to relock and fix things	16:48
* clarkb makes breakfast		16:48
fungi	clarkb: yep, all looks good, nothing locked, nothing old, only ro sites on afs02.dfw and afs01.ord	16:51
fungi	all rw volumes are on afs01.dfw	16:51
clarkb	fungi: looks like no RO site on afs02, it is afs01.dfw and afs01.ord	16:56
clarkb	but that is the way it was before aiui	16:56
clarkb	fungi: you good with me dropping the lock now?	16:56
fungi	yeah, should be safe	16:59
fungi	go for it	16:59
fungi	there are 40 ro sites on afs02.dfw by the way	17:00
fungi	that is what i meant by "only ro sites on afs02.dfw and afs01.ord	17:01
fungi	"	17:01
clarkb	ah	17:05
clarkb	I have closed down all the screens and released the lock. We should be back to nromal now	17:06
Alex_Gaynor	👋 it looks like maybe linaro is having issues this morning. We're seeing delays in arm64 jobs starting, and https://grafana.opendev.org/d/S1zTp6EGz/nodepool-linaro?orgId=1 seems to say that there's been 13 nodes stuck in "deleting" for 6 hours?	17:12
clarkb	fungi: ^ you restarted the launcher for linaro to unstick some nodes too?	17:15
clarkb	Alex_Gaynor: I think there was an issue that got addressed, but it is possible that allowed new issues further down the pipeline to pop up	17:15
Alex_Gaynor	clarkb: So it's possible either there's still an issue, or this is just the backlog slowly clearing out?	17:15
clarkb	Alex_Gaynor: ya. I'm taking a quick look at the logs and status the service reports directly to see if we can infer more	17:16
Alex_Gaynor	👍	17:17
clarkb	Alex_Gaynor: there are definitely a number of deleting instances from far too long ago. Next step for thosewill be to check the cloud side and determine if we leaked or they leaked. I do see building instances and ready instances though so I think it is moving. But possiblyslowly due to backlog	17:18
Alex_Gaynor	clarkb: it looks like there's a handful of instance that have been sitting in "ready" for a while, is there some reason they wouldn't be assigned to jobs?	17:18
clarkb	Alex_Gaynor: the two most common things are that they are part of a larger nodeset that hasn't managed to fully allocate yet or min-ready caused them to boot without direct demand. That said it is also possible there is an issue (like the one that instigated the earlier restart) that prevents those from being assigned	17:19
clarkb	fungi: ^ would know better what that looekd like as I wasn't awake when it happened	17:20
clarkb	Alex_Gaynor: I do see an instance that went into use 3 minutes ago so something is making progress	17:20
clarkb	oh and another is under a minute old	17:20
Alex_Gaynor	👏	17:21
clarkb	ok the 55 day old deletes are just noise I think. They are labeled under a different provider (they are linaro-us not linaro-us-regionone). We need to clean those up in our db but they don't appear to be in the actual cloud (no quota consumption)	17:23
clarkb	I suspect for the other deletions their builds timed out because the cloud shows them in a BUILD state	17:24
Alex_Gaynor	that sounds like something that would be consuming quota?	17:25
clarkb	yes the ones in a BUILD state on the cloud side are most likely consuming quota	17:25
clarkb	nodepool appears to have been attempting to delete at least some of those instances for some time	17:26
clarkb	I think we may need kevinz to look into this since the cloud isn't honoring our requests	17:26
clarkb	and also we should see if ianw remembers what the magic was to get the osuosl mirror running again	17:26
clarkb	(then we can enable that cloud again)	17:26
fungi	clarkb: sorry, in a meeting right now, but yes i restarted the launcher responsible for linaro-us to free some stale locks on node requests	17:27
fungi	there were a few changes which had been sitting waiting on node assignments which the launcher thought it had declined	17:28
clarkb	fungi: fwiw it seems like things are moving there but we've got a number of nodepool deleting state instances that show BUILD state in the cloud and seem to refuse to go away. I think we need to reach out to kevinz	17:28
fungi	yes, it seems like that's probably constricting the available quota (and having the osuosl cloud down isn't helping matters)	17:29
clarkb	we appear to have room for 5 job nodes right now. There are two "ready" nodes. If those come from min-ready we might try estting those values to 0 to get 5 bumped to 7	17:31
clarkb	after this meeting I can write an email to kevinz	17:31
*** ricolin_ is now known as ricolin		17:32
clarkb	we are up to 7 in use now. slightly better than my earlier counts	17:32
fungi	did it end up consuming the two in ready state?	17:37
clarkb	fungi: no I suspect those are min-ready allocations	17:38
clarkb	I'm drafting up the email now	17:38
fungi	thanks!	17:40
clarkb	Alex_Gaynor: fungi: email sent to kevinz (I also pointed out we are on OFTC now in case IRC confusion happens). Hopefully that can be addressed later today when kevinz's day starts	17:45
Alex_Gaynor	👍	17:45
clarkb	fungi: maybe we should try some osuosl mirror reboots just for fun? I've got to prep for the infra meeting that is in just over an hour though	17:46
fungi	yeah, i'm about to resurface from my meeting-induced urgent task backlog, and can try a few more things	17:48
fungi	working on our startup race theory, i've got openafs-client.service disable and am waiting for the server to reboot, then will try to modprobe the openafs lkm before manually starting openafs-client	18:07
fungi	hopefully that will give us a good idea of whether it's racing the module loading	18:08
fungi	manually did `sudo modprobe openafs`	18:16
fungi	now lsmod confirms the openafs lkm is loaded	18:16
fungi	trying to `sudo systemctl start openafs-client`	18:16
fungi	not looking good	18:17
clarkb	was worth a shot anyway	18:19
fungi	yeah, no dice, same kernel oops	18:19
fungi	claims it's got the same openafs package versions installed as the linaro-us mirror server though	18:20
fungi	i'll try to remember to pick ianw's brain on this after the meeting	18:22
ianw	for reference, it appears what happens is that ptgbot writes out a .json file which is read by some static html that uses handlebars (https://handlebarsjs.com/) templates. so no cgi/wsgi	19:22
clarkb	ah in that case we'd want to grab the json	19:23
fungi	but not bothering until after we find out if https://web.archive.org/web/*/ptg.openstack.org has it	19:26
clarkb	ya	19:26
corvus	i think we missed a content-type issue with eavesdrop logs; this used to serve text/plain: https://meetings.opendev.org/irclogs/%23opendev-meeting/%23opendev-meeting.2021-06-15.log	19:48
corvus	but now it's application/log	19:48
opendevreview	Merged opendev/system-config master: ircbot: flush channel logging continuously https://review.opendev.org/c/opendev/system-config/+/796513	19:48
fungi	corvus: thanks! should be easy enough to fix in the vhost config	19:50
ianw	we might as well update the ppa to 1.8.7 as a first thing	19:50
fungi	not a terrible idea	19:50
fungi	at least if we're going to debug it, let's debug the latest version	19:51
ianw	i don't see any updates in https://launchpad.net/~openafs/+archive/ubuntu/stable	19:51
fungi	it may be stalled by the debian release freeze	19:52
fungi	if they're just slurping in latest debian packages	19:52
fungi	yeah, sid's still at 1.8.6-5	19:53
ianw	hrm, i think that 1.8.7 only has those fixes for the rollover things which are in 1.8.6-5	19:53
ianw	i.e. they're essentially the same thing	19:53
fungi	yeah, and there's nothing newer in experimental either	19:53
ianw	there's been no other release	19:53
clarkb	ok with the meeting ended I am now being dragged out the door. I'll see ya'll on Thursday	19:53
fungi	have a good time clarkb!	19:54
clarkb	thanks!	19:54
ianw	i'm just going back in my notes; i'm trying to remember the context	19:54
ianw	i did not seem to note sending that email	19:55
ianw	my only entry is	19:55
ianw	* linaro mirror	19:56
ianw	* broken, reboot, clear afs cache	19:56
ianw	osuosl has move to libera, but Ramereth is particularly helpful if we find it's an infra thing. trying to log in now	19:56
* Ramereth perks up		20:02
fungi	ianw: ^ ;)	20:02
fungi	and yeah, not sure yet it's an infrastructure layer issue	20:03
fungi	it could be as simple as differences in the kernel rev between the osuosl and linaro-us mirror servers	20:03
ianw	so the suggested fix, https://gerrit.openafs.org/14093, i think got a bit lost	20:03
ianw	it was in 1.8.7pre1, but then 1.8.7 got taken over as an emergency release for the rollover stuff	20:04
fungi	nevermind, both are on 5.4.0-74-generic #83-Ubuntu SMP	20:04
ianw	if we want that, we should either patch our 1.8.6-5 (== 1.8.7) or try 1.8.8pre1 which is just released	20:04
ianw	http://www.openafs.org/dl/openafs/candidate/1.8.8pre1/RELNOTES-1.8.8pre1 seems safe-ish	20:05
ianw	preferences? i could try either	20:05
fungi	i'd go with the 1.8.8pre1	20:15
fungi	but i'm a glutton for punishment	20:15
ianw	it is a bit more punishment figuring out which patches have been applied, etc. but otoh it seems to have quite a few things we want	20:16
fungi	the server's already out of commission anyway, so 1.8.8pre1 is at least an opportunity to also give the openafs maintainers some feedback	20:20
opendevreview	Michael Johnson proposed opendev/system-config master: Removing openstack-state-management from statusbot https://review.opendev.org/c/opendev/system-config/+/795596	20:22
johnsom	Rebased after statusbot config moved	20:23
Ramereth	ianw: did you need something from me?	20:26
ianw	Ramereth: no, thank you :) i think we're hitting openafs race issues, i think arm64/osuosl is just somehow very lucky in triggering them	20:27
Ramereth	I feel so lucky	20:33
ianw	ok, i've been through debian/patches and pulled out all those applied to 1.8.8pre1	20:34
ianw	i should probably make a separate ppa to avoid overwriting existing packages	20:36
ianw	https://launchpad.net/~openstack-ci-core/+archive/ubuntu/openafs-1.8.8pre1 is building so we'll see how that goes	20:45
ianw	we can manually pull that onto the osuosl mirror and test. if we like it, i can upload it to our main ppa for everything	20:46
*** sshnaidm is now known as sshnaidm\|afk		21:15
fungi	clarkb: proposed-cleanups.20210416 is missing a bit of context for me... there are some duplicate addresses in that list though most seem to be singular... also there's a handful of addresses for the same account ids.... and not quite sure if all the blank lines toward the end are relevant or just cruft	21:22
ianw	openafs builds look ok, it's publishing now	21:22
fungi	awesome	21:23
fungi	thanks so much for updating that	21:23
ianw	haha don't thank me until it at least meets the bar of not making things worse :)	21:24
fungi	that's a rather high bar in these here parts ;)	21:27
ianw	mordred: as the author of the comment about caveats to upgrading ansible on bridge; would you have a second to double check https://review.opendev.org/c/opendev/system-config/+/792866/4/playbooks/install-ansible.yaml that i believe notes the issues are solved with the ansible 4.0 package?	21:33
opendevreview	Merged openstack/project-config master: infra-package-needs: stub ntp for Fedora https://review.opendev.org/c/openstack/project-config/+/796414	21:34
ianw	i'm install 1.8.8pre1 on osuosl mirror now. i'll clear the cache directory for good measure and reboot	21:34
fungi	ansible 4 is already a thing? i've clearly not been paying close enough attention	21:34
fungi	ianw: be prepared for the reboot to take a while since systemd will wait quite a long time for openafs-client to fail to stop before it gives up	21:36
ianw	fungi: sort of. it's what we'd think of as 2.11 with collections bolted on	21:37
fungi	ahh	21:38
fungi	what happened to ansible 3 i wonder	21:38
ianw	it is there and works in a similar fashion, we just skipped it	21:41
fungi	heh	21:42
fungi	did they decide to just go with single-component versions then?	21:42
ianw	if you mean are all collections tagged at one version; i think the answer is no. each "add-on" keeps it's own version and somehow the build magic pulls in it's latest release	21:45
ianw	and then the "ansible" package depends on a version of ansible-base (3 is >2.10<2.11, 4 is >2.11<2.12, etc)	21:46
fungi	okay, so ansible-base is versioned independently of ansible now	21:49
ianw	well, that sucks, seems like i got the same oops starting the client	21:56
fungi	:(	21:59
fungi	you're sure dkms rebuilt the lkm?	21:59
ianw	ii openafs-client 1.8.8pre1-1~focal arm64 AFS distributed filesystem client support	22:06
ianw	ii openafs-krb5 1.8.8pre1-1~focal arm64 AFS distributed filesystem Kerberos 5 integration	22:06
ianw	ii openafs-modules-dkms 1.8.8pre1-1~focal all AFS distributed filesystem kernel module DKMS source	22:06
ianw	pretty sure	22:06
mordred	<ianw "mordred: as the author of the co"> Lgtm ... That was long ago ... I believe that has been sorted now	22:44
mordred	ianw: I +2'd - but left off the +A - because that's clearly a pay-attention kind of thing	22:47
mordred	actually - I'm old ... do we use install-ansible as part of the integration tests? or is that one of the things that's different in our live test environment?	22:48
mordred	looks like we do	22:49
mordred	so I believe our integration tests also verify that ansible v4 should be fine	22:49
Alex_Gaynor	Hmm, it looks like there's now 0 linaro arm64 jobs running, even though there's a bunch queues. Has the linaro situation regressed?	22:52
ianw	Alex_Gaynor: umm ... maybe. sorry i'm currently trying to debug the oops we're seeing on the osuosl mirror server	23:14
ianw	we've got a bunch of hosts in build status there	23:17

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!