Tuesday, 2021-06-15

clarkbI'm going to need to step out for dinner shortly. I'll check in on docs later, but its still locked :/00:03
clarkb`vos move -id docs -toserver afs01.dfw.openstack.org -topartition vicepa -fromserver afs02.dfw.openstack.org -frompartition vicepa -localauth` is the command we need to run (I was running them all from screen on afsdb01). If I don't manage to run that today I'll try to get it done in the morning00:03
opendevreviewMerged opendev/system-config master: Update eavesdrop deploy job  https://review.opendev.org/c/opendev/system-config/+/79600600:08
opendevreviewMerged opendev/system-config master: ircbot: update limnoria  https://review.opendev.org/c/opendev/system-config/+/79632300:08
ianwok i haven't been following closely on that one sorry.  i also have to go afk for a bit for vaccine shots, so back at an indeterminate time00:39
clarkbianw: tldr is we wanted to do reboots and to do those without impacting afs users I did afs02.dfw and afs01.ord first. THen today I held locks and moved RW volumes around and rebooted afs01.dfw. Now I'm just trying to restore things to the way they were before as far as volume assignments go00:43
clarkband it is looking like I won't get to move docs back tonight. I'll check it in the morning00:50
opendevreviewSteve Baker proposed openstack/diskimage-builder master: Add a growvols utility for growing LVM volumes  https://review.opendev.org/c/openstack/diskimage-builder/+/79108302:16
*** diablo_rojo is now known as Guest218103:19
ianwand back.  i'll looks at getting nb containers up with that mount and f3403:52
*** ykarel|away is now known as ykarel04:18
opendevreviewSteve Baker proposed openstack/diskimage-builder master: Add a growvols utility for growing LVM volumes  https://review.opendev.org/c/openstack/diskimage-builder/+/79108304:33
opendevreviewMerged opendev/system-config master: nodepool-builder: add volume for /var/lib/containers  https://review.opendev.org/c/opendev/system-config/+/79570704:38
opendevreviewMerged opendev/system-config master: statusbot: don't prefix with extra # for testing  https://review.opendev.org/c/opendev/system-config/+/79600904:38
*** marios is now known as marios|ruck05:16
opendevreviewMerged openstack/project-config master: Build images for Fedora 34  https://review.opendev.org/c/openstack/project-config/+/79560406:03
*** ysandeep|out is now known as ysandeep06:09
*** iurygregory_ is now known as iurygregory06:19
*** jpena|off is now known as jpena06:34
opendevreviewMerged openstack/project-config master: Removing openstack-state-management from the bots  https://review.opendev.org/c/openstack/project-config/+/79559906:37
*** rpittau|afk is now known as rpittau07:13
*** ykarel is now known as ykarel|lunch07:38
opendevreviewIan Wienand proposed openstack/project-config master: infra-package-needs: stub ntp for Fedora  https://review.opendev.org/c/openstack/project-config/+/79641409:45
ianwmnaser: getting closer ... ^09:45
*** ykarel|lunch is now known as ykarel09:57
dtantsurFYI freenode has reset all registrations apparently. so those of us who stayed to redirect people can no longer join the +r channels09:59
avass[m]is https://review.opendev.org having issues?10:16
avass[m]oh now it loads but that took way longer than usual10:18
ianwdtantsur: yeah, i had left a connection too, hoping some sort of sanity may prevail.  but alas I just closed everything too, the complete end of an era.10:38
*** sshnaidm|afk is now known as sshnaidm11:21
sshnaidmI see there are problems with limestone proxies, fyi https://zuul.opendev.org/t/openstack/build/e8b283b37245414284966011a2d0e49e/log/job-output.txt#124511:23
sshnaidmclarkb, fungi ^11:23
sshnaidmmarios|ruck, ^11:23
*** ysandeep is now known as ysandeep|afk11:24
*** jpena is now known as jpena|lunch11:25
marios|rucksshnaidm: ack yes quite a few this morning but last couple hours have not seen more11:34
marios|rucksshnaidm: i filed that earlier https://bugs.launchpad.net/tripleo/+bug/1931975 and was waiting for afternoon when the folks you pinged would be around 11:35
*** amoralej is now known as amoralej|lunch11:50
ravHi All11:51
ravNeeded info on how i can add gpgkey11:51
ravi want to trigger a pypi release and tag the version using "git tag -s "versionnumber"11:52
ravbut the tag fails with error gpg skipped: no secret key11:52
ravI have added gpg in mygithub11:52
ravbut how do i add it on opendev?11:52
fungisshnaidm: looks like it timed out while pulling a cryptography wheel through the proxy from pypi, yeah. must be intermittent because i can pull that same file from the same mirror, though it's also possible the local network between the job nodes and mirror server in limestone are having trouble... have you noticed more than one incident like this today?12:04
fungirav: you don't need to upload your gpg key anywhere for that to work, you just need to have it available locally where you're running git12:05
sshnaidmfungi, yeah, much more, that was just one example12:05
sshnaidmfungi, mostly timeouts, for example marios|ruck submitted an issue with another examples: https://bugs.launchpad.net/tripleo/+bug/193197512:06
fungirav: running `gpg --list-secret-keys` locally where you use git should list your key if you have one12:06
fungisshnaidm: were they all errors pulling files from pypi, or other stuff too?12:07
ravgot it working12:08
sshnaidmfungi, in the issues above the example is with centos repositories12:08
fungisshnaidm: thanks, so probably not just a problem between the mirror server and pypi/fastly12:12
fungicentos packages would be getting served from the afs cache on the server12:12
fungiseems more likely it's a local networking issue between the mirror server and the job nodes in that case12:13
fungii also don't seem to be able to get the xml file mentioned in that bug12:15
fungisshnaidm: oh, maybe not, that's a proxy to https://trunk.rdoproject.org/12:17
*** jpena|lunch is now known as jpena12:17
fungiwe're not mirroring that, just caching/forwarding requests for it12:17
sshnaidmfungi, yeah, though request has died to http://mirror.regionone.limestone.opendev.org:808012:18
sshnaidmtogether with all pypi mirror failures12:18
fungiand i was finally able to get that xml file to pull through the proxy, so maybe this is a performance problem with the cache12:18
fungioh yeah it took me several seconds just to touch a new file in /var/cache/apache2/12:19
fungiload average on the server is nearly 10012:20
sshnaidmfungi, here an example with centos/8-stream repo: https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_e0d/795019/2/gate/tripleo-ci-centos-8-content-provider/e0d29df/job-output.txt12:20
fungihalf its cpu is consumed by iowait12:20
sshnaidmfungi, that explains a lot :)12:20
fungii'm going to try stopping and starting apache completely there12:21
ravfungi: fatal: 'gerrit' does not appear to be a git repository. Im getting this error when i run "git push gerrit <tag>12:23
fungisshnaidm: looks like the problem is htcacheclean being unable to keep up12:24
fungiand there was more than one htcacheclean running, one had been going since friday, the second started today at 08:00 utc12:25
fungii killed the older one and started apache back up for now, but we're going to need to keep a close eye on this... the cache volume is around 80% full when htcacheclean normally wants to keep it closer to 50%12:25
ravok this works. thanks12:26
fungirav: see the third bullet in the note section at https://docs.opendev.org/opendev/infra-manual/latest/drivers.html#tagging-a-release12:26
ravYes, had missed it12:26
fungi#status log Killed a long-running extra htcacheclean process on mirror.regionone.limestone which was driving system load up around 100 from exceptional iowait contention and causing file retrieval problems for jobs run there12:29
opendevstatusfungi: finished logging12:29
*** ysandeep|afk is now known as ysandeep12:33
*** amoralej|lunch is now known as amoralej12:51
*** diablo_rojo__ is now known as diablo_rojo13:13
fungiminotaur launch in a few minutes hopefully, at the spaceport just up the beach from here13:32
fungitoo cloudy here for us to be able to see it clear the horizon, unfortunately13:33
funginasa tv has the livestream from the wallops island facility though13:34
fungilaunch successful!13:37
fungipointed out in #openstack-infra a few minutes ago, channel logging seems to be spotty14:06
fungion eavesdrop, /var/lib/limnoria/opendev/logs/ChannelLogger/oftc/#openstack-infra/#openstack-infra.2021-06-15.log is entirely empty14:09
fungiand #opendev/#opendev.2021-06-15.log is truncated after 12:26:25 today14:10
fungithe /var/lib/limnoria volume isn't anywhere close to being out of space or inodes14:12
corvusit might be easier to write a quick matrix log bot instead of figuring out and fixing whatever ails limnoria14:13
fungiit hasn't logged any errors related to writing channel logs14:15
corvusiiuc, limnoria itself just logs the plaintext; the html formatting is a separate program.  so a log bot doesn't need to do anything other than write out every thing it receives.14:23
fungiyep, potentially filtering (for example we filter the join/part messages so that we don't publish logs of everyone's ip addresses)14:25
*** rpittau is now known as rpittau|afk14:29
gibibtw, #openstack-neutron logging also seems to be stopped during today14:30
gibiand the #openstack-nova log is empty today14:30
fungii'll try restarting the bot during a lull between meetings14:39
clarkbfungi: re limestone, I think the idea was if we caught it in that state again we could have logan- look at the running host. (Disable the cloud in nodepool if necessary)14:44
clarkbgibi: we found a bug in limnoria, which we fixed but are trying to get deployed14:44
fungiclarkb: the new problem seems to be unrelated to the previous limnoria bug you patched14:44
clarkbfungi: oh?14:44
fungialso it looks like the qa and neutron meetings finished a few minutes ago, so i'll stop/start the container before the next meeting block begins14:45
clarkbfungi: is it possible that ianw's fixes to update the image simply undid our manual patching?14:45
fungiclarkb: nope, this doesn't appear related to joining channels, the bot is just suddenly no longer logging conversations for channels it's in14:46
clarkbinfra-root it looks like the afs docs volume is currently locked. Is there a way for me to see the age of the lock and/or release progress? I want to see if I can narrow down when I might be able to move the RW copy or to see if there is something wrong?14:46
clarkbfungi: that is unfortunate14:46
clarkbhttps://grafana.opendev.org/d/T5zTt6PGk/afs?viewPanel=36&orgId=1 makes me suspicious14:51
clarkbis it possible that whatever held that lock has died and the lock is stale? This is a vos release that happens in a cron?14:52
clarkbhrm do we do those vos releases on afs01.dfw? and when I rebooted it maybe I killed a release in progress?14:54
clarkb(we need to update our docs on server reboots if so)14:54
opendevreviewMerged openstack/project-config master: Re-add publish jobs after panko deprecation  https://review.opendev.org/c/openstack/project-config/+/79389114:55
fungiclarkb: new detail on the limnoria bug... seems it's just not getting around to flushing its files. when i stopped the container it wrote out all the many hours of missing lines to the logfiles14:55
*** ykarel_ is now known as ykarel14:56
mordredwell that's nice14:58
clarkbat leasat that is likely to be an easy fix?14:59
fungiyeah, probably, and we haven't lost any data seems like14:59
fungiclarkb: there's a command to list volume locks, lemme see if i can find it when i get a spare moment15:00
fungioh, and yes the vos releases for static volumes happen via ssh from mirror-update to afs01.dfw15:00
fungiwith a cronjob15:00
clarkbya so I bet what happened was I rebooted in the middle of that vos release (and our docs need an update)15:02
clarkbafter meetings I guess I should try to confirm that better, break the lock, do a release, then do the move?15:03
corvusclarkb, fungi: there's a setting in the limnoria config15:03
corvusto flush15:03
fungioh, maybe that's just missing or wrong15:04
fungisupybot.plugins.ChannelLogger.flushImmediately: False15:08
fungi"Determines whether channel logfiles will be flushed anytime they're written to, rather than being buffered by the operating system."15:08
fungiapparently that buffering can be a day or more, even15:08
opendevreviewJeremy Stanley proposed opendev/system-config master: ircbot: flush channel logging continuously  https://review.opendev.org/c/opendev/system-config/+/79651315:13
fungicorvus: thanks! clarkb: gibi: ^15:13
fungihopefully that takes care of it15:13
*** jpena is now known as jpena|off15:18
fungiclarkb: yeah, so vos examine on the docs rw volume says "Volume is locked for a release operation" and we normally only release it via cronjob on mirror-update.o.o15:20
clarkboh is the cron on mirror-update? I rebooted that server last week15:21
clarkbcould the problem be that I rebooted the fileserver during the reelase instead?15:22
clarkbregardless of where the release is running?15:22
fungiit tried to release the docs volume yesterday at 19:20:02 running `ssh -T -i /root/.ssh/id_vos_release vos_release@afs01.dfw.openstack.org -- vos release docs` and recorded a failure15:22
clarkbah ok the actual release happens remotely on afs01 and that is approximately when I rebooted iirc15:22
clarkb(though `last reboot` can confirm as can irc logs)15:23
clarkbI rebooted around 21:00UTC15:23
clarkbdepending on how long that release was taking I could have rebooted during the release15:24
mordredcorvus: gtema reports being able to set the channel logo in #openstack-ansible-sig without being op there15:24
fungiit last successfully released the docs volume at 18:30:21, subsequent checks showed no vos release needed for that volume up until 19:20:02 when it tried and failed15:24
clarkbfungi: I guess in that case we confirm there is no current release happening on mirror-update, hold a lock on mirror-update to prevent it trying again, manuall break the lock, then manually rerun the release?15:25
clarkbfungi: then after that is done we can move the volume again?15:25
fungicorrect, and be prepared for it to decide this needs to be a full re-release and take days because of the ro replica in ord15:25
clarkbok after this meeting I'll look into starting that15:26
mordredcorvus: nevermind - he had elevated privs but wasn't aware of it15:26
clarkbI'll put everything in screen too as I'll be out the next 1.5 days starting after our meeting today15:26
fungiclarkb: the vos release very well may take that long if it decides to retransfer all the data from dfw to ord15:27
corvusmordred: yeah, i was gonna say, matrix says he's a mod :)15:27
clarkbfungi: I think the docs need to be updated to note that rebooting RO servers is not safe if vos releases can happen?15:27
clarkbwe need to hold locks for anything that may vos release essentially regardless of the RO or RW hosting state of the fileserver15:28
fungiclarkb: yeah, it should probably mention temporarily disabling the release-volumes.py job in the root crontab on mirror-update.o.o15:28
clarkbfungi: and holding the mirror locks too15:28
clarkbcurrently it basically says you can reboot an RO server safely15:29
clarkbwhcih is why i moved all the volumes15:29
fungianything that's remotely calling vos release on the fileservers so they can use -localauth15:29
clarkboh I see it is specifically due to us running the vos release from afs0115:29
clarkbso not generic that RO isn't safe during a reboot, but that the controlling release process needs to be safe. Got it15:29
fungiyeah, the script does an ssh to mirror-update and calls vos release in a shell on it, in order to get around the kerberos ticket timeouts15:30
fungiand your reboot may have killed the vos release process15:30
clarkbya I think that is must've15:30
clarkbbut I need to ssh into the mirror and check that no other release is running and then figure out how to break the lock etc15:30
clarkb(doesn't look like breaking locks is in our docs?)15:31
fungivos unlock15:31
clarkbhttps://docs.openafs.org/AdminGuide/HDRWQ247.html yup google found it15:31
fungii've done it like `vos unlock -localauth docs`15:32
fungias root on afs01.dfw15:32
clarkbthanks. I'll get that moving once i check the mirror-update side after this meeting. I'll use a screen session on mirror-update and afs01.dfw too so that others can check on it if necessary15:32
fungiand then try to `vos release -localauth docs` after that, but you'll want a screen session for that since it may take a long time15:33
clarkbyup ++15:33
*** ysandeep is now known as ysandeep|out15:41
opendevreviewMerged openstack/project-config master: Revert "Accessbot OFTC channel list stopgap"  https://review.opendev.org/c/openstack/project-config/+/79285715:42
clarkbfungi: looks like /opt/afs-release/release-volumes.py is the script run on mirror-update to do the release15:44
fungii guess if you want to temporarily comment out the cronjob, that implies disabling ansible for the mirror-update server as well so it doesn't get readded15:45
clarkb/var/run/release-volumes.lock is the lockfile that script appears to try and hold. I'm going to try and hold that lockfile instead15:45
clarkb`flock -n /var/run/release-volumes.lock bash` seems to be running successfully in my screen now implying I have the lock15:47
clarkbfungi: ^ if you think that is sufficient I'll do the `vos unlock -localauth docs` next15:47
fungiyeah, that looks right15:49
fungiand probably simpler than fiddling the crontab15:49
clarkblock has been released I am going to start the vos release now15:50
clarkbfor record keeping purposes there is a root screen on mirror-update that is holding the release-volumes.lock file with flock -n /var/run/release-volumes.lock bash. Then on afs01.dfw.openstack.org I have another root screen that is running the vos release15:51
clarkbOnce the vos release on afs01.dfw.openstack.org completes we can exit the processes on mirror-udpate to release the lock. THen later we can move the volume. I'm not sure I'll be around when the release completes, but I can do the followups when I get back on thursday if no one beats me to it15:52
*** amoralej is now known as amoralej|off15:53
*** marios|ruck is now known as marios|out15:57
opendevreviewSlawek Kaplonski proposed openstack/project-config master: Add members of the neutron drivers team as ops in neutron channel  https://review.opendev.org/c/openstack/project-config/+/79652115:59
clarkbfungi: it claims it is done already actually. Can you double check it? I'll keep holding the lock so that i can do the move without contention but then drop the lock when the move is done if it looks good to you16:00
fungioh, sure. that was the other possibility: that it decided it could do an incremental completion safely16:01
fungidid you try a second vos release yet, btw?16:01
fungiusually if one is interrupted, your first vos release will only attempt to complete the interrupted transfers16:01
clarkbI have not. I can do that first16:02
fungithen a second one will catch up to current16:02
clarkbOnly the one has completed. Will do second one now16:02
fungivos examine is claiming it's locked, but that's probably you16:03
clarkbyup since I just started another release16:04
clarkbfungi: that release shows it is done now though so you can check it again for no lock and I have done two releases in succession16:04
fungiclarkb: yep, looks great, not locked, all ro volumes are current16:04
fungishould be safe to proceed with the move16:05
clarkbdoing so now, thanks for checking16:07
gmannclarkb: fungi; is http://ptg.openstack.org/ down? I was dumping the Xena etherpads to wiki and seems it is not working16:24
*** ricolin_ is now known as ricolin16:26
clarkbI think the foundation runs that?16:27
* clarkb looks16:27
clarkboh no its on eavesdrop til16:27
clarkbgmann: the issue is that we are migrating irc services from the old eavesdrop server to a new one and we must've turned off apache?16:28
clarkbgmann: ianw  has been working on that mgiration16:28
fungialso any content on that site would have been about the last ptg not the next one16:29
clarkbfungi: should we go ahead and approve https://review.opendev.org/c/opendev/system-config/+/796513/16:29
fungiclarkb: yeah, old eavesdrop server is completely offline16:29
gmannyeah, I want to get the Xena etherpad list. I hope that is still there16:29
gmannXena PTG16:29
fungiclarkb: for the limnoria config change, i guess that won't redeploy/restart the container?16:29
clarkbfungi: I am not sure16:30
clarkbgmann: I am not sure about your question either16:30
gmannclarkb: I mean this one http://ptg.openstack.org/etherpads.html16:30
fungiwe may need to restore a copy of the ptgbot sqlite db somewhere to get that data16:30
fungi"The Internet Archive's sites are temporarily offline due to a planned power outage."16:31
fungithat's unfortunate, i was going to see if they had a copy16:31
gmannok, thanks16:32
fungigmann: but no, we haven't deleted the old server yet, merely turned it off16:33
fungiif it's not urgent let's see if there's an archive of the site content at archive.org once they're back up, otherwise we can try to temporarily turn the ptg.o.o site back on or try to restore its data from backups16:34
gmannwe do not have all the etherpads anywhere so thought of dumping it on wiki for future ref. but not so urgent so we can go with  'ptg.o.o site back on or try'16:37
gmannmay be we can automate dumping etherpad to wiki or so right after ptg. 16:38
fungiyeah, all good ideas16:39
clarkbfungi: the docs RW volume move has completed. Can you double check vos listvldb and if it looks good to you I'll stop holding the lock on mirror-update?16:48
clarkbfwiw the listing lgtm but want to double check so that I'm not competing with some automated process to relock and fix things16:48
* clarkb makes breakfast16:48
fungiclarkb: yep, all looks good, nothing locked, nothing old, only ro sites on afs02.dfw and afs01.ord16:51
fungiall rw volumes are on afs01.dfw16:51
clarkbfungi: looks like no RO site on afs02, it is afs01.dfw and afs01.ord16:56
clarkbbut that is the way it was before aiui16:56
clarkbfungi: you good with me dropping the lock now?16:56
fungiyeah, should be safe16:59
fungigo for it16:59
fungithere are 40 ro sites on afs02.dfw by the way17:00
fungithat is what i meant by "only ro sites on afs02.dfw and afs01.ord17:01
clarkbI have closed down all the screens and released the lock. We should be back to nromal now17:06
Alex_Gaynor 👋 it looks like maybe linaro is having issues this morning. We're seeing delays in arm64 jobs starting, and https://grafana.opendev.org/d/S1zTp6EGz/nodepool-linaro?orgId=1 seems to say that there's been 13 nodes stuck in "deleting" for 6 hours?17:12
clarkbfungi: ^ you restarted the launcher for linaro to unstick some nodes too?17:15
clarkbAlex_Gaynor: I think there was an issue that got addressed, but it is possible that allowed new issues further down the pipeline to pop up17:15
Alex_Gaynorclarkb: So it's possible either there's still an issue, or this is just the backlog slowly clearing out?17:15
clarkbAlex_Gaynor: ya. I'm taking a quick look at the logs and status the service reports directly to see if we can infer more17:16
clarkbAlex_Gaynor: there are definitely a number of deleting instances from far too long ago. Next step for thosewill be to check the cloud side and determine if we leaked or they leaked. I do see building instances and ready instances though so I think it is moving. But possiblyslowly due to backlog17:18
Alex_Gaynorclarkb: it looks like there's a handful of instance that have been sitting in "ready" for a while, is there some reason they wouldn't be assigned to jobs?17:18
clarkbAlex_Gaynor: the two most common things are that they are part of a larger nodeset that hasn't managed to fully allocate yet or min-ready caused them to boot without direct demand. That said it is also possible there is an issue (like the one that instigated the earlier restart) that prevents those from being assigned17:19
clarkbfungi: ^ would know better what that looekd like as I wasn't awake when it happened17:20
clarkbAlex_Gaynor: I do see an instance that went into use 3 minutes ago so something is making progress17:20
clarkboh and another is under a minute old17:20
clarkbok the 55 day old deletes are just noise I think. They are labeled under a different provider (they are linaro-us not linaro-us-regionone). We need to clean those up in our db but they don't appear to be in the actual cloud (no quota consumption)17:23
clarkbI suspect for the other deletions their  builds timed out because the cloud shows them in a BUILD state17:24
Alex_Gaynorthat sounds like something that would be consuming quota?17:25
clarkbyes the ones in a BUILD state on the cloud side are most likely consuming quota17:25
clarkbnodepool appears to have been attempting to delete at least some of those instances for some time17:26
clarkbI think we may need kevinz to look into this since the cloud isn't honoring our requests17:26
clarkband also we should see if ianw remembers what the magic was to get the osuosl mirror running again17:26
clarkb(then we can enable that cloud again)17:26
fungiclarkb: sorry, in a meeting right now, but yes i restarted the launcher responsible for linaro-us to free some stale locks on node requests17:27
fungithere were a few changes which had been sitting waiting on node assignments which the launcher thought it had declined17:28
clarkbfungi: fwiw it seems like things are moving there but we've got a number of nodepool deleting state instances that show BUILD state in the cloud and seem to refuse to go away. I think we need to reach out to kevinz17:28
fungiyes, it seems like that's probably constricting the available quota (and having the osuosl cloud down isn't helping matters)17:29
clarkbwe appear to have room for 5 job nodes right now. There are two "ready" nodes. If those come from min-ready we might try estting those values to 0 to get 5 bumped to 717:31
clarkbafter this meeting I can write an email to kevinz17:31
*** ricolin_ is now known as ricolin17:32
clarkbwe are up to 7 in use now. slightly better than my earlier counts17:32
fungidid it end up consuming the two in ready state?17:37
clarkbfungi: no I suspect those are min-ready allocations17:38
clarkbI'm drafting up the email now17:38
clarkbAlex_Gaynor: fungi: email sent to kevinz (I also pointed out we are on OFTC now in case IRC confusion happens). Hopefully that can be addressed later today when kevinz's day starts17:45
clarkbfungi: maybe we should try some osuosl mirror reboots just for fun? I've got to prep for the infra meeting that is in just over an hour though17:46
fungiyeah, i'm about to resurface from my meeting-induced urgent task backlog, and can try a few more things17:48
fungiworking on our startup race theory, i've got openafs-client.service disable and am waiting for the server to reboot, then will try to modprobe the openafs lkm before manually starting openafs-client18:07
fungihopefully that will give us a good idea of whether it's racing the module loading18:08
fungimanually did `sudo modprobe openafs`18:16
funginow lsmod confirms the openafs lkm is loaded18:16
fungitrying to `sudo systemctl start openafs-client`18:16
funginot looking good18:17
clarkbwas worth a shot anyway18:19
fungiyeah, no dice, same kernel oops18:19
fungiclaims it's got the same openafs package versions installed as the linaro-us mirror server though18:20
fungii'll try to remember to pick ianw's brain on this after the meeting18:22
ianwfor reference, it appears what happens is that ptgbot writes out a .json file which is read by some static html that uses handlebars (https://handlebarsjs.com/) templates.  so no cgi/wsgi19:22
clarkbah in that case we'd want to grab the json19:23
fungibut not bothering until after we find out if https://web.archive.org/web/*/ptg.openstack.org has it19:26
corvusi think we missed a content-type issue with eavesdrop logs; this used to serve text/plain: https://meetings.opendev.org/irclogs/%23opendev-meeting/%23opendev-meeting.2021-06-15.log19:48
corvusbut now it's application/log19:48
opendevreviewMerged opendev/system-config master: ircbot: flush channel logging continuously  https://review.opendev.org/c/opendev/system-config/+/79651319:48
fungicorvus: thanks! should be easy enough to fix in the vhost config19:50
ianwwe might as well update the ppa to 1.8.7 as a first thing19:50
funginot a terrible idea19:50
fungiat least if we're going to debug it, let's debug the latest version19:51
ianwi don't see any updates in https://launchpad.net/~openafs/+archive/ubuntu/stable19:51
fungiit may be stalled by the debian release freeze19:52
fungiif they're just slurping in latest debian packages19:52
fungiyeah, sid's still at 1.8.6-519:53
ianwhrm, i think that 1.8.7 only has those fixes for the rollover things which are in 1.8.6-519:53
ianwi.e. they're essentially the same thing19:53
fungiyeah, and there's nothing newer in experimental either19:53
ianwthere's been no other release19:53
clarkbok with the meeting ended I am now being dragged out the door. I'll see ya'll on Thursday19:53
fungihave a good time clarkb!19:54
ianwi'm just going back in my notes; i'm trying to remember the context19:54
ianwi did not seem to note sending that email19:55
ianwmy only entry is19:55
ianw* linaro mirror19:56
ianw  * broken, reboot, clear afs cache19:56
ianwosuosl has move to libera, but Ramereth is particularly helpful if we find it's an infra thing.  trying to log in now19:56
* Ramereth perks up20:02
fungiianw: ^ ;)20:02
fungiand yeah, not sure yet it's an infrastructure layer issue20:03
fungiit could be as simple as differences in the kernel rev between the osuosl and linaro-us mirror servers20:03
ianwso the suggested fix, https://gerrit.openafs.org/14093, i think got a bit lost20:03
ianwit was in 1.8.7pre1, but then 1.8.7 got taken over as an emergency release for the rollover stuff20:04
funginevermind, both are on 5.4.0-74-generic #83-Ubuntu SMP20:04
ianwif we want that, we should either patch our 1.8.6-5 (== 1.8.7) or try 1.8.8pre1 which is just released20:04
ianwhttp://www.openafs.org/dl/openafs/candidate/1.8.8pre1/RELNOTES-1.8.8pre1 seems safe-ish20:05
ianwpreferences?  i could try either20:05
fungii'd go with the 1.8.8pre120:15
fungibut i'm a glutton for punishment20:15
ianwit is a bit more punishment figuring out which patches have been applied, etc.  but otoh it seems to have quite a few things we want20:16
fungithe server's already out of commission anyway, so 1.8.8pre1 is at least an opportunity to also give the openafs maintainers some feedback20:20
opendevreviewMichael Johnson proposed opendev/system-config master: Removing openstack-state-management from statusbot  https://review.opendev.org/c/opendev/system-config/+/79559620:22
johnsomRebased after statusbot config moved20:23
Ramerethianw: did you need something from me?20:26
ianwRamereth: no, thank you :)  i think we're hitting openafs race issues, i think arm64/osuosl is just somehow very lucky in triggering them20:27
RamerethI feel so lucky20:33
ianwok, i've been through debian/patches and pulled out all those applied to 1.8.8pre120:34
ianwi should probably make a separate ppa to avoid overwriting existing packages20:36
ianwhttps://launchpad.net/~openstack-ci-core/+archive/ubuntu/openafs-1.8.8pre1 is building so we'll see how that goes20:45
ianwwe can manually pull that onto the osuosl mirror and test.  if we like it, i can upload it to our main ppa for everything20:46
*** sshnaidm is now known as sshnaidm|afk21:15
fungiclarkb: proposed-cleanups.20210416 is missing a bit of context for me... there are some duplicate addresses in that list though most seem to be singular... also there's a handful of addresses for the same account ids.... and not quite sure if all the blank lines toward the end are relevant or just cruft21:22
ianwopenafs builds look ok, it's publishing now21:22
fungithanks so much for updating that21:23
ianwhaha don't thank me until it at least meets the bar of not making things worse :)21:24
fungithat's a rather high bar in these here parts ;)21:27
ianwmordred: as the author of the comment about caveats to upgrading ansible on bridge; would you have a second to double check https://review.opendev.org/c/opendev/system-config/+/792866/4/playbooks/install-ansible.yaml that i believe notes the issues are solved with the ansible 4.0 package?21:33
opendevreviewMerged openstack/project-config master: infra-package-needs: stub ntp for Fedora  https://review.opendev.org/c/openstack/project-config/+/79641421:34
ianwi'm install 1.8.8pre1 on osuosl mirror now.  i'll clear the cache directory for good measure and reboot21:34
fungiansible 4 is already a thing? i've clearly not been paying close enough attention21:34
fungiianw: be prepared for the reboot to take a while since systemd will wait quite a long time for openafs-client to fail to stop before it gives up21:36
ianwfungi: sort of.  it's what we'd think of as 2.11 with collections bolted on21:37
fungiwhat happened to ansible 3 i wonder21:38
ianwit is there and works in a similar fashion, we just skipped it21:41
fungidid they decide to just go with single-component versions then?21:42
ianwif you mean are all collections tagged at one version; i think the answer is no.  each "add-on" keeps it's own version and somehow the build magic pulls in it's latest release21:45
ianwand then the "ansible" package depends on a version of ansible-base (3 is >2.10<2.11, 4 is >2.11<2.12, etc)21:46
fungiokay, so ansible-base is versioned independently of ansible now21:49
ianwwell, that sucks, seems like i got the same oops starting the client21:56
fungiyou're sure dkms rebuilt the lkm?21:59
ianwii  openafs-client                 1.8.8pre1-1~focal                  arm64        AFS distributed filesystem client support22:06
ianwii  openafs-krb5                   1.8.8pre1-1~focal                  arm64        AFS distributed filesystem Kerberos 5 integration22:06
ianwii  openafs-modules-dkms           1.8.8pre1-1~focal                  all          AFS distributed filesystem kernel module DKMS source22:06
ianwpretty sure22:06
mordred<ianw "mordred: as the author of the co"> Lgtm ... That was long ago ... I believe that has been sorted now22:44
mordredianw: I +2'd - but left off the +A - because that's clearly a pay-attention kind of thing22:47
mordredactually - I'm old ... do we use install-ansible as part of the integration tests? or is that one of the things that's different in our live test environment?22:48
mordredlooks like we do22:49
mordredso I believe our integration tests also verify that ansible v4 should be fine22:49
Alex_GaynorHmm, it looks like there's now 0 linaro arm64 jobs running, even though there's a bunch queues. Has the linaro situation regressed?22:52
ianwAlex_Gaynor: umm ... maybe.  sorry i'm currently trying to debug the oops we're seeing on the osuosl mirror server23:14
ianwwe've got a bunch of hosts in build status there23:17

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!