clarkb | I'm going to need to step out for dinner shortly. I'll check in on docs later, but its still locked :/ | 00:03 |
---|---|---|
clarkb | `vos move -id docs -toserver afs01.dfw.openstack.org -topartition vicepa -fromserver afs02.dfw.openstack.org -frompartition vicepa -localauth` is the command we need to run (I was running them all from screen on afsdb01). If I don't manage to run that today I'll try to get it done in the morning | 00:03 |
opendevreview | Merged opendev/system-config master: Update eavesdrop deploy job https://review.opendev.org/c/opendev/system-config/+/796006 | 00:08 |
opendevreview | Merged opendev/system-config master: ircbot: update limnoria https://review.opendev.org/c/opendev/system-config/+/796323 | 00:08 |
ianw | ok i haven't been following closely on that one sorry. i also have to go afk for a bit for vaccine shots, so back at an indeterminate time | 00:39 |
clarkb | ianw: tldr is we wanted to do reboots and to do those without impacting afs users I did afs02.dfw and afs01.ord first. THen today I held locks and moved RW volumes around and rebooted afs01.dfw. Now I'm just trying to restore things to the way they were before as far as volume assignments go | 00:43 |
clarkb | and it is looking like I won't get to move docs back tonight. I'll check it in the morning | 00:50 |
opendevreview | Steve Baker proposed openstack/diskimage-builder master: Add a growvols utility for growing LVM volumes https://review.opendev.org/c/openstack/diskimage-builder/+/791083 | 02:16 |
*** diablo_rojo is now known as Guest2181 | 03:19 | |
ianw | and back. i'll looks at getting nb containers up with that mount and f34 | 03:52 |
*** ykarel|away is now known as ykarel | 04:18 | |
opendevreview | Steve Baker proposed openstack/diskimage-builder master: Add a growvols utility for growing LVM volumes https://review.opendev.org/c/openstack/diskimage-builder/+/791083 | 04:33 |
opendevreview | Merged opendev/system-config master: nodepool-builder: add volume for /var/lib/containers https://review.opendev.org/c/opendev/system-config/+/795707 | 04:38 |
opendevreview | Merged opendev/system-config master: statusbot: don't prefix with extra # for testing https://review.opendev.org/c/opendev/system-config/+/796009 | 04:38 |
*** marios is now known as marios|ruck | 05:16 | |
opendevreview | Merged openstack/project-config master: Build images for Fedora 34 https://review.opendev.org/c/openstack/project-config/+/795604 | 06:03 |
*** ysandeep|out is now known as ysandeep | 06:09 | |
*** iurygregory_ is now known as iurygregory | 06:19 | |
*** jpena|off is now known as jpena | 06:34 | |
opendevreview | Merged openstack/project-config master: Removing openstack-state-management from the bots https://review.opendev.org/c/openstack/project-config/+/795599 | 06:37 |
*** rpittau|afk is now known as rpittau | 07:13 | |
*** ykarel is now known as ykarel|lunch | 07:38 | |
opendevreview | Ian Wienand proposed openstack/project-config master: infra-package-needs: stub ntp for Fedora https://review.opendev.org/c/openstack/project-config/+/796414 | 09:45 |
ianw | mnaser: getting closer ... ^ | 09:45 |
*** ykarel|lunch is now known as ykarel | 09:57 | |
dtantsur | FYI freenode has reset all registrations apparently. so those of us who stayed to redirect people can no longer join the +r channels | 09:59 |
avass[m] | is https://review.opendev.org having issues? | 10:16 |
avass[m] | oh now it loads but that took way longer than usual | 10:18 |
ianw | dtantsur: yeah, i had left a connection too, hoping some sort of sanity may prevail. but alas I just closed everything too, the complete end of an era. | 10:38 |
dtantsur | indeed | 10:40 |
*** sshnaidm|afk is now known as sshnaidm | 11:21 | |
sshnaidm | I see there are problems with limestone proxies, fyi https://zuul.opendev.org/t/openstack/build/e8b283b37245414284966011a2d0e49e/log/job-output.txt#1245 | 11:23 |
sshnaidm | clarkb, fungi ^ | 11:23 |
sshnaidm | marios|ruck, ^ | 11:23 |
*** ysandeep is now known as ysandeep|afk | 11:24 | |
*** jpena is now known as jpena|lunch | 11:25 | |
marios|ruck | sshnaidm: ack yes quite a few this morning but last couple hours have not seen more | 11:34 |
marios|ruck | sshnaidm: i filed that earlier https://bugs.launchpad.net/tripleo/+bug/1931975 and was waiting for afternoon when the folks you pinged would be around | 11:35 |
*** amoralej is now known as amoralej|lunch | 11:50 | |
rav | Hi All | 11:51 |
rav | Needed info on how i can add gpgkey | 11:51 |
rav | i want to trigger a pypi release and tag the version using "git tag -s "versionnumber" | 11:52 |
rav | but the tag fails with error gpg skipped: no secret key | 11:52 |
rav | I have added gpg in mygithub | 11:52 |
rav | but how do i add it on opendev? | 11:52 |
fungi | sshnaidm: looks like it timed out while pulling a cryptography wheel through the proxy from pypi, yeah. must be intermittent because i can pull that same file from the same mirror, though it's also possible the local network between the job nodes and mirror server in limestone are having trouble... have you noticed more than one incident like this today? | 12:04 |
fungi | rav: you don't need to upload your gpg key anywhere for that to work, you just need to have it available locally where you're running git | 12:05 |
sshnaidm | fungi, yeah, much more, that was just one example | 12:05 |
sshnaidm | fungi, mostly timeouts, for example marios|ruck submitted an issue with another examples: https://bugs.launchpad.net/tripleo/+bug/1931975 | 12:06 |
fungi | rav: running `gpg --list-secret-keys` locally where you use git should list your key if you have one | 12:06 |
fungi | sshnaidm: were they all errors pulling files from pypi, or other stuff too? | 12:07 |
rav | got it working | 12:08 |
sshnaidm | fungi, in the issues above the example is with centos repositories | 12:08 |
fungi | sshnaidm: thanks, so probably not just a problem between the mirror server and pypi/fastly | 12:12 |
fungi | centos packages would be getting served from the afs cache on the server | 12:12 |
fungi | seems more likely it's a local networking issue between the mirror server and the job nodes in that case | 12:13 |
sshnaidm | yeah | 12:14 |
fungi | i also don't seem to be able to get the xml file mentioned in that bug | 12:15 |
fungi | sshnaidm: oh, maybe not, that's a proxy to https://trunk.rdoproject.org/ | 12:17 |
*** jpena|lunch is now known as jpena | 12:17 | |
fungi | we're not mirroring that, just caching/forwarding requests for it | 12:17 |
sshnaidm | fungi, yeah, though request has died to http://mirror.regionone.limestone.opendev.org:8080 | 12:18 |
sshnaidm | together with all pypi mirror failures | 12:18 |
fungi | and i was finally able to get that xml file to pull through the proxy, so maybe this is a performance problem with the cache | 12:18 |
fungi | oh yeah it took me several seconds just to touch a new file in /var/cache/apache2/ | 12:19 |
fungi | load average on the server is nearly 100 | 12:20 |
sshnaidm | fungi, here an example with centos/8-stream repo: https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_e0d/795019/2/gate/tripleo-ci-centos-8-content-provider/e0d29df/job-output.txt | 12:20 |
fungi | half its cpu is consumed by iowait | 12:20 |
sshnaidm | fungi, that explains a lot :) | 12:20 |
fungi | i'm going to try stopping and starting apache completely there | 12:21 |
rav | fungi: fatal: 'gerrit' does not appear to be a git repository. Im getting this error when i run "git push gerrit <tag> | 12:23 |
fungi | sshnaidm: looks like the problem is htcacheclean being unable to keep up | 12:24 |
fungi | and there was more than one htcacheclean running, one had been going since friday, the second started today at 08:00 utc | 12:25 |
fungi | i killed the older one and started apache back up for now, but we're going to need to keep a close eye on this... the cache volume is around 80% full when htcacheclean normally wants to keep it closer to 50% | 12:25 |
rav | ok this works. thanks | 12:26 |
fungi | rav: see the third bullet in the note section at https://docs.opendev.org/opendev/infra-manual/latest/drivers.html#tagging-a-release | 12:26 |
rav | Yes, had missed it | 12:26 |
fungi | #status log Killed a long-running extra htcacheclean process on mirror.regionone.limestone which was driving system load up around 100 from exceptional iowait contention and causing file retrieval problems for jobs run there | 12:29 |
opendevstatus | fungi: finished logging | 12:29 |
*** ysandeep|afk is now known as ysandeep | 12:33 | |
*** amoralej|lunch is now known as amoralej | 12:51 | |
*** diablo_rojo__ is now known as diablo_rojo | 13:13 | |
fungi | minotaur launch in a few minutes hopefully, at the spaceport just up the beach from here | 13:32 |
fungi | too cloudy here for us to be able to see it clear the horizon, unfortunately | 13:33 |
fungi | nasa tv has the livestream from the wallops island facility though | 13:34 |
fungi | launch successful! | 13:37 |
fungi | pointed out in #openstack-infra a few minutes ago, channel logging seems to be spotty | 14:06 |
fungi | on eavesdrop, /var/lib/limnoria/opendev/logs/ChannelLogger/oftc/#openstack-infra/#openstack-infra.2021-06-15.log is entirely empty | 14:09 |
fungi | and #opendev/#opendev.2021-06-15.log is truncated after 12:26:25 today | 14:10 |
fungi | the /var/lib/limnoria volume isn't anywhere close to being out of space or inodes | 14:12 |
corvus | it might be easier to write a quick matrix log bot instead of figuring out and fixing whatever ails limnoria | 14:13 |
fungi | it hasn't logged any errors related to writing channel logs | 14:15 |
corvus | iiuc, limnoria itself just logs the plaintext; the html formatting is a separate program. so a log bot doesn't need to do anything other than write out every thing it receives. | 14:23 |
fungi | yep, potentially filtering (for example we filter the join/part messages so that we don't publish logs of everyone's ip addresses) | 14:25 |
*** rpittau is now known as rpittau|afk | 14:29 | |
gibi | btw, #openstack-neutron logging also seems to be stopped during today | 14:30 |
gibi | and the #openstack-nova log is empty today | 14:30 |
fungi | i'll try restarting the bot during a lull between meetings | 14:39 |
clarkb | fungi: re limestone, I think the idea was if we caught it in that state again we could have logan- look at the running host. (Disable the cloud in nodepool if necessary) | 14:44 |
clarkb | gibi: we found a bug in limnoria, which we fixed but are trying to get deployed | 14:44 |
fungi | clarkb: the new problem seems to be unrelated to the previous limnoria bug you patched | 14:44 |
clarkb | fungi: oh? | 14:44 |
clarkb | fun | 14:45 |
fungi | also it looks like the qa and neutron meetings finished a few minutes ago, so i'll stop/start the container before the next meeting block begins | 14:45 |
clarkb | fungi: is it possible that ianw's fixes to update the image simply undid our manual patching? | 14:45 |
fungi | clarkb: nope, this doesn't appear related to joining channels, the bot is just suddenly no longer logging conversations for channels it's in | 14:46 |
clarkb | infra-root it looks like the afs docs volume is currently locked. Is there a way for me to see the age of the lock and/or release progress? I want to see if I can narrow down when I might be able to move the RW copy or to see if there is something wrong? | 14:46 |
clarkb | fungi: that is unfortunate | 14:46 |
clarkb | https://grafana.opendev.org/d/T5zTt6PGk/afs?viewPanel=36&orgId=1 makes me suspicious | 14:51 |
clarkb | is it possible that whatever held that lock has died and the lock is stale? This is a vos release that happens in a cron? | 14:52 |
clarkb | hrm do we do those vos releases on afs01.dfw? and when I rebooted it maybe I killed a release in progress? | 14:54 |
clarkb | (we need to update our docs on server reboots if so) | 14:54 |
opendevreview | Merged openstack/project-config master: Re-add publish jobs after panko deprecation https://review.opendev.org/c/openstack/project-config/+/793891 | 14:55 |
fungi | clarkb: new detail on the limnoria bug... seems it's just not getting around to flushing its files. when i stopped the container it wrote out all the many hours of missing lines to the logfiles | 14:55 |
*** ykarel_ is now known as ykarel | 14:56 | |
mordred | well that's nice | 14:58 |
clarkb | at leasat that is likely to be an easy fix? | 14:59 |
fungi | yeah, probably, and we haven't lost any data seems like | 14:59 |
fungi | clarkb: there's a command to list volume locks, lemme see if i can find it when i get a spare moment | 15:00 |
fungi | oh, and yes the vos releases for static volumes happen via ssh from mirror-update to afs01.dfw | 15:00 |
fungi | with a cronjob | 15:00 |
clarkb | ya so I bet what happened was I rebooted in the middle of that vos release (and our docs need an update) | 15:02 |
clarkb | after meetings I guess I should try to confirm that better, break the lock, do a release, then do the move? | 15:03 |
corvus | clarkb, fungi: there's a setting in the limnoria config | 15:03 |
corvus | to flush | 15:03 |
fungi | oh, maybe that's just missing or wrong | 15:04 |
fungi | supybot.plugins.ChannelLogger.flushImmediately: False | 15:08 |
fungi | "Determines whether channel logfiles will be flushed anytime they're written to, rather than being buffered by the operating system." | 15:08 |
fungi | apparently that buffering can be a day or more, even | 15:08 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: ircbot: flush channel logging continuously https://review.opendev.org/c/opendev/system-config/+/796513 | 15:13 |
fungi | corvus: thanks! clarkb: gibi: ^ | 15:13 |
fungi | hopefully that takes care of it | 15:13 |
gibi | thanks | 15:17 |
*** jpena is now known as jpena|off | 15:18 | |
fungi | clarkb: yeah, so vos examine on the docs rw volume says "Volume is locked for a release operation" and we normally only release it via cronjob on mirror-update.o.o | 15:20 |
clarkb | oh is the cron on mirror-update? I rebooted that server last week | 15:21 |
clarkb | could the problem be that I rebooted the fileserver during the reelase instead? | 15:22 |
clarkb | regardless of where the release is running? | 15:22 |
fungi | it tried to release the docs volume yesterday at 19:20:02 running `ssh -T -i /root/.ssh/id_vos_release vos_release@afs01.dfw.openstack.org -- vos release docs` and recorded a failure | 15:22 |
clarkb | ah ok the actual release happens remotely on afs01 and that is approximately when I rebooted iirc | 15:22 |
clarkb | (though `last reboot` can confirm as can irc logs) | 15:23 |
clarkb | I rebooted around 21:00UTC | 15:23 |
clarkb | depending on how long that release was taking I could have rebooted during the release | 15:24 |
mordred | corvus: gtema reports being able to set the channel logo in #openstack-ansible-sig without being op there | 15:24 |
fungi | it last successfully released the docs volume at 18:30:21, subsequent checks showed no vos release needed for that volume up until 19:20:02 when it tried and failed | 15:24 |
clarkb | fungi: I guess in that case we confirm there is no current release happening on mirror-update, hold a lock on mirror-update to prevent it trying again, manuall break the lock, then manually rerun the release? | 15:25 |
clarkb | fungi: then after that is done we can move the volume again? | 15:25 |
fungi | correct, and be prepared for it to decide this needs to be a full re-release and take days because of the ro replica in ord | 15:25 |
clarkb | ok after this meeting I'll look into starting that | 15:26 |
mordred | corvus: nevermind - he had elevated privs but wasn't aware of it | 15:26 |
clarkb | I'll put everything in screen too as I'll be out the next 1.5 days starting after our meeting today | 15:26 |
fungi | clarkb: the vos release very well may take that long if it decides to retransfer all the data from dfw to ord | 15:27 |
corvus | mordred: yeah, i was gonna say, matrix says he's a mod :) | 15:27 |
clarkb | fungi: I think the docs need to be updated to note that rebooting RO servers is not safe if vos releases can happen? | 15:27 |
clarkb | we need to hold locks for anything that may vos release essentially regardless of the RO or RW hosting state of the fileserver | 15:28 |
fungi | clarkb: yeah, it should probably mention temporarily disabling the release-volumes.py job in the root crontab on mirror-update.o.o | 15:28 |
clarkb | fungi: and holding the mirror locks too | 15:28 |
fungi | right | 15:28 |
clarkb | currently it basically says you can reboot an RO server safely | 15:29 |
clarkb | whcih is why i moved all the volumes | 15:29 |
fungi | anything that's remotely calling vos release on the fileservers so they can use -localauth | 15:29 |
clarkb | oh I see it is specifically due to us running the vos release from afs01 | 15:29 |
clarkb | so not generic that RO isn't safe during a reboot, but that the controlling release process needs to be safe. Got it | 15:29 |
fungi | yeah, the script does an ssh to mirror-update and calls vos release in a shell on it, in order to get around the kerberos ticket timeouts | 15:30 |
clarkb | yup | 15:30 |
fungi | and your reboot may have killed the vos release process | 15:30 |
clarkb | ya I think that is must've | 15:30 |
clarkb | but I need to ssh into the mirror and check that no other release is running and then figure out how to break the lock etc | 15:30 |
clarkb | (doesn't look like breaking locks is in our docs?) | 15:31 |
fungi | vos unlock | 15:31 |
clarkb | https://docs.openafs.org/AdminGuide/HDRWQ247.html yup google found it | 15:31 |
fungi | i've done it like `vos unlock -localauth docs` | 15:32 |
fungi | as root on afs01.dfw | 15:32 |
clarkb | thanks. I'll get that moving once i check the mirror-update side after this meeting. I'll use a screen session on mirror-update and afs01.dfw too so that others can check on it if necessary | 15:32 |
fungi | and then try to `vos release -localauth docs` after that, but you'll want a screen session for that since it may take a long time | 15:33 |
clarkb | yup ++ | 15:33 |
*** ysandeep is now known as ysandeep|out | 15:41 | |
opendevreview | Merged openstack/project-config master: Revert "Accessbot OFTC channel list stopgap" https://review.opendev.org/c/openstack/project-config/+/792857 | 15:42 |
clarkb | fungi: looks like /opt/afs-release/release-volumes.py is the script run on mirror-update to do the release | 15:44 |
fungi | correct | 15:44 |
fungi | i guess if you want to temporarily comment out the cronjob, that implies disabling ansible for the mirror-update server as well so it doesn't get readded | 15:45 |
clarkb | /var/run/release-volumes.lock is the lockfile that script appears to try and hold. I'm going to try and hold that lockfile instead | 15:45 |
clarkb | `flock -n /var/run/release-volumes.lock bash` seems to be running successfully in my screen now implying I have the lock | 15:47 |
clarkb | fungi: ^ if you think that is sufficient I'll do the `vos unlock -localauth docs` next | 15:47 |
fungi | yeah, that looks right | 15:49 |
fungi | and probably simpler than fiddling the crontab | 15:49 |
clarkb | lock has been released I am going to start the vos release now | 15:50 |
clarkb | for record keeping purposes there is a root screen on mirror-update that is holding the release-volumes.lock file with flock -n /var/run/release-volumes.lock bash. Then on afs01.dfw.openstack.org I have another root screen that is running the vos release | 15:51 |
clarkb | Once the vos release on afs01.dfw.openstack.org completes we can exit the processes on mirror-udpate to release the lock. THen later we can move the volume. I'm not sure I'll be around when the release completes, but I can do the followups when I get back on thursday if no one beats me to it | 15:52 |
*** amoralej is now known as amoralej|off | 15:53 | |
*** marios|ruck is now known as marios|out | 15:57 | |
opendevreview | Slawek Kaplonski proposed openstack/project-config master: Add members of the neutron drivers team as ops in neutron channel https://review.opendev.org/c/openstack/project-config/+/796521 | 15:59 |
clarkb | fungi: it claims it is done already actually. Can you double check it? I'll keep holding the lock so that i can do the move without contention but then drop the lock when the move is done if it looks good to you | 16:00 |
fungi | oh, sure. that was the other possibility: that it decided it could do an incremental completion safely | 16:01 |
fungi | looking | 16:01 |
fungi | did you try a second vos release yet, btw? | 16:01 |
fungi | usually if one is interrupted, your first vos release will only attempt to complete the interrupted transfers | 16:01 |
clarkb | I have not. I can do that first | 16:02 |
fungi | then a second one will catch up to current | 16:02 |
clarkb | Only the one has completed. Will do second one now | 16:02 |
fungi | vos examine is claiming it's locked, but that's probably you | 16:03 |
clarkb | yup since I just started another release | 16:04 |
clarkb | fungi: that release shows it is done now though so you can check it again for no lock and I have done two releases in succession | 16:04 |
fungi | clarkb: yep, looks great, not locked, all ro volumes are current | 16:04 |
fungi | should be safe to proceed with the move | 16:05 |
clarkb | doing so now, thanks for checking | 16:07 |
gmann | clarkb: fungi; is http://ptg.openstack.org/ down? I was dumping the Xena etherpads to wiki and seems it is not working | 16:24 |
*** ricolin_ is now known as ricolin | 16:26 | |
clarkb | I think the foundation runs that? | 16:27 |
* clarkb looks | 16:27 | |
clarkb | oh no its on eavesdrop til | 16:27 |
clarkb | gmann: the issue is that we are migrating irc services from the old eavesdrop server to a new one and we must've turned off apache? | 16:28 |
clarkb | gmann: ianw has been working on that mgiration | 16:28 |
gmann | ohk | 16:28 |
fungi | also any content on that site would have been about the last ptg not the next one | 16:29 |
clarkb | fungi: should we go ahead and approve https://review.opendev.org/c/opendev/system-config/+/796513/ | 16:29 |
fungi | clarkb: yeah, old eavesdrop server is completely offline | 16:29 |
gmann | yeah, I want to get the Xena etherpad list. I hope that is still there | 16:29 |
gmann | Xena PTG | 16:29 |
fungi | clarkb: for the limnoria config change, i guess that won't redeploy/restart the container? | 16:29 |
clarkb | fungi: I am not sure | 16:30 |
clarkb | gmann: I am not sure about your question either | 16:30 |
gmann | clarkb: I mean this one http://ptg.openstack.org/etherpads.html | 16:30 |
fungi | we may need to restore a copy of the ptgbot sqlite db somewhere to get that data | 16:30 |
fungi | "The Internet Archive's sites are temporarily offline due to a planned power outage." | 16:31 |
fungi | that's unfortunate, i was going to see if they had a copy | 16:31 |
gmann | ok, thanks | 16:32 |
fungi | gmann: but no, we haven't deleted the old server yet, merely turned it off | 16:33 |
fungi | if it's not urgent let's see if there's an archive of the site content at archive.org once they're back up, otherwise we can try to temporarily turn the ptg.o.o site back on or try to restore its data from backups | 16:34 |
gmann | we do not have all the etherpads anywhere so thought of dumping it on wiki for future ref. but not so urgent so we can go with 'ptg.o.o site back on or try' | 16:37 |
gmann | may be we can automate dumping etherpad to wiki or so right after ptg. | 16:38 |
fungi | yeah, all good ideas | 16:39 |
clarkb | fungi: the docs RW volume move has completed. Can you double check vos listvldb and if it looks good to you I'll stop holding the lock on mirror-update? | 16:48 |
clarkb | fwiw the listing lgtm but want to double check so that I'm not competing with some automated process to relock and fix things | 16:48 |
* clarkb makes breakfast | 16:48 | |
fungi | clarkb: yep, all looks good, nothing locked, nothing old, only ro sites on afs02.dfw and afs01.ord | 16:51 |
fungi | all rw volumes are on afs01.dfw | 16:51 |
clarkb | fungi: looks like no RO site on afs02, it is afs01.dfw and afs01.ord | 16:56 |
clarkb | but that is the way it was before aiui | 16:56 |
clarkb | fungi: you good with me dropping the lock now? | 16:56 |
fungi | yeah, should be safe | 16:59 |
fungi | go for it | 16:59 |
fungi | there are 40 ro sites on afs02.dfw by the way | 17:00 |
fungi | that is what i meant by "only ro sites on afs02.dfw and afs01.ord | 17:01 |
fungi | " | 17:01 |
clarkb | ah | 17:05 |
clarkb | I have closed down all the screens and released the lock. We should be back to nromal now | 17:06 |
Alex_Gaynor | 👋 it looks like maybe linaro is having issues this morning. We're seeing delays in arm64 jobs starting, and https://grafana.opendev.org/d/S1zTp6EGz/nodepool-linaro?orgId=1 seems to say that there's been 13 nodes stuck in "deleting" for 6 hours? | 17:12 |
clarkb | fungi: ^ you restarted the launcher for linaro to unstick some nodes too? | 17:15 |
clarkb | Alex_Gaynor: I think there was an issue that got addressed, but it is possible that allowed new issues further down the pipeline to pop up | 17:15 |
Alex_Gaynor | clarkb: So it's possible either there's still an issue, or this is just the backlog slowly clearing out? | 17:15 |
clarkb | Alex_Gaynor: ya. I'm taking a quick look at the logs and status the service reports directly to see if we can infer more | 17:16 |
Alex_Gaynor | 👍 | 17:17 |
clarkb | Alex_Gaynor: there are definitely a number of deleting instances from far too long ago. Next step for thosewill be to check the cloud side and determine if we leaked or they leaked. I do see building instances and ready instances though so I think it is moving. But possiblyslowly due to backlog | 17:18 |
Alex_Gaynor | clarkb: it looks like there's a handful of instance that have been sitting in "ready" for a while, is there some reason they wouldn't be assigned to jobs? | 17:18 |
clarkb | Alex_Gaynor: the two most common things are that they are part of a larger nodeset that hasn't managed to fully allocate yet or min-ready caused them to boot without direct demand. That said it is also possible there is an issue (like the one that instigated the earlier restart) that prevents those from being assigned | 17:19 |
clarkb | fungi: ^ would know better what that looekd like as I wasn't awake when it happened | 17:20 |
clarkb | Alex_Gaynor: I do see an instance that went into use 3 minutes ago so something is making progress | 17:20 |
clarkb | oh and another is under a minute old | 17:20 |
Alex_Gaynor | 👏 | 17:21 |
clarkb | ok the 55 day old deletes are just noise I think. They are labeled under a different provider (they are linaro-us not linaro-us-regionone). We need to clean those up in our db but they don't appear to be in the actual cloud (no quota consumption) | 17:23 |
clarkb | I suspect for the other deletions their builds timed out because the cloud shows them in a BUILD state | 17:24 |
Alex_Gaynor | that sounds like something that would be consuming quota? | 17:25 |
clarkb | yes the ones in a BUILD state on the cloud side are most likely consuming quota | 17:25 |
clarkb | nodepool appears to have been attempting to delete at least some of those instances for some time | 17:26 |
clarkb | I think we may need kevinz to look into this since the cloud isn't honoring our requests | 17:26 |
clarkb | and also we should see if ianw remembers what the magic was to get the osuosl mirror running again | 17:26 |
clarkb | (then we can enable that cloud again) | 17:26 |
fungi | clarkb: sorry, in a meeting right now, but yes i restarted the launcher responsible for linaro-us to free some stale locks on node requests | 17:27 |
fungi | there were a few changes which had been sitting waiting on node assignments which the launcher thought it had declined | 17:28 |
clarkb | fungi: fwiw it seems like things are moving there but we've got a number of nodepool deleting state instances that show BUILD state in the cloud and seem to refuse to go away. I think we need to reach out to kevinz | 17:28 |
fungi | yes, it seems like that's probably constricting the available quota (and having the osuosl cloud down isn't helping matters) | 17:29 |
clarkb | we appear to have room for 5 job nodes right now. There are two "ready" nodes. If those come from min-ready we might try estting those values to 0 to get 5 bumped to 7 | 17:31 |
clarkb | after this meeting I can write an email to kevinz | 17:31 |
*** ricolin_ is now known as ricolin | 17:32 | |
clarkb | we are up to 7 in use now. slightly better than my earlier counts | 17:32 |
fungi | did it end up consuming the two in ready state? | 17:37 |
clarkb | fungi: no I suspect those are min-ready allocations | 17:38 |
clarkb | I'm drafting up the email now | 17:38 |
fungi | thanks! | 17:40 |
clarkb | Alex_Gaynor: fungi: email sent to kevinz (I also pointed out we are on OFTC now in case IRC confusion happens). Hopefully that can be addressed later today when kevinz's day starts | 17:45 |
Alex_Gaynor | 👍 | 17:45 |
clarkb | fungi: maybe we should try some osuosl mirror reboots just for fun? I've got to prep for the infra meeting that is in just over an hour though | 17:46 |
fungi | yeah, i'm about to resurface from my meeting-induced urgent task backlog, and can try a few more things | 17:48 |
fungi | working on our startup race theory, i've got openafs-client.service disable and am waiting for the server to reboot, then will try to modprobe the openafs lkm before manually starting openafs-client | 18:07 |
fungi | hopefully that will give us a good idea of whether it's racing the module loading | 18:08 |
fungi | manually did `sudo modprobe openafs` | 18:16 |
fungi | now lsmod confirms the openafs lkm is loaded | 18:16 |
fungi | trying to `sudo systemctl start openafs-client` | 18:16 |
fungi | not looking good | 18:17 |
clarkb | was worth a shot anyway | 18:19 |
fungi | yeah, no dice, same kernel oops | 18:19 |
fungi | claims it's got the same openafs package versions installed as the linaro-us mirror server though | 18:20 |
fungi | i'll try to remember to pick ianw's brain on this after the meeting | 18:22 |
ianw | for reference, it appears what happens is that ptgbot writes out a .json file which is read by some static html that uses handlebars (https://handlebarsjs.com/) templates. so no cgi/wsgi | 19:22 |
clarkb | ah in that case we'd want to grab the json | 19:23 |
fungi | but not bothering until after we find out if https://web.archive.org/web/*/ptg.openstack.org has it | 19:26 |
clarkb | ya | 19:26 |
corvus | i think we missed a content-type issue with eavesdrop logs; this used to serve text/plain: https://meetings.opendev.org/irclogs/%23opendev-meeting/%23opendev-meeting.2021-06-15.log | 19:48 |
corvus | but now it's application/log | 19:48 |
opendevreview | Merged opendev/system-config master: ircbot: flush channel logging continuously https://review.opendev.org/c/opendev/system-config/+/796513 | 19:48 |
fungi | corvus: thanks! should be easy enough to fix in the vhost config | 19:50 |
ianw | we might as well update the ppa to 1.8.7 as a first thing | 19:50 |
fungi | not a terrible idea | 19:50 |
fungi | at least if we're going to debug it, let's debug the latest version | 19:51 |
ianw | i don't see any updates in https://launchpad.net/~openafs/+archive/ubuntu/stable | 19:51 |
fungi | it may be stalled by the debian release freeze | 19:52 |
fungi | if they're just slurping in latest debian packages | 19:52 |
fungi | yeah, sid's still at 1.8.6-5 | 19:53 |
ianw | hrm, i think that 1.8.7 only has those fixes for the rollover things which are in 1.8.6-5 | 19:53 |
ianw | i.e. they're essentially the same thing | 19:53 |
fungi | yeah, and there's nothing newer in experimental either | 19:53 |
ianw | there's been no other release | 19:53 |
clarkb | ok with the meeting ended I am now being dragged out the door. I'll see ya'll on Thursday | 19:53 |
fungi | have a good time clarkb! | 19:54 |
clarkb | thanks! | 19:54 |
ianw | i'm just going back in my notes; i'm trying to remember the context | 19:54 |
ianw | i did not seem to note sending that email | 19:55 |
ianw | my only entry is | 19:55 |
ianw | * linaro mirror | 19:56 |
ianw | * broken, reboot, clear afs cache | 19:56 |
ianw | osuosl has move to libera, but Ramereth is particularly helpful if we find it's an infra thing. trying to log in now | 19:56 |
* Ramereth perks up | 20:02 | |
fungi | ianw: ^ ;) | 20:02 |
fungi | and yeah, not sure yet it's an infrastructure layer issue | 20:03 |
fungi | it could be as simple as differences in the kernel rev between the osuosl and linaro-us mirror servers | 20:03 |
ianw | so the suggested fix, https://gerrit.openafs.org/14093, i think got a bit lost | 20:03 |
ianw | it was in 1.8.7pre1, but then 1.8.7 got taken over as an emergency release for the rollover stuff | 20:04 |
fungi | nevermind, both are on 5.4.0-74-generic #83-Ubuntu SMP | 20:04 |
ianw | if we want that, we should either patch our 1.8.6-5 (== 1.8.7) or try 1.8.8pre1 which is just released | 20:04 |
ianw | http://www.openafs.org/dl/openafs/candidate/1.8.8pre1/RELNOTES-1.8.8pre1 seems safe-ish | 20:05 |
ianw | preferences? i could try either | 20:05 |
fungi | i'd go with the 1.8.8pre1 | 20:15 |
fungi | but i'm a glutton for punishment | 20:15 |
ianw | it is a bit more punishment figuring out which patches have been applied, etc. but otoh it seems to have quite a few things we want | 20:16 |
fungi | the server's already out of commission anyway, so 1.8.8pre1 is at least an opportunity to also give the openafs maintainers some feedback | 20:20 |
opendevreview | Michael Johnson proposed opendev/system-config master: Removing openstack-state-management from statusbot https://review.opendev.org/c/opendev/system-config/+/795596 | 20:22 |
johnsom | Rebased after statusbot config moved | 20:23 |
Ramereth | ianw: did you need something from me? | 20:26 |
ianw | Ramereth: no, thank you :) i think we're hitting openafs race issues, i think arm64/osuosl is just somehow very lucky in triggering them | 20:27 |
Ramereth | I feel so lucky | 20:33 |
ianw | ok, i've been through debian/patches and pulled out all those applied to 1.8.8pre1 | 20:34 |
ianw | i should probably make a separate ppa to avoid overwriting existing packages | 20:36 |
ianw | https://launchpad.net/~openstack-ci-core/+archive/ubuntu/openafs-1.8.8pre1 is building so we'll see how that goes | 20:45 |
ianw | we can manually pull that onto the osuosl mirror and test. if we like it, i can upload it to our main ppa for everything | 20:46 |
*** sshnaidm is now known as sshnaidm|afk | 21:15 | |
fungi | clarkb: proposed-cleanups.20210416 is missing a bit of context for me... there are some duplicate addresses in that list though most seem to be singular... also there's a handful of addresses for the same account ids.... and not quite sure if all the blank lines toward the end are relevant or just cruft | 21:22 |
ianw | openafs builds look ok, it's publishing now | 21:22 |
fungi | awesome | 21:23 |
fungi | thanks so much for updating that | 21:23 |
ianw | haha don't thank me until it at least meets the bar of not making things worse :) | 21:24 |
fungi | that's a rather high bar in these here parts ;) | 21:27 |
ianw | mordred: as the author of the comment about caveats to upgrading ansible on bridge; would you have a second to double check https://review.opendev.org/c/opendev/system-config/+/792866/4/playbooks/install-ansible.yaml that i believe notes the issues are solved with the ansible 4.0 package? | 21:33 |
opendevreview | Merged openstack/project-config master: infra-package-needs: stub ntp for Fedora https://review.opendev.org/c/openstack/project-config/+/796414 | 21:34 |
ianw | i'm install 1.8.8pre1 on osuosl mirror now. i'll clear the cache directory for good measure and reboot | 21:34 |
fungi | ansible 4 is already a thing? i've clearly not been paying close enough attention | 21:34 |
fungi | ianw: be prepared for the reboot to take a while since systemd will wait quite a long time for openafs-client to fail to stop before it gives up | 21:36 |
ianw | fungi: sort of. it's what we'd think of as 2.11 with collections bolted on | 21:37 |
fungi | ahh | 21:38 |
fungi | what happened to ansible 3 i wonder | 21:38 |
ianw | it is there and works in a similar fashion, we just skipped it | 21:41 |
fungi | heh | 21:42 |
fungi | did they decide to just go with single-component versions then? | 21:42 |
ianw | if you mean are all collections tagged at one version; i think the answer is no. each "add-on" keeps it's own version and somehow the build magic pulls in it's latest release | 21:45 |
ianw | and then the "ansible" package depends on a version of ansible-base (3 is >2.10<2.11, 4 is >2.11<2.12, etc) | 21:46 |
fungi | okay, so ansible-base is versioned independently of ansible now | 21:49 |
ianw | well, that sucks, seems like i got the same oops starting the client | 21:56 |
fungi | :( | 21:59 |
fungi | you're sure dkms rebuilt the lkm? | 21:59 |
ianw | ii openafs-client 1.8.8pre1-1~focal arm64 AFS distributed filesystem client support | 22:06 |
ianw | ii openafs-krb5 1.8.8pre1-1~focal arm64 AFS distributed filesystem Kerberos 5 integration | 22:06 |
ianw | ii openafs-modules-dkms 1.8.8pre1-1~focal all AFS distributed filesystem kernel module DKMS source | 22:06 |
ianw | pretty sure | 22:06 |
mordred | <ianw "mordred: as the author of the co"> Lgtm ... That was long ago ... I believe that has been sorted now | 22:44 |
mordred | ianw: I +2'd - but left off the +A - because that's clearly a pay-attention kind of thing | 22:47 |
mordred | actually - I'm old ... do we use install-ansible as part of the integration tests? or is that one of the things that's different in our live test environment? | 22:48 |
mordred | looks like we do | 22:49 |
mordred | so I believe our integration tests also verify that ansible v4 should be fine | 22:49 |
Alex_Gaynor | Hmm, it looks like there's now 0 linaro arm64 jobs running, even though there's a bunch queues. Has the linaro situation regressed? | 22:52 |
ianw | Alex_Gaynor: umm ... maybe. sorry i'm currently trying to debug the oops we're seeing on the osuosl mirror server | 23:14 |
ianw | we've got a bunch of hosts in build status there | 23:17 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!