Monday, 2021-01-18

ianwok, that's all cleaned up on ORD00:09
ianwi'm feeling like we must have deployed /etc/openafs/server/rxkad.keytab manually because i'm not seeing it deployed in config mgmt00:18
ianwi've put in emergency and am going to try the manual 1.8 update on it00:24
ianwi think the debian scripts must create KeyFileExt automatically00:46
ianw# vos status -localauth -server afs01.ord.openstack.org01:14
ianwvos: host '' not found in host table01:14
ianwis the only wierd thing so far.  it can't seem to query itself01:14
ianwok, it appears to be filtering out all local addresses, meaning it can't talk to itself02:12
ianwso it seems 100% intentional 1.8 tools filter out loopback addresses; although the error message is a little unclear (says it can't find the host, rather than "i found the host, but it's all loopback so i chose to ignore it")03:00
ianwthe upshot is probably that you can't run the commands on the servers themselves.  which is probably fine, i've just been used to being able to do that with 1.603:01
openstackgerritIan Wienand proposed opendev/system-config master: openafs-server : add ansible roles for OpenAFS servers
openstackgerritIan Wienand proposed opendev/system-config master: openafs-server : add ansible roles for OpenAFS servers
ianwclarkb / fungi: update dump related to ^04:54
ianwI cleaned up a bunch of old volumes that appeared to be abanonded on (readonly pypi and npm mirrors, etc.)04:55
ianwafter that i put it in emergency and manually installed from our ppa openafs 1.8.6-5 packages, and restarted everything04:55
ianwit seems to be working04:56
ianwi haved started on moving this to ansible with 77115904:56
ianwif you could review that, i think what I would like to do is run that manually to ensure it is deploying ok on afs01.ord04:57
ianwthen we can merge and do afs01/02.dfw04:57
ianwfungi it looks like the tarballs transaction finished, but i note all those other transactions marked deleted are still running04:59
ianwmaybe they clear out, maybe not04:59
ianwit might be worth considering to run vos release sequentially05:01
prometheanfireianw: for the removed at end of image build, I guess not, I modeled it after the repo stuff05:14
prometheanfirenot sure what removes it for theme either05:14
auristorianw: I believe you referenced the wrong commit in your e-mail to openafs-info.   The loopback filtering was introduced by dc2a4fe4e949c250ca25708aa5a6dd575878fd7e05:17
ianwauristor: ahh yeah, that was the one i was looking at, i guess i pasted the wrong thing :)05:19
ianwanyway, i get that "don't use the tools on the server" might be the most practical answer05:19
auristorthe intent of that code is to find a non-local address when localhost or is specified05:20
ianwsure -- and we've (and by that I mean "I") have screwed up before by getting in the volume db05:22
auristorvos partinfo localhost -localauth is a reasonable command to execute05:22
auristorLooks like I asked the relevant question back in 2013 but the conclusion in 2014 was that preventing loopback addressed from getting pushed into the VLDB was more important than resolving edge cases.05:34
ianwprobably the right decision if it took 8 years for anyone to notice :)05:41
ianwalthough separating the error for "host not found" from "host filtered" would probably have tipped me off without having to go to the source :)05:43
ianwfungi: i approved a couple of other dib things.  i think it's fine for a 3.6.0 release if you like, or i can do it tomorrow when things merge05:48
openstackgerritIan Wienand proposed opendev/system-config master: openafs-server : add ansible roles for OpenAFS servers
openstackgerritMerged openstack/diskimage-builder master: Fix centos 8.3 partition image building error with element iscsi-boot
openstackgerritMerged openstack/diskimage-builder master: Fix building error with element dracut-regenerate
AJaegerinfra-root, gives a 403 - forbidden. Is that a known issue?07:27
ykareli see same in ^ and asked in #openstack-infra07:29
fricklerfrickler@static01:~$ ls -l /afs/
fricklerls: cannot access '/afs/': Connection timed out07:30
fricklerseems is down07:32
fricklerat least some sites are still working, so I won't touch anything at this point. ianw and fungi did some work on it last night.07:39
fricklermaybe we should discuss whether afs is still the right tool to use for this07:42
ykarelalso seeing in some of our jobs:- Status code: 403 for (IP:
AJaegerfrickler: did anybody sent an #status alert?07:46
ykareljpena|off, danpawlik5 fyi ^07:48
danpawlik5AJaeger on thursday or friday07:49
ykarelseems issue is cleared, sites are visible now.]]07:49
fricklerafs01 is back, uptime 12min. now sure what happened there.07:56
mrungegood morning and happy new week08:02
mrungefrickler, it seems setting the swap size did work in an unintended way (disabled swap?)08:03
mrungethat's the change08:03
mrungeis the log, swap is 0 . hmmm08:04
ykarelmrunge, seems u are adding it wrongly08:14
ykarelalso need to add to
ykarelso it used in master branches08:15
fricklermrunge: IIUC host-info is collected before any task runs, so no swap to be seen there. check the log output for the configure-swap task and then what ykarel says.08:31
mrungeykarel, what do you mean?08:32
ykarelas there are job variants for different branches, you need to add vars to all variants as needed08:34
mrungeI was hoping all this would be inherited08:34
mrungebut yes.08:35
mrungethanks, let's see if the gate is more happy now08:35
mrungereducing memory in xz job did not have any effect it seems08:35
ykarelno it's not get inherited like this, if you see all vars are common you can use yaml anchors08:36
fricklerinfra-root: / on bridge is at 91%, seems to be going on pretty linearly over the last year, but more stable recently, so not sure whether we need to take any action
akahat|roverwe are facing issue with mirros.09:18
openstackLaunchpad bug 1912177 in tripleo "Issue with upstream mirrors, Jobs failing with Error: Failed to download metadata for repo 'quickstart-centos-base': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried"." [Critical,Triaged]09:18
bhagyashrisakahat|rover, ykarel raised one issue few times before not sure this same or not09:25
bhagyashris<ykarel> also seeing in some of our jobs:- Status code: 403 for (IP:
bhagyashris<ykarel> seems issue is cleared, sites are visible now.]]09:25
ysandeepbhagyashris, can you try opening locally, its not working for me.09:26
akahat|roverbhagyashris, ok.. but not sure this will get resolve with the time. :|09:27
bhagyashrisysandeep, not working at my end09:28
fboHi our monitoring is reporting a sync date older than 5 days from the opendev mirror. Is there any ongoing issue with opendev mirrors ? thanks in advance.09:39
*** jcapitao has joined #opendev09:55
*** DSpider has joined #opendev10:13
lourot^ seeing issue with that regionone mirror again as well, see
lourotfrickler, are you around maybe?10:51
*** dpawlik7 has joined #opendev11:09
fricklerafs01 now has an uptime of 6min, not sure whether it is crashing or maybe rackspace is having some issues11:24
*** ysandeep|afk is now known as ysandeep11:55
*** jpena is now known as jpena|lunch12:28
openstackgerritMartin Kopec proposed opendev/system-config master: WIP Deploy refstack with ansible docker
fungifrickler: i'll check the root inbox, maybe rackspace opened a ticket12:57
fungiwoah, 24 messages from rackspace support12:58
fungijust since i went to sleep12:58
fungi"This message is to inform you that the host your cloud server,, 'adfea02f-6229-464a-99e5-7617fb00caef' resides on alerted our monitoring systems at 2021-01-18T06:46:50.457633."12:58
fungiyeah, so hypervisor host outage which happened to impact afs01.dfw while we were still trying to recover the replicas on afs02.dfw12:59
fungithe instance was also impacted13:01
fungiper subsequent messages there was a hardware failure on the host, and then they had to initiate an offline migration of the server instances to a new host, hence the extended downtime13:02
fungichecking various static sites backed by afs, they seem to be returning content now13:19
fungia vos release of the docs volume is underway since 11:30 utc13:25
*** jpena|lunch is now known as jpena13:28
openstackgerritTobias Henkel proposed zuul/zuul-jobs master: Fail mirror-workspace-git-repos if checkout failed
fungiaha, the docs volume is served from afs01.dfw and, not afs02.dfw13:33
fungiso the vos release in progress for it is (re?)populating the replica on afs01.ord13:34
fungiianw: ^ expected, i suppose?13:34
fungithat may take a while to complete13:34
fungicacti says it's doing roughly 8Mbps inbound on eth0, and the volume has 23GiB of content, so should expect it to run for roughly 7 hours, maybe finishing around 18:30 utc13:49
*** mordred has joined #opendev13:55
*** Eighth_Doctor has joined #opendev13:57
openstackgerritMerged zuul/zuul-jobs master: Fail mirror-workspace-git-repos if checkout failed
*** d34dh0r53 has quit IRC14:17
*** zoharm1 has quit IRC14:23
*** sboyron has quit IRC14:27
openstackgerritMerged zuul/zuul-jobs master: Pass environment variables to 'tox envlist config' task
fungiugh... 13:48:21 "At this time we are still in the process of migrating your cloud migration. At this time we are still in the process of migrating your cloud server,, to a new host. Your server is currently online but will experience some intermittent downtime during the migration process. We will notify you once the migration is complete and we have verified that your cloud server is14:34
fungionline. Please do not attempt to access or modify [the server] during this process."14:34
fungiso i guess we should expect it to reboot and abort the in-progress vos release for the docs volume14:35
openstackgerritTristan Cacqueray proposed zuul/zuul-jobs master: DNM: negative test
*** ysandeep|afk is now known as ysandeep14:45
openstackgerritTristan Cacqueray proposed zuul/zuul-jobs master: DNM: negative test
*** ysandeep is now known as ysandeep|away15:32
openstackgerritCole Walker proposed openstack/project-config master: Add PTP Notification app to StarlingX
clarkbfungi: is afs01.dfw running 1.8? if not is 1.6 functioning?15:43
fungiit's running 1.6 and seems to be functioning15:46
fungiafs01.ord is 1.8 i believe15:54
fungiyeah, 1.8.6-5ubuntu1~xenial2 on afs-1.ord, 1.6.15-1ubuntu1 on afs01.dfw and afs02.dfw still15:55
openstackgerritCole Walker proposed openstack/project-config master: Add PTP Notification app to StarlingX
openstackgerritClark Boylan proposed opendev/system-config master: run-selenium: run selenium on a node
openstackgerritClark Boylan proposed opendev/system-config master: gerrit: Initalize in testing
openstackgerritClark Boylan proposed opendev/system-config master: gerrit: move plugins to common code
openstackgerritClark Boylan proposed opendev/system-config master: bazelisk-build: specify targets as list
openstackgerritClark Boylan proposed opendev/system-config master: gerrit: get files from bazel build dir
openstackgerritClark Boylan proposed opendev/system-config master: gerrit: Install zuul-summary-results plugin
openstackgerritClark Boylan proposed opendev/system-config master: Fix review01's fqdn in infratesting
smcginnisclarkb, fungi: Semi-random question, but how difficult would it be to stand up a site?15:59
clarkbsmcginnis: we can barely run the current wiki we've got, it needs to be upgraded. Meanwhile dansmith is complaining that gerrit is slow and openafs decided last week thatn january 14-31 were going to be an inoperable period for it16:02
smcginnisOK, so difficulty == high. ;)16:02
clarkbI think at this point for new stuff we really need new help to go along with it16:05
smcginnisYeah... not even just the new stuff though.16:05
clarkbwell that too16:06
fungismcginnis: ttx found a great hosted wiki option which is open source though, i forget the name just now, something maintained by ow2 i think?16:09
fungiwe've talked about possibly convincing projects to use that so we can retire the wiki service we're currently running16:10
smcginnisI was also thinking may be an option. Paid service, but free for communities and the code itself appears to be open source.16:10
fungii thought it was more like an etherpad, but i've rarely used it16:10
smcginnisSync with GitHub. I wonder how difficult it would be to reverse mirror into our own.16:10
smcginnisHackMD is very similar to etherpad, but with a lot more features. And I think maybe a little more appealing to those that aren't used to plain text based collaboration (thinking board members here).16:11
clarkbits a straightforward peice of software to install with a simple upgrade process16:13
clarkbmigrating off of etherpad to a hosted $othertool isn't going to save us16:13
fungiyeah for something which runs as a nodejs process, it's really not half terrible to operate16:14
smcginnisDefinitely not proposing that. I'm just brainstorming options for moving Open Infra BOD stuff off of
ttxfungi: xwiki16:15
smcginnisNot like it's critical, but just thinking the perception switch from "OpenStack + a few others" to "set of projects, including OpenStack"16:15
openstackgerritAlfredo Moralejo proposed zuul/zuul-jobs master: Set zuul-jobs-test-base-roles-gentoo-17-0-systemd non-voting
openstackgerritAlfredo Moralejo proposed zuul/zuul-jobs master: Rename config repos file config for CentOS Stream
openstackgerritAlfredo Moralejo proposed zuul/zuul-jobs master: Rename config repos file config for CentOS Stream
clarkbfungi: did zuul get restarted?16:28
clarkbfungi: frickler bridge disk use is likely to be due to ansible logging16:35
clarkbI don't know if we got something in place to clean up old aged out logs16:35
clarkb(we rotate them directly in the jobs and that may not clean up older sets?)16:36
fungiclarkb: yes, zuul scheduler was restarted late sunday my time (i status logged it so can probably get a more exact time from there)16:38
fungier, sorry, late saturday/early sunday i mean16:38
clarkbawesome, I was looking at the zuul queues take off and wondered if we got that out of the way already16:39
fungii ended up having to restart zuul-web too because the api socket became unresponsive after the scheduler restart and remained unusable for a good five minutes after all the cat jobs returned, so i gave up waiting for it to possibly come back on its own16:51
*** marios|out has quit IRC16:51
clarkbya I think that is a known issue (though seems like it doesn't fail 100% of the time?)16:52
clarkbfungi: ianw: I think the arm64 afs stuff isn't working properly as the arm64 zuul job integration test things are failing with afs not being readable17:09
clarkbit does look like we successfully built 1.8.6-5 arm64 packages in our ppa so we should't be using the old unpatched version17:10
fungiwe probably haven't upgraded the packages on the arm64 mirror17:10
fungioh, maybe17:10
* fungi checks it17:10
clarkbI think the arm64 mirror is fine17:11
clarkbbut possibly because it hasn't restarted with the new version? is a job log showing the mriror seems fine then it tries to afs locally and beraks17:12
fungi`ls /afs/` is taking a while to return for me from there17:13
fungistill trying17:13
fungi1.8.6-5ubuntu1~focal1 is what's installed17:14
fungirebooted 3 days ago17:14
clarkbI suppose it is possible that it is a networking issue rather than afs itself17:14
fungi`vos status -server` works from it17:15
fungiokay, after initially not returning, i now get content in /afs/ there17:15
clarkbI guess we can recheck those changes that failed and see if it persists17:16
fungi[Mon Jan 18 17:10:49 2021] afs: Lost contact with volume location server in cell (code -1)17:16
fungi[Mon Jan 18 17:11:17 2021] afs: volume location server in cell is back up (code 0)17:17
clarkbthe job failure was ~16:58-17:01 ish17:17
fungiand again a few seconds later, and then same for
fungiso yeah, could be network related17:17
clarkbin that case I'll wait for things to finish testing then can recheck and see if it persists17:18
clarkbfungi: it hit an amd64 job17:35
clarkbI wonder is msg connection timed out there really ansible saying it can't connect to the remote host anymore?17:35
clarkbdebain and centos != ubuntu17:36
clarkbso ya I think this is the old packages not working anymore17:36
fungiindeed, so they're still testing with broken packages, i expect17:36
clarkbthat clears things up17:36
fungilooks like there's a 1.8.6-5 in sid and bullseye now, but buster still has an older 1.8.2-1 which is probably not patched17:37
clarkb is where we get the centos package too17:37
clarkband I bet we haven't updated our centos builds of the openafs package17:38
clarkbI think I see a quick fix for centos17:38
fungiyeah, debian's 1.8.6-5 was pulling in the bitmask patches:
clarkbremote: Build openafs 1.8.7 on centos17:41
clarkbthe fix for centos may be that simple?17:41
*** artom has quit IRC17:42
clarkbfungi: yup we just repackaged that version for ubuntu17:42
*** mrunge_ is now known as mrunge18:05
fungii've followed up to asking if anyone's working on backporting the fixes to buster. i'll push up a merge request in salsa for it if not18:06
openstackDebian bug 980115 in openafs-client "connection failure when rx initialized after 08:25:36 GMT 14 Jan 2021" [Grave,Fixed]18:06
*** jpena is now known as jpena|off18:07
clarkbfungi: thanks18:11
fungioh, hey! the vos release for docs finished at 18:10:29 so roughly in line with what i projected18:17
fungithat means our static site volumes should all be caught up now, and i can start releasing the locks i've been holding for mirror volume updates, one at a time18:19
fungithe full restore of the docs volume ran from 11:30:01 until 17:36:35, then a subsequent vos release caught it up to current18:22
fungii'll release the yum-puppetlabs lock first and test the waters18:23
fungii've got a mirror update running for it in a root screen session on now18:25
fungiStarting ForwardMulti from 536871036 to 536871036 on (full release).18:25
fungihopefully won't take too long as this volume isn't huge18:25
funginot exactly tiny (~200GiB of data according to our grafana dashboard) but it's the smallest one we're graphing (other than epel, which i fear shouldn't be updated until centos/fedora have been)18:28
fungialso i'm worried that the live migration of afs01.dfw is still in progress, so we may see rackspace reboot it at any time18:30
fungioh, i guess not. yay! 16:18:24 "This message is to inform you that your cloud server,, is online."18:33
*** zoharm has quit IRC19:05
clarkbfungi: looks like 1.8.7 is present at now19:13
clarkbThat should just leave debian before we can recheck those19:13
clarkbfungi: is that something that debian might backport quickly?19:14
clarkbnot sure how debian prioritizes those against risk19:14
fungithey did set the urgency to "emergency" on the sid upload so it migrated to bullseye within a day, but not sure if fixing it in buster is on their radar which is why i pinged the bug about it19:23
fungithe bug is already tagged as affecting buster, so hopefully they're working on it? but i wouldn't count on it being solved straight away19:23
fungiwe could probably set our tests to use our ubuntu ppa19:24
*** andrewbonney has quit IRC19:25
fungithe focal build would probably work, but we can try the bionic one if not19:25
fungiproblem is bionic released with 1.8.0 and focal with 1.8.4, so neither is particularly in sync with buster19:25
fungibut odds are our package builds for either would run on it just fine19:26
clarkbour packages are built from the debian package too19:43
clarkbfungi: to do that we just enable the ppa on debian? then apt-get update and install?19:44
clarkbalso reading our role to install openafs client it seems we may not install those packages on xenial19:49
clarkbwhich would affect the zuul executors? we may need to double check those have updated?19:49
openstackgerritClark Boylan proposed opendev/system-config master: Use our ubuntu openafs ppa on debian
clarkbMaybe something like ^ will work19:52
fungiclarkb: we checked the executors and ansible updated them within a few hours of publishing to the ppa19:54
fungiwe even tested rebooting ze01 with it just to make sure19:55
clarkbhrm I wonder if that logic there is wrong or did we already update the executors?19:55
fungiii  openafs-modules-dkms                   1.8.6-5ubuntu1~xenial219:55
clarkbI wonder if I'm missing something with how that module is used or maybe the condition on it is buggy19:56
clarkbI think we want xenial to update fwiw19:56
clarkbbut the code says not ( ansible_distribution_version is version('16.04', '==') and ansible_architecture == 'x86_64' )19:56
fungioof, we have a ton of cruft in /etc/apt/sources.list.d on ze01 (and probably elsewhere)19:57
fungilooks like we add in openafs.list, openstack-ci-core-ubuntu-openafs-amd64-hwe-xenial.list, and ppa_openstack_ci_core_openafs_xenial.list19:59
fungii expect only one of those is what we call the file these days and we've renamed it and not cleaned up the old filenames (at least) twice19:59
clarkbwe use the ansible apt_repository module without an explicit filename so not sure which is the most current20:00
clarkbbut also I wonder if maybe all of those are stale from puppet and the current ansible isn't trying to add it in20:00
fungiyuck. i guess then the module has changed how it works multiple times and doesn't clean up after itself20:00
fungioh, or could be that too, yep20:00
fungicomparing to ze12 we have the same files there too20:01
fungibut i suppose they all pre-date the ansibification20:01
clarkbI think they do20:01
clarkb771268 should give us more insight as it runs on xenial in addition to debian and centos20:02
clarkbfungi: thinking out loud woudl it be better to pull in from sid on buster for openafs instead of our ppa?20:02
clarkbI guess it probably doesn't matter much as long as the kernel is happy with our ppa too20:02
fungiit's likely the sid/bullseye build will run just fine on buster too, but that requires doing some apt pinning if we don't want to accidentally upgrade buster to bullseye once it releases (the default pin priorities will still cause buster packages to be preferred while bullseye is still the testing suite at least)20:04
clarkbwith the ppa its only the packages in the ppa we have to worry about so don't need to pin. THat makes sense20:06
fungithe thing our ppa has going for it is that only the openafs packages are present, so if we add that ppa we won't accidentally install any other packages from outside buster20:06
fungiyeah, that20:06
fungiwe could probably also add a buster build to the ppa, i don't recall whether the system there is restricted to only holding ubuntu-targeted builds20:07
clarkbfungi: if you've got a second can you take a quick look at the meeting agenda for tomorrow? want to make sure I didn't miss anything important with everything that went on last week20:12
clarkbfungi: the centos jobs appear to work now after the 1.8.7 update for their package repo. Debian is failing though. "E: Unable to correct problems, you have held broken packages." I guess it didn't like the ppa?20:15
clarkbthat isn't very verbose about what the broken package is20:17
clarkbare we missing an apt-get update?20:17
clarkbthe timing for the ppa addition implies ti did more than just write a file20:19
clarkbya ansible docs  say it should do an update by default after adding a new repo20:24
clarkbthe zuul render from the js may include better debugging info20:25
clarkber from the jsobn20:25
clarkbdigging in that json by hand isn't much fun I've found20:25
fungiyeah, looks like we don't have the output from apt-get20:26
fungiif it's in the json, that may help20:26
fungiopenafs-client : Depends: libcrypt1 (>= 1:4.1.0) but it is not installable20:27
clarkbyup its in the json20:27
clarkbyou found it too20:27
fungiindeed, flamel is indispensible20:27
fungiand that was using the focal build instead of the bionic one?20:28
clarkb says that package doesn't exist at all?20:28
fungiyeah, introduced in bullseye20:28
clarkbbut ubuntu is happy with it?20:29
clarkbhrm bionic doesn't have it either so maybe we edited the dep list for bionic and xenial?20:30
clarkbI'll try to use the bionic ppa to see if that fixes it20:30
fungiwell, deps may be automatically inferred from what libraries the linker pulls in20:30
clarkboh interesting20:30
fungiso assuming autotools, it probably detected libcrypt was present and used it20:31
clarkbI didn't realize debian pacakging was doing that20:31
openstackgerritClark Boylan proposed opendev/system-config master: Use our ubuntu openafs ppa on debian
fungithough libcrypt1 replaced libxcrypt20:31
clarkbthere is bionic20:31
fungiyeah, debhelper allows you to embed placeholders in the dependencies which get auto-filled at package build time20:32
clarkbI'll send out the meeting agenda in about half an hour. If you get a chance to look that over first that would be great20:32
fungiDepends: ${shlibs:Depends}, ${misc:Depends}, lsb-base (>= 3.0-6)20:34
fungithat's what it looks like in the packaging config20:34
fungiso shlibs:Depends gets filled in by debhelper's dh_shlibdeps20:35
fungiand `man dh_shlibdeps` explains every painful detail20:35
fungiso to conclude, the bionic build will likely depend on libxcrypt instead of libcrypt1, and work fine on buster (assuming no other dependencies had different breaking transitions between when bionic was snapshotted from sid and when buster froze)20:37
fungilooking over the agenda now20:37
fungimnaser did get his agenda item added, good20:40
fungiwe should probably talk about project renames sometime soon, if not this week. have we done any since the upgrade? the past months are a blur. is going to need a rename once they decide on a namespace, not sure if there are others waiting who don't know to add themselves to the agenda20:41
fungiclarkb: i don't see anything missing from the current agenda wiki article, no20:43
clarkbthanks for checking I'll get that sent out shortly20:43
clarkbfungi: looks like the debian jobs are passing now when switched to bionic from focal20:45
clarkbthank you for looking at that with me. I learned a bit more about packaging there ;)20:45
clarkbI think that role is only used on debian for wheel builds? is that right?20:46
clarkbif so the impact to making that change should be fairly small, but maybe a good idea to have ianw double check it too?20:46
fungisounds right20:47
fungiit's getting to be the time where he often appears, so happy to wait20:47
ianwsorry catching up20:52
clarkbianw: no rush, basically I picked up the gerrit thing to help out on that side and discovered it runs the openafs-client testing from zuul-jobs which needs working openafs on debian and centos. We fixed centos by updating the job that builds the package to build 1.8.7 instead of 1.8.620:54
clarkbthe change above proposes we install our ppa package on buster as the fix for debian20:54
ianwahh yes i meant to do that20:55
clarkbianw: left a few notes on one of which needs fixing (so I -1'd)20:55
fungiianw: probably the most exciting thing to note is that afs01.dfw suffered from a hypervisor host outage in rackspace, and was rebooted after many hours offline20:56
ianwfungi: yeah, so basically we need to get that to 1.8 now?20:56
fungialso i discovered that the docs volume is replicated to afs01.ord, so was replacing it there at a crawl20:57
fungibut it completed, i've since started removing locks i was holding for mirror servers, but so far just the smallest one (mirror.yum-puppetlabs which is now done)20:58
fungii can hold off allowing other mirrors to update if we want to avoid any lengthy vos releases in progress20:58
fungimirror.yum-puppetlabs finished roughly an hour ago20:59
ianwfungi: yeah, basically docs is the only thing on ORD20:59
ianwi think we got rid of everything mirroring to it because it was so slow?20:59
fungiunfortunately afs01.dfw went offline while docs wasn't done being rebuilt on afs01.ord so the site ended up offline20:59
fungibut now it should be okay21:01
fungiyeah, restoration of the docs volume from dfw to ord took ~7.5 hours at 8Mbps to replicate all 23GiB of content21:02
ianwbut basically afs01.dfw is now running 1.6 with the bad timestamp issue, since it got restarted, right?21:02
fungiwe suspect so, yes21:02
fungithough i haven't really seen any issues stemming from that so far today21:02
ianwok, well ORD seemed to go OK with it's update21:04
ianwi'll just look at 771159; my initial thought was to run that against ORD and make sure it's idempotent21:04
fungiis it just me or is the executor not running on ze01?21:08
openstackgerritIan Wienand proposed opendev/system-config master: openafs-server : add ansible roles for OpenAFS servers
fungiyeah i think it maybe didn't start when we rebooted the server a few days ago for the openafs update?21:09
fungistarting it now21:09
ianwi agree, doesn't look like it crashed21:09
fungigraphite was reporting only 11 executors too, so i expect the others are still running21:09
fungiwhat's the easiest way to tell if the console stream interface is still running on an executor? whether or not there's a process listening on 7900/tcp?21:12
ianwyep i think so.  oom messages in dmesg is usually a smoking gun21:12
ianwusing bionic afs on debian should be self-testing with the integration tests right?21:13
fungii think so?21:13
fungitrying the focal builds was at least21:13
fungiwhich is why it got switched to bionic21:14
ianwoh, in those 5 seconds zuul reported21:14
ianwyeah, that tests afs client install and access, so must be good21:14
fungiso should i hold off letting any other mirrors resync to afs02.dfw just yet? i guess we don't want to be in the middle of what could be a multi-day replication while we're upgrading openafs packages/rebooting things?21:21
ianwfungi: sorry, eating toast, back now21:33
ianwumm, i think yes maybe we should upgrade and just get it done21:33
ianwthe other thing to consider is to release sequentially21:34
fungithe other thing to consider about those mirror volumes, since they're not current on afs02.dfw, they won't be used, so if afs01.dfw is offline then any requests to our mirrors will result in a 40321:43
fungi(we saw that happen during the afs01.dfw outage earlier today as well)21:43
fungii suppose we could snapshot zuul's queues restart the scheduler, and then restore them to avoid a bazillion builds failing due to our mirrors all going offline21:44
ianwhrm, i guess we should upgrade 02, sync volumes then do 0121:45
fungisyncing the outstanding mirror volumes is going to require the better part of a week, just keep that in mind21:46
fungiit's certainly a conundrum21:47
ianwfungi: i'm thinking through a plan v2 on
fungiit looks like we were able to sync the mirror.yum-puppetlabs volume to afs02.dfw at around 300Mbps,21:51
ianwwe've got the db servers too21:51
fungiso we copied 200GiB in 1.5 hours give or take21:52
fungimaybe the other volumes won't take *too* terribly long21:52
openstackgerritMerged opendev/system-config master: Use our ubuntu openafs ppa on debian
ianwi'm running 771159 manually now and watching it21:58
ianwoh!  i need to put the base64 keys in22:00
fungiinto /etc/openafs/server/KeyFileExt?22:06
ianwinto bridge heira so it gets deployed22:10
ianwalright, i confirmed with a manual run that doesn't do anything crazy to afs01.ord22:30
ianwand is also restricted to that host22:30
ianwfungi: you ok to merge with that?22:30
openstackgerritIan Wienand proposed opendev/system-config master: openafs-server: ensure vos_release keys installed on new servers
openstackgerritIan Wienand proposed opendev/system-config master: Move to afs-1.8 group
fungiianw: yep, approving now, thanks for giving it a trial run!22:44
ianwok, when that is in, we can do a manual upgrade of afs02.dfw and switch that to ansible control.  i think we should try some small releases and then decide what to do next22:48
ianwi think i'll start work on the db side now22:48
fungiyeah, looking over the numbers now23:07
fungicentos may be the next one to unleash23:07
fungior opensuse is about the same size23:08
fungiwe're not graphing the ceph mirror volumes though, they may actually be much smaller23:08
* fungi checks23:08
fungioh, yeah those will go quickly... ceph-deb-nautilus has 2.9GiB of data and ceph-deb-octopus 7.7GiB23:09
ianwhrm, it may be that i reorganised ceph and didn't update the graphs?  can't quite remember23:09
ianwi thought i consolidated those to one volume, but maybe it was just that, a thought :)23:10
ianwi think maybe i just cleaned up some old ones23:10
fungiwe still update and vos release them separately in the cronjobs, at least23:10
ianwit looks like for the afsdb servers we basically just extra-install openafs-dbserver23:11
fungianyway i can drop the lock and manually run the reprepro script for mirror.deb-nautilus which in theory will only take a few minutes to complete23:12
ianwi guess the "bos" setup for fileserver v db server isn't recorded in config mgmt23:12
ianwfungi: ++23:12
fungionce you're ready for testing afs02.dfw23:12
fungii'll be around for at least a few more hours23:12
openstackgerritMerged opendev/system-config master: openafs-server : add ansible roles for OpenAFS servers
ianwfungi: ^ let's let that deploy, then we can emergency afs02.dfw and manual upgrade; do a release test, and when good merge 771285 to manage it with ansible23:15
ianwgiven that a) a lot of things aren't 100% captured by config mgmt, 2) afs doesn't like ip addresses changing and 3) generally not that many complex dependencies I'm thinking that in-place upgrades of these hosts from xenial->focal is the best idea23:16
ianwinfra-prod-install-ansible (3. attempt)23:21
ianwi wonder why that's on it's 3rd try23:21
ianwsigh, looks like the streamer is dead
ianwwarning: failed to remove launch/__pycache__/sshclient.cpython-36.pyc: Permission denied23:24
ianwi guess that must be related23:24
corvusmordred, ianw: we have a 'yamlgroup' ansible inventory plugin in system-config; is that published anywhere else?  if not, any plans to do so?23:27
ianwcorvus: i feel like maybe that was written in rush mode after we were having issues with the openstack inventory plugin that was querying all the clouds?23:29
ianwiirc git blame might at some point show me adding regex support to it23:29
ianwthat feels like all i know :)23:29
corvusianw: yes, there are 2 commits, one from you and one from mordred :)23:30
corvusmordred's commit explains why we created it; and i think it's a good generally useful plugin23:30
corvusi'm working on an install where i'd actually like to use it in concert with the openstack inventory plugin23:30
ianwyeah, so clearly mordred was the driver; i imagine there was no specific reason to not publish it more widely23:31
corvusokay, maybe next time mordred is around we can see if there's a good home for it (maybe publish it as a collection or something?)23:32
ianwcorvus: ++23:34
ianwinfra-prod-ansible stopped working @   on 2021-01-18 15:01:5423:34
ianwit appears to be this bit23:34
ianw# Clean is needed because we pushed to a non-bare repo23:34
ianwgit clean -xdf23:34
ianwi wonder if that is new23:34
ianwhrm ...
ianwFail mirror-workspace-git-repos if checkout failed23:36
ianwthat change looks correct23:37
ianwi think we must have run the launch script out of /home/zuul/src/.../system-config/launch as root which created these pyc files23:38
ianwi can remove them, but i wonder if we can stop that tree from creating them23:38
ianwi don't think you can, easily, without setting environment variables23:41
openstackgerritIan Wienand proposed opendev/system-config master: : don't write bytecode
fungithat seems like a reasonable precaution23:50
ianwok, i'm going to run the clean as root, which should get things working, and ^ should help minimise harm in the future23:50
fungiyeah, i think that's the right choice23:51
ianwthere was also playbooks/filter_plugins/__pycache__/23:52
ianwwe'll have to see if that comes back, not sure how that's getting generated :/23:52
fungitimestamps on the files might yield some indication of age23:53
fungiand thus possible frequency23:53
fungilike if they're years old, then maybe not done by automation23:54
ianwthe launch ones were old; unfortunately i didn't realise filter_plugins would go till i ran git clean -dxf and saw it23:54
fungiahh, yeah23:54
ianwi think at this point, i think a manual 1.8 upgrade of afs02.dfw and then merge to get it into ansible, and i'll watch the deployment which hopefully can complete23:54
ianwi'll put it in emergency and start the module builds23:55

