clarkb | there are 31 neutron changes in the queue too | 00:05 |
---|---|---|
clarkb | which is about 20% of the queue size but likely significantly more when measured in node time | 00:07 |
clarkb | does anyone know why the tripleo-ci-centos-8-standalone jobs in the tripleo gate queue all seem to be waiting? I wonder if they are waiting on resources by a parent job? | 00:09 |
clarkb | that seems to be a big reason for why that queue isn't flushing its contents quickly | 00:09 |
clarkb | well that and the general slowness of the jobs that are running but that is normal | 00:09 |
corvus | infra-root: it looks like perhaps afs release jobs are stuck or slow; is this known? | 00:15 |
ianw | corvus: i belive due to a prior failure of afs02 everything is doing a full release | 00:16 |
corvus | neat | 00:16 |
corvus | it came to my attention because docs volumes aren't updating; from the process list, i'm guessing they have completed their release, but due to the way the cron works, they won't get another release until everything else is done | 00:17 |
clarkb | I think the tripleo jobs that aren't running depend on tripleo-ci-centos-8-content-provider which zuul says has a > 3 hour estimated runtime | 00:17 |
clarkb | so that is likely another place that can be optimized | 00:17 |
clarkb | oh I wonder if that job pauses though so the long runtime is related to waiting for the other jobs to run | 00:18 |
corvus | let me revise that: it looks like the docs release script is waiting only on the release of project.tarballs | 00:20 |
fungi | corvus: yep, i mentioned in #zuul earlier, still keeping an eye on it | 00:20 |
clarkb | https://opendev.org/openstack/tripleo-ci/src/branch/master/zuul.d/standalone-jobs.yaml#L1092 that seems to be a job the builds container images. I wonder if every tripleo content-provider job rebuilds all tripleo images which isn't quick, then other jobs sit around waiting for that to complete | 00:21 |
corvus | fungi: oh, sorry missed that | 00:21 |
fungi | all the mirror volumes are trying to do full releases to afs02.dfw which has immensely slowed the tarballs volume release, ad the zuul site release is in a serialized script behind the tarballs full rerelease | 00:21 |
fungi | recovery has been underway since roughly 17:20 utc | 00:22 |
clarkb | the neutron changes also run that trpleo content-provider job as a dep for the tripleo jobs run against neutron changes | 00:23 |
clarkb | I'm running out of steam to do a zuul throughput debug but we may need to talk to the openstack TC tomorrow if this persists or gets worse | 00:24 |
clarkb | it does seem like we've got at least a couple of setups that are making it bad | 00:24 |
clarkb | I expect the build the images step is a response to docker hub doing rate limits | 00:25 |
corvus | fungi, ianw: are we sure afs01.dfw is okay? | 00:25 |
clarkb | quay.io says they won't do image download rate limits (though if you make too many looks they rate limit those) ? I wonder if we need to work with tripleo to look at quay again | 00:25 |
corvus | i'm seeing 'vos examine' hang on volumes that are on that server | 00:26 |
clarkb | corvus: I think it stopped complaining about its cinder volume after rebooting, but I'm not sure if further verification of health was made | 00:26 |
fungi | corvus: not entirely sure since the server itself hung last week and had to be rebooted. it looked like things had recovered but possible i missed something | 00:26 |
fungi | clarkb: afs01 not 02 | 00:26 |
clarkb | fungi: oh got it | 00:27 |
fungi | for afs01 the server hung last week and had to be hard rebooted to recover (likely something related to live migration in rax) | 00:27 |
fungi | for afs02 one of its cinder volumes was on a host which became unresponsive, so the /vicepa partition remained in a read-only state until i rebooted the server | 00:28 |
fungi | corvus: where did you see hung vos examine processes? possible those were somewhere i missed after the reboot last week | 00:29 |
ianw | afs01 seems up and no bad messages as step 1 | 00:29 |
corvus | fungi: in my terminal | 00:29 |
fungi | ahh | 00:29 |
corvus | vos examine test.fedora | 00:29 |
fungi | yeah, it takes a while for me to get a response on that as well (maybe forever, hasn't returned yet) | 00:30 |
fungi | corvus: looks like there's a hung `vos examine project.tarballs` from 10 minutes ago on mirror-update.opendev.org too, is that you as root? | 00:31 |
ianw | nothing bad in the afs logs, but agree it's not looking healthy | 00:32 |
corvus | yep | 00:32 |
fungi | could it be that having so many volumes doing full releases in parallel has overwhelmed it? | 00:32 |
corvus | could be? maybe the volserver limited queue for answering requests and it's full of release txns? | 00:34 |
fungi | project.tarballs, mirror.yum-puppetlabs, mirror.debian, mirror.opensuse, mirror.debian-security, mirror.ubuntu-ports, mirror.epel, mirror.centos, mirror.fedora, mirror.deb-octopus, mirror.deb-docker, mirror.ubuntu-cloud, mirror.apt-puppetlabs, mirror.deb-nautilus, mirror.ubuntu | 00:34 |
fungi | those are all running simultaneously | 00:34 |
corvus | since we're not seeing any errors, probably makes sense to just leave it be until the backlog clears a bit | 00:34 |
fungi | basically the static volume releases kicked off and those went fine until it got up to project.tarballs which needed a full release and is something like 175GB of data | 00:35 |
fungi | roughly an hour into that, mirror volumes started getting vos release runs and i expect many if not most of them decided they needed a full release too | 00:35 |
fungi | and they piled up before i noticed | 00:35 |
corvus | would be cool if we could shard that by date :) | 00:36 |
corvus | next time we design a tarballs archive.... | 00:36 |
fungi | indeed | 00:36 |
*** tosky has quit IRC | 00:36 | |
fungi | in theory at the date of data transfer i'm seeing the project.tarballs full release would have only needed two hours to complete | 00:37 |
fungi | except about halfway into that it started having to compete with full releases of mirror volumes | 00:37 |
fungi | er, rate of data transfer i'm seeing | 00:37 |
fungi | we're clocking ~50Mbps into eth0 on afs02.dfw | 00:38 |
ianw | iotop on afs01 shows a lot of reads, and a lot of writes on afs02 ... it does appear to be doing *something*, although i agree none of the status commands return | 00:39 |
fungi | er, actually bad math on my part. more like 9 hours to complete a full release of 175GiB at 45Mbps (closer to what the graph indicates) | 00:41 |
fungi | i'm increasingly tempted to kill the mirror volume releases and hold locks for them, then remove the locks in an orderly fashion once static site volumes are done | 00:42 |
fungi | when i thought the tarballs volume could finish at any time i was less concerned about it, but at this point i'm quite sure the mirror volumes are going to need faaar longer and drag out the tarballs site finishing by a lot more than i originally estimated | 00:43 |
ianw | fungi: i once had a change out to put a global stop to mirror releases ... | 00:44 |
ianw | https://review.opendev.org/c/opendev/system-config/+/680586 | 00:45 |
fungi | thinking through this, the main risk is if i kill an in progress vos release, the volume it was releasing might need a full release instead of an incremental one. the problem volumes are likely ones doing full releases anyway and i bet they're not very far along... so maybe if i check the logs and only kill the mirror volumes doing a full release, then not much progress is lost? | 00:45 |
fungi | any of the ones where their last loglines are currently something like "Starting ForwardMulti from .* to .* on afs02.dfw.openstack.org (full release)." | 00:47 |
auristor | fungi: the most likely reason that "vos status afs01.dfw.openstack.org" is failing to return is that there are no available threads from to process the request. | 00:48 |
fungi | auristor: sounds likely, thanks! | 00:49 |
fungi | yeah, currently 14 mirror volumes all in the midst of a full release to afs02.dfw according to their respective logs | 00:49 |
auristor | although rxdebug reports that there are currently 10 calls waiting for a thread and two threads are idle | 00:49 |
fungi | plus the project.tarballs volume | 00:49 |
*** d34dh0r53 has quit IRC | 00:50 | |
ianw | the other thing is, we'd be in a much better position to put in a sequential lock for the mirror volume releases now that they're all done from the same server | 00:50 |
fungi | great point | 00:51 |
fungi | well, except for the wheel volumes i think? | 00:52 |
fungi | but those aren't a huge problem | 00:52 |
*** d34dh0r53 has joined #opendev | 00:54 | |
*** auristor has quit IRC | 00:56 | |
openstackgerrit | melanie witt proposed opendev/elastic-recheck master: Add query for bug 1911574 https://review.opendev.org/c/opendev/elastic-recheck/+/770688 | 00:59 |
openstack | bug 1911574 in OpenStack-Gate "SSH to guest sometimes fails publickey authentication: AuthenticationException: Authentication failed." [Undecided,New] https://launchpad.net/bugs/1911574 | 00:59 |
fungi | actually a more careful analysis of the mirror update logs indicates there are only 8 with a full release in progress so i'll stop those first and put locks in place, then see which others queue up and whether they want full releases too | 01:00 |
*** auristor has joined #opendev | 01:07 | |
*** auristor has quit IRC | 01:19 | |
*** auristor has joined #opendev | 01:20 | |
fungi | okay, i've stopped releases underway for the following and held locks in a root screen session on mirror-update.o.o: centos, epel, fedora, opensuse, yum-puppetlabs, debian-security, debian, ubuntu-ports | 01:21 |
auristor | what do you mean by "stopped releases"? do you mean you killed the vos processes? that isn't going to cancel the transfers between afs01.dfw and afs02.dfw. it will result in the transferred data being discarded after the transfer completes. | 01:25 |
fungi | yeah, i still need to do something about the transactions, presumably | 01:25 |
fungi | just trying to make sure our scripts don't restart them once i have | 01:25 |
auristor | until there are available threads to process incoming rpcs there is nothing that can be done. | 01:26 |
fungi | ahh, even the vos endtrans calls will queue up i guess | 01:26 |
fungi | so need to wait for at least one of them to complete | 01:26 |
auristor | one of the design flaws in openafs that was addressed by auristorfs is that you cannot terminate the transfer between the volservers | 01:27 |
auristor | also, I strongly recommend that no more than 5 volume operations be permitted in flight at a time. | 01:27 |
fungi | thanks, sounds like our plan to start setting up a semaphore of some sort with these is a good one in that case | 01:28 |
fungi | usually it's not a problem, but if something happens to the server and most of the volumes have to get full releases they'll pile up | 01:29 |
*** cloudnull has quit IRC | 01:29 | |
openstackgerrit | xinliang proposed openstack/diskimage-builder master: Fix building error with element dracut-regenerate https://review.opendev.org/c/openstack/diskimage-builder/+/770241 | 01:53 |
*** chateaulav has joined #opendev | 02:29 | |
*** chateaulav has quit IRC | 02:33 | |
*** mlavalle has quit IRC | 02:34 | |
openstackgerrit | xinliang proposed openstack/diskimage-builder master: Fix centos 8.3 partition image building error with element iscsi-boot https://review.opendev.org/c/openstack/diskimage-builder/+/770701 | 03:00 |
openstackgerrit | Jeremy Stanley proposed opendev/engagement master: Initial commit https://review.opendev.org/c/opendev/engagement/+/729293 | 03:01 |
openstackgerrit | Jeremy Stanley proposed opendev/engagement master: Initial commit https://review.opendev.org/c/opendev/engagement/+/729293 | 03:04 |
openstackgerrit | xinliang proposed openstack/diskimage-builder master: Add rhel support for iscsi-boot https://review.opendev.org/c/openstack/diskimage-builder/+/770702 | 03:25 |
*** brinzhang0 has joined #opendev | 03:27 | |
*** brinzhang0 has quit IRC | 03:29 | |
*** brinzhang0 has joined #opendev | 03:29 | |
*** brinzhang_ has quit IRC | 03:30 | |
*** brinzhang0 has quit IRC | 03:30 | |
*** brinzhang0 has joined #opendev | 03:31 | |
openstackgerrit | xinliang proposed openstack/diskimage-builder master: Add aarch64 support for rhel https://review.opendev.org/c/openstack/diskimage-builder/+/770703 | 03:37 |
openstackgerrit | Merged opendev/system-config master: Publish structured data listing our ML archives https://review.opendev.org/c/opendev/system-config/+/751125 | 03:38 |
*** whoami-rajat__ has joined #opendev | 04:22 | |
*** artom has quit IRC | 04:24 | |
*** amotoki has quit IRC | 04:44 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: vos-release: implement sequential release lock https://review.opendev.org/c/opendev/system-config/+/770705 | 04:44 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: mirror-update: create common timeout set https://review.opendev.org/c/opendev/system-config/+/770706 | 04:44 |
*** amotoki has joined #opendev | 04:44 | |
ianw | fungi: ^ | 04:45 |
ianw | mnaser: so we have two vexxhost backups; the old bup one @ backup01.sjc-1.vexxhost.opendev.org and the new borg one @ backup02.sjc-1.vexxhost.opendev.org | 04:53 |
ianw | mnaser: the old one has 3tb attached, 2.4 taken, so i can't take any drives out of that. the new one only has 1tb attached and is at 98% | 04:54 |
ianw | mnaser: i think ideally we'd like to keep the old bup 3tb of backups for a while, just as ... well ... a backup. but we're out of quota to add more disk to the new server | 04:55 |
ianw | mnaser: let me know what you think. if adding extra is an issue, we could probably drop the old bup backups on vexxhost and just leave one copy in RAX; i.e. effectively free up the 3tb of the old server and allocate to new | 04:56 |
*** ykarel has joined #opendev | 05:22 | |
*** ykarel has quit IRC | 05:39 | |
*** ykarel has joined #opendev | 05:41 | |
*** ysandeep|out is now known as ysandeep|afk | 05:43 | |
*** ykarel_ has joined #opendev | 05:54 | |
*** ykarel has quit IRC | 05:57 | |
ianw | fungi: ooohhhh ... hrm 770705 & 770706 actually have a little dependency on production in testing. it does a no-op test of the vos-release script in the testinfra | 06:03 |
ianw | fungi: the problem being that it does a vos examine (or whatever) on the volumes to determine if they need have release run on them. that is timing out due to the prior discussion, failing the test | 06:04 |
ianw | i dunno what to do about that. probably just leaving it till everything settles down is the best course of action at this point | 06:04 |
*** ykarel__ has joined #opendev | 06:05 | |
*** marios has joined #opendev | 06:07 | |
*** ykarel__ is now known as ykarel | 06:07 | |
*** ykarel_ has quit IRC | 06:08 | |
*** brinzhang_ has joined #opendev | 06:25 | |
*** brinzhang0 has quit IRC | 06:28 | |
*** auristor has quit IRC | 06:43 | |
*** ysandeep|afk is now known as ysandeep | 07:21 | |
*** eolivare has joined #opendev | 07:42 | |
*** jaicaa has quit IRC | 07:44 | |
*** jaicaa has joined #opendev | 07:45 | |
*** rpittau|afk is now known as rpittau | 07:47 | |
*** ralonsoh has joined #opendev | 07:51 | |
*** jpena|off is now known as jpena | 07:52 | |
*** ralonsoh_ has joined #opendev | 07:56 | |
*** ralonsoh has quit IRC | 07:59 | |
*** slaweq has joined #opendev | 08:03 | |
*** fressi has joined #opendev | 08:05 | |
danpawlik | ianw, fungi: Hey, do you have some issue with AFS mirror? | 08:05 |
danpawlik | seems that release date are not updated since 1 day for all distros | 08:05 |
*** sgw has quit IRC | 08:11 | |
*** zoharm has joined #opendev | 08:14 | |
*** andrewbonney has joined #opendev | 08:17 | |
*** hashar has joined #opendev | 08:22 | |
*** ralonsoh has joined #opendev | 08:28 | |
*** ralonsoh_ has quit IRC | 08:28 | |
zoharm | Hi all, would like to ask here for some pointers regarding setting up Cinder volume backend driver 3rd party CI. | 08:31 |
zoharm | We currently have devstack streamlined to run with our storage backend and are able to launch successful tempest runs. My question/request is, what are some useful resources documenting the integration points needed for Gerrit, initiating assigned CI runs, and publishing results? | 08:31 |
zoharm | And any recommendations for the setup architecture would be greatly appreciated! Thank you! | 08:32 |
*** fp4 has joined #opendev | 08:35 | |
*** fp4 has quit IRC | 08:39 | |
*** fp4 has joined #opendev | 08:41 | |
*** brinzhang0 has joined #opendev | 08:45 | |
*** brinzhang_ has quit IRC | 08:48 | |
*** tosky has joined #opendev | 08:49 | |
openstackgerrit | Merged zuul/zuul-jobs master: upload-artifactory: no_log upload task https://review.opendev.org/c/zuul/zuul-jobs/+/768111 | 08:51 |
*** sgw has joined #opendev | 08:51 | |
*** slaweq has quit IRC | 08:55 | |
frickler | danpawlik: yes, one afs node crashed yesterday and we are still trying to get things back into sync | 09:00 |
*** slaweq has joined #opendev | 09:00 | |
zbr | Am i the only one that find quite often gerrit stuck forever with "Loading..."? | 09:08 |
zbr | i have to reload the page again to fully load it, as when it happens it almost never finishes. | 09:09 |
zbr | it did happen in the past but during the last week it become very common. | 09:09 |
*** ykarel_ has joined #opendev | 09:13 | |
zbr | frickler: low hanging, https://review.opendev.org/c/opendev/git-review/+/770556 thanks. | 09:14 |
*** ykarel has quit IRC | 09:16 | |
openstackgerrit | Martin Kopec proposed opendev/system-config master: WIP Deploy refstack with ansible docker https://review.opendev.org/c/opendev/system-config/+/705258 | 09:29 |
*** fressi has quit IRC | 09:39 | |
*** fressi has joined #opendev | 09:40 | |
*** ykarel_ is now known as ykarel | 10:06 | |
openstackgerrit | Andy Ladjadj proposed zuul/zuul-jobs master: [ensure-python] install python version only if not present https://review.opendev.org/c/zuul/zuul-jobs/+/770656 | 10:14 |
*** sgw1 has joined #opendev | 10:15 | |
*** sgw has quit IRC | 10:16 | |
*** lpetrut has joined #opendev | 10:33 | |
*** sshnaidm|afk is now known as sshnaidm|ruck | 10:40 | |
openstackgerrit | Andy Ladjadj proposed zuul/zuul-jobs master: [ensure-python] install python version only if not present https://review.opendev.org/c/zuul/zuul-jobs/+/770656 | 10:56 |
*** dtantsur|afk is now known as dtantsur | 10:56 | |
*** brinzhang has joined #opendev | 10:57 | |
*** brinzhang has quit IRC | 10:58 | |
*** brinzhang has joined #opendev | 10:58 | |
*** brinzhang0 has quit IRC | 10:59 | |
mnaser | ianw: how long would you keep it up? | 11:03 |
openstackgerrit | Andy Ladjadj proposed zuul/zuul-jobs master: [ensure-python] install python version only if not present https://review.opendev.org/c/zuul/zuul-jobs/+/770656 | 11:06 |
openstackgerrit | Andy Ladjadj proposed zuul/zuul-jobs master: [ensure-python] install python version only if not present https://review.opendev.org/c/zuul/zuul-jobs/+/770656 | 11:07 |
*** ysandeep is now known as ysandeep|afk | 11:17 | |
*** DSpider has joined #opendev | 11:19 | |
*** brinzhang has quit IRC | 11:25 | |
*** brinzhang has joined #opendev | 11:25 | |
*** brinzhang_ has joined #opendev | 11:26 | |
*** brinzhang_ has quit IRC | 11:28 | |
*** brinzhang_ has joined #opendev | 11:29 | |
*** brinzhang has quit IRC | 11:30 | |
*** fressi has quit IRC | 11:32 | |
*** fressi has joined #opendev | 11:39 | |
*** hashar is now known as hasharLunch | 11:54 | |
*** fressi has quit IRC | 11:54 | |
*** brinzhang0 has joined #opendev | 12:03 | |
*** brinzhang_ has quit IRC | 12:05 | |
openstackgerrit | Andy Ladjadj proposed zuul/zuul-jobs master: [ensure-python] install python version only if not present https://review.opendev.org/c/zuul/zuul-jobs/+/770656 | 12:21 |
*** hasharLunch is now known as hashar | 12:22 | |
*** jpena is now known as jpena|lunch | 12:33 | |
*** auristor has joined #opendev | 12:39 | |
*** ysandeep|afk is now known as ysandeep | 12:52 | |
*** michael-mcaleer has joined #opendev | 12:58 | |
michael-mcaleer | Hi OpenDev team, I have a question around editing watchlists in review.opendev.org. The docs say to go through Settings > Watched Projects but since the recent changes to gerrit I seem to have lost the ability to add/remove watched tags or projects. I am trying to unsubscribe from tags from cinder after moving teams | 13:01 |
michael-mcaleer | Can you help point me in the right direction here? Thanks! | 13:01 |
frickler | michael-mcaleer: I think https://review.opendev.org/settings/#Notifications should be what you are looking for | 13:08 |
openstackgerrit | Martin Kopec proposed opendev/system-config master: WIP Deploy refstack with ansible docker https://review.opendev.org/c/opendev/system-config/+/705258 | 13:10 |
michael-mcaleer | Thanks frickler, I was able to find what I was looking for | 13:13 |
michael-mcaleer | I also needed to remove it from launchpad, that was where I was going wrong | 13:13 |
*** brinzhang0 has quit IRC | 13:13 | |
*** brinzhang0 has joined #opendev | 13:14 | |
*** chateaulav has joined #opendev | 13:16 | |
*** sboyron has joined #opendev | 13:26 | |
*** jpena|lunch is now known as jpena | 13:29 | |
*** artom has joined #opendev | 13:30 | |
*** zul has quit IRC | 13:49 | |
*** ykarel has quit IRC | 14:02 | |
slaweq | hi fungi and other infra-root guys, can You take a look at | 14:04 |
slaweq | https://review.opendev.org/c/zuul/zuul-jobs/+/762650? | 14:04 |
slaweq | thx in advance | 14:04 |
*** ysandeep is now known as ysandeep|cinder_ | 14:14 | |
*** ysandeep|cinder_ is now known as ysandeep|session | 14:14 | |
openstackgerrit | Alfredo Moralejo proposed zuul/zuul-jobs master: Rename config repos file config for CentOS Stream https://review.opendev.org/c/zuul/zuul-jobs/+/770815 | 14:16 |
*** fressi has joined #opendev | 14:16 | |
openstackgerrit | Benedikt Löffler proposed zuul/zuul-jobs master: Pass environment variables to 'tox envlist config' task https://review.opendev.org/c/zuul/zuul-jobs/+/770819 | 14:25 |
*** fressi has quit IRC | 14:32 | |
*** fbo has quit IRC | 14:45 | |
*** fbo has joined #opendev | 14:47 | |
fungi | danpawlik: yes, there was a catastrophic storage failure for afs02.dfw due to a cinder outage in the provider, it's being rewritten very slowly but volume releases are delayed due to limited bandwidth to complete that | 14:50 |
guillaumec | zuul-promote-docs doesn't seem to update Zuul Documentation https://zuul-ci.org/docs/zuul/index.html, recent https://review.opendev.org/644927 and https://review.opendev.org/732066 doc update aren't online | 14:50 |
fungi | oh, i see frickler replied further down in my scrollback | 14:50 |
danpawlik | fungi, frickler: thanks for the information | 14:51 |
fungi | guillaumec: yes, we're roughly a day backlogged on replicating afs volumes due to a catastrophic cinder failure in rackspace | 14:51 |
guillaumec | fungi, ok | 14:51 |
fungi | and that site is served out of a read-only afs volume replica | 14:51 |
fungi | it will update once all the replication catches back up | 14:52 |
frickler | slaweq: can you consider ianw's remark? would it be possible to just rerun this task when you need it instead of using the new script? | 14:54 |
*** cloudnull has joined #opendev | 14:55 | |
slaweq | frickler: but can I run this ansible task from devstack directly? | 15:00 |
slaweq | currently the problem is that: | 15:00 |
slaweq | 1. zuul installs ovs and configures bridges for infra connectivity | 15:01 |
*** sgw1 has left #opendev | 15:01 | |
slaweq | 2. devstack runs and if ovn module is used there, it removes ovs installed previously from packages and installs everything from source | 15:01 |
*** ykarel has joined #opendev | 15:02 | |
slaweq | 3. and then all those settings made by zuul and this role are gone | 15:02 |
slaweq | frickler: that's why I wanted to have simply script which I can call from the devstack plugin | 15:02 |
openstackgerrit | Martin Kopec proposed opendev/system-config master: WIP Deploy refstack with ansible docker https://review.opendev.org/c/opendev/system-config/+/705258 | 15:07 |
*** michael-mcaleer has quit IRC | 15:17 | |
*** sboyron has quit IRC | 15:31 | |
*** hashar has quit IRC | 15:33 | |
*** lpetrut has quit IRC | 15:33 | |
*** mtreinish has joined #opendev | 15:41 | |
*** JayF has joined #opendev | 15:50 | |
openstackgerrit | Matthew Thode proposed openstack/project-config master: update gentoo from python 3.6 to python 3.8 https://review.opendev.org/c/openstack/project-config/+/770828 | 15:55 |
*** ysandeep|session is now known as ysandeep | 15:58 | |
*** cloudnull has quit IRC | 15:59 | |
clarkb | slaweq: are ovn and ovs not able to coexist because they share the same kernel configuration? | 16:01 |
clarkb | or maybe the same module in the kernel | 16:01 |
slaweq | clarkb: it's not that they can't coexists, ovn requires ovs to be running | 16:02 |
slaweq | clarkb: but the problem in our case is that ovs installed from packages is using different ovsdb file, different sockets, etc. | 16:02 |
clarkb | slaweq: right I undersatnd that, I think I'm trying to understand why you must flush our existing config | 16:02 |
clarkb | we've intentionally tried to set it up such that is uses very high vxlan numbers and is otherwise out of the way | 16:02 |
openstackgerrit | Matthew Thode proposed openstack/project-config master: update gentoo from python 3.6 to python 3.8 https://review.opendev.org/c/openstack/project-config/+/770828 | 16:03 |
slaweq | so when we are installing new ovs from source, it don't see anything what was created earlier when "packaged ovs" was running | 16:03 |
clarkb | I see it completely ignores the preexisting config | 16:03 |
slaweq | clarkb: config is flushed by reinstallation of ovs from source | 16:03 |
clarkb | my concern with adding scripts there is that they won't be tested 99% of the time. I think if we are going to go that route then the role should run those scripts as its setup too | 16:03 |
clarkb | I know others will object to replacing ansible with bash (though I'm personally less concerned about that) | 16:04 |
slaweq | clarkb: I can do that of course but as You said, will others be happy with that? | 16:04 |
*** cloudnull has joined #opendev | 16:05 | |
slaweq | I just need simple way to "reconfigure" that br-infra bridge again, after ovs is installed from source by ovn devstack plugin | 16:05 |
openstackgerrit | Martin Kopec proposed opendev/system-config master: WIP Deploy refstack with ansible docker https://review.opendev.org/c/opendev/system-config/+/705258 | 16:05 |
clarkb | slaweq: I think if we are going to supply a script to do this but don't run it as part of the role it is very likely to regress the next time we make a change to the role | 16:05 |
clarkb | which is why I think we should run the scripts in the role too. Maybe ask in #zuul and see if others have strong opinions about replacing ansible with bash | 16:06 |
slaweq | clarkb: ok, I will ask | 16:06 |
slaweq | but not today as I'm almost done for today | 16:06 |
fungi | right, it's really a zuul project, not an opendev project | 16:06 |
clarkb | slaweq: yes enjoy your evening | 16:06 |
slaweq | thx for checking that | 16:06 |
fungi | (zuul-jobs is the zuul standard library in essence) | 16:06 |
clarkb | fungi: did the backlog of changes in zuul come up in the tc meeting? | 16:07 |
clarkb | looks like it has grown significantly since I last looked :/ | 16:08 |
clarkb | looks like pylint caused neutron to reset the gate recently too | 16:08 |
clarkb | I didn't realize any openstack projects used pylint (for this very reason) | 16:09 |
fungi | it did not come up, no | 16:09 |
fungi | wasn't on the agenda i don't think | 16:09 |
*** ykarel is now known as ykarel|away | 16:24 | |
clarkb | fungi: for the afs release is the list of volumes to release shrinking ? | 16:28 |
*** mlavalle has joined #opendev | 16:29 | |
*** hashar has joined #opendev | 16:31 | |
*** zbr3 has joined #opendev | 16:38 | |
*** zbr3 has quit IRC | 16:39 | |
*** zbr9 has joined #opendev | 16:40 | |
*** zbr has quit IRC | 16:40 | |
*** zbr9 is now known as zbr | 16:40 | |
*** ykarel|away has quit IRC | 16:46 | |
openstackgerrit | Martin Kopec proposed opendev/system-config master: WIP Deploy refstack with ansible docker https://review.opendev.org/c/opendev/system-config/+/705258 | 16:49 |
fungi | clarkb: not really, no, because they're all volumes which would require multiple days to do a full release and all got triggered on top of each other | 16:49 |
clarkb | got it | 16:50 |
fungi | and as was subsequently pointed out, killing the vos release processes doesn't terminate the data transfer transactions, just makes them get discarded when they complete, so there's still no open rpc slots to actually tell afs to do anything else | 16:50 |
clarkb | oh TIL | 16:51 |
*** marios has quit IRC | 17:03 | |
*** diablo_rojo has joined #opendev | 17:04 | |
JayF | https://lists.openafs.org/pipermail/openafs-info/2021-January/043013.html in case you were not aware; IDK what version of OpenAFS you all run, but this is particularly nasty. | 17:06 |
*** jpena is now known as jpena|off | 17:07 | |
*** eolivare has quit IRC | 17:07 | |
clarkb | fungi: ^ I think our restarts were prior to that date. But I assume subsequent restarts will completely break us | 17:09 |
clarkb | thats a joy | 17:09 |
fungi | looks like we're using 1.8.6 on focal | 17:09 |
clarkb | fungi: our servers likely pose a bigger issue they are still 1.6 iirc | 17:09 |
JayF | folks were talking about it in #lopsa -- apparently almost any change, client or server side, can trigger it | 17:10 |
fungi | also using 1.8.6 on bionic | 17:10 |
clarkb | oh our servers are 1.8 ? I guess that is "good" | 17:11 |
fungi | at least some of our mirrors are bionic and focal, not sure which ones might be xenial still | 17:11 |
clarkb | I didn't thnk we had upgraded the servers | 17:11 |
clarkb | fungi: the afs servers themselves | 17:11 |
fungi | ahh | 17:11 |
fungi | yeah, afs01.dfw at least is still using 1.6.15 | 17:12 |
clarkb | I see that email says <1.8 might not be affected in the same way "further research needed" | 17:13 |
fungi | all three fileservers are 1.6.15 | 17:13 |
clarkb | also specifically calls out unauthenticated there | 17:13 |
clarkb | but I thnik our servers talk to each other authenticated? | 17:13 |
fungi | zuul executors are also using 1.8.6 (but on xenial) | 17:14 |
clarkb | fungi: I think the 1.8 packages we install are out of our own ppa | 17:15 |
clarkb | so in theory we can apply those patches and rebuild in the ppa | 17:15 |
fungi | yeah | 17:15 |
fungi | that's a great point | 17:15 |
clarkb | JayF: fungi ya reading the earlier messages it sounds like any action like a vos * starts a new rx stack and can hit this | 17:16 |
clarkb | so it isn't just if we restart services or reload the kernel module | 17:16 |
*** chateaulav has quit IRC | 17:16 | |
clarkb | looking at the change in gerrit it says authenticated calls will fail because it detects the mismatch between the id and the auth state | 17:18 |
*** rjcv has joined #opendev | 17:22 | |
*** rjcv has quit IRC | 17:25 | |
fungi | unrelated to the bug fix, but... | 17:28 |
fungi | infra-root: probably the biggest impact to users while we're still waiting to get vos releases back on track is the static site content volumes... what if we temporarily patched the vhost configs on static.o.o to hit the read-write volume path instead of the read-only path? is that a terrible idea? keep in mind we're probably looking at some time next week having everything back to normal? | 17:28 |
clarkb | I think we have done it in the past | 17:29 |
clarkb | fungi: I don't know if doing that will trip over that bug though | 17:29 |
clarkb | its possible that changing the target will cause a new rx stack to be created then we'll break there too? | 17:29 |
clarkb | fungi: I'm not opposed to the idea, I think the biggest risk with it is if it trips on the afs bug. Maybe we can attemtp to test things somehow with different hosts? | 17:33 |
clarkb | though this may not be a problem until there is >1 new connection since they won't share the ids until there is >1 | 17:34 |
* JayF feels like he played the role of messenger of pain this morning | 17:34 | |
fungi | JayF: nah, appreciate the heads up | 17:35 |
*** ysandeep is now known as ysandeep|out | 17:35 | |
clarkb | https://launchpad.net/~openstack-ci-core/+archive/ubuntu/openafs is what we use to install 1.8 on the clients | 17:37 |
clarkb | so I guess we can put those two patches into that then do a forced update on openafs on the openafs clients? | 17:38 |
clarkb | then figure out what it means for the servers? | 17:38 |
fungi | well, once it builds | 17:38 |
clarkb | JayF: you haven't come across any more concrete info for 1.6 have you? | 17:39 |
clarkb | I'm kind of reading it like it doesn't have the specifric issue but lacks randomness so can hit this in other ways | 17:39 |
JayF | The only reason I even know about this is a couple of folks (billings and ENOMAD) were talking about the problem in #lopsa | 17:39 |
clarkb | JayF: thanks | 17:39 |
JayF | Might not hurt to drop in there and ask billings? but I wouldn't have high expectations he has an answer | 17:39 |
clarkb | ya I think we can work off of what is published now then go digging when we get to that point | 17:40 |
clarkb | (to avoid bugging people who are likely triaging their stuff) | 17:40 |
JayF | it does sound to me like having 1.6 server doesn't solve it though; 1.8 clients can trigger it just on their own | 17:40 |
JayF | but IMBW, and I haven't worked with AFS in ~10 years | 17:40 |
fungi | right, i think we need to push the fixes into our lp forks for now and then trigger ppa rebuilds | 17:41 |
clarkb | fungi: ya I think that is step 0 | 17:41 |
fungi | might be worth considering updating our servers to 1.8.6 from the ppa as well i guess | 17:41 |
clarkb | reading the comments on the fixes the first fix only makes this work 50% of the time. Need the second fix to get it much better than that | 17:41 |
clarkb | fungi: the reason we haven't done that is the 1.6 -> 1.8 migration requires an outage aiui | 17:42 |
clarkb | but maybe if we're gonna have an outage anyway this is our opportunity | 17:42 |
clarkb | but ya ++ to updating the ppa | 17:42 |
*** rpittau is now known as rpittau|afk | 17:42 | |
clarkb | fungi: the first fix ahs merged but the second has not yet | 17:43 |
clarkb | but we likely need both (not sure if we want to wait for them to merge the second) | 17:43 |
*** dtantsur is now known as dtantsur|afk | 18:10 | |
*** andrewbonney has quit IRC | 18:17 | |
*** hashar is now known as hasharAway | 18:29 | |
*** diablo_rojo has quit IRC | 18:34 | |
*** hasharAway has quit IRC | 18:34 | |
*** slaweq has quit IRC | 18:38 | |
*** ralonsoh has quit IRC | 18:41 | |
*** diablo_rojo has joined #opendev | 18:43 | |
fungi | okay, lemme hydrate real quick and then i can look at ppa updates for realz | 18:53 |
fungi | also ianw might have suggestions once he's around | 18:53 |
fungi | he seems to have done the most recent uploads for it | 18:54 |
clarkb | fwiw it looks like the second change to fix things has merged so should be good to grab that version | 18:54 |
fungi | oh, awesome, that was fast | 18:54 |
clarkb | I've also been poking at figuring out build stuff in a container but it is slow going as I haven't done this in like 15 years | 18:55 |
clarkb | learning a lot about various things like quilt :) right now I'm just trying to do a local build of the existing package source though | 18:55 |
clarkb | then will try to figure out quilt for applying the patches then rebuild. I'm not in a good spot to test this before uploading though beacuse my container is running on suse with different kernels and all that | 18:55 |
clarkb | all that to say fungi please don't stop looking at it too :) | 18:55 |
fungi | i'll get some pbuilder chroots set up for xenial, bionic and focal on my workstation but that'll need a few minutes | 18:56 |
fungi | creating ubuntu chroots is its own rabbit hole | 19:07 |
clarkb | I used docker which may or may not present problems too :) | 19:09 |
clarkb | I'm sure if you do debian packaging this all makes immediate sense via looking at the debian/ dir in the source but what is the best way to say I want this git diff to be a quilt patch? | 19:10 |
clarkb | my local build succeeds but tests fail. I'm now going to fiddle with patches and see if I can make it do a thing | 19:12 |
fungi | if you do debian packaging regularly you already have clean chroots on hand with the dev tools installed into them and your gpg keys mapped in | 19:18 |
fungi | mmdebstrap was throwing weird gpg errors trying to create ubuntu chroots (and it's not that i'm missing the ubuntu archive keys, those are definitely already installed) so i'm updating the packages on my workstation to rule out something being stale/behind | 19:19 |
*** whoami-rajat__ has quit IRC | 19:20 | |
fungi | yeesh, our node request backlog is up over 4k now | 19:27 |
clarkb | ya nova neutron glance etc are backed up >24 hours now | 19:29 |
clarkb | I brought it up with the tc and called out a couple of the things I noticed and sounds like people are looking at it now | 19:29 |
fungi | clarkb: probably easiest way to get those patches now: https://git.openafs.org/?p=openafs.git;a=patch;h=a3bc7ff1501d51ceb3b39d9caed62c530a804473 https://git.openafs.org/?p=openafs.git;a=patch;h=2c0a3901cbfcb231b7b67eb0899a3133516f33c8 | 19:30 |
clarkb | fungi: thanks, I did manage to get them from gerrit directly eventually | 19:30 |
clarkb | dumping them into the patches dir and updating the series file by hand seems to work when I run quilt push -a | 19:31 |
clarkb | now i'm looking to verify the diff between that state and the old state (but I didn't record the old state first so now learning how to revert with quilt) | 19:31 |
fungi | right, that's easier than learning to use quilt push/pop | 19:31 |
fungi | and apply | 19:31 |
fungi | clarkb: the debdiff tool can compare two source packages | 19:32 |
fungi | oh, though that's only for the filenames themselves | 19:32 |
clarkb | quilt pop worked for reverting | 19:33 |
fungi | debdiff manpage recommends diffoscope tool for deeper comparison of packages | 19:33 |
clarkb | and the diff looks good. I guess now I rerun the build and see if tests pass (if they don't I'm not sure what I can do next as I suspect it may be cranky about my kernel?) | 19:33 |
fungi | yeah, can't hurt, but you also might just try building the source package and pushing to lp and letting the ppa builder complain if it doesn't really build | 19:34 |
clarkb | ya but if it build we'll start pulling it automatically in like 10 hours? | 19:35 |
clarkb | I'm just paranoid I will accidentally push an unhappy package (also I have to figure out signing still) | 19:36 |
fungi | ahh | 19:36 |
clarkb | do you know if lp will run those same tests on build? | 19:36 |
clarkb | its rebuilding with the patches applied now so should find out if the test failures are related to that soon I guess | 19:37 |
fungi | i believe it will run any autotests defined in the package build | 19:38 |
clarkb | cool. It also ran them automatically when I ran debuild | 19:39 |
clarkb | seems likely it would run them in lp too | 19:39 |
mordred | yah- the ppa builds should run the in-package tests | 19:40 |
clarkb | oh I bet I need to bump the package version too otherwise it won't upgrade? | 19:41 |
fungi | yeah, you can use the dch command or just add an entry to the debian/changelog file | 19:42 |
clarkb | also I'm told lunch will be ready shortly so I need to pop out for that (please don't rely on me to figure this out quickest, really going through it as a learning exercise and expecting people like fungi that understand debs better to get to the end first) | 19:42 |
fungi | you may still beat me to it, my chroot creation rabbit hole has turned up what i think may be a corrupt gnupg trustdb for apt-key | 19:43 |
fungi | can't create any ubuntu chroots until i solve why i'm getting "gpg: [don't know]: invalid packet (ctb=2d)" | 19:44 |
fungi | which is a marvellously clear error message, lemme tell you | 19:44 |
clarkb | alright that builds successfully now and creates a ton of .debs | 19:48 |
clarkb | the versioning stuff is really confusing to me. The changelogs for the ppa package are different for xenial and focal | 19:52 |
clarkb | but only in their versions | 19:52 |
clarkb | does this mean ianw uploaded version specific source packages for each one or is lp doing something smart? | 19:52 |
fungi | not sure if you can tell a ppa to backport a source package to another release, searching now | 19:54 |
mordred | clarkb: usually version specific source packages | 19:54 |
mordred | fungi: you can also do that in the LP user interface but there are times when it doesn't work awesomely | 19:55 |
clarkb | got it | 19:55 |
mordred | when I've done this before I've always created release-specific source packages - usually only different in the version | 19:55 |
clarkb | 1.8.6-1ubuntu2 is the version number I should use ya? | 19:55 |
clarkb | actually no the current version is 1.8.6-1ubuntu1~focal1 so it would be 1.8.6-1ubuntu1~focal2 ? | 19:55 |
fungi | clarkb: yeah, or you could make it 1.8.6-1ubuntu2~focal1 | 19:56 |
fungi | since you're adding patches which are not focal-specific that's probably technically more correct, but either should work | 19:56 |
ianw | o/ | 19:56 |
clarkb | fungi: thanks | 19:57 |
clarkb | ianw: ohai https://lists.openafs.org/pipermail/openafs-info/2021-January/043013.html is what we are poking at | 19:57 |
fungi | there's also the fact that these are built from 1.8.6-1 in debian but now debian has a 1.8.6-4 which ubuntu has imported into its archives, may be worth taking a look at the changelog for that too | 19:57 |
clarkb | ianw: I'm learning me a debian right now the hard way but also need to pop out to lunch | 19:57 |
ianw | so we need new afs packages post haste? | 19:57 |
fungi | ianw: with upstream patches applied | 19:57 |
clarkb | ianw: yes with two upstream gerrit patches applied. Then we need to figure out how/if 1.6 on our openafs fileservers if affected | 19:57 |
fungi | not-yet-released | 19:57 |
ianw | hrm; clients not too hard as we pull from launchpad. the servers don't | 19:58 |
clarkb | ianw: ya and 1.6 doesn't have any patches yet I think. In that email I linked auristor says that investigation is necessary. One (potentially crazy) idea I threw out was the reason we haven't gone to 1.8 on the fileservers yet is that it requires a downtime iirc | 19:59 |
clarkb | but if this ends up forcing downtime maybe we do that with the patched version | 19:59 |
clarkb | ianw: also totally feel free to sort this out more quickly than me. I'm doing this as a good exercise to learn but doubt I'll be quickest | 19:59 |
ianw | :/ | 19:59 |
fungi | unrelated, but also weighing the possibility of switching static.o.o vhost configs to serve from the read-write path as we're unlikely to get the read-only replicas updating regularly again before some time next week | 20:00 |
clarkb | ianw: fwiw it almost sounds like unauthenticated client connections on 1.6 are expected to be affected but authenticated may not be? it is possible that our servers will be fine bceause they talk to each other auth'd? | 20:00 |
ianw | fungi: that seems very sane, iirc we've done that before in recovery situations | 20:01 |
clarkb | and its only our leaf nodes like the mirrors etc that are unauth'd | 20:01 |
fungi | executors use auth too as they're primarily write not read | 20:01 |
fungi | so static.o.o and the mirror servers are the main unauthed systems i guess | 20:01 |
clarkb | looks like there may be a third patch https://gerrit.openafs.org/14495 | 20:02 |
clarkb | which has been abandoned in favor of https://gerrit.openafs.org/14496 | 20:02 |
ianw | fungi: that one looks abandonded | 20:03 |
ianw | oh, what clarkb said :) | 20:03 |
clarkb | also people on the mailing list are reporting that 1.8.6 clients talking to 1.6 servers are failing even after being patched | 20:03 |
ianw | there's also centos to consider, but that's only for wheels | 20:03 |
clarkb | but they indicate 1.6 clients seem to be ok? | 20:03 |
clarkb | it sort of does seem like 1.6 might be less affected | 20:04 |
ianw | although we're all 1.8 clients -> 1.6 servers | 20:04 |
clarkb | correct | 20:04 |
clarkb | but the 1.6 servers also talk to each other aiui | 20:04 |
clarkb | just pointing out that the comms between servers may end up being ok (though I have no concrete assertion for that0 | 20:04 |
ianw | ahh, although i guess that's kind of moot if no client can talk correctly to the servers :) | 20:05 |
clarkb | https://lists.openafs.org/pipermail/openafs-info/2021-January/043015.html | 20:06 |
fungi | turn back all the clocks to 2020? ;) | 20:06 |
fungi | i guess that won't work unless you can globally turn back the world | 20:06 |
fungi | and i don't think the world wants another 2020 | 20:06 |
ianw | maybe i should turn the clock back an hour and go back to bed :) | 20:06 |
clarkb | can I dothat but for 5 hours? | 20:07 |
fungi | ianw: or forward 36 hours and start your weekend? | 20:07 |
clarkb | I do think we should try and be careful about testing this if we can. Like maybe we upload to another ppa or just copy a deb around and install it? | 20:07 |
clarkb | before we push to the normal ppa that servers will automatically update from | 20:07 |
clarkb | since that may restart services and such and then actually break us if the fix is not sufficient? | 20:08 |
clarkb | of course if things properly break then it will be moot at that point anyway | 20:08 |
clarkb | I have been told food is waiting for me. back in a bit | 20:08 |
auristor | there will not be patches for openafs 1.6 | 20:08 |
ianw | clarkb: should we spin up a common ubuntu/debian vm? | 20:08 |
clarkb | auristor: I didn't necessarily expect them, but it wasn't clear reading the mailing list stuff if it is suffering from the same problems | 20:08 |
clarkb | auristor: as at least one person indicates a 1.6 client can talk to 1.6 servers just fine? | 20:09 |
clarkb | ianw: ya maybe that is a good next step | 20:09 |
ianw | we want to be a bit careful with launchpad as it likes to only keep one version of the package | 20:09 |
auristor | its a related but different problem with all pre-1.8 openafs and non-AuriStorFS and non-Linux rxrpc clients | 20:09 |
clarkb | auristor: and does that related but different problem affect us if only the server side of afs is 1.6 ? | 20:10 |
auristor | The 1.6 issue is not as simple as "if you restart today it won't work" | 20:10 |
ianw | mnaser: i'd have to run that by infra-root, not sure we have a strict policy on keeping the old backups. maybe 6 months or so i guess? | 20:10 |
*** lpetrut has joined #opendev | 20:10 | |
auristor | the problems are always in the rx initiator. so client connections to fileserver / vlserver; fileserver to vlserver and cache managers; volserver to volserver, etc | 20:11 |
clarkb | and volserver to volserver etc is where our 1.6 side of things would be using the buggy rx initiator | 20:12 |
*** zoharm has quit IRC | 20:12 | |
clarkb | auristor: is https://gerrit.openafs.org/#/c/14496 expected to fix https://lists.openafs.org/pipermail/openafs-info/2021-January/043015.html ? | 20:14 |
clarkb | if so I guess we continue to focus on getting patched packages built, then limp along with 1.6 with its not 100% failure while we update all our 1.8 clients. Then sort out a 1.6 -> 1.8 upgrade | 20:14 |
openstackgerrit | Jeremy Stanley proposed opendev/system-config master: Temporarily serve static sites from AFS R+W vols https://review.opendev.org/c/opendev/system-config/+/770856 | 20:15 |
openstackgerrit | Jeremy Stanley proposed opendev/system-config master: Revert "Temporarily serve static sites from AFS R+W vols" https://review.opendev.org/c/opendev/system-config/+/770857 | 20:15 |
ianw | i'm just spinning up a common server to work on now | 20:17 |
clarkb | thanks. I need to pop out before my lunch gets cold | 20:17 |
clarkb | trying to start a new local build with that third patch just to get to see it passes testing | 20:18 |
fungi | the good news is the fixes stack on the existing quilt patches in the source package without conflict and pass the autopkgtests | 20:18 |
ianw | np, do that :) | 20:18 |
fungi | at least the first two merged fixes do anyway | 20:18 |
clarkb | yup the third applies cleanly too | 20:19 |
*** slaweq has joined #opendev | 20:19 | |
clarkb | I just started a build with the third and will go eat now | 20:20 |
clarkb | (that will tell us if the tests pass) | 20:20 |
auristor | pretty much try not to restart any openafs clients or servers (except for patching 1.8) from now to the end of the month | 20:21 |
ianw | root@10.209.39.226 | 20:21 |
ianw | we don't tend to restart the afs servers ... but when it does happen it's usually not our choice and something wrong on the cloud provider side | 20:22 |
fungi | yup, so far it's been last week the provider decided they needed to live-migrate one server and it hung the kernel such that a hard reboot was the only option, then earlier this week they had a problem with a system serving one of the iscsi volumes which make up the lvm underlying our /vicepa on another afs server and i had to restart it to fsck the fs mounted writeable again | 20:26 |
fungi | it's not been a good couple of weeks stability-wise for that segment of our services | 20:27 |
*** slaweq has quit IRC | 20:27 | |
fungi | infra-root: anybody else feel strongly in favor or against switching to serving static site content from the rw path temporarily? https://review.opendev.org/770856 | 20:29 |
clarkb | I guess not | 20:34 |
*** sgw has joined #opendev | 20:35 | |
clarkb | ianw: thats a 10/8 address :) | 20:36 |
ianw | ok, how about 104.239.144.149 :) | 20:37 |
fungi | now that i can reach | 20:37 |
clarkb | I can hit it and need to reload my keys. Will do that once lunch is more properly finished | 20:38 |
ianw | i'm just putting 14491/2 in the patches; so far this is nothing unique over what anyone else has done right? | 20:38 |
ianw | i.e. we know that works | 20:38 |
clarkb | ianw: correct. Lookls like my 14496 build succeeded locally | 20:39 |
clarkb | running the build without these patches failed for me locally | 20:39 |
clarkb | my rough setup was pull the source and deps, build clean, that failed. Apply patches, rebuild that succeeded | 20:40 |
fungi | ianw: in debian/patches and list them in debian/patches/serial | 20:45 |
fungi | to make sure quilt will apply them | 20:45 |
ianw | yep, i'm running a build on that host in a screen too, just for sanity | 20:45 |
fungi | and then dch to add a new build version to the debian/changelog | 20:45 |
ianw | dpkg-source: info: applying 0011-14491.patch | 20:45 |
ianw | dpkg-source: info: applying 0012-14492.patch | 20:45 |
ianw | agre they apply clean | 20:45 |
ianw | fungi: there we differ, i tend to use the emacs mode :) | 20:46 |
fungi | ianw: sure, whatever works | 20:47 |
fungi | though if you set VISUAL or EDITOR to emacs then dch will respect that too | 20:48 |
ianw | so yeah, the only trick is the source needs to be signed by a key accepted by launchpad, and you do need to upload for each release | 20:48 |
ianw | there's no clicky button to build for different releases | 20:48 |
*** hashar has joined #opendev | 20:54 | |
ianw | looks like kaduk has pushed 1.8.6-5 to debian | 20:56 |
ianw | and there's talk of making 1.8.7 release | 20:56 |
clarkb | ianw: fungi before we push to lp we can install our built package on that server and then check it works? | 20:58 |
clarkb | then if it works figure out signing and push to lp? | 20:58 |
ianw | and jsbillings has already done rpm's @ https://copr.fedorainfracloud.org/coprs/jsbillings/openafs/package/openafs/ which we can import | 20:59 |
clarkb | ianw: we should make sure that they include the third fix too if we start using the upstream packaging stuff | 21:00 |
clarkb | at least I had completely missed hte third change until very recently | 21:00 |
ianw | that being 14496? | 21:02 |
clarkb | yes | 21:02 |
ianw | https://gerrit.openafs.org/#/q/status:merged+project:openafs+branch:openafs-stable-1_8_x now has the backports of those 3 too | 21:02 |
ianw | i didn't have 14496 on the server, rebuilding now | 21:05 |
ianw | (the server being 104.239.144.149) | 21:06 |
*** lpetrut has quit IRC | 21:07 | |
fungi | clarkb: if you built binary packages from your patched source package, then yes you should be able to `apt install ./somepackage.deb ./otherpackage.deb` or use dpkg -i to do it for that matter | 21:09 |
clarkb | fungi: yes, except I'm in a ubuntu container on suse and my kernel is too new I think | 21:09 |
fungi | ahh | 21:09 |
clarkb | but on ianw's server we should be able to do that | 21:09 |
fungi | yeah, that will also exercise the dkms bits | 21:10 |
*** Alex_Gaynor has left #opendev | 21:10 | |
ianw | really need to add --parallel to the rules :/ | 21:10 |
clarkb | ianw: slight nit, I think fungi said the mroe correct packge version was 1.8.6-1ubuntu2~focal1 not 1.8.6-1ubuntu1~focal2 | 21:10 |
clarkb | but both should work for now | 21:11 |
ianw | yeah i'm fairly tempted to take the 1.8.6-5 packages | 21:13 |
fungi | agreed, we could in theory just nab them from debian/sid | 21:14 |
fungi | once they land in the archive anyway | 21:14 |
ianw | we've only kept our own ppa because a) for a while there packages were way behind and we had much more bespoke work and b) iirc for whatever reason they weren't building for arm64 | 21:14 |
fungi | 1.8.6-5 is unlikely to end up in xenial at all, and may see a lengthy delay getting to bionic | 21:15 |
clarkb | when you say take teh 1.8.6-5 you mean backport them to xenial and bionic from whatever is at 1.8.6-5? | 21:15 |
clarkb | fungi: ok that answers some of my question | 21:15 |
fungi | but we can grab the source packages and stuff them into our ppa, right | 21:15 |
ianw | clarkb: yeah, basically stuff those packages into our ppa | 21:15 |
fungi | even landing in focal proper will take some time because ubuntu is trickling those packages in from debian and probably not with significant urgency | 21:16 |
ianw | i imagine it just has a few more of those patches from the 1.8 stable branch | 21:16 |
fungi | yes, it will be just more stuff in the debian/patches dir | 21:16 |
ianw | ... which also all might be moot if there's a 1.8.7 release with all htem too | 21:17 |
clarkb | https://salsa.debian.org/debian/openafs/-/blob/master/debian/patches/0014-Remove-overflow-check-from-update_nextCid.patch seems to show that debian's package has all three fixes in 1.8.6-5. The 0013 and 0012 patches are the other two | 21:17 |
clarkb | so ya that would probably work too assuming it also compiles for xenial and bionic and focal | 21:17 |
clarkb | (and I guess arm64) | 21:18 |
ianw | it *should* if no new build-deps, since we've been ok with all the prior versions | 21:18 |
ianw | ok, the .debs on 104.239.144.149 have all three patches. we can install them and try if we like | 21:19 |
ianw | what i'm still not clear on is if our 1.6 servers are just now broken and can't be fixed | 21:19 |
clarkb | ianw: should we install openafs-client on 104.239.144.149 first and see if we can navigate /afs/ ? | 21:19 |
fungi | since they're keeping the packaging in git on salsa, you can probably just look at the history there to see if anything about the package itself has changed other than the quilt patches | 21:20 |
clarkb | ianw: it sounds like 1.6 servers are broken in a similar way but not exactly the same and that means they aren't 100% broken like 1.8 is | 21:20 |
clarkb | ianw: and if we don't restart 1.6 servers we'll maybe be ok? | 21:20 |
clarkb | fungi: https://salsa.debian.org/debian/openafs/-/commit/be72605900a4820ce613a3c3b2bce372a203d2c6 no thats it | 21:20 |
clarkb | at least in the last 2 months | 21:21 |
fungi | testing against our servers at the moment may be intractable because they're overrun with vos release activity in progress from yesterday and not accepting new rpc calls | 21:21 |
clarkb | there are other updates in the debain package we don't have in our ppa | 21:21 |
clarkb | fungi: well they should be able to do reads right? or are those slots also full up? | 21:21 |
fungi | you should be able to test that you can reach files, right | 21:21 |
fungi | i assumed you meant testing vos commands | 21:22 |
ianw | building modules now | 21:22 |
clarkb | ya I guess I don't actually know enough about afs to know how to sufficiently test this | 21:22 |
*** hamalq has joined #opendev | 21:22 | |
clarkb | the example on the mailing list seem to be primarily vos * commands | 21:23 |
clarkb | ianw: oh also at least for 1.8 it seems to be based on server start time | 21:24 |
clarkb | ianw: so 1.6 may not actually exhibit any problems until we restart | 21:24 |
clarkb | and then we'd be completely hosed potentially | 21:24 |
ianw | i think that's fine, but yeah, the restart may not be our choice and recent stability hasn't been reassuring in that regard | 21:25 |
fungi | looks like the other major changes in the newer package version are support for more recent linux kernels (5.8, 5.9) | 21:26 |
*** hashar has quit IRC | 21:26 | |
fungi | so yeah, honestly i would just clone https://salsa.debian.org/debian/openafs/ and debuild from that | 21:29 |
*** jrosser has quit IRC | 21:29 | |
*** ildikov has quit IRC | 21:30 | |
clarkb | lsmod show openafs on ianw's server now | 21:30 |
ianw | # ls /afs/openstack.org/ | 21:30 |
ianw | developer-docs docs docs-old mirror project service user | 21:30 |
ianw | is promising | 21:30 |
*** ildikov has joined #opendev | 21:31 | |
*** jrosser has joined #opendev | 21:32 | |
ianw | a "find . -type f -exec md5sum {} \;" stress test is seeming good | 21:32 |
clarkb | do we try to do some more admin type commands next? | 21:32 |
clarkb | though as noted before those might hang because of full up slots for all those releases? | 21:32 |
ianw | also i haven't setup the kerberosy things | 21:33 |
clarkb | I think we document how to do that via command line switches without setting up default domains and all that | 21:33 |
clarkb | but ya it seems like at least for reads this is working (which is going to be what the mirrors and static are all doing, zuul will do writes too) | 21:34 |
*** fp4 has quit IRC | 21:37 | |
ianw | ok, i have 1.8.6-5 focal/bionic/xenial packages i can upload to the ppa to build if we like | 21:40 |
clarkb | ianw: do we want to build the focal one on your test serverfirst? or treat the testing of the three patches we've already done as good enough? | 21:41 |
clarkb | but I agree using 1.8.6-5 seems like a good option rather than maintaining an older fork | 21:41 |
ianw | the only two extra patches that has over what we just tested are for 5.8 & 5.9 kernels | 21:42 |
clarkb | probably pretty safe to push as is then | 21:42 |
ianw | sorry there is also one other for afsmonitor, atool i don't think we use | 21:43 |
ianw | if nobody has any other opinions, i'm happy to dput these and get the ppa buliding them | 21:47 |
clarkb | I think that sounds like a reasonable next step. fungi ^ | 21:47 |
*** diablo_rojo has quit IRC | 21:48 | |
clarkb | looks like jproulx was hit by this too | 22:00 |
clarkb | maybe jproulx has hints on the 1.6 -> 1.8 upgrade | 22:00 |
*** fp4 has joined #opendev | 22:01 | |
mnaser | does anyone know if there's some sort of setting i'm missing, our ci is not reporting to zuul-jobs, ijust generated a new set of http credentials and im getting a 403 when its reporting | 22:21 |
mnaser | https://www.irccloud.com/pastebin/n3ZCIyP5/ | 22:22 |
mnaser | ssh streaming work cause it does get enqueud | 22:22 |
clarkb | mnaser: basic auth was the only thing we changed for opendev zuul after the gerrit upgrade | 22:23 |
mnaser | yeah thats why i changed auth_type=basic | 22:23 |
clarkb | mnaser: however we just dropped the digest setting and let it default to basic | 22:23 |
clarkb | maybe explicitly setting it to basic doesn't work? | 22:23 |
mnaser | ok, dropping the basic and trying again | 22:24 |
fungi | sorry, dinner pulled me away, catching back up | 22:33 |
fungi | ianw: clarkb: dput them, yes please | 22:33 |
clarkb | fungi: I think ianw just did that | 22:34 |
ianw | yep, hitting some backport package errors with debhelerp-compat packages | 22:37 |
clarkb | ianw: did you only push xenial? I wonder if the other two have the right compat level | 22:37 |
ianw | yeah, i've pushed bionic and we'll see if that works | 22:38 |
ianw | sbuild-build-depends-openafs-dummy : Depends: debhelper-compat (= 12) | 22:39 |
ianw | nope on bionic. i feel like i just dropped this previously and they built ok anyway | 22:39 |
clarkb | ianw: ya looking at the ppa debian/control files there doesn't seem to be a debhelper-compat listed but the debian/control file in the upstream salsa repo has it | 22:41 |
clarkb | (so I assume that means you dropped it in those other ones) | 22:41 |
*** DSpider has quit IRC | 22:42 | |
clarkb | https://manpages.debian.org/testing/debhelper/debhelper.7.en.html#Supported_compatibility_levels has more info | 22:42 |
fungi | good point, dh-compat needs to list a dh version available in the target distro release | 22:43 |
ianw | i think i hacked it back to 9 and it "just worked" | 22:43 |
clarkb | v9 seems to be what xenail has if you want to just switch it to 9 | 22:43 |
ianw | i also think i very stupidly didn't note that in the changelog | 22:43 |
fungi | it probably will if the package isn't relying on newer dh features | 22:43 |
clarkb | ianw: debhelper (>= 9.20160114~) is what I see in the xenial and focal control files from our ppa | 22:44 |
ianw | yep, that is what i put in | 22:44 |
clarkb | fungi: it must because it worked before an sounds like the delta between our current package and the upstream package is all in the patching side? | 22:45 |
clarkb | or maybe people were only looking at the patch dir? | 22:45 |
fungi | clarkb: yes, i agree with that logic | 22:45 |
ianw | ok, focal looks like it's building ok | 22:46 |
*** fp4 has quit IRC | 22:46 | |
fungi | git diff --stat debian/1.8.6-1..HEAD | 22:47 |
fungi | Standards-Version was increased from 4.1.3 to 4.5.0 in debian/control | 22:48 |
fungi | some make var assignment methods were tweaked in debian/rules | 22:49 |
fungi | the debian/watch file was updated to track upstream source via https instead of http | 22:50 |
fungi | otherwise, just more quilt patches | 22:50 |
fungi | i don't see any changes which would have impacted debhelper use | 22:50 |
clarkb | cool | 22:51 |
ianw | i just have to wait a bit for the old failed ones to delete | 22:51 |
ianw | well that's brilliant | 22:54 |
ianw | deleting the failed builds appears to have also removed the prior good xenial and bionic builds | 22:54 |
ianw | no sorry, i see. the old builds have been superseded by the failed builds | 22:55 |
fungi | good news! i got a response from a vos status call finally | 22:56 |
clarkb | ianw: I don't see xenial and bionic at https://launchpad.net/~openstack-ci-core/+archive/ubuntu/openafs/+packages anymore fwiw | 22:56 |
clarkb | they are still at http://ppa.launchpad.net/openstack-ci-core/openafs/ubuntu/pool/main/o/openafs/ though | 22:56 |
clarkb | so maybe a weird UI thing? | 22:56 |
ianw | clarkb: yeah, they're not listed as their status is "Superseded" | 22:57 |
ianw | i should probably upload xenial/bionic as "2" with the debhelper revert explicitly | 22:58 |
clarkb | wfm | 22:58 |
*** redrobot1 has joined #opendev | 22:58 | |
*** redrobot has quit IRC | 23:02 | |
*** redrobot1 is now known as redrobot | 23:02 | |
clarkb | ianw: looks like bionic builds just started | 23:05 |
ianw | yep, so that 2 package just has the debhelper manually reverted to 9, no other changes | 23:06 |
fungi | sounds fine to me | 23:09 |
ianw | gosh i will be happy to have xenial gone | 23:12 |
ianw | btw we have 96gb of ram available in the rax ci tenant | 23:12 |
fungi | good to know | 23:12 |
fungi | sounds like enough for a gerrit replacement in that case | 23:13 |
ianw | yeah, enough for a 60g replacement for a little | 23:13 |
ianw | we've got ticks on the bionic x86-64/i386 packages; that's good | 23:25 |
ianw | fungi: is it reliably returning | 23:26 |
ianw | fungi: thinking about upgrades for server -- i think we're in a pickle enough to manually get to 1.8.6 and then prioritise actual ansible etc for replacement non-xenial servers | 23:27 |
fungi | ianw: no, i got a call in between some vos release completing and another kicking off, i think | 23:27 |
ianw | my thought was the best thing to do is to probably vos dump the important volumes before attempting such a thing | 23:27 |
openstackgerrit | Martin Kopec proposed opendev/system-config master: WIP Deploy refstack with ansible docker https://review.opendev.org/c/opendev/system-config/+/705258 | 23:28 |
fungi | so i went and held locks for all the remaining mirror updates now to hopefully prevent any additional vos release calls which are waiting from actually getting serviced, not sure if it will help | 23:28 |
ianw | the other options seem to be a openstack-side snapshot of the volumes attached to the server, or posix-level rsync-type copies of vicepa somewhere | 23:28 |
fungi | we could add another cinder volume as a pv in that vg and then make an lvm snapshot | 23:29 |
fungi | we probably don't have sufficient available extents on the current pvs to comfortably snapshot the volume | 23:30 |
fungi | since lvm snapshots are essentially cow we shouldn't need enough room for a full extra copy that way | 23:30 |
ianw | yeah, i think we waste a lot of time though backing up the mirrors | 23:30 |
clarkb | ya the mirrors are like 95% of the disk use | 23:31 |
fungi | lvm snapshot is instantaneous | 23:31 |
fungi | (effectively) | 23:31 |
clarkb | ah til | 23:31 |
ianw | oh good point, yeah cow | 23:31 |
ianw | i mentioned in this in openafs | 23:32 |
clarkb | those servers appaer to still be puppeted fwiw | 23:32 |
ianw | i've been pointed at akeyconvert | 23:32 |
fungi | see manpage for lvcreate if you're interested in details | 23:32 |
ianw | yes, we'll need to ansiblise this. it feels like we should do that and recrate them as focal nodes | 23:32 |
clarkb | and puppet == xenial for us | 23:32 |
clarkb | ianw: note I believe we did that last server upgrades in place via do-upgrade or whatever the ubuntu command is because afs doesn't like ip addresses changing | 23:33 |
clarkb | that said I agree getting all the way up to focal for a lot of things sounds like a great idea :) | 23:33 |
fungi | our kerberos servers also probably fall into the category of related infrastructure we'll want to upgrade around the same time | 23:34 |
ianw | http://manpages.ubuntu.com/manpages/bionic/man8/akeyconvert.8.html | 23:34 |
ianw | The akeyconvert command is used when upgrading an AFS cell from the 1.6.x release series | 23:34 |
ianw | to the 1.8.x release series. | 23:34 |
clarkb | one thing to keep in mind is that bionic is supposedly getting 10 yaers of support. I don't know if that is true for focal too | 23:37 |
clarkb | but if not it may make sense to consider if bionic and sit in place for the remaining 7 years or whatever it is is a better option | 23:38 |
openstackgerrit | Martin Kopec proposed opendev/system-config master: WIP Deploy refstack with ansible docker https://review.opendev.org/c/opendev/system-config/+/705258 | 23:39 |
ianw | yeah, if we're going to replace i'd say focal -- i mean already the packages don't build on bionic | 23:39 |
clarkb | thats a good point | 23:39 |
clarkb | looks like all LTS are 10 years now | 23:41 |
clarkb | sorry not all | 23:41 |
clarkb | bionic and newer | 23:41 |
clarkb | ianw: fungi I think the packages are all done now except for arm64 | 23:44 |
clarkb | do we want to install the new package on a node and see if it works? | 23:44 |
ianw | we can upgrade a mirror node, yep | 23:45 |
clarkb | note the ml seems to indicate that some sort of forceful restart of things may be required (lots of people were just doing reboots) | 23:46 |
ianw | starting to track things in https://etherpad.opendev.org/p/infra-openafs-1.8 | 23:47 |
*** sshnaidm|ruck is now known as sshnaidm|afk | 23:49 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!