Thursday, 2021-06-17

ianwand i need to fix the base deployment issues00:00
Ramerethianw: I've been trying to catch those instances and clean them up. Do we have some right now?00:01
ianwi'm seeing 5 in ERROR status from our client00:01
ianw fault                       | {'message': 'libvirtError', 'code': 500, 'created': '2021-06-16T23:56:47Z'}00:02
ianwlooks like they've all got libvirt error00:03
ianwfungi: in openstackzuul-linaro-us  i only see active servers?00:10
Ramerethyup, more qemu-kvm processes in a defunct state :/00:16
RamerethI wish I knew why that keeps happening. I don't see this on our ppc or x86 clusters00:16
Ramerethi'll have to work on cleaning those up tomorrow as I'm about to head out for the day00:21
opendevreviewMerged opendev/system-config master: openafs-client: add service timeout override
ianwRamereth: no problems.  i'll likely re-enable osuosl once ^ deploys00:22
fungiianw: nodepool list when run on nl03 at least shows 15 nodes in a deleting state for linaro-us for me00:22
ianwahhh, so ZK is out of sync with the cloud i guess00:23
ianwi was looking via openstack client00:23
fungicould be, yeah00:23
fungipossible these are cluttering up zk and so nodepool thinks they're occupying quota?00:24
ianwthat could be true.  i can delete the nodes by hand00:24
ianw(the ZK nodes ... too much overloading of node :)00:24
fungiwe've got quite the backlog of queued arm64 builds and linaro-us now has only deleting and two ready nodes which are over a day old00:25
Ramerethianw: how long until that deploys? Maybe I should go ahead and do it now00:26
funginot sure why we'd have a ready ubuntu-focal-arm64 and ubuntu-bionic-arm64 node in linaro-us for more than a day00:26
fungiwould have expected those to get used sooner, but maybe all the arm64 testing is centos/fedora00:26
ianwRamereth: probably an hour, but then i also have to merge the change to re-enable.  i think it can wait00:26
ianwhrm, system-config at a minimum should slurp those up you'd think00:27
ianwbut most of the testing is tox-ish things as well, that i think would all be focal00:27
ianwok, i'm into the zk shell just correlating now00:31
ianwfungi: hrm, i don't see lots of deleted nodes.  i wonder if we're getting launch failures00:33
ianwno i tell a lie, sorry, they don't have images attached so weren't matching arm64 grep00:34
ianwok, i've cleared the ZK nodes, nodepool list is no longer showing them00:39
ianw2021-06-17 00:39:51,268 DEBUG nodepool.PoolWorker.linaro-us-regionone-main: Active requests: []00:40
ianwit doesn't feel like it thinks it has anything to do00:40
ianwi might have to restart nl03, i feel like i've seen this before00:41
ianw2021-06-17 00:43:31,423 DEBUG nodepool.PoolWorker.linaro-us-regionone-main: Active requests: ['300-0014437271', '300-0014440634', '300-0014440637', '300-0014441811', '300-0014455060', '300-0014455555', '300-0014437862', '300-0014437863', '300-0014454786', '300-0014455047', '300-0014462863', '300-0014462864', '300-0014454788', '300-0014454789', '300-0014455322', '300-0014455325', '300-0014463009', '300-0014463010', '300-0014454915', '300-0014409073', 00:43
ianw'300-0014422609', '300-0014433950', '300-0014434049']00:43
ianwthat's more like it00:43
fungiahh, excellent, thanks00:43
ianwwe now have 25 servers building in linaro00:44
fungiyes, that sounds like what we should normally see00:45
ianwi think the fact that nodepool didn't notice those VM's were gone *and* thought that there were no requests despite the long queue are related.  i'm not sure how though00:50
Ramerethianw: alright, I cleared those out. Heading out for real now00:50
ianwRamereth: thank you!00:51
ianwi've enabled the openafs service and am rebooting the osuosl mirror.  the timeout has applied "A start job is running for OpenAFS client (1min 2s / 8min 8s)"01:07
ianw2min 54s / 8min 8s and it's up01:09
opendevreviewMerged openstack/project-config master: Revert "Disable the osuosl arm64 cloud"
ianwi think the hosts missing the EFI mounts are limited01:42 and  since we're not starting bionic nodes, i think i'll just fix this by hand01:43
melwitt converts blueprint integration in jeepyb to the gerrit API if anyone interested. here's the system-config change to re-enable blueprint integration for patchset-created
*** ykarel|away is now known as ykarel04:14
*** zbr is now known as Guest248805:03
opendevreviewchandan kumar proposed openstack/project-config master: Added publish-openstack-python-tarball job
opendevreviewDr. Jens Harbott proposed opendev/git-review master: Fix nodeset selections for zuul jobs
*** marios is now known as marios|ruck05:44
*** raukadah is now known as chandankumar05:46
opendevreviewMerged opendev/system-config master: bridge: upgrade to Ansible 4.0.0
opendevreviewFlorian Haas proposed opendev/git-review master: Support the Git "core.hooksPath" option when dealing with hook scripts
*** rpittau|afk is now known as rpittau07:22
*** jpena|off is now known as jpena07:31
*** raukadah is now known as chandankumar08:28
hjensasHi, the job at the top of the queue for TripleO, its been sitting there with that on tripleo-ci-centos-8-undercloud-upgrade-victoria "queued" for 12ish hours. See:
hjensasAlmost all the jobs behind it in the queue has finished, but with the job at the top of the queue stuck the whole queue seems stuck? Any chance of poking so that the job in "queued" is started?08:36
*** ykarel is now known as ykarel|lunch08:38
hjensas#opendev anyone around who can take a look at TripleO's stuck gate queue?
*** ykarel|lunch is now known as ykarel09:45
fricklerhjensas: for a fast workaround, you could likely drop that patch from the queue. not sure we'd get to any further debugging before the event next week, but for that you'd likely have to wait a couple of hours for corvus to show up09:46
hjensasfrickler: yeah, we may just have to abandon that patch and restore it. Any idea when corvus usually shows up?10:02
hjensasfyi, TripleO abandoned the blocking change, and restored it to get the queue unstuck.10:09
*** jpena is now known as jpena|lunch11:34
*** amoralej is now known as amoralej|lunch12:00
*** whayutin is now known as weshay|ruck12:10
opendevreviewAnanya Banerjee proposed opendev/elastic-recheck master: Run elastic-recheck container
opendevreviewAnanya Banerjee proposed opendev/elastic-recheck master: Run elastic-recheck container
*** jpena|lunch is now known as jpena12:32
fungihjensas: frickler: it does look more generally like there may be some stuck node requests across the board though... for example an openstack/ovsdbapp has been sitting in check for 137 hours waiting for nodes for some of its unit and static analysis jobs12:33
fungitrying to track those down now and probably restart launchers to free any of the locks they're holding on those node requests12:34
hjensasfungi: in case you didn't see, we abandoned the stuck change in the tripleo queue. So it is resolved there, but thanks for investigating! :)12:47
fungihjensas: yep, thanks, i've got plenty of other candidates to serve as examples this time12:48
fungihere's the node request i'm hunting down:12:48
fungi2021-06-11 19:39:33,602 DEBUG zuul.Pipeline.openstack.check: [e: 08ab26431732453d87d673e2aaaf138e] Adding node request <NodeRequest 300-0014404756 <NodeSet ubuntu-focal [<Node None ('ubuntu-focal',):ubuntu-focal>]>> for job openstack-tox-lower-constraints to item <QueueItem af28c45f00f441238e6bbf3099a4a98c for <Change 0x7fe02f02eeb0 openstack/ovsdbapp 795892,1> in check>12:48
fungilooks like it was taken by nl04:12:51
fungi2021-06-11 19:40:43,357 INFO nodepool.NodeLauncher: [e: 08ab26431732453d87d673e2aaaf138e] [node_request: 300-0014404756] [node: 0025081525] Node is ready12:52
fungibut it never unlocked the request12:52
fungiso same symptoms we've been finding in other cases12:52
fungii wonder if a thread dump will prove useful, i'll trigger one12:53
*** amoralej|lunch is now known as amoralej13:02
fungiokay, i've sent 12 (sigusr2) to the child launcher process twice now roughly a minute apart and the dumps are in /var/log/nodepool/launcher-debug.log at 2021-06-17 13:11:43,516-13:12:55,30413:13
fungii'm going to restart the container now to release the locks on those node requests13:14
fungi#status log Restarted the nodepool-launcher container on to release stale node request locks13:15
opendevstatusfungi: finished logging13:15
fungiand now i'm seeing many of the stuck builds transition from queued to running13:17
fungicorvus: probably not particularly urgent as it's not that debilitating and we've been seeing it for months off and on so no idea when it started, but do you have any good suggestions for how to try to track this down?13:34
fungimost of the time it seems to happen amid flurries of node launch failures, though that perception could also be selection bias on my part13:35
fungisometimes it's a decline after three launch failures, sometimes it's a successful node launch, but the commonality is that it either doesn't get communicated back to the scheduler or the scheduler loses the event somehow13:37
funginot entirely sure which13:37
fungialso the trimmed up thread dumps are now in
corvusfungi: which nl has that debug log ^?13:38
corvusheh :)13:38
fungiif catching a launcher actively in this state would help, i can avoid restarting next time it comes up13:41
fungibut generally when it happens it's blocking more than just a few changes13:41
*** marios|ruck is now known as marios|call14:02
corvusfungi: there was a zookeeper connection suspension between when the server finished booting and when nodepool should have marked the request fulfilled and unlocked the nodes.  it was only a suspension, which means that it should have been able to pick up without losing anything.  additionally, that node request and several others all disappeared from the list of current requests without a log entry; that's not supposed to be possible.14:18
corvusi don't think keeping the launcher in that state would provide more info.  i think that's enough clues to figure out what debug info we're missing14:19
fungithanks, and i don't recall seeing it prior to maybe february, but i suppose it could have been lurking there for as long as we've been handling node requests via zk... when we're regularly restarting the launchers it tends to just solve itself14:21
*** marios|call is now known as marios|ruck14:44
*** ykarel is now known as ykarel|away15:10
*** ysandeep is now known as ysandeep|out15:33
clarkblooks like the osuosl mirror got sorted out, that is great news15:34
clarkbfungi: re gerrit yes I should page that back in then ask for review on data that I'm back up to speed with. I'll see if I can get to that today or tomorrow15:34
*** rpittau is now known as rpittau|afk16:09
*** sshnaidm is now known as sshnaidm|afk16:16
*** marios|ruck is now known as marios|out16:28
clarkbI'm looking into why the LE certs didn't refresh after 60 days as epxected for the names we got emails about16:34
clarkbit appears that failed for some reason in the last couple of LE playbook runs which meant the certs haven't updated anywhere.16:34
clarkblooks like a full /opt on nb03 is causing that.16:34
*** amoralej is now known as amoralej|off16:35
*** jpena is now known as jpena|off16:36
clarkb/opt/nodepool_dib (where we store the images that get uploaded) is only about 1/3 of the disk use. This implies we've leaked a bunch in the dib_tmp dir. I'll stop the daemon, clear out dib_tmp, then restart things16:38
opendevreviewClark Boylan proposed opendev/system-config master: Add note about afs01's mirror-update vos releases to docs
clarkbinfra-root ^ that is an update to the openafs docs based on what I discovered the hard way earlier this week :)16:54
fricklerclarkb: fungi: I tried to fix the nodeset selection for git-review jobs, which I think succeeded, but now there are some issues with the test instance of gerrit not starting, see 17:07
fricklerseems to be some kind of race condition or similar, as it hits only one out of 5 jobs17:08
clarkbagreed and seems to have only hit 8 tests in that job17:10
clarkbwe don't seem to collect the gerrit startup logs in that case to see what went wrong though17:11
clarkbre next step it is probably modifying the test suite to grab the gerrit error log file to see why it breaks17:12
fungii thought we did collect the gerrit logs17:13
clarkbfungi: I don't see them on that job or in the job-output.txt17:14
rosmaitawhen someone has a few minutes, i'm trying to make sure a job has python 3.6 available, but the change i made to .zuul.yaml on this patch isn't working, it's still reporting "InterpreterNotFound: python3.6":
rosmaitai'm using "python_version" in the job vars, do i need something else?17:31
clarkbrosmaita: you have to run the job on a platform that has the python version you want17:31
fungimake sure it's ubuntu-bionic17:31
clarkbI expect that job is running on focal which has not python3.6. Bionic is probably what you want17:32
fungiif you don't specify a nodeset, right now you get ubuntu-focal which is too new17:32
rosmaitaoh, ok17:32
rosmaitaand i guess python_version is still a good idea?17:32
fungiour abstract py36 jobs set the nodeset so that child jobs will inherit that, but if you're creating a job which doesn't inherit then you need a nodeset which has the version of python you need or you need to specify an alternative means of installing it17:33
rosmaitaok, thanks, will revise my patch17:34
fungisetting python_version is fine for jobs which switch on that, though it's more useful for selecting a non-default interpreter on whatever node label you're running against17:34
fungie.g. ubuntu-focal defaults to python 3.8 but has 3.9 packages available, so you might want to use python_version to tell the job to use 3.9 on focal17:35
rosmaitathat's helpful, i understand this better now17:36
clarkbcorvus: reading the matrix spec and one thing that stands out to me is that the homeserver is responsbile for maintaining scrollback (until the end of time I guess?). While this makes sense I wonder what sort of storage needs are required? its basically the entire channel log with a pointer for each user's client indicating where they were last caught up?17:43
clarkbcorvus: any idea how that scales over time? do we need a potentially ever growing data store for this information (not that channel logs tend to be large but trying to understand what is involved there)17:44
fungiclarkb: if it helps, the channel logs for most of the channels we'd be talking about, as recorded by the meetbot, presently occupy 17gb on disk and this includes the htmlified copies as well17:48
fungitechnically that's ever-growing but we've never really come close to running out of space for it17:49
clarkbfungi: that is good info. Probably a decent estimate for matrix disk needs (though I expect matrix adds significantly more metadata, the order of magnitude is probably similar)17:49
corvusclarkb: that's my understanding -- at least for the users on that homeserver that are in that room.  of course, we will almost certainly have at least one "user" in each room on our homeserver in the form of a bot.17:51
fungithough i agree, having someone with a homeserver which has channel history provide an estimate for how much room logging needs by comparison to our meetbot logging might be good17:51
fungilike x% the size of a plan text log for the equivalent timeframe17:51
clarkbanother thing that comes to mind is how upgrades are done. Can those be done without taking chat down?17:52
clarkbI suspect we'd be required to run multiple homeservers to accomplish that?17:52
corvusclarkb: it's possible there are ways to reduce or truncate the data since our bots wouldn't need history.  however, it seems likely that we might want the homeserver to provide history as a service for new folks connecting.  so i think we should expect to keep indefinite history, not as a technical requirement, but as a good service.17:52
clarkbscrollback is one of the key features people always talk about so keeping that around makes sense to me17:53
corvusclarkb: synapse is a python program with a database; i'd expect minimal downtime between upgrades.  remember that users won't see that downtime, except maybe in just a little bit of extra lag, as their homeserver queues updates.17:53
corvusclarkb: i have stopped and restarted my homesever several times during conversations with you :)17:54
JayFfungi: holy @#$%, I've been using `skipdist` in about every tox file I've created since the beginning of time. I was wondering why it always seemed so ineffective17:54
JayFthank you for that email :D17:54
corvusclarkb: i have no idea how accurate this is:
clarkbcorvus: ya I guess the impact would be due to prolonged outages, short outages for updates should go unnoticed17:54
corvusclarkb: but it is certainly a thing you can put numbers in and get other numbers out of :)17:54
fungiJayF: yw17:55
corvusclarkb: i'd have to check the actual Matrix Specification for this, but it may be the case that only bots and new users would notice homeserver downtime.17:56
corvus(given that we're talking of only using our homeserver for bots)17:56
clarkbnew users bceause they wouldn't be able to join the channels in that moment?17:57
corvusclarkb: the high-level thing under "how does it work" at leads me to believe that's the case17:59
corvusoh we should steal that "next" animation idea for zuul17:59
corvusessentially, our bots would get an eventually-consistent view of the room after the homeserver came back up18:01
clarkbI've brought the subject up over at the foundation. Pointed out the oddity in potential pricing for element hosted homeserver for us and asked if there was any reason to not reach out to them now and start a conversation to understand what element can do better. I also mentioned corvus  and mordred are willing to meet up and talk about it in more depth18:02
corvus(also, fwiw, i think new users could technically join the room if any other homeserver involved chose to publish it at an alternate address)18:02
clarkbI'm also trying to build a model in my head for what hosting a homeserver looks like if we do it ourselves. Seems like we should expect some disk and network bw.18:03
clarkbAnd for upgrades maybe we can have docker compose simply update to the latest image constantly?18:03
corvus<clarkb "I'm also trying to build a model"> yep; minimal for #zuul, potentially significant for other communities18:04
corvus<clarkb "And for upgrades maybe we can ha"> I haven't tried an upgrade yet; that sounds reasonable, but it's definitely a wildcard in my mind.18:04
*** dviroel is now known as dviroel|away18:05
clarkbhaving a plan for those as well as understanding what server upgrades look like is probably important before going down the run our own path18:05
clarkbin particular for the host server I wonder if changing IPs will be a problem or if we can spin up a new one alongside and do a failover of sorts, etc18:05
clarkba sync + failover seems like the sort of thing matrix may just do out of the box given how it does federation18:06
fungiyeah, i asked similar questions about server redundancy in the zuul spec18:06
*** slittle1 is now known as Guest255218:06
fungiseems like something we need a little more research around18:06
corvusclarkb: also potential interesting info for the foundation is that we have contacts in ansible, fedora, gnome, and mozilla orgs all of whom are in various stages of the same process (most/all of whom are chosing to use EMS hosting) and they seem happy to help with the process18:07
corvuschanging ips will not be a problem18:07
clarkbcorvus: oh ya it would be interesting to get info about EMS experiences from them I bet18:08
clarkbOn the subject of history collection I wonder if we can simply expose those logs somehow as the canonical historical record without needing an account and authentication (just so we don't end up with a copy of them all on the homeserver and another copy where the bot lives)18:11
fungithat might get tricky to filter if the homeserver has history from any private channels for some reason18:13
fungithough if we can be sure the only history it has is safe to publish, then perhaps easier18:14
clarkbfungi: good point. Point to point private comms are e2e encrypted so it would just be private channels that have this issue I think18:15
clarkbbut definitely something to check on if we explore that option further18:15
corvusclarkb: i don't think they're natively stored in a way that's usable for that purpose (you know, directed graph structure and all).  doing that with a web app is probably computationally expensive.  my guess is for purposes of search engine indexing, etc, just having flat files is best.  but maybe there's a way to export a room history to flat file to avoid needing the bot.18:15
clarkbcorvus: gotcha18:15
fungior a way to make a plugin which does that in more lightweight ways than a typical "bot"18:16
corvusfungi:  yeah, an app service may be able to do stuff like that, but i'm not very familiar18:17
clarkbcorvus: mordred: do you know if is the best way to reach out to EMS?18:19
clarkbor rather, do have a better contact? if not thehn ^ is probably easiest18:19
corvusclarkb: i think so18:28
corvuser, i think that's the best way to get started; i don't have a better contact :)18:29
opendevreviewmelanie witt proposed opendev/jeepyb master: Convert update_blueprint to use the Gerrit REST API
*** dviroel|away is now known as dviroel18:53
clarkbmelwitt: I left a couple of comments on The first one would probably be a good followup and the other is more of a "is this even possible question"19:41
fungioh, thanks for the reminder, i was going to review that today19:44
clarkbok cleanup on nb03 has completed and I have restarted the builder there19:48
clarkbhopefully the periodic LE job tonight runs successfully and we don't get warnings about expiring certs tomorrow19:48
fungiunrelated, gitea01 has been reporting backup problems. i saw the e-mails for the past ~week but haven't found time to look into it yet19:49
clarkbfungi: those backups are primarily there to keep db updates around and db updates are primarily important for project renames. If we don't do a project rename until that is fixed it may not be urgent. Of course understanding why it broke would be nice19:52
fungiyeah, more or less what i was thinking, i just wanted to at least mention it so i know i'm not the only one aware there's a problem19:54
melwittclarkb: ok thanks, will look20:09
y2kennyWith OpenDev's Zuul, is it possible for organizations to add additional nodepool and/or executors?  If so, what is the process?20:38
y2kenny(or perhaps "attach" additional nodepool/executor is better wording...)20:40
fungiy2kenny: we haven't worked out a process for adding provider-specific builders/launchers/excutors, so far we've only got central services connecting to publicly accessible cloud apis with public addressing (in some cases ipv6-only) to reach the nodes:
y2kennyfungi: Is it something that's in the plan or is that something that will be tricky?20:46
y2kennyfungi: This is still very much on the drawing board right now, but the kind of testing resources I am thinking of is not generally available in the cloud (baremetal HW with GPUs.)20:47
fungiy2kenny: we've talked about adding a zoned executor for one prospective donor who has no ipv6 connectivity to their environments and extremely limited ipv4 capacity, but we haven't talked about dedicated builders or launchers at all20:48
fungiand even environment with the dedicated executor would be a slight reduction in functionality since we'd need to attach floating ips to give developers remote access to held nodes20:49
y2kennyfungi: ok... so is this something that you guys would be interested in starting a conversation on or is it too complex to do any time soon?20:50
fungiy2kenny: i guess we'd need to talk about what the architecture would look like and what amount and sort of capacity we're talking about, to determine whether the engineering work needed to support it would be sufficiently offset, since there's just not that many people helping design and maintain our control plane these days21:16
y2kennyfungi: ok, I see.21:17
y2kennyfungi: what is the right place to continue the discussion?  ( ?)21:22
fungiy2kenny: sure, here or there, though the mailing list is better for longer-term asynchronous discussion21:33
fungithe handful of sysadmins who are active on opendev are scattered around the globe, so not all awake/around right now21:33
y2kennyfungi: understood.21:39
opendevreviewmelanie witt proposed opendev/jeepyb master: Convert update_blueprint to use the Gerrit REST API
ianwclarkb: ahh, thanks for finding the letsencrypt job failure21:53
ianwalso, i just manually added fstab entries for EFI on our rax bionic hosts21:54
ianwyesterday on base i was seeing almost random "-13" errors; the only thing i could find was an ancient devstack-gate launchpad bug from yourself that mentioned iptables randomly returning -13 (no permissions)21:54
ianwlooks like last night failed to install updated ansible on bridge; looking21:56
ianw### ERROR ###21:57
ianwUpgrading directly from ansible-2.9 or less to ansible-2.10 or greater with pip is21:57
ianw                known to cause problems.  Please uninstall the old version found at:21:57
ianwi guess this is not worth coding for.  i'll manually uninstall and re-install ansible 4.0.0 once to get around this.  we don't see this in the gate with fresh installs22:00
ianw"Ansible will require Python 3.8 or newer on the controller starting with Ansible 2.12." ... we should think about bridge upgrade process too22:04
ianw#status log manually performed uninstall/reinstall for bridge ansible upgrade from
opendevstatusianw: finished logging22:05
clarkbianw: that is an interesting move by ansible since they have historically kept compat with really ancient python. I guess they decided that isn't sustainable (good for them)22:46
clarkbfungi: y2kenny: also note that there are specs up to zuul to make that sort of allocation easier on a per tenant basis22:47
opendevreviewMerged opendev/system-config master: review02 : switch reviewdb to mariadb_container type
ianwclarkb: it is only on the controller side though23:03
clarkbianw: ah23:14
fungiyeah, what version of python interpreter ansible can be installed with, not what version it can orchestrate23:18
opendevreviewMerged openstack/project-config master: Add gmann to IRC accessbot
opendevreviewMerged openstack/project-config master: Added publish-openstack-python-tarball job
corvusclarkb: i just upgraded my homeserver to matrixdotorg/synapse:latest; it automatically applied 1 db schema update23:42
corvusthat was just a pull followed by docker-compose up -d23:43
fungiso if i want to run a homeserver should i work out the docker orchestration, or do you expect the matrix-synapse 1.36.0 package in debian would suffice?23:47
opendevreviewGhanshyam proposed openstack/project-config master: End project gating for retiring arch-wg repo
corvusfungi: i think that would be fine.  it's basically a single python daemon, at least unless you want to start running app services like a slack bridge; so if the debian packages have worked out all the python deps already, i don't see a big advantage.  it uses sqlite by default for ease of testing, but they recommend postgres for production use.23:53
corvusi'm about to migrate from sqlite to postgres; that's going to take a couple of minutes.  so i'll go silent for a bit and... hopefully... let you know how it goes :)23:53
corvusfungi, clarkb apparently it worked fine :)23:57

Generated by 2.17.2 by Marius Gedminas - find it at!