Wednesday, 2021-06-16

*** diablo_rojo is now known as Guest232600:47
ianwfungi: been debugging, with a little input from #openafs ... eventually i figured out that systemd is timing out; which tries to unload the module while afsd is still in a tight loop of ioctls to it01:05
ianwafter upping the timeout, the service takes ~4:30 to start, but it does eventually start01:06
fungiso the real trigger for the oops is the attempted rmmod call then?01:08
fungiand we've just got a slow vm01:09
fungiand the one in linaro-us is fast enough to not hit that timeout?01:09
ianwyeah, Ramereth ^ maybe I do need to summon your help :)01:11
ianwis the node backing under particularly heavy load?01:11
fungii guess it would help to know what's being slow there at openafs-client startup... creating the cachedirs?01:11
fungiwondering if it's i/o contention, storage bandwidth, cpu hitching, memory pressure...01:12
ianwyeah i have strace, it first goes through and stats everything, but then spends a long period of time making sequential ioctl calls which i am guessing is doing something like registering every cache directory01:12
fungimy guess is something to do with the cinder volume01:12
ianwit dosen't seem like it's stuck at one thing01:12
ianwi'm going to clear the cache, reboot and see again01:13
fungialso maybe see if startup is faster when the cachedir is sanely created from a previous clean start01:13
ianwi think the time is all this stating and ioctl-ing to register the directories; my hunch is actually that a more populated directory would take longer01:14
ianwi can run afsd under strace with timing, but that will only further perturb things i guess01:15
ianw0.01user 0.00system 1:38.27elapsed 0%CPU (0avgtext+0avgdata 6376maxresident)k01:18
ianwthat's right on the edge of a default 1:30 timeout i guess01:21
* Ramereth has been summoned01:22
Ramerethlet me take a look at the load and see what's going on01:22
fungii always enjoy *appears in a puff of smoke*01:23
Ramerethdoesn't appear to be network issues01:29
Ramerethwhich node is having the issue? the mirror01 one?01:29
ianwyeah, f0f1ed9d-98a8-4713-8d74-69b168c4c99601:30
ianwi dunno if we can 100% pin it as issues.  starting openafs is probably the only time it pegs itself at 100% stating and doing all these ioctls01:31
ianwif the timeout for the service starting is at 1:30 and we're at something like 1:38 ... then maybe we've just been lucky and slipped under the timeout previously01:32
ianwi.e. the performance is roughly the same as it's always been01:32
ianwbut if the backing node looks particularly overcommitted, or something else, that would also explain things01:33
Ramereththe hypervisor looks the same as it has been all week01:38
Ramerethnothing looks different out of the last 24hr for the ceph nodes either01:39
RamerethI need to go but I can look more into it tomorrow. Feel free to send an email to support with additional details if you think it is something on our end01:40
ianwRamereth: thanks for looking!01:40
ianwit may well be that the only time we hit this is on reboot; we barely ever reboot so we don't really know what the usual time to start up is01:41
ianwi'll propose a timeout override and i think we can just continue to monitor01:41
opendevreviewIan Wienand proposed opendev/system-config master: openafs-client: add service timeout override
fungiyeah, also the reboot where we first hit this was last week, we weren't having much luck figuring out the problem and disabled booting job nodes there on friday01:53
ianwprobably hadn't been rebooted since it was first started02:08
ianwstdout: Looking in indexes:,
ianw:stderr: ERROR: Could not find a version that satisfies the requirement ara[server] (from versions: none)03:29
ianware we having limestone issues?03:29
ianwmirror_443_access.log:2607:ff68:100:54:f816:3eff:feb4:c4a3 - - [2021-06-16 01:55:55.115] "GET /pypi/simple/ara/ HTTP/1.1" 200 15079 cache miss: attempting entity save "-" 03:43
ianwthat must have been the query?  timestamp doesn't quite line up so maybe it didn't hit the mirror?03:43
ianwE: Failed to fetch  Connection failed [IP: 443]03:44
ianwsimilar error here03:44
opendevreviewIan Wienand proposed opendev/system-config master: gerrit: add mariadb_container option
opendevreviewIan Wienand proposed opendev/system-config master: review02 : switch reviewdb to mariadb_container type
opendevreviewIan Wienand proposed openstack/project-config master: infra-package-needs: don't start ntp for Fedora
opendevreviewIan Wienand proposed openstack/project-config master: Revert "Disable the osuosl arm64 cloud"
opendevreviewMerged openstack/project-config master: infra-package-needs: don't start ntp for Fedora
*** marios is now known as marios|ruck05:04
gibihi infra! do you see some package mirror issue in limestone region? We have this bug that is only appeare in limestone and produces nonsensible requirement conflicts seemingly randomly05:25
ianwgibi: i've also seen some issues with limestone that don't seem repeatable.  we might have to disable it05:41
fungiianw: gibi: earlier there were two htcacheclean processes going at the same time driving system load up to 100 due to excessive iowait, and after i killed one of the htcachecleans it seemed to calm down06:16
fungiit's possible the apache cache on that server is just getting larger than the i/o performance makes it possible to prune06:16
gibiianw, fungi: thanks for replying. we just got a fresh it of the req conflict
ianwfungi: that sounds exactly like the performance issues we disabled it for before06:40
ianwi think we'll have to disable it.  i think clarkb worked with logan-, but i think something is still up with storage06:40
opendevreviewIan Wienand proposed openstack/project-config master: Revert "Revert "Revert "Revert "Disable limestone due to mirror issues""""
opendevreviewMerged openstack/project-config master: Revert "Revert "Revert "Revert "Disable limestone due to mirror issues""""
*** rpittau|afk is now known as rpittau07:15
*** jpena|off is now known as jpena07:33
*** elodilles is now known as elodilles_afk08:10
*** gthiemon1e is now known as gthiemonge08:23
*** amoralej|off is now known as amoralej08:24
*** ykarel is now known as ykarel|lunch08:31
*** ysandeep|out is now known as ysandeep08:45
*** sshnaidm|afk is now known as sshnaidm08:50
*** ykarel|lunch is now known as ykarel09:38
*** bhagyashris_ is now known as bhagyashris10:27
*** ykarel_ is now known as ykarel10:54
*** jpena is now known as jpena|lunch11:32
*** amoralej is now known as amoralej|lunch12:10
*** hashar is now known as Guest238712:24
*** hashar is now known as Guest238812:26
*** jpena|lunch is now known as jpena12:30
*** diablo_rojo__ is now known as diablo_rojo12:35
*** elodilles_afk is now known as elodilles13:04
*** amoralej|lunch is now known as amoralej13:10
*** ysandeep is now known as ysandeep|brb13:33
mordredfungi: <-- this is from our friends at mozilla (part of a series of blog posts about their move) ... apparently matrix has federated ban lists13:38
fungioh neat13:53
mordredyeah - seems like a really neat way for folks to collaborate in managing bad actors14:07
*** ysandeep|brb is now known as ysandeep14:15
*** gthiemon1e is now known as gthiemonge14:37
*** ykarel is now known as ykarel|away14:39
fungiokay, so i'm looking at tagging a new version of gear, as requested in #zuul15:17
fungithe last release was 0.15.1 in february 202015:18
fungithere are 9 non-merge commits since then15:19
fungidoesn't look like there are any backward-incompatible changes, but we did increase some dependency versions and add support for more python interpreters, so this should probably be 0.16.0 (unless we want to be bold and make it 1.0.0, but keep in mind we're likely to no longer use it after zuul v5)15:20
fungiwhen i was trying to release it earlier in the year we got stuck solving the ssl/tls changes for python 3.9, but that's working now15:21
*** dklyle is now known as dklyle__15:31
mordredfungi: I think 0.16 seems reasonable15:41
*** marios|ruck is now known as marios|out15:45
fungii just realized i still need to update the package metadata to list newer python releases, so working on that match now15:46
opendevreviewJeremy Stanley proposed opendev/gear master: Overhaul package metadata and contributor info
fungimordred: ^16:07
mordredfungi: I think you could claim 3.10 too? (wasn't that what mhuin was poking at?)16:08
fungiwell, we're not actively testing gear changes with 3.10, he tested zuul with 3.10 which happens to exercise gear16:08
mordredah - nod. fair16:09
fungibut also 3.10 is still in beta anyway16:09
fungiand won't be released for months yet16:09
mhuinfungi, I can add py310 testing if you'd like, I've done it on a local clone at some point16:10
*** rpittau is now known as rpittau|afk16:11
fungimhuin: not urgent, at this stage i still think it would be premature to have a release of gear claiming it supports 3.10 when 3.10 hasn't been finalized, so the metadata could retroactively wind up incorrect if we need another release to fix it for things which change in 3.10 before it's final16:12
fungi"gear 0.16.0 claims it supports python 3.10, but you really need to install gear 0.16.1 if you want to use it with python 3.10" would be an unfortunate situation16:13
mhuinfungi, fair enough. I'll still do it anyway as a [DNM], it's going to be useful for packaging python-gear on Fedora Rawhide16:13
fungisure, sounds like a great idea16:13
opendevreviewJeremy Stanley proposed opendev/gear master: Overhaul package metadata and contributor info
fungiminor fixup to a line in the setup.cfg ^16:16
opendevreviewMatthieu Huin proposed opendev/gear master: [DNM] add python 3.10 testing
opendevreviewMatthieu Huin proposed opendev/gear master: [DNM] add python 3.10 testing
*** ysandeep is now known as ysandeep|out16:28
opendevreviewMatthieu Huin proposed opendev/gear master: [DNM] add python 3.10 testing
*** amoralej is now known as amoralej|off16:34
corvusfungi: whoah please don't tag gear16:38
fungicorvus: holding off in that case16:47
corvusfungi: see followup in #zuul16:48
*** jpena is now known as jpena|off16:59
corvusinfra-root: i've gotten a report of a memory leak in zuul; opendev doesn't appear to be suffering from it now, but it may have been last week;  i'd like to do some investigation with the repl.  i've developed a less-intrusive technique of getting object graphs, so i think i can do this with little or no visible impact, but it's all new, so things could go wrong.17:46
*** ricolin_ is now known as ricolin17:49
fungicorvus: sounds good, thanks for the heads up... nice to not be the first ones to hit a memory leak for a change! ;)17:53
corvusyeah, hopefully i can find it before it becomes a bigger problem17:57
*** mordred[m] is now known as mordred17:59
fungigmann: looks like didn't know to grab a copy of ptg.json, so we'll likely have to extract data from the server if you still need it:*/*18:02
fungithe etherpads.html is unfortunately not statically served, it's just javascript pulling and rendering a json blob18:03
tobiash[m]fungi: we were the ones hit by the memleak ;)18:19
fungitobiash[m]: oh no! sorry to hear that18:21
tobiash[m]took one day to go oom with 30gb of memory...18:22
fungisurprised we haven't hit the same then18:25
fungibut yeah, our memory usage has been fairly flat since we restarted ~5 days ago:
corvusunfortunately, so far, all i've managed to do is confirm that we don't have any leaked layout objects :(18:27
fungiwe did seem to be on track for runaway memory utilization back on june 1018:27
fungilooks like it began very suddenly after having been running for more than a week18:28
fungiso maybe we did experience it as well, but just happened to restart before we exhausted available ram18:29
corvusfungi: yeah, which is part of why i suspect even if we're not seeing it now, we may still be suceptible.  that was the old version without the gear job fix, but the mem increase happened before the executor crashed, so i don't think that was the cause18:29
fungimakes sense18:30
gmannfungi: ohk, I think we can try by bring the site up and see of it is there otherwise leave -
fungiyeah, the main gotcha is going to be that the old irc bots may start up and connect, and might be smart enough to ghost the production bots18:48
fungisince that server houses gerritbot and meetbot too18:49
opendevreviewFlorian Haas proposed opendev/git-review master: Support the Git "core.hooksPath" option when dealing with hook scripts
fungiooh, a patch from florian!19:09
JayFI hope someone exclaims with glee when they see me post a patch somewhere :)19:17
fungii always do19:17
opendevreviewFlorian Haas proposed opendev/git-review master: Support the Git "core.hooksPath" option when dealing with hook scripts
fungiso looking through the current state with our arm nodes, we seem to once again have a number of stale node request locks, i'm going to restart the launcher container on nl03.o.o19:19
fungi#status log Restarted nodepool launcher on nl03 to free stale node request locks for arm64 nodes19:21
opendevstatusfungi: finished logging19:21
fungialso, aside from the problem getting the mirror server workign again in osuosl, we seem to be deleting quite a few nodes there, one of which has been stuck in a deleting state for two months19:25
fungii'm not sure what to make of this traceback:
fungianyway, need to take a dinner break, back in a while19:29
corvusfungi: there may have been an exception deleting a server19:46
corvusfungi: apparently that really means "timeout of 10 minutes exceeded while waiting for server deletion"19:48
corvusfungi: so assuming instance id 1c8d118d-756f-4e8c-bf57-0c6b8ad690ea still exists in the cloud, it's likely that nodepool did try to delete that and it persists in showing up in the server list19:50
corvusi'm closing the repl for now; i've found no evidence of any leak of layout or pipeline objects :/20:02
*** ChanServ sets mode: +o corvus21:17
*** corvus was kicked by corvus (Kicked by : Removing stale matrix bridge puppet user)21:18
fungicorvus: thanks, and yeah that was also the only interpretation i could come up with. i'm used to seeing api errors for delete failures, so the lack of api response logged suggested no response21:43
fungior at least no response before it gave up waiting for one21:44
*** diablo_rojo is now known as Guest244621:51
ianwfungi: i just got into a usual yak shaving exercise with the timeout for openafs-client role; since the last time we ran centos8 has pulled in some changes, so we need to update our rpms to unbreak the gate21:59
ianwof course, the pre-release has a slightly different naming convention, breaking the build scripts ...22:00
fungiof course it does!22:03
fungiwho would you keep something like that consistent?22:03
opendevreviewMerged opendev/system-config master: gerrit: add mariadb_container option
*** diablo_rojo is now known as Guest245723:20
ianwfungi: now overrides the openafs timeout, which i've already manually applied to osuosl mirror.  if you're ok, i can unapply, merge that, and we can just leave the mirror there on 1.8.8pre123:48
ianwit can essentially be a 1.8.8 tester, and i'll keep an eye for release and we can jump straight to that in the main PPA repo23:49
fungiianw: lgtm, approved, go ahead23:50
ianwfungi: thanks; do you say you cleared out some stuck nodes?23:51
fungiianw: no, it's not clear to me why they're stuck in delete state in osuosl23:52
fungimight be undeletable and need attention from the admin end23:52
ianwahh, i feel like we've had that before23:52
ianwRamereth deleted them for me ^23:52
fungifor linaro-us, we seem to be managing to get 2 in-use nodes at a time right now23:53
fungi15 nodes in delete state in linaro-us with 4 building, 2 ready and 2 in-use23:54
ianwi feel like that led me down the path of
ianwthe mariadb container for gerrit seemed to hvae faield in deploy with letsencrypt :
fungimany of the deleting nodes in linaro-us have been there for two months23:55
ianwi'll look into that.  it's annoying as i wanted to be 100% sure it didn't modify production before moving on to deploy it on review0223:56
ianwfungi: i guess we need to ping kevinz on that ... email might be better23:56
fungiwe've got that e-mail thread with him clarkb started yesterday23:56
fungiwhere he cleared some stuck building nodes i think?23:57
ianwi'll double check and reply23:59
ianwfedora-34 should also have built23:59

Generated by 2.17.2 by Marius Gedminas - find it at!