*** diablo_rojo is now known as Guest2326 | 00:47 | |
ianw | fungi: been debugging, with a little input from #openafs ... eventually i figured out that systemd is timing out; which tries to unload the module while afsd is still in a tight loop of ioctls to it | 01:05 |
---|---|---|
ianw | after upping the timeout, the service takes ~4:30 to start, but it does eventually start | 01:06 |
fungi | aha! | 01:08 |
fungi | so the real trigger for the oops is the attempted rmmod call then? | 01:08 |
fungi | and we've just got a slow vm | 01:09 |
fungi | and the one in linaro-us is fast enough to not hit that timeout? | 01:09 |
ianw | yeah, Ramereth ^ maybe I do need to summon your help :) | 01:11 |
ianw | is the node backing mirror01.regionone.osuosl.opendev.org under particularly heavy load? | 01:11 |
fungi | i guess it would help to know what's being slow there at openafs-client startup... creating the cachedirs? | 01:11 |
fungi | wondering if it's i/o contention, storage bandwidth, cpu hitching, memory pressure... | 01:12 |
ianw | yeah i have strace, it first goes through and stats everything, but then spends a long period of time making sequential ioctl calls which i am guessing is doing something like registering every cache directory | 01:12 |
fungi | my guess is something to do with the cinder volume | 01:12 |
ianw | it dosen't seem like it's stuck at one thing | 01:12 |
ianw | i'm going to clear the cache, reboot and see again | 01:13 |
fungi | also maybe see if startup is faster when the cachedir is sanely created from a previous clean start | 01:13 |
ianw | i think the time is all this stating and ioctl-ing to register the directories; my hunch is actually that a more populated directory would take longer | 01:14 |
ianw | i can run afsd under strace with timing, but that will only further perturb things i guess | 01:15 |
fungi | yeah | 01:16 |
ianw | 0.01user 0.00system 1:38.27elapsed 0%CPU (0avgtext+0avgdata 6376maxresident)k | 01:18 |
ianw | that's right on the edge of a default 1:30 timeout i guess | 01:21 |
* Ramereth has been summoned | 01:22 | |
Ramereth | let me take a look at the load and see what's going on | 01:22 |
fungi | i always enjoy *appears in a puff of smoke* | 01:23 |
Ramereth | doesn't appear to be network issues | 01:29 |
Ramereth | which node is having the issue? the mirror01 one? | 01:29 |
ianw | yeah, f0f1ed9d-98a8-4713-8d74-69b168c4c996 | 01:30 |
ianw | i dunno if we can 100% pin it as issues. starting openafs is probably the only time it pegs itself at 100% stating and doing all these ioctls | 01:31 |
ianw | if the timeout for the service starting is at 1:30 and we're at something like 1:38 ... then maybe we've just been lucky and slipped under the timeout previously | 01:32 |
ianw | i.e. the performance is roughly the same as it's always been | 01:32 |
ianw | but if the backing node looks particularly overcommitted, or something else, that would also explain things | 01:33 |
Ramereth | the hypervisor looks the same as it has been all week | 01:38 |
Ramereth | nothing looks different out of the last 24hr for the ceph nodes either | 01:39 |
Ramereth | I need to go but I can look more into it tomorrow. Feel free to send an email to support with additional details if you think it is something on our end | 01:40 |
ianw | Ramereth: thanks for looking! | 01:40 |
ianw | it may well be that the only time we hit this is on reboot; we barely ever reboot so we don't really know what the usual time to start up is | 01:41 |
ianw | i'll propose a timeout override and i think we can just continue to monitor | 01:41 |
opendevreview | Ian Wienand proposed opendev/system-config master: openafs-client: add service timeout override https://review.opendev.org/c/opendev/system-config/+/796578 | 01:51 |
fungi | yeah, also the reboot where we first hit this was last week, we weren't having much luck figuring out the problem and disabled booting job nodes there on friday | 01:53 |
ianw | probably hadn't been rebooted since it was first started | 02:08 |
ianw | stdout: Looking in indexes: https://mirror.regionone.limestone.opendev.org/pypi/simple, https://mirror.regionone.limestone.opendev.org/wheel/ubuntu-18.04-x86_64 | 03:29 |
ianw | :stderr: ERROR: Could not find a version that satisfies the requirement ara[server] (from versions: none) | 03:29 |
ianw | are we having limestone issues? | 03:29 |
ianw | mirror_443_access.log:2607:ff68:100:54:f816:3eff:feb4:c4a3 - - [2021-06-16 01:55:55.115] "GET /pypi/simple/ara/ HTTP/1.1" 200 15079 cache miss: attempting entity save "-" | 03:43 |
ianw | that must have been the query? timestamp doesn't quite line up so maybe it didn't hit the mirror? | 03:43 |
ianw | E: Failed to fetch https://mirror.regionone.limestone.opendev.org/debian/pool/main/k/kerberos-configs/krb5-config_2.6_all.deb Connection failed [IP: 216.245.200.130 443] | 03:44 |
ianw | similar error here | 03:44 |
opendevreview | Ian Wienand proposed opendev/system-config master: gerrit: add mariadb_container option https://review.opendev.org/c/opendev/system-config/+/775961 | 03:57 |
opendevreview | Ian Wienand proposed opendev/system-config master: review02 : switch reviewdb to mariadb_container type https://review.opendev.org/c/opendev/system-config/+/795192 | 03:57 |
opendevreview | Ian Wienand proposed openstack/project-config master: infra-package-needs: don't start ntp for Fedora https://review.opendev.org/c/openstack/project-config/+/796584 | 04:08 |
opendevreview | Ian Wienand proposed openstack/project-config master: Revert "Disable the osuosl arm64 cloud" https://review.opendev.org/c/openstack/project-config/+/796585 | 04:18 |
opendevreview | Merged openstack/project-config master: infra-package-needs: don't start ntp for Fedora https://review.opendev.org/c/openstack/project-config/+/796584 | 04:54 |
*** marios is now known as marios|ruck | 05:04 | |
gibi | hi infra! do you see some package mirror issue in limestone region? We have this bug https://bugs.launchpad.net/nova/+bug/1931864 that is only appeare in limestone and produces nonsensible requirement conflicts seemingly randomly | 05:25 |
ianw | gibi: i've also seen some issues with limestone that don't seem repeatable. we might have to disable it | 05:41 |
fungi | ianw: gibi: earlier there were two htcacheclean processes going at the same time driving system load up to 100 due to excessive iowait, and after i killed one of the htcachecleans it seemed to calm down | 06:16 |
fungi | it's possible the apache cache on that server is just getting larger than the i/o performance makes it possible to prune | 06:16 |
gibi | ianw, fungi: thanks for replying. we just got a fresh it of the req conflict https://1b38eb07519f5fe2ed36-da9f1bb46fd216e97fa5e10d4af58222.ssl.cf5.rackcdn.com/796255/2/check/openstack-tox-py36/3c59c9d/job-output.txt | 06:24 |
ianw | fungi: that sounds exactly like the performance issues we disabled it for before | 06:40 |
ianw | i think we'll have to disable it. i think clarkb worked with logan-, but i think something is still up with storage | 06:40 |
opendevreview | Ian Wienand proposed openstack/project-config master: Revert "Revert "Revert "Revert "Disable limestone due to mirror issues"""" https://review.opendev.org/c/openstack/project-config/+/796590 | 06:45 |
opendevreview | Merged openstack/project-config master: Revert "Revert "Revert "Revert "Disable limestone due to mirror issues"""" https://review.opendev.org/c/openstack/project-config/+/796590 | 07:14 |
*** rpittau|afk is now known as rpittau | 07:15 | |
*** jpena|off is now known as jpena | 07:33 | |
*** elodilles is now known as elodilles_afk | 08:10 | |
*** gthiemon1e is now known as gthiemonge | 08:23 | |
*** amoralej|off is now known as amoralej | 08:24 | |
*** ykarel is now known as ykarel|lunch | 08:31 | |
*** ysandeep|out is now known as ysandeep | 08:45 | |
*** sshnaidm|afk is now known as sshnaidm | 08:50 | |
*** ykarel|lunch is now known as ykarel | 09:38 | |
*** bhagyashris_ is now known as bhagyashris | 10:27 | |
*** ykarel_ is now known as ykarel | 10:54 | |
*** jpena is now known as jpena|lunch | 11:32 | |
*** amoralej is now known as amoralej|lunch | 12:10 | |
*** hashar is now known as Guest2387 | 12:24 | |
*** hashar is now known as Guest2388 | 12:26 | |
*** jpena|lunch is now known as jpena | 12:30 | |
*** diablo_rojo__ is now known as diablo_rojo | 12:35 | |
*** elodilles_afk is now known as elodilles | 13:04 | |
*** amoralej|lunch is now known as amoralej | 13:10 | |
*** ysandeep is now known as ysandeep|brb | 13:33 | |
mordred | fungi: http://exple.tive.org/blarg/2020/03/06/brace-for-impact/ <-- this is from our friends at mozilla (part of a series of blog posts about their move) ... apparently matrix has federated ban lists | 13:38 |
fungi | oh neat | 13:53 |
mordred | yeah - seems like a really neat way for folks to collaborate in managing bad actors | 14:07 |
*** ysandeep|brb is now known as ysandeep | 14:15 | |
*** gthiemon1e is now known as gthiemonge | 14:37 | |
*** ykarel is now known as ykarel|away | 14:39 | |
fungi | okay, so i'm looking at tagging a new version of gear, as requested in #zuul | 15:17 |
fungi | the last release was 0.15.1 in february 2020 | 15:18 |
fungi | there are 9 non-merge commits since then | 15:19 |
fungi | doesn't look like there are any backward-incompatible changes, but we did increase some dependency versions and add support for more python interpreters, so this should probably be 0.16.0 (unless we want to be bold and make it 1.0.0, but keep in mind we're likely to no longer use it after zuul v5) | 15:20 |
fungi | when i was trying to release it earlier in the year we got stuck solving the ssl/tls changes for python 3.9, but that's working now | 15:21 |
*** dklyle is now known as dklyle__ | 15:31 | |
mordred | fungi: I think 0.16 seems reasonable | 15:41 |
*** marios|ruck is now known as marios|out | 15:45 | |
fungi | i just realized i still need to update the package metadata to list newer python releases, so working on that match now | 15:46 |
opendevreview | Jeremy Stanley proposed opendev/gear master: Overhaul package metadata and contributor info https://review.opendev.org/c/opendev/gear/+/796704 | 16:07 |
fungi | mordred: ^ | 16:07 |
mordred | fungi: I think you could claim 3.10 too? (wasn't that what mhuin was poking at?) | 16:08 |
fungi | well, we're not actively testing gear changes with 3.10, he tested zuul with 3.10 which happens to exercise gear | 16:08 |
mordred | ah - nod. fair | 16:09 |
fungi | but also 3.10 is still in beta anyway | 16:09 |
fungi | and won't be released for months yet | 16:09 |
mhuin | fungi, I can add py310 testing if you'd like, I've done it on a local clone at some point | 16:10 |
*** rpittau is now known as rpittau|afk | 16:11 | |
fungi | mhuin: not urgent, at this stage i still think it would be premature to have a release of gear claiming it supports 3.10 when 3.10 hasn't been finalized, so the metadata could retroactively wind up incorrect if we need another release to fix it for things which change in 3.10 before it's final | 16:12 |
fungi | "gear 0.16.0 claims it supports python 3.10, but you really need to install gear 0.16.1 if you want to use it with python 3.10" would be an unfortunate situation | 16:13 |
mhuin | fungi, fair enough. I'll still do it anyway as a [DNM], it's going to be useful for packaging python-gear on Fedora Rawhide | 16:13 |
fungi | sure, sounds like a great idea | 16:13 |
opendevreview | Jeremy Stanley proposed opendev/gear master: Overhaul package metadata and contributor info https://review.opendev.org/c/opendev/gear/+/796704 | 16:16 |
fungi | minor fixup to a line in the setup.cfg ^ | 16:16 |
opendevreview | Matthieu Huin proposed opendev/gear master: [DNM] add python 3.10 testing https://review.opendev.org/c/opendev/gear/+/796705 | 16:20 |
opendevreview | Matthieu Huin proposed opendev/gear master: [DNM] add python 3.10 testing https://review.opendev.org/c/opendev/gear/+/796705 | 16:26 |
*** ysandeep is now known as ysandeep|out | 16:28 | |
opendevreview | Matthieu Huin proposed opendev/gear master: [DNM] add python 3.10 testing https://review.opendev.org/c/opendev/gear/+/796705 | 16:29 |
*** amoralej is now known as amoralej|off | 16:34 | |
corvus | fungi: whoah please don't tag gear | 16:38 |
fungi | corvus: holding off in that case | 16:47 |
corvus | fungi: see followup in #zuul | 16:48 |
fungi | thanks! | 16:51 |
*** jpena is now known as jpena|off | 16:59 | |
corvus | infra-root: i've gotten a report of a memory leak in zuul; opendev doesn't appear to be suffering from it now, but it may have been last week; i'd like to do some investigation with the repl. i've developed a less-intrusive technique of getting object graphs, so i think i can do this with little or no visible impact, but it's all new, so things could go wrong. | 17:46 |
*** ricolin_ is now known as ricolin | 17:49 | |
fungi | corvus: sounds good, thanks for the heads up... nice to not be the first ones to hit a memory leak for a change! ;) | 17:53 |
corvus | yeah, hopefully i can find it before it becomes a bigger problem | 17:57 |
*** mordred[m] is now known as mordred | 17:59 | |
fungi | gmann: looks like archive.org didn't know to grab a copy of ptg.json, so we'll likely have to extract data from the server if you still need it: https://web.archive.org/web/*/http://ptg.openstack.org/* | 18:02 |
fungi | the etherpads.html is unfortunately not statically served, it's just javascript pulling and rendering a json blob | 18:03 |
tobiash[m] | fungi: we were the ones hit by the memleak ;) | 18:19 |
fungi | tobiash[m]: oh no! sorry to hear that | 18:21 |
tobiash[m] | took one day to go oom with 30gb of memory... | 18:22 |
fungi | ouch | 18:24 |
fungi | surprised we haven't hit the same then | 18:25 |
fungi | but yeah, our memory usage has been fairly flat since we restarted ~5 days ago: http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=70194&rra_id=all | 18:27 |
corvus | unfortunately, so far, all i've managed to do is confirm that we don't have any leaked layout objects :( | 18:27 |
fungi | we did seem to be on track for runaway memory utilization back on june 10 | 18:27 |
fungi | looks like it began very suddenly after having been running for more than a week | 18:28 |
fungi | so maybe we did experience it as well, but just happened to restart before we exhausted available ram | 18:29 |
corvus | fungi: yeah, which is part of why i suspect even if we're not seeing it now, we may still be suceptible. that was the old version without the gear job fix, but the mem increase happened before the executor crashed, so i don't think that was the cause | 18:29 |
fungi | makes sense | 18:30 |
gmann | fungi: ohk, I think we can try by bring the site up and see of it is there otherwise leave - http://ptg.openstack.org/ | 18:45 |
fungi | yeah, the main gotcha is going to be that the old irc bots may start up and connect, and might be smart enough to ghost the production bots | 18:48 |
fungi | since that server houses gerritbot and meetbot too | 18:49 |
opendevreview | Florian Haas proposed opendev/git-review master: Support the Git "core.hooksPath" option when dealing with hook scripts https://review.opendev.org/c/opendev/git-review/+/796727 | 19:03 |
fungi | ooh, a patch from florian! | 19:09 |
JayF | I hope someone exclaims with glee when they see me post a patch somewhere :) | 19:17 |
fungi | i always do | 19:17 |
opendevreview | Florian Haas proposed opendev/git-review master: Support the Git "core.hooksPath" option when dealing with hook scripts https://review.opendev.org/c/opendev/git-review/+/796727 | 19:18 |
fungi | so looking through the current state with our arm nodes, we seem to once again have a number of stale node request locks, i'm going to restart the launcher container on nl03.o.o | 19:19 |
fungi | #status log Restarted nodepool launcher on nl03 to free stale node request locks for arm64 nodes | 19:21 |
opendevstatus | fungi: finished logging | 19:21 |
fungi | also, aside from the problem getting the mirror server workign again in osuosl, we seem to be deleting quite a few nodes there, one of which has been stuck in a deleting state for two months | 19:25 |
fungi | i'm not sure what to make of this traceback: http://paste.openstack.org/show/806712 | 19:28 |
fungi | anyway, need to take a dinner break, back in a while | 19:29 |
corvus | fungi: there may have been an exception deleting a server | 19:46 |
corvus | fungi: apparently that really means "timeout of 10 minutes exceeded while waiting for server deletion" | 19:48 |
corvus | fungi: so assuming instance id 1c8d118d-756f-4e8c-bf57-0c6b8ad690ea still exists in the cloud, it's likely that nodepool did try to delete that and it persists in showing up in the server list | 19:50 |
corvus | i'm closing the repl for now; i've found no evidence of any leak of layout or pipeline objects :/ | 20:02 |
*** ChanServ sets mode: +o corvus | 21:17 | |
*** corvus was kicked by corvus (Kicked by @jim:acmegating.com : Removing stale matrix bridge puppet user) | 21:18 | |
fungi | corvus: thanks, and yeah that was also the only interpretation i could come up with. i'm used to seeing api errors for delete failures, so the lack of api response logged suggested no response | 21:43 |
fungi | or at least no response before it gave up waiting for one | 21:44 |
*** diablo_rojo is now known as Guest2446 | 21:51 | |
ianw | fungi: i just got into a usual yak shaving exercise with the timeout for openafs-client role; since the last time we ran centos8 has pulled in some changes, so we need to update our rpms to unbreak the gate | 21:59 |
ianw | of course, the pre-release has a slightly different naming convention, breaking the build scripts ... | 22:00 |
fungi | of course it does! | 22:03 |
fungi | who would you keep something like that consistent? | 22:03 |
opendevreview | Merged opendev/system-config master: gerrit: add mariadb_container option https://review.opendev.org/c/opendev/system-config/+/775961 | 23:14 |
*** diablo_rojo is now known as Guest2457 | 23:20 | |
ianw | fungi: https://review.opendev.org/c/opendev/system-config/+/796578 now overrides the openafs timeout, which i've already manually applied to osuosl mirror. if you're ok, i can unapply, merge that, and we can just leave the mirror there on 1.8.8pre1 | 23:48 |
ianw | it can essentially be a 1.8.8 tester, and i'll keep an eye for release and we can jump straight to that in the main PPA repo | 23:49 |
fungi | reviewing | 23:49 |
fungi | ianw: lgtm, approved, go ahead | 23:50 |
ianw | fungi: thanks; do you say you cleared out some stuck nodes? | 23:51 |
fungi | ianw: no, it's not clear to me why they're stuck in delete state in osuosl | 23:52 |
fungi | might be undeletable and need attention from the admin end | 23:52 |
ianw | ahh, i feel like we've had that before | 23:52 |
ianw | Ramereth deleted them for me ^ | 23:52 |
fungi | for linaro-us, we seem to be managing to get 2 in-use nodes at a time right now | 23:53 |
fungi | 15 nodes in delete state in linaro-us with 4 building, 2 ready and 2 in-use | 23:54 |
ianw | i feel like that led me down the path of https://review.opendev.org/c/zuul/nodepool/+/785821 | 23:54 |
ianw | the mariadb container for gerrit seemed to hvae faield in deploy with letsencrypt : https://review.opendev.org/c/opendev/system-config/+/775961 | 23:55 |
fungi | many of the deleting nodes in linaro-us have been there for two months | 23:55 |
ianw | i'll look into that. it's annoying as i wanted to be 100% sure it didn't modify production before moving on to deploy it on review02 | 23:56 |
ianw | fungi: i guess we need to ping kevinz on that ... email might be better | 23:56 |
fungi | we've got that e-mail thread with him clarkb started yesterday | 23:56 |
fungi | where he cleared some stuck building nodes i think? | 23:57 |
ianw | i'll double check and reply | 23:59 |
ianw | fedora-34 should also have built | 23:59 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!