opendevreview | Merged openstack/project-config master: Add puppet-manila-core https://review.opendev.org/c/openstack/project-config/+/834318 | 02:31 |
---|---|---|
*** ysandeep|out is now known as ysandeep | 06:45 | |
*** pojadhav- is now known as pojadhav|rover | 08:28 | |
*** jpena|off is now known as jpena | 08:38 | |
*** ysandeep is now known as ysandeep|afk | 08:45 | |
*** ysandeep|afk is now known as ysandeep | 09:04 | |
*** pojadhav- is now known as pojadhav|rover | 09:16 | |
opendevreview | Michal Nasiadka proposed opendev/irc-meetings master: kolla: move meeting one hour backward (DST) https://review.opendev.org/c/opendev/irc-meetings/+/835020 | 09:44 |
*** rlandy|out is now known as rlandy | 10:21 | |
*** ysandeep is now known as ysandeep|afk | 11:10 | |
*** ysandeep|afk is now known as ysandeep | 11:14 | |
opendevreview | Simon Westphahl proposed zuul/zuul-jobs master: Add tox-py310 job https://review.opendev.org/c/zuul/zuul-jobs/+/821247 | 11:21 |
opendevreview | Simon Westphahl proposed zuul/zuul-jobs master: Add tox-py310 job https://review.opendev.org/c/zuul/zuul-jobs/+/821247 | 11:26 |
opendevreview | Simon Westphahl proposed zuul/zuul-jobs master: Add tox-py310 job https://review.opendev.org/c/zuul/zuul-jobs/+/821247 | 11:29 |
opendevreview | Merged opendev/irc-meetings master: kolla: move meeting one hour backward (DST) https://review.opendev.org/c/opendev/irc-meetings/+/835020 | 11:52 |
opendevreview | Simon Westphahl proposed zuul/zuul-jobs master: Add tox-py310 job https://review.opendev.org/c/zuul/zuul-jobs/+/821247 | 12:25 |
*** ysandeep is now known as ysandeep|afk | 12:58 | |
frickler | infra-root: regarding iweb, I checked the "nodepool list" output and there is a high number of nodes in deleting state there, and checking some of those I found more nodes with an uptime >10d. those not necessarily correlate to IP conflicts, but still maybe we just want to shut that cloud down right now instead of doing further debugging? | 13:31 |
fungi | nodes stuck in a deleting state simply subtract from the available quota and cause the launchers to make additional delete calls | 13:32 |
fungi | it would be nice if we could safely ride it out through next wednesday so we have that additional capacity for openstack yoga release day | 13:33 |
*** ysandeep|afk is now known as ysandeep | 13:40 | |
*** pojadhav|rover is now known as pojadhav|afk | 14:12 | |
frickler | I think "safely" is the key point here. iiuc there is noone left operating that cloud, without the stuck nodes we have no control about how many further IP conflicts may happen. and the risk of jobs falsely passing because they run on the wrong node type can't be excluded, which I consider pretty dangerous in particular for release day | 14:20 |
frickler | *without cleaning up the stuck nodes | 14:20 |
fungi | yep, i guess we can take a closer look at our current utilization as a projection for next week, and if we decide we're unlikely to be overloaded then just plan for the release to take longer to run all the jobs it will need | 14:21 |
*** Guest86 is now known as diablo_rojo_phone | 14:53 | |
mgagne | frickler: fungi: I can take a look | 14:59 |
fungi | thanks mgagne! didn't know you still had access | 15:01 |
mgagne | I 100% have access and our team "owns" the infra. | 15:02 |
fungi | to summarize, there are a bunch of undeletable nodes, and also some which nova seems to have lost track of so they're still up but their addresses are also getting assigned to newly-booted nodes | 15:02 |
fungi | the undeletable nodes are mainly just impacting available capacity slightly, it's the rogue servers which are presenting more of a problem since we're getting some builds which end up connecting to them and either run on a wrong dirtied node or fail with host key mismatches | 15:04 |
mgagne | I'll take care of the undeletable nodes. I'll take a look at the rogue vms too. | 15:07 |
fungi | thanks so much! | 15:10 |
frickler | mgagne: that's great, if you let me know when done, I can double check the nodepool list. did you see the reference to the two rogue nodes we powered off in #openstack-infra? | 15:17 |
mgagne | Was an IP or UUI provided? I guess I missed it from the backlog. | 15:18 |
frickler | mgagne: I can get the IPs for you, one moment | 15:18 |
frickler | mgagne: ubuntu-focal-iweb-mtl01-0028848617 198.72.124.130 and ubuntu-focal-iweb-mtl01-0028797665 198.72.124.111 | 15:21 |
fungi | those are the ones we're aware of so far, any way. there may be others we haven't spotted | 15:22 |
mgagne | thanks, looking into it | 15:22 |
mgagne | I'm not able to find any "rogue" VMs, everything matches between Nova and the compute nodes. | 15:34 |
fungi | maybe they got cleaned up after we logged in and issued shutdown commands to stop them from interfering with new nodes | 15:56 |
frickler | yeah, that's likely. or you didn't find them because they are shut down? anyway, we can ping you once we find another instance | 15:58 |
frickler | fungi: another idea I just had: nodepool could possibly detect such a situation by checking for the correct hostname to be seen when it confirms ssh connectitivity, do you think that that would be worthwhile to add as a feature? | 15:59 |
fungi | frickler: i don't think nodepool currently logs into the nodes it boots, just checks to see that ssh is responding? | 16:00 |
frickler | fungi: it should at least verify that ssh access is working, shouldn't it? but I must admit I'm not sure about the details either | 16:01 |
fungi | also that doesn't really address the issue, since the situation you get into generally is an arp conflict, where 50% of connections go to one node and 50% to the other depending on timing and the state of the gateway's arp table | 16:01 |
fungi | so it would at best discover 50% of them | 16:02 |
frickler | depends on timing I guess, if the older node is faster at responding, it could win 100% of the time. at least with the current incident I've always been connected to the older node only | 16:03 |
frickler | maybe be the better option would be for nova to finally learn to allow ssh hostkey verification to happen via the API somehow. again I'm not sure how that would be possible, but it sure would be a cool feature | 16:06 |
corvus | nodepool does not log into the node. but zuul does, so if you wanted to check that, you could do so in zuul. | 16:06 |
corvus | (by "in zuul" i mean in a pre-run playbook in a job) | 16:07 |
fungi | right, we have some similar checks which happen in our base job already, so could extend that | 16:27 |
fungi | but still, it'll only catch situations where that playbook happens to hit the "wrong" server, so won't catch them all | 16:27 |
dpawlik | fungi, clarkb: hey, I added few visualization in Opensearch. Right now it is just simple visualization, maybe it will be helpful for someone. Also added some new logs to be pushed: https://review.opendev.org/c/openstack/ci-log-processing/+/833624/12/logscraper/config.yaml.sample . Other will be added soon! | 16:35 |
*** ysandeep is now known as ysandeep|out | 16:36 | |
dpawlik | fungi: I will add the white chars tomorrow to PS https://review.opendev.org/c/opendev/system-config/+/833264 . Need to go now, (I just leave a message because I saw ping on openstack-tc channel) | 16:38 |
*** rlandy is now known as rlandy|PTO | 16:46 | |
*** jpena is now known as jpena|off | 17:28 | |
opendevreview | Merged openstack/project-config master: Add Istio app to StarlingX https://review.opendev.org/c/openstack/project-config/+/834896 | 19:43 |
*** diablo_rojo_phone is now known as Guest250 | 20:16 | |
*** dviroel is now known as dviroel|pto | 20:45 | |
priteau | Good evening. I am investigating a doc job failure in blazar, seen in stable/wallaby but I assume master is affected too. The failure is caused by the latest Jinja2 3.1.0 being installed, which is incompatible with Sphinx 3.5.2 from wallaby upper constraints. | 21:34 |
priteau | Actually master may not be affected, since the Sphinx u-c is different | 21:35 |
priteau | Jinja3 comes from Flask which is a blazar requirement, I am wondering what could be causing it to be installed with upper constraints? | 21:36 |
priteau | Failure log: https://8abbfecd00d9996fdb0c-5c4643ce01a9b304087712e3e08013b4.ssl.cf1.rackcdn.com/835057/1/check/openstack-tox-docs/8c5e9c0/job-output.txt | 21:36 |
priteau | I can actually reproduce it locally, it doesn't affect master or stable/xena | 21:40 |
corvus | the ara package has a similar error: cannot import name 'Markup' from 'jinja2' | 21:43 |
priteau | Workaround is to pin Jinja2<3.1.0 | 21:45 |
priteau | If you use Jinja2 directly, the proper fix is to update usage, see https://jinja.palletsprojects.com/en/3.1.x/changes/ | 21:45 |
corvus | ah looks like they released 3.1.0 today | 21:46 |
priteau | Anyway, I can work around it in blazar, but I am curious why tox -e docs is installing blazar without using u-c? | 21:46 |
priteau | It is the develop-inst step which does this | 21:52 |
priteau | Sorry for thinking out loud, I found the solution | 22:02 |
priteau | docs tox env should be using `skip_install = True` | 22:02 |
opendevreview | Mohammed Naser proposed zuul/zuul-jobs master: run-buildset-registry: Drop extra install packages task https://review.opendev.org/c/zuul/zuul-jobs/+/835156 | 22:30 |
fungi | setuptools 61.0.0 was released roughly 2 hours ago. i haven't looked through the changelog yet | 22:49 |
corvus | i'm going to start a rolling restart of zuul | 23:14 |
corvus | first 6 executors are gracefully stopping | 23:18 |
fungi | thanks! i'm around if i can be of help | 23:21 |
opendevreview | Mohammed Naser proposed zuul/zuul-jobs master: run-buildset-registry: Drop extra install packages task https://review.opendev.org/c/zuul/zuul-jobs/+/835156 | 23:29 |
opendevreview | Mohammed Naser proposed zuul/zuul-jobs master: ensure-kubernetes: fix missing 02-crio.conf https://review.opendev.org/c/zuul/zuul-jobs/+/835162 | 23:29 |
corvus | fungi: the executor restart is running in screen on bridge; i think whenever they're all back online we can do the schedulers/web | 23:41 |
fungi | ahh, awesome, connecting | 23:43 |
opendevreview | Mohammed Naser proposed zuul/zuul-jobs master: ensure-kubernetes: fix missing 02-crio.conf https://review.opendev.org/c/zuul/zuul-jobs/+/835162 | 23:47 |
opendevreview | Mohammed Naser proposed zuul/zuul-jobs master: run-buildset-registry: Drop extra install packages task https://review.opendev.org/c/zuul/zuul-jobs/+/835156 | 23:47 |
fungi | looks like there's a second screen session with the results of a merger restart in it. i'll leave it for now in case someone's still doing something with that | 23:48 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!