Thursday, 2022-03-24

opendevreview	Merged openstack/project-config master: Add puppet-manila-core https://review.opendev.org/c/openstack/project-config/+/834318	02:31
*** ysandeep\|out is now known as ysandeep		06:45
*** pojadhav- is now known as pojadhav\|rover		08:28
*** jpena\|off is now known as jpena		08:38
*** ysandeep is now known as ysandeep\|afk		08:45
*** ysandeep\|afk is now known as ysandeep		09:04
*** pojadhav- is now known as pojadhav\|rover		09:16
opendevreview	Michal Nasiadka proposed opendev/irc-meetings master: kolla: move meeting one hour backward (DST) https://review.opendev.org/c/opendev/irc-meetings/+/835020	09:44
*** rlandy\|out is now known as rlandy		10:21
*** ysandeep is now known as ysandeep\|afk		11:10
*** ysandeep\|afk is now known as ysandeep		11:14
opendevreview	Simon Westphahl proposed zuul/zuul-jobs master: Add tox-py310 job https://review.opendev.org/c/zuul/zuul-jobs/+/821247	11:21
opendevreview	Simon Westphahl proposed zuul/zuul-jobs master: Add tox-py310 job https://review.opendev.org/c/zuul/zuul-jobs/+/821247	11:26
opendevreview	Simon Westphahl proposed zuul/zuul-jobs master: Add tox-py310 job https://review.opendev.org/c/zuul/zuul-jobs/+/821247	11:29
opendevreview	Merged opendev/irc-meetings master: kolla: move meeting one hour backward (DST) https://review.opendev.org/c/opendev/irc-meetings/+/835020	11:52
opendevreview	Simon Westphahl proposed zuul/zuul-jobs master: Add tox-py310 job https://review.opendev.org/c/zuul/zuul-jobs/+/821247	12:25
*** ysandeep is now known as ysandeep\|afk		12:58
frickler	infra-root: regarding iweb, I checked the "nodepool list" output and there is a high number of nodes in deleting state there, and checking some of those I found more nodes with an uptime >10d. those not necessarily correlate to IP conflicts, but still maybe we just want to shut that cloud down right now instead of doing further debugging?	13:31
fungi	nodes stuck in a deleting state simply subtract from the available quota and cause the launchers to make additional delete calls	13:32
fungi	it would be nice if we could safely ride it out through next wednesday so we have that additional capacity for openstack yoga release day	13:33
*** ysandeep\|afk is now known as ysandeep		13:40
*** pojadhav\|rover is now known as pojadhav\|afk		14:12
frickler	I think "safely" is the key point here. iiuc there is noone left operating that cloud, without the stuck nodes we have no control about how many further IP conflicts may happen. and the risk of jobs falsely passing because they run on the wrong node type can't be excluded, which I consider pretty dangerous in particular for release day	14:20
frickler	*without cleaning up the stuck nodes	14:20
fungi	yep, i guess we can take a closer look at our current utilization as a projection for next week, and if we decide we're unlikely to be overloaded then just plan for the release to take longer to run all the jobs it will need	14:21
*** Guest86 is now known as diablo_rojo_phone		14:53
mgagne	frickler: fungi: I can take a look	14:59
fungi	thanks mgagne! didn't know you still had access	15:01
mgagne	I 100% have access and our team "owns" the infra.	15:02
fungi	to summarize, there are a bunch of undeletable nodes, and also some which nova seems to have lost track of so they're still up but their addresses are also getting assigned to newly-booted nodes	15:02
fungi	the undeletable nodes are mainly just impacting available capacity slightly, it's the rogue servers which are presenting more of a problem since we're getting some builds which end up connecting to them and either run on a wrong dirtied node or fail with host key mismatches	15:04
mgagne	I'll take care of the undeletable nodes. I'll take a look at the rogue vms too.	15:07
fungi	thanks so much!	15:10
frickler	mgagne: that's great, if you let me know when done, I can double check the nodepool list. did you see the reference to the two rogue nodes we powered off in #openstack-infra?	15:17
mgagne	Was an IP or UUI provided? I guess I missed it from the backlog.	15:18
frickler	mgagne: I can get the IPs for you, one moment	15:18
frickler	mgagne: ubuntu-focal-iweb-mtl01-0028848617 198.72.124.130 and ubuntu-focal-iweb-mtl01-0028797665 198.72.124.111	15:21
fungi	those are the ones we're aware of so far, any way. there may be others we haven't spotted	15:22
mgagne	thanks, looking into it	15:22
mgagne	I'm not able to find any "rogue" VMs, everything matches between Nova and the compute nodes.	15:34
fungi	maybe they got cleaned up after we logged in and issued shutdown commands to stop them from interfering with new nodes	15:56
frickler	yeah, that's likely. or you didn't find them because they are shut down? anyway, we can ping you once we find another instance	15:58
frickler	fungi: another idea I just had: nodepool could possibly detect such a situation by checking for the correct hostname to be seen when it confirms ssh connectitivity, do you think that that would be worthwhile to add as a feature?	15:59
fungi	frickler: i don't think nodepool currently logs into the nodes it boots, just checks to see that ssh is responding?	16:00
frickler	fungi: it should at least verify that ssh access is working, shouldn't it? but I must admit I'm not sure about the details either	16:01
fungi	also that doesn't really address the issue, since the situation you get into generally is an arp conflict, where 50% of connections go to one node and 50% to the other depending on timing and the state of the gateway's arp table	16:01
fungi	so it would at best discover 50% of them	16:02
frickler	depends on timing I guess, if the older node is faster at responding, it could win 100% of the time. at least with the current incident I've always been connected to the older node only	16:03
frickler	maybe be the better option would be for nova to finally learn to allow ssh hostkey verification to happen via the API somehow. again I'm not sure how that would be possible, but it sure would be a cool feature	16:06
corvus	nodepool does not log into the node. but zuul does, so if you wanted to check that, you could do so in zuul.	16:06
corvus	(by "in zuul" i mean in a pre-run playbook in a job)	16:07
fungi	right, we have some similar checks which happen in our base job already, so could extend that	16:27
fungi	but still, it'll only catch situations where that playbook happens to hit the "wrong" server, so won't catch them all	16:27
dpawlik	fungi, clarkb: hey, I added few visualization in Opensearch. Right now it is just simple visualization, maybe it will be helpful for someone. Also added some new logs to be pushed: https://review.opendev.org/c/openstack/ci-log-processing/+/833624/12/logscraper/config.yaml.sample . Other will be added soon!	16:35
*** ysandeep is now known as ysandeep\|out		16:36
dpawlik	fungi: I will add the white chars tomorrow to PS https://review.opendev.org/c/opendev/system-config/+/833264 . Need to go now, (I just leave a message because I saw ping on openstack-tc channel)	16:38
*** rlandy is now known as rlandy\|PTO		16:46
*** jpena is now known as jpena\|off		17:28
opendevreview	Merged openstack/project-config master: Add Istio app to StarlingX https://review.opendev.org/c/openstack/project-config/+/834896	19:43
*** diablo_rojo_phone is now known as Guest250		20:16
*** dviroel is now known as dviroel\|pto		20:45
priteau	Good evening. I am investigating a doc job failure in blazar, seen in stable/wallaby but I assume master is affected too. The failure is caused by the latest Jinja2 3.1.0 being installed, which is incompatible with Sphinx 3.5.2 from wallaby upper constraints.	21:34
priteau	Actually master may not be affected, since the Sphinx u-c is different	21:35
priteau	Jinja3 comes from Flask which is a blazar requirement, I am wondering what could be causing it to be installed with upper constraints?	21:36
priteau	Failure log: https://8abbfecd00d9996fdb0c-5c4643ce01a9b304087712e3e08013b4.ssl.cf1.rackcdn.com/835057/1/check/openstack-tox-docs/8c5e9c0/job-output.txt	21:36
priteau	I can actually reproduce it locally, it doesn't affect master or stable/xena	21:40
corvus	the ara package has a similar error: cannot import name 'Markup' from 'jinja2'	21:43
priteau	Workaround is to pin Jinja2<3.1.0	21:45
priteau	If you use Jinja2 directly, the proper fix is to update usage, see https://jinja.palletsprojects.com/en/3.1.x/changes/	21:45
corvus	ah looks like they released 3.1.0 today	21:46
priteau	Anyway, I can work around it in blazar, but I am curious why tox -e docs is installing blazar without using u-c?	21:46
priteau	It is the develop-inst step which does this	21:52
priteau	Sorry for thinking out loud, I found the solution	22:02
priteau	docs tox env should be using `skip_install = True`	22:02
opendevreview	Mohammed Naser proposed zuul/zuul-jobs master: run-buildset-registry: Drop extra install packages task https://review.opendev.org/c/zuul/zuul-jobs/+/835156	22:30
fungi	setuptools 61.0.0 was released roughly 2 hours ago. i haven't looked through the changelog yet	22:49
corvus	i'm going to start a rolling restart of zuul	23:14
corvus	first 6 executors are gracefully stopping	23:18
fungi	thanks! i'm around if i can be of help	23:21
opendevreview	Mohammed Naser proposed zuul/zuul-jobs master: run-buildset-registry: Drop extra install packages task https://review.opendev.org/c/zuul/zuul-jobs/+/835156	23:29
opendevreview	Mohammed Naser proposed zuul/zuul-jobs master: ensure-kubernetes: fix missing 02-crio.conf https://review.opendev.org/c/zuul/zuul-jobs/+/835162	23:29
corvus	fungi: the executor restart is running in screen on bridge; i think whenever they're all back online we can do the schedulers/web	23:41
fungi	ahh, awesome, connecting	23:43
opendevreview	Mohammed Naser proposed zuul/zuul-jobs master: ensure-kubernetes: fix missing 02-crio.conf https://review.opendev.org/c/zuul/zuul-jobs/+/835162	23:47
opendevreview	Mohammed Naser proposed zuul/zuul-jobs master: run-buildset-registry: Drop extra install packages task https://review.opendev.org/c/zuul/zuul-jobs/+/835156	23:47
fungi	looks like there's a second screen session with the results of a merger restart in it. i'll leave it for now in case someone's still doing something with that	23:48

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!