Friday, 2023-03-03

opendevreview	Merged openstack/nova master: Add service version for Antelope https://review.opendev.org/c/openstack/nova/+/874932	06:15
opendevreview	Jorge San Emeterio proposed openstack/nova master: Have host look for CPU controller of cgroupsv2 location. https://review.opendev.org/c/openstack/nova/+/873127	08:01
bauzas	looks like I shouldn't gamble with poker	08:41
bauzas	all my rechecks were failing, while the new ones this night (thanks dansmith) got eventually merged	08:42
opendevreview	Amit Uniyal proposed openstack/nova master: Allow swap resize from non-zero to zero https://review.opendev.org/c/openstack/nova/+/857339	09:30
admin1	hi all .. what exactly does this mean ( which came during yoga -> zed upgrade) and now nova is down -- https://gist.githubusercontent.com/a1git/6f25cfb53feb2cb3b6d122da5664b462/raw/5bb0563f46c05bcea47fbc2a04f607f1f045f4ab/gistfile1.txt	09:42
admin1	Details: Current Nova version does not support computes older than \|", "\| Yoga but the minimum compute service level in your system \|", "\| is 60 and the oldest supported service level is 61"	09:45
bauzas	admin1: are you sure all your computes are supporting at least Yoga ?	10:08
admin1	yes	10:08
admin1	those were upgraded 3 days prior and i had a canary deployed in all of them	10:08
bauzas	something detected a compute having a 60 service version	10:08
admin1	now yoga -> zed, ( using openstack ansible) it just failed	10:08
bauzas	admin1: can you please create a bug report ?	10:09
bauzas	I'll try to look at it	10:09
admin1	bauzas, is it possible to somehow bypass or disable this check for a bit ?	10:09
bauzas	indeed, sec	10:09
bauzas	admin1: https://docs.openstack.org/nova/latest/configuration/config.html#workarounds.disable_compute_service_check_for_ffu	10:11
bauzas	but you'll need to find which compute is still using the older service version	10:12
bauzas	maybe some of them wasn't restarted	10:12
bauzas	(after upgrading)	10:12
admin1	output of select host,version from nova.services => https://gist.githubusercontent.com/a1git/13ceb2e181dab9532a5b229b2915b478/raw/eee560226905d21159808f36a038c9201d8e2d23/gistfile1.txt	10:14
admin1	i think some are affected	10:14
bauzas	admin1: what's strange is that 60 is an interim service version	10:18
bauzas	did you use a milestone or something for some computes ?	10:19
bauzas	https://github.com/openstack/nova/blob/master/nova/objects/service.py#L260-L261 Xena is 57 and Yoga is 61	10:19
bauzas	https://github.com/openstack/nova/blob/master/nova/objects/service.py#L212-L215 and that's what was changed by the RPC version for the 60 service version	10:20
bauzas	unless you deploy with master, of course	10:20
admin1	bauzas, so the servers that are 60 were not online ..	10:25
admin1	they were off temporarily	10:25
admin1	but this blocked the whole upgrade process	10:25
bauzas	the compute state isn't and shouldn't be checked for safety reasons	10:26
admin1	it did .. temporarily what i did was update nova.services set version=61 where version=60 and trying to run the playbook again	10:26
admin1	if it works, then i am all good .. else i have to report it here again	10:27
admin1	if this works, then i can open a bug report saying unavailable compute node blocked upgrade	10:27
bauzas	https://docs.openstack.org/nova/latest/cli/nova-status.html#nova-status-checks helps to test your upgrade	10:27
bauzas	admin1: again, that's by design that we don't allow non-upgraded compute to be left registered	10:28
bauzas	admin1: and that's why we have the workaround option for that intent	10:28
bauzas	https://github.com/openstack/nova/blob/master/nova/cmd/status.py#L251	10:29
bauzas	which exactly tests the support contract before you upgrade https://github.com/openstack/nova/blob/59f7a524fd4ded3c17b10abcedb0baff769c3a8a/nova/utils.py#L1052	10:30
sean-k-mooney	admin1: havign the server be off is expected to block the upgrade process	11:07
sean-k-mooney	that would not be a bug	11:08
sean-k-mooney	because if the service verion is 60 it means they never started with the offical yoga release (61)	11:08
opendevreview	Amit Uniyal proposed openstack/nova master: Allow swap resize from non-zero to zero https://review.opendev.org/c/openstack/nova/+/857339	11:09
sean-k-mooney	if you had started them with yoga and then they were stop it would not cause this issue	11:09
admin1	i think they were off 2 days before when i did wallaby -> yoga	11:09
admin1	but wallaby -> yoga did not complained of this .. was this check added in zed ?	11:09
sean-k-mooney	this was alwasy a requirement and we decied to start enforcining it in yoga becasue of operators violating the upgrade contract	11:10
sean-k-mooney	and filing bugs :)	11:10
admin1	:D	11:10
sean-k-mooney	nova before 2023.1/2024.1 only allows n to n+1 upgrades	11:11
sean-k-mooney	we put the workaround option in place ans an escape hatch	11:11
sean-k-mooney	so if you want to run nova in an unsupproted state you can but it should never be requried if you are upgrading withing the upgrade contract	11:12
admin1	i understand now .. will make sure no computes are down next time we upgrade	11:13
sean-k-mooney	provided we do not do an rpc bump its generally possibel for > n->n+1 to function but the first time we tested that was yoga to antelope(2023.1) as a dry run for 2023.1->2024.1	11:15
sean-k-mooney	we will be offically testing that going forward in case you are not aware of this chagne https://governance.openstack.org/tc/resolutions/20220210-release-cadence-adjustment.html	11:16
opendevreview	Amit Uniyal proposed openstack/nova master: Allow swap resize from non-zero to zero https://review.opendev.org/c/openstack/nova/+/857339	11:19
dansmith	bauzas: \o/	14:24
dansmith	bauzas: are you cooking up the rc1 patch?	14:33
bauzas	I was waiting for the revert to arrive	14:34
dansmith	ah okay, I see it's close	14:34
dansmith	sorry I didn't recheck that because I thought we were punting	14:34
bauzas	shit no	14:34
bauzas	https://zuul.openstack.org/status#873584	14:34
dansmith	oh yep	14:34
bauzas	post_failure	14:34
bauzas	ok, so I'll skip it	14:35
bauzas	gibi: sean-k-mooney: for your sake of knowledge, I'm gonna branch RC1 without the logging revert	14:35
bauzas	we'll backport the revert later aftr GA	14:35
bauzas	hmmmm	14:36
bauzas	dansmith: actually, it looks like the release team agrees us some graceful extra period for branching RC1	14:37
bauzas	(they're on meeting now)	14:37
dansmith	okay	14:37
gibi	the revert is in the gate queue now	14:38
sean-k-mooney	ok we modified it ot be safe in production even so having it in RC1 is not terible	14:38
bauzas	gibi: yup, but failing	14:38
gibi	so if we are lucky it might merge today	14:38
gibi	ohh	14:38
gibi	sh*t	14:38
bauzas	we were so close	14:38
sean-k-mooney	but we can ask them to reque it	14:38
bauzas	I'll claim for a RC1 patch on Monday	14:38
sean-k-mooney	if there is a long delay	14:38
bauzas	and I'll recheck this revert by the next 4 mins	14:38
dansmith	18 other things in the gate right now	14:39
gibi	bauzas: I can shepherd the patch during Saturday and a bit on Sunday as well.	14:39
dansmith	so it'll be a bit if it re-runs, but it's also not a critical patch	14:39
sean-k-mooney	post_failure form nova next. unfortunet	14:40
bauzas	dansmith: I don't disagree	14:40
bauzas	but it will be a bit of a pain to backport the revert if we go	14:40
bauzas	if the release team says they're OK with releasing on Monday, then meh, we gonna try this weekend	14:40
bauzas	gibi: last time you were way luckier than me	14:41
sean-k-mooney	we technially didnt run out of memory but it got pretty clsoe memory_tracker low_point: 730	14:42
sean-k-mooney	* memory_tracker low_point: 7308	14:43
dansmith	sean-k-mooney: has nova-next been OOMing?	14:43
sean-k-mooney	MemAvailable: 9152 kB	14:43
sean-k-mooney	i think its been surviing because of swap	14:43
sean-k-mooney	Mar 03 13:53:10.114492 np0033355853 memory_tracker.sh[131948]: SwapTotal: 4194300 kB	14:44
sean-k-mooney	Mar 03 13:53:10.114492 np0033355853 memory_tracker.sh[131948]: SwapFree: 0 kB	14:44
sean-k-mooney	https://zuul.opendev.org/t/openstack/build/ec34b5fa7a354e19a6919d167268cb8b/log/controller/logs/screen-memory_tracker.txt#3011	14:44
sean-k-mooney	dansmith: are you wondering if this is related to the mariadb tweaks ye did	14:44
sean-k-mooney	keystone was giving 503s	14:45
dansmith	sean-k-mooney: those tweaks are disabled by default in devstack right now	14:45
dansmith	I'm just saying if we're memory constrained on that job, we might want to enable those tweaks	14:45
sean-k-mooney	and when i see that it often because of the db/service getting oom killed	14:45
dansmith	it seem to have done well for the ceph one	14:46
sean-k-mooney	yep	14:46
sean-k-mooney	im just looking to see if i can confim that in the logs	14:46
sean-k-mooney	but that is why i was checkign the memory tracker i think we are runnign very close to out of memory if we have not hit it	14:46
dansmith	on the ceph job my tweaks dropped mysql to half of what it was using (~800m to ~400m)	14:47
sean-k-mooney	ar 03 13:53:09 np0033355853 kernel: sshd invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0	14:47
bauzas	https://zuul.opendev.org/t/openstack/build/ec34b5fa7a354e19a6919d167268cb8b	14:47
dansmith	but, less memory usage could impair performance and make other things worse of course	14:47
sean-k-mooney	so yes we are	14:47
sean-k-mooney	https://zuul.opendev.org/t/openstack/build/ec34b5fa7a354e19a6919d167268cb8b/log/controller/logs/syslog.txt#5871	14:47
bauzas	we had two problems	14:47
bauzas	a unresponsive API	14:48
bauzas	and some leaked allocs	14:48
sean-k-mooney	the api issue are because mysql got killed	14:48
sean-k-mooney	Mar 03 13:53:09 np0033355853 kernel: Out of memory: Killed process 47910 (mysqld) total-vm:5223564kB, anon-rss:328112kB, file-rss:0kB, shmem-rss:0kB, UID:116 pgtables:2648kB oom_score_adj:0	14:48
dansmith	yeah that's usually how it works	14:49
dansmith	mysql is killed and then we stop being able to talk to keystone (et al)	14:49
sean-k-mooney	yep	14:49
sean-k-mooney	so 1 we shoudl enabel that devstack feature for nova-next 2 we shoudl consider doing it by default	14:50
sean-k-mooney	assuming it does not regress over all job time too much	14:50
dansmith	we were going to do it by default after everyone is branched, to see if it is reasonable across the board, but not to break anyone before release	14:50
sean-k-mooney	* default in jobs	14:50
sean-k-mooney	not sure about defaulting in devstack	14:50
sean-k-mooney	ack	14:50
dansmith	we have it enabled in some jobs, so we should do that for -next if it can't make it worse	14:51
dansmith	and then maybe we'll get the defaulting in a month or so	14:51
dansmith	I can post a patch for -next	14:51
sean-k-mooney	sounds good	14:51
bauzas	cool	14:51
sean-k-mooney	if we are swappign that hard its going to really slow down the job too	14:52
sean-k-mooney	is the memory reduction much overall	14:52
dansmith	like I said above, about 800m to 400m rss for mysql	14:53
bauzas	that's not big	14:53
bauzas	mysqld seems to be a canary	14:53
sean-k-mooney	400m is a lot when we only have 8192mb of ram in the vms	14:54
opendevreview	Dan Smith proposed openstack/nova master: Make nova-next reduce mysql memory https://review.opendev.org/c/openstack/nova/+/876391	14:54
dansmith	yeah, it's a lot :)	14:54
sean-k-mooney	its like 5%	14:54
dansmith	it's the single biggest user	14:54
dansmith	and it puts it down closer to many of the other users like rabbit	14:55
sean-k-mooney	i just checked and we are alredy limiting nova to 2 worksers as well so we likely wont get much form limiting that more	14:55
dansmith	yeah	14:55
bauzas	true but neutron-api takes its own big piece of cake https://zuul.opendev.org/t/openstack/build/ec34b5fa7a354e19a6919d167268cb8b/log/controller/logs/screen-memory_tracker.txt#1993	14:56
sean-k-mooney	sure but it looks like they are also limiting to two workers	14:58
sean-k-mooney	and the reduction in mysql is around the same as both of them combined	14:58
dansmith	bauzas: if you can find any other 50% reductions and 400m of free ram, please do let me know :)	14:58
sean-k-mooney	also the nutorn server acts both as the api and conductor for neutron	14:58
sean-k-mooney	and schduler	14:58
sean-k-mooney	as in it impelmente everything the contoler would do for neutron	14:59
bauzas	dansmith: fwiw, I +2d your patch,	14:59
bauzas	so I'm not debating it :)	14:59
dansmith	I suspect other gains to be had will be much smaller and much harder to enact :)	14:59
bauzas	true	14:59
dansmith	bauzas: I know, you're for it, you're just not impressed, I get it	14:59
dansmith	I'll just go cry in the corner	14:59
bauzas	:)	14:59
sean-k-mooney	i sent it to the ci	14:59
bauzas	I wish I would have a magic wand	14:59
sean-k-mooney	what i have wanted to try for a while is enabel zswap	15:00
bauzas	Bibbidi-Bobbidi-Boo !	15:00
sean-k-mooney	that shoudl speed up swap usage a bit and help a little with swap size too	15:00
bauzas	(shit, doesn't work)	15:00
dansmith	sean-k-mooney: yeah that might be a thing, but we're also legitimately timing out a lot of jobs, so I'm concerned about slowing anything down with a memory boost causing more of those	15:01
sean-k-mooney	https://www.omgubuntu.co.uk/2022/01/ubuntu-on-raspberry-pi-4-2gb-zswap	15:01
dansmith	if we're thrashing I think it will slow us down a lot, if we're stashing bloat we never reference, then it will help	15:01
* bauzas is currently investigating https://bugs.launchpad.net/tempest/+bug/1999893		15:01
sean-k-mooney	once we swap to 22.04 it will be better	15:01
bauzas	and I suspect this may be related	15:01
dansmith	sean-k-mooney: what will?	15:01
sean-k-mooney	dansmith: its much simpler to enable zswap in 22.04	15:02
dansmith	oh	15:02
sean-k-mooney	and they did some performace optiomistaions	15:02
sean-k-mooney	https://waldorf.waveform.org.uk/2021/6-months-with-the-pi-desktop.html	15:02
dansmith	but still, you're trading cpu for memory	15:02
sean-k-mooney	cpu vs disk io performace really	15:03
dansmith	and right now in addition to running against the memory barrier, we're also running against the cpu barrier	15:03
sean-k-mooney	compression effectivly makes it like swappign to faster storage	15:03
sean-k-mooney	if you have the cpu to do it	15:03
dansmith	sure, if you're io constrained it will help there, but it's all trading for cpu	15:03
sean-k-mooney	yep	15:04
sean-k-mooney	i looked into this a while ago when those blogs first came out	15:04
sean-k-mooney	it proably worth doing an expermient with at least	15:04
dansmith	yeah, any shorter but fatter jobs will benefit from it for sure, and it could reduce our own noisy neighbor effects	15:05
sean-k-mooney	honestly just moving to 16G vms in ci woudl be the better solution	15:05
sean-k-mooney	but since that is not really an option	15:05
dansmith	I'm just worried about slowing the tests down at all because we're already hitting widespread legit timeouts	15:05
dansmith	on slower nodes	15:05
sean-k-mooney	ya which is a concern	15:05
sean-k-mooney	wow gerrit found https://review.opendev.org/c/openstack/devstack/+/828639 becasue i mention zswap in my last comment on the top level	15:08
* sean-k-mooney was hoping i already coded that up		15:08
* bauzas disappears for taxi reasons		15:16
* bauzas is back		15:46
darkhorse	Hi team - I read about service tokens in cinder documentation which is really cool. Can service tokens be used in nova and other services too?	16:46
dansmith	wow, the nova-next mysql patch passed check the first time through	17:08
dansmith	bauzas: sean-k-mooney ^	17:08
dansmith	melwitt: is this something you saw in the gate? https://review.opendev.org/c/openstack/nova/+/875991	17:39
dansmith	I don't see that error message in opensearch	17:39

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!