opendevreview | Merged openstack/nova master: Add service version for Antelope https://review.opendev.org/c/openstack/nova/+/874932 | 06:15 |
---|---|---|
opendevreview | Jorge San Emeterio proposed openstack/nova master: Have host look for CPU controller of cgroupsv2 location. https://review.opendev.org/c/openstack/nova/+/873127 | 08:01 |
bauzas | looks like I shouldn't gamble with poker | 08:41 |
bauzas | all my rechecks were failing, while the new ones this night (thanks dansmith) got eventually merged | 08:42 |
opendevreview | Amit Uniyal proposed openstack/nova master: Allow swap resize from non-zero to zero https://review.opendev.org/c/openstack/nova/+/857339 | 09:30 |
admin1 | hi all .. what exactly does this mean ( which came during yoga -> zed upgrade) and now nova is down -- https://gist.githubusercontent.com/a1git/6f25cfb53feb2cb3b6d122da5664b462/raw/5bb0563f46c05bcea47fbc2a04f607f1f045f4ab/gistfile1.txt | 09:42 |
admin1 | Details: Current Nova version does not support computes older than |", "| Yoga but the minimum compute service level in your system |", "| is 60 and the oldest supported service level is 61" | 09:45 |
bauzas | admin1: are you sure *all* your computes are supporting at least Yoga ? | 10:08 |
admin1 | yes | 10:08 |
admin1 | those were upgraded 3 days prior and i had a canary deployed in all of them | 10:08 |
bauzas | something detected a compute having a 60 service version | 10:08 |
admin1 | now yoga -> zed, ( using openstack ansible) it just failed | 10:08 |
bauzas | admin1: can you please create a bug report ? | 10:09 |
bauzas | I'll try to look at it | 10:09 |
admin1 | bauzas, is it possible to somehow bypass or disable this check for a bit ? | 10:09 |
bauzas | indeed, sec | 10:09 |
bauzas | admin1: https://docs.openstack.org/nova/latest/configuration/config.html#workarounds.disable_compute_service_check_for_ffu | 10:11 |
bauzas | but you'll need to find which compute is still using the older service version | 10:12 |
bauzas | maybe some of them wasn't restarted | 10:12 |
bauzas | (after upgrading) | 10:12 |
admin1 | output of select host,version from nova.services => https://gist.githubusercontent.com/a1git/13ceb2e181dab9532a5b229b2915b478/raw/eee560226905d21159808f36a038c9201d8e2d23/gistfile1.txt | 10:14 |
admin1 | i think some are affected | 10:14 |
bauzas | admin1: what's strange is that 60 is an interim service version | 10:18 |
bauzas | did you use a milestone or something for some computes ? | 10:19 |
bauzas | https://github.com/openstack/nova/blob/master/nova/objects/service.py#L260-L261 Xena is 57 and Yoga is 61 | 10:19 |
bauzas | https://github.com/openstack/nova/blob/master/nova/objects/service.py#L212-L215 and that's what was changed by the RPC version for the 60 service version | 10:20 |
bauzas | unless you deploy with master, of course | 10:20 |
admin1 | bauzas, so the servers that are 60 were not online .. | 10:25 |
admin1 | they were off temporarily | 10:25 |
admin1 | but this blocked the whole upgrade process | 10:25 |
bauzas | the compute state isn't and shouldn't be checked for safety reasons | 10:26 |
admin1 | it did .. temporarily what i did was update nova.services set version=61 where version=60 and trying to run the playbook again | 10:26 |
admin1 | if it works, then i am all good .. else i have to report it here again | 10:27 |
admin1 | if this works, then i can open a bug report saying unavailable compute node blocked upgrade | 10:27 |
bauzas | https://docs.openstack.org/nova/latest/cli/nova-status.html#nova-status-checks helps to test your upgrade | 10:27 |
bauzas | admin1: again, that's by design that we don't allow non-upgraded compute to be left registered | 10:28 |
bauzas | admin1: and that's why we have the workaround option for that intent | 10:28 |
bauzas | https://github.com/openstack/nova/blob/master/nova/cmd/status.py#L251 | 10:29 |
bauzas | which exactly tests the support contract *before* you upgrade https://github.com/openstack/nova/blob/59f7a524fd4ded3c17b10abcedb0baff769c3a8a/nova/utils.py#L1052 | 10:30 |
sean-k-mooney | admin1: havign the server be off is expected to block the upgrade process | 11:07 |
sean-k-mooney | that would not be a bug | 11:08 |
sean-k-mooney | because if the service verion is 60 it means they never started with the offical yoga release (61) | 11:08 |
opendevreview | Amit Uniyal proposed openstack/nova master: Allow swap resize from non-zero to zero https://review.opendev.org/c/openstack/nova/+/857339 | 11:09 |
sean-k-mooney | if you had started them with yoga and then they were stop it would not cause this issue | 11:09 |
admin1 | i think they were off 2 days before when i did wallaby -> yoga | 11:09 |
admin1 | but wallaby -> yoga did not complained of this .. was this check added in zed ? | 11:09 |
sean-k-mooney | this was alwasy a requirement and we decied to start enforcining it in yoga becasue of operators violating the upgrade contract | 11:10 |
sean-k-mooney | and filing bugs :) | 11:10 |
admin1 | :D | 11:10 |
sean-k-mooney | nova before 2023.1/2024.1 only allows n to n+1 upgrades | 11:11 |
sean-k-mooney | we put the workaround option in place ans an escape hatch | 11:11 |
sean-k-mooney | so if you want to run nova in an unsupproted state you can but it should never be requried if you are upgrading withing the upgrade contract | 11:12 |
admin1 | i understand now .. will make sure no computes are down next time we upgrade | 11:13 |
sean-k-mooney | provided we do not do an rpc bump its generally possibel for > n->n+1 to function but the first time we tested that was yoga to antelope(2023.1) as a dry run for 2023.1->2024.1 | 11:15 |
sean-k-mooney | we will be offically testing that going forward in case you are not aware of this chagne https://governance.openstack.org/tc/resolutions/20220210-release-cadence-adjustment.html | 11:16 |
opendevreview | Amit Uniyal proposed openstack/nova master: Allow swap resize from non-zero to zero https://review.opendev.org/c/openstack/nova/+/857339 | 11:19 |
dansmith | bauzas: \o/ | 14:24 |
dansmith | bauzas: are you cooking up the rc1 patch? | 14:33 |
bauzas | I was waiting for the revert to arrive | 14:34 |
dansmith | ah okay, I see it's close | 14:34 |
dansmith | sorry I didn't recheck that because I thought we were punting | 14:34 |
bauzas | shit no | 14:34 |
bauzas | https://zuul.openstack.org/status#873584 | 14:34 |
dansmith | oh yep | 14:34 |
bauzas | post_failure | 14:34 |
bauzas | ok, so I'll skip it | 14:35 |
bauzas | gibi: sean-k-mooney: for your sake of knowledge, I'm gonna branch RC1 without the logging revert | 14:35 |
bauzas | we'll backport the revert later aftr GA | 14:35 |
bauzas | hmmmm | 14:36 |
bauzas | dansmith: actually, it looks like the release team agrees us some graceful extra period for branching RC1 | 14:37 |
bauzas | (they're on meeting now) | 14:37 |
dansmith | okay | 14:37 |
gibi | the revert is in the gate queue now | 14:38 |
sean-k-mooney | ok we modified it ot be safe in production even so having it in RC1 is not terible | 14:38 |
bauzas | gibi: yup, but failing | 14:38 |
gibi | so if we are lucky it might merge today | 14:38 |
gibi | ohh | 14:38 |
gibi | sh*t | 14:38 |
bauzas | we were so close | 14:38 |
sean-k-mooney | but we can ask them to reque it | 14:38 |
bauzas | I'll claim for a RC1 patch on Monday | 14:38 |
sean-k-mooney | if there is a long delay | 14:38 |
bauzas | and I'll recheck this revert by the next 4 mins | 14:38 |
dansmith | 18 other things in the gate right now | 14:39 |
gibi | bauzas: I can shepherd the patch during Saturday and a bit on Sunday as well. | 14:39 |
dansmith | so it'll be a bit if it re-runs, but it's also not a critical patch | 14:39 |
sean-k-mooney | post_failure form nova next. unfortunet | 14:40 |
bauzas | dansmith: I don't disagree | 14:40 |
bauzas | but it will be a bit of a pain to backport the revert if we go | 14:40 |
bauzas | if the release team says they're OK with releasing on Monday, then meh, we gonna try this weekend | 14:40 |
bauzas | gibi: last time you were way luckier than me | 14:41 |
sean-k-mooney | we technially didnt run out of memory but it got pretty clsoe memory_tracker low_point: 730 | 14:42 |
sean-k-mooney | * memory_tracker low_point: 7308 | 14:43 |
dansmith | sean-k-mooney: has nova-next been OOMing? | 14:43 |
sean-k-mooney | MemAvailable: 9152 kB | 14:43 |
sean-k-mooney | i think its been surviing because of swap | 14:43 |
sean-k-mooney | Mar 03 13:53:10.114492 np0033355853 memory_tracker.sh[131948]: SwapTotal: 4194300 kB | 14:44 |
sean-k-mooney | Mar 03 13:53:10.114492 np0033355853 memory_tracker.sh[131948]: SwapFree: 0 kB | 14:44 |
sean-k-mooney | https://zuul.opendev.org/t/openstack/build/ec34b5fa7a354e19a6919d167268cb8b/log/controller/logs/screen-memory_tracker.txt#3011 | 14:44 |
sean-k-mooney | dansmith: are you wondering if this is related to the mariadb tweaks ye did | 14:44 |
sean-k-mooney | keystone was giving 503s | 14:45 |
dansmith | sean-k-mooney: those tweaks are disabled by default in devstack right now | 14:45 |
dansmith | I'm just saying if we're memory constrained on that job, we might want to enable those tweaks | 14:45 |
sean-k-mooney | and when i see that it often because of the db/service getting oom killed | 14:45 |
dansmith | it seem to have done well for the ceph one | 14:46 |
sean-k-mooney | yep | 14:46 |
sean-k-mooney | im just looking to see if i can confim that in the logs | 14:46 |
sean-k-mooney | but that is why i was checkign the memory tracker i think we are runnign very close to out of memory if we have not hit it | 14:46 |
dansmith | on the ceph job my tweaks dropped mysql to half of what it was using (~800m to ~400m) | 14:47 |
sean-k-mooney | ar 03 13:53:09 np0033355853 kernel: sshd invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 | 14:47 |
bauzas | https://zuul.opendev.org/t/openstack/build/ec34b5fa7a354e19a6919d167268cb8b | 14:47 |
dansmith | but, less memory usage could impair performance and make other things worse of course | 14:47 |
sean-k-mooney | so yes we are | 14:47 |
sean-k-mooney | https://zuul.opendev.org/t/openstack/build/ec34b5fa7a354e19a6919d167268cb8b/log/controller/logs/syslog.txt#5871 | 14:47 |
bauzas | we had two problems | 14:47 |
bauzas | a unresponsive API | 14:48 |
bauzas | and some leaked allocs | 14:48 |
sean-k-mooney | the api issue are because mysql got killed | 14:48 |
sean-k-mooney | Mar 03 13:53:09 np0033355853 kernel: Out of memory: Killed process 47910 (mysqld) total-vm:5223564kB, anon-rss:328112kB, file-rss:0kB, shmem-rss:0kB, UID:116 pgtables:2648kB oom_score_adj:0 | 14:48 |
dansmith | yeah that's usually how it works | 14:49 |
dansmith | mysql is killed and then we stop being able to talk to keystone (et al) | 14:49 |
sean-k-mooney | yep | 14:49 |
sean-k-mooney | so 1 we shoudl enabel that devstack feature for nova-next 2 we shoudl consider doing it by default | 14:50 |
sean-k-mooney | assuming it does not regress over all job time too much | 14:50 |
dansmith | we were going to do it by default after everyone is branched, to see if it is reasonable across the board, but not to break anyone before release | 14:50 |
sean-k-mooney | * default in jobs | 14:50 |
sean-k-mooney | not sure about defaulting in devstack | 14:50 |
sean-k-mooney | ack | 14:50 |
dansmith | we have it enabled in some jobs, so we should do that for -next if it can't make it worse | 14:51 |
dansmith | and then maybe we'll get the defaulting in a month or so | 14:51 |
dansmith | I can post a patch for -next | 14:51 |
sean-k-mooney | sounds good | 14:51 |
bauzas | cool | 14:51 |
sean-k-mooney | if we are swappign that hard its going to really slow down the job too | 14:52 |
sean-k-mooney | is the memory reduction much overall | 14:52 |
dansmith | like I said above, about 800m to 400m rss for mysql | 14:53 |
bauzas | that's not big | 14:53 |
bauzas | mysqld seems to be a canary | 14:53 |
sean-k-mooney | 400m is a lot when we only have 8192mb of ram in the vms | 14:54 |
opendevreview | Dan Smith proposed openstack/nova master: Make nova-next reduce mysql memory https://review.opendev.org/c/openstack/nova/+/876391 | 14:54 |
dansmith | yeah, it's a lot :) | 14:54 |
sean-k-mooney | its like 5% | 14:54 |
dansmith | it's the single biggest user | 14:54 |
dansmith | and it puts it down closer to many of the other users like rabbit | 14:55 |
sean-k-mooney | i just checked and we are alredy limiting nova to 2 worksers as well so we likely wont get much form limiting that more | 14:55 |
dansmith | yeah | 14:55 |
bauzas | true but neutron-api takes its own big piece of cake https://zuul.opendev.org/t/openstack/build/ec34b5fa7a354e19a6919d167268cb8b/log/controller/logs/screen-memory_tracker.txt#1993 | 14:56 |
sean-k-mooney | sure but it looks like they are also limiting to two workers | 14:58 |
sean-k-mooney | and the reduction in mysql is around the same as both of them combined | 14:58 |
dansmith | bauzas: if you can find any other 50% reductions and 400m of free ram, please do let me know :) | 14:58 |
sean-k-mooney | also the nutorn server acts both as the api and conductor for neutron | 14:58 |
sean-k-mooney | and schduler | 14:58 |
sean-k-mooney | as in it impelmente everything the contoler would do for neutron | 14:59 |
bauzas | dansmith: fwiw, I +2d your patch, | 14:59 |
bauzas | so I'm not debating it :) | 14:59 |
dansmith | I suspect other gains to be had will be much smaller and much harder to enact :) | 14:59 |
bauzas | true | 14:59 |
dansmith | bauzas: I know, you're for it, you're just not impressed, I get it | 14:59 |
dansmith | I'll just go cry in the corner | 14:59 |
bauzas | :) | 14:59 |
sean-k-mooney | i sent it to the ci | 14:59 |
bauzas | I wish I would have a magic wand | 14:59 |
sean-k-mooney | what i have wanted to try for a while is enabel zswap | 15:00 |
bauzas | Bibbidi-Bobbidi-Boo ! | 15:00 |
sean-k-mooney | that shoudl speed up swap usage a bit and help a little with swap size too | 15:00 |
bauzas | (shit, doesn't work) | 15:00 |
dansmith | sean-k-mooney: yeah that might be a thing, but we're also legitimately timing out a lot of jobs, so I'm concerned about slowing anything down with a memory boost causing more of those | 15:01 |
sean-k-mooney | https://www.omgubuntu.co.uk/2022/01/ubuntu-on-raspberry-pi-4-2gb-zswap | 15:01 |
dansmith | if we're thrashing I think it will slow us down a lot, if we're stashing bloat we never reference, then it will help | 15:01 |
* bauzas is currently investigating https://bugs.launchpad.net/tempest/+bug/1999893 | 15:01 | |
sean-k-mooney | once we swap to 22.04 it will be better | 15:01 |
bauzas | and I suspect this may be related | 15:01 |
dansmith | sean-k-mooney: what will? | 15:01 |
sean-k-mooney | dansmith: its much simpler to enable zswap in 22.04 | 15:02 |
dansmith | oh | 15:02 |
sean-k-mooney | and they did some performace optiomistaions | 15:02 |
sean-k-mooney | https://waldorf.waveform.org.uk/2021/6-months-with-the-pi-desktop.html | 15:02 |
dansmith | but still, you're trading cpu for memory | 15:02 |
sean-k-mooney | cpu vs disk io performace really | 15:03 |
dansmith | and right now in addition to running against the memory barrier, we're also running against the cpu barrier | 15:03 |
sean-k-mooney | compression effectivly makes it like swappign to faster storage | 15:03 |
sean-k-mooney | if you have the cpu to do it | 15:03 |
dansmith | sure, if you're io constrained it will help there, but it's all trading for cpu | 15:03 |
sean-k-mooney | yep | 15:04 |
sean-k-mooney | i looked into this a while ago when those blogs first came out | 15:04 |
sean-k-mooney | it proably worth doing an expermient with at least | 15:04 |
dansmith | yeah, any shorter but fatter jobs will benefit from it for sure, and it could reduce our own noisy neighbor effects | 15:05 |
sean-k-mooney | honestly just moving to 16G vms in ci woudl be the better solution | 15:05 |
sean-k-mooney | but since that is not really an option | 15:05 |
dansmith | I'm just worried about slowing the tests down at all because we're already hitting widespread legit timeouts | 15:05 |
dansmith | on slower nodes | 15:05 |
sean-k-mooney | ya which is a concern | 15:05 |
sean-k-mooney | wow gerrit found https://review.opendev.org/c/openstack/devstack/+/828639 becasue i mention zswap in my last comment on the top level | 15:08 |
* sean-k-mooney was hoping i already coded that up | 15:08 | |
* bauzas disappears for taxi reasons | 15:16 | |
* bauzas is back | 15:46 | |
darkhorse | Hi team - I read about service tokens in cinder documentation which is really cool. Can service tokens be used in nova and other services too? | 16:46 |
dansmith | wow, the nova-next mysql patch passed check the first time through | 17:08 |
dansmith | bauzas: sean-k-mooney ^ | 17:08 |
dansmith | melwitt: is this something you saw in the gate? https://review.opendev.org/c/openstack/nova/+/875991 | 17:39 |
dansmith | I don't see that error message in opensearch | 17:39 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!