*** gibi_off is now known as gibi | 07:43 | |
* gibi is back | 07:45 | |
gibi | let me know if there are urgent+important things I should be aware of | 07:46 |
---|---|---|
elodilles | gibi: welcome back o/ maybe a nice clean cherry pick for an easy start? :D o:) https://review.opendev.org/c/openstack/nova/+/904371 | 08:17 |
gibi | elodilles: thanks. Easy win indeed. Done | 08:24 |
elodilles | \o/ | 08:24 |
elodilles | thanks, too | 08:24 |
gibi | ping me when the next round of backport from that is up | 08:28 |
elodilles | sure thing :) | 08:31 |
opendevreview | Merged openstack/nova stable/2023.2: Reproduce bug #2025480 in a functional test https://review.opendev.org/c/openstack/nova/+/904370 | 08:58 |
*** Continuity__ is now known as Continuity | 09:35 | |
bauzas | gibi: want some paperwork change ? sure https://review.opendev.org/c/openstack/nova-specs/+/905342 :p | 09:43 |
gibi | bauzas: sure. done. | 09:45 |
bauzas | <3 | 09:45 |
* bauzas hugs | 09:45 | |
opendevreview | Merged openstack/nova-specs master: add the blueprint link to the specs https://review.opendev.org/c/openstack/nova-specs/+/905342 | 09:58 |
opendevreview | Merged openstack/nova stable/2023.2: Do not untrack resources of a server being unshelved https://review.opendev.org/c/openstack/nova/+/904371 | 10:24 |
gibi | gate seems to be healthy on stable | 10:45 |
elodilles | gibi: yepp, it looks good \o/ and the 2023.2 version of the backport has merged ;) | 10:48 |
elodilles | so I've +2'd the patch on 2023.1 | 10:49 |
gibi | elodilles: lets see how the 2023.1 gate performs :) | 11:07 |
bauzas | fwiw, back in 2024 finding why we race for https://42ba7385992f85feadcc-7de3ea29d2988c6d3fb53b5f30580c1f.ssl.cf2.rackcdn.com/904177/5/check/nova-tox-functional-py38/b878af0/testr_results.html | 11:09 |
bauzas | 3 weeks after testing, I don't remember why | 11:09 |
sean-k-mooney[m] | bauzas: is it because you added the test before https://review.opendev.org/c/openstack/nova/+/904209/4 | 11:26 |
sean-k-mooney[m] | i also looks like whatever fixture your using to providie the mdevs likely is not taking effect | 11:30 |
opendevreview | Amit Uniyal proposed openstack/nova master: enforce remote console shutdown https://review.opendev.org/c/openstack/nova/+/901824 | 12:32 |
opendevreview | Amit Uniyal proposed openstack/nova master: enforce remote console shutdown https://review.opendev.org/c/openstack/nova/+/901824 | 12:36 |
opendevreview | Merged openstack/nova stable/2023.1: Reproduce bug #2025480 in a functional test https://review.opendev.org/c/openstack/nova/+/904372 | 12:46 |
*** elodilles is now known as elodilles_afk | 12:55 | |
opendevreview | Amit Uniyal proposed openstack/nova master: Allow swap resize from non-zero to zero https://review.opendev.org/c/openstack/nova/+/857339 | 13:01 |
opendevreview | Amit Uniyal proposed openstack/nova master: WIP: temp commit for tests https://review.opendev.org/c/openstack/nova/+/905597 | 13:01 |
bauzas | sean-k-mooney: nope, sorry but you're wrong, the test works correctly | 13:25 |
bauzas | I'm running stestr with --until-failure from 2 hours for all functional tests with mdevs, np for the moment | 13:26 |
bauzas | stestr --test-path=./nova/tests/functional run nova.tests.functional.libvirt.test_vgpu --until-failure | 13:27 |
bauzas | ====== | 13:27 |
bauzas | Totals | 13:27 |
bauzas | ====== | 13:27 |
bauzas | Ran: 3924 tests in 8192.6080 sec. | 13:27 |
bauzas | - Passed: 3924 | 13:27 |
bauzas | - Skipped: 0 | 13:27 |
bauzas | - Expected Fail: 0 | 13:27 |
bauzas | - Unexpected Success: 0 | 13:27 |
bauzas | - Failed: 0 | 13:27 |
bauzas | Sum of execute time for each test: 35255.4411 sec. | 13:27 |
*** elodilles_afk is now known as elodilles | 13:33 | |
*** elodilles is now known as elodilles_afk | 14:21 | |
* frickler also has a simple stable paperwork change to offer https://review.opendev.org/c/openstack/nova/+/897297 | 16:12 | |
*** elodilles_afk is now known as elodilles | 16:27 | |
kashyap | zigo: Or anyone: Do you know if #ubuntu-kernel on Libera is still the place to talk to Ubuntu kernel maintainers? | 16:49 |
clarkb | kashyap: the ubuntu wiki says that is the correct location and the ubuntu security team is in a similarly named channel on libera | 16:50 |
kashyap | clarkb: Thank you! Mid last year I had chat with 'juergh' about a certain kernel panic. I was hoping to catch them again, but the channel right now has about 4 folks. | 16:58 |
sean-k-mooney | http://tinyurl.com/bdfk6j9e | 17:04 |
kashyap | melwitt: dansmith: Hi, so even with CirrOS 0.5.3 (with the updated stable kernel), we're seeing these sort of failures, yeah? - https://bugs.launchpad.net/nova/+bug/2018612 | 17:04 |
sean-k-mooney | so for context we have had 226 kernel panics in the last 10 ish days | 17:05 |
kashyap | (Meta note: If so, we (I can do) have to file a separate issue as per the above) | 17:05 |
sean-k-mooney | many have been in the cinder project jobs | 17:05 |
kashyap | sean-k-mooney: Yeah, I was just about to run a query in OpenSearch too; thanks for the link | 17:05 |
melwitt | kashyap: we're using cirros 0.6.2 and 0.6.1 | 17:05 |
melwitt | orly? ok, that's good to know | 17:05 |
kashyap | melwitt: Ah, thanks! I'm running on an outdated brain model | 17:06 |
melwitt | :) | 17:06 |
sean-k-mooney | they are in neutron, glance, cinder,nova | 17:06 |
dansmith | the first one i see in that stack from sean-k-mooney is not like the one from the bug kashyap | 17:06 |
dansmith | it's failing to load userspace libs, which is what sean-k-mooney was mentioning, but is different from the page fault in the bug.. similar time in boot, but different behaivor | 17:07 |
sean-k-mooney | " Kernel panic - not syncing: IO-APIC + timer " | 17:07 |
sean-k-mooney | that is form using cirros 5.x | 17:07 |
sean-k-mooney | its a fixed kernel bug that was just not in the cirros images so we can ignore those | 17:07 |
kashyap | dansmith: Yeah, the one I just searched for is this (also in melwitt's CI list): | 17:09 |
kashyap | "Kernel panic - not syncing: Attempted to kill init! exitcode=0x00001000" | 17:09 |
sean-k-mooney | right the "/sbin/init: can't load library 'libtirpc.so.3'" is the main painic we still see after the recent changes we made | 17:09 |
kashyap | The above panic has 36 global hits (12 Tempest) in the last 7 days. Nova and Cinder projects affected | 17:09 |
dansmith | sean-k-mooney: meaning we see that even with the split boot right? | 17:10 |
dansmith | I hadn't seen that manifestation until just now | 17:10 |
melwitt | I guess that's "good" that we can still get some samples for showing kernel devs if we can get some help | 17:10 |
dansmith | or hadn't noticed. I've mostly been in "see a panic, recheck" mode for a while now | 17:10 |
kashyap | sean-k-mooney: IIUC, "5.x" needs to 0.5.3 or higher | 17:10 |
sean-k-mooney | dansmith: yep but form what i have seen its mostly (perhaps always) been a boot form volume test | 17:11 |
sean-k-mooney | kashyap: i dont think the io api issues was ever fixed in 0.5.3 | 17:11 |
sean-k-mooney | i tried to get it rebuilt but it didnt happen in the past | 17:12 |
sean-k-mooney | if 0.5.3 happned in the last year then it might have the updated kernel | 17:12 |
sean-k-mooney | if not then it is still using the old ubuntu package | 17:12 |
dansmith | sean-k-mooney: ack, but it seems like it's still in initramfs at this point so still seems like bfv "shouldn't matter" | 17:13 |
kashyap | By the "io api issues" you mean the IO-APIC related patch "no_timer_check" that was at some pointed added to a CirrOS image? | 17:13 |
sean-k-mooney | kashyap: no | 17:14 |
sean-k-mooney | the reason this happens in 5.2 for example is ubuntu had already fixed the kernel bug but cirros was not rebuilt with the fixed kernel package | 17:14 |
kashyap | sean-k-mooney: CirrOS 0.5.3 happened in May 2023 (after the bug Dan filed): https://github.com/cirros-dev/cirros/releases/tag/0.5.3 | 17:14 |
sean-k-mooney | eventhough it was released | 17:14 |
sean-k-mooney | ok so it did hapen in the last year | 17:15 |
kashyap | Yep; I filed this request after Dan's bug above: https://github.com/cirros-dev/cirros/issues/102 | 17:16 |
sean-k-mooney | i asked to do this 2-3 years ago and they said no at the time | 17:16 |
kashyap | Maybe because at that time the older kernel was still "in support" | 17:16 |
sean-k-mooney | in support didnt matter | 17:17 |
sean-k-mooney | it was pinned ot a specific deb package | 17:17 |
sean-k-mooney | so it was reciving no backports | 17:17 |
sean-k-mooney | it was using 5.3.0-26.28~18.04.1 explcitly | 17:17 |
sean-k-mooney | anyay ya if we are using cirros 5.2 anywhere they shoudl be updated to 5.3 | 17:18 |
sean-k-mooney | https://codesearch.opendev.org/?q=cirros-0.5.2&i=nope&literal=nope&files=&excludeFiles=&repos= | 17:20 |
sean-k-mooney | so unfortunetly we are usin 0.5.2 in some places | 17:20 |
sean-k-mooney | obviouly on master we shoudl be using 6.x | 17:20 |
sean-k-mooney | but we likely shoudl fix that if nova is pinend and inform others to update there repos to avoid the io-apic issue | 17:21 |
sean-k-mooney | the cinder-tempest-plugin-lvm-lio-barbican-centos-9-stream is using 0.5.2 for example | 17:21 |
sean-k-mooney | or at leat havign the io-apic panic | 17:22 |
sean-k-mooney | nova is using it ofr the arm job | 17:22 |
sean-k-mooney | thats the only usage of 5.2 that affects us directly | 17:23 |
sean-k-mooney | im not sure the kernel bug exists in the arm builds | 17:23 |
kashyap | melwitt: sean-k-mooney: I'll start a "kernel panics" only Etherpad to track some of these in one place | 17:27 |
melwitt | thanks kashyap | 17:27 |
sean-k-mooney | im in the middle of something else but dansmith melwitt do you want me to select one of the jobs and go back to the normal image and larger nodeset lable | 17:29 |
sean-k-mooney | or will ye have time to do that | 17:29 |
dansmith | not just larger nodeset, but also with a larger flavor | 17:30 |
sean-k-mooney | melwitt: also i had hopped to start reviewing your serise today but its looking like it will we wednessday | 17:30 |
sean-k-mooney | dansmith: ya that what i ment but didnt say | 17:30 |
dansmith | I'm actually supposed to be off doing something else right now too, but I can work on that later if you never get to it | 17:30 |
dansmith | (later this week) | 17:31 |
sean-k-mooney | ack ill see if i have time before i finish for today | 17:31 |
sean-k-mooney | to that end ill go back to my ansibel patch | 17:31 |
kashyap | sean-k-mooney: melwitt: A quick start here: https://etherpad.opendev.org/p/Kernel-panics-in-Nova-CI | 17:33 |
gibi | why a privsep decorated function like https://github.com/openstack/nova/blob/master/nova/virt/libvirt/cpu/core.py#L63-L66 does not get escalated privileges? I see https://paste.opendev.org/show/bFZ9WgE2828iGamQoDaB/ and sure /sys/devices/system/cpu/cpu1/online is only writtable by root but I would assume that is why we have the privesep decorator on the call | 17:44 |
sean-k-mooney | it shold be elevated | 17:46 |
sean-k-mooney | that is a file not found issue | 17:46 |
gibi | no the file is visible | 17:47 |
gibi | [root@edpm-compute-1 ~]# podman exec -it nova_compute /bin/bash | 17:47 |
gibi | bash-5.1$ ls -alh /sys/devices/system/cpu/cpu1/online | 17:47 |
gibi | -rw-r--r--. 1 root root 4.0K Jan 15 14:38 /sys/devices/system/cpu/cpu1/online | 17:47 |
sean-k-mooney | can you write ot it with sudo ? | 17:47 |
sean-k-mooney | im wondering if its selinux | 17:47 |
sean-k-mooney | im guesing your on centso becasue fo py 3.9 | 17:48 |
gibi | yes I can | 17:48 |
gibi | [root@edpm-compute-1 ~]# podman exec -it -u root nova_compute /bin/bash | 17:48 |
gibi | [root@edpm-compute-1 /]# echo "online" > /sys/devices/system/cpu/cpu1/online | 17:48 |
gibi | [root@edpm-compute-1 /]# echo "offline" > /sys/devices/system/cpu/cpu1/online | 17:48 |
gibi | [root@edpm-compute-1 /]# cat /sys/devices/system/cpu/cpu1/online | 17:48 |
gibi | 0 | 17:48 |
sean-k-mooney | odd | 17:49 |
sean-k-mooney | anythign in the audit log? | 17:49 |
sean-k-mooney | im wondering if hte permission issue is due to user/group, or due to selinux or something else | 17:49 |
gibi | no related log in audit.log | 17:51 |
gibi | wouldn't I should see the code that escalates priviledges in the stack trace? | 17:52 |
gibi | * shouldn't I see ... | 17:52 |
sean-k-mooney | well it should be runing via privsep | 17:52 |
sean-k-mooney | so you shoudl see some privsep debug logs | 17:52 |
sean-k-mooney | im not sure why this is saying ERROR oslo_service.service | 17:53 |
sean-k-mooney | thats the wrong module logger | 17:54 |
gibi | there is no debug log about privsep either | 17:54 |
gibi | but we have `oslo.privsep.daemon=INFO` in the config | 17:54 |
sean-k-mooney | you should see message form nova sayign the deamon was started | 17:55 |
sean-k-mooney | even at info | 17:55 |
sean-k-mooney | perhaps nto the detailed logs but you should see it start | 17:55 |
gibi | something is very strange | 18:01 |
gibi | at the first nova-compute startup the privsep daemon logs that it is started, but subsequent container restarts it does not log it | 18:02 |
gibi | I'm going to redeploy the computes to see if I can reproduce it | 18:03 |
sean-k-mooney | this is nova compute container | 18:04 |
sean-k-mooney | hum ok i woner if we are leakign the unix socket | 18:04 |
sean-k-mooney | or what we are using to detech the start between container starts | 18:05 |
sean-k-mooney | and then not recreating the privsep deamon | 18:05 |
gibi | I pushed revert https://github.com/openstack-k8s-operators/nova-operator/pull/650 to unblock Andrew and the folks testing cpu pinning | 18:09 |
sean-k-mooney | hehe ok im not sure htat is super relevent to upstream nova folks but perhaps we have a bug | 18:10 |
gibi | nah, that repo is also upstream :D | 18:10 |
sean-k-mooney | i supose | 18:10 |
gibi | spreading the word that a new deployer engine exists ;) | 18:11 |
sean-k-mooney | anyway i suspect we need to look into how privsep is being executed | 18:11 |
sean-k-mooney | i didnt look at that previosly becuase i was expecting it to run the same as triplo did | 18:11 |
sean-k-mooney | but perhasp something has changed there | 18:11 |
gibi | I will redeploy tomorrow and check for the privesp process before touching anything | 18:12 |
gibi | then I will try a container restart to see if privesp also restarted | 18:12 |
sean-k-mooney | so you might need to do an operation that needs privsep before its started | 18:13 |
sean-k-mooney | so you might need to boot a vm or something like that | 18:13 |
sean-k-mooney | i think its lazy loaded | 18:13 |
sean-k-mooney | we are not usign the fork modle i belive | 18:13 |
sean-k-mooney | there are 2 ways to run privsep but i think we are using the privsep helper way | 18:13 |
sean-k-mooney | vai sudo/rootwrap | 18:13 |
gibi | I will check tomorrow and report back | 18:14 |
sean-k-mooney | cool. are you almost done for today | 18:14 |
sean-k-mooney | im going to be aroudn until the top of the hour but i approved the revert quickly to unblock folks | 18:14 |
gibi | I'm done for today | 18:15 |
gibi | see you all tomorrow | 18:15 |
sean-k-mooney | o/ | 18:15 |
opendevreview | Merged openstack/nova stable/2023.2: Fix URLs in status check results https://review.opendev.org/c/openstack/nova/+/897297 | 19:07 |
opendevreview | sean mooney proposed openstack/nova master: [DNM] debug kernel paincs https://review.opendev.org/c/openstack/nova/+/905628 | 19:31 |
sean-k-mooney | dansmith: it looks like the 16g nodepool lables are not aviable so i used the 32G ones ^ i can see about readding the 16G flavors but i just want to see if that is correct first. the vxhost flaovr still exists as we have 16G lables but with bionic | 19:32 |
sean-k-mooney | so we just need a 16G lable with jammy | 19:33 |
sean-k-mooney | with that said im not sure but it looks like vexhost might be providing 32G vms by defult now https://opendev.org/openstack/project-config/src/branch/master/nodepool/nl03.opendev.org.yaml#L147-L187 | 19:35 |
opendevreview | Merged openstack/nova stable/2023.1: Do not untrack resources of a server being unshelved https://review.opendev.org/c/openstack/nova/+/904373 | 20:06 |
opendevreview | sean mooney proposed openstack/nova master: [DNM] debug kernel paincs https://review.opendev.org/c/openstack/nova/+/905628 | 21:51 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!