sean-k-mooney[m] | well its not exactly form 0 since i starded it like a year ago https://review.opendev.org/q/topic:%22alpine%22 | 00:01 |
---|---|---|
sean-k-mooney[m] | but ya we dont use much in the images an honestly i kind of wish tempest only used busybox or similar tools | 00:01 |
sean-k-mooney[m] | tineycore might be a bitt too spartan but it would be worth a try | 00:01 |
sean-k-mooney[m] | my alpine image does not have what needed for growpart | 00:02 |
clarkb | fwiw I don't expect the ubuntu kernel built for cirros to be that drastically different than another tiny kernel. They use the same source but rebuild to make it smaller and lightweight iirc | 00:02 |
sean-k-mooney[m] | so that was causing things to fail | 00:02 |
JayF | alpine does use musl libc, which can cause changes in behavior FWIW; but as I say that I realize I don't know if tinycore uses glibc :) | 00:02 |
sean-k-mooney[m] | clarkb: hum ok i tought the just unpacked the deb file without rebuilding | 00:02 |
clarkb | sean-k-mooney[m]: there is a whole massive build process for those images. I thought they were rebuilding the kernels. frickler can probably confirm or deny | 00:03 |
sean-k-mooney[m] | JayF: the use of musl is part of why alpine was apealing | 00:03 |
clarkb | sean-k-mooney[m]: fwiw I think the ubuntu kernel packages hosted for ubuntu are several times larger than an entire cirros image | 00:03 |
JayF | Ah. I would consider it quite the opposite since all our other testing is on glibc, and most target distributions for openstack are glibc. Seems like a place where we could find weird breakages that we might not be well-suited to fix. | 00:03 |
sean-k-mooney[m] | clarkb: ack, in this case the kernel messages seam to be indicating that we are runnign out of memeory for memroy mapped io to allocte to the pcie devices | 00:03 |
clarkb | maybe even an entire order of magnitude | 00:03 |
clarkb | JayF: also python runs poorly on musl | 00:04 |
clarkb | though most of our jobs don't really need to run python in the nested VMs | 00:04 |
sean-k-mooney[m] | JayF: nothign about openstack should care if the guest has glibc or not | 00:04 |
sean-k-mooney[m] | if it does that a problem we should fix | 00:04 |
JayF | sean-k-mooney[m]: aha, you don't run anything inside the images; Ironic does, I forgot that :D | 00:04 |
sean-k-mooney[m] | we basically use touch, ls ssh and a few other very minimal commands | 00:05 |
sean-k-mooney[m] | hench why busybox + mkfs.fat32 | 00:05 |
sean-k-mooney[m] | woudl cover most things | 00:05 |
clarkb | I think the most complicated stuff we do in the nodes is realted to networking to confirm network functionality | 00:05 |
sean-k-mooney[m] | assuming glean or cloud-init work for metadata/ssh keys | 00:05 |
JayF | sean-k-mooney[m]: you really should see how far you could get with commenting out 99% of our tinyipa build script and tossing the resulting image into a CI job | 00:06 |
JayF | I guess really the hard part is CI infra, and it wouldn't be able too reuse hte IPA-B stuff because it's not really IPA-adjacent | 00:06 |
JayF | https://opendev.org/openstack/ironic-python-agent-builder/src/branch/master/tinyipa/build-tinyipa.sh is the entrypoint for our tinyipa ramdisk build, fwiw | 00:07 |
sean-k-mooney[m] | why is it pulling from linux.dell.com? | 00:07 |
JayF | to get a source release of biosdevname, which isn't shipped in tinyipa | 00:08 |
sean-k-mooney[m] | ya but i assume that is not the upstream source repo for that | 00:08 |
JayF | you actually assume wrong, surpringly | 00:08 |
JayF | but that was basically my knee jerk review feedback too LOL | 00:08 |
sean-k-mooney[m] | hum the more you know | 00:09 |
sean-k-mooney[m] | does the ipa image have glean or cloud init? | 00:10 |
JayF | for purposes of tinyipa, I think we only test that on dhcp images, but I'm not 100% sure | 00:10 |
JayF | rpittau is the expert on those | 00:10 |
* JayF reading it to be sure | 00:10 | |
sean-k-mooney[m] | it looks like its configuing dhcp using udhcp ya | 00:13 |
JayF | sean-k-mooney[m]: yep; based on our docs I'd assume we don't glean/cloud-init it outta the box (but I'll note that TheJulia worked on upstream glean support for tinycore so it's likely just automation away): https://docs.openstack.org/ironic-python-agent-builder/latest/admin/tinyipa.html#enabling-disabling-ssh-access-to-the-ramdisk | 00:13 |
sean-k-mooney[m] | so the gap would be that iamge today does not ahve anythign to reach out and download the ssh key form metadata so that we can log in | 00:13 |
JayF | our relationship with glean in IPA is weird, bceause we run it on demand, not automatically | 00:13 |
sean-k-mooney[m] | we could fall back to password login but tempest asssumes a “cloud image” i.e. one with a cloud-init equivalent | 00:14 |
JayF | you literally just need to wait for https://review.opendev.org/c/opendev/glean/+/899219 to land | 00:14 |
JayF | and add a stanza to install glean in the image | 00:14 |
sean-k-mooney[m] | good to know. | 00:15 |
JayF | I'll summarize this all with: I don't care what you use | 00:15 |
JayF | I don't even know why we used tinycore for this image | 00:15 |
JayF | but we've used it literally for years and years, it mostly just works | 00:16 |
JayF | only real downside is the longstanding open bug we have where tinycore just does not want to host their sources over https | 00:16 |
JayF | which is infuriating, but ... acceptably infuriating for a CI-only tool | 00:16 |
TheJulia | Yeah, and that is huge sadness | 00:16 |
* JayF -> EOD; good luck sean! | 00:17 | |
sean-k-mooney[m] | melwitt: so in nova next we allcoate 24 pci root ports https://opendev.org/openstack/nova/src/branch/master/.zuul.yaml#L414 we should drop that to something more reasonable like 12-16 | 00:18 |
sean-k-mooney[m] | that will signigincatly reduce the pci mmio space requried in the guest | 00:19 |
melwitt | sean-k-mooney[m]: ok, sweet. glad there is something we can try | 00:19 |
sean-k-mooney[m] | im not entirly sure that the actuall issue is the io space asignment. looking at the kernel trace the error seams to be coming form an irq interupt handeler but its also referencing ? exc_page_fault+0x89/0x170 so i would wonder if this is still a memory issue | 00:39 |
sean-k-mooney[m] | its saying Kernel panic - not syncing: Attempted to kill init! | 00:39 |
sean-k-mooney[m] | as the main panic message | 00:39 |
sean-k-mooney[m] | and it looks like we re loading form the initramdisk to the root file system at the time the interup was happening | 00:39 |
sean-k-mooney[m] | so im just wondering if the interupt handler is trying to allocate memroy at a time we are very close to the limit and failling ocationally | 00:40 |
sean-k-mooney[m] | if that is the case allocatedign a very small amount of swap in the nova flavor might also help but reducing the pcie root ports would still be my first step | 00:41 |
sean-k-mooney[m] | i think each unused pci root port is using like a mb of ram so 24 mb or our 128 total just for the pcie slot | 00:41 |
sean-k-mooney[m] | and we dont actully use more hten a hand full of them for volumes/ports | 00:42 |
melwitt | ok, I see | 00:42 |
sean-k-mooney[m] | the last messages before the panic are | 00:44 |
sean-k-mooney[m] | info: initramfs loading root from /dev/vda1 | 00:44 |
sean-k-mooney[m] | /sbin/init: can't load library 'libtirpc.so.3' | 00:44 |
melwitt | I noticed that too but didn't know what it means | 00:44 |
melwitt | that message is also just before the panic in the non-q35 job | 00:46 |
sean-k-mooney[m] | so this is where we are changign form runing in the kernel ramdisk to running form the actual root disk i think and in this case init (which i belive is sysvinit not systmed in this case) is loading a dynmaic lib into memory | 00:46 |
melwitt | maybe the common denominator? | 00:46 |
sean-k-mooney[m] | well if we get an interupt at that point im guessing the kernel oom killer tries to kill init. ill admit this si a bit beyond my understanding of early boot so im not sure either | 00:48 |
sean-k-mooney[m] | neutron are seeing the same error for what its worth | 00:49 |
sean-k-mooney[m] | https://bugs.launchpad.net/neutron/+bug/2039940 | 00:49 |
sean-k-mooney[m] | ya cirros has there own init script https://github.com/cirros-dev/cirros/blob/main/src/init | 00:53 |
sean-k-mooney[m] | its failing here in the switch root call l https://github.com/cirros-dev/cirros/blob/main/src/init#L86 | 00:54 |
melwitt | nice find | 00:56 |
sean-k-mooney[m] | i feel like this is busybox init | 00:57 |
sean-k-mooney[m] | or rather its calling into busybox init | 00:58 |
sean-k-mooney[m] | based on https://github.com/cirros-dev/cirros/blob/46a1162787f669ad8d6065cb6bbe477654b4327f/conf/busybox.config#L701 | 00:58 |
sean-k-mooney[m] | i belive switch_root is at least being provided by busybox even if we are not using busy box init directly | 00:58 |
sean-k-mooney[m] | anyway im going to go to sleep now o/ | 01:04 |
melwitt | sean-k-mooney[m]: thanks for helping with this, have a good night o/ | 01:05 |
melwitt | (leaving a message for tomorrow) bauzas, gibi, sean-k-mooney: I made a list of some of the CI failures I saw on one of dan's patches to help me keep track and linked to this ^ discussion ... https://etherpad.opendev.org/p/nova-ci-failures-minimal in case anyone might be interested | 01:20 |
melwitt | I made a new etherpad bc I have trouble parsing etherpads with larger amounts of text | 01:23 |
tkajinam | wondering if that "please report a bug" message is beneficial these days . I agree it was in the past when OpenStack was not yet common but I doubt it still is considering its current status | 08:20 |
tkajinam | recent reports are mostly caused by unrelated problems (wrong configuration, broken rabbitmq, etc) and personally I've not seen any reports related to actual logic bug in nova for a while | 08:22 |
frickler | clarkb: sean-k-mooney[m] is right, cirros is using a stock ubuntu kernel, which is why that makes up about half of the complete image size. but that's also why I'm not convinced that this is the source for the issues. did someone check for possible general OOM situations for these failures? | 08:33 |
jkulik | to me that kernel-panic topic sounds like something is wrong with the image or with loading it. I'd interprete it as /sbin/init running into an error and the kernel thus panicing. /sbin/init: can't load library 'libtirpc.so.3' | 08:48 |
jkulik | [ 13.568826] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00001000 | 08:48 |
jkulik | this bug report seems related https://bugs.launchpad.net/tempest/+bug/1888224 when ends in "It looks like rootfs image is corrupted." | 08:49 |
jkulik | I'm not sure how the test infra works, but if it only happens sporadically - could it be that some workers have a corrupt image and others don't? | 08:51 |
frickler | the cirros image should be baked into the cloud image we boot our test instances from, so if there really is some data corruption happening, it would need to be at a very low level and would likely lead to a more diverse spread of error patterns | 09:52 |
bauzas | when looking at the kernel panics, yeah I'm not sure other OSes would better work | 10:00 |
sean-k-mooney[m] | frickler: i am thinking its a general OOM issue which is why i was suggesting reducint the pcie ports we allocate and or adding swap to the tempest flavor to avoid continually adding more ram to the flavors. say add 64mb of swap in the flavor | 10:27 |
sean-k-mooney[m] | *reduceing | 10:42 |
sean-k-mooney[m] | frickler: by the way how would you feel about building busybox in statically linked mode so we can aovid dynmaiclly loading these shared objects in general? https://github.com/cirros-dev/cirros/blob/main/conf/busybox.config#L43 | 11:01 |
jkulik | hm ... yeah, could be OOM since the kernel-panic is in `exc_page_fault` ... maybe the kernel tries to load the shared library into memory but doesn't have the space? | 11:40 |
sean-k-mooney[m] | we have 128m total the kernel ramdisk was using 32mb of that and we allocated 24 pcie root devices each of which uses about 1mb or ram so we are using about half our ram just in does two. the initramfs can be unloaded once we complete the switch_root to the root block device but thats that failed. so ya it not entirely cleared if it OOM’d or not partly due to how early this is in boot but its worth exporing | 12:01 |
elodilles | bauzas: hi, you've asked me on the meeting to ping you today about the nova stable releases o:) ( https://review.opendev.org/q/project:openstack/releases+is:open+intopic:nova ) | 12:37 |
elodilles | bauzas: let me know if for any reason any of the release patches needs an update (hopefully the version bumps are correct, and hashes should be the latest) | 12:39 |
opendevreview | Danylo Vodopianov proposed openstack/nova master: Packed virtqueue support was added. https://review.opendev.org/c/openstack/nova/+/876075 | 12:40 |
frickler | sean-k-mooney[m]: we can test that, would have to check how the effect on image size looks like | 13:03 |
sean-k-mooney | frickler: ya im not sure if it woudl go up or down (both on disk and at runtime) | 13:13 |
sean-k-mooney | frickler: are there any videos or blog posts on how cirros is built and its design goals? | 13:14 |
sean-k-mooney | i have too many other pulls on my time right now but kind of interested in either improving cirros to adress these probelems ro replacing it eventually | 13:15 |
sean-k-mooney | that way i was looking at alpine as a replacement, tinycore ala ironic ipa style would be fien too but if we can tweak cirros to make it more stable in this regrad and continue to use it then that fine in my book too | 13:17 |
dvo-plv | Hello, sean-k-mooney Are you here ? | 13:38 |
sean-k-mooney | im listenting to an internal call but yep | 13:39 |
frickler | sean-k-mooney: I'm not aware of any kind of docs except for what is inside the repo itself. if you want to talk to smoser directly, you can find us in #cirros on liberachat. iiuc the design goal is to have a minimal image for testing purposes, which matches pretty much what openstack CI does with it | 13:40 |
dvo-plv | I'm currently has a time to work with nova patch | 13:41 |
dvo-plv | I've resolve some of your comments | 13:41 |
dvo-plv | but I have an open question | 13:41 |
dvo-plv | https://review.opendev.org/c/openstack/nova/+/876075/28/nova/virt/libvirt/config.py#1815 | 13:41 |
sean-k-mooney | frickler: ack i skimed the docs breifly last night but have not actully looked at how the build system works or the current content of the image in any dept | 13:44 |
sean-k-mooney | frickler: so i was conisderign trying the static build myself | 13:44 |
sean-k-mooney | dvo-plv: so "self.driver_packed is True" is a common python mistake | 13:46 |
sean-k-mooney | you shoudl not use "is" to compare if something is True | 13:46 |
sean-k-mooney | is shoudl only be used to test agains monostate types like NONE or to check the adress of two objects | 13:47 |
sean-k-mooney | so " self.driver_packed is True or" shoudl just be "self.driver_packed or" | 13:47 |
dvo-plv | oh, I see what you mean, you would like to rewrite it to the just "if self.driver_packed" | 13:48 |
sean-k-mooney | yep | 13:48 |
sean-k-mooney | jsut remove "is True" | 13:48 |
dvo-plv | I thought that you want some another statment here, okay, I will do it asap | 13:48 |
sean-k-mooney | cool | 13:49 |
sean-k-mooney | i normally give a code example when i ask for changes like this but i didnt last year | 13:50 |
sean-k-mooney | *night | 13:50 |
opendevreview | Danylo Vodopianov proposed openstack/nova master: Packed virtqueue support was added. https://review.opendev.org/c/openstack/nova/+/876075 | 14:07 |
dvo-plv | its okay, I just get your comment wrong from my side | 14:09 |
opendevreview | Elod Illes proposed openstack/nova stable/zed: add a regression test for all compute RPCAPI 6.x pinnings for rebuild https://review.opendev.org/c/openstack/nova/+/900307 | 14:20 |
opendevreview | Elod Illes proposed openstack/nova stable/zed: Fix rebuild compute RPC API exception for rolling-upgrades https://review.opendev.org/c/openstack/nova/+/900341 | 14:20 |
opendevreview | Elod Illes proposed openstack/nova stable/zed: Adding server actions tests to grenade-multinode https://review.opendev.org/c/openstack/nova/+/900342 | 14:20 |
elodilles | bauzas: i've updated the zed version of the 'RPC backports', please review if you want them included in the zed release ^^^ | 14:29 |
bauzas | sure | 14:29 |
bauzas | gibi: do you want https://review.opendev.org/c/openstack/nova/+/901656 to be in the first Bobcat z release ? | 14:29 |
gibi | bauzas: ohh we can try a recheck but I don't want to hold the release | 14:35 |
bauzas | just rechecked | 14:37 |
gibi | me too :D | 14:37 |
gibi | thanks for the ping | 14:37 |
bauzas | gibi: looks not me wasn't due to a ssh issue | 14:37 |
bauzas | oh, you looked at the other job failure | 14:38 |
bauzas | my bad | 14:38 |
gibi | no worries :) | 14:48 |
dvo-plv | gibi, Maybe you will have a hance to review this one ? | 14:56 |
dvo-plv | https://review.opendev.org/c/openstack/nova-specs/+/895924 | 14:56 |
dvo-plv | and this | 14:56 |
dvo-plv | https://review.opendev.org/c/openstack/nova/+/876075 | 14:56 |
*** blarnath is now known as d34dh0r53 | 15:00 | |
bauzas | gibi: sean-k-mooney: before I leave, so about https://review.opendev.org/c/openstack/nova/+/902084/1 | 16:54 |
bauzas | we have two possibilities : | 16:54 |
bauzas | 1/ use a star like I did | 16:54 |
bauzas | 2/ let device_addresses be optional | 16:54 |
bauzas | this way, | 16:55 |
bauzas | https://paste.opendev.org/show/bbw8uYTcDqasoSSD54WD/ would be acceptable | 16:55 |
sean-k-mooney | if we have fxed https://review.opendev.org/c/openstack/nova/+/899406/2 | 16:55 |
sean-k-mooney | then i think it can be optional | 16:55 |
bauzas | that's a bit differenbt | 16:56 |
sean-k-mooney | that is fixing the fact that if you have just enabled_mdev_types = nvidia-35 | 16:56 |
sean-k-mooney | device_addresses | 16:56 |
bauzas | pas-ha[m] change is accepting to only use one type but with a section | 16:56 |
bauzas | yeah | 16:56 |
sean-k-mooney | ya so | 16:56 |
sean-k-mooney | without that i dont think device_adderss being optional is really a good approch | 16:57 |
bauzas | so, my WIP (if we make dev_add optional) would change the existing behaviour | 16:57 |
sean-k-mooney | btu if we fix that then im fine with it | 16:57 |
bauzas | which is that if you use two types but only one got a section, then none of them use any section | 16:57 |
bauzas | not sure we could backport that | 16:58 |
sean-k-mooney | thats a bug right | 16:58 |
bauzas | not really | 16:58 |
sean-k-mooney | it kind of is | 16:58 |
sean-k-mooney | im aware of the legacy behavior for the first type | 16:58 |
bauzas | https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L7938-L7951 | 16:58 |
sean-k-mooney | but the second adn subsquent type are ment ot have a section | 16:59 |
bauzas | this conditional makes all types needing to have groups for each of them | 16:59 |
sean-k-mooney | and we shoudl not ignore them if they are defiend just because the first type does not ahve one | 16:59 |
bauzas | sure, I'm just changing this is a bit of a behavioural change | 16:59 |
bauzas | which couldn't be acceptable for a backport | 16:59 |
sean-k-mooney | so i do think all types shoudl have a section in general | 17:00 |
sean-k-mooney | however that only makes snese if there is somethign in it | 17:00 |
bauzas | you do or don't ? | 17:00 |
sean-k-mooney | i think we shoudl requrie a section for each mdev type unconditonally | 17:00 |
sean-k-mooney | personlaly | 17:00 |
sean-k-mooney | but wiether device_adresss is required or not is a sperate question | 17:01 |
bauzas | so you'd prefer to have https://paste.opendev.org/show/bKc2SjB6l7LILYE0onzG/ | 17:01 |
sean-k-mooney | kind of yes although i would also like to consider if max_instances is requried or not | 17:02 |
bauzas | if so, that simplifies my change and then we can backport it :) | 17:02 |
sean-k-mooney | to me https://paste.opendev.org/show/bKc2SjB6l7LILYE0onzG/ | 17:02 |
sean-k-mooney | is you intentionally not setting device_adresses | 17:03 |
bauzas | I wouldn't require max_instance or only unless device_addresses isn't set | 17:03 |
sean-k-mooney | and as a result opting into the whild card behavior | 17:03 |
bauzas | sean-k-mooney: correct, you're saying 'this is a default type for all found GPUs' | 17:03 |
bauzas | okay, then I'll change my current non-uploaded patch and I'll provide it tomorrow | 17:04 |
bauzas | thanks sean-k-mooney | 17:04 |
sean-k-mooney | ack | 17:04 |
sean-k-mooney | gibi: ^ when you have time does that work for you | 17:04 |
bauzas | sean-k-mooney: fwiw, https://review.opendev.org/c/openstack/nova-specs/+/900636 is open for reviews :) | 17:04 |
sean-k-mooney | hehe ya i need to do a review day soon | 17:05 |
sean-k-mooney | maybe tomorrow | 17:05 |
bauzas | you're lucky | 17:05 |
bauzas | my review karma is bad those days | 17:05 |
* bauzas raises hand up at some n... company | 17:05 | |
sean-k-mooney | https://www.stackalytics.io/report/contribution?module=nova-group&project_type=openstack&days=30 i mean min is not much better for thte last little bit | 17:06 |
gibi | bauzas: sean-k-mooney: sounds good to me | 17:07 |
bauzas | thanks gibi | 17:07 |
bauzas | will upload the patch quickly then | 17:07 |
opendevreview | melanie witt proposed openstack/nova master: Lower num_pcie_ports to 12 in the nova-next job https://review.opendev.org/c/openstack/nova/+/902175 | 17:29 |
melwitt | sean-k-mooney: from our convo yesterday ^ | 17:30 |
sean-k-mooney | cool lets see if that passses | 18:06 |
sean-k-mooney | it would only fial if we need more then 12 pci device in a tempest test | 18:06 |
melwitt | kk | 18:07 |
sean-k-mooney | we use about 4-6 by defualt and i dont think we are attaching that many volumes or prots to cause an issue | 18:07 |
opendevreview | Merged openstack/python-novaclient master: add pyproject.toml to support pip 23.1 https://review.opendev.org/c/openstack/python-novaclient/+/899950 | 18:26 |
opendevreview | Merged openstack/nova stable/2023.2: Allow enabling cpu_power_management with 0 dedicated CPUs https://review.opendev.org/c/openstack/nova/+/901656 | 19:15 |
melwitt | argh, it already failed for the non-q35 guest kernel panic https://4a4a1510e17776a8b793-89f5aa2f0368a55b3a90e6b26173438f.ssl.cf2.rackcdn.com/902175/1/check/nova-multi-cell/c2bacb0/testr_results.html | 19:32 |
melwitt | it meaning the patch, unrelated to the change itself | 19:33 |
sean-k-mooney | that is a slightly diffent panic | 19:35 |
sean-k-mooney | it paniced at the same phases switching_root | 19:35 |
melwitt | right | 19:35 |
sean-k-mooney | but instead of being in a page fault handeler it was in teh apic timer handeler | 19:35 |
sean-k-mooney | it still could be OOM related | 19:36 |
sean-k-mooney | in that case we also dont have any of the io mapping error since it machine_type=pc and using pci not pcie as a result | 19:37 |
melwitt | that would be my guess | 19:37 |
melwitt | right | 19:37 |
sean-k-mooney | so the patch will help with one of the 2 issues but perhaps we shoudl just add 64 mb of swap to the tempest flavors | 19:38 |
melwitt | I've been comparing a non kernel panic console log with a panic one and one difference I see leading up to it is that the non-panic has this line "info: copying initramfs to /dev/vda1" but the panic one does not | 19:38 |
sean-k-mooney | https://github.com/openstack/devstack/blob/master/lib/tempest#L293-L306 | 19:39 |
sean-k-mooney | so from here https://github.com/cirros-dev/cirros/blob/46a1162787f669ad8d6065cb6bbe477654b4327f/src/init#L63 | 19:40 |
melwitt | hm ok. so maybe I can try adding swap there and recheck a bunch of times to see if panic happens? | 19:40 |
melwitt | yes | 19:40 |
melwitt | so that code block is skipped in the cases where it panics. dunno why | 19:41 |
sean-k-mooney | so its failing here https://github.com/cirros-dev/cirros/blob/46a1162787f669ad8d6065cb6bbe477654b4327f/src/init#L85 | 19:41 |
melwitt | right | 19:42 |
sean-k-mooney | that after where the copy should print | 19:42 |
sean-k-mooney | do you see the GPT header messages in the one that passed | 19:42 |
melwitt | yes | 19:43 |
sean-k-mooney | [ 11.405476] GPT:Primary header thinks Alt. header is not at the end of the disk. | 19:43 |
sean-k-mooney | [ 11.406051] GPT:229375 != 2097151 | 19:43 |
sean-k-mooney | [ 11.406370] GPT:Alternate GPT header not at the end of the disk. | 19:43 |
sean-k-mooney | [ 11.406743] GPT:229375 != 2097151 | 19:43 |
sean-k-mooney | [ 11.407018] GPT: Use GNU Parted to correct GPT errors. | 19:43 |
sean-k-mooney | ok so its not realted to that | 19:43 |
sean-k-mooney | if we dont see "info: copying initramfs to /dev/vda1" | 19:43 |
sean-k-mooney | that imples $ROOT is undefiend | 19:43 |
melwitt | yeah.. I kinda wondered if the lack of "copying initramfs" is related to not being able to load 'libtirpc.so.3' | 19:44 |
sean-k-mooney | well if we have not copied the ramdisk to the root file system | 19:44 |
sean-k-mooney | the the dynmaic loader wont be able to find it | 19:44 |
sean-k-mooney | so looking at https://github.com/cirros-dev/cirros/blob/46a1162787f669ad8d6065cb6bbe477654b4327f/src/init#L44-L52 | 19:45 |
melwitt | either $ROOT is undefined of if search_for_blank "$rootspec" rw "$NEWROOT_MP" was false | 19:45 |
sean-k-mooney | ya i thik in the failing case we are taking the else branch on line 50 | 19:45 |
melwitt | I wonder how we could enable debug logging in there | 19:46 |
melwitt | debug 1 "did not find a device matching $rootspec" but we don't see it | 19:46 |
sean-k-mooney | we likely cod do ti by passing kernel command line args | 19:47 |
sean-k-mooney | but i partly wonder if that woudl fix it | 19:47 |
sean-k-mooney | we woudl have to change form the full disk image to the one with the splict out kernel and initram | 19:47 |
sean-k-mooney | i know that neutron use that in one of there jobs | 19:48 |
sean-k-mooney | to avoid some kernel panics. | 19:48 |
melwitt | oh huh | 19:48 |
sean-k-mooney | specifical this on ewith the apic timer | 19:48 |
melwitt | interesting | 19:48 |
sean-k-mooney | any unrecognised parmater on the kernel command line is set as an ENV var in the init procsss by the kernel | 19:49 |
sean-k-mooney | incase you didnt know that so if debug is based on an env var you sould jsut add that env var to the kernel command line | 19:49 |
melwitt | so theoretically the num_pcie_ports might fix the nova-next panics and the split disk image might fix the rest ... seems it would be too good to be true | 19:49 |
melwitt | ok yeah I didn't know that | 19:50 |
sean-k-mooney | well im 99% shure that the split disk will fix the apic one | 19:50 |
sean-k-mooney | the only down side to that is that we loose coverage of testign full disk images if we use that every where | 19:50 |
sean-k-mooney | ill check quickly if i can see where they enable it in there job | 19:51 |
melwitt | k. I wonder if we could get away with using the split image everywhere except nova-next | 19:52 |
opendevreview | alisafari proposed openstack/nova master: Fix traits to cpu flags mapping https://review.opendev.org/c/openstack/nova/+/902183 | 19:53 |
sean-k-mooney | proably | 19:53 |
melwitt | I feel like I'm seeing these panics all the time. not sure what changed bc they used to be so rare in the past. maybe OOM like you said or maybe if in the past everything was using the split image, that I don't know | 19:54 |
sean-k-mooney | no we were defienlty usign the whole disk image for several reelase | 19:56 |
sean-k-mooney | i dont know if we ever use d the split imate by default | 19:56 |
melwitt | ack | 19:56 |
sean-k-mooney | we may have | 19:56 |
sean-k-mooney | ah found it | 20:10 |
sean-k-mooney | https://github.com/openstack/neutron/commit/e04bd8fbdfa56320d16870b1f294b2cb62b8a828 | 20:10 |
sean-k-mooney | so if we add | 20:11 |
sean-k-mooney | CIRROS_VERSION: 0.6.2 | 20:11 |
sean-k-mooney | DEFAULT_IMAGE_NAME: cirros-0.6.2-x86_64-uec | 20:11 |
sean-k-mooney | DEFAULT_IMAGE_FILE_NAME: cirros-0.6.2-x86_64-uec.tar.gz | 20:11 |
sean-k-mooney | that shoud fix it | 20:12 |
sean-k-mooney | the other way to do that is move the job to use nested virt | 20:12 |
sean-k-mooney | https://review.opendev.org/c/openstack/neutron-tempest-plugin/+/821067 | 20:12 |
sean-k-mooney | melwitt: what we coudl do is swap to the uec image for our default jobs | 20:13 |
sean-k-mooney | and move nova-next to nested virt with the whole disk image | 20:14 |
melwitt | sounds worth a try | 20:14 |
sean-k-mooney | although we woudl want ot use the nested virt jamy lable in our node set | 20:15 |
sean-k-mooney | https://opendev.org/openstack/project-config/src/branch/master/nodepool/nl04.opendev.org.yaml#L41-L42 | 20:15 |
melwitt | ok | 20:17 |
sean-k-mooney | do you know how ot make those changes. im going to finish for today and get dinner but i can update the jobs tomorow if you havent and we can see if it works | 20:18 |
melwitt | I think I can do it given the examples you linked. but yeah sounds good, if I do it wrong feel free to redo it or whatever tomorrow | 20:19 |
sean-k-mooney | we in theory have nested virt capablity for 2 vexhost clouds and ovh so i dont really feel as bad about using in in nova next as i did whne it was only one provider | 20:21 |
melwitt | ah ok | 20:22 |
-opendevstatus- NOTICE: The Gerrit service on review.opendev.org will be restarting momentarily for a patch update to address a recently observed regression preventing some changes from merging | 21:09 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!